## Real-time motion detection in video on Zynq FPGA

UPDATE 23.2.2017: Since I no longer work with Xilinx products (more than 3 years!) I cannot provide any updates or support regarding the information given below. You may continue reading at your own risk!

Since I am leaving my current employer UTIA by the end of the year and thus I will–likely–not be working with FPGAs any more (at least not for living), I wanted to implement a larger design in Zynq to enjoy the technology for the last time 😉

The selected application is real-time motion detection in video stream. We implemented it in the project SMECY in Spartan 6 FPGA using the master/worker abstractions. The original design in Spartan 6 achieved 5 FPS at best, and 3 FPS when other features are included (morphology, labeling).

Here I used Zynq XC7Z020 FPGA in the ZC702 board with the IMAGEON daughter card. No code is reused from the SMECY solution. Video pipeline is realized using AXI Streams, HDMI is used for input and output, the accelerator was implemented using Vivado HLS (high-level synthesis from C). The synthesis tool used is Vivado 2013.3 with IP Integrator (replaces XPS).

One possible practical application of the motion detection is in smart cameras for surveillance (security, safety) use — see the second youtube video below. The HDMI input would be replaced with a camera interface and the FPGA system could be integrated in the camera module.

Below is a demonstration video. The application runs at 8.2 FPS with one accelerator, and 14 FPS with two accelerators (not shown in the videos).

Video input and output is via HDMI on the Imageon extension card. The input 1080p video is fed via HDMI from a PC running Ubuntu. Output is 1280x720p to a monitor. The output image contains a top-right 640×480 corner which is (also) the input to the motion detection. Output 640×480 black&white mask is positioned visually next to it.

## Motion detection algorithm

The system implements real-time video motion detection, sometimes also called foreground/background pixel segmentation. The algorithm is derived from a paper by Kaewtrakulpong; the implementation does not use shadow detection and it has several modifications intended to lower compute complexity.

The goal of image segmentation is to mark each pixel in an image frame as a part of static background or moving foreground. The decision depends on statistical models and their mixtures. All pixels in image are considered independently. Each pixel is modelled by a mixture of K strongest Gaussian models of background, K=4 in the implementation. Each Gaussian model k is defined by a set of 3 mean values $\mu_{R,k}$, $\mu_{G,k}$, $\mu_{B,k}$, corresponding to the three primary colours red, green, blue; by variance $\sigma_k$, and by weight $w_k$. Models represent RGB colours that are considered to be stationary background’ colours of the pixel. As there are K=4 independent Gaussian models kept for each pixel the algorithm allows for situations when the pixel periodically changes between two colours, such as moving escalators or trees in wind–these scenes are classified as stationary. Each model also contains the weight parameter $w_k$ indicating how often that particular model successfully described background in the pixel.

The picture above shows how the algorithm updates the models; for simplicity the presentation ignores RGB colours and shows only three models. The first picture at the top shows initial situation with three models M1=$(\mu_1,\sigma_1)$, M2=$(\mu_2,\sigma_2)$, and M3=$(\mu_3, \sigma_3)$. Mean values $\mu_i$ position bell’ shaped models on the horizontal colour (greyscale) axis; variances $\sigma_i$ define widths of the bells’; and model weights $w_i$ are represented by the heights of the bells’. When new pixel colour hits in one of the models the model is strengthen’ by slightly increasing its weight, and the pixel colour is classified as a background that is stationary. This situation is shown in the picture in the middle: the colour hits in model M3, the weight $w_3$ is increased. If the hit is not precise the model is also slightly shifted towards the new colour.

However, when new pixel colour does not hit any existing Gaussian model, the colour is classified as a foreground that is moving. The weakest model is erased and replaced by a new model representing the new colour, albeit with small initial weight. This is illustrated in the last subpicture above: the weakest model M3 has been replaced by new model.

This algorithm was selected and implemented in a “high-level” C code (intended for CPU execution) by Roman Bartosinsky, a colleague in the SMECY project.

## Implementation details

The picture below shows annotated block diagram from IP Integrator. Click to see larger version.

The system consists of three main parts: video input pipeline, video output pipeline, and accelerator subsystem.

The video input path is highlighted using the yellow colour in the system image above. The pipeline consists of the following processing cores:

1. HDMI input and decoder
2. video to AXI-Stream covertor
3. YUV 4:2:2 to 4:4:4 expander (16 to 24 bits) (custom core in VHDL)
4. YUV 4:4:4 to RGB colour space converter
6. Video DMA storing the input video stream into the main memory via
7. AXI memory interconnect and
8. Zynq HP0 port (150MHz, 64bits).

The video output path is highlighted using the blue colour. It basically mirrors the input path in the reverse order:

1. Zynq HP1 port (150MHz, 64bits),
2. AXI memory interconnect,
3. Video DMA reading via the above ports and producing pixel stream on its AXI-Stream output,
4. 32-to-24 bits trim
5. RGB to YUV 4:4:4 colour space convertor
6. YUV 4:4:4 to YUV 4:2:2 conversion (24 to 16 bits)
7. AXI-Stream to video stream
8. HDMI output coder.

The accelerator subsystem uses the red path.

1. Pixel and context data is accessed using Zynq HP2 port (100MHz, 64bits),
2. by the Central DMA engine (in scatter/gather mode that automatically fetches new block descriptors via the ACP port)
3. and transferred via AXI interconnects to
4. AXI BRAM controllers
5. that connect to the actual BRAMs.
6. The BRAMs keep the working data for the accelerator – pixels in, context, pixels out.
7. The accelerator HW implemented by Vivado HLS.

The accelerator HW block requires: 17 DSP48E blocks, 5967 FFs, and 7229 LUTs. It runs at 100 MHz clock. In can process 1024 pixels in one activation. This requires 4kB for pixel input and output data, and 5*16kB=80kB for context data. The accelerator is internally pipelined. Processing of a single pixel takes 96 clock-cycles, a new pixel can be accepted into the pipeline every 9 clock-cycles. Using a single accelerator instance delivers about 8.2 FPS.

UPDATE: A configuration with two accelerator instances achieves 14 FPS.

## Xilinx Vivado 2013.3 on Fedora 18: Working around a D-Bus bug

Running Xilinx Vivado 2013.3 (webpack license) on Fedora 18 may fail with the following error message:

\$ vivado

**** SW Build 329390 on Wed Oct 16 18:26:55 MDT 2013
**** IP Build 192953 on Wed Oct 16 08:44:02 MDT 2013

INFO: [Common 17-78] Attempting to get a license: Implementation
process 17688: arguments to dbus_move_error() were incorrect, assertion "(dest) == NULL || !dbus_error_is_set ((dest))" failed in file dbus-errors.c line 282.
This is normally a bug in some application using the D-Bus library.
D-Bus not built with -rdynamic so unable to print a backtrace
Abnormal program termination (6)
Please check '/home/jara/hdl/hs_err_pid17688.log' for details

The workaround is to make the D-Bus communication socket file /var/run/dbus/system_bus_socket unavailable when the Xilinx tools run.

Execute as root:

chmod o-rw /var/run/dbus/system_bus_socket

However, the workaround fix may interfere with system software, hence look out for potential side effects.

## Hardware/Software Co-Simulation (PORTAL)

I devised PORTAL as a means for validating our hardware accelerator cores (ASVP in SMECY) in a co-simulated environment together with control software. The basic structure of PORTAL is shown below:

PORTAL is a a communication library that connects a hardware model simulated in ModelSim to its control software running on a PC. Communication is done over TCP/IP.

In PORTAL the primitive communication abstraction is a shared memory. All PORTAL clients have a common access to a (virtually) shared 32-bit address space. Any client can dynamically claim and register any unoccupied memory range in the address space and start to serve read/write requests generated by other clients. Management of the virtual address space is dedicated to the central sever, PHUB.

In our use case, each ASVP core typically has 4 data memory banks and 2 to 3 control banks (firmware, control/status, vector partitions). In co-simulation environment the VHDL top level test-bench registers each bank as a memory extent with the PHUB server and serves accesses to the banks. Meanwhile, the ASVP master control program that runs on a PC also connects to the PHUB, discovers the ASVP core’s memory extents and presents them using WAL (Worker Abstraction Library) to the software.

## MASSTEST: Automated validation/verification testing

In 2009 I joined UTIA and started working on the Apple-CORE project. The project was already in its second year, hence there had been already plenty of implementation work done. My first assignment was to prepare a test suite of simple assembly-level programs for validation of the UTLEON3 processor. The initial approach was quite naive, though: I produced a test program, observed that it does not run as intended, and reported a bug via e-mail to my colleagues. However, it quickly turned out that bugs were so proliferated that fixing one place often broke other things… Running validation tests had to be automated, and the MASSTEST was born.

MASSTEST is a collection of simple scripts written in Perl for running validation test suites. The main script is called mtest.pl. This script runs test-sets: first it compiles test programs using assembler/makefile toolchain and then it runs ModelSim VHDL simulator to execute the test program in a simulated UTLEON3. Simulation results are placed in dedicated directories. The second important script is called mcollect.pl: it gathers results from the directories, processes ModelSim output files (mainly the transcript) and generates summary tables.

The mtest.pl script is driven by a test-set configuration file. Configuration files contain lines that assign lists of values to parameters. For example:

# directories: software, hardware
prog_dir        ../../utbench
hdl_dir         ../../integration-V2/designs/utleon3-ml509-hwt03/_scripts
timeout         400
simscr          do-sim-c.sh
# hardware parameters
hw-PCNT_KERNEL  '1'
hw-RF_1W_PORT    1
# programs to run
program         t07-ut t07-ut-unroll2 t07-ut-unroll4 t07-ut-swch t07-ut-unroll2-swch t07-ut-unroll4-swch
# software parameters
sw-BLOCKSIZE_1  2 5 7
sw-BLOCKSIZE_2  6

The parameters prog_dir, hdl_dir, timeout, and simscr specify system values for running the test. The parameter “program” specifies a list of programs that will be run; each program is a directory under “prog_dir”. The parameters that start with “hw-” are passed to VHDL simulator as generics of the top-level test-bench. The parameters that start with “sw-” are passed to compiler/assembler as macro expansions.

Each parameter is really a list of space-separated values. The important thing about MASSTEST is that it creates Cartesian product of the configuration parameters: in the script above all combinations of the parameters ‘program’ and ‘sw-BLOCKSIZE_1’ will create an independent test run.

Often times, however, I don’t need the full cartesian product of several parameters. Indeed, as individual simulation runs can take tens of minutes it becomes too expensive to run so many tests. Therefore, bundled parameters were devised:

hw-TECHNO;hw-SPEED;hw-dfuclkspeed   1;A3M2;1.0  1;A4M3;1.5  1;A5M3;1.66  1;A6M3;2.0  3;A2M3;1.0  3;A4M4;2.5
cmd-WLEN                4 8 16 32 64 128 196

Bundled parameters are separated by semicolon. The example above bundles three parameters ‘hw-TECHNO’, ‘hw-SPEED’, and ‘hw-dfuclkspeed’ and specifies a set of value triples that are assigned (or iterated) together. The other parameter ‘cmd-WLEN’ on the other hand is not bound to those three; hence, MASSTEST will create Cartesian product of the triples and the individual values in ‘cmd-WLEN’.

Even though all test-runs generated and executed in MASSTEST are independent to each other, parallelization of the runs is not so easy. The problem is that program (software) and design (hardware) directories are re-used for each run. Indeed, VHDL compilation for ModelSim takes time and disk space and redoing it for each test-run is wasteful.

So, parallelization is done differently. You first decide on the number of concurrent runs that should be executed together. Usually this is the number of processors in system, typically four. The program and design directories must be replicated four times and specified in the configuration file using a bracket notation:

# directories: software, hardware; replicated four times.
prog_dir[0]        ../../utbench_0
prog_dir[1]        ../../utbench_1
prog_dir[2]        ../../utbench_2
prog_dir[3]        ../../utbench_3
hdl_dir[0]         ../../integration-V2/designs/utleon3-ml509-hwt03-0/_scripts
hdl_dir[1]         ../../integration-V2/designs/utleon3-ml509-hwt03-1/_scripts
hdl_dir[2]         ../../integration-V2/designs/utleon3-ml509-hwt03-2/_scripts
hdl_dir[3]         ../../integration-V2/designs/utleon3-ml509-hwt03-3/_scripts

This tells MASSTEST that there are four independent copies of program and HDL design directories. Then you specify that the parallelization should be done over those four replicas:

parmodulo    4

Configuration scripts using parallelization need to be run using mptest.pl, not mtest.pl. The mptest.pl is a wrapper that forks off four instances (or any number specified in parmodulo) of mtest.pl and–this is important–tells each instance ‘i’ that it should actually execute only i-th test modulo four. All instances iterate over the whole test space; but the first instance executes tests 0, 4, 8,…, the second instance tests 1, 5, 9, …, and so on. The test runs are interleaved over the instances.

Finally, it is also possible to parallelize test runs over a cluster of computers. This requires that computers in cluster have a unified filesystem view, i.e. a file name on one computer refers to the same object on the other. In practice this means having a shared network disk mounted in the same place in all nodes. Secondly, remote command execution is facilitated by SSH with public key password-less authentication. MASSTEST configuration scripts are extended with a new “remote” directive:

remote[0]    blue03
remote[1]    blue03
remote[2]    blue04
remote[3]    blue04`

The “remote” directive specifies cluster host names where the mtest.pl instances should be delegated to. The parmodulo mechanism is used to break test-runs into independent groups, and the mtest.pl for each group is run in the remote host via ssh.