UPDATE 23.2.2017: Since I no longer work with Xilinx products (more than 3 years!) I cannot provide any updates or support regarding the information given below. You may continue reading at your own risk!
Since I am leaving my current employer UTIA by the end of the year and thus I will–likely–not be working with FPGAs any more (at least not for living), I wanted to implement a larger design in Zynq to enjoy the technology for the last time 😉
The selected application is real-time motion detection in video stream. We implemented it in the project SMECY in Spartan 6 FPGA using the master/worker abstractions. The original design in Spartan 6 achieved 5 FPS at best, and 3 FPS when other features are included (morphology, labeling).
Here I used Zynq XC7Z020 FPGA in the ZC702 board with the IMAGEON daughter card. No code is reused from the SMECY solution. Video pipeline is realized using AXI Streams, HDMI is used for input and output, the accelerator was implemented using Vivado HLS (high-level synthesis from C). The synthesis tool used is Vivado 2013.3 with IP Integrator (replaces XPS).
One possible practical application of the motion detection is in smart cameras for surveillance (security, safety) use — see the second youtube video below. The HDMI input would be replaced with a camera interface and the FPGA system could be integrated in the camera module.
Below is a demonstration video. The application runs at 8.2 FPS with one accelerator, and 14 FPS with two accelerators (not shown in the videos).
Video input and output is via HDMI on the Imageon extension card. The input 1080p video is fed via HDMI from a PC running Ubuntu. Output is 1280x720p to a monitor. The output image contains a top-right 640×480 corner which is (also) the input to the motion detection. Output 640×480 black&white mask is positioned visually next to it.
Motion detection algorithm
The system implements real-time video motion detection, sometimes also called foreground/background pixel segmentation. The algorithm is derived from a paper by Kaewtrakulpong; the implementation does not use shadow detection and it has several modifications intended to lower compute complexity.
The goal of image segmentation is to mark each pixel in an image frame as a part of static background or moving foreground. The decision depends on statistical models and their mixtures. All pixels in image are considered independently. Each pixel is modelled by a mixture of K strongest Gaussian models of background, K=4 in the implementation. Each Gaussian model k is defined by a set of 3 mean values , , , corresponding to the three primary colours red, green, blue; by variance , and by weight . Models represent RGB colours that are considered to be `stationary background’ colours of the pixel. As there are K=4 independent Gaussian models kept for each pixel the algorithm allows for situations when the pixel periodically changes between two colours, such as moving escalators or trees in wind–these scenes are classified as stationary. Each model also contains the weight parameter indicating how often that particular model successfully described background in the pixel.
The picture above shows how the algorithm updates the models; for simplicity the presentation ignores RGB colours and shows only three models. The first picture at the top shows initial situation with three models M1=, M2=, and M3=. Mean values position `bell’ shaped models on the horizontal colour (greyscale) axis; variances define widths of the `bells’; and model weights are represented by the heights of the `bells’. When new pixel colour hits in one of the models the model is `strengthen’ by slightly increasing its weight, and the pixel colour is classified as a background that is stationary. This situation is shown in the picture in the middle: the colour hits in model M3, the weight is increased. If the hit is not precise the model is also slightly shifted towards the new colour.
However, when new pixel colour does not hit any existing Gaussian model, the colour is classified as a foreground that is moving. The weakest model is erased and replaced by a new model representing the new colour, albeit with small initial weight. This is illustrated in the last subpicture above: the weakest model M3 has been replaced by new model.
This algorithm was selected and implemented in a “high-level” C code (intended for CPU execution) by Roman Bartosinsky, a colleague in the SMECY project.
The picture below shows annotated block diagram from IP Integrator. Click to see larger version.
The system consists of three main parts: video input pipeline, video output pipeline, and accelerator subsystem.
The video input path is highlighted using the yellow colour in the system image above. The pipeline consists of the following processing cores:
- HDMI input and decoder
- video to AXI-Stream covertor
- YUV 4:2:2 to 4:4:4 expander (16 to 24 bits) (custom core in VHDL)
- YUV 4:4:4 to RGB colour space converter
- 24-to-32 bits pixel padding
- Video DMA storing the input video stream into the main memory via
- AXI memory interconnect and
- Zynq HP0 port (150MHz, 64bits).
The video output path is highlighted using the blue colour. It basically mirrors the input path in the reverse order:
- Zynq HP1 port (150MHz, 64bits),
- AXI memory interconnect,
- Video DMA reading via the above ports and producing pixel stream on its AXI-Stream output,
- 32-to-24 bits trim
- RGB to YUV 4:4:4 colour space convertor
- YUV 4:4:4 to YUV 4:2:2 conversion (24 to 16 bits)
- AXI-Stream to video stream
- HDMI output coder.
The accelerator subsystem uses the red path.
- Pixel and context data is accessed using Zynq HP2 port (100MHz, 64bits),
- by the Central DMA engine (in scatter/gather mode that automatically fetches new block descriptors via the ACP port)
- and transferred via AXI interconnects to
- AXI BRAM controllers
- that connect to the actual BRAMs.
- The BRAMs keep the working data for the accelerator – pixels in, context, pixels out.
- The accelerator HW implemented by Vivado HLS.
The accelerator HW block requires: 17 DSP48E blocks, 5967 FFs, and 7229 LUTs. It runs at 100 MHz clock. In can process 1024 pixels in one activation. This requires 4kB for pixel input and output data, and 5*16kB=80kB for context data. The accelerator is internally pipelined. Processing of a single pixel takes 96 clock-cycles, a new pixel can be accepted into the pipeline every 9 clock-cycles. Using a single accelerator instance delivers about 8.2 FPS.
UPDATE: A configuration with two accelerator instances achieves 14 FPS.
UPDATE 2: The source codes can be downloaded from here.