This topic shows how to implement a cartoon-like filter that can be applied to video frames by using a data-flow network. The filter consists of two stages:
1. Color simplification: where a Gaussian average of the neighbor area of a pixel is calculated and assigned to that pixel. This is an iterative filter that is done multiple times for each frame. In this walkthrough it is done three times.
2. Edge detection: where edge pixels is assigned a black color.
Here is the serial implementation of both filters:
Color simplification serial code
Edge Detection Serial code
Pipelining and data flow network
The pipeline resembles the above diagram. It has three color simplification stages and one edge detection stage. This pipeline, however, uses only half of the available computation resources if it is running on an eight core machine. To extend it to use all CPUs, the frame can be divided into chunks; each chunk is passed to a network similar to the one above.
The number of chunks matches the number of networks, which is the number of CPUs divided by four (number of stages inside each network). This network would use 100% of CPU utilization on any CPU it is run on, but it would hit a thread safety issue regarding edge detection. The problem here is the dependency between the two filters implemented on each frame. The rule is that color simplification must end before the edge detection begins which is not the case with the network above. In this scenario chunks of the same frame might be in an edge detection stage while other chunks of the same frame are inside color simplification stages. To solve this problem, a frame must wait to be done with color simplification before entering edge detection. This is done using a join message block waiting for all color simplification stages to end before allowing a frame to pass to edge detection like in the diagram below. After edge detection is done, the video reader block is signaled to send the next frame into the network. The reason for this feedback loop is to prevent the video reader from overwhelming the network with messages while relatively slower processing of frames takes place. However, initially the video reader sends some frames to insure that the network is busy at all times with a number of frames that is larger than one. In our case here we send four times the number of color simplification stages of frames to the network initially (12 frames). This way the network is ensured to always have 12 frames to process at all times.
Now this is ready for implementation. The code below shows the Video agents that behaves as the connection point between the UI and the network signals the video reader to read the initial frames.
Using data flow network showed liner speedup up to 48 cores on a video stream with frame size of 640×360.
Mohamed Magdy Mohamed Parallel Computing Platform Team