Data Partitioning
One partitioning technique commonly used is called data partitioning, which relies on the ability of processing data blocks in parallel. This technique is most commonly and easily applied at a high-granularity, each data block being a channel, a frame, or a slice, where a slice refers to a large area of a frame that is processed independently from the rest of the frame. Each data block is processed in parallel on a different processor. A master processor is generally responsible for ensuring synchronization among the processors and combining the results as needed.
Applying this partitioning approach at a high-granularity presents the advantage of requiring only a minimal amount of inter-processor communication, since each block can be processed independently from the others. This approach is also easy to implement as only few modifications to the existing single-core oriented reference code are required in order to run in parallel on all processors and produce functional output. However, this technique has also inherent problems. First, it is difficult to ensure proper load balancing among processors as video codec algorithms have data-dependent processing requirements. For example, one slice of a frame (or video sequence) may contain scenes with small amount of movement and details (e.g., a uniform background such as a wall or the sky), resulting in much lower processing requirements than another slice containing high-motion scenes with finer details (e.g., the face of a person talking).
Because of these discrepancies, some processors remain idle for long periods of time while others are busy processing the most computationally intensive scenes: this unbalance translates as a waste of the processing resources and suboptimal performance. Second, simple data partitioning results in non-scalable implementations, since there is little flexibility on the number of blocks in which the data can be divided. For example, a 16-channel encoder may fit nicely on a 16-processor architecture by assigning one processor to each channel, but the code will need to be reworked significantly if another application needs to run in parallel on that architecture and mobilizes one or more processors for extended periods of time. Dividing frames into slices offer slightly more flexibility as the size and number of slices can generally be adjusted without requiring extensive code changes. Unfortunately, dividing a frame into too many slices deteriorates the efficiency of the compression algorithms.
Data Pipelining
Another partitioning technique is called data pipelining or functional partitioning. This technique consists of assigning each processor a different processing block, such as motion estimation or texture encoding in the case of video encoders. With this approach, the data is processed in a pipelined fashion: one processor applies the first processing block to the data and passes on its output to another processor, which applies the second processing block to the modified data, and so on. Using the same approach, the most challenging processing blocks can be subdivided and assigned to multiple processors while simpler ones can be handled by one processor alone to achieve better load balancing.
Unfortunately, this second partitioning technique also introduces multiple challenges. First, the assignment of each processing block to a processor is generally done at compile time: this is the simplest approach and sometimes the only approach that is possible because of the limitations of the architecture or the lack of a multi-core operating system running on the architecture. Assigning roles to each processor at compile-time does not allow for proper load balancing. For example, motion estimation processing requirements vary greatly depending on the video streams being processed: still and low-motion sequences result in lower processing requirements for the motion estimation block than high-motion ones.
As a result, processors assigned to the motion estimation algorithm are likely to be underused with low-motion scenes while they will be the bottlenecks with high-motion scenes. Second, splitting an algorithm into multiple processing blocks run on different processors introduces inter-processor communications that may affect performance by straining memory resources or causing processors to stall as a result of data starvation. Finally, a simple data-pipelining approach suffers the same limitations that were mentioned about data partitioning earlier. Algorithms cannot be divided into an arbitrary number of blocks to match perfectly the number of processors available on a multi-core architecture, and assigning processing blocks to individual processors at compile time results in non-scalable implementations.
Next: More Efficient Partitioning Strategies and the CT3600 MDSP family