Exploiting thread-level parallelism is an attractive approach to improving the performance of multimedia applications that are running on multithreading general-purpose processors. Given dual-core and multicore processors, the earlier you start to design for multithreading, the better. This article focuses on the design for increased multithreading performance in a larger project. This case study describes the design and implementation of the multithreaded H.264 encoder. Parallelized using the OpenMP programming model, the H.264 encoder is an example of how you can leverage the advanced compiler technologies in the Intel C++ Compiler for Intel architectures with Hyper-Threading Technology (HT Technology). Taken together, these two efficient methods for multi-level data partitioning can improve the performance of a large and complex application like the multithreaded H.264 encoder.
Furthermore, this study shows how you can exploit different options in the OpenMP programming. One implementation that uses the taskqueuing model is slightly slower than optimal performance, but the application program easier to read. The other method goes for speed. The results have shown speed increases ranging from 3.97x to 4.69x over the well-optimized sequential code performance on a system of four Intel Xeon processors with HT Technology.
Initial Performance of the H.264 Encoder
H.264 (ISO/IEC 2002) is an emerging standard for video coding, which has been proposed by the Joint Video Team (JVT). The new standard is aimed at high-quality coding of video contents at very low bit-rates. H.264 uses the same model for hybrid block-based motion compensation and transform coding that is used by existing standards, such as those for H.263 and MPEG-4 (ISO/IEC 1998). Moreover, a number of new features and capabilities in H.264 improve the performance of the code. As the standard becomes more complex, the encoding process requires much greater computation power than most existing standards. Hence, you need a number of mechanisms to improve the speed of the encoder.
One way to improve the application’s speed is to process tasks in parallel. Zhou and Chen demonstrated that using MMX/SSE/SSE2 technology increased the H.264 decoder’s performance by a factor ranging from two to four (Zhou 2003). Intel has applied the same technique to the H.264 reference encoder, achieving the results in Table 1 using only SIMD optimization.
Although the encoder is two-to-three times faster with SIMD optimization, the speed is still not fast enough to meet the expectations of real-time video processing. Furthermore, the optimized sequential code cannot take advantage of Hyper-Threading Technology and multiprocessor load-sharing, two key performance boosters that are supported by the Intel architecture. In other words, you still can improve the performance of the H.264 encoder a lot by exploiting thread-level parallelism.
Parallelization of the H.264 Encoder
By exploiting thread-level parallelism at different levels, you can take advantage of potential opportunities to increase performance. To achieve the greatest speed increases over well-tuned sequential code on a processor with HT Technology, you should consider the following characteristics as you re-design the H.264 encoder for parallel programming:
- The criteria of choosing data or task partitions
- The judgments of thread granularity
- How the first implementation uses two slice queues
- How the second implementation uses one task queue
Task and Data Domain Decomposition
You can divide the H.264 encoding process into multiple threads using either functional decomposition or data-domain decomposition:
- Functional decomposition. Each frame should experience a number of functional steps: motion estimation, motion compensation, integral transformation, quantization and entropy coding. The reference frames also need inverse qualification, inverse integral transformation, and filtering. These functions could be explored for opportunities to make these tasks parallel.
- Data domain decomposition. As shown Figure 1, the H.264 encoder treats a video sequence as many groups of pictures (GOP). Each GOP includes a number of frames. Each frame is divided into slices. Each slice is an encoding unit and is independent of other slices in the same frame. The slice can be further decomposed into a macroblock, which is the unit of motion estimation and entropy coding. Finally, the macroblock can be separated into block and sub-block units. All are possible places to parallelize the encoder.
To choose the optimal task or data partition scheme, compare the advantages and disadvantages of two schemes below:
- Scalability. In the data-domain decomposition, to increase the number of threads, you can decrease the size of the processing unit of each thread. Because of the hierarchical structure in GOPs, frames, slice, macroblocks, and blocks, you have many choices for the size of processing unit, thereby achieving good scalability. In functional decomposition, each thread has different function. To increase the number of threads, partition a function into two or more threads, unless the function is unbreakable.
- Load balance. In the data domain decomposition, each thread performs the same operation on different data block that has the same dimension. In theory, without cache misses or other nondeterministic factors, all threads should have the same processing time. On the other hand, it is difficult to achieve good load balance among functions because the chosen algorithm determines the execution time of each function. Furthermore, any attempt to functionally decompose the video encoder to achieve a good load balance depends on algorithms, too. As the standard keeps improving, the algorithms are sure to change over time to exploiting thread-level parallelism at multiple levels to achieve a good load balance.
Considering these factors, you could use the data-domain decomposition as your multithreading scheme.
Slice-Level Parallelism. When you have decided on the functional decomposition or data domain decomposition scheme, the next step is to decide the granularity for each thread. One possible scheme of data domain decomposition is to divide a frame into small slices.
Parallelizing the slices has both advantages and disadvantages. The advantage lies in the independence of slices in a frame. Since they are independent, you can simultaneously encode all slices in any order. On the other hand, the disadvantage is the resulting increase in the bit rate. Figure 2 shows the video encoder rate-distortion when you divide a frame into varying numbers of slices. When a frame is divided into nine slices but quality is held at the same level, the bit-rate increases about 15 to 20 percent because slices break the dependence between macroblocks. The compression efficiency decreases when a macroblock in one slice cannot exploit a macroblock in another slice for compression. To avoid increasing the bit-rate at the same video quality, you should exploit other areas of parallelism in the video encoder.
Frame-Level Parallelism. Another possible scheme for exploiting parallelism is to identify independent frames. Normally, you would encode a sequence of frames using an IBBPBBP… structure. (Frame notation definitions: I frame in video codecs stands for intra frames, which can be encoded or decoded independently. Normally, there is an I frame per 15~60 frames. P frame stands for predicted frames, each of which is predicted from a previously encoded I frame or P frame. Because a P frame is predicted from the previously encoded I/P frame, the dependency makes it harder to encode two P frames simultaneously. B frame stands for bi-directional predicted frames, which are predicted from a two previously encoded I/P frames. No frame depends on B frames.)
In each sequence, you have two B frames between each of the P frames. P frames are reference frames, on which other P or B frames depend, but B frames are not. The dependence among the frames is shown in Figure 3. In this PBB encoding structure, the completion of encoding a particular P frame makes the subsequent P frame and its two B frames ready for encoding. The more frames encoded simultaneously, the more parallelism you can exploit. Therefore, P frames are the critical point in the encoder. Accelerating P-frame encoding makes more frames ready for encoding and avoids the possibility of idle threads. In your implementation, encode I or P frames first, then encode the B frames.
Unlike dividing a frame into slices, utilizing parallelism among frames does not increase the bit rate. However, the dependence among them does limit the threads’ scalability. The trade-off is to combine the two approaches, data-domain decomposition and functional decomposition, into one implementation. First explore the parallelism among frames to gain performance from it without bit-rate increase. At some point, you are going to reach the upper limit of the number of threads to which you can apply frame-level parallelism. When you do, explore the parallelism among slices. As a final result, the application utilizes processor resources as much as possible, keeping the compression ratio as high as possible and the bit-rate as low as possible.