In summary, having the number of threads equal to the number of logical processors strikes the best balance between speed-up and parallelism. But, what happens to the performance when the number of threads is greater or less than the number of logical processors? Figure 11 shows that the speed-up changes along with the number of threads for an implementation using two slice queues. The speed increases along with increasing of the number of threads, reaching peak performance when the number of threads equals to the number of logical processors.
An interesting observation is that the speedup is essentially flat, or it drops only slightly when the number of threads is greater than the number of logical processors. Thus, the overhead due to threading is minor. In other words, the multithreaded code generated by the compiler exploits effective parallelism efficiently, and the overhead of the multithreaded run-time library is small. Furthermore, the multithreaded H.264 encoder should have good scalability for mediumscale multiprocessor systems, such as the one shown in Figure 12, because the performance is not sensitive to the number of threads.
Further Performance Tuning
In this explanation of the first parallel implementation of the H.264 encoder on the multithreading architecture, you got one explanation of different tradeoffs in video quality and parallelization. In other studies, researchers took the most straightforward approach to encoding the video sequences either by pictures or by slices. Our approach is slightly more complicated in exploiting both the slice level and frame-level parallelism.
Even when the expected performance gain is achieved, one can always find some further work to do. In this case, you could analyze the performance impact from different image resolutions. While the resolution of source image can scale from QCIF, CIF, SD to HDTV, most of our current analysis focused on the CIF resolution. Figure 5 shows that the increased speed of SD (720x480) format is slightly less than that of CIF (352x288) format. While the speedup is determined by factors such as synchronization and degree of parallelism, Figure 13 shows that the number of synchronizations per second during encoding SD video is less than that of encoding CIF video. Furthermore, SD has a higher degree of parallelism. We could do better to understand the reasons that the speedup of encoding higher resolution video is less than that of lower resolution video.
As the emerging codec standard becomes more complex, the encoding and decoding processes require much more computation power than most existing standards. The H.264 standard includes a number of new features and requires much more computation than most existing standards, such as MPEG-2 and MPEG-4. Even after media instruction optimization, the H.264 encoder at CIF resolution still is not fast enough to meet the expectation of real-time video processing. Thus, exploiting thread-level parallelism to improve the performance of H.264 encoders is becoming more attractive.
The case study presented here shows that multithreading based on the OpenMP programming model is a simple, yet effective way to exploit parallelism that only requires a few additional pragmas in the serial code. Developers can rely on the compiler to convert the serial code to multithreaded code automatically via adding OpenMP pragmas. The performance results have shown that the code generated by the Intel compiler delivers optimally increased speed over the well-optimized sequential code on the architecture with Hyper-Threading Technology, often boosting performance by 20 percent on top of native parallel speedups, approximately 4x without HT in this case, with very little additional cost.
In summary, when parallelizing an application, remember the following key points:
- Understand the application to make the best choice on task and data decomposition schemes for achieving optimal scalability and load-balancing.
- Carefully choose the granularity of the parallelism such as frame level and slice-level parallelism to exploit a right amount of parallelism with minimum synchronization overhead.
- Use tools such as the Intel VTune Performance Analyzer and Intel Thread Profiler to measure the performance at various levels such as micro-architecture metric, and the breakdown time of thread busy and waiting time to understand your performance gain or loss and identify further tuning headroom for performance improvements.
This article is based on material found in book The Software Optimization Cookbook, Second Edition by Richard Gerber, Aart J.C. Bik, Kevin B. Smith, and Xinmin Tian. (http://www.intel.com/intelpress/sum_swcb2.htm)