Performance measurements for the multithreaded encoder are the result of experiments conducted on the following systems:
- A Dell Precision 530 system, built with dual Intel Xeon processors (four logical processors) running at 2.0 GHz with HT Technology, a 512 KB L2 Cache, and 1 GB of memory.
- An IBM eServer xSeries 360 system, built with quad Intel Xeon processors (eight logical processors) running at 1.5 GHz with HT Technology, a 256 KB L2 Cache, a 512 KB L3 Cache, and 2 GB of memory.
Unless specified otherwise, the resolution of the input video is 352x288 in pixels or 22x18 in macroblocks. To be sure to provide enough slices for eight threads, the program takes the slice as the basic encoding unit for each thread.
Tradeoff Between Increased Speed and Effective Compression
A frame can be partitioned up to a maximum of 18 slices. Taking a slice as the base encoding unit for a thread can reduce the synchronization overhead because no data dependency among slices occurs within a single frame during the encoding process. As mentioned earlier, partitioning the frame into multiple slices can increase the degree of parallelism, but, it also increases the bit-rate. One of the challenges is to achieve an increased execution speed and lower the bit-rate without sacrificing any image quality. Therefore, you should choose the slicing threshold carefully.
Figure 6 and Figure 7 show the combinations of increased encoding speed and the associated bit rate for two variations of the number of slices for each frame. In Figure 6, the number of slices ranges from 1 to 18, while maintaining a constant quality level for the encoded frames. Speed increases when the number of slices for a frame is 1 to 2 on the DELL 530 platform, and the speedup is almost flat when the per-frame number of slices ranges from 2 to 18. Meanwhile, the bitrate increase is smaller if the number of slices is less than 3, but it starts going up as the frames go from 3 slices to 18 slices. One important observation is that partitioning a frame into 2 or 3 slices is the best tradeoff, one that achieves a higher speedup and a lower bit rate.
Figure 7 shows that we need more than three slices to keep eight logical processors busy on the IBM x360 platform. Essentially, we need nine threads to achieve an optimal performance level for four physical processors with HT Technology enabled. You want to keep the number of slices roughly same as the number of logical processors. This simple approach achieves higher performance. You can maintain good image quality with an optimal tradeoff while generating enough slices to keep threads busy for encoding.
Performance on Multiprocessor with HT Technology
Table 2 shows the speed increase for the threaded encoder on the IBM x360 quad-processor system with HT Technology. In this implementation, a picture frame was partitioned into nine slices. In general, the multithreaded H.264 encoder increased its execution speed in the following ranges: 1.9x to 2.01x on a two–processor system, 3.61x to 3.99x on a four–processor system, and 3.97x to 4.69x on a four–processor system with HT technology enabled for five different input video sequences.
You can see some performance differences between the first implementation with two-slice queues and the second implementation with only one task queue, shown in Table 3. The performance gap is larger when the system contains more processors. Because the implementation uses two queues to accelerate the encoding of I or P frames, it can make more slices ready for encoding, especially when a large number of processors is available to do the work. On the other hand, the taskqueuing model in OpenMP maintains only one queue. In this case, all slices are treated equally. Therefore, the execution threads spend more time in an idle state when the system has a lot of processors.
With HT Technology enabled, the program achieved a 1.2x speed increase. The explanation for this improvement lies in the microarchitecture metrics in the next section.
Understanding the Performance
Table 4 shows the distribution of the number of instructions retired per cycle on a Dell Precision 530 dual-processor system with the second processor disabled. Although no instruction is retired for almost half of the execution time, the probability of retiring more instructions is higher with HT Technology. This statistic indicates that higher processor utilization is achieved with HT Technology.
Table 5 and Table 6 show mixed results. Without HT Technology, the trace cache spends about 80 percent of the time under the deliver mode, which is good for performance, and about 18 percent of the time under the build mode, which is bad for performance. However, when HT Technology is enabled, the deliver mode percentage drops to 70 percent while the build mode percentage increases to 25 percent. This performance drop indicates that the front end of the system with HT Technology cannot provide enough micro-ops to the execution unit. Similarly, the miss rate for the first-level cache load also shows the same decline. You see a 50-percent increase in the number of first-level cache misses when HT Technology is enabled. This 6–to–9-percent increase in the miss rate results from the two logical processors in one physical package sharing the first-level cache of only 8 kilobytes. In short, performance gains for HT Technology are limited by the trace cache and the L1 cache for our multithreaded H.264 encoder.
Front-side-bus utilization rate is the only noticeable impact on microarchitecture metrics for multiprocessor configuration. The number of bus activities does not increase significantly along with the increasing of number of threads. The execution time is reduced due to the better use of processor resources that you get by exploiting enough thread-level parallelism. The result is an increased front-side-bus utilization rate.
Table 3 also shows that the execution time is even longer on a quad-processor with HT Technology (QP+HT) than a quad-processor (QP) in the case of a smaller slice number. This increase can be explained from the profile of threads. Figure 8 shows the profile when a frame contains only one slice. The encoder thread is waiting about 61.8 percent of the execution time due to insufficient parallelism.
Figure 9 shows the profile when 18 slices are in a frame. The eight encoder threads are all busy except during the set-up time. The eight encoder threading model is waiting only 1.4 percent of the execution time. In this case, all processor resources are used fully.
Therefore, during the process of doing trade-off analysis, you should choose carefully the best way to balance the slices in a frame. The criterion is to keep the number of slices low while providing enough slices to keep all encoder threads busy. If the number of slices is smaller than the number of threads, the execution speed decreases.
Figure 10 shows the execution time profile of the second implementation using one task queue. As mentioned earlier, all slices are treated equally because the taskqueuing model in OpenMP only maintains one queue. Therefore, the system could have too few ready-to encode slices, as you can see from the amount of idle time in the execution threads. Compared to Figure 9, Figure 10 shows that the processors are utilized less efficiently.