Implementation Using Two Slice Queues
The H.264 encoder is divided into three parts: input pre-processing, encoding, and output post-processing. Input pre-processing reads uncompressed images, performs some preliminary processes, and then issues the images to encoding threads. The pre-processed images are placed in a buffer, called the "image buffer". Output processing checks the encoding status of each frame and commits the encoded result to the output bit-stream sequentially. After that, the entries in the image buffer are reused to prepare the image for encoding. Although the input and output processes of the encoder must be sequential due to the inherent parallelism of the encoder, the computation complexity of input and output processes is insignificant compared to the encode process. Therefore, you can use one thread to handle the input and output processes. This thread becomes the master thread in charge of checking all the data dependency.
You would use another buffer, called "slice buffer", to exploit the parallelism among slices. After each image is pre-processed, the slices of the image go into the slice buffer. The slices placed in the slice buffer are independent and ready for encoding; the readiness of reference frames is checked during the input process. In this case, you can encode these slices out of order. To distinguish the priority differences between the slices of B frames and the slices of I or P frames, use two separate slice queues to handle them. The pseudocode in Example 1 implements this two-slice model.
// Pesudo-code of Threaded H.264 Encoder using OpenMP omp_set_nested( # of encoding thread + 1) #pragma omp parallel sections { #pragma omp section { while ( there is frame to encode ) { if ( there is free entry in image buffer ) issue new frame to image buffer else if ( there are frame encoded in image buffer ) commit the encoded frame, release the entry else // dependency are handled here wait; } } #pragma omp section { #pragma omp parallel num_threads(# of encoding thread) { while ( 1 ) { if ( there is slice in slice queue 0) // higher priority for I/P-frames Encode one slice else if ( there is slice in slice queue 1) // lower priority for B-frames encode one slice else if ( all frames are encoded ) exit; else // wait for the main thread to put more slices wait } } } }
Figure 4 shows how the video stream is processed by the final multithreading implementation of a parallelized H.264 encoder. In the code segment, one thread processes both the input and the output, in order, and other threads encode slices out of order.

Implementation Using Task Queuing Model
The implementation in Example 1 uses the OpenMP pragma, making the structure of the parallel code very different from that of a sequential code. A second proposed implementation uses the taskqueuing model that is supported by the Intel C++ Compiler.
Essentially, for any given program with taskqueuing constructs, a team of threads is created by the run-time library when the main thread encounters a parallel region. Figure 5 shows the taskqueuing execution model. The run-time thread scheduler chooses one thread (TK) to execute initially from all the threads that encounter a taskq pragma. All the other threads wait for work to be put on the work queue. Conceptually, the taskq pragma triggers this sequence of actions:
- Causes an empty queue to be created by the chosen thread TK
- Enqueues each task that it encounters
- Executes the code inside the taskq block as a single thread
The task pragma specifies a unit of work, potentially to be executed by a different thread. When a task pragma is encountered lexically within a taskq block, the code inside the task block is placed on the queue associated with the taskq pragma. The conceptual queue is disbanded when all work enqueued on it finishes and the end of the taskq block is reached.

The first proposed multithreaded H.264 scheme uses two FIFO buffers: an image buffer and a slice buffer. The main thread is in charge of three activities:
- Moving raw images into the image buffer when the image buffer has space
- Moving slices of the image buffer into slice buffers when the slice buffer has space and the image is not yet dispatched
- Moving the encoded images out the image buffer when the image is encoded
The working threads are in charge of encoding new slices when a slice is waiting in the slice buffer to be encoded. All these operations are synchronized through the image buffers. Hence, you would find it natural to use the taskqueuing model supported by the Intel compiler.
The code segment in Example 2 shows the pseudo-code of the multithreading of the H.264 encoder using the taskqueuing model. This multithreaded source code is closer to the way you would write singlethread code. The only difference is the pragma, which is a key characteristic of OpenMP. Furthermore, in this scheme, you no longer have a control thread, only a number of working threads in total.
// Pesudo-code of Threaded H.264 Encoder using Taskqueuing #pragma intel omp parallel taskq { while ( there is frame to encode ) { if ( there is no free entry in image buffer ) (1) commit the encoded frame; (2) release the entry; (3) load the original picture to memory; (4) prepare for encoding; for (all slice in this frame) { #pragma intel omp task { encode one slice; } } } }