Design

Motion Estimation & MPEG Encoding

By Shehrzad Qureshi, May 01, 2004

Motion estimation is the most computationally demanding subsystem of an MPEG encoder.

May04: Motion Estimation & MPEG Encoding

Algorithmic improvements

Shehrzad is an engineer at Labcyte Inc. He can be contacted at shehrzad_q @hotmail.com.

The principal goal of compression is the utilization of any information redundancy in the input to reduce the size of the output. In particular, motion within video sequences illustrates how video compression systems such as MPEG can use redundant data to reduce the size of the resulting bitstream (or file). To accomplish this, however, the encoder needs to first analyze signal content to discern such motion ("motion estimation"), then apply this information such that it reduces the entropy of the coded video ("motion compensation").

Motion estimation (ME) is the most computationally demanding subsystem of an MPEG encoder. Efficient and robust ME is critical for real-time encoding of MPEG video. (For nonreal-time applications, on the other hand, it's more important to err on the side of computing the optimal motion vector, no matter the computational costs involved.) In this article, I examine how motion estimation relates to MPEG encoding. I also present algorithmic improvements and source code (available electronically; see "Resource Center," page 5) to the MPEG Software Simulation Group (MSSG) MPEG2 encoder (http://www.mpeg.org/MPEG/MSSG/#source), which is a reference MPEG2 codec implementation.

Motion Estimation

Although the MPEG specification does not explicitly describe the encoding process, it does dictate how the encoder must generate the video bitstream so that it can be decoded by a "model decoder." Consequently, you are free to experiment with various encoding algorithms, so long as the integrity of the output bitstream is maintained. In the simplified MPEG encoder in Figure 1, motion estimation refers to the encoding step where a video frame is partitioned into nonoverlapping 16×16 macroblocks (MBs). For each MB, the encoder attempts to estimate the motion with respect to some reference frame. The output of the ME (a motion vector for each MB) is then fed into the motion compensation system, where the differences between the MB and the predicted 16×16 blocks in the reference frame are entropy coded. Essentially, the ME attempts to exploit the temporal redundancy present in a video sequence. Figure 2 illustrates the ME process.

ME plays a different role in each type of the three frame types defined by the MPEG coding standard:

I (Intra) frames utilize no motion estimation, and are intracoded (think JPEG).
P (Predicted) frames utilize forward prediction.
B (Bidirectional Predicted) frames utilize both forward and backward prediction.

Figure 3 shows the relationship between frame type and motion estimation/compensation. The motion estimator estimates the motion displacement in the predicted frame, with respect to some source frame. With P frames, the source (or reference) frame is the nearest temporally preceding I or P frame in the video sequence. For B frames, both forward and backward prediction is utilized, meaning that the prediction system takes into account the nearest preceding and/or upcoming I or P frames in the sequence.

In MPEG, ME attempts to model only rigid-body translational motion. The nonlinear effects of rotational motion (or nonrigid-body motion) are too hard to model effectively within the constraints of an MPEG encoder. Most ME algorithms fall into a class of algorithms known as "block-based motion estimators," where a motion vector is found for a contiguous group of video frame pixels (macroblock). How do you go about finding these motion vectors? One way is the exhaustive search brute-force method.

Motion Estimation Algorithms

Example 1 is pseudocode for both the exhaustive search procedure and the original fullsearch() function in motion.c (available electronically; see "Research Center," page 5) used to implement this search algorithm. The distortion metric in this pseudocode is parameterized. A common metric is the sum of absolute distortions (SAD); see Example 2. The dist1() function in motion.c implements the SAD metric. In terms of minimizing distortion, exhaustive search is guaranteed to find the optimal motion vector. All other search methods are suboptimal because, in the exhaustive search scheme, all possible displacements within the search range are compared.

To understand the scale of the computational task involving ME, consider the following complexity analysis: For a search range consisting of +/-R pixels (in both directions), 4R+4 block comparisons must be performed per motion vector (the extra 4 comes from a half-pixel search). Using 16×16 macroblocks and the SAD distortion metric requires 768 integer operations (one pixel subtraction, absolute value, and increment per pixel in a 16×16 block) per block comparison, meaning each motion vector search consumes (4R+4)(768) integer operations. For an M×N video frame sequence partitioned into M/16×N/16 MB, the overall cost per frame is approximately (4R+4)(768)(M/16×N/16) integer operations (ignoring boundary effects and the cost of interpolation, resulting from a final half-pixel search). Choosing realistic parameters, such as R=+/-16, M=360, and N=240, yields 17,233,920 integer operations per frame! MPEG frame repetition rates typically range from 24 to 30 frames per second (fps), so any method that reduces the number of calculations by even a single order of magnitude is worth investigating.

Fast Motion Estimation Algorithms

Fast block-matching search algorithms work by subsampling the search space on the premise of a slowly varying, continuous distortion function, where the distortion increases monotonically as one moves away from the minimum distortion. If this assumption holds true, then using only a fraction of the pixels to approximate the shape of the distortion function would appear to be a good choice in reducing computational complexity, while still providing the means to find an accurate motion vector.

A generalization of the 1D logarithmic search algorithm popularized by Donald Knuth extended to incorporate 2D translational displacements is known as "N-step 2D logarithmic search," as in Figure 4 (where N=3). A series of five block comparisons are performed at each step. A logarithmic refinement of the search pattern is triggered if either the best match is in the center of the five-point pattern, or the center of the search pattern touches the border of the search range. After N iterations, the search procedure terminates and returns the last minimum point as the minimum distortion.

Another fast search method that depends on the assumption of a slowly varying and monotonic distortion function is the "conjugate direction search." In this scheme, a modified 1D search is performed along one direction. Immediately thereafter, another modified 1D search in the orthogonal direction is performed, starting at the minima found by the first 1D search.

By decreasing the number of block comparisons, both of these methods improve upon exhaustive search, yet both have the same problem—as with most minimization techniques, they can get stuck on local minima. There are numerous places in the video capture sequence where noise can come into the picture and distort the distortion function! Furthermore, recall that the ME has no way of dealing with rotations, which manifest themselves in odd ways. Consequently, within the distortion space there are areas of peaks and valleys. In the absence of a reasonable initial guess, these techniques converge to a local minima instead of the global minima, which is where the minimum distortion is located. The technique I implement uses a hybrid scheme to get around this problem.

A Hybrid Motion Estimation Method

The hybrid ME method presented here proceeds in three distinct stages:

1. An initial guess at a motion vector is found via a block-hierarchical search.

2. Once this stage completes, the ME transitions into one of the fast block-matching search algorithms, which initiate where the first search terminated.

3. Finally, the ME performs a half-pixel search to further refine the motion vector estimate.

What makes this scheme unique is that it uses the 2D discrete wavelet transform (DWT) as the basis for the block-hierarchical search.

Figure 5 illustrates the block-hierarchical search, in which a cascading series of block-matching searches proceed from coarse (highly subsampled) resolution to fine (less subsampled). The location of the MB defines a region-of-interest (ROI) within the reference frame that completely encloses the search area. Both the MB and ROI are repeatedly filtered and downsampled. The manner in which this type of dyadic decomposition is accomplished is via the 2D DWT. Using a three-level DWT on a 16×16 MB, the coarse-to-fine cascade starts with a 2×2 heavily filtered and downsampled version of the MB, transitions to a 4×4 MB, and ends with an 8×8 MB. Likewise, given a search range of 16 pixels in either direction (thus, the ROI is a 48×48 block of pixels; see Figure 6), the reference ROI cascade is 6×6, 12×12, and 24×24 resampled subimages.

The 2D DWT is a subband decomposition. The low-low (LL) subband is a low-pass filtered version of the input. So the first block-hierarchical search operation is a three-level wavelet decomposition of the input MB and reference frame ROI. At each wavelet decomposition level, the LL band is extracted, which constitutes a level of a multiresolution pyramid (see Figure 7). The end result of the wavelet decompositions is two pyramidal multiresolution analyses. Now at each level of both pyramids, starting at the top (most coarse), a block comparison is done with the objective to minimize the distortion at the current level. The result of each level (the computed minimum distortion) is passed onto the lower level (less coarse) and becomes the starting point for the next series. Example 3 details this pyramidal search scheme. By choosing an appropriate form of the DWT, the low-pass filtered versions of the MB and ROI exhibit high "energy-compaction," whereby the important features in the subimages should be preserved by the decomposition. Consequently, the algorithm should be able to track the minimum distortion as it proceeds from a coarse-to-fine search by moving from the top of the pyramid to the base of the pyramid.

At the coarsest level in Example 3, a list of the three minimum distortions is returned instead of returning the minimum distortion. Upon traversing to the next level, the search scheme uses these three search ranges for its next search. But why bother with three minimum distortions? Why not just use the minimum distortion? The answer is to aid the algorithm in steering itself in the correct direction. Recall that these search schemes are based on the assumption of a monotonically increasing distortion function, as you spiral away from the minima. This assumption may not always hold true, especially in video where there are slow-moving objects. Furthermore, sources of random noise can introduce local phenomena that will tend to steer the block-hierarchical search in the wrong direction, especially at extremely coarse decimation levels, where a significant amount of decimation has occurred and information has been filtered away. To aid the search scheme, the middle level (step 2 in Example 3) is actually a three-step process. This is at the expense of more block comparisons and more downstream comparisons, but greatly increases the robustness of the motion estimator.

A more traditional fast motion search algorithm kicks in at the original resolution level, starting from wherever the block-hierarchical search terminates. In essence, the block-hierarchical search algorithm is used as the means to provide a fairly accurate guess for either a conjugate directional search or an N-step logarithmic search. Finally, a half-pixel search (which requires interpolation) completes the motion estimation procedure.

Performance

For test purposes, I've included 10 video frames (in PPM format) of a hummingbird in flight (from the CD accompanying Video Compression Demystified, by Perter Symes, McGraw-Hill, 2001; ISBN 0071363246). When running the encoder program, the parameter file (also included) and the frame files must all reside in the same directory. The encoder then encodes all 50 frames into an MPEG movie.

Calculating the number of operations for the new search scheme is a bit more involved than the exhaustive search method, as the new method is less structured. Using the dimensions in Figure 7, the topmost level of the pyramid will require nine block comparisons. The middle level requires three series of blockcomparisons, for a total of 3(16)=48 block comparisons, in the worst case (ignoring edge effects). The bottom-most level entails 25 block comparisons in the worst case (again ignoring edge effects). So, in the worst case, the top-down search consumes 9+48+25=82 block comparisons, compared to 68 block comparisons for the exhaustive search. However, these block comparisons are not created equal due to the coarse-to-fine pyramid progression. Computing the SAD distortion metric for the top level requires only 12 integer operations (versus 768 at the original scale), while the middle and bottommost levels require 48 and 192 integer operations, respectively. So the total cost per motion vector for just the top-down portion is (25)(12)+(48)(48)+ (9)(192)=4332 integer operations.

Assuming the three-step logarithmic search as the algorithm of choice for the second stage of the hybrid algorithm, which operates in the original resolution space (16×16 SAD calculations), it is clear from the description of the algorithm that it shall require 13 block comparisons, which comes out to 9984 integer operations. Adding the cost of the four half-pixel block comparisons, the total cost for this new algorithm (in the worst case) comes out to a total of 4432+9984+3072=17,488 calculations, versus 52,224 calculations for exhaustive search. In other words, hybrid search requires about a third of the calculations of exhaustive search. Of course, performing the DWT on the MB and the search areas requires computation cycles, but the DWT is well known for efficient implementations.

It's one thing to drastically reduce the number of operations required to perform the motion estimation, but an equally important question is what happens to the visual quality of the resultant MPEG movie? Recall that exhaustive search (with sub-pixel interpolation) is guaranteed to be optimal in the sense that because the entire ROI is searched, the minimum distortion shall be found. Any other method can only hope to approximate the minima of the distortion function.

To assess this qualitative aspect of the new search method, the input frames were encoded into an MPEG bitstream twice—once using exhaustive search and the other using the algorithm described here (with Conjugate Direction Search used as the fast block matching algorithm). The mpeg2enc program (available electronically) can be configured via the parameter file mp2movie.par to save the reconstructed frames during the encoding process. The reconstructed luminance frames were used to evaluate the visual degradation of the MPEG bitstream. Comparing the reconstructed frames with the original (input) video frames gives a good sense of how well the encoder is doing its job, in terms of signal fidelity.

What's the best way to compare the reconstructed video frames with the original frames? There are a variety of metrics, but the Peak Signal-to-Noise Ratio (PSNR) is useful for comparing image restoration results. Figure 8 is a plot of two PSNR graphs, one comparing the original frames versus exhaustive search (baseline), and original frames versus the hybrid motion estimation technique. Keep in mind that the larger the PSNR number, the better, and these results indicate that the hybrid motion estimation technique results in acceptable quantitative results.

Software

Using the mpeg2enc program is straightforward (see mpeg2enc.doc, the MSSG documentation, available electronically). My focus here is on the software I wrote to implement the hybrid motion estimation algorithm. The preprocessor directive USE_HYBRID_SEARCH (see motion.c) controls which code base to use for motion estimation; if set to 0, then the MPEG2 encoder uses exhaustive search. Given the complexity of the algorithms implemented, the user time when encoding MPEG bitstreams using the provided implementation may actually be longer than with exhaustive search. The source code is not optimally coded, as I strived for simplicity of implementation and debugging, as opposed to optimization of the low-level code. In particular, the use of local C++ constructs such as vector and valarray slow things considerably (especially in debug builds), as well as favoring floating-point over integer routines.

When grafting additions to the base MSSG source code, I implemented the six C++ classes in Table 1. For the most part, I followed the Strategy design pattern (see Design Patterns: Elements of Reusable Object-Oriented Software, by Erich Gamma et al., Addison-Wesley, 1995; ISBN 0201633612), so they can be fairly interchangeable and easy to manipulate. Toggling the value of the preprocessor directive USE_LOGARITHMIC_SEARCH in HybridSearch.cpp switches between Conjugate Direction Search and N-Step Logarithmic Search. The Logarithmic Search strategy object defaults to a three-step search, but the final argument to the constructor does allow the client to specify how many levels the search should use.

The HierarchicalDecomposition class relies heavily on the Intel Integrated Performance Primitives Library (http://developer.intel.com/software/ products/ipp/ipp30/) for low-level wavelet functions. To build mpeg2enc, you need to download the distribution and install it locally on your system. You also need to update the MSVC Project Settings to include directory preprocessor directives (Project->Settings->C/C++ Tab->Preprocessor Category, under Additional Include Directories) as well as linker directives (Project->Settings->Link Tab-> Input Category, under Additional Library path). The manner in which the 2D DWT is implemented is as a forward wavelet transform on the rows followed by a forward wavelet transform on the columns of the previous transform coefficients. The method where all this takes place is Decomp::doWaveletFwdXform(). The wavelet basis chosen for these test results was a biorthogonal spline wavelet, as this particular wavelet basis exhibits exceptional energy compaction. It is a simple matter to experiment with other wavelet bases; simply update the filter coefficients and calls to ippsWTFwdInitAlloc_32f() and ippsWTFwd_32f() accordingly (remembering to change the code in two places: for both the row and column decompositions).

Conclusion

The top-down pyramid search followed by a more traditional fast search algorithm produces acceptable performance in both subjective video quality and more quantitative measures such as PSNR. MPEG hardware encoder implementations often use an exhaustive search method, which lends itself to highly efficient parallelization. Software-based encoders do exist, and any real-time software encoder must take into account the motion estimation cost. Other algorithms (such as phase estimation, which operates in the frequency domain) are available, but this hybrid search scheme is a good place to start. MPEG4 explicitly uses wavelets for still-image compression; in fact, the choice of the spline biorthogonal wavelet basis is because it is used in MPEG4. Part of the cost attributed to performing the DWT could be amortized, given that MPEG4 encoders may need to perform the DWT for other purposes anyway.

DDJ

1 2 3 4 5 6 7 8 9 10 Next

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

Design

Motion Estimation & MPEG Encoding