Dr. Dobb's | H.264 and Video Compression

H.264 and Video Compression

Producing video compression of acceptable quality and very low bit-rate

August 31, 2007
URL:http://www.drdobbs.com/h264-and-video-compression/201203492

Stewart Taylor is a software architect at Intel Corporation and was the lead designer of the Intel IPP functions. He is also author of Optimizing Applications for Multi-Core Processors, from which this article is adapted. Copyright (c) 2007 Intel Corporation. All rights reserved.

The two series of video codec nomenclature H.26x and MPEG-x overlap. MPEG-2 is named H.262 in the H.26x scheme. Likewise, another popular codec, H.264, is a subset of MPEG-4 also known as MPEG-4 Advanced Video Coding (AVC). Its intent, like that of all of MPEG-4, was to produce video compression of acceptable quality and very low bit-rate -- around half of its predecessors MPEG-2 and H.263.

Like its predecessors in the H.26x video codec family, H.264 has two encoding modes for individual video frames -- intra and inter. In the former, a frame of video is encoded as a stand-alone image without reference to other images in the sequence. In the latter, the previous and possibly future frames are used to predict the values. Figure 1 shows the high-level blocks involved in intra-frame encoding and decoding of H.264. Figure 2 shows the encoding and decoding process for inter frames.

Figure 1: Intra-Mode Encoding and Decoding in H.264

Whether in inter or intra frames, blocks in H.264 can be expressed relative to previous and subsequent blocks or frames. In inter frames, this is called "motion estimation" and is relative to blocks in other frames. This is the source of considerable compression. As with other video compression techniques, this exploits the fact that there is considerably less entropy in the difference between similar blocks than in the absolute values of the blocks. This is particularly true if the difference can be between a block and a constructed block at an offset from that block in another frame.

Figure 2: Inter-Mode Encoding and Decoding in H.264

H.264 has very flexible support for motion estimation. The estimation can choose from 32 other frames as reference images, and is allowed to refer to blocks that have to be constructed by interpolation.

The encoder is responsible for determining a reference image, block and motion vector. This block is generally chosen using some search among the possibilities, starting with the most likely options. The encoder then calculates and encodes the difference between previously encoded blocks and the new data.

On the decoding end, after decoding the reference blocks, the code adds the reference data and the decoded difference data. The blocks and frames are likely to be decoded in non-temporal order, since the frames can be encoded relative to forward-looking blocks and frames.

H.264 encoding supports sub-pixel resolution for motion vectors, meaning that the reference block is actually calculated by interpolating inside a block of real pixels. The motion vectors for luma blocks are expressed at quarter-pixel resolution, and for chroma blocks the accuracy can be eighth-pixel accuracy.

This sub-pixel resolution increases the algorithmic and computational complexity significantly. The decoding portion, which requires performing sub-pixel motion compensation only once per block, takes about 10 to 20 percent of decoding pipeline. The bulk of this time is spent interpolating values between pixels to generate the sub-pixel-offset reference blocks. The cost of performing sub-pixel estimation varies with the encoding algorithm, but may require performing motion compensation more than once.

The interpolation algorithm to generate offset reference blocks is defined differently for luma and chroma blocks. For luma, interpolation is performed in two steps, half-pixel and then quarter-pixel interpolation. The half-pixel values are created by filtering with this kernel horizontally and vertically:

[1 -5 20 20 -5 1]/32

Quarter-pixel interpolation is then performed by linearly averaging adjacent half-pixel values.

Motion compensation for chroma blocks uses bilinear interpolation with quarter-pixel or eighth-pixel accuracy, depending on the chroma format. Each sub-pixel position is a linear combination of the neighboring pixels.

Figure 3 illustrates which pixels are thus used for both interpolation approaches.

Figure 3: Sub-pixel Interpolation for Motion Compensation in H.264

After interpolating to generate the reference block, the algorithm adds that reference block to the decoded difference information to get the reconstructed block. The encoder executes this step to get reconstructed reference frames, and the decoder executes this step to get the output frames.

Intra Prediction

Intra frames by their nature don't depend on earlier or later frames for reconstruction. However, in H.264 the encoder can use earlier blocks from within the same frame as reference for new blocks. This process, intra prediction, can give additional compression for intra macroblocks, and can be particularly effective if a sufficiently appropriate reference block can be found.

The reference blocks are not used in the way that inter prediction blocks are, by taking the pixel-by-pixel difference of actual blocks in adjacent frames. Instead, a prediction of the current block is calculated as an average of some of the pixels bordering it. Which pixels are chosen and how they are used to calculate the block is dependent on the intra prediction mode. Figure 4 shows the directions that pixels may be used, along with the mode numbers as defined in the H.264 specification.

Figure 4: Mode Numbers for Intra Prediction in H.264

This can also be one of the most computationally intensive parts of the encoding process. For the encoder to exhaustively search through all options, it would have to compare each 16x16 luma or 8x8 chroma block against four other blocks, and each 4x4 or 8x8 luma block against 9 other blocks.

Because the encoder can consider a variety of block sizes, a scheme that optimizes the trade-off between the number of bits necessary to represent the video and the fidelity of the result is desirable.

Transformation

Instead of the DCT, the H.264 algorithm uses an integer transform as its primary transform to translate the difference data between the spatial and frequency domains. The transform is an approximation of the DCT that is both lossless and computationally simpler. The core transform, illustrated in Figure 5, can be implemented using only shifting and adding.

Figure 5: Matrices for Transformation in H.264

This 4x4 transform is only one flavor of the H.264 transform. H.264 defines transformations on 2x2 and 4x4 blocks in the baseline profile, and additional profiles support transforms on larger block sizes, rectangular or square, with dimensions that are also powers of two.

The algorithm applies the transforms separately on the first, or DC chroma and luma component. In the baseline profile, H.264 uses one 2x2 transform chroma DC coefficients, a 4x4 transform luma DC coefficients, and the main 4x4 transform for all other coefficients.

Quantization

The quantization stage reduces the amount of information by dividing each coefficient by a particular number to reduce the quantity of possible values that value could have. Because this makes the values fall into a narrower range, this allows entropy coding to express the values more compactly.

Quantization in H.264 is arithmetically expressed as a two-stage operation. The first stage is multiplying each coefficient in the 4x4 block by a fixed coefficient-specific value. This stage allows the coefficients to be scaled unequally according to importance or information. The second stage is dividing by an adjustable quantization parameter (QP) value. This stage provides a single "knob" for adjusting the quality and resultant bitrate of the encoding. The two operations can be combined into a single multiplication and single shift operation.

The QP is expressed as an integer from 0 to 51. This integer is converted to a quantization step size (QStep) nonlinearly. Each six steps increases the step size by a factor of 2, and between each pair of power-of-two step sizes N and 2N there are 5 five steps: 1.125N, 1.25N, 1.375N, 1.625N, 1.75N.

Reordering

When encoding the coefficients of each macroblock using entropy coding, the codec processes the blocks in a particular order. The order helps increase the number of consecutive zeros.

It's natural to handle this ordering when writing the output of the transform and quantization stage.

Entropy Coding

H.264 defines two entropy coding modes, Context Adaptive Variable Length Coding (CAVLC) and Context Adaptive Binary Arithmetic Coding (CABAC).

CAVLC can be considered the "baseline" VLC. It is a conventional variable-length coding algorithm, with a table of uniquely-prefixed, variable-bit-length codes, but for additional efficiency the standard specifies additional tables. The selection among these tables and the length of the fixed-length coefficient value suffix is based on the local statistics of the current stream, termed the context.

CAVLC employs 12 additional code tables: 6 for characterizing the content of the transform block as a whole, 4 for indicating the number of coefficients, 1 for indicating the overall magnitude of a quantized coefficient value, and 1 for representing consecutive runs of zero-valued quantized coefficients. Given the execution efficiency of VLC tables, combined with this limited adaptive coding to boost coding efficiency, this provides a nice tradeoff between speed of execution and performance.

The CABAC mode has been shown to increase compression efficiency by roughly 10 percent relative to the CAVLC mode, although CABAC is significantly more computationally complex. In a first step, a suitable model is chosen according to a set of past observations of relevant syntax elements; this is called context modeling. If a given symbol is non-binary valued, it will be mapped onto a sequence of binary decisions, so-called bins, in a second step This binarization is done according to a specified binary, using a tree structure similar to a VLC code. Then each bin is encoded with an adaptive binary arithmetic coding engine using probability estimates which depend on the specific context. This pipeline is show in Figure 6.

Figure 6: Arithmetic Coding Pipeline in H.264

Deblocking Filter

The last stage before reconstruction is a deblocking filter. This filter is intended to smooth the visual discontinuities between transform blocks, and as such is only applied to those pixels nearest these boundaries-at most 4 on either side of a block boundary. The filter consists of separable horizontal and vertical filters. Figure 7 shows the boundaries in a macroblock and the pixels of interest for a horizontal filter across a vertical boundary.

Figure 7: Horizontal Deblocking Filter in H.264

H.264 specified that the filter be applied on frames after de-quantization and before the image is used as a reference for motion compensation. For intra frames it should be applied after intra prediction.

This filtering is a very computationally expensive portion of the decoder, taking 15 to 30 percent of the CPU for low-bitrate streams that require the most filtering.

The deblocking filter is an adaptive filter, the strength of which is automatically adjusted according to the boundary strength and differences between pixel values at the border. The boundary strength is higher for intra blocks than inter, higher when the blocks in question have difference reference images, and higher when across a macroblock boundary. The pixel value differences must be less than a threshold that decreases with increasing quality. When the quantization parameter is small, increasing the fidelity of the compressed data, any significant difference is assumed to be an image feature rather than an error, so the strength of the filter is reduced. When the quantization step size is very small, the filter is shut off entirely. The encoder can also disable the filter explicitly or adjust it in strength at the slice level.

H.264 in Intel IPP

The most computationally intensive part of motion compensation in H.264 is generating the reference blocks. Since H.264 permits sub-pixel offsets from the actual data, the implementation must use a particular interpolation filter to calculate the blocks.

The Intel IPP defines a set of interpolation functions to handle interpolation at different locations in the image. The functions are the following:

ippiInterpolateLuma_H264_[8u|16u]_C1R
ippiInterpolateLumaTop_H264_[8u|16u]_C1R
ippiInterpolateLumaBottom_H264_[8u|16u]_C1R
ippiInterpolateLumaBlock_H264_[8u|16u]_C1R
ippiInterpolateChroma_H264_[8u|16u]_C1R
ippiInterpolateChromaTop_H264_[8u|16u]_C1R
ippiInterpolateChromaBottom_H264_[8u|16u]_C1R
ippiInterpolateChromaBlock_H264_[8u|16u]_C1R

These functions are divided into those handling the luma or brightness plane and those handling the chroma or color planes. They are also divided between those functions that handle blocks for which all the data is present and those that occur on a frame boundary outside which there is no data.

The functions that handle all blocks not on the edge of a frame, functions ippiInterpolateLuma_H264 and ippiInterpolateChroma_H264, do not consider the integral portion of the motion vectors. They only perform the interpolation. The input pointer for the reference data should already point to the integral-offset reference block. The functions then calculate the interpolated reference block, using the 2 or 3 bits specifying the fractional motion vector at quarter- or eighth-pixel resolution.

Of the other functions, those with Top or Bottom in the function name interpolate data at the edge of the image. The parameters tell them how far outside the image the reference block is. The function generates that data outside that doesn't exist by replicating the border row, then performs the interpolation as usual.

The remaining function type, that with Block in the function name, performs the interpolation on a reference block entirely within the image, but also takes the entire motion vector so that it can take care of the offset calculation. Listing 1 shows these functions in action.

The function SelectPredictionMethod determines whether the algorithm needs to employ the border versions of the functions. The rest of the code is from another, unspecified function.

The bulk of the function prepares all of the arguments to the interpolation functions. The variables mvx and mvy hold the complete motion vectors. This code sets the variables xh and yh to the low bits of the motion vector, the fractional portion. Then, after clipping the motion vectors to lie within a maximum range, the code sets the variables xint and yint to the integral portion of the motion vector. Finally, it calculates the pointer to the offset reference block and calls the appropriate Intel IPP function.

Note that the edge replication seems only to be an issue at the top and bottom and not the sides. This is because the replication at the top and bottom boundaries takes place at the macroblock level, but the left and right boundaries are replicated at the frame level.

inline
Ipp8s SelectPredictionMethod(Ipp32s MBYoffset,Ipp32s mvy,
  Ipp32s sbheight,Ipp32s height)
{
  Ipp32s padded_y = (mvy&3)>0?3:0;
  mvy>>=2;
  if (mvy-padded_y+MBYoffset<0)
  {
    return PREDICTION_FROM_TOP;
  }
  if (mvy+padded_y+MBYoffset+sbheight>=height)
  {
    return PREDICTION_FROM_BOTTOM;
  }

  return ALLOK;
}

{
  ...
  // set pointers for this subblock
  pMV_sb = pMV + (xpos>>2) + (ypos>>2)*4;
  mvx = pMV_sb->mvx;
  mvy = pMV_sb->mvy;

  ...

  xh = mvx & (INTERP_FACTOR-1);
  yh = mvy & (INTERP_FACTOR-1);


  Ipp8u pred_method = 0;
  if (ABS(mvy) < (13 << INTERP_SHIFT))
  {
    if (is_need_check_expand)
    {
      pred_method = SelectPredictionMethod(
        mbYOffset+ypos,
        mvy,
        roi.height,
        height);
    }
  } else {
    pred_method = SelectPredictionMethod(
      mbYOffset+ypos,
      mvy,
      roi.height,
      height);

    mvy = MIN(mvy, (height - ((Ipp32s)mbYOffset + ypos + 
      roi.height -
      1 - D_MV_CLIP_LIMIT))*INTERP_FACTOR);
    mvy = MAX(mvy, -((Ipp32s)(mbYOffset + ypos + 
      D_MV_CLIP_LIMIT)*INTERP_FACTOR));
  }

  if (ABS(mvx) > (D_MV_CLIP_LIMIT << INTERP_SHIFT))
  {
    mvx = MIN(mvx, (width - ((Ipp32s)mbXOffset + xpos + 
      roi.width -
      1 - D_MV_CLIP_LIMIT))*INTERP_FACTOR);
    mvx = MAX(mvx, -((Ipp32s)(mbXOffset + xpos +
      D_MV_CLIP_LIMIT)*INTERP_FACTOR));
  }

  mvyc = mvy;

  xint = mvx >> INTERP_SHIFT;
  yint = mvy >> INTERP_SHIFT;

  pRef = pRefY_sb + xint + yint * pitch;
  switch(pred_method)
  {
    case ALLOK:
      ippiInterpolateLuma_H264_8u_C1R(pRef, pitch,
        pTmpY, nTmpPitch,
        xh, yh, roi);
      break;
    case PREDICTION_FROM_TOP:
      ippiInterpolateLumaTop_H264_8u_C1R(pRef, pitch,
        pTmpY, nTmpPitch,
        xh, yh, - ((Ipp32s)mbYOffset+ypos+yint),roi);
      break;
    case PREDICTION_FROM_BOTTOM:
      ippiInterpolateLumaBottom_H264_8u_C1R(pRef, pitch,
        pTmpY, nTmpPitch,
        xh, yh, ((Ipp32s)mbYOffset+ypos+yint+roi.height)-
        height,roi);
      break;

    default:VM_ASSERT(0);
        break;
  }
}

Listing 1: Framework for Interpolation in H.264

Intra Prediction

The Intel IPP has three functions for prediction as applied to intra blocks. They are ippiPredictIntra_4x4_H264_8u_C1IR for 4x4 blocks, ippiPredictIntra_16x16_H264_8u_C1IR for 16x16 blocks, and ippiPredictIntraChroma8x8_H264_8u_C1IR for chroma blocks.

These functions take as arguments a pointer to the location of the block start and the buffer's step value, the prediction mode as in Listing 2, and a set of flags indicating which data blocks up or to the left are available. Listing 2 lists code using these functions to perform prediction.

There are three paths in this code: 16x16, 8x8, and 4x4. The 16x16 blocks call ippiPredictIntra immediately. The 8x8 call AddResidualAndPredict8x8 and the 4x4 call AddResidualAndPredict. The smaller blocks are organized into separate functions because of how relatively complicated they are. The smaller blocks involve many types of boundaries with other blocks, and a loop within the macroblock. Of these functions, only the 4x4 version is shown. The 8x8 version is nearly identical.

These prediction functions use a particular algorithm from the standard to calculate a reference block from previous blocks. The mode determines the direction of the data of interest, and then the algorithm calculates a prediction for each pixel based on average of one or more available pixels in that direction.

This code takes the mode, already calculated elsewhere, as an argument. So the bulk of the code is dedicated to determining which outside reference blocks are available and calculating the block locations in memory. The border blocks are available if the predicted block is not on that border with another macroblock, or if the edge_type variable does not indicates that this macroblock is on a global (frame) edge. After calculating the predicted block, each of the two functions AddResidualAndPredict adds the residual using some flavor of motion compensation function starting with ippiMC, using full-pel resolution.

void AddResidualAndPredict(Ipp16s ** luma_ac,
    Ipp8u * pSrcDstPlane,
    Ipp32u step,
    Ipp32u cbp4x4,
    const IppIntra4x4PredMode_H264 *pMBIntraTypes,
    Ipp32s edge_type,
    bool   is_half,
    Ipp32s bit_depth)

{
  Ipp32s srcDstStep = step;
  Ipp8u * pTmpDst = pSrcDstPlane;

  /* bit var to isolate cbp for block being decoded */
  Ipp32u uCBPMask = (1 << IPPVC_CBP_1ST_LUMA_AC_BITPOS);
  for (Ipp32s uBlock = 0; uBlock < (is_half ? 8 : 16);
       uBlock++, uCBPMask <<= 1)
  {
    pTmpDst = pSrcDstPlane;
    Ipp32s left_edge_subblock = left_edge_tab16[uBlock];
    Ipp32s top_edge_subblock = top_edge_tab16[uBlock];
    Ipp32s top = top_edge_subblock  &&
      (edge_type & IPPVC_TOP_EDGE);
    Ipp32s left = left_edge_subblock &&
      (edge_type & IPPVC_LEFT_EDGE);
    Ipp32s top_left = ((top || left) && (uBlock != 0)) ||
      ((edge_type & IPPVC_TOP_LEFT_EDGE) && (uBlock == 0));
    Ipp32s top_right = (top && (uBlock != 5)) ||
      (!above_right_avail_4x4[uBlock]) ||
      ((edge_type & IPPVC_TOP_RIGHT_EDGE) && (uBlock == 5));

    Ipp32s avail = (left == 0)*IPP_LEFT +
      (top_left == 0)*IPP_UPPER_LEFT +
      (top_right == 0)*IPP_UPPER_RIGHT +
      (top == 0)*IPP_UPPER;

    ippiPredictIntra_4x4_H264_8u_C1IR(pTmpDst,
        srcDstStep, pMBIntraTypes[uBlock], avail);

    if ((cbp4x4 & uCBPMask) != 0)
    {
      const Ipp8u * pTmp = pSrcDstPlane;
      ippiMC4x4_8u_C1(pTmp, srcDstStep, *luma_ac, 8,
        pSrcDstPlane, srcDstStep, IPPVC_MC_APX_FF, 0);

      *luma_ac += 16;
    }
    pSrcDstPlane +=
      xyoff[uBlock][0] + xyoff[uBlock][1]*srcDstStep;
  }
}

{
 ...

Ipp32s availability =
  ((edge_type & IPPVC_LEFT_EDGE) == 0)*IPP_LEFT +
  ((edge_type & IPPVC_TOP_LEFT_EDGE) == 0)*IPP_UPPER_LEFT +
  ((edge_type & IPPVC_TOP_RIGHT_EDGE) == 0)*IPP_UPPER_RIGHT +
  ((edge_type & IPPVC_TOP_EDGE) == 0)*IPP_UPPER;

if (mbtype == MBTYPE_INTRA_16x16)
{
  ippiPredictIntra_16x16(
    context->pYPlane + offsetY,
    rec_pitch_luma,
    (IppIntra16x16PredMode_H264) pMBIntraTypes[0],
    availability);

  if (luma_ac)
    AddResidual(luma_ac,
      context->pYPlane + offsetY,
      rec_pitch_luma,
      sd->m_cur_mb.LocalMacroblockInfo->cbp4x4_luma,
      sd->bit_depth_luma);
}
else // if (intra16x16)
{
  if (is_high_profile)
  {
    switch (special_MBAFF_case)
    {
      default:
        if (pGetMB8x8TSFlag(sd->m_cur_mb.GlobalMacroblockInfo))
        {
          AddResidualAndPredict_8x8(
            &luma_ac,
            context->pYPlane + offsetY,
            rec_pitch_luma,
            sd->m_cur_mb.LocalMacroblockInfo->cbp,
            (IppIntra8x8PredMode_H264 *) pMBIntraTypes,
            edge_type_2t,
            true,
            sd->bit_depth_luma);
          AddResidualAndPredict_8x8(
            &luma_ac,
            context->pYPlane + offsetY + 8*rec_pitch_luma,
            rec_pitch_luma,
            sd->m_cur_mb.LocalMacroblockInfo->cbp >> 2,
            (IppIntra8x8PredMode_H264 *) pMBIntraTypes + 2,
            edge_type_2b,
            true,
            sd->bit_depth_luma);
        }
        else
        {
          AddResidualAndPredict(
            &luma_ac,
            context->pYPlane + offsetY,
            rec_pitch_luma,
            sd->m_cur_mb.LocalMacroblockInfo->cbp4x4_luma,
            (IppIntra4x4PredMode_H264 *) pMBIntraTypes,
            edge_type_2t,
            true,
            sd->bit_depth_luma);
          AddResidualAndPredict(
            &luma_ac,
            context->pYPlane + offsetY + 8*rec_pitch_luma,
            rec_pitch_luma,
            sd->m_cur_mb.LocalMacroblockInfo->cbp4x4_luma >> 8,
            (IppIntra4x4PredMode_H264 *) pMBIntraTypes + 8,
            edge_type_2b,
            true,
            sd->bit_depth_luma);
        }
        break;
      case 0:
        if (pGetMB8x8TSFlag(sd->m_cur_mb.GlobalMacroblockInfo))
        {
          AddResidualAndPredict_8x8(
            &luma_ac,
            context->pYPlane + offsetY,
            rec_pitch_luma,
            sd->m_cur_mb.LocalMacroblockInfo->cbp,
            (IppIntra8x8PredMode_H264 *) pMBIntraTypes,
            edge_type,
            false,
            sd->bit_depth_luma);
        }
        else
        {
          AddResidualAndPredict(
            &luma_ac,
            context->pYPlane + offsetY,
            rec_pitch_luma,
            sd->m_cur_mb.LocalMacroblockInfo->cbp4x4_luma,
            (IppIntra4x4PredMode_H264 *) pMBIntraTypes,
            edge_type,
            false,
            sd->bit_depth_luma);
        }
        break;
    }
  }
  else
  {
    switch (special_MBAFF_case)
    {
      default:
        AddResidualAndPredict(
          &luma_ac,
          context->pYPlane + offsetY,
          rec_pitch_luma,
          sd->m_cur_mb.LocalMacroblockInfo->cbp4x4_luma,
          (IppIntra4x4PredMode_H264 *) pMBIntraTypes,
          edge_type_2t,
          true,
          sd->bit_depth_luma);
        AddResidualAndPredict(
          &luma_ac,
          context->pYPlane + offsetY + 8*rec_pitch_luma,
          rec_pitch_luma,
          sd->m_cur_mb.LocalMacroblockInfo->cbp4x4_luma >> 8,
          (IppIntra4x4PredMode_H264 *) pMBIntraTypes + 8,
          edge_type_2b,
          true,
          sd->bit_depth_luma);
        break;
      case 0:
        AddResidualAndPredict(
          &luma_ac,
          context->pYPlane + offsetY,
          rec_pitch_luma,
          sd->m_cur_mb.LocalMacroblockInfo->cbp4x4_luma,
          (IppIntra4x4PredMode_H264 *) pMBIntraTypes,
          edge_type,
          false,
          sd->bit_depth_luma);
        break;
    }
  }
 ...
}

Listing 2: Prediction for Intra Frames in H.264.

Transformation and Quantization

In Intel IPP functions, transform and quantization functionality are merged for more efficiency. There are four functions for the decoding of H.264:

ippiTransformDequantLumaDC_H264_16s_C1I
ippiTransformDequantChromaDC_H264_16s_C1I
ippiDequantTransformResidual_H264_16s_C1I
ippiDequantTransformResidualAndAdd_H264_16s_C1I

There are analogous functions for encoding:

ippiTransformQuantLumaDC_H264_16s_C1I
ippiTransformQuantChromaDC_H264_16s_C1I
ippiTransformQuantResidual_H264_16s_C1I

Additional functions handle 8x8 blocks.

Listing 3 lists a block of code from the H.264 that uses these functions.

The cbp4x4 variable is a bitmask indicating whether there are any DC coefficients within the macroblock that have any data, and individually whether each residual (AC) block within the macroblock has any data. The QP variable contains the Quality Parameter that specifies the degree of quantization.

If the bitmask indicates that there is any DC luma data, the code transforms it with the ippiTransformDequantLumaDC function. Then the code iterates over the 16 blocks within the macroblock. For each block, if there is either DC data or residual data, the code will transform and dequantize the block. It will pass in the decoded DC coefficient, which might be 0, the buffer of residual data along with a flag indicating whether the residual data is valid, and the Quality Parameter.

if ((cbp4x4 & (IPPVC_CBP_LUMA_AC | IPPVC_CBP_LUMA_DC)) != 0)
{
  Ipp16s *pDC;
  Ipp16s DCCoeff;

  Ipp16s *tmpbuf;

  /* bit var to isolate cbp for block being decoded */
  Ipp32u uCBPMask = (1 << IPPVC_CBP_1ST_LUMA_AC_BITPOS);

  if ((cbp4x4 & IPPVC_CBP_LUMA_DC) != 0)
  {
    luma_dc = (*ppSrcCoeff);
    *ppSrcCoeff += 16;
    ippiTransformDequantLumaDC_H264_16s_C1I(luma_dc, QP);
  }

  tmpbuf = 0;  /* init as no ac coeffs */
  pDC = 0;  /* init as no dc */

  ac_coeffs = pDstCoeff;

  for (Ipp32s uBlock = 0; uBlock < 16;
       uBlock++, uCBPMask <<= 1)
  {
    DCCoeff = (Ipp16s)luma_dc[block_subblock_mapping[uBlock]];
    if (DCCoeff != 0)
      pDC = &DCCoeff; /* dc coeff presents */

    if ((cbp4x4 & uCBPMask) != 0)
    {
      memcpy(pDstCoeff, *ppSrcCoeff, 16*sizeof(Ipp16s));
      tmpbuf = pDstCoeff;
      pDstCoeff += 16;
      *ppSrcCoeff += 16;
    }

    Ipp32s hasAC = tmpbuf != 0;
    if (tmpbuf || pDC)
    {
      if (!pDC)
      {
        if (tmpbuf)
        {
          if (dc_present)
            tmpbuf[0] = 0;
        }
      }
      else
      {
        if (!tmpbuf)
        {
          tmpbuf = pDstCoeff;
          pDstCoeff += 16;
          cbp4x4 |= uCBPMask;
        }
      }
      ippiDequantTransformResidual_H264_16s_C1I(tmpbuf, 8, pDC,
        hasAC, QP);
      tmpbuf = 0;
      pDC = 0;
    }
  }
}

Listing 3: Transformation and Quantization in H.264

Deblocking Filter

The Intel IPP functions that perform filtering on the edges of macroblocks are divided according to horizontal and vertical edges, luma and chroma blocks, block size, bit depth, and sampling rate. They are the following:

ippiFilterDeblockingLuma_VerEdge_H264_[8u|16u]_C1IR
ippiFilterDeblockingLuma_HorEdge_H264_[8u|16u]_C1IR
ippiFilterDeblockingChroma_HorEdge[422|444]_H264_[8u|16u]_C1IR
ippiFilterDeblockingChroma_VerEdge[422|444]_H264_[8u|16u]_C1IR
ippiFilterDeblockingLuma_VerEdge_MBAFF_H264_[8u|16u]_C1IR
ippiFilterDeblockingChroma_VerEdge_MBAFF_H264_[8u|16u]_C1IR

The MBAFF versions of the function filter 16x8 blocks instead of 16x16 and are intended for use with interlaced video.

Slightly different variations of some of these functions take a structure of parameters instead of pushing all of the parameters on the stack. These provide a slight performance improvement due to decreased stack usage.

Listing 4 shows a code snippet that executes a deblocking filter. The behavior of the filters are determined by the alpha, beta, and clipping thresholds, and the filter strength arrays. The alpha parameter is the threshold for gradient across the edges, while the beta parameter is the threshold for gradient on one side of an edge. The clipping thresholds, held in the array Clipping and called tc0 in the standard, limit the effect of the filter. The threshold parameters are based on fixed tables, indexed by the Quality Parameter (QP) plus a tuning factor. The strength parameter pStrength, which is referred to as bS in the standard, affects the deblocking filter in a number of ways, including the basic algorithm. Both the tables and the formulas used in to calculate the indices are taken from the H.264 standard.

For simplicity, this code uses simple wrapper functions around each of the Intel IPP functions. The wrappers adapt the arguments and provide a uniform prototype for all the deblocking filters, but do not do any computation. Since they have a uniform prototype, the function calls them indirectly, according to a table set elsewhere.

Ipp8u BETA_TABLE[52] =
{
  0,  0,  0,  0,  0,  0,  0,  0,
  0,  0,  0,  0,  0,  0,  0,  0,
  2,  2,  2,  3,  3,  3,  3,  4,
  4,  4,  6,  6,  7,  7,  8,  8,
  9,  9,  10, 10, 11, 11, 12, 12,
  13, 13, 14, 14, 15, 15, 16, 16,
  17, 17, 18, 18
};

.{
  ...
  IppStatus ( *(IppDeblocking[])) (Ipp8u *, Ipp32s, Ipp8u *,
    Ipp8u *, Ipp8u *, Ipp8u *, Ipp32s ) =
  {
    &(FilterDeblockingLuma_VerEdge),
    &(FilterDeblockingLuma_HorEdge),
    &(FilterDeblockingChroma_VerEdge),
    &(FilterDeblockingChroma_HorEdge),
    &(FilterDeblockingChroma422_VerEdge),
    &(FilterDeblockingChroma422_HorEdge),
    &(FilterDeblockingChroma444_VerEdge),
    &(FilterDeblockingChroma444_HorEdge),
    &(FilterDeblockingLuma_VerEdge_MBAFF),
    &(FilterDeblockingChroma_VerEdge_MBAFF)
  };

  IppStatus ( *(IppDeblocking16u[])) (Ipp16u *, Ipp32s, Ipp8u *,
    Ipp8u *, Ipp8u *, Ipp8u *, Ipp32s ) =
  {
    &(FilterDeblockingLuma_VerEdge),
    &(FilterDeblockingLuma_HorEdge),
    &(FilterDeblockingChroma_VerEdge),
    &(FilterDeblockingChroma_HorEdge),
    &(FilterDeblockingChroma422_VerEdge),
    &(FilterDeblockingChroma422_HorEdge),
    &(FilterDeblockingChroma444_VerEdge),
    &(FilterDeblockingChroma444_HorEdge),
    &(FilterDeblockingLuma_VerEdge_MBAFF),
    &(FilterDeblockingChroma_VerEdge_MBAFF)
  };

  // internal edge variables
  QP = pmq_QP;

  index = IClip(0, 51, QP + BetaOffset);
  Beta[1] = (Ipp8u) (BETA_TABLE[index]);

  index = IClip(0, 51, QP + AlphaC0Offset);
  Alpha[1] = (Ipp8u) (ALPHA_TABLE[index]);
  pClipTab = CLIP_TAB[index];

  // create clipping values
  {
    Ipp32s edge;

    for (edge = 1;edge < 4;edge += 1)
    {
      if (*((Ipp32u *) (pStrength + edge * 4)))
      {
        // create clipping values
        Clipping[edge * 4 + 0] =
          (Ipp8u) (pClipTab[pStrength[edge * 4 + 0]]);
        Clipping[edge * 4 + 1] =
          (Ipp8u) (pClipTab[pStrength[edge * 4 + 1]]);
        Clipping[edge * 4 + 2] =
          (Ipp8u) (pClipTab[pStrength[edge * 4 + 2]]);
        Clipping[edge * 4 + 3] =
          (Ipp8u) (pClipTab[pStrength[edge * 4 + 3]]);
      }
    }
  }

  if (pParams->bitDepthLuma > 8)
  {
    IppDeblocking16u[dir]((Ipp16u*)pY,
      pic_pitch,
      Alpha,
      Beta,
      Clipping,
      pStrength,
      pParams->bitDepthLuma);
  }
  else
  {
    IppDeblocking[dir](pY,
      pic_pitch,
      Alpha,
      Beta,
      Clipping,
      pStrength,
      pParams->bitDepthLuma);
  }
}

Listing 4: Deblocking Filters in H.264

Threading and Video Coding

H.264 and MPEG-4 in general are amenable to threading. Listing 5 shows the key piece of code from the Intel IPP codec sample for H.264 that uses one OpenMP pragma to parallelize this encoder.

The key aspect of this code is the slice. The slice is defined as an independent segment of the image, a segment that neither uses other video slices for reference in prediction is used for reference by other video slices. That makes it the perfect level for parallelization, as the codec can process multiple slices simultaneously and not be forced into serial mode by motion compensation.

template <class PixType, class CoeffsType> Status
  H264CoreEncoder<PixType,CoeffsType>::CompressFrame(
   EnumPicCodType &    ePictureType,
   EnumPicClass   &    ePic_Class,
   MediaData*        dst)
{
  Status      status = UMC_OK;
  Ipp32s  slice;

  for (m_field_index=0;
    m_field_index <= (Ipp8u)
    (m_pCurrentFrame->m_PictureStructureForDec< FRM_STRUCTURE); 
	m_field_index++)
  {
    ...

#if defined _OPENMP
      vm_thread_priority mainTreadPriority = vm_get_current_thread_priority();
#pragma omp parallel for private(slice)
#endif // _OPENMP
      for (slice = (Ipp32s)m_info.num_slices*m_field_index;
           slice < m_info.num_slices*(m_field_index+1);
           slice++)
      {
#if defined _OPENMP
        vm_set_current_thread_priority(mainTreadPriority);
#endif // _OPENMP

        UpdateRefPicList(m_Slices + slice,
          m_pCurrentFrame->GetRefPicLists(slice),
          m_SliceHeader, &m_ReorderInfoL0,
          &m_ReorderInfoL1);

        // Compress one slice
        if (m_is_cur_pic_afrm)
          m_Slices[slice].status =
            Compress_Slice_MBAFF(m_Slices + slice);
        else{
          m_Slices[slice].status =
            Compress_Slice(m_Slices + slice,
            slice == m_info.num_slices*m_field_index);
        }
      ...
      }

Listing 5: Threading the H.264 Encoder