Parallel

Algorithm Alley

By Tim Kientzle, March 01, 1999

The Discrete Cosine Transform (DCT) is a crucial part of modern image and sound compression. Tim discusses several fast algorithms for computing the 8-point DCT and IDCT.

Mar99: Algorithm Alley

Tim is senior technical editor for DDJ. He can be reached at [email protected].

The Discrete Cosine Transform (DCT) is a crucial part of modern image and sound compression. It is used in JPEG, MPEG video, MPEG audio, and Digital VCR, to name just a few. Mathematically, the DCT provides a way to break a complex signal down into separate components, and does so in a way that's nearly optimal for compression.

However, the computational overhead of the DCT is an obstacle to efficient implementation of these compression algorithms. In a typical DCT-based encoder/ decoder, DCT calculations alone can easily take up 50 percent of the CPU time for the entire program.

In this article, I'll discuss several fast algorithms for computing the 8-point DCT and IDCT. Since these algorithms are typically expressed using a diagram notation, I'll present both the diagrams and a somewhat mechanical technique for translating these diagrams into efficient C code. Using this technique, building efficient DCT and IDCT implementations is straightforward.

Why DCT?

Lossy compression -- whether for video, audio, or still graphics -- rests on a simple idea: You want to identify imperceptible features and remove them without changing features that are noticeable. To do this, you need a good understanding of human perception and a mathematical tool that can separate features.

From a purely mathematical standpoint, the Karhunen-Loeve Transform (KLT) is an ideal tool for separating a data stream into separate features. The KLT works by first analyzing the data to identify certain statistical properties, then using those properties to construct an optimal decomposition. Unfortunately, this analysis is extremely time consuming -- you have to analyze the data, construct a transform tailored to that data, then compute the final transform.

The DCT closely matches the KLT for common types of data, and is simpler to compute. The DCT converts a single block of data into a collection of DCT coefficients. Intuitively, you can think of these coefficients as representing different frequency components. The first coefficient (the DC coefficient) is simply the average of the entire block. Later coefficients (the AC coefficients) represent successively higher frequencies. (For graphics compression, "higher frequency" roughly corresponds to "finer detail.")

Typical DCT Use

A typical compression algorithm starts by breaking the data into small blocks. A DCT is applied to each block. Each coefficient is then multiplied by a fixed weight; higher-frequency coefficients typically use smaller weights. The result is that the higher-frequency values become small, zeros will usually predominate. Finally, standard compression techniques such as Huffman, arithmetic coding, or simple run-length coding are used to pack the coefficients into a small number of bits. Often, this process is iterative; if the final encoded data is too large, the weighting is adjusted and the coefficients are compressed again. Usually, the compressed output indicates the weighting used.

Decompression works in reverse. First, the bits are decoded to yield a series of weighted coefficients. Then, each coefficient is divided by the corresponding weight and an Inverse DCT is used to recover the final values.

The DCT and Inverse DCT are not lossy except for minor errors in calculation. The lossy part of this process is the weighting and inverse weighting that effectively rounds less-important coefficients to one of a few numbers. Of course, designing a good weighting requires a detailed knowledge of human perception.

1D and 2D DCTs

DCT-based graphics compression usually employs an 8×8 DCT. For this reason, there has been extensive study of this particular DCT. Figure 1 shows the equations for the DCT. Figure 1(a) is the one-dimensional (1D), 8-element DCT, while Figure 1(b) is the corresponding equation for the two-dimensional 8×8 Forward DCT.

Figure 1(c) is the same as Figure 1(b), but factored to illustrate an important property. The square brackets in Figure 1(c) enclose a 1D, 8-element DCT computed over each row of the input samples. Once this is computed, the outer sum is another 1D, 8-element DCT. In simpler language, you can compute a 2D DCT by first computing a 1D DCT over each row, then computing a 1D DCT over each column. This separation is the first crucial step to developing a fast DCT algorithm. Even in the most naive implementation, a separated 2D DCT can easily be 10 times faster than the nonseparated version.

Figure 2 shows the corresponding equations for the IDCT. The IDCT can be separated in the same fashion as the Forward DCT.

DCT Flow Diagrams

The starting point, then, for a fast 2D DCT is a fast 1D DCT. Figure 3 is the flow diagram for a typical fast algorithm.

The diagram notation used in Figure 3 is fairly standard; some slight variation of this notation is used in almost every paper discussing the DCT. Figure 4 shows the basic building blocks of this notation.

Figure 3 is typical in several respects. After the first stage, the algorithm splits into even and odd halves; the even half (in the purple area) is just a four-point DCT. Similarly, this four-point DCT splits into two halves, one of which is a two-point DCT in the red area. All of the additions in Figure 3 appear in symmetric add/subtract pairs. In fact, the entire first stage is simply four such pairs in a very typical cross-over pattern; note that the four-point DCT diagram starts with a similar pattern.

Translating Flow Diagrams into Code

If you step through the diagram, it's remarkably routine to convert these diagrams into efficient code. I'll demonstrate by converting the four-point DCT diagram in Figure 5 into C code. The inputs in Figure 5(a) are labeled x0 through x3. I'll also need one additional variable which I call x4. The red lines in Figure 5(b) indicate a sum; x4 is currently unused, so I put the sum into x4. I now have to compute the difference in Figure 5(c); I place that result in x0, and x3 is now available to be used in Figure 5(d). The remaining add/subtract pairs are calculated following the same pattern. (An animated GIF version of Figure 5 is available electronically; see "Resource Center," page 5).

The rotation is a bit trickier. First, as in Figure 6, you can use a temporary variable to reduce the calculation to only three multiplications. (k is a constant here.) Once you've calculated the temporary variable, D only requires A, and C only requires B. In this code, by the time you get to calculate the rotation, x3 is available and can be used for the temporary variable. By letting x0 be C and B, and using x1 for A and D, you end up with the efficient code in the last part of Figure 6.

I've used 10-bit fixed point for the constants in the rotation. This means that at the end of the algorithm, I have to shift the outputs of the rotation to the right by 10 bits. Before I do this, I add 512, which is 1/2 in 10-bit fixed-point arithmetic. This properly rounds the result, and significantly improves the accuracy. (The 512 is actually added to x3 earlier so that both x0 and x1 will be correctly biased.) The only remaining step is to assign the final values to the correct output locations.

Implementing the full algorithm in Figure 3 is similar. The only tricky part is that the outputs of the bottom two rotations must later be multiplied by another constant. This requires care to prevent overflow; if you use 10-bit fixed-point values for both multiplications and 32-bit arithmetic, then the inputs to these rotations can't exceed 12 bits. Fortunately, to get 11-bit accuracy, the square root of 2 is 2896/2048, which is exactly equal to 181/128. Thus, you can get 11-bit accuracy with 7-bit fixed point, reducing overflow problems. Listing One is the resulting code, while Listing Two extends this to the full 2D 8×8 DCT.

The inverse DCT can be directly implemented from the same diagram. You follow the same rules, only working from left to right instead of right to left.

Performance

Listings One and Two are fast DCT implementations. To illustrate, I implemented the basic formulas given at the beginning of this article, and timed them. Table 1 shows the results on a 266-MHz Pentium-II. The source code for this article (available electronically) includes test programs that exercise these DCT implementations and provide timing information.

The literature on DCTs usually categorizes implementations according to the number of multiplications. The theoretical minimum for a complete 1D, 8-element DCT is 11 multiplications. (Figure 3 achieves this minimum.) When using fixed-point arithmetic on newer processors, however, the number of multiplications is not the only concern.

Further Optimizations

In typical applications, the forward DCT is followed by a weighting stage that multiplies every DCT output by a constant value. In many DCT algorithms, there are several multiplications that occur at the end. These factors can be combined with the weighting, effectively eliminating several multiplications. The algorithm in Figure 3 can be reduced to eight internal multiplications: the rotation 2 R6 can be written as in Example 2, and of course the two multiplications by 2 are already at the end.

Figure 7 shows another DCT algorithm that pushes this idea to the extreme. This algorithm, due to Arai, Agui, and Nakajima, uses only five multiplications within the algorithm, and eight postmultiplications. Better yet, those postmultiplications are all by powers of two. This gives a total of 80 internal multiplications for a full 8×8 DCT.

Finally, since DCT implementations are fairly compact and speed critical, hand-coded assembly is often a good idea. In particular, newer processor extensions, including Intel's MMX, directly support simple parallelism, allowing you to compute multiple 1D DCTs at a time. This works extremely well for separable 2D DCT algorithms, such as the ones I've discussed in this article.

Accuracy

Although optimizing a program for speed often requires sacrificing accuracy, fast DCT algorithms can actually be more accurate than slower algorithms. Remember that each multiplication involves some loss of accuracy due to roundoff. Algorithms with fewer multiplications are therefore not only faster, but also more accurate. The only place that practical DCT implementations sacrifice accuracy is by using fixed-point rather than floating-point arithmetic. With careful attention to rounding, however, even fixed-point implementations can be extremely accurate.

The listings that are available electronically use several different measurements to assess the accuracy of a DCT implementation. The basic approach is to generate random test data, compute the DCT in two different ways, and compare the results. The reference version is a direct implementation of the formulas using double-precision floating point. Among the measurements that are of interest are:

How many outputs are in error, and by how much?
Is the output correct for certain special inputs?

For a fixed-point implementation using 32-bit arithmetic with 8-bit inputs, no output should be in error by more than one, and a typical block should have no more than one error for each eight elements. If all the inputs are the same, the AC coefficients should all be precisely zero.

Another way to measure the error is to compute the "mean square error" -- simply take the difference between your test output and your reference output, square the results, and compute the average for each block of data and for several thousand random tests. If you have no errors larger than one, the mean square error is simply the percentage of incorrect outputs. The test programs (available electronically) measure and display several of these error measurements.

Weaknesses

DCT-based image compression is based on two principles. The first is that the DCT -- because it approximates the Karhunen-Loeve Transform -- is a nearly optimal mathematical decomposition. The other is that the human eye is less sensitive to errors in fine details. Since those fine details are represented by high-frequency components, you can safely discard some or all of that information.

The problem is that the eye is quite sensitive to edges, which are also represented by high-frequency components. DCT-based image compression does not do well on images that have a lot of hard edges, such as line art or text. Also, because each block of data is compressed independently, DCT-based compression is subject to "edge artifacts," in which the reconstructed blocks no longer match at the edges.

References

Rao, K.R. and P. Yip. Discrete Cosine Transform, Academic Press, 1990.

"Practical Fast One-Dimensional DCT Algorithms with 11 Multiplications" by Loeffler, Ligtenberg, and Moschytz. ICASSP-89, IEEE, 1989.

Pennebaker and Mitchell. JPEG: Still Image Data Compression Standard, Van Nostrand Reinhold, 1993.

Sherlock and Monro. "Algorithm 749: Fast Discrete Cosine Transform," ACM Transactions on Mathematical Software, December 1995.

DDJ

Listing One

static voiddct1dTest(int *dctBlock) {
  static const int c1=1004 /*cos(pi/16)<<10*/, s1=200 /*sin(pi/16)<<10*/;
  static const int c3=851 /*cos(3pi/16)<<10*/, s3=569 /*sin(3pi/16)<<10*/;
  static const int r2c6=554 /*sqrt(2)*cos(6pi/16)<<10*/, r2s6=1337;
  static const int r2=181; /* sqrt(2)<<7 */
  int x0=dctBlock[0], x1=dctBlock[1], x2=dctBlock[2], x3=dctBlock[3],
    x4=dctBlock[4], x5=dctBlock[5], x6=dctBlock[6], x7=dctBlock[7];
  int x8;


</p>
  /* Stage 1 */
  x8=x7+x0; x0-=x7;  x7=x1+x6; x1-=x6;
  x6=x2+x5; x2-=x5;  x5=x3+x4; x3-=x4;


</p>
  /* Stage 2 */
  x4=x8+x5; x8-=x5;  x5=x7+x6; x7-=x6;
  x6=c1*(x1+x2); x2=(-s1-c1)*x2+x6; x1=(s1-c1)*x1+x6;
  x6=c3*(x0+x3); x3=(-s3-c3)*x3+x6; x0=(s3-c3)*x0+x6;


</p>
  /* Stage 3 */
  x6=x4+x5; x4-=x5;
  x5=r2c6*(x7+x8); x7=(-r2s6-r2c6)*x7+x5; x8=(r2s6-r2c6)*x8+x5;
  x5=x0+x2;x0-=x2; x2=x3+x1; x3-=x1;


</p>
  /* Stage 4, round, and output */
  dctBlock[0]=x6;  dctBlock[4]=x4;
  dctBlock[2]=(x8+512)>>10; dctBlock[6] = (x7+512)>>10;
  dctBlock[7]=(x2-x5+512)>>10; dctBlock[1]=(x2+x5+512)>>10;
  dctBlock[3]=(x3*r2+65536)>>17; dctBlock[5]=(x0*r2+65536)>>17;
}


</p>

Back to Article

Listing Two

static voiddct2dTest(int (*dctBlock)[8]) {
  static const int c1=1004 /*cos(pi/16)<<10*/, s1=200 /*sin(pi/16)<<10*/;
  static const int c3=851 /*cos(3pi/16)<<10*/, s3=569 /*sin(3pi/16)<<10*/;
  static const int r2c6=554 /*sqrt(2)*cos(6pi/16)<<10*/, r2s6=1337;
  static const int r2=181; /* sqrt(2)<<7 */
  int row,col;


</p>
  for(row=0;row<8;row++) {
    int x0=dctBlock[row][0], x1=dctBlock[row][1], x2=dctBlock[row][2],
      x3=dctBlock[row][3], x4=dctBlock[row][4], x5=dctBlock[row][5],
      x6=dctBlock[row][6], x7=dctBlock[row][7], x8;
    /* Stage 1 */
    x8=x7+x0; x0-=x7; x7=x1+x6; x1-=x6; x6=x2+x5; x2-=x5; x5=x3+x4; x3-=x4;


</p>
    /* Stage 2 */
    x4=x8+x5; x8-=x5; x5=x7+x6; x7-=x6;
    x6=c1*(x1+x2); x2=(-s1-c1)*x2+x6; x1=(s1-c1)*x1+x6;
    x6=c3*(x0+x3); x3=(-s3-c3)*x3+x6; x0=(s3-c3)*x0+x6;


</p>


</p>
    /* Stage 3 */
    x6=x4+x5; x4-=x5; x5=x0+x2;x0-=x2; x2=x3+x1; x3-=x1;
    x1=r2c6*(x7+x8); x7=(-r2s6-r2c6)*x7+x1; x8=(r2s6-r2c6)*x8+x1;


</p>
    /* Stage 4 and output */
    dctBlock[row][0]=x6;  dctBlock[row][4]=x4;
    dctBlock[row][2]=x8>>10; dctBlock[row][6] = x7>>10;
    dctBlock[row][7]=(x2-x5)>>10; dctBlock[row][1]=(x2+x5)>>10;
    dctBlock[row][3]=(x3*r2)>>17; dctBlock[row][5]=(x0*r2)>>17;
  }
  for(col=0;col<8;col++) {
    int x0=dctBlock[0][col], x1=dctBlock[1][col], x2=dctBlock[2][col],
      x3=dctBlock[3][col], x4=dctBlock[4][col], x5=dctBlock[5][col],
      x6=dctBlock[6][col], x7=dctBlock[7][col], x8;


</p>
    /* Stage 1 */
    x8=x7+x0; x0-=x7; x7=x1+x6; x1-=x6; x6=x2+x5; x2-=x5; x5=x3+x4; x3-=x4;


</p>
    /* Stage 2 */
    x4=x8+x5; x8-=x5; x5=x7+x6; x7-=x6;
    x6=c1*(x1+x2); x2=(-s1-c1)*x2+x6; x1=(s1-c1)*x1+x6;
    x6=c3*(x0+x3); x3=(-s3-c3)*x3+x6; x0=(s3-c3)*x0+x6;


</p>
    /* Stage 3 */
    x6=x4+x5; x4-=x5; x5=x0+x2;x0-=x2; x2=x3+x1; x3-=x1;
    x1=r2c6*(x7+x8); x7=(-r2s6-r2c6)*x7+x1; x8=(r2s6-r2c6)*x8+x1;


</p>
    /* Stage 4 and output */
    dctBlock[0][col]=(x6+16)>>5;  dctBlock[4][col]=(x4+16)>>5;
    dctBlock[2][col]=(x8+16384)>>15; dctBlock[6][col] = (x7+16384)>>15;
    dctBlock[7][col]=(x2-x5+16384)>>15; dctBlock[1][col]=(x2+x5+16384)>>15;
    dctBlock[3][col]=((x3>>8)*r2+8192)>>14;
    dctBlock[5][col]=((x0>>8)*r2+8192)>>14;
  }
}

Back to Article

1 2 3 4 5 6 7 8 9 10 Next

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

Parallel