Design

Optimization Techniques

By Tim Kientzle, May 01, 2004

When it comes to performance, a good understanding of processor architecture pays big dividends.

May04: Optimization Techniques

Optimization Techniques

Dr. Dobb's Journal May 2004

Sharing hard-won insights

By Tim Kientzle

Tim is a DDJ contributing editor and consultant. He can be contacted at [email protected].

Some years ago, a client asked me to write a simple decompressor for extracting still images from DV camcorder data. As requirements changed, I was repeatedly called in to add new features and improve performance. Today, my decoder is one of the fastest available for that format.

With requirements ratcheting steadily higher for each iteration, I was forced to learn a lot about optimizing code for modern processors. In this article, I share some of those lessons.

The DV video format is a fairly standard DCT-based image compression system, similar in concept to motion JPEG. There are three major steps to the decoder: variable-length decoding, Inverse DCT, and color-space conversion.

The DV format stores data using a varying number of bits for each piece of data, similar to Huffman compression. Decompressing requires first splitting the data stream into varying-length pieces at appropriate bit boundaries. The resulting data goes through an Inverse DCT operation, which transforms the resulting 8×8 block of frequency information into actual pixel values. This is a standard and well-studied transformation used by many audio and video compression systems. The final pixel values specify YCbCr color information, which must be converted into a color format suitable for display, usually YUV or RGB.

Test, Test, Test

For the most part, cutting-edge optimization is just a lot of hard work. You have to constantly experiment with different approaches to find the ones that work best for a particular problem. To know which approaches are best, you must constantly measure the real performance.

Profiling tools such as GNU gprof (http://www.gnu.org/software/binutils/) or Intel's VTune (http://www.intel.com/software/products/vtune/) are useful for determining which parts of your code are consuming the most time. Be prepared to spend a lot of time examining profiling data. Look at the amount of time being spent in each function, and carefully consider whether that matches your expectations.

In my video decoder, for example, I was not surprised to find that the Inverse DCT calculation required nearly half of the processor time. I was surprised, however, that simply clearing my temporary work variables in preparation for each new block of data required nearly 15 percent of the processor time. A small piece of hand-tuned assembly made a big difference there and resulted in an easy 10 percent speedup of the entire decoder.

Building-in Test Code

You can use the Time Stamp Counter (TSC) on modern processors to build accurate self-measurement into your code. By checking the TSC before and after critical code sections, you can know exactly how many clock cycles are being used by your code. Example 1 shows how to read the TSC using two popular compilers.

For my application, I developed a test harness to read several hundred test frames from a file into memory, then iteratively decompress each one. My decoder reads the TSC at key points and stores the values into a public structure so that timing information can be accumulated by the driver program and reported. In this way, I was able to quickly assess the impact of each change. If a code change resulted in a slowdown, I could immediately back it out and try something else.

In some cases, you need to remove the timing and test code for production release because the timing and test code itself requires time. In my case, this was not required. The full decoder required millions of cycles per frame; the timing calculations added only a few hundred. This let me simply leave the timing code in as a permanent feature.

Assembly Won't Save You

Video decoding involves doing many repeated calculations on blocks of data. This is a perfect fit for parallel instructions such as those provided by the MMX, SSE, or 3DNow extensions. Many key algorithms were made significantly faster by careful translation into assembly.

In particular, I was able to find some exceptionally good implementations of the Inverse DCT calculation that were carefully tuned for parallel operation using MMX and SSE2 instruction sets. A little time spent searching the Internet or asking around can often save you a lot of time.

However, assembly is no silver bullet. Remember that assembly code is generally more difficult to modify than higher level code. As a result, it takes much longer to fix bugs or implement algorithmic improvements to assembly code. Stick with C where you can and put off assembly until late in the development cycle. This gives you more time to focus on architectural and algorithmic improvements, which often provide much bigger speedups.

In my case, the initial variable-length decoding was particularly frustrating. I tried several times to use assembly to optimize that code. Each such attempt resulted in modest improvements, but those improvements turned out to be algorithmic; in each case, fitting those algorithmic improvements back into the C version resulted in the C version being faster than the assembly.

Use Vector Tables

For top performance, you need processor-specific versions of key routines. For example, my video decoder contains C, MMX, and SSE2 versions of the critical Inverse DCT calculation. Managing processor-specific code is not easy. You must determine the processor when you start, then install the correct versions of critical functions. The only effective way to manage this is to define a function pointer for each major routine. Example 2 shows typical code for supporting this. In this case, you would never invoke Foo(x) directly, you would instead invoke it via the function pointer as (*Foo_Ptr)(x).

Determining the processor is easier today than it used to be, thanks to the CPUID instruction on x86 processors and similar capabilities on other processor families. Listing One provides basic processor detection for current x86 processors. Unlike many published processor- detection routines, this one is almost all in high-level C, which makes it easy to add checks for new features.

In practice, of course, you won't need a separate version of every function for every processor. I manage this in my work by keeping each major routine in a separate source file, and pairing each source file with a header file. Example 3 shows a typical header that defines four separate names with only two underlying implementations. This prevents you from needing to update the code in Example 2 whenever you implement a new processor-specific function.

Factor Carefully

Function calls take time, especially if you have to go through a vector table. It is imperative that you not factor your code too finely. For example, one decoder I studied called a function to read each binary bit of input. This obviously swamped performance. On the other hand, you do need to identify pieces of code that can benefit from processor-specific optimizations and factor them out. I found early on that simply clearing my work variables before processing each block of data was a time-consuming task; defining separate functions to exploit MMX or SSE registers when they were available helped significantly.

Avoid Branches, Unroll Loops

Branches can be very slow on modern processors and should be avoided for maximal performance. Example 4 shows four different ways to compute the same thing; the first two require a branch, the second two do not. In many cases, the latter versions will be significantly faster.

Branches commonly appear in loops. Consider the routine in Listing Two for clearing a 4K block of memory. I've unrolled the loop significantly to reduce the number of branches required. Of course, extreme loop unrolling also increases the size of the code and hence increases the amount of time required to load instructions from memory. Experiment carefully to find the right balance point. Also note the technique I've used here of carefully separating data changes (the two add instructions) from uses of that data. Modern processors can execute multiple instructions at a time only if there are no dependencies. Usually, of course, it's not this simple to separate the data dependencies, which is one reason that the code from a good optimizing compiler often outperforms hand-generated assembly.

Right-Size Your Data

Think carefully about data storage. Most processors are more efficient when all of your data is the same width; converting between 16-bit and 32-bit values can be expensive. As a result, you'll generally want to stick with 32-bit values on 32-bit processors. On the other hand, memory bandwidth is a real constraint. Storing large blocks of data as 16-bit values rather than 32-bit values can double performance by halving the amount of memory traffic. In particular, remember that single-precision floating-point values are 32-bit values. Integer algorithms using 16-bit values will almost always be much faster.

Be Aware of the Processor Cache

Every time you have to go to the L2 cache or main memory, you pay a penalty. As a result, a well-optimized program will organize its work to fit into the L1 cache where possible. My decoder, for example, takes one small block of data and processes it completely before going to the next block. One competing decoder I had a chance to study instead performed complete passes over the entire image. Because each pass processed more data than would fit in the cache, the intermediate data had to be written all the way to main memory and then read back from main memory for the next step. My decoder is significantly faster.

Conclusion

As processor performance improves and compilers become smarter, assembly language is becoming less of a cure-all for performance problems. However, a good understanding of processor architecture still pays big dividends. Even more important is taking the time to set up a good testing system—one that is convenient for you to use constantly as you test and refine your code.

DDJ

Listing One

typedef struct {
    /* Registers */
    int eax;
    int ebx;
    int ecx;
    int edx;
    /* Basic CPUID data */
    int maxCPUID;
    int maxExtendedCPUID;
    int features;
    int extendedFeaturesAMD;
    int family;
    int model;
    int steppingId;
    /* Identification Strings */
    char manufacturer[16];
    char brand[50];
    /* Specific features */
    char hasCPUID;
    char hasTSC;
    char hasCMOV;
    char hasMMX;
    char hasSSE;
    char hasSSE2;
} PROCESSOR;
void
ProcessorCPUID(int arg, PROCESSOR *p) {
  if(!p->hasCPUID) return;
  p->eax = p->ebx = p->ecx = p->edx = 0;
#if defined(_M_IX86) && defined(_MSC_VER)
  /* Visual C++ syntax */
  __asm {
    mov edi,p
    mov eax,arg
    cpuid
    mov [edi]PROCESSOR.eax,eax
    mov [edi]PROCESSOR.ebx,ebx
    mov [edi]PROCESSOR.ecx,ecx
    mov [edi]PROCESSOR.edx,edx
  }
#elif defined(__GNUC__) && defined(i386)
  /* GNU C/C++ Syntax */
  __asm__("mov %4,%%eax;cpuid;mov %%eax,%0;mov %%ebx,%1;"
          "mov %%ecx,%2;mov %%edx,%3"
          : "=mr" (p->eax),"=mr"(p->ebx),
              "=mr"(p->ecx),"=mr"(p->edx) /* outputs */
          : "mr" (arg) /* inputs */
          : "eax","ebx","ecx","edx" /* Registers clobbered */);
#endif
}
void
ProcessorInfo(PROCESSOR *p) {
  int t=0; /* Scratch C register */
  memset(p,0,sizeof(*p));
  /* Step 1: Determine if this processor supports CPUID. */
#if defined(_M_IX86) && defined(_MSC_VER)
  /* Processor supports CPUID if you can set and clear bit 21 of EFLAGS */
  /* This sequence sets bit 21 of t if bit 21 can be both set and cleared */
  __asm {
    pushfd        /* Start with current EFLAGS value */
    pop eax
    or eax,0x00200000  /* Set bit 21 */
    push eax
    popfd         /* Store into EFLAGS */
    pushfd        /* Read back from EFLAGS */
    pop ebx       /* Result into EBX */
    xor eax,0x00200000 /* Clear bit 21 */
    push eax
    popfd         /* Store into EFLAGS */
    pushfd        /* Read back from EFLAGS */
    pop ecx       /* Result into ECX */
    xor ebx,ecx   /* See if bit 21 is different in EBX and ECX */
    mov t,ebx     /* Store result into 't' */
  }
#elif defined(__GNUC__) && defined(i386)
  /* Processor supports CPUID if you can set and clear bit 21 of EFLAGS */
  /* This sequence sets bit 21 of t if bit 21 can be both set and cleared */
  __asm__("pushf;pop %%eax;"
          "or $0x00200000,%%eax;push %%eax;popf;pushf;pop %%ebx;"
          "xor $0x00200000,%%eax;push %%eax;popf;pushf;pop %%ecx;"
          "xor %%ecx,%%ebx;mov %%ebx,%0"
          : "=mr" (t) /* Output 't' can be a memory argument or register */
          : /* No inputs */
          : "eax","ebx","ecx" /* Registers clobbered */);
#endif
  if(t & 0x00200000) p->hasCPUID=1;
  /* No CPUID? Then there's not a lot we can determine. */
  if(!p->hasCPUID) return;
  /* Use CPUID(0) to determine manufacturer and extent of CPUID support */
  ProcessorCPUID(0,p);
  p->maxCPUID = p->eax; /* Maximum CPUID argument */
  {
    int *s = (int *)(p->manufacturer);
    s[0] = p->ebx;
    s[1] = p->edx;
    s[2] = p->ecx;
    p->manufacturer[12] = 0;
  }
  if(p->maxCPUID < 1) return;
  /* Identify standard features */
  ProcessorCPUID(1,p);
  p->features = p->edx;
  if(p->features & (1<<4)) p->hasTSC = 1;
  if(p->features & (1<<15)) p->hasCMOV = 1;
  if(p->features & (1<<23)) p->hasMMX = 1;
  if(p->features & (1<<25)) p->hasSSE = 1;
  if(p->features & (1<<26)) p->hasSSE2 = 1;
  p->family = (p->eax >> 8) & 15;
  if(p->family == 0x0F) p->family |= (p->eax >> 16) & 0xFF0;
  p->model = (p->eax >> 4) & 0x0F;
  if(p->model == 0x0F) p->model |= (p->eax >> 8) & 0xF0;
  p->steppingId = p->eax & 0x0F;
  /* De facto standard: AMD 3DNow */
  ProcessorCPUID(0x80000000U,p);
  p->maxExtendedCPUID = p->eax;
  if(p->maxExtendedCPUID >= 0x80000001U) {
    ProcessorCPUID(0x80000001U,p);
    p->extendedFeaturesAMD = p->edx;
    if(p->extendedFeaturesAMD & (1<<31)) p->has3DNow = 1;
  }
  if(p->maxExtendedCPUID >= 0x80000004U) {
    int *s = (int *)(p->brand);
    ProcessorCPUID(0x80000002U,p);
    s[0] = p->eax; s[1] = p->ebx; s[2] = p->ecx; s[3] = p->edx;
    ProcessorCPUID(0x80000003U,p);
    s[4] = p->eax; s[5] = p->ebx; s[6] = p->ecx; s[7] = p->edx;
    ProcessorCPUID(0x80000004U,p);
    s[8] = p->eax; s[9] = p->ebx; s[10] = p->ecx; s[11] = p->edx;
  }
  if(strcmp(p->manufacturer,"GenuineIntel")==0) {
    /* Intel-specific feature recognition */
  }
  if(strcmp(p->manufacturer,"AuthenticAMD")==0) {
    /* Check extended features for AMD-specific extensions */
    /* if(p->extendedFeaturesAMD & (1<<22)) p->hasAMDMMX = 1;*/
    /* if(p->extendedFeaturesAMD & (1<<31)) p->has3DNow = 1; */
    /* if(p->extendedFeaturesAMD & (1<<30)) p->has3DNowExtensions = 1; */
  }
  /* Clear out register fields before returning */
  p->eax = p->ebx = p->ecx = p->edx = 0;
}

Back to Article

Listing Two

void ClearWorkArea_MMX(char *work) {
  _asm {
    mov   esi,work
    mov   edi,esi
    sub   edi,128
    pxor  mm0,mm0
    mov   cl,15
loopTop:
    movq  qword ptr [esi],mm0
    movq  qword ptr [esi+8],mm0
    movq  qword ptr [esi+16],mm0
    movq  qword ptr [esi+24],mm0
    movq  qword ptr [esi+32],mm0
    movq  qword ptr [esi+40],mm0
    movq  qword ptr [esi+48],mm0
    movq  qword ptr [esi+56],mm0
    movq  qword ptr [esi+64],mm0
    movq  qword ptr [esi+72],mm0
    movq  qword ptr [esi+80],mm0
    movq  qword ptr [esi+88],mm0
    add   edi,256
    movq  qword ptr [esi+96],mm0
    movq  qword ptr [esi+104],mm0
    movq  qword ptr [esi+112],mm0
    movq  qword ptr [esi+120],mm0
    movq  qword ptr [edi],mm0
    movq  qword ptr [edi+8],mm0
    movq  qword ptr [edi+16],mm0
    movq  qword ptr [edi+24],mm0
    movq  qword ptr [edi+32],mm0
    movq  qword ptr [edi+40],mm0
    movq  qword ptr [edi+48],mm0
    movq  qword ptr [edi+56],mm0
    movq  qword ptr [edi+64],mm0
    movq  qword ptr [edi+72],mm0
    add   esi,256
    movq  qword ptr [edi+80],mm0
    movq  qword ptr [edi+88],mm0
    movq  qword ptr [edi+96],mm0
    dec   cl
    movq  qword ptr [edi+104],mm0
    movq  qword ptr [edi+112],mm0
    movq  qword ptr [edi+120],mm0
    jne   loopTop
    emms
  }
}

Back to Article

1 2 3 4 5 Next

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

Design

Optimization Techniques

Optimization Techniques