Dr. Dobb's | MMX Technology Code Optimization

MMX Technology Code Optimization

Max examines MMX-code optimization techniques and shows how you can achieve maximum speed on the Intel Pentium II and AMD K6-2 processors.

September 01, 1999
URL:http://www.drdobbs.com/mmx-technology-code-optimization/184411044

Sep99: MMX Technology Code Optimization

Max is a programmer and Ph.D. student. He can be contacted at [email protected].

It hasn't been that long since computationally intensive, real-time graphics applications required digital signal processors (DSPs) or other special processors. With the introduction of Single-Instruction-stream Multiplex-Data-stream (SIMD) extensions to general-purpose processors, however, things have changed. With multimedia-instruction-set extensions such as Intel's MMX, you can execute up to 16 integer operations on 8-bit data per clock cycle or up to four integer operations on 32-bit data per cycle. The introduction of 3DNow! by AMD and streaming SIMD extensions by Intel has brought floating-point performance up to speed with number crunching rates of four single-precision floating-point operations per cycle. These technologies open possibilities for real-time image processing, speech recognition, audio/video compression, and 3D rendering. Raw CPU speed, however, is not enough to make applications work faster. Programmers must use optimization techniques to achieve optimal performance of critical code. In this article, I'll discuss MMX code optimization and suggest techniques for achieving maximum speed on two common PC CPUs -- the Intel Pentium II and AMD K6-2.

AMD K6-2 Versus Intel Pentium II

Both Intel's Pentium II and AMD's K6-2 are sophisticated CPUs with complex internal structures. Both CPU families employ superscalar pipelining, dynamic execution, and branch prediction -- and both can execute up to 6 m-operations per cycle.

Of course, there is a difference in the internal architecture. The Pentium II, for instance, has three instruction decoders and the K6-2 has two. Aimed to speedup existing software, AMD's K6-2 is less sensitive to code selection and instruction scheduling. Specific details on the internal architecture of these two CPUs can be found in DirectX, RDX, RSX, and MMX Technology: A Jump-start Guide to High Performance APIs, by Rohan Coelho and Maher Hawash (Addison-Wesley, 1998); The Pentium II Processor Developer's Manual (Intel Corporation, Order #243502-001, October 1997); and the AMD K6-2 Processor Data Sheet (Document # 21850, http:// www.amd.com/K6/ k6docs/).

As Table 1 illustrates, however, there is another striking difference between Pentium II and K6-2 -- cache organization. AMD did not include L2 cache in the K6-2 chip package. Instead, the original Socket 7 was improved to work at 100 MHz and renamed "Super 7." It connects the CPU with external L2 cache via a 100-MHz 64-bit wide bus. On the other hand, the Pentium II, famous for its integrated L2 cache, works at half the speed of the CPU core (full speed for Pentium II Xeon).

At 300-MHz, Pentium II L2 cache works at 150 MHz, or 50 percent faster than 100-MHz K6-2 L2 cache. At 450 MHz, the difference in speed is 125 percent. To compensate for this without risking major processor and motherboard redesigning, AMD doubled the CPU's internal cache from 32 KB to 64 KB and implemented a sector L1 cache structure, where two 32-byte cache lines are combined in a sector. When cache misses occur and a cache line is filled from L2 cache, the other line in the sector is automatically fetched. Theoretically, this approach compensates for the excessive L2 cache latency by reducing the number of misses by a factor of two.

While the Pentium III did not introduce any major changes in cache organization, the same can't be said about the AMD K6-III (see AMD K6-III Processor Data Sheet #21918, http://www .amd.com/K6/k6docs/). Still fitting into a Super 7 socket, the K6-III contains a 256 KB integrated L2 cache working at the same speed with the CPU core and supports external L3 cache. The K6-III is capable of fetching data from L2 cache twice as fast as Pentium III, but the cache pool is two times smaller. Thus, expensive L2 cache misses happen twice as often. This creates a confusing situation where it is hard to guess whether these tradeoffs will compensate each other.

MMX Data Optimization

Efficient processing of a continuous stream of data requires both a fast CPU and fast memory throughput. For instance, when you encode a video sequence, each frame has to be fetched from system memory, compressed, and stored in another location. The performance is limited by the memory bandwidth, no matter how fast the CPU is. However, it is possible to reduce, if not eliminate, the computational compressor overhead by aggressive instruction scheduling and data prefetching.

Still, by having a continuous data stream or static data, the performance can be improved by changing the data organization (if possible) or by changing the way data is processed. If the video compressor has to process the same frame several times and the frame is too large to fit in cache, the data will have to be fetched from memory again and again, thus reducing performance dramatically. However, if the frame can be split into several blocks (each small enough to fit in cache) and each block can be processed separately, then the data will be fetched from system memory only once, and all consecutive loads will hit the cache. Thus, with multipass processing (typical for most applications), it is highly desirable to partition data in the smallest blocks possible (16 KB to fit in Pentium II/III), do multipass processing on each block while it resides in cache, store results, fetch another block, and so on. This approach may not work for all applications, however. Convolution filtering, for example, may create undesirable edge effects and image artifacts when applied to a block-split image.

Another important consideration is whether complex data should be organized as an array of structures (AoS) or a structure of arrays (SoA). It depends. With MMX, SIMD processing, it is vital that the data, which can be processed in parallel, is packed into 8-byte chunks. Consider the example of when alpha values of an RGBA image must be adjusted. In such a case, the SoA approach eliminates unnecessary unpacking. At the same time, when the intensity of the entire image has to be adjusted (and each RGBA component must be offset by the same value), the AoS approach is favorable.

In short, the necessary steps to achieve MMX data optimization are:

1. Determine optimal data packing format for SIMD processing (AoS, SoA, or data structure layout).

2. Determine in how many passes data is processed.

3. If data is large and processed in more than one pass, determine the minimal block size the data can be split into (the preferable block size is L1 cache size).

5. When possible, the data processing should be in place, rather than out-of-place, to minimize the number of memory references and cache misses.

MMX Code Optimization

Once the data format and organization is selected, it is time to write code. Despite the CPU's rescheduling and out-of-order execution, it is critical to arrange MMX code correctly.

Therefore, the code optimization guidelines are as follows:

Pair and schedule instructions to fill both pipelines.
Maximize register usage and minimize memory references.
Minimize branching and unroll small loops.
Align code, branch targets, and data.
Avoid using long (more than 7 bytes) and complex instructions (loop, enter, leave, and so on).

MMX-specific code optimization guidelines include the following:

Parallel processing of multiple data streams along with loop unrolling can improve pairing.
MMX instructions do not mix well with floating-point instructions.
MMX instructions, which reference memory or integer registers, do not mix well with integer instructions referencing memory or registers.
Column-wise processing can be better than sequential row-wise processing.
MMX code sections should end with emms instructions if floating-point operations are to be used later in the program.

In general, MMX instructions can be easily paired with each other and with integer instructions. The exceptions are:

MMX instructions that reference memory or integer registers. These should not be paired with integer instructions referencing memory or the registers; see Listing One.
MMX shift/pack/unpack instructions. These should not be paired because there is a single shifter unit; see Listing Two.
MMX multiplication instructions pmull/ pmulh/pmadd. These should not be paired because there is a single MMX multiplication unit; see Listing Three.
The destination register of the first instruction. This should not match the source or destination register of the second instruction in the pair, except for certain movq instructions; see Listing Four.

Sometimes, when it seems that it is impossible to improve instruction pairing in a block of MMX code, multiple data stream processing or loop unrolling helps.

In Listing Five, which is an example of thresholding, each instruction depends on the results of previous instruction, and almost no MMX instruction pairing takes place. However, as Listing Six illustrates, you can improve pairing by processing two data streams in parallel. You can improve this code even further by rearranging instructions, where pairing is improved from 40 to 100 percent and code performance is improved 5/3 (1.7 times); see Listing Seven.

Multiple data stream processing has the effect of loop unrolling combined with aggressive instruction scheduling and reordering. For large loops, the branching overhead is virtually nonexistent because the correctly predicted branch takes only one cycle to execute. There is only one expensive mispredicted branch in such a loop, and it happens only in the last iteration.

Loop unrolling is a powerful technique in speeding up MMX code. However, excessive unrolling may result in a large code footprint and more instruction cache misses. Thus, the size of the inner loop should be kept in the 16-KB and 32-KB range for the Pentium II and K6-2, respectively.

The side effect of loop unrolling is that the processed data size or image height/ width (in bytes) must be a multiplicative of eight times the number of loop unrolls. If it is not possible to achieve such data granularity, then the edges must be processed separately.

Data alignment is also important for efficient data processing. Misaligned data incurs a 3-cycle penalty on the Pentium II and a 1-cycle penalty on the K6-2. Thus 32-bit data should be aligned on a 32-bit boundary, and 64-bit data on a 64-bit boundary. If misaligned data access crosses the cache line boundary, the penalty is even greater (6-9 cycles on the Pentium II).

Most compilers (including Visual C++) align data automatically. However, when some addresses happen to be misaligned, they can be corrected manually, as in Listing Eight. Misalignments of 64-bit data can occur when data arrays are declared as members of C++ classes/C-structures having the default 32-bit alignment.

Code alignment is also vital. Compilers automatically align function entry points and branch targets. However, the branch targets in inline assembly code may be misaligned, thus causing a 1-cycle penalty per branch on the Pentium II. To avoid this, the code should be analyzed (debugged or disassembled), and extra nop instructions should be used to align the branch; see Listing Nine.

Likewise, data arrangement plays a critical role in MMX code performance. In the case of horizontal and vertical 8-bit image decimation with averaging, for instance, each new column/row transforms into a sum of two original columns/rows. In column-wise processing, it is easy to add and summate 8 pixels at once; see Listing Ten.

A row-wise approach is more computationally expensive, however. Each 8-byte chunk will have to be unpacked, bytes rearranged so that bytes 0, 2, 4, 6 could be added with bytes 1, 3, 5, 7, and the result then packed back into an 8-byte chunk.

Another important code optimization issue is instruction scheduling. Though most instructions have a latency of 1 cycle, multiplication (pmul and pmadd) and memory referencing instructions have a latency of 3 cycles (3 cycles is a minimum cache hit latency for load instructions) on the Pentium II/III, and 2 cycles on the K6-2/K6-III. Thus, in order to avoid wasting latency cycles, the results of multiplication/load instructions should not be referenced immediately after the instruction is issued; see Listing Eleven.

Properly utilized latency cycles provide room for other instructions. Up to four such instructions can be executed on the Pentium II to fill latency cycles incurred by MMX multiply/load instructions.

Another performance issue comes from preemptive multitasking, which is commonly found in desktop operating systems. Real-time applications require immediate data processing and full CPU power.

However, the OS task scheduler may suspend the application while executing some other process. Windows 95/98/NT rely on preemptive multitasking and do not provide any means for CPU monopolization by high-level applications. It is possible to boost the priority of the currently executing time-critical thread by using the Win32 API calls:

SetPriorityClass(GetCurrentProcess(), REALTIME_PRIORITY_CLASS);

SetThreadPriority(GetCurrentThread(), THREAD_PRIORITY_TIME_CRITICAL);

These calls set thread priority to the highest value of 31. On NT, the thread blocks even system processes, including mouse input and disk buffer flushing. Thus, at the end of a time-critical section, the process and thread priorities should be downgraded to their normal values:

SetPriorityClass(GetCurrentProcess(), NORMAL_PRIORITY_CLASS);

SetThreadPriority(GetCurrentThread(), THREAD_PRIORITY_NORMAL);

Surprisingly, Windows 9x responds differently to the priority boost. System processes do not get suspended, and VMM activity eats up 30 percent of the CPU -- and this is without an active network or other tasks running.

You can write a VxD for Windows 9x that executes in ring 0 and blocks system processes for time-critical operations by calling Adjust_Execution_Time or Adjust_Exec_ Priority; see Listing Twelve. But Windows 9x can crash if you use MMX instructions in VxD (see How to Use Floating-Point or MMX Instructions in Ring 0 or a VxD under Windows 95, http://developer.intel .com/drg/mmx/appnotes/). Until I inserted the two statements in Listing Thirteen in front of an MMX code block in my device driver, it often crashed. These calls preserve and restore the FP state, which is ultimately violated by the execution of MMX instructions.

Performance Measurement

A quick and dirty way of measuring the performance of critical code is to measure clock ticks before and after the code section. To improve the accuracy of measurement, the code should run in a loop. Also, the application priority should be boosted to its maximum value (Listing Fourteen) for more accurate results. It helps to terminate all running tasks and services and wait a few minutes to let the OS settle (flush disk buffers, complete page file I/O, and so on).

The ideal running time is 1-2 seconds. Longer running times increase chances of preemption and introduce OS interference jitter. Because Windows 9x is less sensitive to priority changes, the performance measurements done under this OS are less accurate and can vary by 10-20 percent between runs.

Code Samples

Many of the code optimization techniques discussed here were developed in connection with the development of a real-time, PC-based ultrasound imaging system. Inside the imaging system, the data is constantly acquired by a specially designed PCI interface card, which notifies the CPU about the readiness of data by invocation of a hardware interrupt. An interrupt handler located in a device driver reads the data as fast as possible and initiates another data-acquisition cycle. The interrupt handler notifies a user-level GUI application by sending a message. The application must process and display acquired data as fast as possible to achieve maximum frame rate.

I initially wrote image processing functions in C and operated on 32-bit data. Because they were sluggish and could barely yield 2-3 frames per second on AMD's K6-2/350, I changed the data format to 8-bit and wrote all critical code using MMX instructions. This made a big difference -- frame rates soared to over 30 frames per second. In this section, I'll discuss some useful 8-bit signed image-processing functions and detail their implementation. The complete code that implements these techniques is available electronically; see "Resource Center," page 5.

90-Degree Image Rotation (Matrix Transposition)

Matrix transposition may seem trivial, but efficient MMX implementation of it is not obvious. A detailed discussion of 16-bit square matrix in-place transposition and 16-bit rectangular matrix transposition can be found in Using MMX Instructions to Transpose a Matrix (http://developer.intel .com/drg/mmx/appnotes/).

The basic idea is that a matrix is split on square blocks of the same size, and the transposition of each block results in the transposition of the whole matrix. For 8-bit matrix transposition, the size of each block is 8×8 bytes (Figure 1). The width of the block is selected to match MMX register size. The transposition is affected through the use of an MMX unpack operation (Figure 2). Because only one unpack operation per cycle can be executed, they were intermixed with other instructions to improve parallelism (code that illustrates this is available electronically). Matrix transposition cannot be done in place unless the matrix is square.

Vertical Decimation with Averaging

Vertical decimation by a factor of 2 with averaging reduces image size and produces new images in which each new row is a sum of two original image rows divided by 2. This process exhibits inherent parallelism (since eight 8-bit points can be added at once), and can be efficiently coded (code available electronically).

Summation requires conversion from 8-bit to 16-bit; otherwise, you encounter the effects of saturation. As mentioned in the previous section, 8-bit signed numbers are unpacked. The main loop is unrolled twice for better parallelism.

Horizontal Decimation

Horizontal decimation by a factor of 2 is a process of packing two 8-byte chunks together (Listing Fifteen). To do this, high bytes of each 16-bit chunk must be separated from low bytes, added together, and packed back into a single 8-byte value (code that does this is available electronically).

Horizontal Zoom

Horizontal zoom is part of a 2× image scaling algorithm described in Using MMX Instructions to Implement 2× 8-bit Image Scaling (http://developer.intel.com/drg/ mmx/appnotes/). Vertical zoom is essentially row duplication and its implementation has nothing to do with MMX.

Horizontal 2× and 4× zoom has some interesting points. First, both operations can be done in place. For this, the image has to be processed back to front; see Figure 3. In 2× zooming, as Listing Sixteen shows, each byte in an 8-byte chunk of the original image is duplicated using the unpack instruction (complete code is available electronically).

In 4×, zooming each byte in an 8-byte chunk of the original image is duplicated four times; see Listing Seventeen (the complete code is available electronically). Data is stored in reverse order (high bytes are written first) to match the reverse back-to-front processing direction.

Digital Image Filtering

Digital filtering is a key operation for image compression and feature detection. Digital filtering of an input sequence s_j can be expressed as a discrete convolution of the input sequence with the FIR filter coefficients f_i in Figure 4, where d_j is an output (filtered) sequence, N is the number of elements in the input sequence, and m the number of filter coefficients (taps). The convolution operation in Figure 4 is a mere sequence of multiply-accumulate operations (MAC). The number of MACs for each input sequence element is equivalent to the number of filter taps m. The total number MACs for the sequence is m×N. Considering the fact that a typical filter has over 16 taps, this operation is extremely computationally expensive.

Efficient MMX algorithms implementing FIR filtering can be found in Using MMX Technology Instructions to Compute a 16-Bit FIR Filter, Using MMX Instructions to Implement a Column Filter, and Using MMX Instructions to Implement a Row Filter (http://developer.intel .com/ drg/mmx/ appnotes/). The sequential image filtering (row filtering) algorithm should be optimized to avoid misaligned-data-access penalties. When targeting AMD K6-2/K6-III CPUs, however, this optimization may be unnecessary because a misaligned data access penalty is only one cycle.

Column filtering does not suffer from misaligned access because data is processed in 8-byte chunks and the chunks do not overlap (see Figure 5). Normally, row filters utilize pmadd instructions. However, column filters are faster to implement as a sequence of pmul and padd instructions.

When the source image consists of 8-bit values, the data must be unpacked for pmulhw/pmullw operations. To avoid the overhead of signed 8-bit value unpacking (that is, eliminate shift operations), the data can be unpacked into high-order bytes; see Listing Eighteen. This multiplies the source data by 256. If the filter coefficients can also be scaled by 256, then a pmulhw instruction can be used for multiplication and no further result scaling is needed (code that does this is available electronically).

Finally, pay attention to the pmul instruction scheduling. Although it is possible to schedule one such instruction every cycle, the results are available with 3-cycle delay. Thus, for optimal performance, the results should be referenced no sooner than 3 cycles after the pmul instruction is issued.

Proper scheduling of the pmul instructions in the original code resulted in a dramatic (20 percent) performance increase. No actual code changes were made except for the order of instructions.

Contrast Enhancement

Contrast enhancement is a useful operation in image processing (code available electronically). In general, contrast enhancement of a color pixel value is done according to the formula d_j=s_jv+c, where d_j is the resulting pixel value, s_j the original pixel value, and v and c are constants. (Assume in this example that the value of c is zero.)

Contrast enhancement of the image consisting of 8-bit grayscale signed values makes positive pixels brighter and negative pixels darker, while preserving the color of zero level.

The contrast enhancement routine utilizes the same trick as the column filter -- signed 8-bit values are unpacked into high-order bytes and the value of v is scaled by 256. MMX saturation prevents pixel value wraparound and effectively blocks sudden color changes.

Color Keying

Overlay transparency blitting is important for many graphics applications. In some cases, black (zero) may not be the best choice for transparent color. Thus, it might be useful to have an overlay blitting routine that accepts an arbitrary color key (that is, a value corresponding to transparent pixels).

As Listing Nineteen illustrates, the implementation of arbitrary color-key blitting is straightforward. The overlay pixels are compared against color-key values and the mask is calculated. The mask is then applied to the overlay to clear transparent pixels. The same mask is applied to destination pixels to clear everything except the overlay background area. Both pixels are then combined using a logical operation (code is available electronically). Efficient sprite overlay routine with a black color key can be found in Using MMX Instructions to Implement 2D Sprite Overlay (http://developer.intel.com/drg/ mmx/appnotes/).

Conclusion

MMX code generation and optimization is a complex and time-consuming process requiring an understanding of the processor architecture and the specifics of different processor families. In brief, MMX code generation and optimization steps can be summarized as follows:

1. Determine minimal suitable precision for the processed values and the corresponding packed data type for SIMD processing.

2. Arrange data in the best way for SIMD processing (SoA/AoS, row-wise, and column-wise arrangements).

3. Produce straightforward MMX code.

4. Unroll loops and reorder/pair instructions to improve parallelism.

5. Schedule instructions to avoid latency cycle wasting.

6. Use an optimization tool (such as Intel's VTune Analyzer, http://developer.intel .com/vtune/) to fine tune the code.

Where applications require fast but not necessarily time-critical processing, consider using Intel's Performance Library Suite (http://developer.intel.com/vtune/perflibst/), which offers a variety of image processing, mathematical, primitive recognition, and DSP functions -- and can be freely downloaded.

DDJ

Listing One

movq    mm0,[esi]   ; these two instructions won't execute in same cycle
add     eax,ebx
movq    mm0,mm1     ; these two would
add     eax,ebx