Parallel

Optimizing Matrix Math on the Pentium

By Harlan W. Stockman, May 01, 1994

The Pentium processor is more than just a fast 486. Its pipeline and floating-point and integer operations require that you pay more attention to the flow of data on and off the FPU. Harlan presents and measures methods for speeding up Pentium matrix-math operations.

MAY94: Optimizing Matrix Math on the Pentium

Optimizing Matrix Math on the Pentium

Hand-tuning using C and assembler

Harlan W. Stockman

Harlan, who earned his doctorate in geochemistry at MIT, works in the geochemistry department at Sandia National Labs. He can be contacted at [email protected].

In many applications, matrix math is the rate-limiting step. With the 486, programmers optimized matrix math by minimizing the number of instructions, putting little emphasis on the flow of data onto or off the CPU, since floating-point addition and multiplication were two to ten times slower than loads and stores.

Times have changed. The Pentium can add or multiply floating-point numbers in one or two clock cycles; cache misses and data-induced pipeline stalls are now limiting factors in numeric performance. Pentium programmers must embrace new concepts, familiar to RISC programmers, to achieve the highest speed. Yet the Pentium retains a distinctly non-RISC architecture--the comparatively small number of registers and the stack structure are hard for compilers to handle. On the other hand, it is extremely easy to improve Pentium performance with tiny fragments of assembly language; the performance gains are typically much larger than you get with a 486 and hint at the power of future, Pentium-optimized compilers.

This article examines methods to speed up matrix operations on the Pentium and explores some of the common lore about optimizing C and assembly code. I'll begin with a matrix-multiplication example, then move to the LINPACK routines for solving simultaneous linear equations. Some of these methods have been discussed elsewhere [Ross, 1993], but those discussions tend to be qualitative. My intent is to provide a quantitative assessment of each method.

I used three machines for my testing: a 60-MHz Pentium (Gateway P5-60 with 256 Kbytes write-through L2 cache); a 33-MHz 486DX (Gateway 486/33C with 64 Kbytes of write-through L2 cache); and a 100-MHz MIPS R4000 (SGI Elan with 1 Mbyte of write-back L2 cache). The Pentium code was compiled with Symantec C++ 6.0 for DOS, using the --mx --5 --o --f switches. This compiler aligns data on 8-byte boundaries, but otherwise doesn't perform Pentium-specific optimizations for the floating-point unit (FPU). I compiled the R4000 code with SGI's UNIX cc version 3.1 under IRIX 4.0.5H, using the --O3 --mips2 switches; the latter turn on all optimizations, including loop unrolling and global register allocation and allow 64-bit loads and stores.

Matrix Multiplication: The Cache Counts

Table 1 compares the times required to multiply two double-precision matrices, a[][] and b[][], to form a product matrix c[][], using five different algorithms. Example 1, adapted from Mark Smotherman's mm.c benchmark, shows the first three (and simplest) of these methods. By compiling with the --DASMLOOP switch, you can replace the standard inner loop with one of the handcoded assembly routines discussed shortly. Despite the simplicity of the operation, there is nearly a twelve-fold variation in speed for the Pentium, and a factor of 8.2 for the R4000. The slowest algorithm, by far, is the "normal" method; see Example 1(a). Simple variations on the normal method, such as transposing the b[][] matrix or switching the loop order, as in Example 1(b) and 1(c), cause a two- to eight-fold increase in speed.

To understand why the normal method is so bad, consider how the processor fills the L1 data cache. On the Pentium, the L1 cache is divided into 128 two-line sets, each line containing 32 bytes (or four double-precision numbers). Each time a matrix element is read from memory, the processor checks if that datum is available in one of the lines; if not, the element and its three neighbors are read into a cache line. Thus, if your program tends to use data in consecutive memory locations, the processor will generally find the datum it needs in the cache, and won't have to look in the much slower main memory. However, the number of lines is limited, so if a program reads elements from widely separated places in memory, it will typically overwrite cache lines before they are reused.

Now you can see why the normal method (see Figure 1) is so hard on the cache. The inner loop forms the dot product, where the b_kj elements are accessed sequentially along the jth column and each element is separated by 8N bytes (N being the number of elements in a row). Each time you read a b_kj element, the cache tries to read its three neighbors, assuming they will soon be used. However, these neighbors are in the next three columns, and the program won't try to fetch them

until it cycles through the next outer loop--by then, the cache line will have been overwritten. Translating the inner loop into assembly language (the third entry in Table 1) has little effect, since the time taken to read data into the CPU far overshadows the time for addition and multiplication. In contrast, the inner loops of Examples 1(b) and 1(c) access the a[][], b[][], and c[][] arrays along rows and use the cache much more efficiently.

There are two lessons to be learned from the matrix-multiplication example. First, write C routines so that the innermost loops operate on rows whenever possible, or at least reuse data as much as you can while it is in the cache. Second, before you spend a lot of time writing assembly language, make sure the overall algorithm is cache-efficient.

While searching for the most efficient C algorithm, be aware that many methods are implicitly optimized for RISC chips and may not be suited for the Pentium. An example is the warner() method (lines 11 and 12 of Table 1). In the pure C version, this algorithm does not substantially improve Pentium performance over the simpler reg_loops() method. However, the R4000 speed is greatly improved by warner(). The latter algorithm copies array elements into numerous temporary variables to keep the program from constantly recalculating addresses in the innermost loop; this approach works well on a RISC machine, since the compiler will keep these variables in registers. The Pentium has far fewer registers than the R4000, so the C compiler tends to keep the temporary variables in memory; thus the Pentium sees little benefit from the warner() method unless one uses hand-optimized assembly language for the inner loop, carefully keeping the temporary variables on the FPU stack.

Which Array Declaration is Best?

It is generally accepted [Press et al., 1992] that 2-D arrays are more efficient when declared as pointers to pointers to double--that is, as double **a instead of double a[500][500]. The latter declaration ostensibly requires a multiplication by the row length to calculate the address of an element; since integer multiplications are comparatively slow (10 or 11 clock cycles on a Pentium or R4000), it is reasonable to assume that the a[500][500] declaration will be less efficient. However, lines 2 and 5 in Table 1 tell a different story. For the "normal" algorithm, the **a type declaration produces substantially slower code on both the Pentium and R4000.

To find the source of this discrepancy, I disassembled the object code produced by the R4000 and Pentium compilers. For both processors, the object code for normal() contains 1.5 times as many loads from memory when the arrays are declared **a instead of a[500]

[500], since the a[], b[], and c[] pointers must also be read from memory. But equally important, with the a[500][500]-type declaration both compilers managed to generate inner loops free of integer multiplication. Both the Pentium and SGI compilers effectively transformed the innermost loop of normal() into Example 2, which requires no integer multiplication to calculate successive b[k][j] addresses.

Optimizing With ASM

To eke more performance out of the Pentium, you can replace the inner loops with two simple assembly-language routines. The function ddot() (Listing One, page 104) returns the dot product of the vectors x and y; and daxpy() (Listing Two, page 104) forms the vector sum ax+y, where a is a scalar, replacing the old y values with the new. Both ddot() and daxpy() process n elements in an unrolled loop of n/4 iterations, relying on a cleanup block to process the last (n mod 4) elements. These functions have wide application and are at the core of the LINPACK linear-algebra package.

From the assembly-language programmer's perspective, the Pentium FPU looks much like its 80x87 predecessors. There are eight 80-bit registers (st0 through st7) arranged in a stack; for most instructions, one of the operands must be st0, which leads to a "top-of-stack" bottleneck. However, there are several significant differences between the Pentium and the 486 and 386/387. First, with the Pentium it is much more important to avoid "data collisions"--that is, successive FPU operations that reference the same location in memory and stall the pipeline. Second, the fxch instruction--which switches the contents of two registers--can often be executed in parallel with another FPU instruction, making it essentially clockless. Careful use of fxch can make the Pentium FPU stack look much like a set of general registers and removes the top-of-stack bottleneck. Perhaps more important, fxch can be used to avoid data collisions. However, lacing code with abundant fxch instructions will slow down the 387 and 486, so you must be judicious. Third, while it is still possible to achieve simultaneous execution of integer and floating-point instructions on the Pentium, the value of this trick is diminished. FPU and integer instructions share the first three stages of the U and V pipelines, then feed into separate execution units. I found that interleaving integer instructions with fadd, fmul, and fst generally slowed the code by a few percent, since the smooth flow of the 8-stage FPU pipeline was disrupted.

To illustrate the importance of assembly-level optimization, let's examine the object code produced by the compiler. Example 3 is C code for the inner loop of daxpy() and the corresponding assembly language produced by Symantec C/6.0. This compiler-produced ASM is pretty good, but has several inefficiencies. The calculation of addresses is complicated; the compiler doesn't seem to notice the simple relationship among the addresses of x[i], x[i+1], and so on, so offsets are stored on the memory stack (referenced by ebp) and reloaded into registers with each iteration of the loop. The value of a (in axi+yi) is also kept in memory, even though it could easily be stored in one of the six free FPU registers. Equally important are the pairs of fadd, fstp instructions that set up a "data collision" that stalls the pipeline.

Example 4 shows three different ways to recode the loop of daxpy(), from the least efficient, (a), to the most efficient, (c). (Example 5 is a schematic of the FPU stack, so you can track the register contents after each operation.) All three algorithms keep "a" on the FPU stack and have simple addressing, which results in substantial savings on the Pentium. However, Example 4(a) retains the adjacent instructions like fadd qword ptr [ebx] and fstp qword ptr [ebx]. Since each instruction references the same address in memory, data collisions stall the pipeline. The function in Example 4(b) removes three of the collisions by accumulating four axi+yi results on the stack, then successively popping and storing the results. Example 4(c) removes the last collision with a single fxch instruction. Table 2 gives the time to perform 100 million floating-point operations with each version of daxpy(), for both the Pentium and 33-MHz 486. For reference, the first line of Table 2 gives the speed of a pure C version of the code. On the Pentium, the speedup from Examples 4(a) to 4(b) is 17 to 26 percent, with an additional 6 to 9 percent from Examples 4(b) to 4(c). The most efficient Pentium algorithm is 1.75 to 2.14 times as fast as the optimized, compiled C code. Table 2 also shows several dramatic differences between the Pentium and the 486. First of all, the advantage of assembly language, compared to compiled C code, is much smaller for the 486; the fastest ASM routine is only 1.10 to 1.28 times faster than the C code. Second, there is very little difference in speed--less than 3 percent--among the various ASM routines. The data collisions have almost no effect on 486 speed, and adding the extra fxch slows the 486 algorithm by a mere 1.6 to 2.7 percent. Lastly, with optimized code the Pentium is 7.4 to 9.0 times faster than the 486/33; normalized to the same clock speed, the Pentium is 4.1 to 5.0 times faster. Note that daxpy() is similar in purpose to the da() function given by Subramaniam and Kundargi ("Programming the Pentium Processor," DDJ, June 1993). The code used in da() is easier to generate with a compiler, but requires four times as many fxch to avoid data collisions, penalizing the 486 by an additional 7 percent.

Returning to Table 1, you see that the assembly-language routines speed up matrix multiplication by a factor 1.6 to 2.66 on the Pentium. In fact, these simple ASM routines allow the Pentium to match or surpass the 100-MHz R4000, while carrying an extra 16 bits of precision. The comparison is not completely fair, since I've not hand-optimized the R4000 code. However, it is generally not wise to hand-tune R4000 code on an SGI. First of all, SGI's compilers are extremely good; I disassembled the R4000 code for daxpy() and several other matrix routines and was hard pressed to find any improvements. Second, SGI's assembler does a lot of instruction shuffling to take advantage of load-delay slots. The instruction sequence in object code can be quite different from the sequence specified in your assembly-language source, and hand-optimizations may make the code run slower. But most important, SGI's global register allocation (the --O3 compiler switch) makes assembly tuning almost impossible, since you can't specify registers at will; individual program modules are compiled to an intermediate ucode, and the linker decides how to allocate registers only after examining all the modules. The global-allocation scheme is a key to the R4000's excellent performance; the next-lower optimization scheme (--O2) generated code up to 40 percent slower.

LINPACK

LINPACK is an extensive linear-algebra package for manipulating matrices and solving linear systems [Dongarra et al., 1979]. The code was originally written in Fortran, but portions have been translated into C, as in the clinpack benchmark. Among the most widely used routines in LINPACK are dgefa() and dgesl(), which can be combined to solve a linear system Ax=y by the LU decomposition method, where the vector x is the "solution." This method first factors the matrix A into lower and upper triangular matrices L and U, such that A=LU. Thus: (LU) x=y or L (Ux)=y. We define Uox=y* where y* is a column vector, and solve Ly*=y for y*; since L is triangular, the solution follows very simply from forward substitution. We then solve Ux=y* for x by back substitution. In LINPACK, dgefa() factors the matrix A into L and U, and dgesl() solves the system by substitution. Typically, dgefa takes the lion's share of computation time.

Though the LU method may seem indirect and complicated, it is actually quite fast and robust; moreover, it can be written so that the innermost loops perform operations purely on rows, minimizing the cache thrashing we discussed in the matrix-multiplication example. In fact, the inner loops of dgesl() and dgefa() call the same daxpy() and ddot() routines discussed in the matrix-multiplication example. I use the LU method extensively, and the matrix A commonly contains at least 300x300 double-precision elements and is solved up to a thousand times in a single program. Thus, the speed of these routines is very important to me.

Table 3 shows results for the clinpack.c benchmark, which calls dgesl() and dgefa() repeatedly to solve several sets of 100x100 linear systems, using double precision. The matrices are embedded in larger (200x200) arrays to ensure the cache is exercised. The standard benchmark runs ten iterations; I found it necessary to use 100 iterations, due to the poor resolution of the PC system clock. The assembly-language versions of daxpy() and ddot() were used for the first line of the table; the second line is for the pure C algorithm. The third line gives results of the Fortran version of the benchmark (Lahey compiler, 486 optimizations), and the fourth, a C-compiled, f2c translation of the original Fortran benchmark. (F2c is a public-domain program to translate Fortran source into C [Feldman et al., 1992]; coupled with a good C compiler, it is widely regarded as an inexpensive substitute for a real Fortran compiler.)

Table 3 has several significant points. First, with a little ASM, the 60-MHz Pentium surpasses a 100-MHz R4000. Second, the ASM routines once again have far more of an effect on the Pentium than on the 486. Third, don't expect compiled f2c translations to match the quality of a real Fortran compiler or well-optimized, native C code. The f2c translation is perhaps acceptable for the 486, but on the Pentium it is nearly three times slower than the best C/ASM code.

The problem with f2c lies partly in the way it maps arrays. A 2-D Fortran array a(i,k) is translated to a 1-D C array, each element referenced as a[i+N*k], where N is the number of elements in a Fortran column, and i and k are adjusted for the Fortran indexing offset. Consequently, the translated code is peppered with integer multiplications, and most C compilers can't optimize them away. On a 486, integer multiplication is roughly the same speed as a floating-point operation but on the Pentium, an imul takes 10 or 11 clock ticks, compared to just 1 or 2 for fadd or fmul. Thus on the Pentium, the f2c-translated code can spend much more time using integer multiplication to calculate addresses than it spends actually performing floating-point operations.

While LINPACK is still in wide use, it has a very powerful successor, called LAPACK [Anderson, 1992]. The LAPACK routines carry cache optimization even further, with sophisticated "blocking" algorithms that divide large matrices into chunks small enough to fit in the L1 and L2 caches. Like LINPACK, LAPACK uses basic linear-algebra subprograms (or BLAS) to perform low-level matrix operations. However, the LAPACK BLAS are much more complicated than the simple daxpy() and ddot() routines described earlier. Some workstation manufacturers include hand-coded LAPACK BLAS in their compiler libraries, offering dramatic improvements over simple, compiled C and Fortran code.

Conclusions

The Pentium is not just a fast 486. The design of the Pentium pipeline and the radical changes in relative speeds of floating-point and integer operations require much more attention to the flow of data on and off the FPU. Currently, the two best methods to speed Pentium matrix operations are to use cache-efficient C or Fortran code and handcode the inner loops to simplify address calculations and avoid pipeline stalls.

It is obvious from Table 1 and Table 3 that pure C code can be 1.3 to 2.5 times faster on the 100-MHz R4000 than on the 60-MHz Pentium. After disassembling object code from the R4000 and Pentium, I concluded that much of this speed difference was due to the sophistication of SGI's R4000 compiler, rather than a hardware advantage. The SGI compilers unroll loops, minimize the time lost in address calculations, and reshuffle instructions to avoid load delays and data collisions. The simple hand optimizations discussed in this article mimic those performed by the SGI compiler and allow the Pentium to match or surpass R4000 performance. Let's hope that as Pentium compilers mature, the hand optimizations will be less and less necessary.

Availability

The complete mm.c and LINPACK benchmarks are too large to include in this article. However, they can be obtained from many ftp sites around the world (including ftp.nosc.mil). They can be located via archie servers. The f2c translator is available from netlib.att.com, and from many bulletin boards. The mm.c benchmark was developed by Mark Smotherman, and contains contributions from numerous programmers. In particular, the reg_loops() and tiling() algorithms were developed by Monica Lam, and the warner() method was modified from an algorithm by Dan Warner.

References

Anderson, B. LAPACK User's Guide. SIAM (1992).

Dongarra, J.J., C.B. Moler, and J.R. Bunch. Linpack: User's Guide. SIAM (1979).

Feldman, S.I., D.M. Gay, M.W. Maimone, and N.L. Schryer. A Fortran-to-C Converter. Computing Science Technical Report No. 149, AT&T Bell Laboratories, Murray Hill, NJ (1992).

Press, W.H., S.A. Teukolsky, W.T. Vetterling, and B.P. Flannery. Numerical Recipes in C. Cambridge, U.K.: Cambridge University Press (1992).

Ross, J.W. "Calling C Functions with Variably Dimensioned Arrays." Dr. Dobb's Journal (August 1993).

Subramaniam, R. and K. Kundargi. "Programming the Pentium Processor." Dr. Dobb's Journal (June 1993).

Figure 1The dot product in normal().

Table 1: Time in seconds to multiply two double-precision 500x500 matrices.

===========================================================================
                           Pentium      R4000
                           60 MHz       100 MHz
===========================================================================
    <I>normal()</I>               145.7        72.34
    <I>normal()</I>, **a          186.2        128.6
    <I>normal()</I>, asm          121.6        --
           inner loop
    <I>transpose()</I>            58.66        29.87
    <I>transpose()</I>, **a       61.08        30.61
    <I>transpose()</I>, asm       23.07        --
       <I>ddot()</I> inner loop
    <I>reg_loops()</I>            43.44        34.82
    <I>reg_loops()</I>, asm       26.60        --
       <I>daxpy()</I> inner loop
    <I>tiling()</I>               53.28        36.18
    <I>tiling()</I>, asm          22.91        --
       <I>daxpy()</I> inner loop
    <I>warner()</I>               42.89        15.67
    <I>warner()</I>, asm          16.09        --
       inner loop
===========================================================================

Example 1: Matrix multiplication.

<b>(a)</b>
   void normal(){
   int i,j,k;
   for (i=0;i<N;i++){
      for (j=0;j<N;j++){
         c[i][j] = 0.0;
         for (k=0;k<N;k++){
            c[i][j] += a[i][k]*b[k][j];
         }
      }
   }
}

<b>(b)</b>
   void transpose(){
   int i,j,k;
   double temp;
   for (i=0;i<N;i++){
      for (j=0;j<N;j++){
         bt[j][i] = b[i][j];
      }
   }
   for (i=0;i<N;i++){
       for (j=0;j<N;j++){
#ifndef ASMLOOP
          temp = a[i][0]*bt[j][0];
          for (k=1;k<N;k++){
             temp += a[i][k]*bt[j][k];
          }
          c[i][j] = temp;
#else
          c[i][j] = ddot(N, a[i], bt[j]);
#endif
      }
   }
}

<b>(c)</b>
   void reg_loops(){
   int i,j,k;
   double a_entry;
   for (i=0;i<N;i++){
      for (j=0;j<N;j++){
         c[i][j] = 0.0;
      }

   }
   for (i=0;i<N;i++){
      for (k=0;k<N;k++){
#ifndef ASMLOOP
         a_entry = a[i][k];
         for (j=0;j<N;j++){
            c[i][j] += a_entry*b[k][j];
         }
#else
         daxpy(N, a[i]+k, b[k], c[i]);
#endif
      }
   }
}

Example 2: Inner loop of normal() after being transformed by both the Pentium and SGI compilers.

aptr = a[i];
cptr = c[i] + j;
bptr = b[0] + j;
for(k=0; k<500; bptr+=500;){
    *cptr += *(aptr + k) * *bptr;
}

Table 2: Time in seconds to perform 100 million operations with daxpy(). N is the number of elements per row.

                           Pentium, 60 MHz     486, 33 MHz
                           N=15     N=500    N=15      N=500
-------------------------------------------------------------
Compiled C                 18.13    11.26    80.12     63.42
ASM, <a href="/showArticle.jhtml?documentID=ddj9405e&pgno=3">Example 4(a)</A>          10.49    9.12     62.82     58.14
ASM, <a href="/showArticle.jhtml?documentID=ddj9405e&pgno=3">Example 4(b)</A>          8.95     7.25     62.87     57.87
ASM, <a href="/showArticle.jhtml?documentID=ddj9405e&pgno=3">Example 4(c)</A>          8.46     6.65     64.25     59.41
ASM, <a href="/showArticle.jhtml?documentID=ddj9405e&pgno=3">Example 4(c)</A>          8.90     6.43     63.64     58.64
   loops unrolled by 8

Table 3: LINPACK benchmarks (MFLOPS).

                                  Pentium      486       R4000 


                                  60 MHz      33 MHz    100 MHz
---------------------------------------------------------------
clinpack.c/asm                    10.59        1.64       --
clinpack.c                         6.35        1.44      9.22
linpack.for (Lahey FORTRAN)        6.83        1.43       --
f2c translation of linpack.for     3.57        1.09       --

Example 3: (a) Main loop of daxpy() in C; (b) corresponding assembly language produced by compiler.

<b>(a)</b>

for (i=m; i<n; i=i+4) {
    y[i]   = y[i]   + a*x[i];
    y[i+1] = y[i+1] + a*x[i+1];
    y[i+2] = y[i+2] + a*x[i+2];
    y[i+3] = y[i+3] + a*x[i+3];
}

<b>(b)</b>

L124: mov   ECX, EBX
      sub   ECX, ESI
      mov   EAX, 014h[EBP]
      fld   qword ptr [EAX][ECX]
      fmul  qword ptr 0Ch[EBP]
      mov   EAX, 01Ch[EBP]
      fadd  qword ptr [EAX][ECX]
      fstp  qword ptr [EAX][ECX]
      fld   qword ptr [EDX][ECX]
      fmul  qword ptr 0Ch[EBP]
      mov   EAX, 010h[EBP]

      fadd  qword ptr [EAX][ECX]
      fstp  qword ptr [EAX][ECX]
      wait
      mov   EAX, 0Ch[EBP]
      fld   qword ptr [EAX][ECX]
      fmul  qword ptr 0Ch[EBP]
      mov   EAX, 8[EBP]
      fadd  qword ptr [EAX][ECX]
      fstp  qword ptr [EAX][ECX]
      fld   qword ptr [EBX]
      fmul  qword ptr 0Ch[EBP]
      mov   EAX, 4[EBP]
      fadd  qword ptr [EAX][ECX]
      fstp  qword ptr [EAX][ECX]
      wait
      add   EBX, 020h
      add   EDI, 4
      cmp   EDI, 8[EBP]
      jl    L124

**Example 4: Three ways to code daxpy() for Pentium, from least efficient (a) to most efficient (c).**

Example 5: (a) FPU for stack for Example 4(a)--at top of loop a is already loaded onto stack; (b) FPU stack for Example 4(c)--at top of loop a is already loaded onto stack, skipping several instructions near bottom of loop.

[LISTING ONE]



          .386P
          .model    small
.data
ALIGN 4
dstorage DQ 0.0
.code
;******************* ddot() *********************
; double ddot(int n, double *xptr, double *yptr)
;   ..forms the dot product of two row vectors..
; RETURNS product in edx:eax

          public    _ddot
_ddot     proc
          push ebx
          ;---STACK:---
;+-------+------------+-------+--------+--------+
;|  ebx  |  ret addr  |  n    |  xptr  |  yptr  |
;^esp    ^esp+4       ^esp+8  ^esp+12  ^esp+16
;-----------------------------------------------+
          mov ecx, dword ptr [esp+8]
          test ecx, ecx  ;<= 0 iterations ?
          jle badboy
          mov eax, dword ptr [esp+12]   ;eax = xptr
          mov ebx, dword ptr [esp+16]   ;ebx = yptr
          fldz           ;initialize accumulator..
          ;---determine length of cleanup, main loop iterations---
          mov edx, ecx
          and edx, 3     ;edx is length of cleanup...
          shr ecx, 2     ;loops unrolled by 4, so adjust counter...
          jz cleanup1
          ;=======loop1=======
   loop1: fld qword ptr [eax]
          fmul qword ptr [ebx]
          fadd
          fld qword ptr [eax+8]
          fmul qword ptr [ebx+8]
          fadd
          fld qword ptr [eax+16]
          fmul qword ptr [ebx+16]
          fadd
          fld qword ptr [eax+24]
          fmul qword ptr [ebx+24]
          fadd
          add eax,32
          add ebx,32
          dec ecx        ;faster than "loop" on pentium...
          jnz loop1
          ;=====END loop1=====
cleanup1: or edx,edx
          jz store1
          fld qword ptr [eax]
          fmul qword ptr [ebx]
          fadd
          dec edx
          jz store1
          fld qword ptr [eax+8]
          fmul qword ptr [ebx+8]
          fadd
          dec edx
          jz store1
          fld qword ptr [eax+16]
          fmul qword ptr [ebx+16]
          fadd
          ;----store result-------
  store1: fstp dstorage  ;Zortech expects to see result in edx:eax...
          fwait          ;Needed for 387...
          mov eax, dword ptr dstorage
          mov edx, dword ptr dstorage+4
          ;-------
          pop ebx
          ret
  badboy: xor eax,eax
          xor edx,edx
          pop ebx
          ret
_ddot     endp

[LISTING TWO]



;******************* daxpy() ********************
; void daxpy(int n, double *aptr, double *xptr, double *yptr)
;   ..forms the sum of a*x[i] + y[i], and stores in y[]..
; RETURNS nothing.
          public    _daxpy
_daxpy    proc
          push ebx
          ;---STACK:---;
+---------+------------+--------+--------+--------+--------+
;|  ebx   |  ret addr  |  n     |  aptr  |  xptr  |  yptr  |
;^esp     ^esp+4       ^esp+8   ^esp+12  ^esp+16  ^esp+20
;----------------------------------------------------------+
          mov ecx, dword ptr [esp+8]
          test ecx, ecx                 ;<=0 iterations ?
          jle badboy5
          ;---load *aptr onto fp stack
          mov eax, dword ptr [esp+12]   ;address of multiplier (aptr)..
          ;---test if *aptr is positive or negative 0.0---
          mov edx, dword ptr [eax+4]    ;upper dword of *aptr..
          and edx, 01111111111111111111111111111111B ;mask off sign bit
          or edx, dword ptr [eax]
          jz badboy5
          ;---load *aptr onto stack if not 0.0---
          fld qword ptr [eax]           ;multiplier now in ST(0)..
          mov eax, dword ptr [esp+16]
          mov ebx, dword ptr [esp+20]
          ;---determine length of cleanup, main loop iterations---
          mov edx, ecx
          and edx, 3     ;edx is length of cleanup...
          shr ecx, 2     ;loops unrolled by 4, so adjust counter...
          jz cleanup5
          ;=======loop5=======
   loop5: fld qword ptr [eax]
          fmul st,st(1)
          fadd qword ptr [ebx]
          ;---next element---
          fld qword ptr [eax+8]
          fmul st,st(2)
          fadd qword ptr [ebx+8]
          ;---next element---
          fld qword ptr [eax+16]
          fmul st,st(3)
          fadd qword ptr [ebx+16]
          ;---next element---
          fld qword ptr [eax+24]
          fmul st,st(4)
          fadd qword ptr [ebx+24]
          ;---store new y[]'s, clean stack---
          fxch st(2)     ; !!! Avoid data collision !!!
          fstp qword ptr [ebx+8]
          fstp qword ptr [ebx+16]
          fstp qword ptr [ebx+24]
          fstp qword ptr [ebx]
          add eax,32
          add ebx,32
          dec ecx        ;faster than "loop" on pentium,
          jnz loop5      ;due to instruction pairing...
          ;=====END loop5=====
cleanup5: or edx,edx
          jz stckcln5
          ;---1st cleanup element---
          fld qword ptr [eax]
          fmul st,st(1)
          fadd qword ptr [ebx]
          fstp qword ptr [ebx]
          dec edx
          jz stckcln5
          ;---next element---
          fld qword ptr [eax+8]
          fmul st,st(1)
          fadd qword ptr [ebx+8]
          fstp qword ptr [ebx+8]
          dec edx
          jz stckcln5
          ;---next element---

          fld qword ptr [eax+16]
          fmul st,st(1)
          fadd qword ptr [ebx+16]
          fstp qword ptr [ebx+16]
          ;---------------------
          ;stckcln5: ;must clean *aptr off stack
          fstp st(0)
 badboy5: pop ebx
          ret
_daxpy    endp

Previous 1 2 3 4 5

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

Parallel

Optimizing Matrix Math on the Pentium

Optimizing Matrix Math on the Pentium

Hand-tuning using C and assembler

Harlan W. Stockman

Matrix Multiplication: The Cache Counts

Which Array Declaration is Best?

Optimizing With ASM

LINPACK

Conclusions

Availability

References

Table 1: Time in seconds to multiply two double-precision 500x500 matrices.

Example 1: Matrix multiplication.

Example 2: Inner loop of normal() after being transformed by both the Pentium and SGI compilers.

Table 2: Time in seconds to perform 100 million operations with daxpy(). N is the number of elements per row.

Table 3: LINPACK benchmarks (MFLOPS).

Example 3: (a) Main loop of daxpy() in C; (b) corresponding assembly language produced by compiler.

**Example 4: Three ways to code daxpy() for Pentium, from least efficient (a) to most efficient (c).**

Example 5: (a) FPU for stack for Example 4(a)--at top of loop a is already loaded onto stack; (b) FPU stack for Example 4(c)--at top of loop a is already loaded onto stack, skipping several instructions near bottom of loop.

[LISTING ONE]

[LISTING TWO]

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Parallel Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content

Parallel

Optimizing Matrix Math on the Pentium

Optimizing Matrix Math on the Pentium

Harlan W. Stockman

Example 4: Three ways to code daxpy() for Pentium, from least efficient (a) to most efficient (c).

Example 5: (a) FPU for stack for Example 4(a)--at top of loop a is already loaded onto stack; (b) FPU stack for Example 4(c)--at top of loop a is already loaded onto stack, skipping several instructions near bottom of loop.

Related Reading

More Insights

White Papers

Reports

Webcasts

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Parallel Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content

**Example 4: Three ways to code daxpy() for Pentium, from least efficient (a) to most efficient (c).**