Parallel

Getting Numeric Coprocessors Up to Speed

By John H. Letcher, May 01, 1991

Memory-mapped math coprocessors can boost performance without sacrificing compatibility.

MAY91: GETTING NUMERIC COPROCESSORS UP TO SPEED

John is a professor of Computer Sciences at the University of Tulsa, 600 South College Ave., Tulsa, Oklahoma 74014.

Over the past decade, a revolution has occurred in numeric processing, as coprocessors have usurped many of the CPU's number-crunching chores. This article describes these trends and shows how you can take advantage of these developments, using as examples floating point (FP) numeric coprocessors that support the 80x86 integer CPU.

As the 80x86 family (8088, 8086, 80188, 80186, 80286, and 80386) evolved, each generation became successively faster and more potent than its predecessor. And even as each CPU model continued to obey the instructions of its predecessor, new instructions were added to the repertoire. This led to unavoidable conflicts between the integer CPUs and their numeric coprocessors, ultimately requiring that new coprocessors be developed for each CPU. This pattern persisted until the 80486, which incorporates the FP unit as part of its onboard circuitry. A parallel scenario has occurred with Motorola's 680x0 family.

More Speed, More Power

It's no secret that historically, floating point instructions have executed more slowly than their integer counterparts: A given integer multiply or divide, for instance, executes much faster than the corresponding FP instruction.

It's less well known, however, that with each successive processor/coprocessor generation, the time required to perform an FP multiply has moved closer to its integer counterpart. For example, Table 1 lists the execution time for selected instructions on a variety of Intel and Cyrix processors/coprocessors.

Table 1: Execution time (in clock cycles) for selected instructions on a variety of Intel and Cyrix processors/coprocessors.

  CPU       IMUL<reg>               Coprocessor   FMUL ST(O), ST(I)
            Instruction(in clocks)                Instruction (in clocks)
  -----------------------------------------------------------------------

  8088           128-154               8087             130-145
  80188           34-37                8087             130-145
  80296            21                  80287            130-145
  80386           9-22                 80387               52
  80386           9-22                 83D87               19
  80386           9-22                 EMC87               10

Note the performance improvement in integer instructions from the original 4.77-MHz 8088 PC to the 33-MHz 80386. As measured by Norton's SI, the 386 provides a performance improvement factor of over 40. Granted, SI is a notoriously misleading benchmark: Nevertheless, this is an impressive boost. More significantly, note the even greater improvement in coprocessor performance--both the Cyrix 83D87 and EMC87 execute FP multiply instructions faster than integer multiply instructions on the 386!

How are these dramatic performance gains achieved? In the case of the Cyrix coprocessors listed in Table 1(as well as other FP coprocessors such as the Weitek 3167), performance gains come from improvements in algorithms expressed in microcode, from improved internal design, and from utilizing what I've termed a "memory-mapped" operation mode. This mode operates in contrast to what I call the "80387 compatible" mode, which utilizes 80387 instructions per se.

Memory-mapped coprocessors take advantage of the fact that in protected mode on the 386, there is a page high in the 32-bit address space where the CPU does not insert internal wait states on accessing these locations. (These internal wait states are not to be confused with the wait states inserted into ordinary memory references and I/O instructions.) By sending or receiving bytes, words, or doublewords to or from these locations, you can double up on sending information to the processor about the instructions to be carried out by the coprocessor. The coprocessor decodes the address location and its instruction opcode, and uses the operand values passed on the data bus in the conventional manner. Each FP instruction is assigned a unique address within the 4K block of memory that starts at C0000000h. A MOV instruction to or from a location in this block tells the coprocessor that the FP instruction is to be executed. The coprocessor uses the address alone to determine which instruction is specified. The MOV instruction (usually to or from the register EAX) also causes the transfer of operand data. Because the address bus and the data bus work in parallel, the instructions work faster. For the purposes of this article, I'll refer to the FP units that operate in this way as "memory-mapped coprocessors."

Unfortunately, this memory-mapped scheme is incompatible with the standard mode of operation in the Intel 80387. In this mode of operation (the "387-compatible" mode), the FP unit expects opcodes to be embedded within the code, not as instructions to be poked into memory addresses. This can be a real problem if you have existing programs written for the 80387.

Fortunately, both the Cyrix and Weitek coprocessors provide solutions to the incompatibilities between memory-mapped and 387-compatible modes that give you higher performance without making you rewrite your code. The two processors approach the problem differently, however.

With the Weitek processor, you must purchase a small circuit board, a 3167, and an 80387. These two processors are plugged into the circuit board, which is in turn plugged into the coprocessor socket on the motherboard. (One such circuit board, the mW3167/80387 Board, is available from Microway.)

By contrast, the Cyrix EMC87 can operate in 387-compatible mode because its register set is the same. To use it, you do not need the small circuit board mentioned earlier, only the EMC87 (which plugs directly into the 121-pin 80387 socket) on the motherboard.

(One other difference between the Weitek approach and that of Cyrix is that the Weitek coprocessor is not compliant with the IEEE 754 FP Standard, while the Intel and Cyrix coprocessors are. Weitek has, in effect, traded numerical accuracy for speed. This, of course, is fine for those applications that don't require precise accuracy, of which there are many.)

There is a significant performance difference between the memory-mapped and 387-compatible modes, see Table2, which shows the performance of two mathematical algorithms I've used in designing medical imaging devices. The first is a 256-point complex Fast Fourier Transform (each complex point consisting of a 32-bit REAL for the real and imaginary parts, respectively). The second is a 256-point Daubechies D[4] wavelet transformation (each point is represented by a 32-bit REAL and produces a 256-point array of REALs plus a 255-point array of REALs in the multispectral decomposition of the input data).

Table 2: Performance of two complex mathematical algorithms.

  Processor            256-point     256-point Forward  256-point Inverse
                       Complex FFT   D[4] Wavelet       D[4] Wavelet
                                     Transformation     Transformation
  -----------------------------------------------------------------------

  10MHz 80286 with
    80287
                        160.0 msec         -                   -

  33MHz/80386
    with Intel 80387    45.0 msec       6.30 msec           7.00 msec


  33MHz/80386
    with Cyrix          17.6 msec       5.75 msec           4.88 msec
    83D87 or the
    Cyrix EMC87
    operated in
    compatible
    mode

  33MHz/80386 with
    Cyrix EMC87
    operated in         9.2 msec        4.00 msec           3.60 msec
    memory-mapped
    mode

As evident from Table 2, the 83D87 and EMC87 are faster than the 80387. Note also that the performance gains were achieved without sacrificing compatibility with the IEEE FP Standard. If a program functioned correctly on your older 8088-based PC with 8087, it should function perfectly on any of the more powerful Intel coprocessors, the Cyrix 83D87 or EMC87. Comparing the numeric results down to the last bit reveals minor discrepancies. The affected instructions are not the usual FP add, subtract, or multiply, but rather those instructions whose accuracies have not been defined by IEEE FP Standard -- for example, the transcendental functions. Therefore, the accuracy of one processor may not be as great as another even though each is still fully compliant with the Standard. These differences are slight and should not significantly affect the results of typical calculations.

Although not listed in Table 2, the Weitek 3167 comes in at speeds essentially equivalent to the Cyrix coprocessors.

Porting to Memory-Mapped Mode

Programmers are faced with complications when using memory-mapped coprocessors. Although some compilers (like Microway's NDP family) can generate memory-mapped FP instructions, there are still many programmers who want to keep their existing assembly language code without a major rewrite. Fortunately, translation programs that allow you to continue using compatible instructions are normally supplied with the coprocessor.

To use the memory-mapped scheme, the computer must think it is operating in protected mode while the programmer thinks the computer is in real mode.

The first step in accomplishing this is to find an unused 4K block of memory below the 1-Mbyte boundary. Between 640K and 1-Mbyte, there is usually some address space for which no physical memory is installed. These locations will vary from machine to machine (Cyrix supplies a program to find these blocks for you). Let us say that the program finds an unused 4K block at D000:0000 (0D0000h through 0D0FFFh). This number will be used in the examples later on.

The second step is to place the computer into the Virtual 8086 mode, in which the 386 runs in protected mode and is made to think that memory is broken into 1-Mbyte bundles (which may be placed anywhere). The no wait page at C0000000h is then mapped to somewhere within the 1-Mbyte limit of the apparent real mode, in this case to 0D0000h. The real-mode program can then use this block of address space to communicate with the memory-mapped coprocessor.

To put the computer into Virtual 8086 mode when using the Cyrix coprocessors, run a program (supplied by Cyrix) called CRXV86M.EXE, the Virtual 8086 Monitor. You can also execute the program from within the AUTOEXEC.BAT file by adding the line CRXV86M/m5. (The option /m5 places the monitor at the end of extended memory so that you can still use the standard RAM disk driver.) The monitor remains in memory and intercepts operating system requests from real-mode programs. The monitor converts the requests to the equivalent protected-mode requests. From this point on, the computer is in protected mode even though the real-mode programmer doesn't realize it.

Generating FP Code

The process by which "387 compatible" routines are translated into memory-mapped instructions is best explained in an example.

Consider the Fortran subroutine ABC in Example 1(a), to be called by Ryan-McFarland F-77 compiled routines. The Fortran compiler produces a table of addresses that the the calling program uses to pass the locations of its arguments to the subroutine. Note that this Fortran calls by reference, and not by value, as C usually does. Also notice that the argument locations are not pushed onto the stack. Example 1(b), shows the resulting table of pointers. The compiled call to ABC is shown in Example 1(c).

Example 1: (a) A Fortran subroutine ABC to be called by Ryan-McFarland F-77 compiled routines; (b)the resulting table of pointers; (c)the compiled call to ABC.

  (a) SUBROUTINE ABC(X,Y,Z)
      Z=X+Y
      RETURN
      END

  (b) ARGLOC    DW  OFFSET X ; argument 1
      DW        SEG    X
      DW        OFFSET Y ;     argument 2
      DW        SEG    Y
      DW        OFFSET Z ;     argument 3
      DW        SEG    Z

  (c) MOV    AX, SEG ARGLOC
      MOV    ES, AX
      MOV    BX, OFFSET ARGLOC
      CALL   ABC

Example 2 shows ABC after compilation into an assembly language module that uses 387-compatible FP instructions. To convert this module to one using memory-mapped instructions, you must run the translation program provided by the chip manufacturer (the Cyrix version is called CX.EXE). The translator produces an assembly language module, which can be assembled and linked in the usual manner.

Example 2: ABC after compilation into an assembly language module that uses 387-compatible FP instructions.

           TITLE ABC
           CODE SEGMENT 'CODE'
  ;        SUBROUTINE ABC(X,Y,Z)
  ABC        PROC       FAR
            PUBLIC  ABC
            LDS  SI,ES:[BX]       ; NOW DS:SI POINTS TO X
            FLD  DWORD PTR [SI]   ; LOADS THE VALUE OF X
            LDS  SI,ES:4[BX]      ; NOW DS:SI POINTS TO Y
            FADD DWORD PTR [SI]   ; ADDS Y TO X
            LDS  SI,ES:8[BX]      ; NOW DS:SI POINTS TO Z
            FSTP DWORD PTR [SI]   ; STORES VALUE INTO Z
            RET
  ABC       ENDP
  CODE      ENDS
      END

Running the translator on ABC produces the code shown in Example3. Notice that the 386 register FS is used to map the absolute location C0000000h into D000:0000 (0D0000h). The offset mflds is an equate (having, in this case, a value of 0200h) defined in the include file FCODE.INC (supplied by Cyrix, along with the translator). This equate, along with other entries such as mfldd, mfstps, and mfadds, represents the offsets in the address space that correspond to specific FP instructions.

Example 3: Assembly language module resulting from the translation into memory-mapped instructions.

          INCLUDE     FCODE.INC
                TITLE  ABC
          CODE        SEGMENT 'CODE'
  ;                   SUBROUTINE ABC(X,Y,Z)
  .386
  ABC                 PROC    FAR
  mov eax, 0d000h                       ; EMC
  mov fs,eax                            ; EMC
        PUBLIC   ABC
        LDS    SI,ES:[BX]               ; NOW DS:SI POINTS TO X
  ;     FLD    DWORD PTR [SI]           ; LOADS THE VALUE OF X
  mov eax, DWORD PTR [SI]               ; EMC
  mov DWORD PTR FS:mflds, eax           ; EMC
      LDS     SI,ES:4[BX]               ; NOW DS:SI POINTS TO Y
  ;   FADD    DWORD PTR [SI]            ; ADDS Y TO X
  mov eax, DWORD PTR [SI]               ; EMC
  mov DWORD PTR FS:mfadds, eax          ; EMC
      LDS  SI,ES:8[BX]                  ; NOW DS:SI POINTS TO Z
  ;   FSTP DWORD PTR [SI]               ; STORES VALUE INTO Z
  mov eax, DWORD PTR FS:mfstps          ; EMC
  mov DWORD PTR [SI], eax               ; EMC
      RET
  ABC       ENDP
  CODE      ENDS
     END

The translated assembly language module can be run through any assembler (such as Borland's Turbo Assembler) that understands the .386 pseudo-op and 80386 mnemonics. The new object module is then used in lieu of the untranslated one.

Conclusion

The impact of the numeric processing performance described in this article is dramatically affecting how PCs are being used. For instance, most medical Magnetic Resonance Imager(MRI) manufactures currently used computers that cost hundreds of thousands of dollars. An MRI unit I've designed, however, was built using a PC--that uses numeric coprocessors--as the only computer within the device.

A single MRI image is an array of 256 x 256 pixels. These data are calculated by means of 512 Fourier transforms. Image reconstruction in under two seconds is possible! Integral transforms (both Fourier and wavelet) make it possible to produce very inexpensive medical imaging devices in the near future.

As mentioned earlier, the speed of both integer and FP instructions has increased dramatically over the past five years, with FP instructions all but overtaking their integer counterparts. This will impact compiler design, because the overhead associated with control statements (DO, for instance) is becoming more important. In the past, index calculation time didn't constitute a significant percentage of the execution of a program. It does now! In short, hardware designers have done their jobs and compiler writers should be on notice that now it's their turn.

References

Letcher, J. "The Use of Weiner Deconvolution (An Optimal Filter) in Nuclear Magnetic Resonance Imaging." International Journal of Imaging Systems & Technology (vol. 1, 1989).

Letcher, J. "The Use of Weiner Deconvolution to Improve the Quality of MR Images." 74th Scientific Assembly and Annual Meeting, Radiology Society of North America, 1988.

Birth of the IEEE 754 Floating Point Standard

In the early 1960s, computer manufacturers couldn't agree on how to represent floating point numbers and in the definition of language used to calculate these numbers. Consequently, programs written for one computer couldn't be ported intact to another. A push for standardization evolved, particularly in the area of Fortran-66, that eventually leading to improved language specifications and to the ANSI-blessed Fortran-77.

This language standardization provided benefits to the programmer, but a serious problem remained: A program could be ported from one machine to another, compiled, and executed without errors--but sometimes, to the programmer's horror, different answers were obtained. To deal with such issues, a committee of several hundred scientists and engineers convened under the auspices of the IEEE. They explored, studied, and proposed a standard way of representing floating point (FP) numbers and the precise characteristics of how the arithmetic should behave.

The result of this committee's work was the IEEE 754 Standard for FP arithmetic that allowed the programmer to specify rounding structures and other arithmetic properties. The stage was set for programmers to port files from one machine to another with identical answers and a high degree of numerical stability.

The IEEE FP Standard is not simple. Routines written using integer instructions to fully comply with the Standard are large, cumbersome, and time-consuming in execution--a limitation particularly important to PCs. Nevertheless, work had begun at Zilog, Intel, Motorola, and others to implement the IEEE FP Standard in the form of a single integrated circuit, one which would share (in parallel) almost all of the signal lines to the integer CPU. The Intel 8087 and Motorola 68881 were born.

To illustrate by example, an ingenious scheme was implemented on the Intel 8086 whereby the integer CPU (the 8086) would encounter a single byte instruction called ESC (Escape). When it did, the 8086 knew that the upcoming instruction was destined for the companion circuit (the 8087, the IEEE Standard compliant coprocessor) and passed the instruction and operand bytes, as appropriate, to the coprocessor. It also helped with address calculations. This scheme allowed the 8086 to look ahead in the instruction stream and execute in parallel whatever it could while 8087 was busy with integer calculations. A simple instruction, FWAIT OR WAIT, caused the 8086 to freeze until the instant that the coprocessor had finished its previously assigned work.

This gave PC computers accurate and predictable numeric processing capabilities (by way of IEEE 754 compliance), which were undoubtedly one of the strongest contributing factors to the explosive growth and acceptance of PC and workstation computers by serious scientists and engineers. --J.H.L.




_GETTING NUMERIC COPROCESSORS UP TO SPEED_
by John H. Letcher


Example 1:

(a)

        SUBROUTINE ABC(X,Y,Z)
        Z=X+Y
        RETURN
        END

(b)

   ARGLOC  DW   OFFSET X ; argument 1
           DW   SEG    X
           DW   OFFSET Y ; argument 2
           DW   SEG    Y
           DW   OFFSET Z ; argument 3
           DW   SEG    Z

(c)
       MOV   AX,SEG ARGLOC
       MOV   ES,AX
       MOV   BX,OFFSET ARGLOC
       CALL  ABC


Example 2:

               TITLE  ABC
               CODE   SEGMENT 'CODE'
         ;   SUBROUTINE ABC(X,Y,Z)
         ABC     PROC    FAR
               PUBLIC  ABC
               LDS   SI,ES:[BX]      ; NOW DS:SI POINTS TO X
               FLD   DWORD PTR [SI]  ; LOADS THE VALUE OF X
               LDS   SI,ES:4[BX]     ; NOW DS:SI POINTS TO Y
               FADD  DWORD PTR [SI]  ; ADDS Y TO X
               LDS   SI,ES:8[BX]     ; NOW DS:SI POINTS TO Z
               FSTP  DWORD PTR [SI]  ; STORES VALUE INTO Z
               RET
         ABC    ENDP
         CODE   ENDS
               END


Example 3:

 INCLUDE   FCODE.INC
     TITLE  ABC
 CODE      SEGMENT 'CODE'
 ;         SUBROUTINE ABC(X,Y,Z)
 .386
 ABC       PROC    FAR
 mov eax,0d000h              ; EMC
 mov fs,eax                  ; EMC
     PUBLIC  ABC
     LDS    SI,ES:[BX]       ; NOW DS:SI POINTS TO X
;    FLD    DWORD PTR [SI]   ; LOADS THE VALUE OF X
mov eax,DWORD PTR [SI]       ; EMC
mov DWORD PTR FS:mflds,eax   ; EMC
    LDS     SI,ES:4[BX]      ; NOW DS:SI POINTS TO Y
;   FADD    DWORD PTR [SI]   ; ADDS Y TO X
mov eax,DWORD PTR [SI]       ; EMC
mov DWORD PTR FS:mfadds,eax  ; EMC
    LDS     SI,ES:8[BX]      ; NOW DS:SI POINTS TO Z
;   FSTP    DWORD PTR [SI]   ; STORES VALUE INTO Z
mov eax,DWORD PTR FS:mfstps  ; EMC
mov DWORD PTR [SI],eax       ; EMC
    RET
ABC       ENDP
CODE      ENDS
    END

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

Parallel

Getting Numeric Coprocessors Up to Speed

More Speed, More Power

Table 1: Execution time (in clock cycles) for selected instructions on a variety of Intel and Cyrix processors/coprocessors.

Table 2: Performance of two complex mathematical algorithms.

Porting to Memory-Mapped Mode

Generating FP Code

Example 1: (a) A Fortran subroutine ABC to be called by Ryan-McFarland F-77 compiled routines; (b)the resulting table of pointers; (c)the compiled call to ABC.

Example 2: ABC after compilation into an assembly language module that uses 387-compatible FP instructions.

Example 3: Assembly language module resulting from the translation into memory-mapped instructions.

Conclusion

References

Birth of the IEEE 754 Floating Point Standard

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Parallel Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content

Parallel

Getting Numeric Coprocessors Up to Speed

Related Reading

More Insights

White Papers

Reports

Webcasts

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Parallel Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content