Embedded Systems

A Coprocessor for a Coprocessor?

, May 01, 1991

When fast graphics aren't fast enough, the TL 34020 graphics coprocessor employs its own coprocessor--the 34082 FPU--for even higher system performance.

MAY91: A COPROCESSOR FOR A COPROCESSOR?

A COPROCESSOR FOR A COPROCESSOR?

The 34082 floating point coprocessor for the 34020 graphics processor

Warren Davis and Kan Yabumoto

Warren, who has been a graphics programmer for ten years, is the designer and programmer of a number of video arcade games including Q*Bert, Us vs. Them, Lotto Fun, and Exterminator. Currently, he is a senior software engineer at Pixelab Inc., a graphics consulting firm. Kan was originally a gas chromatographer, but has been involved with graphics software for the last ten years. He is a cofounder of Pixelab Inc., and the author of a series of source-level debuggers for the TMS340 family of graphics processors, the GSP Operating Tools. His lesser-known arcade game, Mad Planets, is a collector's item. Both authors can be reached at Pixelab Inc., 4513 Lincoln Ave., Suite 105, Lisle, IL 60532, 70 -960-9339, or via their BBS, 708-960-9352.

When it was introduced in 1985, the Texas Instruments TMS34010 Graphics System Processor (GSP) faced an identity crisis. Was it really a general-purpose microprocessor that happened to have built-in graphics-related instructions and video control circuitry, or was it merely an unusually powerful programmable graphics coprocessor? In truth it is both, although just the first description is more accurate from a technical standpoint. And while there have been many systems designed in which a TMS34010 is the sole (or main) microprocessor, it is in the PC graphics arena that this device has the potential to flourish by being used to offload graphics related tasks from a host processor (usually an 8Ox86 or 68OxO). At the very least, TI hopes the GSP will become a major player in this field, as evidenced by TIGA (Texas Instruments Graphics Architecture), a standard for communication between a host and a target (TMS340-based) graphics system.

But the 34010 was just the beginning. In 1989, TI began mass producing the TMS34020, which includes speed and functionality improvements over its predecessor, and is designed to accommodate an optional floating point coprocessor, the 34082. A coprocessor for a coprocessor? We shudder to think what might be next. But let's look a little deeper into the workings of these devices. Who knows, they might even make sense!

The 34010: Processor or Coprocessor?

There is no doubt that TI's GSPs are complete microprocessors in their own right. They contain internal registers, a stack pointer, a status register, and interrupt vectors. They fetch instructions and data from a local memory, have the ability to make conditional jumps, and are supported by all the standard language tools (assembler, C compiler, linker, and so on). And of course, there are the graphics-related features that make them unique. In fact, the 34010 was the first device to incorporate video signal generation and efficient graphics-related operations with an instruction set for general-purpose computing. In addition, there is a host interface built into the silicon which simplifies the hardware connection between a GSP and another computer's bus. This is a somewhat unusual feature for a microprocessor, but looking at the real world filled with PCs, Macs, and Unixbased systems, you see the logic of it. The simpler the interface, the easier it is to develop GSP programs on the host computer and then download them to the GSP's memory.

But such an interface can also be used to communicate between a host and target processor while programs are running on both. The host could download parameters -- say, the position and radius of a circle along with a fill color -- to the GSP, which would then perform some graphics-related operations, such as drawing a filled circle on the screen. In fact, most graphics coprocessors have a similar means of receiving graphics commands from a host. For this reason, no doubt, many people originally thought of the 34010 as a glorified graphics coprocessor. (The term "graphics coprocessor" is actually somewhat vague. The history of devices which assist a host processor in performing graphics tasks covers a wide spectrum of "processing" ability.)

Anyway, all "graphics coprocessors" are treated as peripheral devices by a host processor, and this is certainly true of the 34010 as well. Once the analogy was made, some pointed out that the 34010 was actually slower in performing certain graphics tasks than some graphics controllers, which implemented a fixed set of functions internally and performed them at lightning speed. The beauty of the GSP, however, is in its ability to be tailored to a specific task.

Let's look at an example. Say we want to draw a series of filled circles along a path represented by an equation. Say also that we can divide the computational tasks into four sections fairly easily. Figure 1 shows us how our processing time would probably be spent using a typical graphics controller. The host processor takes some amount of time (Tm) to calculate the position of the next circle. When it comes time to draw the circle, we offload that task to the controller. In doing so, we incur a small bit of overhead (To) which is usually more than made up for by the speed of the controller (Tg). Presuming that the host does not need to acknowledge the completion of the graphics task, the total time for the loop is Tm = T_A + T_B + T_C + T_D + T_o.

If Tg is less than Tm, the graphics controller could be spending most of its time waiting for the host to send a command. Unfortunately, there isn't any work for the graphics device to do while the host is busy with other things. Now look at Figure 2, which shows a possible way of implementing this program using a GSP. We can increase the parallelism between the two processors by adjusting the division of tasks. So instead of having the GSP just draw the circle, we can send it some interim values, have it complete the computation, and then draw the circle. Even if the actual circle drawing time of the GSP is slower, the throughput of the system is faster.

Most graphics controllers contain hardcoded primitives, so the host has little or no choice in how to divide its tasks between itself and the controller. But because the GSP is completely programmable and capable of performing any standard computational task as well, there is no restriction on how much or how little it does at a time. The division of tasks between a host and GSP can be tailored to a particular need and tweaked to perfection (or as near to perfection as a deadline will allow).

So it seems pretty clear that the TMS340 GSPs must be accepted as more than just graphics coprocessors, although if that's the way you want to use them, they are more than equipped to handle the job exceptionally well.

Enter the 34020

Flexibility is one thing, but performance is another. The 34020, TI's newest GSP, provides a 32-bit external data path (which by itself virtually doubles the speed of pixel transfers over its predecessor), faster cycle times, a larger internal cache, support for a variety of VRAM capabilities, and a multiprocessor interface to allow multiple 34020s to share a memory space. Most relevant to the scope of this article, however, is the inclusion of a coprocessor interface. This notion was completely missing from the 34010, but its need becomes apparent as soon as you try to perform floating point arithmetic on the 34010. While the performance is respectable, it is nowhere near remarkable.

The 34020's coprocessor interface is general-purpose in a somewhat limited sense. Some of the 34020's local memory interface signals are used to tell a coprocessor that a command is being directed to it. Naturally, the coprocessor must be designed to listen properly, and at present there is only one device (the 34082) which will do that. Also, the 34020 is capable of working with more than one coprocessor. Through an ID field in its coprocessor instructions, the 34020 can control up to five coprocessors. Up to four of these can be 34082s, and only one may be a coprocessor of another origin which conforms to the 34020's coprocessor interface conventions.

The 34020 communicates with its coprocessors through a set of general coprocessor instructions, shown in Table 1. One of these instructions, CEXEC, simply involves the transfer of a command, embedded into the instruction, to a coprocessor. All the others involve the additional transfer of data between the 34020 and coprocessor. As Table 1 shows, data can be sent to or returned from the coprocessor using 34020 registers or memory. When executing any coprocessor instruction, the 34020 first generates a particular combination of control signals on its address/data bus to signal the coprocessor. The coprocessor command is placed onto the bus along with some other information including the coprocessor ID. Transferral of data, if any, follows. The 34020 controls these transfers, but the coprocessor needs to know what to do with data it is receiving or, if it is expected to return data to the 34020, what data to send back. This information must be inherently present in the command field sent by the 34020.

Table 1: 34020 coprocessor instructions

These are all of the 34020's coprocessor instructions.

The size field is a bit which indicates whether the operation is to be performed on 32-bit values (size = 0) or 64-bit values (size = 1).

The command field tells the coprocessor what operation to perform. If data is being transferred from the 34020 (that is, CMOVGC or CMOVMC), the command should indicate where it is to go. If data is being transferred to the 34020 (that is,. CMOVCG, CMOVCM, or CMOVCS), the command should indicate what data is to be returned.

The ID field is used to select a particular coprocessor (or all coprocessors) when there is more than one in the system. When omitted, a default value (which can be changed with an assembler directive) is used.

  Execute Coprocessor Command without Data Transfer
  CEXEC       size,command[,ID][,L]      Long Form
  CEXEC       size,command[,ID]          Short Form

  Move from Coprocessor to 34020 Registers
  CMOVCG      Rd,command[,ID]            Move one register
  CMOVCG      Rd1,Rd2,size,command[,ID]  Move two registers

  Move from Coprocessor to Memory
  CMOVCM      *Rd+,cnt,size,command[,ID] Post increment
  CMOVCM      *-Rd,cnt,size,command[,ID] Pre decrement

  Move from Coprocessor to 34020 Status Register
  CMOVCS      command[,ID]               Replaces N, C, Z, and V bits of
                                         34020's status register

  Move from 34020 Register(s) to Coprocessor
  CMOVGC      Rs,command[,ID]            Move one register
  CMOVGC      Rs1,Rs2,size,command[,ID]  Move two registers

  Move from Memory to Coprocessor
  CMOVMC      *Rs+,cnt,size,command[,ID] Post increment, Constant count
  CMOVMC      *-Rs,cnt,size,command[,ID] Pre decrement, Constant count
  CMOVMC      *Rs+,Rd,size,command[,ID]  Post increment, Register count

Presenting the 34082

As we mentioned before, the 34082 is currently the only device designed to work with the 34020's coprocessor interface. Because these devices have been designed to work so closely together, TI's TMS340 language tools support a special set of so-called "pseudo-ops" which consists entirely of variations on the instructions shown Table 1.

For example, the 34020 instruction, ADD CRs1,CRs2,CRd (Add Integer), is actually a CEXEC instruction which sends a command to the 34082, instructing it to add two of its registers (CRs1 and CRs2) as integers and place the result in another register, CRd. The ADDF (Add Float) instruction is identical to ADD except that a different coprocessor command is sent to indicate floating point addition. The ADDD instruction is identical to ADDF except the size field is 1 to indicate an operation on 64-bit values.

The 34082 has a built-in command set contained in its internal ROM. The "commands" sent by the 34020 are actually nothing more than addresses of microcoded programs in this ROM. So when the 34020 issues the ADD instruction mentioned before, it is really just triggering the 34082 to execute a one-line program consisting of a native 34082 ADD instruction. Some 34020 pseudo-ops trigger more complex 34082 programs, such as matrix multiplications or polynomial expansions.

Looking at the specs of the 34082 would lead one to conclude something very exciting. It is fast! The 34082-32 has a 67.5 ns instruction cycle time, a three-operand Floating Point Unit (FPU) with two levels of internal pipelining, and can perform most single precision operations in one cycle when executing out of its own local memory. (When commands are sent from the 34020, the minimum timing is equal to one 34020 cycle or 125 ns.) It supports three data types: 32-bit integer, 32-bit IEEE float, and 64-bit IEEE double.

A configuration register allows you to set the rounding mode and pipeline configuration. The 34082's native instruction set allows for conditional branches, jumps to subroutines (nested up to two-deep), loops, and interrupt service routines.

This degree of programmability within the 34082 itself is no accident. As if the "processor vs. coprocessor" issue were not muddy enough, the 34082 has the ability to act as a standalone processor. In this mode, called the "host-independent mode", programs are executed from an external memory (up to 64K-long words of program and 64K-long words of data) made up of either Static RAM (SRAM) or EPROM, which connects to the 34082 without any glue logic! A bootstrap loader is provided to simplify the initialization of SRAM. And TI provides not only a macro assembler and linker for the 34082, but a C compiler as well! This external memory is required for host-independent operation, but it can still be present even when the 34082 is in coprocessor mode. Communication between the 34082 and this memory occurs over a local bus, independent from the 34020 (see Figure 3). You can actually develop custom routines for the 34082, download them to SRAM or burn them into EPROM, and use them just as you would the commands built into the internal ROM! The improvement in execution speed can be remarkable, as you will see shortly.

Programming Notes

Many programmers stay away from multiplication operations, replacing them with additions when possible (for example, adding a number to itself instead of multiplying by two). On the 34082 this becomes a moot point. As long as you use the float format, most operations are a single clock cycle, so you gain nothing by replacing a multiplication with an addition. In fact, the 34082 runs so fast on many instructions that the bottleneck ends up being in the 34020-to-34082 communication.

To squeeze every drop of 34082 performance, you should focus your effort on optimizing the allocation of registers so that the restrictions placed on source operands do not force you to shuffle data between registers. The 34082's three-operand FPU allows many instructions to specify two source operands and a destination operand. The restriction on most instructions of this type is that the first source register must come from the A-file and the second from the B-file. Some instructions requiring a single-source operand require it to reside in a particular file (for example, the source register for SQR (square) instructions must be from the A-file, and the source register for INV (invert) instructions must be from the B-file). There is a mode bit in the 34082's CONFIG register which allows you to remove these restrictions by making the A- and B-files equivalent. The trade-off is that you then have only 10 registers available instead of 20.

Some of the more complex instructions act like subroutines and use specific registers as inputs. This is right along the lines of the GSP's graphics instructions, which expect operands to be stored in specific B-file registers. The Feedback Registers, C and CT, which are primarily used for temporary storage by some instructions, are also available and can be used to minimize any inconveniences. Thankfully, there are no file restrictions on the destination register.

Fractals

Now for the fun part. To evaluate the performance of the TMS34082 floating point processor, we wrote a simple C program that displays a picture of the Mandelbrot set. The screen represents a rectangle of arbitrary dimension at some position in the complex plane. The X axis represents real number components and the Y axis represents imaginary number components. The Mandelbrot plot is created by computing successive iterations of the equation A_n = A²_n-1 + C where A and C are complex numbers, the initial value of A is 0+0i, and C is a constant which is represented by a pixel in the complex plane. For all values of C which are visible on our screen, we determine how many iterations it takes for A to diverge. For our purposes, that means how many iterations until the magnitude of A becomes greater than 2. In plotting these results, we use the number of iterations until divergence as an index into our color map. If, after 256 iterations, A has still not diverged, we simply use color 0.

To freshen your algebra in complex numbers, if we represent a complex number, A, as follows:

  A = a<SUB>R</SUB>+a<SUB>I</SUB>i = (a<SUB>R</SUB>,a<SUB>I</SUB>)

then

A+B = (a<SUB>R</SUB>+b<SUB>R</SUB>,a<SUB>I</SUB>+b<SUB>I</SUB>)


A<SUP>2</SUP> = (a<SUP>2</SUP><SUB>R</SUB>-a<SUP>2</SUP><SUB>I</SUB>,2a<SUB>R</SUB>a<SUB>I</SUB>)


magnitude of <img src="http://twimgs.com/ddj/ddj/images/ddj9105a/sqrt12.gif">(A = a<SUB>R</SUB><SUP>2</SUP>+a<SUB>I</SUB><SUP>2</SUP>)

Our Test Program

Listing Two (page 84) shows a C program written to run under any environment and on any graphics display. The main routine starts by calling a black box function called initialize( ) that performs all hardware-dependent tasks -- it initializes the display board, clears the display screen, and loads a predetermined set of 256 colors into the display board's palette memory. Under some environments, you make a query to find out what your pixel resolution is, so initialize( ) also sets the global variables screenx and screeny. There is another "black box" function which is dependent on the display board used: put_pixel( ), which writes a color at a given position on the screen. To port this program, all you need to do is write your own initialize( ) and put_pixel( ).

The only other purpose of the main routine is to set up the parameters for compute_fractal( ). The four parameters form two complex numbers, which determine what chunk of the complex plane appears on our display screen. The origin parameter becomes the upper left corner of the screen, and the size parameter gives the dimensions of the screen in the complex plane. You can see by the initial values of origin and size that we will map an area from -4.0 to +4.0 along the real (X) axis and from -3.0 to +3.0 along the imaginary (Y) axis. These numbers were chosen to approximate the aspect ratio of a typical monitor so that each pixel represents a true square. It also gives a nice encompassing picture of the Mandelbrot set. By varying these parameters, you can achieve a limitless variety of fractal landscapes, some of which are quite breathtaking.

More Details.

The compute_fractal routine begins by computing DeltaR and DeltaI, which essentially represent the width and height of a single pixel in the complex plane. For every pixel on the screen, we need to determine a color. Therefore, we have two outer "for" loops, which encompass the entire screen, and an inner loop, which performs the calculations. The inner loop essentially performs complex arithmetic to determine how many iterations it takes to meet our divergence criterion. If we detect divergence, we break out of the loop and plot a pixel using the loop count as a color index. Otherwise, we fall through and plot a pixel of color 0.

(Rather than compute a true magnitude, which involves a square root, we compare the square of the magnitude to the square of our comparison value.)

Although the program is fairly simple, it is obviously a real number cruncher, so we tried to optimize the code as much as possible without losing its readability: All variables have been declared as register; we save the squares of the real and imaginary portions of A at each iteration. This is because they are used in computing both the next iteration and the square of the magnitude. By storing them, we save ourselves a multiplication.

The program in Listing One (page 84) was compiled under two environments -- Microsoft C 6.0 and Texas Instruments TMS340 C 5.01. The host computer was a 80386/25 MHz MS-DOS machine with an 80387. A TI SDB20 board, which is built around a 32MHz 34020 processor and 34082 coprocessor, was plugged into a slot on the host computer. In both cases, we used the display buffer of the SDB20 board connected to an NEC 3D Multisync monitor to view our images. The screen resolution was 640 x 480 pixels with 256 colors.

We compiled the program for each environment in two ways. First, we had the compilers generate floating point library calls. Next, we had them generate coprocessor instructions. The timing results are shown in Table 2. We were also fortunate enough to try our program out on a 80486 machine (at 25 MHz). The 80486 is essentially an 80386 married to an 80387 on a single chip with speed enhancements, and is therefore software-compatible with the 387 version of our program.

Table 2: Results of fractal comparison (Times are shown in seconds and hr:min:sec.)

                    Image 1  Image 2  Image 3
  -------------------------------------------

  80386/FP Library     2231    13251    31059
                    0:37:11  3:40:51  8:37:39
  34010/FP Library     1077     5199    15528
                    0:17:57  1:26:39  2:18:48
  34020/FP Library      443     2534     6304
                    0:07:23  0:42:14  1:45:04
  80386/80387            97      569     1319
                    0:01:37  0:09:29  0:21:59
  80486                  23      126      293
                    0:00:23  0:02:06  0:04:53
  34020/34082            18       93      216
                    0:00:18  0:01:33  0:03:36

  *** Above entries used C program as source.
  *** Following entries used assembler.

  Tweaked 34020/34082    11       64      149
                    0:00:11  0:01:04  0:02:29
  34082 running out of
  its local SRAM          4       17       38

Note: The times shown in this table do NOT include the overhead of writing the 640x480 pixels to the display screen. Each program was run in a mode where all pixel writing was inhibited. So the results shown above are the computation times of the algorithm only.

Although it's nice to know that the TMS340 C compiler is capable of generating 34082 instructions, anyone who's ever done any graphics programming knows that for performance, nothing beats assembler language. For that reason, we created a hand-tweaked assembler version of compute_fractal( ) based on code generated by the C compiler. The original output of the C compiler is shown in Listing Two. Compare this against the assembly code we tweaked in Listing Three (page 87).

The first thing to notice in Listing Three is that only one of the variables we declared to be "register" was placed in local memory, namely DeltaR. Every other variable is maintained in a register. Not only that, the float variables have been assigned to 34082 registers while the integers reside in 34020 registers. This is done by the C compiler automatically!

The Ultimate Method

We mentioned before that the 34082 can have its own local memory which can contain user programmed commands. In the case of the SDB20 board, there is a piggyback card available which plugs into the 34082 socket to provide the 34082 with external SRAM. By using this card, we were able to port the Mandelbrot algorithm to the 34082's SRAM. The particular programming techniques used are beyond the scope of this article, but we would be happy to answer any inquiries from interested readers. Basically, we created three new 34082 "commands." The first initializes the 34082's registers. The second performs the computations for a single point, returns the color of that point, and adjusts all registers to prepare for the next point. The third is called at the end of each line and adjusts all registers to prepare for the beginning of the next line. The 34020 simply maintains the row and column loops while sending these newly defined commands to the 34082.

And the Winner Is ...

In examining the timing results in Table 2, keep in mind that this was a test of curiosity more than anything else. The timings for the 386/387 are very dependent on the compiler and library used. The three Mandelbrot images we chose represent a wide variation in the amount of computing necessary.

The coprocessors boosted performance by a factor between 20 and 30 for both the TI and Intel chips. That isn't too surprising. After all, the existence of math coprocessors cannot be justified if the gain is marginal. However, we were very surprised to see TI's chip outperform the 80387 by a factor of 6. To explain this difference in performance, we must look into the underlying processor architectures. The entire compute_fractal( ) function fits into the on-chip instruction cache of the 34020, eliminating all instruction fetches. In this case, the 34020 executes over 80 percent of typical instructions in one machine cycle. All of its coprocessor instructions are also executed in one machine cycle. And because the TI C compiler puts 11 local variables into registers (many of which stay entirely inside the coprocessor), there are hardly any memory accesses. In the tweaked assembler version of the program, there are no memory accesses at all except for the outer loop initialization and pixel drawing.

Normally, when you replace a routine written in C with a tweaked assembler version, you would expect performance to improve by a factor of 3 or more. Not so in this case. We did not achieve even a two-fold increase in speed. Whereas many C programmers may have been skeptical of declaring register variables in the past, GSP C programmers should now get in the habit of declaring all automatic variables to be "register," keeping in mind that the compiler assigns registers in the order in which the declarations appear. By the way, we did not write a hand-tweaked version of the program for the 386/387 because it was not our purpose to provide an official benchmark, just a rough comparison. We would be happy to hear about anyone else's results from similar comparisons.

The times for the 486 machine are about four times faster than those of the 386/387 combination, which is as we expected. However, the 34020/82 combination was still faster by about 35 percent. Part of the speed improvements of the 486 come from the fact that there is no bus overhead in communicating with a coprocessor. This is almost the case when the 34082 is running our custom commands from its SRAM. The amount of communication between the 34020 and 34082 is reduced considerably, though not entirely, and yet we still see an improvement of close to a factor of 4 over the tweaked version which uses the 34082's built-in commands.

One can typically experience frustration while waiting for a Mandelbrot plot to complete. Using the 34020/34082 combination, we have practically exhausted our curiosity in this area by viewing image after image, many within a few seconds, using an interactive version of our program. Having observed this incredible performance, we wonder why we haven't yet seen an add-on card interfacing a 34082 to a PC, because a bus connection is technically feasible. At present, the price of a 34082 is about one-third that of an 80387. With some software support, it could turn a regular PC into a super number cruncher.

80x86 vs. TMS340 Philosophies

A 34082 connected to a 34020 is a floating point coprocessor in the truest sense. The 34020 does not treat it as a peripheral device but as an extension of itself. Even the hardware interface between the two devices has been optimized to make it as direct as possible. This is similar to the relationship between the Intel 80x86 and 80x87 devices. Just for grins, let's compare the Intel and Texas Instruments way of doing things.

Intel's processors are built upon a classic CISC architecture where the CPU contains a relatively small number of registers but allows most of the arithmetic and logical instructions to use memory locations as operands. This approach results in fewer move instructions than the TMS340 processors, which are influenced by the RISC philosophy. They have many more registers (30 general-purpose 32-bit wide registers) and cannot perform arithmetic and logical operations out of memory. Memory accesses are slower than register accesses, so the idea is to keep as much information as possible in registers. These philosophies were carried over to some extent to both companies' floating point math coprocessors. The 80x87 processors have relatively few (8) registers in a stack-like organization. The 34082 math coprocessor comes with many registers (20 general-purpose 64-bit wide registers plus two Feedback Registers) that can be accessed more freely.

Another concept carried over from the TMS340 processors to the 34082 is that of A-file and B-file registers. The 30 general-purpose registers of the GSPs are divided into 15 A registers (AO-A14) and 15 B registers (BO-B14). Many instructions require that both register operands be within the same file. The 34082's 20 registers are also organized in A- and B-Files. Like the 34010 and 34020, there are some restrictions on register usage.

Both the 80x87 and 34082 have synchronization instructions to allow a lengthy coprocessor operation to take place concurrently with main CPU execution. Both coprocessors can also transmit/receive data to/from system memory directly. And in both cases, the main CPU is responsible for coprocessor instruction decoding and memory access for optional operands. In the Intel case, when a special "ESC" prefix is encountered by the CPU, then the CPU generates a special I/O cycle to communicate with the 80x87. In the TI case, when a coprocessor instruction opcode is detected by the 34020, the 34020 initiates a special coprocessor bus cycle to which the 34082 responds. The data which actually appears on the data bus has been massaged by the 34020 to look very much like a microcoded instruction, with the "command" field being a pointer into the 34082's internal ROM.

Another interesting comparison between the 80x87 and 34082 is that the Intel chips perform 80-bit "temporary real" floating point math which provides more range and accuracy than the IEEE 64-bit double format used in the 34082. Also, while Intel's parts contain built-in logarithmic, exponential, and trigonometric functions, TI's device has none of these. These were sacrificed in favor of a variety of matrix and vector arithmetic and other graphics oriented functions. However, using the optional external memory, you can write your own functions as needed and expect the performance to be as fast or faster than other numeric processors.

All this is fascinating, I'm sure, but what about performance? Well, Table 3 shows a comparison of the speed of some floating point instructions among the latest math coprocessors. In comparing the performance of these coprocessors, we should note that the move/load/store functions of the 80387 devices create a significant overhead (20 to 93 cycles) which is negligible in the 34082. This is because Intel chose to convert all numbers to/from the "temporary real" format. TI maintains three distinct formats (int, float, and double) and gives you the choice of transferring data as is, or transferring and converting to a desired representation in one breath. We should also note that a comparison of instruction cycles alone is not very meaningful. The overall architecture of the processing environment can become very significant in evaluating the device's performance.

Table 3: Comparison of Instruction Execution Times in nanoseconds for 80387, 80486, and 34082

  Operation     80486 (33 MHz)  80387 (33 MHz)    34082 (32 MHz)
  ---------------------------------------------------------------

  abs               90 (FABS)           660       125/125 (ABSx)
  compare          120 (FCOM)           720       125/125 (CMPx)
  add              300 (FADD)           690       125/125 (ADDx)
  multiply         480 (FMUL)           870       250/125 (MPYx)
  divide          2190 (FDIV)          2640      1500/750 (DIVx)
  sqrt            2550 (FDIV)          3660     1875/1125 (SQRTx)
  int2real     480/330 (FILD)      1680/600       125/125 (CVIx)

Note 1: Currently, TI is only shipping 34082s rated at 32 MHz (40 MHz will be available later).

Note 2: The two numbers separated by a slash correspond to double and float operations, respectively. The integer operations of the 34082 are equal or slightly slower than their double precision counterparts. On the other hand, the Intel parts always operate in "temporary real" format.

Note 3: The third column reflects the timings of these operations when executed as 34020 coprocessor instructions. The minimum possible execution time is one 34020 instruction cycle (or 125 ns). On the other hand, if the 34082 were executing instructions from its local memory, the timings would be different. Specifically, the single cycle functions (abs, cmp, add, and mult) would execute in one 34082 instruction cycle (or 67.5 ns).

--W.D. and K.Y.


_A COPROCESSOR FOR A COPROCESSOR?_
by Warren Davis and Kan Yabumoto

[LISTING ONE]

<a name="0110_0011">

/* C program to perform display of Mandelbrot set. Needs to be
linked with a module containing the initialize() and put_pixel() */

int   screenx, screeny;   /*  These values represent the size of the display */
              /*  screen in pixels. They are initialized in the  */
              /*  initialize() routine called by main().         */

/**************************************************************************
compute_fractal is the heart of our program. Four parameters are passed
from main() representing two two complex numbers. The first two parameters,
base_R and base_I, are the real and imaginary portions of upper left corner
of the screen screen in the complex plane. The last two, span_R and span_I,
give the size of the area of the complex plane visible on the screen.
SOME BACKGROUND... This routine computes successive iterations of the equation,
(An = An-1 ** 2) + C where A and C are complex numbers, and C represents a
point in the complex plane. The initial value of A is 0+0i, and when the
magnitude of A becomes greater than 2.0, it will be considered that series
will eventually diverge. The color of pixel at C becomes the number of
iterations before divergence. If after 256 iterations, there is no divergence,
color 0 is written. The color is used as an index into color palette of the
display board. COMPLEX ARITHMETIC... For those of you a little rusty on your
complex arithmetic, the following formulas are supplied...
If W and Z are complex numbers, then each has two parts, real and imaginary.
(i.e. W = W_real + W_imag * i). W + Z means  (W_real + Z_real) + (W_imag +
Z_imag) * i W * W means   (W_real * W_real) - (W_imag * W_imag)  +
(2 * W_real * W_imag) * i.  The magnitude of Z would be  SQRT((Z_real *
Z_real) + (Z_imag * Z_imag))
**************************************************************************/

void compute_fractal(float BaseR, float BaseI, float SpanR, float SpanI)
{
register float   AR, AI;   /*  Real and Imaginary components of A */
register float   ConstR, ConstI;/* Real and Imaginary components of C */
register float   DeltaR, DeltaI;        /* increment values for C */
register float   ARsqr, AIsqr;            /* squares of AR and AI */
register int     row, col, color;         /****   See NOTE 1   ****/

DeltaR = SpanR / (float)screenx;
DeltaI = SpanI / (float)screeny;

ConstI = BaseI;
for (row=0; row < screeny; row++)   {       /* Scan top to bottom */
    ConstR = BaseR;
    for (col=0; col < screenx; col++)   {   /* Scan left to right */
        AR = AI = ARsqr = AIsqr = 0.0F;   /****   See NOTE 2   ****/
        for (color = 256; --color > 0;)   {/* Find color for this C */
            AI = (AR * AI * 2.0F) + ConstI;   /* Compute next     */
            AR = ARsqr - AIsqr + ConstR;      /*   iteration of A */

            if ( ((ARsqr = AR * AR) + (AIsqr = AI * AI))  > 4.0F )
                    break;                /****   See NOTE 3   ****/
            }
        put_pixel(color,col,row);/* Write color to display buffer. */
        ConstR += DeltaR;
        }
    ConstI += DeltaI;
    }
}

/* NOTE 1: We declare everything to be register variables. For some processors
this may not have much of an effect, but on others (like the 34020 and 34082)
you may be surprised.
NOTE 2: For each point on the screen, we begin computing iterations of the
Mandelbrot equation. The initial value of A is 0+0i. Since the values
A_real*A_real and A_imag*A_imag are used in computing both the next iteration
of A and its magnitude, we maintain these values as separate variables so the
multiplications need only be computed once.
NOTE 3: For our magnitude comparison, we actually compare the SQUARE of the
magnitude against the square of our divergence value. This saves us from
computing a square root.
*/

/**************************************************************************
The main() function serves only to pass initial values to compute_fractal. We
will leave the initialize() routine to be a "black box". Interested
programmers may want to write their own routine for whatever display board is
available. The values used in this test program show the familiar picture of
the Mandelbrot set. By varying these numbers, you can obtain some breathtaking
fractal landscapes.
 *************************************************************************/

main()
{
float origin_R,origin_I,size_R,size_I;

/*  The initialize() routine must initialize display board, clear display
buffer, load a table of 256 colors into color palette, and set global
variables, screenx and screeny. If successful, it returns 0. If it encounters
any problems it returns a non-zero value. */

if (initialize()) return(1);

origin_R = -4.0;      /*  origin represents the upper left corner of */
origin_I = -3.0;      /*         the screen.                         */
size_R = 8.0;         /*  size   represents the domain of the screen */
size_I = 6.0;         /*         in the complex plane.               */

compute_fractal(origin_R,origin_I,size_R,size_I);
}





<a name="0110_0012">
<a name="0110_0013">

[LISTING TWO]

<a name="0110_0013">

****************************************************************************
*  Assembly code generated by TMS340 C Compiler using the -mc option for
*  generating coprocessor instructions.
****************************************************************************
;  gspac -mc -v20 mandel.gc mandel.if
;  gspcg -o -c -v20 -o mandel.if mandel.asm mandel.tmp
   .version    20
   .ieeefl
FP .set   A13
STK.set   A14
   .file  "mandel.gc"
   .globl _screenx
   .globl _screeny

   .sym   _compute_fractal,_compute_fractal,32,2,0
   .globl _compute_fractal

   .func  50
;>>>> void compute_fractal(float BaseR,float BaseI,float SpanR,
                           float SpanI)
;>>>> register float   AR, AI, ConstR, ConstI;
;>>>> register float   ARsqr, AIsqr, DeltaI, DeltaR;
;>>>> register int     row,col,color;
******************************************************
* FUNCTION DEF : _compute_fractal
******************************************************
_compute_fractal:
   MMTM   SP,A7,A9,A10,A11,FP
   SUBI   448,SP
   MOVE   SP,A11
   MOVD   RA5,*A11+,4
   MOVD   RB6,*A11+,3
   MOVE   STK,FP
   ADDK   32,STK
   MOVE   SP,*STK+,1    ;; DEBUGGER TRACEBACK AID
   .sym   _BaseR,-32,6,9,32
   .sym   _BaseI,-64,6,9,32
   .sym   _SpanR,-96,6,9,32
   .sym   _SpanI,-128,6,9,32
   .sym   _AR,32,6,4,32
   .sym   _AI,33,6,4,32
   .sym   _ConstR,30,6,4,32
   .sym   _ConstI,31,6,4,32
   .sym   _ARsqr,28,6,4,32
   .sym   _AIsqr,29,6,4,32
   .sym   _DeltaR,26,6,4,32
   .sym   _DeltaI,0,6,1,32
   .sym   _row,9,4,4,32
   .sym   _col,10,4,4,32
   .sym   _color,11,4,4,32

   .line  9
;>>>> DeltaR = SpanR / (float)screenx;
   MOVE   @_screenx,A7,1
   MOVE   A7,RA0          ; screenx --> RA0
   CVIF   RA0,RB0         ; convert RA0 from int to float, put in RB0
   MOVE   FP,A7
   SUBI   96,A7

   MOVF   *A7+,RA0        ; move parameter SpanR --> RA0
   DIVF   RA0,RB0,RB0     ; RA0 / RB0 --> RB0.  Result is DeltaR
   ADDI   64,A7
   MOVF   RB0,*A7+        ; Store DeltaR as a local variable.

   .line  10
;>>>> DeltaI = SpanI / (float)screeny;
   MOVE   @_screeny,A7,1
   MOVE   A7,RA1            ; screeny --> RA1
   CVIF   RA1,RB1           ; convert to float and put in RB1
   MOVE   FP,A7
   SUBI   128,A7
   MOVF   *A7+,RA1          ; get SpanI
   DIVF   RA1,RB1,RA5       ; compute DeltaI and LEAVE IN RA5!!!
                                 ;  DeltaI is used as a register variable!
   .line  12
;>>>> ConstI = BaseI;
   ADDK   32,A7
   MOVF   *A7+,RB7        ; BaseI --> ConstI (RB7)

   .line  13
;>>>> for (row=0; row < screeny; row++)   {
; NOTICE here that both ConstI and row are used as register variables. Yet
;  ConstI, which is a float, is kept in a 34082 register and row, which is an
;  int, is kept in a 34020 register! The C compiler is smart enough to know
;  which variables should be maintained on which processor!
;
   CLRS   A9                ; 0 --> row (A9)
   MOVE   @_screeny,A7,1
   CMP    A7,A9
   JRGE   L2

L1:
   .line  15
;>>>>     ConstR = BaseR;
   MOVE   FP,A7
   SUBK   32,A7
   MOVF   *A7+,RA7        ; BaseR --> ConstR (RA7)

   .line  16
;>>>>     for (col=0; col < screenx; col++)   {
   CLRS   A10             ; 0 --> col (A10)
   MOVE   @_screenx,A7,1
   CMP    A7,A10
   JRGE   L4

L3:
   .line  18
;>>>>         AR = AI = ARsqr = AIsqr = 0.0F;
   CLRF   RB6             ; clear AIsqr (RB6)
   MOVF   RB6,RA6         ; clear ARsqr (RA6)
   MOVF   RB6,RB8         ; clear AI (RB8)

   MOVF   RB6,RA8         ; clear AR (RA8)

   .line  20
;>>>>         for (color = 256; --color > 0;)
   MOVI   256,A11
   SUBK   1,A11           ; 255 --> color (A11)
   JRLE   L6

L5:
   .line  22
;>>>>             AI =  (AR * AI * 2.0F) + ConstI;
   MPYF   RA8,RB8,RA0     ; AR * AI --> RA0
   TWOF   RB0             ; 2.0F --> RB0
   MPYF   RA0,RB0,RA0     ; AR * AR * 2.0 --> RA0
   ADDF   RA0,RB7,RB8     ; RA0 + ConstR --> AI (RB8)

   .line  23
;>>>>             AR = ARsqr - AIsqr + ConstR;
   SUBF   RA6,RB6,RB1     ; ARsqr - AIsqr --> RB1
   ADDF   RA7,RB1,RA8     ; ConstR + RB1 --> AR (RA8)

   .line  25
;>>>>             if ( ((ARsqr = AR*AR)+
   MOVF   RA8,RB1         ; AR --> RB1
   MPYF   RA8,RB1,RA6     ; Compute new ARsqr
   MOVF   RB8,RA0         ; AI --> RA0
   MPYF   RA0,RB8,RB6     ; Compute new AR_imag
   ADDF   RA6,RB6,RA0     ; Sum of squares --> RA0
   MOVI   FS3,A7          ; FS3 is a pointer to a float constant, 4.0
   MOVF   *A7+,RB1        ; 4.0 --> RB1
   CMPF   RA0,RB1         ; if square of magnitude > 4.0, break
   GETCST
   JRGT   L6

   .line  26
;>>>>                   (AIsqr = AI*AI))  > 4.0F ) break;
   .line  20
   SUBK   1,A11           ; Otherwise, decrement color and see
   JRGT   L5              ;     if loop ended.

L6:
   .line  29
;>>>>         put_pixel(color,col,row);
   MOVE   STK,-*SP,1        ; Call display_board dependent routine
   MOVE   A9,*STK+,1        ;      to place a pixel on the screen.
   MOVE   A10,*STK+,1
   MOVE   A11,*STK+,1
   CALLA  _put_pixel

   .line  30
;>>>>         ConstR += DeltaR;
   MOVE   FP,A8
   MOVF   *A8+,RB0
   ADDF   RA7,RB0,RA7


   .line  16
   ADDK   1,A10             ; col++
   MOVE   @_screenx,A7,1
   CMP    A7,A10            ; If col >= screenx, end middle loop
   JRLT   L3                ; Otherwise, jump back

L4:
   .line  32
;>>>>     ConstI += DeltaI;
   ADDF   RA5,RB7,RB7

   .line  13
   ADDK   1,A9              ; row++
   MOVE   @_screeny,A7,1
   CMP    A7,A9             ; If row >= screeny, end outer loop
   JRLT   L1                ; Otherwise, jump back

L2:
EPI0_1:
   .line  34
   MOVE   *SP(640),STK,1    ; C cleanup
   MOVD   *SP+,RA5,4
   MOVD   *SP+,RB6,3
   MMFM   SP,A7,A9,A10,A11,FP
   RETS   2

   .endfunc    83,00000ee80H,32

   .sym   _main,_main,36,2,0
   .globl _main

   .func  103
;>>>> main()
;>>>> float origin_R,origin_I,size_R,size_I;
******************************************************
* FUNCTION DEF : _main
******************************************************
_main:
   MOVE   FP,-*SP,1
   MOVE   STK,FP
   ADDI   128,STK
   MOVE   SP,*STK+,1    ;; DEBUGGER TRACEBACK AID
   .sym   _origin_R,0,6,1,32
   .sym   _origin_I,32,6,1,32
   .sym   _size_R,64,6,1,32
   .sym   _size_I,96,6,1,32


   .line  12
;>>>> if (initialize()) return(1);
   CALLA  _initialize
   MOVE   A8,A8
   JRZ    L8
   MOVK   1,A8
   JR     EPI0_2


L8:
   .line  14
;>>>> origin_R = -4.0;
   MOVE   @FS4,A8,1
   MOVE   A8,*FP,1

   .line  15
;>>>> origin_I = -3.0;
   MOVE   @FS5,A8,1
   MOVE   A8,*FP(32),1

   .line  16
;>>>> size_R = 8.0;
   MOVE   @FS6,A8,1
   MOVE   A8,*FP(64),1

   .line  17
;>>>> size_I = 6.0;
   MOVE   @FS7,A8,1
   MOVE   A8,*FP(96),1

   .line  19
;>>>> compute_fractal(origin_R,origin_I,size_R,size_I);
   MOVE   STK,-*SP,1
   MOVE   *FP(96),*STK+,1
   MOVE   *FP(64),*STK+,1
   MOVE   *FP(32),*STK+,1
   MOVE   *FP(0),*STK+,1
   CALLA  _compute_fractal

EPI0_2:
   .line  20
   SUBI   160,STK
   MOVE   *SP+,FP,1
   RETS   0

   .endfunc    140,00000a000H,128

   .sym   _screenx,_screenx,4,2,32
   .globl _screenx
   .bss   _screenx,32,32

   .sym   _screeny,_screeny,4,2,32
   .globl _screeny
   .bss   _screeny,32,32
*************************************************
* DEFINE FLOATING POINT CONSTANTS               *
*************************************************
   .text
   .even 32
FS1:.float0.0
FS3:.float4.0
FS4:.float-4.0
FS5:.float-3.0

FS6:.float8.0
FS7:.float6.0
*****************************************************
* UNDEFINED REFERENCES                              *
*****************************************************
   .ref   _put_pixel
   .ref   _initialize
   .end

.po 0




<a name="0110_0014">
<a name="0110_0015">

[LISTING THREE]

<a name="0110_0015">

* Hand-tweaked assembler code using Listing 2 as a basis. *
   .version   20
   .ieeefl
   .globl   _screenx
   .globl   _screeny

*   Register Nicknames are used for program clarity
*       34020 Registers...
FP    .set   A13                      ; C function Frame Pointer
STK   .set   A14                      ; C function Stack

DPTCH    .set   B3                    ; Destination Pitch of Screen
OFFSET   .set   B4                    ; Offset of Screen

*       34082 Registers...
RA0_2           .set     RA0          ; 2.0 constant
RA1_4           .set     RA1          ; 4.0 constant
RA2_TMP         .set     RA2          ; temporary storage
RA5_DI          .set     RA5          ; DeltaI
RA6_AR2         .set     RA6          ; AR squared
RA7_CR          .set     RA7          ; ConstR
RA8_AR          .set     RA8          ; AR

RB1_DR          .set     RB1          ; DeltaR
RB2_TMP         .set     RB2          ; temporary storage
RB4_BI          .set     RB4          ; BaseI
RB5_BR          .set     RB5          ; BaseR
RB6_AI2         .set     RB6          ; AI squared
RB7_CI          .set     RB7          ; ConstI
RB8_AI          .set     RB8          ; AI

TubeOffset      .set    2000H         ; These definitions apply for the
TubePitch       .set    (1024 * 8)    ; SDB20 board which we used.

   .globl   _compute_fractal


******************************************************
* FUNCTION DEF : _compute_fractal
******************************************************
_compute_fractal:
   MMTM   SP,A0,A1,A2,A3,A4,A11,FP

*   Since we are creating a highly efficient tweaked program, we have the
*   main program place the 4 parameters used in compute_fractal directly
*   into 34082 registers. Specifically, BaseI has been placed in RB4,
*   BaseR has been placed in RB5, SpanI has been placed in RA0, SpanR has
*   been placed in RA1

;>>>>    DeltaR = SpanR / (float)screenx;
   MOVE   @_screenx,A3,1           ; screenx --> A3 (stays there)
   MOVE   A3,RA2_TMP
   CVIF   RA2_TMP,RB0              ; (float)screenx --> RB0
   DIVF   RA1,RB0,RB1_DR           ; SpanR / screenx = DeltaR --> RB1
                                   ;                    (stays there)
;>>>>    DeltaI = SpanI / (float)screeny;
   MOVE   @_screeny,A4,1           ; screeny --> A4 (stays there)
   MOVE   A4,RA2_TMP
   CVIF   RA2_TMP,RB0              ; (float)screeny --> RB1
   DIVF   RA0,RB0,RA5_DI           ; SpanI / screeny = DeltaI --> RA5
                                   ;                    (stays there)
*   Set up initializations outside any loops
   TWOF   RA0_2                    ; constant 2.0 in RA0
   SQRF   RA0_2,RA1_4              ; constant 4.0 in RA1

;>>>> for (ConstI = BaseI, row=0; row < screeny; row++,ConstI += DeltaI)
   MOVF   RB4_BI,RB7_CI            ; BaseI --> ConstI  (RB7)
   CLRS   A0                       ; 0 --> row (A0)

L1:
;>>>>   for (ConstR = BaseR, col=0; col < screenx; col++,ConstR += DeltaR)
   MOVF   RB5_BR,RA7_CR            ; BaseR --> ConstR  (RA7)
   CLRS   A1                       ; 0 --> col (A1)


L3:
;>>>>         AR = AI = ARsqr = AIsqr = 0.0F;
   CLRF   RB8_AI                   ; 0.0 --> AI (RB8)
   MOVF   RB8_AI,RB6_AI2           ; 0.0 --> AI squared (RB6)
   CLRF   RA8_AR                   ; 0.0 --> AR (RA8)
   MOVF   RA8_AR,RA6_AR2           ; 0.0 --> AR squared (RA6)

;>>>>            for (color = 256; --color > 0;)
   MOVI   255,A2                   ; 255 --> color (A2)

L5:

;>>>>                AI = ( AR * AI * 2.0F ) + ConstI;
   MPYF   RA8_AR,RB8_AI,RB2_TMP    ; AR * AI --> tmp (RB2)
   MPYF   RB2_TMP,RA0_2,RA2_TMP    ; tmp * 2.0 --> tmp (RA2)
   ADDF   RA2_TMP,RB7_CI,RB8_AI    ; tmp + ConstI --> AI

;>>>>             AR = ARsqr - AIsqr + ConstR;
   SUBF   RA6_AR2,RB6_AI2,RB2_TMP  ; AR**2 - AI**2 --> tmp (RB2)
   ADDF   RB2_TMP,RA7_CR,RA8_AR    ; tmp + ConstR --> AR

;>>>>             if ( ((ARsqr = AR*AR)+
;>>>>                   (AIsqr = AI*AI))  > 4.0F ) break;
   SQRF   RA8_AR,RA6_AR2           ; Compute new ARsqr
   MOVF   RB8_AI,RA2_TMP           ; SQRF must be performed on an A reg.
   SQRF   RA2_TMP,RB6_AI2          ; Compute new AIsqr
   ADDF   RA6_AR2,RB6_AI2,RB2_TMP  ; sum of squares in RB2
   CMPF   RA1_4,RB2_TMP            ; if sum of squares > 4.0, break
   GETCST
   JRLE   L6

   DSJ    A2,L5                    ; dec color and loop back if not 0

L6:
;>>>>            put_pixel(color,col,row);
   MOVE   A0,A8                    ; row becomes Y
   SLL    16,A8                    ; shift Y into upper 16 bits
   MOVA   A1,A8                    ; col becomes A,  Y:X now in A8
   PIXT   A2,*A8.XY                ; write the pixel

;               bottom of 'col' loop
   ADDF   RB1_DR,RA7_CR,RA7_CR     ; ConstR += DeltaR
   INC    A1                       ; col++
   CMP    A3,A1                    ; if col < screenx, jump back
   JRLT   L3

;              bottom of 'row' loop
L4:
   ADDF   RA5_DI,RB7_CI,RB7_CI     ; ConstI += DeltaI
   INC    A0                       ; row++
   CMP    A4,A0                    ; if row < screeny, jump back
   JRLT   L1

L2:
EPI0_1:
   MMFM   SP,A0,A1,A2,A3,A4,A11,FP
   RETS

   .globl   _main

******************************************************
* FUNCTION DEF : _main
******************************************************
_main:
   MOVE   FP,-*SP,1
   MOVE   STK,FP
   ADDI   128,STK

   MOVE   SP,*STK+,1    ;; DEBUGGER TRACEBACK AID
   CALLA  _initialize
   MOVE   A8,A8
   JRZ    L8
   MOVK   1,A8
   JR     EPI0_2
L8:
   MOVE    @ORG_I,A8,1            ; We can place the initial parameters
   MOVF    A8,RB4_BI              ; directly into the 34082 registers
   MOVE    @ORG_R,A8,1            ; where they will be used by the
   MOVF    A8,RB5_BR              ; compute_fractal routine.
   MOVE    @SIZE_I,A8,1
   MOVF    A8,RA0
   MOVE    @SIZE_R,A8,1
   MOVF    A8,RA1
   CALLA   _compute_fractal
EPI0_2:
   MOVE   *SP+,FP,1
   RETS   0

   .globl   _screenx
   .bss   _screenx,32,32

   .globl   _screeny
   .bss   _screeny,32,32

*************************************************
* DEFINE FLOATING POINT CONSTANTS               *
*************************************************
   .text
   .even 32
ORG_R:    .float   -4.0
ORG_I:    .float   -3.0
SIZE_R:   .float    8.0
SIZE_I:   .float    6.0

   .ref   _initialize
   .end

Previous 1 2 3 4

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

Embedded Systems

A Coprocessor for a Coprocessor?

A COPROCESSOR FOR A COPROCESSOR?

The 34082 floating point coprocessor for the 34020 graphics processor

Warren Davis and Kan Yabumoto

The 34010: Processor or Coprocessor?

Enter the 34020

Table 1: 34020 coprocessor instructions

Presenting the 34082

Programming Notes

Fractals

Our Test Program

Table 2: Results of fractal comparison (Times are shown in seconds and hr:min:sec.)

The Ultimate Method

And the Winner Is ...

80x86 vs. TMS340 Philosophies

Table 3: Comparison of Instruction Execution Times in nanoseconds for 80387, 80486, and 34082

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Embedded Systems Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content

Embedded Systems

A Coprocessor for a Coprocessor?

A COPROCESSOR FOR A COPROCESSOR?

Warren Davis and Kan Yabumoto

Related Reading

More Insights

White Papers

Reports

Webcasts

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Embedded Systems Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content