A COPROCESSOR FOR A COPROCESSOR?
The 34082 floating point coprocessor for the 34020 graphics processor
Warren Davis and Kan Yabumoto
When it was introduced in 1985, the Texas Instruments TMS34010 Graphics System Processor (GSP) faced an identity crisis. Was it really a general-purpose microprocessor that happened to have built-in graphics-related instructions and video control circuitry, or was it merely an unusually powerful programmable graphics coprocessor? In truth it is both, although just the first description is more accurate from a technical standpoint. And while there have been many systems designed in which a TMS34010 is the sole (or main) microprocessor, it is in the PC graphics arena that this device has the potential to flourish by being used to offload graphics related tasks from a host processor (usually an 8Ox86 or 68OxO). At the very least, TI hopes the GSP will become a major player in this field, as evidenced by TIGA (Texas Instruments Graphics Architecture), a standard for communication between a host and a target (TMS340-based) graphics system.
But the 34010 was just the beginning. In 1989, TI began mass producing the TMS34020, which includes speed and functionality improvements over its predecessor, and is designed to accommodate an optional floating point coprocessor, the 34082. A coprocessor for a coprocessor? We shudder to think what might be next. But let's look a little deeper into the workings of these devices. Who knows, they might even make sense!
The 34010: Processor or Coprocessor?
There is no doubt that TI's GSPs are complete microprocessors in their own right. They contain internal registers, a stack pointer, a status register, and interrupt vectors. They fetch instructions and data from a local memory, have the ability to make conditional jumps, and are supported by all the standard language tools (assembler, C compiler, linker, and so on). And of course, there are the graphics-related features that make them unique. In fact, the 34010 was the first device to incorporate video signal generation and efficient graphics-related operations with an instruction set for general-purpose computing. In addition, there is a host interface built into the silicon which simplifies the hardware connection between a GSP and another computer's bus. This is a somewhat unusual feature for a microprocessor, but looking at the real world filled with PCs, Macs, and Unixbased systems, you see the logic of it. The simpler the interface, the easier it is to develop GSP programs on the host computer and then download them to the GSP's memory.
But such an interface can also be used to communicate between a host and target processor while programs are running on both. The host could download parameters -- say, the position and radius of a circle along with a fill color -- to the GSP, which would then perform some graphics-related operations, such as drawing a filled circle on the screen. In fact, most graphics coprocessors have a similar means of receiving graphics commands from a host. For this reason, no doubt, many people originally thought of the 34010 as a glorified graphics coprocessor. (The term "graphics coprocessor" is actually somewhat vague. The history of devices which assist a host processor in performing graphics tasks covers a wide spectrum of "processing" ability.)
Anyway, all "graphics coprocessors" are treated as peripheral devices by a host processor, and this is certainly true of the 34010 as well. Once the analogy was made, some pointed out that the 34010 was actually slower in performing certain graphics tasks than some graphics controllers, which implemented a fixed set of functions internally and performed them at lightning speed. The beauty of the GSP, however, is in its ability to be tailored to a specific task.
Let's look at an example. Say we want to draw a series of filled circles along a path represented by an equation. Say also that we can divide the computational tasks into four sections fairly easily. Figure 1 shows us how our processing time would probably be spent using a typical graphics controller. The host processor takes some amount of time (Tm) to calculate the position of the next circle. When it comes time to draw the circle, we offload that task to the controller. In doing so, we incur a small bit of overhead (To) which is usually more than made up for by the speed of the controller (Tg). Presuming that the host does not need to acknowledge the completion of the graphics task, the total time for the loop is Tm = TA + TB + TC + TD + To.
If Tg is less than Tm, the graphics controller could be spending most of its time waiting for the host to send a command. Unfortunately, there isn't any work for the graphics device to do while the host is busy with other things. Now look at Figure 2, which shows a possible way of implementing this program using a GSP. We can increase the parallelism between the two processors by adjusting the division of tasks. So instead of having the GSP just draw the circle, we can send it some interim values, have it complete the computation, and then draw the circle. Even if the actual circle drawing time of the GSP is slower, the throughput of the system is faster.
Most graphics controllers contain hardcoded primitives, so the host has little or no choice in how to divide its tasks between itself and the controller. But because the GSP is completely programmable and capable of performing any standard computational task as well, there is no restriction on how much or how little it does at a time. The division of tasks between a host and GSP can be tailored to a particular need and tweaked to perfection (or as near to perfection as a deadline will allow).
So it seems pretty clear that the TMS340 GSPs must be accepted as more than just graphics coprocessors, although if that's the way you want to use them, they are more than equipped to handle the job exceptionally well.
Enter the 34020
Flexibility is one thing, but performance is another. The 34020, TI's newest GSP, provides a 32-bit external data path (which by itself virtually doubles the speed of pixel transfers over its predecessor), faster cycle times, a larger internal cache, support for a variety of VRAM capabilities, and a multiprocessor interface to allow multiple 34020s to share a memory space. Most relevant to the scope of this article, however, is the inclusion of a coprocessor interface. This notion was completely missing from the 34010, but its need becomes apparent as soon as you try to perform floating point arithmetic on the 34010. While the performance is respectable, it is nowhere near remarkable.
The 34020's coprocessor interface is general-purpose in a somewhat limited sense. Some of the 34020's local memory interface signals are used to tell a coprocessor that a command is being directed to it. Naturally, the coprocessor must be designed to listen properly, and at present there is only one device (the 34082) which will do that. Also, the 34020 is capable of working with more than one coprocessor. Through an ID field in its coprocessor instructions, the 34020 can control up to five coprocessors. Up to four of these can be 34082s, and only one may be a coprocessor of another origin which conforms to the 34020's coprocessor interface conventions.
The 34020 communicates with its coprocessors through a set of general coprocessor instructions, shown in Table 1. One of these instructions, CEXEC, simply involves the transfer of a command, embedded into the instruction, to a coprocessor. All the others involve the additional transfer of data between the 34020 and coprocessor. As Table 1 shows, data can be sent to or returned from the coprocessor using 34020 registers or memory. When executing any coprocessor instruction, the 34020 first generates a particular combination of control signals on its address/data bus to signal the coprocessor. The coprocessor command is placed onto the bus along with some other information including the coprocessor ID. Transferral of data, if any, follows. The 34020 controls these transfers, but the coprocessor needs to know what to do with data it is receiving or, if it is expected to return data to the 34020, what data to send back. This information must be inherently present in the command field sent by the 34020.
Table 1: 34020 coprocessor instructions
These are all of the 34020's coprocessor instructions.
The size field is a bit which indicates whether the operation is to be performed on 32-bit values (size = 0) or 64-bit values (size = 1).
The command field tells the coprocessor what operation to perform. If data is being transferred from the 34020 (that is, CMOVGC or CMOVMC), the command should indicate where it is to go. If data is being transferred to the 34020 (that is,. CMOVCG, CMOVCM, or CMOVCS), the command should indicate what data is to be returned.
The ID field is used to select a particular coprocessor (or all coprocessors) when there is more than one in the system. When omitted, a default value (which can be changed with an assembler directive) is used.
Execute Coprocessor Command without Data Transfer CEXEC size,command[,ID][,L] Long Form CEXEC size,command[,ID] Short Form Move from Coprocessor to 34020 Registers CMOVCG Rd,command[,ID] Move one register CMOVCG Rd1,Rd2,size,command[,ID] Move two registers Move from Coprocessor to Memory CMOVCM *Rd+,cnt,size,command[,ID] Post increment CMOVCM *-Rd,cnt,size,command[,ID] Pre decrement Move from Coprocessor to 34020 Status Register CMOVCS command[,ID] Replaces N, C, Z, and V bits of 34020's status register Move from 34020 Register(s) to Coprocessor CMOVGC Rs,command[,ID] Move one register CMOVGC Rs1,Rs2,size,command[,ID] Move two registers Move from Memory to Coprocessor CMOVMC *Rs+,cnt,size,command[,ID] Post increment, Constant count CMOVMC *-Rs,cnt,size,command[,ID] Pre decrement, Constant count CMOVMC *Rs+,Rd,size,command[,ID] Post increment, Register count
Presenting the 34082
As we mentioned before, the 34082 is currently the only device designed to work with the 34020's coprocessor interface. Because these devices have been designed to work so closely together, TI's TMS340 language tools support a special set of so-called "pseudo-ops" which consists entirely of variations on the instructions shown Table 1.
For example, the 34020 instruction, ADD CRs1,CRs2,CRd (Add Integer), is actually a CEXEC instruction which sends a command to the 34082, instructing it to add two of its registers (CRs1 and CRs2) as integers and place the result in another register, CRd. The ADDF (Add Float) instruction is identical to ADD except that a different coprocessor command is sent to indicate floating point addition. The ADDD instruction is identical to ADDF except the size field is 1 to indicate an operation on 64-bit values.
The 34082 has a built-in command set contained in its internal ROM. The "commands" sent by the 34020 are actually nothing more than addresses of microcoded programs in this ROM. So when the 34020 issues the ADD instruction mentioned before, it is really just triggering the 34082 to execute a one-line program consisting of a native 34082 ADD instruction. Some 34020 pseudo-ops trigger more complex 34082 programs, such as matrix multiplications or polynomial expansions.
Looking at the specs of the 34082 would lead one to conclude something very exciting. It is fast! The 34082-32 has a 67.5 ns instruction cycle time, a three-operand Floating Point Unit (FPU) with two levels of internal pipelining, and can perform most single precision operations in one cycle when executing out of its own local memory. (When commands are sent from the 34020, the minimum timing is equal to one 34020 cycle or 125 ns.) It supports three data types: 32-bit integer, 32-bit IEEE float, and 64-bit IEEE double.
A configuration register allows you to set the rounding mode and pipeline configuration. The 34082's native instruction set allows for conditional branches, jumps to subroutines (nested up to two-deep), loops, and interrupt service routines.
This degree of programmability within the 34082 itself is no accident. As if the "processor vs. coprocessor" issue were not muddy enough, the 34082 has the ability to act as a standalone processor. In this mode, called the "host-independent mode", programs are executed from an external memory (up to 64K-long words of program and 64K-long words of data) made up of either Static RAM (SRAM) or EPROM, which connects to the 34082 without any glue logic! A bootstrap loader is provided to simplify the initialization of SRAM. And TI provides not only a macro assembler and linker for the 34082, but a C compiler as well! This external memory is required for host-independent operation, but it can still be present even when the 34082 is in coprocessor mode. Communication between the 34082 and this memory occurs over a local bus, independent from the 34020 (see Figure 3). You can actually develop custom routines for the 34082, download them to SRAM or burn them into EPROM, and use them just as you would the commands built into the internal ROM! The improvement in execution speed can be remarkable, as you will see shortly.
Programming Notes
Many programmers stay away from multiplication operations, replacing them with additions when possible (for example, adding a number to itself instead of multiplying by two). On the 34082 this becomes a moot point. As long as you use the float format, most operations are a single clock cycle, so you gain nothing by replacing a multiplication with an addition. In fact, the 34082 runs so fast on many instructions that the bottleneck ends up being in the 34020-to-34082 communication.
To squeeze every drop of 34082 performance, you should focus your effort on optimizing the allocation of registers so that the restrictions placed on source operands do not force you to shuffle data between registers. The 34082's three-operand FPU allows many instructions to specify two source operands and a destination operand. The restriction on most instructions of this type is that the first source register must come from the A-file and the second from the B-file. Some instructions requiring a single-source operand require it to reside in a particular file (for example, the source register for SQR (square) instructions must be from the A-file, and the source register for INV (invert) instructions must be from the B-file). There is a mode bit in the 34082's CONFIG register which allows you to remove these restrictions by making the A- and B-files equivalent. The trade-off is that you then have only 10 registers available instead of 20.
Some of the more complex instructions act like subroutines and use specific registers as inputs. This is right along the lines of the GSP's graphics instructions, which expect operands to be stored in specific B-file registers. The Feedback Registers, C and CT, which are primarily used for temporary storage by some instructions, are also available and can be used to minimize any inconveniences. Thankfully, there are no file restrictions on the destination register.
Fractals
Now for the fun part. To evaluate the performance of the TMS34082 floating point processor, we wrote a simple C program that displays a picture of the Mandelbrot set. The screen represents a rectangle of arbitrary dimension at some position in the complex plane. The X axis represents real number components and the Y axis represents imaginary number components. The Mandelbrot plot is created by computing successive iterations of the equation An = A2n-1 + C where A and C are complex numbers, the initial value of A is 0+0i, and C is a constant which is represented by a pixel in the complex plane. For all values of C which are visible on our screen, we determine how many iterations it takes for A to diverge. For our purposes, that means how many iterations until the magnitude of A becomes greater than 2. In plotting these results, we use the number of iterations until divergence as an index into our color map. If, after 256 iterations, A has still not diverged, we simply use color 0.
To freshen your algebra in complex numbers, if we represent a complex number, A, as follows:
A = a<SUB>R</SUB>+a<SUB>I</SUB>i = (a<SUB>R</SUB>,a<SUB>I</SUB>)
then
A+B = (a<SUB>R</SUB>+b<SUB>R</SUB>,a<SUB>I</SUB>+b<SUB>I</SUB>) A<SUP>2</SUP> = (a<SUP>2</SUP><SUB>R</SUB>-a<SUP>2</SUP><SUB>I</SUB>,2a<SUB>R</SUB>a<SUB>I</SUB>) magnitude of <img src="http://twimgs.com/ddj/ddj/images/ddj9105a/sqrt12.gif">(A = a<SUB>R</SUB><SUP>2</SUP>+a<SUB>I</SUB><SUP>2</SUP>)
Our Test Program
Listing Two (page 84) shows a C program written to run under any environment and on any graphics display. The main routine starts by calling a black box function called initialize( ) that performs all hardware-dependent tasks -- it initializes the display board, clears the display screen, and loads a predetermined set of 256 colors into the display board's palette memory. Under some environments, you make a query to find out what your pixel resolution is, so initialize( ) also sets the global variables screenx and screeny. There is another "black box" function which is dependent on the display board used: put_pixel( ), which writes a color at a given position on the screen. To port this program, all you need to do is write your own initialize( ) and put_pixel( ).
The only other purpose of the main routine is to set up the parameters for compute_fractal( ). The four parameters form two complex numbers, which determine what chunk of the complex plane appears on our display screen. The origin parameter becomes the upper left corner of the screen, and the size parameter gives the dimensions of the screen in the complex plane. You can see by the initial values of origin and size that we will map an area from -4.0 to +4.0 along the real (X) axis and from -3.0 to +3.0 along the imaginary (Y) axis. These numbers were chosen to approximate the aspect ratio of a typical monitor so that each pixel represents a true square. It also gives a nice encompassing picture of the Mandelbrot set. By varying these parameters, you can achieve a limitless variety of fractal landscapes, some of which are quite breathtaking.
The compute_fractal routine begins by computing DeltaR and DeltaI, which essentially represent the width and height of a single pixel in the complex plane. For every pixel on the screen, we need to determine a color. Therefore, we have two outer "for" loops, which encompass the entire screen, and an inner loop, which performs the calculations. The inner loop essentially performs complex arithmetic to determine how many iterations it takes to meet our divergence criterion. If we detect divergence, we break out of the loop and plot a pixel using the loop count as a color index. Otherwise, we fall through and plot a pixel of color 0.
(Rather than compute a true magnitude, which involves a square root, we compare the square of the magnitude to the square of our comparison value.)
Although the program is fairly simple, it is obviously a real number cruncher, so we tried to optimize the code as much as possible without losing its readability: All variables have been declared as register; we save the squares of the real and imaginary portions of A at each iteration. This is because they are used in computing both the next iteration and the square of the magnitude. By storing them, we save ourselves a multiplication.
The program in Listing One (page 84) was compiled under two environments -- Microsoft C 6.0 and Texas Instruments TMS340 C 5.01. The host computer was a 80386/25 MHz MS-DOS machine with an 80387. A TI SDB20 board, which is built around a 32MHz 34020 processor and 34082 coprocessor, was plugged into a slot on the host computer. In both cases, we used the display buffer of the SDB20 board connected to an NEC 3D Multisync monitor to view our images. The screen resolution was 640 x 480 pixels with 256 colors.
We compiled the program for each environment in two ways. First, we had the compilers generate floating point library calls. Next, we had them generate coprocessor instructions. The timing results are shown in Table 2. We were also fortunate enough to try our program out on a 80486 machine (at 25 MHz). The 80486 is essentially an 80386 married to an 80387 on a single chip with speed enhancements, and is therefore software-compatible with the 387 version of our program.
Table 2: Results of fractal comparison (Times are shown in seconds and hr:min:sec.)
Image 1 Image 2 Image 3 ------------------------------------------- 80386/FP Library 2231 13251 31059 0:37:11 3:40:51 8:37:39 34010/FP Library 1077 5199 15528 0:17:57 1:26:39 2:18:48 34020/FP Library 443 2534 6304 0:07:23 0:42:14 1:45:04 80386/80387 97 569 1319 0:01:37 0:09:29 0:21:59 80486 23 126 293 0:00:23 0:02:06 0:04:53 34020/34082 18 93 216 0:00:18 0:01:33 0:03:36 *** Above entries used C program as source. *** Following entries used assembler. Tweaked 34020/34082 11 64 149 0:00:11 0:01:04 0:02:29 34082 running out of its local SRAM 4 17 38
Note: The times shown in this table do NOT include the overhead of writing the 640x480 pixels to the display screen. Each program was run in a mode where all pixel writing was inhibited. So the results shown above are the computation times of the algorithm only.
Although it's nice to know that the TMS340 C compiler is capable of generating 34082 instructions, anyone who's ever done any graphics programming knows that for performance, nothing beats assembler language. For that reason, we created a hand-tweaked assembler version of compute_fractal( ) based on code generated by the C compiler. The original output of the C compiler is shown in Listing Two. Compare this against the assembly code we tweaked in Listing Three (page 87).
The first thing to notice in Listing Three is that only one of the variables we declared to be "register" was placed in local memory, namely DeltaR. Every other variable is maintained in a register. Not only that, the float variables have been assigned to 34082 registers while the integers reside in 34020 registers. This is done by the C compiler automatically!
The Ultimate Method
We mentioned before that the 34082 can have its own local memory which can contain user programmed commands. In the case of the SDB20 board, there is a piggyback card available which plugs into the 34082 socket to provide the 34082 with external SRAM. By using this card, we were able to port the Mandelbrot algorithm to the 34082's SRAM. The particular programming techniques used are beyond the scope of this article, but we would be happy to answer any inquiries from interested readers. Basically, we created three new 34082 "commands." The first initializes the 34082's registers. The second performs the computations for a single point, returns the color of that point, and adjusts all registers to prepare for the next point. The third is called at the end of each line and adjusts all registers to prepare for the beginning of the next line. The 34020 simply maintains the row and column loops while sending these newly defined commands to the 34082.
And the Winner Is ...
In examining the timing results in Table 2, keep in mind that this was a test of curiosity more than anything else. The timings for the 386/387 are very dependent on the compiler and library used. The three Mandelbrot images we chose represent a wide variation in the amount of computing necessary.
The coprocessors boosted performance by a factor between 20 and 30 for both the TI and Intel chips. That isn't too surprising. After all, the existence of math coprocessors cannot be justified if the gain is marginal. However, we were very surprised to see TI's chip outperform the 80387 by a factor of 6. To explain this difference in performance, we must look into the underlying processor architectures. The entire compute_fractal( ) function fits into the on-chip instruction cache of the 34020, eliminating all instruction fetches. In this case, the 34020 executes over 80 percent of typical instructions in one machine cycle. All of its coprocessor instructions are also executed in one machine cycle. And because the TI C compiler puts 11 local variables into registers (many of which stay entirely inside the coprocessor), there are hardly any memory accesses. In the tweaked assembler version of the program, there are no memory accesses at all except for the outer loop initialization and pixel drawing.
Normally, when you replace a routine written in C with a tweaked assembler version, you would expect performance to improve by a factor of 3 or more. Not so in this case. We did not achieve even a two-fold increase in speed. Whereas many C programmers may have been skeptical of declaring register variables in the past, GSP C programmers should now get in the habit of declaring all automatic variables to be "register," keeping in mind that the compiler assigns registers in the order in which the declarations appear. By the way, we did not write a hand-tweaked version of the program for the 386/387 because it was not our purpose to provide an official benchmark, just a rough comparison. We would be happy to hear about anyone else's results from similar comparisons.
The times for the 486 machine are about four times faster than those of the 386/387 combination, which is as we expected. However, the 34020/82 combination was still faster by about 35 percent. Part of the speed improvements of the 486 come from the fact that there is no bus overhead in communicating with a coprocessor. This is almost the case when the 34082 is running our custom commands from its SRAM. The amount of communication between the 34020 and 34082 is reduced considerably, though not entirely, and yet we still see an improvement of close to a factor of 4 over the tweaked version which uses the 34082's built-in commands.
One can typically experience frustration while waiting for a Mandelbrot plot to complete. Using the 34020/34082 combination, we have practically exhausted our curiosity in this area by viewing image after image, many within a few seconds, using an interactive version of our program. Having observed this incredible performance, we wonder why we haven't yet seen an add-on card interfacing a 34082 to a PC, because a bus connection is technically feasible. At present, the price of a 34082 is about one-third that of an 80387. With some software support, it could turn a regular PC into a super number cruncher.
80x86 vs. TMS340 Philosophies
A 34082 connected to a 34020 is a floating point coprocessor in the truest sense. The 34020 does not treat it as a peripheral device but as an extension of itself. Even the hardware interface between the two devices has been optimized to make it as direct as possible. This is similar to the relationship between the Intel 80x86 and 80x87 devices. Just for grins, let's compare the Intel and Texas Instruments way of doing things.
Intel's processors are built upon a classic CISC architecture where the CPU contains a relatively small number of registers but allows most of the arithmetic and logical instructions to use memory locations as operands. This approach results in fewer move instructions than the TMS340 processors, which are influenced by the RISC philosophy. They have many more registers (30 general-purpose 32-bit wide registers) and cannot perform arithmetic and logical operations out of memory. Memory accesses are slower than register accesses, so the idea is to keep as much information as possible in registers. These philosophies were carried over to some extent to both companies' floating point math coprocessors. The 80x87 processors have relatively few (8) registers in a stack-like organization. The 34082 math coprocessor comes with many registers (20 general-purpose 64-bit wide registers plus two Feedback Registers) that can be accessed more freely.
Another concept carried over from the TMS340 processors to the 34082 is that of A-file and B-file registers. The 30 general-purpose registers of the GSPs are divided into 15 A registers (AO-A14) and 15 B registers (BO-B14). Many instructions require that both register operands be within the same file. The 34082's 20 registers are also organized in A- and B-Files. Like the 34010 and 34020, there are some restrictions on register usage.
Both the 80x87 and 34082 have synchronization instructions to allow a lengthy coprocessor operation to take place concurrently with main CPU execution. Both coprocessors can also transmit/receive data to/from system memory directly. And in both cases, the main CPU is responsible for coprocessor instruction decoding and memory access for optional operands. In the Intel case, when a special "ESC" prefix is encountered by the CPU, then the CPU generates a special I/O cycle to communicate with the 80x87. In the TI case, when a coprocessor instruction opcode is detected by the 34020, the 34020 initiates a special coprocessor bus cycle to which the 34082 responds. The data which actually appears on the data bus has been massaged by the 34020 to look very much like a microcoded instruction, with the "command" field being a pointer into the 34082's internal ROM.
Another interesting comparison between the 80x87 and 34082 is that the Intel chips perform 80-bit "temporary real" floating point math which provides more range and accuracy than the IEEE 64-bit double format used in the 34082. Also, while Intel's parts contain built-in logarithmic, exponential, and trigonometric functions, TI's device has none of these. These were sacrificed in favor of a variety of matrix and vector arithmetic and other graphics oriented functions. However, using the optional external memory, you can write your own functions as needed and expect the performance to be as fast or faster than other numeric processors.
All this is fascinating, I'm sure, but what about performance? Well, Table 3 shows a comparison of the speed of some floating point instructions among the latest math coprocessors. In comparing the performance of these coprocessors, we should note that the move/load/store functions of the 80387 devices create a significant overhead (20 to 93 cycles) which is negligible in the 34082. This is because Intel chose to convert all numbers to/from the "temporary real" format. TI maintains three distinct formats (int, float, and double) and gives you the choice of transferring data as is, or transferring and converting to a desired representation in one breath. We should also note that a comparison of instruction cycles alone is not very meaningful. The overall architecture of the processing environment can become very significant in evaluating the device's performance.
Table 3: Comparison of Instruction Execution Times in nanoseconds for 80387, 80486, and 34082
Operation 80486 (33 MHz) 80387 (33 MHz) 34082 (32 MHz) --------------------------------------------------------------- abs 90 (FABS) 660 125/125 (ABSx) compare 120 (FCOM) 720 125/125 (CMPx) add 300 (FADD) 690 125/125 (ADDx) multiply 480 (FMUL) 870 250/125 (MPYx) divide 2190 (FDIV) 2640 1500/750 (DIVx) sqrt 2550 (FDIV) 3660 1875/1125 (SQRTx) int2real 480/330 (FILD) 1680/600 125/125 (CVIx)
Note 1: Currently, TI is only shipping 34082s rated at 32 MHz (40 MHz will be available later).
Note 2: The two numbers separated by a slash correspond to double and float operations, respectively. The integer operations of the 34082 are equal or slightly slower than their double precision counterparts. On the other hand, the Intel parts always operate in "temporary real" format.
Note 3: The third column reflects the timings of these operations when executed as 34020 coprocessor instructions. The minimum possible execution time is one 34020 instruction cycle (or 125 ns). On the other hand, if the 34082 were executing instructions from its local memory, the timings would be different. Specifically, the single cycle functions (abs, cmp, add, and mult) would execute in one 34082 instruction cycle (or 67.5 ns).
--W.D. and K.Y.
_A COPROCESSOR FOR A COPROCESSOR?_ by Warren Davis and Kan Yabumoto[LISTING ONE]
<a name="0110_0011"> /* C program to perform display of Mandelbrot set. Needs to be linked with a module containing the initialize() and put_pixel() */ int screenx, screeny; /* These values represent the size of the display */ /* screen in pixels. They are initialized in the */ /* initialize() routine called by main(). */ /************************************************************************** compute_fractal is the heart of our program. Four parameters are passed from main() representing two two complex numbers. The first two parameters, base_R and base_I, are the real and imaginary portions of upper left corner of the screen screen in the complex plane. The last two, span_R and span_I, give the size of the area of the complex plane visible on the screen. SOME BACKGROUND... This routine computes successive iterations of the equation, (An = An-1 ** 2) + C where A and C are complex numbers, and C represents a point in the complex plane. The initial value of A is 0+0i, and when the magnitude of A becomes greater than 2.0, it will be considered that series will eventually diverge. The color of pixel at C becomes the number of iterations before divergence. If after 256 iterations, there is no divergence, color 0 is written. The color is used as an index into color palette of the display board. COMPLEX ARITHMETIC... For those of you a little rusty on your complex arithmetic, the following formulas are supplied... If W and Z are complex numbers, then each has two parts, real and imaginary. (i.e. W = W_real + W_imag * i). W + Z means (W_real + Z_real) + (W_imag + Z_imag) * i W * W means (W_real * W_real) - (W_imag * W_imag) + (2 * W_real * W_imag) * i. The magnitude of Z would be SQRT((Z_real * Z_real) + (Z_imag * Z_imag)) **************************************************************************/ void compute_fractal(float BaseR, float BaseI, float SpanR, float SpanI) { register float AR, AI; /* Real and Imaginary components of A */ register float ConstR, ConstI;/* Real and Imaginary components of C */ register float DeltaR, DeltaI; /* increment values for C */ register float ARsqr, AIsqr; /* squares of AR and AI */ register int row, col, color; /**** See NOTE 1 ****/ DeltaR = SpanR / (float)screenx; DeltaI = SpanI / (float)screeny; ConstI = BaseI; for (row=0; row < screeny; row++) { /* Scan top to bottom */ ConstR = BaseR; for (col=0; col < screenx; col++) { /* Scan left to right */ AR = AI = ARsqr = AIsqr = 0.0F; /**** See NOTE 2 ****/ for (color = 256; --color > 0;) {/* Find color for this C */ AI = (AR * AI * 2.0F) + ConstI; /* Compute next */ AR = ARsqr - AIsqr + ConstR; /* iteration of A */ if ( ((ARsqr = AR * AR) + (AIsqr = AI * AI)) > 4.0F ) break; /**** See NOTE 3 ****/ } put_pixel(color,col,row);/* Write color to display buffer. */ ConstR += DeltaR; } ConstI += DeltaI; } } /* NOTE 1: We declare everything to be register variables. For some processors this may not have much of an effect, but on others (like the 34020 and 34082) you may be surprised. NOTE 2: For each point on the screen, we begin computing iterations of the Mandelbrot equation. The initial value of A is 0+0i. Since the values A_real*A_real and A_imag*A_imag are used in computing both the next iteration of A and its magnitude, we maintain these values as separate variables so the multiplications need only be computed once. NOTE 3: For our magnitude comparison, we actually compare the SQUARE of the magnitude against the square of our divergence value. This saves us from computing a square root. */ /************************************************************************** The main() function serves only to pass initial values to compute_fractal. We will leave the initialize() routine to be a "black box". Interested programmers may want to write their own routine for whatever display board is available. The values used in this test program show the familiar picture of the Mandelbrot set. By varying these numbers, you can obtain some breathtaking fractal landscapes. *************************************************************************/ main() { float origin_R,origin_I,size_R,size_I; /* The initialize() routine must initialize display board, clear display buffer, load a table of 256 colors into color palette, and set global variables, screenx and screeny. If successful, it returns 0. If it encounters any problems it returns a non-zero value. */ if (initialize()) return(1); origin_R = -4.0; /* origin represents the upper left corner of */ origin_I = -3.0; /* the screen. */ size_R = 8.0; /* size represents the domain of the screen */ size_I = 6.0; /* in the complex plane. */ compute_fractal(origin_R,origin_I,size_R,size_I); } <a name="0110_0012"> <a name="0110_0013">[LISTING TWO]
<a name="0110_0013"> **************************************************************************** * Assembly code generated by TMS340 C Compiler using the -mc option for * generating coprocessor instructions. **************************************************************************** ; gspac -mc -v20 mandel.gc mandel.if ; gspcg -o -c -v20 -o mandel.if mandel.asm mandel.tmp .version 20 .ieeefl FP .set A13 STK.set A14 .file "mandel.gc" .globl _screenx .globl _screeny .sym _compute_fractal,_compute_fractal,32,2,0 .globl _compute_fractal .func 50 ;>>>> void compute_fractal(float BaseR,float BaseI,float SpanR, float SpanI) ;>>>> register float AR, AI, ConstR, ConstI; ;>>>> register float ARsqr, AIsqr, DeltaI, DeltaR; ;>>>> register int row,col,color; ****************************************************** * FUNCTION DEF : _compute_fractal ****************************************************** _compute_fractal: MMTM SP,A7,A9,A10,A11,FP SUBI 448,SP MOVE SP,A11 MOVD RA5,*A11+,4 MOVD RB6,*A11+,3 MOVE STK,FP ADDK 32,STK MOVE SP,*STK+,1 ;; DEBUGGER TRACEBACK AID .sym _BaseR,-32,6,9,32 .sym _BaseI,-64,6,9,32 .sym _SpanR,-96,6,9,32 .sym _SpanI,-128,6,9,32 .sym _AR,32,6,4,32 .sym _AI,33,6,4,32 .sym _ConstR,30,6,4,32 .sym _ConstI,31,6,4,32 .sym _ARsqr,28,6,4,32 .sym _AIsqr,29,6,4,32 .sym _DeltaR,26,6,4,32 .sym _DeltaI,0,6,1,32 .sym _row,9,4,4,32 .sym _col,10,4,4,32 .sym _color,11,4,4,32 .line 9 ;>>>> DeltaR = SpanR / (float)screenx; MOVE @_screenx,A7,1 MOVE A7,RA0 ; screenx --> RA0 CVIF RA0,RB0 ; convert RA0 from int to float, put in RB0 MOVE FP,A7 SUBI 96,A7 MOVF *A7+,RA0 ; move parameter SpanR --> RA0 DIVF RA0,RB0,RB0 ; RA0 / RB0 --> RB0. Result is DeltaR ADDI 64,A7 MOVF RB0,*A7+ ; Store DeltaR as a local variable. .line 10 ;>>>> DeltaI = SpanI / (float)screeny; MOVE @_screeny,A7,1 MOVE A7,RA1 ; screeny --> RA1 CVIF RA1,RB1 ; convert to float and put in RB1 MOVE FP,A7 SUBI 128,A7 MOVF *A7+,RA1 ; get SpanI DIVF RA1,RB1,RA5 ; compute DeltaI and LEAVE IN RA5!!! ; DeltaI is used as a register variable! .line 12 ;>>>> ConstI = BaseI; ADDK 32,A7 MOVF *A7+,RB7 ; BaseI --> ConstI (RB7) .line 13 ;>>>> for (row=0; row < screeny; row++) { ; NOTICE here that both ConstI and row are used as register variables. Yet ; ConstI, which is a float, is kept in a 34082 register and row, which is an ; int, is kept in a 34020 register! The C compiler is smart enough to know ; which variables should be maintained on which processor! ; CLRS A9 ; 0 --> row (A9) MOVE @_screeny,A7,1 CMP A7,A9 JRGE L2 L1: .line 15 ;>>>> ConstR = BaseR; MOVE FP,A7 SUBK 32,A7 MOVF *A7+,RA7 ; BaseR --> ConstR (RA7) .line 16 ;>>>> for (col=0; col < screenx; col++) { CLRS A10 ; 0 --> col (A10) MOVE @_screenx,A7,1 CMP A7,A10 JRGE L4 L3: .line 18 ;>>>> AR = AI = ARsqr = AIsqr = 0.0F; CLRF RB6 ; clear AIsqr (RB6) MOVF RB6,RA6 ; clear ARsqr (RA6) MOVF RB6,RB8 ; clear AI (RB8) MOVF RB6,RA8 ; clear AR (RA8) .line 20 ;>>>> for (color = 256; --color > 0;) MOVI 256,A11 SUBK 1,A11 ; 255 --> color (A11) JRLE L6 L5: .line 22 ;>>>> AI = (AR * AI * 2.0F) + ConstI; MPYF RA8,RB8,RA0 ; AR * AI --> RA0 TWOF RB0 ; 2.0F --> RB0 MPYF RA0,RB0,RA0 ; AR * AR * 2.0 --> RA0 ADDF RA0,RB7,RB8 ; RA0 + ConstR --> AI (RB8) .line 23 ;>>>> AR = ARsqr - AIsqr + ConstR; SUBF RA6,RB6,RB1 ; ARsqr - AIsqr --> RB1 ADDF RA7,RB1,RA8 ; ConstR + RB1 --> AR (RA8) .line 25 ;>>>> if ( ((ARsqr = AR*AR)+ MOVF RA8,RB1 ; AR --> RB1 MPYF RA8,RB1,RA6 ; Compute new ARsqr MOVF RB8,RA0 ; AI --> RA0 MPYF RA0,RB8,RB6 ; Compute new AR_imag ADDF RA6,RB6,RA0 ; Sum of squares --> RA0 MOVI FS3,A7 ; FS3 is a pointer to a float constant, 4.0 MOVF *A7+,RB1 ; 4.0 --> RB1 CMPF RA0,RB1 ; if square of magnitude > 4.0, break GETCST JRGT L6 .line 26 ;>>>> (AIsqr = AI*AI)) > 4.0F ) break; .line 20 SUBK 1,A11 ; Otherwise, decrement color and see JRGT L5 ; if loop ended. L6: .line 29 ;>>>> put_pixel(color,col,row); MOVE STK,-*SP,1 ; Call display_board dependent routine MOVE A9,*STK+,1 ; to place a pixel on the screen. MOVE A10,*STK+,1 MOVE A11,*STK+,1 CALLA _put_pixel .line 30 ;>>>> ConstR += DeltaR; MOVE FP,A8 MOVF *A8+,RB0 ADDF RA7,RB0,RA7 .line 16 ADDK 1,A10 ; col++ MOVE @_screenx,A7,1 CMP A7,A10 ; If col >= screenx, end middle loop JRLT L3 ; Otherwise, jump back L4: .line 32 ;>>>> ConstI += DeltaI; ADDF RA5,RB7,RB7 .line 13 ADDK 1,A9 ; row++ MOVE @_screeny,A7,1 CMP A7,A9 ; If row >= screeny, end outer loop JRLT L1 ; Otherwise, jump back L2: EPI0_1: .line 34 MOVE *SP(640),STK,1 ; C cleanup MOVD *SP+,RA5,4 MOVD *SP+,RB6,3 MMFM SP,A7,A9,A10,A11,FP RETS 2 .endfunc 83,00000ee80H,32 .sym _main,_main,36,2,0 .globl _main .func 103 ;>>>> main() ;>>>> float origin_R,origin_I,size_R,size_I; ****************************************************** * FUNCTION DEF : _main ****************************************************** _main: MOVE FP,-*SP,1 MOVE STK,FP ADDI 128,STK MOVE SP,*STK+,1 ;; DEBUGGER TRACEBACK AID .sym _origin_R,0,6,1,32 .sym _origin_I,32,6,1,32 .sym _size_R,64,6,1,32 .sym _size_I,96,6,1,32 .line 12 ;>>>> if (initialize()) return(1); CALLA _initialize MOVE A8,A8 JRZ L8 MOVK 1,A8 JR EPI0_2 L8: .line 14 ;>>>> origin_R = -4.0; MOVE @FS4,A8,1 MOVE A8,*FP,1 .line 15 ;>>>> origin_I = -3.0; MOVE @FS5,A8,1 MOVE A8,*FP(32),1 .line 16 ;>>>> size_R = 8.0; MOVE @FS6,A8,1 MOVE A8,*FP(64),1 .line 17 ;>>>> size_I = 6.0; MOVE @FS7,A8,1 MOVE A8,*FP(96),1 .line 19 ;>>>> compute_fractal(origin_R,origin_I,size_R,size_I); MOVE STK,-*SP,1 MOVE *FP(96),*STK+,1 MOVE *FP(64),*STK+,1 MOVE *FP(32),*STK+,1 MOVE *FP(0),*STK+,1 CALLA _compute_fractal EPI0_2: .line 20 SUBI 160,STK MOVE *SP+,FP,1 RETS 0 .endfunc 140,00000a000H,128 .sym _screenx,_screenx,4,2,32 .globl _screenx .bss _screenx,32,32 .sym _screeny,_screeny,4,2,32 .globl _screeny .bss _screeny,32,32 ************************************************* * DEFINE FLOATING POINT CONSTANTS * ************************************************* .text .even 32 FS1:.float0.0 FS3:.float4.0 FS4:.float-4.0 FS5:.float-3.0 FS6:.float8.0 FS7:.float6.0 ***************************************************** * UNDEFINED REFERENCES * ***************************************************** .ref _put_pixel .ref _initialize .end .po 0 <a name="0110_0014"> <a name="0110_0015">[LISTING THREE]
<a name="0110_0015"> * Hand-tweaked assembler code using Listing 2 as a basis. * .version 20 .ieeefl .globl _screenx .globl _screeny * Register Nicknames are used for program clarity * 34020 Registers... FP .set A13 ; C function Frame Pointer STK .set A14 ; C function Stack DPTCH .set B3 ; Destination Pitch of Screen OFFSET .set B4 ; Offset of Screen * 34082 Registers... RA0_2 .set RA0 ; 2.0 constant RA1_4 .set RA1 ; 4.0 constant RA2_TMP .set RA2 ; temporary storage RA5_DI .set RA5 ; DeltaI RA6_AR2 .set RA6 ; AR squared RA7_CR .set RA7 ; ConstR RA8_AR .set RA8 ; AR RB1_DR .set RB1 ; DeltaR RB2_TMP .set RB2 ; temporary storage RB4_BI .set RB4 ; BaseI RB5_BR .set RB5 ; BaseR RB6_AI2 .set RB6 ; AI squared RB7_CI .set RB7 ; ConstI RB8_AI .set RB8 ; AI TubeOffset .set 2000H ; These definitions apply for the TubePitch .set (1024 * 8) ; SDB20 board which we used. .globl _compute_fractal ****************************************************** * FUNCTION DEF : _compute_fractal ****************************************************** _compute_fractal: MMTM SP,A0,A1,A2,A3,A4,A11,FP * Since we are creating a highly efficient tweaked program, we have the * main program place the 4 parameters used in compute_fractal directly * into 34082 registers. Specifically, BaseI has been placed in RB4, * BaseR has been placed in RB5, SpanI has been placed in RA0, SpanR has * been placed in RA1 ;>>>> DeltaR = SpanR / (float)screenx; MOVE @_screenx,A3,1 ; screenx --> A3 (stays there) MOVE A3,RA2_TMP CVIF RA2_TMP,RB0 ; (float)screenx --> RB0 DIVF RA1,RB0,RB1_DR ; SpanR / screenx = DeltaR --> RB1 ; (stays there) ;>>>> DeltaI = SpanI / (float)screeny; MOVE @_screeny,A4,1 ; screeny --> A4 (stays there) MOVE A4,RA2_TMP CVIF RA2_TMP,RB0 ; (float)screeny --> RB1 DIVF RA0,RB0,RA5_DI ; SpanI / screeny = DeltaI --> RA5 ; (stays there) * Set up initializations outside any loops TWOF RA0_2 ; constant 2.0 in RA0 SQRF RA0_2,RA1_4 ; constant 4.0 in RA1 ;>>>> for (ConstI = BaseI, row=0; row < screeny; row++,ConstI += DeltaI) MOVF RB4_BI,RB7_CI ; BaseI --> ConstI (RB7) CLRS A0 ; 0 --> row (A0) L1: ;>>>> for (ConstR = BaseR, col=0; col < screenx; col++,ConstR += DeltaR) MOVF RB5_BR,RA7_CR ; BaseR --> ConstR (RA7) CLRS A1 ; 0 --> col (A1) L3: ;>>>> AR = AI = ARsqr = AIsqr = 0.0F; CLRF RB8_AI ; 0.0 --> AI (RB8) MOVF RB8_AI,RB6_AI2 ; 0.0 --> AI squared (RB6) CLRF RA8_AR ; 0.0 --> AR (RA8) MOVF RA8_AR,RA6_AR2 ; 0.0 --> AR squared (RA6) ;>>>> for (color = 256; --color > 0;) MOVI 255,A2 ; 255 --> color (A2) L5: ;>>>> AI = ( AR * AI * 2.0F ) + ConstI; MPYF RA8_AR,RB8_AI,RB2_TMP ; AR * AI --> tmp (RB2) MPYF RB2_TMP,RA0_2,RA2_TMP ; tmp * 2.0 --> tmp (RA2) ADDF RA2_TMP,RB7_CI,RB8_AI ; tmp + ConstI --> AI ;>>>> AR = ARsqr - AIsqr + ConstR; SUBF RA6_AR2,RB6_AI2,RB2_TMP ; AR**2 - AI**2 --> tmp (RB2) ADDF RB2_TMP,RA7_CR,RA8_AR ; tmp + ConstR --> AR ;>>>> if ( ((ARsqr = AR*AR)+ ;>>>> (AIsqr = AI*AI)) > 4.0F ) break; SQRF RA8_AR,RA6_AR2 ; Compute new ARsqr MOVF RB8_AI,RA2_TMP ; SQRF must be performed on an A reg. SQRF RA2_TMP,RB6_AI2 ; Compute new AIsqr ADDF RA6_AR2,RB6_AI2,RB2_TMP ; sum of squares in RB2 CMPF RA1_4,RB2_TMP ; if sum of squares > 4.0, break GETCST JRLE L6 DSJ A2,L5 ; dec color and loop back if not 0 L6: ;>>>> put_pixel(color,col,row); MOVE A0,A8 ; row becomes Y SLL 16,A8 ; shift Y into upper 16 bits MOVA A1,A8 ; col becomes A, Y:X now in A8 PIXT A2,*A8.XY ; write the pixel ; bottom of 'col' loop ADDF RB1_DR,RA7_CR,RA7_CR ; ConstR += DeltaR INC A1 ; col++ CMP A3,A1 ; if col < screenx, jump back JRLT L3 ; bottom of 'row' loop L4: ADDF RA5_DI,RB7_CI,RB7_CI ; ConstI += DeltaI INC A0 ; row++ CMP A4,A0 ; if row < screeny, jump back JRLT L1 L2: EPI0_1: MMFM SP,A0,A1,A2,A3,A4,A11,FP RETS .globl _main ****************************************************** * FUNCTION DEF : _main ****************************************************** _main: MOVE FP,-*SP,1 MOVE STK,FP ADDI 128,STK MOVE SP,*STK+,1 ;; DEBUGGER TRACEBACK AID CALLA _initialize MOVE A8,A8 JRZ L8 MOVK 1,A8 JR EPI0_2 L8: MOVE @ORG_I,A8,1 ; We can place the initial parameters MOVF A8,RB4_BI ; directly into the 34082 registers MOVE @ORG_R,A8,1 ; where they will be used by the MOVF A8,RB5_BR ; compute_fractal routine. MOVE @SIZE_I,A8,1 MOVF A8,RA0 MOVE @SIZE_R,A8,1 MOVF A8,RA1 CALLA _compute_fractal EPI0_2: MOVE *SP+,FP,1 RETS 0 .globl _screenx .bss _screenx,32,32 .globl _screeny .bss _screeny,32,32 ************************************************* * DEFINE FLOATING POINT CONSTANTS * ************************************************* .text .even 32 ORG_R: .float -4.0 ORG_I: .float -3.0 SIZE_R: .float 8.0 SIZE_I: .float 6.0 .ref _initialize .end
Copyright © 1991, Dr. Dobb's Journal