Embedded Systems

Programming the 29050

By David L. Moore, January 01, 1992

RISC processors present programming challenges, particularly in terms of fast floating-point events.

JAN92: PROGRAMMING THE 29050

Dave wrote the code generator and optimizer for YARC Systems' 29000 Fortran compiler. He is the author of FTL Modula-2 for Z80, MS-DOS, and 68000 systems. Dave can be contacted as DaMoore on BIX, or as [email protected].

Reduced Instruction Set Computers (RISC) give consistently higher performance than Complex Instruction Set Computers (CISC). In fact, the current generation of RISC microprocessors is typically three times more powerful than the 80486 and 68040, the fastest CISC chips presently available.

Although RISC processors are used in the current generation of workstations (where CISC machines are too slow), the greatest success RISC has enjoyed, at least in terms of actual numbers of processors shipped, has been in embedded systems applications. In these applications, the processor is used as part of some machine--a laser printer or optical character recognition system--rather than as a general-purpose computer.

Some embedded systems designers continue to use CISC because they believe RISC processors are difficult to program. As the chips have become more sophisticated, and the supporting tools more robust, these objections have been overcome. Many projects that would have been coded entirely in assembler for earlier CISC processors can now be written in a high-level language for the faster RISC chips. Assembler coding is then left to the final tuning phase of a project, if used at all. And with many RISC processors, assembly language programming isn't that difficult anyway. In fact, because of flat address space and more registers, RISC assembly programming is much easier than programming on the 8086 family.

In this article, I concentrate on the AM29050, the most powerful member of the Advanced Micro Devices' 29000 family. The AM29050 implements floating-point operations on the chip using conventional instructions. In the earlier members of the family, floating point is performed either in software, or by the AM29027 coprocessor. The single-chip AM29050, however, provides much higher performance than the 29000 with attached coprocessor; better still, the 29050 does not require special programming.

The AM29050 Instruction Set

The AM29050 is a three-address machine. Arithmetic and logic instructions contain two operand fields and a result field. One of the operands can be a literal value in the range 0-255 as in Example 1(a). The first operand is the destination field. The first two instructions are straightforward adds. The third instruction ("subtract reverse") subtracts gr97 from 2 and puts the result in gr99. gr96 through gr99 are examples of global registers, which I'll discuss shortly.

Example 1: (a) One of the operands can be a literal value in the range 0 to 255; (b) loads and stores always require the address of the load or store be contained in a register; (c) commonly used values for cases in which the first operand is 0, and the second controls the type of memory access.

  (a)
  [1] add  gr96,gr97,gr98
  [2] add  gr97,gr97,1
  [3] subr  gr99,gr97,2

  (b)
  [1] load  0,0,gr96,gr97
  [2] store  0,0,gr96,gr98

  (c)
  [1] load  0,0,gr96,gr97        ;load full 32 bit word
  [2] load  0,0x01,gr96,gr97     ;load byte, zero fill
  [3] load  0,0x11,gr96,gr97     ;load byte, sign extend
  [4] load  0, 0x02, gr96, gr97  ;load half word, zero fill
  [5] load  0,0x12,gr96,gr97     ;load half word, sign extended

Loads and stores always require that the address of the load or store be contained in a register; see Example 1(b). The first line loads the value stored at the address given by gr97 into gr96. The second line stores this value into the address given by gr98.

In addition to the source and destination register, the load and store instructions have 8 bits of additional information. This is specified by the first two operands of the instruction. The first operand specifies a 1-bit value, while the second specifies a 7-bit value.

The first operand is normally 0. It is set to 1 to perform a load or store to an attached coprocessor, such as the now obsolete AM29027, instead of to memory. All communication with the AM29027 was carried out in this way. When talking to the coprocessor, the second (7-bit) field was a coprocessor command. I'll not go into these here because the AM29050 doesn't need an attached floating-point coprocessor. The interface, however, is still supported.

When the first operand is 0, the second controls the type of memory access. The commonly used values are shown in Example 1 (c). (Note that the AM29050 assembler adopts the C notation "0xnnnn" for hexadecimal numbers.)

CISC assembly language programmers will find the operation of the AM29050's jump instructions surprising. As with most RISC chips, each jump instruction has an associated "delay instruction." The delay instruction is placed immediately after the jump. It is executed while the jump is being processed.

When first programming the chip, it is easy to forget to code a delay instruction for every jump. Experience does not remove the obligation of trying to do something useful with this instruction, rather than simply coding a noop. This is one of the few genuine difficulties in programming the AM29050.

As an example, Example 2(a) shows a loop that adds together an array of 100 numbers on an 8086 processor. On the AM29000, this loop translates to the code in Example 2(b). Lines [1a] through [1d] declare new names for some global registers. This facility makes it easy to document the contents of registers and is essential given the large number of registers on the AM29050. Lines [2] through [5] correspond to lines [1] through [3] in the 8086 code. Notice the loop counter (line 3) starts at two less than the loop count. This is because the jmpfdec instruction at line [8] tests the counter for a negative value before performing the decrement. Line [9] is the delay instruction for the jmpfdec.

Example 2: (a) A loop that adds together an array of 100 numbers on an 8086 processor; (b) the loop shown in (a) translated to the code on an AM29000.

  (a)
  [1]  xor  ax,ax
  [2]  mov  cx,100
  [3]  lea  bx,array
  [4]  addloop: add  ax, word ptr [bx]
  [5]  add  bx,2
  [6]  loop  addloop

  (b)
  [1a]  .reg  counter,gr96
  [1b]  .reg  addr,gr97
  [1c]  .reg  sum,gr98
  [1d]  .reg  temp,gr99
  [2]  const  sum,0
  [3]  const  counter,100-2
  [4]  const  addr,array
  [5]  consth  addr,array
  [6]  addloop:load  0,0,temp,addr
  [7]  add  addr,addr,4
  [8]  jmpfdec  counter,addloop
  [9]  add  sum,sum,temp

This code runs very quickly on the AM29050 because we are not using the value loaded at line [6] until line [9], three cycles after the load started. There is no delay waiting for the value to be loaded. On the 29050, this loop takes four clock cycles each time around. On the 80486, the loop takes 11 clock cycles if the data value is in cache and around 16 if it is not.

Registers Galore!

Because of the reduced instruction set, a RISC chip requires less space for microcode, or for hardwired logic. Instead, this space can be used for registers. The AM29000 range has over 150 general-purpose registers, all 32 bits wide.

On a CISC machine, there is a fixed number of registers, each with a unique name. The same few registers are used by all procedures, so registers have to be saved and restored across procedure calls.

On the AM29050, there are two sets of registers. The global registers are fixed, just as on a CISC machine. gr96 always refers to a particular 32-bit register in the processor: global register 96. The local registers are dynamic. Each procedure allocates its own set of local registers. The hardware register corresponding to a given local register changes from procedure to procedure. A procedure only allocates as many local registers as it requires, up to a maximum of 126. Within the procedure, these registers are referenced as lr0, lr1, and so on.

The AM29050 processor contains 128 hardware registers for local registers. When deeply nested procedures are called, the total number of registers allocated by all the procedures may exceed this number. When this happens, the oldest registers are spilled to a stack in memory. Like most microprocessor stacks, this stack grows downwards in memory.

Register filling and spilling is "lazy": A spilled register is not restored to the hardware until it is needed. This results in a pool of unused local registers in the hardware. Whenever a procedure allocates no more registers than exist in the free pool, no spill or fill is required for that procedure. As a result, the overhead of saving and restoring registers across procedure calls is often completely avoided.

What's Left Out?

The original RISC processors implemented only instructions that could be executed in a single cycle. More complex instructions had to be implemented as procedure calls or software interrupts. Almost all of the usual logical and arithmetic instructions that you would expect to find on a CISC machine are present and implemented in hardware on the AM29050, rather than as software traps.

The AM29050 instruction set is "reduced" from a CISC instruction set in only three ways. One of these we have already seen: The processor can only perform data manipulation on registers; arithmetic and logical operations cannot contain implicit loads and stores.

The second reduction is that all instructions are exactly 4 bytes in length. This restriction is noticed when loading constants. A general constant load requires two instructions, as in Example 3. The first instruction sets the bottom half of global register 96 to 0x5678 and clears the top half to 0. The second instruction sets the top half to 0x1234 without altering the bottom half of the register.

Example 3: A general constant load requires two instructions.

  [1]  const  gr96,0x12345678
  [2]  consth  gr96,0x12345678

Clearly, loads of constants with 0s in the top 16 bits only require the first instruction. In addition, there is a constn instruction which loads a constant with all 1s in the top 16 bits.

Conquering Divide

The final reduction is that, on the AM29050, there is no hardware integer divide instruction. The divide instruction causes a trap to a software routine which requires about 40 machine cycles. This is still comparable to an 80486, which typically takes 43 cycles, but there are a couple of faster ways to do integer divides.

The AM29050 does support floating-point divide in hardware. In addition, the AM29050 floating point supports denormalized reals in hardware. A denormalized real is a real number with an exponent of 0. We can use this to divide positive integers; see Example 4(a). This will do an integer division between two positive integers smaller than about eight million (23 significant bits) in 14 cycles.

Example 4: (a) Using a denormalized real to divide positive integers; (b) dividing a positive integer by 3.

  (a)
  [1]  fdiv        gr96,gr97,gr98
  [2]convertgr96,  gr96,unsigned
  truncate, single,single

  (b)
  [1] const   gr98,0x55555556
  [2] consth  gr98,0x55555556
  [3] multmu   gr96,gr97,gr98

Alternatively, if the divisor is a known constant, we can implement the divide of a positive value as a multiply! We first divide 0x100000000 by the constant, then use this value to do a multmu, which produces the higher 32 bits of a 64-bit unsigned multiply.

For example, to divide a positive integer by 3, we calculate 0x100000000 /3=0x55555556, and the code comes to resemble Example 4(b). This divide takes five cycles! (I first saw this trick in the Pascal compiler for the CDC 6600, written by Urs Amman.)

Fast Floating Point

The floating-point instructions take several cycles to complete, but the floating-point units are pipelined so that, in many cases, a new operation can be started every cycle. (See Table 1.) You don't have to wait for one operation to complete before starting the next.

Table 1: Times for some slow operations

  Operation  Time  Issue every
  ----------------------------
  MULTIPLY     3        1
  FADD         3        1
  FMUL         3        1
  FDIV        11       10
  DADD         3        1
  DMUL         6        4
  DDIV        18       17

  (All times in cycles)

The pipelining means that, with a 40-MHz clock, you can perform 40 Megaflops on a single-chip computer! In fact, the actual limit is twice this because two instructions, FMSM and FMAC, start a combined multiply and add, and a new one can be started every cycle. Hence, the peak rate is 80 Megaflops!

To get the maximum floating-point performance, you need to carefully consider the order in which operations are to be performed. Evaluating a polynomial is a good example.

The traditional way to evaluate a polynomial is to use Horner's rule. For example, the polynomial x^4-10x^3 +35x^2-50x+24 would be evaluated as (((x-10)*x+35)*x-50)*x+24. As each step uses the result of the previous step, no advantage can be taken of the pipelining. It will take 21 cycles to evaluate this polynomial in single precision using Horner's rule. If, instead, you find the roots of the polynomial, you can evaluate the polynomial as ((x-1)*(x-2))*((x-3)*(x-4)). The subtractions can all be done in parallel so that all four subtracts are complete after six cycles, even though each subtract requires three cycles. The first two multiplies are then executed in parallel. Only the final multiply must be executed alone. In this way, you can evaluate the polynomial in 12 cycles.

Using Burst Mode

Floating-point calculations on the AM29050 are so fast, that the limit on calculation speed is often not the floating-point performance of the chip, but the speed of main memory.

Consider taking a Euclidean norm, for example. This simply squares the elements of an array and adds the squares together. In C, this would look like Example 5(a). This piece of code requires one memory load per two floating-point operations. In AM29050 code, in its simplest form, the loop would look like that in Example 5(b). By convention, functions always pass parameters starting in lr2 and return results in gr96.

Example 5: (a) Squaring the elements of an array and adding the squares together in C; (b) doing the same thing in AM29050 code.

  (a)
  float Euc_Norm(float) x[],int len)
  {
      int i;
      float a=0.0;
      for (i=0;i<len;i++) a+=x[i] *x[i];
      }
  (b)
  [1]        sub      lr3,lr3,2   ;prepare counter for jmpfdec
  [2]        const    gr96,0
  [3]        mtacc    gr96,single_precision,0  ;clear acc 0
  [4]  loop: load     0,0,gr97,lr2
  [5]        add      lr2,lr2,4
  [7]        jmpfdec  lr3,loop
  [8]        fmac     single_precision,0,gr97,gr97
  [9]        jmpi     lr0
  [10]       mfacc    gr96,single_precision,0  ;result from acc 0

The fmac instruction at line [8] multiplies its arguments and adds the result to one of four accumulators. These accumulators are special-purpose registers, not part of the global or local registers. The mtacc instruction (move to accumulator) at line [3] moves the value of gr96 (0) into accumulator zero. The mfacc instruction (move from accumulator) at line [10] retrieves the result. It is the delay instruction for the jump indirect instruction at line [9].

Each iteration of this loop requires four cycles. Each fmac requires six cycles each but can be started every three cycles. The accumulate for one loop is finishing while the multiply for the next loop is being performed.

At 40 MHz, you do this loop ten million times per second, and each iteration does two floating-point operations, so this simple loop gives us 20 Megaflops.

Burst mode allows you to access memory faster than you could by accessing memory one location at a time. The memory in a memory chip is organized into rows and columns. When a normal access is made, the chip must first select the column for the bit to be read, and then select the row.

In burst mode, the processor takes advantage of static-column mode, in which the column does not change from one access to the next. The column-select step is avoided, resulting in a faster read access. The three cycles normally required to load a word can be reduced to two.

Burst mode is always used to load instructions into the processor. This is the only way the memory system can keep up with the instruction execution rate required by the processor. Extra hardware, external to the AM29050, is used on the instruction bus so a word of instructions can be loaded every cycle, rather than every other cycle. This special hardware could be duplicated on the data bus to produce single-cycle burst data loads, but this adds cost that few users will find justified. In this discussion, I'll assume this hardware is not present.

To use the burst mode, we must use a special instruction, loadm, that loads multiple registers. Using this instruction, our Euclidean norm looks like Example 6. For simplicity, I'm assuming the number of elements is a multiple of four. I've accumulated to all four special accumulators and then added the separate totals at the end. This allows all the fmac instructions to run in parallel.

Example 6: Using loadm to use burst mode

  [1]               srl lr3,lr3,2    ;divide count by 4
  [2]               sub  lr3,lr3,2   ;prepare counter for jmpfdec
  [3]   const       gr96,0
  [4]   mtacc       gr96,single_precision,0  ;clear accumulators
  [5]   mtacc       gr96,single_precision,1
  [6]   mtacc       gr96,single_precision,2
  [7]   mtacc       gr96,single_precision,3
  [8]   const       gr96,3           ;operations per loop, less 1
  [9]   loop: mtsr  CR,gr96          ;set load count special register
  [10]  loadm       0,0,gr97,lr2      ;load 4 words
  [11]  add         lr2,lr2,16
  [12]  fmac        single_precision,1,gr97,gr97
  [13]  fmac        single_precision,2,gr98,gr98
  [14]  fmac        single_precision,3,gr99,gr99
  [15]  jmpfdec     lr3,loop
  [16]  fmac        single_precision,4,gr100,gr100
  [17]  mfacc       gr96,single_precision,0  ;combine acc values
  [18]  mfacc       gr97,single_precision,1
  [19]  fadd        gr96,gr96,gr97
  [20]  mfacc       gr98,single_precision,2
  [21]  mfacc       gr99,single_precision,3
  [22]  fadd        gr96,gr96,gr98
  [23]  jmpi        lr0
  [24]  fadd        gr96,gr96,gr99

This rather complex loop takes ten cycles per loop, performs eight floating-point operations, and equates to 32 Megaflops at 40 MHz. Using burst mode, you can improve the performance of the routine by 50 percent. With special hardware to load a word every cycle, this would increase to 40 Megaflops.

How Hard is it to Program?

You've seen some optimizations that can be performed to extract very high performance out of the AM29050. Most of the time, however, straightforward code will run adequately. The regularity of the instruction set and the huge number of available registers make it relatively simple to produce such code. These attributes also enable compilers to produce fast code, so the performance penalty for using a high-level language is fairly small. Any programmer experienced in assembly language programming on another processor will find little difficulty in turning out code for this chip.

More Speed!

There is constant competition between microprocessor developers to produce more powerful chips. The next generation of CISC chips will seek to execute one instruction per cycle, just as RISC chips do now. To do this, they will have to be organized internally as a data-flow processor, in which instructions are triggered by the arrival of their operands. This organization is very different from the organization of the current generation of microprocessors, so designing these chips will require a major effort.

Superscalar architectures, the next generation of RISC chips, will execute several instructions per cycle. RISC architecture lends itself to superscalar execution because all the instructions are the same size. On a CISC machine, you have to decode the current instruction before you can even find the following instruction.

Because of their complexity and novelty, it is likely to be at least a couple of years before we see the next generation of CISC chips in real machines. These chips will bring CISC performance up to the level currently enjoyed by users of RISC processors. By then, however, RISC chips can be expected to have advanced to yet higher levels of performance.

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

Embedded Systems

Programming the 29050

The AM29050 Instruction Set

Example 1: (a) One of the operands can be a literal value in the range 0 to 255; (b) loads and stores always require the address of the load or store be contained in a register; (c) commonly used values for cases in which the first operand is 0, and the second controls the type of memory access.

Example 2: (a) A loop that adds together an array of 100 numbers on an 8086 processor; (b) the loop shown in (a) translated to the code on an AM29000.

Registers Galore!

What's Left Out?

Example 3: A general constant load requires two instructions.

Conquering Divide

Example 4: (a) Using a denormalized real to divide positive integers; (b) dividing a positive integer by 3.

Fast Floating Point

Table 1: Times for some slow operations

Using Burst Mode

Example 5: (a) Squaring the elements of an array and adding the squares together in C; (b) doing the same thing in AM29050 code.

Example 6: Using loadm to use burst mode

How Hard is it to Program?

More Speed!

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Embedded Systems Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content

Embedded Systems

Programming the 29050

Related Reading

More Insights

White Papers

Reports

Webcasts

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Embedded Systems Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content