Steve is free-lance writer and co-founder of Project Notify, a non-profit, emergency communications network. He can be reached at P.O. Box 8656, Incline Village, NV 89450 or on CompuServe at 70007,3351.
The newest entry in the CPU chip wars is now ready for the system builders: The Motorola 68040. The first available chips will work at 25 MHz, with 33 MHz and faster parts becoming available later this year. Don't think, though, this is just a faster 68030: Motorola built in some nifty features to make multiprocessing hardware much easier to design and build.
Motorola has gone to great pains to make a line of compatible 32-bit microcomputer chips. Like IBM did with the System/360 mainframe computers of the mid-1960s, Motorola made sure that applications code written for the earlier members of the 68000 family would run without modification on later chips. This scheme makes the assumption that programmers segregate I/O and chip control code from the rest of the system.
The general programming model for the 68000 family is the same: Eight 32-bit data registers, seven 32-bit address registers, one 32-bit user stack pointer, one 32-bit supervisor stack pointer, and chip-specific registers. The 68000 family supports operations on individual bits, 8-bit bytes, 16-bit words, 32-bit longwords, and packed binary coded decimal (BCD) data. Address calculations are all 32 bits, although some CPUs have limited addressing capability.
The 68008 (1980) is much the same as the Intel 8088 in that it talks to the outside world over a 20-bit address bus and an 8-bit data bus.
The 68000 (1979), the first CPU in the family, and the low-power CMOS 68HC000 use a 24-bit address bus and 16-bit data bus.
The 68010 (1982) takes the 68000 and adds virtual memory support, using an external memory management unit (MMU) and a special three-instruction "loop mode" that lets the 68010 execute a tight three-instruction loop repeatedly without fetching the instructions from memory more than once.
The 68020 (1984) is the first true 32-bit member of the 68000 family. The address and data busses are both a full 32-bits wide, allowing the chip to directly access four gigabytes (4096 Mbytes) of memory, up to 32 bits at a time. Memory management is provided by an external MMU. Instead of the 68010's "loop mode," the 68020 implements a 256-byte (64 x 4 direct mapped) instruction cache so that most loops run out of on-chip cache memory -- improving execution time 33 percent and reducing the load on the system bus. Bit-field instructions let you deal with data of varying bit lengths. Instructions for multiprocessing were added into the 68020 as well.
The 68030 (1987) moves demand-page memory management on-chip, and adds a 256-byte (64 x 4 direct mapped) data cache on-chip to complement the 68020's 256-byte instruction cache. The data cache uses a write-through philosophy. The bus system implements a burst transfer mode, that lets the chip effectively use page-mode, nibble-mode, and static-column DRAM to load data and instructions into cache memory quickly.
The newest member of the 68000 family, the 68040, essentially combines a beefed-up 68030 and the low-level functions of the 68881 floating-point coprocessor onto the same chip. The improvements, however, go much beyond that. Motorola's goal appears to be to make the 68040 as suitable as possible for large-scale multiprocessing systems.
Instead of one MMU trying to serve the entire chip, the 68040 gives you two: One for instructions, one for data. This keeps data and instruction accesses from causing page table entry faults (not to be confused with page faults) so as to minimize the amount of time the 68040 has to go to RAM to fetch address translation information.
The two on-chip memory caches are completely changed. Not only do you have a 4-Kbyte data cache and a 4-Kbyte instruction cache, but the cache system -- particularly the data cache -- is designed to minimize the number of times you have to go to the system bus. The two caches are organized as 64 four-way associative maps (256 locations), with 16 bytes of data in each cache location. The data cache can be write through, as it is in the 68030, or the 68040 can use a copyback philosophy that delays the write to memory until the chip needs the cache location for something else or the CPU's supervisor empties the cache.
When using cache in a multiprocessing system, you can have data that is one value in cache and another value in main memory. This problem is called "cache coherency." The 68040 takes care of this problem with "bus snooping" -- the chip looks at the system bus, and when a write memory cycle is detected, any on-chip cache location containing data for the changed location is marked invalid.
What happens, though, when one 68040 has changed data, but hasn't written it back to DRAM yet? The bus snoop hardware has another trick up its sleeve. When a read memory cycle is detected, the 68040 checks its data cache to see if it changed the requested location; if so, it inhibits the RAM memory cycle and sends the correct data to the other CPU. This reduces the amount of work programmers have to do to keep data up-to-date.
If you do a lot of scientific work, watch out for the floating-point unit. On the 68040, the only floating-point operations supported are absolute value, add, branch on condition, compare, decrement and branch conditionally, divide, move, move multiple, multiply, negate, nop, restore internal state, save internal state, set on condition, square root, subtract, trap on condition, and test. Other operations supported by the 68881, such as the trig and logarithmic functions, have to be handled by software emulation.
Portability When writing code that needs to run on different systems, you need to limit yourself to those instructions common to all the 68000 family. (See Table 1 for those instructions to avoid.) In particular, pay attention to addressing modes. The 68020, '30, and '40 support some additional modes not found on the '00, '08, and '10. Also try to segregate chip-dependent functions from the rest of your program. This limits how much code has to be replaced as you shift from CPU to CPU. The majority of your code should be running in user mode anyway.
Table 1: 680x0 family instruction set differences. An instruction or capability added or changed is in the open. An instruction or capability removed is in parens. For example, the CALLM instruction was removed in the 68030, so in the table it shows as (CALLM).
68010 from 68000 and 68008 Move from CCR Move from Condition Code register Move from SR Move from Status register MOVEC Move Control register MOVES Move Status register RTD Return and Deallocate 68020 from 68010 Data alignment restriction dropped Bcc Branch conditionally (allow 32-bit displacements) BFCHG Test Bit Field and Change BFCLR Test Bit Field and Clear BFEXTS Bit Field Extract Signed BFEXTU Bit Field Extract Unsigned BFFFO Bit Field Find First One-bit BFINS Bit Field Insert BFSET Test Bit Field and Set BFTST Test Bit Field BKPT Breakpoint CALLM Call Module CAS Compare and Swap Operands CAS2 Compare and Swap Dual Operands CHK2 Check register against upper and lower bound CMP2 Compare register against upper and lower bound (between) cpBcc Branch on CoProcessor condition cpDBcc Test CoProcessor condition Decrement and Branch cpGEN CoProcessor General function cpRESTORE CoProcessor Restore function cpSAVE CoProcessor Save function cpScc Set on CoProcessor condition cpTRAPcc Trap on CoProcessor condition DIVSL Long signed divide DIVUL Long unsigned divide EXTB Extend byte to long PACK Pack binary coded decimal (BCD) RTM Return from Module (*not* "Read the manual") TRAPcc Trap conditionally UNPK Unpack binary coded decimal (BCD) 68030 from 68020 (CALLM) PFLUSH Invalidates specific entry in the address translation cache (ATC) PFLUSHA Invalidates all entries in the address translation cache (ATC) PLOAD Load an entry into the address translation cache PMOVE Load an entry into the address translation cache PTEST Get information about a logical address (RTM) 68040 from 68030 CINV Invalidate cache entries (cpBcc) (cpDBcc) (cpGEN) (cpRESTORE) (cpSAVE) (cpScc) (cpTRAPcc) CPUSH Push, then invalidate, cache entries Floating-point Instructions MOVE16 Move 16-byte block; block must be aligned (PFLUSHA) (PLOAD) (PMOVE)
Loops The loop mode of the '10 is of limited use, being composed of a loop-able instruction and a DBcc instruction. Use this construct when you can on the off chance you end up running on a '10, such as one of the older Sun workstations. Where possible, try to keep loops under 256 bytes, the size of the instruction cache on the '20. If a much-repeating loop can't be squeezed down that far, move seldom-executed code such as exception code outside of the loop. The longer you can stay in the cache, the faster that loop executes.
Loop Data In assembler, it is usually easier to whip through an array word by adjacent word, so most assembler language programmers won't have to concentrate on what order data gets accessed. If you are writing a table-driven package, though, pay attention to how table information makes you access data. Where possible, the table should be optimized so your program sweeps through any array. This is somewhat important on the '30, and much more important on the '40 -- particularly in multiprocessing systems.
Tests Many times, you have to load one of two values into a register or location based on some test condition. The "IF ... THEN ... ELSE ..." construction is easy to understand, but the multiple branches can play hob with instruction fetching. Instead, try "... IF ... THEN ... " where you set the less common value, perform the test, and conditionally branch around the more common value. The penalty on '00, '08, and '10 CPUs is almost zero, but the savings on the '20, '30, and '40 can be significant. In fact, the first way requires at least five instructions (test, branch-false, set-1, branch, set-2) while the other way saves one instruction (set-2, test, branch-false, set-1).
Portability Chip-dependent functions usually have to be written in assembler, so make sure the design of the system routines are as generic as possible so you don't have to change applications code when the next gee-whiz feature is introduced in the 68050. You'll need to package separate interface modules for each chip. High-level code should always be run in user mode.
Loops If your compiler can optimize for the loop mode on the '10 or if the library includes routines to perform functions using loop mode, use them. When structuring loops that are executed often consider dropping structured programming practices to pack the loop as tight as possible. The goal is to get the loop within the 256-byte window of the instruction cache of the '20. Branches are much cheaper than function calls to get the seldom-used code out of the loop. You have more latitude with the larger cache on the '30 and '40.
Loop Data Be very careful when transversing arrays that you know exactly how your compiler is working. Fortran programmers need to remember that they have to vary the first subscript first in order to walk through data sequentially. For PL/I and Pascal programmers, most compilers require you to vary the last subscript first to sweep an array. C programmers need to remember that when accessing a multidimensional array using the array operators that are in the construct "a[i][j]", the fragment "a[i]" loads a pointer, then "<e>[j]" loads the desired word; use an intermediate pointer where possible to limit the amount of pointer loading when the first subscript is held locally constant.
Tests You are at the mercy of the compiler when it comes to ordering tests to save time. Because compilers vary so much in what they do, it probably isn't worth it to change the way you select values.
The 68040 is more than "just a 68030 with floating point" and more than Motorola's weapon to fight the Intel 80486. It is a well-designed product in its own right. Graphics programmers like the support for manipulating bits, particularly the bit-field instructions introduced by the '20 and continued in the '40.