DTACK Revisited

SP95: DTACK REVISITED

Hal is a hardware engineer who sometimes programs. He is the former editor of DTACK Grounded and can be contacted through the DDJ offices.

I'm going to explain some important stuff about computer architecture, stuff that you really need to know. I'll cover the Pentium, PowerPC 601, and the P6. We have to discuss a few basics before we come to the important stuff.

The term "computer architecture" is widely misunderstood. It has little to do with the design of a computer system or microprocessor chip. The computer architect is best known as the person who gets to use a clean piece of paper to define which instructions the computer will be able to execute. But the most important job the architect does is decide on the length(s), in bits, of the computer instructions and assign the bit fields within that length to perform the necessary computer operations.

If there's a large proportion of unused combinations, the architect has done a lousy job. But a few should be set aside. When Intel designed the 8086, some then-undefined combinations later became the basis for adding a (very) few more registers in the 386 generation.

The 8086 was designed back when 64K was a huge memory space and Pascal seemed to be taking over the personal-computer marketplace. So the 8086 was given exactly enough registers to run compiled Pascal.

Because memory was then an extremely limited resource, the 8086's basic instruction-field length was made eight bits, and some of Pascal's most common instructions (LOOP, for example) were fitted into those eight bits; eight bits does not provide for specifying a lot of registers.

When the 68000 was designed, larger memories were common, so the architect selected a 16-bit basic instruction field. Two 4-bit register fields were assigned. Eight bits, half the 16-bit instruction field, went to defining the source and destination registers.

But the ability to place many transistors on a single die was exploding, and soon 32 each, 32-bit registers started showing up, for instance on David Patterson's Berkeley RISC I design. Five bits are required to select one of 32 registers. If two-address (SRC, DEST) operands were to be used, then ten bits of the instruction bit field were needed to specify the registers. That leaves only six bits of a 16-bit instruction field, not enough to be useful.

So computers with 32 registers moved up to a 32-bit instruction field. All the computer architects made the decision to use three-address operands (SRC1, SRC2, DEST) and so assigned 15 bits just for register selection--again, about half the instruction field.

The microprocessor went from a register-starved, 8-bit instruction field in 1977 to a register-rich, 32-bit instruction field in 1982. These architectural decisions were dictated by the then state of the chip-fabrication art. Let me repeat--these were architectural decisions.

And architectural innovations stopped right there in 1982, because a personal computer does not (yet) need a 64-bit instruction field. Yep. Architecture for personal computers essentially froze in 1982.

How do you upgrade a computer to a new architecture? In other words, how do you get your hands on more registers while continuing to run your old software? The answer is, you don't. The only way to get more registers is to abandon your software--all your software--and move to a new computer. I understand the MIPS-based ACE computer systems (which run both UNIX and Windows NT) are particularly good examples of desktop computers with register-rich environments.

Oh? You don't have an ACE system on your desktop? You still use, and program for, a register-starved computer architecture? Gee. It appears that computer architecture, while fundamental, is not important.

The personal-computer marketplace doesn't care about architectural hardware issues. The marketplace responds to fast and cheap. "Fast" means internal caches, floating-point accelerators, superscalar techniques, and the like--none of which has anything to do with architecture. (The presence or absence of an internal cache is independent of the instruction field.)

"Cheap" means economy of scale. More than 50 million personal computers will be sold this year, and to a first-order approximation, 100 percent of them will be based on the x86 architecture. If you want a cheap computer, buy one based on the x86.

But the marketplace still wants to run the software it acquired ten years ago. Software compatibility is, in fact, an architectural issue, and it matters in the marketplace.

The people who designed the Pentium and the P6 and who are currently designing the P7 are not computer architects. But they're pretty good engineers, based on the results I've seen. I call them "chip designers."

Back when the world was young and children were respectful of their elders, the chip designer's job was simple: The design had to execute any instruction as quickly as possible. Then it had to execute the next instruction as quickly as possible. That's how the 8086, 286, 386, and 486 work.

But with the advent of the Pentium, those days are gone. The Pentium--sometimes--executes more than one instruction in the same clock cycle. That "sometimes" is pretty important to those of you who need to write code that runs fast, and has afforded my colleague Michael Abrash the opportunity to publish several articles on optimizing code for the Pentium.

The Pentium is the first x86 generation that uses a "superscalar" implementation. Let's compare it to the PowerPC 601, which was primarily designed by IBM, with a little bus-interface assistance from Motorola. To a first-order approximation, the 60x architecture has 0 percent of the personal-computer market.

The 601 is based on the latest computer architecture: the 32-bit model with 32 registers. Like the Pentium, its implementation uses superscalar techniques, but not those used by the Pentium. The 601 can issue up to three instructions each clock cycle, one each of integer, floating point (fp), and branch.

You are the software experts, not me, so let's pretend you just explained to me that most application programs in the personal-computer market execute instructions in the ratio 85 percent integer, 0 percent fp, and 15 percent branch. This means the 601's ability to simultaneously execute fp instructions with integer and branch instructions is useless. The only improvement the superscalar 601 offers is the ability to simultaneously issue integer and branch instructions. And since there are roughly six times as many integer as branch instructions, this isn't terribly useful. In fact, the 601's superscalar ability means that, at best, it can execute 100 instructions in 85 clocks (assuming one clock per instruction). All that superscalar design effort provides, at best, a 17.6-percent performance improvement.

The Pentium's designers were much more crude. If either an fp or branch instruction is issued on a given clock cycle, then no other instruction can be issued at that time. In practice, this means that during the 15 percent of the time that branch instructions are being issued, the Pentium ain't superscalar. But in the 85 percent of the time that integer instructions are being issued, the Pentium can--sometimes--issue two integer instructions on the same clock cycle. This means the Pentium can, at best, execute a 100-instruction mix (assuming one clock per instruction cycle) in 85/2+15=57.5 clocks--a 73.9 percent performance improvement.

Okay, instructions sometimes need more than a single clock to execute, and the Pentium cannot always issue two integer instructions in the same clock period, thus Abrash's fine articles on optimization. But Intel's chip designers focused on improving performance during the 85 percent of the time that integer instructions are being issued, while IBM's designers concentrated their efforts on the 15 percent of the time that branch instructions were being issued.

Which design team best earned its paycheck?

I sent a copy of the penultimate draft of this article to some folks who used to design microprocessor chips for a living. One of them, John Wharton, called me back and said "Hal, the Pentium doesn't work like that!" (The last four digits of John's home phone number are 8051, which is one of Intel's most popular 8-bit micros.)

So I was wrong. A Pentium can issue a branch instruction after an integer instruction in the same clock (but not an integer instruction after a branch instruction). And under rare circumstances the Pentium can issue two FP instructions in the same clock--if one of them is an FXCH instruction.

In the pairing rules, a "complex" instruction is a microprogrammed instruction, such as one of the string instructions (MOVS or SCAS, for example). When one of the integer pipes goes into microprogrammed mode, both pipes do. That's why only one "complex" instruction can be active at a time.

John also explained floating-point processing:

A cute trick the Pentium designers came up with was getting the result of a 64-bit FP operation back to the internal cache quickly. FP operations use the integer pipes, each of which is 32 bits wide. So the Pentium uses both pipes to move 64 bits in parallel. It saves one clock and at Pentium speeds, one clock is important.

(The most interesting thing John told me was about the infighting--I call it civil war--over Intel's upcoming P7. But that's another story.)

The Pentium design team set up two on-chip production lines, like Ford using one line for Escorts and another for Taurii. With a budget of 5.5 million transistors, the P6 design team was able to use more advanced techniques. Continuing with the automotive analogy, the P6 makes intensive efforts to build a car in the shortest time.

In the P6, we find a large crowd gathered at the input ends of several parallel production lines (pipes), and another large crowd at the output ends.

The input crowd looks for tasks ready to proceed and issues them to one of the production lines. It also looks for tasks that might be ready to proceed and speculatively issues them, too. A list of 30 tasks to select from is kept.

The crowd at the output accepts and temporarily stores all the results the several production lines deliver. Not everything that comes off the production lines proves to be useful. Some "product" is ultimately discarded. ("We can't use that blue trunk assembly on this red car, Fred. Throw it away!")

A scoreboard keeps track of everything that's going on. The P6 has a lot more registers than the programmer's model asserts, and renames them for efficiency. How did Intel's designers get so smart? They probably read Chaitin et al.'s tutorial, "Register Allocation via Coloring," which is part of the June 1982 SIGPLAN Proceedings on compiler construction. Yes, tutorial. In 1982. You didn't think this stuff was new, did you?

[Abstract: Register allocation may be viewed as a graph coloring problem... Preliminary results... suggest that global register allocation approaching that of hand-coded assembly language may be attainable.]

Now you should have a grasp of what Intel means when it says the P6 uses scoreboarding techniques and issues instructions speculatively. Specifically, the P6 guesses which branch paths will be taken and speculatively executes the instructions following those branches (assuming no data dependencies). If those branches are taken, then the instruction results are already available. Otherwise, the results are discarded. The P6 speculatively executes instructions passed up to five (!) branches, assuming they're available in the 30-instruction queue at the front end.

The P6 is Intel's first x86 that does not always directly execute x86 instructions. If you've read Abrash's articles on Pentium optimization, you know the performance benefits of breaking some complex instructions down into two simpler, yet equivalent, x86 instructions. Well, the P6 takes this a step further. The P6's instruction decoder will often break a complex x86 instruction into simpler instructions, that may not be x86 instructions at all.

Since P6 continually looks at the next 30 instructions and begins execution of each as soon as possible, and automatically breaks up complex instructions when beneficial, you won't have to optimize P6 code.

The P6 self-optimizes all that shrink-wrapped code, no matter what generation of optimizing compiler was used. Poor Michael Abrash! He'll have nothing to write about, and the bank will foreclose his mortgage.

The philosophical design differences underlying the 486, Pentium, and P6 generations have nothing whatever to do with computer architecture and everything to do with chip design. The best chips are designed by persons familiar with happenings in the mainframe and minicomputer arenas a dozen or more years back.

Intel's Andrew Grove once publicly asserted that there wasn't any use for a million-transistor-plus chip except for memory. If he'd known his x86 chip designers would soon be crafting microprocessors that performed useless instructions and wouldn't even directly execute x86 code, do you suppose he'd have fired them?

DTACK Revisited

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

DTACK Revisited

Related Reading

News

Commentary

Slideshow

Video

Most Popular

More Insights

White Papers

Reports

Webcasts

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content