P6 Processor in the Pipeline

By Ray Valdes, April 01, 1995

DDDU/ Vol 2 No 4 April 95/ Feature

Those who say that there's no longer much difference between hardware and software have not visited San Francisco recently. If you were attending the Software Development '95 conference, it was quite possible to take a wrong turn and end up at the International Solid State Circuits (ISSC) meeting a block away. There you would have immediately noticed a stark contrast in the audience: ISSC attendees are almost uniformly clean-cut, with pressed shirts, sports jackets, and a sense of discipline that is all-too-often lacking in some programming departments, where one dresses up by donning a baseball cap or a Viking helmet.

To judge from SD '95 exhibits, there seems little that is new and exciting in the software-development industry, with the sole exception of Borland's Delphi, which drew large crowds. By contrast, some presentations at ISSC covered a topic that would excite any programmer: the relentless quest for speed in CPUs. It was at ISSC that Intel chose to unveil its P6 processor --- the newest member of the x86 family, the successor to the Pentium. Unlike software-product introductions, which are often accompanied by laser light shows and a dance troupe or two, the presentation by Robert Colwell, architecture manager for the P6, was a dry, 20-minute technical summary in keeping with similar talks that day by CPU designers from Motorola, Sun, NexGen, IBM, Hal, and DEC.

By now you've probably heard about the basic P6 specs: 5.5 million transistors (compared to 3.1 million for the Pentium), a performance rating of 200 SpecInt92 at 133 MHz (almost twice as fast as a 100-MHz Pentium, rated at 96 SpecInt92), 2.9 volts, 20 watts peak power consumption (compared the Pentium's 16 watts), with an average consumption of 14 watts. The chip is built with 0.6 micron BiCMOS technology, the same process now used in the 100-MHz Pentium (which originally used a 0.8-micron process). Intel plans to move the P6 to a 0.35-micron process, which will allow a significant clock speedup. (Historically, Intel has been able to double the clock frequency after product introduction by moving to a faster process.) The chip is supposed to ship in the second half of 1995; the official name has not yet been revealed. The price is also not yet determined, but Intel has historically released its CPUs at around the $1000 price point.

After the ISSC conference, Intel made available a P6 prototype system for hands-on demos. Running Windows and mainstream PC applications, the system only crashed once (which may have had nothing to do with the CPU, given the fragility of today's software). One jaded journalist was moved to describe the system as ``wicked fast'' --- when you launch Windows, the initial logo display is now just a blur.

Intel has made available some technical papers at its Web site (http://www.intel.com/procs/p6/) as well as in its forum on CompuServe, but programmer-level documentation won't be released until after the formal product introduction later this year. Colwell's presentation, which he said was prepared on a P6 system, provided much detail about the chip's microarchitecture, down to the level of describing a typical BinMOS gate (which gives a 15 percent performance advantage over CMOS technology) and the ``delayed precharge domino logic'' used in speed-critical paths such as instruction decoding. However, Colwell did not say anything about new instructions or new operating modes. Lew Paceley, the P6 marketing director, later said that the P6 offers no new operating modes (that is, real versus protected): ``The only mode is fast.'' According to author and DDJ contributing editor Andrew Schulman, one new instruction is a conditional move, which is meant to substitute for the typical test-and-jump sequence of instructions. Schulman also reports that the P6 offers a number of hooks (registers, counters, instructions) for performance tuning, analogous to undocumented registers in the Pentium.

The P6 has been in the works for several years, even before the Pentium saw first silicon. Intel is one of the few semiconductor manufacturers with enough resources to deploy two CPU design teams, which work in parallel and leapfrog each other every couple of years. The Pentium was designed in Santa Clara, California, while the P6 comes out of Intel's facility in Hillsboro, Oregon. Even now, the design teams are at work on the P7 and P8.

Die-hards may still continue the CISC-versus-RISC debate, but for many others, the focus has shifted to using performance-oriented techniques that originated with RISC implementations to gain the most speed, regardless of whether the instruction set is CISC or RISC. The superscalar approach to processor implementation uses multiple pipelines to execute several instructions at the same time. This concept is not new, having been used in the dual-pipelined Pentium and other machines. Today, what is different is the complexity and sophistication of this approach, evident in the P6 and other implementations showcased at ISSC.

Regarding performance optimization, Peter Deutsch, one of the principal implementors of Smalltalk-80, said: ``It's okay to cheat as long as you don't get caught.'' Deutsch's comment can apply to hardware optimization as well, not just software. For microprocessor designers, the Intel x86 instruction set has become almost a high-level API, below which all kinds of speed-oriented tricks can be played, including translating x86 instructions into smaller, RISC-like instructions that can then be processed in a superscalar fashion.

Intel press releases tout ``an innovation called Dynamic Execution,'' which is a ``unique combination of technologies'' that ``deliver superior performance.'' Dynamic execution refers to processing techniques such as multiple-branch prediction, dataflow analysis, and speculative execution, which comprise a superscalar, superpipelined design. These design techniques, while sophisticated and clever, are not very different from techniques used in other modern processors such as the PPC620 or other processors described at the ISSC conference.

For example, both the P6 and NexGen processors start with a CISC front end which translates x86 instructions into an equivalent stream of RISC-like instructions. The NexGen processor calls these ``RISC86'' instructions, while Intel refers to them as ``micro-ops.'' This instruction stream is passed to a core engine, which puts instructions into a common pool, from which they can be retrieved and executed as they are ready (this is known as ``out-of-order'' execution or ``dataflow'' processing). The P6 can work on as many as 40 instructions at one time. While this is an astounding number, it does not compare to the Hal R1 CPU, which can work on 64 instructions at a time (in a 2.7-million-transistor implementation that results in 256 SPECInt performance at 154 MHz). But it is certainly better than the NexGen, which supports 14 active macro instructions at a time.

The P6 can execute three to five instructions simultaneously (it can decode three x86 instructions and/or dispatch five micro-ops at a time). This is in the same ballpark as the NexGen, PPC620, and Sun UltraSparc processors, all of which can handle four instructions at a time. By contrast, the Hal R1 can dispatch four instructions and execute as many as ten at a time. (These comparisons are meant to place the P6 within the context of other implementations, not to imply that all these products are directly competitive. For example, the NexGen processor is intended to compete with the Pentium, not the P6, while the Hal R1 is for higher-end systems.)

Concurrent execution requires a technique called ``register renaming,'' which maps the nominal set of eight x86 registers into a larger set of internal physical registers. The NexGen has 22 physical registers; Colwell says only that the P6 has ``a large number'' of such registers.

Another technique for facilitating superscalar processing is speculative execution and branch prediction. As the name implies, the CPU starts executing instructions after a branch point, based on the history of that branch. In the P6, the branch history is stored in a branch target buffer, which contains 512 entries. By contrast, the Hal R1 branch history table has room for 1024 entries, while the NexGen supports 96 entries.

After out-of-order dataflow processing, instructions become ready for ``retirement,'' which means that an instruction execution is ready to be committed or finalized. As you might expect, instructions must be retired in the order that they were issued. Therefore, in addition to the in-order front end and the out-of-order core, the P6 has an ``in-order back end'' unit that retires instructions. Each of these three major components in the P6 has its own pipeline; the retirement back-end pipe is three cycles long.

One unique aspect of the P6 is its initial two-die implementation, consisting of the 5.5-million-transistor CPU integrated with a separate 256-Kbyte level-two cache (estimated at 15 million transistors). By contrast, the Pentium has 3.1 million transistors on a single die, while the 486DX has 1.2 million. The record for the largest number of transistors on a single chip is DEC's 300 MHz Alpha, which contains 9.3 million transistors (including two 8-Kbyte L1 caches, and a 96-Kbyte L2 cache). Also at the high end, the Hal R1 CPU consists of 2.7 million transistors, but is packaged on a multichip module containing the CPU, an MMU, four cache chips, plus a clock chip, for a grand total of 25 million transistors. On the low end, the NexGen consists of 3.5 million transistors, which contains two on-chip level-one caches of 16 Kbytes each (compared to the two 8-Kbyte caches on both P6 and Pentium).

In addition to the CPU, the P6 project team has designed a whole new chipset in order to support the P6 external bus, which is a 64-bit bus that connects memory and I/O as well as other P6 processors. This bus supports up to four P6 processors.

In his talk, Colwell was pretty candid about the checkered history of the Pentium FDIV bug. He asked, rhetorically: ``Why should you believe this will work?'' The answer is that 300 engineer-years were spent on validation, in addition to tens of billions of simulation cycles. The simulation included not just the processor chip, but the entire chipset, plus four multi-processed CPUs. Colwell emphasized the importance of catching errors at the design stage, rather than try to deal with them gracefully after the product is shipping. But, he added, ``dealing with them gracefully later beats the heck out of not dealing with them gracefully later.''

And in case you were wondering about bugs that may appear after shipping, Intel's Lew Paceley stated, in an online discussion on CompuServe: ``We will follow exactly the same errata policy as the Pentium processor. When we enter production, we'll also publish our errata.''

Setting qualms about bugs aside, there's no denying the speed and sophistication of the P6 design. Experienced hardware engineers like to say: ``No matter how fast we make the hardware, the software boys just p*** it away.'' Software types will have to micturate furiously to keep up with the P6's performance.

Ray is senior technical editor at Dr. Dobb' Journal and can be contacted at [email protected].

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

P6 Processor in the Pipeline

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content

P6 Processor in the Pipeline

Related Reading

News

Commentary

Slideshow

Video

Most Popular

More Insights

White Papers

Reports

Webcasts

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content