Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

Pentium Crosses the Great Divide


Pentium Crosses the Great Divide

Marketers like to use words like "unprecedented," "landmark," and "milestone" when touting a product. Unfortunately for those in Intel's marketing, the Pentium has given new meaning to these words by becoming the first microprocessor to displace Madonna and the Clintons as the butt of jokes in David Letterman's monologue. Fresh from multiple appearances on CNN and the "Today" show, the Pentium now seems destined to follow the parabolic arc of other media celebrities by moving up to Ted Koppel's "Nightline" before descending to "Oprah" and "Hard Copy."

Let's start with a quick chronology. The story begins with a coal miner's son, Thomas Nicely, pursuing a research project in an area of pure mathematics called "computational number theory" at Lynchburg College in Virginia. Professor Nicely's program, which enumerates the primes, twin primes, prime triplets, and prime quadruplets for all positive integers up to an extremely large limit, has been running for over a year simultaneously on half a dozen systems. Most are 486s, but a Pentium box was added in March 1994. Simultaneously with the calculation of the unknown quantities, a number of checks are maintained by calculating values which are already known (such as p(x), the number of primes <= x).

On June 13, 1994, Nicely found that the computed check value for p(x) disagreed with the published value. This led to a long search for logic errors and sources of reduced precision in his program (which consisted of 3000 lines of code). Along the way, Nicely found a bug in the Borland C++ 4.02 compiler, which was producing erroneous code when compiled in 32-bit mode with certain optimizations (-Op -Om -Og) enabled. For a while, Nicely thought this was the source of discrepancy. After a series of events (which Nicely describes in a document dated December 12, 1994, via ftp from acavax.lyncburg.edu), Nicely isolated the bug. He found the Pentium floating-point unit (FPU) was returning erroneous values for certain division operations. For example, in the reciprocal 1/824633702441, all digits beyond the eighth significant digit are in error (in general, coprocessor results should contain 19 significant decimal digits).

On October 24, 1994, Nicely contacted Intel tech support. After six days, Intel still didn't have an answer to the problem. Unknown to Nicely then--but now well-known--Intel had discovered the problem in the summer of 1994, according to an Internet posting by Intel CEO Andrew Grove. Regarding the seriousness of the bug, an Intel spokesman later said: "This doesn't even qualify as an errata. We fixed it in a subsequent stepping." (Quoted by Alex Wolfe, EE Times, November 7, 1994.)

Nicely writes: "In the absence of any meaningful response from Intel tech support, on 30 October I sent e-mail to a number of individuals and organizations who I felt would have access to many other Pentium systems, and asked them to check for the problem."

Intel's response left much to be desired. Steven Smith, a Pentium engineering manager for the company, emphasized that the anomaly would not affect the average user, and said, speaking of Nicely: "He's the most extreme user. He spends round-the-clock time calculating reciprocals." (EE Times, November 9, 1994).

The day after the EE Times article appeared, Nicely signed a nondisclosure agreement (NDA) with Intel, having been contacted by the company regarding possible work as a consultant. As Nicely later admitted, he took the NDA to heart, and from November 10--22, deflected all inquiries regarding the Pentium FPU to Intel. Nicely also removed his files from his college's ftp server, later explaining his behavior by saying: "This was a consequence of my own mistaken interpretation of the NDA; I was treating it in the manner of a security clearance." This had the effect of silencing the person most knowledgeable about the bug outside of Intel (at that time). After receiving complaints about Nicely's silence, Intel then told him that he was free to discuss any of his work conducted prior to signing the NDA. By then, other parties had taken up the bug chase, and the hue-and-cry ran widely across Internet newsgroups such as comp.arch and comp.sys.intel.

Interestingly, Andrew Grove has in recent years been a columnist for a national women's magazine, writing a Q&A column about people management. His advice tends to be direct, to-the-point, and highly pragmatic--something that makes the Pentium PR fiasco more ironic. The consensus among the mainstream press is that this is a textbook case of how not to handle a commercial crisis. For example, George Gunset says in the Chicago Tribune: "For the appearance of arrogance and disdain for customers, it would be hard to top the handling by Intel Corp. of disclosure of a flaw in its top-drawer Pentium computer chips." Likewise, columnist Walter Mossberg of the Wall Street Journal took Intel to task for not listening to its customers (the same column brought up the long-standing bug in Windows CALC and forced Microsoft to, almost simultaneously, announce a fix for that egregious bug). In short, the mainstream press is just now discovering what PC trade journalists have known for years--Intel has never understood the end-user market, while at the same time has aggressively pitched a consumer-oriented campaign with its "Intel Inside" promotions.

So what exactly was the bug, and why the Pentium? It seems that the bug was a result of five missing entries in a lookup table implemented in a PLA (programmable logic array). I spoke with Paul Ries, a chip designer who has worked on the 386, the MIPS R3000, and some specialized chips for Silicon Graphics (SGI). As a fellow engineer, Ries felt empathy: "The error they made is not that they had a bad algorithm_. Engineers always make mistakes, you usually have lots of people looking for those mistakes. Actually, this bug could have been introduced in a later stepping [later revision]. It could have happened to me, I'm glad it wasn't."

This willingness to cut some slack for fellow engineers was also expressed by Mike Schmit, a long-time assembly-language programmer, and author of the QuantASM optimized-floating-library for the 80x86 architecture (as well as the recently published book, Pentium Processor Optimization Tools). Schmit told me: "This wasn't a conceptual design error, and doesn't change my confidence in Intel engineers and the Intel architecture." Missing data is less disturbing than a deep-logic flaw.

During this time, a number of people on the Internet started looking into the bug's behavior a bit more deeply. Tim Coe, an FPU designer at Vitesse Semiconductor, found that "The best characteristic and only characteristic of the bug to come from Intel is its 1 in 9 billion probability of occurring with random operands. The worst characteristic of the bug is that the specific operands that are most at risk are integers plus-or-minus very small deltas. The integers 3, 9, 15, 21, and 27 minus very small deltas are THE at-risk divisors." (November 28, 1994) For example, a divisor greater than 3.0 and less than 3.0-36*(2-22) is at risk.

Vaughan Pratt, a professor of computer science at Stanford, picked up where Coe left off in characterizing the bug more closely: "The worst FDIV bug with regard to magnitude is Tim Coe's pair 4195835/3145727, for which the Pentium gets 1.333739 instead of the correct 1.33382045."

Pratt approached the problem from an empirical point of view: "I wrote a program that knocked 10^-6 off all one million pairs of integers from 1 to 1000, did the million divisions on a Pentium, and checked which were buggy, and by how much. There were 627 bugs, of which 426 had a relative error worse than 10^-7, and 14 worse than 10^-5."

The denominator in Coe's pair is 3*220--1, which is 23.99999237060547*217 in decimal, and 4147FFFF80000000 in hex representation. As you can see, this number has 20 1-bits in a row. The relative error from dividing into 4195835 is .999 times 2--14, the largest error observed to date, perhaps the maximum possible error for this bug.

How important is a relative error of this magnitude? All the technical people I spoke to, even those sympathetic to Intel, agreed that for users who are doing numerically-intensive computing (rocket science, options trading, real-time machine control), this is a very serious error indeed. Pratt's own work has to do with automatically converting scanned drawings to a Postscript representation of circles and arcs, and as such, is "highly sensitive to numerical instability." Pratt adds: "This bug is maximally insidious: It is about as large as it could possibly be without actually triggering warning bells when people review their columns of data."

As new findings rolled into the newsgroups on an almost-daily basis, Intel's actions seemed to follow a bizarre logic of its own. Although semiconductor engineers may have been empathetic, many other users were not. The "scandal" snowballed, while Intel stonewalled. On November 27, 1994, Andrew Grove posted a message to the Internet, through an intermediary: "We would like to find all users of the Pentium processor who are engaged in work involving heavy duty scientific/floating point calculations and resolve their problem in the most appropriate fashion including, if necessary, by replacing their chips with new ones. We don't know how to set precise rules on this so we decided to do it thru individual discussions between each of you and a technically trained Intel person."

Although this policy has perhaps some semblance of sense from a purely technical or logistical point of view, it was not what the villagers of the Internet wanted to hear. A battle of white papers ensued between Intel and IBM.

In a widely quoted statement, Intel claimed that the possibility of encountering the error (using a random set of data) was 1 in 9 billion. Given 15 minutes per day of intensive FPU usage, a user would encounter this error once every 27,000 years. As is well-known now, IBM announced it would stop shipping Pentium-based machines as a result of its analysis that a user could encounter the error as much as once every 28 days.

The key difference between the two papers is not any technical details of how the bug works, but on somewhat arbitrarily-chosen assumptions of how users work and what kind of data they process. In both the white papers, the numbers (15 minutes per day, 1 fdiv instruction every x thousand cycles) seemed picked out of a hat. PC Week labs weighed in with an analysis that was somewhere between the Intel and IBM extremes, although the white paper did not provide detail about the assumptions made, making it impossible to evaluate. Although this blizzard of white papers was based in part on painstaking and tedious work by Internet participants, there was little acknowledgment of this.

Vaughan Pratt came up with an analysis that was much more pessimistic than IBM's: "If you're unlucky enough to do all your computing with such bruised integers' [also known as truncated integers, which it's possible are common in financial spreadsheet calculations], and you do lots of divisions (the Pentium can do two million divisions a second) you'll see the big 10^-5 errors 14*2 = 28 times a second or once every 35 microseconds, and the smaller errors, between 10^-5 and 10^-7, 627*2 = 1254 times a second, or more than one a millisecond!"

The volume and tenor of the complaints increased, to the point where a lot of the original participants (such as Nicely, Pratt, and Coe) found it difficult to sort out signal from noise in the newsgroups, and gave up altogether. Some users illustrated the situation with an automotive analogy: Say that GM manufactured a truck that turned out to have a serious brake defect, and, rather than a general recall, decided to interview each and every user to find out what speeds they drove, and how often they tapped on the brakes. This was not at all satisfactory.

A number of bug fixes and workarounds were proposed. Cleve Moler of MathWorks came up with a simple technique that does not rely on software emulation of floating-point arithmetic (which would slow processing down by a factor of ten). Instead, a quick comparison is made to see if the leading digits contain the "at-risk" values, and if so, these values are scaled proportionately so that the result is the same, but the dangerous bit patterns (strings of consecutive 1 bits) are scrambled. Initially, Moler used a scale factor of 3/4, but others showed that, in certain cases, an at-risk divisor would be scaled to a number that also contained the risky bit strings.

In mid-December, Intel tried another tactic. Intel staff pored through Internet postings, e-mailing each complainant a customized solicitation for a replacement chip. User after user reported receiving "the call" from Intel and, rather than be appreciative, took it as one more sign of cluelessness and manipulation. Finally, on December 20, Intel caved and announced a no-questions-asked replacement policy for an estimated two million CPUs.

At this writing, it is unclear how this will pan out. Will this be the scandal that won't go away? End-user lawsuits so far amounting to more than $600 million and sales of Pentium PCs dropping more than 20 percent in December alone suggest so. Or will the 15 minutes of fame for the Pentium wind down as this event sinks like a stone into the public's collective unconsciousness? One thing is clear: If Andy Grove ever writes another column on management, people will interpret his words in a whole new light.


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.