Embedded Systems

Inside the Pentium II Math Bug

By Robert R. Collins, August 01, 1997

Two days before Intel's biggest processor announcement in years, a math bug in the Pentium Pro and Pentium II came to light. Robert takes you inside the Dan-0411 "flag erratum," and tells how the story unfolded.

Dr. Dobb's Journal August 1997: Inside the Pentium II Math Bug

Dan-0411 rocks the industry

Robert is a DDJ contributing editor and independent consultant on the x86 architecture. He can be reached at [email protected].

Sidebar: The Chronology of Dan-0411

Just two days before its biggest processor announcement in years, Intel was hit by reports of a math bug in its Pentium Pro and (the soon to be announced) Pentium II processors. The bad timing prompted reports that the bug disclosure was deliberately timed to coincide with the Pentium II announcement, thereby maximizing the embarrassment to Intel. Another early rumor put AMD behind the bug report. Yet another industry rumor said that the Pentium II used for the tests was illegally obtained. As intriguing as these theories may be, none of them are true. How do I know? Because I wrote the bug report.

The bug was known as the Dan-0411 bug by the news media and Internet community. Intel had its own name for it -- the Flag Erratum.

The Facts

I received e-mail from "Dan" who asked if I could reproduce what he thought was a bug in the Pentium Pro. After contemplating my involvement for ten days, I finally decided to help out (see the accompanying text box). I wrote an assembly-language program that checked into the problem. I ran the test on Pentium Pro, Pentium II, Pentium classic (P54C), Pentium MMX (P55C), and AMD K6 processors. (I had purchased the Pentium II over the counter at Fry's Electronics in Sunnyvale, California, six weeks before its official introduction. There was nothing illegal about the acquisition of the Pentium II processor.) After running the test on these various processors, I came to the conclusion that a bug did exist in the Pentium Pro and Pentium II.

Why Dan-0411?

These days, astronomers name new stars and comets by combining the discoverer's name and some number. Why should microprocessor bugs be different? In this case, "Dan" is the discoverer of the bug, and 04-11 (1997) is the date on which I got my first e-mail about it. So I've named the bug "Dan-0411" after its discoverer and the date he first reported it to me. (Please refer to http://www.x86.org/secrets/Dan0411.html for the text of the original bug announcement.)

What is the Bug and What Does it Affect?

The bug relates to operations that convert floating-point numbers into integer numbers. All floating-point numbers are stored inside of the microprocessor in an 80-bit format. Even though the external representation of a number may not be an 80-bit format, once the number is loaded into the microprocessor, it is converted to an 80-bit format. Integer numbers are stored externally in two different sizes. A short integer is stored in 16 bits, and a long integer is stored in 32 bits. It is often desirable to store the floating-point numbers as integer numbers. On occasion, the converted numbers won't fit into the smaller integer format. This is when the bug occurs.

The host software is supposed to be warned by the microprocessor when such a floating-point conversion error occurs; a specific error flag is supposed to be set in a floating-point status register. If the microprocessor fails to set this flag, it does not comply with the IEEE Floating Point Standards, which mandate such behavior. For the Dan-0411 bug, the Pentium II and Pentium Pro fail to set this error flag in many cases.

It is interesting to note that a launch failure of the Ariane 5 rocket, which happened less than a minute into the launch, was traced to behavior around an overflow condition. In this case, it was a software bug, not a microprocessor bug, that caused the problem. One of the computers on board had a floating-point to integer conversion that had overflowed. The overflow was not expected and, therefore, not detected by the computer software. As a result, the computer did a dump of its memory. Unfortunately, this memory dump was interpreted by the rocket as instructions to its rocket nozzles. Result: Boom!

The case of the Ariane rocket is a sensational example of the drastic consequences of an unhandled float-to-integer overflow. Pentium Pro and Pentium II users, on the other hand, are most likely to see the results of this bug in their graphics displays or in heavy-duty numerical analysis programs. Intel says ordinary users might see a temporary screen glitch on some games when this bug occurs.

The Nature of the Bug

The Dan-0411 bug occurs when a large negative floating-point number is stored to memory in an integer format. Under normal operation, the largest negative integer (MAXNEG) is stored in memory when a floating-point number is too large to fit in the integer format. The FPU Status Word is supposed to indicate that an Invalid Operand Exception (IE) occurred (FSW.IE = 1).

Floating-point numbers that overflow the "real number" format are supposed to behave differently than floating-point numbers that overflow the "integer number" format. Float-to-real overflows are supposed to set the overflow flag (FSW.OE=1); Float-to-integer overflows are supposed to set the Invalid Operand Exception flag (FSW.IE). Section 7.8.4 of the Pentium Pro Family Developer's Manual, Volume 2 makes this difference quite clear:

Float-to-real overflows:

The FPU reports a floating-point numeric overflow exception (#O) whenever the rounded result of an arithmetic instruction exceeds the largest allowable finite value that will fit into the real format of the destination operand. For example, if the destination format is extended-real (80 bits), overflow occurs when the rounded result falls outside the unbiased range of -1.0 × 2¹⁶³⁸⁴ to 1.0 × 2¹⁶³⁸⁴ (exclusive). Numeric overflow can occur on arithmetic operations where the result is stored in an FPU data register. It can also occur on store-real operations (with the FST and FSTP instructions), where a within-range value in a data register is stored in memory in a single-or double-real format. The overflow threshold range for the single-real format is -1.0 × 2¹²⁸ to 1.0 × 2¹²⁸; the range for the double-real format is -1.0 × 2¹⁰²⁴ to 1.0 × 2¹⁰²⁴.

Float-to-integer overflows:

The numeric overflow exception cannot occur when overflow occurs when storing values in an integer or BCD integer format. Instead, the invalid-arithmetic-operand exception is signaled.

Instead of setting the Invalid Operand Exception (FSW.IE) bit, only the precision-exception (FSW.PE) bit is set. The precision-exception flag indicates that a computation can't be precisely represented by the floating-point operation -- in this case, the float-to-integer store operation. In most cases, this bit is ignored by programmers. Therefore, when the conditions are met for the Dan-0411 bug to occur, programmers may never know that an error occurred. If that isn't bad enough, it gets worse.

The Dan-0411 bug occurs for three out of four rounding modes, and when exceptions are either masked or unmasked. In the case of masked exceptions, the correct value is stored to memory; only the Floating-Point Status Word (FSW) is incorrectly set. For unmasked exceptions, the errant behavior is more serious:

No exception occurs. The floating-point exception handler is not invoked. Therefore, the errant condition is undetectable.
MAXNEG is returned to memory. Storing MAXNEG to memory is an errant condition. When exceptions are unmasked, nothing is supposed to be stored to memory. This means the microprocessor is spuriously storing data to memory when no data is expected.
In the case of the FISTP instruction, the floating-point value is popped from the floating-point stack. When exceptions are unmasked, the floating-point stack is supposed to remain unchanged to allow for error recovery. In this case, the value is popped from the stack and gone forever. Even if the errant condition was detectable, it would be unrecoverable after the FISTP instruction.

Why Wasn't this Bug Detected Before?

I'm not sure why this bug wasn't detected sooner, but there are clues that could help provide an explanation. Professor William Kahan of the University of California, Berkeley, has written a suite of floating-point test programs in Fortran (see http://http.cs.berkeley.edu/ ~wkahan/). These programs are commonly used to test the Float-to-Integer Store instructions (FIST and FISTP). Dan ported Dr. Kahan's Fortran programs to C and ran the tests against the Pentium II--this is when the bug came to light. So in the end, either Intel failed to run Dr. Kahan's test on the Pentium II, misconfigured the program, or a Fortran compiler hid the bug in the chip.

Source Code and Programs

Available electronically (see "Availability," page 3) are a source-code file and two executable programs. The programs are executable versions of the stand-alone assembly-language source code. The first program, FISTBUG.EXE, demonstrates the bug in a straightforward manner. When you run the program, all that appears on the screen is either the simple message "*** Dan-0411 bug found. ***", or "Dan-0411 not found."

The second program, FISTBUGV.EXE, runs the same exact tests as the first, but is much more verbose. This program shows the microprocessor stepping information and itemized results. Each operand under test is printed to the screen, along with pass/fail status for four different testing methods.

DDJ

1 2 Next

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

Embedded Systems

Inside the Pentium II Math Bug