The Temptation For Implementations To Cut Corners
Last week, I noted that C++ programs convert floating-point numbers from decimal strings to internal binary form in several different contexts, and suggested that it would be useful if all such conversions were as accurate as possible and, independently, were always done in exactly the same way. These desirable qualities may seem so obvious as to raise the question of why an implementation would ever want to do anything else. The reasons, as is so often the case, are complicated.
Consider these statements:
float one = 1, ten = 10, tenth = one / ten; double percent = tenth * tenth;
What is the value of
percent? Obviously it should be close to
0.01, but how close?
On an implementation that conforms to the IEEE floating-point standard, there is no question: The value of
tenth is the closest single-precision floating-point approximation to
0.1, and the value of
percent is obtained by squaring that (approximate) value, rounding the result to single precision (because the product of two
float values is a
float, not a
double), and then converting that value (presumably without changing it) to
To make this discussion concrete, let's assume that
float numbers have 24 significant bits. Then the value of
tenth will be a floating-point number with a binary value of exactly
0.000110011001100110011001100. This number has 27 bits because the three zeroes after the binary point are not significant.
Remember that every binary floating-point number can be represented exactly as a decimal number with the same number of digits after the decimal point as the binary number has after the binary point. If we disregard the trailing zeroes, this particular value has 25 bits after the binary point, which means that we can represent it exactly as a decimal value with 25 digits after the decimal point, namely
0.0999999940395355224609375. This is the best approximation to 0.1 that is possible in 24-bit binary floating-point.
If we now square this number, we get a binary value of exactly
0.00000010100011110101110000100011110101110000101001. With its 50 bits after the binary point, this value translates into a decimal value of exactly
0.00999999880790714001932428800500929355621337890625, which has 50 digits after the decimal point. If our typical IEEE-compliant implementation has 53 significant bits in double-precision floating-point numbers, then this value can be represented exactly as a
double but not as a
float. If we wish to represent it as a float, we must round it to 24 significant bits, which gives us
0.000000101000111101011100001001 in binary. The last bit changes from 0 to 1 as a consequence of rounding. Because this value has 30 bits after the binary point, we can convert it exactly to a decimal value with 30 digits after the decimal point, namely
If you've spaced out while you were trying to follow this intricate discussion, here's the point. If we assume that
float variables have 24 significant bits and
double variables have 53 significant bits, then:
- The closest
0.0999999940395355224609375, so that is the value that
- The exact value of
tenth * tenthis
0.00999999880790714001932428800500929355621337890625, which can be represented exactly as a
- If we round
tenth * tenthto
float, the result is exactly
- The result of rounding
tenth * tenthto
floatdiffers from the exact value of in the value of
tenth * tenthin the 9th significant (decimal) digit, and from the ideal value 0.01 in the 7th significant digit.
In other words, rounding
float limits the precision of
tenth * tenth to about seven significant digits.
I now invite you to ponder the following question:
Suppose you are implementing C++ on a computer that always produces a
double result as the product of two
float values. On such hardware, storing the
double value of
tenth * tenth directly in
percent might be substantially faster than rounding that result to
float and then converting it back to
double. In this case, the decision about whether to round would affect the 9th and subsequent significant digits of a value that has only seven significant digits in the first place.
As an implementor, surely you will be tempted to generate the faster code, because doing so is at least as accurate than computing the result as a
double, rounding it to
float, and then converting it back to
double. What possible disadvantage could there be to this faster, more accurate approach?
We shall see next week.