The Temptation For Implementations To Cut Corners
Last week, I noted that C++ programs convert floating-point numbers from decimal strings to internal binary form in several different contexts, and suggested that it would be useful if all such conversions were as accurate as possible and, independently, were always done in exactly the same way. These desirable qualities may seem so obvious as to raise the question of why an implementation would ever want to do anything else. The reasons, as is so often the case, are complicated.
Consider these statements:
float one = 1, ten = 10, tenth = one / ten; double percent = tenth * tenth;
What is the value of percent
? Obviously it should be close to 0.01
, but how close?
On an implementation that conforms to the IEEE floating-point standard, there is no question: The value of tenth
is the closest single-precision floating-point approximation to 0.1
, and the value of percent
is obtained by squaring that (approximate) value, rounding the result to single precision (because the product of two float
values is a float
, not a double
), and then converting that value (presumably without changing it) to double
.
To make this discussion concrete, let's assume that float
numbers have 24 significant bits. Then the value of tenth
will be a floating-point number with a binary value of exactly 0.000110011001100110011001100
. This number has 27 bits because the three zeroes after the binary point are not significant.
Remember that every binary floating-point number can be represented exactly as a decimal number with the same number of digits after the decimal point as the binary number has after the binary point. If we disregard the trailing zeroes, this particular value has 25 bits after the binary point, which means that we can represent it exactly as a decimal value with 25 digits after the decimal point, namely 0.0999999940395355224609375
. This is the best approximation to 0.1 that is possible in 24-bit binary floating-point.
If we now square this number, we get a binary value of exactly 0.00000010100011110101110000100011110101110000101001
. With its 50 bits after the binary point, this value translates into a decimal value of exactly 0.00999999880790714001932428800500929355621337890625
, which has 50 digits after the decimal point. If our typical IEEE-compliant implementation has 53 significant bits in double-precision floating-point numbers, then this value can be represented exactly as a double
but not as a float
. If we wish to represent it as a float, we must round it to 24 significant bits, which gives us 0.000000101000111101011100001001
in binary. The last bit changes from 0 to 1 as a consequence of rounding. Because this value has 30 bits after the binary point, we can convert it exactly to a decimal value with 30 digits after the decimal point, namely 0.009999998845160007476806640625
.
If you've spaced out while you were trying to follow this intricate discussion, here's the point. If we assume that float
variables have 24 significant bits and double
variables have 53 significant bits, then:
- The closest
float
to0.1
is0.0999999940395355224609375
, so that is the value thattenth
will have. - The exact value of
tenth * tenth
is0.00999999880790714001932428800500929355621337890625
, which can be represented exactly as adouble
. - If we round
tenth * tenth
tofloat
, the result is exactly0.009999998845160007476806640625
. - The result of rounding
tenth * tenth
tofloat
differs from the exact value of in the value oftenth * tenth
in the 9th significant (decimal) digit, and from the ideal value 0.01 in the 7th significant digit.
In other words, rounding tenth
to float
limits the precision of tenth * tenth
to about seven significant digits.
I now invite you to ponder the following question:
Suppose you are implementing C++ on a computer that always produces a double
result as the product of two float
values. On such hardware, storing the double
value of tenth * tenth
directly in percent
might be substantially faster than rounding that result to float
and then converting it back to double
. In this case, the decision about whether to round would affect the 9th and subsequent significant digits of a value that has only seven significant digits in the first place.
As an implementor, surely you will be tempted to generate the faster code, because doing so is at least as accurate than computing the result as a double
, rounding it to float
, and then converting it back to double
. What possible disadvantage could there be to this faster, more accurate approach?
We shall see next week.