# The Temptation For Implementations To Cut Corners

Last week, I noted that C++ programs convert floating-point numbers from decimal strings to internal binary form in several different contexts, and suggested that it would be useful if all such conversions were as accurate as possible and, independently, were always done in exactly the same way. These desirable qualities may seem so obvious as to raise the question of why an implementation would ever want to do anything else. The reasons, as is so often the case, are complicated.

Consider these statements:

float one = 1, ten = 10, tenth = one / ten; double percent = tenth * tenth;

What is the value of `percent`

? Obviously it should be close to `0.01`

, but how close?

On an implementation that conforms to the IEEE floating-point standard, there is no question: The value of `tenth`

is the closest single-precision floating-point approximation to `0.1`

, and the value of `percent`

is obtained by squaring that (approximate) value, rounding the result to single precision (because the product of two `float`

values is a `float`

, not a `double`

), and then converting that value (presumably without changing it) to `double`

.

To make this discussion concrete, let's assume that `float`

numbers have 24 significant bits. Then the value of `tenth`

will be a floating-point number with a binary value of exactly `0.000110011001100110011001100`

. This number has 27 bits because the three zeroes after the binary point are not significant.

Remember that every binary floating-point number can be represented exactly as a decimal number with the same number of digits after the decimal point as the binary number has after the binary point. If we disregard the trailing zeroes, this particular value has 25 bits after the binary point, which means that we can represent it exactly as a decimal value with 25 digits after the decimal point, namely `0.0999999940395355224609375`

. This is the best approximation to 0.1 that is possible in 24-bit binary floating-point.

If we now square this number, we get a binary value of exactly `0.00000010100011110101110000100011110101110000101001`

. With its 50 bits after the binary point, this value translates into a decimal value of exactly `0.00999999880790714001932428800500929355621337890625`

, which has 50 digits after the decimal point. If our typical IEEE-compliant implementation has 53 significant bits in double-precision floating-point numbers, then this value can be represented exactly as a `double`

but not as a `float`

. If we wish to represent it as a float, we must round it to 24 significant bits, which gives us `0.000000101000111101011100001001`

in binary. The last bit changes from 0 to 1 as a consequence of rounding. Because this value has 30 bits after the binary point, we can convert it exactly to a decimal value with 30 digits after the decimal point, namely `0.009999998845160007476806640625`

.

If you've spaced out while you were trying to follow this intricate discussion, here's the point. If we assume that `float`

variables have 24 significant bits and `double`

variables have 53 significant bits, then:

- The closest
`float`

to`0.1`

is`0.0999999940395355224609375`

, so that is the value that`tenth`

will have. - The exact value of
`tenth * tenth`

is`0.00999999880790714001932428800500929355621337890625`

, which can be represented exactly as a`double`

. - If we round
`tenth * tenth`

to`float`

, the result is exactly`0.009999998845160007476806640625`

. - The result of rounding
`tenth * tenth`

to`float`

differs from the exact value of in the value of`tenth * tenth`

in the 9^{th}significant (decimal) digit, and from the ideal value 0.01 in the 7^{th}significant digit.

In other words, rounding `tenth`

to `float`

limits the precision of `tenth * tenth`

to about seven significant digits.

I now invite you to ponder the following question:

*Suppose you are implementing C++ on a computer that always produces a double result as the product of two float values. On such hardware, storing the double value of tenth * tenth directly in percent might be substantially faster than rounding that result to float and then converting it back to double. In this case, the decision about whether to round would affect the 9*

^{th}and subsequent significant digits of a value that has only seven significant digits in the first place.

As an implementor, surely you will be tempted to generate the faster code, because doing so is at least as accurate than computing the result as a `double`

, rounding it to `float`

, and then converting it back to `double`

. What possible disadvantage could there be to this faster, more accurate approach?
We shall see next week.