 Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726. # The Temptation For Implementations To Cut Corners

February 20, 2014

Last week, I noted that C++ programs convert floating-point numbers from decimal strings to internal binary form in several different contexts, and suggested that it would be useful if all such conversions were as accurate as possible and, independently, were always done in exactly the same way. These desirable qualities may seem so obvious as to raise the question of why an implementation would ever want to do anything else. The reasons, as is so often the case, are complicated.

Consider these statements:

```
float one = 1, ten = 10, tenth = one / ten;
double percent = tenth * tenth;
```

What is the value of `percent`? Obviously it should be close to `0.01`, but how close?

On an implementation that conforms to the IEEE floating-point standard, there is no question: The value of `tenth` is the closest single-precision floating-point approximation to `0.1`, and the value of `percent` is obtained by squaring that (approximate) value, rounding the result to single precision (because the product of two `float` values is a `float`, not a `double`), and then converting that value (presumably without changing it) to `double`.

To make this discussion concrete, let's assume that `float` numbers have 24 significant bits. Then the value of `tenth` will be a floating-point number with a binary value of exactly `0.000110011001100110011001100`. This number has 27 bits because the three zeroes after the binary point are not significant.

Remember that every binary floating-point number can be represented exactly as a decimal number with the same number of digits after the decimal point as the binary number has after the binary point. If we disregard the trailing zeroes, this particular value has 25 bits after the binary point, which means that we can represent it exactly as a decimal value with 25 digits after the decimal point, namely `0.0999999940395355224609375`. This is the best approximation to 0.1 that is possible in 24-bit binary floating-point.

If we now square this number, we get a binary value of exactly `0.00000010100011110101110000100011110101110000101001`. With its 50 bits after the binary point, this value translates into a decimal value of exactly `0.00999999880790714001932428800500929355621337890625`, which has 50 digits after the decimal point. If our typical IEEE-compliant implementation has 53 significant bits in double-precision floating-point numbers, then this value can be represented exactly as a `double` but not as a `float`. If we wish to represent it as a float, we must round it to 24 significant bits, which gives us `0.000000101000111101011100001001` in binary. The last bit changes from 0 to 1 as a consequence of rounding. Because this value has 30 bits after the binary point, we can convert it exactly to a decimal value with 30 digits after the decimal point, namely `0.009999998845160007476806640625`.

If you've spaced out while you were trying to follow this intricate discussion, here's the point. If we assume that `float` variables have 24 significant bits and `double` variables have 53 significant bits, then:

• The closest `float` to `0.1` is `0.0999999940395355224609375`, so that is the value that `tenth` will have.
• The exact value of `tenth * tenth` is `0.00999999880790714001932428800500929355621337890625`, which can be represented exactly as a `double`.
• If we round `tenth * tenth` to `float`, the result is exactly `0.009999998845160007476806640625`.
• The result of rounding `tenth * tenth` to `float` differs from the exact value of in the value of `tenth * tenth` in the 9th significant (decimal) digit, and from the ideal value 0.01 in the 7th significant digit.

In other words, rounding `tenth` to `float` limits the precision of `tenth * tenth` to about seven significant digits.

I now invite you to ponder the following question:

Suppose you are implementing C++ on a computer that always produces a `double` result as the product of two `float` values. On such hardware, storing the `double` value of `tenth * tenth` directly in `percent` might be substantially faster than rounding that result to `float` and then converting it back to `double`. In this case, the decision about whether to round would affect the 9th and subsequent significant digits of a value that has only seven significant digits in the first place.

As an implementor, surely you will be tempted to generate the faster code, because doing so is at least as accurate than computing the result as a `double`, rounding it to `float`, and then converting it back to `double`. What possible disadvantage could there be to this faster, more accurate approach?   We shall see next week.

### More Insights To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.

## Featured Reports ## Featured Whitepapers 