Why Is Exact Floating-Point Output So Hard?

January 30, 2014

Last week, I observed that every binary floating-point number has an exact decimal representation with the same number of digits after the decimal point as the binary version has bits after the binary point. Since then, I have thought of an easier way to explain this phenomenon.

We can think of a binary floating-point number as comprising a fraction, which we can think of as a binary point followed by some number of bits, together with an exponent, which we can think of as the number of bits that we should move the binary point to the left or right to obtain the final value. As a result, every binary floating-point number is equivalent to what we might call a binary fixed-point number: a sequence of bits with a binary point separating the integer part from the fractional part.

Let's pick a binary fixed-point number to represent in decimal:

```
10010.01011
```

If we were to remove the binary point from this number, we would have `1001001011` , which, of course, is an integer. Like all integers, it has an obvious, exact decimal representation — in this case, 587. In order to see that it has an exact decimal representation even if we put the binary point back in, let's take `1001001011` and move the binary point one bit to the left to get `100100101.1`. Doing so effectively divides our value by 2, so we get (exactly) 293.5.

More generally, if we have a value that is represented as a decimal number with n digits after the decimal point, and we divide it by 2, the result of the division can always be represented exactly as a decimal number with n+1 digits after the decimal point. We can see this by noting that dividing by 2 is equivalent to multiplying by 5 and then dividing by 10. Multiplying by 5 yields a number with n digits after the decimal point; dividing by 10 is equivalent to moving the decimal point one position to the left, putting one additional digit after the decimal point. In other words, if we multiply 587 by 5, we get 2935; moving the decimal point one position to the left gives 293.5.

Similar reasoning allows us to divide by 2 four more times, yielding 146.75, 73.375, 36.6875, and, finally, 18.34375. We have learned that `10010.01011` in binary is exactly equivalent to 18.34375 in decimal, and both numbers have exactly the same number of positions after the radix (decimal or binary) point. We note in passing that all of these decimal digits are necessary; the only way for the last such digit to be zero would be if that digit were an even number before we divided it by 2; but this case cannot happen because the last bit after the binary point in a binary fixed-point number is 1 by definition. In other words, if we were to try to express `10010.0101100` in decimal, we would start by discarding the two irrelevant zeroes at the end of the fraction, after which we would get exactly the same result as we did before.

Because of this discussion, it might appear that converting a binary floating-point number to its exact decimal equivalent is a fairly easy problem. However, this belief runs into trouble because of floating-point numbers' exponents. Floating-point numbers are intended to be able to encompass a substantial range; to which end they are typically defined so that the exponent can be large. For example, the 64-bit floating-point format most widely used these days allows 11 bits for the exponent, including the sign, and therefore allows for floating-point numbers with exponents as small as -1023. Representing such a number exactly therefore requires a number of digits after the decimal point equal to 1,022 plus the number of bits in the fraction (perhaps plus or minus 1 or 2 to account for fine points, such as denormalized numbers, in the floating-point format's definition). Computing such a representation is not exactly a quick or simple computational task.

Of course, it is safe to say that very few programmers would ever want to print a floating-point number with more than a thousand digits after the decimal point. For one thing, most of the time, almost all of those digits will be zero. Moreover, even if they are not zero, such a large number of decimal digits is a gross misstatement of the amount of precision that is inherent in the binary number to begin with. In fact, the very reason for using floating-point arithmetic in the first place is to decouple a number's precision from its magnitude. In practice, then, there does not seem to be a pragmatic need to use that many decimal digits to represent a binary floating-point number.

Unfortunately, even though we can say with some confidence that we do not have a pragmatic need to use more than a thousand digits after the decimal point to represent a 64-bit floating-point number, it is a much harder problem to decide just how many digits we do need to be able to produce. To get a handle on this problem, we can start with a few desirable properties:

• A given floating-point number should always produce the same decimal representation whenever we print it with maximum precision.
• Two different floating-point numbers should always produce different decimal representations in maximum precision.
• If we write a floating-point number in decimal, and then read it back in, we should get exactly the same floating-point number back again.

It may not be immediately obvious that we can implement all of these properties; but at least we should try. Because the last of these properties involves reading a floating-point value, our next order of business will be to understand the problems that surround floating-point input. As we shall see next week, these problems turn out to be quite different from the problems surrounding floating-point output.

More Insights

 To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.