Floating-Point Input and Output Are Not Symmetric
Last week, I observed that converting a binary floating-point number to printable decimal form is both easy and hard. The problem is easy because every binary floating-point number has an exact decimal representation. Moreover, it is possible to determine in advance how many decimal digits are needed to contain that exact representation, which means that it is always possible to compute in advance an upper bound to how much memory might be required to obtain that exact representation. On the other hand, the problem is difficult because very small floating-point numbers may require a large number of digits after the decimal point for an accurate representation. Moreover, last week we said that it was desirable to be able to guarantee that writing a floating-point number in decimal and reading it back in again yields exactly the same value that we wrote initially. It is far from clear how to figure out how many digits are needed to meet that requirement.
In short, floating-point output is difficult because
- The number of digits needed to represent the result exactly might be very large, and
- It is not easy to figure out how many of those digits are needed to preserve the values being converted.
Floating-point input has almost the opposite problems. When we convert a decimal value to binary floating-point, we normally do so in order to put that value into a variable with a type such as
double. Accordingly, we know the type of the result, which means that we know in advance how many bits that result must have. Moreover, we know that there are many decimal values, such as
0.1, that cannot be represented exactly in binary no matter how many bits we have available. Accordingly, we do not have to worry about representing the exact value of a decimal number as a binary floating-point variable, because it is not always possible to do so.
You might think that the a priori limit on the number of bits in a floating-point value makes input an easier problem than output. However, even if we cannot generally represent input exactly, we have to be able to say something sensible about the relationship between the decimal input that we are reading and the binary floating-point number that is the result of that input. One straightforward claim that we might wish to make about floating-point input might be
The result of reading a decimal floating-point number is the binary floating-point value that would be obtained by rounding the decimal input to the number of bits in the result.
This rule is possible only because we know in advance how many bits the result will have. Otherwise, we would have to do the conversion in order to figure out how to do the conversion. However, because the size of a floating-point input variable is known in advance, the rule makes a great deal of sense. In particular, IEEE floating-point arithmetic uses the same rule for computations such as division. For example, if we divide
10.0 on a machine with IEEE floating-point arithmetic, we are guaranteed a result that is equal to what would be obtained by converting the infinite-precision result of such a division to the type we want for our result according to whatever rounding rules are in effect. Our proposed rule therefore has the desirable effect that reading
0.1 into a floating-point variable would give that variable exactly the same value as would assigning to it the result of dividing
Perhaps surprisingly, this simple-sounding rule is far from easy to implement. To see why, consider a machine with 32-bit IEEE floating-point arithmetic. On such a machine, the fraction part of a floating-point number will have 24 bits, which means that
16777216 (224) can be represented exactly, but
16777217 (224+1) cannot. If we try to convert
16777217 to this machine's binary floating-point format, the values
16777218 are both equally close to
16777217. Which one should we choose?
IEEE rounding rules say that in general, when there is a tie, we should prefer the result with a low-order bit of 0. If we represent 224 as a floating-point number with a 24-bit fraction, the low-order bit signifies a value of 2, just as when we represent 1000 (103) as a 3-digit decimal floating-point number (i.e., 0.100 times 104), the low-order digit signifies a value of 10.
If you didn't follow the preceding two paragraphs, the thing to remember is that IEEE floating-point rules require rounding
16777218. However, a value that is even a tiny bit larger than
16777217 cannot be rounded to
16777216 because such a value is closer to
16777218 than it is to
16777217 must be rounded to
16777217.1 must be rounded to
16777217.000000000000000001 must be rounded to
and so on.
Now consider the plight of a program that is trying to convert a decimal value to binary floating-point. If it reads
16777217.0, it does not know yet whether the final result will be
16777218. Moreover, that last
0 can be followed by any number of additional zeroes; it is not until the converter sees a nonzero digit that it knows what the final result will be. Accordingly, the seemingly simple rule that the result of floating-point input must be as accurate as the result of arithmetic is anything but simple to implement.
Of course, at the moment, we're thinking about design rather than implementation — so let's toss the problem of implementation complexity over the fence to the implementors for the moment and continue to think about the consequences of input conversions that follow this rule. We'll start exploring those consequences next week.