Accurate Floating-Point Input: Several Contexts, One Behavior
Last week, I suggested that it would be reasonable for an implementation to follow a simple rule for converting strings to floating-point numbers: Every digit string represents a specific number; the result of the conversion should be the result of rounding that number to the given floating-point precision according to the implementation's normal rounding rules. So, for example, if you read the string
0.1 into a variable of
type double, the result should be what you would get from rounding
0.1 (the exact mathematical value 0.1) to a
double value according to whatever your implementation's rounding rules might be.
I also suggested last week that this kind of conversion might have technical problems in its implementation. Surely, however, those technical problems are surmountable; so let's imagine that we had an implementation that worked this way and look at some consequences.
Note first that there are several different ways of converting a character sequence that represents a floating-point number to the corresponding internal floating-point type:
double d = 0.1; stringstream("0.1") >> d; sscanf("0.1", "%lf", &d); d = strtod("0.1", NULL);
The first of these conversions happens during compilation; the second is done by the C++ standard library; the third and fourth are done by the C standard library. If the implementation is as accurate as possible, all four of these techniques should result in exactly the same value for
Moreover, if we execute
d = 1.0 / 10.0; assert(d == 0.1);
we would expect this assertion to pass if we were to run it on an implementation that complies with IEEE floating-point arithmetic. The reason is that IEEE requires the results of the five fundamental operations (addition, subtraction, multiplication, division, and square root) to be exactly equal to the results of doing those operations in infinite precision and then rounding to the target precision. Therefore, when we execute
d = 1.0 / 10.0;
IEEE requires the result to be the same as that of rounding the exact value
0.1 to the precision of
d. That result, in turn, is exactly the same as the compiler's interpretation of the literal
0.1 as a
double constant. The same argument applies to the other three conversion contexts.
Trying these various conversions on my desktop computer reveals that they all seem to work correctly. Therefore, it may come as a surprise that they are not required to work. For example, the C99 standard says the following about converting decimal floating-point constants to their internal representation (subclause 126.96.36.199, paragraph 3):
For decimal floating constants …, the result is either the nearest representable value, or the larger or smaller representable value immediately adjacent to the nearest representable value, chosen in an implementation-defined matter.
In other words, the implementation is permitted to get the low-order bit wrong.
The C++11 standard is less forgiving (subclause 2.14.4, paragraph 1):
If the scaled value is in the range of representable values for its type, the result is the scaled value if representable, else the larger or smaller representable nearest the scaled value, chosen in an implementation-defined manner.
In other words, C++ requires that if a floating-point constant has an exact representation as a floating-point number (
0.5, for example), then the implementation must yield that exact representation. If, however, the constant cannot be exactly represented in floating-point (
0.1, for example), then the implementation is permitted to round in either direction.
In other words, neither C99 nor C++11 is required to execute
double d = 1.0 / 10.0; assert(d == 0.1);
Why on earth not? The reasons are social and historical; we shall begin exploring them next week.