Channels ▼

Andrew Koenig

Dr. Dobb's Bloggers

Accurate Floating-Point Input: Several Contexts, One Behavior

February 17, 2014

Last week, I suggested that it would be reasonable for an implementation to follow a simple rule for converting strings to floating-point numbers: Every digit string represents a specific number; the result of the conversion should be the result of rounding that number to the given floating-point precision according to the implementation's normal rounding rules. So, for example, if you read the string 0.1 into a variable of type double, the result should be what you would get from rounding 0.1 (the exact mathematical value 0.1) to a double value according to whatever your implementation's rounding rules might be.

I also suggested last week that this kind of conversion might have technical problems in its implementation. Surely, however, those technical problems are surmountable; so let's imagine that we had an implementation that worked this way and look at some consequences.

Note first that there are several different ways of converting a character sequence that represents a floating-point number to the corresponding internal floating-point type:

 
     double d = 0.1;
     stringstream("0.1") >> d;
     
     sscanf("0.1", "%lf", &d);
     d = strtod("0.1", NULL);

The first of these conversions happens during compilation; the second is done by the C++ standard library; the third and fourth are done by the C standard library. If the implementation is as accurate as possible, all four of these techniques should result in exactly the same value for d.

Moreover, if we execute

 
     d = 1.0 / 10.0;
     assert(d == 0.1);

we would expect this assertion to pass if we were to run it on an implementation that complies with IEEE floating-point arithmetic. The reason is that IEEE requires the results of the five fundamental operations (addition, subtraction, multiplication, division, and square root) to be exactly equal to the results of doing those operations in infinite precision and then rounding to the target precision. Therefore, when we execute

 
               d = 1.0 / 10.0;

IEEE requires the result to be the same as that of rounding the exact value 0.1 to the precision of d. That result, in turn, is exactly the same as the compiler's interpretation of the literal 0.1 as a double constant. The same argument applies to the other three conversion contexts.

Trying these various conversions on my desktop computer reveals that they all seem to work correctly. Therefore, it may come as a surprise that they are not required to work. For example, the C99 standard says the following about converting decimal floating-point constants to their internal representation (subclause 6.4.4.2, paragraph 3):

For decimal floating constants …, the result is either the nearest representable value,
 or the larger or smaller representable value immediately adjacent to the nearest representable
 value, chosen in an implementation-defined matter.

In other words, the implementation is permitted to get the low-order bit wrong.

The C++11 standard is less forgiving (subclause 2.14.4, paragraph 1):

If the scaled value is in the range of representable values for its type,
 the result is the scaled value if representable, else the larger or smaller
 representable nearest the scaled value, chosen in an implementation-defined manner.

In other words, C++ requires that if a floating-point constant has an exact representation as a floating-point number (0.5, for example), then the implementation must yield that exact representation. If, however, the constant cannot be exactly represented in floating-point (0.1, for example), then the implementation is permitted to round in either direction.

In other words, neither C99 nor C++11 is required to execute

 
     double d = 1.0 / 10.0;
     assert(d == 0.1);

without failure.

Why on earth not? The reasons are social and historical; we shall begin exploring them next week.

Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 


Video