Channels ▼

Andrew Koenig

Dr. Dobb's Bloggers

Floating-Point Input and Output Are Not Symmetric

February 06, 2014

Last week, I observed that converting a binary floating-point number to printable decimal form is both easy and hard. The problem is easy because every binary floating-point number has an exact decimal representation. Moreover, it is possible to determine in advance how many decimal digits are needed to contain that exact representation, which means that it is always possible to compute in advance an upper bound to how much memory might be required to obtain that exact representation. On the other hand, the problem is difficult because very small floating-point numbers may require a large number of digits after the decimal point for an accurate representation. Moreover, last week we said that it was desirable to be able to guarantee that writing a floating-point number in decimal and reading it back in again yields exactly the same value that we wrote initially. It is far from clear how to figure out how many digits are needed to meet that requirement.

In short, floating-point output is difficult because

  • The number of digits needed to represent the result exactly might be very large, and
  • It is not easy to figure out how many of those digits are needed to preserve the values being converted.

Floating-point input has almost the opposite problems. When we convert a decimal value to binary floating-point, we normally do so in order to put that value into a variable with a type such as float or double. Accordingly, we know the type of the result, which means that we know in advance how many bits that result must have. Moreover, we know that there are many decimal values, such as 0.1, that cannot be represented exactly in binary no matter how many bits we have available. Accordingly, we do not have to worry about representing the exact value of a decimal number as a binary floating-point variable, because it is not always possible to do so.

You might think that the a priori limit on the number of bits in a floating-point value makes input an easier problem than output. However, even if we cannot generally represent input exactly, we have to be able to say something sensible about the relationship between the decimal input that we are reading and the binary floating-point number that is the result of that input. One straightforward claim that we might wish to make about floating-point input might be

The result of reading a decimal floating-point number is the binary floating-point value that would be obtained by rounding the decimal input to the number of bits in the result.

This rule is possible only because we know in advance how many bits the result will have. Otherwise, we would have to do the conversion in order to figure out how to do the conversion. However, because the size of a floating-point input variable is known in advance, the rule makes a great deal of sense. In particular, IEEE floating-point arithmetic uses the same rule for computations such as division. For example, if we divide 1.0 by 10.0 on a machine with IEEE floating-point arithmetic, we are guaranteed a result that is equal to what would be obtained by converting the infinite-precision result of such a division to the type we want for our result according to whatever rounding rules are in effect. Our proposed rule therefore has the desirable effect that reading 0.1 into a floating-point variable would give that variable exactly the same value as would assigning to it the result of dividing 1.0 by 10.0.

Perhaps surprisingly, this simple-sounding rule is far from easy to implement. To see why, consider a machine with 32-bit IEEE floating-point arithmetic. On such a machine, the fraction part of a floating-point number will have 24 bits, which means that 16777216 (224) can be represented exactly, but 16777217 (224+1) cannot. If we try to convert 16777217 to this machine's binary floating-point format, the values 16777216 and 16777218 are both equally close to 16777217. Which one should we choose?

IEEE rounding rules say that in general, when there is a tie, we should prefer the result with a low-order bit of 0. If we represent 224 as a floating-point number with a 24-bit fraction, the low-order bit signifies a value of 2, just as when we represent 1000 (103) as a 3-digit decimal floating-point number (i.e., 0.100 times 104), the low-order digit signifies a value of 10.

If you didn't follow the preceding two paragraphs, the thing to remember is that IEEE floating-point rules require rounding 16777217 to 16777216, not 16777218. However, a value that is even a tiny bit larger than 16777217 cannot be rounded to 16777216 because such a value is closer to 16777218 than it is to 16777216. Therefore:

16777217 must be rounded to 16777216
16777217.1 must be rounded to 16777218
16777217.000000000000000001 must be rounded to 16777218

and so on.

Now consider the plight of a program that is trying to convert a decimal value to binary floating-point. If it reads 16777217.0, it does not know yet whether the final result will be 16777216 or 16777218. Moreover, that last 0 can be followed by any number of additional zeroes; it is not until the converter sees a nonzero digit that it knows what the final result will be. Accordingly, the seemingly simple rule that the result of floating-point input must be as accurate as the result of arithmetic is anything but simple to implement.

Of course, at the moment, we're thinking about design rather than implementation — so let's toss the problem of implementation complexity over the fence to the implementors for the moment and continue to think about the consequences of input conversions that follow this rule. We'll start exploring those consequences next week.

Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 


Video