Channels ▼

Andrew Koenig

Dr. Dobb's Bloggers

It's Hard To Compare Floating-Point Numbers

March 01, 2013

Last week I posed a problem: Suppose you have an inheritance hierarchy that lets you represent integers or floating-point numbers. How would you define comparison within your hierarchy? We can restate this problem in a language-independent way: How can we compare two numbers, either of which might be integer or floating-point?

More Insights

White Papers

More >>


More >>


More >>

There are three cases: (1) Both numbers are integers; (2) Both numbers are floating-point; (3) One is an integer and the other is floating-point. We shall treat each case separately.

The first case, comparing two integers, is easy: Integer comparison is well defined and works on any integer value. However, even an operation seemingly as simple as comparing two floating-point numbers can cause trouble. The source of this trouble is a special value called "not a number," usually abbreviated NaN, that is implemented in most floating-point hardware as an extension to what C or C++ requires from a machine's floating-point arithmetic.

The purpose of NaN is to provide a value — or, more accurately, a family of values — that can be used to indicate that something went wrong during the process of computing that value. NaN values have the curious property that they compare as "unordered" with all values, even with themselves. In other words, if x is a NaN, and y is any floating-point value, NaN or not, then x<y, x>y, x<=y, x>=y, and x==y are all false, and x!=y is true.

This definition means that < on floating-point values fails to meet the requirements for C++ order relations! The reason for this failure is that if x and y are unequal ordinary values, and z is NaN, then x is unrelated to z, z is unrelated to y, but x and y are not unrelated to each other. Therefore, "unrelated" is not transitive, as C++ requires.

Remember that the requirements on "unrelated" cause it to partition the set of all values into equivalence classes. Because all NaN values are mutually unrelated, it makes sense to include them in a single equivalence class. As soon as we do so, we are saying that all NaN values are indistinguishable from each other for comparison purposes. In order for comparison to make sense, then, we must define it so that NaN is distinguishable from other values, so that non-NaN values can consistently be distinguished from each other.

The most straightforward strategy for defining comparison on floating-point numbers that might include NaN is to decide either that every NaN is "less than" every other number or "greater than" every other number. For the sake of the discussion, we'll define NaN as "greater than" every other number. Then we can define comparison between floating-point numbers as follows:

     bool compare(double x, double y)
           if (isnan(y))
                return !isnan(x);
           return x < y;

We use a function named isnan, which tests whether its argument is a NaN. This function is part of C++11, but might not be available in earlier versions of C++. Fortunately, it is easy to implement:

     bool isnan(double x)
           return x != x;

Our definition of compare collapses the case analysis. If y is NaN, then either x is also NaN, in which case we must return false (because all NaN values are unrelated to each other), or x is an ordinary value, in which case we must return true (because we decided earlier that NaN is "greater than" any ordinary value).

We are left with the case in which y is an ordinary number. If x is NaN, we've seen that < will return false — which is what we want because x should be "greater than" y. If x is not NaN, then < does comparison on ordinary numbers, which is again what we want.

In short, comparing two floating-point numbers is harder than it looks if your goal is to meet the C++ order-relation requirements. What about comparing a floating-point number with an integer? C++ defines such comparisons by converting the integer to floating-point. Such conversions are accurate as long as the floating-point number has at least as many bits in its fraction as the integer.

Unfortunately, integers sometimes have more bits than do the fractions of floating-point numbers. In particular, this sad state of affairs will occur on machines in which integers and floating-point numbers have the same total number of bits, because floating-point numbers devote some of those bits to the exponent. So, for example, most implementations devote only 24 bits of a 32-bit floating-point number to the fraction, and only 53 bits of a 64-bit floating-point number.

For many years, it was reasonable to assume that the double type had at least 64 bits, and the int and long types had at most 32 bits. Therefore, one could be pragmatically confident in the result of comparing integer and double values, though not necessarily float values. This assumption no longer applies: Now that both disk and memory addresses routinely exceed 232, it is not uncommon to find implementations with 64-bit integers and 64-bit floating-point values. Because 64-bit floating-point values have fewer than 64 bits in their floating-point fractions, comparisons between 64-bit integers and 64-bit floating-point numbers will not meet the C++ ordering requirements on such implementations.

How can we do such comparisons correctly? I'll show you one way of doing so next week.

Related Reading

Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.



"For many years, it was reasonable to assume that the double type " ..... its never reasonable fro a programmer to assume this. How old the DEC ALpha? It's this fast and loose attitude to well documented, defined and prescibed rules that defines all that is wrong with our industry.


Are NaN and -NaN in the same equivalence class?


It most certainly does affect sorting. The simplistic reason is that if < does not behave according to the language requirements, sorting is not required to work. The pragmatic reason is that on systems in which comparison is broken in the way I describe, I have personally encountered sequences in which sorting changes the order, and then sorting again changes the order again.


There are two types of NaNs, quiet and signaling. They are denoted with exponent bits all set to 1 and differ only in the sign bit. You can freely set the 52 bits of the base part to anything you like (if the base is zero, you have Inf).

Your base part should propagate through operations. If you set different base part to different inputs, you can see them propagate to the results.

Personally I never had the need to do this. Read more on


Sure, but this doesn't affect sorting.
This still holds true: A < B and B < C then A < C

Let's look at it more closely, where X is the float, while A & B & C are integers:
X < B and B < C then X < C
A < X and X < C then A < C
A < B and B < X then A < X

The worst you can get is approximate equality:
X = B and B < C then X = C
A = X and X = C then A < C
A < B and B = X then A = X

But this is not possible, if I revise my proposed solution:
Anyhow, you could make a struct of a "double" and an "int". If double comparison yields equal, then return as the result comparison of the integer parts or convert the double parameter to long long if (double_param < 2^63) and (double_param >= -2^63) before the integer comparison.


Rather than calculating with NaN's expecting NaN's out would it not be better to raise an exception on finding you got a NaN instead of a numerical value *before* trying to perform some numerical function on it? If the function itself can produce a NaN answer from non-NaN input then your way could mask where the NaN originated.


Good point!


Consider two unequal integers M and N that convert to the same floating-point number X because they have more significant bits than the floating-point format can handle, Then M==X and N==X, but M != N.


Definitely not. Doing so would make it impossible to sort any data structure with even a single NaN in it. It's much more useful to have the NaNs sort *somewhere* well defined, and then when they're eventually printed, have them show up in a useful, recognizable form.


NaNs are a way to handle missing and invalid values. I worked with survey data and stock exchange quotes and there are always some missing values (missing survey answers, non-trading day or an illiquid share).

What's the most meaningful way to handle missing and invalid values? That's up for debate, but it depends on your particular statistics / modelling you are trying to get from the data. You can interpolate, extrapolate in a linear on non-linear fashion...

The "beuaty" of IEEE 754 NaNs is that they will propagate. If your formula uses a missing value for the input, the result of the formula will also be a missing value.


Not sure about NaN's in C++, but shouldn't you just throw an exception if comparison is attempted on NaN's?


You are right, these are not NaNs, but they are also not finite numbers.

One would think that "(x * 0) == 0" always holds true...

bool isNaN(double val)
return val != val;

bool isNotNaN(double val)
return val == val;

bool isNaNOrInfinity(double val)
return (val * 0) != 0;

bool isNotNaNOrInfinity(double val)
return (val * 0) == 0;


> So, if double==double because of the lost precision, int > int would be performed with full integer precision.
But the problem is that double has much larger _range_ than int. So you could for example come across {double=2^80, int=x} and {double=2^80, int=y}. What can you say about these two numbers? 2^80 is far beyond the range of the ints, so the best you can hope for is that the ints retained the LSBs of all the operations that led up to these two values. But the middle bits are still unknown (lost).


There is no particular difficulty in dealing with infinities; they are different from NaN and already behave in comparisons as one would expect. In particular, +infinity is greater than any number and -infinity is less than any number.

In general, systems that support signed zeroes treat +0 and -0 as equal; I see no obvious reason to change that behavior.


The struct solution is interesting, but I think you would have difficulty implementing the traditional arithmetic operators over this number representation.
I didn't have MSB&LSB system in mind, where the integer part of the struct would extend the precision of the float (there are already quadruple precision libraries, I'm not going to reimplement them). In case of comparing an integer to an integer, the full precision is used. So, if double==double because of the lost precision, int > int would be performed with full integer precision.

There's an additional problem with comparing NaNs: where would you put items with positive infinity, and those with negative infinity? Then you have positive and negative zeroes, underflow and overflow. How would you compare all those values? The proposed solution to put all NaNs at the beginning might not be the most correct solution after all.

In C++ you can overload operators, so the usage of complex types is generally not a problem.

If it's only precision, you are worried about:
- use "long double", this is 80-bit x87 float (64bit+16bit)
- use a quadruple precision compiler / library, such as GCC's __float128 / libquadmath


I think the implication was that the numbers should be ordered as they would be on the real number line. It would be pretty strange for the program to consider 2^30 equivalent (unrelated) to 2^30+1.

The struct solution is interesting, but I think you would have difficulty implementing the traditional arithmetic operators over this number representation. You're effectively storing the N MSBs (significand of the float) and M LSBs (integer) of the number. I think it's possible to construct a number with >= N+M+1 bits where the middle bits are ambiguous and you're mistakenly relying on the LSBs to define your relation. Would have to think more about it to come up with a concrete example though.


As pointed out, the issue happens when integers have more significant digits than doubles. For the sake of concreteness, I will consider comparison between int64_t and double.

I believe the code below is correct but it's a bit hard to explain why. The first comparison function taking a double and an int64_t can be written as follows:

bool compare(double x, int64_t y) {
double dy = static_cast<double>(y);
if (x != dy)
return x < dy;
return static_cast<int64_t>(x) < y;

The other overload (taking an int64_t and a double) can be written in a similar way. Both can be simplified by taking advantage of the fact that the compiler implicitly casts from int64_t to double (the semantics remains the same as the code above). The simplified versions are:

bool compare(double x, int64_t y) {
if (x != y)
return x < y;
return static_cast<int64_t>(x) < y;

bool compare(int64_t x, double y) {
if (x != y)
return x < y;
return x < static_cast<int64_t>(y);

I hope this is correct.

(PS: There's a small typo in the article. It says "memory addresses routinely exceed 232". It should be "2^32" rather than "232".)


How can we do such comparisons correctly?
I'd use "long double", that is 80-bit floating point format, with 64-bit base and 16-bit exponent. There should be no precision loss when converting between this format and 64-bit integer.

comparisons between 64-bit integers and 64-bit floating-point numbers will not meet the C++ ordering requirements on such implementations
Could you please explain that. I don't see how precision loss could violate the transitivity requirement. There would just be more equal values. If you then represented a mixed double/long long list as the result, then yes, some items could be wrongly ordered - but C++ ordering requirements wouldn't be violated.

Anyhow, you could make a struct of a "double" and an "int". If double comparison yields equal, then return as the result comparison of the integer parts.