# Comparing an Integer With a Floating-Point Number, Part 1: Strategy

Last week, I started discussing the problem of comparing two numbers, each of which might be integer or floating-point. I pointed out that integers are easy to compare with each other, but a program that compares two floating-point numbers must take NaN (Not a Number) into account.

##
More Insights

### White Papers

More >>### Reports

More >>### Webcasts

- IT and LOB Win When Your Business Adopts Flexible Social Cloud Collaboration Tools
- Big Data and Customer Interaction Analytics: How To Create An Innovative Customer Experience

That discussion omitted the case in which one number is an integer and the other is floating-point. As before, we must decide how to handle NaN; presumably, we shall make this decision in a way that is consistent with what we did for pure floating-point values.

Aside from dealing with NaN, the basic problem is easy to state: We have two numbers, one integer and one floating-point, and we want to compare them. For convenience, we'll refer to the integer as `N`

and the floating-point number as `X`

. Then there are three possibilities:

`N`

<`X`

.`X`

<`N`

.- Neither of the above.

It's easy to write the comparisons `N < X`

and `X < N`

directly as C++ expressions. However, the definition of these comparisons is that `N`

gets converted to floating-point and the comparison is done in floating-point. This language-defined comparison works only when converting `N`

to floating-point yields an accurate result. On every computer I have ever encountered, such conversions fail whenever the "fraction" part of the floating-point number — that is, the part that is neither the sign nor the exponent — does not have enough capacity to contain the integer. In that case, one or more of the integer's low-order bits will be rounded or discarded in order to make it fit.

To make this discussion concrete, consider the floating-point format usually used for the `float`

type these days. The fraction in this format has 24 significant bits, which means that `N`

can be converted to floating-point only when `|N| < 2`

. For larger integers, the conversion will lose one or more bits. So, for example, 2^{24}^{24} and 2^{24}+1 might convert to the same floating-point number, or perhaps 2^{24}+1 and 2^{24}+2 might do so, depending on how the machine handles rounding. Either of these possibilities implies that there are values of `N`

and `X`

such that `N == X`

, `N+1 == X`

, and (of course) `N < N+1`

. Such behavior clearly violates the conditions for C++ comparison operators.

In general, there will be a number — let's call it `B`

for *big* — such that integers with absolute value greater than `B`

cannot always be represented exactly as floating-point numbers. This number will usually be 2^{k}, where `k`

is the number of bits in a floating-point fraction. I claim that "greater" is correct rather than "greater than or equal" because even though the actual value 2^{k} doesn't quite fit in `k`

bits, it can still be accurately represented by setting the exponent so that the low-order bit of the fraction represents 2 rather than 1. So, for example, a 24-bit fraction can represent 2^{24} exactly but cannot represent 2^{24}+1, and therefore we will say that `B`

is 2^{24} on such an implementation.

With this observation, we can say that we are safe in converting a positive integer `N`

to floating-point unless `N > B`

. Moreover, on implementations in which floating-point numbers have more bits in their fraction than integers have (excluding the sign bit), `N > B`

will always be false, because there is no way to generate an integer larger than `B`

on such an implementation.

Returning to our original problem of comparing `X`

with `N`

, we see that the problems arise only when `N > B`

. In that case we cannot convert `N`

to floating-point successfully. What can we do? The key observation is that if `X`

is large enough that it might possibly be larger than `N`

, the low-order bit of `X`

must represent a power of two greater than 1. In other words, if `X > B`

, then `X`

*must be an integer*. Of course, it might be such a large integer that it is not possible to represent it in integer format; but nevertheless, the mathematical value of `X`

is an integer.

This final observation leads us to a strategy:

- If
`N < B`

, then we can safely convert`N`

to floating-point for comparison with`X`

; this conversion will be exact. - Otherwise, if
`X`

is larger than the largest possible integer (of the type of`N`

), then`X`

must be larger than`N`

. - Otherwise,
`X > B`

, and therefore`X`

can be represented exactly as an integer of the type of`N`

. Therefore, we can convert`X`

to integer and compare`X`

and`N`

as integers.

I noted at the beginning of this article that we still need to do something about NaN. In addition, we need to handle negative numbers: If `X`

and `N`

have opposite signs, we do not need to compare them further; and if they are both negative, we have to take that fact into account in our comparison. There is also the problem of determining the value of `B`

.

However, none of these problems is particularly difficult once we have the strategy figured out. Accordingly, I'll leave the rest of the problem as an exercise, and go over the whole solution next week.

## Comments:

2013-03-14T16:53:19

Now we're finally getting down the the real stuff. When comparing floating point numbers for equality, we cannot directly use:

float1 == float2

because of the chance that they will not always be equal. For example, in the following code, what value is areSame:

double x = 49.0;

double y = 1 / x;

double calculatedResult = x * y;

double expectedResult = 1.0;

bool areSame = calculatedResult == expectedResult;

It will be false. calculatedResult does not turn out to be the same as 1.0. When comparing doubles for equality, we must use something as:

areSame = !(abs(calculatedResult - expectedResultabs) > FLT_EPSILON);

where FLT_EPSILON is defined as 1.192092896e-07F. If the difference between the floats is greater than FLT_EPSILON or some other very small number, then they are not the same (and areSame should be set to false).

Permalink

2013-03-13T15:15:34

Mr. Smith makes a valid observation. This is why a programmer who bases loop termination on two floating-point values being equal is asking for an infinite loop unless it is certain that both values are really fully-accurate integers.

Permalink

2013-03-13T05:28:59

Comparison is actually a subtraction:

cmp eax, ebxsets the same flags assub eax, ebx.If the difference between two operands is small enough, eg. 1e-06, you can say they are equal.

Permalink

2013-03-12T21:16:03

That is an interesting point with large numbers. I hope you plan to discuss small numbers as well. The significant bits problem is much worse for small numbers. Many floating point math libraries will get the following to show false.

(9/7-4/7-5/7)==0

vs.

(9/7-9/7)==0

((9-4-5)/7)==0

will show true. This has nothing to do with order of operations, it is because of the number of significant digits in the floating point math.

Very few programmers even know about know, much less understand the limitations of floating point numbers coming out of school.

Permalink

2013-03-11T04:26:14

Who said anything about Assembler? It's possible to solve the problem in reasonably portable C or C++.

Permalink

2013-03-09T02:39:28

mctom never programmed in Assembler...(in an 8 bits processor with no floating point...).

Permalink

2013-03-08T16:07:37

If you think it's so easy, how about posting code that you think solves the problem correctly? I've told you how to do it; all you have to do is implement it.

Permalink

2013-03-08T08:56:40

Dude, comparing floats and ints is brain-dead easy. Comparing two floats is not as simple but isn't rocket surgery either. Why are you wasting space writing about this? Is stackoverflow.com blocked by your ISP?

Permalink