Comparing an Integer With a Floating-Point Number, Part 1: Strategy
Last week, I started discussing the problem of comparing two numbers, each of which might be integer or floating-point. I pointed out that integers are easy to compare with each other, but a program that compares two floating-point numbers must take NaN (Not a Number) into account.
White PapersMore >>
- High Performance Computing in Finance: Best Practices Revealed
- Advanced Threat Protection For Dummies ebook and Using Big Data Security Analytics to Identify Advanced Threats Webcast
That discussion omitted the case in which one number is an integer and the other is floating-point. As before, we must decide how to handle NaN; presumably, we shall make this decision in a way that is consistent with what we did for pure floating-point values.
Aside from dealing with NaN, the basic problem is easy to state: We have two numbers, one integer and one floating-point, and we want to compare them. For convenience, we'll refer to the integer as
N and the floating-point number as
X. Then there are three possibilities:
- Neither of the above.
It's easy to write the comparisons
N < X and
X < N directly as C++ expressions. However, the definition of these comparisons is that
N gets converted to floating-point and the comparison is done in floating-point. This language-defined comparison works only when converting
N to floating-point yields an accurate result. On every computer I have ever encountered, such conversions fail whenever the "fraction" part of the floating-point number — that is, the part that is neither the sign nor the exponent — does not have enough capacity to contain the integer. In that case, one or more of the integer's low-order bits will be rounded or discarded in order to make it fit.
To make this discussion concrete, consider the floating-point format usually used for the
float type these days. The fraction in this format has 24 significant bits, which means that
N can be converted to floating-point only when
|N| < 224. For larger integers, the conversion will lose one or more bits. So, for example, 224 and 224+1 might convert to the same floating-point number, or perhaps 224+1 and 224+2 might do so, depending on how the machine handles rounding. Either of these possibilities implies that there are values of
X such that
N == X,
N+1 == X, and (of course)
N < N+1. Such behavior clearly violates the conditions for C++ comparison operators.
In general, there will be a number — let's call it
B for big — such that integers with absolute value greater than
B cannot always be represented exactly as floating-point numbers. This number will usually be 2k, where
k is the number of bits in a floating-point fraction. I claim that "greater" is correct rather than "greater than or equal" because even though the actual value 2k doesn't quite fit in
k bits, it can still be accurately represented by setting the exponent so that the low-order bit of the fraction represents 2 rather than 1. So, for example, a 24-bit fraction can represent 224 exactly but cannot represent 224+1, and therefore we will say that
B is 224 on such an implementation.
With this observation, we can say that we are safe in converting a positive integer
N to floating-point unless
N > B. Moreover, on implementations in which floating-point numbers have more bits in their fraction than integers have (excluding the sign bit),
N > B will always be false, because there is no way to generate an integer larger than
B on such an implementation.
Returning to our original problem of comparing
N, we see that the problems arise only when
N > B. In that case we cannot convert
N to floating-point successfully. What can we do? The key observation is that if
X is large enough that it might possibly be larger than
N, the low-order bit of
X must represent a power of two greater than 1. In other words, if
X > B, then
X must be an integer. Of course, it might be such a large integer that it is not possible to represent it in integer format; but nevertheless, the mathematical value of
X is an integer.
This final observation leads us to a strategy:
N < B, then we can safely convert
Nto floating-point for comparison with
X; this conversion will be exact.
- Otherwise, if
Xis larger than the largest possible integer (of the type of
Xmust be larger than
X > B, and therefore
Xcan be represented exactly as an integer of the type of
N. Therefore, we can convert
Xto integer and compare
I noted at the beginning of this article that we still need to do something about NaN. In addition, we need to handle negative numbers: If
N have opposite signs, we do not need to compare them further; and if they are both negative, we have to take that fact into account in our comparison. There is also the problem of determining the value of
However, none of these problems is particularly difficult once we have the strategy figured out. Accordingly, I'll leave the rest of the problem as an exercise, and go over the whole solution next week.