Numeric Types, Representation, and Other Fictions

Numerical accuracy: Is it for real?


June 16, 2007
URL:http://www.drdobbs.com/architecture-and-design/numeric-types-representation-and-other-f/199904923

God created the natural numbers; all else is man made.
--L. Kronecker

There I was years ago, wandering the exhibit floor of a Software Development conference with a piece of paper in my hand. Stan Kelly-Bootle had just wired me with five cups of hyper-caffeinated java, and I was rocketing from the compiler vendor to compiler vendor like Diogenes, looking for an honest result. I sidled up to the representatives, sporting my most innocent smile, and said, "I'm testing the numerical accuracy of various compilers. Would you be kind enough to solve this [Exampel 1] polynomial for me?"

f(x,y) = 333.75y6+x2(11x2y2-y6-121y4-2)+5.5y8+(x/2y)

for x= 77.617 and y =33.096
Example 1

A few vendors coded the procedure and proudly gave me the result (usually 1.1926039400531...). Surprisingly, some larger vendors claimed they didn't have any "techies" available to program a few lines of floating-point code. They promised to send me the results (none ever did).

As I left the booths of vendors who had attempted to solve the problem, my sadistic nature bubbled to the surface, and I cackled at the consistently wrong results. Naturally, I had given them a problem that I knew was impossible to solve using floating-point arithmetic.

Stop now, and use your favorite language to solve Example 1 to see how cruel life can be. Imagine my surprise (and severe depression) when the Mansfield Software Group, who was then vendor of Personal Rexx (now distributed by Quercus Systems), returned the result correct to 500 decimals.

In many of my articles, I've commented about the inaccuracies of floating-point arithmetics. It is now time to re-examine my claim that floating-point approximations are inherently inaccurate. More correctly, any fixed-formal representation is inherently inaccurate. I pick on floating-point arithmetic because it is the most common, but each format has its own quirks.

I've previously described both the magnitude and the resolution I consider necessary to describe the universe. Those requirements are a resolution of one part in 10 and a range that encompasses our largest and smallest measures. In this article, on our road to understanding arithmetic errors, we will examine some common numeric representations to see if they meet my requirements.

The simplest numbers are integers, or whole numbers (including zero and negative whole numbers). If a quantity is countable, it is exactly representable as an integer. The history of counting didn't begin with integers; it started with positive numbers, then added zero and, finally, negative numbers. Present studies suggest that the Chinese recognized zero (represented as a blank space) in the fourth century B.C. Western mathematicians did not see the significance of zero until approximately the ninth century A.D. Negative numbers were used routinely in China in the second century B.C. but were not introduced to the west until the mid-sixteenth century. (They were suggested by Greek mathematician Diophantus but were rejected as absurd; see Robert Temple's The Genius of China.) The natural numbers Kronecker referred to in the lead quotation are zero and the positive integers.

In computers, for convenience in the hardware implementation of arithmetic operations, we typically represent integers as a contiguous block (word) of bits. PCs have word lengths ranging from eight bits (8086 machines) to 16 bits (80286 systems) and 32 bits (80386 family). If we let one of these bits represent the sign, we have short integers, 2^7 (-128...127); integers 2^15 (-32,768...32,767); and long integers, 2^31 (-2,147,483,648... 2,147,483,647). The asymmetry in the representation is due to the inclusion of zero.

Some languages, such as LISP dialects, do not place machine-dictated limits on integer size. The penalty associated with using large integers (thousands of decimals) is the lack of hardware support for the arithmetic operations. This lack of support results in a severe speed-of-execution penalty for exact operations with large integers. These large integers can be represented internally by several methods. One way is binary-coded decimals (BCD); another way represents the large integer as an array of smaller integers. Both methods use the concept of positional notation.

The problem with integers is that they can't represent real numbers unless they are used as ratios. When we were taught fractions in grade school, we learned to represent numbers as the ratio of two integers, known as a rational number. The rational numbers give us a way to use integers to approximate real numbers. Rational arithmetic is exact in nature since it involves integers. In computers, however, problems arise when the integers needed to approximate a real number expand to fill memory or overflow the integer format available. These problems can be solved by choosing an approximate limit on the precision of the numbers involved and applying an appropriate rounding technique.

Algebraic packages usually use rational arithmetic as their internal representation. This representation is a part of their LISP heritage. When I entered Example 1 in Derive, it changes the representation of the decimal fractions before it performs any operation. The number 333.75 becomes 1,3335/4 and 5.5 becomes 11/2. If you solve Example 1 using rational arithmetic, you will find that the exact answer is -54,767/66,192. This exact rational number has no finite decimal representation, so when we try to express the result as a decimal fraction it becomes an inexact approximation.

[Editor's Note: Texas Instruments and Soft Warehouse Europe will discontinue selling Derive after July 31, 2007, except for the editions in Polish, Czech, Hungarian, and Japanese, which will be sold by the local distributors until December 31, 2008. Derive will be replaced by Texas Instrument's new product TI-Nspire. TI-Nspire carries the technology of the legendary TI-92 (TI-89, Voyage200) further into a new generation Math/CAS tool.]

For an example of another rational number without a finite decimal representation, look at 1/3. Some common numbers like 1/10 have no finite representation in binary notation. This incompatibility of exact rational numbers in arbitrary base positional representations causes rounding or truncation errors. Rational arithmetic will not represent all of the numbers necessary to ensure exact solutions. Between each pair of adjacent rational numbers is an infinite number of irrational numbers. It is also true that no matter how large our computer memory, we can only represent an insignificantly small sample of the rational numbers.

The most common method for representing real numbers in computers is the IEEE-754 standard for floating-point arithmetic. Almost every modern language and general-purpose hardware platform supports floating-point arithmetic. In floating-point format, a number is represented by a fixed-length mantissa f (a normalized fractional part sometimes called the "significand"), a fixed-length binary exponent q that scales the fractional part, and a sign bit s as:

sX2^qXf.

The lengths of the two formats used in the IEEE standard are 32 and 64 bits, which gives a selection of 2,147,483,648 and 9,223,372,036,854,775,808 points, respectively. The selection available for the 64-bit format seems huge, but remember that an infinite number of real points are between each adjacent pair of represented points. Since I don't have enough room in this article to print each point in either the 32- or 64-bit representation, I will illustrate floating-point numbers using a 5-bit representation (Figure 1).

Binary Decimal
00 001 0.125
00 010 0.250
00 011 0.375
00 100 0.500
00 101 0.625
00 110 0.750
00 111 0.875
Figure 1(a): 2-bit exponent

Binary Decimal
01 001 0.250
01 010 0.500
01 011 0.750
01 100 1.000
01 101 1.250
01 110 1.500
01 111 1.750
Figure 1(b): 3-bit exponent

My simplified 5-bit numbers have neither a sign bit nor a exponent offset (unlike the IEEE representation). In Figure, 1, the exponent is the first two bits and the significand (mantissa) is the last three bits. I mentioned the mantissa is normalized; that means it should always start with a one unless it is representing zero. (Zero is a special representation, usually all zeros.) In Figure 1, I have shaded the numbers that are available but not normalized. In the denormal (jargon for numbers that are not normalized) floating-point line in Figure 1, the denormal region is shaded. Nine of the 12 denormals are duplicates. In the IEEE standard, a small gap is left near zero due to the normalized representation.

Since five bits are available to represent numbers, we expect 2^5 (32) points. Actually, only 17 discrete numbers are available due to the redundancy of the denormals and the four ways of expressing zero (any exponent with a mantissa of zero).

Examining the normalized line in Figure 2, a striking asymmetry appears in the spacing of the representable points. The points are scattered along the line with a power of two spacing. This irregular spacing is responsible for the distance between representable points ranging from 10^-324 near zero to 10^292 near the maximum magnitude. If the user is not aware of the anomalies in spacing and the gap near zero, there can be more surprises that usual in floating-point arithmetic.

Another method for representing the real line is fixed-point arithmetic. Ada represents fixed-point arithmetic as sign.mantissa.small, where sign is the sign bit, mantissa is an integer that represents the mantissa, and small is a negative power of two that scales the mantissa. The compiler knows the scale factor and the scale factor is therefore implied in the representation. Figure 2 demonstrates the distribution of fixed-point numbers. Again, I've used a 5-bit representation to show that 32 evenly spaced numbers are available in the 5-bit representation.

Both the fixed-point and floating-point representations are subsets of the rational numbers. There is no way to explicitly represent any irrational number. One drawback to fixed point numbers is that there are no standards of which I am aware. Caveat executor!

Another primitive representation is available on most computers: BCD. In BCD notation, a number is represented as a string of decimal digits ranging in value from 0 to 9. For the 80x86 family of processors (in BCD representation), each byte can store either one decimal digit in an unpacked mode or two digits in a packed mode (see any good assembly language reference for details). These bytes are assembled into a string, and we perform our arithmetic on these decimal digits just as we learned in school. Rexx got the correct approximation to Example 1 by using BCD arithmetic and carrying out the result to 500 decimal digits precision. Hardware support for BCD arithmetic makes it a good way to obtain high-accuracy results. Since BCD results are unscaled decimal numbers, they have an even distribution along the real number line.

Figure 2: 5-bit fixed-point numbers.

Terms of Service | Privacy Statement | Copyright © 2024 UBM Tech, All rights reserved.

Binary Integer Fixed
00000 0 0.00
00001 1 0.25
00010 2 0.50
00011 3 0.75
00100 4 1.00
00101 5 1.25
00110 6 1.50
00111 7 1.75
01000 8 2.00
01001 9 2.25
01010 10 2.50
01011 11 2.75
01100 12 3.00
01101 13 3.25
01110 14 3.25
01111 15 3.50
10000 16 4.00
10001 17 4.25
10010 18 4.50
10011 19 4.75
10100 20 5.00
10101 21 5.25
10110 22 5.50
10111 23 5.75
11000 24 6.00
11001 25 6.25
11010 26 6.50
11011 27 6.75
11100 28 7.00
11101 29 7.25
11110 30 7.50
11111 31 7.75