Channels ▼
RSS

Floating-point Summation


September 1996/Floating-point Summation

Addition may be one of the first operations you master when learning mathematics, but it is probably one of the last ones you learn how to do safely on a computer.


Introduction

The summation of large quantities of floating-point numbers is a simple task that arises frequently in practical applications. Some of the most common applications are numerical integration and calculating averages. But this task turns out not to be as straightforward and predictable as it sounds.

First let's look at the obvious way. Writing a routine to sum a series of numbers may seem trivial. Listing 1 is my version of a Simple Summation routine. It uses a local float variable sum to accumulate a running total of all elements in the array flt_arr. This routine is simple, straightforward, and frequently gives bad results.

The problem is that addition of floating-point numbers is different from addition of integers. Integer addition is like doing addition by hand -- you line up two numbers one over the other and add one digit at a time, starting from the least significant. Of course computers do this in binary. Floating-point addition includes a step like this too, but first the numbers must be shifted so that corresponding digits line up. But if the two numbers have sufficiently different exponents then some least significant digits will have to be discarded.

For example, if we have a base ten system with one digit of accuracy then valid values include 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 200, and so on. In this system, 10.0 + 1.0 = 10.0 because 11.0 is not a valid value. In such a system, 10.0 would indeed be the best possible answer for the sum of these two numbers, so there is no problem. But if three or more numbers are to be summed then errors accumulate, and the order of the additions becomes important.

If, for another example, there are not enough bits of mantissa to resolve a difference of one part in ten billion, then:

(1.0 + 1.0e10) + -1.0e10
  = 1.0e10 + -1.0e10
  = 0.0

but

1.0 + (1.0e10 + -1.0e10)
  = 1.0 + 0.0
  = 1.0

In this example, all the information from one summand was lost because two of the summands were much larger than the other. Usually when you are summing large sets of floating-point numbers there will not be such pronounced differences between individual terms, but errors will grow from the accumulation of many small errors, and from the difference between the running total and the individual terms.

An obvious fix is to add numbers with higher precision. If we start with an array of float we should then keep our internal result as a double. If we started with an array of double we should then keep our result as a long double.

Listing 2 is almost like Listing 1 but with the type of sum changed to double. This method gives much higher precision than Simple Summation, and is an excellent approach, when available. But if calculations are already being performed in the highest precision available then extending precision is not an option.

Order in the Court

Much of the imprecision stems from the order of operations. Precision is lost most quickly when the magnitude of the two numbers to be added is most different. With summation, one of the operands is always the sum so far, and in general it is much larger than the numbers being added to it. This effect can be minimized by adding the numbers in order from the smallest to the largest. (Smallest in magnitude, that is. A very negative number is still large in the sense that its least significant bit has little precision. Bits will be discarded in adding -0.001 to 1000.0 but not in adding -1000.0 and 1000.0.)

This effect can be minimized by sorting the numbers by magnitude before summing. Listing 3 does this using qsort and a comparison function. First it makes a local copy of flt_arr in lcl_arr, then it sorts this array with qsort. Finally, it sums the sorted terms with a running total like the other methods shown so far.

The results are still not good. Sorting is a much slower operation than addition, and the sum still quickly becomes much larger than the numbers being added, so accuracy is not greatly enhanced.

In fact, particular details of the numbers being summed critically impacts how well sorting works. The best case for sorting is where the sum is near zero and the distribution is nearly symmetric. These conditions will keep the sum near the same magnitude as the numbers being added. These conditions are satisfied in test cases 2 and 3 in the test program (Listing 8) , but rarely in real life.

Sorting does bring predictability. Sorted Summation always gives the same answer for the same set of numbers, even if those numbers are presented in different orders. This is not entirely true if you use the comparison function in Listing 4. A number and its negation compare equal, and so may show up in any order in the sorted sequence. When you need predictability, use a comparison function like that in Listing 5. This comparison function breaks such ties to attain complete predictability.

Another approach to keeping the operands of each addition similar in magnitude is to add in pairs. Listing 6 presents a routine that loops through the array adding pairs of numbers and reducing the size of the array by half each iteration. At each addition, the summands represent the same number of original entries, and so should have similar magnitudes.

Pairwise Summation proves to be both faster and more accurate than Sorted Summation, and further refinements are possible which would improve it further. But it still cannot compete with ...

The Right Way

An even better way to sum many numbers is Kahan Summation [1] . With this method, a correction is maintained along with the current running total, effectively increasing the precision. Listing 7 implements Kahan Summation. Each term is first corrected for the error that has accumulated so far. Then the new sum is calculated by adding this corrected term to the running total. Finally, a new correction term is calculated as the difference between the change in the sums and the corrected term.

Algebraically, this routine seems to do nothing special. Substituting the line:

new_sum = sum + corrected_next_term;

into:

correction
  = (new_sum - sum) -
       corrected_next_term;

gives:

correction
  = ((sum +
     corrected_next_term) - sum) -     
     corrected_next_term;

which would simplify to:

correction
  = corrected_next_term -
    corrected_next_term 
  = 0;

for exact numbers. But for real floating-point numbers, the correction term will frequently be nonzero. Using it does improve accuracy. This method is more complicated than the other methods shown here, but that complexity is easily hidden in a function.

Here is an example program. It is organized around a pair of arrays,

float flt_array[ARR_SIZE]; double dbl_array[ARR_SIZE];

The main routine (Listing 8) runs four tests. For each test the code repeatedly fills these arrays (OUTER_ITER repetitions) and calls the routine loop to shuffle and sum the values.

  • The first test fills the arrays with uniform random values ranging from 0.0 to 1.0. I use four calls to rand for each element. This assures me of at least 60 bits of randomness e+-ven when rand produces only 15 bits (the minimum permitted by the ANSI/ISO standard).
  • The second test is like the first except that the arrays are rescaled so that they run from -1.0 to 1.0.
  • The third test uses random values with a bell-shaped distribution.
  • The final test is a numerical integration to estimate the value of pi by integrating one quarter of a circle. loop applies all five summation methods on each of the two data sizes (float and double), and then shuffles the arrays and repeats for SHUFFLE_ITERS iterations.

Results

Over a large number of runs a pattern becomes obvious: Simple and Sorted Summation are tied for least accurate, Pairwise Summation is a little better, Kahan Summation is better still, and the champion is Extended Precision, as shown in Listing 2. (I also provide a function, fkx_add, to do Extended Precision Kahan Summation, which is not shown, but is available on the code disk and via ftp. This function is just like the regular precision fk_add, except that the local floating-point variables are defined as doubles, instead of floats. See page 3 for downloading details.)

In my test on one machine I found Simple Summation to be consistently the fastest. Extended Precision was a bit faster than Simple Summation when Simple Summation used float and Extended Precision used double, but was about 50 times slower when Simple Summation used double and Extended Precision used long double. Apparently this platform favors double over float and does not support long double in hardware but simulates it in software.

Sorted Summation was typically about 100 times slower than Simple Summation, rendering it unsuitable for most applications. Because sorting scales as n*log(n) (where n is the number of operands) while all other methods discussed here scale as n, sorting becomes less suitable for larger data sets. Pairwise Summation is about 3-4 times slower than Simple Summation with doubles, and so would be a good bet if it were not eclipsed by Kahan Summation. Kahan Summation turns out to be about half as fast as Simple Summation when using double precision.

Simple Summation, Extended Precision, and Kahan Summation all use very little extra storage space. As implemented here, Sorted Summation uses an additional array the same size as the original and Pairwise Summation uses an array half the original size. But in some applications the summation would be the last operation performed on the array and so the same storage could be reused for the sorted array or the pairwise partial sums. Alternatively, these algorithms could be optimized somewhat to reduce storage if needed.

Mix and Match

Various combinations of the algorithms discussed here may also prove useful. On platforms like mine, where float operations are always slower than double operations, you should always sum with double precision even when the summands are float. This gives you higher precision and greater speed. Kahan Summation with all local variables of type double would be faster and more accurate than the version of Kahan Summation shown here.

Similarly, on platforms where long double operations are not more expensive than double operations, any of these methods should be performed with long double.

Even though sorting is slow and does not improve accuracy, it does guarantee to give the same sum for different shufflings of a given data set. When you need this predictability, sort first and then perform Kahan Summation or some other method.

Kahan Summation is always a good bet if you don't have specific timing information available for your target platform. If you can do timing on the target platform, then explore the possibility of combining Extended Precision and Kahan Summation.

Final Thoughts

It seemed natural to write the code for this article in C instead of C++ because C++ features might prove a distraction for some readers and because the code is relatively simple. To my surprise, I found the lack of some of my favorite features of C++ jarring. I particularly missed // comments, function templates, reference parameters, and the ability to mix declarations in with my code.

I recommend Reference [1] to all readers; it really does deliver "What Every Computer Scientist Should Know About Floating-Point Arithmetic" (But Was Afraid to Ask).

References

[1] David Goldberg. "What Every Computer Scientist Should Know About Floating-Point Arithmetic,'' ACM Computing Surveys, Vol. 23, #1, March 1991, pp. 5-48.

Evan Manning has degrees in Applied Physics from Caltech and Stanford. He has been working as a self-taught C programmer in defense and space applications for the past 8 years. Currently he works for Telos Information Systems as a consultant at NASA's Jet Propulsion Laboratory. He can be reached at manning@alumni.caltech.edu.


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 

Video