Discontiguous Exponential Averaging

Exponentially decaying averages are easy to program, widely used, and fundamentally flawed. John takes a close look at this standard technique and shows how to mend it.

September 01, 1998
URL:http://www.drdobbs.com/tools/discontiguous-exponential-averaging/184410671

Sep98: Algorithm Alley

John is a lead technical programmer at BetzDearborn. He can be contacted at [email protected].

Sidebar: What Averaging Window?

In the March 1998 "Algorithm Alley," William Stallings showed how exponential smoothing is used to help network protocols adapt to changing conditions.

Although simple, exponential smoothing is based on some interesting statistics. Like all statistical measures, for instance, you can get outrageous answers from insufficient input data. This is one reason why the initial estimates from exponential smoothing are often unreliable -- with only a few data points, you can't get meaningful averages.

As John discusses this month, however, that's not the only problem with exponential smoothing. By untangling a simple algorithm, John uncovered some interesting facts that help address a variety of shortcomings. Anyone who has spent time working with this common technique should appreciate John's results.

-- Tim Kientzle

Exponential averaging algorithms are used when you need to know the average recent behavior of a system. According to philosopher Daniel Dennett, useful ideas gain mind-share in a manner analogous to how survival-enhancing DNA spreads through the gene pool. From this perspective, the popularity of exponential averaging (also called "exponentially weighted moving average," "exponential smoothing," or decaying average") is understandable: It is communicated rapidly, and quite useful for easily estimating the average recent behavior of a system.

Listing One shows the traditional exponential averaging algorithm. After initializing the average with the first measurement, it's updated by setting it to the weighted average of the new data point, with a weight (multiplier) of 1-alpha, and the previous average, with a weight of alpha (0<=alpha<1). Alphas closer to 1 result in more smoothing and longer averaging windows, alphas closer to 0 result in less smoothing and shorter averaging windows.

This algorithm is simple, and does not require a buffer of recent samples. However, Figure 1 illustrates two basic problems with this algorithm -- it suffers from startup transients, and it fails to recognize and compensate for gaps in the data. Figure 1 also shows the output of a modified exponential average that corrects both of these problems. In this article, I'll explain why the traditional algorithm is flawed, and present corrections for these flaws.

Debugging the Traditional Algorithm

The update formula of the traditional algorithm can be broken into two steps:

Reduce past weights. Since the current exponential average is a weighted sum of all past measurements, multiplying it by alpha effectively reduces every past weight by alpha.
Add the new data point, with an appropriate weight.

Like all weighted average formulas, the total sum of the weights is always 1. When the average is initialized to 1*firstData, the total weight is 1. At each update, the multiplication by alpha reduces the total weight to alpha. You then add the new point with a weight of 1-alpha to bring the total weight back to 1.

The key to the exponential average's memory efficiency is that it reduces all of the past weights with a single multiplication, without having to store each point explicitly. After a sequence of such updates, the weight on the ith point in the exponential average will be initialWeight(i)*alpha^updates(i), where initialWeight(i) is the weight given to the ith point when it is first added to the average, and updates(i) is the number of updates that have been performed since. The cleverness of the algorithm lies in the manner in which it keeps the sum of weights equal to 1 and makes those weights decay exponentially, using just a single accumulator.

But this cleverness introduces bugs. The first bug occurs when you initialize the average with 1*firstData. This biases the average toward the first point: Its initial weight of 1 can be much bigger than the initial weight of 1-alpha given to every subsequent point. In other words, to force the initial sum of weights to be 1, you give the first point not only the weight it deserves, but the weights of all those missing samples before the first point as well. Since a single measurement is often much more variable than the average, the result is the startup transient illustrated in Figure 1. A pernicious aspect of this bug is that the bias, and thus the erratic behavior, is worst for those longer-term averages for which the algorithm's zero-memory footprint provides the most-significant practical advantages.

The second bug is that the weight on each point is an exponentially decreasing function of the number of updates, rather than of the elapsed time since each point was first added to the weighted average. As long as the data points are regularly spaced, the two numbers are equivalent, and this insidious bug lurks unseen. However, gaps in the data sequence expose the bug, which, like Rip van Winkle after a long sleep, treats 30-year-old data as if it had been collected yesterday (see Figure 1).

Handling Irregularly Spaced Data

Suppose a commercial phosphate analyzer takes 10 minutes to produce each new measurement. If a monitoring computer polls such an analyzer once per minute, most of these samples are redundant. Such oversampling is typical of computer-acquired data sequences.

If the monitor becomes busy with other tasks, it might sometimes check the phosphate analyzer only every other minute. As a result, some measurements will be read (and averaged) 10 times (once per minute for 10 minutes), while others might be read only five times. Since we are oversampling anyway, there is really no appreciable information loss. However, the traditional algorithm in Listing One would give data collected during such busy periods only around half the weight it would otherwise have had.

In the traditional algorithm, you multiply the old values by alpha to the power of one because those old values are now one time unit older. If the time intervals vary, so should the multiplier. In Listing Two, I compute the value weightReductionFactor by taking the appropriate power of alpha. (Remember that this is "exponential" averaging.) If one time unit has elapsed, this degenerates to the value used in the traditional algorithm. For larger intervals, the reduced weight on past data produces a counterbalancing increase in new data weight, which corrects the problem of bias against less-frequently, but still adequately, sampled data.

Eliminating Post Gap Bias

Listing Two properly reduces the weight of older data. As a result, any data that follows a long gap (such as the first data point) will be weighted heavily because of the calculation of newDataWeight. In particular, this algorithm will produce the traditional algorithm's erratic startup behavior after any sufficiently large data gap (an improvement in bug consistency it takes a programmer to appreciate).

The algorithm in Listing Three corrects this postgap bias. The basic idea is to place an upper limit on newDataWeight. This limits the total contribution of any one data point. Hence, it limits startup transients and the postgap effect. One impact of this change is that the total weights will not always be 1. To compensate, I track the total weights separately. As this algorithm progresses, provided there are no additional gaps in the data, the sum of the weights will gradually approach 1. The sum of weights is itself a useful statistic: You can use it to tell whether current averages reflect sufficient recent data to provide reliable estimates.

Listing Three computes this upper bound on new data weights from a new parameter maxDt, which defines how large a gap to fill in before data is treated as missing.

The parameter maxDt defines how large a gap to fill in before data is treated as missing. The choice of maxDt represents an educated guess as to how long it takes for the sampled data sequence to change significantly. For the phosphate analyzer example, 10 would be a reasonable maxDt. For a data sequence dominated by large random variations, a maxDt of 1 might be more appropriate. Note that the algorithm of Listing Three reduces to Listing Two when maxDt=INFINITY.

Figure 2 illustrates this data filling process. The discrete samples are representative of some continuous function. Each piece of data defines the height of a rectangle and the width of the rectangle is the period of time over which that data is relevant. Beyond maxDt, data is considered missing.

The Final Algorithm

Listing Four is the final algorithm, which builds on previous examples by also computing the standard deviation and a value I call the "completeness fraction."

The ewCompletenessFraction function returns the fraction of the total data (weights), which the current average and standard deviation depend upon, that was actually available. Note how the ewCompletenessFraction accounts for any missing data through the last call to ewUpdate by multiplication of ew.sumOfWeights by an appropriate power of alpha. This function could be used, for example, to enforce a constraint that control actions should not be based upon averages computed from only a small fraction of the data of interest (to avoid having control actions that are based upon erratic and/or outdated averages).

The formula used within ewStdDev is a well-known shortcut formula for evaluation of the standard deviation: Take the average of the squares, subtract the square of the average, then take the square root of the result. Although you have probably seen this formula applied in the equal weight case, its validity is independent of the manner in which the data is weighted. For further discussion of this formula, see Statistics for Experimenters, by Box, Hunter, and Hunter (John Wiley & Sons, 1978). One potential pitfall: If the squared standard deviation is much smaller than the squared average, precision can be lost. If you encounter this problem (and using a higher precision data type isn't an option), try shifting the data by a fixed offset so as to bring the average closer to zero.

Discontiguous Exponential Regression Analysis

Most people are unaware that simply computing the average and standard deviation also serves as the simplest possible linear regression analysis -- fitting a constant function to the data. The best fitting constant turns out to be the average, and the root-mean-squared error of the fit is the standard deviation. Thus, you already know how to perform simple discontiguous exponential regression analysis -- just use the algorithm in Listing Four.

Stated in terms of formal regression analysis, when fitting the function y=Const*1 to a data set, there are technically two data sequences (or vectors) that come into play: the data sequence, y(i), and a sequence that consists of all 1s that the to-be-fitted constant multiplies. You can view the computation of the three sums in Listing Four as computing exponentially weighted sums of all possible product pairs formed from these two sequences: 1*1, 1*y, and y*y. Listing Four calls these ew.sumOfWeights, ew.sumOfData, and ew.sumOfSquaredData. In regression analysis, these weighted sums are known variously as the "vector dot products" or "the upper triangular elements of the X transpose X matrix," which is always symmetrical and, in this case, is a 2×2 matrix.

More complex linear regression analyses merely add more data sequences. Hence, they add more of these weighted sums of products. For example, fitting the model y=m*x+b*1 involves the product pairs 1*1, 1*x, 1*y, x*x, x*y, and y*y. These form the upper triangular elements of a corresponding symmetrical 3×3 X transpose X matrix. By expanding the three lines in ewUpdate that update the exponentially weighted sums of products to instead update these six new sums of products, and making ewInitialize initialize each of the corresponding six sums to 0, you can easily modify the algorithm so that it maintains the X transpose X matrix associated with the problem of fitting a straight line to an (x,y) data set with weights that are exponentially decreasing functions of the data's age.

Linear algebra teaches how the best fitting parameters of the associated exponentially weighted linear regression can be found by solving a system of linear equations (the so-called normal equations) in which these sums of products appear as coefficients. In Listing Four, that system of equations is so simple that trivial algebraic expressions involving these weighted sums, such as that in ewAverage, are all you need to determine the parameters of interest. Although solving these equations using "Cramer's rule" could, in principle, be used to write explicit algebraic expressions for m and b in terms of these six weighted sums, experts recommend the use of one of several alternative, more numerically stable, linear-equation-solving algorithms instead. The important point is that these easily updated, missing-data-aware, weighted sums of products are the only inputs that such algorithms require. For a detailed description of such algorithms, see any good book on applied linear algebra. I've found Gilbert Strang's Introduction to Linear Algebra (Wellesley-Cambridge Press, 1993) both accessible and invaluable.

One caveat: The number of such sums of products you need to update and store is NVectors*(NVectors+1)/2, so if your models reach neural network-like levels of complexity, much of the memory you save by not having to buffer the data sequences will be squandered. But for small numbers of fitted parameters and reasonably large data-fitting windows, the memory savings obtained by not having to buffer the fitted data sequences will generally predominate. Okay, so it's not exactly a zero-memory footprint, but what are a few floating-point variables between programmers?

Conclusion

Correcting problems with the traditional exponential smoothing algorithm involves introducing a sensible characterization of when you have data and when you don't, and applying that criterion consistently when forming the exponentially weighted sums. Fortunately, it's easy to do this while preserving the simplicity and memory efficiency that have made traditional exponential smoothing a classic data processing algorithm. And the same methods can provide similar benefits with more general kinds of exponentially weighted regression analysis.

DDJ

Listing One

ewInitialize(ew, alpha, firstData) {  assert(0 <= alpha < 1)
  ew.alpha = alpha
  ew.average = firstData
}
ewUpdate(ew, newData) {  
  ew.average = ew.alpha*ew.average + (1-ew.alpha)*newData
}

Back to Article

Listing Two

ewInitialize(ew, alpha) {  assert(0 <= alpha < 1)  
  ew.previousTime =  - INFINITY 
  ew.alpha = alpha
}
ewUpdate(ew, newData, time) {  
  weightReductionFactor  =  ew.alpha^(time - ew.previousTime)
  newDataWeight = 1 - weightReductionFactor
  ew.previousTime = time



  ew.average = weightReductionFactor * ew.average + newDataWeight*newData
}

Back to Article

Listing Three

ewInitialize(ew, alpha, maxDt) {  assert(0 <= alpha < 1)  
  assert(maxDt > 0)
  ew.sumOfWeights = 0
  ew.sumOfData = 0
  ew.previousTime =  - INFINITY 
  ew.alpha = alpha
  ew.newDataWeightUpperBound = 1 - alpha^maxDt
}
ewUpdate(ew, newData, time) {  
  assert (time > ew.previousTime)
  weightReductionFactor  =  ew.alpha^(time - ew.previousTime)
  newDataWeight = min(1 - weightReductionFactor, ew.newDataWeightUpperBound)
  ew.previousTime = time



  ew.sumOfWeights = weightReductionFactor* ew.sumOfWeights + newDataWeight
  ew.sumOfData = weightReductionFactor * ew.sumOfData + newDataWeight*newData



  ew.average = ew.sumOfData / ew.sumOfWeights
}

Back to Article

Listing Four

ewInitialize(ew, alpha, maxDt) {  assert(0 <= alpha < 1)  
  assert(maxDt > 0)
  ew.sumOfWeights = 0
  ew.sumOfData = 0
  ew.sumOfSquaredData = 0  
  ew.previousTime =  - INFINITY 
  ew.alpha = alpha
  ew.newDataWeightUpperBound = 1 - alpha^maxDt 
}
ewUpdate(ew, newData, time) {  
  assert (time > ew.previousTime)
  weightReductionFactor  =  ew.alpha^(time - ew.previousTime)
  newDataWeight = min(1 - weightReductionFactor, ew.newDataWeightUpperBound)
  ew.sumOfWeights = weightReductionFactor* ew.sumOfWeights + newDataWeight
  ew.sumOfData = weightReductionFactor * ew.sumOfData + newDataWeight*newData
  ew.sumOfSquaredData = 
      weightReductionFactor* ew.sumOfSquaredData + newDataWeight*newData^2
  ew.previousTime = time
}
ewCompletenessFraction(ew, time) { 
  assert (time >= ew.previousTime)
  return(ew.alpha^(time - ew.previousTime) * ew.sumOfWeights) 
}
ewAverage(ew) {
  assert (ew.sumOfWeights > 0)
  return(ew.sumOfData/ew.sumOfWeights) 
}
ewStdDev(ew) {       
  assert (ew.sumOfWeights > 0)
  return(Sqrt(ew.sumOfSquaredData/ew.sumOfWeights  - ewAverage(ew)^2))
}

Back to Article

What Averaging Window?

By John C. Gunther

Dr. Dobb's Journal September 1998

Figure 1: Application of traditional and discontiguous algorithms to simulated data.

What Averaging Window?

By John C. Gunther

Dr. Dobb's Journal September 1998

Figure 2: Approximating a continuous function via discrete samples.

Dr. Dobb's Journal September 1998

What Averaging Window?

Call me fussy, but if I'm using an average that has weights that decrease exponentially with the data's age, I'd like the characteristic time that governs this exponential decay to appear explicitly. The relationship between the more widely used alpha, and the less widely used characteristic time (Tau) that governs the decay of the exponential weights is alpha=exp(-dt/Tau) or Tau=-dt/ln(alpha). Here, dt is the sampling interval.

Although Tau can be loosely interpreted as the moving averaging window, bear in mind that exponential averaging windows are not as clear cut as ordinary equal weight-averaging windows. For example, although around 95 percent of the weights are associated with points less than 3*Tau time units old, a single monster data point -- despite being many more than three Taus old -- could still have a sizable impact on the average.

The largest possible averaging window is a function of numeric precision, with Taumax=-dt/ln(1-EPS)~=dt/EPS (EPS is the smallest number, such that 1+EPS>1).

-- J.C.G.