Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

Database

Finding Significance in Noisy Data


JUN92: FINDING SIGNIFICANCE IN NOISY DATA

FINDING SIGNIFICANCE IN NOISY DATA

Roy E. Kimbrell

Roy is a senior staff member at Planning Research Corp. (PRC). He has managed the software development of a large military intelligence project. In his spare time he is building a simulation development system for the PC. Roy can be reached at PRC, 1410 Wall St., Bellevue, NE 68005.


Data containing cyclic variations can be treacherous. Despite our ability to find the frequencies of the variations, certain features of the data remain elusive. Suppose you are counting cars traversing a well-traveled (but hypothetical) section of road. You will find the usual daily variations, but you also will find that taking advantage of your knowledge of the traffic's typical behavior is tricky. The question, "Is today's traffic significantly heavy or light?" is hard to answer.

If you have been driving long, you have come to expect the following kind of behavior from traffic: During week-days, there will be a high point in the morning and in the evening, the rush hour you've grown to hate. There will be another peak about noon and a low point in the late night and early morning hours. Weekends have their own cycles. The peaks and valleys are at different times and the traffic density is probably lower. Looking at the data's weekly cycle, the traffic density is heavy during the week and light on weekends. We might find monthly variations as well--heavier traffic on the first weekend of the month (after payday). There will be yearly variations, perhaps due to weather, and even longer trends. Despite these cycles and trends, there will be random, probably unexplained, variations in the traffic density--this is noise. It is an unavoidable and even necessary feature of the kind of data we are examining. Now suppose you were to ask questions about specific days: For example, is this day's traffic heavier or lighter than normal? You wish to decide this despite the noise in the data.

For example, a city engineer might want to know if a newly opened shopping center has changed the local traffic patterns. The engineer would probably decide that the relative density of the traffic depended on the day of the week and the season of the year. If the city asked him to write a program to flag days with significantly high or low traffic he might approach the problem head on. He might establish high and low norms at each collection point for weekdays and weekends for each season of the year. Then, if a day's traffic count exceeded a high point or failed to reach a low point, the program would flag that day. With enough data, the engineer might even establish an average and a standard deviation for the weekdays and weekends in each season. With an average and a standard deviation he could calculate the significance of a particular day's variation from the average.

Unfortunately, this approach ties the program to a particular cycle (weekly) and fails to capture trends. There may be unrecognized monthly cycles where some weeks have heavy traffic and others light. Thus, a day in a normally heavy week might be marked as significant when it really isn't. A new business in the neighborhood could destroy the statistics you've captured and require new analyses. If unrecognized changes occur, two kinds of errors might result: The program might fail to flag significant days, or it might flag days with no particular significance.

Problems with finding significance in noisy data occur everywhere: Traffic (vehicular, pedestrian, message, telephone, animal, and aircraft), work loads (in computers, assembly lines, and repair shops), and events (such as crimes, accidents, occurrences of disease, insect invasions, and animal population counts) are just a few examples. These problems have at least one feature in common. They have a cyclic character but the cycles change in amplitude and frequency and noise overlays these fundamental frequencies. The simple methods of solving these problems may miss significant events or mistakenly flag others. As the cycles change over time, the solutions only get worse. Adjusting to changing cycles requires continuous work. More complicated methods may be statistically sophisticated but still have difficulty adjusting to the fundamental changes in the data. Unless the statistical basis is somehow continually updated, eventually significant events will be missed or insignificant ones flagged.

Fortunately, there is a better way. There is a class of methods known as Digital Signal Processing (DSP). Speech analysis, generation, and filtering use digital signal processing techniques heavily. The filtering techniques are of special interest in solving our problem. In particular, a subset of digital filtering called "adaptive filters" can give us an excellent solution to our noisy data problem.

Adaptive Filtering

The particular adaptive filter I'll use is a lattice filter, so named because of the shape of its schematic; see Figure 1(c). A lattice filter has one or more stages. Each stage computes an error or deviation from the expected value and passes it on to the next. To do this it correlates the present error value with previous error values at that stage. At each stage there is feedback making the filter self-correcting. Up to a point, the more stages there are in a lattice filter, the better the filtering action. The multiple stages in the filter capture the cycles in the data. More stages in the filter allow the filter to capture lower-frequency cycles. The effective limit to the number of stages is the composition of the data itself. (If there aren't any low-frequency cycles, it makes little sense to have many stages.)

Figure 1 is a schematic of an adaptive filter. Each stage accepts "forward" errors (efi) and "backward" errors (ebi) from previous stages and emits new forward and backward errors. The initial forward and backward errors are the input data point itself. Eventually, the last forward error corrects the original value to produce a filtered value. (The filtered value is the value expected in the absence of noise. Notice that we base the definition of "noise" solely on the history of previous values.)

In the lattice filter in Figure 1(a) and in the filter element in Figure 1(b) there are filter components labeled "prior." Here the filter uses the prior value rather than the current value; that is, we simply store the current value and use the one we stored previously. In each stage of the filter, we are using prior values to influence the forward error and current values to influence the backward error. Finally, at the output of the filter, we use the prior value to correct the input.

Inside each stage of the filter, the input forward and backward errors get multiplied by a number derived from the trend in changes of the data values. After the multiplication, the backward error subtracts from the forward error and vice versa. The results are the new forward and backward errors created by the stage. Selecting a value for the multiplier--ki in Figure 1(b)--is somewhat arbitrary. We want a value that reflects the trend of the previous inputs to the stage. A new value of ki may be computed every time there are new inputs or less frequently. How often you need to compute ki depends on how responsive the filter must be to changes in the data.

One way to compute ki requires a history of forward- and backward-error values at each stage. Suppose we have these arrays, indexed now through last. Now is the current value, prior(t) returns the value immediately prior to t, and last is the oldest value in the array. We index in this fashion because later we will want to implement a circular array to avoid copying values. We would like the value of k to reflect trends in the relative differences between forward-and (prior) backward-error values. The approach in Example 1 will work fairly well. Note that the method of computing the output values in each stage of the filter seen inFigure 1(b) is linear; essentially, output=k*input. You could devise nonlinear schemes for both output and k computation.

This method of computing k requires several multiplications--many, if the history of values saved is long. But because we are continuously adding new values to the Efbs and Effs, we should only have to add new values and subtract the oldest values, thus avoiding many of the multiplications. We use this method in the final algorithm given later.

At some point, addition of stages or an increase in the number of historical values kept at each stage adds nothing to the effectiveness of the filter. To find the optimum number of stages, use a representative set of data. Then compute the average magnitude of the forward error generated by the last stage of the filter. Increase the number of stages and continue to compute this average. The average will decrease as you add stages; when it levels off, new stages are no longer yielding an advantage. You can compute the optimum history size in the same way. Start with a short history and gradually increase it as you repetitively compute the average forward error generated by the last stage. At some point there will be no benefit to increasing the size of the history.

Determining Significance

The filtered values are the expected values based on the cycles in the data captured by the filter stages. The difference between the raw and filtered values is the noise or error in the value. When is this error significant?

Here we can rely on statistics for help. We can compute the average and the standard deviation of the errors. The standard deviation is a measure of the variability of the errors. If the random noise in the data has a high amplitude, the standard deviation will be large; otherwise it will be small. (Notice that the frequency of the noise will be the frequency of occurrence of the data values themselves.)

We subtract the current error from the average error and divide the result by the standard deviation to obtain the value z. z is measured in standard deviations, so we can speak of the difference between the current error and the average in terms of the number of standard deviations between them.

It won't hurt to assume a normal distribution for the errors. We can then estimate about how often we can expect to find errors of various magnitudes. Errors within one standard deviation of the average will occur about 68 percent of the time. Errors within two standard deviations of the average will occur about 98 percent of the time. If we take these frequencies as probabilities, we can use them as a determiner of significance.

We would call an error significant if it fell more than a few standard deviations from the average error. The exact number of standard deviations would depend on the level of significance (probability) we desire.

Thus we can automatically compute the significance of any error based on our choice of the probability. In our algorithm, the chosen number of standard deviations (probability) is called the confidence. Table 1 shows some confidence values from the normal distribution for useful probabilities.

Table 1: Confidence values from the normal distribution.

  Probability that the       Confidence=number
  value is significant    of standard deviations
  ----------------------------------------------

       68.3%                     1.0
       90.0%                     1.64485
       95.0%                     1.95996
       97.7%                     2.0
       99.0%                     2.57580

The final algorithm, given next, contains the method for computing the significance of the input value.

The Final Algorithm

The basic lattice-filter algorithm is simple (simple enough that you can get it in silicon); see Table 2. The efst and ebst arrays are circular over the tst index. We use a circular array to avoid having to copy values down as new values enter the array; see Table 3. With these definitions and initial values we calculate the filtered value as shown in Example 2. This algorithm is implemented in Listing One, page 90.

Table 2: The basic lattice-filter algorithm.

  Value       Meaning
  -------------------------------------------------------------------------

  Y           Input data value
  ^
  Y           Output (filtered) value
  S           Number of stages in the filter
  T           Number of prior forward and backward error values saved at
              each stage
  e<SUP>f</SUP><SUB>st</SUB>         Array of forward error values: 0<-s<-S is the stage index,
              0<-t<T is the history-values index.
  e<SUP>b</SUP><SUB>st</SUB>         Array of backward error values, 0<-s<-S, 0<-t<T
  E<SUP>ff</SUP><SUB>s</SUB>         Array of sums of the squares of the forward errors at each
              stage, O<-s<-S.  The sum is over all the prior values saved, O<-t<T.
  E<SUP>fb</SUP><SUB>s</SUB>         Array of sums of the products of the forward and backward
              errors at each stage, 0<-s<-S.  The sum is over all the prior
              values saved, 0<-t<-T.
  E<SUP>f</SUP>          Sum of the forward errors from output of the last stage The
              sum is over all the prior values saved, 0<-t<-T.
  priorE<SUP>ff</SUP>    Prior E<SUP>ff</SUP><SUB>S</SUB>, the E<SUP>ff</SUP><SUB>s</SUB> at the last stage
  priorE<SUP>f</SUP>     Prior E<SUP>f</SUP>
  k           Multiplier
  sd          Standard deviation of the forward-error values produced by
              the final stage confidence  Number of standard deviations used as a measure of
              significance.
  iteration   Count of the number of times the filter has been executed;
              also a count of the number of data values entering the
              filter.

Table 3: Using a circular array to avoid having to copy values down as new values enter the array.

  Value   Meaning
  ------------------------------------------------------------------------

  now     Index of the current time:
             int now;
  prior   Index of the immediately prior time:
             #define prior (now==0? T+1:now-1)
  last    Index of the last time in the array:
             # define last \
             (now==T?0:(now==T+1?1:now+2))
  plast   Index of the time prior to the last time in the array:
             #define plast (now==T+1?0:now+1)
  cycle   Increments the current time index in preparation for new values:
             #define cycle\
             (now==T+1?now=0:now++)
  These variables take on the following initial values:
  e<SUP>f</SUP><SUB>st</SUB>=e<SUP>b</SUP><SUB>st</SUB>=0, 0<-s<-S, 0<-t<-T;
  E<SUP>ff</SUP><SUB>s</SUB>=E<SUP>fb</SUP><SUB>s</SUB>=0, 0<-s<-S;
  E<SUP>f</SUP>=priorE<SUP>ff</SUP>=priorE<SUP>f</SUP>=sd=0;
  iteration=0;

An Example

To test the ability of the lattice filter to catch significant events, we applied the filter to a year's count of daily flights at Eppley Airfield, the Omaha, Nebraska city airport. All commercial airlines use this airport, as well many company "executive" and private aircraft. The greatest number of flights to and from Eppley are by these private and company aircraft. Weather at the airport is good most of the year: dry but not arid. Days with low ceilings caused by rain, snow, or fog are rare. High winds are a bigger problem for the light company and private aircraft.

The Federal Aviation Administration graciously summarized for us the total daily flights (inbound and outbound) for all of 1989. We had 365 data points. We set the number of stages in the filter at 30 and the number of historical values used also at 30. (In retrospect, this was a considerable overkill--15 for each would have been more than sufficient. The number of stages must span the longest cycle you wish to capture. At Eppley there appeared to be a weekly cycle and a yearly cycle, but no monthly cycle. The 15 stages would have nicely spanned the weekly cycle, always keeping one whole cycle in the filter.)

We fed the flight counts to the filter in sequence and plotted the actual and expected values. We set the confidence at two standard deviations (about 98 percent) and the filter returned a significant/not_significant flag for each day. Figure 2 shows the graphs of the actual and filtered counts for the last three months of the year. The bars are the actual counts, the small diamonds mark the filtered counts and the small black squares mark significant deviations. The first few days marked significant are due only to start-up transients.

We checked the Omaha World Herald, the local newspaper, for those days flagged as significant. This was not a study of the aircraft traffic at Eppley Airfield so we did not attempt to correlate all events with the traffic. Instead, we were looking for reasonable explanations for the significant counts. The events we found that might have affected the aircraft traffic were weather and holidays. Generally, high winds reduced traffic, sometimes for several days. There would then be a significantly higher count on the first good-weather weekday. In December we noticed that the traffic was considerably higher on a Friday preceding a forecast storm--so forecasts of bad weather could cause significant counts as well. A significant event occurred close to Labor Day (September 4) and two significant events occurred near New Year's Eve. Traffic generally tapered off toward Christmas, so the low traffic over Christmas wasn't noticed. The weather over the Christmas holidays was extremely cold (record setting). Perhaps that contributed to the low traffic.

When we changed the number of historical values from 30 to 15, the number of days flagged as significant increased considerably. This is because with the smaller number of historical values, the filter reacted more quickly to changes in the data, making the filtered data vary more.

Using Adaptive Filters to Flag Significant Events

We have shown how adaptive filters can identify significant deviations from expected values even in highly variable data. Allowing for and filtering the cycles in the data lets us find deviations from the expected. When the deviations are large, we suppose something significant may have happened to cause them.

Adaptive filters can catch significant events, but because they adapt to changing conditions, they will miss trends. Slow changes, despite their significance, will fool a single adaptive filter. There are two ways to catch the slow changes. One is to use one adaptive filter with a history window and number of stages sized to watch short cycles, and another with a larger history window and more stages to watch longer cycles. The other way is to use a filter designed to watch the short cycles, then sum the data over intervals and use another filter to watch the summed data.

The advantages of using adaptive filters outweigh their drawbacks. They select significant events with precision. The level of significance can be changed using a single parameter. It is necessary to keep only the raw data; the parameter data used for other significance-determining systems is no longer necessary.



_FINDING SIGNIFICANCE IN NOISY DATA_
by Roy Kimbrell


[LISTING ONE]
<a name="0142_000d">

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <console.h>

void I_Lfilter(void);
unsigned Lfilter(double Y, double confidence);

#define MAIN

#define T           15  /* number of historical values */
#define S           30  /* number of stages */
#define SIGNIFICANT     1
#define NOT_SIGNIFICANT !SIGNIFICANT

/* The following macros define the circular buffer. There are T+2 values in
the buffer to allow exactly T historical values. The "last" and "plast" values
are additional to the T values and are not used in the computations--they
exist in the buffer to allow them to be subtracted from current values.
*/

#define prior           (now == 0 ? T+1 : now-1)
#define last            (now == T ? 0 : (now == T+1 ? 1 : now+2))
#define plast           (now == T+1 ? 0 : now+1)
#define cycle           (now == T+1 ? now = 0 : now++)

static int now; /* index of the current historical value */

static double Ef;           /* final error sum */
static double priorEf, priorEff;    /* prior error sums */
static double ef[S+1][T+2];     /* array of forward errors */
static double eb[S+1][T+2];     /* array of backward errors */
double Y_hat;               /* expected Y */
static double Efb[S+1], Eff[S+1];

void I_Lfilter(void){

    int s, t;
    now = 0;
    Ef = priorEf = priorEff = Y_hat = 0.0;
    for(s=0; s<=S; s++){
        Efb[s] = Eff[s] = 0;
        for(t=0; t<=T+1; t++) ef[s][t] = eb[s][t] = 0.0;
        }
    } /* I_Lfilter */

unsigned Lfilter(double Y, double confidence){

    double k, z=0.0, sd=0.0;
    int s;
    static unsigned iteration;

    ef[0][now] = eb[0][now] = Y;
     for (s=1; s<=S; s++){
      Efb[s] += ef[s-1][now] * eb[s-1][prior] - ef[s-1][last] * eb[s-1][plast];
         Eff[s] += ef[s-1][now] * ef[s-1][now] - ef[s-1][last] * ef[s-1][last];
        k = Efb[s] / Eff[s];
        ef[s][now] = ef[s-1][now] - k * eb[s-1][prior];
        eb[s][now] = eb[s-1][prior] - k * ef[s-1][now];
        }
    Y_hat = Y - ef[S][prior];
    Ef += ef[S][now] - ef[S][last];
    if (iteration != 0){
        sd = sqrt((T * priorEff - priorEf * priorEf)/(T*(T-1)));
        z = (ef[S][prior] - Ef / T) / sd;
        }
    priorEff = Eff[S];  /* use the output of the last stage */
    priorEf = Ef;
    iteration++;
    cycle;
    if (fabs(z) > confidence)
        return SIGNIFICANT;
    else
        return NOT_SIGNIFICANT;
    } /* Lfilter */

#ifdef MAIN

#define CONFIDENCE      2.0

void main(int ac, char **av){

    char buf[80];
    unsigned i=0, significant;
    double Y;
    extern double Y_hat;

    while (fgets(buf,sizeof(buf),stdin)){
        Y = atof(buf);
        significant = Lfilter(Y,CONFIDENCE);
        printf("%d\t%.4f\t%.4f",i++,Y,Y_hat);
        if (significant)
            printf("\tSIGNIFICANT\n");
        else
            printf("\n");
        }
    } /* main */

#endif


Example 1: The value of to reflects trends in the relative
differences between forward and (prior) backward error values.


At stage s,
   for (t = now; t != last; t = prior(t)){
   Efbs = eft s-1 * eb prior(t) s-1;
   Effs = (eft s-1)2;
   }
   ks =  F(Efbs/Effs );



Example 2: Calculating the filtered value.

ef0 now = ef0 now = Y;
for (s = 1; s # S; s++){
   Efbs += efs-1 now * ebs-1 prior - efs-1 last * ebs-1 plast;
   Effs += efs-1 now  * efs-1 now   - efs-1 last  * efs-1 last ;
   k =  F(Efbs ,Effs);
   efs now = efs-1 now - k * ebs-1 prior;
   ebs now = ebs-1 prior - k * efs-1 now;
/*      Comment The following happens to be one way of computing
        o(Y,^) but we actually do it a different way:
    o(Y,^) += k * ebs-1 prior;
*/
   }
 o(Y,^) = Y - efS prior;
Ef += efS now - efS last;
if (iteration != 0){
   sd =  R(, F(T * priorEff - (priorEf)2,T * (T-1)));
   z =  F( B bc [(e S up3(f) S do3(S prior) -  F(Ef,T)),sd);
   }
priorEff = EffS;
priorEf = Ef;
iteration++;
cycle;
if ( B bc |(z) > confidence)
   return SIGNIFICANT;
else
   return NOT_SIGNIFICANT;


Copyright © 1992, Dr. Dobb's Journal


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.