Channels ▼
RSS

Tools

Statistics In Java


Mold the Math Instead of Maxing the Machine

Variance is one of the measures of dispersion. Usually, variance is calculated after calculating the mean. Example 5 calculates variance after the mean is already known.


double calculateVariance() {
     double variance = 0.0;
     double sumSquaredDelta = 0.0;
     int elementsInArray = getList().size();
     int count = 0;
     // set median at 0 and return if no valid datapoints
     if ( elementsInArray == 0 ) {
           return variance;
     }
     while (count < elementsInArray ) {
     // We are only going to count nip median and mode
           if (getList().get(count).GetValid() == true) {
                  sumSquaredDelta += Math.pow(getList().get(count).getNumber() -
                          getNipMean(), 2);
           }
          ++count;
       }
       variance = sumSquaredDelta / count;
      return variance;
}

Example 5

This is called the "two-pass" method. It cycles through the data points once to calculate a mean. It then cycles back through the points to calculate the variance. Note in Example 5 that the method getNipMean assumes the mean has already been calculated.

Performing a two-pass calculation is not a problem for a hundred persistent points like returned records from a database. You can quickly calculate the mean and rewind to the beginning of the records to calculate variance. However, that method is tedious when calculating a few million records. Also, it is impossible if your data is not persistent. You cannot rewind to the first point for the second pass unless we store the data. That may not be possible with a huge stream of data such as hours of streamed audio. Fortunately, the standard variance formula can be rewritten into the raw score method. The raw score method allows calculating variance without knowing the mean in advance.

Let's see how the normal variance formula can be morphed into the raw score method. We start with the normal two-pass variance formula used in Example 5 above.

Step 1: The first step is to perform the squaring operation.

Step 2: The second step is to remove the parentheses by attaching the sigma notation to each term.

Step 3: The third step recognizes that the last part of the third term, for i=1 to N multiply by 1, is functionally the same thing as adding 1 N number of times. Thus, the last part of the third term can be rewritten as simply N.

Step 4: The fourth step simplifies the third term by removing N because N/N = 1.

Step 5: The fifth step recognizes that the second term contains the formula for the mean (everything after 2 μ). Thus, the second term is merely 2 * μ* μ. This further reduces the second term to 2 μ2.

Step 6: The sixth step combines the second and third terms by performing the addition.

Step 7: The seventh step substitutes the mean formula for ì in the second term.

This leaves us with the raw score method. We can now calculate variance in one single pass. The raw score method is the formula used by all calculators and most computer programs.


void addPoint(String number) {
     DataPoint d = new DataPoint(number);
     // Only add valid data
     if (d.GetValid()) {
            list.add(d);
            setNipCount(list.size());
     } else {          
           setCountInvalidValues(getCountInvalidValues() + 1);
    }
          setFullCount(getNipCount() + getCountInvalidValues());
          calculateMeanVariance(d);
}
void calculateMeanVariance(DataPoint d) {
         setSum(getSum() + d.getNumber());
         setNipMean(getSum() / getNipCount());
        setSumSquared(getSumSquared() + Math.pow(d.getNumber(), 2));
        setNipVariance((getSumSquared() - Math.pow(getSum(), 2) / getNipCount())
                           / getNipCount());
        setNilMean(getSum() / getFullCount());
        setNilVariance((getSumSquared() - Math.pow(getSum(), 2)
                       / getFullCount())
                      / getFullCount());
}

Example 6

This efficiency is not like the old debates from the early days of ANSI C. We discussed the efficiency of y = x++ over y = x +1. Some of us believed that using the post-increment operator (y=x++) required less CPU cycles than adding one (y=x+1). It would translate to the increment CPU instruction instead of the add instruction. The increment instruction took fewer cycles to process. However, we came to a different conclusion after looking at the assembler listing from most compilers. The compilers had recognized that (y=x+1) is the same as (y=x++) and translated the output to the increment opcode on its own. That was a simple recognition and a simple translation. No compiler will completely rework something like the standard variance equation into the raw score method. Right now that is our job.

Conclusion

I have only scratched the surface. All of our calculations were full populations and not samples. All of our continuous data was rounded into discrete bins. There are a wealth of formulas and concepts left untouched. However, I hope we are all reminded that there are more ways of measuring central tendency than average and more ways of measuring dispersion than the min/max range. Occasionally open your college math books. You can hammer a screw but it usually does not hold well.


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 

Video