# Statistics In Java

### Mold the Math Instead of Maxing the Machine

Variance is one of the measures of dispersion. Usually, variance is calculated after calculating the mean. Example 5 calculates variance after the mean is already known.

```
double calculateVariance() {
double variance = 0.0;
double sumSquaredDelta = 0.0;
int elementsInArray = getList().size();
int count = 0;
// set median at 0 and return if no valid datapoints
if ( elementsInArray == 0 ) {
return variance;
}
while (count < elementsInArray ) {
// We are only going to count nip median and mode
if (getList().get(count).GetValid() == true) {
sumSquaredDelta += Math.pow(getList().get(count).getNumber() -
getNipMean(), 2);
}
++count;
}
variance = sumSquaredDelta / count;
return variance;
}

```
Example 5

This is called the "two-pass" method. It cycles through the data points once to calculate a mean. It then cycles back through the points to calculate the variance. Note in Example 5 that the method getNipMean assumes the mean has already been calculated.

Performing a two-pass calculation is not a problem for a hundred persistent points like returned records from a database. You can quickly calculate the mean and rewind to the beginning of the records to calculate variance. However, that method is tedious when calculating a few million records. Also, it is impossible if your data is not persistent. You cannot rewind to the first point for the second pass unless we store the data. That may not be possible with a huge stream of data such as hours of streamed audio. Fortunately, the standard variance formula can be rewritten into the raw score method. The raw score method allows calculating variance without knowing the mean in advance.

Let's see how the normal variance formula can be morphed into the raw score method. We start with the normal two-pass variance formula used in Example 5 above.

Step 1: The first step is to perform the squaring operation.

Step 2: The second step is to remove the parentheses by attaching the sigma notation to each term.

Step 3: The third step recognizes that the last part of the third term, for i=1 to N multiply by 1, is functionally the same thing as adding 1 N number of times. Thus, the last part of the third term can be rewritten as simply N.

Step 4: The fourth step simplifies the third term by removing N because N/N = 1.

Step 5: The fifth step recognizes that the second term contains the formula for the mean (everything after 2 μ). Thus, the second term is merely 2 * μ* μ. This further reduces the second term to 2 μ2.

Step 6: The sixth step combines the second and third terms by performing the addition.

Step 7: The seventh step substitutes the mean formula for ì in the second term.

This leaves us with the raw score method. We can now calculate variance in one single pass. The raw score method is the formula used by all calculators and most computer programs.

```
DataPoint d = new DataPoint(number);
if (d.GetValid()) {
setNipCount(list.size());
} else {
setCountInvalidValues(getCountInvalidValues() + 1);
}
setFullCount(getNipCount() + getCountInvalidValues());
calculateMeanVariance(d);
}
void calculateMeanVariance(DataPoint d) {
setSum(getSum() + d.getNumber());
setNipMean(getSum() / getNipCount());
setSumSquared(getSumSquared() + Math.pow(d.getNumber(), 2));
setNipVariance((getSumSquared() - Math.pow(getSum(), 2) / getNipCount())
/ getNipCount());
setNilMean(getSum() / getFullCount());
setNilVariance((getSumSquared() - Math.pow(getSum(), 2)
/ getFullCount())
/ getFullCount());
}

```
Example 6

This efficiency is not like the old debates from the early days of ANSI C. We discussed the efficiency of y = x++ over y = x +1. Some of us believed that using the post-increment operator (y=x++) required less CPU cycles than adding one (y=x+1). It would translate to the increment CPU instruction instead of the add instruction. The increment instruction took fewer cycles to process. However, we came to a different conclusion after looking at the assembler listing from most compilers. The compilers had recognized that (y=x+1) is the same as (y=x++) and translated the output to the increment opcode on its own. That was a simple recognition and a simple translation. No compiler will completely rework something like the standard variance equation into the raw score method. Right now that is our job.

### Conclusion

I have only scratched the surface. All of our calculations were full populations and not samples. All of our continuous data was rounded into discrete bins. There are a wealth of formulas and concepts left untouched. However, I hope we are all reminded that there are more ways of measuring central tendency than average and more ways of measuring dispersion than the min/max range. Occasionally open your college math books. You can hammer a screw but it usually does not hold well.

### More Insights

 To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.