Calculating Mean
Mean is a simple arithmetic average -- add the values up and divide by the number of values. The formal statistical formula, which will be important later when we discuss variance, is as follows:

The formula looks more complicated than the actual operation in Example 3.
void addPoint(String number) { DataPoint d = new DataPoint(number); // Only add valid data if (d.GetValid()) { list.add(d); setNipCount(list.size()); } else { setCountInvalidValues(getCountInvalidValues() + 1); } setFullCount(getNipCount() + getCountInvalidValues()); calculateMean(d); } void calculateMean(DataPoint d) { setSum(getSum() + d.getNumber()); setNipMean(getSum() / getNipCount()); setNilMean(getSum() / getFullCount() ); }
I am calculating two flavors of the mean. The nipMean is the mean with null thrown away and the nilMean is the mean with null converted to zero. Let's look at nipMean. The advantage of nipMean is that stores a more accurate mean based on the data as we know it. For example, nipMean would store the mean of (0, 1, 2, null, null) as one ( (0 + 1 + 2) / 3). The nipMean value, however, gives a false total if someone extrapolates the aggregate sum of a population. Going back to my example of (0, 1, 2, null, null), we have five data elements and a mean of one. If we extrapolate the aggregate total with the formula, N * ì, it would give us an population sum of five (5 data points X 1 mean value ). That is not the case. The true sum value is actually 3 (0 + 1 + 2).
I am storing the mean with nulls converted to zero in nilMean. It calcuates the mean as 0.60 with the same population (0, 1, 2, null, null) becomes ((0+1+2+0+0) / 5). The advantage of nilMean is that the population sum value is correct when we multiply the mean by the number of data points ( 0.60 X 5 = 3).
Random Tests for Determined Validity
Only testing convenience data my be the most common testing flaw other than the my-codehas-no-bugs delusion. Probably every programmer has blown a foot off at least once by only testing the first convenient section of data records. I am a little ashamed to admit it, but I have blown a foot off myself. Don't be embarrassed yourself. Use Example 4 to retrieve a random sample of records to test on your next project.
import java.util.Random; public class Sample { private int numberSamples; private int maxValue; private Integer[] samples; private Random generator; private static int currentSample = 0; Sample (int maximum, int numberSamples ) { setGenerator(new Random()); setNumberSamples(numberSamples); setMaxValue(maximum); setSamples(new Integer[numberSamples]); init(); } public int next() { if ( getCurrentSample() == getMaxValue() ) { setCurrentSample(0); } setCurrentSample(getCurrentSample() + 1); return getSamples()[getCurrentSample() - 1]; } void init() { for ( int count = 0; count < getNumberSamples(); count++ ) { getSamples()[count] = getGenerator().nextInt(getMaxValue()); } } }
The example above is pretty simple. It is designed for testing rows of data. The parameter, maximum, is the highest row number of our population. The number of samples you wish to test is passed in numberSamples. The init() method populates the samples array with a list of random row numbers. The next method retrieves the next random row number so you can retrieve a row number, use it to retrieve a row a data, and then verify the data is correct.
A couple of things to remember about the standard Random class. First, you should instantiate Random() only once during any run. It is possible to generate the same list of random numbers if you repeatedly instantiate Random in quick succession. Second, if you seed Random with the same seed number then it will always yield the same list of random numbers. Thus, computer randomness is really an illusion. The random numbers are predetermined given any known seed. That is not really a problem. Some of us that believe in the noetical nature of predeterminism think the physical universe works the same way -- chaos dissolves as knowledge becomes deeper and broader. For practical purposes, however, it is only important to remember to either let the library self-seed by not supplying a seed number or seed Random with a different number every time.