Channels ▼


Statistics In Java

Calculating Mean

Mean is a simple arithmetic average -- add the values up and divide by the number of values. The formal statistical formula, which will be important later when we discuss variance, is as follows:

The formula looks more complicated than the actual operation in Example 3.

void addPoint(String number) {
     DataPoint d = new DataPoint(number);
     // Only add valid data
     if (d.GetValid()) {
     } else {
           setCountInvalidValues(getCountInvalidValues() + 1);
     setFullCount(getNipCount() + getCountInvalidValues());
void calculateMean(DataPoint d) {
          setSum(getSum() + d.getNumber());
          setNipMean(getSum() / getNipCount());
          setNilMean(getSum() / getFullCount() );

Example 3

I am calculating two flavors of the mean. The nipMean is the mean with null thrown away and the nilMean is the mean with null converted to zero. Let's look at nipMean. The advantage of nipMean is that stores a more accurate mean based on the data as we know it. For example, nipMean would store the mean of (0, 1, 2, null, null) as one ( (0 + 1 + 2) / 3). The nipMean value, however, gives a false total if someone extrapolates the aggregate sum of a population. Going back to my example of (0, 1, 2, null, null), we have five data elements and a mean of one. If we extrapolate the aggregate total with the formula, N * ì, it would give us an population sum of five (5 data points X 1 mean value ). That is not the case. The true sum value is actually 3 (0 + 1 + 2).

I am storing the mean with nulls converted to zero in nilMean. It calcuates the mean as 0.60 with the same population (0, 1, 2, null, null) becomes ((0+1+2+0+0) / 5). The advantage of nilMean is that the population sum value is correct when we multiply the mean by the number of data points ( 0.60 X 5 = 3).

Random Tests for Determined Validity

Only testing convenience data my be the most common testing flaw other than the my-codehas-no-bugs delusion. Probably every programmer has blown a foot off at least once by only testing the first convenient section of data records. I am a little ashamed to admit it, but I have blown a foot off myself. Don't be embarrassed yourself. Use Example 4 to retrieve a random sample of records to test on your next project.

import java.util.Random;
public class Sample {
    private int numberSamples;
    private int maxValue;
    private Integer[] samples;
    private Random generator;
    private static int currentSample = 0;
    Sample (int maximum, int numberSamples ) {
        setGenerator(new Random());
        setSamples(new Integer[numberSamples]);
   public int next() {
         if ( getCurrentSample() == getMaxValue() ) {
        setCurrentSample(getCurrentSample() + 1);
        return getSamples()[getCurrentSample() - 1];
   void init() {
             for ( int count = 0; count < getNumberSamples(); count++ ) {
                     getSamples()[count] =

Example 4

The example above is pretty simple. It is designed for testing rows of data. The parameter, maximum, is the highest row number of our population. The number of samples you wish to test is passed in numberSamples. The init() method populates the samples array with a list of random row numbers. The next method retrieves the next random row number so you can retrieve a row number, use it to retrieve a row a data, and then verify the data is correct.

A couple of things to remember about the standard Random class. First, you should instantiate Random() only once during any run. It is possible to generate the same list of random numbers if you repeatedly instantiate Random in quick succession. Second, if you seed Random with the same seed number then it will always yield the same list of random numbers. Thus, computer randomness is really an illusion. The random numbers are predetermined given any known seed. That is not really a problem. Some of us that believe in the noetical nature of predeterminism think the physical universe works the same way -- chaos dissolves as knowledge becomes deeper and broader. For practical purposes, however, it is only important to remember to either let the library self-seed by not supplying a seed number or seed Random with a different number every time.

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.