*Don has 25 years experience professionally programming in more languages than he can remember -- from as obscure as Forth to as popular as C. Don can be contacted at [email protected].*

Years ago I reviewed a complete bookkeeping system made with only SQL. Many tasks were a stretch for a single purpose language. Adding wrappers in a general-purpose language (like C) would have significantly simplified the code and certainly its operation. I asked the developer about it. He said, "If all you have is a hammer, everything looks like a nail." I have regularly borrowed the quip over the years. It holds true for the software tools that finally have an approved purchase order, languages we learn, design paradigms we assimilate, and, if we honestly admit it, the little math we remember from college. Too often we form our solutions in the image of the tools that are convenient.

Most programming projects rarely use statistics beyond a simple average or a minimum to maximum range. That is a shame because many programming projects would be improved with a quick thumb through our old college statistics textbook. Let's review some math, talk about a few physical realities, and look at a few concrete examples of realizing math in a virtual world with Java.

The examples I present here are all univarite examples for simple illustration. Also, all my examples assume we are looking at a full population and not a sample. Samples require a slight modification to some of the formulas. Simple univarite population calculations make better examples in this basic article.

I encourage you to download the complete source code that's available here and have it available while reading the article. The source code can be executed with the supplied junit classes. My way of encouraging test-driven development -- one of the best improvements to our craft in the last 25 years.

### Calculating Univariate Frequency and Mode

A univariate frequency distribution is a single variable distribution. For example, tallying the weight ranges of a population or tallying age groups of a population are univariate frequency distributions. Univariate analysis tallies weight and age separately. Bivariate analysis would tally weight ranges by age groups. Multivariate analysis tallies variables grouped by any number of variables.

In addition to the number of variables there are also different types of variables. A nominal or qualitative variable is a variable that differs in quality but not quantity. An example of a nominal variable is color. We normally create a map of colors (for example, (0,black), (1,red), … ) and use the numeric values to represent color. Tallying modal, the most frequently occurring, qualitative value(s) makes perfect sense. For example, determining black occurred 24 times, red occurred more than any other value, or red and yellow occurred more than the other value is meaningful. However, some statistical calculations of qualitative values are meaningless. Red cannot be the mean, or average, color. Numbers assigned in the map are merely symbols for the color and not an actual measured quantity. So statistics like mean and variance have no meaning for nominal variables.

Rank order or ordinal variables have a dimension of quantity but not with precision. The pain scale used in hospitals is an example of an ordinal variable. It is immaterial if the scale starts at 1 and ends with 10, starts with 0 and ends with 5, or even if the larger number indicates more pain. The exact numbers have no meaning. It is only a vehicle to gauge relative pain. You can calculate the mean for an ordinal value. However, it is sometimes not as useful as the mean of quantitative variables. Often ordinal scales are constructed with the average condition as the middle value. For example, a middle value described as, "I feel normal" would naturally calculate as the mean/median value because the scale was constructed with the middle value being the norm.

Another type of variable is a quantitative variable. A quantitative variable is a variable that has numeric significance. For example, the maximum year of school, the number of children in a family, and the liters of water per container are all quantitative variables.

Quantitative variables can be either discrete or continuous. A discrete variable has a finite number of possible values between any two points. For example, there are no values between three and four children. The number of children per family is a discrete variable. A continuous variable has an infinite number of possible values between any two points. For example, between 1.0 liters and 1.1 liters, a container may have 1.01, 1.001, 1.0001, or an infinitely precise measurement of liters. The number of liters in a container is a continuous variable. However, often discrete or continuous is a blurry distinction that is more a function of our capacity to measure than its genre of data type. Deciliters may be the smallest unit of measurement possible with our instruments. The instrument may randomly report 1.03, 1.02, 0.99, or 1.01 but we are only confident that it contains greater than 0.95 and less than 1.05 liters in those measurements. Our measurement is only significant within one decimal point. So, the frequency for those four values should be four containers with 1.0 liters.

Example 1 works for all the different types of variables I have discussed: quantitative and qualitative; continuous and discrete. The **roundData** method converts continuous variables into discrete variables by rounding data into discrete bins. The method **getAccuracy** returns the number of digits that should be considered significant. It indicates data is rounded to the nearest integer if it returns zero. If **getAccuracy** returns one then it indicates rounding to the nearest tenth. A negative one indicates no rounding. We always by-pass rounding for qualitative variables because rounding distorts the data. The qualitative pairs (9, red) and (11, brown) should never round to (10, green).

Truly continuous data requires a different approach because the math is different for continuous data. For example, the probability of any absolutely exact value is near zero in the probability density function. For example, there is nearly a zero percent chance of measuring a sample with a mass of 1 gram to the atom. So, Example 1 assumes that all continuous data will be rounded into discrete bins. These rounding bins mimic discrete data points to make my examples work. Example 1 also correctly handles multimodal populations because, unlike sword carrying immortals, there can be more than one. The mode(s) are stored in the modes **ArrayList**.

private ArrayList<Double> modes; private Map<Double, Integer> uniqueValues; public void tallyFrequencyModes() { Integer highestFrequency = 0; Iterator<DataPoint> iterator = getList().iterator(); while (iterator.hasNext()) { DataPoint dataPoint = iterator.next(); // Only tally data that is valid if (dataPoint.isValid() == true) { // Round data based on rounding double effectiveValue = roundData(dataPoint); Integer frequency = uniqueValues.get(effectiveValue); frequency = (frequency == null) ? 1 : frequency + 1; uniqueValues.put(effectiveValue, frequency); if (frequency > highestFrequency) { // Set the max frequency to the new max highestFrequency = frequency; } } } for (Double key : uniqueValues.keySet()) { // Important note: there can be more than one mode // So we are storing the modal values in an array if (uniqueValues.get(key) == highestFrequency) { modes.add(key); } } } double roundData(DataPoint d) { if (getAccuracy() != -1) { BigDecimal bigDecimal = new BigDecimal(Double .toString(d.getNumber())); bigDecimal = bigDecimal.setScale(this.accuracy, BigDecimal.ROUND_HALF_UP); return bigDecimal.doubleValue(); } else { return d.getNumber(); } } // if accuracy has been set to -1 then no rounding occurs // if accuracy has been set > -1 then number of digits to round public int getAccuracy() { return this.accuracy; }