Algorithm Alley

Tom uses frequency distribution to explore if pseudorandom numbers really are. In particular, he examines a statistical method known as the "chi-square distribution."

December 01, 1993
URL:http://www.drdobbs.com/database/algorithm-alley/184409132

Figure 1

Figure 1

DEC93: ALGORITHM ALLEY

ALGORITHM ALLEY

Heads I Win, Tails You Lose

Tom Swan

Among programmers, the search for a better mousetrap is matched only by the hunt for the ultimate random-number generator. Or, perhaps I ought to call it a random-sequence generator, because, as you probably know, computers can't generate random numbers. They can produce only pseudo-random numbers--values that, when juxtaposed, seem genuinely random.

Stock-market prices and lottery numbers are good examples of true random numbers. Never mind whether those values satisfy someone's favorite numerical test for randomness--if the Dow Jones industrial average didn't behave randomly, everyone would be wealthy. In fact, the Daily Lottorama is probably a better source of unpredictable numbers than the random-number generator supplied with many compilers.

I'll probably hear from mathematicians about this, but when it comes to unpredictability, only real-world future events can possibly be random. Numerical sequences generated by a computer are repeatable and are therefore unpredictable by definition. They are not random. They merely possess the characteristics of randomness. A random sequence, for instance, may have an even frequency distribution, an unpredictable "gap length," uniformly distributed pairs of successive numbers, and other qualities devised by researchers as tests of randomness. Some have even suggested that a definition for a random sequence is impossible, but I'll stick my neck out and offer my own recursive rule of thumb: A random sequence is any series of values that can't be proved to be nonrandom.

It's a clear case of heads I win, tails you lose. You cannot prove that a random-number generator is working; all you can say for sure is whether the algorithm behind the function has failed.

In Rand We Trust

Why not trust the random-number generator that came with your compiler? There are several reasons why that may not be wise:

Few random-number generators are supplied with test results. At the very least, a random-number generator should be accompanied with a suitable test suite, though this is rarely done. When was the last time you tested your compiler's Rand function?
A random-number generator should include an option for producing non-random sequences, useful for debugging. You might, for example, test your code with a broken generator--one that consistently returns 0 or 1, or that produces an obviously nonrandom sequence such as 1, 2, 3, . . ., N.
Compiler manufacturers have been known to alter a random function's algorithm without notice, adversely affecting simulation, cryptography, game, and other software that might be sensitive to a particular generator. It's especially important to use your own random-number generator in programs that must produce a repeatable sequence from a specific starting value, or seed.
Programs that depend heavily on random numbers might benefit from a generator that has been optimized for the computer's word size, or one written in assembly language.
Random-number generators can be combined to produce sequences more random than those produced by individual methods. Because this technique tends to waste memory, and the functions are less efficient than those that use simpler algorithms, few compiler manufacturers use this method for stock generators. If you need a combination generator, you probably have to write one yourself.

In his quintessential opus, The Art of Computer Programming, Vol. 2, Seminumerical Algorithms, Donald Knuth proposed nine tests for determining randomness. Perhaps, however, we should consider them to be tests of nonrandomness, because the algorithms are particularly adept at weeding out generators that are not producing usable random sequences. Before writing your own random-number generator, it's important to know how to implement at least one of these tests.

The best-known analysis examines a random sequence's frequency of distribution. The test is easily performed, but the results are often misinterpreted. Let's take a look at the method's inner workings, then apply it to your compiler's random-number generator. Next month, I'll list random-number algorithms that you can implement and test using the algorithm explained here.

To measure a frequency distribution, we need to select a number of categories (k) and a number of tests to perform (N). The tests must have independent outcomes--that is, test N-1 must have no bearing on the results of any other test. Too many categories are unwieldy, so K is usually set to a small value such as 100. In other words, we will test numbers selected at random in 100 categories from 0 to 99. N should be relatively large, say 1000, 10,000, or even higher. (Robert Sedgewick suggests performing at least K*10 tests.) The algorithm uses an array of integers to record the frequency of numbers generated at random within the defined range. After the test, the array's counts are processed using a statistical method known as the chi-square distribution; see Figure 1.

The chi-square distribution formula has been published in a variety of derivations, but the one in Knuth's text is among the simplest to understand and implement. If mathematical formulas make you glassy eyed, bear with me. It's not as difficult to comprehend as it may appear, and besides, you don't have to understand the formula to use it. (Knuth provides an excruciatingly detailed description of the chi-square statistical method.) Briefly stated, the formula sums the squares of the random-number counts (Y) divided by probability (P). Because each number should be equally probable if the generator is working, P should be set to 1/K. (Remember, K is the number of test categories, 100 in this example.) The reason for using the chi-square formula rather than directly comparing random-number occurrences and their probabilities is that the statistical method mirrors true randomness as found in the real world (wherever that is). We don't want each number to be generated with a frequency of exactly 1/100. That would hardly be random!

Table 1 shows a portion of a chi-square distribution table that you can use to analyze the formula's results. I extracted the table's data from a portion of a Microsoft Excel spreadsheet included with this month's files. If you have Excel, open file CHISQR.XLS and enter a starting value into cell F2 (not shown here). For example, to produce the figures in Table 1, enter 95. The other cells are protected, so you can't accidentally enter a value into the wrong location. When creating new tables, be patient--a complete recalculation takes a half minute or longer if your system doesn't have a math coprocessor.

To use Table 1, determine the number of degrees of freedom used for the test, equal to one less than the number of categories. For example, given 100 categories, examine the expected values in row 99. According to the table, the chi-square formula should yield 98.33414 about 50 percent of the time (the middle column). The leftmost column tells you that, 99 percent of the time, V should be greater than 69.23. Or, to put that another way, V should be less than 69.23 no more than about 1 percent of the time. The right-most columns in row 99 indicate that V should be greater than 117.41 no more than about 10 percent of the time. A simple way to use the table is to look at the 70-40 percent columns and ignore the others. If V is between 91.12 and 101.93 most of the time, the random-number generator is probably working.

Example 1, Algorithm #14, lists the Pascal code for the frequency distribution test using the chi-square formula. The algorithm assumes that the array Data holds the counts of random numbers generated for a range 0. . .degree-1. RANDTEST.PAS, Listing One (page 136) implements the formula and uses it to test Borland Pascal's random-number generator. When I ran the program several times, specifying 1000 iterations for each run, I received the chi-square results 112.0, 100.8, 94.8, 99.6, 87.2, and 105.8. These values are clustered around the 50 percent column for row 99, so it's reasonable to conclude that Borland Pascal's random-number generator is working (or, at least, it isn't broken according to the frequency-distribution test). Enable the test program's DEBUG symbol by deleting the space before the dollar sign in $DEFINE, and rerun RANDTEST to examine the results of a degenerate generator, which I included in the program just for fun.

On one occasion, I received a chi-square result of 77.4 from RANDTEST. Does this mean that Borland Pascal's random-number generator sometimes produces nonrandom sequences? Not at all! From Table 1, we should expect to receive a similarly low value about once in every ten trials. In fact, if the results were right on the money in the 50 percent column every time, the generator would be as suspect as it would be if it produced results that were out of whack on every go-around. After all, it's a random-number generator, not a Swiss watch.

Your Turn

I almost forgot to mention some notable advice from Robert Sedgewick's Algorithms in C++. Keep this one in mind the next time your statistical-analysis application develops a bug. If the program fails, always blame the random-number generator.

Next time: random number algorithms you can implement in most programming languages.

Figure 1: Chi-square distribution formula.

Example 1: Pascal code for Algorithm #14 (chi-square distribution).

const
  degree = 100;  { Test range is 0 .. degree - 1 }
var
  Data: array[0 .. degree - 1] of LongInt;
function ChiSquare(N: LongInt; R: Word): Double;
var
  V, P: Double;
  I: Integer;
begin
  V := 0;     { Initialize result }
  P := 1 / R; { Probability of Random(R) }
  for I := 0 to R - 1 do
    V := V + (Sqr(Data[I]) / P);
  ChiSquare := ((1.0 / N) * V) - N
end;

Table 1. Chi-square distribution probabilities

Degrees of
freedom 99%       90%       70%       50%       40%       10%       1%
95      65.89826  77.81844  87.31749  94.33416  97.8549   113.0377  129.9725
96      66.73003  78.72541  88.27938  95.33417  98.8733   114.1307  131.1411
97      67.56234  79.63287  89.24148  96.33415  99.8916   115.2232  132.3089
98      68.39571  80.54082  90.20378  97.33415  100.9098  116.3153  133.4756
99      69.22986  81.44925  91.16627  98.33414  101.9279  117.4069  134.6415
100     70.065    82.35813  92.12895  99.33413  102.9459  118.498   135.8069
101     70.90072  83.26747  93.09182  100.3341  103.9639  119.5887  136.9711
102     71.7373   84.17727  94.05486  101.3341  104.9817  120.6789  138.1343



[LISTING ONE]

(* ----------------------------------------------------------- *(
**  randtest.pas -- Random sequence test program               **
**  Copyright (c) 1993 by Tom Swan. All rights reserved.       **
)* ----------------------------------------------------------- *)

{ $DEFINE DEBUG}    { Delete space before $ for debugging }

{$N+,E+}  { Required for 8087 mode }
{$R+}     { Halt on range errors }

program RandTest;
uses Dos;
const
  degree = 100;     { Test range is 0 .. degree - 1 }
var
  Data: array[0 .. degree - 1] of LongInt;

{$IFDEF DEBUG}
const
  a: LongInt = 0;

{ Degenerate generator }
function Rand: Word;
begin
  inc(a);
  Rand := a;
end;

{ Degenerate randomizer }
procedure Scramble;
begin
  a := -1
end;

{$ELSE}

{ Call Borland Pascal's Random function }
function Rand: Word;
begin
  Rand := Random(65535)
end;

{ Call Borland Pascal's Randomize function }
procedure Scramble;
begin
  Randomize
end;

{$ENDIF}

{ Return a word at random in the range 0 to M-1 }
function RandMod(M: Word): Word;
begin
  RandMod := Rand MOD M
end;

{ Perform Chi-Square analysis of counts in Data array }
function ChiSquare(N: LongInt; R: Word): Double;
var
  V, P: Double;
  K: Word;
  I: Integer;
begin
  V := 0;     { Initialize result }
  P := 1 / R; { Probability of RandMod(R) }
  for I := 0 to R - 1 do
    V := V + (Sqr(Data[I]) / P);
  ChiSquare := ((1.0 / N) * V) - N
end;

var
  N: LongInt;     { Number of tests to perform }
  I: LongInt;     { For-loop control variable }
  E: Integer;     { Error result for val function }
begin
  if ParamCount = 0 then
  begin
    Writeln;
    Writeln('Random Sequence Test');
    Writeln('(C) 1993 by Tom Swan');
    Writeln;
    Writeln('Enter number of tests to perform.');
    Writeln;
    Writeln('ex. randtest 10000')
  end else
  begin
    Scramble;
    val(ParamStr(1), N, E);
    if E <> 0 then
    begin
      Writeln(ParamStr(1));
      Writeln('^':E, '--- Error!')
    end else
    begin
      if N < 10 * degree then
      begin
        Writeln;
        Writeln('WARNING: For accurate results, run program');
        Writeln(' with at least ', 10 * degree, ' tests.')
      end;
      Writeln;
      Writeln('Test range: 0 to ', degree - 1);
      Write('Performing ', N, ' tests...');
      for I := 0 to degree - 1 do
        Data[I] := 0;
      for I := 1 to N do
        Inc(Data[RandMod(degree)]);
      Writeln;
      Writeln('Chi-Square result = ',
        ChiSquare(N, degree):0:4)
    end
  end
end.