Tom uses frequency distribution to explore if pseudorandom numbers really are. In particular, he examines a statistical method known as the "chi-square distribution."
December 01, 1993
URL:http://www.drdobbs.com/database/algorithm-alley/184409132
Copyright © 1993, Dr. Dobb's Journal
Copyright © 1993, Dr. Dobb's Journal
Among programmers, the search for a better mousetrap is matched only by the hunt for the ultimate random-number generator. Or, perhaps I ought to call it a random-sequence generator, because, as you probably know, computers can't generate random numbers. They can produce only pseudo-random numbers--values that, when juxtaposed, seem genuinely random.
Stock-market prices and lottery numbers are good examples of true random numbers. Never mind whether those values satisfy someone's favorite numerical test for randomness--if the Dow Jones industrial average didn't behave randomly, everyone would be wealthy. In fact, the Daily Lottorama is probably a better source of unpredictable numbers than the random-number generator supplied with many compilers.
I'll probably hear from mathematicians about this, but when it comes to unpredictability, only real-world future events can possibly be random. Numerical sequences generated by a computer are repeatable and are therefore unpredictable by definition. They are not random. They merely possess the characteristics of randomness. A random sequence, for instance, may have an even frequency distribution, an unpredictable "gap length," uniformly distributed pairs of successive numbers, and other qualities devised by researchers as tests of randomness. Some have even suggested that a definition for a random sequence is impossible, but I'll stick my neck out and offer my own recursive rule of thumb: A random sequence is any series of values that can't be proved to be nonrandom.
It's a clear case of heads I win, tails you lose. You cannot prove that a random-number generator is working; all you can say for sure is whether the algorithm behind the function has failed.
Why not trust the random-number generator that came with your compiler? There are several reasons why that may not be wise:
In his quintessential opus, The Art of Computer Programming, Vol. 2, Seminumerical Algorithms, Donald Knuth proposed nine tests for determining randomness. Perhaps, however, we should consider them to be tests of nonrandomness, because the algorithms are particularly adept at weeding out generators that are not producing usable random sequences. Before writing your own random-number generator, it's important to know how to implement at least one of these tests.
The best-known analysis examines a random sequence's frequency of distribution. The test is easily performed, but the results are often misinterpreted. Let's take a look at the method's inner workings, then apply it to your compiler's random-number generator. Next month, I'll list random-number algorithms that you can implement and test using the algorithm explained here.
To measure a frequency distribution, we need to select a number of categories (k) and a number of tests to perform (N). The tests must have independent outcomes--that is, test N-1 must have no bearing on the results of any other test. Too many categories are unwieldy, so K is usually set to a small value such as 100. In other words, we will test numbers selected at random in 100 categories from 0 to 99. N should be relatively large, say 1000, 10,000, or even higher. (Robert Sedgewick suggests performing at least K*10 tests.) The algorithm uses an array of integers to record the frequency of numbers generated at random within the defined range. After the test, the array's counts are processed using a statistical method known as the chi-square distribution; see Figure 1.
The chi-square distribution formula has been published in a variety of derivations, but the one in Knuth's text is among the simplest to understand and implement. If mathematical formulas make you glassy eyed, bear with me. It's not as difficult to comprehend as it may appear, and besides, you don't have to understand the formula to use it. (Knuth provides an excruciatingly detailed description of the chi-square statistical method.) Briefly stated, the formula sums the squares of the random-number counts (Y) divided by probability (P). Because each number should be equally probable if the generator is working, P should be set to 1/K. (Remember, K is the number of test categories, 100 in this example.) The reason for using the chi-square formula rather than directly comparing random-number occurrences and their probabilities is that the statistical method mirrors true randomness as found in the real world (wherever that is). We don't want each number to be generated with a frequency of exactly 1/100. That would hardly be random!
Table 1 shows a portion of a chi-square distribution table that you can use to analyze the formula's results. I extracted the table's data from a portion of a Microsoft Excel spreadsheet included with this month's files. If you have Excel, open file CHISQR.XLS and enter a starting value into cell F2 (not shown here). For example, to produce the figures in Table 1, enter 95. The other cells are protected, so you can't accidentally enter a value into the wrong location. When creating new tables, be patient--a complete recalculation takes a half minute or longer if your system doesn't have a math coprocessor.
To use Table 1, determine the number of degrees of freedom used for the test, equal to one less than the number of categories. For example, given 100 categories, examine the expected values in row 99. According to the table, the chi-square formula should yield 98.33414 about 50 percent of the time (the middle column). The leftmost column tells you that, 99 percent of the time, V should be greater than 69.23. Or, to put that another way, V should be less than 69.23 no more than about 1 percent of the time. The right-most columns in row 99 indicate that V should be greater than 117.41 no more than about 10 percent of the time. A simple way to use the table is to look at the 70-40 percent columns and ignore the others. If V is between 91.12 and 101.93 most of the time, the random-number generator is probably working.
Example 1, Algorithm #14, lists the Pascal code for the frequency distribution test using the chi-square formula. The algorithm assumes that the array Data holds the counts of random numbers generated for a range 0. . .degree-1. RANDTEST.PAS, Listing One (page 136) implements the formula and uses it to test Borland Pascal's random-number generator. When I ran the program several times, specifying 1000 iterations for each run, I received the chi-square results 112.0, 100.8, 94.8, 99.6, 87.2, and 105.8. These values are clustered around the 50 percent column for row 99, so it's reasonable to conclude that Borland Pascal's random-number generator is working (or, at least, it isn't broken according to the frequency-distribution test). Enable the test program's DEBUG symbol by deleting the space before the dollar sign in $DEFINE, and rerun RANDTEST to examine the results of a degenerate generator, which I included in the program just for fun.
On one occasion, I received a chi-square result of 77.4 from RANDTEST. Does this mean that Borland Pascal's random-number generator sometimes produces nonrandom sequences? Not at all! From Table 1, we should expect to receive a similarly low value about once in every ten trials. In fact, if the results were right on the money in the 50 percent column every time, the generator would be as suspect as it would be if it produced results that were out of whack on every go-around. After all, it's a random-number generator, not a Swiss watch.
I almost forgot to mention some notable advice from Robert Sedgewick's Algorithms in C++. Keep this one in mind the next time your statistical-analysis application develops a bug. If the program fails, always blame the random-number generator.
Next time: random number algorithms you can implement in most programming languages.
Copyright © 1993, Dr. Dobb's Journal
Terms of Service | Privacy Statement | Copyright © 2024 UBM Tech, All rights reserved.
In Rand We Trust
Testing for Nonrandomness
Your Turn
Figure 1: Chi-square distribution formula.
Example 1: Pascal code for Algorithm #14 (chi-square distribution).
const
degree = 100; { Test range is 0 .. degree - 1 }
var
Data: array[0 .. degree - 1] of LongInt;
function ChiSquare(N: LongInt; R: Word): Double;
var
V, P: Double;
I: Integer;
begin
V := 0; { Initialize result }
P := 1 / R; { Probability of Random(R) }
for I := 0 to R - 1 do
V := V + (Sqr(Data[I]) / P);
ChiSquare := ((1.0 / N) * V) - N
end;
Table 1. Chi-square distribution probabilities
Degrees of
freedom 99% 90% 70% 50% 40% 10% 1%
95 65.89826 77.81844 87.31749 94.33416 97.8549 113.0377 129.9725
96 66.73003 78.72541 88.27938 95.33417 98.8733 114.1307 131.1411
97 67.56234 79.63287 89.24148 96.33415 99.8916 115.2232 132.3089
98 68.39571 80.54082 90.20378 97.33415 100.9098 116.3153 133.4756
99 69.22986 81.44925 91.16627 98.33414 101.9279 117.4069 134.6415
100 70.065 82.35813 92.12895 99.33413 102.9459 118.498 135.8069
101 70.90072 83.26747 93.09182 100.3341 103.9639 119.5887 136.9711
102 71.7373 84.17727 94.05486 101.3341 104.9817 120.6789 138.1343
[LISTING ONE]
(* ----------------------------------------------------------- *(
** randtest.pas -- Random sequence test program **
** Copyright (c) 1993 by Tom Swan. All rights reserved. **
)* ----------------------------------------------------------- *)
{ $DEFINE DEBUG} { Delete space before $ for debugging }
{$N+,E+} { Required for 8087 mode }
{$R+} { Halt on range errors }
program RandTest;
uses Dos;
const
degree = 100; { Test range is 0 .. degree - 1 }
var
Data: array[0 .. degree - 1] of LongInt;
{$IFDEF DEBUG}
const
a: LongInt = 0;
{ Degenerate generator }
function Rand: Word;
begin
inc(a);
Rand := a;
end;
{ Degenerate randomizer }
procedure Scramble;
begin
a := -1
end;
{$ELSE}
{ Call Borland Pascal's Random function }
function Rand: Word;
begin
Rand := Random(65535)
end;
{ Call Borland Pascal's Randomize function }
procedure Scramble;
begin
Randomize
end;
{$ENDIF}
{ Return a word at random in the range 0 to M-1 }
function RandMod(M: Word): Word;
begin
RandMod := Rand MOD M
end;
{ Perform Chi-Square analysis of counts in Data array }
function ChiSquare(N: LongInt; R: Word): Double;
var
V, P: Double;
K: Word;
I: Integer;
begin
V := 0; { Initialize result }
P := 1 / R; { Probability of RandMod(R) }
for I := 0 to R - 1 do
V := V + (Sqr(Data[I]) / P);
ChiSquare := ((1.0 / N) * V) - N
end;
var
N: LongInt; { Number of tests to perform }
I: LongInt; { For-loop control variable }
E: Integer; { Error result for val function }
begin
if ParamCount = 0 then
begin
Writeln;
Writeln('Random Sequence Test');
Writeln('(C) 1993 by Tom Swan');
Writeln;
Writeln('Enter number of tests to perform.');
Writeln;
Writeln('ex. randtest 10000')
end else
begin
Scramble;
val(ParamStr(1), N, E);
if E <> 0 then
begin
Writeln(ParamStr(1));
Writeln('^':E, '--- Error!')
end else
begin
if N < 10 * degree then
begin
Writeln;
Writeln('WARNING: For accurate results, run program');
Writeln(' with at least ', 10 * degree, ' tests.')
end;
Writeln;
Writeln('Test range: 0 to ', degree - 1);
Write('Performing ', N, ' tests...');
for I := 0 to degree - 1 do
Data[I] := 0;
for I := 1 to N do
Inc(Data[RandMod(degree)]);
Writeln;
Writeln('Chi-Square result = ',
ChiSquare(N, degree):0:4)
end
end
end.