Sooner or later, it had to happen. My voice mail system received a telephone call from an automated polling computer conducting a religious survey, and as I discovered hours later, the two machines had a lengthy heart-to-heart talk all on their own. From the sound of the playback, my voice mail machine has acquired serious doubts about its place in the universe. Here's a sample of the conversation:

**Survey**: *"May I ask you a few questions?"*

**Survey**: *"Do you believe in God?"*

Voice mail: (tentatively) *"beep?"*

**Survey**: *"If you died today, are you sure you will go to heaven?"*

Voice mail: (concerned) *"beep!?"*

### Selection Sampling

Speaking of polls, one of a pollster's primary software tools is the selection-sampling algorithm--a technique for selecting *N* items at random from a set of any size *M*, where *N* is less than or equal to *M*. The method is useful not only in polling, but any time you need to reduce a large collection of records to a more manageable subset. Selecting names and phone numbers for a poll is an obvious use, but you could also use the technique in other ways: to sample disk sectors for performing diagnostic spot checks on a large disk drive, or in statistical work requiring samples to be taken at random from massive data sets.

The algorithm seems almost too simple to work, but it is guaranteed to produce exactly the number of samples requested from an input data set, regardless of size. Example 1, Algorithm #12, lists the selection-sampling algorithm in pseudocode. *M* represents the size of the input source, and is usually set to the total number of records in an input file. *N* is the desired number of records in the sample. Three integer variables keep track of the number of *requested*, *examined*, and *selected* records. A real number *r* holds a value selected at random so that *0 <= r < 1.0*. After initializing the integers, a While loop repeats until the number of selected samples equals the number requested. On each pass through the loop, the number of examined records is incremented and *r* is set to a random real number. To decide whether to use the next record from the input, an If statement tests the result of the expression (M-*examined)*r>=(requestedselected*). In English, the expression tests whether the number of records left to be examined times a random value is greater than or equal to the number of records remaining to be added to the subset. If the expression is true, the algorithm skips the next input record; otherwise, it increments *selected* and writes the next record to the output.

Algorithm #12 works because, as the number of sampled records grows, the chances that the If statement's control expression will be true increases proportionally. (The expression is always true, for example, when *requested* equals *selected*.) At first, it may seem that selection sampling would tend to sample too many records from one end of the input, but using *different* random numbers on each pass through the While loop ensures an even distribution in the sample. Believing this fact requires faith in probability--not to mention a well-tested random-number generator, but the algorithm works flawlessly, every time.

SAMPLE.PAS (Listing One, page 169) implements Algorithm #12 in Pascal. The demonstration program reads a text file containing one record per line. For test purposes, I used Grady Ward's Moby Words database of 21,400 common first names, but any list of names or other text records will do. Set constant *M* to the number of input records in your data source, and change INFNAME and OUTFNAME to your input and output file names. Run the program and enter the number of records you want to sample. Even with a huge input source, the selection-sampling algorithm requires only a single pass through the file, making the method very fast--in a few seconds, I can create a subset of any size from my input source of 21,400 records. I often use the program to create sample input files for testing other algorithms.

### Permutations in Pairs

My phone may lead a life of its own, but fortunately, the mail still gets through, and I've heard from many readers on subjects ranging from text-compression algorithms to source-code parsers. Author Jim Mischel sent a unique permutation algorithm and implementation written in Pascal for preparing team pairings for sporting events and tournaments. Jim writes, "Recently, I worked on a problem that ties in with the permutation algorithms in the June issue, 'Telephonic Mnemonics and the Chocolate Coefficient' (*DDJ*, June 1993). On CompuServe in DDJFORUM, somebody asked for a method to generate sports-team pairings so each team plays the others once. That's easily accomplished by adapting a selection-sort algorithm, shown in pseudocode in Example 2."

As Jim goes on to say, "given four teams, the algorithm produces the output sequence 1-2, 1-3, 1-4, 2-3, 2-4, 3-4--in other words, a permutation of all possible team pairings. But one fact complicates the solution: No team may play more than one game in a day. To fulfill that requirement, consider that the number of unique pairings of *N* teams equals 1+2+3 ...+(*N*-1), or (*N**(*N*-1))/2. Also, if *N* is even, it takes N*-*1 days for all games to be played. If *N* is odd, it takes *N* days. (With three teams, for example, at one match per day, the pairings are 1-2, 1-3, and 2-3.)

"Experimenting with these ideas led to an insight. Take the pairings of four and five teams, shown in Table 1. An asterisk marks a day off for that team. In the five-team table, replacing the asterisks with the digit 6 produces pairings for six teams. Discovering that principal led to the final program, which generates pairings as though there were an even number of teams, replacing the last team with an asterisk for an odd number of teams.

"Table 2 shows the same pairings as integer values rearranged for each day, substituting the next higher number for asterisks.

"These are merely the permutations of the numbers from 1 to *N*! The Swap column shows the positions to exchange for generating the next permutation. In the six-team case, start with the first-day pairings of 1-2, 3-4, and 5-6. Then swap the positions of teams 2 and 3, and also swap the positions of teams 4 and 5, to generate the pairings for the next day: 1-3, 2-5, and 4-6. Swap the positions of teams 3 and 4, and of teams 5 and 6, to generate day three, and so on. PAIRINGS.PAS (Listing Two, page 169) implements this algorithm using Pascal to generate team pairings for sports events. Each team is guaranteed to play each other team exactly once, and no team plays more than one game per day. The code uses a touch of magic in the manipulation of the swap array at the end of the day loop, but is otherwise straightforward.

### Your Turn

Send me your thoughts, notes, and algorithms in care of DDJ, or post a message to my Compuserve ID, 73627,3241. Meanwhile, the next time you receive an unsolicited phone call from a high-pressure sales department, try this. Don't say anything. Just push a button on your touchtone phone in response to every question. Drives 'em nuts, and they hang up right away. I suppose this proves that I'm not too old to learn new tricks--even from a voice-mail machine with a mind of its own.

#### Example 1: Pseudocode for Algorithm #12 (selection sampling)

const M = 1000; { Input records } N = 128; { Subset (N <= M) } var requested, examined, selected: Integer; r: Real; begin requested <- N; examined <- 0; selected <- 0; while (selected < requested) do begin examined <- examined + 1; r <- Random; if (M - examined) * r >= (requested - selected) then skip next input record else begin selected <- selected + 1; use next input record end end end.

#### Example 2:

for x <- 1 to NTeams - 1 do for y <- x + 1 to NTeams do write(x, '-', y, ',');

#### Table 1: Four and five sports-team pairings.

4 teams 5 teams Day 1 Day 2 Day 3 Day 1 Day 2 Day 3 Day 4 Day 5 1-2 1-3 1-4 1-2 1-3 1-4 1-5 1-* 3-4 2-4 2-3 3-4 2-5 2-* 3-* 4-5 5-* 4-* 3-5 2-4 2-3

#### Table 2: Team pairings from Table 1 rearranged.

4 teams Swap 6 teams Swap Day 1 12 34 2&3 12 34 56 2&3, 4&5 Day 2 13 24 3&4 13 25 46 3&4, 5&6 Day 3 14 23 14 26 35 2&3, 4&5 Day 4 15 36 24 3&4, 5&6 Day 5 16 45 23

_ALGORITHM ALLEY_ by Tom Swan[LISTING ONE](* ----------------------------------------------------------- *( ** sample.pas -- Algorithm #12: Selection Sampling ** ** ------------------------------------------------------------** ** Creates a file SAMPLE.DAT with a specified number of names ** ** extracted from Grady Ward's Moby Words. The first line of ** ** the output file indicates the number of selections. ** ** Assumes the number of names in the source is known. ** ** Reference: Knuth, Vol 2, p122 ** ** ------------------------------------------------------------** ** Copyright (c) 1993 by Tom Swan. All rights reserved. ** )* ----------------------------------------------------------- *) program Sample; const M = 21420; { Number of records in source } INFNAME = 'g:\moby\words\21400nam'; { Source file } OUTFNAME = 'sample.dat'; { Destination file } var infile, outfile: Text; { File variables } word: String; { Holds each record from source } requested, { Requested number of samples } examined, { Total records examined } selected: Integer; { Total records selected } r: Real; { Random number 0 <= r < 1.0 } begin Randomize; Writeln('Write selected names to ', OUTFNAME); Write('How many names? '); Readln(requested); if (requested <= 0) or (requested > M) then begin Writeln('Number must be >= 0 and <= ', M); Exit end; examined := 0; selected := 0; Assign(infile, INFNAME); Reset(infile); Assign(outfile, OUTFNAME); Rewrite(outfile); Writeln(outfile, requested); { Save 'requested' in file } while (selected < requested) (* and (not Eof(infile)) *) do begin examined := examined + 1; r := Random; if (M - examined) * r >= requested - selected then Readln(infile) { Skip next record } else begin { Select next record } selected := selected + 1; { Count selections so far } Readln(infile, word); { Read record from source } Writeln(outfile, word); { Write record to destination } Writeln(word) { Echo selection to display } end end; Close(infile); Close(outfile) end.

**[LISTING TWO]**

<a name="02e5_000e"> (* ----------------------------------------------------------- *( ** pairings.pas -- Select sports-event team pairings ** ** ------------------------------------------------------------** ** This program generates team pairings for sports events. ** ** Each team is guaranteed to play each other team exactly ** ** once. No team will play more than one game per day. ** ** An asterisk ('*') means a day off for that team. ** ** For example, 5 teams produces this output: ** ** Day 1 - 12 34 5* ** ** Day 2 - 13 25 4* ** ** Day 3 - 14 2* 35 ** ** Day 4 - 15 3* 24 ** ** Day 5 - 1* 45 23 ** ** ------------------------------------------------------------** ** Copyright (c) 1993 by Jim Mischel. All rights reserved. ** )* ----------------------------------------------------------- *) program pairings; const TEAMCOUNT = 5; var TeamNames: Array [1 .. TEAMCOUNT + 1] of Char; SwapArray: Array [1 .. TEAMCOUNT + 1] of Integer; x, Temp, Day: Integer; TempChar: Char; const NTeams: Integer = TEAMCOUNT; begin { Set up team names. Normally read from a file. } for x := 1 to NTeams do TeamNames[x] := Chr(x + Ord('0')); if Odd(NTeams) then begin NTeams := NTeams + 1; TeamNames[NTeams] := '*' end; { Set up the array that controls swapping. } for x := 1 to NTeams do SwapArray[x] := x; for Day := 1 to NTeams - 1 do begin Write('Day ', Day, ' -'); { Write the team pairings for this day } x := 1; while x < NTeams do begin Write(' ', TeamNames[x], TeamNames[x + 1]); x := x + 2; end; WriteLn; { Perform swaps to prepare array for next day's pairings. } if Odd(Day) then x := 2 else x := 3; while x < NTeams do begin TempChar := TeamNames[SwapArray[x]]; TeamNames[SwapArray[x]] := TeamNames[SwapArray[x + 1]]; TeamNames[SwapArray[x + 1]] := TempChar; Temp := SwapArray[x]; SwapArray[x] := SwapArray[x + 1]; SwapArray[x + 1] := Temp; x := x + 2 end end end.

Copyright © 1993, *Dr. Dobb's Journal*