Database

Algorithm Alley

By Tom Swan, October 01, 1993

Tom presents a selection-sampling algorithm--a technique useful when reducing a large collection of records to a more manageable subset.

OCT93: ALGORITHM ALLEY

Sooner or later, it had to happen. My voice mail system received a telephone call from an automated polling computer conducting a religious survey, and as I discovered hours later, the two machines had a lengthy heart-to-heart talk all on their own. From the sound of the playback, my voice mail machine has acquired serious doubts about its place in the universe. Here's a sample of the conversation:

Survey: "May I ask you a few questions?"

Voice mail: (eagerly) "beep!"

Survey: "Do you believe in God?"

Voice mail: (tentatively) "beep?"

Survey: "If you died today, are you sure you will go to heaven?"

Voice mail: (concerned) "beep!?"

This goes on for 20 minutes! Apparently, my voice mail's "leave a message" beep was music to the polling system's ears, which seemed to accept any sound as a valid response. In turn, my machine's start-recording switch was reactivated by each new question. I doubt the exchange qualifies as a successful Turing Test, but if you tried my number and couldn't get through, at least now you know why. Some families have teenagers who monopolize the line, but in our house, we can now tell our friends, "Sorry to have missed your call, but our phone was on the phone."

Selection Sampling

Speaking of polls, one of a pollster's primary software tools is the selection-sampling algorithm--a technique for selecting N items at random from a set of any size M, where N is less than or equal to M. The method is useful not only in polling, but any time you need to reduce a large collection of records to a more manageable subset. Selecting names and phone numbers for a poll is an obvious use, but you could also use the technique in other ways: to sample disk sectors for performing diagnostic spot checks on a large disk drive, or in statistical work requiring samples to be taken at random from massive data sets.

The algorithm seems almost too simple to work, but it is guaranteed to produce exactly the number of samples requested from an input data set, regardless of size. Example 1, Algorithm #12, lists the selection-sampling algorithm in pseudocode. M represents the size of the input source, and is usually set to the total number of records in an input file. N is the desired number of records in the sample. Three integer variables keep track of the number of requested, examined, and selected records. A real number r holds a value selected at random so that 0 <= r < 1.0. After initializing the integers, a While loop repeats until the number of selected samples equals the number requested. On each pass through the loop, the number of examined records is incremented and r is set to a random real number. To decide whether to use the next record from the input, an If statement tests the result of the expression (M-examined)*r>=(requestedselected). In English, the expression tests whether the number of records left to be examined times a random value is greater than or equal to the number of records remaining to be added to the subset. If the expression is true, the algorithm skips the next input record; otherwise, it increments selected and writes the next record to the output.

Algorithm #12 works because, as the number of sampled records grows, the chances that the If statement's control expression will be true increases proportionally. (The expression is always true, for example, when requested equals selected.) At first, it may seem that selection sampling would tend to sample too many records from one end of the input, but using different random numbers on each pass through the While loop ensures an even distribution in the sample. Believing this fact requires faith in probability--not to mention a well-tested random-number generator, but the algorithm works flawlessly, every time.

SAMPLE.PAS (Listing One, page 169) implements Algorithm #12 in Pascal. The demonstration program reads a text file containing one record per line. For test purposes, I used Grady Ward's Moby Words database of 21,400 common first names, but any list of names or other text records will do. Set constant M to the number of input records in your data source, and change INFNAME and OUTFNAME to your input and output file names. Run the program and enter the number of records you want to sample. Even with a huge input source, the selection-sampling algorithm requires only a single pass through the file, making the method very fast--in a few seconds, I can create a subset of any size from my input source of 21,400 records. I often use the program to create sample input files for testing other algorithms.

Permutations in Pairs

My phone may lead a life of its own, but fortunately, the mail still gets through, and I've heard from many readers on subjects ranging from text-compression algorithms to source-code parsers. Author Jim Mischel sent a unique permutation algorithm and implementation written in Pascal for preparing team pairings for sporting events and tournaments. Jim writes, "Recently, I worked on a problem that ties in with the permutation algorithms in the June issue, 'Telephonic Mnemonics and the Chocolate Coefficient' (DDJ, June 1993). On CompuServe in DDJFORUM, somebody asked for a method to generate sports-team pairings so each team plays the others once. That's easily accomplished by adapting a selection-sort algorithm, shown in pseudocode in Example 2."

As Jim goes on to say, "given four teams, the algorithm produces the output sequence 1-2, 1-3, 1-4, 2-3, 2-4, 3-4--in other words, a permutation of all possible team pairings. But one fact complicates the solution: No team may play more than one game in a day. To fulfill that requirement, consider that the number of unique pairings of N teams equals 1+2+3 ...+(N-1), or (N*(N-1))/2. Also, if N is even, it takes N-1 days for all games to be played. If N is odd, it takes N days. (With three teams, for example, at one match per day, the pairings are 1-2, 1-3, and 2-3.)

"Experimenting with these ideas led to an insight. Take the pairings of four and five teams, shown in Table 1. An asterisk marks a day off for that team. In the five-team table, replacing the asterisks with the digit 6 produces pairings for six teams. Discovering that principal led to the final program, which generates pairings as though there were an even number of teams, replacing the last team with an asterisk for an odd number of teams.

"Table 2 shows the same pairings as integer values rearranged for each day, substituting the next higher number for asterisks.

"These are merely the permutations of the numbers from 1 to N! The Swap column shows the positions to exchange for generating the next permutation. In the six-team case, start with the first-day pairings of 1-2, 3-4, and 5-6. Then swap the positions of teams 2 and 3, and also swap the positions of teams 4 and 5, to generate the pairings for the next day: 1-3, 2-5, and 4-6. Swap the positions of teams 3 and 4, and of teams 5 and 6, to generate day three, and so on. PAIRINGS.PAS (Listing Two, page 169) implements this algorithm using Pascal to generate team pairings for sports events. Each team is guaranteed to play each other team exactly once, and no team plays more than one game per day. The code uses a touch of magic in the manipulation of the swap array at the end of the day loop, but is otherwise straightforward.

Your Turn

Send me your thoughts, notes, and algorithms in care of DDJ, or post a message to my Compuserve ID, 73627,3241. Meanwhile, the next time you receive an unsolicited phone call from a high-pressure sales department, try this. Don't say anything. Just push a button on your touchtone phone in response to every question. Drives 'em nuts, and they hang up right away. I suppose this proves that I'm not too old to learn new tricks--even from a voice-mail machine with a mind of its own.

Example 1: Pseudocode for Algorithm #12 (selection sampling)

const
  M = 1000;  { Input records }
  N = 128;   { Subset (N <= M) }
var
  requested,
  examined,
  selected: Integer;
  r: Real;
begin
  requested <- N;
  examined <- 0;
  selected <- 0;
  while (selected < requested) do
  begin
    examined <- examined + 1;
    r <- Random;
    if (M - examined) * r
      >= (requested - selected)
      then skip next input record
    else begin
      selected <- selected + 1;
      use next input record
    end
  end
end.

Example 2:

for x <- 1 to NTeams - 1 do
  for y <- x + 1 to NTeams do
    write(x, '-', y, ',');

Table 1: Four and five sports-team pairings.

     
          4 teams                          5 teams
Day 1   Day 2   Day 3       Day 1   Day 2   Day 3   Day 4   Day 5
1-2     1-3     1-4         1-2     1-3     1-4     1-5     1-*
3-4     2-4     2-3         3-4     2-5     2-*     3-*     4-5
                            5-*     4-*     3-5     2-4     2-3

Table 2: Team pairings from Table 1 rearranged.

     
      4 teams             Swap          6 teams         Swap
Day 1     12   34         2&3         12   34   56      2&3, 4&5
Day 2     13   24         3&4         13   25   46      3&4, 5&6
Day 3     14   23                     14   26   35      2&3, 4&5
Day 4                                 15   36   24      3&4, 5&6
Day 5                                 16   45   23



_ALGORITHM ALLEY_
by Tom Swan

[LISTING ONE]

(* ----------------------------------------------------------- *(
**  sample.pas -- Algorithm #12: Selection Sampling            **
** ------------------------------------------------------------**
**  Creates a file SAMPLE.DAT with a specified number of names **
**  extracted from Grady Ward's Moby Words. The first line of  **
**  the output file indicates the number of selections.        **
**  Assumes the number of names in the source is known.        **
**  Reference: Knuth, Vol 2, p122                              **
** ------------------------------------------------------------**
**     Copyright (c) 1993 by Tom Swan. All rights reserved.    **
)* ----------------------------------------------------------- *)

program Sample;
const
  M = 21420;  { Number of records in source }
  INFNAME = 'g:\moby\words\21400nam';  { Source file }
  OUTFNAME = 'sample.dat'; { Destination file }
var
  infile, outfile: Text;  { File variables }
  word: String;           { Holds each record from source }
  requested,              { Requested number of samples }
  examined,               { Total records examined }
  selected: Integer;      { Total records selected }
  r: Real;                { Random number 0 <= r < 1.0 }
begin
  Randomize;
  Writeln('Write selected names to ', OUTFNAME);
  Write('How many names? ');
  Readln(requested);
  if (requested <= 0) or (requested > M) then
  begin
    Writeln('Number must be >= 0 and <= ', M);
    Exit
  end;
  examined := 0;
  selected := 0;
  Assign(infile, INFNAME);
  Reset(infile);
  Assign(outfile, OUTFNAME);
  Rewrite(outfile);
  Writeln(outfile, requested);  { Save 'requested' in file }
  while (selected < requested) (* and (not Eof(infile)) *) do
  begin
    examined := examined + 1;
    r := Random;
    if (M - examined) * r >= requested - selected
    then
      Readln(infile)             { Skip next record }
    else
    begin                        { Select next record }
      selected := selected + 1;  { Count selections so far }
      Readln(infile, word);      { Read record from source }
      Writeln(outfile, word);    { Write record to destination }
      Writeln(word)              { Echo selection to display }
    end
  end;
  Close(infile);
  Close(outfile)
end.

[LISTING TWO]

<a name="02e5_000e">

(* ----------------------------------------------------------- *(
**  pairings.pas -- Select sports-event team pairings          **
** ------------------------------------------------------------**
**   This program generates team pairings for sports events.   **
**   Each team is guaranteed to play each other team exactly   **
**   once. No team will play more than one game per day.       **
**   An asterisk ('*') means a day off for that team.          **
**   For example, 5 teams produces this output:                **
**     Day 1 - 12 34 5*                                        **
**     Day 2 - 13 25 4*                                        **
**     Day 3 - 14 2* 35                                        **
**     Day 4 - 15 3* 24                                        **
**     Day 5 - 1* 45 23                                        **
** ------------------------------------------------------------**
**   Copyright (c) 1993 by Jim Mischel. All rights reserved.   **
)* ----------------------------------------------------------- *)

program pairings;
const
  TEAMCOUNT = 5;
var
  TeamNames: Array [1 .. TEAMCOUNT + 1] of Char;
  SwapArray: Array [1 .. TEAMCOUNT + 1] of Integer;
  x, Temp, Day: Integer;
  TempChar: Char;
const
  NTeams: Integer = TEAMCOUNT;
begin
{ Set up team names. Normally read from a file. }
  for x := 1 to NTeams do
    TeamNames[x] := Chr(x + Ord('0'));
  if Odd(NTeams) then
  begin
    NTeams := NTeams + 1;
    TeamNames[NTeams] := '*'
  end;
{ Set up the array that controls swapping. }
  for x := 1 to NTeams do
    SwapArray[x] := x;
  for Day := 1 to NTeams - 1 do
  begin
    Write('Day ', Day, ' -');
{ Write the team pairings for this day }
    x := 1;
    while x < NTeams do
    begin
      Write(' ', TeamNames[x], TeamNames[x + 1]);
      x := x + 2;
    end;
    WriteLn;
{ Perform swaps to prepare array for next day's pairings. }
    if Odd(Day)
      then x := 2
      else x := 3;
    while x < NTeams do
    begin
      TempChar := TeamNames[SwapArray[x]];
      TeamNames[SwapArray[x]] := TeamNames[SwapArray[x + 1]];
      TeamNames[SwapArray[x + 1]] := TempChar;
      Temp := SwapArray[x];
      SwapArray[x] := SwapArray[x + 1];
      SwapArray[x + 1] := Temp;
      x := x + 2
    end
  end
end.

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

Database

Algorithm Alley

Selection Sampling

Permutations in Pairs

Your Turn

Example 1: Pseudocode for Algorithm #12 (selection sampling)

Example 2:

Table 1: Four and five sports-team pairings.

Table 2: Team pairings from Table 1 rearranged.

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Database Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content

Database

Algorithm Alley

Related Reading

More Insights

White Papers

Reports

Webcasts

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Database Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content