Dr. Dobb's | Feb02: Algorithm Alley

Feb02: Algorithm Alley

Here's yet another record-selection algorithm for your database toolbox.

February 01, 2002
URL:http://www.drdobbs.com/database/feb02-algorithm-alley/184404981

Feb02: Algorithm Alley

William is the senior vice president of Technical Support and a part-time computer science faculty member at the University of Nebraska at Omaha. He can be contacted at [email protected].

Suppose you are asked to select a random sample of, say, 500 records from a file of known length. In the article "Reservoir Sampling" (DDJ, January 2001), the late Paul Hultquist and I discussed one possible method that lets you select the 500 records from a file of unknown length. Certainly, you could pretend that you do not know the length of the file and use reservoir sampling. However, knowing the length of the file eliminates the need for the reservoir and also lets you select the records in the order of their appearance in the file. The latter is important as it eliminates any possible need to sort the sample file when you are done.

The method for doing this is to select the records from the file, one at a time, in such a way that the likelihood for selection is increased if you are behind schedule and decreased if you are ahead of schedule. Although this necessarily changes the prob- ability of selection for a record as the algorithm runs, it is not difficult to see that the a priori probability of selection for a record in the file is the same for each record.

For example, suppose that the file contains N_T total records, of which you want to retrieve N of them, where usually 0<N<N_T. Suppose that s is the number of records selected thus far as the algorithm runs. This value is initially zero and ends up, you hope, at N. Let k be the file position. It starts at zero and is incremented for each record through the last record, which is number N_T-1.

At each record, that particular part of the file may be added to your N elements with a probability of (N-s)/(N_T-k). That is, the number of records that you still need, over the number of records that are remaining in the file. If a random number R is chosen, and if R<(N-s)/(N_T-k), you include the record in the output file. The algorithm is straightforward, as Listing One shows.

However, you really want to assure yourself that the a priori likelihood for selecting each record from the file is the same.

The first record you encounter will have s=0 and k=0, so that the likelihood for picking the initial record is just N/N_T. There is nothing that is conditional on the first record being selected. After deciding whether R<(N/N_T), you increment s if it was selected, and increment k regardless.

The probability of including the second record, though, depends on whether we included the first record. It may be N/(N_T-1) if it was not included, or it may be (N-1)/(N_T-1) if it was. The total probability of some event E, if it relies on some other events A,B,C,... is:

P(E)=P(E|A)P(A)+P(E|B)P(B)+P(E|C)P(C)...

where P(E|A) is the probability that E happens assuming A happened. So this is just the total of the probabilities for all of the preceding events. If we think about the second record in the file, it has probability:

P(E)=P(E|A)P(A)+P(E|B)P(B)

and the event A, in this case, is the first record having been chosen, while event B is the first record having not been chosen. The first record was chosen with probability N/N_T and was not chosen with probability 1-(N/N_T). If it was chosen, then (N-s)/(N_T-k) is equal to (N-1)/ (N_T-1), and if it was not chosen, (N-s)/ (N_T-k) is equal to N/(N_T-1). Thus, Example 1(a), which is N/N_T.

The third record is even more complex. There are now three cases to consider: The first two records were chosen (call that event A), one of the first two was chosen (event B), and neither were chosen (event C). Remember that the likelihood of event B is double, since you could have picked the first but not the second, or the second but not the first; see Example 1(b). (Trust me on this!)

This might convince everyone that the selection probability for each record is N/N_T. But it leaves open the possibility that you could pick fewer than N records from the file if things work out wrong at the end. The algorithm needs to terminate when s=N, which may happen early.

But you might think it could also happen late; for example, you may need two more records when you get to the last two records in the file. Maybe you pick them, maybe you don't, leaving your desired sample with too few records.

But if there are only two left, and you need two more — the second to last record is chosen with a probability of (N-s)/(N_T-k)=2/2=1.0. It is certainly picked. This increments s and k. And then the last record is chosen with a probability of (N-s)/(N_T-k)=1/1=1.0, so it is also picked. (As an aside, adding the last record to your collection happens with a probability of N/N_T, adding the last two happens with a probability of (N/N_T)2, the last three are (N/N_T)³, and so on. The selection criterion (N-s)/(N_T-k) tends to anticipate whether records are needed and increase or decrease the likelihood that they will be chosen.)

Finally, asking for N records out of a file with N_T records where N=N_T, naturally causes each record to be chosen, since (N-s)=(N_T-k) for the entire length of the run. Each record is picked with a probability of 1.0. And if N>N_T, the probability is greater than 1.0 (how about those odds!), causing all records to be selected but leaving the number short of N. If this is not a desired behavior, a simple check to see that N<=N_T is sufficient.

So you have a simple method for selecting the elements from the file. Pick each with probability (N-s)/(N_T-k) and it is guaranteed to work, even if it is a random algorithm.

DDJ

Listing One

/* =================================================================
random record selection demo
Author: Bill Mahoney
================================================================== */
#include <stdlib.h> // for rand/random
#include <iostream>
const int N_sub_t = 967; // Some "known length" for our "file"
struct file_s {
              int data; // the "data" in the "file"
              char more[ 3 ]; // "More" data in the "file"
              };
void sample( file_s *file, file_s *selected, int N );
int main()
{
     int N, i;
     file_s *file = new file_s[ N_sub_t ]; // pretend file
     file_s *selected; // The ones we pick at random
     // Fill up the file with data; of course, for a real application one 
     // would be accessing records from a file in the "select" function
     // below; the file would already have data.
     for( i = 0; i < N_sub_t; i++ )
        // Just number them sequentially so that
        // it is easy to see which are selected.
        file[ i ].data = i;
    // Ask how many the user wants; we'll assume for now that 0 <= N < N_sub_t
    cout << "How many records do you need? ";
    cin >> N;
    cout << "Selecting " << N << " elements...\n";
    selected = new file_s[ N ];
    // Select the sample of N records from N_sub_t
    sample( file, selected, N );
    for( i = 0; i < N; i++ )
        cout << selected[ i ].data
             << ( ( ( i + 1 ) % 5 == 0 ) ? "\n" : " " );
    cout << "Done!\n";
    delete [] selected;
    delete [] file;
    return( 0 );
}

/* =================================================================
sample
Randomly select N records from the "file" and copy records into "selected".
================================================================= */
void sample( file_s *file, file_s *selected, int N )
{
    int s, k; // See article
    float R; // probability to test
    float limit; // (N-s)/(N_sub_t-k)
    for( k = s = 0; k < N_sub_t; k++ )
    {
       // random(3) returns 0..RAND_MAX; the older C function rand(3) would 
       // also work. Since the algorithm is explained in terms of floating 
       // point, we will do this program to match. Integer math would, 
       // of course, be faster.
       R = ( (float) random() ) / ( (float) RAND_MAX );
       limit = ( (float)( N - s ) ) / ( (float)( N_sub_t - k ) );
       if ( R < limit )
           // R < (N-s)/(N_sub_t-k) -- include the record.
           selected[ s++ ] = file[ k ];
       if ( s == N )
          break; // get out early
    }
}

Back to Article

Feb02: Algorithm Alley

Listing One

Example 1: (a) Second record chosen; (b) third record chosen.