Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

Database

Feb02: Algorithm Alley


Feb02: Algorithm Alley

William is the senior vice president of Technical Support and a part-time computer science faculty member at the University of Nebraska at Omaha. He can be contacted at [email protected].


Suppose you are asked to select a random sample of, say, 500 records from a file of known length. In the article "Reservoir Sampling" (DDJ, January 2001), the late Paul Hultquist and I discussed one possible method that lets you select the 500 records from a file of unknown length. Certainly, you could pretend that you do not know the length of the file and use reservoir sampling. However, knowing the length of the file eliminates the need for the reservoir and also lets you select the records in the order of their appearance in the file. The latter is important as it eliminates any possible need to sort the sample file when you are done.

The method for doing this is to select the records from the file, one at a time, in such a way that the likelihood for selection is increased if you are behind schedule and decreased if you are ahead of schedule. Although this necessarily changes the prob- ability of selection for a record as the algorithm runs, it is not difficult to see that the a priori probability of selection for a record in the file is the same for each record.

For example, suppose that the file contains NT total records, of which you want to retrieve N of them, where usually 0<N<NT. Suppose that s is the number of records selected thus far as the algorithm runs. This value is initially zero and ends up, you hope, at N. Let k be the file position. It starts at zero and is incremented for each record through the last record, which is number NT-1.

At each record, that particular part of the file may be added to your N elements with a probability of (N-s)/(NT-k). That is, the number of records that you still need, over the number of records that are remaining in the file. If a random number R is chosen, and if R<(N-s)/(NT-k), you include the record in the output file. The algorithm is straightforward, as Listing One shows.

However, you really want to assure yourself that the a priori likelihood for selecting each record from the file is the same.

The first record you encounter will have s=0 and k=0, so that the likelihood for picking the initial record is just N/NT. There is nothing that is conditional on the first record being selected. After deciding whether R<(N/NT), you increment s if it was selected, and increment k regardless.

The probability of including the second record, though, depends on whether we included the first record. It may be N/(NT-1) if it was not included, or it may be (N-1)/(NT-1) if it was. The total probability of some event E, if it relies on some other events A,B,C,... is:

P(E)=P(E|A)P(A)+P(E|B)P(B)+P(E|C)P(C)...

where P(E|A) is the probability that E happens assuming A happened. So this is just the total of the probabilities for all of the preceding events. If we think about the second record in the file, it has probability:

P(E)=P(E|A)P(A)+P(E|B)P(B)

and the event A, in this case, is the first record having been chosen, while event B is the first record having not been chosen. The first record was chosen with probability N/NT and was not chosen with probability 1-(N/NT). If it was chosen, then (N-s)/(NT-k) is equal to (N-1)/ (NT-1), and if it was not chosen, (N-s)/ (NT-k) is equal to N/(NT-1). Thus, Example 1(a), which is N/NT.

The third record is even more complex. There are now three cases to consider: The first two records were chosen (call that event A), one of the first two was chosen (event B), and neither were chosen (event C). Remember that the likelihood of event B is double, since you could have picked the first but not the second, or the second but not the first; see Example 1(b). (Trust me on this!)

This might convince everyone that the selection probability for each record is N/NT. But it leaves open the possibility that you could pick fewer than N records from the file if things work out wrong at the end. The algorithm needs to terminate when s=N, which may happen early.

But you might think it could also happen late; for example, you may need two more records when you get to the last two records in the file. Maybe you pick them, maybe you don't, leaving your desired sample with too few records.

But if there are only two left, and you need two more — the second to last record is chosen with a probability of (N-s)/(NT-k)=2/2=1.0. It is certainly picked. This increments s and k. And then the last record is chosen with a probability of (N-s)/(NT-k)=1/1=1.0, so it is also picked. (As an aside, adding the last record to your collection happens with a probability of N/NT, adding the last two happens with a probability of (N/NT)2, the last three are (N/NT)3, and so on. The selection criterion (N-s)/(NT-k) tends to anticipate whether records are needed and increase or decrease the likelihood that they will be chosen.)

Finally, asking for N records out of a file with NT records where N=NT, naturally causes each record to be chosen, since (N-s)=(NT-k) for the entire length of the run. Each record is picked with a probability of 1.0. And if N>NT, the probability is greater than 1.0 (how about those odds!), causing all records to be selected but leaving the number short of N. If this is not a desired behavior, a simple check to see that N<=NT is sufficient.

So you have a simple method for selecting the elements from the file. Pick each with probability (N-s)/(NT-k) and it is guaranteed to work, even if it is a random algorithm.

DDJ

Listing One

/* =================================================================
random record selection demo
Author: Bill Mahoney
================================================================== */
#include <stdlib.h> // for rand/random
#include <iostream>
const int N_sub_t = 967; // Some "known length" for our "file"
struct file_s {
              int data; // the "data" in the "file"
              char more[ 3 ]; // "More" data in the "file"
              };
void sample( file_s *file, file_s *selected, int N );
int main()
{
     int N, i;
     file_s *file = new file_s[ N_sub_t ]; // pretend file
     file_s *selected; // The ones we pick at random
     // Fill up the file with data; of course, for a real application one 
     // would be accessing records from a file in the "select" function
     // below; the file would already have data.
     for( i = 0; i < N_sub_t; i++ )
        // Just number them sequentially so that
        // it is easy to see which are selected.
        file[ i ].data = i;
    // Ask how many the user wants; we'll assume for now that 0 <= N < N_sub_t
    cout << "How many records do you need? ";
    cin >> N;
    cout << "Selecting " << N << " elements...\n";
    selected = new file_s[ N ];
    // Select the sample of N records from N_sub_t
    sample( file, selected, N );
    for( i = 0; i < N; i++ )
        cout << selected[ i ].data
             << ( ( ( i + 1 ) % 5 == 0 ) ? "\n" : " " );
    cout << "Done!\n";
    delete [] selected;
    delete [] file;
    return( 0 );
}

/* =================================================================
sample
Randomly select N records from the "file" and copy records into "selected".
================================================================= */
void sample( file_s *file, file_s *selected, int N )
{
    int s, k; // See article
    float R; // probability to test
    float limit; // (N-s)/(N_sub_t-k)
    for( k = s = 0; k < N_sub_t; k++ )
    {
       // random(3) returns 0..RAND_MAX; the older C function rand(3) would 
       // also work. Since the algorithm is explained in terms of floating 
       // point, we will do this program to match. Integer math would, 
       // of course, be faster.
       R = ( (float) random() ) / ( (float) RAND_MAX );
       limit = ( (float)( N - s ) ) / ( (float)( N_sub_t - k ) );
       if ( R < limit )
           // R < (N-s)/(N_sub_t-k) -- include the record.
           selected[ s++ ] = file[ k ];
       if ( s == N )
          break; // get out early
    }
}

   

Back to Article


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.