Reservoir Sampling

By Paul F. Hultquist and William R. Mahoney, January 01, 2001

Reservoir sampling lets you find precisely the right number N of records without any prior knowledge of the number of records in the file—provided that there are enough!

Jan01: Algorithm Alley

Paul was a consultant to Technical Support Inc. in Omaha, Nebraska. William is the senior vice president of Technical Support Inc., a part-time computer science faculty member at the University of Nebraska at Omaha, and the owner of a car sporting the only "UNIX" license plate in Nebraska. He can be contacted at [email protected].

Suppose your boss asks you to find a random sample of, say, 500 records from a file of customer records. Unfortunately, you may not know the length of the file, in terms of the number of records; or if you do, you may have a qualification such as "include only customers who have spent more than $200 with us this year." There is a good method for drawing exactly N samples randomly from a file of known length, but that is not applicable unless you first run through the file and count the exact number of records in the file. This may not be convenient, especially if there is a qualification on the chosen records. Reservoir sampling, originally developed by Alan Waterman, lets you find precisely the right number N of records without any prior knowledge of the number of records in the file (provided that there are enough). Also, the records are chosen with equal a priori probabilities.

Reservoir sampling involves the establishment of a reservoir of candidate records. The allotted reservoir space must be several times larger than that of the final sample. (Later, we will show how to estimate the reservoir size.) To begin, we copy the first N (eligible, if there is a qualification) records into the reservoir. An array of length N contains pointers to the N records in the reservoir.

The second stage of the process consists of going through the input file, starting with record N+1, and staging a knock-out tournament with records that are "in" the reservoir. By "in" we mean records that have not been knocked out. Going through the file requires a pointer P that is set initially to record N+1 in the file. For the new record, generate a random integer K on the interval 1KP. If KN, then copy the record at P in the file to the first available space (past N) in the reservoir, and switch the reservoir pointer from the record at K to the new record. Thus, the record at K is no longer in the reservoir — it has been replaced by the new one. Dead records still occupy reservoir storage space, which makes it grow in size. After the replacement, if it occurs, or if K>N, advance the file pointer P to the next record.

Continue this process until there are no more records in the file. At the beginning of the tournament, many of the first N records will be evicted, but as the file pointer P grows, the probability of a new file record evicting an old one drops. By the time P=2N, the probability of an eviction drops to one half. We can show that the a priori probability of any of the first N records being chosen for the final cut is N/N_T, where N_T is the (unknown beforehand) file length.

Let N be the desired sample size and N_T be the unknown file size. Choose a pet record from among the initial N records and follow it to the end of the process. The tournament begins when the file pointer P=N+1. The probability that this record is chosen is N/(N+1); and if it is chosen, the probability that it evicts our pet record is 1/N. Thus, the probability that our pet record is evicted at this point is the product of these probabilities, 1/(N+1). The probability that our chosen record survives is 1-1/(N+1)=(N+1-1)/(N+1)= N/(N+1). Now, for record at P=N+2, the probability that it is chosen is N/(N+2), and the probability that it evicts our pet record, if it is chosen, is 1/N; so the probability of our pet record being evicted is 1/(N+2). The probability of survival at this point is 1-1/(N+2)=(N+2-1)/(N+2)= (N+1)/(N+2). The probability of survival for our pet after two new records being added is:

If we continue this tiresome process, we find that the survival probability for our pet record is:

What about the survival probabilities for records chosen after the first N? It is easy to see that the last record is chosen with probability N/N_T. We can see that the record at the next-to-last position in the file is chosen with probability N/(N_T-1), but it must survive being knocked out by the last record. The probability of the last record being chosen and knocking out the next-to-last one is (N/N_T)(1/N)=1/N_T. The probability of that not happening is 1-1/N_T=(N_T -1)/N_T, which means that the probability that the next-to-last record in the file will be chosen and will survive is:

This not intended as a rigorous proof, but as a demonstration of how the algorithm works.

How large does the reservoir grow? That depends on the logarithm of the ratio of the file size N_T to N. We know that, first of all, the reservoir is loaded with N records. Following this, the remaining N_T-N records are chosen to occupy reservoir space with decreasing probabilities — starting with N/(N+1) and steadily decreasing to N/N_T. (Eliminated records still occupy reservoir space.) It is clear that the statistical expectation of the size S of the reservoir can be calculated as:

The summation can be slightly overestimated by replacing it with an integral:

Thus the reservoir size estimate is:

Table 1 gives typical values of S/N for values of N_T/N.

The values of S/N given, times the value of N, are the expectations of the mean number of records chosen. Users should allow an extra 25 percent or more for the fact that this is a random process. Likewise, it is also necessary to allow for error in the estimates of N_T.

One other observation: The source file may very well be alphabetized or sorted on some key or parameter, but the reservoir file will not be. That may not matter for some uses, such as gathering statistics; but for other applications, it may be necessary to copy the active records to a new file and then to sort them properly.

Finally, the algorithm as shown will work for gruesome applications such as those involving multiple input files (tapes, multiple disks, and so on). Copying the records to a reservoir in fast memory simplifies the selection and output parts of the process. However, for small files that can be contained in central memory, the reservoir can be replaced by an array of pointers to records in the active reservoir with no copying necessary until output.

Listing One demonstrates the reservoir sampling of a large file of records (an executable version is available electronically; see "Resource Center," page 5). In this example, the file is really just an array, and the records are simple integers. The simulated file on disk is handled by the array big_file. It consists of a set of integers selected at random from the range 1..100. The file ranges from 0..BIG_FILE_ SIZE-1. There are several arrays that are allocated dynamically; no error checking is done on these. In a real application, you would want to take care of that. Also, Listing One includes some shortcuts that are relevant because we are dealing with integer records. In a real program, these would contain more than just an int, so the shortcuts are omitted.

In Passing

Sadly, we learned Paul Hultquist passed away just prior to publication of his article. We'd like to extend our heartfelt condolences to his family, friends, and coworkers.

— The editors

DDJ

Listing One

/* ===========================================================================
Demonstration of reservoir sampling.
Author:  Bill Mahoney
=========================================================================== */

#include <iostream.h>
#include <time.h>
#include <stdlib.h>
#include <math.h>
#define  BIG_FILE_SIZE 967  /* Any old number will do */
void reservoir_sample( int *big_file, int *sample, int N );

int main()
{
    int *sample, N, i;
    int big_file[ BIG_FILE_SIZE ];

    srand( (unsigned int) time( (time_t *) NULL ) );
    for( i = 0; i < BIG_FILE_SIZE; i++ )
        /* We could put in random numbers, or put in numbers that are the */ 
        /* same as the record number. This'll make it simpler for us to   */
        /* see which records are in fact selected.                        */
#if 0
        big_file[ i ] = rand() % 1000;
#else
        big_file[ i ] = i;
#endif
    /* For simplicity, assume the user enters a valid number. */
    cout << "How many elements would you like to retrieve? ";
    cin >> N;
    cout << "Selecting " << N << " elements...";
    sample = new int[ N ];

    reservoir_sample( big_file, sample, N );

    for( i = 0; i < N; i++ )
        cout << sample[ i ] << ( ( ( i + 1 ) % 5 == 0 ) ? "" : " " );
    cout << "Done!";

    delete [] sample;
    return( 0 );
} 
/* ===========================================================================
reservoir_sample
Here we do the actual guts. Select the first N records from the front of the 
"file", then selectively knock out candidates and replace them with new ones.
Obviously, since we have just an array of int's there is no need for the 
temporary reservour array at all. We could just keep the index into the array 
"file" instead of an index into "res", the reservoir. Likewise, we could omit
"active" and just use "sample" to hold indices into the array. In an actual 
application, though, you would need to allocate the reservoir and fill it from
the disk file; thus the use of the dynamic array "res" - to illustrate this. 
Same thing with "active". An alternative in a real world application would be 
to have "res" maintain the file position (i.e. ftell() it) of record on disk.
=========================================================================== */

void reservoir_sample( int *file, int *sample, int N )
{
    int end_res, file_record, *res, res_size, *active, i;
    /* Here we create a reservoir with sufficient space.  */
    /* See the formula in the article for an explanation. */
    res_size = (int) ( 1.25 * (double) N * 
               ( 1.0 + log( (double) BIG_FILE_SIZE / (double) N ) ) );
    cout << "Creating a temporary storage area of " << res_size << " elements.";
    res = new int[ res_size ];
    active = new int[ N ];
    /* First off, copy the first N records into the reservoir. */
    for( i = 0; i < N; i++ )
    {
        res[ i ] = file[ i ];
        active[ i ] = i;
    } 
    /* end_res is the # of elements in the reservoir, */
    /* file_record is the next record in the "file".  */
    end_res = file_record = N;

    while ( file_record < BIG_FILE_SIZE )
    {
        /* Position is 0..file_record-1 */
        int position = rand() % file_record;
        if ( position < N )
        {
            /* Copy the record into the reservoir */
            res[ end_res ] = file[ file_record ];
            /* Save the index to this record.     */
            active[ position ] = end_res;
            end_res++;
            if ( end_res >= res_size )
               cerr << "Out of space in the reservoir!";
               /* Do some kind of error recovery here. */
        } 
        file_record++;
    } 
    /* Give the caller the desired records. */
    for( i = 0; i < N; i++ )
        sample[ i ] = res[ active[ i ] ];
    delete[] res;
    delete [] active;
}

Back to Article

1 2 Next

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

Reservoir Sampling

In Passing

Listing One

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content

Reservoir Sampling

In Passing

Listing One

Related Reading

News

Commentary

Slideshow

Video

Most Popular

More Insights

White Papers

Reports

Webcasts

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content