*
Jon is a Member of Technical Staff at Bell Labs. He can be reached at [email protected] .bell-labs.com. This article is based on Column 1 of Jon's Programming Pearls, Second Edition (Addison-Wesley, 1999, ISBN 0-201-65788-0).*

When I answered the phone, the programmer's question was simple: "How do I sort a disk file?" My mistake was answering his question. I gave him a thumbnail sketch of how to implement a Merge sort on disk. My suggestion that he dig into an algorithms text met with little enthusiasm -- he was more concerned about solving the problem than furthering his education. I then told him about a disk sorting program in a popular programming book. The program consisted of about 200 lines of code in a dozen functions; I estimated that implementing and testing the code would have taken him at most a week.

I thought that I solved his problem, but his hesitation led me back to the right track. The conversation then went something like this:

Why do you want to write your own sort at all? Why not use a sort provided by your system?

I need the sort in the middle of a large system, and for obscure technical reasons, I can't use the system file-sorting program.

What exactly are you sorting? How many records are in the file? What is the format of each record?

The file contains at most 10 million records; each record is a seven-digit integer.

Wait a minute. If the file is that small, why bother going to disk at all? Why not just sort it in main memory?

Although the machine has many megabytes of main memory, this function is part of a big system. I expect that I'll have only about a megabyte free at that point.

Is there anything else you can tell me about the records?

Each one is a seven-digit positive integer with no other associated data, and no integer can appear more than once.

The context made the problem clearer. In the United States, telephone numbers consist of a three-digit area code followed by seven additional digits. Telephone calls to numbers with the toll-free area code of 800 (the only such code at the time) were not charged. A real database of toll-free telephone numbers includes a great deal of information: the toll-free telephone number, the real number to which calls are routed (sometimes several numbers, with rules on which calls go where and when), the name and address of the subscriber, and so on.

The programmer was building a small corner of a system for processing such a database, and the integers to be sorted were toll-free telephone numbers. The input file was a list of numbers (with all other information removed), and it was an error to include the same number twice. The desired output was a file of the numbers, sorted in increasing numeric order. The context also defines the performance requirements. During a long session with the system, users requested a sorted file roughly once an hour and could do nothing until the sort was completed. The sort therefore couldn't take more than a few minutes, while 10 seconds was a more desirable run time.

### Precise Problem Statement

To the programmer, these requirements added up to, "How do I sort a disk file?" Before attacking the problem, let's arrange what we know in a less biased and more useful form.

- Input. A file containing at most
*n*positive integers, each less than*n*, where*n*=10^{7}. It is a fatal error if any integer occurs twice in the input. No other data is associated with the integer. - Output. A list of the input integers sorted in increasing order.
- Constraints. At most (roughly) a megabyte of storage is available in main memory; ample disk storage is available. The run time can be at most several minutes; a run time of 10 seconds need not be decreased.

Think for a minute about this problem specification. How would you advise the programmer now?

### Program Design

The obvious program uses a general disk-based Merge sort as a starting point, but trims it to exploit the fact that we are sorting integers. That reduces the 200 lines of code by a few dozen lines, and also makes it run faster. It might still take a few days to get the code up and running.

A second solution makes even more use of the particular nature of this sorting problem. If the program stores each number in 7 bytes, then it can store about 143,000 numbers in the available megabyte. If it represents each number as a 32-bit integer, though, then it can store 250,000 numbers in the megabyte. The program will therefore make 40 passes over the input file. On the first pass, it reads into memory any integer between 0 and 249,999, sorts (at most) 250,000 integers, and writes them to the output file. The second pass sorts the integers from 250,000 to 499,999, and so on to the 40th pass, which sorts 9,750,000 to 9,999,999. A Quicksort would be quite efficient for the main memory sorts, and it requires only 20 lines of code. The entire program could therefore be implemented in a page or two of code. It also has the desirable property that it no longer uses intermediate disk files; unfortunately, in exchange for that benefit, the Quicksort pays the price of reading the entire input file 40 times.

A Merge sort program reads the file once from the input, sorts it with the aid of work files that are read and written many times, then writes it once; see Figure 1. The 40-pass algorithm (Figure 2) reads the input file many times and writes the output just once, using no intermediate files. One would prefer the following scheme (which combines the advantages of the previous two). It reads the input just once, and uses no intermediate files; see Figure 3. A program can do this only if it represents all the integers in the input file in the available megabyte of main memory. Thus, the problem boils down to whether it can represent at most 10 million distinct integers in about 8 million available bits. Think about an appropriate representation.

### Implementation Sketch

Viewed in this light, the bitmap or bit vector representation of a set screams out to be used. It represents a toy set of nonnegative integers less than 20 by a string of 20 bits. In particular, it represents the set {1, 2, 3, 5, 8, 13} by this string:

01110100100001000000

The bits representing numbers in the set are 1, and all other bits are 0.

In the real problem, the seven decimal digits of each integer denote a number less than 10 million. We'll represent the file by a string of 10 million bits in which the *i*th bit is on if -- and only if -- the integer *i* is in the file. (The programmer found 2 million spare bits; Problem 5 [at the end of the article] investigates what happens when a megabyte is a firm limit.) This representation uses three attributes of this problem not usually found in sorting problems: The input is from a relatively small range, it contains no duplicates, and no data is associated with each record beyond the single integer.

Given the bitmap data structure to represent the set of integers in the file, the program can be written in three natural phases. The first phase initializes the set to empty by turning off all bits. The second phase builds the set by reading each integer in the file and turning on the appropriate bit. The third phase produces the sorted output file by inspecting each bit and writing out the appropriate integer if the bit is 1. If *n* is the number of bits in the vector (in this case 10,000,000), the program can be expressed in pseudocode:

/* phase 1: initialize set to empty */

for (i = 0; i < n; i++)

bit[i] = 0

/* phase 2: insert present elements */

for each i in the input file

bit[i] = 1

/* phase 3: write sorted output */

for (i = 0; i < n; i++)

if bit[i] == 1

write i on the output file

This sketch was sufficient for the programmer to solve his problem. Some of the implementation details he faced are described in Problems 2, 5, and 7.

### Principles

The programmer told me about his problem in a phone call; it took us about 15 minutes to get to the real problem and find the bitmap solution. It took him a couple of hours to write the few dozen lines of code (far more efficient than the hundreds of lines and the week of programming time that we had feared earlier). And the program was lightning fast: While a Merge sort might have taken many minutes, this program took about 10 seconds. In addition to the advertising for clever programming, this case illustrates the following general principles:

- The Right Problem. Defining the problem was about 90 percent of this battle -- I'm glad that the programmer didn't settle for the first program I described. Problem 9 has an elegant solution once you pose the right problem.
- The Bitmap Data Structure. It represents a dense set over a finite domain when each element occurs at most once, and no other data is associated with the element. Even if these conditions aren't satisfied (when there are multiple elements or extra data, for instance), a key from a finite domain can be used as an index into a table with more complicated entries.
- Multiple-Pass Algorithms. Such algorithms make several passes over their input data, accomplishing a little more each time. We saw a 40-pass algorithm earlier; Problem 5 encourages you to develop a two-pass algorithm.
- A Time-Space Tradeoff and One That Isn't. Programming folklore and theory abound with time/space tradeoffs: By using more time, a program can run in less space. The two-pass algorithm in Solution 5, for instance, doubles a program's run time to halve its space. It has been my experience more frequently, though, that reducing a program's space requirements also reduces its run time. Space-efficient bitmaps dramatically reduce the run time of sorting. There are two reasons that the reduction in space led, in the previously mentioned example, to a reduction in time: Less data to process means less time to process it, and keeping data in main memory rather than on disk avoids the overhead of disk accesses.
- A Simple Design. The French writer and aircraft designer Antoine de Saint-Exupéry said, "A designer knows he has arrived at perfection not when there is no longer anything to add, but when there is no longer anything to take away." More programmers should judge their work by this criterion. Simple programs are usually more reliable, secure, robust, and efficient than their complex cousins, and they're easier to build and to maintain.

### Problems

1. If memory were not scarce, how would you implement a sort in a language with libraries for representing and sorting sets?

2. How would you implement bit vectors using bitwise logical operations (such as *and*, *or*, and *shift*)?

3. Run-time efficiency was an important part of the design goal, and the resulting program was efficient enough. Implement the bitmap sort on your system, and measure its run time. How does it compare to the system sort and to the sorts in Problem 1? Assume that *n* is 10,000,000, and that the input file contains 1,000,000 integers.

4. For the testing and timing in Problem 3, how could you generate a file of *k* unique random integers between 0 and *n*-1 in random order?

5. The programmer said that he had about a megabyte of free storage, but the code uses 1.25 megabytes. He was able to scrounge the extra space without much trouble. If the megabyte had been a hard boundary, what would you have recommended? What is the run time of your algorithm?

6. What would you recommend to the programmer if, instead of saying that each integer could appear at most once, he told you that each integer could appear at most 10 times? How would your solution change as a function of the amount of available storage?

7. In this problem, suggested by Roy Weil, the program assumes that no integer appears twice in the input. What happens if one does show up more than once? What happens when an input integer is less than zero or greater than or equal to *n*? What if an input is not numeric? What other sanity checks could the program incorporate? Describe small data sets that test the program, including its proper handling of these and other ill-behaved cases.** **

8. When the programmer faced the problem, all toll-free phone numbers in the United States had the 800 area code. Toll-free codes now include 800, 877, and 888, and the list is growing. How would you sort all of the toll-free numbers using only a megabyte? How can you store a set of toll-free numbers to allow rapid lookup capabilities to determine whether a given toll-free number is available or already taken?

9. Pioneers of human space flight soon realized the need for writing implements that work well in the extreme environment of space. A popular urban legend asserts that NASA solved the problem with a million dollars of research to develop a special pen. According to the legend, how did the Soviets solve the same problem?

### Solutions

1. Listing One is a C program that use the Standard Library qsort to sort a file of integers, while Listing Two is a C++ program that uses the set container from the Standard Template Library for the same job. (This code and more is also available at http://www.programmingpearls.com/.)

2. These functions use the constants to set, clear, and test the value of a bit:

#define BITSPERWORD 32

#define SHIFT 5

#define MASK 0x1F

#define N 10000000

int a[1 + N/BITSPERWORD];

void set(int i) { a[i>>SHIFT] |= (1<<(i & MASK)); }

void clr(int i) { a[i>>SHIFT] &= ~(1<<(i & MASK)); }

int test(int i){ return a[i>>SHIFT] & (1<<(i & MASK)); }

3. This C code implements the sorting algorithm, using the functions defined in Solution 2.

int main(void)

{ int i;

for (i = 0; i < N; i++)

clr(i);

while (scanf("%d", &i) != EOF)

set(i);

for (i = 0; i < N; i++)

if (test(i))

printf("%d\n", i);

return 0;

}

Table 1 reports the cost of sorting 1 million distinct positive integers, each less than 10 million, with the system command-line sort, the C++ and C programs in Solution 1, and the bitmap code.

4. This code assumes that *randint(l,u) *returns a random integer in *l..u *and that the *swap *function exchanges the two elements of *x.*

for (i = 0; i < n; i++)

x[i] = i

for (i = 0; i < k; i++)

swap(i, randint(i, n-1))

print x[i]

5. I used the program in Solution 4 to generate a file of 1 million distinct positive integers, each less than 10 million. Table 1 reports the cost of sorting them with the system command-line sort, the C++ and C programs in Solution 1, and the bitmap code.

The first line reports the total time, and the second line subtracts out the 10.2 seconds of input/output required to read and write the files. Even though the general C++ program uses 50 times the memory and CPU time of the specialized C program, it requires just half the code and is much easier to extend to other problems.

Representing all 10 million numbers requires that many bits, or 1.25 million bytes. Employing the fact that no phone numbers begin with the digits 0 or 1 reduces the memory to exactly 1 million bytes. Alternatively, a two-pass algorithm first sorts the integers 0 through 4,999,999 using 5,000,000/8=625,000 words of storage, then sorts 5,000,000 through 9,999,999 in a second pass. A *k*-pass algorithm sorts at most *n *nonrepeated positive integers less than *n *in time *kn *and space *n/k.*

6. If each integer appears at most 10 times, then we can count its occurrences in a 4-bit half-byte (or nybble). Using the solution to Problem 5, we can sort the complete file in a single pass with 10,000,000/2 bytes, or in *k* passes with 10,000,000/2*k* bytes.

7. This solution is left as an exercise to the reader.

8. This solution also is left as an exercise to the reader.

9. According to the urban legend, the Soviets solved their problem with a pencil, of course (see http://www.spacepen .com/). The Fisher Space Pen company was founded in 1948, and its writing instruments have been used by the Russian Space Agency, underwater explorers, and Himalayan climbing expeditions.

**DDJ**

#### Listing One

/* Copyright (C) 1999 Lucent Technologies */ /* From 'Programming Pearls' by Jon Bentley */ /* qsortints.c -- Sort input set of integers using qsort */ #include <stdio.h> #include <stdlib.h> int intcomp(int *x, int *y) { return *x - *y; } int a[1000000]; int main() { int i, n=0; while (scanf("%d", &a[n]) != EOF) n++; qsort(a, n, sizeof(int), intcomp); for (i = 0; i < n; i++) printf("%d\n", a[i]); return 0; }

#### Listing Two

/* Copyright (C) 1999 Lucent Technologies */ /* From 'Programming Pearls' by Jon Bentley */ /* sortints.cpp -- Sort input set of integers using STL set */ #include <iostream> #include <set> using namespace std; int main() { set<int> S; int i; set<int>::iterator j; while (cin >> i) S.insert(i); for (j = S.begin(); j != S.end(); ++j) cout << *j << "\n"; return 0; }

*Copyright © 1999, Dr. Dobb's Journal*