Which sorting algorithm is the fastest? Ask this question to any group of programmers and you'll get an animated discussion. Of course, there is no one answer. It depends not only on the algorithm, but also on the computer, data, and implementation. However, if you count the number of operations needed to sort integer numbers on a standard von Neumann computer, there is a clear winner -- the algorithm presented in the paper "Sorting In Linear Time?" by A. Andersson, T. Hagerup, S. Nilsson, and R. Raman (*Proceedings of the 27th Annual ACM Symposium on the Theory of Computing*, 1995). It sorts *n* integers in time proportional to *n log log n*. In this article, I'll give you a complete description of this algorithm.

Can it be done even faster? No one knows. We only know that it can't possibly be done using less than *n* operations: An algorithm using fewer operations than that can't look at each of the *n* numbers and, therefore, might leave some of the numbers out of order.

Even though the *n log log n* time-sorting algorithm came about as a theoretical game, its real-life performance is good. A C implementation like nloglogn.c ( available electronically; downloads at the top of the page) with no particular optimizations runs faster on a typical 32-bit machine than many standard textbook sorting algorithms.

### The Problem

To achieve agreement when discussing the speed of a sorting algorithm, it's necessary to establish some common ground. How do we measure speed? There is a well-established model -- the von Neumann computer, or unit-cost RAM model -- widely used by computer scientists to evaluate algorithms without introducing all the complexities of real-world computers. This model doesn't account for some important things, such as the performance gap between RAM and CPU, but typically gives good approximations of real-life performance.

There is nothing strange about the unit-cost RAM model. It's probably similar to your mental image of a computer. The computer has a RAM divided into words of *w* bits -- the "word length" of the machine. Each word has an address, and a word can be accessed in constant time given its address. The computer has a CPU capable of performing a small number of instructions, such as reading and writing words in RAM, and performing basic arithmetical and logical operations. Each of these operations can be performed in a constant number of machine cycles.

If you don't use the full power of the RAM model, but only consider sorting algorithms that operate by comparing elements pairwise, it's well known that at least *n log n* comparisons are needed to sort *n* elements in the worst case. There are plenty of algorithms, such as mergesort and heapsort, that achieve this bound.

Other sorting techniques may be faster. For example, using radix sorting, you can sort *n* word-sized integers by scanning the data a few bits at a time. Each scan can be done in linear time, and the number of scans depends on the word length *w*. Hence, the time for radix sorting is proportional to *nw*. Here, it's easy to make a mistake. On most computers, *w* is a small number (typically 32 or 64), and you might be tempted to state that radix sorting is a linear time sorting algorithm. Not so. The algorithm will only work if *w>=log n*. If not, you wouldn't even be able to store the numbers. Since each memory address is a word consisting of *w* bits, the address space won't accommodate *n* numbers if *w<log n*. Hence, in the unit-cost RAM model, radix sort also runs in time proportional to *n log n*.

I'll now describe an algorithm that does not depend on the word length. For any word length *w*, *w >=log n*, it sorts *n* word-sized integers in time proportional to* n log log n.*

### The Algorithm

With this algorithm, the sorting is performed in two phases. In the first phase, you reduce the size of the numbers using radix sorting techniques. The elements become smaller and the number of elements doesn't increase. In fact, in linear time, it's possible to cut the number of bits of each number in half.

In the second phase, you perform a mergesort on the shortened numbers. Because the numbers are now much shorter, several numbers will fit in a single machine word. With several numbers in one word, it's possible to do several operations on these numbers in parallel using the shift and bitwise AND/OR/XOR operations found on most computers. In fact, using a scheme known as "packed merging," you can merge two sorted sequences of *k* numbers, each of which fits in one word, in time proportional to *log k*. This is an improvement on the standard merging algorithm, which requires time proportional to *k*.

Given these building blocks, we are almost done. In the first phase, you perform *log log n* reductions. This is done in *n log log n* time, because each reduction takes linear time. After halving the number of bits *log log* *n* times, the numbers will be so small that you can fit *k=log n* numbers within one word.

In the second phase, you use the packed merging as a subroutine in a standard mergesort. You perform the merging bottom up. Start by sorting subsequences of length *k*. Each of the *n/k* subsequences is sorted in *k log k* time using a standard comparison-based algorithm, such as mergesort. Plugging in *k=log n*, you see that this adds up to *n log log n*.

You now combine these sorted sequences pairwise using packed merging, and thereby create longer sorted sequences. Each sequence is packed in a single machine word. In this first round, you need to merge *n/2k* pairs, each merging is done in *log k* time using packed merging. This sums up to *n/2k*×*log *k.

In the next round, you need to merge *n/4k* pairs, each consisting of two words. This is done in *n/4k*×*2 log k=n/2k*×*log k* time, the same time bound as before. In fact, you'll get the same result for each round.

Because the length of the sequences double in each round, the number of rounds is *log n/k*. Hence, the total time for this version of mergesort is *log n/k*×*n/2k*×*log k*. Plugging in *k=log n*, you see that this expression is also no more than *n log log n*.

At this point, the data is sorted and no more than *n log log n* time has elapsed, since both the reduction phase and the mergesorting phase run in time proportional to *n log log n*.

### Reducing the Number Size

There are several alternative algorithms that can be used for reducing a sorting problem to one with shorter numbers. Which to chose depends on whether you want to optimize the rate of reduction, the speed of the algorithm, or the amount of memory required. Here, I'll show you a simple algorithm that reduces the number of bits by a factor of 2. It runs in linear time, but uses plenty of memory.

Assume that you want to sort *n* *b*-bit numbers. I'll use a bucketing scheme with two bucket tables: High and Low. Each table entry can hold a linked list of numbers. I'll explain the algorithm using a small example. In Figure 1, you see seven 6-bit numbers to be sorted. Figure 1 also shows the two bucket tables and a batch list used to record which buckets are actually being used. This is necessary because you cannot afford to inspect every single bucket: The number of buckets may be much larger than the number of elements.

##### Figure 1: Bucketing on high-order bits. All active buckets are recorded in the batch list.

Data: 100101, 011001, 100111, 001100, 011111, 110111, 001000 High Low ------------------------ ------------------------ 000: 000: 001: 001100, 001000 001: 010: 010: 011: 011001, 011111 011: 100: 100101, 100111 100: 101: 101: 110: 110111 110: 111: 111: Batch: 100, 011, 001, 110

The first step in Figure 1 is to perform bucketing on the high-order bits; in this example, the first 3 bits of each number. Every time a bucket is used for the first time, the number of this bucket is recorded in the batch list.

Figure 2 shows the second step of the algorithm, where you traverse each non-empty bucket in the High bucket table, moving elements into the Low bucket table. There are several important things to note.

##### Figure 2: Bucketing on low-order bits. The minimum element in each High bucket is left behind.

High Low ------------------------ -------------------- 000: 000: 001: 001000 001: 010: 010: 011: 011001 011: 100: 100101 100: 001100 101: 101: 110: 110111 110: 111: 111: 100111, 011111 Batch: 100, 011, 001, 110, 111, 100

- You use the batch to find the nonempty buckets. In this way, the whole operation can be performed in linear time.
- When a Low bucket is used for the first time, the bucket is recorded in the batch.
- You don't move all of the elements. The minimum element in each High bucket is left behind. This is crucial. The batch will be our reduced sorting problem and it must not contain more than
*n*elements. By leaving one element behind in each nonempty High bucket, you ensure that the total number of nonempty buckets, and hence the number of entries in the batch, is at most*n*.