# The Fastest Sorting Algorithm?

The next step is to sort the batch. This is the reduced sorting problem -- the batch contains at most n numbers and each number has b/2 bits.

Using the sorted batch, it's straightforward to assemble the numbers in ascending order. You start by moving numbers back from the Low bucket list to the High bucket list. Figure 3 is the result. Once again, you use the batch to find the nonempty buckets. But because the batch is now sorted, the Low buckets will be visited in order. Hence, each High bucket will end up containing a sorted list.

##### Figure 3: Using the now sorted batch, the elements from the Low buckets have been moved back to High buckets and can be extracted in ascending order.

```High                       Low
------------------------   --------
000:                       000:
001: 001000, 001100         001:
010:                       010:
011: 011001, 011111         011:
100: 100101, 100111         100:
101:                       101:
110: 110111                110:
111:                       111:

Batch:  001, 011, 100, 100, 110, 111

```

Finally, you traverse the nonempty High buckets, once again using the sorted batch, to produce the final sorted list.

You might have noticed a problem with this algorithm. How do you know if a bucket is empty? The obvious solution is to initialize each bucket to a null pointer. But you can't afford that: The number of buckets might be larger than n and the algorithm is supposed to run in linear time. The solution is a bit tricky.

### How to Avoid Initializing Memory

Consider an algorithm that uses a large memory area. If the running time of the algorithm is smaller than the size of the memory, initializing the memory will take longer than running the algorithm. However, using a shrewd trick, it's possible to refrain from initializing the memory.

The idea is to keep track of which memory positions have been used so far. Using this information, you may postpone the initialization until the first time a memory cell is encountered.

One solution would be to use an array to keep track of active memory positions. Every time a memory position is accessed, check whether the address is in this array of active memory positions. If not, initialize this memory cell and add the address to the array.

For this scheme to work without increasing the overall running time, it's crucial that the lookups in the array can be performed in constant time. To achieve this, use double bookkeeping. Not only do you use an array of active memory positions, but each memory position is also extended to contain a pointer into this array. In Figure 4, for instance, the memory has three active memory cells: cell numbers 0, 4, and 6.

##### Figure 4: How to avoid initializing memory using double bookkeeping.

```Memory                        Active
-------------------------      -------

0       134431   X---------->  0
1       938434      -------->  4
2       432754      |  ----->  6
3       292343      |  |
4       874944   X---  |
5       002345         |
6       654243   X------
7       112903

```

Checking whether a memory cell is active is a two-step procedure. First, you check whether the pointer attached to this memory position is valid: Does it point to a position in the active array? If not, the memory position contains garbage and has not been initialized. If the pointer is valid, you also need to check whether the corresponding pointer in the active array actually points to this memory address. If it does, the memory cell is active.

This check only requires two memory accesses. Hence, it can be performed in constant time. Adding a new address to the active list is also done in constant time: Store the memory address in the next free position in the active array and set up the pointer associated with the memory location accordingly.

### Fast Merging of Short Numbers

All that remains is to merge short sequences of short numbers fast. The idea is to use a parallel algorithm that only requires very simple operations. If the algorithm is simple enough, it may be possible to simulate the parallelism using the inherent parallel nature of bitwise logical operations. For example, the bitwise OR operation performs w OR operations in parallel in just one machine cycle.

Batcher's bitonic merging fulfills the requirements. The algorithm is based on a property of so called "bitonic sequences." A bitonic sequence is almost sorted: It is the concatenation of an ascending and a descending sequence, or it can be obtained by a rotation (cyclic shift) of such a sequence. For example, the sequence 2, 3, 4, 3, 2, 1 is bitonic since it is the concatenation of the ascending sequence 2, 3, 4 and the descending sequence 3, 2, 1. The sequence 3, 4, 3, 2, 1, 2 is also bitonic, because it can be obtained by rotating the previous bitonic sequence. Bitonic sequences have a special property described in the lemma in Figure 5.

##### Figure 5: The bitonic lemma.

```If the sequence x(1), x(2), ..., x(2k) is bitonic,
the two sequences

L = min(x(1), x(k+1)), min(x(2), x(k+2)), min(x(k), x(2k))

and
R = max(x(1), x(k+1)), max(x(2), x(k+2)), max(x(k), x(2k))

are also bitonic. Furthermore, each element of L is smaller
than or equal to each element of R.

```

In other words:

1. Start with any bitonic sequence with an even number of elements, cut it in half, and form two new sequences by comparing the two halves element by element.
2. Form one sequence, L, containing the minimum elements of each pair and an other, R, which contains the maximum elements.
3. Then the sequences L and R will both be bitonic and each element of L will be smaller than or equal to each element of R.

Using this lemma, you can efficiently merge sorted sequences. For instance, by reversing one sequence in Figure 6 and appending the other, you get a bitonic sequence. In the first phase of the merging, you use the lemma to produce two subsequences L and R, where each element of L is no larger than each element of R. To complete the merging, you only need to sort L and R and concatenate the two sequences. This can be done using the lemma once again. Remember that both L and R are bitonic, so in the second phase, you can apply the bitonic lemma to each of them. The process is repeated recursively until the subsequences consist of just one element.

##### Figure 6: The first phase of merging two sorted sequences using the bitonic lemma.

```X = 0, 2, 3, 5
Y = 2, 3, 6, 7

// Form a bitonic sequence S by reversing X and appending Y:
S = 5, 3, 2, 0, 2, 3, 6, 7

// Compute the min and max sequences L and R:
L = min(5, 2), min(3, 3), min(2, 6), min(0, 7) = 2, 3, 2, 0
R = max(5, 2), max(3, 3), max(2, 6), max(0, 7) = 5, 3, 6, 7

```

Packing all of the numbers in one machine word, it's possible to perform each phase of this merging algorithm in constant time; see Figure 7. In this example, the numbers consist of 3 bits. You'll also need a special test bit for each number, so you can fit eight numbers into one 32-bit word. The special bitmask N, which indicates which elements are smaller, is computed using subtraction and the basic bitwise AND/OR operations. The second phase of the merging can be performed in a similar fashion. In fact, both subproblems can be solved in parallel using the same technique. (The full details can be found in nloglogn.c, available electronically at the top of the page.)

##### Figure 7: Implementing the computation of Figure 6 in parallel with standard operations only.

```A = 0101 0011 0010 0000 0000 0000 0000 0000   // 5 3 2 0 0 0 0 0
B = 0010 0011 0110 0111 0000 0000 0000 0000   // 2 3 6 7 0 0 0 0

// Compute a bitmask N, showing which elements of A that are
// smaller than or equal to the corresponding elements of B.
// M is a precomputed constant.

M = 1000 1000 1000 1000 1000 1000 1000 1000     // 8 8 8 8 8 8 8 8
N = ((B OR M) - A) AND M                      	// 0 8 8 8 8 8 8 8
N = N - (N>>3)                                	// 0 7 7 7 7 7 7 7

// Compute the max and min sequence and concatenate them.

Z =    (B AND N)                             // 0 3 6 7 0 0 0 0
OR (A - (A AND N))                       // 5 0 0 0 0 0 0 0
OR ((A AND N)>>16)                       // 0 0 0 0 0 3 2 0
OR ((B - (B AND N))>>16)                 // 0 0 0 0 2 0 0 0

```

Each phase of this merging algorithm is implemented using just a constant number of basic operations so the total time to merge k numbers, small enough to fit within one word, is proportional to log k.

Stefan is a lecturer in the Department of Computer Science at the Royal Institute of Technology (KTH) in Stockholm, Sweden. He can be reached at [email protected]

### More Insights

 To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.