This article discusses practical approaches to generic sorting in C++. I will tell you upfront that I do assume you've gotten acquainted in the past with sorting algorithms (beyond Bubble Sort, by the way!) and that you need only to refresh the afferent neurons. Explaining in great detail how Quicksort works, presenting its challenges, and building a cutting-edge C++ implementation of a sorting algorithm would be too much for one article. So if your only intuition of Quicksort is that it might be an improved version of Slowsort, you may want to refer to [1] first.

The code in alexandr.zip is extracted from the legendary YASLI (Yet Another Standard Library Implementation), the one STL implementation that is better than all others.

### Errata

The **duff_find** code in my previous article [2] has a bug. The unrolled loop does seven steps instead of doing eight, as it should. I corrected the code in the online version (ah, the joys of electronic publishing) but be warned that I didn't redo the speed measurements, which are likely to become even more favorable to the Duff implementation.

### Quickly Implementing a Quick Version of Quicksort

There are many sorting algorithms, but the one that works best on arrays and other randomly accessible sequences is the well-known Quicksort.

Quicksort has the following easy steps:

- If there's nothing to sort (empty input sequence), then stop.
- Select some element in the sequence, called the pivot.
- Partition the sequence into two adjacent ranges: one contains elements less than or equal to the pivot; the other contains elements greater than the pivot. This is a straightforward linear-time algorithm.
- Quicksort the first range.
- Quicksort the second range.

A high-level language can express Quicksort almost as tersely and clearly as you would describe it using English and mathematics. For example, here's a *complete* Quicksort implementation in Haskell:

quicksort :: [a] -> [a] quicksort [] = [] quicksort (p:xs) = quicksort [ x | x <- xs, x <= p ] ++ [ p ] ++ quicksort [ x | x <- xs, x > p ]

The code above simply says: the empty sequence sorts to itself; otherwise, you pick the first element in the sequence and make it the pivot. The result is a concatenation of three sequences: Quicksort on those elements in the sequence's tail that are less than or equal to the pivot; the pivot itself; and finally, Quicksort on the greater elements.

Figuring out how to take this four-liner and transform it *automatically* into an efficient implementation is one of those fascinating problems with which Computer Science never ceases to confront us. For now, we'll follow the advice of the tailor shop ad: "We don't tear your clothes with machines; we do it carefully by hand."

### Choosing the Pivot

The major problem with Quicksort is its abysmal worst-case performance.

The power of Quicksort comes from dividing its input into two disjoint subsequences having roughly the same size. If the two subsequences are roughly equal, Quicksort will do a linear pass on sequences of which lengths halve every stage, which yields a *O(n log n)* complexity.

If, however, one of the subsequences is always much larger than the other, this means the work is not balanced — there is only little progress at each recursive step.

For example, say you try to sort a sequence that's already sorted, and you always choose the first element of the sequence as the pivot. Consequently, the left subsequence will only contain one element (the pivot itself), and the right subsequence will contain all others. Given that the selection process takes linear time and that at each step you are getting rid of only one element, you don't have to be a Knuth to figure out that the "quick" sort algorithm has *O(n ^{2})* complexity. Ouch!

Therefore, much effort has been put into figuring out good guesses for the pivot. Ideally, the pivot's value would be the median of the sequence so that it nicely divides the input into equal halves.

Choosing the first element as pivot, as we saw, is not lucrative for sequences that are already sorted or almost sorted. If you think maybe the last element would consequently be best, sequences that are sorted in decreasing order become troublesome.

Choosing the element in the middle is not bad, but still there's no guarantee against the dreaded *O(n ^{2})*. What many modern Quicksort implementations do is to select the median of the first, last, and mid element as the pivot. Some others choose a more representative sample. Beyond a point, however, too much work for finding the best pivot requires too much time to make it worthwhile.

Given that we really don't have much of a clue about what the best pivot is, one possibility is to totally give up and just choose some random element of the sequence as a pivot. I know, this sounds crazy, but let's look a little into this heretical approach.

### Randomized Algorithms

Randomized algorithms are exactly what you fear them to be: at a point, a randomized algorithm makes a random choice out of a set of possibilities and carries on from there.

Starting in the 1990s (see [3] for a nice overview written by David R. Karger, a prominent contributor to the field), randomized algorithms began gaining more and more stature and popularity with algorithm theoreticians. Given that it takes about a decade for a successful concept in Computer Science to make a career in the industry, I predict it's about time that the industry gives in and gets randomized.

Why randomized algorithms? Well, there are multiple reasons for looking at them:

- There are many useful problems for which no algorithm running in reasonable time exists. Some of these problems have satisfactory solutions based on randomized algorithms. These range from practical problems such as the traveling salesman problem to much more "computerish" problems such as finding the optimal register allocation when compiling a high-level language to machine language.
- There are many problems that do have a reasonably fast solution, but you'd be glad to trade a little accuracy for a lot of speed. For example, in a network router, you'd gladly opt for a quick, pretty good routing solution instead of waiting longer for the perfect routing.
- Most randomized algorithms are considerably simpler than their non-randomized counterparts.
- Concocting and analyzing randomized algorithms is sheer fun.
- There are situations in which no reasonable heuristic can be made, so shooting at random can be very effective in fighting the multi-headed "complexity hydra." Examples: memory allocation (for any deterministic heuristic you use, there is a program that makes it look really bad) or, well, Quicksort!

The last point is instrumental to understanding randomized algorithms. Think of your heuristic as your weapon, and of the problem as your enemy. If your heuristic is deterministic, the problem could "figure it out" and "fight against it." If, *au contraire*, you shoot at random, no matter where it attacks from, the problem can't get you.

Randomized algorithms come in two flavors:

- Monte Carlo algorithms. They offer complexity guarantees, but don't guarantee returning the correct result. That's why such algorithms are typically run several times (in a so-called Monte Carlo simulation), until you find the right solution (or give up in frustration).
- Las Vegas algorithms. These always return the correct result, yet offer no complexity guarantees. (In other words, they can take a long time under pathological circumstances.)

Choosing a pivot randomly in Quicksort yields a Las Vegas algorithm: it always works, but still can take *O(n ^{2})* time in the worst case. The point is that hitting that worst case is very improbable. Plus, don't forget that the investment in making a random choice is small: generating good pseudorandom numbers is cheap.

We abstract away the pivot selection in a generic function bearing the signature:

template <class Iter> Iter SelectPivot(Iter begin, Iter end);

**SelectPivot** is templated by the iterator type (oftentimes, a pointer). It takes a range specified in the classic STL manner and just returns some randomly chosen iterator in between the two.

### Fit Pivot

Another source of variation in Quicksort implementations is treating equal ranges. You see, if the input has many elements equal to the pivot, you can accelerate the algorithm by dividing the sequence into *three* (less than, equal to, and greater than) subsequences instead of two. If the second sequence is quite large, then sorting goes substantially faster. The average-case complexity decreases from *O(n log n)* down to *O(n log m)*, where *m* is the number of *distinct* elements in the sequence.

This technique is called "fat pivot" by Quicksort junkies because the pivot becomes a range of elements instead of one element.

On the downside, however, maintaining a fat pivot naturally demands more work, and that work doesn't come for free. So if your sequence doesn't contain a lot of identical elements, you end up doing more work for no benefit.

An intermediate solution — which we're going to use — is to keep a "fit pivot," which is neither skinny nor fat. The way it works is to grab the low-hanging fruit by applying the following partitioning algorithm:

- Starting from left, find the first element strictly greater than the pivot.
- Starting from right, find the first element strictly smaller than the input.
- If the result of step 1 is to the left of the result of step 2, swap the two elements, rinse, lather, and repeat from step 1 with the reduced sequence.
- Otherwise, the discovered range is the fit pivot.

The algorithm above divides the input sequence into three ranges. From left to right:

- Elements smaller than or equal to the pivot.
- Elements equal to the pivot.
- Elements greater than or equal to the pivot.

If there is a continuous run of equal elements, our algorithm can find it and have the effectiveness of a fat pivot algorithm. On the other hand, if many equal elements are dispersed through the sequence, then our algorithm will remain just as effective as a skinny pivot algorithm. The point is our algorithm doesn't do any more work (comparisons and swapping) than the skinny pivot algorithm, yet it is smart enough to grab an opportunity. Quite fit I'd say.

The **Partition** routine that our algorithm uses has the signature:

template <class Iter> std::pair<Iter, Iter> Partition(Iter begin, Iter end, Iter pivot);

**Partition** returns the fit pivot as a range of two iterators.

### To Iterate Is Human; to Recurse Is Divine; to Do Both Is Demiurgic

Assuming we have **SelectPivot** and **Partition** under our belt, implementing Quicksort would look like this:

template <class Iter> void Sort(Iter b, Iter e) { const pair<Iter, Iter> midRange = Partition(b, e, SelectPivot(b, e)); Sort(b, midRange.first); Sort(midRange.second, e); }

Now *that*'s easy. Recursion is very elegant, but iteration is slightly more efficient. The last recursive branch can be trivially transformed in an iteration. (It's an instance of the so-called "tail recursion," which many modern languages transform in iteration automatically.)

template <class Iter> void Sort(Iter b, Iter e) { while (e - b > 1) { const pair<Iter, Iter> midRange = Partition(b, e, SelectPivot(b, e)); Sort(b, midRange.first); b = midRange.second; } }

The final code actually performs one more optimization — it transforms the largest range of the two into an iteration, so as to put minimal strain on the stack in the recursive step.

### Seattle's Best Coffee

Coffee aficionados know that you get the best coffee by blending several kinds of beans. Similarly, you obtain a better sorting algorithm by combining several algorithms and put each to work when the circumstances would favor it.

For example, Quicksort doesn't work very well on short sequences. It swaps elements more than it should, and all that work and shuffling makes for a steep constant in front of that *O(n log n)* average complexity.

For short sequences, insertion sort would work better. So a good sorting algorithm is to use Quicksort as above, until the size of the sequence to sort falls below a threshold. Then, insertion sort would take care of the rest:

template <class Iter> void Sort(Iter b, Iter e) { while (e - b > 1) { enum { threshold = 10 }; if (e - b <= threshold) { ... do an insertion sort on [b, e) ... } else { ... carry on with the Quicksort as above ... } }

As you can see, this strategy improves sorting speed for all inputs, small or large. Quicksort chops the input in coarse pieces, and insertion sort finishes the process.

Insertion sort works by "growing" the sorted array in-place. It successively inserts elements from the unsorted portion of the array into the sorted portion. Here's the algorithm:

- Assign
**sortedLength = 1**. The array from**0**to**sortedLength**is sorted; the array from**sortedLength**to**length**is unsorted. - Do a binary search on
**array[sortedLength]**in the sequence from**0**to**sortedLength**. Obtain the position of the first element greater than**array[sortedLength]**, if any. - Insert
**array[sortedLength]**into the found position by shifting up the positions to the right of the found position. - Increment
**sortedLength**. - Continue to step 2 until
**sortedLength == length**.

Now, how short is "short"? This depends on a ton of factors difficult to evaluate other than by sheer measurements. Values between 8 and 16 are commonly used.

### Loop Unrolling

Ok, now if insertion sort is used only for short sequences, how about that loop unrolling that is supposed to make short loops faster?

Indeed, both the binary search and the insertion sort itself can be nicely unrolled. If you think that can be done only by writing the same code again and again, think *templates*. Specifically, think using integral constants as template arguments.

Loop unrolling using templates is simple — you define your loop as a tail recursion and then define a template that implements that tail recursion. For example, here is a binary search algorithm that doesn't use any loops at all:

template <class Iter, unsigned int size> struct BinaryFinder; template <class Iter> struct BinaryFinder<Iter, 1> { typedef typename std::iterator_traits<Iter>::value_type ValueType; static inline Run(Iter begin, const ValueType& v) { if (!(v < *begin)) ++begin; return begin; } }; template <class Iter, unsigned int size> struct BinaryFinder { typedef typename std::iterator_traits<Iter>::value_type ValueType; static Iter Run(Iter begin, const ValueType& v) { const Iter mid = begin + (size / 2); if (v < *mid) { return BinaryFinder<Iter, size / 2>::Run(begin, v); } return BinaryFinder<Iter, size - size / 2>::Run(mid, v); } };

The **size** argument specifies the length of the sequence to be sorted. The algorithm successively recourses to "smaller" versions of itself, until it inevitably reaches the corner case **size==1**. That case stops the recursion and is implemented separately as a specialization.

In source code, **BinaryFinder** is a nice and simple recursive algorithm. After inlining, optimizations and all, **BinaryFinder** reveals itself as a big, ugly, hairy ogre that does it all by brute force. It's really nice that you don't have to deal with that beast directly; you can keep your white sleeves clean and lie to people that you're conducting honorable business.

As you can see in the included code, **InsertionSorter** is equally aggressive with inlining. It uses the above-defined **BinarySearch**, Duff's device for copying, and recursion to itself for unrolling.

Too much unrolling can be harmful, though. Long loops fill the instruction cache, and at a point the cache misses overcome the gains made by not computing indices and loops at run time. For example, in my tests, a threshold of 16 was slower than a threshold of 12.

### Dumb and Dumber: Who's Better?

For sake of curiosity, let's try another algorithm for short sequences: selection sort.

- Find the smallest element in the input sequence.
- Swap it into the first position.
- Repeat from step 1, this time starting from the next element in the input sequence (given that the first has already put in its final position).

Selection sort's claim to fame is that it does the minimum amount of swapping possible. However, selection sort is quite generous with the number of comparisons, in the order of magnitude of *O(n ^{2})*.

It turns out selection sort does even better than insertion sort for short sequences. I decided to use an unrolled selection sort algorithm for the final implementation.

The included **Sort** routine therefore combines Quicksort (with the tweaks mentioned above) and unrolled selection sort. One thing worth trying would be Quicksort followed by not-unrolled (would that be called "rolled"?) insertion (or selection) sort for mid-sized sequences followed by unrolled selection sort for small sequences.

### Conclusion

To some extent, sorting is a pretext to illustrate useful generic programming techniques. However, our **Sort** routine is not just a toy. I decided to leave it as homework for the interested reader to test it for speed against popular STL implementations. You might be in for a little surprise. For now, suffice to say that on random integers, our **Sort** leaves behind Microsoft Visual C++ .NET 2003 Beta's **std::sort** by a comfortable 34% margin. That's like driving 80 mph instead of 60. Kudos to Howard Hinnant, the author of a very neatly optimized **std::sort** for Metrowerks' CoreWarrior 7.0. The legendary YASLI beats it by just an insignificant 2.5% margin (but hey, it's a positive margin nonetheless). I will leave it up to you to play with other inputs and compilers. Download the code, take a look at it, run it, and let me know your results. If you have further optimization ideas, I'd be glad if you shared them with me so I can share them with everybody on your behalf.

### Bibliography and Notes

[1] The original paper on Quicksort is: C. A. R. Hoare. "Quicksort," *Computer Journal*, 1962, 5, p.p. 10-15. You can find a very nice Quicksort tutorial at <http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/qsort.html>.

[2] A. Alexandrescu. "Generic<Programming>: Efficient Generic Sorting and Searching in C++ (I)," *C/C++ Users Journal*, October 2001, <www.cuj.com/experts/2010/alexandr.htm>.

[3] David R. Karger. *Randomization in Graph Optimization Problems: A Survey*, 1998 , <http://citeseer.nj.nec.com/karger98randomization.html>.

* Andrei Alexandrescu is a Ph.D. student at University of Washington in Seattle, and author of the acclaimed book *Modern C++ Design

*. He may be contacted at*

**www.moderncppdesign.com**. Andrei is also one of the featured instructors of The C++ Seminar (<http://thecppseminar.com>).