# Planet of the Iterative Quicksort

I like sorting. I'm not sure if I could tell you why, but it has always been an interesting problem for me to think about, investigate, and implement. It must be more than the fact that sorting is estimated to consume anywhere from 50 to 90 percent of all computing cycles (depending on who you ask and in what decade you ask them). Maybe the reason is because it's easy to tell if you've gotten the right answer, especially in parallel. Plus, there are so many different ways to approach the same problem.

One of the best-known algorithms for sorting is Quicksort. It is a quintessential example of recursion. In serial, the first step takes the array of things to be sorted, chooses one of them as the "pivot" element, and partitions the array such that you now have two sub-arrays — one with items less than (or equal to) the pivot value, and one with items greater than (or equal to) the pivot. If duplicate keys are allowed, you can arbitrarily decide which sub-array gets the keys equal to the pivot. A recursive call to the Quicksort routine on each sub-array completes the algorithm. The recursion ends when the number of items in a sub-array becomes too small to sort. Alternately, a threshold can be set, and a simpler sort routine (e.g., insertion sort) can be called when the size of a sub-array dips below that level, which foregoes the overhead of recursive function calls.

In parallel, since each of the recursive calls is independent, you can spawn a pair of tasks with each task responsible for the execution of one recursive call. To take advantage of data locality and a warm cache, another option is to spawn a single task on one of the recursive calls and simply loop back to the beginning of the routine to partition the other sub-array in the same task (thread).

This is all review for you, right? Sure. And if this is all you want to know about parallelizing Quicksort, jump out of here and see if Gaston Hillar has something new this week. (Feel free to pass "Go" and collect $200 on your way, too.)

For those of you who are still here, let's take a look at an iterative version of Quicksort. Recursive algorithms can be imitated by an iterative variation and (typically) a stack to simulate the pattern of recursive function calls. So, what would an iterative (serial) version of Quicksort look like? The following code is one example:

void QuickSort(void *pArg) { int p, r, q; qSortIndex *d, *d1, *d2; while (Q.notEmpty()) { d = Q.dequeue(); //pull out next index pair, unless queue is empty p = d->lo; r = d->hi; delete d; if (p < r) { // if there is one or more things to sort... q = Partition(p, r); // encapsulate the indices for the lower portion of the partition and enqueue d1 = new qSortIndex; d1->lo = p; d1->hi = q-1; Q.enqueue(d1); // encapsulate the indices for the upper portion of the partition and enqueue d2 = new qSortIndex; d2->lo = q+1; d2->hi = r; Q.enqueue(d2); } } // end while }

For this code, I'm assuming a structure that holds objects of type `qSortIndex`

, which is a pair of integers denoting the lower and upper index values of a sub-array to be sorted. Before the first call to `QuickSort()`

, an initial pair of indices (delimiting the entire array to be sorted) is put into the data structure, `Q`

. The code loops while `Q`

is not empty. For each iteration of the `while`

-loop, an index pair of a sub-array to be sorted is taken out of `Q`

. If there is more than one item to be sorted, the sub-array is partitioned around some pivot element. The index pairs for the sub-array of items less than the pivot are put into `Q`

, and the index pairs for the sub-array of items greater than the pivot are put into `Q`

. The loop cycles to the top and moves on to another index pair from a non-empty `Q`

.

It doesn't too much matter if `Q`

is a queue or stack. The sort will still work correctly, but the order in which the sub-arrays are handled changes. (For clarity and brevity, I'll refer to the data structure as a queue.) If you feel ambitious, you can modify the aforementioned algorithm to stuff just one index pair back onto `Q`

and continue by partitioning the other sub-array. Once the "kept" array is too small to sort, the next item on `Q`

is taken out and the algorithm continues processing with that retrieved sub-array.

At this point, I could insert a mathematical proof that this algorithm does sort an array of items. However, I'm sure you can see that each call to `Partition()`

places one item in its final sorted position, and thus, the sizes of the sub-arrays constructed are smaller than the original. And since one thing by itself is in sorted order (Breshears's Fundamental Law of Sorting), there will be a point where no new index pairs are added to `Q`

and the algorithm will eventually terminate when `Q`

is empty. So, let's just move on to something more interesting.

What about parallelizing this iterative algorithm? First and most obvious, you need to use a thread-safe queue. You can't lose an index pair, nor do you want to sort the same sub-array more than once for best performance. If one thread is putting something in while another thread is taking something out, or if two separate threads are both taking something out, it is imperative that it be done in such a way that the queue or an entry in the queue is not corrupted. Also, it must be the case that if a thread finds the queue not empty, then there is an element in the queue that the thread can retrieve and process. We don't want some other thread purloining the lone item in the queue that both threads believe belongs to them.

I could go on citing other bad situations for another paragraph or two, but I think you get the idea. There are many issues and potential situations that must be considered and covered correctly in a thread-safe data structure. Luckily, we won't spend any time on those details. Let's just click our heels together three times and say, "There are thread-safe data structures."

Now, armed with our favorite queue-like thread-safe data structure, to parallelize the iterative code above, we just prime the queue with the initial array bounds, launch some threads, and then point them at the `QuickSort()`

function, right? Well, not quite. There's one more piece to figure out and that piece is… *urrk* *choke* *cough* *arrrgh*

(Don't worry, I'm fine. I've just always wanted to write an interrupted reveal ending.)

Next time: the missing piece!