Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

# Rise of the Planet of the Iterative Quicksort

October 24, 2011

The genesis of my study of this less-than-intuitive implementation of Quicksort actually was a challenge from a colleague to do Iterative Quicksort with OpenMP. It's easy to use OpenMP tasks for parallelizing the recursive version of the algorithm. Voila!

```void QuickSort(SomeDataType *A, int p, int r)
{
if (p < r) {  // if there is one or more things to sort...
q = Partition(A, p, r);
QuickSort(A, p, q-1);
QuickSort(A, q+1, r);
}
}
. . .
#pragma omp parallel
{
#pragma omp single
QuickSort(A, 0, N); // initial call in parallel region
}
```

The initial call to get the Quicksort recursion started is done from a single region within a parallel region. That way, only one thread will actually execute the call to `QuickSort()` while all other threads on the team wait at the implicit barrier of the `single` construct. However, not wanting to leave threads spinning or twiddling their thumbs or knitting sweaters or doing whatever idle threads do, the OpenMP runtime will assign tasks spawned from `QuickSort()` to the blocked threads.

Using OpenMP on an iterative Quicksort can be as easy as I showed in the previous installment. Take a thread-safe queue, add a semaphore to keep count of items in the queue, and count as items are put into the final position to know when the sort is complete. The OpenMP thread team simply becomes a thread pool calling the iterative function as if they were explicitly spawned. If you haven't been there and done that, go back to my previous post and then come back. I'll wait.

Okay, what can we do different? How about using an Intel Threading Building Blocks (TBB) `concurrent_queue` container? It may not sound much different than using a brute force queue implementation, but it does have at least one an advantage over the previous code I've posted. The TBB `concurrent_queue` object has a `try_pop()` method. This method looks into the queue and, if there is something there, it retrieves the item at the head of the queue; otherwise, it returns an error code. (Sound familiar?) And all of this is done atomically.

By using the TBB queue there is no need to allocate a semaphore object to track the number of items in the queue. However, we will need to set up a spin-wait loop to keep checking the queue status. Every time the queue is empty, threads will check the `Done` variable to determine if the empty queue is indicative of the sort being done. If` Done` has been set to` TRUE`, the sort is done and threads can return from `QuickSort()`; if not, check the queue again for an index pair of a sub-array to be sorted.

Everything else is pretty much the same as we've seen previously. Because we are using OpenMP, the protected update to` tCount` is done within an `atomic` region.

Here's the code for the OpenMP `QuickSort()` function. I've added the shared declaration of the TBB `concurrent_queue` to hold items of type `qSortIndex` (index pairs of integers defining a sub-array to be sorted), but not the inclusion of the proper TBB header file nor the calling of `QuickSort()` from a parallel region.

``` tbb::concurrent_queue<qSortIndex> Q;
. . .
void QuickSort(SomeDataType *A, int N)
{
int p, r, q;
qSortIndex d, d1, d2;
int t = 0;

while (1) {
while (!Q.try_pop(d)) {
if (Done) return;  // if Done, no more need to process anything
}
p = d.lo; r = d.hi;
if (p < r) {
q = Partition(A, p, r);
#pragma omp atomic
++tCount;
d1.lo = p; d1.hi = q-1;
Q.push(d1);
d2.lo = q+1; d2.hi = r;
Q.push(d2);
}
else if (p == r) {
#pragma omp atomic
++tCount;
if (tCount == N) Done = TRUE;   // Signal threads that sorting is done
}
}
}```

I know I said I was not a fan of spin-wait loops if there was an alternative. I've already provided an alternative, but where's the fun in rehashing something we've already done? Besides, another advantage of the above implementation is that there is no need to signal an external thread that sorting is done and have that external thread fill up the queue to "free" those threads sitting on some synchronization object waiting for something in the queue before they are able to check the status of the sort. This should yield simpler code. There is no need to spawn a team of one OpenMP thread to be the "external" thread and not have all threads deal with all the signals and objects of a spin-wait free version.

One last thing before I wind up this post. Below is a code snippet from an older version of the OpenMP algorithm I wrote. This is the two `while`-loops and one statement that follows the inner `while`-loop from the above function. The code works when run with multiple threads, but it looked funny when I was reviewing things to write up this post. (Helpful Hint to Readers: Include documenting comments so you won't be mystified by code you wrote just one month before.)

```   while (1) {
while (!Q.try_pop(d)) { //pause? or else spin-wait with no wait
if (Done) break;  // if Done, no more need to process anything
}
if (Done) break;  // signaled to terminate
. . .```

The two conditional tests so close together looked redundant. I knew that one of them was required. However, when I tried to simplify the code by removing one or the other, the execution hung. So, let's end with a quiz. The quiz has two questions: 1) When running with two or more threads, why does the code hang if the first `if (Done) break` (line 3) is removed? 2) When running with two or more threads, why does the code hang if the second `if (Done) break` (line 5) is removed?

If you have an answer, show off and leave them in the comments.

### More Insights

 To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.