Rise of the Planet of the Iterative Quicksort
The genesis of my study of this less-than-intuitive implementation of Quicksort actually was a challenge from a colleague to do Iterative Quicksort with OpenMP. It's easy to use OpenMP tasks for parallelizing the recursive version of the algorithm. Voila!
void QuickSort(SomeDataType *A, int p, int r)
{
if (p < r) { // if there is one or more things to sort...
q = Partition(A, p, r);
#pragma omp task
QuickSort(A, p, q-1);
#pragma omp task
QuickSort(A, q+1, r);
}
}
. . .
#pragma omp parallel
{
#pragma omp single
QuickSort(A, 0, N); // initial call in parallel region
}
The initial call to get the Quicksort recursion started is done from a single region within a parallel region. That way, only one thread will actually execute the call to QuickSort() while all other threads on the team wait at the implicit barrier of the single construct. However, not wanting to leave threads spinning or twiddling their thumbs or knitting sweaters or doing whatever idle threads do, the OpenMP runtime will assign tasks spawned from QuickSort() to the blocked threads.
Using OpenMP on an iterative Quicksort can be as easy as I showed in the previous installment. Take a thread-safe queue, add a semaphore to keep count of items in the queue, and count as items are put into the final position to know when the sort is complete. The OpenMP thread team simply becomes a thread pool calling the iterative function as if they were explicitly spawned. If you haven't been there and done that, go back to my previous post and then come back. I'll wait.
Ready?
Okay, what can we do different? How about using an Intel Threading Building Blocks (TBB) concurrent_queue container? It may not sound much different than using a brute force queue implementation, but it does have at least one an advantage over the previous code I've posted. The TBB concurrent_queue object has a try_pop() method. This method looks into the queue and, if there is something there, it retrieves the item at the head of the queue; otherwise, it returns an error code. (Sound familiar?) And all of this is done atomically.
By using the TBB queue there is no need to allocate a semaphore object to track the number of items in the queue. However, we will need to set up a spin-wait loop to keep checking the queue status. Every time the queue is empty, threads will check the Done variable to determine if the empty queue is indicative of the sort being done. If Done has been set to TRUE, the sort is done and threads can return from QuickSort(); if not, check the queue again for an index pair of a sub-array to be sorted.
Everything else is pretty much the same as we've seen previously. Because we are using OpenMP, the protected update to tCount is done within an atomic region.
Here's the code for the OpenMP QuickSort() function. I've added the shared declaration of the TBB concurrent_queue to hold items of type qSortIndex (index pairs of integers defining a sub-array to be sorted), but not the inclusion of the proper TBB header file nor the calling of QuickSort() from a parallel region.
tbb::concurrent_queue<qSortIndex> Q;
. . .
void QuickSort(SomeDataType *A, int N)
{
int p, r, q;
qSortIndex d, d1, d2;
int t = 0;
while (1) {
while (!Q.try_pop(d)) {
if (Done) return; // if Done, no more need to process anything
}
p = d.lo; r = d.hi;
if (p < r) {
q = Partition(A, p, r);
#pragma omp atomic
++tCount;
d1.lo = p; d1.hi = q-1;
Q.push(d1);
d2.lo = q+1; d2.hi = r;
Q.push(d2);
}
else if (p == r) {
#pragma omp atomic
++tCount;
if (tCount == N) Done = TRUE; // Signal threads that sorting is done
}
}
}
I know I said I was not a fan of spin-wait loops if there was an alternative. I've already provided an alternative, but where's the fun in rehashing something we've already done? Besides, another advantage of the above implementation is that there is no need to signal an external thread that sorting is done and have that external thread fill up the queue to "free" those threads sitting on some synchronization object waiting for something in the queue before they are able to check the status of the sort. This should yield simpler code. There is no need to spawn a team of one OpenMP thread to be the "external" thread and not have all threads deal with all the signals and objects of a spin-wait free version.
One last thing before I wind up this post. Below is a code snippet from an older version of the OpenMP algorithm I wrote. This is the two while-loops and one statement that follows the inner while-loop from the above function. The code works when run with multiple threads, but it looked funny when I was reviewing things to write up this post. (Helpful Hint to Readers: Include documenting comments so you won't be mystified by code you wrote just one month before.)
while (1) {
while (!Q.try_pop(d)) { //pause? or else spin-wait with no wait
if (Done) break; // if Done, no more need to process anything
}
if (Done) break; // signaled to terminate
. . .
The two conditional tests so close together looked redundant. I knew that one of them was required. However, when I tried to simplify the code by removing one or the other, the execution hung. So, let's end with a quiz. The quiz has two questions: 1) When running with two or more threads, why does the code hang if the first if (Done) break (line 3) is removed? 2) When running with two or more threads, why does the code hang if the second if (Done) break (line 5) is removed?
If you have an answer, show off and leave them in the comments.

