Quicksort Partition via Prefix Scan
So, here's what I've been building up to in the two previous posts. I'm going to describe how to put into effect the partition functionality of the Quicksort algorithm using the Prefix Scan operation. What this will do, if we have a manycore processor with a core for each of the M elements in the unsorted list, is to reduce the span (execution time) from the serial, O(M), to the seriously parallel O(log M). I saw a description of this attempt to make use of manycore parallelism and shorten the execution time of parallel codes in slides from the keynote address Guy Blelloch gave at the 2009 Principles and Practice of Parallel Programming (PPoPP) conference entitled "Parallel Thinking."
The operation that I'll use is a Binary Selection, which gathers together elements from a set that share a property. (I'm not sure if I just coined the term "Binary Selection," but I wanted a different name to avoid confusion with the other Selection algorithm that is used to pick the K-th largest element from an unordered set. Both of these operations can be done in parallel and rely on Prefix Scan as a component.) For the Partition phase of Quicksort I need to generate subsets of elements with three mutually exclusive properties: less than the pivot element, equal to the pivot element, and greater than the pivot element. I'm going to describe how to create the subset of "less than" elements since the same algorithm can be used for determining the other two subsets.
If you're like me, you can better learn a new algorithm (or most anything unfamiliar) if you have an example to illustrate the data manipulations of the algorithm. So, let me take a set of eight integers that we need to partition for upcoming Quicksort processing. Since I need to deal with the relative order between items, and it will likely be stored in an array structure, I'll think of my data as being stored in an array or vector.
Data vector to sort: [6 1 7 4 0 3 5 2]
The first step is to create a mask vector of the same length as the data vector. This vector will be used to show whether or not the associated data item has the desired property. Since this is a binary decision the values in the mask vector are 0 and 1. These should be the integer values, not the logical FALSE and TRUE, because I will need to do some arithmetic operations on the mask values in the next step.
To know whether or not a data item is less than the pivot value, I first need to have a pivot element. For this example, I will pick the value '4' to be my pivot value. Thus, my 8-element mask vector will be [0 1 0 0 1 1 0 1]. Since I'm assuming that there are eight processors available to do the simple comparison of one data element to the pivot value, this step takes constant time, O(1).
The second step is to run a Prefix Scan with addition on the mask vector. If I have zero-based indexing (C/C++, Java), I use an exclusive scan; if I have 1-based indexing (Fortran), I use an inclusive scan. As I showed in my previous post, this will be an O(log M) operation with M processors.
The result of the exclusive scan on my example mask vector is [0 0 1 1 1 2 3 3]. This result will best be copied into an array separate from the vector mask. It's not essential since there is a way to "recover" the original mask value of each position, but keeping the results separate from the source makes for a cleaner implementation.
The third step is a data movement step. Each processor will copy the associated value in the data vector that has a '1' in the mask to a new subset vector in preparation of a Quicksort recursive call (or new task for parallel execution) on that subset vector. The moves will be done in parallel, so I must ensure that each element is moved to a unique location. To do that, I use the scan result values as the destination indices to the new subset vector.
Look what I get when I line up the mask vector with the scan result vector:
[0 1 0 0 1 1 0 1] [0 0 1 1 1 2 3 3]
Everywhere I have a "1" in the mask vector, there is a different number in the corresponding position of the scan results vector. In fact, reading from left to right, these "marked" values are consecutive from zero to one less than the number of data elements that are less than the pivot value. So, any processor with a "1" in the corresponding mask vector position can move, in parallel, its data value to the new subset vector indexed by the corresponding value in the scan results vector. And it's all done in O(1) time.

