Quicksort Partition via Prefix Scan
In the three steps that I've just outlined here, I can create one of three subsets needed from the Partition operation for Quicksort. The other two partition subsets can be constructed in a similar manner, all in O(log M) time. After that the "less than" and "greater than" subsets are ready for sorting via the Quicksort algorithm.
There were a few ideas that occurred to me as I was writing the above. If you're in an indulgent mood, please read on. If you've got something else to get to soon or you have something cooking as you're reading this, simply skip down to the last two paragraphs.
First, the creation of the three masks can be done at the same time with a 3-part if-then-else structure to place a "1" in the corresponding position of the proper vector mask. Each processor is only looking at one element compared to the pivot value and it's still an O(1) operation. The prefix scan still needs to be done three times. (I suppose, if you're really tricky and you've written your own prefix scan subroutine, you can do all of them at the same time by handling each of the three vector masks all at once.)
Moving the data needs some care when building all three subsets concurrently. For the less-than subset, nothing changes from what was described earlier. If you are using separately allocated vectors for the other two subsets, these are handled just like the less-than vector. However, if you allocate a single vector the same length as the vector to be partitioned, you need to calculate the offsets into that one vector for the second and third subsets. The offsets are going to be found using the last element of the post-prefix scan vector masks. For example, look back to the example prefix scan results. The last value stored there is "3", which means that the pivot subset would start in the [4] (3 + 1) index position and the other elements equal to the pivot value will be indexed by adding 4 to the appropriate prefix scan vector values. The index of the first element in the greater-than subset will be the sum of the final elements from the less-than and equal prefix scan results plus the sum of the two final mask vector elements. (Only one of the three mask vectors will have a "1" in the final position and since I'm using an exclusive scan, the prefix scan on that mask vector will be one less than the actual count of the number of elements that match the criteria.)
Second, if you choose the pivot element to be the final vector element, you can skip the prefix scan for the equal subset run the partition algorithm on the first M-1 elements. Simply place that one pivot value in the proper position between the other two subset vectors when you've computed them. One of the partitioned subsets needs to hold any item from the unsorted vector that might be equal to the pivot value, too.
Finally, this variation of Quicksort is stable. By using the prefix scan and the method for choosing which elements go into which subset, the relative order of equal keys is preserved. Once the subsets become small enough, you simply need to use a stable sort algorithm to complete the process. If you pick the final element as the pivot to reduce the number of prefix scan operations (and because there may not be too many equal-valued keys), then the first subset must be "less than or equal to" in order to preserve the stability of this sort.
If you've been playing along with the other underlying theme of my last two postings, you should be able to see the advantages of this Partition variation in terms of work and span. In the traditional and the scan version of parallel Quickosrt, the work will be O(n log n). This shouldn't be a surprise since we are doing a sort. The big difference is that the span of the prefix scan version is much improved. With the traditional serial Partition code, the span for parallel Quicksort will be O(n) while the scan version is going to be only O(log^2 n).
With manycore processors, I hope you can see that there may be more parallelism available to speed up algorithms for which we already have parallel implementations on multi-core systems. Finding the algorithms that can take advantage of the bounty of cores will be a challenge and will require looking at things from a different perspective. Next up, I'll detail an actual implementation of the scan-based Partition operation described above.

