Parallel Merge Sort
Improving the algorithm via performance measurement
In last month's article in this series, a Parallel Merge algorithm was introduced and performance was optimized to the point of being limited by system memory bandwidth. Of course, the natural next step is to use it as a core building block for Parallel Merge Sort, since Parallel Merge does most of the work. Let's leap right in and explore this Parallel Merge Sort algorithm.
Simplest Parallel Merge Sort
Listing One shows the simplest Parallel Merge Sort, which uses the Parallel Merge. At the core, the algorithm takes the input range (l to r), computes the mid-point (m), calls itself recursively twice — once with the lower half range, and the second time with the upper half range — followed by merging the two resulting sorted halves into a single output. This is exactly how the standard Merge Sort works — divide, conquer, and merge. The two recursive calls are executed in parallel, followed by a parallel merge. The parallelism is Θ(n/lg2n), derived in [1].
template< class _Type >
inline void parallel_merge_sort_simplest_r( _Type* src, int l, int r, _Type* dst, bool srcToDst = true ) // srcToDst specifies direction for this level of recursion
{
if ( r == l ) { // termination/base case of sorting a single element
if ( srcToDst ) dst[ l ] = src[ l ]; // copy the single element from src to dst
return;
}
int m = ( r + l ) / 2;
//tbb::parallel_invoke( // Intel's Threading Building Blocks (TBB)
Concurrency::parallel_invoke( // Microsoft's Parallel Pattern Library (PPL)
[&] { parallel_merge_sort_simplest_r( src, l, m, dst, !srcToDst ); }, // reverse direction of srcToDst for the next level of recursion
[&] { parallel_merge_sort_simplest_r( src, m + 1, r, dst, !srcToDst ); } // reverse direction of srcToDst for the next level of recursion
);
if ( srcToDst ) merge_parallel_L5( src, l, m, m + 1, r, dst, l );
else merge_parallel_L5( dst, l, m, m + 1, r, src, l );
}
template< class _Type >
inline void parallel_merge_sort_simplest( _Type* src, int l, int r, _Type* dst, bool srcToDst = true ) // srcToDst specifies direction for this level of recursion
{
if ( r < l ) return;
parallel_merge_sort_simplest_r( src, l, r, dst, srcToDst );
Listing One
Table 1 and Graph 1 show performance of the simplest parallel merge sort implementation as compared with STL stable_sort and STL sort.
STL stable_sort is slower than STL sort, as expected — it takes more effort and time to ensure stability. The simplest Parallel Merge Sort implementation outperforms both STL sorts for arrays larger than about 10K elements. Parallel Merge Sort is stable, and is as much as 3.5X faster than STL stable_sort, and is as much as 2.7X faster than STL sort, on a quad-core processor.
Reversals
The Parallel Merge Sort is not an in-place algorithm. Source array elements are sorted and the result is placed in the destination array. Note that the last argument to the recursive function specifies the direction, with the default being "from the source to the destination," which is set at the top-level of recursion. Then at each level of recursion the direction is reversed. Figure 2 illustrates this directional reversal at each recursion level of the call tree.
At the top level of recursion hierarchy the sorted destination (dst) array is produced by merging two sorted source (src) arrays. During this merge operation elements are moved from the source (src) array to the destination array. At the next level, the sorted source (src) array is produced by merging two sorted destination (dst) arrays. This continues until the bottom of the hierarchy is reached with only a single element in the array to be processed.
The termination condition (base case) handles the two possible cases at this point:
- By doing nothing if the destination array is
src. In this case, we are sorting a single element that was provided in thesrcarray and are asked to place the result back in thesrcarray. With a single element, there is nothing to do when sorting it, and it's already in thesrcarray. So, we just leave it there. - By copying that single element from the
srcarray to thedstarray. In this case, we are sorting a single element that was provided in thesrcarray and are asked to place the result in thedstarray. With a single element there is nothing to do when sorting it. To place it in thedstarray we simply copy it.
The recursive implementation in Listing 1 does not handle the condition of zero elements in the input array, which is signified by r being smaller than l. The implementation could test for this case, but it would be wasteful as it can never be generated by the recursive implementation (and can only be generated by the user on the initial function call). So, for the sake of efficiency and simplicity, the check for zero size array is pulled out into a wrapper function shown in Listing Two. Actually, this is also the reason for l and r being int's instead of unsigned long. For example, for an array of zero elements, argument l would be set to zero, requiring r to be set to -1, which can be done using an int but not unsigned long. This issue can probably be avoided by using a pointer to the start of each array and to the one-beyond-the-end of each array or iterators as STL does, instead of indexes. Another option is to use a pointer and an array size as function call arguments.
This implementation is a variation on that presented in the Introduction to Algorithms book [1]. By using direction reversal, memory allocation operations have been eliminated.








