Optimization
One performance optimization strategy successfully used for other parallel sorting algorithms has been the hybrid approach, where the nonparallel algorithm is used to terminate recursion of the parallel algorithm early. Listing Three shows the nonparallel hybrid merge sort implementation.
Listing Three
template< class _Type > inline void merge_sort_in_place_1( _Type* a, int l, int r ) { if ( r <= l ) return; if (( r  l ) <= 32 ) { insertionSortSimilarToSTLnoSelfAssignment( a+l, rl+1 ); return; } int m = (( r + l ) / 2 ); merge_sort_in_place_1( a, l, m ); // recursive call left half merge_sort_in_place_1( a, m + 1, r ); // recursive call right half merge_in_place_L1( a, l, m, r ); // merge the results }
This implementation uses Insertion Sort to terminate recursion earlier. An inplace sorting algorithm must be used in this case, and insertion sort is a good choice due to its cachefriendly disposition. It wouldn't be fair to use STL sort in this case, because STL sort is overall faster than nonparallel merge sort, and thus could entirely replace it at any level of recursion.
Listing Four shows the parallel inplace hybrid implementation, using several potential inplace nonparallel sort algorithms as early recursion terminators. All of these are fair game since they are all slower than the parallel algorithm itself.
Listing Four
template< class _Type > inline void p_merge_sort_in_place_2( _Type* a, int l, int r ) { if ( r <= l ) return; // if (( r  l ) <= 100 ) { merge_sort_in_place( a, l, r ); return; } // if (( r  l ) <= 32 ) { insertionSortSimilarToSTLnoSelfAssignment( a+l, rl+1 ); return; } if (( r  l ) <= 1024 ) { sort( a+l, a+r+1 ); return; } int m = (( r + l ) / 2 ); //tbb::parallel_invoke( Concurrency::parallel_invoke( [&] { p_merge_sort_in_place_2( a, l, m ); }, // parallel recursive call left half [&] { p_merge_sort_in_place_2( a, m + 1, r ); } // parallel recursive call right half ); p_merge_in_place_2( a, l, m, r ); // parallel inplace merge }
Hybrid Performance
Table 2 shows performance measurements of pure and hybrid parallel and nonparallel inplace merge sort algorithm implementations.
Time Per Run 









Number of items 
10 
100 
1K 
10K 
100K 
1M 
10M 
100M 
1B 
Parallel Merge Sort pure 
2.6E06 
1.1E05 
6.6E05 
5.7E04 
5.52E03 
5.35E02 
6.42E01 
7.62E+00 
8.77E+01 
Parallel Merge Sort hybrid 
1.10E07 
2.64E06 
4.26E05 
1.87E04 
2.17E03 
2.6E02 
3.6E01 
4.8E+00 
6.4E+01 
Merge Sort pure 
3.55E07 
9.00E06 
1.52E04 
2.21E03 
2.96E02 
3.7E01 
4.5E+00 
5.4E+01 
6.4E+02 
Merge Sort hybrid 
8.35E08 
4.54E06 
1.04E04 
1.79E03 
2.46E02 
3.2E01 
4.1E+00 
5.0E+01 
5.9E+02 
Parallel pure/hybrid 
24.1 
4.2 
1.5 
3.1 
2.5 
2.1 
1.8 
1.6 
1.4 
NonParallel pure/hybrid 
4.3 
2.0 
1.5 
1.2 
1.2 
1.2 
1.1 
1.1 
1.1 
Table 2.
Parallel implementation shows the highest performance improvement of 40% from a hybrid implementation. Nonparallel hybrid implementation benefits by about 10%. Interestingly, both show diminishing gain from the hybrid approach as the array size increases. This is due to the nonparallel portion making a smaller contribution as array size increases. One possibility is that within this limit, the speedup from the hybrid approach would diminish to zero and would not be necessary, while still being beneficial for the array sizes presented. Another possibility is that within this limit, the speedup approaches a constant due to recursion causing a binary tree where half of the nodes are leaf nodes — the nonparallel algorithm execution.
Parallel inplace merge sort is now 2.2X faster than the STL sort. The hybrid approach shows a reduced performance differential between inplace and notinplace from 4X to 3X, with inplace being slower.
Conclusion
Parallel inplace merge led to the development of a simple, compact and generic parallel inplace merge sort that performed competitively. Both parallel merge sort algorithms, inplace and notinplace, utilized all the cores within the multicore CPU, leading to 100% CPU utilization (versus only 12.5% CPU utilization by STL sort). Being inplace carries a penalty of 3X performance loss versus notinplace. Reducing this performance penalty and improving performance further will be the focus of future development.
Related Articles
Algorithm Improvement through Performance Measurement: Part 1
Parallel InPlace Radix Sort Simplified
Victor Duvanenko is a frequent contributor to Dr. Dobb's on algorithms for parallel programming.