Optimization
One performance optimization strategy successfully used for other parallel sorting algorithms has been the hybrid approach, where the non-parallel algorithm is used to terminate recursion of the parallel algorithm early. Listing Three shows the non-parallel hybrid merge sort implementation.
Listing Three
template< class _Type > inline void merge_sort_in_place_1( _Type* a, int l, int r ) { if ( r <= l ) return; if (( r - l ) <= 32 ) { insertionSortSimilarToSTLnoSelfAssignment( a+l, r-l+1 ); return; } int m = (( r + l ) / 2 ); merge_sort_in_place_1( a, l, m ); // recursive call left half merge_sort_in_place_1( a, m + 1, r ); // recursive call right half merge_in_place_L1( a, l, m, r ); // merge the results }
This implementation uses Insertion Sort to terminate recursion earlier. An in-place sorting algorithm must be used in this case, and insertion sort is a good choice due to its cache-friendly disposition. It wouldn't be fair to use STL sort in this case, because STL sort is overall faster than non-parallel merge sort, and thus could entirely replace it at any level of recursion.
Listing Four shows the parallel in-place hybrid implementation, using several potential in-place non-parallel sort algorithms as early recursion terminators. All of these are fair game since they are all slower than the parallel algorithm itself.
Listing Four
template< class _Type > inline void p_merge_sort_in_place_2( _Type* a, int l, int r ) { if ( r <= l ) return; // if (( r - l ) <= 100 ) { merge_sort_in_place( a, l, r ); return; } // if (( r - l ) <= 32 ) { insertionSortSimilarToSTLnoSelfAssignment( a+l, r-l+1 ); return; } if (( r - l ) <= 1024 ) { sort( a+l, a+r+1 ); return; } int m = (( r + l ) / 2 ); //tbb::parallel_invoke( Concurrency::parallel_invoke( [&] { p_merge_sort_in_place_2( a, l, m ); }, // parallel recursive call left half [&] { p_merge_sort_in_place_2( a, m + 1, r ); } // parallel recursive call right half ); p_merge_in_place_2( a, l, m, r ); // parallel in-place merge }
Hybrid Performance
Table 2 shows performance measurements of pure and hybrid parallel and non-parallel in-place merge sort algorithm implementations.
Time Per Run |
|
|
|
|
|
|
|
|
|
Number of items |
10 |
100 |
1K |
10K |
100K |
1M |
10M |
100M |
1B |
Parallel Merge Sort pure |
2.6E-06 |
1.1E-05 |
6.6E-05 |
5.7E-04 |
5.52E-03 |
5.35E-02 |
6.42E-01 |
7.62E+00 |
8.77E+01 |
Parallel Merge Sort hybrid |
1.10E-07 |
2.64E-06 |
4.26E-05 |
1.87E-04 |
2.17E-03 |
2.6E-02 |
3.6E-01 |
4.8E+00 |
6.4E+01 |
Merge Sort pure |
3.55E-07 |
9.00E-06 |
1.52E-04 |
2.21E-03 |
2.96E-02 |
3.7E-01 |
4.5E+00 |
5.4E+01 |
6.4E+02 |
Merge Sort hybrid |
8.35E-08 |
4.54E-06 |
1.04E-04 |
1.79E-03 |
2.46E-02 |
3.2E-01 |
4.1E+00 |
5.0E+01 |
5.9E+02 |
Parallel pure/hybrid |
24.1 |
4.2 |
1.5 |
3.1 |
2.5 |
2.1 |
1.8 |
1.6 |
1.4 |
Non-Parallel pure/hybrid |
4.3 |
2.0 |
1.5 |
1.2 |
1.2 |
1.2 |
1.1 |
1.1 |
1.1 |
Table 2.
Parallel implementation shows the highest performance improvement of 40% from a hybrid implementation. Non-parallel hybrid implementation benefits by about 10%. Interestingly, both show diminishing gain from the hybrid approach as the array size increases. This is due to the non-parallel portion making a smaller contribution as array size increases. One possibility is that within this limit, the speed-up from the hybrid approach would diminish to zero and would not be necessary, while still being beneficial for the array sizes presented. Another possibility is that within this limit, the speed-up approaches a constant due to recursion causing a binary tree where half of the nodes are leaf nodes the non-parallel algorithm execution.
Parallel in-place merge sort is now 2.2X faster than the STL sort. The hybrid approach shows a reduced performance differential between in-place and not-in-place from 4X to 3X, with in-place being slower.
Conclusion
Parallel in-place merge led to the development of a simple, compact and generic parallel in-place merge sort that performed competitively. Both parallel merge sort algorithms, in-place and not-in-place, utilized all the cores within the multicore CPU, leading to 100% CPU utilization (versus only 12.5% CPU utilization by STL sort). Being in-place carries a penalty of 3X performance loss versus not-in-place. Reducing this performance penalty and improving performance further will be the focus of future development.
Related Articles
Algorithm Improvement through Performance Measurement: Part 1
Parallel In-Place Radix Sort Simplified
Victor Duvanenko is a frequent contributor to Dr. Dobb's on algorithms for parallel programming.