# Parallel In-Place Merge Sort

### Optimization

One performance optimization strategy successfully used for other parallel sorting algorithms has been the hybrid approach, where the non-parallel algorithm is used to terminate recursion of the parallel algorithm early. Listing Three shows the non-parallel hybrid merge sort implementation.

Listing Three

```template< class _Type >
inline void merge_sort_in_place_1( _Type* a, int l, int r )
{
if ( r <= l ) return;
if (( r - l ) <=  32 ) { insertionSortSimilarToSTLnoSelfAssignment( a+l, r-l+1 ); return; }
int m = (( r + l ) / 2 );
merge_sort_in_place_1( a, l,     m );  // recursive call left  half
merge_sort_in_place_1( a, m + 1, r );  // recursive call right half
merge_in_place_L1( a, l, m, r );     // merge the results
}```

This implementation uses Insertion Sort to terminate recursion earlier. An in-place sorting algorithm must be used in this case, and insertion sort is a good choice due to its cache-friendly disposition. It wouldn't be fair to use STL sort in this case, because STL sort is overall faster than non-parallel merge sort, and thus could entirely replace it at any level of recursion.

Listing Four shows the parallel in-place hybrid implementation, using several potential in-place non-parallel sort algorithms as early recursion terminators. All of these are fair game since they are all slower than the parallel algorithm itself.

Listing Four

```template< class _Type >
inline void p_merge_sort_in_place_2( _Type* a, int l, int r )
{
if ( r <= l ) return;
//	if (( r - l ) <= 100 ) { merge_sort_in_place( a, l, r ); return; }
//	if (( r - l ) <=  32 ) { insertionSortSimilarToSTLnoSelfAssignment( a+l, r-l+1 ); return; }
if (( r - l ) <= 1024 ) { sort( a+l, a+r+1 ); return; }
int m = (( r + l ) / 2 );
//tbb::parallel_invoke(
Concurrency::parallel_invoke(
[&] { p_merge_sort_in_place_2( a, l,     m ); },	// parallel recursive call left  half
[&] { p_merge_sort_in_place_2( a, m + 1, r ); }		// parallel recursive call right half
);
p_merge_in_place_2( a, l, m, r );	// parallel in-place merge
}```

### Hybrid Performance

Table 2 shows performance measurements of pure and hybrid parallel and non-parallel in-place merge sort algorithm implementations.

 Time Per Run Number of items 10 100 1K 10K 100K 1M 10M 100M 1B Parallel Merge Sort pure 2.6E-06 1.1E-05 6.6E-05 5.7E-04 5.52E-03 5.35E-02 6.42E-01 7.62E+00 8.77E+01 Parallel Merge Sort hybrid 1.10E-07 2.64E-06 4.26E-05 1.87E-04 2.17E-03 2.6E-02 3.6E-01 4.8E+00 6.4E+01 Merge Sort pure 3.55E-07 9.00E-06 1.52E-04 2.21E-03 2.96E-02 3.7E-01 4.5E+00 5.4E+01 6.4E+02 Merge Sort hybrid 8.35E-08 4.54E-06 1.04E-04 1.79E-03 2.46E-02 3.2E-01 4.1E+00 5.0E+01 5.9E+02 Parallel pure/hybrid 24.1 4.2 1.5 3.1 2.5 2.1 1.8 1.6 1.4 Non-Parallel pure/hybrid 4.3 2.0 1.5 1.2 1.2 1.2 1.1 1.1 1.1

Table 2.

Parallel implementation shows the highest performance improvement of 40% from a hybrid implementation. Non-parallel hybrid implementation benefits by about 10%. Interestingly, both show diminishing gain from the hybrid approach as the array size increases. This is due to the non-parallel portion making a smaller contribution as array size increases. One possibility is that within this limit, the speed-up from the hybrid approach would diminish to zero and would not be necessary, while still being beneficial for the array sizes presented. Another possibility is that within this limit, the speed-up approaches a constant due to recursion causing a binary tree where half of the nodes are leaf nodes — the non-parallel algorithm execution.

Parallel in-place merge sort is now 2.2X faster than the STL sort. The hybrid approach shows a reduced performance differential between in-place and not-in-place from 4X to 3X, with in-place being slower.

### Conclusion

Parallel in-place merge led to the development of a simple, compact and generic parallel in-place merge sort that performed competitively. Both parallel merge sort algorithms, in-place and not-in-place, utilized all the cores within the multicore CPU, leading to 100% CPU utilization (versus only 12.5% CPU utilization by STL sort). Being in-place carries a penalty of 3X performance loss versus not-in-place. Reducing this performance penalty and improving performance further will be the focus of future development.

### Related Articles

Algorithm Improvement through Performance Measurement: Part 1

Parallel Counting Sort

Parallel Merge

Parallel In-Place Merge

Victor Duvanenko is a frequent contributor to Dr. Dobb's on algorithms for parallel programming.

### More Insights

 To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.