Merge sort is a wonderful, widely used sorting algorithm, with consistent dataindependent performance. When not inplace merge sorting, that is when the source and destination array are not the same, performance is O(nlgn). When inplace, so the source and destination array are the same, performance is slightly slower: O(nlg^{2}n). Because notinplace sorting algorithms are faster, implementations frequently allocate memory for the destination array, copy the sorted result back into the source array, and proceed to deallocate the destination array. STL uses this type of a strategy to construct a fast inplace sort from a faster notinplace sort whenever memory is available. Of course, when memory is not available, inplace sort is necessary, and STL also provides such an implementation.
However, the STL sort is a nonparallel implementation and runs only on a single core of today's multicore processors, utilizing just a fraction of the available computational resources. Several parallel sorting algorithms have been developed: Parallel Radix Sort, Parallel Counting Sort, and Parallel Merge Sort.
At the core of merge sort is the merge algorithm. Fast notinplace Parallel Merge Sort was developed using a Parallel Merge that is notinplace. Also, a Parallel InPlace Merge has been developed. Now, it's time to use this merge to construct an inplace merge sort that is parallel, and then see how well it performs.
Implementation
Listing One shows a simple implementation of InPlace Merge Sort, which calls itself recursively twice — once to sort the left half and once to sort the right half, followed by an inplace merge. This inplace merge merges the two sorted left and right half results into a single sorted result. The implementation is generic, being able to sort any data type that can use comparisons.
Listing One
template< class _Type > inline void merge_sort_in_place( _Type* a, int l, int r ) { if ( r <= l ) return; int m = (( r + l ) / 2 ); merge_sort_in_place( a, l, m ); // recursive call left half merge_sort_in_place( a, m + 1, r ); // recursive call right half merge_in_place_L1( a, l, m, r ); // merge the results }
Listing Two shows a parallel implementation, which makes the two recursive calls in parallel, followed by a parallel inplace merge. It uses Intel's opensource Threading Building Blocks (TBB) or Microsoft's Parallel Pattern Library (PPL). The use of C++ lambda functions simplifies the implementation significantly. This implementation is also generic.
Listing Two
template< class _Type > inline void p_merge_sort_in_place( _Type* a, int l, int r ) { if ( r <= l ) return; int m = (( r + l ) / 2 ); //tbb::parallel_invoke( Concurrency::parallel_invoke( [&] { p_merge_sort_in_place( a, l, m ); }, // parallel recursive call left half [&] { p_merge_sort_in_place( a, m + 1, r ); } // parallel recursive call right half ); p_merge_in_place_2( a, l, m, r ); // parallel inplace merge }
The nonparallel and parallel versions are strikingly similar. Both are fully functional algorithms and are compact. Inplace merge can be found in my Parallel InPlace Merge article, Listing One for nonparallel and Listing Three for parallel. Having these inplace merge implementations makes these compact merge sort implementations possible. If you want to contrast inplace merge with a notinplace merge sort, take a look at Listing One in Parallel Merge Sort.
Let's take a look at performance of these algorithms.
Performance
Table 1 and Figure 1 compare performance of the nonparallel inplace merge sort with the parallel implementation, both in their initial pure forms, as well as STL sort and parallel notinplace merge sort. Sorting of arrays of 32bit unsigned integers were benchmarked on Intel i7 3630QM quadcore CPU with hyperthreading running at 3.2 GHz, using 16 GB of system memory.
Time Per Run 









Number of items 
10 
100 
1K 
10K 
100K 
1M 
10M 
100M 
1B 
InPlace Merge Sort 
3.55E07 
9.00E06 
1.52E04 
2.21E03 
2.96E02 
3.7E01 
4.5E+00 
5.4E+01 
6.4E+02 
Parallel InPlace Merge Sort 
2.6E06 
1.1E05 
6.6E05 
5.7E04 
5.52E03 
5.35E02 
6.42E01 
7.62E+00 
8.77E+01 
STL sort 
1.07E07 
2.62E06 
4.17E05 
5.80E04 
7.33E03 
8.9E02 
1.0E+00 
1.2E+01 
1.4E+02 
Parallel Merge Sort hybrid 
2.5E07 
3.6E06 
2.0E05 
1.8E04 
1.38E03 
1.42E02 
1.64E01 
1.81E+00 
2.13E+01 
Merge Sort / Parallel Merge Sort 
0.1 
0.8 
2.3 
3.9 
5.4 
7.0 
7.1 
7.1 
7.3 
STL sort / Parallel InPlace 
0.0 
0.2 
0.6 
1.0 
1.3 
1.7 
1.6 
1.6 
1.6 
Parallel InPlace / Parallel NotInPlace 
10.4 
3.0 
3.3 
3.2 
4.0 
3.8 
3.9 
4.2 
4.1 
Table 1.
Figure 1.
Performance measurements show that the nonparallel, inplace merge sort is the slowest algorithm. Parallel implementation accelerates performance by over 7X and is faster than STL sort by 1.6X. The notinplace parallel merge sort is the fastest merge sorting algorithm, beating the parallel inplace version by more than 4X. It also beat STL by more than 6X.
However, the parallel notinplace merge sort is a hybrid implementation, using a nonparallel algorithm to terminate parallel recursion early, whereas the parallel inplace merge sort is not a hybrid. Let's see if the parallel inplace merge sort can be optimized.