In-place hybrid merge implementation is over 1.5x faster than the pure merge, for large arrays. Interestingly, the hybrid approach for in-place merge yields less gain in performance than it did for not-in-place algorithm, where over 8.5x gain was obtained for large arrays, bringing the performance of the divide-and-conquer algorithm to nearly that of the “simple merge” (see Parallel Merge page 2, table 3). One possible reason is that in-place merge algorithm performs more memory accesses, in the block swap, becoming memory bandwidth bound instead of computationally bound. This effect is visible in Graph 2, where the performance on the hybrid algorithm starts out equal to STL inplace_merge and drifts toward in-place merge pure as the array size increases.
The order of the sequential divide-and-conquer in-place merge algorithm is O(nlgn), where the divide-and-conquer portion produces a tree with O(lgn) levels, with each level processing O(n) elements. The order of not-in-place merge is O(n).
Parallel
A simple, but only partially parallel algorithm falls out naturally from the implementation in Listing Two. This implementation is shown in Listing Three, where the two recursive calls have been parallelized using the parallel_invoke() of Intel TBB or Microsoft PPL. However, block swap has not yet been parallelized.
template< class _Type >
inline void p_merge_in_place_2( _Type* t, int l, int m, int r )
{
int length1 = m - l + 1;
int length2 = r - m;
if ( length1 >= length2 )
{
if ( length2 <= 0 ) return;
if (( length1 + length2 ) <= 1024 ) { std::inplace_merge( t + l, t + m + 1, t + r + 1 ); return; }
int q1 = ( l + m ) / 2; // q1 is mid-point of the larger segment
int q2 = my_binary_search( t[ q1 ], t, m + 1, r ); // q2 is q1 partitioning element within the smaller sub-array (and q2 itself is part of the sub-array that does not move)
int q3 = q1 + ( q2 - m - 1 );
block_exchange_mirror( t, q1, m, q2 - 1 );
//tbb::parallel_invoke(
Concurrency::parallel_invoke(
[&] { p_merge_in_place_2( t, l, q1 - 1, q3 - 1 ); }, // note that q3 is now in its final place and no longer participates in further processing
[&] { p_merge_in_place_2( t, q3 + 1, q2 - 1, r ); }
);
}
else {
if ( length1 <= 0 ) return;
if (( length1 + length2 ) <= 1024 ) { std::inplace_merge( t + l, t + m + 1, t + r + 1 ); return; }
int q1 = ( m + 1 + r ) / 2; // q1 is mid-point of the larger segment
int q2 = my_binary_search( t[ q1 ], t, l, m ); // q2 is q1 partitioning element within the smaller sub-array (and q2 itself is part of the sub-array that does not move)
int q3 = q2 + ( q1 - m - 1 );
block_exchange_mirror( t, q2, m, q1 );
//tbb::parallel_invoke(
Concurrency::parallel_invoke(
[&] { p_merge_in_place_2( t, l, q2 - 1, q3 - 1 ); }, // note that q3 is now in its final place and no longer participates in further processing
[&] { p_merge_in_place_2( t, q3 + 1, q1, r ); }
);
}
}
Listing Three
Table 3 and Graph 3 summarize performance of this partially parallel implementation.
|
Time per Run |
||||||||
|---|---|---|---|---|---|---|---|---|
Number of items |
50 |
500 |
5000 |
50000 |
500000 |
5.0E+06 |
5.0E+07 |
5.0E+08 |
STL inplace_merge |
1.9E-06 |
6.6E-06 |
5.8E-05 |
9.1E-04 |
6.8E-03 |
8.2E-02 |
8.1E-01 |
8.8E+00 |
In-Place Merge hybrid |
2.8E-06 |
7.0E-06 |
8.1E-05 |
9.9E-04 |
1.3E-02 |
1.6E-01 |
1.8E+00 |
1.9E+01 |
In-Place Merge hybrid parallel |
2.5E-06 |
6.9E-06 |
9.6E-05 |
5.7E-04 |
4.3E-03 |
6.0E-02 |
6.4E-01 |
8.0E+00 |
In-Place hybrid / |
1.1 |
1.0 |
0.8 |
1.7 |
2.9 |
2.6 |
2.8 |
2.4 |
STL inplace_merge / |
0.7 |
1.0 |
0.6 |
1.6 |
1.6 |
1.4 |
1.3 |
1.1 |
Table 3 Performance of a partially parallel implementation.
The parallel implementation is the fastest for large arrays, running on a hyper-threaded, quad-core Intel i7-860 processor. It scaled to more than 2.4x for large array sizes over the sequential divide-and-conquer implementation, but only 10% faster than STL inplace_merge() (possibly due to the dynamic nature of STL implementation). So far, in-place merge has not scaled as well as not-in-place merge, most likely due to much higher memory bandwidth usage (in block swap, which has not been parallelized yet).
Table 4 shows how performance varies with the number of threads used to execute tasks.
|
Time per Run |
|||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Number of items |
50 |
500 |
5000 |
50000 |
500000 |
5.0E+06 |
5.0E+07 |
5.0E+08 |
|||||||
Parallel 8 threads |
2.5E-06 |
6.9E-06 |
9.6E-05 |
5.7E-04 |
4.3E-03 |
6.0E-02 |
6.4E-01 |
8.0E+00 |
|||||||
Parallel 4 threads |
2.0E-06 |
1.7E-05 |
6.2E-05 |
4.9E-04 |
4.8E-03 |
6.3E-02 |
6.5E-01 |
8.0E+00 |
|||||||
Parallel 2 threads |
3.2E-06 |
1.1E-05 |
7.2E-05 |
5.5E-04 |
4.5E-03 |
6.1E-02 |
6.3E-01 |
8.0E+00 |
|||||||
Parallel 1 threads |
3.1E-06 |
8.2E-06 |
7.4E-05 |
6.1E-04 |
4.4E-03 |
6.3E-02 |
6.4E-01 |
7.9E+00 |
|||||||
4 threads / 8 threads |
0.8 |
2.4 |
0.6 |
0.9 |
1.1 |
1.0 |
1.0 |
1.0 |
|||||||
2 threads / 8 threads |
1.3 |
1.6 |
0.7 |
1.0 |
1.0 |
1.0 |
1.0 |
1.0 |
|||||||
1 thread / 8 threads |
1.2 |
1.2 |
0.8 |
1.1 |
1.0 |
1.0 |
1.0 |
1.0 |
|||||||
Table 4 How performance varies with the number of threads used to execute tasks.
Performance is nearly constant with the number of threads, which is most likely due to memory bandwidth bottleneck limiting parallel performance gain. This needs further investigation, especially because the single thread divide-and-conquer outperforms the sequential divide-and-conquer (although not by a significant amount).
Conclusion
Parallel in-place stable merge algorithm has been developed. The sequential algorithm is not new and goes back to [1], with a similar implementation in STL as inplace_merge(). The parallel implementation is an extension of Parallel Merge algorithm, using block swap/exchage algorithm to swap two inner sub-arrays of unequal size in-place. This in-place algorithm does not scale as well with the number of cores, most likely due to becoming memory bandwidth limited, yet outperforms the sequential algorithm slightly. The order of the algorithm is O(nlgn), and should extend easily into in-place stable parallel merge sort. Further performance gains could come from parallelization of block swap/exchange, along with possible elimination of redundant rotations.
References
[1] S. Dvorak, and B. Durian, "Merging by Decomposition Revisited", The Computer Journal, Vol. 31, No. 6, 1988.
Victor Duvanenko is a frequent contributor to Dr. Dobb's on algorithms for parallel programming.


