# Parallel In-Place Merge

In-place hybrid merge implementation is over 1.5x faster than the pure merge, for large arrays. Interestingly, the hybrid approach for in-place merge yields less gain in performance than it did for not-in-place algorithm, where over 8.5x gain was obtained for large arrays, bringing the performance of the divide-and-conquer algorithm to nearly that of the “simple merge” (see Parallel Merge page 2, table 3). One possible reason is that in-place merge algorithm performs more memory accesses, in the block swap, becoming memory bandwidth bound instead of computationally bound. This effect is visible in Graph 2, where the performance on the hybrid algorithm starts out equal to STL inplace_merge and drifts toward in-place merge pure as the array size increases.

Graph 2

The order of the sequential divide-and-conquer in-place merge algorithm is O(nlgn), where the divide-and-conquer portion produces a tree with O(lgn) levels, with each level processing O(n) elements. The order of not-in-place merge is O(n).

### Parallel

A simple, but only partially parallel algorithm falls out naturally from the implementation in Listing Two. This implementation is shown in Listing Three, where the two recursive calls have been parallelized using the `parallel_invoke()` of Intel TBB or Microsoft PPL. However, block swap has not yet been parallelized.

```template< class _Type >
inline void p_merge_in_place_2( _Type* t, int l, int m, int r )
{
int length1 = m - l + 1;
int length2 = r - m;
if ( length1 >= length2 )
{
if ( length2 <= 0 )	return;
if (( length1 + length2 ) <= 1024 )	{ std::inplace_merge( t + l, t + m + 1, t + r + 1 );  return; }
int q1 = ( l + m ) / 2;								// q1 is mid-point of the larger segment
int q2 = my_binary_search( t[ q1 ], t, m + 1, r );	// q2 is q1 partitioning element within the smaller sub-array (and q2 itself is part of the sub-array that does not move)
int q3 = q1 + ( q2 - m - 1 );
block_exchange_mirror( t, q1, m, q2 - 1 );
//tbb::parallel_invoke(
Concurrency::parallel_invoke(
[&] { p_merge_in_place_2( t, l,      q1 - 1, q3 - 1 ); },	// note that q3 is now in its final place and no longer participates in further processing
[&] { p_merge_in_place_2( t, q3 + 1, q2 - 1, r      ); }
);
}
else {
if ( length1 <= 0 )	return;
if (( length1 + length2 ) <= 1024 )	{ std::inplace_merge( t + l, t + m + 1, t + r + 1 );  return; }
int q1 = ( m + 1 + r ) / 2;							// q1 is mid-point of the larger segment
int q2 = my_binary_search( t[ q1 ], t, l, m );		// q2 is q1 partitioning element within the smaller sub-array (and q2 itself is part of the sub-array that does not move)
int q3 = q2 + ( q1 - m - 1 );
block_exchange_mirror( t, q2, m, q1 );
//tbb::parallel_invoke(
Concurrency::parallel_invoke(
[&] { p_merge_in_place_2( t, l,      q2 - 1, q3 - 1 ); },	// note that q3 is now in its final place and no longer participates in further processing
[&] { p_merge_in_place_2( t, q3 + 1, q1,     r      ); }
);
}
}
```

Listing Three

Table 3 and Graph 3 summarize performance of this partially parallel implementation.

Time per Run

Number of items

50

500

5000

50000

500000

5.0E+06

5.0E+07

5.0E+08

STL inplace_merge

1.9E-06

6.6E-06

5.8E-05

9.1E-04

6.8E-03

8.2E-02

8.1E-01

8.8E+00

In-Place Merge hybrid

2.8E-06

7.0E-06

8.1E-05

9.9E-04

1.3E-02

1.6E-01

1.8E+00

1.9E+01

In-Place Merge hybrid parallel

2.5E-06

6.9E-06

9.6E-05

5.7E-04

4.3E-03

6.0E-02

6.4E-01

8.0E+00

In-Place hybrid /
In-Place parallel

1.1

1.0

0.8

1.7

2.9

2.6

2.8

2.4

STL inplace_merge /
In-Place parallel

0.7

1.0

0.6

1.6

1.6

1.4

1.3

1.1

Table 3 Performance of a partially parallel implementation.

Graph 3. A graphical view of the data in Table 3.

The parallel implementation is the fastest for large arrays, running on a hyper-threaded, quad-core Intel i7-860 processor. It scaled to more than 2.4x for large array sizes over the sequential divide-and-conquer implementation, but only 10% faster than STL `inplace_merge`() (possibly due to the dynamic nature of STL implementation). So far, in-place merge has not scaled as well as not-in-place merge, most likely due to much higher memory bandwidth usage (in block swap, which has not been parallelized yet).

Table 4 shows how performance varies with the number of threads used to execute tasks.

Time per Run

Number of items

50

500

5000

50000

500000

5.0E+06

5.0E+07

5.0E+08

2.5E-06

6.9E-06

9.6E-05

5.7E-04

4.3E-03

6.0E-02

6.4E-01

8.0E+00

2.0E-06

1.7E-05

6.2E-05

4.9E-04

4.8E-03

6.3E-02

6.5E-01

8.0E+00

3.2E-06

1.1E-05

7.2E-05

5.5E-04

4.5E-03

6.1E-02

6.3E-01

8.0E+00

3.1E-06

8.2E-06

7.4E-05

6.1E-04

4.4E-03

6.3E-02

6.4E-01

7.9E+00

0.8

2.4

0.6

0.9

1.1

1.0

1.0

1.0

1.3

1.6

0.7

1.0

1.0

1.0

1.0

1.0

1.2

1.2

0.8

1.1

1.0

1.0

1.0

1.0

Table 4 How performance varies with the number of threads used to execute tasks.

Performance is nearly constant with the number of threads, which is most likely due to memory bandwidth bottleneck limiting parallel performance gain. This needs further investigation, especially because the single thread divide-and-conquer outperforms the sequential divide-and-conquer (although not by a significant amount).

### Conclusion

Parallel in-place stable merge algorithm has been developed. The sequential algorithm is not new and goes back to [1], with a similar implementation in STL as `inplace_merge()`. The parallel implementation is an extension of Parallel Merge algorithm, using block swap/exchage algorithm to swap two inner sub-arrays of unequal size in-place. This in-place algorithm does not scale as well with the number of cores, most likely due to becoming memory bandwidth limited, yet outperforms the sequential algorithm slightly. The order of the algorithm is O(nlgn), and should extend easily into in-place stable parallel merge sort. Further performance gains could come from parallelization of block swap/exchange, along with possible elimination of redundant rotations.

### References

[1] S. Dvorak, and B. Durian, "Merging by Decomposition Revisited", The Computer Journal, Vol. 31, No. 6, 1988.

Victor Duvanenko is a frequent contributor to Dr. Dobb's on algorithms for parallel programming.

### More Insights

 To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.