Channels ▼
RSS

Parallel

Parallel In-Place Merge


In-place hybrid merge implementation is over 1.5x faster than the pure merge, for large arrays. Interestingly, the hybrid approach for in-place merge yields less gain in performance than it did for not-in-place algorithm, where over 8.5x gain was obtained for large arrays, bringing the performance of the divide-and-conquer algorithm to nearly that of the “simple merge” (see Parallel Merge page 2, table 3). One possible reason is that in-place merge algorithm performs more memory accesses, in the block swap, becoming memory bandwidth bound instead of computationally bound. This effect is visible in Graph 2, where the performance on the hybrid algorithm starts out equal to STL inplace_merge and drifts toward in-place merge pure as the array size increases.

Graph 2

The order of the sequential divide-and-conquer in-place merge algorithm is O(nlgn), where the divide-and-conquer portion produces a tree with O(lgn) levels, with each level processing O(n) elements. The order of not-in-place merge is O(n).

Parallel

A simple, but only partially parallel algorithm falls out naturally from the implementation in Listing Two. This implementation is shown in Listing Three, where the two recursive calls have been parallelized using the parallel_invoke() of Intel TBB or Microsoft PPL. However, block swap has not yet been parallelized.

template< class _Type >
inline void p_merge_in_place_2( _Type* t, int l, int m, int r )
{
	int length1 = m - l + 1;
	int length2 = r - m;
	if ( length1 >= length2 )
	{
		if ( length2 <= 0 )	return;
		if (( length1 + length2 ) <= 1024 )	{ std::inplace_merge( t + l, t + m + 1, t + r + 1 );  return; }
		int q1 = ( l + m ) / 2;								// q1 is mid-point of the larger segment
		int q2 = my_binary_search( t[ q1 ], t, m + 1, r );	// q2 is q1 partitioning element within the smaller sub-array (and q2 itself is part of the sub-array that does not move)
		int q3 = q1 + ( q2 - m - 1 );
		block_exchange_mirror( t, q1, m, q2 - 1 );
        //tbb::parallel_invoke(
		Concurrency::parallel_invoke(
			[&] { p_merge_in_place_2( t, l,      q1 - 1, q3 - 1 ); },	// note that q3 is now in its final place and no longer participates in further processing
			[&] { p_merge_in_place_2( t, q3 + 1, q2 - 1, r      ); }
		);
	}
	else {
		if ( length1 <= 0 )	return;
		if (( length1 + length2 ) <= 1024 )	{ std::inplace_merge( t + l, t + m + 1, t + r + 1 );  return; }
		int q1 = ( m + 1 + r ) / 2;							// q1 is mid-point of the larger segment
		int q2 = my_binary_search( t[ q1 ], t, l, m );		// q2 is q1 partitioning element within the smaller sub-array (and q2 itself is part of the sub-array that does not move)
		int q3 = q2 + ( q1 - m - 1 );
		block_exchange_mirror( t, q2, m, q1 );
        //tbb::parallel_invoke(
		Concurrency::parallel_invoke(
			[&] { p_merge_in_place_2( t, l,      q2 - 1, q3 - 1 ); },	// note that q3 is now in its final place and no longer participates in further processing
			[&] { p_merge_in_place_2( t, q3 + 1, q1,     r      ); }
		);
	}
}

Listing Three

Table 3 and Graph 3 summarize performance of this partially parallel implementation.

Time per Run

Number of items

50

500

5000

50000

500000

5.0E+06

5.0E+07

5.0E+08

STL inplace_merge

1.9E-06

6.6E-06

5.8E-05

9.1E-04

6.8E-03

8.2E-02

8.1E-01

8.8E+00

In-Place Merge hybrid

2.8E-06

7.0E-06

8.1E-05

9.9E-04

1.3E-02

1.6E-01

1.8E+00

1.9E+01

In-Place Merge hybrid parallel

2.5E-06

6.9E-06

9.6E-05

5.7E-04

4.3E-03

6.0E-02

6.4E-01

8.0E+00

In-Place hybrid /
In-Place parallel

1.1

1.0

0.8

1.7

2.9

2.6

2.8

2.4

STL inplace_merge /
In-Place parallel

0.7

1.0

0.6

1.6

1.6

1.4

1.3

1.1

Table 3 Performance of a partially parallel implementation.

Graph 3. A graphical view of the data in Table 3.

The parallel implementation is the fastest for large arrays, running on a hyper-threaded, quad-core Intel i7-860 processor. It scaled to more than 2.4x for large array sizes over the sequential divide-and-conquer implementation, but only 10% faster than STL inplace_merge() (possibly due to the dynamic nature of STL implementation). So far, in-place merge has not scaled as well as not-in-place merge, most likely due to much higher memory bandwidth usage (in block swap, which has not been parallelized yet).

Table 4 shows how performance varies with the number of threads used to execute tasks.

Time per Run

Number of items

50

500

5000

50000

500000

5.0E+06

5.0E+07

5.0E+08

Parallel 8 threads

2.5E-06

6.9E-06

9.6E-05

5.7E-04

4.3E-03

6.0E-02

6.4E-01

8.0E+00

Parallel 4 threads

2.0E-06

1.7E-05

6.2E-05

4.9E-04

4.8E-03

6.3E-02

6.5E-01

8.0E+00

Parallel 2 threads

3.2E-06

1.1E-05

7.2E-05

5.5E-04

4.5E-03

6.1E-02

6.3E-01

8.0E+00

Parallel 1 threads

3.1E-06

8.2E-06

7.4E-05

6.1E-04

4.4E-03

6.3E-02

6.4E-01

7.9E+00

4 threads / 8 threads

0.8

2.4

0.6

0.9

1.1

1.0

1.0

1.0

2 threads / 8 threads

1.3

1.6

0.7

1.0

1.0

1.0

1.0

1.0

1 thread / 8 threads

1.2

1.2

0.8

1.1

1.0

1.0

1.0

1.0

Table 4 How performance varies with the number of threads used to execute tasks.

Performance is nearly constant with the number of threads, which is most likely due to memory bandwidth bottleneck limiting parallel performance gain. This needs further investigation, especially because the single thread divide-and-conquer outperforms the sequential divide-and-conquer (although not by a significant amount).

Conclusion

Parallel in-place stable merge algorithm has been developed. The sequential algorithm is not new and goes back to [1], with a similar implementation in STL as inplace_merge(). The parallel implementation is an extension of Parallel Merge algorithm, using block swap/exchage algorithm to swap two inner sub-arrays of unequal size in-place. This in-place algorithm does not scale as well with the number of cores, most likely due to becoming memory bandwidth limited, yet outperforms the sequential algorithm slightly. The order of the algorithm is O(nlgn), and should extend easily into in-place stable parallel merge sort. Further performance gains could come from parallelization of block swap/exchange, along with possible elimination of redundant rotations.

References

[1] S. Dvorak, and B. Durian, "Merging by Decomposition Revisited", The Computer Journal, Vol. 31, No. 6, 1988.


Victor Duvanenko is a frequent contributor to Dr. Dobb's on algorithms for parallel programming.


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 

Video