Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

Channels ▼


Parallel Merge Sort

In last month's article in this series, a Parallel Merge algorithm was introduced and performance was optimized to the point of being limited by system memory bandwidth. Of course, the natural next step is to use it as a core building block for Parallel Merge Sort, since Parallel Merge does most of the work. Let's leap right in and explore this Parallel Merge Sort algorithm.

Simplest Parallel Merge Sort

Listing One shows the simplest Parallel Merge Sort, which uses the Parallel Merge. At the core, the algorithm takes the input range (l to r), computes the mid-point (m), calls itself recursively twice — once with the lower half range, and the second time with the upper half range — followed by merging the two resulting sorted halves into a single output. This is exactly how the standard Merge Sort works — divide, conquer, and merge. The two recursive calls are executed in parallel, followed by a parallel merge. The parallelism is Θ(n/lg2n), derived in [1].

template< class _Type >
inline void parallel_merge_sort_simplest_r( _Type* src, int l, int r, _Type* dst, bool srcToDst = true )	// srcToDst specifies direction for this level of recursion
	if ( r == l ) {    // termination/base case of sorting a single element
		if ( srcToDst )  dst[ l ] = src[ l ];    // copy the single element from src to dst
	int m = ( r + l ) / 2;
	//tbb::parallel_invoke(				// Intel's     Threading Building Blocks (TBB)
	Concurrency::parallel_invoke(		// Microsoft's Parallel Pattern Library  (PPL)
		[&] { parallel_merge_sort_simplest_r( src, l,     m, dst, !srcToDst ); },		// reverse direction of srcToDst for the next level of recursion
		[&] { parallel_merge_sort_simplest_r( src, m + 1, r, dst, !srcToDst ); }		// reverse direction of srcToDst for the next level of recursion
	if ( srcToDst ) merge_parallel_L5( src, l, m, m + 1, r, dst, l );
	else			merge_parallel_L5( dst, l, m, m + 1, r, src, l );
template< class _Type >
inline void parallel_merge_sort_simplest( _Type* src, int l, int r, _Type* dst, bool srcToDst = true )	// srcToDst specifies direction for this level of recursion
	if ( r < l ) return;
	parallel_merge_sort_simplest_r( src, l, r, dst, srcToDst );

Listing One

Table 1 and Graph 1 show performance of the simplest parallel merge sort implementation as compared with STL stable_sort and STL sort.

[Click image to view at full size]
Table 1.

[Click image to view at full size]
Figure 1.

STL stable_sort is slower than STL sort, as expected — it takes more effort and time to ensure stability. The simplest Parallel Merge Sort implementation outperforms both STL sorts for arrays larger than about 10K elements. Parallel Merge Sort is stable, and is as much as 3.5X faster than STL stable_sort, and is as much as 2.7X faster than STL sort, on a quad-core processor.


The Parallel Merge Sort is not an in-place algorithm. Source array elements are sorted and the result is placed in the destination array. Note that the last argument to the recursive function specifies the direction, with the default being "from the source to the destination," which is set at the top-level of recursion. Then at each level of recursion the direction is reversed. Figure 2 illustrates this directional reversal at each recursion level of the call tree.

[Click image to view at full size]
Figure 2.

At the top level of recursion hierarchy the sorted destination (dst) array is produced by merging two sorted source (src) arrays. During this merge operation elements are moved from the source (src) array to the destination array. At the next level, the sorted source (src) array is produced by merging two sorted destination (dst) arrays. This continues until the bottom of the hierarchy is reached with only a single element in the array to be processed.

The termination condition (base case) handles the two possible cases at this point:

  • By doing nothing if the destination array is src. In this case, we are sorting a single element that was provided in the src array and are asked to place the result back in the src array. With a single element, there is nothing to do when sorting it, and it's already in the src array. So, we just leave it there.
  • By copying that single element from the src array to the dst array. In this case, we are sorting a single element that was provided in the src array and are asked to place the result in the dst array. With a single element there is nothing to do when sorting it. To place it in the dst array we simply copy it.

The recursive implementation in Listing 1 does not handle the condition of zero elements in the input array, which is signified by r being smaller than l. The implementation could test for this case, but it would be wasteful as it can never be generated by the recursive implementation (and can only be generated by the user on the initial function call). So, for the sake of efficiency and simplicity, the check for zero size array is pulled out into a wrapper function shown in Listing Two. Actually, this is also the reason for l and r being int's instead of unsigned long. For example, for an array of zero elements, argument l would be set to zero, requiring r to be set to -1, which can be done using an int but not unsigned long. This issue can probably be avoided by using a pointer to the start of each array and to the one-beyond-the-end of each array or iterators as STL does, instead of indexes. Another option is to use a pointer and an array size as function call arguments.

This implementation is a variation on that presented in the Introduction to Algorithms book [1]. By using direction reversal, memory allocation operations have been eliminated.

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.