For the next several articles in this series, we'll explore parallel and sequential merge algorithms. We'll utilize Intel Threading Building Blocks as well as Microsoft Parallel Patterns Library (PPL), which is part of Visual Studio 2010. We'll implement several different parallel and sequential merge algorithms, and investigate their performance on today's multicore processors. In this article, we'll leap right into a very interesting parallel merge, see how well it performs, and attempt to improve it.

### Merge

Merge is a fundamental operation, where two sets of presorted items are combined into a single set that remains sorted. For example, given two sets of integers

```
5, 11, 12, 18, 20
```

`2, 4, 7, 11, 16, 23, 28`

The resulting merged set is

`2, 4, 5, 7, 11, 11, 12, 16, 18, 20, 23, 28`

The first set has five elements, whereas the second set has seven, and the result set has twelve. Merge does not require the source sets to be of equal size. Merge can operate on two input arrays, and produce an output array. The output array has to be large enough to hold the result, which is generated in `O(n)`

time.

Listing One shows a simple sequential merge algorithm implementation, where two read-only input arrays of different sizes are merged into a single output array. The input arrays must be presorted, monotonically incrementing. The resulting output array will also be sorted.

// _end pointer point not to the last element, but one past and never access it. template< class _Type > inline void merge_ptr( const _Type* a_start, const _Type* a_end, const _Type* b_start, const _Type* b_end, _Type* dst ) { while( a_start < a_end && b_start < b_end ) { if ( *a_start <= *b_start ) *dst++ = *a_start++; // if elements are equal, then a[] element is output else *dst++ = *b_start++; } while( a_start < a_end ) *dst++ = *a_start++; while( b_start < b_end ) *dst++ = *b_start++; }

**Listing One**

The first if statement compares the first element from `a[]`

array and the first element from `b[]`

array, outputting whichever is smaller, with ties choosing the element from `a[]`

to retain stability. Each array has a pointer that keeps track of the current position within that array. Only the pointer of the source array that is output gets incremented. When the current position of either array goes past the end of that array, the first `while`

loop terminates. When all of the elements from one of the arrays have been used, the other array may still have elements left. At that point, comparisons between input arrays are no longer possible. The second and third `while`

loops output any remaining elements.

Merge is `O(n)`

, where `n`

is the number of output elements, since one element is output during each iteration of the `while`

loop(s). More complex merges support more than two input arrays, in-place operation, and can support other data structures such as linked lists.

### Divide-and-Conquer Merge

The aforementioned sequential merge algorithm does not parallelize in an obvious fashion. Divide-and-conquer algorithms typically parallelize well. Merge is tricky since it's `O(n)`

and making the order worse is not desirable. In other words, we want an `O(n)`

or better divide-and-conquer merge algorithm. The authors of the latest (3rd edition) *Introduction to Algorithms* book [1] develop, describe and analyze such an algorithm, in the new chapter on multithreaded algorithms. Listing Two shows a direct translation of the pseudocode parallel implementation into sequential C++ implementation. A binary search algorithm is also shown, which is a translation of the pseudocode on p. 799. Figure 1 illustrates how the algorithm works by showing the process within one level of recursion.

// This version is borrowed from "Introduction to Algorithms" 3rd edition, p. 799. template< class _Type > inline int my_binary_search( _Type value, const _Type* a, int left, int right ) { long low = left; long high = __max( left, right + 1 ); while( low < high ) { long mid = ( low + high ) / 2; if ( value <= a[ mid ] ) high = mid; else low = mid + 1; // because we compared to a[mid] and the value was larger than a[mid]. // Thus, the next array element to the right from mid is the next possible // candidate for low, and a[mid] can not possibly be that candidate. } return high; } template < class Item > inline void exchange( Item& A, Item& B ) { Item t = A; A = B; B = t; } // Merge two ranges of source array T[ p1 .. r1 ] and T[ p2 .. r2 ] into destination array A starting at index p3. // From 3rd ed. of "Introduction to Algorithms" p. 798-802 // Listing 2 (which also needs to include the binary search implementation as well) template< class _Type > inline void merge_dac( const _Type* t, int p1, int r1, int p2, int r2, _Type* a, int p3 ) { int length1 = r1 - p1 + 1; int length2 = r2 - p2 + 1; if ( length1 < length2 ) { exchange( p1, p2 ); exchange( r1, r2 ); exchange( length1, length2 ); } if ( length1 == 0 ) return; int q1 = ( p1 + r1 ) / 2; int q2 = my_binary_search( t[ q1 ], t, p2, r2 ); int q3 = p3 + ( q1 - p1 ) + ( q2 - p2 ); a[ q3 ] = t[ q1 ]; merge_dac( t, p1, q1 - 1, p2, q2 - 1, a, p3 ); merge_dac( t, q1 + 1, r1, q2, r2, a, q3 + 1 ); }

**Listing Two**

The input to the divide-and-conquer merge algorithm comes from two sub-arrays of `T`

, and the output is a single sub-array `A`

. The two input sub-arrays of `T`

are from `[p`

and from `[p`

. The output sub-array of `A`

is from `[p`

. The divide step is done by choosing the middle element of the larger of the two input sub-arrays (at index `q`

in Figure 1 and in Listing Two) . The value at this index is then used as the partition element to split the other input sub-array into two sections — less than `X`

in Figure 1 and greater than or equal to `X`

. The partition value `X`

(at index `q`

of `T`

) is copied to array `A`

at index `q`

. The conquer step recursively merges the two portions that are smaller than or equal to `X`

— indicated by light gray box and arrow shading. It also recursively merges the two portions that are larger than or equal to `X`

— indicated by darker gray box and arrow shading.

The algorithm proceeds recursively until the termination condition — when the shortest of the two input sub-arrays has no elements. At each step, the algorithm reduces the size of the array by one element. One merge is split into two smaller merges, with the output element placed in between. Each of the two smaller merges will contain at least` N/4`

elements, since the left input array is split in half.

The algorithm is stable, because each recursion step is stable. At each step, the elements of the left array are smaller (or equal to) the elements of the right array. When the split is made using the mid-left element, only right array elements that are strictly less than the split element (`X`

) are used in the subsequent right smaller merge, which is the crucial step ensuring stability.

Table 1 and Graph 1 show performance measurements for merging of two random input arrays that are pre-sorted, for the simple merge, STL merge, and divide-and-conquer merge.

The simple merge and STL merge algorithms are always faster than the divide-and-conquer merge algorithm, by 2-5 times. Performance measurements confirm all algorithms to be linear. The number of items in Table 1 and Graph 1 indicates the size of each of the two input arrays (of equal size), producing the resulting array of twice that size. Each item is a 32-bit unsigned number. Thus, the simple merge algorithm outputs items at a rate of 153 Million/second. The total memory bandwidth is 1.2 GB/second for reads and writes.