### Parallel Merge

Transforming the divide-and-conquer sequential merge algorithm into a parallel algorithm turns out to be very simple when using Microsoft Parallel Pattern Library (PPL) or Intel Threading Building Blocks (TBB) along with C++ lambda functions. Lambda functions are new to C++, and are part of C++0x standard, which is supported by Microsoft VisualStudio 2010 and Intel C++ Compiler.

Listing Three shows the initial parallel implementation for both PPL and TBB, which are identical except for the namespace — `Concurrency::`

for PPL, and `tbb::`

for TBB. Parallel function `parallel_invoke()`

is used to implement concurrency in both libraries. Multiple functions are passed as arguments to `parallel_invoke()`

, which evaluates them possibly in parallel.

template< class _Type > inline void merge_parallel_L3( _Type* t, int p1, int r1, int p2, int r2, _Type* a, int p3 ) { int length1 = r1 - p1 + 1; int length2 = r2 - p2 + 1; if ( length1 < length2 ) { exchange( p1, p2 ); exchange( r1, r2 ); exchange( length1, length2 ); } if ( length1 == 0 ) return; int q1 = ( p1 + r1 ) / 2; int q2 = my_binary_search( t[ q1 ], t, p2, r2 ); int q3 = p3 + ( q1 - p1 ) + ( q2 - p2 ); a[ q3 ] = t[ q1 ]; tbb::parallel_invoke( //Concurrency::parallel_invoke( [&] { merge_parallel_L3( t, p1, q1 - 1, p2, q2 - 1, a, p3 ); }, [&] { merge_parallel_L3( t, q1 + 1, r1, q2, r2, a, q3 + 1 ); } ); }

**Listing Three**

Lambda functions simplify the code by enabling the functions to be defined in-place. The` [&]`

indicates that all arguments to the function are to be by-reference. `parallel_invoke()`

is a good choice for this algorithm because it allows for complete control over the range of the input and output array, which is needed by the algorithm. Other parallel functions, such as `parallel_reduce()`

and` parallel_for()`

generate array ranges "automagically," which makes them useful in other cases (as shown in previous parts of this series).
Table 2 and Graph 2 show performance measurements for two sequential and two parallel implementations: simple merge, divide-and-conquer merge, PPL parallel merge, and TBB parallel merge.

Both parallel divide-and-conquer implementations perform poorly and are several times slower than the sequential implementations. These parallel implementations running on four cores are up to 29-69 times slower than the simple merge sequential algorithm running on a single core. This is a substantial performance gap to close. However, so far only pure algorithms have been utilized, and is a good time to mix things up.

### Hybrid Approach

In previous articles of this series, the hybrid algorithm approach was used to accelerate performance of several sequential and parallel algorithms. Since the simple merge is faster than the divide-and-conquer algorithm for merging small arrays, it could be used to terminate recursion early to handle small merges. Listing Four shows the hybrid divide-and-conquer merge algorithm.

// The hybrid divide-and-conquer algorithm implementation template< class _Type > inline void merge_dac_hybrid( const _Type* t, int p1, int r1, int p2, int r2, _Type* a, int p3 ) { int length1 = r1 - p1 + 1; int length2 = r2 - p2 + 1; if ( length1 < length2 ) { exchange( p1, p2 ); exchange( r1, r2 ); exchange( length1, length2 ); } if ( length1 == 0 ) return; if (( length1 + length2 ) <= 8192 ) merge_ptr( &t[ p1 ], &t[ p1 + length1 ], &t[ p2 ], &t[ p2 + length2 ], &a[ p3 ] ); else { int q1 = ( p1 + r1 ) / 2; int q2 = my_binary_search( t[ q1 ], t, p2, r2 ); int q3 = p3 + ( q1 - p1 ) + ( q2 - p2 ); a[ q3 ] = t[ q1 ]; merge_dac_hybrid( t, p1, q1 - 1, p2, q2 - 1, a, p3 ); merge_dac_hybrid( t, q1 + 1, r1, q2, r2, a, q3 + 1 ); } }

**Listing Four**

Table 3 shows performance measurements for the sequential pure and hybrid divide-and-conquer merge algorithm implementations, with varied threshold value (shown in parenthesis) of switching between divide-and-conquer and the simple merge algorithm.

The hybrid approach substantially accelerates performance of sequential divide-and-conquer algorithm. As the threshold value is increases, the performance also increases. For threshold of 32 acceleration is 2-3 times, whereas threshold of 8192 increased performance by 5-10 times. The hybrid divide-and-conquer sequential merge algorithm even outperforms the sequential merge algorithm, which seems incorrect and should be investigated in the future (it seems the performance should approach that of the simple merge).

For parallel algorithms, grain size has been shown as one of the key performance parameters. Grain size is the minimum amount of work that is worthwhile to perform in parallel as a task. Parallel tasks may get moved between cores, which has a finite overhead cost (takes finite time). Thus, parallel tasks must be large enough to keep this overhead as a small percentage of overall work being performed. This is the main reason that the aforementioned pure parallel algorithm performance is poor — the grain size is potentially as small as a single array element, which is too small to be executed efficiently in parallel. Intel TBB recommends keeping the grain size at more than 10K CPU clock cycles. The hybrid algorithm approach works well for this issue (as we've seen in previous articles of this series). Listing Five shows the hybrid divide-and-conquer parallel merge algorithm.

template< class _Type > inline void merge_parallel_L5( _Type* t, int p1, int r1, int p2, int r2, _Type* a, int p3 ) { int length1 = r1 - p1 + 1; int length2 = r2 - p2 + 1; if ( length1 < length2 ) { exchange( p1, p2 ); exchange( r1, r2 ); exchange( length1, length2 ); } if ( length1 == 0 ) return; if (( length1 + length2 ) <= 16 ) merge_ptr( &t[ p1 ], &t[ p1 + length1 ], &t[ p2 ], &t[ p2 + length2 ], &a[ p3 ] ); else { int q1 = ( p1 + r1 ) / 2; int q2 = my_binary_search( t[ q1 ], t, p2, r2 ); int q3 = p3 + ( q1 - p1 ) + ( q2 - p2 ); a[ q3 ] = t[ q1 ]; tbb::parallel_invoke( //Concurrency::parallel_invoke( [&] { merge_parallel_L5( t, p1, q1 - 1, p2, q2 - 1, a, p3 ); }, [&] { merge_parallel_L5( t, q1 + 1, r1, q2, r2, a, q3 + 1 ); } ); } }

**Listing Five**

Tables 4 and 5 show performance measurements for two pure and hybrid parallel hybrid divide-and-conquer merge algorithm implementations, with the varied threshold value (shown in parenthesis) of switching between divide-and-conquer and the simple merge algorithm.

The hybrid approach also substantially accelerates performance of parallel divide-and-conquer algorithms. As the threshold value is increases, the performance also increases. For a threshold of 32, acceleration is 2-15 times; whereas a threshold of 8192 increased performance over 20 times. The parallel hybrid divide-and-conquer merge algorithm is the highest-performance algorithm developed.

The hybrid algorithm approach works well for the following reason. A pure divide-and-conquer merge algorithm splits the two input arrays recursively, creating a binary recursion tree, until one of the arrays has no elements. The algorithm performs a fair amount of computation during each recursion, while outputting a single array element. A hybrid algorithm approach terminates the recursion tree earlier, set by the threshold, and uses the faster simple merge algorithm to process merges of small arrays. The faster simple merge performs very little work per output element.

In the hybrid approach, some of the work is performed by the divide-and-conquer, and some of the work is performed by the simple merge. The threshold specifies how small the input arrays need to be to switch to using the simple merge algorithm. As the threshold is increased, the amount of work performed by the simple merge increases, which brings performance closer to the performance of the simple merge. The resulting hybrid algorithm has the performance of the simple merge, but retains the divide-and-conquer structure to enable parallelism.

The hybrid approach also improves memory access pattern of the divide-and-conquer algorithm, since the simple merge reads and writes memory sequentially, whereas the pure divide-and-conquer algorithm does not.

The two parallel merge implementations (PPL and TBB) benefit from the hybrid approach also by having a larger parallel grain size — a larger parallel work quanta. When tasks perform work, it's big enough to keep the overhead of task switching negligible. Thus, hybrid parallel merge benefits from a larger grain size and from combining with the faster simple merge. The PPL implementation speedup is 23-48 times, and TBB speedup is 20-58 times.

Table 6 and Graph 3 compare performance of the best sequential and parallel merge algorithms developed in this article.

Parallel hybrid divide-and-conquer merge algorithm performs 0.9-5.8 times better than sequential STL merge and simple merge algorithms, on a quad-core processor, with larger arrays outperforming by over 5 times. Parallel processing combined with a hybrid algorithm approach provides a powerful high performance result.

The overall memory bandwidth used by the fastest algorithm is 7.1 GB/second for reads and writes, which is approaching the maximum available when using scalar code, as shown in Parallel Counting Sort (Part 2). Thus, parallel algorithms are bumping up against the system memory bandwidth limit for merge, as sorting algorithms did in previous articles of this series. To continue improving performance, memory-bandwidth-frugal algorithms would need to be developed.