Conclusion
Parallel hybrid merge algorithm was developed that outperformed sequential simple merge as well as STL merge by 0.9-5.8 times overall and by over 5 times for larger arrays, running on a quad-core processor. The pure parallel merge algorithm was shown to perform poorly — substantially slower than the simple sequential merge.
Microsoft PPL and Intel TBB technologies were used for parallel algorithm development, with comparable performance and identical implementations, simplifying development significantly. Lambda functions were used to simplify the parallel implementation.
A parallel algorithm may use a different strategy than a sequential algorithm. The simple sequential merge iterated through the two input arrays, while the parallel divide-and-conquer algorithm split the problem into two sub-problems recursively.
Pure divide-and-conquer sequential and parallel algorithms failed to deliver competitive performance due to large computational overhead per output element and large task-switching overhead. Simple merge has a small constant as it performs little work per output element. The hybrid algorithm approach yielded performance gains for sequential and parallel merge algorithms. For sequential merge, the hybrid divide-and-conquer algorithm outperformed the simple merge algorithm, which seems questionable.
For parallel algorithms, divide-and-conquer proved to be a good strategy, especially coupled with the hybrid approach, enabling parallel processing with high performance. A fast sequential merge was shown to be a critical component of the fast parallel hybrid algorithm.
The hybrid algorithm improved the memory access pattern of the divide-and-conquer algorithm. Pure divide-and-conquer has a nonsequential memory access pattern. The hybrid approach used the simple merge, which has incrementing memory addressing for both inputs and the output. The hybrid parallel algorithm approach provided over 20 times the performance gain over the pure parallel divide-and-conquer algorithm. Otherwise, the pure parallel algorithm was measured to be slower than the simple sequential merge.
As long as task-switching overhead continues to be fairly high, a high-performance parallel algorithm will require a high-performance sequential algorithm to use as a recursion termination case or within the minimal parallel work quanta.
The sequential simple merge algorithm measured at 1.2 GB/second of memory bandwidth. The parallel hybrid merge algorithm used about 7.1 GB/second, nearing the memory bandwidth limit for scalar code, and possibly limiting performance.
References
[1] T.H. Cormen, C.E. Leiserson, R.L. Rivest, and C. Stein, Introduction to Algorithms, Third Edition, MIT Press, Sept. 2009, pp. 797-805
Parallel In-Place Radix Sort Simplified
Parallel In-Place N-bit-Radix Sort
Quality of Random Numbers versus Performance
In-Place Hybrid Binary-Radix Sort
Optimizing Performance of Sorting Algorithms (Insertion and Selection Sort)
In-Place Hybrid N-bit-Radix Sort
Parallel Counting Sort, Part 2
Stable Hybrid MSD N-bit-Radix Sort
Victor Duvanenko is a long-time contributor to Dr. Dobb's. He can be contacted at [email protected]