Introduction
In my article on parallel merge, I developed and optimized a generic parallel merge algorithm. It utilized multiple CPU cores and scaled well in performance. The algorithm was stable, leading directly to Parallel Merge Sort. However, it was not an in-place implementation. In this article, I develop an in-place parallel merge, which enables in-place parallel merge sort.
The STL implements sequential merge and in-place merge algorithms, as merge() and inplace_merge(), which run on a single CPU core. The merge algorithm is O(n), leading to O(nlgn) merge sort, while the in-place merge is O(nlgn), leading to O(n(lgn)2) in-place merge sort. Because merge is faster than in-place merge, but requires extra memory, STL favors using merge whenever memory is available, copying the result back to the input array (under the hood).
Revisit Not-In-Place
The divide-and-conquer merge algorithm described in Parallel Merge, which is not-in-place, is illustrated in Figure 1. At each step, this algorithm moves a single element X from the source array T to the destination array A, as follows.

Figure 1. A merge algorithm that is not in place.
The two input sub-arrays of T are from [p1 to r1] and from [p2 to r2]. The output is a single sub-array of A from [p3 to r3]. The divide step is done by choosing the middle element within the larger of the two input sub-arrays—at index q1. The value at this index is then used as the partition element to split the other input sub-array into two sections - less than X and greater than or equal to X. The partition value X (at index q1 of T) is copied to array A at index q3. The conquer step recursively merges the two portions that are smaller than or equal to X— indicated by light gray boxes and light arrows. It also recursively merges the two portions that are larger than or equal to X—indicated by darker gray boxes and dark arrows.
The algorithm proceeds recursively until the termination condition —when the shortest of the two input sub-arrays has no elements. At each step, the algorithm reduces the size of the array by one element. One merge is split into two smaller merges, with the output element placed in between. Each of the two smaller merges will contain at least N/4 elements, since the left input array is split in half.
This algorithm is O(n) and is stable. It was shown not to be performance competitive in its pure form. However, performance became equivalent to other merges in a hybrid form, which utilized a simple merge to terminate recursion early. The hybrid version enabled parallelism by using of divide-and-conquer method, scaled well across multiple cores, outperforming STL merge by over 5X on quad-core CPU, being limited by memory bandwidth. It was also used as the core of Parallel Merge Sort, which scaled well across multiple cores, utilizing a further hybrid approach to gain more speed, and providing a high degree of parallelism, Θ(n/lg2n).


