Figure 2 shows two sorted arrays on the top line, in which an overall median is to be found. On the lower line, the two arrays have been merged into a single sorted array with the median shown, as the mid-element at offset 4 labeled with an M. Using a binary search within the first array, element G (for "guess") is selected as a guess for the overall median. Two elements on the right of G must be greater than or equal to G, since the array A is sorted. For G to be the overall median (M), it must be less than or equal to 4 elements, within the overall sorted array C. Two of these 4 elements come from array A, which implies that the rest of the elements (two more) must come from array B. Also, G must be greater than or equal to the rest of the elements in array B, i.e. B[0-1]. Since B is a sorted array, G only has to be greater than or equal to B, as B ≤ B. In other words, if B ≤ G ≤ B, then G is the overall median M.
Figure 2: Two sorted arrays and a merge.
In Figure 2, the first guess G has a value of 3, and needs to satisfy B ≤ 3 ≤ B, where B=5 and B=6. This guess G of 3 cannot be the overall median, since it fails to satisfy 5 ≤ 3 comparison. As G failed to be bigger than B, the modified binary search algorithm moves the guess to the right within A. An additional explanation video is available here.
A simple and quick, constant time, test has now been developed to see whether an element is an overall median. In principle, the algorithm consists of a binary search within the first sorted array A, where the split of array B is tested to see if the current guess G is an overall median. If the search fails within array A, then array B is searched, since the median could be in either array. Thus, in the worst case, both arrays are binary searched, resulting in O( lg(|A|) + lg(|B|)), where |A| is the size of A..
Listing One shows the implementation of the "Median of Two Sorted Arrays" algorithm.
The algorithm is broken into two parts: the outer part and the inner part. The outer part (
medianOfTwoSortedArrays_m2), which is called by the user, performs setup, handles the special cases of one or both sorted arrays having zero elements, leaving the rest of the work for the inner algorithm. Interestingly, the same inner routine is performed twice: once for array A, and the second time for array B, but only if the overall median was not found in A.
The inner algorithm (
medianOfTwoSortedArrays_m2_inner on line 8) performs a modified binary search on one of the arrays. It starts out with a full range over that array, and chooses the mid-element within that range. It then computes the split point within the other array and performs the comparisons (if possible) for the median test. Note the condition of the split being outside the bounds of the second array, or right at the edge, are handled graciously (lines 20 and 22). In case of being right at the boundary, only a single comparison is possible (lines 22 and 33).
When the current guess is found to be too small and needs to be increased for the next guess, the lower range boundary is adjusted to the current location plus one, to remove the current guess location, which failed to be the median from further consideration. When the current guess is found to be too large, the high boundary is adjusted similarly (lines 28 and 50). The search continues in this fashion, until either the overall median is found or the low and high boundaries cross, and thus there are no more elements that can be searched for within the current array.
Let's say that we have 101 elements in sorted array A and only 2 elements in sorted array B. If the overall median is within A, then it's going to be close to the median of A itself, no matter what the values are in B. Thus, there is no reason to search all of A for the overall median. Narrowing the search range should speed-up the search. How much should the range be around the median of A? Maybe +/- length of B, or possibly even less…
There are only a few possible scenarios. In the first, all elements of B are smaller than median of A. In the second, all elements of B are larger than median of A. Or elements of B straddle the median of A. Of course, this is the case when B has only two elements. Otherwise, there are many more straddling possibilities. If all elements of B are smaller than median of A (first case), then the overall median starts at the median of A and moves higher by half the length of B. In case two, the overall median starts at the median of A and moves lower by half the length of B. The third case is somewhere in between cases one and two. Thus, it's only necessary to search +/- (1/2 length of B) away from the median of A, where A is the larger of the two arrays.
The order of the resulting algorithm is O(2*log(min(|A|,|B|))), since both arrays still need to be searched, but now reduced to the range of the smallest array, both times.
The algorithms presented in Parallel Merge and Parallel In-Place Merge are both O(log(min(|A|,|B|))), where |A| is the size of A, since the binary search is performed within the smaller array and is O(lgN). The algorithm presented in this article is O(log(|A|+|B|)), which is slightly worse. With further optimizations the order was reduced to O(log(2*min(|A|,|B|))), which is better, but is 2X more work, since both arrays may have to be searched. All algorithms are logarithmic.
Two binary searches were necessary to find an even split that produced two equal or nearly equal halves. Luckily, this part of the merge algorithm is not performance critical. So, more effort can be spent looking for a better split. Using this algorithm in the two parallel merges should balance the recursive binary tree of the divide-and-conquer and improve the worst-case performance of parallel merge sort.
I'll explore its integration into parallel merge and parallel merge sort in my next article.
Victor Duvanenko is a frequent contributor to Dr. Dobb's on algorithms for parallel programming.