Now, that the split has been found, block swap can be performed, in-place. This moves all elements less than X to the left of X, and all element greater than or equal to X to the right of X. However, elements left of X are in two segments that now need to be merged. Also the two segments of elements right of X also need to be merged. The two recursive calls take care of these two merge operations. The else portion takes care of the opposite case—when the left segment is larger than the right segment.
Note that the in-place merge algorithm used a non-optimized non-hybrid block swap algorithm, for simplicity. An optimized hybrid algorithm is definitely a performance boosting opportunity.
Performance of several in-place and not-in-place merge algorithms is shown in Table 1 and in Graph 1.
|
Time per Run |
||||||||
|---|---|---|---|---|---|---|---|---|
Number of items |
50 |
500 |
5000 |
50000 |
500000 |
5.0E+06 |
5.0E+07 |
5.0E+08 |
STL inplace_merge |
1.9E-06 |
6.6E-06 |
5.8E-05 |
9.1E-04 |
6.8E-03 |
8.2E-02 |
8.1E-01 |
8.8E+00 |
STL merge |
3.6E-07 |
1.5E-06 |
1.2E-05 |
1.8E-04 |
2.0E-03 |
2.0E-02 |
2.0E-01 |
3.1E+00 |
In-Place Merge pure |
2.9E-06 |
2.1E-05 |
2.1E-04 |
2.4E-03 |
2.4E-02 |
2.5E-01 |
2.8E+00 |
3.1E+01 |
STL inplace_merge / |
5.1 |
4.3 |
4.8 |
5.0 |
3.3 |
4.2 |
4.0 |
2.8 |
In-Place Merge pure / |
1.5 |
3.2 |
3.6 |
2.6 |
3.5 |
3.1 |
3.4 |
3.6 |
Table 1
Graph 1
STL merge is more than 2.8x faster than STL inplace_merge for all array sizes. The pure divide-and-conquer merge is more than 3X slower than STL inplace_merge for large arrays. This performance disparity may be attributed to STL inplace_merge being an adaptive algorithm, switching to STL merge when additional memory is available to allocate a temporary intermediate buffer and copying the result back to the input array. Also, nearly every algorithm that has been explored so far has performed poorly in its pure form. So, let's develop a hybrid in-place merge.
Listing Two shows such a hybrid in-place merge algorithm. STL inplace_merge is used as the recursion termination case—processing small sub-arrays.
template< class _Type >
inline void merge_in_place_1( _Type* t, int l, int m, int r )
{
int length1 = m - l + 1;
int length2 = r - m;
if ( length1 >= length2 )
{
if ( length2 <= 0 ) return;
if (( length1 + length2 ) <= 8192 ) { std::inplace_merge( t + l, t + m + 1, t + r + 1 ); return; }
int q1 = ( l + m ) / 2; // q1 is mid-point of the larger segment
int q2 = my_binary_search( t[ q1 ], t, m + 1, r ); // q2 is q1 partitioning element within the smaller sub-array (and q2 itself is part of the sub-array that does not move)
int q3 = q1 + ( q2 - m - 1 );
block_exchange_mirror( t, q1, m, q2 - 1 ); // 2X speedup
merge_in_place_1( t, l, q1 - 1, q3 - 1 ); // note that q3 is now in its final place and no longer participates in further processing
merge_in_place_1( t, q3 + 1, q2 - 1, r );
}
else {
if ( length1 <= 0 ) return;
if (( length1 + length2 ) <= 8192 ) { std::inplace_merge( t + l, t + m + 1, t + r + 1 ); return; }
int q1 = ( m + 1 + r ) / 2; // q1 is mid-point of the larger segment
int q2 = my_binary_search( t[ q1 ], t, l, m ); // q2 is q1 partitioning element within the smaller sub-array (and q2 itself is part of the sub-array that does not move)
int q3 = q2 + ( q1 - m - 1 );
block_exchange_mirror( t, q2, m, q1 ); // 2X speedup
merge_in_place_1( t, l, q2 - 1, q3 - 1 ); // note that q3 is now in its final place and no longer participates in further processing
merge_in_place_1( t, q3 + 1, q1, r );
}
}
Listing Two
Performance of the hybrid algorithm is shown in Table 2.
|
Time per Run |
||||||||
|---|---|---|---|---|---|---|---|---|
Number of items |
50 |
500 |
5000 |
50000 |
500000 |
5.0E+06 |
5.0E+07 |
5.0E+08 |
STL inplace_merge |
1.9E-06 |
6.6E-06 |
5.8E-05 |
9.1E-04 |
6.8E-03 |
8.2E-02 |
8.1E-01 |
8.8E+00 |
In-Place Merge pure |
2.9E-06 |
2.1E-05 |
2.1E-04 |
2.4E-03 |
2.4E-02 |
2.5E-01 |
2.8E+00 |
3.1E+01 |
In-Place Merge hybrid |
2.8E-06 |
7.0E-06 |
8.1E-05 |
9.9E-04 |
1.3E-02 |
1.6E-01 |
1.8E+00 |
1.9E+01 |
In-Place Merge hybrid / |
1.5 |
1.1 |
1.4 |
1.1 |
1.9 |
1.9 |
2.2 |
2.2 |
In-Place pure / |
1.0 |
3.1 |
2.6 |
2.4 |
1.9 |
1.6 |
1.5 |
1.6 |


