# Algorithm Improvement through Performance Measurement: Part 2

### Hybrid

The STL sort implementation combines three different sorting algorithms to extract the best attributes of each, resulting in superior overall performance. STL sort combines the best of QuickSort, Heap Sort, and Insertion Sort algorithms, enabling each under different conditions, where each excels over others. It starts out using QuickSort for 1.5*log2(n) divisions of recursion depth, followed by Heap Sort if the resulting divisions still hold more than 32 elements, and Insertion Sort if they hold between 32 and 2 elements. Insertion Sort provides good cache locality at the leaf nodes of the binary recursion tree, and it is adaptive to input data [4]. Ending recursion at about 32 elements avoids the function call and return overhead when the number of elements is small.

The same good ideas can be applied to Radix Sort (Binary or higher radix). Listing 2 shows such an implementation, which uses a version of Insertion Sort from [4], also included in Listing 2.

```// Copyright(c), Victor J. Duvanenko, 2009
// In-place Binary Radix Sort implementations.

#ifndef _InPlaceBinaryRadixSort2_h
#define _InPlaceBinaryRadixSort2_h

// Swap that does not check for self-assignment.
template< class _Type >
inline void _swap( _Type& a, _Type& b )
{
_Type tmp = a;
a         = b;
b         = tmp;
}

template< class _Type >
inline void insertionSortSimilarToSTLnoSelfAssignment( _Type* a, unsigned long a_size )
{
for ( unsigned long i = 1; i < a_size; i++ )
{
if ( a[ i ] < a[ i - 1 ] )		// no need to do (j > 0) compare for the first iteration
{
_Type currentElement = a[ i ];
a[ i ] = a[ i - 1 ];
unsigned long j;
for ( j = i - 1; j > 0 && currentElement < a[ j - 1 ]; j-- )
{
a[ j ] = a[ j - 1 ];
}
a[ j ] = currentElement;	// always necessary work/write
}
// Perform no work at all if the first comparison fails - i.e. never assign an element to itself!
}
}

inline void _binaryRadixSort_wInsertion( unsigned long* a, long first, long last, unsigned long bitMask )
{
if (( last - first ) > 32 )
{
// Split the provided array range into 0's and 1's sub-ranges
long _zerosEndPlusOne = first;						// index is one beyond the last  0's portion
long _onesEndMinusOne = last;						// index is one before the first 1's portion
for ( ; _zerosEndPlusOne <= _onesEndMinusOne; )
{
if ( 0 == ( bitMask & a[ _zerosEndPlusOne ] ))			// 0's portion
{
// nothing to do, as this is our 0's portion of the array
_zerosEndPlusOne++;
}
else {											// 1's portion
_swap( a[ _zerosEndPlusOne ], a[ _onesEndMinusOne ] );
_onesEndMinusOne--;
}
}
// Recursively call to sort 0's portion and 1's portion using the next bit lower
bitMask >>= 1;
if ( bitMask != 0 )
{
if ( first < ( _zerosEndPlusOne - 1 ))
_binaryRadixSort_wInsertion( a, first, _zerosEndPlusOne - 1, bitMask );
if (( _onesEndMinusOne + 1 ) < last )
_binaryRadixSort_wInsertion( a, _onesEndMinusOne + 1,  last, bitMask );
}
}
else {
insertionSortSimilarToSTLnoSelfAssignment( &a[ first ], last - first + 1 );
}
}
inline void binaryRadixSortInPlace_wInsertion( unsigned long* a, unsigned long a_size )
{
if ( a_size <  2 )	return;
unsigned long bitMask = 0x80000000;

_binaryRadixSort_wInsertion( a, 0, a_size - 1, bitMask );
}

endif	// _InPlaceBinaryRadixSort_h

```
Listing 2

Tables 2 through 5 and Graph 1 compare performance of Hybrid Binary-Radix Sort with STL sort and IPP Radix Sort.

Table 2: Random 32-bit Unsigned Elements

Graph 1

Table 3: Increasing 32-bit Unsigned Elements

Table 4: Decreasing 32-bit Unsigned Elements

These measurements and the ratios computed from them, show that the Hybrid Binary-Radix Sort consistently outperforms STL sort for random input data set, by 20%, but not for increasing and decreasing input data sets, where it lags by up to 40%. Intel's IPP Radix Sort outperformed all others by at least 3X for random input data, but lagged STL sort for other data input statistics.

### Improvements

Several inefficiencies are noticeable in the sorting example of six 3-bit numbers above:

• In the next-to-last step of the algorithm, the element a2 is swapped with itself.
• In lines 3 and 4, the swaps are performed between elements which both belong to the 1's bin.

The first improvement can be implemented using a method similar to that used in [4], where self-assignment was eliminated without performing additional work. In this case, the last loop can be extracted from the for loop by changing the comparison from &b><= to <. The tricky detail here is that:

```if (_zerosEndPlusOne == _onesEndMinusOne )

```

is not necessary and can be removed, because all recursion calls check to make sure that each bin holds at least two elements. Thus, the for loop is guaranteed to run for at least one iteration, leaving the "==" condition as the only work left after the for loop; for example, the only work left is when a single element is left (and a swap is not necessary). Note that in this case, the amount of work was actually reduced to handle this condition, since the "==" comparison was removed. This optimization resulted in 4-5% performance improvement for random input data set of 32-bit unsigned integers. The implementation is in Listing 3.

```// Copyright(c), Victor J. Duvanenko, 2009
// In-place Binary Radix Sort implementations.

#ifndef _InPlaceBinaryRadixSort3_h
#define _InPlaceBinaryRadixSort3_h

// Swap that does not check for self-assignment.
template< class _Type >
inline void _swap( _Type& a, _Type& b )
{
_Type tmp = a;
a         = b;
b         = tmp;
}

template< class _Type >
inline void insertionSortSimilarToSTLnoSelfAssignment( _Type* a, unsigned long a_size )
{
for ( unsigned long i = 1; i < a_size; i++ )
{
if ( a[ i ] < a[ i - 1 ] )		// no need to do (j > 0) compare for the first iteration
{
_Type currentElement = a[ i ];
a[ i ] = a[ i - 1 ];
unsigned long j;
for ( j = i - 1; j > 0 && currentElement < a[ j - 1 ]; j-- )
{
a[ j ] = a[ j - 1 ];
}
a[ j ] = currentElement;	// always necessary work/write
}
// Perform no work at all if the first comparison fails - i.e. never assign an element to itself!
}
}

template< class _Type >
inline void _binaryRadixSortNoSelfAssignment_unsigned( _Type* a, long first, long last, _Type bitMask )
{
// Split the provided array range into 0's and 1's portions
long _zerosEndPlusOne = first;						// index is one beyond the last  0's portion
long _onesEndMinusOne = last;						// index is one before the first 1's portion
for ( ; _zerosEndPlusOne < _onesEndMinusOne; )
{
if ( 0 == ( bitMask & a[ _zerosEndPlusOne ] ))				// 0's portion
{
// this element belongs in the 0's portion
// so just keep it in its place and grow the 0's portion size
_zerosEndPlusOne++;
}
else {														// 1's portion
_swap( a[ _zerosEndPlusOne ], a[ _onesEndMinusOne ] );	// move this element to 1's portion
_onesEndMinusOne--;
}
}
if ( 0 == ( bitMask & a[ _zerosEndPlusOne ] ))				// 0's portion
{
_zerosEndPlusOne++;
}
else {														// 1's portion
_onesEndMinusOne--;										// No swap is needed, since it's self-assignment
}
// Recursively call to sort 0's portion and 1's portion using the next bit lower
bitMask >>= 1;
if ( bitMask != 0 )						// end recursion when all the bits have been processes
{
if ( first < ( _zerosEndPlusOne - 1 ))		// call only if number of elements >= 2
_binaryRadixSortNoSelfAssignment_unsigned( a, first, _zerosEndPlusOne - 1, bitMask );
if (( _onesEndMinusOne + 1 ) < last )		// call only if number of elements >= 2
_binaryRadixSortNoSelfAssignment_unsigned( a, _onesEndMinusOne + 1,  last, bitMask );
}
}

inline void binaryRadixSortInPlaceNoSelfAssignment_unsigned( unsigned long* a, unsigned long a_size )
{
if ( a_size < 2 )	return;

unsigned long bitMask = 0x80000000;

_binaryRadixSortNoSelfAssignment_unsigned( a, 0, a_size - 1, bitMask );
}

endif	// _InPlaceBinaryRadixSort_h

```
Listing 3

The second improvement was prompted by [5] and shows up clearly in the sorting example above. Listing 4 shows an implementation which removes redundant swaps of elements that both belong to the 1's bin. Instead, the algorithm switches to growing the 1's bin while looking for a 0's bin element to swap with.

```// Copyright(c), Victor J. Duvanenko, 2009
// In-place Binary Radix Sort implementations.

#ifndef _InPlaceBinaryRadixSort4_h
#define _InPlaceBinaryRadixSort4_h

// Swap that does not check for self-assignment.
template< class _Type >
inline void _swap( _Type& a, _Type& b )
{
_Type tmp = a;
a         = b;
b         = tmp;
}

template< class _Type >
inline void insertionSortSimilarToSTLnoSelfAssignment( _Type* a, unsigned long a_size )
{
for ( unsigned long i = 1; i < a_size; i++ )
{
if ( a[ i ] < a[ i - 1 ] )		// no need to do (j > 0) compare for the first iteration
{
_Type currentElement = a[ i ];
a[ i ] = a[ i - 1 ];
unsigned long j;
for ( j = i - 1; j > 0 && currentElement < a[ j - 1 ]; j-- )
{
a[ j ] = a[ j - 1 ];
}
a[ j ] = currentElement;	// always necessary work/write
}
// Perform no work at all if the first comparison fails - i.e. never assign an element to itself!
}
}

template< class _Type >
inline void _binaryRadixSort_wInsertionMinSwapsSymmetric_unsignedOnly( _Type* a, long first, long last, _Type bitMask )
{
if (( last - first ) > 32 )
{
// Split the provided array range into 0's and 1's bins
long _zerosEndPlusOne = first;						// index is one beyond the last  0's portion
long _onesEndMinusOne = last;						// index is one before the first 1's portion
for ( ; _zerosEndPlusOne <= _onesEndMinusOne; )
{
if ( 0 == ( bitMask & a[ _zerosEndPlusOne ] ))	// 0's bin
{
_zerosEndPlusOne++;							// grow 0's bin
}
else {											// 1's portion
do	// Locate an element that belongs in 0's portion, to eliminate unnecessary swaps
{
if ( 0 != ( bitMask & a[ _onesEndMinusOne ]))
{
_onesEndMinusOne--;					// grow 1's bin of the array
}
else
{
_swap( a[ _zerosEndPlusOne ], a[ _onesEndMinusOne ] );
_onesEndMinusOne--;
_zerosEndPlusOne++;			// grow 0's and 1's - found a perfect swap match
break;						// switch back to the 0's bin
}
} while( _zerosEndPlusOne <= _onesEndMinusOne );
}
}
// Recursively call to sort 0's portion and 1's portion using the next bit lower
bitMask >>= 1;
if ( bitMask != 0 )
{
if ( first < ( _zerosEndPlusOne - 1 ))
_binaryRadixSort_wInsertionMinSwapsSymmetric_unsignedOnly( a, first, _zerosEndPlusOne - 1, bitMask );
if (( _onesEndMinusOne + 1 ) < last )
_binaryRadixSort_wInsertionMinSwapsSymmetric_unsignedOnly( a, _onesEndMinusOne + 1,  last, bitMask );
}
}
else {
insertionSortSimilarToSTLnoSelfAssignment( &a[ first ], last - first + 1 );
}
}
inline void binaryRadixSortInPlace( unsigned char* a, unsigned long a_size )
{
if ( a_size <  2 )	return;
unsigned char bitMask = 0x80;

_binaryRadixSort_wInsertionMinSwapsSymmetric_unsignedOnly( a, 0, a_size - 1, bitMask );
}
inline void binaryRadixSortInPlace( unsigned short* a, unsigned long a_size )
{
if ( a_size <  2 )	return;
unsigned short bitMask = 0x8000;

_binaryRadixSort_wInsertionMinSwapsSymmetric_unsignedOnly( a, 0, a_size - 1, bitMask );
}
inline void binaryRadixSortInPlace( unsigned long* a, unsigned long a_size )
{
if ( a_size <  2 )	return;
unsigned long bitMask = 0x80000000;

_binaryRadixSort_wInsertionMinSwapsSymmetric_unsignedOnly( a, 0, a_size - 1, bitMask );
}
inline void binaryRadixSortInPlace( unsigned __int64* a, unsigned long a_size )
{
if ( a_size <  2 )	return;
unsigned __int64 bitMask = 0x8000000000000000ui64;

_binaryRadixSort_wInsertionMinSwapsSymmetric_unsignedOnly( a, 0, a_size - 1, bitMask );
}

endif	// _InPlaceBinaryRadixSort_h

```
Listing 4

When it finds the 0's element, it swaps knowing that both elements involved are going to their final destinations -- into their respective bins. These two elements will not be swapped again (growing 0's and 1's bin after the swap). Modified algorithm steps are outlined as follows:

This improvement eliminated redundant swaps between elements that both belong to the 1's bin, in the initial implementation. The number of steps to complete the sort has been reduced dues to the swaps becoming "perfect" (moving both elements of the swap to their respective bins), which allowed for both bins to be grown; see line 5 of example above. Also, self-assignment has been removed in both cases: when the last element is in the 0's bin or in 1's bin, that bin is grown without self-assignment (which can be seen in the implementation).

This improvement also fixed an asymmetry in the algorithm where in the 0's bin the current pointer was advanced without swapping if the element belonged in the 0's bin, but was not the case for 1's bin. With the modification, if the element belonged in the 1's bin and was already in the 1's bin location, it was left there and the pointer was advanced to keep searching for an element that did not belong in the 1's bin; for example, looking for a "perfect" swap candidate (where both elements were in the wrong bin before the swap).

Another ending condition is possible and is shown below:

This ending condition also ends well, with a single perfect swap and both bins being grown.

Performance measurements of the final version of Hybrid Binary-Radix Sort along with other algorithms are shown in Tables 5, 6, and 7. The Intel IPP library does not have an implementation of 32-bit unsigned non-Radix sorting, and for 64-bit unsigned sorting.

Table 5: Random Unsigned Elements

Table 5 provides a lot of information, with several notables:

• IPP Radix Sort is significantly faster than all other algorithms for 8-, 16-, and 32-bit unsigned -- 20X faster, 8X and about 3X faster respectively than the closest competitor.
• IPP sort performance for 8-bit seems to be the same as IPP Radix Sort, but with the benefit of being in-place.
• IPP sort for 16-bit unsigned is significantly worse in performance than the 8-bit version -- 65X.
• Hybrid Binary-Radix Sort outperforms STL sort for all data sizes, and IPP sort for 16-bit by at least 15%.
• Tables 6 and 7 show that Hybrid Binary-Radix outperforms STL sort and IPP Radix Sort by at least 30% for increasing and decreasing input data sets.
• Hybrid Binary-Radix Sort increased in performance over 40% from the initial implementation.

Table 6: Increasing 32-bit Unsigned Elements

Table 7: Decreasing 32-bit Unsigned Elements

Since the generic implementation, through overloaded functions, is data-type-aware it is possible to provide a type-dependent threshold value that controls the switch point for using Insertion Sort. Brief experiments did not indicate this to be promising, but more detailed investigation may be fruitful.

### Signed

Signed integers use 2's complement format shown below, with the binary representation on the left and its decimal value on the right:

```01111111b	127
01111110b	126
01111101b	125
….
00000001b	1
00000000b	0
11111111b	-1
11111110b	-2
11111101b	-3
…
10000011b	-125
10000010b	-126
10000001b	-127
10000000b	-128

```

Note that the most-significant-bit (MSB) serves as the sign bit. However, its meaning is opposite of what is needed for sorting. If unsigned Binary-Radix Sort algorithm is used, then negative numbers would be interpreted as larger than positive numbers, since negatives have the MSB set to a 1. However, the rest of the bits have the same meaning as unsigned numbers -- a 1 is bigger than a 0. Thus, for signed numbers only the MSB must be treaded specially, with the opposite meaning -- a 0 is larger than a 1. Listing 5 shows an implementation, which treats the sign bit (the MSB) of signed numbers in the opposite manner, followed by recursive calls to the unsigned core routine for the rest of the bits.

```// Copyright(c), Victor J. Duvanenko, 2009
// In-place Binary Radix Sort implementations.

#ifndef _InPlaceBinaryRadixSort5_h
#define _InPlaceBinaryRadixSort5_h

// Swap that does not check for self-assignment.
template< class _Type >
inline void _swap( _Type& a, _Type& b )
{
_Type tmp = a;
a         = b;
b         = tmp;
}
// A set of logical right shift functions to work-around the C++ issue of performing an arithmetic right shift
// for >>= operation on signed types.
inline char logicalRightShift( char a, unsigned long shiftAmount )
{
return (char)(((unsigned char)a ) >> shiftAmount );
}
inline short logicalRightShift( short a, unsigned long shiftAmount )
{
return (short)(((unsigned short)a ) >> shiftAmount );
}
inline long logicalRightShift( long a, unsigned long shiftAmount )
{
return (long)(((unsigned long)a ) >> shiftAmount );
}
inline __int64 logicalRightShift( __int64 a, unsigned long shiftAmount )
{
return (__int64)(((unsigned __int64)a ) >> shiftAmount );
}

template< class _Type >
inline void insertionSortSimilarToSTLnoSelfAssignment( _Type* a, unsigned long a_size )
{
for ( unsigned long i = 1; i < a_size; i++ )
{
if ( a[ i ] < a[ i - 1 ] )		// no need to do (j > 0) compare for the first iteration
{
_Type currentElement = a[ i ];
a[ i ] = a[ i - 1 ];
unsigned long j;
for ( j = i - 1; j > 0 && currentElement < a[ j - 1 ]; j-- )
{
a[ j ] = a[ j - 1 ];
}
a[ j ] = currentElement;	// always necessary work/write
}
// Perform no work at all if the first comparison fails - i.e. never assign an element to itself!
}
}

template< class _Type >
inline void _binaryRadixSort_wInsertionMinSwapsSymmetric_signedOnly( _Type* a, long first, long last, _Type bitMask )
{
if (( last - first ) > 32 )
{
// Split the provided array range into 0's and 1's bins
long _zerosEndPlusOne = first;						// index is one beyond the last  0's portion
long _onesEndMinusOne = last;						// index is one before the first 1's portion
for ( ; _zerosEndPlusOne <= _onesEndMinusOne; )
{
if ( 0 != ( bitMask & a[ _zerosEndPlusOne ] ))	// 0's bin
{
_zerosEndPlusOne++;							// grow 0's bin
}
else {											// 1's portion
do	// Locate an element that belongs in 0's portion, to eliminate unnecessary swaps
{
if ( 0 == ( bitMask & a[ _onesEndMinusOne ]))
{
_onesEndMinusOne--;					// grow 1's bin of the array
}
else
{
_swap( a[ _zerosEndPlusOne ], a[ _onesEndMinusOne ] );
_onesEndMinusOne--;
_zerosEndPlusOne++;			// grow 0's and 1's - found a perfect swap match
break;						// switch back to the 0's bin
}
} while( _zerosEndPlusOne <= _onesEndMinusOne );
}
}
// Recursively call to sort 0's portion and 1's portion using the next bit lower
bitMask = logicalRightShift( bitMask, 1 );
if ( bitMask != 0 )
{
if ( first < ( _zerosEndPlusOne - 1 ))
_binaryRadixSort_wInsertionMinSwapsSymmetric_unsignedOnly( a, first, _zerosEndPlusOne - 1, bitMask );
if (( _onesEndMinusOne + 1 ) < last )
_binaryRadixSort_wInsertionMinSwapsSymmetric_unsignedOnly( a, _onesEndMinusOne + 1,  last, bitMask );
}
}
else {
insertionSortSimilarToSTLnoSelfAssignment( &a[ first ], last - first + 1 );
}
}
inline void binaryRadixSortInPlace( char* a, unsigned long a_size )
{
if ( a_size <  2 )	return;
char bitMask = (char)0x80;

_binaryRadixSort_wInsertionMinSwapsSymmetric_signedOnly( a, 0, a_size - 1, bitMask );
}
inline void binaryRadixSortInPlace( short* a, unsigned long a_size )
{
if ( a_size <  2 )	return;
short bitMask = (short)0x8000;

_binaryRadixSort_wInsertionMinSwapsSymmetric_signedOnly( a, 0, a_size - 1, bitMask );
}
inline void binaryRadixSortInPlace( long* a, unsigned long a_size )
{
if ( a_size <  2 )	return;
long bitMask = 0x80000000;

_binaryRadixSort_wInsertionMinSwapsSymmetric_signedOnly( a, 0, a_size - 1, bitMask );
}
inline void binaryRadixSortInPlace( __int64* a, unsigned long a_size )
{
if ( a_size <  2 )	return;
__int64 bitMask = 0x8000000000000000i64;

_binaryRadixSort_wInsertionMinSwapsSymmetric_signedOnly( a, 0, a_size - 1, bitMask );
}

endif	// _InPlaceBinaryRadixSort_h

```
Listing 5

This implementation uses overloaded functions to handle all signed types from 8-bit to 64-bits. This fits nicely with the unsigned functions to provide a consistent and fairly generic implementation that handles unsigned and signed types using the same interface, as well as the same function name. Correctness of the signed implementation was tested using the same test procedure as for unsigned versions of the algorithm.

Signed implementation has another slight difference -- it uses an overloaded function logicalRightShift(). This function was necessary because C++ defines right shift operation ">>" for signed number as an arithmetic shift and not a logical shift, which is not the behavior that is needed. Interestingly, Java defines a ">>>" operator, which implements a logical shift on signed numbers [6] and ">>" performs an arithmetic shift. It's unfortunate that C++ is stuck with this behavior, most likely for backwards compatibility with C, since it would be clearer to have ">>" mean logical shift and "/2" mean arithmetic shift for both unsigned and signed integers. Capable compilers already optimize "integer divide by power of 2 constant" as an arithmetic shift.

### More Insights

 To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.