# Algorithm Improvement through Performance Measurement: Part 4

### Stability

STL stable_sort() function uses a buffered Merge Sort algorithm, if it can allocate sufficient memory for a temporary buffer, or Insertion Sort if it cannot [2]. Thus, STL stable_sort() is not in-place, when it uses a buffered Merge Sort. Stability of Intel's IPP Radix Sort is unknown. It may not be possible to determine stability of an algorithm that sorts only numbers, since stability is not necessary when sorting arrays of numbers [1]. For instance, when the resulting sorted array has two elements of the same value next to each other (e.g. a[8] = 5 and a[9] = 5) it’s unknown which of the 5's originally preceded the other value of 5.

The algorithm in Listing 1 is stable. It processes the input array in order from the smallest index to the largest. At each recursion level, the algorithm moves the array elements from the source array in order, and places them in order in the destination array. The copy at the end is also stable. Insertion Sort, used at the last level of recursion, is known to be stable as well. Thus, the entire process is stable.

### Setup and Random Input

The same setup was used as described in [3] using the same test procedure. Visual Studio 2008 project settings were set to optimize for speed and inline any suitable function. However, a different random number generator was used for all performance measurements. ran2() function from [6] was used to generate 16-bits of random data at a time for 16-bit, 32-bit and 64-bit data types, and 8-bits for 8-bit data types. STL unique() function, as well as visual inspection, was used for rudimentary verification that a high percentage of generated numbers were indeed unique and without an obvious pattern. Insignificant performance differences were observed between using ran2() function and using Windows rand() used as described in [1] and [3].

### Hybrid

In-place Binary-Radix Sort in [3], in-place N-bit-Radix Sort in [1], and STL sort() use a hybrid sorting algorithm approach. They use several different sorting algorithms under various circumstances, resulting in superior performance versus a single/pure algorithm approach. Listing 2 shows a hybrid stable MSD-based N-bit-Radix Sort algorithm implementation, which uses Insertion Sort when bins get small enough. Note the Insertion Sort is only used as the last sorting operation -- as a recursion tree leaf.

```// Copyright(c), Victor J. Duvanenko, 2009

// Listing 2
template< class _Type >
inline unsigned long extractDigit( _Type a, _Type bitMask, unsigned long shiftRightAmount )
{
unsigned long digit = (unsigned long)(( a & bitMask ) >> shiftRightAmount );
// extract the digit we are sorting based on
return digit;
}
template< class _Type >
inline void _RadixSort_StableHybridUnsigned_Radix256( _Type* a, _Type* b, long last, _Type bitMask, unsigned long shiftRightAmount, bool inputArrayIsDestination )
{
const unsigned long  numberOfBins = 256;	// 256-radix (Bins 0 - 255)
unsigned long count[ numberOfBins ];

for( unsigned long i = 0; i < numberOfBins; i++ )
count[ i ] = 0;
for ( long _current = 0; _current <= last; _current++ )
// Scan the array and count the number of times each value appears
count[ extractDigit( a[ _current ], bitMask, shiftRightAmount ) ]++;

long startOfBin[ numberOfBins ], endOfBin[ numberOfBins ];
startOfBin[ 0 ] = endOfBin[ 0 ] = 0;
for( unsigned long i = 1; i < numberOfBins; i++ )
startOfBin[ i ] = endOfBin[ i ] = startOfBin[ i - 1 ] + count[ i - 1 ];
for ( long _current = 0; _current <= last; _current++ )
// Move elements from input array into the bins in destination array based on digit value
b[ endOfBin[ extractDigit( a[ _current ], bitMask, shiftRightAmount ) ]++ ] = a[ _current ];

bitMask          >>= 8;
shiftRightAmount  -= 8;
if ( bitMask != 0 )   // end recursion when all the bits have been processes
{
inputArrayIsDestination = !inputArrayIsDestination;
for( unsigned long i = 0; i < numberOfBins; i++ )
{
long numOfElements = endOfBin[ i ] - startOfBin[ i ];
if ( numOfElements >= 100 )
_RadixSort_StableHybridUnsigned_Radix256( &b[ startOfBin[ i ]],
&a[ startOfBin[ i ]], numOfElements - 1, bitMask,
shiftRightAmount, inputArrayIsDestination );
else {
insertionSortSimilarToSTLnoSelfAssignment( &b[ startOfBin[ i ]], numOfElements );
if ( inputArrayIsDestination )
for ( long j = startOfBin[ i ]; j < endOfBin[ i ]; j++ )
// copy from external array back into the input array
a[ j ] = b[ j ];
}
}
}
else {	// Done with recursion copy all of the bins
if ( !inputArrayIsDestination )
for ( long _current = 0; _current <= last; _current++ )
// copy from external array back into the input array
a[ _current ] = b[ _current ];
}
}
inline void RadixSort_StableHybridUnsigned_Radix256( unsigned long* a, unsigned long* b, unsigned long a_size )
{
if ( a_size < 2 )	return;

unsigned long bitMask  = 0xFF000000;  // bitMask controls how many bits we process at a time
unsigned long shiftRightAmount = 24;

if ( a_size >= 100 )
_RadixSort_StableHybridUnsigned_Radix256(  a, b, a_size - 1,
bitMask, shiftRightAmount, false );
else
insertionSortSimilarToSTLnoSelfAssignment( a, a_size );
}

```

Listing 2

Insertion Sort is in-place, however, and operates directly on the array it is given. If this array is not the user’s input array, then a copy is necessary after Insertion Sort completes (to get the result to end up in the desired array).

Table 2 and Graph 2 compare performance of the initial pure Stable N-bit-Radix Sort implementation versus the hybrid implementation for 8, 16, 32, and 64-bit unsigned integer data types.

Table 2: Random unsigned 32-bit elements

Graph 2: Random unsigned 32-bit elements

These performance measurements show a significant benefit when using a hybrid algorithm (Radix and Insertion Sort) over a purely Radix Sort. The performance gain is in the range of 2.5-18X for 32-bit elements, and 4-27X for 64-bit elements. No improvement is seen for 8-bit data type, since 8-bit-Radix was used and Insertion Sort is not executed. For 16-bit data type, array sizes 100K and larger are not accelerated because Insertion Sort is not executed due to the size of bins after the first pass being too big (100K array / 256 bins = 390 elements on average per bin). These results correlate well with similar measurements in [1] for the In-Place N-Bit-Radix Sort algorithm implementation, with similar magnitudes of acceleration for each data type.

Threshold values, which determine the switch point to Insertion Sort, were benchmarked between 8 and 512, as was done in [1]. Values at 32 and below, as well as value of 512 were determined to perform inferior, with a value of about 100 being a good choice. Table 3 and Graph 3 show the performance measurements of this implementation along with other algorithms, for 32-bit unsigned integers.

Table 3: Random unsigned 32-bit elements

Graph 3: Random unsigned 32-bit elements

Several trends are visible in these measurement results:

• Adjusting the threshold from 32 to 100 improved performance of only a single array size of 10K.
• Stable Hybrid N-bit-Radix Sort varies between 0.6X and 1.3X in performance comparing to Intel's IPP Radix Sort, with the worst performance at 10K and the best performance at 100M element array.
• Stable Hybrid N-bit-Radix Sort has the same performance as In-place Hybrid N-bit-Radix Sort averaged across array sizes, but outperforms it for larger array sizes.
• Stable Hybrid N-bit-Radix Sort consistently and significantly outperforms STL stable_sort() -- up to about 5X faster -- especially for larger array sizes.

The following table shows a relationship between the array size, number of digits used to sort with, and the average size of the resulting bins.

For example, for 10K elements and Radix of 256 (8-bits), after the first pass over the array the 10K elements would be split into 256 bins, each containing 10K/256 = 39 elements on average. In this case, with the threshold set at 32, Insertion Sort would not be called, and Radix Sort would be used instead on the second pass. This is the reason 10K array size accelerated by 2.2X when the threshold was raised from 32 to 100, as Insertion Sort was called after the first pass of the Radix Sort algorithm (instead of calling Radix Sort on the second digit).

From this table it becomes evident why 1K and 100K array sizes accelerate well by using the hybrid approach (as shown in Table 2, with 18X speed-up for these array sizes). For these array sizes, the last bin size is very small (4 and 2 elements respectively, on average), for which Radix Sort is much less efficient than Insertion Sort. It is surprising that 100M array size does not accelerate to the same level, since its last bin size is also very small -- 6 elements on average. However, while 100K array size fits into CPU cache, the 100M array does not, which may be responsible for the difference in performance increase.

### More Insights

 To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.