# Algorithm Improvement through Performance Measurement: Part 3

### Improvements

Listing 3 shows an improved Hybrid Radix Sort implementation. This implementation improves not only performance, but extends the algorithm to handle unsigned numbers from 8-bit to 64-bit. Also, the radix is no longer fixed at 256 (8-bits), but can be any power-of-two; i.e., any number of bits.

```
// Listing 3
template< class _Type, unsigned long PowerOfTwoRadix, unsigned long Log2ofPowerOfTwoRadix, long Threshold >
inline void _RadixSort_Unsigned_PowerOf2Radix_1( _Type* a, long last, _Type bitMask, unsigned long shiftRightAmount )
{
const unsigned long numberOfBins = PowerOfTwoRadix;
unsigned long count[ numberOfBins ];
for( unsigned long i = 0; i < numberOfBins; i++ )
count[ i ] = 0;
for ( long _current = 0; _current <= last; _current++ ) // Scan the array and count the number of times each value appears
{
unsigned long digit = (unsigned long)(( a[ _current ] & bitMask ) >> shiftRightAmount ); // extract the digit we are sorting based on
count[ digit ]++;
}
long startOfBin[ numberOfBins ], endOfBin[ numberOfBins ], nextBin;
startOfBin[ 0 ] = endOfBin[ 0 ] = nextBin = 0;
for( unsigned long i = 1; i < numberOfBins; i++ )
startOfBin[ i ] = endOfBin[ i ] = startOfBin[ i - 1 ] + count[ i - 1 ];
for ( long _current = 0; _current <= last; )
{
unsigned long digit;
_Type tmp = a[ _current ];  // get the compiler to recognize that a register can be used for the loop instead of a[_current] memory location
while ( true ) {
digit = (unsigned long)(( tmp & bitMask ) >> shiftRightAmount );   // extract the digit we are sorting based on
if ( endOfBin[ digit ] == _current )
break;
_swap( tmp, a[ endOfBin[ digit ] ] );
endOfBin[ digit ]++;
}
a[ _current ] = tmp;
endOfBin[ digit ]++;   // leave the element at its location and grow the bin
_current++;  // advance the current pointer to the next element
while( _current >= startOfBin[ nextBin ] && nextBin < numberOfBins )
nextBin++;
while( endOfBin[ nextBin - 1 ] == startOfBin[ nextBin ] && nextBin < numberOfBins )
nextBin++;
if ( _current < endOfBin[ nextBin - 1 ] )
_current = endOfBin[ nextBin - 1 ];
}
bitMask >>= Log2ofPowerOfTwoRadix;
if ( bitMask != 0 )   // end recursion when all the bits have been processes
{
if ( shiftRightAmount >= Log2ofPowerOfTwoRadix ) shiftRightAmount -= Log2ofPowerOfTwoRadix;
else shiftRightAmount  = 0;
for( unsigned long i = 0; i < numberOfBins; i++ )
{
long numberOfElements = endOfBin[ i ] - startOfBin[ i ];
if ( numberOfElements >= Threshold )  // endOfBin actually points to one beyond the bin
_RadixSort_Unsigned_PowerOf2Radix_1< _Type, PowerOfTwoRadix, Log2ofPowerOfTwoRadix, Threshold >( &a[ startOfBin[ i ]], numberOfElements - 1, bitMask, shiftRightAmount );
else if ( numberOfElements >= 2 )
insertionSortSimilarToSTLnoSelfAssignment( &a[ startOfBin[ i ]], numberOfElements );
}
}
}

```
Listing 3

The main performance improvement is based on the realization that swapping of elements happens with the highest probability. In the case of 256-Radix, the bin where the _current pointer is located is only one out of 256 bins, and thus the chance of finding a number that goes into that bin are one of 256. The first while loop inside the main for sorting loop takes care of this condition (elements not belonging to the _current bin) and stays in it while this is true (probability of 255/256 = 99.6%). The use of tmp variable gives the compiler a chance to use a CPU register for swapping instead of the _current array memory location.

The threshold value that determines if the array size is small enough to use Insertion Sort was set to 32. This value was borrowed from the Hybrid Binary-Radix Sort where it was experimentally determined to be best. Table 3 shows results of experiments for Radix-Sort of arrays of 32-bit elements, to determine the best setting for the threshold.

Table 3

In the top portion of the table the threshold is varied from 8 elements to 512 elements. In the bottom portion of the table the ratio of Radix-Sort to IPP Radix-Sort is computed. From these measurements it is evident that the threshold values of 32 and smaller perform inferior, as well as values of 256 and larger. Value of about 100 seems a good choice.

Table 4 and Graph 4 compare In-place Hybrid Radix-Sort (using 256-radix) with Intel's IPP Radix-Sort (not in-place) and STL sort, for 32-bit unsigned numbers.

Table 4: Random 32-bit Unsigned Elements

Graph 4

They show the two Radix algorithms have comparable performance for large arrays, for mid-size arrays IPP outperforms, and for small arrays Hybrid Radix outperforms. These measurements also show STL sort lagging in performance for all array sizes. Table 4 also quantifies, in the bottom row, the impact of the speed-up improvements, which improved performance for all array sizes except the smallest.

Radix support has been extended to an arbitrary power-of-two-radix; i.e., an arbitrary number of bits at a time (up to 32-bits). This radix value needed to be passed to the function, as a true constant, since the count[] array size is allocated based on the radix. Template functions allow constants to be passed to the function, but do not support default values for these constants [2]. This fact about C++ is disappointing, but can be hidden from the user inside a library. Listing 3 shows the use of overloaded sorting functions for each data type with a core template function that is type independent. Also, note that not only the radix and data type are passed to the template function, but also the threshold value of when to switch to Insertion Sort.

Handling arbitrary power-of-two-radix (i.e., any number of bits) also includes support for the case where the number of bits of the array elements does not evenly divide by the number of bits in the radix; e.g., 32-bit elements and 11-bit-radix. In this case, three passes would be performed: 11-bit, 11-bit and 10-bit. Actually, the last pass would still be an 11-bit pass, but the mask would be only 10-bits. Thus the count[] array would be allocated with 2048 elements, but there would only be 1024 possible "digits", leaving half of the bins empty, and still functioning properly.

From Table 5, which compares non-Hybrid N-bit-Radix Sort variations, it seems hopeful that 16-Radix Sort (4-bit) or 64-Radix (6-bit) could outperform 256-Radix (8-bit) as a Hybrid algorithm. However, this did not prove fruitful even when varying the threshold between 8 and 512, most likely because 16-Radix required eight passes and 64-Radix required six passed, compared to only four passed for 256-Radix, for 32-bit unsigned array elements.

Table 5: Non-Hybrid Algorithm, Random 32-bit Unsigned Elements

2048-Radix (11-bit) did not succeed, even though it used only 3 passes, most likely because of its poor initial performance (as seen in Table 5). Using the Hybrid algorithm in this case improved performance, but not enough to be superior to 256-Radix (8-bit) version. The conclusion is 256-Radix (8-bit) Hybrid algorithm is the best performing combination over the entire range of array sizes for 32-bit unsigned array elements.

Table 6 and Graph 6 show measurements of various algorithms when sorting 64-bit unsigned integers. Intel did not implement sorting for 64-bit unsigned integers.

Table 6: Random 64-bit Unsigned Elements

Graph 6 Editor's Note: In an effort to keep the Tables and Graphs in sync and more easily follow the article, we've intentionally jumped from Graph 4 to Graph 6.

These measurements show that Hybrid 256-Radix (8-bit) Sort outperforms STL sort and Hybrid Binary-Radix Sort from [1]. All three algorithms being compared are in-place. When comparing a pure Radix implementation with the Hybrid Radix, performance gains are measured for all array sizes, and of similar magnitude as gains for 32-bit unsigned (in Table 2).

Table 7 and Graph 7 show performance measurements of several sorting algorithms, some in-place and other not, for arrays of 16-bit unsigned integers. Intel IPP Radix Sort outperforms all others in all but the smallest array size (where Insertion Sort is used by some). The best performing threshold value for the Hybrid 256-Radix Sort was found to be 64, which is used in Table 7. Hybrid 256-Radix Sort lags IPP Radix Sort in performance by nearly 2X, and provides a trade-off between memory usage (being in-place and using half the memory). Intel Sort, STL sort and Hybrid-Binary Sort from [1], all performed nearly equally, lagging in performance.

Table 7: Random 16-bit Unsigned Elements

Graph 7

Other potential radix values were tested, such as 2-bit, 4-bit, and 6-bit, with various threshold settings, but none were faster than 256-Radix (8-bit). Table 7 and Graph 7 also compare a pure 256-Radix implementation to a Hybrid 256-Radix, where at 100K size array and larger the hybrid version offers no performance improvement. The reason is that at 10K elements, 256-Radix divides the array into 256 bins in the first level of recursion, each array being about 39 elements. This array size is small enough to trigger the use of Insertion Sort, and results in faster performance for the Hybrid 256-Radix Sort for array sizes of 10K elements and smaller. However, at 100K elements division into 256 bins results in each bin holding 390 elements, which is above the threshold, resulting in not using Insertion Sort in the second/last recursion level, leading to no performance gain for the Hybrid implementation. In other words, at 10K elements and below the algorithm is a Hybrid 256-Radix Sort (leveraging Insertion Sort), but at 100K and above the algorithm is a pure 256-Radix Sort (not leveraging Insertion Sort). This behavior was verified by instrumenting the algorithm to check whether Insertion Sort was called for arrays larger than 100K elements in size, and it was not (but was called for smaller array sizes).

Table 8 and Graph 8 show measurements of variety of algorithms.

Table 8: Random 8-bit Unsigned Elements

Graph 8

Intel IPP Radix Sort and Counting Sort substantially outperform all others except for the smallest array size, and track in performance closely, with Counting Sort slightly faster for larger array sizes. Counting Sort will be discussed in more detail in a later section. STL and Hybrid Binary-Radix Sort from Algorithm Improvement through Performance Measurement: Part 2 are close in performance and lag all others. Hybrid 256-Radix Sort and Pure 256-Radix Sort track in performance, because for 8-bit elements only the first level of recursion is executed and Insertion Sort is never called. Trial of Hybrid 16-Radix Sort (4-bit) resulted in lower performance. Hybrid 256-Radix and Pure 256-Radix Sort outperform STL sort and Binary-Radix Sort, but lag IPP Radix Sort and Counting Sort.

An optimization that did not yield performance improvement is to combine count[] array and startOfBin[] array into a single array. This optimization reduced stack memory usage of the algorithm, but decreased performance by a few percent due to more complex computation of startOfBin[] values.

### More Insights

 To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.

# First C Compiler Now on Github

The earliest known C compiler by the legendary Dennis Ritchie has been published on the repository.

Quick Read

# HTML5 Mobile Development: Seven Good Ideas (and Three Bad Ones)

HTML5 Mobile Development: Seven Good Ideas (and Three Bad Ones)

Quick Read

# Building Bare Metal ARM Systems with GNU

All you need to know to get up and running... and programming on ARM

Quick Read

# Amazon's Vogels Challenges IT: Rethink App Dev

Amazon Web Services CTO says promised land of cloud computing requires a new generation of applications that follow different principles.

Quick Read

# How to Select a PaaS Partner

Eventually, the vast majority of Web applications will run on a platform-as-a-service, or PaaS, vendor's infrastructure. To help sort out the options, we sent out a matrix with more than 70 decision points to a variety of PaaS providers.

Quick Read

More "Best of the Web" >>