### Improvements

Listing 3 shows an improved Hybrid Radix Sort implementation. This implementation improves not only performance, but extends the algorithm to handle unsigned numbers from 8-bit to 64-bit. Also, the radix is no longer fixed at 256 (8-bits), but can be any power-of-two; i.e., any number of bits.

// Listing 3 template< class _Type, unsigned long PowerOfTwoRadix, unsigned long Log2ofPowerOfTwoRadix, long Threshold > inline void _RadixSort_Unsigned_PowerOf2Radix_1( _Type* a, long last, _Type bitMask, unsigned long shiftRightAmount ) { const unsigned long numberOfBins = PowerOfTwoRadix; unsigned long count[ numberOfBins ]; for( unsigned long i = 0; i < numberOfBins; i++ ) count[ i ] = 0; for ( long _current = 0; _current <= last; _current++ ) // Scan the array and count the number of times each value appears { unsigned long digit = (unsigned long)(( a[ _current ] & bitMask ) >> shiftRightAmount ); // extract the digit we are sorting based on count[ digit ]++; } long startOfBin[ numberOfBins ], endOfBin[ numberOfBins ], nextBin; startOfBin[ 0 ] = endOfBin[ 0 ] = nextBin = 0; for( unsigned long i = 1; i < numberOfBins; i++ ) startOfBin[ i ] = endOfBin[ i ] = startOfBin[ i - 1 ] + count[ i - 1 ]; for ( long _current = 0; _current <= last; ) { unsigned long digit; _Type tmp = a[ _current ]; // get the compiler to recognize that a register can be used for the loop instead of a[_current] memory location while ( true ) { digit = (unsigned long)(( tmp & bitMask ) >> shiftRightAmount ); // extract the digit we are sorting based on if ( endOfBin[ digit ] == _current ) break; _swap( tmp, a[ endOfBin[ digit ] ] ); endOfBin[ digit ]++; } a[ _current ] = tmp; endOfBin[ digit ]++; // leave the element at its location and grow the bin _current++; // advance the current pointer to the next element while( _current >= startOfBin[ nextBin ] && nextBin < numberOfBins ) nextBin++; while( endOfBin[ nextBin - 1 ] == startOfBin[ nextBin ] && nextBin < numberOfBins ) nextBin++; if ( _current < endOfBin[ nextBin - 1 ] ) _current = endOfBin[ nextBin - 1 ]; } bitMask >>= Log2ofPowerOfTwoRadix; if ( bitMask != 0 ) // end recursion when all the bits have been processes { if ( shiftRightAmount >= Log2ofPowerOfTwoRadix ) shiftRightAmount -= Log2ofPowerOfTwoRadix; else shiftRightAmount = 0; for( unsigned long i = 0; i < numberOfBins; i++ ) { long numberOfElements = endOfBin[ i ] - startOfBin[ i ]; if ( numberOfElements >= Threshold ) // endOfBin actually points to one beyond the bin _RadixSort_Unsigned_PowerOf2Radix_1< _Type, PowerOfTwoRadix, Log2ofPowerOfTwoRadix, Threshold >( &a[ startOfBin[ i ]], numberOfElements - 1, bitMask, shiftRightAmount ); else if ( numberOfElements >= 2 ) insertionSortSimilarToSTLnoSelfAssignment( &a[ startOfBin[ i ]], numberOfElements ); } } }

The main performance improvement is based on the realization that swapping of elements happens with the highest probability. In the case of 256-Radix, the bin where the **_current** pointer is located is only one out of 256 bins, and thus the chance of finding a number that goes into that bin are one of 256. The first **while** loop inside the main for sorting loop takes care of this condition (elements not belonging to the **_current** bin) and stays in it while this is true (probability of 255/256 = 99.6%). The use of **tmp** variable gives the compiler a chance to use a CPU register for swapping instead of the **_current** array memory location.

The threshold value that determines if the array size is small enough to use Insertion Sort was set to 32. This value was borrowed from the Hybrid Binary-Radix Sort where it was experimentally determined to be best. Table 3 shows results of experiments for Radix-Sort of arrays of 32-bit elements, to determine the best setting for the threshold.

In the top portion of the table the threshold is varied from 8 elements to 512 elements. In the bottom portion of the table the ratio of Radix-Sort to IPP Radix-Sort is computed. From these measurements it is evident that the threshold values of 32 and smaller perform inferior, as well as values of 256 and larger. Value of about 100 seems a good choice.

Table 4 and Graph 4 compare In-place Hybrid Radix-Sort (using 256-radix) with Intel's IPP Radix-Sort (not in-place) and STL sort, for 32-bit unsigned numbers.

They show the two Radix algorithms have comparable performance for large arrays, for mid-size arrays IPP outperforms, and for small arrays Hybrid Radix outperforms. These measurements also show STL sort lagging in performance for all array sizes. Table 4 also quantifies, in the bottom row, the impact of the speed-up improvements, which improved performance for all array sizes except the smallest.

Radix support has been extended to an arbitrary power-of-two-radix; i.e., an arbitrary number of bits at a time (up to 32-bits). This radix value needed to be passed to the function, as a true constant, since the **count[]** array size is allocated based on the radix. Template functions allow constants to be passed to the function, but do not support default values for these constants [2]. This fact about C++ is disappointing, but can be hidden from the user inside a library. Listing 3 shows the use of overloaded sorting functions for each data type with a core template function that is type independent. Also, note that not only the radix and data type are passed to the template function, but also the threshold value of when to switch to Insertion Sort.

Handling arbitrary power-of-two-radix (i.e., any number of bits) also includes support for the case where the number of bits of the array elements does not evenly divide by the number of bits in the radix; e.g., 32-bit elements and 11-bit-radix. In this case, three passes would be performed: 11-bit, 11-bit and 10-bit. Actually, the last pass would still be an 11-bit pass, but the mask would be only 10-bits. Thus the **count[]** array would be allocated with 2048 elements, but there would only be 1024 possible "digits", leaving half of the bins empty, and still functioning properly.

From Table 5, which compares non-Hybrid N-bit-Radix Sort variations, it seems hopeful that 16-Radix Sort (4-bit) or 64-Radix (6-bit) could outperform 256-Radix (8-bit) as a Hybrid algorithm. However, this did not prove fruitful even when varying the threshold between 8 and 512, most likely because 16-Radix required eight passes and 64-Radix required six passed, compared to only four passed for 256-Radix, for 32-bit unsigned array elements.

2048-Radix (11-bit) did not succeed, even though it used only 3 passes, most likely because of its poor initial performance (as seen in Table 5). Using the Hybrid algorithm in this case improved performance, but not enough to be superior to 256-Radix (8-bit) version. The conclusion is 256-Radix (8-bit) Hybrid algorithm is the best performing combination over the entire range of array sizes for 32-bit unsigned array elements.

Table 6 and Graph 6 show measurements of various algorithms when sorting 64-bit unsigned integers. Intel did not implement sorting for 64-bit unsigned integers.

These measurements show that Hybrid 256-Radix (8-bit) Sort outperforms STL sort and Hybrid Binary-Radix Sort from [1]. All three algorithms being compared are in-place. When comparing a pure Radix implementation with the Hybrid Radix, performance gains are measured for all array sizes, and of similar magnitude as gains for 32-bit unsigned (in Table 2).

Table 7 and Graph 7 show performance measurements of several sorting algorithms, some in-place and other not, for arrays of 16-bit unsigned integers. Intel IPP Radix Sort outperforms all others in all but the smallest array size (where Insertion Sort is used by some). The best performing threshold value for the Hybrid 256-Radix Sort was found to be 64, which is used in Table 7. Hybrid 256-Radix Sort lags IPP Radix Sort in performance by nearly 2X, and provides a trade-off between memory usage (being in-place and using half the memory). Intel Sort, STL sort and Hybrid-Binary Sort from [1], all performed nearly equally, lagging in performance.

Other potential radix values were tested, such as 2-bit, 4-bit, and 6-bit, with various threshold settings, but none were faster than 256-Radix (8-bit). Table 7 and Graph 7 also compare a pure 256-Radix implementation to a Hybrid 256-Radix, where at 100K size array and larger the hybrid version offers no performance improvement. The reason is that at 10K elements, 256-Radix divides the array into 256 bins in the first level of recursion, each array being about 39 elements. This array size is small enough to trigger the use of Insertion Sort, and results in faster performance for the Hybrid 256-Radix Sort for array sizes of 10K elements and smaller. However, at 100K elements division into 256 bins results in each bin holding 390 elements, which is above the threshold, resulting in not using Insertion Sort in the second/last recursion level, leading to no performance gain for the Hybrid implementation. In other words, at 10K elements and below the algorithm is a Hybrid 256-Radix Sort (leveraging Insertion Sort), but at 100K and above the algorithm is a pure 256-Radix Sort (not leveraging Insertion Sort). This behavior was verified by instrumenting the algorithm to check whether Insertion Sort was called for arrays larger than 100K elements in size, and it was not (but was called for smaller array sizes).

Table 8 and Graph 8 show measurements of variety of algorithms.

Intel IPP Radix Sort and Counting Sort substantially outperform all others except for the smallest array size, and track in performance closely, with Counting Sort slightly faster for larger array sizes. Counting Sort will be discussed in more detail in a later section. STL and Hybrid Binary-Radix Sort from Algorithm Improvement through Performance Measurement: Part 2 are close in performance and lag all others. Hybrid 256-Radix Sort and Pure 256-Radix Sort track in performance, because for 8-bit elements only the first level of recursion is executed and Insertion Sort is never called. Trial of Hybrid 16-Radix Sort (4-bit) resulted in lower performance. Hybrid 256-Radix and Pure 256-Radix Sort outperform STL sort and Binary-Radix Sort, but lag IPP Radix Sort and Counting Sort.

An optimization that did not yield performance improvement is to combine **count[]** array and **startOfBin[]** array into a single array. This optimization reduced stack memory usage of the algorithm, but decreased performance by a few percent due to more complex computation of **startOfBin[]** values.