### Hybrid

It is simple to convert the initial 256-Radix Sort implementation to a Hybrid implementation by letting Insertion Sort handle any recursion for a small number of elements. This strategy was used successfully by STL sort and by Hybrid Binary-Radix Sort in (see Algorithm Improvement through Performance Measurement: Part 2). In the case of Hybrid Binary-Radix Sort the performance was improved by about 30%. However, 256-Radix Sort has a substantially larger overhead than Hybrid Binary-Radix Sort or Insertion Sort, which shows up in measurements shown in Table 1 and Graph 1, where Insertion Sort is significantly faster for input data set under 1K elements. Table 2 and Graph 2 compare non-Hybrid and Hybrid implementations of 256-Radix Sort.

Listing 2 shows the Hybrid 256-Radix Sort implementation. Note that the same value of 32 that was used in [1] for Hybrid Binary-Radix Sort is used as a threshold to decide whether to call Insertion Sort or not. Also, the decision is made immediately before calling Hybrid 256-Radix Sort, to avoid the overhead of executing the Hybrid 256-Radix Sort even once when the array size is smaller than the threshold.

// Listing 2 template< class _Type > inline void insertionSortSimilarToSTLnoSelfAssignment( _Type* a, unsigned long a_size ) { for ( unsigned long i = 1; i < a_size; i++ ) { if ( a[ i ] < a[ i - 1 ] ) // no need to do (j > 0) compare for the first iteration { _Type currentElement = a[ i ]; a[ i ] = a[ i - 1 ]; unsigned long j; for ( j = i - 1; j > 0 && currentElement < a[ j - 1 ]; j-- ) { a[ j ] = a[ j - 1 ]; } a[ j ] = currentElement; // always necessary work/write } // Perform no work at all if the first comparison fails - i.e. never assign an element to itself! } } template< class _Type > inline void _RadixSort_HybridUnsigned_Radix256( _Type* a, long last, _Type bitMask, unsigned long shiftRightAmount ) { const unsigned long numberOfBins = 256; // 256-radix (Bins 0 - 255) unsigned long count[ numberOfBins ]; for( unsigned long i = 0; i < numberOfBins; i++ ) count[ i ] = 0; for ( long _current = 0; _current <= last; _current++ ) // Scan the array and count the number of times each value appears { _Type digit = ( a[ _current ] & bitMask ) >> shiftRightAmount; // extract the digit sorting is based on count[ digit ]++; } long startOfBin[ numberOfBins ], endOfBin[ numberOfBins ], nextBin; startOfBin[ 0 ] = endOfBin[ 0 ] = nextBin = 0; for( unsigned long i = 1; i < numberOfBins; i++ ) startOfBin[ i ] = endOfBin[ i ] = startOfBin[ i - 1 ] + count[ i - 1 ]; for ( long _current = 0; _current <= last; ) // Move elements from the original array into the bins based on digit value { _Type digit = ( a[ _current ] & bitMask ) >> shiftRightAmount; // extract the digit we are sorting based on if ( endOfBin[ digit ] != _current ) { _swap( a[ _current ], a[ endOfBin[ digit ]] ); endOfBin[ digit ]++; } else { endOfBin[ digit ]++; // leave the element at its location and grow the bin _current++; // advance the current pointer to the next element while( _current >= startOfBin[ nextBin ] && nextBin < numberOfBins ) nextBin++; while( endOfBin[ nextBin - 1 ] == startOfBin[ nextBin ] && nextBin < numberOfBins ) nextBin++; if ( _current < endOfBin[ nextBin - 1 ] ) _current = endOfBin[ nextBin - 1 ]; } } bitMask >>= 8; shiftRightAmount -= 8; if ( bitMask != 0 ) // end recursion when all the bits have been processes { for( unsigned long i = 0; i < numberOfBins; i++ ) if (( endOfBin[ i ] - startOfBin[ i ] ) >= 32 ) _RadixSort_HybridUnsigned_Radix256( &a[ startOfBin[ i ]], endOfBin[ i ] - 1 - startOfBin[ i ], bitMask, shiftRightAmount ); else if (( endOfBin[ i ] - startOfBin[ i ] ) >= 2 ) insertionSortSimilarToSTLnoSelfAssignment( &a[ startOfBin[ i ]], endOfBin[ i ] - startOfBin[ i ]); } } // Listing 2 inline void RadixSortInPlace_HybridUnsigned_Radix256( unsigned long* a, unsigned long a_size ) { if ( a_size < 2 ) return; unsigned long bitMask = 0xFF000000; // bitMask controls how many bits we process at a time unsigned long shiftRightAmount = 24; if ( a_size >= 32 ) _RadixSort_HybridUnsigned_Radix256(a, a_size - 1, bitMask, shiftRightAmount ); else insertionSortSimilarToSTLnoSelfAssignment( a, a_size ); }

Inside the Hybrid 256-Radix Sort template function (which recurses) the recursion threshold check is done at the end of the function and not at the beginning because this is more efficient by several percent. Otherwise, if the check whether to call Insertion Sort or not is done at the top of the template function, then 256-Radix Sort would be called only then to decide that Insertion Sort should have been called, adding an unnecessary function call overhead before every call to Insertion Sort.

These measurements show that a Hybrid 256-Radix Sort using Insertion Sort substantially improves the performance of the algorithm. This is due to Insertion Sort having much lower overhead for small input data sets (see Table 1 and Graph 1).