### Initial Implementation

Listing 1 shows the initial implementation of the Stable MSD 256-Radix Sort, which is similar in many aspects to the In-Place Non-Stable MSD N-bit-Radix Sort in [1]. Let’s explore the details of this implementation.

// Copyright(c), Victor J. Duvanenko, 2009 // Listing 1 template< class _Type > inline unsigned long extractDigit( _Type a, _Type bitMask, unsigned long shiftRightAmount ) { unsigned long digit = (unsigned long)(( a & bitMask ) >> shiftRightAmount ); // extract the digit we are sorting based on return digit; } template< class _Type > inline void _RadixSort_StableInitialUnsigned_Radix256( _Type* a, _Type* b, long last, _Type bitMask, unsigned long shiftRightAmount, bool inputArrayIsDestination ) { const unsigned long numberOfBins = 256; // 256-radix (Bins 0 - 255) unsigned long count[ numberOfBins ]; for( unsigned long i = 0; i < numberOfBins; i++ ) count[ i ] = 0; for ( long _current = 0; _current <= last; _current++ ) // Scan the array and count the number of times each value appears count[ extractDigit( a[ _current ], bitMask, shiftRightAmount ) ]++; long startOfBin[ numberOfBins ], endOfBin[ numberOfBins ]; startOfBin[ 0 ] = endOfBin[ 0 ] = 0; for( unsigned long i = 1; i < numberOfBins; i++ ) startOfBin[ i ] = endOfBin[ i ] = startOfBin[ i - 1 ] + count[ i - 1 ]; for ( long _current = 0; _current <= last; _current++ ) // Move elements from input array into the bins in destination array based on digit value b[ endOfBin[ extractDigit( a[ _current ], bitMask, shiftRightAmount ) ]++ ] = a[ _current ]; bitMask >>= 8; shiftRightAmount -= 8; if ( bitMask != 0 ) // end recursion when all the bits have been processes { inputArrayIsDestination = !inputArrayIsDestination; for( unsigned long i = 0; i < numberOfBins; i++ ) if (( endOfBin[ i ] - startOfBin[ i ] ) >= 2 ) _RadixSort_StableInitialUnsigned_Radix256( &b[ startOfBin[ i ]], &a[startOfBin[ i ]], endOfBin[ i ] - 1 - startOfBin[ i ], bitMask, shiftRightAmount, inputArrayIsDestination ); else if (( endOfBin[ i ] - startOfBin[ i ] ) == 1 ) { if ( inputArrayIsDestination ) a[ startOfBin[ i ]] = b[ startOfBin[ i ]]; // copy the single element in the bin back to the input array } } else { // Done with recursion copy all of the bins if ( !inputArrayIsDestination ) for ( long _current = 0; _current <= last; _current++ ) // copy from external array back into the input array a[ _current ] = b[ _current ]; } } // Initial implementation inline void RadixSort_StableInitialUnsigned_Radix256( unsigned long* a, unsigned long* b, unsigned long a_size ) { if ( a_size < 2 ) return; unsigned long bitMask = 0xFF000000; // bitMask controls how many bits we process at a time unsigned long shiftRightAmount = 24; _RadixSort_StableInitialUnsigned_Radix256( a, b, a_size - 1, bitMask, shiftRightAmount, false ); }

A **count[]** array as big as the radix. **Count[0]** will hold the count of how many elements were encountered with the most-significant-digit (MSD) of zero. **Count[1]** will hold the number of elements with the MSD of one, and so on. This array is initialized to zero in the first for loop. In the second for loop the appropriate digit (that’s being used to sort with) is extracted from each input element and **Count[digit]** is incremented. This loop scans through the input array portion that it was given, and determines how big each bin is (as discussed in the previous section).

Once the bin sizes are known, then the third for loop determines the index/offset for the starting point of each of the bins. The array **a** is used as an input and array **b** is used as an output. The fourth for loop is the core of the algorithm. It processes all of the elements in the input array from the first element to the last, in order. It extracts the appropriate digit (that's being used to sort with) and moves it to its bin. **EndOfBin[]** keeps track of the current location/index within each bin, and this index is advanced once an element has been moved to that bin. In other words, **EndOfBin[ digit ]** keeps track of where the end of each bin currently is -- where an element would be placed if it were to be added to that bin. **a** The digit is extracted from the current **a** input array element **a[ _current ]** by using **extractDigit()** template function. This function performs two operations: masking (bitwise AND) to extract the digit being used for sorting, and shift to the right to place the digit in the least significant bits of the word so that it can be used as a bin index. After this loop is done all of the input **a** array elements have been moved to their respective bins within the output **b** array.

At this point sorting based on the current digit is done. The bitmask is shifted to the right to select the next digit, and **shiftRightAmount** is adjusted by the same amount. Since the initial algorithm implementation is a 256-Radix (8-bit digits), these adjustments determine the size of each digit, which is 8-bits in this case. If the resulting bitMask is zero, then the algorithm is done sorting as all digits have been used to sort by. If however, there are digits left to process, it is done recursively by processing each of the bins separately. The next for loop processes each bin and calls the Radix Sort function (itself) recursively if the bin has 2 or more elements in it. In that case, the **b** output array and **a** input arrays are swapped, where **b** is used as input and **a** is used as output of this next recursion level. In other words, the output array of the current recursion level becomes the input array for the next recursion level. The bin size is also passed down to the next recursion level as well as the new **bitMask** and **shiftRigthAmount**.

As the algorithm progresses recursively, it swaps **a** and **b** arrays, using one of them as input and the other as output at each level of recursion. Once the last level of recursion has been reached, it is possible for the **b** array to be the output, whenever the number of recursions is an odd number. This is the purpose of the last else statement, where the for loop copies the bin from the **b** output array back to the original **a** input array, so that the sorted bin ends up in the **a** array (where the user expects it). If the number of **a** and **b** array swaps (recursion levels) is an even number, then this last else case will not be executed. For example, if an 8-bit-Radix is used to sort 32-bit numbers, then 4 recursion levels will always be used (since all 32-bits need to be processed, 8-bits at a time), and the last else statement is not necessary and will never be executed. However, if an 11-bit-Radix is used, then 3 recursion levels will always be used and the last else statement (copy from **b** to **a**) will be required for proper operation. Also, 8-bit array elements with 8-bit-Radix will go through a single level of recursion, where the results will be in the **b** array.

Table 1 and Graph 1 show measurements of performance as compared to STL **sort()**, STL **stable_sort()**, in-place Hybrid N-bit-Radix Sort of [1] (not stable), Intel's IPP Radix Sort (of unknown stability), and Insertion Sort.

Several trends are visible in these measurements:

- STL stable sort is consistently slightly slower than STL sort.
- STL stable sort is faster than the initial implementation of the Stable N-bit-Radix Sort for smaller array sizes, but is slower at 100M array size.
- Intel's IPP Radix Sort is many times faster than the Stable N-bit-Radix Sort for all but the smallest array size.
- In-Place Hybrid N-bit-Radix Sort from [1] is many times faster for all array sizes.
- In-Place Hybrid N-bit-Radix Sort from [1] is about the same performance as Intel’s IPP Radix Sort (which is not in-place).
- Insertion Sort is faster for small array sizes up to about 1K elements.