### Counting Sort

Table 8 and Graph 8 introduced the Counting Sort algorithm by comparing its superior performance to others. Listing 5 is an implementation for sorting unsigned 8-bit integers. The algorithm works by creating 256 counters, one for each possible value that an array element can be. These counters are initialized to a count of zero. Then the array is scanned the first time to count how many times each value appears. For example, for the following array:

0, 2, 15, 200, 0, 3, 12, 203, 181, 181, 2, 0, 2, 12, 0, 3, 15

the counts would be count[0]=4, count[1]=0, count[2]=3, count[3]=2, count[12]=2, count[15]=2, count[181]=2, count[200]=1, count[203]=1, and the rest of the count[] array would be zero. The next for loop starts and count[0] and stores the number of zeros into the array as the value that count[0] hold. In the above example, array[0]=0, array[1]=0, array[2]=0, and array[3]=3, since count[0] was equal 4. Since count[1] is zero, then zero values of 1 are stored in the array. Then array[4]=2, array[5]=2, and array[6]=2, since three 2's were found in the input array. This process continues until the entire count array has been processed. As Table 8 and Graph 8 measurements show this process is very efficient for 8-bit unsigned numbers and outperforms all other algorithms by a wide margin.

Listing 5 also shows a Counting Sort implementation for 16-bit unsigned numbers as well as a Hybrid Counting Sort, which combines the Counting Sort with Insertion Sort. Table 13 and Graph 13 show measurement results of these three implementations along with Insertion Sort.

//Listing 5 inline void CountSortInPlace( unsigned char* a, unsigned long a_size ) { if ( a_size < 2 ) return; const unsigned long numberOfCounts = 256; // one count for each possible value of an 8-bit element (0-255) unsigned long count[ numberOfCounts ] = { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }; //for( unsigned long i = 0; i < numberOfCounts; i++ ) // initialized all counts to zero, since the array may not contain all values // count[ i ] = 0; // !!! It should be possible to use array initialization and remove this overhead!!! // Scan the array and count the number of times each value appears for( unsigned long i = 0; i < a_size; i++ ) count[ a[ i ] ]++; // Fill the array with the number of 0's that were counted, followed by the number of 1's, and then 2's and so on unsigned long n = 0; for( unsigned long i = 0; i < numberOfCounts; i++ ) for( unsigned long j = 0; j < count[ i ]; j++, n++ ) a[ n ] = (unsigned char)i; } inline void CountSortInPlace( unsigned short* a, unsigned long a_size ) { if ( a_size < 2 ) return; const unsigned long numberOfCounts = 65536; // one count for each possible value of an 16-bit element (0-65535/0xffff) unsigned long count[ numberOfCounts ]; for( unsigned long i = 0; i < numberOfCounts; i++ ) // initialized all counts to zero, since the array may not contain all values count[ i ] = 0; // Scan the array and count the number of times each value appears for( unsigned long i = 0; i < a_size; i++ ) count[ a[ i ] ]++; // Fill array with the number of 0's counted, followed by the number of 1's, and then 2's and so on unsigned long n = 0; for( unsigned long i = 0; i < numberOfCounts; i++ ) for( unsigned long j = 0; j < count[ i ]; j++, n++ ) a[ n ] = (unsigned short)i; }

These measurements show that Intel IPP Radix Sort outperforms the Hybrid 256-Radix Sort by about 2X, but lags 16-bit Count Sort by about 2X for large array sizes. For mid-size arrays Intel IPP Radix Sort outperforms all others. It seems that a 3-way Hybrid would produce a superior combination, with Insertion Sort for up to 30-50 element arrays, Intel IPP Radix for 50 to 500K arrays, and 16-bit Counting Sort for all largest array sizes. However, Intel IPP Radix Sort is not in place, forcing the combination to be not in-place. This may still be an effective combination, since only the mid-section would be required to be not-in-place, where the requirement of doubling memory size may not be a concern, as it would be at larger array sizes.

Counting Sort will be explored in more detail in future articles, since its performance needs to be examined closer due to concerns about sustainable superior performance depending of input data statistics. The above measurements show potential.

### Aggregate

Listing 6 is a list of overloaded functions that combine to produce a data-type aware slightly generic sorting.

// Copyright(c), Victor J. Duvanenko, 2009 // Hybrid In-place Radix Sort implementations. #ifndef _HybridInPlaceRadixSort_h #define _HybridInPlaceRadixSort_h #include "InsertionSort.h" #include "RadixInPlaceSort.h" #include "CountSort.h" inline void HybridSort( char* a, unsigned long a_size ) { if ( a_size < 2 ) return; const long PowerOfTwoRadix = 256; const long Log2ofPowerOfTwoRadix = 8; const long Threshold = 64; char bitMask = (char)0x80; // bitMask controls how many bits are processed at a time unsigned long shiftRightAmount = 7; for( unsigned long i = 2; i < PowerOfTwoRadix; ) // if not power-of-two value then it will do up to the largest power-of-two value { // that's smaller than the value provided (e.g. radix-10 will do radix-8) bitMask |= logicalRightShift( bitMask, 1 ); shiftRightAmount -= 1; i <<= 1; } if ( a_size >= Threshold ) _RadixSort_Signed_PowerOf2Radix_1< char, PowerOfTwoRadix, Log2ofPowerOfTwoRadix, Threshold >( a, a_size - 1, bitMask, shiftRightAmount ); else insertionSortSimilarToSTLnoSelfAssignment( a, a_size ); } inline void HybridSort( unsigned char* a, unsigned long a_size ) { if ( a_size > 32 ) CountSortInPlace( a, a_size ); else insertionSortSimilarToSTLnoSelfAssignment( a, a_size ); } inline void HybridSort( short* a, unsigned long a_size ) { if ( a_size < 2 ) return; const long PowerOfTwoRadix = 256; const long Log2ofPowerOfTwoRadix = 8; const long Threshold = 64; short bitMask = (short)0x8000; // bitMask controls how many bits we process at a time unsigned long shiftRightAmount = 15; for( unsigned long i = 2; i < PowerOfTwoRadix; ) // if not power-of-two value then it will do up to the largest power-of-two value { // that's smaller than the value provided (e.g. radix-10 will do radix-8) bitMask |= logicalRightShift( bitMask, 1 ); shiftRightAmount -= 1; i <<= 1; } if ( a_size >= Threshold ) _RadixSort_Signed_PowerOf2Radix_1< short, PowerOfTwoRadix, Log2ofPowerOfTwoRadix, Threshold >( a, a_size - 1, bitMask, shiftRightAmount ); else insertionSortSimilarToSTLnoSelfAssignment( a, a_size ); } inline void HybridSort( unsigned short* a, unsigned long a_size, unsigned short* b = NULL ) { if ( a_size >= 1000000 ) CountSortInPlace( a, a_size ); else { if ( b == NULL ) // Can't use IPP Radix without the extra array b { if ( a_size > 100 ) CountSortInPlace( a, a_size ); else insertionSortSimilarToSTLnoSelfAssignment( a, a_size ); } else { if ( a_size > 100 ) ippsSortRadixAscend_16u_I( reinterpret_cast< Ipp16u * > ( a ), reinterpret_cast< Ipp16u * > ( b ), a_size ); else insertionSortSimilarToSTLnoSelfAssignment( a, a_size ); } } } inline void HybridSort( long* a, unsigned long a_size ) { if ( a_size < 2 ) return; const long PowerOfTwoRadix = 256; const long Log2ofPowerOfTwoRadix = 8; const long Threshold = 100; long bitMask = 0x80000000; // bitMask controls how many bits we process at a time unsigned long shiftRightAmount = 31; for( unsigned long i = 2; i < PowerOfTwoRadix; ) // if not power-of-two value then it will do up to the largest power-of-two value { // that's smaller than the value provided (e.g. radix-10 will do radix-8) bitMask |= logicalRightShift( bitMask, 1 ); shiftRightAmount -= 1; i <<= 1; } if ( a_size >= Threshold ) _RadixSort_Signed_PowerOf2Radix_1< long, PowerOfTwoRadix, Log2ofPowerOfTwoRadix, Threshold >( a, a_size - 1, bitMask, shiftRightAmount ); else insertionSortSimilarToSTLnoSelfAssignment( a, a_size ); } inline void HybridSort( unsigned long* a, unsigned long a_size ) { if ( a_size < 2 ) return; const long PowerOfTwoRadix = 256; const long Log2ofPowerOfTwoRadix = 8; const long Threshold = 100; unsigned long bitMask = 0x80000000; // bitMask controls how many bits we process at a time unsigned long shiftRightAmount = 31; for( unsigned long i = 2; i < PowerOfTwoRadix; ) // if not power-of-two value then it will do up to the largest power-of-two value { // that's smaller than the value provided (e.g. radix-10 will do radix-8) bitMask |= ( bitMask >> 1 ); shiftRightAmount -= 1; i <<= 1; } if ( a_size >= Threshold ) _RadixSort_Unsigned_PowerOf2Radix_1< unsigned long, PowerOfTwoRadix, Log2ofPowerOfTwoRadix, Threshold >( a, a_size - 1, bitMask, shiftRightAmount ); else insertionSortSimilarToSTLnoSelfAssignment( a, a_size ); } inline void HybridSort( __int64* a, unsigned long a_size ) { if ( a_size < 2 ) return; const long PowerOfTwoRadix = 256; const long Log2ofPowerOfTwoRadix = 8; const long Threshold = 64; __int64 bitMask = 0x8000000000000000i64; // bitMask controls how many bits process at a time unsigned long shiftRightAmount = 63; for( unsigned long i = 2; i < PowerOfTwoRadix; ) // if not power-of-two value then it will do up to the largest power-of-two value { // that's smaller than the value provided (e.g. radix-10 will do radix-8) bitMask |= logicalRightShift( bitMask, 1 ); shiftRightAmount -= 1; i <<= 1; } if ( a_size >= Threshold ) _RadixSort_Signed_PowerOf2Radix_1< __int64, PowerOfTwoRadix, Log2ofPowerOfTwoRadix, Threshold >( a, a_size - 1, bitMask, shiftRightAmount ); else insertionSortSimilarToSTLnoSelfAssignment( a, a_size ); } inline void HybridSort( unsigned __int64* a, unsigned long a_size ) { if ( a_size < 2 ) return; const long PowerOfTwoRadix = 256; const long Log2ofPowerOfTwoRadix = 8; const long Threshold = 64; unsigned __int64 bitMask = 0x8000000000000000i64 // bitMask controls how many bits we process at a time unsigned long shiftRightAmount = 63; for( unsigned long i = 2; i < PowerOfTwoRadix; ) // if not power-of-two value then it will do up to the largest power-of-two value { // that's smaller than the value provided (e.g. radix-10 will do radix-8) bitMask |= bitMask >> 1; shiftRightAmount -= 1; i <<= 1; } if ( a_size >= Threshold ) _RadixSort_Unsigned_PowerOf2Radix_1< unsigned __int64, PowerOfTwoRadix, Log2ofPowerOfTwoRadix, Threshold >( a, a_size - 1, bitMask, shiftRightAmount ); else insertionSortSimilarToSTLnoSelfAssignment( a, a_size ); } #endif // _HybridInPlaceRadixSort_h

The user calls the **sort()** function passing it an array of 8-bit through 64-bit unsigned or signed integers. The appropriate function for the data-type gets invoked and sorting algorithms optimized for that data-type and array size are executed, resulting in the optimized overall performance. The sorting algorithms used underneath when appropriate are Counting Sort, Insertion Sort, Intel IPP Radix Sort, and 256-Radix Sort. Note the optional additional optional function argument of a pointer to a memory buffer. If this parameter is NULL then Intel IPP Radix Sort is not used for 16-bit integer sorting of mid-size arrays.

### Conclusion

In-place Hybrid Binary-Radix Sort algorithm was developed in [1] for arrays of 8-bit to 64-bit unsigned and signed integers, competitive in performance to STL sort, but inferior to Intel's IPP Radix Sort. That effort was extended in this paper by first developing an in-place MSD pure N-bit-Radix Sorting algorithm, which was slightly slower than STL sort and much slower than Intel's IPP Radix Sort.

Binary-Radix Sort was shown in Algorithm Improvement through Performance Measurement: Part 2 to benefit from being combined with Insertion Sort to form a Hybrid algorithm, similar to STL sort hybrid approach, resulting in a 30% performance gain due to hybrid. In this article the hybrid approach was applied to the N-bit-Radix algorithm, which resulted in higher performance gains. For arrays of 32-bit unsigned and signed integers the gain in performance was over 4X, over 5X for 64-bit, about 6X for 16-bit (but only for arrays smaller than 10K elements), and no gain for 8-bit. This In-place Hybrid N-bit-Radix Sort was further performance optimized by about 10%, by improving the inner loop performance and selecting an optimal threshold of when to switch to Insertion Sort for small sub-arrays. It was then extended to support arrays of 8-bit through 64-bit unsigned and signed integers.

The resulting algorithm was compared with STL sort and Intel's IPP Radix Sort. For 32-bit integers it was 3-4X faster than STL sort, 2-3X faster for 64-bit, 3-7X faster for 16-bit, and 5-7X faster for 8-bit.

Comparing to Intel's IPP Radix Sort for 32-bit integers it was comparable in performance for larger array sizes and smaller sizes, but lagged in the mid-size (1K to 100K). For 16-bit and 8-bit it lagged IPP Radix Sort by about 2X and about 4X respectively. 64-bit performance could not be compared since IPP did not implement it.

A trend emerged where arrays of 64-bit elements benefited from the hybrid approach, as well as 32-bit, but 16-bit only up to about 10K elements, and 8-bit not at all. The hybrid combination was that of 256-Radix Sort and Insertion Sort, where the Insertion Sort was invoked when the array size was below a threshold (64 for 64-bit, 100 for 32-bit, and 64 for 16-bit). 256-Radix on average splits the array into 256 bins during each level of recursion (8 levels of recursion for 64-bit, 4 levels for 32-bit, 2 levels for 16-bit, 1 level for 8-bit). Thus, for arrays of 16-bit elements, when the array size is smaller than 64*256=16K then Insertion Sort will be invoked, as this will produce sub-arrays that are smaller than the threshold value of 64 on the second/last recursion. However, if the array is larger than 16K, the sub-arrays will be larger than the threshold, resulting in no invocation of Insertion Sort, but the use of 256-Radix Sort instead for the second level of recursion. For arrays of 32-bit elements, when the array size is smaller than 100*256*256*256=1.68 billion elements then Insertion Sort will be invoked, but not if larger. For arrays of 64-bit elements, this limit point is at 64*2567="a very large number". This trend exposes a limit to the benefit of the hybrid approach, which increases as the size of array element increases and depends on the radix as well as the threshold. It also illustrates that N-bit-Radix and Insertion Sort make a good hybrid combination because N-bit-Radix breaks the array into 2N bins at every stage of recursion, getting to small sub-arrays quickly (comparing to STL sort and Binary-Radix Sort, which break the array into 2 bins).

The Counting Sort algorithm, in-place, was briefly described here with implementations shown for 8-bit and 16-bit unsigned integers. It was then augmented by Insertion Sort for smaller array sizes to create a Hybrid Counting Sort. This algorithm was shown to have slightly higher performance than IPP Radix Sort for mid-size and large arrays, but slower for smaller arrays for 8-bit arrays. For 16-bit arrays it was shown to outperform by about 2X for larger arrays, outperform for small arrays, but lagged in performance by as much as 16X for mid-size arrays. Counting Sort was shown to outperform STL sort by 22-30X for 8-bit arrays for mid-size and large arrays, and in the range of 0.5-25X for 16-bit arrays with consistent 22-25X for large arrays. Counting Sort was combined with the Hybrid N-bit-Radix algorithm, and invoked for 8-bit and 16-bit unsigned integers.

The approach of making a hybrid non-recursive algorithm, such as the Counting Sort, was not as effective since it only improved performance of the smallest array sizes, whereas when applied to a recursive algorithm such as QuickSort (by STL sort), N-bit-Radix Sort and Binary-Radix Sort performance improvements were experienced for all array sizes (whenever Insertion Sort was invoked).

In-place Hybrid N-bit-Radix Sort performs sorting in-place; i.e., does not require additional memory. STL sort and Counting Sort are also in-place. However, Intel's IPP Radix Sort is not in-place and requires an additional memory array of equivalent size to the input array.

Stability was not considered for this algorithm, since the current implementation is capable of sorting only arrays of numbers, and stability is not relevant. Intel's IPP implementations are also only capable of sorting numbers. However, STL sort generic ability is much more advanced than the algorithms presented in this paper and can sort not only numbers, but also other types as well as user-created classes (as long as the comparison operation has been defined). STL sort offers stable and non-stable versions of sorting algorithms.

This article presented a manual exploration of performance optimization space for a sorting algorithm; e.g., found that 256-Radix performed best, which may not hold on other current or future computers. It may be beneficial to automate this process and execute it when software first encounters a particular computer system at which point the algorithm would be adapted to this system to execute most efficiently.

### References

[1] V. J. Duvanenko, Algorithm Improvement through Performance Measurement: Part 2.

[2] D. Vandevoorde, N.M. Josuttis, C++ Templates. The Complete Guide, pp. 9-19.