### Limitations

The Counting Sort algorithm uses an array of counts, which is reasonable in size for 8-bit and 16-bit numbers (256 counts and 64K counts respectively). Each count is a 32-bit value allowing the algorithm to handle an array up to 4 billion elements. Thus, for 8-bit numbers the count array uses 1K bytes (256 entries at 32-bits each), and for 16-bit number the count array uses 256 K bytes. For 8-bit algorithm, the counts array fits inside L1 cache of modern processors. For 16-bit algorithm, the counts array fits inside L2 or L3 cache. Arrays of 4 billion elements are beyond the limits of 32-bit operating systems and 32-bit processors, and thus unsigned 32-bit values for counts are safe to use.

However, when sorting arrays of 32-bit numbers, the required count array grows to 4 billion counts, due to 4 billion possible values for a 32-bit number. At 32-bits per count, 16 GigaBytes of memory would be required for the counts array. This size is not possible for a 32-bit operating systems, but is on the verge of practical for 64-bit operating systems and processors. When a 64-bit operating system is used, the array sizes can be larger than 4 billion elements, which requires 64-bit counts, doubling the memory size requirement for the counts array.

Thus today, 8-bit and 16-bit Counting Sort is a practical algorithm and performs very well, outperforming other sorting algorithms by a wide margin. When sorting 32-bit and larger integers, as well as single and higher precision floating-point numerical arrays, N-bit Radix Sort is a good choice as was shown in Part 3 and Part 4. N-bit Radix Sort, which uses the Counting Sort internally, sorts one digit at a time in ** O(dn)** time, where

**d**is the number of digits. For example, sorting 32-bit numbers would take four passes at 8-bits at a time. Lastly, Counting Sort and N-bit Radix Sort can be combined to form a hybrid sorting algorithm with superior performance to sorting using a single sorting algorithm. This method was shown to be affective in Part 3, where the Counting Sort was used for sorting arrays of 8-bit and 16-bit elements, and in-place N-bit Radix Sort was used to sort arrays of 32-bit and 64-bit unsigned and signed integers.

### Acceleration

Intel architecture processors support SIMD/SSE instructions (single instruction multiple data) to perform certain kinds of operations in parallel, such as adding eight 16-bit numbers in a single clock cycle. These instructions operate on up to 128-bits of data at a time and achieve speedup from their ability to process several data items simultaneously. Intel has developed the Intel Performance Primitives (IPP) library of common routines that utilize these SSE instructions for acceleration, and has spent numerous man-years optimizing their performance. This library is simple to use and adapts to the processor type along with the subset of the instructions supported. Using the library is simpler and quicker than developing the SSE code yourself, especially when taking into account implementing support for generations of processors with varied support for SSE instruction sub-set.

Two functions from the IPP library are useful for the Counting Sort algorithm: zero and set. The zero function initializes every value within an array to a zero. The set function sets every value within an array to a certain value. Each function supports a variety of data types such as integers, floating-point, and complex.

Listing 4 shows 8-bit and 16-bit (unsigned and signed) implementations of the Counting Sort algorithm using the IPP library functions.

// Copyright(c), Victor J. Duvanenko, 2010 inline void CountSortInPlaceIPP( unsigned char* a, unsigned long a_size ) { if ( a_size < 2 ) return; const unsigned long numberOfCounts = 256; __declspec( align(32)) unsigned long count[ numberOfCounts ]; // one count for each possible value of an 8-bit element (0-255) ippsZero_32s( reinterpret_cast< Ipp32s * > ( count ), numberOfCounts ); // Scan the array and count the number of times each value appears for( unsigned long i = 0; i < a_size; i++ ) count[ a[ i ] ]++; // Fill the array with the number of 0's that were counted, followed by the number of 1's, and then 2's and so on unsigned long n = 0; for( unsigned long i = 0; i < numberOfCounts; i++ ) { ippsSet_8u( (unsigned char)i, reinterpret_cast< Ipp8u * > ( &a[ n ] ), count[ i ] ); n += count[ i ]; } } inline void CountSortInPlaceIPP( unsigned short* a, unsigned long a_size ) { if ( a_size < 2 ) return; const unsigned long numberOfCounts = 65536; #if 1 __declspec( align(32)) unsigned long count[ numberOfCounts ]; // one count for each possible value of an 8-bit element (0-255) // unsigned long count[ numberOfCounts ]; ippsZero_32s( reinterpret_cast< Ipp32s * > ( count ), numberOfCounts ); #else //__declspec( align(32)) unsigned long count[ numberOfCounts ] = { 0 }; // pre-initializing to zero should be faster/free //__declspec( align(32)) unsigned long count[ numberOfCounts ]; unsigned long count[ numberOfCounts ]; for( unsigned long i = 0; i < numberOfCounts; i++ ) // initialized all counts to zero, since the array may not contain all values count[ i ] = 0; #endif // Scan the array and count the number of times each value appears for( unsigned long i = 0; i < a_size; i++ ) count[ a[ i ] ]++; // Fill the array with the number of 0's that were counted, followed by the number of 1's, and then 2's and so on unsigned long n = 0; for( unsigned long i = 0; i < numberOfCounts; i++ ) { ippsSet_16s( (short)i, reinterpret_cast< Ipp16s * > ( &a[ n ] ), count[ i ] ); n += count[ i ]; } }

The 32-bit version of the zero function is **ippsZero_32s()** is used to initialize the counts arrays, as a replacement for the pre-initialized arrays. The set function **ippsSet_8u()** replaced the last inner for loop, in the 8-bit implementation.

Sadly, Intel SSE instruction set has no support for parallel index (lookup table) operations, which would have been useful for acceleration of the counting portion of the algorithm.

Tables 5 and 6 show performance measurements of the unsigned 8-bit and 16-bit Counting Sort algorithm augmented with the Intel IPP library functions.

Measurement results show that using the IPP library does not accelerate Counting Sort. For small array sizes (100 elements or fewer for 8-bit, and 10K or fewer for 16-bit) the IPP-based implementations are slower than C++ scalar (non-IPP) implementations. This is mostly likely due to the overhead of calling IPP library functions. Measurements demonstrate that when using the IPP library the use of **_declspec()** function is critical, since it ensures that the local stack-based count array is cache-line aligned, improving performance of SSE instructions.

### Hybrid

Hybrid algorithm approach uses multiple algorithms to create a better performing combination than a single algorithm could provide. For example, STL **sort()** uses QuickSort, Heap Sort and Insertion Sort to produce a generic high performance sorting algorithm. STL **stable_sort()** uses a buffered Merge Sort, and Insertion Sort.

Counting Sort does processes the array through two passes, and does not break the array down into smaller pieces as other algorithms do. For this reason, it is difficult for Counting Sort to benefit from a hybrid approach, except for smaller array sizes. For arrays of 8-bit numbers, Insertion Sort could be used to accelerate smaller array sizes, as was done in Part 3, since Insertion Sort is about 4X faster for arrays of 10 elements. For arrays of 16-bit numbers, Insertion Sort could also be used for the smallest array sizes, followed by using Intel's IPP Radix Sort for arrays sizes up to 0.5 million elements, and 16-bit Counting Sort for the largest array sizes.

### Conclusion

Counting Sort is a very efficient, high performance, linear-time ** O(n)**, in-place sorting algorithm. Implementations for sorting arrays of unsigned 8-bit and 16-bit numbers were developed. This implementation was extended to support signed numbers, since signed numbers require different treatment from unsigned. The signed implementation was crafted to not sacrifice performance.

For arrays of 8-bit unsigned and signed numbers, Counting Sort outperformed STL **sort()** by over 20X for array sizes of 100K and larger, and outperformed Intel's IPP sort by 20-30% for array sizes of 10K and larger. Counting Sort also outperforms N-bit-Radix Stable Sort from 1.6X to 5.9X for array sizes of 1K and larger. For arrays of 16-bit unsigned and signed numbers, Counting Sort outperforms STL **sort()** by up to 30X, IPP Radix Sort by up to 4X, and N-Bit-Radix Stable Sort by up to 6X.

Counting Sort algorithm was shown to be practical for 8-bit and 16-bit numbers, but not yet practical for 32-bit and larger numbers on 32-bit operating systems. However, for 64-bit processor and operating systems, sorting 32-bit numbers should become practical within the next few years. For now, N-bit Radix Sort (Part 3) is a good alternate high-performance sorting algorithm with ** O(dn)**, where

**d**is the number of digits within each array element.

Counting Sort illustrates that for purely numeric arrays the concept of stability does not apply. In the implementations above the original numbers are not kept to produce the resulting sorted array -- they are counted, discarded, and then recreated. These implementations gain their performance from not moving any of the array elements. However, the Counting Sort algorithm can be implemented using numeric keys with associated data items. In this case, the concept of stability applies and the algorithm can be made stable.

Performance measurement driven optimization drove the implementations, as was illustrated by performance differences when array initialization was used versus a for loop. Unfortuately, using Intel IPP functions (which utilize SSE parallel instructions) to optimize Counting Sort did not yield a faster algorithm implementation. However, these implementations may still be useful, since they use different computational units within the for portions of the algorithm. Lastly, a hybrid algorithms approach should produce a superior sorting algorithm, with several suggestions provided based on measurements.

The astonishing performance gains provided by the Counting Sort algorithm warrant consideration of data type dependent sorting, where different algorithms are used depending on the data type that is being sorted; e.g., Counting Sort for 8 and 16-bit numeric data types, Radix Sort for larger numeric data types, and STL sort for other types.