The two implementations of the Counting Sort differed in the way they initialized the count array. The first implementation used a for loop to set each element within the count array to zero, whereas the second implementation used the compact form of array initialization. Array initialization in C sets the unspecified array elements to zero, which is exactly what is needed in this case. It's unclear why array initialization outperforms the for loop implementation, since the count array is a local array (i.e. not a global array) and the compiler has to initialize the count array at run time.

Listing 2 shows a Counting Sort implementation for 16-bit unsigned numbers. Table 2 and Graph 2 show measurement results of these three implementations along with Insertion Sort.

// Copyright(c), Victor J. Duvanenko, 2010 // Listing 2 inline void CountSortInPlace( unsigned short* a, unsigned long a_size ) { if ( a_size < 2 ) return; const unsigned long numberOfCounts = 65536; // one count for each possible value of an 16-bit element (0-65535/0xffff) unsigned long count[ numberOfCounts ] = { 0 }; // pre-initializing to zero // Scan the array and count the number of times each value appears for( unsigned long i = 0; i < a_size; i++ ) count[ a[ i ] ]++; // Fill the array with the number of 0's that were counted, followed by the number of 1's, and then 2's and so on unsigned long n = 0; for( unsigned long i = 0; i < numberOfCounts; i++ ) for( unsigned long j = 0; j < count[ i ]; j++, n++ ) a[ n ] = (unsigned short)i; }

Counting Sort of arrays of 16-bit elements significantly outperforms other algorithms tested for large array sizes (1 million elements and larger). It outperforms STL **sort()** from 10K element array sizes from 2-30X, outperforms Hybrid 256-Radix Stable Sort from 100K elements by 1.6-6.7X, and IPP radix sort from 1 million elements by 2.9-4X. However, it lags in performance for small and mid-size arrays by up to several orders of magnitude. Most likely this is due to the L2 cache size, since 100K element array takes 200 Kbytes of memory space and fits within the L2 cache, whereas larger arrays do not. These measurements demonstrate that a hybrid algorithm approach (combining several different sorting algorithms) should be beneficial.