Algorithm Improvement through Performance Measurement: Part 1

Algorithm Improvement through Performance Measurement: Part 2

Algorithm Improvement through Performance Measurement: Part 3

Algorithm Improvement through Performance Measurement: Part 4

Algorithm Improvement through Performance Measurement: Part 5

Counting Sort is a simple and very fast linear time ** O(n)** sorting algorithm, with practical limitations which bound its applicability. When it is applicable, such as sorting arrays of 8-bit or 16-bit integers, it has been shown to significantly outperform other sorting algorithms, such as STL

**sort()**by 22-30X for arrays of 8-bit numbers, and 22-25X for large arrays of 16-bit numbers, as described in Part 3 of this series. (STL

**sort()**is a generic hybrid algorithm of quicksort, Heap Sort, and Insertion Sort.) Counting Sort also outperforms Binary-Radix Sort, N-bit-Radix Sort, Insertion Sort, and Selection Sort for 8-bit and 16-bit numbers. It was used in previous installments of this series to create a high-performance hybrid sorting algorithm combinations that adapted to the data type being sorted, performing better than sorting methods based on a single algorithm.

In this article, I examine Counting Sort in detail, extended to support signed 8-bit and 16-bit numbers, followed by performance optimizations, resulting in superior sorting performance for unsigned and signed 8-bit and 16-bit arrays.

### Algorithm

When sorting an array of numbers the Counting Sort algorithm consist of two steps: counting and writing. To sort an array of 8-bit numbers, where each array element can have a value between 0 and 255, for example, Counting Sort uses 256 counters. Each counter stores the number of times that a particular value has been encountered within the array, as the array is scanned from the first element to the last. Thus, **counter[0]** contains the count of how many zeros are in the array, **count[1]** holds the count of how many ones are in the array, **count[2]** has the count of how many twos are in the array, and so on, until **count[255]** which holds the count of how many 255's are in the array. Once the counts have been collected, by scanning through the array in a single pass, the second stage of the algorithm can proceed. This writing stage starts with the **count[0]**, takes its count value, and stores that many zeros back into the array, starting at the first element of the array. Then it looks at the **count[1]** value, and stores that many ones into the array, and so on. The result is a sorted array of 8-bit numbers.

Listing 1 shows an implementation of the Counting Sort for arrays of unsigned 8-bit numbers from the In-place Hybrid N-bit-Radix Sort in Part 3 of this series.

// Copyright(c), Victor J. Duvanenko, 2010 // Listing 1 inline void CountSortInPlace( unsigned char* a, unsigned long a_size ) { if ( a_size < 2 ) return; const unsigned long numberOfCounts = 256; unsigned long count[ numberOfCounts ]; for( unsigned long i = 0; i < numberOfCounts; i++ ) // initialized all counts to zero, since the array may not contain all values count[ i ] = 0; // !!! It should be possible to use array initialization and remove this overhead!!! // Scan the array and count the number of times each value appears for( unsigned long i = 0; i < a_size; i++ ) count[ a[ i ] ]++; // Fill the array with the number of 0's that were counted, followed by the number of 1's, and then 2's and so on unsigned long n = 0; for( unsigned long i = 0; i < numberOfCounts; i++ ) for( unsigned long j = 0; j < count[ i ]; j++ ) a[ n++ ] = (unsigned char)i; } inline void CountSortInPlace_2( unsigned char* a, unsigned long a_size ) { if ( a_size < 2 ) return; const unsigned long numberOfCounts = 256; // one count for each possible value of an 8-bit element (0-255) unsigned long count[ numberOfCounts ] = { 0 }; // count array is initialized to zero by the compiler // Scan the array and count the number of times each value appears for( unsigned long i = 0; i < a_size; i++ ) count[ a[ i ] ]++; // Fill the array with the number of 0's that were counted, followed by the number of 1's, and then 2's and so on unsigned long n = 0; for( unsigned long i = 0; i < numberOfCounts; i++ ) for( unsigned long j = 0; j < count[ i ]; j++ ) a[ n++ ] = (unsigned char)i; }

The first portion of the algorithm creates an array of 256 counters and initializes them to zero, using the first for loop. The second for loop scans the array from the first element to the last, taking each array element and using its value as an index into the **count[]** array, incrementing that particular count. For example, an element of value **3** will increment **count[3]**. Once the counts have been gathered, the writing step begins. In this step, the outer loop walks through every element of the count array, starting with **count[0] **. It uses the value of **count[0]**, and writes that many zeros to the array. Then it looks at the value of **count[1]**, and writes that many ones to the array, and so on.

As a concrete example, the following array of unsigned 8-bit values is to be sorted by using the Counting Sort:

0, 1, 1, 3, 1, 3, 3, 0, 0, 0, 1, 3, 3, 1, 0, 3, 1

The algorithm scans the array from the first element on the left to the last element on the right. In this example, counters 0 through 3 are used, and all counters are initialized to zeroes. The first element is a 0, and thus **count[0]** is incremented. The second element is a 2, and thus **count[2]** is incremented. This method is continued until all 17 array elements have been processed. At the end of this counting step (the second for loop in Listing 1) the counts would be:

count[0] = 5 count[1] = 6 count[2] = 0 count[3] = 6 ... count[255] = 0

The third for loop (the writing step) starts at **count[0]** and writes the value of zero five times -- into **a[0]**, **a[1]**, **a[2]**, **a[3]**, and **a[4]**. Then writes the value of one six times (into **a[5]** through **a[10]**), followed by no values of 2, followed by writing the value of three six times (into **a[11]** through **a[16]**). The resultant array is:

0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 3, 3, 3, 3, 3, 3

which is exactly what was counted: 5 zeros, 6 ones, no twos, and 6 threes. The result is a sorted numerical array. The input array was scanned two times from beginning to end: reading the array during the counting step, and writing the array during the writing step. Thus, the algorithm is ** O(n)**, with the amount of processing equal to

**2n**.

Table 1 and Figure 1 show performance measurements for the Counting Sort along with several other sorting algorithms, such as STL **sort()**, Insertion Sort, Binary-Radix Sort, and 256-Radix Stable Sort from Part 3. The Counting Sort #2 is the second implementation in Listing 1, which initializes the array to zero (where the compiler inserts code to initialize the local count array).

The above measurements show that the Count Sort is very efficient algorithm for 8-bit unsigned numbers and outperforms other algorithms by a wide margin. It outperforms STL sort() for all but the smallest array size, by over 20X in performance for 100K array size and above. At 10K array sizes and above it outperforms Intel's IPP Radix Sort by 20-30%, when count array initialization is used, but lags in performance when the for loop is used for clearing the count array. Insertion Sort is about 4X faster for 10 element arrays, but substantially lags in performance for larger array sizes (by up to 139K times), since it is ** O(n^{2})** algorithm and the Counting Sort is

**.**

*O*(n)

The setup for performance comparison setup was as follows: single-threaded 32-bit command prompt application, compiled using Visual Studio 2008, optimization project setting is set to optimize for speed and inline any suitable function. The application ran on Intel Core i7 860 at 2.8 GHz Quad-Core CPU with hyper-threading (1 Mbyte L2 cache, and 8 MBytes of shared L3 cache) -- 8 GBytes of system memory (dual-channel 1333 MHz DDR3 attached directly to the CPU), running 64-bit Windows 7 operating system.