### First Attempt at Parallel Counting Sort

The performance of the initial Parallel Counting Sort implementation is in Table 1 and Figure 1 for arrays of 8-bit unsigned numbers:

This initial version produced a speedup of up to 3X for large arrays, when running on a quad-core i7 860 CPU with hyper-threading. Parallel Counting Sort is up to 67X faster than STL **sort()** and 3.3X faster than Intel IPP Radix Sort for medium and large arrays, but is slower for smaller array sizes.

Listing 2 shows the top-level function of the initial Parallel Count Sort implementation. This code looks very similar to the serial implementation shown in Listing 1, except for the initialization of the count array, and the counting for loop.

// Copyright(c), Victor J. Duvanenko, 2010 #include "tbb/tbb.h" using namespace tbb; // Listing 2 inline void CountingSort_TBB( unsigned char* a, unsigned long a_size ) { if ( a_size < 2 ) return; const unsigned long numberOfCounts = 256; CountingType count( a ); // contains the count array, which is initialized to all zeros // Scan the array and count the number of times each value appears // for( unsigned long i = 0; i < a_size; i++ ) // count[ a[ i ] ]++; parallel_reduce( blocked_range< unsigned long >( 0, a_size ), count ); // Fill the array with the number of 0's that were counted, followed by the number of 1's, and then 2's and so on unsigned long n = 0; for( unsigned long i = 0; i < numberOfCounts; i++ ) for( unsigned long j = 0; j < count.my_count[ i ]; j++, n++ ) a[ n ] = (unsigned char)i; }

The initialization of the count array has been replaced by a constructor statement **CountingType count( a )**, which creates a variable of **CountingType**, and passes the pointer to the input array to its constructor.

The counting **for** loop has been replaced by the TBB **parallel_reduce()** function. This function is directed to process the array elements in range of 0 to **a_size** (i.e., the entire input array) and to use the **count** variable (which holds the pointer to the input array). The **blocked_range** is a TBB template class that supports a range capable of recursive splitting. This range includes the beginning elements (0 in our case), but excludes the ending point (**a_size**). The first argument to the **parallel_reduce()** function is a constructor for a **blocked_range** variable.

Listing 3 shows the **CountingType** class definition, which contains several methods used by the TBB **parallel_reduce()** function.

// Copyright(c), Victor J. Duvanenko, 2010 #include "tbb/tbb.h" using namespace tbb; class CountingType { unsigned char* my_input_array; // a local copy to the array being counted // to provide a pointer to each parallel task public: static const unsigned long numberOfCounts = 256; unsigned long my_count[ numberOfCounts ]; // the count for this task CountingType( unsigned char a[] ) : my_input_array( a ) // constructor, which copies the pointer to the array being counted { for( unsigned long i = 0; i < numberOfCounts; i++ ) // initialized all counts to zero, since the array may not contain all values my_count[ i ] = 0; } // Method that performs the core work of counting void operator()( const blocked_range< unsigned long >& r ) { unsigned char* a = my_input_array; // these local variables are used to help the compiler optimize the code better size_t end = r.end(); unsigned long* count = my_count; for( unsigned long i = r.begin(); i != end; ++i ) count[ a[ i ] ]++; } // Splitter (splitting constructor) required by the parallel_reduce // Takes a reference to the original object, and a dummy argument to distinguish this method from a copy constructor CountingType( CountingType& x, split ) : my_input_array( x.my_input_array ) { for( unsigned long i = 0; i < numberOfCounts; i++ ) // initialized all counts to zero, since the array may not contain all values my_count[ i ] = 0; } // Join method required by parallel_reduce // Combines a result from another task into this result void join( const CountingType& y ) { for( unsigned long i = 0; i < numberOfCounts; i++ ) my_count[ i ] += y.my_count[ i ]; } };

The class holds the pointer to the input array and the count array (which is 256 elements -- one element for each possible value of an unsigned 8-bit number). The interface consists of the constructor, which receives the pointer to the input array to be sorted (this pointer is remembered for later use). The constructor also initializes the count array to all zeros. In other words, the **CountingType** class holds the **count[]** array and several supporting methods -- it wraps the algorithm in a class that can be passed to a function.

The interface defines several methods that are required by the TBB **parallel_reduce()** function: the **operator()**, the splitting constructor, and the **join()** method. The **operator()** method implements what the **CountingType** class does -- the core algorithm. However, a partial range of the input array needs to be handled by this method (from **r.begin()** to **r.end()**, but not including **r.end()**). The body of **operator()** has a for loop that is very similar to the counting loop of Listing 1.

**Parallel_reduce()** splits the input array into sub-arrays, which are then processed by each computational core in parallel. To do this efficiently, however, each sub-array needs to have its own independent **count[]** array that is initialized to zeros and then counts the sub-array elements. In other words, instead of sharing the **count[]** array among multiple threads, the **count[]** array is duplicated so that each thread has and operates on its own copy (as illustrated below).

TBB **parallel_reduce()** recursively splits the input array into P sub-arrays, where P is the number of processors available. During each recursive split, the splitter method is called. After the algorithm completes processing each sub-array, the join method is called, to combine the results from two sub-arrays. This process is illustrated by the diagram below.

The **splitter** method, which looks like a copy constructor with a dummy split second argument, creates a copy of **CountingType** variable by copying the pointer to the input array and initializing the **count[]** array to zeros. This work is necessary to process a new sub-array independently (as shown in the upper half of the graph above -- note **count[]** and **count_1[]** are separate arrays).

Once the sub-arrays have been processed separately, each with their own **count[]** arrays, the results must be combined into a single **count[]** array. This combining operation is handled by the **join()** method, which the TBB **parallel_reduce()** calls. This is illustrated in the lower half of the diagram above. The **join()** method combines the **count[]** array from the **CountingType** argument passed to it, with its own internal **count[]** array. At the end, a single **count[]** array remains, just like the serial version of the algorithm.

Listing 4 shows the initial Parallel Counting Sort implementation for arrays of 16-bit unsigned numbers. The performance is shown in Table 2 and Figure 2.

// Copyright(c), Victor J. Duvanenko, 2010 #include "tbb/tbb.h" using namespace tbb; class Counting16Type { unsigned short* my_input_array; // a local copy to the array being counted // to provide a pointer to each parallel task public: static const unsigned long numberOfCounts = 256 * 256; unsigned long my_count[ numberOfCounts ]; // the count for this task Counting16Type( unsigned short a[] ) : my_input_array( a ) // constructor, which copies the pointer to the array being counted { for( unsigned long i = 0; i < numberOfCounts; i++ ) // initialized all counts to zero, since the array may not contain all values my_count[ i ] = 0; } // Method that performs the core work of counting void operator()( const blocked_range< unsigned long >& r ) { unsigned short* a = my_input_array; // these local variables are used to help the compiler optimize the code better size_t end = r.end(); unsigned long* count = my_count; for( unsigned long i = r.begin(); i != end; ++i ) count[ a[ i ] ]++; } // Splitter (splitting constructor) required by the parallel_reduce // Takes a reference to the original object, and a dummy argument to distinguish this method from a copy constructor Counting16Type( Counting16Type& x, split ) : my_input_array( x.my_input_array ) { for( unsigned long i = 0; i < numberOfCounts; i++ ) // initialized all counts to zero, since the array may not contain all values my_count[ i ] = 0; } // Join method required by parallel_reduce // Combines a result from another task into this result void join( const Counting16Type& y ) { for( unsigned long i = 0; i < numberOfCounts; i++ ) my_count[ i ] += y.my_count[ i ]; } }; inline void CountingSort_TBB( unsigned short* a, unsigned long a_size ) { if ( a_size < 2 ) return; const unsigned long numberOfCounts = 256 * 256; Counting16Type count( a ); // contains the count array, which is initialized to all zeros // Scan the array and count the number of times each value appears // for( unsigned long i = 0; i < a_size; i++ ) // count[ a[ i ] ]++; parallel_reduce( blocked_range< unsigned long >( 0, a_size ), count ); // Fill the array with the number of 0's that were counted, followed by the number of 1's, and then 2's and so on unsigned long n = 0; for( unsigned long i = 0; i < numberOfCounts; i++ ) for( unsigned long j = 0; j < count.my_count[ i ]; j++, n++ ) a[ n ] = (unsigned short)i; }

The initial 16-bit unsigned Parallel Counting Sort implementation produced an algorithm that is up to 77X faster than STL **sort()** for large array sizes, is up to 10X faster than Intel's IPP Radix Sort for the larger array sizes, and is up to 2.5X faster than serial Counting Sort but only for the largest array sizes. This implementation performs inferior to all other algorithms for the smaller array sizes and to the serial Counting Sort for mid-size arrays. A hybrid algorithm implementation could be easily created to produce a high performing algorithm for all array sizes, as was done in previous articles of this series.