First Attempt at Parallel Counting Sort
The performance of the initial Parallel Counting Sort implementation is in Table 1 and Figure 1 for arrays of 8-bit unsigned numbers:
This initial version produced a speedup of up to 3X for large arrays, when running on a quad-core i7 860 CPU with hyper-threading. Parallel Counting Sort is up to 67X faster than STL sort() and 3.3X faster than Intel IPP Radix Sort for medium and large arrays, but is slower for smaller array sizes.
Listing 2 shows the top-level function of the initial Parallel Count Sort implementation. This code looks very similar to the serial implementation shown in Listing 1, except for the initialization of the count array, and the counting for loop.
// Copyright(c), Victor J. Duvanenko, 2010
#include "tbb/tbb.h"
using namespace tbb;
// Listing 2
inline void CountingSort_TBB( unsigned char* a, unsigned long a_size )
{
if ( a_size < 2 ) return;
const unsigned long numberOfCounts = 256;
CountingType count( a ); // contains the count array, which is initialized to all zeros
// Scan the array and count the number of times each value appears
// for( unsigned long i = 0; i < a_size; i++ )
// count[ a[ i ] ]++;
parallel_reduce( blocked_range< unsigned long >( 0, a_size ), count );
// Fill the array with the number of 0's that were counted, followed by the number of 1's, and then 2's and so on
unsigned long n = 0;
for( unsigned long i = 0; i < numberOfCounts; i++ )
for( unsigned long j = 0; j < count.my_count[ i ]; j++, n++ )
a[ n ] = (unsigned char)i;
}
The initialization of the count array has been replaced by a constructor statement CountingType count( a ), which creates a variable of CountingType, and passes the pointer to the input array to its constructor.
The counting for loop has been replaced by the TBB parallel_reduce() function. This function is directed to process the array elements in range of 0 to a_size (i.e., the entire input array) and to use the count variable (which holds the pointer to the input array). The blocked_range is a TBB template class that supports a range capable of recursive splitting. This range includes the beginning elements (0 in our case), but excludes the ending point (a_size). The first argument to the parallel_reduce() function is a constructor for a blocked_range variable.
'Listing 3 shows the CountingType class definition, which contains several methods used by the TBB parallel_reduce() function.
// Copyright(c), Victor J. Duvanenko, 2010
#include "tbb/tbb.h"
using namespace tbb;
class CountingType
{
unsigned char* my_input_array; // a local copy to the array being counted
// to provide a pointer to each parallel task
public:
static const unsigned long numberOfCounts = 256;
unsigned long my_count[ numberOfCounts ]; // the count for this task
CountingType( unsigned char a[] ) : my_input_array( a ) // constructor, which copies the pointer to the array being counted
{
for( unsigned long i = 0; i < numberOfCounts; i++ ) // initialized all counts to zero, since the array may not contain all values
my_count[ i ] = 0;
}
// Method that performs the core work of counting
void operator()( const blocked_range< unsigned long >& r )
{
unsigned char* a = my_input_array; // these local variables are used to help the compiler optimize the code better
size_t end = r.end();
unsigned long* count = my_count;
for( unsigned long i = r.begin(); i != end; ++i )
count[ a[ i ] ]++;
}
// Splitter (splitting constructor) required by the parallel_reduce
// Takes a reference to the original object, and a dummy argument to distinguish this method from a copy constructor
CountingType( CountingType& x, split ) : my_input_array( x.my_input_array )
{
for( unsigned long i = 0; i < numberOfCounts; i++ ) // initialized all counts to zero, since the array may not contain all values
my_count[ i ] = 0;
}
// Join method required by parallel_reduce
// Combines a result from another task into this result
void join( const CountingType& y )
{
for( unsigned long i = 0; i < numberOfCounts; i++ )
my_count[ i ] += y.my_count[ i ];
}
};
The class holds the pointer to the input array and the count array (which is 256 elements -- one element for each possible value of an unsigned 8-bit number). The interface consists of the constructor, which receives the pointer to the input array to be sorted (this pointer is remembered for later use). The constructor also initializes the count array to all zeros. In other words, the CountingType class holds the count[] array and several supporting methods -- it wraps the algorithm in a class that can be passed to a function.
The interface defines several methods that are required by the TBB parallel_reduce() function: the operator(), the splitting constructor, and the join() method. The operator() method implements what the CountingType class does -- the core algorithm. However, a partial range of the input array needs to be handled by this method (from r.begin() to r.end(), but not including r.end()). The body of operator() has a for loop that is very similar to the counting loop of Listing 1.
Parallel_reduce() splits the input array into sub-arrays, which are then processed by each computational core in parallel. To do this efficiently, however, each sub-array needs to have its own independent count[] array that is initialized to zeros and then counts the sub-array elements. In other words, instead of sharing the count[] array among multiple threads, the count[] array is duplicated so that each thread has and operates on its own copy (as illustrated below).
TBB parallel_reduce() recursively splits the input array into P sub-arrays, where P is the number of processors available. During each recursive split, the splitter method is called. After the algorithm completes processing each sub-array, the join method is called, to combine the results from two sub-arrays. This process is illustrated by the diagram below.
The splitter method, which looks like a copy constructor with a dummy split second argument, creates a copy of CountingType variable by copying the pointer to the input array and initializing the count[] array to zeros. This work is necessary to process a new sub-array independently (as shown in the upper half of the graph above -- note count[] and count_1[] are separate arrays).
Once the sub-arrays have been processed separately, each with their own count[] arrays, the results must be combined into a single count[] array. This combining operation is handled by the join() method, which the TBB parallel_reduce() calls. This is illustrated in the lower half of the diagram above. The join() method combines the count[] array from the CountingType argument passed to it, with its own internal count[] array. At the end, a single count[] array remains, just like the serial version of the algorithm.
Listing 4 shows the initial Parallel Counting Sort implementation for arrays of 16-bit unsigned numbers. The performance is shown in Table 2 and Figure 2.
// Copyright(c), Victor J. Duvanenko, 2010
#include "tbb/tbb.h"
using namespace tbb;
class Counting16Type
{
unsigned short* my_input_array; // a local copy to the array being counted
// to provide a pointer to each parallel task
public:
static const unsigned long numberOfCounts = 256 * 256;
unsigned long my_count[ numberOfCounts ]; // the count for this task
Counting16Type( unsigned short a[] ) : my_input_array( a ) // constructor, which copies the pointer to the array being counted
{
for( unsigned long i = 0; i < numberOfCounts; i++ ) // initialized all counts to zero, since the array may not contain all values
my_count[ i ] = 0;
}
// Method that performs the core work of counting
void operator()( const blocked_range< unsigned long >& r )
{
unsigned short* a = my_input_array; // these local variables are used to help the compiler optimize the code better
size_t end = r.end();
unsigned long* count = my_count;
for( unsigned long i = r.begin(); i != end; ++i )
count[ a[ i ] ]++;
}
// Splitter (splitting constructor) required by the parallel_reduce
// Takes a reference to the original object, and a dummy argument to distinguish this method from a copy constructor
Counting16Type( Counting16Type& x, split ) : my_input_array( x.my_input_array )
{
for( unsigned long i = 0; i < numberOfCounts; i++ ) // initialized all counts to zero, since the array may not contain all values
my_count[ i ] = 0;
}
// Join method required by parallel_reduce
// Combines a result from another task into this result
void join( const Counting16Type& y )
{
for( unsigned long i = 0; i < numberOfCounts; i++ )
my_count[ i ] += y.my_count[ i ];
}
};
inline void CountingSort_TBB( unsigned short* a, unsigned long a_size )
{
if ( a_size < 2 ) return;
const unsigned long numberOfCounts = 256 * 256;
Counting16Type count( a ); // contains the count array, which is initialized to all zeros
// Scan the array and count the number of times each value appears
// for( unsigned long i = 0; i < a_size; i++ )
// count[ a[ i ] ]++;
parallel_reduce( blocked_range< unsigned long >( 0, a_size ), count );
// Fill the array with the number of 0's that were counted, followed by the number of 1's, and then 2's and so on
unsigned long n = 0;
for( unsigned long i = 0; i < numberOfCounts; i++ )
for( unsigned long j = 0; j < count.my_count[ i ]; j++, n++ )
a[ n ] = (unsigned short)i;
}
The initial 16-bit unsigned Parallel Counting Sort implementation produced an algorithm that is up to 77X faster than STL sort() for large array sizes, is up to 10X faster than Intel's IPP Radix Sort for the larger array sizes, and is up to 2.5X faster than serial Counting Sort but only for the largest array sizes. This implementation performs inferior to all other algorithms for the smaller array sizes and to the serial Counting Sort for mid-size arrays. A hybrid algorithm implementation could be easily created to produce a high performing algorithm for all array sizes, as was done in previous articles of this series.


