Algorithm Improvement through Performance Measurement: Part 1

Algorithm Improvement through Performance Measurement: Part 2

Algorithm Improvement through Performance Measurement: Part 3

Algorithm Improvement through Performance Measurement: Part 4

Algorithm Improvement through Performance Measurement: Part 5

Algorithm Improvement through Performance Measurement: Part 6

Algorithm Improvement through Performance Measurement: Part 7

In Parallel Counting Sort: Part 1 I transformed the Counting Sort algorithm into a parallel implementation, resulting in a 3X speedup when sorting arrays of 8-bit numbers, and 2.5X when sorting arrays of 16-bit numbers, comparing to the sequential/scalar implementation running on a quad-core hyper-threaded processor. To simplify parallel programming, I used the Intel Threaded Building Blocks (TBB). Counting Sort is a ** O**(n) algorithm, with a small constant term, performing

**n**reads and

**n**writes, when sorting arrays of 8-bit or 16-bit numbers. The parallel implementation is up to 65X faster than STL

**sort()**for large arrays of 8-bit numbers, and up to 77X faster for large arrays of 16-bit numbers.

In Part 1, a portion of the algorithm was not converted to run in parallel on multiple processor cores. In this installment, I parallelize the rest of the algorithm, pushing the limits of the processor and parallel programming capabilities.

### Templatized Starting Point

Listing 1 shows a portion of the Parallel Counting Sort implementation for arrays of unsigned 8-bit and 16-bit numbers, which will be the starting point. The core of this implementation was developed in Part 1, except for the two templatizations.

// Copyright(c), Victor J. Duvanenko, 2010 // Listing 1 of Part 2 template< class _Type > inline void _internal_CountSortInPlace_TBB_L1( _Type* a, unsigned long a_size ) { if ( a_size < 2 ) return; CountTemplateType< _Type > count( a ); // contains the count array, which // is initialized to all zeros // Scan the array and count the number of times each value appears parallel_reduce( blocked_range< unsigned long >( 0, a_size ), count ); // add grain size of 10K as the third argument to blocked_range constructor for( unsigned long n = 0, i = 0; i < count.numberOfCounts; i++ ) for( unsigned long j = 0; j < count.my_count[ i ]; j++ ) a[ n++ ] = (_Type)i; } inline void CountSortInPlace_TBB_L1( unsigned char* a, unsigned long a_size ) { _internal_CountSortInPlace_TBB_L1< unsigned char >( a, a_size ); } inline void CountSortInPlace_TBB_L1( unsigned short* a, unsigned long a_size ) { _internal_CountSortInPlace_TBB_L1< unsigned short >( a, a_size ); }

At the top-level, the two **CountSortInPlace_TBB_L1()** functions are the ones the users will be calling: one for sorting an array of unsigned 8-bit numbers, and the other for unsigned 16-bit numbers. These two overloaded functions are wrappers over the **_internal_CountSortInPlace_TBB_L1()** template function, to restrict usage only to arrays of unsigned 8-bit and 16-bit numbers. **_internal_CountSortInPlace_TBB_L1()** is the first templatization, handling both unsigned 8-bit and 16-bit implementations.

**CountTemplateType** is a templatization of the **CountType** developed in Part 1, to handle both unsigned 8-bit and 16-bit data types. Listing 2 shows **CountTemplateType** implementation, where the main difference is the ability to set the count array size based on the data type. The **numberOfCounts** constant is derived from the **_Type** passed to the template.

// Copyright(c), Victor J. Duvanenko, 2010 // Listing 2 of Part 2 template< class _Type > class CountTemplateType { _Type* my_input_array; // a local copy to the array being counted // to provide a pointer to each parallel task public: static const unsigned long numberOfCounts = 1 << ( sizeof( _Type ) * 8 ); __declspec( align(32)) unsigned long my_count[ numberOfCounts ]; // the count for this task CountTemplateType( _Type a[] ) : my_input_array( a ) // constructor, which copies the // pointer to the array being counted { for( unsigned long i = 0; i < numberOfCounts; i++ ) // initialized all counts to zero, since the array may not contain all values my_count[ i ] = 0; } // Performs the counting portion of the algorithm void operator()( const blocked_range< unsigned long >& r ) { _Type* a = my_input_array; // these local variables are used to help // the compiler optimize the code better size_t end = r.end(); unsigned long* count = my_count; for( unsigned long i = r.begin(); i != end; ++i ) count[ a[ i ] ]++; } // Splitter (splitting constructor) required by the parallel_reduce // Takes a reference to the original object, and a dummy argument to // distinguish this method from a copy constructor CountTemplateType( CountTemplateType& x, split ) : my_input_array( x.my_input_array ) { for( unsigned long i = 0; i < numberOfCounts; i++ ) // initialized all counts to zero, // since the array may not contain all values my_count[ i ] = 0; } // Join method required by parallel_reduce // Combines a result from another task into this result void join( const CountTemplateType& y ) { for( unsigned long i = 0; i < numberOfCounts; i++ ) my_count[ i ] += y.my_count[ i ]; } };

The rest of the Parallel Counting Sort implementation has a few slight improvements over Part 1. The overall implementation is more compact due to its use of templates, while being as simple as the one in Part 1.

The Parallel Counting Sort uses Intel TBB **parallel_reduce()** to split the counting portion of the algorithm among multiple cores. Each core keeps track of its own counts independently, processing its portion of the input array. The results are merged/joined when the cores are done processing. The **parallel_reduce()** enables the Parallel Counting Sort to scale with the number of processors available, while being oblivious to the actual number of processors that its running on.

The final step of the algorithm writes the sorted values back into the original input array (**a[]**), since Counting Sort is in-place. In this final writing step, the algorithm generates the sorted result array. This operation is accomplished by the two nested for loops, running on a single processor. Accelerating this final writing step of the algorithm, will be the focus of this article.