Channels ▼
RSS

C/C++

Parallel Counting Sort (Part 2)



Algorithm Improvement through Performance Measurement: Part 1

Algorithm Improvement through Performance Measurement: Part 2

Algorithm Improvement through Performance Measurement: Part 3

Algorithm Improvement through Performance Measurement: Part 4

Algorithm Improvement through Performance Measurement: Part 5

Algorithm Improvement through Performance Measurement: Part 6

Algorithm Improvement through Performance Measurement: Part 7


In Parallel Counting Sort: Part 1 I transformed the Counting Sort algorithm into a parallel implementation, resulting in a 3X speedup when sorting arrays of 8-bit numbers, and 2.5X when sorting arrays of 16-bit numbers, comparing to the sequential/scalar implementation running on a quad-core hyper-threaded processor. To simplify parallel programming, I used the Intel Threaded Building Blocks (TBB). Counting Sort is a O(n) algorithm, with a small constant term, performing n reads and n writes, when sorting arrays of 8-bit or 16-bit numbers. The parallel implementation is up to 65X faster than STL sort() for large arrays of 8-bit numbers, and up to 77X faster for large arrays of 16-bit numbers.

In Part 1, a portion of the algorithm was not converted to run in parallel on multiple processor cores. In this installment, I parallelize the rest of the algorithm, pushing the limits of the processor and parallel programming capabilities.

Templatized Starting Point

Listing 1 shows a portion of the Parallel Counting Sort implementation for arrays of unsigned 8-bit and 16-bit numbers, which will be the starting point. The core of this implementation was developed in Part 1, except for the two templatizations.


// Copyright(c), Victor J. Duvanenko, 2010

// Listing 1 of Part 2
template< class _Type >
inline void _internal_CountSortInPlace_TBB_L1( _Type* a, unsigned long a_size )
{
	if ( a_size <  2 )	return;

	CountTemplateType< _Type > count( a );   // contains the count array, which 
                                                                                             // is initialized to all zeros

	// Scan the array and count the number of times each value appears
	parallel_reduce( blocked_range< unsigned long >( 0, a_size ), count );  
                   // add grain size of  10K as the third argument to blocked_range constructor

	for( unsigned long n = 0, i = 0; i < count.numberOfCounts; i++ )
		for( unsigned long j = 0; j < count.my_count[ i ]; j++ )
			a[ n++ ] = (_Type)i;
}
inline void CountSortInPlace_TBB_L1( unsigned char* a, unsigned long a_size )
{
	_internal_CountSortInPlace_TBB_L1< unsigned char >( a, a_size );
}
inline void CountSortInPlace_TBB_L1( unsigned short* a, unsigned long a_size )
{
	_internal_CountSortInPlace_TBB_L1< unsigned short >( a, a_size );
}

Listing 1

At the top-level, the two CountSortInPlace_TBB_L1() functions are the ones the users will be calling: one for sorting an array of unsigned 8-bit numbers, and the other for unsigned 16-bit numbers. These two overloaded functions are wrappers over the _internal_CountSortInPlace_TBB_L1() template function, to restrict usage only to arrays of unsigned 8-bit and 16-bit numbers. _internal_CountSortInPlace_TBB_L1() is the first templatization, handling both unsigned 8-bit and 16-bit implementations.

CountTemplateType is a templatization of the CountType developed in Part 1, to handle both unsigned 8-bit and 16-bit data types. Listing 2 shows CountTemplateType implementation, where the main difference is the ability to set the count array size based on the data type. The numberOfCounts constant is derived from the _Type passed to the template.


// Copyright(c), Victor J. Duvanenko, 2010

// Listing 2 of Part 2
template< class _Type >
class CountTemplateType {
	_Type* my_input_array;		// a local copy to the array being counted
													// to provide a pointer to each parallel task
public:
	static const unsigned long numberOfCounts = 1 << ( sizeof( _Type ) * 8 );
	__declspec( align(32)) unsigned long my_count[ numberOfCounts ];				// the count for this task

	CountTemplateType( _Type a[] ) : my_input_array( a )  // constructor, which copies the 
                                                                                                       // pointer to the array being counted
	{
		for( unsigned long i = 0; i < numberOfCounts; i++ ) // initialized all counts to zero, since the array may not contain all values
			my_count[ i ] = 0;
	}
	// Performs the counting portion of the algorithm
	void operator()( const blocked_range< unsigned long >& r )
	{
		_Type*          a    = my_input_array;  // these local variables are used to help 
                                                                                        // the compiler optimize the code better
		size_t         end   = r.end();
		unsigned long* count = my_count;
		for( unsigned long i = r.begin(); i != end; ++i )
			count[ a[ i ] ]++;
	}
	// Splitter (splitting constructor) required by the parallel_reduce
	// Takes a reference to the original object, and a dummy argument to 
              //    distinguish this method from a copy constructor
	CountTemplateType( CountTemplateType& x, split ) : my_input_array( x.my_input_array )
	{
		for( unsigned long i = 0; i < numberOfCounts; i++ )   // initialized all counts to zero,
                                                                                                // since the array may not contain all values
			my_count[ i ] = 0;
	}
	// Join method required by parallel_reduce
	// Combines a result from another task into this result
	void join( const CountTemplateType& y )
	{
		for( unsigned long i = 0; i < numberOfCounts; i++ )
			my_count[ i ] += y.my_count[ i ];
	}
};

Listing 2

The rest of the Parallel Counting Sort implementation has a few slight improvements over Part 1. The overall implementation is more compact due to its use of templates, while being as simple as the one in Part 1.

The Parallel Counting Sort uses Intel TBB parallel_reduce() to split the counting portion of the algorithm among multiple cores. Each core keeps track of its own counts independently, processing its portion of the input array. The results are merged/joined when the cores are done processing. The parallel_reduce() enables the Parallel Counting Sort to scale with the number of processors available, while being oblivious to the actual number of processors that its running on.

The final step of the algorithm writes the sorted values back into the original input array (a[]), since Counting Sort is in-place. In this final writing step, the algorithm generates the sorted result array. This operation is accomplished by the two nested for loops, running on a single processor. Accelerating this final writing step of the algorithm, will be the focus of this article.


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 

Video