Channels ▼
RSS

Tools

Parallel Counting Sort (Part 2)


Multiple Possibilities

Intel TBB supports parallel_for construct, which is logically equivalent to executing every iteration of the for loop in parallel. The prototype is:


template<typename Range, typename Body >
void parallel_for( const Range& range, const Body& body,[partitioner] );

The first argument is the range over which the parallel_for will execute. This range can be capable of recursively splitting itself; parallel_for makes use of this capability. The second argument is the class, whose "()" operator will be performed during each iteration of the for loop. The parallel_for supports several variations, such as specifying the first and last index, step size, and several different partitioners that control how the work will be split among cores.

The last two nested for loops in the Parallel Counting Sort in Listing 1 are responsible for writing the sorted result array and have not yet been parallelized. Several options are available when converting these sequential for loops to parallel_for: do not parallel the for loops, parallel the outer for loop only, parallel the inner for loop only, parallel both for loops. Listing 1 implements the first option, while Listings 2, 3 and 4 show the three parallel implementations.


// Copyright(c), Victor J. Duvanenko, 2010

// Listing 3 of Part 2
template< class _Type >
class WriteOperation_for
{
	_Type*          my_a;  // a local copy to the input array, to provide a pointer to each parallel task
	unsigned long*  my_count;
public:
	static const unsigned long numberOfBins = 1 << ( sizeof( _Type ) * 8 );

	WriteOperation_for( _Type a[], unsigned long* count ) : my_a( a ), my_count( count )
	{}
	// Operates over the range of counts
	void operator()( const blocked_range< unsigned long >& r ) const
	{
		unsigned long n = 0;
		for( long i = 0; i != r.begin(); ++i ) 	// compute the starting point of r.begin() bin
			n += my_count[ i ];
		for( long i = r.begin(); i != r.end(); ++i )
		{
			unsigned long   numValues = my_count[ i ];	// helps the compiler optimize
			_Type*          a         = my_a;
			for( unsigned long j = 0; j < numValues; j++ )
				a[ n++ ] = (_Type)i;
		}
	}
};

template< class _Type >
inline void CountSortInPlace_TBB_L2( _Type* a, unsigned long a_size )
{
	if ( a_size <  2 )	return;

	CountTemplateType< _Type > count( a ); // contains count array; initialized to all zeros

	// Scan the array and count the number of times each value appears
	parallel_reduce( blocked_range< unsigned long >( 0, a_size ), count ); // grain size of 10K works well

	WriteOperation_for< _Type >  writeOp( a, count.my_count );
	parallel_for( blocked_range< unsigned long >( 0, count.numberOfCounts ), writeOp );
}

Listing 3


// Copyright(c), Victor J. Duvanenko, 2010

//Listing 4 Part 2
template< class _Type >
class SetOperation_for
{
	_Type* my_a;		// a local copy to the input array
				// to provide a pointer to each parallel task
	_Type  my_setValue;	// the value to set the array to
public:
	SetOperation_for( _Type a[], _Type setValue ) : my_a( a ), my_setValue( setValue )
	{}
	void operator()( const blocked_range< unsigned long >& r ) const
	{
		for( unsigned long n = r.begin(); n != r.end(); )
			my_a[ n++ ] = my_setValue;
	}
};

template< class _Type >
inline void CountSortInPlace_TBB_L3( _Type* a, unsigned long a_size )
{
	if ( a_size <  2 )	return;

	CountTemplateType< _Type > count( a ); // contains count array;  initialized to all zeros

	// Scan the array and count the number of times each value appears
	parallel_reduce( blocked_range< unsigned long >( 0, a_size ), count );// grain size of 10K works well

	for( unsigned long n = 0, i = 0; i < count.numberOfCounts; i++ )
	{
		unsigned long     numValues = count.my_count[ i ];
		SetOperation_for< _Type>  setOp( &( a[ n ]), (_Type)i );
		parallel_for( blocked_range< unsigned long >( 0, numValues ), setOp );
		n += numValues;
	}
}

Listing 4


// Copyright(c), Victor J. Duvanenko, 2010

// Listing 5 Part 2
template< class _Type >
class WriteOperation_nested_for
{
	_Type*          my_a;  // a local copy to the input array, to provide a pointer to each parallel task
	unsigned long*  my_count;
public:
	static const unsigned long numberOfBins = 1 << ( sizeof( _Type ) * 8 );

	WriteOperation_nested_for( _Type a[], unsigned long* count ) : my_a( a ), my_count( count )
	{}
	// Operates over the range of counts
	void operator()( const blocked_range< unsigned long >& r ) const
	{
		unsigned long n = 0;
		for( long i = 0; i != r.begin(); ++i )	// compute the starting point of r.begin() bin
			n += my_count[ i ];
		for( long i = r.begin(); i != r.end(); ++i )
		{
		unsigned long   numValues = my_count[ i ];	// helps the compiler optimize
		_Type*			a         = my_a;
		SetOperation_for< _Type >  setOp( &(a[ n ]), (_Type)i );
		parallel_for( blocked_range< unsigned long >( 0, numValues, 10000 ), setOp );
			n += numValues;
		}
	}
};

template< class _Type >
inline void CountSortInPlace_TBB_L4( _Type* a, unsigned long a_size )
{
	if ( a_size <  2 )	return;

	CountTemplateType< _Type > count( a ); // contains count array; initialized to all zeros

	// Scan the array and count the number of times each value appears
	parallel_reduce( blocked_range< unsigned long >( 0, a_size ), count ); // grain size of 10K works well

	WriteOperation_nested_for< _Type >  writeOp( a, count.my_count );

	parallel_for( blocked_range< unsigned long >( 0, count.numberOfCounts ), writeOp );
}

Listing 5

Table 1 and Figure 1 show performance measurements for sorting arrays of unsigned 8-bit numbers, comparing scalar, several Parallel Counting Sort implementations, STL sort(), and Intel Integrated Performance Primitives )IPP) Radix Sort.

[Click image to view at full size]
Table 1: unsigned 8-bit, i7 860.

[Click image to view at full size]
Figure 1: unsigned 8-bit, i7 860.

Parallel Counting Sort is significantly faster than STL sort(), Intel IPP Radix Sort, and Counting Sort. Parallel implementation of either the outer for loop or the inner for loop show no speedup. Parallel implementation of both for loops shows a performance decline, and is consistently slower than parallel outer for loop version. Thus, adding parallelism in this section of the Counting Sort algorithm did not improve performance. This demonstrates that implementing parallelism does not guarantee performance gain. It also shows that extracting more parallelism than the overall system can handle efficiently may lead to performance degradation.

Table 2 and Figure 2 show performance measurements for sorting arrays of unsigned 16-bit numbers.

[Click image to view at full size]
Table 2: unsigned 16-bit, i7 860 CPU.

[Click image to view at full size]
Figure 2: unsigned 16-bit, i7 860.

Parallel implementations of 16-bit Counting Sort show performance issues. Non-parallel Counting Sort showed a little bit of performance trouble at smaller array sizes, but parallelism seems to aggravate the problem more. "No parallel loops" and "parallel outer" outperform the non-parallel Counting Sort only at the largest of array sizes, and severely underperform for small and medium array sizes. "Parallel both" implementation only outperforms at the largest tested array size and underperforms in all other cases. "Parallel inner" underperforms the non-parallel version in all cases.

Also, the "parallel outer" implementation has an issue with data-dependence. The parallelism obtained depends on the statistical distribution of the input data in the array. For example, if the data is evenly distributed between all of the counts/bins, then the available parallelism would be 256 for 8-bit and 64K for 16-bit. However, if all values of the array are the same (e.g. all zero), then the counting portion would parallel fine, but the writing portion would use only a single core and not parallel at all. Only the task responsible to writing that particular count/bin would do work. Data-dependent parallelism issues should be avoided in parallel implementations, if possible, to ensure consistent performance across variety of data distributions.

So far, adding parallelism to the last portion of the Counting Sort has only degraded performance for 8-bit and 16-bit implementations, with the 16-bit version being significantly worse. Since these algorithms are implemented as templates, differentiated by the data type, the source code for them is similar. However, the count array size is a significant difference -- 256 entries (1K bytes) and 64K entries (256K bytes). The i7 860 processor (quad-core) being used integrates a 256K L2 cache per core, where performance could be impacted by the count array being the same size as the entire L2 cache. To investigate if L2 cache thrashing is the cause of poor performance, measurements on a CPU with a different cache configuration could be taken.


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 

Video