Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

Channels ▼


Parallel In-Place N-bit-Radix Sort

Parallel Memory Access

Listing 2 shows an implementation where the recursion for loop has been replaced with the parallel_for construct. The input array is first split into bins (as in Figure 1), using a non-parallel implementation, followed by parallel processing of each bin. The parallel_for processes each bin concurrently (on as many cores as are available), which further breaks down each bin into sub-bins, based on the second digit within each array element. Thus, multiple bins will be sorted in parallel (up to the radix value, which is 256 in this case), moving their respective array elements within each bin, concurrently. The memory range for each bin does not overlap that of the other bins, and thus, bins within the same recursion level can be processed concurrently without using mutual exclusion primitives; i.e., data-parallel processing.

// Copyright(c), Victor J. Duvanenko, 2010
template< class _Type >
inline unsigned long extractDigit( _Type a, _Type bitMask, unsigned long shiftRightAmount )
	unsigned long digit = (unsigned long)(( a & bitMask ) >> shiftRightAmount );	// extract the digit we are sorting based on
	return digit;
// Listing 2
template< unsigned long PowerOfTwoRadix, unsigned long Log2ofPowerOfTwoRadix, long Threshold, class _Type >
class RadixInPlaceOperation_2
	_Type* my_a;										// a local copy to the input array, to provide a pointer to each parallel task
	long*  my_startOfBin;
	long*  my_endOfBin;
	_Type  my_bitMask;									// a local copy of the bitMask
	unsigned long my_shiftRightAmount;
	static const unsigned long Threshold_P = 10000;		// threshold when to switch between parallel and non-parallel implementations
	static const unsigned long numberOfBins = PowerOfTwoRadix;
	unsigned long my_count[ numberOfBins ];				// the count for this task

	RadixInPlaceOperation_2(	_Type a[], long* startOfBin, long* endOfBin, _Type bitMask, unsigned long shiftRightAmount ) :
								my_a( a ), my_startOfBin( startOfBin ), my_endOfBin( endOfBin ),
								my_bitMask( bitMask ), my_shiftRightAmount( shiftRightAmount )
	void operator()( const blocked_range< long >& r ) const
		for( long i = r.begin(); i != r.end(); ++i )
			long numOfElements = my_endOfBin[ i ] - my_startOfBin[ i ];
			if ( numOfElements >= Threshold_P )
				_RadixSort_Unsigned_PowerOf2Radix_2< PowerOfTwoRadix, Log2ofPowerOfTwoRadix, Threshold >( &my_a[ my_startOfBin[ i ]], numOfElements - 1, my_bitMask, my_shiftRightAmount );
			else if ( numOfElements >= Threshold )
				_RadixSort_Unsigned_PowerOf2Radix_1< PowerOfTwoRadix, Log2ofPowerOfTwoRadix, Threshold >( &my_a[ my_startOfBin[ i ]], numOfElements - 1, my_bitMask, my_shiftRightAmount );
			else if ( numOfElements >= 2 )
				insertionSortSimilarToSTLnoSelfAssignment( &my_a[ my_startOfBin[ i ]], numOfElements );
// Listing 2
template< unsigned long PowerOfTwoRadix, unsigned long Log2ofPowerOfTwoRadix, long Threshold, class _Type >
inline void _RadixSort_Unsigned_PowerOf2Radix_2( _Type* a, long last, _Type bitMask, unsigned long shiftRightAmount )
	const unsigned long numberOfBins = PowerOfTwoRadix;
	unsigned long count[ numberOfBins ];

	for( unsigned long i = 0; i < numberOfBins; i++ )
		count[ i ] = 0;
	for ( long _current = 0; _current <= last; _current++ )	// Scan the array and count the number of times each value appears
		count[ extractDigit( a[ _current ], bitMask, shiftRightAmount ) ]++;

	long startOfBin[ numberOfBins ], endOfBin[ numberOfBins ], nextBin;
	startOfBin[ 0 ] = endOfBin[ 0 ] = nextBin = 0;
	for( unsigned long i = 1; i < numberOfBins; i++ )
		startOfBin[ i ] = endOfBin[ i ] = startOfBin[ i - 1 ] + count[ i - 1 ];

	for ( long _current = 0; _current <= last; )
		unsigned long digit;
		_Type tmp = a[ _current ];	// get the compiler to recognize that a register can be used for the loop instead of a[_current] memory location
		while ( true ) {
			digit = (unsigned long)(( tmp & bitMask ) >> shiftRightAmount );	// extract the digit we are sorting based on
			if ( endOfBin[ digit ] == _current )
			_swap( tmp, a[ endOfBin[ digit ] ] );
			endOfBin[ digit ]++;
		a[ _current ] = tmp;

		endOfBin[ digit ]++;					// leave the element at its location and grow the bin
		_current++;								// advance the current pointer to the next element
		while( _current >= startOfBin[ nextBin ] && nextBin < numberOfBins )
		while( endOfBin[ nextBin - 1 ] == startOfBin[ nextBin ] && nextBin < numberOfBins )
		if ( _current < endOfBin[ nextBin - 1 ] )
			 _current = endOfBin[ nextBin - 1 ];

	bitMask >>= Log2ofPowerOfTwoRadix;
	if ( bitMask != 0 )						// end recursion when all the bits have been processes
		if ( shiftRightAmount >= Log2ofPowerOfTwoRadix )	shiftRightAmount -= Log2ofPowerOfTwoRadix;
		else												shiftRightAmount  = 0;

		RadixInPlaceOperation_2< PowerOfTwoRadix, Log2ofPowerOfTwoRadix, Threshold, _Type >  radixOp( a, startOfBin, endOfBin, bitMask, shiftRightAmount );
		parallel_for( blocked_range< long >( 0, numberOfBins ), radixOp );

Listing 2

Note that in Listing 2 in the RadixInPlaceOperation_2 class the recursion template function call was to the parallel function (_RadixSort_Unsigned_PowerOf2Radix_2). However, if the non-parallel version is called (_RadixSort_Unsigned_PowerOf2Radix_1) performance improves by about 5%. This experiment shows that performance acceleration from parallelism was achieved from a single recursion level, and the rest of the recursion levels did not help accelerate. In fact, the rest of the recursion levels reduced performance.

The reason for the performance drop stems from analysis of bin size at each recursion level, where an array holding 100 million random elements gets split into 256 bins when using 8-bit digits (256-Radix) on the top-level (before first recursion), resulting in each bin being about 390K elements. After the first recursion, each of these bins gets split further into 256 bins, with each bin being about 1.5K elements. The Threshold_P optimal setting, determined experimentally, was about 10K. Thus, for bins with more than 10K elements, the parallel function is called, and for smaller bins the non-parallel version is called. With 1.5K elements per bin, the non-parallel version will be called. But, this is the average case with a random array, and some bins will be larger than 10K and some will be much smaller. Thus, parallel function will be called at times, slowing things down due to its larger overhead.

When bins get small, the parallel overhead is most likely larger than acceleration obtained from parallel execution. Intel recommends parallel chunks of work to be about larger than 10K CPU clock cycles. As array sizes grow beyond 100M elements, such as with the transition to 64-bit CPUs and operating systems, further recursive parallel processing will pay off, as bins after the second recursion will be large enough to be worthwhile to process in parallel. The algorithm implementation needs to keep this future in mind. This is one of the reasons to keep the parallel recursion structure.

Another reason is that inputs are not necessarily random. If an input array is such that all elements within the array have the same most significant digit with the rest of the digits being random, then no performance benefit would be obtained from the single recursion parallel implementation. However, the fully parallel version would perform well. In the rest of this paper, the fully parallel implementation is used.

Also, using smaller radix will lead to larger bins at each recursive level, and may result in higher performance. However, it also leads to more levels of recursion, which may be slower. The implementation in Listing 2 allows for an arbitrary number of bits to be used as the radix.

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.