Channels ▼
RSS

Design

Parallel In-Place N-bit-Radix Sort


Victor Duvanenko is a long-time contributor to Dr. Dobb's. He can be contacted at vduvanenko1@gmail.com


In In-Place Hybrid N-bit-Radix SortI developed a high-performance In-Place N-bit-Radix Sort. This linear-time O(dn) algorithm followed the hybrid algorithm approach of STL sort(), which combines quicksort, heap sort, and insertion sort to create a higher performance combination than each algorithms individually. In-Place Radix Sort combined N-bit-Radix Sort with Insertion Sort to significantly boost performance, and sorted arrays of unsigned and signed 8-, 16-, -,32 and 64-bit numbers, many times faster than STL sort().

Counting Sort is used inside the Radix Sort. This algorithm was developed and discussed in [1], along with multicore parallel implementation in [3 and 4] using Intel's Threading Building Blocks (TBB) [5]. This high performance linear-time O(n) algorithm outperforms STL sort() by more than 60X. Parallel Counting Sort benefited by 2-3X when running on a quad-core processor, accelerating from parallelization of the counting portion of the algorithm.

In this article, I combine In-Place N-bit-Radix Sort and Parallel Counting Sort to develop a Parallel In-Place N-bit-Radix Sort algorithm, using Intel's Threaded Building Blocks library. Listing 1 shows a slightly improved version of In-Place N-bit-Radix Sort of [1]. Table 1 and Figure 1 show performance measurements for several sorting algorithms, running on i7 860 quad-core processor.


// Copyright(c), Victor J. Duvanenko, 2010
// Listing 1
template< unsigned long PowerOfTwoRadix, unsigned long Log2ofPowerOfTwoRadix, long Threshold, class _Type >
inline void _RadixSort_Unsigned_PowerOf2Radix_1( _Type* a, long last, _Type bitMask, unsigned long shiftRightAmount )
{
	const unsigned long numberOfBins = PowerOfTwoRadix;
	unsigned long count[ numberOfBins ];
// Counting occurance of digits within the array (related to Counting Sort)
	for( unsigned long i = 0; i < numberOfBins; i++ )
		count[ i ] = 0;

	for ( long _current = 0; _current <= last; _current++ )	// Scan the array and count the number of times each value appears
	{
		unsigned long digit = (unsigned long)(( a[ _current ] & bitMask ) >> shiftRightAmount );	// extract the digit we are sorting based on
		count[ digit ]++;
	}
// Moving array elements into their bins
	long startOfBin[ numberOfBins ], endOfBin[ numberOfBins ], nextBin;
	startOfBin[ 0 ] = endOfBin[ 0 ] = nextBin = 0;
	for( unsigned long i = 1; i < numberOfBins; i++ )
		startOfBin[ i ] = endOfBin[ i ] = startOfBin[ i - 1 ] + count[ i - 1 ];

	for ( long _current = 0; _current <= last; )
	{
		unsigned long digit;
		_Type tmp = a[ _current ];	// get the compiler to recognize that a register can be used for the loop instead of a[_current] memory location
		while ( true ) {
			digit = (unsigned long)(( tmp & bitMask ) >> shiftRightAmount );	// extract the digit we are sorting based on
			if ( endOfBin[ digit ] == _current )
				break;
			_swap( tmp, a[ endOfBin[ digit ] ] );
			endOfBin[ digit ]++;
		}
		a[ _current ] = tmp;

		endOfBin[ digit ]++;					// leave the element at its location and grow the bin
		_current++;								// advance the current pointer to the next element
		while( _current >= startOfBin[ nextBin ] && nextBin < numberOfBins )
			nextBin++;
		while( endOfBin[ nextBin - 1 ] == startOfBin[ nextBin ] && nextBin < numberOfBins )
			nextBin++;
		if ( _current < endOfBin[ nextBin - 1 ] )
			 _current = endOfBin[ nextBin - 1 ];
	}
// Recursion for each bin
	bitMask >>= Log2ofPowerOfTwoRadix;
	if ( bitMask != 0 )						// end recursion when all the bits have been processes
	{
		if ( shiftRightAmount >= Log2ofPowerOfTwoRadix )	shiftRightAmount -= Log2ofPowerOfTwoRadix;
		else												shiftRightAmount  = 0;

		for( unsigned long i = 0; i < numberOfBins; i++ )
		{
			long numberOfElements = endOfBin[ i ] - startOfBin[ i ];
			if ( numberOfElements >= Threshold ) {		// endOfBin actually points to one beyond the bin
				_RadixSort_Unsigned_PowerOf2Radix_1< _Type, PowerOfTwoRadix, Log2ofPowerOfTwoRadix, Threshold >( &a[ startOfBin[ i ]], numberOfElements - 1, bitMask, shiftRightAmount );
		}
		else if ( numberOfElements >= 2 ) {
		insertionSortSimilarToSTLnoSelfAssignment( &a[ startOfBin[ i ]], numberOfElements );
		}
		}
	}
}
// Listing 1, 2 and 3 (top-level template function)
template< class _Type >
inline void RadixSortInPlace_PowerOf2Radix_unsigned_TBB( _Type* a, unsigned long a_size )
{
	if ( a_size < 2 )	return;

	const unsigned long Threshold             =  32;	// Threshold of when to switch to using Insertion Sort
	const unsigned long PowerOfTwoRadix       = 256;	// Radix - must be a power of 2
	const unsigned long Log2ofPowerOfTwoRadix =   8;
	unsigned long shiftRightAmount = sizeof( _Type ) * 8 - Log2ofPowerOfTwoRadix;		// Create bit-mask and shift right amount
	_Type bitMask = (_Type)( ((_Type)( PowerOfTwoRadix - 1 )) << shiftRightAmount );	// bitMask controls/selects how many and which bits we process at a time

	if ( a_size >= Threshold )
		_RadixSort_Unsigned_PowerOf2Radix_3< PowerOfTwoRadix, Log2ofPowerOfTwoRadix, Threshold >( a, a_size - 1, bitMask, shiftRightAmount );
	else
		insertionSortSimilarToSTLnoSelfAssignment( a, a_size );
}

Listing 1

[Click image to view at full size]
Table 1: Various sorting algorithms, arrays of random unsigned 32-bit numbers.

[Click image to view at full size]
Figure 1: Various sorting algorithms, arrays of random unsigned 32-bit numbers.

TBB parallel_sort is the fastest algorithm for the largest array size tested, beating Intel IPP Radix Sort by 10%. IPP Radix Sort uses a single processing core while TBB parallel_sort uses all four cores. For mid-array sizes IPP Radix Sort is the fastest algorithm. TBB parallel_sort beats In-Place Radix Sort by as much as 1.5X and beats it for mid and large array sizes, while lagging in performance for smallest array sizes. TBB parallel_sort beats STL sort by more than 4X for the largest array sizes, and never lags in performance.

Algorithm Analysis

Non-parallel In-Place Radix Sort algorithm implementation is shown in Listing 1, with added comments to show the three major steps:

  1. Counting occurrence of digits within the array
  2. Move array elements into their bins
  3. Recursion for each bin

Each of these steps is an opportunity for parallelism. Counting is based on the Counting Sort algorithm, with a slight variation of extracting the digit which is then used by the count.

[Click image to view at full size]
Figure 2: MSD In-Place Radix Sort.

Figure 2 shows graphically the steps taken in the MSD In-Place Radix Sort, where the original source array is split into N bins, where N is the radix, based on the most-significant-digit of each array element. Then each bin is split into sub-bins, based on the next-significant-digit, and so on. For example, for an array of 32-bit elements sorted with a 256-Radix Sort (which uses 8-bit digits), four recursion levels would be needed -- one level for each of the 8-bit digits within the 32-bit of each element.

One of the lessons in performance optimization learned in [4] was to measure before optimizing, to discover the largest performance bottleneck and to apply effort to that area. When performance of sorting arrays of unsigned 32-bit elements was measured, the "Move array elements" Step #2 took 10X longer than the "Counting" Step #1. The counting step accesses memory from the beginning of the array to the end, sequentially, which is efficient. However, the moving Step #2 moves the array element at the _current location to its destination bin, and moves the element from the destination bin to the _current location (a swap). This swapping process has a random memory access behavior when the input array distribution is random. Random memory accesses are significantly slower than sequential memory accesses in modern computer memory hierarchy, and could be responsible for slower performance within Step #2.

Thus, accelerating the counting portion of the algorithm by running it on multiple cores, as was developed in [3 and 4] will result in a smaller performance gain. A method that parallelizes memory accesses (moving of array elements to their bins) should result in a more significant performance gain.


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 

Video