*Victor Duvanenko is a long-time contributor to Dr. Dobb's. He can be contacted at vduvanenko1@gmail.com *

In In-Place Hybrid N-bit-Radix SortI developed a high-performance In-Place N-bit-Radix Sort. This linear-time ** O**(dn) algorithm followed the hybrid algorithm approach of STL

**sort()**, which combines quicksort, heap sort, and insertion sort to create a higher performance combination than each algorithms individually. In-Place Radix Sort combined N-bit-Radix Sort with Insertion Sort to significantly boost performance, and sorted arrays of unsigned and signed 8-, 16-, -,32 and 64-bit numbers, many times faster than STL

**sort()**.

Counting Sort is used inside the Radix Sort. This algorithm was developed and discussed in [1], along with multicore parallel implementation in [3 and 4] using Intel's Threading Building Blocks (TBB) [5]. This high performance linear-time ** O**(n) algorithm outperforms STL

**sort()**by more than 60X. Parallel Counting Sort benefited by 2-3X when running on a quad-core processor, accelerating from parallelization of the counting portion of the algorithm.

In this article, I combine In-Place N-bit-Radix Sort and Parallel Counting Sort to develop a Parallel In-Place N-bit-Radix Sort algorithm, using Intel's Threaded Building Blocks library. Listing 1 shows a slightly improved version of In-Place N-bit-Radix Sort of [1]. Table 1 and Figure 1 show performance measurements for several sorting algorithms, running on i7 860 quad-core processor.

// Copyright(c), Victor J. Duvanenko, 2010 // Listing 1 template< unsigned long PowerOfTwoRadix, unsigned long Log2ofPowerOfTwoRadix, long Threshold, class _Type > inline void _RadixSort_Unsigned_PowerOf2Radix_1( _Type* a, long last, _Type bitMask, unsigned long shiftRightAmount ) { const unsigned long numberOfBins = PowerOfTwoRadix; unsigned long count[ numberOfBins ]; // Counting occurance of digits within the array (related to Counting Sort) for( unsigned long i = 0; i < numberOfBins; i++ ) count[ i ] = 0; for ( long _current = 0; _current <= last; _current++ ) // Scan the array and count the number of times each value appears { unsigned long digit = (unsigned long)(( a[ _current ] & bitMask ) >> shiftRightAmount ); // extract the digit we are sorting based on count[ digit ]++; } // Moving array elements into their bins long startOfBin[ numberOfBins ], endOfBin[ numberOfBins ], nextBin; startOfBin[ 0 ] = endOfBin[ 0 ] = nextBin = 0; for( unsigned long i = 1; i < numberOfBins; i++ ) startOfBin[ i ] = endOfBin[ i ] = startOfBin[ i - 1 ] + count[ i - 1 ]; for ( long _current = 0; _current <= last; ) { unsigned long digit; _Type tmp = a[ _current ]; // get the compiler to recognize that a register can be used for the loop instead of a[_current] memory location while ( true ) { digit = (unsigned long)(( tmp & bitMask ) >> shiftRightAmount ); // extract the digit we are sorting based on if ( endOfBin[ digit ] == _current ) break; _swap( tmp, a[ endOfBin[ digit ] ] ); endOfBin[ digit ]++; } a[ _current ] = tmp; endOfBin[ digit ]++; // leave the element at its location and grow the bin _current++; // advance the current pointer to the next element while( _current >= startOfBin[ nextBin ] && nextBin < numberOfBins ) nextBin++; while( endOfBin[ nextBin - 1 ] == startOfBin[ nextBin ] && nextBin < numberOfBins ) nextBin++; if ( _current < endOfBin[ nextBin - 1 ] ) _current = endOfBin[ nextBin - 1 ]; } // Recursion for each bin bitMask >>= Log2ofPowerOfTwoRadix; if ( bitMask != 0 ) // end recursion when all the bits have been processes { if ( shiftRightAmount >= Log2ofPowerOfTwoRadix ) shiftRightAmount -= Log2ofPowerOfTwoRadix; else shiftRightAmount = 0; for( unsigned long i = 0; i < numberOfBins; i++ ) { long numberOfElements = endOfBin[ i ] - startOfBin[ i ]; if ( numberOfElements >= Threshold ) { // endOfBin actually points to one beyond the bin _RadixSort_Unsigned_PowerOf2Radix_1< _Type, PowerOfTwoRadix, Log2ofPowerOfTwoRadix, Threshold >( &a[ startOfBin[ i ]], numberOfElements - 1, bitMask, shiftRightAmount ); } else if ( numberOfElements >= 2 ) { insertionSortSimilarToSTLnoSelfAssignment( &a[ startOfBin[ i ]], numberOfElements ); } } } } // Listing 1, 2 and 3 (top-level template function) template< class _Type > inline void RadixSortInPlace_PowerOf2Radix_unsigned_TBB( _Type* a, unsigned long a_size ) { if ( a_size < 2 ) return; const unsigned long Threshold = 32; // Threshold of when to switch to using Insertion Sort const unsigned long PowerOfTwoRadix = 256; // Radix - must be a power of 2 const unsigned long Log2ofPowerOfTwoRadix = 8; unsigned long shiftRightAmount = sizeof( _Type ) * 8 - Log2ofPowerOfTwoRadix; // Create bit-mask and shift right amount _Type bitMask = (_Type)( ((_Type)( PowerOfTwoRadix - 1 )) << shiftRightAmount ); // bitMask controls/selects how many and which bits we process at a time if ( a_size >= Threshold ) _RadixSort_Unsigned_PowerOf2Radix_3< PowerOfTwoRadix, Log2ofPowerOfTwoRadix, Threshold >( a, a_size - 1, bitMask, shiftRightAmount ); else insertionSortSimilarToSTLnoSelfAssignment( a, a_size ); }

TBB **parallel_sort** is the fastest algorithm for the largest array size tested, beating Intel IPP Radix Sort by 10%. IPP Radix Sort uses a single processing core while TBB **parallel_sort** uses all four cores. For mid-array sizes IPP Radix Sort is the fastest algorithm. TBB **parallel_sort** beats In-Place Radix Sort by as much as 1.5X and beats it for mid and large array sizes, while lagging in performance for smallest array sizes. TBB **parallel_sort** beats STL sort by more than 4X for the largest array sizes, and never lags in performance.

### Algorithm Analysis

Non-parallel In-Place Radix Sort algorithm implementation is shown in Listing 1, with added comments to show the three major steps:

- Counting occurrence of digits within the array
- Move array elements into their bins
- Recursion for each bin

Each of these steps is an opportunity for parallelism. Counting is based on the Counting Sort algorithm, with a slight variation of extracting the digit which is then used by the count.

Figure 2 shows graphically the steps taken in the MSD In-Place Radix Sort, where the original source array is split into N bins, where N is the radix, based on the most-significant-digit of each array element. Then each bin is split into sub-bins, based on the next-significant-digit, and so on. For example, for an array of 32-bit elements sorted with a 256-Radix Sort (which uses 8-bit digits), four recursion levels would be needed -- one level for each of the 8-bit digits within the 32-bit of each element.

One of the lessons in performance optimization learned in [4] was to measure before optimizing, to discover the largest performance bottleneck and to apply effort to that area. When performance of sorting arrays of unsigned 32-bit elements was measured, the "Move array elements" Step #2 took 10X longer than the "Counting" Step #1. The counting step accesses memory from the beginning of the array to the end, sequentially, which is efficient. However, the moving Step #2 moves the array element at the **_current** location to its destination bin, and moves the element from the destination bin to the **_current **location (a swap). This swapping process has a random memory access behavior when the input array distribution is random. Random memory accesses are significantly slower than sequential memory accesses in modern computer memory hierarchy, and could be responsible for slower performance within Step #2.

Thus, accelerating the counting portion of the algorithm by running it on multiple cores, as was developed in [3 and 4] will result in a smaller performance gain. A method that parallelizes memory accesses (moving of array elements to their bins) should result in a more significant performance gain.