### Parallel Memory Access

Listing 2 shows an implementation where the recursion for loop has been replaced with the **parallel_for** construct. The input array is first split into bins (as in Figure 1), using a non-parallel implementation, followed by parallel processing of each bin. The **parallel_for** processes each bin concurrently (on as many cores as are available), which further breaks down each bin into sub-bins, based on the second digit within each array element. Thus, multiple bins will be sorted in parallel (up to the radix value, which is 256 in this case), moving their respective array elements within each bin, concurrently. The memory range for each bin does not overlap that of the other bins, and thus, bins within the same recursion level can be processed concurrently without using mutual exclusion primitives; i.e., data-parallel processing.

// Copyright(c), Victor J. Duvanenko, 2010 template< class _Type > inline unsigned long extractDigit( _Type a, _Type bitMask, unsigned long shiftRightAmount ) { unsigned long digit = (unsigned long)(( a & bitMask ) >> shiftRightAmount ); // extract the digit we are sorting based on return digit; } // Listing 2 template< unsigned long PowerOfTwoRadix, unsigned long Log2ofPowerOfTwoRadix, long Threshold, class _Type > class RadixInPlaceOperation_2 { _Type* my_a; // a local copy to the input array, to provide a pointer to each parallel task long* my_startOfBin; long* my_endOfBin; _Type my_bitMask; // a local copy of the bitMask unsigned long my_shiftRightAmount; static const unsigned long Threshold_P = 10000; // threshold when to switch between parallel and non-parallel implementations public: static const unsigned long numberOfBins = PowerOfTwoRadix; unsigned long my_count[ numberOfBins ]; // the count for this task RadixInPlaceOperation_2( _Type a[], long* startOfBin, long* endOfBin, _Type bitMask, unsigned long shiftRightAmount ) : my_a( a ), my_startOfBin( startOfBin ), my_endOfBin( endOfBin ), my_bitMask( bitMask ), my_shiftRightAmount( shiftRightAmount ) {} void operator()( const blocked_range< long >& r ) const { for( long i = r.begin(); i != r.end(); ++i ) { long numOfElements = my_endOfBin[ i ] - my_startOfBin[ i ]; if ( numOfElements >= Threshold_P ) _RadixSort_Unsigned_PowerOf2Radix_2< PowerOfTwoRadix, Log2ofPowerOfTwoRadix, Threshold >( &my_a[ my_startOfBin[ i ]], numOfElements - 1, my_bitMask, my_shiftRightAmount ); else if ( numOfElements >= Threshold ) _RadixSort_Unsigned_PowerOf2Radix_1< PowerOfTwoRadix, Log2ofPowerOfTwoRadix, Threshold >( &my_a[ my_startOfBin[ i ]], numOfElements - 1, my_bitMask, my_shiftRightAmount ); else if ( numOfElements >= 2 ) insertionSortSimilarToSTLnoSelfAssignment( &my_a[ my_startOfBin[ i ]], numOfElements ); } } }; // Listing 2 template< unsigned long PowerOfTwoRadix, unsigned long Log2ofPowerOfTwoRadix, long Threshold, class _Type > inline void _RadixSort_Unsigned_PowerOf2Radix_2( _Type* a, long last, _Type bitMask, unsigned long shiftRightAmount ) { const unsigned long numberOfBins = PowerOfTwoRadix; unsigned long count[ numberOfBins ]; for( unsigned long i = 0; i < numberOfBins; i++ ) count[ i ] = 0; for ( long _current = 0; _current <= last; _current++ ) // Scan the array and count the number of times each value appears count[ extractDigit( a[ _current ], bitMask, shiftRightAmount ) ]++; long startOfBin[ numberOfBins ], endOfBin[ numberOfBins ], nextBin; startOfBin[ 0 ] = endOfBin[ 0 ] = nextBin = 0; for( unsigned long i = 1; i < numberOfBins; i++ ) startOfBin[ i ] = endOfBin[ i ] = startOfBin[ i - 1 ] + count[ i - 1 ]; for ( long _current = 0; _current <= last; ) { unsigned long digit; _Type tmp = a[ _current ]; // get the compiler to recognize that a register can be used for the loop instead of a[_current] memory location while ( true ) { digit = (unsigned long)(( tmp & bitMask ) >> shiftRightAmount ); // extract the digit we are sorting based on if ( endOfBin[ digit ] == _current ) break; _swap( tmp, a[ endOfBin[ digit ] ] ); endOfBin[ digit ]++; } a[ _current ] = tmp; endOfBin[ digit ]++; // leave the element at its location and grow the bin _current++; // advance the current pointer to the next element while( _current >= startOfBin[ nextBin ] && nextBin < numberOfBins ) nextBin++; while( endOfBin[ nextBin - 1 ] == startOfBin[ nextBin ] && nextBin < numberOfBins ) nextBin++; if ( _current < endOfBin[ nextBin - 1 ] ) _current = endOfBin[ nextBin - 1 ]; } bitMask >>= Log2ofPowerOfTwoRadix; if ( bitMask != 0 ) // end recursion when all the bits have been processes { if ( shiftRightAmount >= Log2ofPowerOfTwoRadix ) shiftRightAmount -= Log2ofPowerOfTwoRadix; else shiftRightAmount = 0; RadixInPlaceOperation_2< PowerOfTwoRadix, Log2ofPowerOfTwoRadix, Threshold, _Type > radixOp( a, startOfBin, endOfBin, bitMask, shiftRightAmount ); parallel_for( blocked_range< long >( 0, numberOfBins ), radixOp ); } }

Note that in Listing 2 in the **RadixInPlaceOperation_2** class the recursion template function call was to the parallel function (**_RadixSort_Unsigned_PowerOf2Radix_2**). However, if the non-parallel version is called (**_RadixSort_Unsigned_PowerOf2Radix_1**) performance improves by about 5%. This experiment shows that performance acceleration from parallelism was achieved from a single recursion level, and the rest of the recursion levels did not help accelerate. In fact, the rest of the recursion levels reduced performance.

The reason for the performance drop stems from analysis of bin size at each recursion level, where an array holding 100 million random elements gets split into 256 bins when using 8-bit digits (256-Radix) on the top-level (before first recursion), resulting in each bin being about 390K elements. After the first recursion, each of these bins gets split further into 256 bins, with each bin being about 1.5K elements. The Threshold_P optimal setting, determined experimentally, was about 10K. Thus, for bins with more than 10K elements, the parallel function is called, and for smaller bins the non-parallel version is called. With 1.5K elements per bin, the non-parallel version will be called. But, this is the average case with a random array, and some bins will be larger than 10K and some will be much smaller. Thus, parallel function will be called at times, slowing things down due to its larger overhead.

When bins get small, the parallel overhead is most likely larger than acceleration obtained from parallel execution. Intel recommends parallel chunks of work to be about larger than 10K CPU clock cycles. As array sizes grow beyond 100M elements, such as with the transition to 64-bit CPUs and operating systems, further recursive parallel processing will pay off, as bins after the second recursion will be large enough to be worthwhile to process in parallel. The algorithm implementation needs to keep this future in mind. This is one of the reasons to keep the parallel recursion structure.

Another reason is that inputs are not necessarily random. If an input array is such that all elements within the array have the same most significant digit with the rest of the digits being random, then no performance benefit would be obtained from the single recursion parallel implementation. However, the fully parallel version would perform well. In the rest of this paper, the fully parallel implementation is used.

Also, using smaller radix will lead to larger bins at each recursive level, and may result in higher performance. However, it also leads to more levels of recursion, which may be slower. The implementation in Listing 2 allows for an arbitrary number of bits to be used as the radix.