### Multiple Possibilities

Intel TBB supports **parallel_for** construct, which is logically equivalent to executing every iteration of the for loop in parallel. The prototype is:

template<typename Range, typename Body > void parallel_for( const Range& range, const Body& body,[partitioner] );

The first argument is the range over which the **parallel_for** will execute. This range can be capable of recursively splitting itself; **parallel_for** makes use of this capability. The second argument is the class, whose "()" operator will be performed during each iteration of the for loop. The **parallel_for** supports several variations, such as specifying the first and last index, step size, and several different partitioners that control how the work will be split among cores.

The last two nested for loops in the Parallel Counting Sort in Listing 1 are responsible for writing the sorted result array and have not yet been parallelized. Several options are available when converting these sequential **for** loops to **parallel_for**: do not parallel the for loops, parallel the outer for loop only, parallel the inner for loop only, parallel both for loops. Listing 1 implements the first option, while Listings 2, 3 and 4 show the three parallel implementations.

// Copyright(c), Victor J. Duvanenko, 2010 // Listing 3 of Part 2 template< class _Type > class WriteOperation_for { _Type* my_a; // a local copy to the input array, to provide a pointer to each parallel task unsigned long* my_count; public: static const unsigned long numberOfBins = 1 << ( sizeof( _Type ) * 8 ); WriteOperation_for( _Type a[], unsigned long* count ) : my_a( a ), my_count( count ) {} // Operates over the range of counts void operator()( const blocked_range< unsigned long >& r ) const { unsigned long n = 0; for( long i = 0; i != r.begin(); ++i ) // compute the starting point of r.begin() bin n += my_count[ i ]; for( long i = r.begin(); i != r.end(); ++i ) { unsigned long numValues = my_count[ i ]; // helps the compiler optimize _Type* a = my_a; for( unsigned long j = 0; j < numValues; j++ ) a[ n++ ] = (_Type)i; } } }; template< class _Type > inline void CountSortInPlace_TBB_L2( _Type* a, unsigned long a_size ) { if ( a_size < 2 ) return; CountTemplateType< _Type > count( a ); // contains count array; initialized to all zeros // Scan the array and count the number of times each value appears parallel_reduce( blocked_range< unsigned long >( 0, a_size ), count ); // grain size of 10K works well WriteOperation_for< _Type > writeOp( a, count.my_count ); parallel_for( blocked_range< unsigned long >( 0, count.numberOfCounts ), writeOp ); }

// Copyright(c), Victor J. Duvanenko, 2010 //Listing 4 Part 2 template< class _Type > class SetOperation_for { _Type* my_a; // a local copy to the input array // to provide a pointer to each parallel task _Type my_setValue; // the value to set the array to public: SetOperation_for( _Type a[], _Type setValue ) : my_a( a ), my_setValue( setValue ) {} void operator()( const blocked_range< unsigned long >& r ) const { for( unsigned long n = r.begin(); n != r.end(); ) my_a[ n++ ] = my_setValue; } }; template< class _Type > inline void CountSortInPlace_TBB_L3( _Type* a, unsigned long a_size ) { if ( a_size < 2 ) return; CountTemplateType< _Type > count( a ); // contains count array; initialized to all zeros // Scan the array and count the number of times each value appears parallel_reduce( blocked_range< unsigned long >( 0, a_size ), count );// grain size of 10K works well for( unsigned long n = 0, i = 0; i < count.numberOfCounts; i++ ) { unsigned long numValues = count.my_count[ i ]; SetOperation_for< _Type> setOp( &( a[ n ]), (_Type)i ); parallel_for( blocked_range< unsigned long >( 0, numValues ), setOp ); n += numValues; } }

// Copyright(c), Victor J. Duvanenko, 2010 // Listing 5 Part 2 template< class _Type > class WriteOperation_nested_for { _Type* my_a; // a local copy to the input array, to provide a pointer to each parallel task unsigned long* my_count; public: static const unsigned long numberOfBins = 1 << ( sizeof( _Type ) * 8 ); WriteOperation_nested_for( _Type a[], unsigned long* count ) : my_a( a ), my_count( count ) {} // Operates over the range of counts void operator()( const blocked_range< unsigned long >& r ) const { unsigned long n = 0; for( long i = 0; i != r.begin(); ++i ) // compute the starting point of r.begin() bin n += my_count[ i ]; for( long i = r.begin(); i != r.end(); ++i ) { unsigned long numValues = my_count[ i ]; // helps the compiler optimize _Type* a = my_a; SetOperation_for< _Type > setOp( &(a[ n ]), (_Type)i ); parallel_for( blocked_range< unsigned long >( 0, numValues, 10000 ), setOp ); n += numValues; } } }; template< class _Type > inline void CountSortInPlace_TBB_L4( _Type* a, unsigned long a_size ) { if ( a_size < 2 ) return; CountTemplateType< _Type > count( a ); // contains count array; initialized to all zeros // Scan the array and count the number of times each value appears parallel_reduce( blocked_range< unsigned long >( 0, a_size ), count ); // grain size of 10K works well WriteOperation_nested_for< _Type > writeOp( a, count.my_count ); parallel_for( blocked_range< unsigned long >( 0, count.numberOfCounts ), writeOp ); }

Table 1 and Figure 1 show performance measurements for sorting arrays of unsigned 8-bit numbers, comparing scalar, several Parallel Counting Sort implementations, STL **sort()**, and Intel Integrated Performance Primitives )IPP) Radix Sort.

Parallel Counting Sort is significantly faster than STL **sort()**, Intel IPP Radix Sort, and Counting Sort. Parallel implementation of either the outer for loop or the inner for loop show no speedup. Parallel implementation of both for loops shows a performance decline, and is consistently slower than parallel outer for loop version. Thus, adding parallelism in this section of the Counting Sort algorithm did not improve performance. This demonstrates that implementing parallelism does not guarantee performance gain. It also shows that extracting more parallelism than the overall system can handle efficiently may lead to performance degradation.

Table 2 and Figure 2 show performance measurements for sorting arrays of unsigned 16-bit numbers.

Parallel implementations of 16-bit Counting Sort show performance issues. Non-parallel Counting Sort showed a little bit of performance trouble at smaller array sizes, but parallelism seems to aggravate the problem more. "No parallel loops" and "parallel outer" outperform the non-parallel Counting Sort only at the largest of array sizes, and severely underperform for small and medium array sizes. "Parallel both" implementation only outperforms at the largest tested array size and underperforms in all other cases. "Parallel inner" underperforms the non-parallel version in all cases.

Also, the "parallel outer" implementation has an issue with data-dependence. The parallelism obtained depends on the statistical distribution of the input data in the array. For example, if the data is evenly distributed between all of the counts/bins, then the available parallelism would be 256 for 8-bit and 64K for 16-bit. However, if all values of the array are the same (e.g. all zero), then the counting portion would parallel fine, but the writing portion would use only a single core and not parallel at all. Only the task responsible to writing that particular count/bin would do work. Data-dependent parallelism issues should be avoided in parallel implementations, if possible, to ensure consistent performance across variety of data distributions.

So far, adding parallelism to the last portion of the Counting Sort has only degraded performance for 8-bit and 16-bit implementations, with the 16-bit version being significantly worse. Since these algorithms are implemented as templates, differentiated by the data type, the source code for them is similar. However, the count array size is a significant difference -- 256 entries (1K bytes) and 64K entries (256K bytes). The i7 860 processor (quad-core) being used integrates a 256K L2 cache per core, where performance could be impacted by the count array being the same size as the entire L2 cache. To investigate if L2 cache thrashing is the cause of poor performance, measurements on a CPU with a different cache configuration could be taken.