### Other Input Data Statistics

Table 8 shows performance measurements for sorting of arrays whose every element is the maximum unsigned value of a particular data type (0xff for 8-bit, 0xffff for 16-bit, and 64-bit).

As the data type size increases from 8-bit to 64-bit, Parallel In-Place N-bit-Radix Sort algorithm performance is affected more by constant element values. Mid-size arrays are more affected than larger array sizes. For large array sizes, non-parallel implementations for 32-bit and 64-bit are affected less than parallel implementations, but is opposite for 8-bit and 16-bit arrays.

Slower performance for constant arrays in the case of parallel implementation is not surprising, since there is no processing of multiple bins in parallel due to all elements being within a single bin at every recursion level. Thus, the only parallelism is that of counting elements within the single bin. What is surprising is the lack of performance when cache accesses in this case should not be random. This anomaly will require further investigation and is an opportunity for optimization.

Once again, random input distribution may not lead to the worst case performance for sorting algorithms. This behavior may be worse for parallel algorithm implementation and may be data type size dependent.

### Inherent Parallelism

The counting step (Step #1 of the three steps shown above) is similar to that used in the Counting Sort algorithm, described in [3 and 4]. This step has inherent parallelism, since the array can be broken into sub-arrays, which are then counted independently, combining the counts afterwards. Parallel implementation of this step was shown in [4], which fits the **parallel_reduce** pattern. Parallelism of this step is data-independent.

The splitting into bins step (Step #2) consists of figuring out the start index of each bin, followed by moving each element of the array into its appropriate bin. The start of each bin could be computed independently. However, this would result in redundant computations being performed. Also, if the number of bins is relatively small, the parallel implementation may not result in a performance gain.

Moving each element into its appropriate bin could be performed in parallel by separating the work performed for each destination bin. In other words, each parallel task would be responsible for a single bin (or a group of bins) and would scan the input array from the beginning, looking for elements that belong to the destination bin(s) it is responsible for. However, this requires scalable read memory bandwidth. Writing to memory would have no overlap between parallel tasks and thus would not need mutual exclusion mechanisms. Parallelism of this step is data-dependent; e.g., all elements of the current bin belong in a single sub-bin.

The recursion step (Step #3) for each bin has inherent parallelism. Once the array has been split into bins, each bin can be processed independently. Each bin is separate in memory space and thus no mutual exclusion mechanisms are necessary -- data-parallel processing can be applied after the fist split into bins. Parallelism of this step is data-dependent. In other words, some data input statistics will create many bins, while others will create a few bins, thus influencing the amount of available parallelism. But, if fewer bins are created, then these bins will be larger, creating more opportunity for parallelism for counting within each bin (but with less performance gain).

### Acceleration Bounds

Figure 2 showed the recursion tree of the MSD N-bit-Radix Sort algorithm. If the input data is random, then the tree will be nearly balanced. At the very top of the tree, when the input array is being counted, parallelism was exploited during the counting step. No parallelism was used for moving array elements into their respective sub-bins. Parallelism was used to recursively process each of the sub-bins.

At the second recursion level, these bins were processed in parallel, on as many cores as were available, for as many bins as were available; i.e., the amount of parallelism was

P<sub>1</sub> = min( cores, bins ) (1)

which shows data-dependence since the number of bins is data-dependent. Within each bin counting was performed in parallel with the amount of parallelism equal to

P<sub>2</sub> = min( cores, floor( bin_size / min_parallel_work )) (2)

which shows the limiting effect of minimum quanta of work that is worthwhile to perform in parallel, and also shows data-independence.

On the second level of recursion, for random input array, both P_{1} and P_{2} parallelisms were being utilized, with the number of cores limiting the overall parallelism and the level of performance acceleration. For 8-bit-Radix (256-Radix), 256 bins were available to be processed in parallel, with each bin being big enough to be split up to be counted in parallel.

For subsequent recursion levels the algorithm attempted to extract parallelism in the two cases tested: random array and constant array. In the case of random array, Equation (2) above limited the amount of parallelism even for arrays with 100M elements, as the bins got too small to be executed in parallel efficiently (limiting the number of parallel recursions). In the case of constant array, Equation (1) limited the amount of parallelism, due to data-dependence of the algorithm implementation (on all levels of recursion). Thus, no additional acceleration could be obtained, in either of these cases for the rest of the recursion levels.

The bound on the number of parallel recursion levels results from the combination of the array size, the size of array elements, and the size of the digit used. Levels of recursion is data-dependent. For example, a 100 million element array of random 64-bit elements and using 8-bit digits, has two levels of parallel recursion because the bins hold 1.5K elements after two levels of recursion, which is too small to be executed efficiently in parallel (less than 10K clock cycles of work). Thus, for 64-bit elements and 8-bit digits, eight recursion levels will be executed, but only two of them will be capable of parallelism. If the minimum quanta of parallel work was smaller, then more levels of recursion would execute in parallel.

### Conclusion

In-Place N-bit-Radix Sort, a fast ** O**( dn ) algorithm, was transformed to a parallel implementation using Intel's TBB, enabling it to take advantage of multi-core processors. For random input arrays of unsigned 8-bit numbers it achieved up to 20% performance acceleration, up to 1.8X for 16-bit, up to 2.5X for 32-bit, and up to 2.2X for 64-bit, on a quad-core i7 860 processor.

However, the current implementation, when presented with constant data elements faltered in performance, especially the parallel implementation. Performance was several times slower than for random input data. Thus, random input distribution may not lead to the worst case performance for sorting algorithms. Algorithm performance was shown to vary with input data distributions, with some algorithms showing higher variation than others.

Some of the inherent parallelism within the Radix Sort algorithm is data-dependent, such as processing multiple bins, where parallel memory access works great when there are many bins, but is not possible if all array elements end up in a single bin. Other inherent parallelism, such as the counting part, is data-independent. However, it contributed an order of magnitude less to performance acceleration. Parallel memory access provided the highest performance gain, since memory access (moving elements into their respective bins) was determined to be the bottleneck.

Two equations were developed to describe parallelism bounds for Parallel Radix Sort, and showed that in one case the performance is limited by either the number of cores or the number of bins (which is data-dependent). In the other case, parallel acceleration is not data-dependent, but is limited by the smallest amount of work that is worthwhile to run in parallel (which is about 10K clock cycles for TBB). Levels of parallel recursion also bounds parallelism, which is partially bounded by the smallest quanta of parallel work.

Thus, the benefit of multicore parallelism is bounded by the size of the worthwhile parallel work quanta. 10K clock cycles is a fairly large chunk of work (3 microseconds on a 3-GHz core). Hopefully, multicore processor, system and library designers will continue reducing this critical aspect of parallel systems thereby increasing applicability of parallel systems.

This was the first step in the effort to parallelize In-Place N-bit-Radix Sort. Several opportunities for further performance optimization and improvement were also identified.

### References

[1] V. J. Duvanenko, In-Place Hybrid N-bit-Radix Sort

[2] V. J. Duvanenko, Counting Sort Performance

[3] V. J. Duvanenko, Parallel Counting Sort

[4]V. J. Duvanenko, Parallel Counting Sort, Part 2

[5] Intel Threaded Building Blocks

[6] V. J. Duvanenko, Stable Hybrid MSD N-bit-Radix Sort

[7] Introsort