Data-Dependent Parallelism
Some algorithms exhibit performance that varies dramatically across input data variations. This may or may not be a desirable behavior. Table 4 and Figure 4 show performance measurements for various algorithms with all array elements being MAX_ULONG=0xffffffff. Performance of all elements of zeros is slightly better.
STL sort and TBB parallel_sort are significantly faster than the Radix family of algorithms for all but the smallest array size. STL sort varies as much as 137X with data input variation, with random input being slower. TBB parallel_sort varies as much as 55X.
Non-parallel Radix Sort algorithms (IPP Radix Sort and In-Place Radix Sort) improved in performance with non-random input, by up to 2X and 1.2X respectively. However, Parallel In-Place Radix Sort slowed down by 1.3X to 2.4X. Parallel In-Place Radix Sort is faster than the non-parallel version by up to 1.4X for larger array sizes (compared to 2.5X for random inputs); i.e., parallel speedup reduced with non-random inputs. Also, Intel IPP Radix Sort is faster than Parallel In-Place Radix Sort for non-random inputs, plus it uses a single processing core.
Thus, random input distribution may not lead to the worst case performance for sorting algorithms.
Other Data Types
So far, performance of sorting unsigned 32-bit arrays was studied. Non-parallel N-bit-Radix Sort performs well for other data type sizes as well, such as 8-bit, 16-bit and 64-bit (unsigned and signed), as in [1]. At 8-bit and 16-bit data types comparison to Parallel Counting Sort developed in [2 and 3] is possible. Listings 2 and 3 support various byte-divisible unsigned data type sizes. Tables 5-7 and Figures 5-7 show performance measurements for unsigned 8-bit, 16-bit and 64-bit data types, respectively.
For arrays of unsigned 8-bit elements, Parallel In-Place N-bit-Radix Sort provides at most about 20% in performance gain for the largest of array sizes, and 10% performance loss for mid-size array. It consistently outperforms STL sort() for all array sizes, up to 5X for mid-size arrays and 4.6X for the largest array sizes. It lags in performance consistently for all but the smallest array size against Intel's IPP Radix Sort. It consistently outperforms TBB parallel_sort across all array sizes, from 1.4X to 3.4X, with 1.8X for the largest array sizes. Parallel Counting Sort is substantially faster than all other algorithms, especially for the largest array sizes, outperforming Parallel Radix Sort by up to 14X and 3X faster than Intel's IPP Radix Sort for the largest array sizes.
For 16-bit arrays, Parallel In-Place Radix Sort provides up to 1.8X speed-up over the non-parallel implementation on a quad-core processor. It is consistently faster that STL sort, by up to 8.7X. It lags in performance to Intel's IPP Radix Sort, but outperforms TBB parallel_sort in all but one array size, by up to 2.5X. Parallel Counting Sort is nearly 10X faster than all other algorithms for the largest array sizes, but lags in performance for mid and small array sizes.
For 64-bit arrays, Parallel In-Place Radix Sort provides nearly 2X speedup over the non-parallel version, and is consistently faster except for two smaller array sizes. It is up to 6X faster than STL sort, and is faster except for one smaller array size. It is also consistently faster than TBB parallel_sort, up to 1.6X. Thus, Parallel In-Place Radix Sort consistently provides speedup over the non-parallel implementation across all tested data type sizes of unsigned 8-, 16-, 32-, and 64-bit.
Parallel Radix Sort implementation makes it possible for an input array of 8-bit elements to run on multiple cores by using 4-bit-Radix Sort (or 2-bit-Radix, or 1-bit-Radix). When performance was measured, comparing to 8-bit-Radix Sort, 4-bit-Radix Sort was about 12% slower, and 2-bit-Radix Sort was nearly 2X slower. Most likely multiple passes over the input data (bandwidth) is the limiting factor and warrants future investigation (as it could be due to parallel overhead).


