Comparing algorithms of dissimilar performance order, such as O(nlogn) for STL sort and O(kn) for Radix-Sort, should provide insight about when one algorithm is expected to be better than another. The comparison is easier to see if the orders are put in a similar form, such as O(log2(n)*n) and O(k*n). Both formulas have an n term in common, and thus the comparison is between log2(n) and k . Another way to see this is to divide the two formulas. In this case, n cancels out and log2(n)/k ratio remains. If log2(n) is bigger than k, the ratio will be bigger than 1. If k is bigger, the ratio will be less than 1.
Table 8 demonstrates this relationship. It shows that log2(n) increases by one when n doubles -- which follows from one of the rules of logarithms, where log( m*n ) = log(m) + log(n). In this case, log2( 2*n ) = log2(2) + log2(n) = 1 + log2(n). The second portion of the table shows that for log2(n)to double, n has to square, which follows from another rule of logarithms, where m*log(n) = log(nm), where m=2 in this case.
In Table 8 one of the entries is log2(4.3*109)=32 -- for example, n = 4.3E+9, or 4.3 billion. For this case, when the array contains 4.3 billion elements, then O(log(n)*n ) = 32*4.3*109, which is the estimate of the number of comparisons for the STL sort algorithm.
For Binary-Radix Sort, the order is O(k*n), where k is the number of bits in each element. If the elements are 32-bits each, then O(k*n) = 32*4.3*109 operations for an array with 4.3 billion elements. In this case, the estimates for STL sort and Binary-Radix Sort are equal. For the case of 8-bit elements, Binary-Radix Sort estimate is 8*4.3*109, whereas STL estimate stays at 32*4.3*109, predicting Binary-Radix Sort to be faster. For the case of 64-bit elements, the Binary-Radix-Sort estimate is 64*4.3*109, whereas STL estimate stays at 32*4.3*109, predicting STL sort to be faster.
These examples demonstrate that STL sort performance is estimated only from the number of elements in the array and not their size, whereas Binary-Radix Sort is estimated from the number of bits of each element and the number of elements. The order formulas predict that Binary-Radix Sort should be faster than STL sort (actually any optimal comparison sort) when the elements are made of a few bits, but that STL sort should be faster when the element size is large.
Tables 9 show measured performance comparisons. Intel IPP does not implement 8-bit and 64-bit signed integer sorting in non-Radix and Radix sorting.
Table 9 provides a lot of information, with several notables:
- All algorithms performed consistently with their unsigned counterparts (when existed), where signed and unsigned performance was nearly equal.
- IPP Radix Sort is significantly faster than all other algorithms for 16 and 32-bit signed -- 8X and about 3X faster respectively than the closest competitor.
- Order predictions hold accurately for Hybrid Binary-Radix Sort for 8, 16 and 32-bit cases, where the performance decreases by 2X as the number of bits (k) increases by 2X. However, for 64-bit case the performance does not decrease by 2X as predicted, but increases by about 16%.
- Order prediction is not accurate for STL sort, expecting the performance not to depend on the size of array elements, but only on the number of elements. However, STL sort performance decreases by 2X as the number of bits increases by 2X, for 8, 16 and 32-bit cases, but not for 64-bit.
- Order prediction for STL sort to beat Binary-Radix Sort did not hold for up to 64-bits.
At first it may seem that Binary-Radix sort is a good candidate for multicore and multithreading. It splits the data set into two portions and then splits those further recursively. However, the split of the array is data dependent. In other words, the split will not always be even, which leads to uneven load balancing. This may be one of the reasons that Intel did not multithread its Radix-Sort implementation.
In-place Hybrid Binary-Radix Sort (MSD-style) algorithm was developed and improved through several performance optimizations. Insertion Sort was used at the bottom of the recursion tree. Over 40% performance improvement was achieved from the initial implementation, by removing redundant operations. The implementation was extended to handle unsigned and signed integers from 8-bits to 64-bits. A data-type-aware generic interface, overloaded functions, was used to encapsulate unique unsigned and signed implementations under a common interface. The resulting algorithm compared favorably with STL sort implementation, outperforming it by at least 15% for random input data and 32 & 64-bit increasing and decreasing data sets.
Comparison to Intel's IPP sort (in-place) and Radix sort (not in-place) was also performed, where IPP's Radix Sort was found to perform 20X, 8X and about 3X faster for 8-bit, 16-bit and 32-bit unsigned respectively. Hybrid Binary-Radix Sort outperformed IPP's sort for 16 and 32-bit data types, but lagged by 20X for 8-bit unsigned data type.
Inconsistencies in using algorithm order for performance prediction were found for STL sort for most data sizes, whereas Hybrid Binary-Radix Sort was predicted consistently except for 64-bit data size.
A further hybrid in-place algorithm can be evolved by combining the best attributes of Intel's IPP algorithms with in-place Hybrid Binary-Radix Sort. For instance, the high-performance IPP 8-bit in-place sort can be integrated under the generic interface developed. For the 8-bit unsigned overloaded function implementation, Intel's IPP sort function would be called. The combined algorithm would retain its generic interface, but yet would have data type specific optimized implementations. This leads to "generic data type adaptive" algorithms, which would retain a generic interface and adapt to perform optimally for each data type. Purely generic algorithms miss this opportunity, as well as the 20X performance improvement shown for 8-bit data types.
Floating-point support would be a nice extension. Intel IPP sorting routines support floating-point: single and double-precision. Creating a more sophisticated generic implementation, such as STL's, would allow custom classes to be sorted. The use of iterators would reduce the number of items passed to each level of recursion from currently 4 down to 3 (first and last iterator, and bitMask), possibly improving performance.
 Intel Integrated Performance Primitives for Intel Architecture, Reference Manual, Volume 1: Signal Processing, August 2008, pp. 5-57 - 5-61.
Jim Vaught of Arxan Defense Systems -- personal discussion.
 V. J. Duvanenko, Algorithm Improvement through Performance Measurement: Part 1, Dr. Dobb's
 Scott Miller of Arxan Defense Systems -- personal discussion.