### Improving Hybrid Abstraction

Several overloaded functions were used in [1] and [3], one for each supported data type (such as char, unsigned char, short, unsigned short, unsigned long, long, etc.). This method provided a single consistent interface that handled sorting all of the supported data types. This method enabled the selection of the best suited sorting algorithm for each data type. For instance, Counting Sort was shown to be significantly faster than other sorting algorithms for 8-bit and 16-bit numbers, and is used to sort unsigned char data type and unsigned short [1].

Listing 5 shows an implementation for this hybrid abstraction, which is an improvement on [1]. Note these functions have very little in them, as most of the initialization details have been integrated into a generic top-level template function (unsigned and signed versions). Note the use of the sizeof() to determine the number of bits in a type, to construct the mask of appropriate matching size.

// Copyright(c), Victor J. Duvanenko, 2009 // Hybrid Stable Radix Sort implementations. #ifndef _HybridStableRadixSort_h #define _HybridStableRadixSort_h #include "StableNotInPlaceMSDRadixSort.h" #include "CountSort.h" // Since 8-bit sorting uses Counting Sort, the extra array "b" is not needed. inline void HybridSort_stable( char* a, unsigned long a_size, char* b = NULL ) { CountSortInPlace( a, a_size ); //RadixSortMSDStablePowerOf2Radix_signed( a, b, a_size ); // works for 8-bit types, but Count Sort is much faster } // Since 8-bit sorting uses Counting Sort, the extra array "b" is not needed. inline void HybridSort_stable( unsigned char* a, unsigned long a_size, unsigned char* b = NULL ) { CountSortInPlaceHybrid( a, a_size ); //RadixSortMSDStablePowerOf2Radix_unsigned( a, b, a_size ); // works for 8-bit types, but Count Sort is much faster } inline void HybridSort_stable( short* a, unsigned long a_size, short* b = NULL ) { CountSortInPlace( a, a_size ); //RadixSortMSDStablePowerOf2Radix_signed( a, b, a_size ); // works for 16-bit types, but Count Sort is much faster } // Since 8-bit sorting uses Counting Sort, the extra array "b" is not needed. inline void HybridSort_stable( unsigned short* a, unsigned long a_size, unsigned short* b = NULL ) { CountSortInPlaceHybrid( a, a_size ); //RadixSortMSDStablePowerOf2Radix_unsigned( a, b, a_size ); // works for 16-bit types, but Count Sort is much faster } inline void HybridSort_stable( long* a, unsigned long a_size, long* b ) { RadixSortMSDStablePowerOf2Radix_signed( a, b, a_size ); } inline void HybridSort_stable( unsigned long* a, unsigned long a_size, unsigned long* b ) { RadixSortMSDStablePowerOf2Radix_unsigned( a, b, a_size ); } inline void HybridSort_stable( int* a, unsigned long a_size, int* b ) { RadixSortMSDStablePowerOf2Radix_signed( a, b, a_size ); } inline void HybridSort_stable( unsigned int* a, unsigned long a_size, unsigned int* b ) { RadixSortMSDStablePowerOf2Radix_unsigned( a, b, a_size ); } inline void HybridSort_stable( __int64* a, unsigned long a_size, __int64* b ) { RadixSortMSDStablePowerOf2Radix_signed( a, b, a_size ); } inline void HybridSort_stable( unsigned __int64* a, unsigned long a_size, unsigned __int64* b ) { RadixSortMSDStablePowerOf2Radix_unsigned( a, b, a_size ); } #endif // _HybridStableRadixSort_h

Tables 4-7 and Graphs 4-7 show performance measurements for signed and unsigned data types, along with other algorithms, including in-place N-bit-Radix Sort.

Note that STL **stable_sort()** is significantly slower than STL **sort()** for 8-bit data types (signed or unsigned). Stable Hybrid N-bit-Radix Sort significantly outperforms STL **sort()**, by up to 65X for large array sizes, but outperforms Intel's IPP Radix Sort by 10-30% for larger array sizes. This is due to the user of Count Sort for 8-bit data types. In-Place Hybrid N-Bit-Radix Sort in [1] did not use Count Sort for signed 8-bit data type and is significantly slower because of the use of 8-bit-Radix Sort, but is still more than 10X faster than STL **stable_sort()**.

Note that STL **stable_sort()** is slower than STL **sort()** for 16-bit data types (signed or unsigned), but by a smaller amount than for 8-bit data types. Stable Hybrid N-bit-Radix Sort significantly outperforms STL sort(), by up to 38X for large array sizes, and outperforms Intel's IPP Radix Sort by up to 2.5X for larger array sizes. This is due to the user of Count Sort for 16-bit data types, and is consistent with results in [1]. In-Place Hybrid N-Bit-Radix Sort in [1] did not use Count Sort for signed 16-bit data type and is significantly slower because of the use of 8-bit-Radix Sort, but is still many times faster than STL **stable_sort()**and STL **sort()**. Count Sort does not perform well between 100 and 10K array sizes, and could be substituted with Radix Sort to improve performance for this region of array sizes.

Note that STL **stable_sort()** is slower than STL sort() for 32-bit data types (signed or unsigned), by less than 10% (smaller difference than for 16-bit and 8-bit). Stable Hybrid N-bit-Radix Sort significantly and consistently outperforms STL **sort()**, by up to 5X for large array sizes. It also outperforms Intel's IPP Radix Sort by up to 30% for largest array size, but not consistently, where for 10K element array size it underperforms by 40%. Count Sort is not used for 32-bit data types.

Note that STL **stable_sort()** is slower than STL **sort()** for 64-bit data types (signed or unsigned), by less than 12% (similar difference as for 32-bit). Stable Hybrid N-bit-Radix Sort significantly and consistently outperforms STL **sort()**, by up to 3.7X for large array sizes.

All algorithms did not show significant deviation between processing signed and unsigned integer data types. The above results of comparing performance between all algorithms considered are consistent with results obtained in [1].

### Conclusion

An implementation of a Stable Hybrid N-bit-Radix Sort was developed and its performance measured. Starting from an implementation similar to the In-Place N-bit Radix Sort of [1], but transformed to be stable, it was combined with Insertion Sort to produce a high performance hybrid sorting algorithm. This hybrid combination was extended further in capability by adding support for signed integers and in performance by combining with Counting Sort for 8-bit and 16-bit data types, as was done in [1]. The resulting data type adaptive stable sorting algorithm produced superior performance.

Performance compared to STL **stable_sort()** was over 60X faster for large arrays of 8-bit elements, over 25X faster for large arrays of 16-bit elements, over 4X faster for large arrays of 32-bit, and about 3X faster for 64-bit unsigned and signed integers. Performance lead was consistent across array sizes and data element sizes.

Performance compared to Intel's IPP Radix Sort was not as consistently dominant. It was 10-30% faster for larger arrays of 8-bit elements, 2-2.5X faster for 16-bit for large array sizes (but lagged in performance as high as 2X for 1K to 10K array sizes), up to 30% faster for large arrays of 32-bit elements ( but lagged by up to 40% for 1K to 100K array sizes).

Performance compared to the In-Place N-Bit-Radix Sort of [1] was consistently higher for larger array of 32-bit and 64-bit data sizes.

Keep in mind that the resulting algorithm is not in-place, requiring an additional array of equivalent size to the input array. This requirement is the same for Intel’s IPP Radix Sort. STL stable_sort() also allocates extra memory automatically if memory is available [2].

Stable N-bit-Radix Sort benefited from a hybrid approach to similar levels as the In-Place N-bit-Radix Sort algorithm -- no benefit for 8-bit (due to the use of 8-bit Radix), some acceleration for 16-bit elements for arrays of less than 100K, 2.5-18X gain for 32-bit, and 4-27X for 64-bit. Without the hybrid approach this algorithm outperforms STL for 8, 16-bit and 32-bit data types, but not for 64-bit data type. Most likely this is because for 8-bit performs a single pass over the data, 16-bit performs two passes, 32-bit performs four, and 64-bit performs 8 passes, whereas comparison based algorithms such as STL **stable_sort()** do not perform more work as data type size increases (for natively supported types by the CPU). This demonstrates that a hybrid approach is significantly beneficial for Stable and In-Place Radix Sort implementations.

The power of generic functions in C++ was utilized to pass the Radix, Threshold and the data type to the core sorting function. This allowed the implementation to be efficient and yet flexible to be able to choose the Radix and Threshold, but only statically. It should be possible to vary the radix dynamically -- data adaptively, type adaptively and/or array size adaptively.

STL **stable_sort()** cost performance and used more memory comparing with STL **sort()** (in-place and non-stable). Stable Hybrid MSD N-bit-Radix Sort cost memory, but improved performance over In-Place Hybrid MSD N-bit-Radix Sort.

The validity of random distribution input data set as the worst case is questionable. Since we've shown that for 100M elements only 3 digits out of 8 are needed to sort the array, there should be distributions that are bad for the first 5 digits and then sort based on the last 3 digits, or possibly where the whole array fits into a single bin always. Input data distribution effects on performance will be explored in future articles.

### References

[1] In-place Hybrid N-bit-Radix Sort, by V. J. Duvanenko.

[2] The C++ Standard Template Library, by P.J. Plauger, A.A. Stepanov, M. Lee, D.R. Musser, p. 199.

[3] In-place Hybrid Binary-Radix Sort by V. J. Duvanenko

[4] C++ Templates: The Complete Guide, by D. Vandevoorde, N.M. Josuttis, pp. 9-19.

[5] Numerical Recipes in C: The Art of Scientific Computing Second Edition, by W. H. Press, S.A. Teukolsky, W.T. Vetterling, B. P. Flannery, pp. 274-283.