# Algorithm Improvement through Performance Measurement: Part 2

In-place Hybrid Binary-Radix Sort

Radix sorting algorithms do not compare elements of the array to each other and thus are not bound by O(nlogn) performance limit of comparison-based sorting algorithms. In fact, they have O(kn) linear performance. Radix Sort processes array elements one digit at a time, starting ether from the most significant digit (MSD) or the least significant digit (LSD). For example, to sort the following array of 2-digit numbers:

One way to implement Radix Sort is to take the input array and look at the highest digit of each element. Create a bin (a separate array) for 0's, a bin for 1's, a bin for 2's,…, a bin for 9's. Put all array elements into the bin where the most-significant-digit matches the bin number. Then continue doing this recursively for each bin, based on the next digit. Once all numbers are split into separate bins based on all of their digits, collect them starting with the smallest most significant digit and smallest next digit and so on; for example, collect all digits in the 00 bin, then 01 bin, then 02 bin, …, 09 bin, 10 bin, 11 bin and so on until 99 bin. The result is a sorted array.

One weakness of Radix sort algorithms is the requirement of extra storage (for bins) besides the input array; that is, not being in-place. For instance, Intel's Integrated Performance Primitives (IPP) library implements a Radix-sort algorithm with the function call prototype of:

```
IppStatus ippsSortRadixAscend_32u_I(Ipp32u* pSrcDst, Ipp32u* pTmp, Ipp32s len);

```

The temporary vector pTmp is required by the algorithm, with size equal to that of the source vector [1]. Note, that the array is labeled pSrcDst, indicating that the array serves as both the source and the destination for the sorting operation. Intel's IPP library also implements a non-Radix sort with the function call prototype of:

```
IppStatus ippsSortAscend_32s_I(Ipp32s* pSrcDst, int len);

```

which requires no additional storage -- that is, it is in-place.

The radix does not have to be decimal, but can be any integer. For example, decimal-Radix is base-10 (with values 0 through 9), hex-radix is base-16 (values 0 through 15), octal-radix is base-8 (values 0 through 7), and binary-radix is base-2 (values 0 and 1).

Decimal-Radix is familiar for humans, but binary-radix (base-2) is the smallest useful radix. With binary-radix there are only two possible bins: 0's bin and 1's bin. For example, the array is scanned from the beginning to the end while looking at the most-significant-digit, which is the most-significant-bit (MSB) for a binary-radix. If the MSB is a zero, then the array element goes into 0's bin, or if the MSB is a 1, then the array element goes into 1's bin.

After splitting the original array into these two bins, the number of elements in 0's bin plus the number of elements in 1's bin add up to the number of elements in the original array. The 0's bin can now be split into 0's bin and 1's bin, using the next bit as the sorting decision. The same is done with the 1's bin. This process continues recursively until all bits have been exhausted. Figure 1 illustrates this procedure.

Figure 1

For instance, if the array elements are 32-bit, then 32 passes over each array element will be performed, sorting it into 0's and 1's bin for each of the 32-bits. Thus, Binary-Radix Sort is O(kn), where k is the number of bits in each array element and n is the number of array elements. In other words, each array element will be processed 32 times, once for each of its bits.

### In-place

In-place is a useful property for a sorting algorithm, where the input array is processed without requiring additional memory, and serves as the output array. If an algorithm is in-place, then it can process larger arrays than an algorithm that is not in-place. This difference was illustrated above with Intel's IPP functions for Radix Sort, which requires additional memory of equal size to the original array and is not in-place. However, the IPP non-Radix Sort requires no additional memory and is in-place.

Binary Radix-Sort can be made in-place by the following method. Instead of allocating additional memory for 0's bin and 1's bin, the original array can be used to store elements of both bins. The 0's bin is grown from the beginning of the array, and the 1's bin is grown from the end of the array. When the binning process is done, these two bins will meet somewhere within the array, and will not overlap.

In the next iteration of the algorithm, the bins are created based on the next most-significant-bit, where the 0's bin from the first iteration is split into two bins: 0's bin and 1's bin. The same is done for the 1's bin from the first iteration. This process continues recursively until all bits of the elements have been processed.

The first iteration of this algorithm, which sorts based on the most-significant-bit (MSB), is demonstrated below. The pointer on the left keeps track of 0's bin current location. The pointer on the right keeps track of 1's bin current location.

Note that all array elements with MSB of zero are moved to the left side of the array, and the elements with MSB of one are moved to the right side of the array. The array is not completely sorted yet. The algorithm calls itself recursively with the 0's bin (the left side of the array), and the 1's bin (the right side of the array), and sort both based on the next bit. Then the algorithm calls itself recursively one more time to sort based on the last bit.

Listing 1 shows an implementation for in-place Binary-Radix Sort algorithm.

```
// Copyright(c), Victor J. Duvanenko, 2009
// In-place Binary Radix Sort implementations.

// Swap that does not check for self-assignment.
template< class _Type >
inline void _swap( _Type& a, _Type& b )
{
_Type tmp = a;
a         = b;
b         = tmp;
}

template< class _Type >
inline void _binaryRadixSort_initialUnsigned( _Type* a, long first, long last, _Type bitMask )
{
// Split the provided array range into 0's and 1's bins
long _zerosEndPlusOne = first;						// index is one beyond the last  0's bin
long _onesEndMinusOne = last;						// index is one before the first 1's bin
for ( ; _zerosEndPlusOne <= _onesEndMinusOne; )
{
if ( 0 == ( bitMask & a[ _zerosEndPlusOne ] ))				// 0's bin
{
// this element belongs in the 0's bin
// it stays at its current location and the 0's bin size is increased
_zerosEndPlusOne++;
}
else {														// 1's bin
_swap( a[ _zerosEndPlusOne ], a[ _onesEndMinusOne ] );	// move this element to 1's bin
_onesEndMinusOne--;										// increase the 1's bin size
}
}
// At this point the provided array portion has been split into 0's bin and 1's bin
// Recursively call to sort this 0's bin and this 1's bin using the next bit lower
if ( bitMask != 0 )						// end recursion when all the bits have been processes
{
if ( first < ( _zerosEndPlusOne - 1 ))
_binaryRadixSort_initialUnsigned( a, first, _zerosEndPlusOne - 1, bitMask );
if (( _onesEndMinusOne + 1 ) < last )
_binaryRadixSort_initialUnsigned( a, _onesEndMinusOne + 1,  last, bitMask );
}
}
inline void binaryRadixSortInPlace_initialUnsigned( unsigned long* a, unsigned long a_size )
{
if ( a_size < 2 )	return;

unsigned long bitMask = 0x80000000;

_binaryRadixSort_initialUnsigned( a, 0, a_size - 1, bitMask );
}

```
Listing 1

The outermost routine binaryRadixSortInPlace_initialUnsigned() sets the initial bit-mask to select the most significant bit and calls a template function _binaryRadixSort_initialUnsigned(). This core function splits the provided array range, between first and last elements, into 0's bin and 1's bin, in-place. Then it shifts the bit-mask to the right to select the next bit down, and calls itself recursively twice: once for the 0's bin, and the second time for the 1's bin. During the splitting process the algorithm uses the _swap() generic function template to move the element to 1's bin.

Table 1 shows performance benchmarks of in-place Binary-Radix Sort, STL sort, and Intel's IPP Radix-sort functions for an array of random unsigned 32-bit elements. Intel IPP library also has a non-Radix sort function, which handles a variety of input data types, but not 32-bit unsigned integers [1].

Table 1: Random 32-bit Unsigned Elements

These measurements and the ratios computed from them show that Binary-Radix Sort approaches STL sort in performance for random input data sets. Intel's IPP Radix Sort outperformed Binary-Radix Sort by about 4-6X, for 32-bit unsigned numbers.

 To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.

### Features

• #### How Data Dependence Affects Performance

Data dependence between statements is a straightjacket on the compiler's ability to optimize code for parallelism. So to get the maximum benefit from parallel code, data dependence must be carefully managed

• #### The Intel Threading Building Blocks Flow Graph

User feedback inspired the flow graph feature in Intel Threading Building Blocks, which allows programmers to express static and dynamic dependency graphs, as well as reactive or event-based graphs.

• #### Dataflow Programming: Handling Huge Data Loads Without Adding Complexity

Manage the data pipelines rather than webs of threads and processes

• #### Language of the Month: Intel's Cilk Plus

The latest C/C++ language extension to facilitate parallel programming

More Features >>