Algorithm Improvement through Performance Measurement: Part 1

Algorithm Improvement through Performance Measurement: Part 2

Algorithm Improvement through Performance Measurement: Part 3

Algorithm Improvement through Performance Measurement: Part 4

Algorithm Improvement through Performance Measurement: Part 5

Algorithm Improvement through Performance Measurement: Part 6

To date, the focus for this series has been on measurement-based performance optimization of various sorting algorithms, such as Counting Sort, several Radix Sorts, Insertion and Selections Sorts, on a single processor. These implementations, while high-performance through careful use of the hybrid algorithm and data type adaptable approach, use only a portion of the available computational resources. On a quad-core hyper-threaded processor such as the Intel i7 860, these single-threaded algorithms utilize somewhere between 12% and 25% of the processor, leaving 75% to 88% of the compute power idle. As the number of cores increases, following Moore's self-fulfilling prophecy, a higher percentage of the CPU will be unutilized by single-threaded algorithm implementations. In this article and ones to come, the previously developed high-performance serial sorting algorithms will be refactored into parallel algorithms to raise the performance even higher by utilizing multicore computational resources.

### Tools

Several companies have worked tirelessly to create tools to enable parallel programming. Intel has a vested interest in enabling multicore programming, due to silicon shrinking continuing its path while frequency and power density growth is unable to scale at the same rate. Intel, for instance, has developed Threading Building Blocks and Integrated Performance Primitives (IPP) libraries, along with a powerful suite of compiler, multicore/multi-thread development and performance analysis tools. The Intel C++ compiler supports the latest version of the OpenMP standard, which is one of the parallel programming methods. Intel has also acquired Cilk Arts and the Cilk++ parallel compiler, and is developing Ct Technology.

nVidia has developed CUDA programming environment to enable programs to runs on hundreds of cores in graphics processors (GPUs) across many GPUs, enabling supercomputer level of performance. OpenCL is an industry effort at portable parallel programming across CPUs and graphics processors (GPUs). Sieve C++ Parallel programming system offers a portable environment for many CPUs and game systems. Microsoft Visual Studio 2010 has a suite of tools to enable visualization and debug of parallel applications, as well a Parallel Pattern Library (PPL). These are wonderful developments at simplifying the laborious and complex task of parallel programming. I'll explore some of these exciting tools in this series by applying them to the problem of transforming sorting algorithms from serial to parallel.

The first tool to be explored is Intel's Threading Building Blocks (TBB), a thread-pool management library along with sophisticated templates for C++ parallel programming development. With many parallel constructs and a work-stealing thread scheduler, TBB enables the programmer to focus on the parallel problem instead of the mechanics of thread management. TBB works on several operating systems and has support for a variety of parallel programming patterns. Let's dive right in and apply TBB to the fastest sorting algorithm developed so far -- the Counting Sort.

### Counting Sort

Counting Sort is a very efficient, in-place, high-performance sorting algorithm that is more than 20X faster than STL **sort()**. It handles arrays of 8-bit and 16-bit signed and unsigned numbers very well [2]. For arrays of numbers it does not move the array elements, but instead counts the occurrence of each possible value, and then constructs the sorted array from these counts. Counting Sort is a linear-time O(n) algorithm, which performs 2n amount of work -- n for reading/counting, n for writing/constructing.

Listing 1 is an implementation of the Counting Sort algorithm for sorting an array of 8-bit unsigned numbers.

// Copyright(c), Victor J. Duvanenko, 2010 inline void CountingSort( unsigned char* a, unsigned long a_size ) { if ( a_size < 2 ) return; const unsigned long numberOfCounts = 256; // one count for each possible value of an 8-bit element (0-255) unsigned long count[ numberOfCounts ] = { 0 }; // count array is initialized to zero by the compiler // Scan the array and count the number of times each value appears for( unsigned long i = 0; i < a_size; i++ ) count[ a[ i ] ]++; // Fill the array with the number of 0's that were counted, followed by the number of 1's, and then 2's and so on unsigned long n = 0; for( unsigned long i = 0; i < numberOfCounts; i++ ) for( unsigned long j = 0; j < count[ i ]; j++ ) a[ n++ ] = (unsigned char)i; }

The algorithm creates a count array of 256 elements (32-bits each) -- one for each possible value for an 8-bit number -- and initializes the count array to zero. The first for loop counts the occurrence of each 8-bit value by scanning the array from beginning to end. The second for loop creates the sorted array by storing/writing the counted number of each element. For a detailed explanation of the Count Sort algorithm for arrays of unsigned and signed numbers see Part 6 of this series.