### Parallel Algorithms

As the number of cores increases, single-threaded purely sequential algorithms will utilize a shrinking percentage of the overall processor — 25% on a quad-core CPU, 12% on an octal-core, and so on. Parallel algorithms scale in performance along with the growing number of cores (ideally), while presenting the same interfaces. So far in this series, two parallel algorithms were developed and studied: Parallel Counting Sort and Parallel In-Place MSD N-bit-Radix Sort. These parallel implementations used Intel's Threading Building Blocks (TBB), which provides a portable infrastructure enabling usage of tasks that have much lower overhead than threads. Tasks are scheduled by TBB onto thread pools using work-stealing scheduling.

We encountered two types of parallelism: data-independent and data-dependent. In data-dependent parallelism, the amount of parallelism varies with input data values, whereas data-independent parallelism is oblivious to input data values. Parallel Counting Sort has data-independent parallelism, obliviously splitting the array into separate sections, scanning each section independently in parallel, followed by combining parallel intermediate results into a single final result. MSD Radix Sort has some data-dependent parallelism, as it processes each of the bins in parallel, where the level of parallelism is the same as the number of bins that contain at least one element. If all elements are of the same value, they end up in a single bin, resulting in no parallelism in that portion of the algorithm. Data-dependent parallelism is weaker than data-independent parallelism, since the amount of parallelism is less consistent — leading to less consistent performance across input data distributions. For instance, in the case of Parallel In-Place Radix Sort, performance decreased by over two times with data input that is constant.

Converting an algorithm to a parallel implementation does not guarantee performance improvement. When parallelism was applied to the writing portion of Parallel Counting Sort, performance didn't improve as memory bandwidth bottleneck was reached. Memory bandwidth saturation was reached even by a single core. Parallel execution of this portion of the algorithm wasted CPU/core resources. This presents a dilemma to parallel algorithm developers: "How can parallel algorithms be designed to be adaptive to run efficiently on today's machines as well as future ones?" In the case of Counting Sort, on today's machine, the algorithm should use only a single core for its writing portion. However, this choice will certainly be wrong for a future machine where memory bandwidth is beyond the ability of a single core to use it all.

Sequential algorithms are typically optimized for time and space, such as order. Parallel algorithms add more dimensions for optimization: e.g., sorting at elements/second/core instead of elements/second. Optimization in multiple dimensions will be useful.

Several parallel patterns were explored by usage, in the implementation of these parallel algorithms, and are supported by TBB. Parallel_reduce is a split-join pattern, where the input array is split into some number of sections, processing each section independently, and then joining the results from each section into a single result. Parallel_reduce was used in Parallel Counting Sort and Parallel MSD Radix Sort. Parallel_for is a pattern that executes several things in parallel. Parallel_for was used in Parallel MSD Radix Sort to process up to the radix number of bins in parallel (independently).

Grain size is one of the critical parameters for parallel algorithms. It's the minimum amount of work that is worth doing in parallel. If the amount of work is smaller than the grain size, then it should not be done in parallel, since the overhead of scheduling to do this work (moving it to another core) is higher than just doing the work in sequential fashion. Currently, TBB recommends grain size to be about 10K CPU cycles, which is about 3 microseconds on a 3 GHz CPU. "Automagically" determining grain size for today's CPU and future ones will be a struggle point for parallel algorithm developers. Self-tuning methods that determine optimal grain size dynamically will need to be developed.

The data-parallel method was used in Parallel N-bit-Radix Sort and Parallel Counting Sort, where the input data is split into **P** non-overlapping parts, which are processed independently in parallel by up to **P** cores, oblivious to the actual number of cores in the system.

Parallel N-bit-Radix Sort was limited by the grain size in the amount of parallelism, because bins got smaller than the grain size after two levels of recursion. It was also limited by not having enough parallelism within the top level of recursion. Parallel counting was used inside Parallel Radix Sort, resulting in about a 13% performance gain. Parallel Radix Sort is about eight times faster than STL sort for arrays of 32-bit integers and about two times faster than Intel's IPP Radix Sort (which uses a single CPU), as well as two times faster than TBB's parallel sort. Plus, Radix Sort is in-place, whereas Intel's IPP Radix Sort is not and requires an external array of the same size as the input array size. However, Intel IPP tops elements/second/core metric.

Parallel Counting Sort experienced 3X speedup for 8-bit and 2X speedup for 16-bit on quad-core CPU, for an overall 67X faster than STL sort. Parallel Counting Sort reached 1.7 GigaElements/second, and 16-bit peaked at just over 1 GigaElements/second, and is the only sorting algorithm explored to break GigaElements/second. The 16-bit version exhibited performance problems with smaller arrays. Attempts at parallelizing the writing portion of the algorithm using Parallel_for failed due to memory bandwidth limitations of the system. Sorting small and medium size arrays of 16-bit numbers with Parallel Counting Sort aggravated performance issues, most likely due to the counting array not fitting into L1 cache and being the same size as L2 cache of each core. Adjusting grain size to 10K elements significantly improved performance for small and medium size arrays of 8-bit elements and arrays of 16-bit elements, nearly eliminating performance issues and indicating that task switching overhead was potentially the cause of the issue. Performance measurements showed that the counting portion of the algorithms took over 90% of the time, and optimizing this part of the algorithm would provide the highest gains, whereas memory bandwidth was not the performance limiter. Counting Sort was predicted to continue scaling by at least 10X if the number of cores increases even without an increase in memory bandwidth.

Performance measurement is a pillar of performance optimization. By measuring before optimizing, the true performance bottlenecks are discovered, enabling developers to focus efforts on areas that will maximize return on effort. Then do it again, and again, and again …

Parallel tool developers have focused on reduction of task switching overhead. However, it needs to be reduced further to increase performance and extract more of the available parallelism and applicability. As we've seen with grain size, at times the minimal parallel quanta is in tens of thousands of items, which limits parallelism to be coarse, limiting performance gains. Maybe a mechanism similar to hyperthreading could be developed to accelerate task switching — call it hypertasking.

Dynamic allocation of resources, by virtualization and cloud computing, forces the parallel algorithm and infrastructure developer to create adaptable algorithms and infrastructures that share possibly heterogeneous (e.g. CPU and GPU) computational resources.

Parallel algorithm developers have not only correctness to deal with, but also more dimensions within exploration space, along with more parameters to explore.

Parallel algorithm developers have numerous available tools, such as Intel's TBB and ArBB and Cilk++, Microsoft's PPL and DirectCompute, nVidia's CUDA, OpenMP, OpenCL, Sieve, as well as various parallel algorithm libraries. These tools help developers focus on algorithms and not on infrastructure. Some of these tools cater only to CPU or only to GPU developers, while agnostic tools enable developers to be oblivious to the type of computing resources. Current CPUs contain a several cores, while GPUs contain several hundred — providing plenty of exploration space. Microsoft and Apple have parallel runtimes that allow multiple applications to schedule and use multicore processing effectively. Universities are not only developing cutting-edge parallel tools, but are also starting to teach parallel programming on a larger scale.