### Tuning

Intel TBB incorporates various tuning capabilities for the parallel infrastructure. One of tuning "knobs" is grain size, which specifies when to stop breaking the array when splitting it up. Intel suggests breaking the arrays into chunks of work that are larger than 10K clock cycles, to keep the parallelization overhead negligible. Grain size is specified as the third parameter to the **blocked_range** constructor, as shown in:

parallel_reduce( blocked_range<unsigned long>(0, a_size, 10000), count );

where the grain size is set to 10K as a guess for the portion of the array that will take at least 10K clock cycles to process. Tables 8 and 9, along with Figures 8 and 9, show performance measurements of 8-bit and 16-bit Parallel Counting Sort algorithms with grain size set to 10K.

The guess for grain size of 10K turned out to be good, resulting in improved parallel algorithm performance for small input arrays of 8-bit and 16-bit numbers, impacting performance of large arrays only slightly. A small overhead is apparent for small arrays, versus the non-parallel implementation. Hyperthreading seems to not improve performance and at times degrades performance slightly, which is not consistent with earlier results, where for 8-bit algorithm hyperthreading improved performance by as much as 10%. The number of physical cores improves performance more than hyperthreading.

Grain size is one of the adjustable parameters that TBB makes available to developers. Also, several task schedulers can be chosen, which can dramatically affect performance. Automatically setting grain sizes in algorithms, so that they adapt to future runtime optimizations as well as future processor architectures, would be a beneficial future development.

### Performance Optimization

Table 9 shows the time spent counting and time spent writing within the non-parallel Counting Sort algorithm on an array of 100 million elements.

Most of the time (over 90%) is spent counting, whereas writing the sorted array takes 10-19X less time. Thus, the effort spent on optimizing performance of the counting portion of the algorithm would be more likely to provide higher gains.

**In general, measuring performance to determine where the majority of time is spent, followed by optimizing the slowest portion, is one of the pillars of performance optimization. Followed by doing it again, and again, and again....**

In the case of Counting Sort of Table 9, doubling performance of the counting portion would nearly double the overall performance. However, doubling performance of the writing portion would improve the overall performance by 4.6% or 2.6%.

Note, that the counting portion of the algorithm could be improved in performance by 10X before its magnitude becomes equal that of the writing portion (for 16-bit algorithm) and by 19X (for 8-bit algorithm). Thus, Parallel Counting Sort will continue scaling with more processing cores (beyond the quad-core explored), until performance within memory hierarchy becomes the limiting factor (reading or writing).

### Conclusion

Sorting algorithms, especially Counting Sort, perform little computation per array element, yet surprisingly benefit from parallel multi-core implementation. This is due to substantial inherent parallelism within the algorithms. When sorting arrays of unsigned 8-bit numbers Parallel Counting Sort is over 3X faster than non-parallel implementation, and over 2X faster for arrays of 16-bit numbers, on i7 860 quad-core processor. It is also up to 70X faster than STL **sort()** for 8-bit, and up to 77X faster for 16-bit. Parallel Counting Sort algorithm sorts at the rate of 1.7 billion 8-bit items per second, and 1 billion 16-bit items per second.

Morphing non-parallel algorithms to parallel does not guarantee a performance gain, and can lead to degradation in performance along with inefficient use of processing capacity. Parallel implementations increase exploration space by growing the number of design and test permutations not only for correctness, but also for performance, as some parallel implementations will be slower and others may possess data-dependent performance characteristics (which should be avoided).

Measurements to determine performance bottlenecks before optimizing is one of the pillars of performance optimization methodology, as demonstrated by measuring the counting and writing portions of the Parallel Counting Sort. Focusing efforts on optimizing the counting portion of the algorithm, which took over 90% of execution time, is 10-19X more beneficial than optimizing the writing portion. The Parallel Counting Sort was projected to continue scaling in performance beyond quad-core processors, since it is surprisingly compute limited.

Processor cache architecture influences parallel performance, with large L1 and L2 caches dedicated to each computational core providing higher performance and scalability for large problems. Higher memory bandwidth is also beneficial. i7 860 processor has 2X the cores of E8300, which allowed the algorithm to scale higher in performance. Tuning parallel infrastructure parameters, such as grain size of processing quanta, for each algorithm helps improve performance for small array sizes, where parallelism overhead degrades performance and wastes processing resources.

Abstraction offered by **parallel_for** and **parallel_reduce** constructs is powerful, enabling the developer not to care about the number of computational cores in the system. This power is dangerous, as it brings with productivity the possibility of poor resource utilization and efficiency. Creating efficient parallel programs where multiple programs share computational resources in a dynamic virtualized environment will be critical. Creating algorithms that scale well on future processor architectures will be a challenge -- e.g. tuning parameters such as grain size will most likely be set to a less than optimal value and may need to be exposed.