### Inherent Parallelism

The Counting Sort algorithm has substantial parallelism inherently within several areas. Initialization of the count array (256 elements for 8-bit and 64K elements for 16-bit) could be performed entirely in parallel, if this was possible. This operation is similar to a reset signal assertion in hardware, where all flip-flops are cleared simultaneously. Thus, for an 8-bit algorithm up to 256 initializations can be performed in parallel, whereas for 16-bit algorithm up to 64K initializations can be performed in parallel.

The counting portion of the algorithm has substantial inherent parallelism, where the input array can be split into many small portions, which are counted independently of others. These counts can then be combined. Of course, the overhead of splitting and joining limits the performance gains.

The writing portion of the algorithm also has substantial inherent parallelism. In the limit each output array element could be written in parallel, independent of all other array elements. This would require independent/parallel access to the **count[]** array and to the output array, which is currently not practical.

### Conclusion

The first venture into parallel programming through the implementation of the Parallel Counting Sort algorithm proved to be rewarding. Leveraging the efforts of Intel's TBB team, the move into parallel computing is significantly simplified and accelerated, along with impressive performance results. However, using TBB requires C++ expertise along with many examples and thorough documentation to demonstrate and explain concepts and usage. Intel TBB accomplishes both of these things well, along with implementing a variety of core parallel computing patterns. By using templates, TBB fits easily and seamlessly into the C++ development environments on many platforms.

The speedups obtained for arrays of 8-bit unsigned numbers were about 3X, and over 2X for arrays of unsigned 16-bit numbers, using a quad-core processor. This demonstrated that Counting Sort, while performing very little work per array element, has inherent parallelism which can be harnessed through parallel processing. Only a portion of the overall algorithm was refactored to a parallel implementation, the initialization and the counting portions, while writing of the results into the array remained non-parallel for now.

16-bit Counting Sort showed performance issues at small and mid-size arrays, which will be investigated and improved in future articles. This article is just the beginning of the journey into parallel computing. Intel TBB provides many other mechanisms and patterns necessary to support high-performance parallel computation, which will be explored in future articles. Other tools briefly described provide similar functionality with the goal of enabling software migration to take advantage of expanding hardware parallel processing.