"Task Parallelism Does Not Exist"
I went to the Intel reception during the 2011 IDF event. The DJ was spinning dance or house or techno music to the eager crowd. It was rumored that will.i.am was in the house and might make an appearance soon. I was standing outside admiring the San Francisco skyline at night with a drink in hand when a noted parallel programming expert and evangelist dropped the little gem I've taken for the title of this post. Luckily I wasn't taking a sip of my bourbon at that time or I might have done a spit-take all over him. Instead, I gave him my most significant look of incredulity.
White PapersMore >>
- Step Up Your Game in Loan Operations in 2014
- IT and LOB Win When Your Business Adopts Flexible Social Cloud Collaboration Tools
My mind quickly took up the opposite position to the statement. I knew I had implemented many different task parallel algorithms (e.g., my iterative Quicksort codes that have been featured in this forum). These had been effectively parallelized and achieved some speedup over the serial equivalent.
Having achieved the desired consternation in my mind and before I could voice any counter-examples to his controversial position statement, the expert explained. If you want to have any chance to have a scalable parallel algorithm — something suitable for many-core processors — you wouldn't implement a task parallel solution. What you need would be to devise data parallel code. As the size of the workload increases, you would be assured that more and more cores would also be able to be utilized effectively.
I had to admit it was a fair assessment; pretty obvious once you stop to think about it. There really is no other way to program the hundreds of cores becoming more and more prevalent in GPGPUs and other coprocessor and accelerator add-ons. Few truly task parallel solutions will contain enough work to use more than a handful of threads. Plus, the SIMD execution model was designed with data parallel and vector computations in mind.
Going over in my mind the types of parallel computations that I have been exposed to over the years, I realized that most of the solutions were data parallel. The whole panorama of matrix-based and grid-based scientific algorithms has the most obvious examples. Whether it was computing the probable path of a hurricane, the flow of air over a helicopter rotor, or the temperature changes wrought by ocean currents in the Strait of Juan de Fuca, dividing up large data sets into smaller parts and executing the same computations on each part was the standard methodology.
Even with all of this, my expert friend finally did admit that task parallelism was still a possible solution. His statement had been meant to elicit shock and awe and to persuade programmers to first think about a data parallel solution. If you only have four or eight cores available, a task parallel solution might be sufficient and could fully utilize the resources of the platform.
Since that September night, as I've reflected more and more on this conversation, I thank my lucky stars that I wasn't in the middle of taking a swig of my drink when I first heard that controversial statement. Being in a city that I wasn't all that familiar with, I'm not sure if I could have found a reputable dry cleaner to get alcohol stains out of clothes. I've also come to realize that my favorite set of algorithms, sorting, are going to be task parallel in their implementation. While the number of tasks may (eventually) be relatively large, that number is more dependent on the order of elements to be sorted than it is on the size of the data.
Consider a parallel Quicksort algorithm. Whether you implement an iterative or recursive (creating tasks for recursive calls), there will be a marked difference in the number tasks and amount of computation per task between data sets whose keys are randomly distributed versus a data set where the keys begin in nearly sorted order. In the first case you would expect the partitioning operation to divide the subarrays into roughly equal-sized sets while the second data set will more likely generate empty partitions on one side of the pivot element. In either case, it is easy to see that the amount of work for tasks generated from the partition step can vary widely, which is not a situation that is handled well by SIMD or data parallel computations.
So, what was the point of this post? Like the expert, I wanted to reinforce the idea that data parallel solutions are going to be more desired and scalable. Even though you might initially come up with a task parallel algorithm, you should take some time to see if a reworking of the algorithm or the data representation can better facilitate a more data-parallel version. For one example of this idea, look back to the prefix scan version of the
Partition operation within Quicksort that utilized a data parallel computation within parallel sorting tasks.