C/C++

Understanding Parallel Performance

By Herb Sutter, October 31, 2008

Understanding parallel performance. How do you know when good is good enough?

The Double-Edged Sword of Fine-Grainedness

Even in Example 4, the code only scales to at most three cores. Can we do better? Yes, if we can make the work more fine-grained, slicing the work to be done more finely into smaller chunks. One approach in Example 4 would be to apply similar techniques within CalcWholesale, CalcRetail, and TotalReturns to further decompose their work into concurrent subtasks. Even better is when we can exploit natural parallelism in algorithms (e.g., divide-and-conquer algorithms like quicksort) and data structures (e.g., trees and graphs) to subdivide our work in a way that scales with the amount of data.

But now we encounter a fundamental tension between scalability and overhead: The smaller the chunks of work, the more chunks we have and the more easily we can distribute them to utilize larger number of cores to get an answer faster—but the more the overhead per chunk starts to dominate in our performance costs.

Let's consider quicksort as a common example. Can you spot the performance flaws in this code?


// Example 5 (flawed, today): Naïve parallel quicksort,
// sorts left and right subranges in parallel
//
void ParallelQuicksort( Iterator first, Iterator last ) {
  Iterator pivot = Partition( first, last );
  future f1 = pool.run( [&]{ ParallelQuicksort( first, pivot ); } );
  future f2 = pool.run( [&]{ ParallelQuicksort( pivot, last ); } );
  wait_all( f1, f2 );
}

We can improve this in at least two ways. First, as noted earlier, we should take the tail chunk of work and do it ourselves to mitigate the overhead of shipping work to a pool thread in today's environments. Second, we can notice that most of the spun-off work will be at the leaves of the computation containing subranges of a few elements or even just one. Even in sequential code, it's typical to switch to something like bubble sort when the subrange's size falls below a threshold; similarly in parallel code, for small ranges we should switch to a sequential sort. Example 6 incorporates both of these changes:


// Example 6: An improved parallel quicksort, still
// sorts subranges in parallel but more efficiently
//
void ParallelQuicksort( Iterator first, Iterator last ) {
  if( distance(first,last) <= threshold ) { 
    SequentialSort( first, last );	
  } else {
   Iterator pivot = Partition( first, last );
    future f1 = pool.run( [&]{ ParallelQuicksort( first, pivot ); } );
    ParallelQuicksort( pivot, last );
    f1.wait();
  }
 }

In general, we want to slice the work as finely as possible, but not to the point where the work is comparable in size to the overhead.

Forward-Looking Note: Work Stealing

Future runtime systems will significantly drive down all of these costs, including the overhead per chunk and the cost of unrealized concurrency, to the point where we will often be able to ignore it and blithely write code like Example 5 without worrying about its performance most of the time.

The basic idea is to use work stealing whereby default "potentially asynchronous work" is actually not shipped elsewhere, but rather queued up to be executed on the original thread. Only if another core runs out of work and sees that our thread has waiting queued work to be performed, will the work be "stolen" and efficiently shipped to the other thread. The idea is to drive down the cost of unrealized concurrency by only actually incurring the overhead of running on another core if it's worth doing at that particular instant—on this hardware and with this amount of other work currently occupying the machine; and the very next execution of the very same function on the very same hardware might not steal, if all the cores are already busy. Sample current and upcoming technologies that feature work stealing runtimes include: Intel Threading Building Blocks; Microsoft's Parallel Patterns Library, Task Parallel Library, and PLINQ; Java 7's Fork/Join framework; and the granddaddy of them all, Cilk, which popularized the technique among implementers.

On Deck

To understand your code's scalability, first identify the key variables that affect the workload, and then measure throughput for workloads with different combinations of those variables. Look for scalability, and how it hits the contention and oversubscription barriers. Prefer thread pools (today) and work stealing (tomorrow).

Next month, we're going to apply the tools we discussed this month to analyze the performance impact of specific optimizations on concrete code. Fasten your seat belts.

Herb is a bestselling author and consultant on software development topics, and a software architect at Microsoft. He can be contacted at www.gotw.ca.

Previous 1 2 3

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

C/C++

Understanding Parallel Performance