The Long and Short of Parallelism
In terms of execution time, I can define
T(P) as the execution time of the algorithm on
P cores. Work is then
T(1), the execution time of the (parallel) algorithm on a single core. Span will be
T(infinity) or the execution time of the algorithm that has access to as many threads/cores as can be used. If this characterization of span reminds you of the PRAM (Parallel Random Access Machine) model of computation with an unlimited number of processors available to the algorithm developer, then I'm right there with you. I hope you're beginning to see a glimmer of how this is related to manycore processing, which I will now illustrate with an example.
Imagine you are a lumberjack (and you're OK) in the wilds of British Columbia, Canada. You have been assigned the task of putting boot prints along the 10 meter length of a tree log that had fallen silently in the woods. How would you do this? The obvious way would be to start at one end of the log and walk down the 10-meter length in your special marking boots. Actually, if it's just you alone, this is about the only way I can think of. It's a serial solution since there is only you to do the work.
What if you had to mark 10 such logs, but also had nine lumberjack friends with comparable footwear? I hope you are thinking that the obvious solution is to assign one person to each log and have them walk down the assigned tree in parallel. The work for this longer task is marking 100 meters of wood and the span is 10 meters, or however long it takes for the slowest walker to reach the end of his assigned log.
What if you brought all 100 lumberjacks from camp and had to traverse the same 10 logs? You might still assign only the first 10 workers to mark up the fallen trees while the others eat their lunch or press wild flowers or have buttered scones with their tea. That seems like a waste of boots, though. Why not turn your perspective on the fallen logs around 90 degrees? Rather than approaching along the 10 meter long axis, consider the problem of marking the logs by walking across the width of the trunk. Now, each of your 100 lumberjacks would have to only mark up 1 meter of an assigned tree. If you had 500 lumberjacks, they would only need to each place their boot imprint on 20 centimeters, about the width of two boots.
The work of log marking along the width of a tree is still 100 meters, but the span, with 500 assistants working together, is practically constant since each worker needs to simply place both feet on a log. It may take some deft Radio City Rockette choreography to get 50 burly lumberjacks to step onto a 10 meter log all at the same time, but the whole marking task is accomplished within the time it takes to jump up onto a log and jump off.
To tie together lumberjacks jumping on logs and the utilization of manycore processors, I will tell you what I'm sure you already guessed: The 500 lumberjacks are a metaphor for a collection of many cores. One of the points I got from Blelloch's keynote presentation, and one I hope to illustrate and demonstrate in my next few posts, is that we need to reexamine the design of parallel algorithms to find methods that could take advantage of hundreds of cores. We can divide up a data set into two or four or eight independent parts and assign those parts to separate cores that then execute a serial algorithm on the given data items. Or, if we can adjust our perspective on these "long" serial computations, we might be able to make "short" work of them by harnessing alternative algorithms using a larger number of cores. One of the key algorithmic building blocks that will help with such a change in perspective is the prefix scan operation. I'll review this in my next post in preparation for demonstrating a way to make shorter work of the Quicksort computation.