The Long and Short of Parallelism
Ten years ago, multi-core processors were just on the horizon. Today they are mainstream and have become the impetus for a revolution in computer programming that can make best use of the two, four, six, or eight cores. Along with multi-core processors, we have seen the rise of manycore processors (having 32 or more cores) in the consumer market. GPUs from nVIDIA and AMD plus the forthcoming MIC line from Intel all fit this category.
So, what can we do with all those cores? Graphics and visual computing are the obvious first answers. Scientific problems that model some large domain or employ hundreds of thousands of data points — all computed on in the same way — can decompose the data set across a large collection of cores. Many of these applications are really outside of the experience or usage for the average computer user, though. I like to think of my 70+ year old mother as an average user. The last time I checked, she reads her email, surfs the web, views pictures from friends, and plays some solitaire games. No heavy duty graphics processing, no modeling fluid flow around the latest submarine hull design, and no hurricane path forecasting that I'm aware of.
There is obviously enough of a market to drive the development of manycore products, but what if we could use them more frequently in less specialized applications? To do this will require a reexamination of how parallel algorithms are constructed. This is the topic of this post (and the next two or three afterwards).
I got the ideas and the example I'll be using from the keynote address Guy Blelloch gave at the 2009 Principles and Practice of Parallel Programming (PPoPP) conference entitled "Parallel Thinking." Since I only have access to the slides, I have probably missed out on the main message Dr. Blelloch was trying to put forth, but there was an interesting approach to parallel algorithms that I saw and wanted to play with. That's what I'm going to look at. Before we get to that, let me define a couple of measures of (parallel) algorithms: work and span.
Work, in terms of computation, is the total amount of time required to get all processing done. Think of the serial execution time or the serial complexity of the algorithm if you want to use something a tad more abstract. Even if you're running things in parallel, you sum up the total time spent in each of the threads or time spent on each core that was executing part of the application. This metric is directly affected by the added overheads (division of work, spawning threads, synchronizing shared access) to make the application run in parallel. Keeping the overhead time to a minimum should be a goal for any parallel code you write, but you already knew that.
To define span, we need to imagine a computation being laid out as a directed acyclic graph (DAG). The execution proceeds along the directed edges of the graph from node to node, executing the task represented by that node, until complete. A serial code has a single path through the graph. Parallelism is denoted by multiple exit arcs from a node (splitting into two or more task paths) to be "joined" at the end of independent computations. With the DAG in mind, the span of an algorithm is the critical path length, or the longest (time of execution) path running from start to finish. The shorter the span, the shorter the execution time of the algorithm (assuming your execution platform has enough resources to cover all parallel computations within the graph). You should always balance the load of independent tasks to achieve the shortest span. If one task is much longer than the others, it sits on the critical path; if all tasks take the same amount of time, your computation will be executed in the shortest time.