There are two main ways to achieve superlinear scalability, or to use P processors to compute an answer more than P times faster (see Figure 1):
- Do disproportionately less work.
- Harness disproportionately more resources.
Last month , we focused on the first point by illustrating parallel search and how it naturally achieves superlinear speedups when matches are not distributed evenly because some workers get "rich" subranges and will find a match faster, which benefits the whole search because we can stop as soon as any worker finds a match.
This month, we'll conclude examining the first point with a few more examples, and then consider how to achieve superlinear speedups by harnessing more resourcesquite literally, running on a bigger machine without any change in the hardware.
What Have We Learned?
As we saw last month, a parallel search often doesn't need to do anything special to exploit a nonuniform distribution of matches to find a match that much faster; a simple search algorithm will do, such as a linear search for arrays or a depth-first search for trees. Parallel search naturally exploits nonuniform distributions of matches:
- Without knowing where the nonuniformity is. It doesn't need any prior locality information, such as location hints for high-probability areas to try first.
- Without knowing whether there is any nonuniformity at all. Most interestingly, parallel search naturally benefits from nonuniformity without any extra knowledge at all, not even whether there is any! In this respect, we might say that parallel search is deliciously opportunistic.
Sequential search, on the other hand, cannot exploit nonuniformity at all without advance information about where the high-probability region are. Having the sequential algorithm visit nodes in a different or randomized order doesn't help by itself without additional information about where to start . Even armed with a map of the high-probability match areas, the sequential algorithm would need to be made more complex to exploit the nonuniformity.
Finally, getting superlinear parallel algorithm performance typically relies on the ability to interrupt or cancel work. There are two aspects to this, one must-have and one good-to-have:
- Immediate return (necessary). When a worker finds a match, we must be able to return the result to the caller, and let the caller complete immediately without waiting for other workers to complete normally. Otherwise, parallel performance will still scale, but the scaling will usually be sublinear; the time to complete is the maximum, not the minimum, of all workers, and in our search example, we'd typically have to wait for some unsuccessful workers that take N/P time, which puts the overall search's performance in the same category.
- Stop abandoned work (highly desirable). When one worker finds a match, we want to be able to efficiently stop the other workers in order to reclaim their resources, including access to any data they may have locked and the OS and hardware threads they are running on, so that we can use those resources for other work. This is sometimes called "interruption" or "cancellation" and will be the subject of next month's column.
Other Sources of Superlinearity
Another way to achieve superlinear scalability is to exploit partial information. In some parallel algorithms, one worker can discover partial information that it can communicate to other workers to let those others do their work faster than they could do otherwise. For example, in a tree search, one worker might discover that "there are never matches under a node whose weight is over 100." Or, in computing solutions to a set of equations, one worker might discover that in any solution the value of variable x has to be between y-1 and y+4. Other workers can then use such partial information to run faster by excluding entire sections of their search spacesubtrees under heavy nodes, or regions, where x and y are too far apartwithout having to traverse those areas at all to inspect their contents for possible matches.
Still another potential path to superlinearity is to perform speculative execution. Let's say that you have two alternative algorithms available to compute a result, such as two search algorithms where one has better average-case performance and the other has better worst-case performance. Now, most of the time you're better off just picking one of the algorithms and using available hardware to run it in parallel. But sometimes you can do better by running both of them, and seeing which gets the answer first. The cost is that you're doing more work, but you can get the answer faster (including avoiding the bad worst case of one algorithm). Sometimes the extra work can be relatively "free," at least from your selfish point of view; for example, when the two operations are two database or web service queries against different servers, and one server might get the answer in a different way than the other because the servers may be experiencing different loads or may use different methods to compute the answer.
The bad news about all of the effects we've covered so far is that none of them are going to help you go superlinear with algorithms that have to visit every element in a range anyway, such as when you're calculating the sum of all nodes. The good news, however, is that there is another effect that can help: It's when using more threads lets you use a bigger machine, literally.