Another way to save power is to "race to sleep." In this strategy, we try to get the computation done as fast as possible (with the lowest latency) so that all the cores can be put in a sleep state that draws very little power. This approach is attractive if a significant fraction of the wakeful power is fixed regardless of how many cores are running.
Especially in mobile devices, parallelism can be used to reduce latency. This reduces the time the device, including its display and other components, is powered up. This not only improves the user experience but also reduces the overall power consumption for performing a user's task: a win-win.
…the effort expended on achieving high parallel processing rates is wasted unless it is accompanied by achievements in sequential processing rates of very nearly the same magnitude.
Amdahl argued that the execution time T1 of a program falls into two categories:
- Time spent doing non-parallelizable serial work
- Time spent doing parallelizable work
Call these Wser and Wpar, respectively. Given P workers available to do the parallelizable work, the times for sequential execution and parallel execution are:
The bound on TP assumes no superlinear speedup, and is an exact equality only if the parallelizable work can be perfectly parallelized. Plugging these relations into the definition of speedup yields Amdahl's Law:
Figure 1 visualizes this bound. Amdahl's Law has an important corollary. Let f be the non-parallelizable serial fraction of the total work. Then the following equalities hold:
Substitute these into Equation 7 and simplify to get:
Figure 1: In Amdahl's Law, speedup is limited by the non-parallelizable serial portion of the work.
Now consider what happens when P tends to infinity:
Speedup is limited by the fraction of the work that is not parallelizable, even using an infinite number of processors. If 10% of the application cannot be parallelized, then the maximum speedup is 10X. If 1% of the application cannot be parallelized, then the maximum speedup is 100X. In practice, an infinite number of processors is not available. With fewer processors, the speedup may be reduced, which gives an upper bound on the speedup. Amdahl's Law is graphed in Figure 2, which shows the bound for various values of f and P. For example, observe that even with f = 0.001 (that is, only 0.1% of the application is serial) and P = 2048, a program's speedup is limited to 672X. This limitation on speedup can also be viewed as inefficient use of parallel hardware resources for large serial fractions, as shown in Figure 3.
…speedup should be measured by scaling the problem to the number of processors, not by fixing the problem size.
Amdahl's Law views programs as fixed and the computer as changeable, but experience indicates that as computers get new capabilities, applications change to exploit these features. Most of today's applications would not run on computers from 10 years ago, and many would run poorly on machines that are just 5 years old. This observation is not limited to obvious applications such as games; it applies also to office applications, web browsers, photography software, DVD production and editing software, and Google Earth.
More than two decades after the appearance of Amdahl's Law, John Gustafson noted that several programs at Sandia National Labs were speeding up by over 1000X. Clearly, Amdahl's Law could be evaded.
Figure 2: The scalability of parallelization is limited by the non-parallelizable (serial) portion of the workload. The serial fraction is the percentage of code that is not parallelized.
Figure 3: Even when speedups are possible, the efficiency can easily become poor; the serial fraction is the percentage of code that is not parallelized.
Gustafson noted that problem sizes grow as computers become more powerful. As the problem size grows, the work required for the parallel part of the problem frequently grows much faster than the serial part. If this is true for a given application, then as the problem size grows the serial fraction decreases and speedup improves.
Figure 4 visualizes this using the assumption that the serial portion is constant while the parallel portion grows linearly with the problem size. On the left is the application running with one worker. As workers are added, the application solves bigger problems in the same time, not the same problem in less time. The serial portion still takes the same amount of time to perform, but diminishes as a fraction of the whole. Once the serial portion becomes insignificant, speedup grows practically at the same rate as the number of processors, thus achieving linear speedup.
Figure 4: Gustafson-Barsis' Law; if the problem size increases with P while the serial portion grows slowly or remains fixed, speedup grows as workers are added.
Both Amdahl's and Gustafson-Barsis' Laws are correct. It is a matter of "glass half empty" or "glass half full." The difference lies in whether you want to make a program run faster with the same workload or run in the same time with a larger workload. History clearly favors programs attacking and solving larger, more complex problems, so Gustafson's observations fit the historical trend. Nevertheless, Amdahl's Law still haunts you when you need to make an application run faster on the same workload to meet some latency target.
Gustafson-Barsis' observation is not a license for carelessness. For the law to hold, it is critical to ensure that serial work grows much more slowly than parallel work, and that synchronization and other forms of overhead are scalable.
This article introduced some of the key definitions and theories related to increasing computing performance through parallelization. Parallelization can reduce latency (the total time needed to compute a single result), increase throughput (the rate at which a series of results can be computed), and reduce the power consumption of a computation. When optimizing performance, an improvement in one factor (such as throughput) may lead to the worsening in another factor (such as latency). The goal is to achieve maximum efficiency: a calculation should run at a rate corresponding to the number of processors involved (e.g., an algorithm should run twice as fast on two processors). However, maximum efficiency (also known as linear speedup) is rare, since there are extra considerations involved in distributing work to processors and coordinating the processes. Two theories guide optimization via parallelization: Amdahl's Law (which posits that speedup is limited by the non-parallelizable serial portion of the work) and Gustafson-Barsis' Law (which posits that adding processors allows an application to solve bigger problems in the same time).
This article is adapted from material found in the book Structured Parallel Programming: Patterns for Efficient Computation by Michael McCool, Arch D. Robison, and James Reinders published by Morgan Kaufmann, an imprint of Elsevier. Portions are Copyright 2012 Elsevier Inc.
Michael McCool is a Software Architect with Intel working on Array Building Blocks and an Adjunct Associate Professor with the University of Waterloo. Arch Robison is the architect of Threading Building Blocks. He was the lead developer for KAI C++, which established the state of the art for C++ optimization in the '90s. James Reinders is a senior engineer who joined Intel Corporation in 1989 and has contributed to projects including the world's first TeraFLOP supercomputer (ASCI Red). He has written several times for Dr. Dobb's.