Channels ▼


Amdahl's Law vs. Gustafson-Barsis' Law

The advantage of parallel programming over serial computing is increased computing performance. Parallel performance improvements can be achieved by way of reducing latency, increasing throughput, and reducing CPU power consumption. Because these three factors are often interrelated, a developer must balance all three to ensure that the efficiency of the whole is maximized. When optimizing performance, the measurement known as "speedup," enables a developer to track changes in the latency of specific computational problems as the number of processors is increased. The goal for optimizing may be to make a program run faster with the same workload (reflected in Amdahl's Law) or to run a program in the same time with a larger workload (Gustafson-Barsis' Law). This article explores the basic concepts of performance theory in parallel programming and how these elements can guide software optimization.

Latency and Throughput

The time it takes to complete a task is called "latency." It has units of time. The scale can be anywhere from nanoseconds to days. Lower latency is better.

The rate at which a series of tasks can be completed is called "throughput." This has units of work per unit time. Larger throughput is better. A related term is "bandwidth," refers to throughput rates that have a frequency-domain interpretation, particularly when referring to memory or communication transactions.
Some optimizations that improve throughput may increase the latency. For example, processing of a series of tasks can be parallelized by pipelining, which overlaps different stages of processing. However, pipelining adds overhead since the stages must now synchronize and communicate, so the time it takes to get one complete task through the whole pipeline may take longer than with a simple serial implementation.

Related to latency is response time. This measure is often used in transaction-processing systems, such as web servers, where many transactions from different sources need to be processed. To maintain a given quality of service, each transaction should be processed in a given amount of time. However, some latency may be sacrificed even in this case in order to improve throughput. In particular, tasks may be queued up, and time spent waiting in the queue increases each task's latency. However, queuing tasks improves the overall utilization of the computing resources and so improves throughput and reduces costs.

"Extra" parallelism can also be used for latency hiding. Latency hiding does not actually reduce latency; instead, it improves utilization and throughput by quickly switching to another task whenever one task needs to wait for a high-latency activity.

Speedup, Efficiency, and Scalability

Two important metrics related to performance and parallelism are speedup and efficiency. Speedup (Equation 1) compares the latency for solving the identical computational problem on one hardware unit ("worker") versus on P hardware units:

Equation 1.

where T1 is the latency of the program with one worker and TP is the latency on P workers.

Equation 2.

Efficiency measures return on hardware investment. Ideal efficiency is 1 (often reported as 100%), which corresponds to a linear speedup, but many factors can reduce efficiency below this ideal.

If T1 is the latency of the parallel program running with a single worker, then Equation 2 is sometimes called "relative speedup" because it shows relative improvement from using P workers. This uses a serialization of the parallel algorithm as the baseline. However, sometimes there is a better serial algorithm that does not parallelize well. If so, it is fairer to use that algorithm for T1, and report absolute speedup, as long as both algorithms are solving an identical computational problem. Otherwise, using an unnecessarily poor baseline artificially inflates speedup and efficiency.

In some cases, it is also fair to use algorithms that produce numerically different answers, as long as they solve the same problem according to the problem definition. In particular, reordering floating point computations is sometimes unavoidable. Since floating-point operations are not truly associative, reordering can lead to differences in output, sometimes radically different if a floating-point comparison leads to a divergence in control flow. Whether the serial or parallel result is actually more accurate depends on the circumstances.

Speedup, not efficiency, is what you see in advertisements for parallel computers, because speedups can be large impressive numbers. Efficiencies, except in unusual circumstances, do not exceed 100% and often sound depressingly low. A speedup of 100 sounds better than an efficiency of 10%, even if both are for the same program and same machine with 1000 cores.

An algorithm that runs P times faster on P processors is said to exhibit linear speedup. Linear speedup is rare in practice, since there is extra work involved in distributing work to processors and coordinating them. In addition, an optimal serial algorithm may be able to do less work overall than an optimal parallel algorithm for certain problems, so the achievable speedup may be sublinear in P, even on theoretical ideal machines. Linear speedup is usually considered optimal since we can serialize the parallel algorithm, as noted above, and run it on a serial machine with a linear slowdown as a worst-case baseline.

However, as exceptions that prove the rule, an occasional program will exhibit superlinear speedup — an efficiency greater than 100%. Some common causes of superlinear speedup include:

  • Restructuring a program for parallel execution can cause it to use cache memory better, even when run on with a single worker! But if T1 from the old program is still used for the speedup calculation, the speedup can appear to be superlinear. See Section 10.5 for an example of restructuring that often reduces T1 significantly.
  • The program's performance is strongly dependent on having a sufficient amount of cache memory, and no single worker has access to that amount. If multiple workers bring that amount to bear, because they do not all share the same cache, absolute speedup really can be superlinear.
  • The parallel algorithm may be more efficient than the equivalent serial algorithm, since it may be able to avoid work that its serialization would be forced to do. For example, in search tree problems, searching multiple branches in parallel sometimes permits chopping off branches (by using results computed in sibling branches) sooner than would occur in the serial code.

However, for the most part, sublinear speedup is the norm.

Later, we discuss an important limit on speedup: Amdahl's Law. It considers speedup as P varies and the problem size remains fixed. This is sometimes called "strong scalability." Another section discusses an alternative, Gustafson-Barsis' Law, which assumes the problem size grows with P. This is sometimes called "weak scalability". But before discussing speedup further, we discuss another motivation for parallelism: power.


Parallelization can reduce power consumption. CMOS (complementary metal–oxide–semiconductor) is the dominant circuit technology in modern computer hardware. CMOS power consumption is the sum of dynamic power consumption and static power consumption. For a circuit supply voltage V and operating frequency f, CMOS dynamic power dissipation is governed by the proportion in Equation 3:

Equation 3.

The frequency dependence is actually more severe than the equation suggests because the highest frequency at which a CMOS circuit can operate is roughly proportional to the voltage. Thus dynamic power varies as the cube of the maximum frequency. Static power consumption is nominally independent of frequency but is dependent on voltage. The relation is more complex than for dynamic power, but, for sake of argument, assume it varies cubically with voltage. Because the necessary voltage is proportional to the maximum frequency, the static power consumption varies as the cube of the maximum frequency, too. Under this assumption, we can use a simple overall model where the total power consumption varies by the cube of the frequency.
















Table 1: The maximum core frequency for an Intel core i5-2500T chip depends on the number of active cores. The right column shows the parallel efficiency over all four cores required to match the speed of using only one active core.

Suppose that parallelization speeds up an application by 1.5X on two cores. You can use this speedup either to reduce latency or reduce power. If your latency requirement is already met, then reducing the clock rate of the cores by 1.5X will save a significant amount of power. Let P1 be the power consumed by one core running the serial version of the application. Then the power consumed by two cores running the parallel version of the application will be given by:

Equation 4.

where the factor of 2 arises from having two cores. Using two cores running the parallelized version of the application at the lower clock rate has the same latency but uses (in this case) 40% less power. Unfortunately, reality is not so simple. Current chips have so many transistors that frequency and voltage are already scaled down to near the lower limit just to avoid overheating, so there is not much leeway for raising the frequency. For example, Intel Turbo Boost Technology enables cores to be put to sleep so that the power can be devoted to the remaining cores while keeping the chip within its thermal design power limits. Table 1 shows an example. Still, the table shows that even low parallel efficiencies offer more performance on this chip than serial execution.

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.



These are good questions. My experience with work-stealing schedulers has been that they behave fairly well, albeit not perfectly, when there are multiple applications on the machine. The intuition is that if a software thread gets less real CPU time, it ends up stealing less work. That is the inherent load-balancing qualities of work-stealing counter the imbalances brought on when multiple applications are running.

On the theory side, proposes "A-Steal" for scheduling multiple work-stealing jobs efficiently. That would seem like a good approach for a cloud.

In practice, there have been some "resource managers" (including in TBB and PPL), but these have been limited in scope to a single process so far. Apple's "Grand Central Dispatch" may be doing system-wide distribution of work, but I have not studied it closely enough to be sure.

There have been proposals for reducing the overhead of distributing work and coordination, notably various schemes for implementing synchronization barriers in hardware. Invariably the difficult part becomes how to virtualize that hardware so that more than one application can use it in a time-sliced environment. If time slicing is replaced by space slicing (partitioning the hardware), the problem becomes how to dynamically partition the synchronization hardware.

For very large machines ("petascale" and beyond) there has been work on addressing the problem not by speeding up communication/synchronization, but by avoiding it. Look for "Communication-Avoiding Algorithms" on for some links.


Thanks for spotting the bad cross-reference. The original text in the book has the correct cross-reference. The mistake occurred when the excerpting renumbered the equations.


From Conclusion: "However, maximum efficiency (also known as linear speedup) is rare, since there are extra considerations involved in distributing work to processors and coordinating the processes."

And the examples used seem to assume that there are dedicated computing resources (processing cores etc.) per the considered application program.

How about the impact of those "extra considerations involved in distributing work to processors and coordinating the processes" in case the resources are dynamically allocated and assigned between work units of multiple applications while the programs are running?

Are those efficiency/performance scalability limiting impacts magnified when we have to move from per-app dedicated to dynamically shared (cloud) computing resources? And isn't cloud hosting becoming the norm?

Is anything being thus done to fight the scalability limiting impacts of the overhead involved in distributing work to processors and coordinating the processes?


Eq8 is the result of substituting Eq7 into Eq6. In your article you write that "Substitute these into Equation 7 and simplify to get..." Other than that, a very interesting read.