Parallel

Super Linearity and the Bigger Machine

By Herb Sutter, March 12, 2008

Herb considers how to set superlinear speedups by harnessing more resources.

Running on a Bigger Machine

Besides leveraging the superlinearity that can crop up naturally in parallel algorithms, how else could we ever achieve a superlinear speedup by using more parallelism on the same machine? After all, more parallelism means we can use more cores to speed up compute-bound work, but just using more cores in itself gives us only a linear speedup. Right?

The fallacy is in the question's final words: "...on the same machine." When you run a program with less parallelism and another with more parallelism on the same modern desktop or server hardware, the one with more parallelism literally runs on a bigger machine—a disproportionately larger share of the available hardware. This happens because the one with more parallelism can use not only additional cores, but additional hardware resources attached to those cores that would not otherwise be available to the program. In particular, using more cores typically also means getting access to more cache and/or more memory.

To see why this is so, consider Figure 2 and Figure 3. Each of these shows a simplified block diagram of the cores and caches on two modern commodity CPUs: the current Intel "Kentsfield" processor, and the upcoming AMD "Barcelona" processor, respectively. The interesting feature in both chips is that each core has access to cache memory that is not available to some or all of the other cores. In the Kentsfield processor, each pair of cores shares a private L2 cache; in the Barcelona chip, each core has its own private L2 cache. In both cases, no core by itself has access to all the available L2 cache...and that means that code running on just one core is limited, not only to the one core, but also to just a fraction of the available cache. For code whose performance is memory bound, the amount of available cache can make a huge difference.

[Click image to view at full size]

Figure 2: Intel Kentsfield core and cache utilization: 1 thread versus 3 threads.

[Click image to view at full size]

Figure 3: AMD Barcelona core and cache utilization: 1 thread versus 3 threads.

In Figure 2, with only one thread running, as expected we see three of the Kentsfield's four cores are unused. But by running a sequential algorithm we don't only lose three cores; we also lose half the system's available L2 cache. When we run a parallel algorithm using two workers (for example, on cores 1 and 3) or more, not only do we get to use more processors, but we also double the amount of L2 cache available—that is, we literally run on a bigger machine with more cache. Similarly on the Barcelona chip shown in Figure 3, each additional worker adds to the total cache size until we saturate the machine with four workers.

A similar effect occurs on NUMA (nonuniform memory access) machines. Figure 4 shows a sample NUMA architecture where the machine is organized into nodes that each has its own processors and its own RAM. All the processors can use all the RAM in the machine, but the fly in the ointment is that not all RAM is created equal: RAM that's attached to other processors is "further away" and more expensive to use, because it requires a trip through the inter-node interconnect. So this time the question isn't one of making more RAM available, but rather of making faster RAM available.

[Click image to view at full size]

Figure 4: Sample NUMA processor and cache utilization: 1 thread versus 4 threads.

To illustrate, imagine that in Figure 4, the average time to access memory (in the case of a cache miss) is 200 clock cycles if the request goes to directly connected RAM, and 1000 cycles if it's to RAM on another node. A program with only one thread running might experience an average memory access time of around 600 cycles, modulo other cache and locality effects. But a program that runs two or more threads that are spread fairly evenly across the two nodes might experience an average cache miss access time of only 200 cycles. Given that waiting for memory is already a common bottleneck for modern high-performance software, and that we could do an awful lot of computation in those 400 clock cycles if we could use them for something better than sitting stalled while waiting for RAM to respond, getting those cycles back can be a big deal.

Well-written parallel code that is memory-bound and cache-bound can get faster superlinearly to a point because more data fits into memory and cache, and so total computation time goes down because we get more page and cache hits. The more the code makes multiple passes over the data, the more this superlinear effect is amplified.

On Deck

There are two main ways into the superlinear stratosphere:

Do disproportionately less work.
Harness disproportionately more resources.

To accomplish the first point, we saw that interruption was essential. But how should we interrupt (or, to use the C-word, "cancel") a thread or a unit of work? More on that when we return...

Notes

[1] H. Sutter. "Going Superlinear" (Dr. Dobb's Journal, March 2008; www.ddj.com/cpp/206100542.

[2] Randomized algorithms do have their place in mitigating worst-case behaviors for sequential code, as noted in this reference: Si. Pi Ravikumar. Parallel Methods for VLSI Layout Design, 4.3.2 (Ablex/Greenwood, 1996).

Previous 1 2

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

Parallel

Super Linearity and the Bigger Machine

Running on a Bigger Machine

On Deck

Notes

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Parallel Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content

Parallel

Super Linearity and the Bigger Machine

Running on a Bigger Machine

On Deck

Notes

Related Reading

News

Commentary

Slideshow

Video

Most Popular

More Insights

White Papers

Reports

Webcasts

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Parallel Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content