Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

Design

Super Linearity and the Bigger Machine


Running on a Bigger Machine

Besides leveraging the superlinearity that can crop up naturally in parallel algorithms, how else could we ever achieve a superlinear speedup by using more parallelism on the same machine? After all, more parallelism means we can use more cores to speed up compute-bound work, but just using more cores in itself gives us only a linear speedup. Right?

The fallacy is in the question's final words: "...on the same machine." When you run a program with less parallelism and another with more parallelism on the same modern desktop or server hardware, the one with more parallelism literally runs on a bigger machine—a disproportionately larger share of the available hardware. This happens because the one with more parallelism can use not only additional cores, but additional hardware resources attached to those cores that would not otherwise be available to the program. In particular, using more cores typically also means getting access to more cache and/or more memory.

To see why this is so, consider Figure 2 and Figure 3. Each of these shows a simplified block diagram of the cores and caches on two modern commodity CPUs: the current Intel "Kentsfield" processor, and the upcoming AMD "Barcelona" processor, respectively. The interesting feature in both chips is that each core has access to cache memory that is not available to some or all of the other cores. In the Kentsfield processor, each pair of cores shares a private L2 cache; in the Barcelona chip, each core has its own private L2 cache. In both cases, no core by itself has access to all the available L2 cache...and that means that code running on just one core is limited, not only to the one core, but also to just a fraction of the available cache. For code whose performance is memory bound, the amount of available cache can make a huge difference.

[Click image to view at full size]

Figure 2: Intel Kentsfield core and cache utilization: 1 thread versus 3 threads.

[Click image to view at full size]

Figure 3: AMD Barcelona core and cache utilization: 1 thread versus 3 threads.

In Figure 2, with only one thread running, as expected we see three of the Kentsfield's four cores are unused. But by running a sequential algorithm we don't only lose three cores; we also lose half the system's available L2 cache. When we run a parallel algorithm using two workers (for example, on cores 1 and 3) or more, not only do we get to use more processors, but we also double the amount of L2 cache available—that is, we literally run on a bigger machine with more cache. Similarly on the Barcelona chip shown in Figure 3, each additional worker adds to the total cache size until we saturate the machine with four workers.

A similar effect occurs on NUMA (nonuniform memory access) machines. Figure 4 shows a sample NUMA architecture where the machine is organized into nodes that each has its own processors and its own RAM. All the processors can use all the RAM in the machine, but the fly in the ointment is that not all RAM is created equal: RAM that's attached to other processors is "further away" and more expensive to use, because it requires a trip through the inter-node interconnect. So this time the question isn't one of making more RAM available, but rather of making faster RAM available.

[Click image to view at full size]

Figure 4: Sample NUMA processor and cache utilization: 1 thread versus 4 threads.

To illustrate, imagine that in Figure 4, the average time to access memory (in the case of a cache miss) is 200 clock cycles if the request goes to directly connected RAM, and 1000 cycles if it's to RAM on another node. A program with only one thread running might experience an average memory access time of around 600 cycles, modulo other cache and locality effects. But a program that runs two or more threads that are spread fairly evenly across the two nodes might experience an average cache miss access time of only 200 cycles. Given that waiting for memory is already a common bottleneck for modern high-performance software, and that we could do an awful lot of computation in those 400 clock cycles if we could use them for something better than sitting stalled while waiting for RAM to respond, getting those cycles back can be a big deal.

Well-written parallel code that is memory-bound and cache-bound can get faster superlinearly to a point because more data fits into memory and cache, and so total computation time goes down because we get more page and cache hits. The more the code makes multiple passes over the data, the more this superlinear effect is amplified.

On Deck

There are two main ways into the superlinear stratosphere:

  • Do disproportionately less work.
  • Harness disproportionately more resources.

To accomplish the first point, we saw that interruption was essential. But how should we interrupt (or, to use the C-word, "cancel") a thread or a unit of work? More on that when we return...

Notes

[1] H. Sutter. "Going Superlinear" (Dr. Dobb's Journal, March 2008; www.ddj.com/cpp/206100542.

[2] Randomized algorithms do have their place in mitigating worst-case behaviors for sequential code, as noted in this reference: Si. Pi Ravikumar. Parallel Methods for VLSI Layout Design, 4.3.2 (Ablex/Greenwood, 1996).


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.