Channels ▼
RSS

Parallel

Cache-Friendly Code: Solving Manycore's Need for Faster Data Access


The coming wave of manycore processors, such as the Xeon Phi, are many-headed beasts that can chew through a tremendous amount of data — but only if they are fed properly. If data can't be brought in quickly enough, some of the cores will starve and be forced to wait for data to arrive.

The multicore opportunity leads developers to focus on software scalability in unprecedented ways. Before, scalability was primarily of interest to those with access to unusual hardware. Now, it is a critical metric for assessing the long-term performance potential of any code or algorithm.

There are two ways to take advantage of multiple cores: Developers can run a single program and make it multithreaded, or they can run multiple processes at the same time. Scalability is an important characteristic in either case.

The case of a single program needing to be spread across multiple threads has been discussed numerous times in Dr. Dobb's, with extensive coverage of the various techniques.
The scaling of threaded programs is frequently limited by contention for resources. Resources that can limit scalability include processor cache space, memory space, and bandwidth limiting data transfer between different components (processor, memory, interconnect, and storage). With multi- and manycore processors, the key bandwidth limitation is often the link between main memory and the processor — the processor caches. These caches are critical for ensuring that the many cores of our beast are adequately fed.

Homogeneous Workloads

Other significant problems are composed of numerous distinct subproblems that are solved independently. In these cases, processor occupancy — the proportion of busy processors — can be raised by running many copies of the program at the same time. Each individual instance may not run terribly fast, but the workload as a whole should benefit from a net throughput that wouldn't be achievable on older processors. Examples of these workloads, often described as being "embarrassingly parallel," might be serving static Web content, calculating Monte Carlo financial models, and bulk processing of image or audio data.

However, when multicore processors are used to run many instances of simple programs, they are still subject to sharing system resources such as bandwidth, cache, and memory. These shared resources can still be bottlenecks even when the computations are completely distinct and embarrassingly parallel. When conflict occurs, cores must wait for data and so, they are idle part of the time. This can mean that even with embarrassingly parallel workloads, adding tasks (even when cores are available to run them) will not necessarily result in faster throughput.

Heterogeneous Workloads

Virtualization is a way to achieve higher core occupancy at a system-administrative level because a large fraction of front- and back-office software is single threaded. Virtualization allows multi- and manycore processors to support wider workloads with a high degree of transparency to both the application and organization. But the scalability limits mentioned previously still apply, because when two or more virtual machines are placed on a multi- or manycore processor, all those processes share the same memory bandwidth and contend for the same cache space.

Cache and memory bandwidth are limiting factors in all these scenarios and they become progressively more challenging as the number of cores increases.

Cache Memory and a Serial Example

Allocation of space in the processor data cache is managed by the hardware and is based on the processor's access pattern. When the processor accesses a part of memory that is not already in the cache, it loads a chunk of the memory around the accessed address into the cache, in the hope that the processor will need that same data or adjacent data soon, and if it does, then that data will be in the cache and easily accessible.

"Cache" lines are the chunks of memory handled by the cache, and the size of these chunks is called the "cache line size." Common cache line sizes are 64 or 128 bytes.
The limited number of lines that a cache can hold is determined by the cache size. For example, a 64 KB cache with 64-byte lines has 1024 cache lines.

Prefetching in Hardware and Software

It is clear that when data is going to be used multiple times, keeping it in the cache will eliminate the latency problem for the second and subsequent accesses, but nothing discussed so far reduces latency the first time a program needs to access a given cache line. Ideally, the data would already be waiting in the cache for the processor to access the first time. This can be achieved through prefetching lines of data into the cache. Most processors have both hardware and software support for prefetching.

Software prefetching takes the form of instructions that don't operate directly on data, but tell the memory system to preload an address or range of addresses into the cache. Compilers can weave these instructions into computations as an optimization if it is likely that they will help.

Hardware prefetching watches where the program is reading from memory and, if it detects a pattern, it will start preloading what will be needed before the program asks for it. Hardware prefetchers work best with very regular access patterns.

Cache Coherency

Modern processor caches also maintain a property called "coherency." A coherent cache places a copy of data close to the core, then takes steps to ensure that the whole system functions as if there were only one copy of the data, therefore avoiding situations where separate cores see different values for what should be the same data. While vital for correctness, the mechanisms used to ensure coherency between items found in the caches of multiple cores can add substantial overhead for certain types of access patterns. Coherency mechanisms also need to be in place to avoid errors when direct memory access (DMA) technology is used to change data in memory without the CPU being involved.

Some Cache Examples

The amount of cache per processor core is generally limited, but much faster than main memory. For example, the quad-core Intel Sandy Bridge (Intel Core i7) has 32 KB of L1 (or innermost) cache per core, 256 KB L2 cache per core, and 2 MB L3 cache per core. Accessing the L1 memory takes 1.2 ns, L2 3.6 ns, and L3 requires 11.7 ns, respectively. AMD's 8-core Bulldozer (8150) has 16 KB/1024 KB/1024 KB of L1/L2/L3 memory per core, respectively, and has slower access times of 1.1 ns/8.4 ns/24.6 ns. Accessing main memory requires about 60-70 ns with either processor [1].

Consider the following simple routine:

static volatile int array[Size];
static void test_function(void)
{
    for (int i = 0; i < Iterations; i++)
        for (int x = 0; x < Size; x++)
          array[x]++;
}

This routine was run on a system with a cache size (L2) of 512 KB. For comparison purposes, the size of the array was varied while changing the number of iterations to keep the number of increments to the inner array constant. When the array size was greater than 512 KB, it took about 50 seconds to run. When the array size was smaller, so it fit completely within the cache, the calculation ran in about 5 seconds. The magnitude of this effect, approximately 10x speed up, is consistent with the ratio of latency to the cache and latency to main memory.

Multiple Processes, Cache Friendly Programs, and Scalability

So what can be done to fully utilize the performance of a multicore system when many separate processes are running on that system? Examining programs to see if they are using the cache efficiently is an important step. If cache-friendly algorithms are used, even in single-threaded applications, a foundation for efficient virtualization is established: A single greedy process can gobble up all the cache and leave other innocent processes to starve as they wait for every data element to be fetched from main memory. Let's look at the types of problems that can inadvertently make caches inefficient, then examine optimizations that can be made to simple programs to make them more cache friendly.

Inefficient Loop Nesting and Blocking

Dense, multidimensional arrays are common datasets in numerical applications. When a program is processing a multidimensional array, such as pixel data in image processing or elements in a finite-difference program, it is important to pay attention to which dimension is assigned to the inner loop and which dimension is assigned to the outer loop.

If an array is traversed along the wrong axis, only one element will be used from each cache line before the program continues to the next cache line. For example, consider this small snippet of C code iterating over a two-dimensional matrix:

double array[SIZE][SIZE];
for (int col = 0; col < SIZE; col++)
    for (int row = 0; row < SIZE; row++)
        array[row][col] = f(row, col);

In one iteration of the outer loop, the inner loop will access one element in each row of a specific column. For each element touched by the loop, a new cache line is accessed. If the array is larger than the cache size, the rows from the top of the matrix may already have been evicted from the cache when the bottom of the matrix is reached. The next iteration of the outer loop then has to reload all of the cache lines again, each time paying the full main-memory latency cost.

If the nesting of the loops is changed so that they traverse the matrix along the rows instead of along the columns, the program receives a lot more benefit from the cache:

double array[SIZE][SIZE];
    for (int row = 0; row < SIZE; row++)
         for (int col = 0; col < SIZE; col++)
             array[row][col] = f(row, col);

As the program processes each column along a row, data is moved into the cache and all of the data in that line is used before the cache line ends up being discarded. Accessing data sequentially also allows hardware prefetch logic to engage, placing data in the cache before the processor even asks for it. This maximizes the use of available memory bandwidth. A code transformation like this can result in performance improvements of up to 15x for large array sizes. For example, the LBM code of the SPEC2006 benchmark suite contains such a performance problem and was made to run 2.7-times faster on a quad-core when the simple code transformation suggested above was applied to one of its loops.


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 

Comments:

ubm_techweb_disqus_sso_-481603c446be484a01ffb847fd930531
2012-11-15T13:01:59

"Using unnecessarily large and sparse hash tables."

This is a great article. I'm on a project that's been running for a long time, using large parallel machines. We discovered something pretty amazing about hash tables: They're slower, for this kind of hardware architecture, than keeping an array of new values and doing a sort/unique when you need the values!

Every new person on the project has to try this out and find that it's true, it's that surprising of a result :-)

So, as always, implement your solution and then profile it where it's really spending time, and be willing to try something else.


Permalink
dblake950
2012-11-14T16:15:59

You're absolutely right: The sentence has been clarified to read "Programming with threads is complex enough that it is easy to neglect the cache and operate under the assumption that data is moved back and forth between processors at the finer granularity of individual words and bytes."


Permalink
ubm_techweb_disqus_sso_-010d6a16efd32515e12371cf630f30ed
2012-11-14T11:35:22

»Programming with threads is complex enough that it is easy to neglect the cache and operate under the assumption that the data that is moved back and forth between processors at a finer granularity of individual words and bytes.«

Is this sentence complete? The object clause seems to be missing a verb …

Thank you for an interesting article!


Permalink
dblake950
2012-11-13T16:03:17

For more on manycore programming, check out Clay Breshears' blogs. "The Long and Short of Parallelism" http://www.drdobbs.com/paralle... is a good place to start.


Permalink

Video