Eliminate False Sharing

Tools
  • Print
  • Email
Stop your CPU power from invisibly going down the drain

False Sharing: What To Look For

Note that Example 1 shows only one common case where data can end up being close together in memory, namely the case of elements of contiguous arrays. But the same thing can occur when we have two independently used fields within the same popular object, or objects are close to each other on the heap, or other situations.

The general case to watch out for is when you have two objects or fields that are frequently accessed (either read or written) by different threads, at least one of the threads is doing writes, and the objects are so close in memory that they're on the same cache line because they are:

  • objects nearby in the same array, as in Example 1 above;
  • fields nearby in the same object, as in Example 4 of [3] where the head and tail pointers into the message queue had to be kept apart;
  • objects allocated close together in time (C++, Java) or by the same thread (C#, Java), as in Example 4 of [3] where the underlying list nodes had to be kept apart to eliminate contention when threads used adjacent or head/tail nodes;
  • static or global objects that the linker decided to lay out close together in memory;
  • objects that become close in memory dynamically, as when during compacting garbage collection two objects can become adjacent in memory because intervening objects became garbage and were collected; or
  • objects that for some other reason accidentally end up close together in memory.

What To Do

When two frequently-used objects are sources of false sharing because they're in the same far-too-popular cache line, there are two general ways to remove the false sharing.

First, we can reduce the number of writes to the cache line. For example, writer threads can write intermediate results to a scratch variable most of the time, then update the variable in the popular cache line only occasionally as needed. This is the approach we took in Example 2, where we changed the code to update a local variable frequently and write into the popular result array only once per worker to store its final count.

Second, we can separate the variables so that they aren't on the same cache line. Typically the easiest way to do this is to ensure an object has a cache line to itself that it doesn't share with any other data. To achieve that, you need to do two things:

  • Ensure that no other object can precede your data in the same cache line by aligning it o begin at the start of the cache line or adding sufficient padding bytes before the object.
  • Ensure that no other object can follow your data in the same cache line by adding sufficient padding bytes after the object to fill up the line.

Here's how you can write this as a reusable library component in C++:

// C++ (using C++0x alignment syntax) template<typename T> struct cache_line_storage { [[ align(CACHE_LINE_SIZE) ]] T data; char pad[ CACHE_LINE_SIZE > sizeof(T) ? CACHE_LINE_SIZE - sizeof(T) : 1 ]; };

To get an object of type MyType that is stored on its own cache line, we would write cache_line_storage<MyType>. Note that this code assumes you've defined CACHE_LINE_SIZE to a suitable value for your target processor, commonly a power of two from 16 to 512. It also uses the standardized C++0x alignment syntax; if you don't have that yet, you can use a compiler-specific extension like Gnu's __attribute__(( aligned(x) )) or Microsoft's __declspec( align(x) ).

If you're on .NET, you can write something similar but for value types only, which in their unboxed form are always laid out "inline" rather than as a separate heap object:

// C#: Note works for value types only // [StructLayout(LayoutKind.Explicit, Size=2*CACHE_LINE_SIZE)] public struct CacheLineStorage<T> where T : struct { [FieldOffset(CACHE_LINE_SIZE)] public T data; }

It may seem strange that this code actually allocates enough space for two cache lines' worth of data instead of just one. That's because, on .NET, you can't specify the alignment of data beyond some inherent 4-byte and 8-byte alignment guarantees, which aren't big enough for our purposes. Even if you could specify a starting alignment, the compacting garbage collector is likely to move your object and thus change its alignment dynamically. Without alignment to guarantee the starting address of the data, the only way to deal with this is to allocate enough space both before and after data to ensure that no other objects can share the cache line.

For Java and .NET full-fledged objects (reference types), the solution is basically the same as for .NET value types, but more intrusive: You need to add the before-and-after padding internally inside the object itself because there is no portable way to add external padding directly adjacent to an object.

Applying this second approach to Example 1, we could change just the definition of the result array to space the array elements far enough apart. For example:

// Example 3: Simple parallel version (de-flawed using padding) // cache_line_storage<int> result[P]; // etc. as before, just replacing result[p] with result[p].data

Running performance tests confirms that this results in the same scalability curve as Example 2.

Parallel Pattern 5: Stencil
All memory addresses used for reads are expressed as offsets
Distributing Work Across Cores Using .NET
A roll-your-own ThreadPool implementation
Looking For The Lost Packets: Part 2
Looking For The Lost Packets: Part 1

Real World Parallelism Webinar Series
  • February 18, 2010
    Lock Contention, Using Intel Parallel Studio to Improve Performance
    Speaker: Vasanth Tovinkere, Software Engineer, Intel Corporation (Bio)

    Vasanth Tovinkere is a software engineer in the Developer Products Division (DPD) at Intel. His current role involves defining novel approaches to understanding and visualizing parallel performance and consulting with strategic customers to help them prepare and deliver code for the multicore world. Vasanth has been involved in the development of automatic semantic event detectors for digital sports technologies in Intel Labs. He also has been awarded three patents and has two patents pending.

    Abstract:
    Discover how easy it is to use the power of Microsoft Visual Studio and Intel Parallel Studio to find performance issues due to lock contention in threaded applications. This ensures that shipped applications can take better advantage of multicore processors. In this webcast, we provide live demonstrations that show how to identify lock contentions issues with Visual Studio and Intel Parallel Studio, an add-in to Visual Studio that helps developers create fast, reliable code on multicore processors.t.