Channels ▼
RSS

C/C++

Eliminate False Sharing


Finally, Don't Forget This Affects Readers, Too

In Example 1 we've been considering the case where all workers are writers, but readers are affected too. Consider the following variant of Example 1 where we arbitrarily force half the workers to only perform reads from their result[p], and measure the program's execution to see what happens:


// Example 3: Simple parallel version with half the accesses as
// reads (still flawed)
//
int result[P];
// Each of P parallel workers processes 1/P-th of the data;
// the p-th worker records its partial count in result[p]
for( int p = 0; p < P; ++p )
  pool.run( [&,p] {
    int local = 0;
    result[p] = 0;
    int chunkSize = DIM/P + 1;
    int myStart = p * chunkSize;
    int myEnd = min( myStart+chunkSize, DIM );
    for( int i = myStart; i < myEnd; ++i )
      for( int j = 0; j < DIM; ++j )
        if( matrix[i*DIM + j] % 2 != 0 )
          if( p % 2 != 0 ) // abitrarily have every second
                           // worker
            local += result[p]; // only read from its
                                // unshared result[p]
       else
         ++result[p];
  } );
// … etc. as before …

How's the performance? To paraphrase Family Feud, in Figure 6 our "survey saaaays…"

Alas, Example 3 is roughly just as bad as Example 1, and with more erratic performance behavior to boot. Even with half the workers performing only reads of their result[p], even at P = 23 we get about the same performance as when P = 1.

Figure 6: Example 3, with half the threads doing reads, is just as awful as Example 1.

Figure 7 shows the CPU monitor screen shots for P = 1 and P = 23, and confirms that for P = 23 we're lighting up 23 cores without any useful effect on total execution time.

Figure 7: Running Example 3 with P = 1 and P = 23 (ouch, approximately equal execution time to perform the same work, as we saw in Figure 6).

Finally, Figure 8 summarizes the relative scalability of Examples 1 to 3 side by side, where ideal linear scalability would be a horizontalline at scaling = 1. As we've already seen, Example 2 scales perfectly while Examples 1 and 3 don't scale at all -- and for no other reason than false sharing, even though in Example 3 half the workers are merely performing reads.

Figure 8: Examples 1 to 3 side by side (1 = perfect scaling).

Summary

Watch out for false sharing; it's an invisible scalability buster. The general case to watch out for is when you have two objects or fields that are frequently accessed (either read or written) by different threads, at least one of the threads is doing writes, and the objects are so close in memory that they're on the same cache line.

Detecting the problem isn't always easy. Typical CPU monitors completely mask memory waiting by counting it as busy time, which doesn't help us here, although the irregular lengths of the individual cores' busy times gives us a clue. Look for code performance analysis tools that let you measure, for each line of your source code, the cycles per instruction (CPI) and/or cache miss rates those source statements actually experience at execution time, so that you can find out which innocuous statements are taking extremely disproportionate amounts of cycles to run and/or spending a lot of time waiting for memory. You should never see high cache miss rates on a variable being updated by one thread in a tight inner loop, because it should just be loaded into cache once and then stay hot; lots of misses mean lots of contention on that variable or on a nearby one.

Resolve false sharing by reducing the frequency of updates to the falsely shared variables, such as by updating local data instead most of the time. Alternatively, you can ensure a variable is completely unshared by by using padding, and alignment if available, to ensure that no other data precedes or follows a key object in the same cache line.

Acknowledgments

Thanks to Joe Duffy and Tim Harris for their observations and comments on this article.

Notes

[1] H. Sutter. "Sharing Is the Root of All Contention" (Dr. Dobb's Digest, March 2009). http://www.ddj.com/go-parallel/article/showArticle.jhtml?articleID=214100002.

[2] H. Sutter. "Maximize Locality, Minimize Contention." (Dr. Dobb's Journal, 33(6), June 2008.) http://www.ddj.com/architect/208200273.

[3] H. Sutter "Measuring Parallel Performance: Optimizing a Concurrent Queue" (Dr. Dobb's Journal, 34(1), January 2009). http://www.ddj.com/cpp/211800538.

[4] If you run this test yourself, you might instead see one core light up for a while, then go dark and a different core light up for a while, and so on. That's just an artifact of operating system thread scheduling, as the OS moves the thread to a different core for its own reasons, for example to keep heat distributed more evenly across the chip. The result is still logically equivalent to that in Figure 2 because only one core is running at a time, and the total execution time should not be materially affected by just occasional migration to a different core.


Herb Sutter is a bestselling author and consultant on software development topics, and a software architect at Microsoft. He can be contacted at www.gotw.ca.



Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 

Video