Finally, Don't Forget This Affects Readers, Too
In Example 1 we've been considering the case where all workers are writers, but readers are affected too. Consider the following variant of Example 1 where we arbitrarily force half the workers to only perform reads from their result[p], and measure the program's execution to see what happens:
// Example 3: Simple parallel version with half the accesses as
// reads (still flawed)
//
int result[P];
// Each of P parallel workers processes 1/P-th of the data;
// the p-th worker records its partial count in result[p]
for( int p = 0; p < P; ++p )
pool.run( [&,p] {
int local = 0;
result[p] = 0;
int chunkSize = DIM/P + 1;
int myStart = p * chunkSize;
int myEnd = min( myStart+chunkSize, DIM );
for( int i = myStart; i < myEnd; ++i )
for( int j = 0; j < DIM; ++j )
if( matrix[i*DIM + j] % 2 != 0 )
if( p % 2 != 0 ) // abitrarily have every second
// worker
local += result[p]; // only read from its
// unshared result[p]
else
++result[p];
} );
// … etc. as before …
How's the performance? To paraphrase Family Feud, in Figure 6 our "survey saaaays…"
Alas, Example 3 is roughly just as bad as Example 1, and with more erratic performance behavior to boot. Even with half the workers performing only reads of their result[p], even at P = 23 we get about the same performance as when P = 1.
Figure 7 shows the CPU monitor screen shots for P = 1 and P = 23, and confirms that for P = 23 we're lighting up 23 cores without any useful effect on total execution time.
Finally, Figure 8 summarizes the relative scalability of Examples 1 to 3 side by side, where ideal linear scalability would be a horizontalline at scaling = 1. As we've already seen, Example 2 scales perfectly while Examples 1 and 3 don't scale at all -- and for no other reason than false sharing, even though in Example 3 half the workers are merely performing reads.
Summary
Watch out for false sharing; it's an invisible scalability buster. The general case to watch out for is when you have two objects or fields that are frequently accessed (either read or written) by different threads, at least one of the threads is doing writes, and the objects are so close in memory that they're on the same cache line.
Detecting the problem isn't always easy. Typical CPU monitors completely mask memory waiting by counting it as busy time, which doesn't help us here, although the irregular lengths of the individual cores' busy times gives us a clue. Look for code performance analysis tools that let you measure, for each line of your source code, the cycles per instruction (CPI) and/or cache miss rates those source statements actually experience at execution time, so that you can find out which innocuous statements are taking extremely disproportionate amounts of cycles to run and/or spending a lot of time waiting for memory. You should never see high cache miss rates on a variable being updated by one thread in a tight inner loop, because it should just be loaded into cache once and then stay hot; lots of misses mean lots of contention on that variable or on a nearby one.
Resolve false sharing by reducing the frequency of updates to the falsely shared variables, such as by updating local data instead most of the time. Alternatively, you can ensure a variable is completely unshared by by using padding, and alignment if available, to ensure that no other data precedes or follows a key object in the same cache line.
Acknowledgments
Thanks to Joe Duffy and Tim Harris for their observations and comments on this article.
Notes
[1] H. Sutter. "Sharing Is the Root of All Contention" (Dr. Dobb's Digest, March 2009). http://www.ddj.com/go-parallel/article/showArticle.jhtml?articleID=214100002.
[2] H. Sutter. "Maximize Locality, Minimize Contention." (Dr. Dobb's Journal, 33(6), June 2008.) http://www.ddj.com/architect/208200273.
[3] H. Sutter "Measuring Parallel Performance: Optimizing a Concurrent Queue" (Dr. Dobb's Journal, 34(1), January 2009). http://www.ddj.com/cpp/211800538.
[4] If you run this test yourself, you might instead see one core light up for a while, then go dark and a different core light up for a while, and so on. That's just an artifact of operating system thread scheduling, as the OS moves the thread to a different core for its own reasons, for example to keep heat distributed more evenly across the chip. The result is still logically equivalent to that in Figure 2 because only one core is running at a time, and the total execution time should not be materially affected by just occasional migration to a different core.
Herb Sutter is a bestselling author and consultant on software development topics, and a software architect at Microsoft. He can be contacted at www.gotw.ca.


