Herb Sutter is a bestselling author and consultant on software development topics, and a software architect at Microsoft. He can be contacted at www.gotw.ca.
Race conditions are one of the worst plagues of concurrent code: They can cause disastrous effects all the way up to undefined behavior and random code execution, yet they're hard to discover reliably during testing, hard to reproduce when they do occur, and the icing on the cake is that we have immature and inadequate race detection and prevention tool support available today.
The holy grail of the Consistency pillar is to make a concurrent program race-free and deadlock-free by construction.  "By construction" means to write our code in such a way that we can verify with confidence during development and testing that it does not contain races or deadlocks -- not just to rely on doing enough stress testing to get a probabilistic level of confidence, but to really know that there are no or deadlocks at all, at least for the shared data and locks that we control. I've written about some deadlock prevention techniques in "Use Lock Hierarchies to Avoid Deadlock."  This time, let's consider race prevention.
This article shows how to achieve the "race-free by construction" grail via sound engineering, with automatic and deterministic race identification at test time based on code coverage alone. Notes: For convenience, from now on I'm going to talk about just locks, but the issues and techniques apply to any kind of mutual exclusion. And although the example code I'll show will happen to be in C++, the techniques we'll discuss apply in any mainstream language, including C/Pthreads, C#, and Java.
Let's say you want to use some shared data protected using mutexes:
// use table1 and table2 (both are shared) table1.erase( x ); table2.insert( x );
Quick: What locks should you take to make this code correct?
The answer is usually to look in some documentation or comment, because the connection between a mutex and the data it protects usually exists primarily in the mind of the programmer, and we rely on discipline to remember which lock to take.
Let's say that table1 and table2 are protected by two mutexes mutTable1 and mutTable2, respectively. Then we might write something like this:
lock mutTable1 and mutTable2 table1.erase( x ); table2.insert( x ); release mutTable1 and mutTable2 appropriately (e.g., in finally blocks, or using automatic dispose/destructor methods)
The trouble is that the programmer has to remember that mutTable1 and mutTable2 are the "right" mutexes to take to use table1 and table2. This has two main problems.
The minor problem is productivity: We're creating work for the developer by forcing him to look up which mutex to use, because in general he won't get automatic IDE completion/suggestion support to answer questions like, "Which mutex do I need to take to use table2?"
The major problem is correctness: We're relying on programmer discipline to make the code correct, and discipline can fail us because we will sometimes make mistakes. What can go wrong? We can fail to take the correct lock, in one of two ways:
- "Oops, forgot to lock." We can forget to take a lock at all, such as that in this example we might forget to take the lock on mutTable2.
- "Oops, took the wrong lock." We can inadvertently take a lock on the wrong mutex, either because of confusion about which mutex is the right one or because of a simple accident like a typo that misspells mutTable2 as mutTable3.
The trouble is, either way we'll end up writing a race condition with a horrible failure mode: The code will silently compile and appear to work during testing, only to fail nondeterministically in production -- typically in the form of intermittent, timing-dependent, hard-to-reproduce bug reports from customer sites.
Instead of relying on comments and discipline, what we'd like to do is state in code which mutex protects which data, so that we can build our own guardrails to ensure we did it right. It turns out that this is a powerful approach, because it lets us turn a class of nondeterministic run-time failures into ones we can detect:
- at test time, and
- based on code coverage alone.
Let's see how.