Kirk Krauss is a software developer working on IBM Rational PurifyPlus.
"People with ropes around their necks don't always hang."
-- Lee van Cleef in The Good, the Bad, and the Ugly
In today's world of multicore computing, it's important to make sure that your multithreaded applications scale. A longstanding approach for making the most of multiple processors is to arrange for different resources to be protected by different synchronization objects, or locks. A drawback of this approach is that a bug in your threading model can lead to a deadlock.
Given the complexity of modern applications, particularly where routines can call back and forth between one another, you have to be very careful to avoid the situation where two or more threads can get stuck waiting for locks held by each other. If two threads take two locks, and each then waits for the lock held by the other, a classic deadlock begins. As an example, consider a program that has two threads, T1 and T2, and two locks, L1 and L2. A deadlock occurs in the following scenario:
Once the threads have deadlocked, they won't execute further code, other than such code as may be involved in waiting on the two locks. Even with careful programming, complex software can be subject to deadlocks, which typically are experienced by users as hangs.
Existing strategies for deadlock prevention, deadlock avoidance, and deadlock detection can be used to either head off deadlocks by not blocking at all, or by automatically exiting the process when a deadlock occurs. (Go here for more information.) In this two-part article, I explore a way to solve the problem by extending the locks themselves, providing them with both self-awareness and a safe way to head off deadlocks and continue executing. Although some studies in fault-tolerant computing consider automatic introspective analysis (for example, see Runtime Verification and Validation for Multi-Core Based On-Board Computingby Hans Zima and Mark James), no detailed methods involving self-aware locks are spelled out amongst any prior materials that I'm aware of.
Consider just what's happening in the classic deadlock situation outlined above: A first thread acquires a first lock, a second thread acquires a second lock, the first thread then waits for the second lock, and the second thread waits for the first lock. You can write code to detect this condition. If it's detected, your code can safely handle the problem via substitution of a surrogate lock in lieu of the first lock. To detect this condition, you can set up watchdog methods to perform the following tracking as your program works with its threads and locks:
- Track all threads and their states (ie. running or waiting, and if waiting whether waiting on a lock, and if so, which lock);
- For each thread, also track information as to which locks that thread is holding at any given time, if any; and
- Track all lock creation, acquisition (by which thread), and release.
Your watchdog can include methods that act as wrappers for the system's synchronization API functions. When a thread attempts to acquire a lock, prior to actually acquiring it, the thread can enter one of your watchdog's methods that wraps lock acquisition. Let's call this wrapper your watchdog's lock acquisition method. This method can track the fact that an attempt is being made to acquire the lock. It also can check the lock status to determine whether another thread is currently holding the lock. If so, your lock acquisition method can check the state of the thread that holds the lock to determine whether it's waiting on another lock that the current thread is currently holding. In that case, to prevent a deadlock, your lock acquisition method can prevent the current thread from acquiring the lock.
Your watchdog can prevent a deadlock as described above, entirely within its lock acquisition method. It can go on to maintain thread-safe resource access by deploying a surrogate lock. This could be implemented in more than one way. Here's an example.
Two threads were involved in the near-deadlock: T1 (which attempted to acquire a first lock) and T2 (which already held that first lock and was waiting for another lock still held by T1). To keep thread T1 going without a deadlock, your watchdog's lock acquisition method can arrange for T1 to bypass lock acquisition and to instead return control to the routine that attempted to acquire the lock. Prior to returning, your lock acquisition method can set a flag in the tracking structure for the thread T2 that currently holds that first lock. The flag will indicate that both threads are now serialized by a new lock, which your lock acquisition method can create before returning.
Once your watchdog's lock acquisition method has created the new lock, it can arrange for the current thread T1 to acquire it as a surrogate in lieu of the first lock before continuing. The lock acquisition method also can arrange for thread T2 to wait on the new surrogate lock. Thread T1 can then execute, as it would have without the deadlock, while thread T2 waits. When thread T1 attempts to release the lock, your method can intercept this attempt and instead arrange for thread T1 to release the surrogate lock.
Thread T2 can then run. The point will eventually come at which thread T2 was programmed to release the lock that was about to trigger the deadlock situation. Upon intercepting this release attempt, your watchdog's lock release method can arrange for thread T2 to release both that lock and the surrogate lock that was created at the point where the near-deadlock experience occurred.
Once the surrogate lock is in play, other threads may have to use it to protect access to the same resources that the original lock was protecting. That is, to prevent race conditions involving threads other than those that were about to deadlock, the watchdog's methods can continue to substitute the surrogate lock in lieu of the original lock, for as long as any thread is holding that surrogate lock. In other words, any thread that attempts to acquire the original lock can instead acquire the surrogate lock, for thread safety. As soon as no thread holds the surrogate lock, your watchdog's methods can of course dispense with the surrogate lock. Your program can once again acquire the original lock as designed.
If you're sure that no other threads will acquire the locks involved in the deadlock situation, there's a simpler alternative that requires no surrogate lock. Because thread T2 is waiting on a lock held by thread T1, the threads are already serialized for purposes of the attempted lock acquisition that was about to cause a deadlock. Since the threads are serialized, your watchdog's methods can arrange thread-safe resource access without creating a new lock.
To do that, your watchdog's methods just have to ensure that the threads remain serialized as needed. In this simplified scenario, your program can both avoid the deadlock and set up the needed serialization like this: Bypass the lock acquisition by thread T1, allowing it to continue without actually acquiring the lock held by T2. When T1 makes the call that would have released that lock, bypass that call as well. If T1 attempts to release the lock that T2 is waiting for, bypass the release of that lock by T1, and instead perform it along with the release of the first lock, once T1 releases that lock. The idea is that at just one of the locks serializes the threads until the point where the program, as designed, would have released both locks. Again, this simplified approach should work safely only if other threads can't acquire the locks involved.