Debugging multithreaded applications can be a challenging task. The increased complexity of multithreaded programs results in a large number of possible states that the program may be in at any given time. Determining the state of the program at the time of failure can be difficult; understanding why a particular state is troublesome can be even more difficult. Multithreaded programs often fail in unexpected ways, and often in a nondeterministic fashion. Bugs may manifest themselves in a sporadic fashion, frustrating developers who are accustomed to troubleshooting issues that are consistently reproducible and predictable. Finally, multithreaded applications can fail in a drastic fashion-deadlocks cause an application or worse yet, the entire system, to hang. Users tend to find these types of failures to be unacceptable.
General Debug Techniques
Regardless of which library or platform that you are developing on, several general principles can be applied to debugging multithreaded software applications.
The first technique for eliminating bugs in multithreaded code is to avoid introducing the bug in the first place. Many software defects can be prevented by using proper software development practices. The later a problem is found in the product development lifecycle, the more expensive it is to fix. Given the complexity of multithreaded programs, it is critical that multithreaded applications are properly designed up front.
How often have you, as a software developer, experienced the following situation? Someone on the team that you're working on gets a great idea for a new product or feature. A quick prototype that illustrates the idea is implemented and a quick demo, using a trivial use-case, is presented to management. Management loves the idea and immediately informs sales and marketing of the new product or feature. Marketing then informs the customer of the feature, and in order to make a sale, promises the customer the feature in the next release. Meanwhile, the engineering team, whose original intent of presenting the idea was to get resources to properly implement the product or feature sometime in the future, is now faced with the task of delivering on a customer commitment immediately. As a result of time constraints, it is often the case that the only option is to take the prototype, and try to turn it into production code.
While this example illustrates a case where marketing and management may be to blame for the lack of following an appropriate process, software developers are often at fault in this regard as well. For many developers, writing software is the most interesting part of the job. There's a sense of instant gratification when you finish writing your application and press the run button. The results of all the effort and hard work appear instantly. In addition, modern debuggers provide a wide range of tools that allow developers to quickly identify and fix simple bugs. As a result, many programmers fall into the trap of coding now, deferring design and testing work to a later time. Taking this approach on a multithreaded application is a recipe for disaster for several reasons:
- Multithreaded applications are inherently more complicated than single-threaded applications. Hacking out a reliable, scalable implementation of a multithreaded application is hard; even for experienced parallel programmers. The primary reason for this is the large number of corner cases that can occur and the wide range of possible paths of the application. Another consideration is the type of run-time environment the application is running on. The access patterns may vary wildly depending on whether or not the application is running on a single-core or multicore platform, and whether or not the platform supports simultaneous multithreading hardware. These different run-time scenarios need to be thoroughly thought out and handled to guarantee reliability in a wide range of environments and use cases.
- Multithreaded bugs may not surface when running under the debugger. Multithreading bugs are very sensitive to the timing of events in an application. Running the application under the debugger changes the timing, and as a result, may mask problems. When your application fails in a test or worse, the customer environment, but runs reliably under the debugger, it is almost certainly a timing issue in the code. While following a software process can feel like a nuisance at times, taking the wrong approach and not following any process at all is a perilous path when writing all but the most trivial applications. This holds true for parallel programs. While designing your multithreaded applications, you should keep these points in mind.
- Design the application so that it can run sequentially. An application should always have a valid means of sequential execution. The application should be validated in this run mode first. This allows developers to eliminate bugs in the code that are not related to threading. If a problem is still present in this mode of execution, then the task of debugging reduces to single-threaded debugging. In many circumstances, it is very easy to generate a sequential version of an application. For example, an OpenMP application compiled with one of the Intel compilers can use the openmp-stubs option to tell the compiler to generate sequential OpenMP code.
- Use established parallel programming patterns. The best defense against defects is to use parallel patterns that are known to be safe. Established patterns solve many of the common parallel programming problems in a robust manner. Reinventing the wheel is not necessary in many cases.
- Include built-in debug support in the application. When trying to root cause an application fault, it is often useful for programmers to be able to examine the state of the system at any arbitrary point in time. Consider adding functions that display the state of a thread-or all active threads. Trace buffers, described in the next section, may be used to record the sequence of accesses to a shared resource. Many modern debuggers support the capability of calling a function while stopped at a breakpoint. This mechanism allows developers to extend the capabilities of the debugger to suit their particular application's needs.