Dr. Dobb's | Debugging Production Software

Debugging Production Software

The Production Software Debug library includes utilities designed to identify and diagnose bugs in production software.

June 01, 2005
URL:http://www.drdobbs.com/tools/debugging-production-software/184406106

John is a programmer from Chicago, Illinois. He can be reached at [email protected].

It's crucial that the software we write be as close to bulletproof as possible. But production environments are hostile, and it sometimes seems like they were designed to chew up software and spit out a smoking, mutilated mass of worthless bytes. Users often do things we do not expect them to—or worse, told them not to. When our software doesn't do something it was never designed to do—and do it perfectly—users cancel units, never to be seen or heard from again.

Nobody ever said programming was easy. Just for the record, I'll say it now—programming is hard. No matter how good you are, how much experience you have, how many books you read and classes you take, you can't escape inevitability. You will write bugs. I do it every day. I often tell my coworkers that if they aren't writing at least two new bugs each day, they probably aren't working hard enough. When I was a rookie C++ programmer, I thought that the key to writing code that was defect free was to know more C++, more techniques, more tricks. I don't think this anymore, and I'm much happier.

Debugging sessions are fine for detecting the major design flaws and little syntactic errors that crop up during development—buffer overruns, sending the wrong kind of message to some server somewhere, and the like. But what happens when bugs are detected by users in production software? Usually, angry users or administrators phone the help desk, but with little information that helps you debug the problem. You typically know what the symptoms were, because these are what told users there was a bug in the first place. But users are usually unreliable sources of objective or accurate information, and they generally cannot tell you what was happening before the bug occurred. Of course, this is the information you really need. So unless you are lucky and just happen to stumble across an obviously stupid piece of code such as this:

QUOTE* pQuote = new QUOTE;
delete pQuote;
pQuote->SetPrice(1.234f);

you will probably spend days looking for the bug. Once you get close enough to reproduce it, fixing the defect is usually comparatively simple, and often limited to around one line of code.

The problem is a lack of information. A bug is, by definition, an unknown quantity. Most language-level features that are designed to help diagnose problems are not intended for use in production software. Things like assert() become useless or worse in release builds. When you get a call about a bug in production software, it takes forever to identify and fix the problem. Most of the time you spend in just trying to reproduce the problem. If you could identify the state of the universe more quickly, the effort needed to resolve bugs would go down a lot.

The Production Software Debug (PSD) library I present here is a library of utilities designed to identify and diagnose bugs in production software. There are only three main features in the library, but they pack a wallop. Used liberally in production code, the PSD library (available electronically; see "Resource Center" page 3) has helped to significantly reduce the amount of time it takes to fix bugs. Its three main features are:

verify(), a better assert().
static_verify(), a compile-time version of verify().
OutputMessage(), a generic logging mechanism that is easy to use and extend.

verify(): A Better assert()

There are few C++ language-level features to help identify bugs, and what precious few do exist are not suitable for production software. One language feature that was added early in the language's evolution was the assert() macro. The idea was simple. When a function is executed, you expect the software to be in a sane state. Pointers point to the right thing. Sockets are open. The planets are aligned. assert() makes it possible to check these things at runtime easily, to add precondition and postcondition checks to blocks of code.

But assert() is contrived. If the expression sent to assert() is false, it kills your program. Back in the '70s, when software was written by the people who ran it, maybe this kind of behavior was okay. But today, if a wild pointer results in the application going poof—well, that's just not going to do at all. Pointers shouldn't be wild in the first place, but the main point is that no matter how much code you write to keep your pointers from being wild, it's not going to be enough. Sometimes they will go wild, anyway. You must come to terms with this fact. It turns out that assert() isn't really useful at all for dealing with wild pointers in code that was written and tested. It's only useful in testing code that's still in development.

There are three major problems with assert that make it unsuitable for production code. The first one I already mentioned—it rips the bones from the back of your running program if a check fails. Second, it has no return value so that you can handle a failed check. Third, it makes no attempt to report the fact that an error occurred. The PSD library's runtime testing utilities address these problems. They are template functions that accept any parameter type that is compatible with operator!, and return bools—true for a successful check, false for a failed check. If the check fails, the test utilities simply return false and do not terminate the program or do anything similarly brutal. Like assert(), in a debug build, a failed verify() will also break into the debugger. But the most significant features of the verify() utilities are the tracing mechanisms.

The PSD library includes tracing utilities, and verify() uses these tracing utilities to dump a rich diagnostic message when a check fails. The message is automatically generated and output to a place where you can get it.

The diagnostic message is rich, meaning it includes a lot of detailed information. Actually, there isn't much information to include, but all of it is included. Specifically, the message says the exact location of the failed check, including source filename and line number, and the expression that failed, including the actual value of the expression. For example, suppose your program is logging a user into a server, and you have a runtime check to assert that the login succeeded:

if( !verify( LOGIN_OK ==
pServer->Login(jdibling,password))
{
// handle a failed login attempt here
}

If the login did not succeed, this diagnostic message is generated and logged:

*WARNING* DEBUG ASSERTION FAILED: Application:
'MyApp.EXE', File: 'login.cpp', Line: 120,
Failed Expression:
'LOGIN_OK ==
pServer->Login("jdibling","password")'

This diagnostic message is sent to whatever destinations you configure (one or more), and you can configure whatever destinations you like, including your own proprietary logging utility. By default, the PSD library sends all such messages to three places: std::cerr, the MSVC 6.0 debugging window (which is visible in production runs using the DbgView.EXE utility, a freely available utility at http://www.sysinternals.com/), and a cleartext log file. (The name and location of this file is configurable, but by default it is named "log.txt" and saved in the current working directory.) The diagnostic message is extremely helpful in diagnosing problems that occur in customer's machines. It is generally a much simpler matter to acquire a log file from a customer than to try and reproduce the error condition. In addition, it frequently is not enough to know just the failed expression and the location of the failed source code. Usually, you need to know how the universe got in such a state, and the previous output messages that occur when the PSD library is used liberally is of extraordinary significance. For example, I usually want to know what exact version of my software the error occurred in, and I output that information to a log file using the tracing utilities in the PSD library. These two pieces of information taken together are often enough to know just what happened and why.

In the aforementioned code, verify is actually a preprocessor macro for the template function Verify (note the change in case). Generally, I don't like macros, but in this case, the benefits outweighed any detraction. You could call the Verify() template function directly, as it is included in the library interface, but there is little point and I have never seen a reason to do so. Also, if you call verify() (the macro version) and the check fails, the diagnostic message includes the filename and line number of the failed check. This is accomplished through macro black magic. If you call Verify() (the low-level template function) directly, you lose this benefit and are on your own in trying to figure out which Verify() check failed.

There are several other flavors of verify() as well, good for common special cases and taking more control over its behavior. One flavor is noimpl(), a default handling placeholder for the black holes in your code. The most common example of this is the default handler in a switch statement. In the case where your intent in a switch is to handle every possibility, you often have a default handler to do some default handling when things go wrong. Adding a noimpl() call to these blocks triggers a call to verify(false). Many otherwise very-hard-to-detect bugs are simply flagged using this feature.

Another flavor of verify() is testme(), which is kind of like a bookmark. When writing new blocks of code that you intend to test by stepping through manually, just add a call to testme() at the beginning of the block. I have found that when I'm writing code that I plan to step through, it is usually in lots of different places and I tend to lose track of them. testme() breaks into the debugger when it is run (just like verify) and reminds you where to test.

static_verify()

static_verify() is a version of verify() that is "run" at compile time, rather than runtime. The motivation for this device has existed for many years, but the design for it was derived from one presented in Andrei Alexandrescu's book Modern C++ Design.

static_verify() is especially useful at detecting when some critical implementation details have changed without your realizing it. Relying on the implementation details of some data structure or object is almost always a bad idea. But in the real world, it happens all the time. Older code, newer programmers, and plain bad designs are everywhere, and our job is to get all of this code to work first, and pontificate about how it isn't pristine later.

This code is guaranteed to work so long as the two user id fields are the same size:

struct USER
{
char m_cUID [10];
char m_cPwd[10];
};
struct LOGIN_MESSAGE
{
char m_cUID[10];
char m_cPwd[10];
};
: :
static_verify( sizeof(USER::m_cUID) ==
sizeof(LOGIN_MESSAGE::m_cUID) ));
memcpy(user.m_cUID,login.m_cUID,
sizeof(user.m_cUID));

Because it is doing a memcpy(), it's going to be fast. It does not matter what the size actually is, and it does not matter what the format of the char buffers are (for example, whether they are null terminated, space padded, or whatever). But if one of the char buffers is changed in size, the static_verify() halts the compiler with an error message, and you can adjust your algorithm to work with the new disparity.

OutputMessage()

OutputMessage() and the other tracing utilities make it easy to generate messages that are sent wherever you want. Use OutputMessage() liberally to log the values of variables and parameters, trace the execution path of a function, and so on. Again, the runtime testing utilities also generate calls to OutputMessage().

OutputMessage() works like sprintf, so it is easy to use, and chances are pretty good that you already know how to use it. There are flavors of OutputMessage() that take additional parameters specifying the destination of the message, options flags, and so on. But the general-purpose OutputMessage() takes just a format string and a variable parameter list, just like sprintf().

OutputMessage() can send messages to wherever you want, and it is easy to get it to send a message to somewhere new. Simply define a callback function and register it with the PSD library as such, set a global PSD library option to always send messages there, and you're done. From then on, every time OutputMessage() is called, messages will be sent to your routine. You can define numerous destinations and have messages sent to all, one, or none of them. You can also call OutputMessageEx() to send a specific message to a specific location.

Conclusion

The PSD library was written in C++ using only standard-compliant features in its interface. It was originally intended for use on Windows platforms and the Microsoft Visual C 6.0 compiler, but there are no platform-specific features in the interface. The implementation of these features in many cases does make use of Windows-specific functions and primitives, as you might expect. But on the whole, it should be easily adaptable to other platforms and compilers.

In production code, where the PSD library was used extensively, the time needed to diagnose, debug, and fix bugs was reduced drastically. There are two keys to reducing debug time:

Using verify to detect errors at runtime. It is possible to simply do a global search and replace in code that currently uses assert, changing all instances of assert to verify. Adding additional calls to verify also helps. The normal case of execution for a verify is for testing a true expression, and this common case is executed fast. If the expression is true, verify consists of one function call, an if statement, and a return statement. Because of this, verify is appropriate to use in time-critical code.
Logging the state of the running program before problems occur. To debug faulty code, in addition to knowing the failed expression, it is important to know the version of the software, the values of internal variables and function parameters, whether pointers are valid, and so on. Using OutputMessage() adds this information to the log and helps reduce debug time.

DDJ