Bil has written three books on multithreaded programming along with the GNU Emacs Lisp manual and numerous articles. He is currently a professor at Tufts University and can be contacted at firstname.lastname@example.org.
The term "omniscient debugging" describes the concept that debuggers should know everything about the run of a program, that they should remember every state change, and be able to present to you the value of any variable at any point in time. Essentially, omniscient debugging means that you can go backwards in time.
Omniscient debugging eliminates the worst problems with breakpoint debuggersno "guessing" where to put breakpoints, no "extra steps" to debugging, no "Whoops, I went too far," no nondeterministic problems. And when the bug is found, there is no doubt that it is indeed the bug. Once the bug occurs, you will never have to run the program a second time.
Breakpoint debugging is deduction based. You place breakpoints according to a heuristic analysis of the code, then look at the state of the program each time it breaks. With omniscient debugging, there is no deduction. You simply follow a trail of "bad" values back to their source.
In 1969, Bob Balzer implemented a version of omniscient debugging for Fortran that ran on a mainframe and had a TTY interface (see "EXDAMSExtendible Debugging and Monitoring System," ACM Spring Joint Computer Conference, 1969). Since then, there have been a half dozen other related projects that were also short-lived. Zstep (Lieberman) was a Lisp debugger that came closest to commercial production and shared a similar GUI philosophy (see "Debugging and the Experience of Immediacy," by H. Lieberman, Communications of the ACM, 4, 1997). With the advent of Java (and its beautiful, simple, bytecode machine!), there have been at least five related projects, including two commercial productsCodeGuide from OmniCore (http://www.omnicore.com/) and RetroVue from VisiComp (http://www.visicomp.com/).
The ODB I present here implements this concept in Java. The ODB is GPL'd and available (with source code) at http://www.LambdaCS.com/. It comes as a single jar file that includes a manual, some aliases (or .bat files), and three demo programs. The current implementation is fairly stable, though not perfect. It is pure Java and has been tested (more or less) on Solaris, Mac OS, and Windows 2000. I am working on integration with Eclipse and IntelliJ. You are welcome to try it out and encouraged to let me know what you think.
The basic implementation of the ODB is straightforward. In Java, it is sufficient to add a method call before every putfield bytecode that records the variable being changed, its new value, the current thread, and the line of code where it occurred. (The ODB records more than this, but this is the fundamental idea.)
With instrumentation taken care of, you can turn your attention to the real issues: How this information is going to be displayed and how you are going to navigate through time to find important events. All of this should help answer the question: "Does omniscient debugging work?"
Display and Navigation
The ODB GUI is based on a few fundamental principles:
- It displays all of the most relevant state information simultaneously. (You should not have to toggle between popups.)
- It provides easy-to-read, unambiguous print strings for objects. (<Thing_14> is a reasonable choice, corp.Thing@0a12ae0c isn't.)
- It makes the most frequent operations simple. (You shouldn't have to do text selections, use menus, and the like.)
- It doesn't surprise you. (Displays are consistent and you can always go back to where your were.)
The ODB records events in a strict order and assigns them timestamps. Every variable has a known value (possibly "not-yet-set") at every timestamp and everything in the debugger window is always updated to show the proper data every time you revert the debugger to a different time. There is never any ambiguity.
You mainly want to know what the program state available to a chosen stack frame islocal variables, instance variables of the this object, and perhaps some static variables and details about other selected objects. You also need to know which line of code in which stack frame you are looking at. This is what the ODB displays. Such things as the build path, variable types, inheritance structures, and the like, are rarely of interest during debugging and shouldn't take up valuable screen space.
For print strings, the ODB uses the format <Thing_34 Cat>, where the type (without the package) is shown with an instance counter and an optional, programmer-chosen instance variable. An important detail is that the print string must be immutableit has to be the same at time 0 as it is at time 10,000. Print strings have a maximum width of 20 characters, so that wrap-around isn't an issue. This format also lends itself well to method traces:
<Person_3 John>.groom(<Pet_44 Dog>, 44.95) -> true
In Figure 1, method calls are shown per-thread and indented, making clear the calling structure of the program. Traces are wonderful for orienting you to where/when in the program you are.
The primary operations I usually want are "show the previous/next value of a variable," and sometimes the first/last value. This is what the arrow buttons do for the selected line (for instance, next context switch, next variable value, next line of code, next line of output, and so on). Threads, traces, output lines, and lines of code may be selected directly, in which case the current time reverts to when that line was executed/printed. On occasion, it is useful to see the entire set of values a variable ever had (Figure 2). Being a rarer operation, it is on a menu.
At the bottom of the window is a minibuffer where messages are printed and where typed interactions occur. When you change the current time, the minibuffer displays the type of the newly selected timestamp (for example, the local assignment as in Figure 2) and how far it is from the previously selected time. Incremental text searches through the trace pane entries (based on EMACS's I-search) are run here, as are expression evaluation and more elaborate event searches.
In addition to the variables in the current stack frame, it is common to keep an eye on specific objects. Any object displayed in any pane may be copied to the Objects Pane where it remains as long as desired. As with all variables, the contents of those objects are updated as you revert the ODB to different times. Variables whose values have changed between the previous and currently selected time are marked with a "*" (sum in Figure 1). Those not yet set are displayed with a value of "-" (i). Similarly, thread <Sorter_4> has not yet executed its first instruction.
More Elaborate Interactions and Displays
There are times when more elaborate displays or interactions are useful. For example, when there are thousands (or millions!) of traces, it is useful to filter out uninteresting ones. You can filter the trace pane to show only those method calls in a particular package, or those not in that package, or only those at a depth less than a limit.
You can revert the debugger to any time desired, then evaluate a method call using the then current values of the recorded objects. These methods will be executed and their traces will be collected in a different timeline, which won't affect the original data. In the secondary timeline, it is possible to change the value of instance variables before executing a method. (The primary timeline is inviolate.) The input format is identical to the trace format and even does name completion.
There are some cases where a different print format is more intuitive for you. Thus, a blocked thread acquires a pseudovariable to show the object it's waiting on (see _blockedOn in Figure 1), as will the objects whose locks are used. Notice that this information is duplicated in the threads pane.
The collection classes constitute another special case. In reality, a Vector comprises an array of type Object, an internal index, and some additional bookkeeping variables that are not of any interest to anyone other than the implementer. So the ODB displays collections in formats more consistent with the way they are used.
In addition to the simple text search, there is an event search command that lets you look for specific assignments, calls, and the like (based on the work of M. Ducassi; see "Coca: An Automated Debugger for C" in the Proceedings of the 21st International Conference on Software Engineering, 1999). A good example of its use is the quick sort demo that is part of the ODB distribution. The program sorts everything perfectly, except for elements 0 and 1. To search for this problem, you can issue an event search to find all calls to sort() that include elements 0 and 1.
port = call & callMethodName = "sort" & arg0 = 0 & arg1 >= 1
There are four matches, and sort(0, 2) is the obvious call to look at. Stepping through that method, it's easy to find the bug. Event searching has a long and honorable history of providing a "super intelligent" breakpoint facility. I find it a nice surprise that it's so useful here, too.
You can write ad hoc display methods for specific objects. For example, when debugging the ODB itself, I have a method that converts the bit fields of the timestamps (which are integers) into readable strings. Instead of seeing "5235365," I see "Thread-1 | a=b | Demo.java:44."
The Snake In the Grass
If you see a snake in the grass and you pull its tail, sooner or later you get to its head.
Under the ODB, a great number of bugs exhibit this kind of behavior. (By contrast, breakpoint debuggers suffer from the "lizard in the grass" problem. Even when you can see the lizard and grab its tail, the lizard breaks off its tail and gets away.) If a program prints something wrong ("The answer is 41" instead of 42), then you have a handle on the bug (you can grab the tail).
"Pulling the tail" consists of finding the improper value at each point, and following that value back to its source. It is not unusual to start the debugger, select the faulty output, and navigate to its source in 60 seconds. And you can do this with perfect confidence for bugs that take hours with conventional debuggers.
A good example of this is when I downloaded the Justice Java class verifier (http://jakarta.apache.org/bcel/). I found that it disallowed a legal instruction in a place where I needed it. Starting with the output "Illegal register store," it was possible to navigate through a dozen levels to the code that disallowed my instruction. I changed that code, recompiled, and confirmed success. This took 15 minutes (most of which was spent confirming the lack of side effects) in a complex code base of 100 files I had never seen before.
A happy side benefit of the ODB is how easy it makes it to see what a program is doing. I teach a class on multithreaded programming and I use the ODB regularly to explain what a program is doing. Students get to see which threads ran when and if there's any confusion, they can just back up and look again. Some problems that used to be difficult suddenly become trivial.
The ODB keeps a single array of timestamps. Each timestamp is a 32-bit int containing a thread index (8 bits), a source-line index (20 bits), and a type index (4 bits). An event for changing an instance variable value requires three words: a timestamp, the variable being changed, and the new value. An event for a method call requires a timestamp and a TraceLine object containing: the object, method name, arguments, and return value (or exception), along with housekeeping variables. This adds up to about 20 words. A matching ReturnLine is generated upon return, costing another 10 words.
Every variable has a HistoryList object associated with it that is just a list of timestamp/value pairs. When the ODB wants to know what the value of a variable was at time 102, it just grabs the value at the closest previous time. The HistoryList for local variables and arguments hang off the TraceLine; those for instance variables hang off a "shadow" object that resides in a hashtable.
Every time an event occurs, the recording method locks the entire debugger, a new timestamp is generated and inserted into the list, and associated structures are built. Return values and exceptions are back-patched into the appropriate TraceLine as they are generated.
Code insertion is simple. The source bytecode is scanned, and instrumentation code is inserted before every assignment and around every method call. A typical insertion looks like this:
289 aload 4 // new value
291 astore_1 // Local var X
292 ldc_w #404 <String"Micro4:Micro4.java:64">
295 ldc_w #416 <String "X">
298 aload_1 // X
299 aload_2 // parent TraceLine
300 invokestatic #14 <change(String, String, Object, Trace)>
where the original code was lines 289 and 291, assigning a value to the local variable X. The instrumentation creates an event that records that on source line 64 of Micro.java (line 292), the local variable X (line 295), whose HistoryList can be found on the TraceLine in register 2 (line 299), was assigned the value in register 1 (line 298). The other kinds of events are similar.
The most common remarks I hear when introducing omniscient debugging are that it takes too much memory and it is much too slow. These are legitimate concerns, but miss the main pointdoes it work? If it is effective, there are all sorts of clever techniques for reducing the cost of the monitoring. If it isn't, who cares?
In experience with the ODB, neither CPU overhead nor memory requirements have been a stumbling block. On a 700-MHz iBook, it takes the ODB 1ms to record an assignment. I have recorded 100 million events in a 31-bit address space. A tight loop may run 300 times slower in a naïve ODB, but such a loop can also be optimized to do no recording at all (it's recomputable). Ant recompiling itself runs 7 times slower and generates about 10 million events. (We're recording Ant, not the compiler!)
There are a number of possible optimizations:
- You can start/stop recording either by hand or by writing an event pattern. Most button-push bugs are amenable to this. (Let the program start up with recording off, turn it on, push the button, turn it off, find the bug.)
- The ODB has a garbage collector that throws out old events as memory gets filled up. Of course, these are not really garbage, but it does make it possible to maintain a sliding window of events around the bug. (I find this is a grand idea, but only seldomly useful.)
- It is quite normal for a large percentage of a program to consist of well-known, "safe" code. (These are either recomputable methods, or methods we just don't want to see the interiors of.) By requesting that these methods not be instrumented, a great number of uninteresting events can be eliminated. The ODB lets you select arbitrary sets of methods, classes, or packages, which can then be instrumented or not. A good optimizer can do this automatically.
My estimate is that a well-optimized omniscient debugger would exhibit an absolute worst-case of 10 times slow-down, with 2 times being far more common. True, there are programs for which this is still too long, but not many. I've yet to encountered a bug that I couldn't find with the current ODB.
With the advent of omniscient debugging, it is becoming easier to find bugs. Bugs that were completely intractable with traditional techniques have become child's play. As optimization techniques improve and omniscient debuggers come to handle larger and larger problems, bugs will find it harder and harder to hide.