Glen Matthews is manager of R&D at Hummingbird Connectivity, a division of Open Text Corp. He can be contacted at Glen.Matthews@hummingbird.com/
Memory dumps have always been a tried and true method of debugging errant programs. The image of a programmer toiling over a stack of paper to track down that elusive bug is very real to me -- I remember doing it. I also remember entering the wired age when a systems programmer from Ireland e-mailed his dump to us for our review -- and consequently bogged down Bitnet (or NetNorth in Canada, a store-and-forward e-mail network dating circa 1981). Still, it was faster than sending a tape!
Nowadays, memory dumps can be taken in a flash, and can easily be transmitted across the Internet. They are a useful tool, in a variety of settings, particularly if we regard the debugging process as extending past RTM to the support lifecycle. If we can add memory dumps to the set of tools available for debugging -- such as the debugger, simple output of information through instrumentation, observing the program behavior -- then we become more effective and make life easier for ourselves.
Microsoft's Debug Diagnostics 1.1 ("DebugDiag" for short) is designed to facilitate the generation and analysis of memory dumps. There are three components to DebugDiag:
DebugDiag's UI is designed to assist in handling dumps: creation of rules defining when and how to take a dump, and analyzing the resulting crash dumps. It is also extensible via scripting. In fact, the built-in analytics are scripts. They provide an easy way of analyzing dumps -- either "crash/hang" dumps, or "memory pressure" snapshots of a process. Moreover, DebugDiag is tailored for use with IIS, and designed to be able to take multiple dumps to reflect changing conditions (memory, for instance) or failing threads in IIS. Finally, in the case of IIS, DebugDiag has the ability to bundle together a number of data sources for debugging (such as the event logs, the IIS Metabase, the W3SVC logs, and the HTTP error logs) into a cab file.
Figure 1 is DebugDiag's interface. The first tab lets you create a crash rule that determines when and how a crash dump is taken. In the second tab, you can analyze a dump. There is also a default wizard that appears at startup to guide you through the dump rule creation process. The third tab displays the currently running processes, and lets you take manual dump of a particular process.
DebugDiag can simply be pointed to a crash dump and told to analyze it. Alternatively, you can prepare for a crash by creating "rules" which trigger a dump by using the wizard to walk you through the rule creation process (either a crash/hang rule, or a memory and handle leak rule).
In Figure 2, you see the four panels of the wizard when setting up an crash rule. Here I'm setting up a rule for demodump, a demonstration program that exercises some of DebugDiag's features.
The wizard lets you choose the type of process (for instance, all IIS/COM+ processes, or a specific generic process, MTS/COM+ process, IIS web application pool, or NT service). This choice leads to a panel displaying either executing processes or applications of the type you selected. If you select the MTS/COM+ application radio button, you are shown a list of packages. If you choose the NT service radio button, a list of services is displayed. Finally, if you select a specific IIS web application pool, you are shown a list of pools from which to select.
The fourth panel in Figure 2 is the Advanced Configuration panel, where you specify the type of dump action that you want (a log trace, or minidump, or full userdump, or even custom action). Limits and further dump configuration can be done here. Also, PageHeap can be configured for the process using a simple dialog (rather than using the Microsoft gflags tool). PageHeap is a tool that can be used to diagnose heap corruption problems, which can be really nasty! For more information, see How to use Pageheap.exe in Windows XP, Windows 2000, and Windows Server 2003.
Memory Pressure Analysis
The configuration behind a memory or handle leak rule is less obvious. Tools such as Process Explorer from Windows Sysinternals show the changes of state in a process. However, to completely analyze memory utilization you need a dump. Moreover, when tracking down a problem, multiple dumps may be necessary. Figure 3 illustrates the configuration of a leak rule. DebugDiag injects LeakTrack.dll into the process under study. (You can also inject this dll manually into a process.) The leak rule lets you specify points at which a dump will be taken, as well as a final dump.
demodump: Illustrating DebugDiag
To illustrate some of these scenarios, I've put together a simple program called demodump that attempts to replicate certain crash situations. This application (demodump) takes one of four parameters: stack, address, heap, and hang.
- address. Two threads are started. In each thread, one of two routines (Recur1(), Recur2() is called, based on a pseudo-random number that is generated for each iteration. Depending on a threshold constant, a function that will generate an intentional addressing exception will be randomly called.
- stack. The same two threads are started, but no intentional addressing exception is generated. The calls are continued until we get a stack overflow, thus generating an exception.
- heap. As above, two threads are started. Instead of generating an exception, the application makes memory allocations and then pauses. The intention here is to simulate a leak, but not necessarily to cause a crash.
- hang. This scenario generates a deadlock using three functions, three critical sections, and three threads.
Symbols and a Symbol Store
To take full advantage of the dump analysis in DebugDiag, you need make program symbols available. This can be either the PDB file associated with the binary in question, or a symbol store derived from this. When many versions of a binary have been released to customers, the latter is much more convenient.
Microsoft maintains a public symbol server which is available over the Internet. The path is:
It's fairly easy to create your own symbol store using the symstore utility (available in the Debugging Tools for Windows download from Microsoft). The following command line creates a symbol store for our example application:
symstore add /f debug /s store /t demodump
In this example, add indicates that files are to be added to this symbol store. The files to be added are specified by the /f switch; here they are the files in the debug output directory for demodump. /s points to the symbol store itself. Once the symbol store has been built, either of the two following symbol paths can be specified to DebugDiag, and it will resolve the symbols for demodump.
Example of a Crash
Figure 4 shows the demodump application in the process of being dumped. At the point when the screenshot was taken, Windows was dumping the process to disk.[Click image to view at full size]
To analyze this dump, select the crash dump file by browsing for it. Since this is a crash, choose the Crash/Hang Analyzers and click on the Start Analysis button. Once the analysis is complete, DebugDiag presents the analysis in a browser in an MHTML (Mime HTML) page, which can be saved as an MHT file and viewed later. Figure 5 shows this resulting crash dump analysis.[Click image to view at full size]
At the top of the dump report (Figure 5), there is a summary of the dump analysis. In a real world issue, this can point to the crash problem and potentially (if in a third-party module) the vendor.
In the next section of the dump, there is a table of contents, in case multiple dumps are analyzed simultaneously. This can be very useful for comparing related dumps. Following this, at the beginning of the dump, we see some general information about the process that crashed.
After this, we find the meat of the crash -- the call stack. Each line shows the calling function and offset where the call is being made, as well as the source file and line number. Thus we can often pin down the exception to the actual statement that failed.
demodump can produce a stack overflow dump, but this is essentially the same as the addressing exception shown above. Consequently I won't describe that here, although a sample dump report that contains four output MHTML pages, each showing the output of debugdiag for a particular exception condition is included with the source code for a demonstration Visual Studio project for generating exception conditions -- bad pointer, too much memory used, and so on. The resulting dumps can be analyzed with the Microsoft tool.
Deadlock or Hang
It's normal for programs to be waiting. A process contains one or more threads that are executing, and these threads may need to synchronize with something else -- perhaps each other. If synchronization is incorrect, threads may block waiting for each other, and thus deadlock. Other threads may continue running, but the blocked threads consume resources and are a problem. If no thread can run within a process, the process is said to by hung.
In this case, Windows won't automatically dump your process. After all, if your process is simply waiting for something, it would be rather rude to terminate it without asking for your permission. In DebugDiag, you can manually choose which type of dump, if any, to take (Figure 6). In the demodump application, it is not really dead-locked -- you time-out after a waiting for a bit, so that the application will terminate normally.[Click image to view at full size]
The analysis carried out by the hang analysis script not only discovers that there is a deadlock in the application, but also pinpoints the threads involved in the hang -- Thread 2 is deadlocked with Thread 1 (Figure 7). Other advice is given in the Recommendation cell, in case this does not enable you to resolve the problem.[Click image to view at full size]
Other elements of the hang report show the Top 5 Threads (by CPU time), and the call stacks of the deadlocked threads (see Figure 8). There is also a report of the critical sections involved. Because of the symbols available, the analyzer is able to identify g_cs1 and g_cs2 as the two critical sections involved. We can see that Thread 2 (thread id 5780) owns g_cs2 while Thread 1 (thread id 800) owns g_cs1.[Click image to view at full size]
Example of Heap Analysis (Memory Leaks)
A memory or handle leak can be a difficult thing to track down. DebugDiag can make this simpler. In tracking a leak, DebugDiag injects the LeakTrack dll into the monitored process. Tools such as Process Explorer can show this (or you can see it in the dump report). The LeakTrack dll saves things such as sample call stacks that are later shown in the dump report.
In your memory rule, you can specify that dumps should be taken periodically, with a final dump at the end of the monitoring period. DebugDiag forces you to monitor the process under study for at least 15 minutes before taking the final dump. However, you can force DebugDiag to start recording sample call stacks immediately -- and then take a manual dump.
In the analysis summary (Figure 9), you can see that the analyzer has identified the two functions that are doing the memory allocations -- Recur1() and Recur2(), and has reported the allocations attributed to each. Next, the memory manager for these allocations is reported (ntdll.dll) and you are directed to the callstack samples to see where the allocations occur.[Click image to view at full size]
Since this program ran for a very short time, the report of high memory fragmentation should be taken with a grain of salt. Generally, there is a settling period for programs, as shown by the recommendation that you monitor a process for at least 15 minutes to get reliable information.
Figure 10 shows the virtual memory summary and heap analysis. In this particular example, this is not too meaningful, but you can see the analytical results available. The Leak Analysis (Figure 11) and the call stacks (Figure 12) are more interesting. demodump has 26 outstanding allocations, and we can see sample call stacks for these allocations. In this dump, there are 10 call stacks (only three are shown). This can be invaluable in tracking down leaks, because it shows the program logic in terms of the functions called when potential leaks occur. Note as well in the details for Recur2() that there is a statistic emitted called the Leak Probability -- in this case, it was 78%.[Click image to view at full size]
Memory allocations can fall into one of three categories:
- Cached memory that will be held for a very long time.
- Short-term allocations that are freed quickly.
Clearly you need to distinguish between the first and the last. The only reliable way to determine if a leak exists is to watch the memory behavior of the program over a long period of time. The leak probability is calculated according to a formula that is based on the observed allocation patterns of your program over a period of time -- it's only a suggestion. Often knowing that there is a leak, and seeing the sample call stacks, is enough to track it down and fix it.[Click image to view at full size][Click image to view at full size]
DebugDiag is simple to use, and provides a wealth of information. I don't think it's a silver bullet for every issue (it's perhaps not as robust as you'd want it to be), but if you aren't aware of its capabilities, then you are missing out on a very useful tool. The best way to learn about it is to try it, and with the simple demonstration application demodump you can investigate how this tool works.