Dr. Dobb's | Postmortem Debugging

Postmortem Debugging

Postmortem debugging helps you isolate program problems by taking a snapshot of the machine at the time a crash occurs. As Matt points out, if you have the tools to support the technique, postmodern debugging is particularly useful in protected-mode Windows.

September 01, 1992
URL:http://www.drdobbs.com/tools/postmortem-debugging/184408832

SEP92: POSTMORTEM DEBUGGING

This article contains the following executables: CORONER.ARC

Matt works for a California programming-tools vendor, specializing in debuggers and file-format programming. He is coauthor of Undocumented Windows (Addison-Wesley, 1992) and can be reached at CIS 767117,1720.

Although relatively new to the world of PC programming, postmortem debugging has been around for a long time. On some systems, for example, the technique is known as "core dumping." Whatever name it goes by, the fundamental idea is that of simply taking a snapshot of the state of the machine at the time of the crash.

This article focuses on postmortem debugging under Windows. It's also possible to implement a postmortem debugger for DOS and OS/2. However, it is more difficult to do so because of the way protected mode works (or doesn't work) with these operating systems. For instance, because DOS applications run in real mode, you can't generate a general protection (GP) fault when you use a NULL pointer or access beyond the limits of a segment--these actions silently destroy the integrity of your DOS program. Consequently, the crash may not occur until much later, and the code that takes the snapshot might never be invoked. Windows programs, on the other hand, run in protected mode, so a NULL pointer reference or an access beyond a segment limit will cause a GP fault. The postmortem facility can be set to kick in at that time.

What some programmers view as a weakness of Windows actually becomes a strength when postmortem debugging enters the picture. For instance, Windows has a single address space for all programs (not including virtual DOS machines). Because of this shared address space, it's possible for one task to handle the exceptions of another, as well as access the memory in its address space. This makes it significantly easier to implement postmortem facilities under Windows than under OS/2.

Although OS/2 runs in protected mode, it's not possible to trap GP faults of another process. To allow one process to handle the exceptions of another would seriously weaken the stability and security that OS/2 offers. In OS/2 2.0, a process can register an exception handler and handle the exception itself. This "intrusive" approach, however, requires you to add extra code to the application.

What You Can Get from Postmortem Debugging

The fundamental requirement for successful postmortem analysis is to make sure there is sufficient debug information available to the analysis tools. If debug information is not available, postmortem analysis can be tedious at best and a pile of meaningless numbers at worst. (This is one reason why Microsoft ships debug information with the debugging version of Windows.)

The type of debug information you can expect from a postmortem debugger includes that for registers, stack traces, memory, as well as miscellaneous details about the system.

Registers. The most fundamental question you can ask about a crash is, "Where did it occur?". That question can be answered by looking at what the CS:IP was at the time of the exception. Of course, seeing a number like "012F:2521" isn't much help. By adding debug information and undertaking some mechanical processing, it's possible to convert that address into something more useful, like: "six bytes past the start of the PostMessage( ) function." Given more detailed information, it's possible to get an even better description, like: "Line 17, inside of FOO.C".

Other registers are often useful too. For instance, assume you know that the faulting instruction was MOV ES:[BX],1. Good postmortem information will tell you that the segment limit of the selector in the ES register is 01FFh. You can then look at BX and see that it contained the value 0200h. This tells you that the problem was due to accessing memory beyond the segment's limit.

Stack Trace. After "Where did the program crash?", the next question is usually "How did the program get to that point?" The answer is in the stack trace. Whenever you call a function, the return address is saved on the stack. By exploiting this knowledge, it's possible for a postmortem debugger to walk (or "crawl") each frame on the stack and obtain the return address. As described earlier, these return addresses are usually meaningless as raw hex numbers, but by applying debug information (and a little elbow grease), the numbers can be converted into meaningful information such as function names and line numbers.

Stack tracing is one of the reasons postmortem analysis is only as good as the debug information. Unfortunately, many a programmer will blindly start analysis without having produced debug information for the tools to work with. The result is a frustrated programmer who sees nothing but a bunch of meaningless numbers.

A similar situation can occur even with a moderate amount of debug information. The classic example is the programmer who tries to use as debug information the Windows SDK SYM files, which are devoid of addresses for the undocumented internal functions in Windows. If a stack trace went through one of these undocumented functions, the symbolic name shown for that frame will not be the real function--instead, it will be the closest documented function found earlier in the code segment.

Memory. By itself, a raw dump of the faulting program's memory probably isn't going to be much help. Still, it's possible to get useful information if the memory is shown in the proper context. For instance, a postmortem report could show some memory around each stack frame, along with the name of the function that it corresponds to. If you know the size and order that the passed parameters were passed in, you can determine their values. Doing so can be tedious, but with some coding and sophisticated debug information (like that used by Turbo Debugger for Windows and Codeview), it's possible to get the names, types, and values for all of the parameters and local variables in each stack frame. This same idea applies to global data in data segments.

System Information. Oftentimes, the cause of a crash is related to the state of the system at the time of the crash and is not reproducible under other circumstances. In these cases, it's important to have a clear picture of the state of the operating system as a whole. Under Windows, this includes a list of all the modules (programs and DLLs) currently loaded (including file dates/times), all the tasks in the system, the state of the memory manager, the version of Windows, the sequence of Windows messages that the faulting task processed, and so on.

A major cause of the "system-configuration dependent" bugs is incompatible versions of programs and DLLs that don't interact properly. For instance, a bug that manifests itself only when an old DLL is used with a newer copy of an application can be found by looking at the file dates of the EXE/DLL involved. A software quality assurance department can take this information and use it to filter out the known bugs, leaving the development team free to concentrate on new problems. Additionally, an analysis of these duplicate bugs can lead to a prioritized list of the most common problems for the developers to tackle first.

Postmortem Tools

In this section, I'll examine a few of the Windows postmortem tools available to programmers. This list is by no means comprehensive, but it does include the tools I have the most experience with.

Multiscope Debugger. At this writing, the Multiscope debugger is the most ambitious postmortem debugging tool I've run across. At the heart of the debugger is MED.EXE, a Windows program that's run when you start up Windows (or at least before you know a program is going to UAE). MED lies dormant until a UAE occurs. At that time, it writes out to disk a binary data file containing information such as register values, contents of the memory in use by all the running programs, and so on.

Analysis of the data is performed by a separate program Multiscope calls the "crash analyzer" (a variation of the Multiscope runtime debugger) that cannot run the program and has a restricted set of options. For instance, you can't set breakpoints because it's meaningless to set breakpoints in a program that can't be executed.

In all other respects, the crash analyzer looks and acts like the Multiscope runtime debugger. Register contents are viewed by opening a register window, the call stack is viewed by opening the call window, and so on. You can examine the values of local and global variables just as if you were running the runtime debugger at the time of the exception. The benefit is that you don't have to learn different programs for postmortem and runtime debugging.

The downside is that the postmortem file can get quite large. Additionally, since the postmortem information is kept in a binary file, it's difficult to post snippets of the information when asking for help via a BBS. Instead, the entire postmortem file must be oaded into the crash analyzer.

Dr. Watson. Dr. Watson comes with Windows 3.1 (but will work equally well in Windows 3.0). It was originally intended as a way for Microsoft to obtain information about where the common UAEs were occurring and who was responsible.

Dr. Watson, like MED.EXE, is run when you start up Windows. Later, if a GP fault occurs, Dr. Watson writes out a file (DRWATSON.LOG) to the disk. Unlike the postmortem file from the Multiscope MED.EXE, the DRWATSON.LOG file is ASCII text. Reports on subsequent faults are appended to the DRWATSON.LOG file, so previous GP fault information isn't lost.

The symbolic stack-trace capabilities of Dr. Watson are provided via SYM files. As the stack of the faulting application is walked, Dr. Watson looks for a SYM file for the appropriate module. If a SYM file is found, Dr. Watson reads it in, and the resulting entry in the DRWATSON.LOG file contains the name of function that the stack frame is for. The Windows SDK comes with SYM files for the core Windows DLLS. You can also make your own SYM files from Microsoft-style MAP files with the MAPSYM utility, which is also in the SDK.

Dr. Watson can hook into the "notification stream" and receive the parameter-validation notifications. By setting an option in the DRWATSON.INI file, you can tell it to treat parameter-validation errors as if they were UAEs and have Dr. Watson dump out a miniature version of a UAE report, including the all-important stack trace.

WinSpector. WinSpector comes with Borland C++ 3.1. Conceptually, it is a cross between Dr. Watson and the Multiscope debugger. While there isn't a full-fledged UI for postmortem analysis, you can obtain much of the information available in the Multiscope debugger.

Like the other two programs, WinSpector lies dormant until a GP fault occurs. When it springs into action, it writes out an ASCII text file (WINSPCTR.LOG). It also writes out a second file, WINSPCTR.BIN, that contains the contents of all the data segments in use by the faulting task. A post-processing utility (DFA.EXE) can then take the WINSPCTR.LOG and WINSPCTR.BIN file and combine them with the Turbo Debugger debug information to produce a text file that contains source-file and line-number information for each frame on the stack (where possible). Additionally, it outputs the names, types, and values of all the local and global variables. If you choose not to run DFA, you may still have symbolic stack traces , as WinSpector works with SYM files in a manner similar to Dr. Watson.

Along with WINSPCTR.EXE and DFA.EXE, there's also a trio of programs that allow you to make SYM files for your own programs. Additionally, it's possible to make SYM files for any Windows EXE/DLL that has exported functions. For instance, you can create your own SYM files for USER, KRNL286, KRNL386, and GDI that include all the exported functions, rather than just the documented ones. These improved SYM files are often the difference between a confusing stack trace and one that pinpoints the problem.

More Details.

Coroner. CORONER.EXE is a sample application from the TOOLHELP chapter in Undocumented Windows (Addison-Wesley, 1992). Although nowhere near as comprehensive as the previous programs, it does provide a demonstration of much of the TOOLHELP API. You can take the code and customize it, adding your own routines to provide whatever information you'd like to see in a postmortem report. While I won't discuss the program here, it is available electronically through Dr. Dobb's; see the book for a complete description of the program.

A Postmortem Example

Example 1 shows BAD.C, a program that intentionally causes a GP fault. Although it doesn't do anything useful, BAD.C does illustrate some of the concepts described above. For more general information about the types of problems postmortem debuggers can help you track down, see the textbox entitled, "Common Problems and How to Spot Them."

Example 1: BAD.C intentionally causes a GP fault.

  #include <windows .h>
  #include <string .h>
  #include <dos .h>

  int MeaningOfLife = 0x42; // A meaningless global,
                            // for DFA demonstration

  void Foo (void far *ptr)
  {

      _fstrlen(ptr); // This call will GP fault
  }

  static void Bar(void far *ptr)
  {

      void far *b;   // Local var for DFA demonstration

      b = ptr;

      Foo(b);
  }

  int PASCAL WinMain(
      HANDLE hInstance,
      HANDLE hPrevInstance,
      LPSTR lpszCmdLine,
      int nCmdShow

  )
  {

      Bar (MK_FP(1.0));     // pass bad pointer
      return 0;
}

In keeping with the rule that your analysis is only as good as your debug information, the first step in preparing this example for postmortem debugging is to provide as much debug information as possible. To that end, I compiled the program with Turbo Debugger debug information, and told the linker to generate a detailed MAP file. TMAPSYM was run on BAD.MAP to produce a BAD.SYM in the same directory as BAD.EXE. By doing this, I provided two sources of debug information (the Turbo Debugger information and the SYM file), neither of which is dependent on the other.

More Details.

Loading WinSpector and running BAD.EXE results in file WINSPCTR.LOG; see Example 2. (Portions of the file are not shown, for brevity's sake.) Near the top of the listing, you see that the module name of the task that UAEed was "BAD". A few lines down, note that the faulting instruction was "REPNE SCASB." SCASB is one of the instructions that implicitly references memory through the ES:DI register combination. A quick glance at the registers section of Example 2 shows that ES contains 0. That explains why the GP fault occurred. You may be thinking that the segment portion of the far pointer contained 1, not 0, but it turns out that the CPU automatically converts values between 0 and 3 into a 0 when loading into DS or ES.

Example 2: The file WINSPCTR.LOG generated by loading WinSpector and running BAD.EXE.

  WinSpector failure report - 5/17/1992  08:59:57
  Exception 13 at BAD 0001:082D (1117:082D)  (TASK=BAD)
  Disassembly:
  1117:082D  REPNE   SCASB
  1117:082F  XCHG    AX, CX
  1117:0830  NOT     AX
  1117:0832  DEC     AX
  1117:0833  POP     DI
  Stack Trace:
  0  BAD       __fstrlen + 0017
     CS:IP 0001:082D (1117:082D)  SS:BP 10DF:164A
     C:\ART4\BAD.EXE
  1  BAD       _Foo + 000E
     CS:IP 0001:017E (1117:017E)   SS:BP 10DF:1654
     C:\ART4\BAD.EXE
  2  BAD       _Foo + 002E
     CS:IP 0001:019E (1117:019E)   SS:BP 10DF:1660
     C:\ART4\BAD.EXE
  3  BAD       WINMAIN + 000A
     CS:IP 0001:01AD (1117:01AD)  SS:BP 10DF:1668
     C:\ART4\BAD.EXE
  4  BAD       <no info>
     CS:IP 0001:00B3 (1117:00B3)  SS:BP 10DF:1676
     C:\ART4\BAD.EXE
  Registers:
  AX  0000
  BX  0238
  CX  FFFF
  DX  0000
  SI  0232
  DI  0000
  SP  1648
  BP  164A
  IP  082D
  FL  0297
  CS  1117    Limit: 089F  execute/read
  DS  10DF    Limit: 267F  read/write
  ES  0000    Limit: 0000  NULL
  SS  10DF    Limit: 267F  read/write

In the stack-trace portion of Example 2, each stack frame contains information about a particular function call that was executed to get to the UAE. Stack frame 0 indicates the exact CS:IP at the time of the exception. Each subsequent frame is one function call removed from the GP faulting function. If the name of the function can be determined for a particular frame, it appears at the end of the first line in each frame entry.

The CS:IP for each stack frame can be seen by examining the second line of each stack frame. It's important to note that the CS:IP is given in terms of a logical address, as well as the physical address. For more information, see the text box entitled, "Logical and Physical Addresses."

Some programmers find it easier to read stack traces in reverse order, starting at the bottom and working to the top. Here, starting from entry 4 and working up, you read the stack trace as: "Some unknown function called WinMain(). WinMain() in turn called Foo(), which called Foo(). Foo() called _fstrlen(), which is where the GP fault occurred."

But wait a minute! WinMain() doesn't call Foo(). We can clearly see from BAD.C that WinMain() calls Bar(). Bar() then calls Foo(). What's going on here?

Notice in BAD.C that Bar() is declared as a static function. This makes the function nonpublic; hence, Bar() does not appear in the BAD.MAP file or the BAD.SYM file. Because Foo() was the closest preceding symbol that did appear in the MAP/SYM files, it is displayed as the function name in stack frame 2.

This illustrates why postmortem analysis is only as good as your debug information. This same problem occurs in stack traces that thread through the Windows DLLs. The SYM files that Microsoft provides with the SDK contain only symbol information for the documented APIs. Thus, you can see function names in a stack trace that you know your program isn't calling. The key to determining the stack trace's trustworthiness involves using the offset that appears after the function name in each stack trace. For instance: "POSTMESSAGE+002E" means that the CS:IP for the frame was 2Eh bytes past the start of the POSTMESSAGE function. If you look closely at the two frames that claim to be Foo(), you'll see that the offset for frame 1 is only 000Eh past Foo(), whereas for frame 2 it is 002Eh. The point is that the larger the offset, the less likely you are to actually be in the named function. You have to judge how truthful the stack trace is.

A more reliable way of determining which function your code is really in involves using debug information more sophisticated than that provided by SYM files. For instance DFA.OUT, the more accurate report shown in Example 3, was produced by running DFA WINSPCTR.LOG WINSPCTR.BIN. This report was generated by crunching the information obtained from the two WINSPCTR files and adding in the Turbo Debugger information in the BAD.EXE file. You can see that the stack trace correctly shows WinMain() calling Bar(), which then calls Foo(). Since the Turbo Debugger information contains more information than just the public symbols, it's much more accurate. Additionally, frames 1, 2, and 3 also contain a source file and a line number. In those lines in BAD.C, you'll see that the report is correct in terms of where execution was inside each function.

Example 3: The file DFA.OUT, a report generated by crunching the information obtained from two WINSPCTR files and adding in the Turbo Debugger information in the BAD.EXE file.

  0  BAD       __fstrlen  +0017
     CS:IP 0001:082D  (1117:082D)  SS:BP  10DF:164A
     C:\ART4\BAD.EXE

  1  BAD       _Foo  +000E
     CS:IP 0001:017E  (1117:017E)  SS:BP  10DF:1654
     C:\ART4\BAD.EXE
     BAD.C line: 10

  SS:1658 ptr
  void far * 0001:0000

  2  BAD       Bar +001B
     CS:IP 0001:019E  (1117:019E)  SS:BP  10DF:1660
     C:\ART4\BAD.EXE
     BAD.C line: 19

  SS:165C b
  void far *  0001:0000

  SS:1664 ptr
  void far *  0001:0000

  3  BAD       WINMAIN +000A
     CS:IP 0001:01AD  (1117:01AD)  SS:BP  10DF:1668
     C:\ART4\BAD.EXE
     BAD.C line: 29

  SS:1674 hInstance
  unsigned int  10DE

  SS:1672 hPrevInstance
  unsigned int  0000

  SS:166E lpszCmdLine
  char far *    10D7:0080

  SS:166C nCmdShow
  int 0001

  4  BAD        <no info>
     CS:IP 0001:00B3  (1117:00B3)  SS:BP  10DF:1676
     C:\ART4\BAD.EXE

  Module:   BAD
  Filename: C:\ART4\BAD.EXE

  Segments:
    Segment 01  Selector: 1117  Length 08A0  CODE
    Segment 02  Selector: 10DF  Length 2680  DATA

  Data Dumps:

  Segment: 02  Selector: 10DF  Length 2680  Offset: 0176

  0002:0056 _MeaningOfLife
  int  0042

As well as providing more accurate information about where the program was executing, the addition of more complete debug information allows DFA to show the names, types, and values of each parameter and local variable in every frame that has debug information. Since frames 0 and 4 are in the Borland C++ runtime library, there is no source file or variable information for those frames.

At the end of the file, you can see that the global variable MeaningOfLife contains 0x42, which is what we initialized it to in BAD.C. While there are many other public global variables in the runtime library, DFA does not display them because there's no type information.

Conclusion

As with runtime debuggers like Turbo Debugger for Windows, Codeview for Windows, or Multiscope, postmortem debugging does not free you from having to think. It's still critical that you understand how your program interacts with the operating system and other programs. Although you may end up having to spend time with a runtime debugger, going through postmortem analysis can dramatically narrow the problem to a manageable scope.

Common Problems and How to Spot Them

In looking at many postmortem files, certain problems show up over and over again. Here's a short list of the most common UAEs and how they'll show up in the postmortem analysis.

NULL Pointers. The code generators of most PC compilers use the ES register to access memory when a far pointer is used. Typically, the ES register is set up via the "LES" instruction.

For instance, the code LES BX,[BP+6] loads ES:BX with the far pointer passed to the function as a parameter. If a NULL far pointer was inadvertently passed, then [BP+06] contains 0. Unfortunately, a GP fault does not occur upon execution of this instruction. It's perfectly legal to have a value of 0 in a segment register. You cannot use the segment register to access memory, however. Thus, the GP fault doesn't occur on the example instruction, but on a subsequent instruction. The key to finding this bug is to look for a value of 0 in the ES register. You can look at the disassembly to see why ES was loaded with a 0 value. In Windows 3.1, the parameter-validation mechanism will check for NULL pointers in many cases, and attempt to prevent the application from UAEing.

Invalid Pointers. This is really a variation on the previous bug. Mercifully though, if you try to load a segment register with a value that's not a legal LDT/GDT selector, the GP fault will occur on that instruction, rather than a subsequent instruction.

A subtle variation on this bug occurs when the value you attempt to load looks perfectly normal, and in fact has been successfully loaded before. The prime example of this is inside a WEP routine, when Windows has already discarded the data segment of the DLL. I spent several hours tracking this problem down once, so I now am always on the lookout for this situation.

Accessing Memory Beyond the Segment Limit. This bug comes in two different incarnations. The first version typically manifests itself to beginning Windows programmers, who have dutifully called MakeProcInstance(), but forgotten to export their window procedure. When this happens, the DS register is not set up to point at the data segment of the application. The faulting instruction will look something like: MOV AX, [7452].

In this case, the use of the DS register is implicit. By checking the limit of the segment pointed to by the DS register, you can determine whether offset 7452h is within the limits of the segment. A dead giveaway of this bug is when the DS register contains a different value than the hInstance of the program. This is almost always the result of forgetting to export the window procedure properly.

The other common variation of the "segment overrun" bug is caused by accessing memory beyond the allocated size of an array. Once again, the key is to look at the instruction and figure out at what offset in the segment the memory access will occur. You can then check the segment limit to see if the intended access really was past the limit of the segment.

Also, don't forget that string instructions such as LODSB, MOVSB, and the like reference memory implicitly, via the SI and DI registers. If the faulting instruction was MOVSB, for instance, you'll have to look at the DS and ES limits, as well as the SI and DI registers, to figure out exactly what went wrong.

--M.P.

Logical and Physical Addresses

Windows EXE and DLL files are called New EXE files because they have a different format then traditional MS-DOS files. When New EXE files are linked, each segment is placed in a different section of the file. (The linker can combine multiple different segments in the .OBJ file into 1 segment in the EXE file. This is what the "pack" option in the linker refers to.) To quickly find each of these segments, a segment table is created in the EXE file that allows Windows and other programs to locate the file offset of a given segment's code/data.

When referring to a particular segment, we use its order in the segment table. Thus, the first segment in the table is logical segment 1, the second is logical segment 2, and so on. You can determine the number of segments, their size, and other information by running Borland's TDUMP or Microsoft's EXEHDR. Alternatively, the MAP file for your program contains this same information. The addresses given in a Windows MAP file are examples of "logical addresses."

When a EXE or DLL is loaded, Windows allocates a selector for each segment in the module. When the segment is needed, the base address of the allocated selector is pointed at a free block of memory, and the data from the segment on disk is read in. This process happens at module loadtime for PRELOAD segments. LOADONCALL segments have selectors allocated for them, but memory is not actually "committed" until the selector is loaded into a segment register, causing a "segment not present" fault.

For any given EXE or DLL file, the logical segment that a function or variable is in will never change. The selector value, on the other hand, is whatever selector was allocated to store the particular logical segment. In other words, the selector Windows uses for a particular logical segment can change between different invocations of the EXE or DLL.

A logical address is comprised of a module name, a logical segment, and an offset. For instance "USER 0001:65EA" means offset 65EA in the first segment of USER.EXE. In a WINSPCTR.LOG file, where code addresses are given, the logical address appears first.

A physical address is comprised of a selector and an offset. The selector value is what the CPU loads into a segment register to access an "instance" of the logical segment that's been loaded into memory. Thus, a typical physical address that Windows might use would be 09CF:65EA. If a logical address is given in a WINSPCTR.LOG file, its corresponding physical address appears afterwards in parentheses.

The "mapping" from a logical segment to its allocated selector (and hence, its physical address) can be found in the segment table inside a module table. This segment table is a variation of the segment table found in the New EXE file. The information to go from a selector to its logical segment is stored in the "Arena header" of the global-memory memory block. The details of these mappings are discussed in Undocumented Windows.

Every logical address has a corresponding physical address, but every physical address does not necessarily have a logical address. Blocks allocated by programs from the global heap do not. To have a logical address, the segment has to have originated in the segment table of the New EXE file.

--M.P.