Design

Crash Tracebacks in Unix

By Alan Dunham, September 01, 1992

This UNIX-based VAX/VMS-like crash traceback utility reports a list of subroutines being executed when the crash occurred, along with line numbers, parameter values, and local symbol values.

SEP92: CRASH TRACEBACKS IN UNIX

This article contains the following executables: TRACE.ARC

Alan is manager of graphics development at Landmark/ITA, a Calgary, Alberta-based division of Landmark Graphics. He can be reached at adunham@ita .lgc.com.

Some time ago, when I developed code on VAX/VMS systems, programs that crashed would give a stack traceback, a list of subroutines that told me what part of the program was currently being executed (see Figure 1). Subroutines compiled with debug information had a line number associated with them. This traceback often enabled me to find the cause of the crash without using a debugger.

Figure 1: List of subroutines identifying parts of a program being executed (VAX/VMS).

      %FOR-E-OUTCONERR, output conversion error
        unit -5 file
        user PC 00AC9E6F
      %TRACE-E-TRACEBACK, symbolic stack dump follows

  module name   routine name  line    rel PC    abs PC
  SIO_OPEN      SIO_OPEN       90   0000018A  00AC023E
  RFE_INFO      RFE_INFO       87         11  00AB7AE1
  REFED         REFED         210         6F  00AB766F

When we ported our code from VAX to UNIX, program crashes no longer gave a traceback. Instead, we got informative error messages like "Segmentation Violation." Since our Fortran programs were huge and didn't have dynamic memory allocation, we had to disable the generation of the resulting huge core files. To make matters worse, we had a couple of numerical programmers who had never used a debugger! It was clear we needed some kind of crash traceback--so we wrote one.

In addition to subroutine names and line numbers, our traceback gives parameter values and local symbol values. This is quite an asset, even though the current implementation doesn't dump structures. Figure 2 is an example of the traceback format.

Figure 2: An example of the traceback format.

  user=alan host=vader date=Thu Apr 16 12:08:25 1992
  program=./sparc
  SIGSEGV: segmentation violation signal=11(3)
  no mapping at the fault address
  ---------- traceback ----------
  file test.c line 21 function tst_segv()
  file test.c line 104 function tst_lv1()
  file test.c line 166 function main()
  -------------------------------
  tst_segv()
    -- local symbols for tst_segv --
    laa = -19088744 0xfedcba98 (int)
    prt = 0x0 (pointer to char)
  tst_lv1(paramtext)
    paramtext = "call 1"
    -- local symbols for tst_lv1 --
    l1 = -554829090 (0xdeedfade) (int)
    ttext = "level1"
  main(argc,argv)
    argc = 1 (0x1) (int)
    argv = 0xf7fffb04 (pointer to *char)
    *argv = "sparc"
    -- local symbols for main --
    b = 32 (0x20) (int)
    pi = 3.14159 (float)
    ss = 0 (0x0) (short)
    random = 5 (0x5) (int)

The traceback has been implemented for the SUN OS and IBM AIX versions of UNIX. Some of the code is similar, while some is system dependent. Since I developed most of the code by exploring, I'll pass on information that will enable you to extend the method to other hardware. Because of space constraints, the entire system, which I've implemented for SPARC and the IBM RS/6000, is available electronically. It includes: stackdump.c, a program to give a stack dump; a side-by-side listing of two SPARC stack dumps showing changes when a different function is called (comments and frame boundaries have been added to the listing); a side-by-side listing of two IBM RS/6000 stack dumps showing changes when a different function is called (again, I've added comments and frame boundaries); a sample frame dump; a set of tracebacks for the seven different values of the variable "random" in file test.c; test.c, a program to demonstrate and test the traceback; sparc.c, the traceback code for the SPARC systems; and ibm.c, the traceback code for RS/6000.

Overview

A simple traceback requires code that installs a signal handler, traces back each stack frame, and converts addresses to function-line numbers. A more informative traceback also has code that prints subroutine parameters and local symbols.

When a program crashes, control is transferred to our traceback subroutine by the UNIX signal-handling mechanism. Near the start of a program, we call the system function SIGNAL with the signals we wish to catch and the name of the traceback subroutine.

Tracing back the stack is done by finding all the stack frames, each of which corresponds to one function call. Each stack frame contains a return address and a stack pointer. The stack pointer is used to find the next stack frame. "Walking the stack" is continued until there are no more stack frames. We store the return addresses for the stack frames and then translate them to function-line numbers.

When an executable file is generated by compiling and linking with -g, that executable contains debugging information. Part of this information is a memory address for each line of each function. Each function is also identified by name. We scan the relevant portion of the executable file to see if each line-number address matches any of our return addresses.

Catching Signals

The SIGNAL function takes two parameters: the name of the signal to be caught and the name of the subroutine to do the catching. Example 1is a typical call.

Example 1: A typical call to the SIGNAL function.

  signal ( SIGSEGV.
  signal_handler_routine ):

  The important signals we wish to
  catch are:

  SIGSEGV  segmentation violations:
  SIGABRT  sent by system abort
  SIGFPE   floating point exception
  SIGBUS   bus error

Other signals are defined in signal.h (or sys/signal.h). Most crashes are caught by the signals in subroutine trb_signal in the signal handling and stack traceback code; see sparc.c and ibm.c, available electronically. On the SPARC, I couldn't catch integer divide at the correct function without SIGABRT. Instead, the traceback would start at the function that called the crashing function. The SPARC also needs a call to ieee_handler to catch floating-point errors. On the RS/6000, I'm currently only catching SIGSEGV. Floating-point errors are usually turned off to speed up the pipelining. The function fp_enable_all is supposed to turn on error catching, but I haven't had any success with it so far.

The Signal Handler

When a program crashes and issues a signal we're interested in, control is transferred to the signal-handler function. This function walks the stack, storing a return address for each function. One of the arguments passed to the signal handler is a structure which contains an initial stack pointer and an initial program counter (return address). This initial stack pointer does two things: It tells us where to find the stack, and it tells us what stack pointers look like in terms of a string of hex digits. When we examine the stack to find where in a stack frame the stack pointer is, we will be looking for a string of similar hex digits. Similarly, the initial return address tells us what other return addresses look like.

The signal handler has several tasks to do. It should describe the cause of the crash, trace back the stack, get information from the executable, and print out the traceback. It should also decide whether to stop program execution or to continue. It can also help us to build itself by printing out a stack dump and frame dumps.

To find the arguments to the signal handler, type man signal at the UNIX prompt; see Example 2. To find the fields in the scp structure, look for a definition in signal.h or look at it via dbx. Hopefully, these fields correspond to the stack pointer and the return address.

Example 2: Signal-handler arguments.

  void trb_handle (sig, code, scp)
  int sig, code;
  struct sigcontext *scp;

More Detail on the Stack

The skeleton program in Figure 3 illustrates the stack for a simple calling sequence. Each function that has had its execution suspended as a result of calling another function will have a stack frame on the stack. The stack grows towards low addresses. Since we are starting at the bottom function, we start at the low-address end of the stack and walk back to the high-address end.

Figure 3: Skeleton program that illustrates the stack for a simple calling sequence.

            main()
            { fun1 ();
            }
            void fun1 ()
            { fun2();
            }
            void fun2()
            {float a,b,c;
            c=0.0;
            a=b/c;
           __________
  Low     |          | Stack frame
  address |          | for fun2
          |__________|
          |          | Stack frame
          |          | for fun1
          |__________|
  High    |          | Stack frame
  address |          | for main
          |__________|

The stack pointer in one stack frame points to the stack frame of the calling function. I Figure 3, the stack frame for fun2 contains a stack pointer that points to the stack frame for fun1. Similarly, the stack frame for fun1 contains a stack pointer that points to the stack frame for main. The stack frame for main contains a stack pointer that points to the end of main's stack frame. The contents of this next frame are 0, indicating that main is the top level function.

The return address provides us with the address inside the calling function where the function is called. To continue our example, the return address in the stack frame of fun2 gives the address in fun1 where fun2 was called. When we translate the address to a source-code file-line number, we get the line number in fun1 in which fun2 was called.

If we are looking at a new hardware architecture, we may know only the initial stack pointer and the initial return address. We need to know where in a stack frame the stack pointer and the return address are.

Using a Stack Dump

The locations of the return address and the stack pointer within a stack frame and found by examining a dump of the stack. To maximize the information contained in the dump, it's best to nest several subroutine calls, the bottom of which prints out the dump in hex and ASCII. Finding the middle of each subroutine's stack frame is simplified if each subroutine contains a unique text string. Stackdump.c (Listing One, page 113) generates a stack dump. Example 3shows an abbreviated version of the sample output generated by stackdump.c. (Complete versions of both SPARC and RS/6000 stack dumps are available electronically.) The first column gives an address in the stack, while the second column gives the value at that address. If you circle all values in the second column that start with Ohf7ff, you will have circled all potential stack pointers. If you can find these values in the first column, you should circle them there as well. This will give all possible starting positions for stack frames. Now you must eyeball the stack dump and look for circles on the right that are a constant position from a circle on the left. For the SPARC, this difference is 14 longwords.

Example 3: Sample output generated by stackdump.c (Listing One).

  f7fff9c8  7efefeff         #start of frame
     ...
  f7fff9ec  f7fffa40         #stack pointer
     ...
  f7fffa2c  66756e32 fun2    #local variable
  f7fff9ec  20746578 text
     ...
  f7fffa40  7efefeff         #start of next frame
     ...

Dissection of a Stack Frame

Assuming that we've been successful in finding the offset of the stack pointer from the start of a stack frame, we can now explore the stack frame. If the value of FRAMEDUMP is changed from 0 to 1 in sparc.c, we'll get a stackdump broken up into frames. (An example dump of the stack when FRAMEDUMP = 1 is available electronically.) Stackdump.c ( Listing One) prints addresses for the start of all functions. Hopefully, we'll find similar addresses in each stack frame which are a fixed offset away from the start of each frame. If we're still having difficulty in finding them, we should substitute the call to fun1a() in main with fun1b(); the stack frame for fun1b should be very similar to the stack frame for fun1a, with the major difference being the return address for fun1b being different than the return address for fun1a.

The output of the stackdump, summarized inExample 4, is in two pairs of columns. The left pair is the output when main calls function fun1a while the right pair is the output when main calls function fun1b. The important difference is at address 7ffffa04, the return address from stack frame fun2. The return address 22e0 corresponds to function fun1a while address 2310 corresponds to fun1b. These addresses are between the appropriate function starting addresses.

Example 4: SPARC stackdumps showing the difference between calling function fun1a vs. function fun1b from main. Note that the return address in function fun2 is different. This file has been edited to remove lines that are the same for both runs; this saves space and emphasizes the difference.

  main address=2290       main address=2290
  fun1a address=22c0      fun1a address=22c0
  fun1b address=22f0      fun1b address=22f0
  fun2 address=2320       fun2 address=2320
  /*_______________________________________________________*/
  /* stack frame for function fun2 */
  /*_______________________________________________________*/
  f7fff9c8 7efefeff ~     f7fff9c8 7efefeff ~
     ...
  f7fffa00 f7fffa40 @     f7fffa00 f7fffa40   @  /* stack pointer */
  f7fffa04 000022e0 "     f7fffa04 00002310  # /* return address DIFFERS */
     ...
  f7fffa2c 66756e32 fun2  f7fffa2c 66756e32 fun2
  f7fffa30 20746578 tex   f7fffa30 20746578 tex
  f7fffa34 74000000 t     f7fffa34 74000000 t
  /*_______________________________________________________*/
  /* stack frame for functions fun1a(left) & fun1b (right) */
  /*_______________________________________________________*/
  f7fffa40 7efefeff ~     f7fffa40 7efefeff ~
     ...
  f7fffa78 f7fffab0       f7fffa78 f7fffab0   /* stack pointer */
  f7fffa7c 000022b0   "   f7fffa7c 000022b0 " /* return address */
     ...
  f7fffaa0 66756e31 fun1  f7fffaa0 66756e31 fun1
  f7fffaa4 61207465 a te  f7fffaa4 62207465 b te
  f7fffaa8 787400b0 xt    f7fffaa8 787400b0 xt
  /*_______________________________________________________*/
  /* stack frame for function main */
  /*_______________________________________________________*/
  f7fffab0 11400086  @    f7fffab0 11400081  @
     ...
  f7fffae8 f7fffb20       f7fffae8 f7fffb20    /* stack pointer */
  f7fffaec 00002064   d   f7fffaec 00002064   d /* return address */
     ...
  f7fffb10 6d61696e main  f7fffb10 6d61696e main
  f7fffb14 20746578 tex   f7fffb14 20746578 tex
  f7fffb18 74000020 t     f7fffb18 74000020 t

Except for the change of text string, the only difference between fun1a and fun1b for the IBM output is the return address. The function starting addresses turn out to be pointers to the real addresses.

Figure 4 illustrates the concept of a stack frame. I'll refer to the low address end of a stack frame as the "stack pointer," and the high address end of a stack frame as the "frame end." On the SPARC, local variables are referenced via a negative offset relative to the frame end. This means that we must find the next stack pointer to find the locals for a given stack pointer. Parameters coming into a function are referenced via a positive offset from the calling function's stack pointer as the space for the parameters was allocated in the calling function.

Figure 4: A stack frame.

           _________
  Low     |         | <--Frame start
  address |_________| (Stack pointer)
          |         |
          |_________|
          |         |
          |_________|
          |         |
          |_________|
  SP +n   | Next SP | Next stack
  bytes   |_________| pointer
          |   PC    | Program
          |_________| counter
          |         |
          |_________|
          |         |
          |_________|
          |         | Last declared
          |_________| local variable
          |         |
          |_________|
          |         | First declared
          |_________| local variable
          |         |
          |_________| <--Frame end
          |         |
          |_________| Parameters
  High    |         |
  address |_________|

The Executable File

The executable files on a SPARC are similar to the System V COFF executables. An executable file contains a header structure whose fields contain the location of the symbol table, the location of the string table, and the number of symbols. An executable that is linked with the -g flag contains a symbol table and a string table. The symbol table contains information about all symbols in the source file, including source-file names, function names, source-line numbers, subroutine parameters, local symbols, and more. The symbol table does not contain text information; instead it has a pointer into the string table. Each symbol in the symbol table is read into a structure which contains the symbol's type (and other fields).

The symbol type determines the meaning of the rest of the structure. For a source-line symbol the structure contains the line number and the address. For a source filename, the structure contains an offset into the string table, which contains the name of the source file (ditto for subroutine names). Parameter symbols contain a parameter type, a stack offset value, and a pointer to the string table. Local symbols are the same as parameters.

A traceback conversion consists of reading the symbols sequentially and storing the last file and subroutine names found until an address is found that matches one of the addresses in the traceback list. At this point we store the filename, subroutine name, and line number and continue reading symbols in order to match the remaining addresses.

Parameters

To print out subroutine parameter values, each level of the traceback needs to know its associated filename and function name. If we also know the position in the symbol table for the start of the function, we can quickly find the parameters for the function by seeking to the beginning of the symbols for the function, then reading symbols. Any symbols of type parameter are printed, until a new filename or subroutine name is found. The string table contains the name of the parameter (as it is named in the source file). The stack offset value lets us find the parameter's location in the stack frame. The parameter type tells us how the parameter was declared, enabling us to print its value as an int, float, and so on. Local symbols are found using the same steps used for parameters.

Speed Considerations and Code Limitations

Because you must compile and link the system with -g, compiler optimization is removed. As such, tracebacks may only be appropriate for in-house usage. This is still quite helpful, as it gives a user something concrete to report, especially if the traceback is written to a file. Speed is not a major factor after the crash. For a large executable, there could be 100K symbols, but we only need to scan the file once. Example 5 shows how to call the traceback start-up routine from a user program.

Example 5: Calling the traceback startup routine from a user program.

  main(argc,argv)
  int argc;
  char *argv[];
  {
  /* declarations */

  /* enable crash tracebacks */
    trb_signalinit(argc,argv);

  /* the program
     ...
  */
  }

Note that neither the SPARC or IBM system prints structures yet, nor does the IBM yet catch floating-point errors or have a data dictionary. (The SPARC data-dictionary code has yet to be optimized.) Also, the RS/6000 version has a hardwired define called PCADJUST that converts stack addresses to COFF file addresses. There should be a function to make this conversion, as it may not be constant. The code does not handle executables not linked with the -g flag, and not all kinds of arrays are printed out:

Conclusion

Not only is the exploration of program stacks and COFF files interesting, it is also very practical. Stack traceback can often be used to find the cause of a program crash very quickly. A stack traceback that prints values of parameters and local symbols is often as informative as dbx and is quicker.

[LISTING ONE]

<a name="01fe_0018">

/* stackdump.c -- a program to dump the stack */

#define SPARC 1
#define IBM 0

void fun1a();
void fun1b();
void fun2();
void stackdump();

main()      /* call function fun1a or function fun1b */
{
  char text[16];
  strcpy(text,"main text");
  fun1a();
}
void fun1a()
{
  char text[16];
  strcpy(text,"fun1a text");
  fun2();
}
void fun1b()
{
  char text[16];
  strcpy(text,"fun1b text");
  fun2();
}
void fun2()
{ int jj;
  char text[16];
  strcpy(text,"fun2 text");
#if SPARC
  printf("main address=%x\n",main);
  printf("fun1a address=%x\n",fun1a);
  printf("fun1b address=%x\n",fun1b);
  printf("fun2 address=%x\n\n",fun2);
#else if IBM
  printf("main address=%x -> %x\n",main, *(unsigned long *)main);
  printf("fun1a address=%x -> %x\n",fun1a, *(unsigned long *)fun1a);
  printf("fun1b address=%x -> %x\n",fun1b, *(unsigned long *)fun1b);
  printf("fun2 address=%x -> %x\n",fun2, *(unsigned long *)fun2);
#endif
  stackdump(&jj-32);   /* the 32 gives us the stack before variable jj */
}
void stackdump(start)
unsigned long start;
{
  int i,j;
  for (i=0;i<128;i++)
    {
      printf("%08x ", (long)start);
      printf("%08x ", *(unsigned long *)(start));
      for (j=0;j<4;j++,start++)
        printf("%c", isprint( *(unsigned char *)(start)) ?
                          *(unsigned char *)(start) : ' ');
      printf("\n");
    }
}

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

Design

Crash Tracebacks in Unix

Figure 1: List of subroutines identifying parts of a program being executed (VAX/VMS).

Figure 2: An example of the traceback format.

Overview

Catching Signals

Example 1: A typical call to the SIGNAL function.

The Signal Handler

Example 2: Signal-handler arguments.

More Detail on the Stack

Figure 3: Skeleton program that illustrates the stack for a simple calling sequence.

Using a Stack Dump

Example 3: Sample output generated by stackdump.c (Listing One).

Dissection of a Stack Frame

Example 4: SPARC stackdumps showing the difference between calling function fun1a vs. function fun1b from main. Note that the return address in function fun2 is different. This file has been edited to remove lines that are the same for both runs; this saves space and emphasizes the difference.

Figure 4: A stack frame.

The Executable File

Parameters

Speed Considerations and Code Limitations

Example 5: Calling the traceback startup routine from a user program.

Conclusion

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Design Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content

Design

Crash Tracebacks in Unix

Related Reading

More Insights

White Papers

Reports

Webcasts

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Design Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content