This article contains the following executables: TRACE.ARC
Alan is manager of graphics development at Landmark/ITA, a Calgary, Alberta-based division of Landmark Graphics. He can be reached at adunham@ita .lgc.com.
Some time ago, when I developed code on VAX/VMS systems, programs that crashed would give a stack traceback, a list of subroutines that told me what part of the program was currently being executed (see Figure 1). Subroutines compiled with debug information had a line number associated with them. This traceback often enabled me to find the cause of the crash without using a debugger.
Figure 1: List of subroutines identifying parts of a program being executed (VAX/VMS).
%FOR-E-OUTCONERR, output conversion error unit -5 file user PC 00AC9E6F %TRACE-E-TRACEBACK, symbolic stack dump follows module name routine name line rel PC abs PC SIO_OPEN SIO_OPEN 90 0000018A 00AC023E RFE_INFO RFE_INFO 87 11 00AB7AE1 REFED REFED 210 6F 00AB766F
When we ported our code from VAX to UNIX, program crashes no longer gave a traceback. Instead, we got informative error messages like "Segmentation Violation." Since our Fortran programs were huge and didn't have dynamic memory allocation, we had to disable the generation of the resulting huge core files. To make matters worse, we had a couple of numerical programmers who had never used a debugger! It was clear we needed some kind of crash traceback--so we wrote one.
In addition to subroutine names and line numbers, our traceback gives parameter values and local symbol values. This is quite an asset, even though the current implementation doesn't dump structures. Figure 2 is an example of the traceback format.
Figure 2: An example of the traceback format.
user=alan host=vader date=Thu Apr 16 12:08:25 1992 program=./sparc SIGSEGV: segmentation violation signal=11(3) no mapping at the fault address ---------- traceback ---------- file test.c line 21 function tst_segv() file test.c line 104 function tst_lv1() file test.c line 166 function main() ------------------------------- tst_segv() -- local symbols for tst_segv -- laa = -19088744 0xfedcba98 (int) prt = 0x0 (pointer to char) tst_lv1(paramtext) paramtext = "call 1" -- local symbols for tst_lv1 -- l1 = -554829090 (0xdeedfade) (int) ttext = "level1" main(argc,argv) argc = 1 (0x1) (int) argv = 0xf7fffb04 (pointer to *char) *argv = "sparc" -- local symbols for main -- b = 32 (0x20) (int) pi = 3.14159 (float) ss = 0 (0x0) (short) random = 5 (0x5) (int)
The traceback has been implemented for the SUN OS and IBM AIX versions of UNIX. Some of the code is similar, while some is system dependent. Since I developed most of the code by exploring, I'll pass on information that will enable you to extend the method to other hardware. Because of space constraints, the entire system, which I've implemented for SPARC and the IBM RS/6000, is available electronically. It includes: stackdump.c, a program to give a stack dump; a side-by-side listing of two SPARC stack dumps showing changes when a different function is called (comments and frame boundaries have been added to the listing); a side-by-side listing of two IBM RS/6000 stack dumps showing changes when a different function is called (again, I've added comments and frame boundaries); a sample frame dump; a set of tracebacks for the seven different values of the variable "random" in file test.c; test.c, a program to demonstrate and test the traceback; sparc.c, the traceback code for the SPARC systems; and ibm.c, the traceback code for RS/6000.
Overview
A simple traceback requires code that installs a signal handler, traces back each stack frame, and converts addresses to function-line numbers. A more informative traceback also has code that prints subroutine parameters and local symbols.
When a program crashes, control is transferred to our traceback subroutine by the UNIX signal-handling mechanism. Near the start of a program, we call the system function SIGNAL with the signals we wish to catch and the name of the traceback subroutine.
Tracing back the stack is done by finding all the stack frames, each of which corresponds to one function call. Each stack frame contains a return address and a stack pointer. The stack pointer is used to find the next stack frame. "Walking the stack" is continued until there are no more stack frames. We store the return addresses for the stack frames and then translate them to function-line numbers.
When an executable file is generated by compiling and linking with -g, that executable contains debugging information. Part of this information is a memory address for each line of each function. Each function is also identified by name. We scan the relevant portion of the executable file to see if each line-number address matches any of our return addresses.
Catching Signals
The SIGNAL function takes two parameters: the name of the signal to be caught and the name of the subroutine to do the catching. Example 1is a typical call.
Example 1: A typical call to the SIGNAL function.
signal ( SIGSEGV. signal_handler_routine ): The important signals we wish to catch are: SIGSEGV segmentation violations: SIGABRT sent by system abort SIGFPE floating point exception SIGBUS bus error
Other signals are defined in signal.h (or sys/signal.h). Most crashes are caught by the signals in subroutine trb_signal in the signal handling and stack traceback code; see sparc.c and ibm.c, available electronically. On the SPARC, I couldn't catch integer divide at the correct function without SIGABRT. Instead, the traceback would start at the function that called the crashing function. The SPARC also needs a call to ieee_handler to catch floating-point errors. On the RS/6000, I'm currently only catching SIGSEGV. Floating-point errors are usually turned off to speed up the pipelining. The function fp_enable_all is supposed to turn on error catching, but I haven't had any success with it so far.
The Signal Handler
When a program crashes and issues a signal we're interested in, control is transferred to the signal-handler function. This function walks the stack, storing a return address for each function. One of the arguments passed to the signal handler is a structure which contains an initial stack pointer and an initial program counter (return address). This initial stack pointer does two things: It tells us where to find the stack, and it tells us what stack pointers look like in terms of a string of hex digits. When we examine the stack to find where in a stack frame the stack pointer is, we will be looking for a string of similar hex digits. Similarly, the initial return address tells us what other return addresses look like.
The signal handler has several tasks to do. It should describe the cause of the crash, trace back the stack, get information from the executable, and print out the traceback. It should also decide whether to stop program execution or to continue. It can also help us to build itself by printing out a stack dump and frame dumps.
To find the arguments to the signal handler, type man signal at the UNIX prompt; see Example 2. To find the fields in the scp structure, look for a definition in signal.h or look at it via dbx. Hopefully, these fields correspond to the stack pointer and the return address.
Example 2: Signal-handler arguments.
void trb_handle (sig, code, scp) int sig, code; struct sigcontext *scp;
More Detail on the Stack
The skeleton program in Figure 3 illustrates the stack for a simple calling sequence. Each function that has had its execution suspended as a result of calling another function will have a stack frame on the stack. The stack grows towards low addresses. Since we are starting at the bottom function, we start at the low-address end of the stack and walk back to the high-address end.
Figure 3: Skeleton program that illustrates the stack for a simple calling sequence.
main() { fun1 (); } void fun1 () { fun2(); } void fun2() {float a,b,c; c=0.0; a=b/c; __________ Low | | Stack frame address | | for fun2 |__________| | | Stack frame | | for fun1 |__________| High | | Stack frame address | | for main |__________|
The stack pointer in one stack frame points to the stack frame of the calling function. I Figure 3, the stack frame for fun2 contains a stack pointer that points to the stack frame for fun1. Similarly, the stack frame for fun1 contains a stack pointer that points to the stack frame for main. The stack frame for main contains a stack pointer that points to the end of main's stack frame. The contents of this next frame are 0, indicating that main is the top level function.
The return address provides us with the address inside the calling function where the function is called. To continue our example, the return address in the stack frame of fun2 gives the address in fun1 where fun2 was called. When we translate the address to a source-code file-line number, we get the line number in fun1 in which fun2 was called.
If we are looking at a new hardware architecture, we may know only the initial stack pointer and the initial return address. We need to know where in a stack frame the stack pointer and the return address are.
Using a Stack Dump
The locations of the return address and the stack pointer within a stack frame and found by examining a dump of the stack. To maximize the information contained in the dump, it's best to nest several subroutine calls, the bottom of which prints out the dump in hex and ASCII. Finding the middle of each subroutine's stack frame is simplified if each subroutine contains a unique text string. Stackdump.c (Listing One, page 113) generates a stack dump. Example 3shows an abbreviated version of the sample output generated by stackdump.c. (Complete versions of both SPARC and RS/6000 stack dumps are available electronically.) The first column gives an address in the stack, while the second column gives the value at that address. If you circle all values in the second column that start with Ohf7ff, you will have circled all potential stack pointers. If you can find these values in the first column, you should circle them there as well. This will give all possible starting positions for stack frames. Now you must eyeball the stack dump and look for circles on the right that are a constant position from a circle on the left. For the SPARC, this difference is 14 longwords.
Example 3: Sample output generated by stackdump.c (Listing One).
f7fff9c8 7efefeff #start of frame ... f7fff9ec f7fffa40 #stack pointer ... f7fffa2c 66756e32 fun2 #local variable f7fff9ec 20746578 text ... f7fffa40 7efefeff #start of next frame ...
Dissection of a Stack Frame
Assuming that we've been successful in finding the offset of the stack pointer from the start of a stack frame, we can now explore the stack frame. If the value of FRAMEDUMP is changed from 0 to 1 in sparc.c, we'll get a stackdump broken up into frames. (An example dump of the stack when FRAMEDUMP = 1 is available electronically.) Stackdump.c ( Listing One) prints addresses for the start of all functions. Hopefully, we'll find similar addresses in each stack frame which are a fixed offset away from the start of each frame. If we're still having difficulty in finding them, we should substitute the call to fun1a() in main with fun1b(); the stack frame for fun1b should be very similar to the stack frame for fun1a, with the major difference being the return address for fun1b being different than the return address for fun1a.
The output of the stackdump, summarized inExample 4, is in two pairs of columns. The left pair is the output when main calls function fun1a while the right pair is the output when main calls function fun1b. The important difference is at address 7ffffa04, the return address from stack frame fun2. The return address 22e0 corresponds to function fun1a while address 2310 corresponds to fun1b. These addresses are between the appropriate function starting addresses.
Example 4: SPARC stackdumps showing the difference between calling function fun1a vs. function fun1b from main. Note that the return address in function fun2 is different. This file has been edited to remove lines that are the same for both runs; this saves space and emphasizes the difference.
main address=2290 main address=2290 fun1a address=22c0 fun1a address=22c0 fun1b address=22f0 fun1b address=22f0 fun2 address=2320 fun2 address=2320 /*_______________________________________________________*/ /* stack frame for function fun2 */ /*_______________________________________________________*/ f7fff9c8 7efefeff ~ f7fff9c8 7efefeff ~ ... f7fffa00 f7fffa40 @ f7fffa00 f7fffa40 @ /* stack pointer */ f7fffa04 000022e0 " f7fffa04 00002310 # /* return address DIFFERS */ ... f7fffa2c 66756e32 fun2 f7fffa2c 66756e32 fun2 f7fffa30 20746578 tex f7fffa30 20746578 tex f7fffa34 74000000 t f7fffa34 74000000 t /*_______________________________________________________*/ /* stack frame for functions fun1a(left) & fun1b (right) */ /*_______________________________________________________*/ f7fffa40 7efefeff ~ f7fffa40 7efefeff ~ ... f7fffa78 f7fffab0 f7fffa78 f7fffab0 /* stack pointer */ f7fffa7c 000022b0 " f7fffa7c 000022b0 " /* return address */ ... f7fffaa0 66756e31 fun1 f7fffaa0 66756e31 fun1 f7fffaa4 61207465 a te f7fffaa4 62207465 b te f7fffaa8 787400b0 xt f7fffaa8 787400b0 xt /*_______________________________________________________*/ /* stack frame for function main */ /*_______________________________________________________*/ f7fffab0 11400086 @ f7fffab0 11400081 @ ... f7fffae8 f7fffb20 f7fffae8 f7fffb20 /* stack pointer */ f7fffaec 00002064 d f7fffaec 00002064 d /* return address */ ... f7fffb10 6d61696e main f7fffb10 6d61696e main f7fffb14 20746578 tex f7fffb14 20746578 tex f7fffb18 74000020 t f7fffb18 74000020 t
Except for the change of text string, the only difference between fun1a and fun1b for the IBM output is the return address. The function starting addresses turn out to be pointers to the real addresses.
Figure 4 illustrates the concept of a stack frame. I'll refer to the low address end of a stack frame as the "stack pointer," and the high address end of a stack frame as the "frame end." On the SPARC, local variables are referenced via a negative offset relative to the frame end. This means that we must find the next stack pointer to find the locals for a given stack pointer. Parameters coming into a function are referenced via a positive offset from the calling function's stack pointer as the space for the parameters was allocated in the calling function.
Figure 4: A stack frame.
_________ Low | | <--Frame start address |_________| (Stack pointer) | | |_________| | | |_________| | | |_________| SP +n | Next SP | Next stack bytes |_________| pointer | PC | Program |_________| counter | | |_________| | | |_________| | | Last declared |_________| local variable | | |_________| | | First declared |_________| local variable | | |_________| <--Frame end | | |_________| Parameters High | | address |_________|
The Executable File
The executable files on a SPARC are similar to the System V COFF executables. An executable file contains a header structure whose fields contain the location of the symbol table, the location of the string table, and the number of symbols. An executable that is linked with the -g flag contains a symbol table and a string table. The symbol table contains information about all symbols in the source file, including source-file names, function names, source-line numbers, subroutine parameters, local symbols, and more. The symbol table does not contain text information; instead it has a pointer into the string table. Each symbol in the symbol table is read into a structure which contains the symbol's type (and other fields).
The symbol type determines the meaning of the rest of the structure. For a source-line symbol the structure contains the line number and the address. For a source filename, the structure contains an offset into the string table, which contains the name of the source file (ditto for subroutine names). Parameter symbols contain a parameter type, a stack offset value, and a pointer to the string table. Local symbols are the same as parameters.
A traceback conversion consists of reading the symbols sequentially and storing the last file and subroutine names found until an address is found that matches one of the addresses in the traceback list. At this point we store the filename, subroutine name, and line number and continue reading symbols in order to match the remaining addresses.
Parameters
To print out subroutine parameter values, each level of the traceback needs to know its associated filename and function name. If we also know the position in the symbol table for the start of the function, we can quickly find the parameters for the function by seeking to the beginning of the symbols for the function, then reading symbols. Any symbols of type parameter are printed, until a new filename or subroutine name is found. The string table contains the name of the parameter (as it is named in the source file). The stack offset value lets us find the parameter's location in the stack frame. The parameter type tells us how the parameter was declared, enabling us to print its value as an int, float, and so on. Local symbols are found using the same steps used for parameters.
Speed Considerations and Code Limitations
Because you must compile and link the system with -g, compiler optimization is removed. As such, tracebacks may only be appropriate for in-house usage. This is still quite helpful, as it gives a user something concrete to report, especially if the traceback is written to a file. Speed is not a major factor after the crash. For a large executable, there could be 100K symbols, but we only need to scan the file once. Example 5 shows how to call the traceback start-up routine from a user program.
Example 5: Calling the traceback startup routine from a user program.
main(argc,argv) int argc; char *argv[]; { /* declarations */ /* enable crash tracebacks */ trb_signalinit(argc,argv); /* the program ... */ }
Note that neither the SPARC or IBM system prints structures yet, nor does the IBM yet catch floating-point errors or have a data dictionary. (The SPARC data-dictionary code has yet to be optimized.) Also, the RS/6000 version has a hardwired define called PCADJUST that converts stack addresses to COFF file addresses. There should be a function to make this conversion, as it may not be constant. The code does not handle executables not linked with the -g flag, and not all kinds of arrays are printed out:
Conclusion
Not only is the exploration of program stacks and COFF files interesting, it is also very practical. Stack traceback can often be used to find the cause of a program crash very quickly. A stack traceback that prints values of parameters and local symbols is often as informative as dbx and is quicker.
[LISTING ONE]
Copyright © 1992, Dr. Dobb's Journal<a name="01fe_0018">
/* stackdump.c -- a program to dump the stack */
#define SPARC 1
#define IBM 0
void fun1a();
void fun1b();
void fun2();
void stackdump();
main() /* call function fun1a or function fun1b */
{
char text[16];
strcpy(text,"main text");
fun1a();
}
void fun1a()
{
char text[16];
strcpy(text,"fun1a text");
fun2();
}
void fun1b()
{
char text[16];
strcpy(text,"fun1b text");
fun2();
}
void fun2()
{ int jj;
char text[16];
strcpy(text,"fun2 text");
#if SPARC
printf("main address=%x\n",main);
printf("fun1a address=%x\n",fun1a);
printf("fun1b address=%x\n",fun1b);
printf("fun2 address=%x\n\n",fun2);
#else if IBM
printf("main address=%x -> %x\n",main, *(unsigned long *)main);
printf("fun1a address=%x -> %x\n",fun1a, *(unsigned long *)fun1a);
printf("fun1b address=%x -> %x\n",fun1b, *(unsigned long *)fun1b);
printf("fun2 address=%x -> %x\n",fun2, *(unsigned long *)fun2);
#endif
stackdump(&jj-32); /* the 32 gives us the stack before variable jj */
}
void stackdump(start)
unsigned long start;
{
int i,j;
for (i=0;i<128;i++)
{
printf("%08x ", (long)start);
printf("%08x ", *(unsigned long *)(start));
for (j=0;j<4;j++,start++)
printf("%c", isprint( *(unsigned char *)(start)) ?
*(unsigned char *)(start) : ' ');
printf("\n");
}
}