STATISTICAL PERFORMANCE ANALYSIS
Looking for quality time
This article contains the following executables: STAT.ARC
Fred Motteler
Fred has been a software engineer at Applied Microsystems Corporation since receiving a Ph.D. in physics from the University of Washington in 1986. In addition to his work at Applied Microsystems, he has worked on a number of embedded real-time controller applications. He can be contacted at Applied Microsystems Corporation, 5020 148th Ave. N.E., P.O. Box 97002, Redmond, WA 98073-9702.
Statistical performance analysis is a practical technique for determining which modules in a program require the most execution time. By knowing this, you can improve performance of time-critical applications. Consecutive trials can then help zero in on effective coding changes for execution time reduction. While the techniques presented here are ideally suited for embedded applications--and were in fact developed in part to squeeze more performance out of embedded systems development tools, as described later--the implementation is general-purpose enough to apply to just about any time-critical application.
Where I work (Applied Microsystems), statistical performance analysis has been used to improve time-critical application performance on several occasions. The most notable was the development of the SCSI communication option for our ES 1800 emulator for 680x0, 80x86, and Z800x processors. Use of statistical performance analysis allowed both host and emulator software to be optimized for speed. This resulted in data transfer rates two to five times faster than would otherwise have been possible.
Another example of statistical performance analysis use involved development of an arbitrary precision, portable C, IEEE-P754 format-compatible, and floating-point arithmetic package. Again, performance analysis was used to determine which routines to optimize for speed and how effective the modifications were.
Performance Analysis Steps
The basic idea of statistical performance analysis is to periodically sample the program counter while a program is executing. Once the program or the sampling is finished, the program counter samples are sorted according to the address range of each module in the program. The number of samples that lie within each module is tallied. From these tallies, an approximate value for the relative amount of execution time spent in each module can be determined and displayed. (See Figure 1.)
Determining the module to which each program counter sample belongs requires the program's link map or memory map to be examined. Most linkers produce memory maps that give information about modules' physical locations in memory.
To good approximation, the number of samples falling within each module is proportional to the amount of time spent executing the code within the module. The final tallies for each module can be displayed as a percentage of the total number of samples recorded for the entire program.
Figure 2 contains examples of statistical performance analysis applied to a simple test program. The test program consists of a main( ) function and 15 delay_loop( ) functions. The delay loop functions are configured so that the second function's delay loop takes twice as much time as the first. The third function's delay loop takes three times as much time as the first, and so on. The main function contains a loop that contains one call to each of the 15 subfunctions. Good statistics should show the second function taking twice as much time as the first, the third function taking three times as much time as the first, and so on. Figure 2(b) shows each program counter sample sorted according the function within which it lies. Trial program configuration was (patest.c): looptoloopN = 5120, scaleN = 2, delay_one = 1, delay_two = 2, delay_three = 3, and so on. This was run using an ES 1800 68020 emulator with a 12.5-MHz "Demon" target for two seconds of 4000-bus cycles between samples.
For a complete listing of the test program patest. c, see Listing One (page 100).
In contrast, Figure 3 presents statistical performance analysis results done on a real PC program. The program being tested is the regression tester for the floating-point arithmetic package mentioned earlier. Only the results for the first 20 functions are included here.
Figure 3. Floating-point arithmetic package regression tester. Trial program configuration: fmtest.exe running on an 8-MHz AT clone. Command-line invocation: pamsdos fmtest. map lattice.cfg fmtest.
Sample output: Module table sorted by sample counts: 4096 samples collected module samples % 0 5 10 15 20 25 30 ushftlm 919 22 ************************************** ushftrm 846 21 ************************************ PA_BOT_MEM 670 16 *************************** ucheckm 463 11 ******************* _eufbs 266 6 ********** ucmpm 169 4 ****** isubm 138 3 ***** iaddm 94 2 *** ffbitext 74 2 *** umultm 62 2 *** _nmalloc2 42 1 * _nfreex 42 1 * udivm 37 1 * itestm 33 1 * _nchkb 23 1 * ffbitins 21 1 * _nnohold 20 0 _CO_pfmt 18 0 prints 18 0 calloc 14 0
From this example, it is clear that the routines ushftlm and ushftrm together take up more than 40 percent of the total program execution time. These routines do bitwise shifts left and right. As expected, they are used extensively and are prime candidates for execution-time optimization. The routine PA_BOT_MEM actually represents program counter samples that were below the program being tested. Typically, this represents the fraction of time spent in DOS support functions. The regression tester outputs a significant amount of results to the display, so time spent scrolling the PC's screen, for example, is significant.
Other routines that show up significantly, ucheckm, ucmpm, isubm, iaddm, umultm, udivm, and itestm are all integer math routines used to support the floating-point package. Some Lattice library routines also show up (_eufbs, _nmalloc2, _nfreex, and _nchkb) due to the extensive use of dynamic memory allocation by the floating-point routines.
Statistical Performance Analysis Techniques
The examples in Figures 2 and 3 demonstrate that the relative time required by different modules can be determined from periodic program counter sampling. A logical question is, "So why isn't statistical performance analysis more widely accepted?"
The main argument against statistical performance analysis is that the entire execution path of a program is not sampled. Only isolated snapshots of where the program is executing are sampled, so the results of statistical performance analysis may not be accurate.
The nature of possible inaccuracies is difficult to characterize and depends on the program being analyzed, the sampling period, and total number of samples taken. Most statistical methodology assumes the sampling process is a series of independent events (such as repeated flipping of a coin). The results of such random sampling are well characterized by the binomial distribution, as is the range of variation in possible outcomes.
In contrast, programs are not random: There is a well-defined sequence of events. Starting the same program twice from the same initial conditions produces the same results. Sampling the program counter at periodic intervals is not a series of independent events.
The assumption for statistical performance analysis is that for sufficiently complex programs, appropriate sampling periods, and appropriate sampling durations, the program counter sampling will approximate independent sampling.
The nature of problems encountered with statistical performance analysis samples is somewhat similar to that of problems associated with pseudorandom number generation programs. Fortunately, the general complexity of most programs makes selection of reasonable parameters for statistical performance analysis much easier than selection of reasonable pseudorandom number generation parameters.
Several factors affect the variation of results between one statistical performance analysis trial and the next:
- Change in the sampling period
- Change in the total sampling time
- Nonidentical initial conditions: The program sampling may not be identically synchronized with the program being sampled.
- Identical initial conditions, but varying synchronization relative to the program being sampled. This is true when the program sampling is driven off a separate clock circuit from the processor clock circuit running the program.
- External events that alter the execution path of the program being sampled: unsynchronized inputs, interrupts, DMA, and so on.
Table 1: Conditions for good statistics
Where t = program counter sampling interval T = total sampling time N = number of program modules n = total number of program samples (n = T/t) then for good sampling t<<T (take a lot of samples) N << n, (there are many samples in each module) t ≠ some basic periodicity of the program, T >> basic periodicity of the program, or T == some multiple of the basic periodicity of the program, or part of the program being tested.
The examples in Figures 4 and 5 illustrate the difference between good and bad statistics. In both cases, a simple test program is analyzed. The program consists of a main function and 15 identical delay loop subfunctions. The main function contains a loop that contains one call to each of the 15 subfunctions. Good statistics should show all 15 functions requiring the same fraction of the total execution time. (See Listing One for the test program patest.c.)
Figure 4(a) illustrates a "good" sampling period in which all of the delay loop functions should have about the same number of samples. Trial program configuration (patest.c) is: looptoloopN = 5120, scaleN = 2, delay_one = 8, delay_two=8, delay_three = 8, and so on. This was run using an ES 1800 /68020 emulator with a 12.5-MHz "Demon" target for two seconds of 3850-bus cycles between samples.
Figure 4(b) illustrates a "bad" sampling period, in which the sampling period matches a fundamental periodicity of the program. The delay loop functions are short delays, and the number of main loop iterations is large. Delay functions delay_eight and delay_fifteen have a disproportionately large share of samples. Trial program configuration (patest.c) is: looptoloopN = 5120, scale-N = 2, delay_one = 8, delay_two = 8, delay_three = 8, and so on. This was run using an ES 1800 68020 emulator with a 12.5-MHz "Demon" target for two seconds of 3900-bus cycles between samples.
Figure 4: Sampling periods: (a) good (b) bad
(a) Sample output: Module table sorted by sample counts: 1377 samples collected module samples % 0 1 2 3 4 5 6 7 8 9 10 .main 91 7 ***************************** .delay_seven 87 6 ************************* .delay_thirteen 87 6 ************************* .delay_one 86 6 ************************* .delay_eleven 86 6 ************************* .delay_fourteen 86 6 ************************* .delay_twelve 86 6 ************************* .delay_ten 86 6 ************************* .delay_five 86 6 ************************* .delay_six 86 6 ************************* .delay_nine 85 6 ************************* .delay_fifteen 85 6 ************************* .delay_three 85 6 ************************* .delay_four 85 6 ************************* .delay_two 84 6 ************************* .delay_eight 84 6 ************************* (b) Sample output: Module table sorted by sample counts: 1359 samples collected module samples % 0 1 2 3 4 5 6 7 8 9 10 .delay_eight 160 12 ************************ .delay_fifteen 158 12 ************************ .delay_four 81 6 ************* .delay_nine 80 6 ************* .delay_eleven 80 6 ************* .delay_five 80 6 ************* .delay_twelve 80 6 ************* .delay_seven 80 6 ************* .delay_thirteen 80 6 ************* .delay_one 80 6 ************* .delay_two 80 6 ************* .delay_fourteen 80 6 ************* .delay_ten 80 6 ************* .delay_three 79 6 ************* .delay_six 79 6 *************
Figure 5(a) illustrates "good" total sampling time, in which all of the delay loop functions should have about the same number of samples. Trial program configuration (patest.c) is: looptoloopN = 5, scaleN = 10, delay_one = 8, delay_two = 8, delay_three = 8, and so on. This was run using an ES 1800 68020 emulator with a 12.5-MHz "Demon" target for two seconds of 4000-bus cycles between samples.
Figure 5(b) shows "bad" total sampling time. Half of the delay loop functions have significantly more samples than expected and the other half have less. In this case, the total sampling time was not a multiple of a fundamental periodicity of the program, the delay loop functions are long delays, and the number of main loop iterations is small. (delay = 8192, loop iterations = 5). Trial program configuration (patest.c) is: looptoloopN = 5, scaleN = 10, delay_one = 8, delay_two = 8, delay_three = 8, and so on. This was run using an ES 1800 68020 emulator with a 12.5-MHz "Demon" target for two seconds of 2500-bus cycles between samples.
Figure 5: Sampling time: (a) good (b) bad
(a) Sample output: Module table sorted by sample counts: 1343 samples collected module samples % 0 1 2 3 4 5 6 7 8 9 10 .delay_one 85 6 ************************* .delay_eleven 85 6 ************************* .delay_six 85 6 ************************* .delay_eight 83 6 ************************* .delay_three 83 6 ************************* .delay_fourteen 83 6 ************************* .delay_four 82 6 ************************* .delay_nine 82 6 ************************* .delay_thirteen 82 6 ************************* .delay_ten 80 6 ************************* .delay_twelve 80 6 ************************* .delay_two 80 6 ************************* .delay_seven 80 6 ************************* .delay_fifteen 80 6 ************************* .delay_five 80 6 ************************* (b) Sample output: Module table sorted by sample counts: 2000 samples collected module samples % 0 1 2 3 4 5 6 7 8 9 10 .delay_fourteen 132 7 ****************************** .delay_nine 131 7 ****************************** .delay_ten 131 7 ****************************** .delay_one 131 7 ****************************** .delay_eleven 131 7 ****************************** .delay_fifteen 131 7 ****************************** .delay_twelve 131 7 ****************************** .delay_thirteen 130 7 ****************************** .delay_eight 106 5 ********************** .delay_six 106 5 ********************** .delay_four 105 5 ********************** .delay_two 105 5 ********************** .delay_three 105 5 ********************** .delay_seven 105 5 ********************** .delay_five 104 5 **********************
In actual practice, selection of the total sampling time and the sampling frequency is rather subjective. Typically the total sampling time is either:
A. The total execution time of the program. For a program that has definite, finite start, and end points (such as the floating-point package regression tester presented earlier) and runs for seconds or minutes, this is a reasonable choice. B. The execution time of a particular operation within a much larger program. Here the goal is to isolate the operation being tested from the rest of the program. A good example of this is the actual download of program data from a host system to the target system via an emulator.
Typically the sampling frequency is either fixed in hardware or variable. On the PC, the clock-tick frequency is fixed. This makes checking for the effects of natural program periodicity difficult. Using a PC with a different CPU clock frequency or processor has the same effect as changing the sampling rate. If either the interrupt frequency or the rate of external hardware sampling can be varied then checking for the effects of natural program periodicity relatively easy.
Unless there are periodic interrupts in the system being tested, natural periodicities in the program are difficult to determine. The best way to check for natural periodicity is to vary the sampling frequency and see how it affects the results. If periodic interrupts are present, and/or an obvious natural periodicity is present, then the sampling period should be changed such that t' = (irrational #)*t where t = natural period (or previous sampling period) and t' = sampling period to try.
Irrational numbers such as pi, pi/2, pi/3, e, and e/2 are good initial choices. This assumes that the natural periodicity is approximately equal to the sampling period and is much less than the total sampling time, where t = natural period << T.
Using an irrational number minimizes the probability that the new sampling period is also some multiple of the program's natural period. However, due to the finite nature of the sampling, even choosing an irrational period ratio does not guarantee good statistics.
The best way to guarantee reasonable results is to try several different sampling periods.
Program Counter Sampling Methods
The most critical step in statistical performance analysis is collecting the sample program counter values. The most common ways to do this are:
- Native interrupt-driven sampling. A periodic, hardware-generated interrupt is used to trigger an interrupt routine which reads the program counter value pushed onto the system stack when the interrupt occurred. The program counter value is written to a part of the system's memory that the program being tested does not affect.
- External hardware sampling. External hardware (such as an emulator with trace memory) periodically samples the processor's address and status lines. The address lines during instruction fetch cycles are a fairly accurate method of determining the current program counter. The instruction fetch cycles are recorded into trace memory contained in the external hardware.
Native Interrupt Sampling on PCs
The advantage of using a PC is that it is a common software development platform. Good quality compilers, editors, and other development tools are readily available and inexpensive.
The PC has a clock-tick hardware interrupt (INT 8) that occurs every 54.92 milliseconds (18.21 times per second). This interrupt is used for time keeping by the PC's BIOS routines and by MS-DOS. (See either the Microsoft MS-DOS Programmer's Reference, the XT Technical Reference, or the AT Technical Reference.) The value of 18.21 times per second is used so that there are 2{16} (65,536) clock ticks per hour.
The clock-tick interrupt vector normally points to a timekeeper service routine in either the BIOS or MS-DOS. Prior to running the program to test, the vector is modified to point to a local program counter sampler interrupt routine. This routine samples the pre-interrupt program counter value on the stack and saves it into a local data buffer not affected by the program being tested. After sampling the PC value, control is transferred to the original timekeeper interrupt service routine. The interrupt vector is restored after the program being tested completes execution. An example is the source code file patick.asm (Listing Two, page 101).
The most significant limitation of doing native, interrupt-driven sampling on a PC is that the interrupt rate is fixed at a relatively slow rate. A fixed sampling rate does not allow determination of possible effects of natural periodicity in the program being tested. The only way to investigate this is to use several PCs with different processor clock rates. The sampling rate stays the same, but the execution speed of the program changes, depending on the PC used.
Another problem with using a PC for performance analysis is that programs are dynamically relocated at runtime. Memory maps generally give module addresses relative to the origin of the program. The origin of the program is determined at runtime. This requires either a modified loader that returns the program origin address, or use of a trial program that returns where it has been loaded into memory. The latter approach is presented in the source code files pawhere.c (Listing Three, page 103) and pamsdos.c (Listing Four, page 103).
A more general limitation of using native, interrupt-driven sampling is that it requires system resources. Memory is required for the sampling program and for the program counter sample buffer. This memory usage competes for memory with the program being tested. A periodic interrupt must be provided to drive the sampling. The sampling interrupt service routine requires some execution time. This execution time requirement competes with that of the program being tested. While in most cases, use of some system resources does not significantly impact the program being tested, in some it will.
Analysis Methods
Once the program counter samples have been collected by either native interrupt sampling or external hardware, the samples can be processed on either the native system or an alternate host system.
Reading the memory map for the program being tested is an easy way to determine the beginning and ending address of each module within it. The file pardmap.c (accessible electronically; see "Availability" at the end of this article) is an example of a generic map file reader program. The map file reader reads the map file and creates a data table of module names, address ranges, and number of samples. It then sorts the table according to address values.
For each program counter sample, a binary search is done to determine which module's sample count to increment. The file pautil.c does the binary search and increments the appropriate sample count. This file, along with the make file, the required header file, padef.h, discussed in this article, and sample configuration files for Microtec and Intermetrics cross linkers, and for Lattice, Zortech, and Microsoft PC native linkers are available electronically.
So, Why Aren't You Using Statistical Performance Analysis?
For execution time critical programs, the benefits of using statistical performance analysis far outweigh the potential problems due to questionable accuracy of results. With proper understanding of the statistical performance analysis technique, most potential problems can be eliminated, making statistical performance analysis a useful, practical tool.
Availability
In addition to the code availability sources listed on page 3, complete source code and PC executable versions of pamsdos.exe and paes1800.exe are available on a 5 1/4-inch, 360K floppy disk from Applied Microsystems. The executable version was compiled from the given source code using the PC Lattice C compiler, Version 6.01. Both programs are fully commented and include fairly complete usage messages.
_STATISTICAL PERFORMANCE ANALYSIS_ by Fred Motteler[LISTING ONE]
<a name="0293_0010"> /* patest.c -- A collection of simple routines to test the accuracy of ** statistical performance analysis programs for the PC and ES 1800 emulator. */ /* Default delay timing parameters */ #define DELAY_ONE 1 #define DELAY_TWO 2 #define DELAY_THREE 3 #define DELAY_FOUR 4 #define DELAY_FIVE 5 #define DELAY_SIX 6 #define DELAY_SEVEN 7 #define DELAY_EIGHT 8 #define DELAY_NINE 9 #define DELAY_TEN 10 #define DELAY_ELEVEN 11 #define DELAY_TWELVE 12 #define DELAY_THIRTEEN 13 #define DELAY_FOURTEEN 14 #define DELAY_FIFTEEN 15 #define DELAY_SCALE 10 /* Effectively muliplies values by 1000 */ #define DELAY_LOOPS 5 /* Default number of times thru main loop */ /* Loop delay parameters. These are done as globals to allow easy access to ** the timing parameters via the ES 1800 emulator. This allows different ** timing configurations to be tested without having to recompile and link ** this code. Kludgie, but it encourages easy experimentation. */ long final_sumL = 0; int time_oneN = DELAY_ONE; int time_twoN = DELAY_TWO; int time_threeN = DELAY_THREE; int time_fourN = DELAY_FOUR; int time_fiveN = DELAY_FIVE; int time_sixN = DELAY_SIX; int time_sevenN = DELAY_SEVEN; int time_eightN = DELAY_EIGHT; int time_nineN = DELAY_NINE; int time_tenN = DELAY_TEN; int time_elevenN = DELAY_ELEVEN; int time_twelveN = DELAY_TWELVE; int time_thirteenN = DELAY_THIRTEEN; int time_fourteenN = DELAY_FOURTEEN; int time_fifteenN = DELAY_FIFTEEN; int scaleN = DELAY_SCALE; int looptoloopN = DELAY_LOOPS; /* Function: long delay_xxxx(int delayN) ** Description: These are simple functions designed to allow varied delays. ** The code in each delay function is the identical to the code in all of ** the other delay functions. This allows accurate comparision of the ** relative execution time of each function. Fifteen of these functions should ** be a reasonable number to represent a simple "real" program. */ long delay_one(delayN) int delayN; { int i; long sumL; sumL = 0L; delayN <<= scaleN; for (i = 0; i < delayN; i++) sumL += (long) i; return(sumL); } long delay_two(delayN) int delayN; { int i; long sumL; sumL = 0L; delayN <<= scaleN; for (i = 0; i < delayN; i++) sumL += (long) i; return(sumL); } . . . . long delay_fifteen(delayN) int delayN; { int i; long sumL; sumL = 0L; delayN <<= scaleN; for (i = 0; i < delayN; i++) sumL += (long) i; return(sumL); } /* Function: void main() ** Description: This is a simple routine to run the various delay routines. ** The delay time variables are all globals to allow experimentation with ** the timing parameters using the ES 1800 emulator. */ void main() { int i; final_sumL = 0L; for (i = 0; i < looptoloopN; i++) { final_sumL += delay_one(time_oneN); final_sumL += delay_two(time_twoN); final_sumL += delay_three(time_threeN); final_sumL += delay_four(time_fourN); final_sumL += delay_five(time_fiveN); final_sumL += delay_six(time_sixN); final_sumL += delay_seven(time_sevenN); final_sumL += delay_eight(time_eightN); final_sumL += delay_nine(time_nineN); final_sumL += delay_ten(time_tenN); final_sumL += delay_eleven(time_elevenN); final_sumL += delay_twelve(time_twelveN); final_sumL += delay_thirteen(time_thirteenN); final_sumL += delay_fourteen(time_fourteenN); final_sumL += delay_fifteen(time_fifteenN); } } <a name="0293_0011"> <a name="0293_0012">[LISTING TWO]
<a name="0293_0012"> TITLE patick - IBM PC / Clone Clock Tick CS:IP Grabber ; File: patick.asm--Fred Motteler and Applied Microsystems Corporation ; Copyright 1990. All Rights Reserved ; Description: ; This file contains three functions: ; C callable: ; void painit(bufferLP, lengthN) Initialize grabber interrupt vector ; int paclose() Close grabber interrupt vector ; Interrupt routine, this is treated like part of painit(): ; patick Grab CS:IP value ; These functions are configured for small model. ; Stack frame structure for painit(): ; stkfr STRUC OLD_FR DW ? ; Previous stack frame pointer RETADDR DW ? ; Return address to caller BUFFERP DW ? ; Pointer to buffer to use BUFLEN DW ? ; Length of buffer (in longwords) stkfr ENDS ; ; Stack frame structure for clock tick timer routine. intfr STRUC INT_FR DW ? ; Pre-interrupt stack frame pointer IP_VAL DW ? ; Pre-interrupt IP value CS_VAL DW ? ; Pre-interrupt CS value intfr ENDS TIMER EQU 8h ; Timer interrupt vector number DGROUP GROUP _DATA _DATA SEGMENT WORD PUBLIC 'DATA' ASSUME DS:DGROUP bufptr DW 0 ; Starting point of buffer bufsiz DW 0 ; Number of longwords in the buffer bufindx DW 0 ; Next location of buffer to use bufwrap DB 0 ; Flag if buffer has wrapped... _DATA ENDS _TEXT SEGMENT BYTE PUBLIC 'CODE' ASSUME CS:_TEXT ; ; void paopen (unsigned long *bufferLP, int lengthN) ; This a C callable function to initialize the CS:IP grabber and ; start it up. bufferLP points the buffer of where to write CS:IP ; values. lengthN is the length of the buffer in longwords. PUBLIC paopen paopen PROC NEAR push bp mov bp,sp push si push di push es ; ; Set up the local buffer pointer values from those passed on the stack. mov ax,[bp].BUFFERP ; Get pointer to start of buffer mov bufptr,ax mov ax,[bp].BUFLEN ; Get length of the buffer shl ax,1 ; convert longword length to byte length shl ax,1 mov bufsiz,ax xor ax,ax ; Start at the beginning of the buffer mov bufindx,ax mov bufwrap,al ; Reset buffer wrap flag ; ; Save the original clock tick interrupt vector. mov al,TIMER ; interrupt number into al mov ah,35h ; DOS function = get vector int 21h ; DOS returns old vector in es:bx mov cs:oldseg,es ; save old segment mov cs:oldoff,bx ; save old offset ; ; Disable interrupts while changing the interrupt vector. cli ; ; Change clock tick interrupt routine to point at local interrupt routine. mov al,TIMER ; vector number mov ah,25h ; DOS function = set vector mov dx,OFFSET patick ; point to our interrupt handler push ds ; don't lose ds, we need to get to local data push cs ; move this cs to ds pop ds ; int 21h ; set the new vector pop ds ; restore ds ; ; Enable interrupts and return; pop es pop di pop si pop bp sti ret ; ; Clock tick grabber routine. This routine samples CS:IP that were pushed ; on to the stack when the interrupt occurs. patick: push bp mov bp,sp ; Treat CS:IP values like stack frame push ax push bx push ds ; ; Get the local ds to allow access to local variables. mov ax,DGROUP mov ds,ax ; ; Use bx as a pointer to the recording buffer mov bx,bufptr add bx,bufindx ; ; Grab the pre-interrupt CS:IP values off the stack mov ax,[bp].IP_VAL ; grab the IP mov [bx],ax ; save the IP in the recording buffer inc bx inc bx mov ax,[bp].CS_VAL ; grab the CS mov [bx],ax ; save the CS in the recording buffer inc bx inc bx ; ; Check if we are at the end of the buffer sub bx,bufptr ; get the byte offset index back again mov ax,bufsiz ; get the buffer byte length cmp ax,bx jne notend ; jump if not at the end of the buffer ; ; At the end of the buffer mov bx,0 ; reset the buffer index mov al,0ffh ; set flag to indicate buffer wrap mov bufwrap,al ; ; Write out modified buffer index notend: mov bufindx,bx ; ; Clean up pop ds pop bx pop ax pop bp ; ; Jump to the original interrupt service routine. An immediate jump ; is used so no segment registers are required. DB 0eah ; jmp immediate, to the offset:segment ; selected below (brute force approach). ; Original interrupt handler's offset and segment values. These are ; in the current code segment to allow the interrupt routine given ; here to directly jump to the original interrupt routine. oldoff DW 0 ; Room for original timer interrupt offset oldseg DW 0 ; Room for original timer interrupt segment paopen ENDP ; ; int paclose() This is a C callable function to close CS:IP grabber and ; return the number of CS:IP values grabbed. PUBLIC paclose paclose PROC NEAR push bp mov bp,sp push si push di push es ; ; Disable interrupts while the original interrupt vector is restored. cli mov al,TIMER ; get interrupt number mov ah,25h ; DOS function = set vector push ds ; mov dx,cs:oldoff ; old timer offset mov ds,cs:oldseg ; old timer segment int 21h ; restore old vector pop ds ; ; ; Enable interrupts. sti ; ; Calculate the number of CS:IP values cmp bufwrap,0 ; check if the buffer has wrapped jne wrapped ; jump if it has wrapped mov ax,bufindx ; no wrap, return buffer index as count jmp done wrapped: mov ax,bufsiz ; wrapped, return buffer size as count ; ; Clean up stack and return done: shr ax,1 ; Return count in number of CS:IP pairs shr ax,1 pop es pop di pop si pop bp ret paclose ENDP _TEXT ENDS END <a name="0293_0013"> <a name="0293_0014">[LISTING THREE]
<a name="0293_0014"> /* pawhere.c -- contains a very simple program that returns its segment base ** address. Note that this program is Lattice version 6.01 specific in that ** the Lattice small model has "main" at the beginning of the exectable ** portion of the program. Other compiler/linker packages may require that the ** program map be examined for the module that starts the program. ** Copyright 1990 Fred Motteler and Applied Microsystems Corporation */ #include <dos.h> #include <stdio.h> unsigned int main() { FILE *fp; fp = fopen("pawhere.tmp", "w"); fprintf(fp, "%x %x\n", (FP_SEG((char far *) main)), (FP_OFF((char far *) main))); fclose(fp); exit(0); } <a name="0293_0015"> <a name="0293_0016">[LISTING FOUR]
<a name="0293_0016"> /* pamsdos.c -- Utility functions used by MS-DOS version of the statistical ** performance analysis package. ** Copyright 1990 Fred Motteler and Applied Microsystems Corporation */ #include <stdio.h> #include <stdlib.h> #include <fcntl.h> #include <string.h> #include "padef.h" /* Function: int main( argcN, argvAS ) ** Description: MS-DOS based statistical performance analysis program. ** Command line arguments: pamsdos prog.map prog.cfg prog.exe options ** Where: prog.map = memory map for program; prog.cfg = memory map ** configuration; prog.exe = program to run; options = command options ** for the program to run */ int main( argcN, argvAS ) int argcN; char *argvAS[]; { int errorN; /* Error code */ unsigned int segmentW; /* Starting load address of program to run */ unsigned int offsetW; unsigned long originL; int processedN; /* Number of map globals processed */ int i; /* General index */ FILE *mapFP; /* Map file to read */ FILE *formatFP; /* File with map file format information */ char commandAC[PA_LINE_LEN]; /* Complete command line for program */ int pagelinesN; /* Number of lines on output page, 0 if * continuous, -1 if no display output, else * n if n lines per page. */ FILE *listFP; /* Results output file */ char listAB[80]; /* Optional results listing file path/name */ char pagelinesAB[8]; /* String for number of lines/page */ printf("pamsdos - Statistical performance analysis tool for MS-DOS\n"); printf("Version %s\n", PA_VERSION); printf("Copyright (C) 1990 Fred Motteler and Applied Microsystems Corp\n"); if (argcN < 4) { printf("Usage: pamsdos prog.map prog.cfg prog.exe [options]\n"); printf(" Where: prog.map memory map for program\n"); printf(" prog.cfg memory map configuration file\n"); printf(" prog.exe program to run\n"); printf(" [options] command line options for program to run\n"); exit(-100); } /* Determine where the program to run is to be located. */ if ((errorN = pa_locate(&segmentW, &offsetW)) != 0) { pa_error(errorN); exit(errorN); } /* Calculate origin of program. Room must be allowed for memory * malloc()'d off the heap. */ originL = (unsigned long) (segmentW + 1); originL <<= 4; originL += (unsigned long) (offsetW - 2); originL += (unsigned long) (PA_BUFLEN << 2); if ((pa_debugN & PA_GENERAL) != 0) { printf("program start segment:offset %x:%x\n", segmentW, offsetW); printf(" linear address %lx\n",originL); } /* Get the complete command line to invoke the program. */ strcpy(commandAC, argvAS[3]); if (argcN > 4) { for (i = 4; i < argcN; i++) { strcat(commandAC," "); strcat(commandAC,argvAS[i]); } } /* Run the program and collect samples. */ printf("Starting %s\n", argvAS[3]); if ((errorN = pa_pcsample(commandAC, PA_SAMPLE, PA_BUFLEN)) != 0) { pa_error(errorN); exit(errorN); } /* Read in the configuration file to get map format information and * to get number of lines / display page and option listing file. */ if ((formatFP = fopen(argvAS[2], "r")) == (FILE *) NULL) { pa_error(PA_NO_CFG_E); exit(PA_NO_CFG_E); } /* Read in display lines, and optional output file configuration data * from the configuration file. */ if (((errorN = paconfig(formatFP, PA_PAGELINES, pagelinesAB)) != 0) || ((errorN = paconfig(formatFP, PA_LISTFILE, listAB)) != 0)) { pa_error(errorN); fclose(formatFP); exit(errorN); } /* Determine the number of lines/page to display */ if (sscanf(pagelinesAB, "%d", &pagelinesN) != 1) { pa_error(PA_BAD_ARG_E); fclose(formatFP); exit(PA_BAD_ARG_E); } /* Open the optional listing file */ if (listAB[0] == '\0') listFP = (FILE *) NULL; else if ((listFP = fopen(listAB, "w")) == (FILE *) NULL) { pa_error(PA_NO_LST_E); fclose(formatFP); exit(PA_NO_LST_E); } /* Read program's memory map and create "bins" for program counter samples. */ if ((mapFP = fopen(argvAS[1], "r")) == (FILE *) NULL) { pa_error(PA_NO_MAP_E); fclose(mapFP); exit(PA_NO_MAP_E); } if ((errorN = pardmap(mapFP, formatFP, originL, &processedN)) != 0) { pa_error(errorN); fclose(mapFP); fclose(formatFP); exit(errorN); } /* Process the samples and sort the bins according to the PC hits in * each bin. */ printf("Processing samples\n"); if ((errorN = pa_bstuff(PA_SAMPLE, patableAHP, &processedN)) != 0) { pa_error(errorN); fclose(mapFP); fclose(formatFP); exit(errorN); } /* Display the results */ padisply(patableAHP, processedN, pagelinesN, listFP); fclose(mapFP); fclose(formatFP); exit(0); } /* Function: int pa_locate(unsigned int *segmentPW, unsigned int *offsetPW) ** Description: This function figures out where in memory the program to be ** analyzed is to be run. MS-DOS executables are dynamically located at ** runtime. In order to avoid the complexity of writing a DOS ".exe" loader ** program, a simpler approach is used here. This function uses the ANSI ** system() library function to execute a trial program, "pawhere.exe" that ** writes its starting code segment and offset to a temporary file ** "pawhere.tmp". After "pawhere.exe" has finished, this function opens the ** temporary file and reads the starting segment and offset value. It is ** assumed that the desired program to be tested will have the same starting ** code segment and offset. If all operations were successful, then 0 is ** returned. Otherwise a non-zero error code will be returned.*/ int pa_locate(segmentPW, offsetPW) unsigned int *segmentPW; unsigned int *offsetPW; { FILE *fp; /* First figure out where the program will be loaded. Run "pawhere.exe" * via a system() function call. */ if ((system("pawhere")) != 0) return(PA_NO_WHERE_E); /* Read in the result of whereami.tmp. */ if ((fp = fopen("pawhere.tmp", "r")) == (FILE *) NULL) return(PA_NO_TMP_E); if ((fscanf(fp, "%x %x", segmentPW, offsetPW)) != 2) return(PA_BAD_TMP_E); fclose(fp); if (remove("pawhere.tmp") != 0) return(PA_TMP_RM_E); return(0); } /* Function: int pa_pcsample(char *programS, char *sampfileS, int samplesN) ** Description: This function runs the program (entire command line) pointed ** to by programS, while sampling its program counter every PC clock tick. ** Up to samplesN program counter samples are collected, and then written ** out in binary format to the file sampfiles.*/ int pa_pcsample(programS, sampfileS, samplesN) char *programS; /* Command line of program to run */ char *sampfileS; /* File to use to write out pc samples */ int samplesN; /* Maximum number of samples to collect */ { unsigned long *pcbufferPL; /* Word pointer to local pc sample buffer */ unsigned int *pcbufferPW; /* Long pointer to local pc sample buffer */ unsigned long *pcorgPL; /* Original copy of pointer to pc sample buf */ unsigned int segmentW; /* Starting segment of program to run */ unsigned int offsetW; /* Starting offset of program to run */ int handleN; /* pc sample file handle */ unsigned long sampleL; /* segment:offset sample converted to linear */ int i; /* general index */ /* Grab memory for the sample buffer */ if ((pcbufferPL = (unsigned long *) malloc((4*samplesN))) == (unsigned long *) NULL) return(PA_NO_MEM_E); /* Copy buffer pointer to allow word (int) access as well as long access.*/ pcbufferPW = (unsigned int *) pcbufferPL; pcorgPL = pcbufferPL; /* Start CS:IP sampling */ paopen(pcbufferPW, samplesN); /* Run the desired program. */ if (system(programS) != 0) { paclose(); return(PA_NO_EXEC_E); } /* Stop sampling */ samplesN = paclose(); /* Convert the samples from offsetW:segment to linear addresses relative * to the origin of the loaded program. */ if ((pa_debugN & PA_GENERAL) != 0) printf("pa_pcsample: number of samples: %d\n", samplesN); for (i = 0; i < samplesN; i++) { /* Read segment:offset value from the table. */ offsetW = *pcbufferPW++; segmentW = *pcbufferPW++; if ((pa_debugN & PA_GENERAL) != 0) printf("pa_pcsample: sample segment:offset %x:%x\n", segmentW,offsetW); /* Convert it to a linear address. */ sampleL = ((unsigned long) offsetW) + (((unsigned long) segmentW) << 4); /* Write the linear address back to the table. */ *pcbufferPL++ = sampleL; if ((pa_debugN & PA_GENERAL) != 0) printf("pa_pcsample: linear sample %lx\n",sampleL); } /* Write the samples to a binary file. */ if ((handleN = open (sampfileS, (O_CREAT | O_WRONLY | O_RAW), 0)) == (-1)) { free(pcorgPL); return(PA_NO_PC_FILE_E); } if ((write( handleN, ((char *) pcorgPL), (samplesN << 2))) != (samplesN << 2)) { close(handleN); free(pcorgPL); return(PA_NO_PC_WR_E); } close(handleN); free(pcorgPL); return(0); }
Copyright © 1991, Dr. Dobb's Journal