Statistical Performance Analysis

Understanding statistical performance analysis helps you improve time-critical application execution and avoid potential problems.

December 01, 1991
URL:http://www.drdobbs.com/parallel/statistical-performance-analysis/184408672

Figure 1

Figure 2

Figure 2(b)

DEC91: STATISTICAL PERFORMANCE ANALYSIS

STATISTICAL PERFORMANCE ANALYSIS

Looking for quality time

This article contains the following executables: STAT.ARC

Fred Motteler

Fred has been a software engineer at Applied Microsystems Corporation since receiving a Ph.D. in physics from the University of Washington in 1986. In addition to his work at Applied Microsystems, he has worked on a number of embedded real-time controller applications. He can be contacted at Applied Microsystems Corporation, 5020 148th Ave. N.E., P.O. Box 97002, Redmond, WA 98073-9702.

Statistical performance analysis is a practical technique for determining which modules in a program require the most execution time. By knowing this, you can improve performance of time-critical applications. Consecutive trials can then help zero in on effective coding changes for execution time reduction. While the techniques presented here are ideally suited for embedded applications--and were in fact developed in part to squeeze more performance out of embedded systems development tools, as described later--the implementation is general-purpose enough to apply to just about any time-critical application.

Where I work (Applied Microsystems), statistical performance analysis has been used to improve time-critical application performance on several occasions. The most notable was the development of the SCSI communication option for our ES 1800 emulator for 680x0, 80x86, and Z800x processors. Use of statistical performance analysis allowed both host and emulator software to be optimized for speed. This resulted in data transfer rates two to five times faster than would otherwise have been possible.

Another example of statistical performance analysis use involved development of an arbitrary precision, portable C, IEEE-P754 format-compatible, and floating-point arithmetic package. Again, performance analysis was used to determine which routines to optimize for speed and how effective the modifications were.

Performance Analysis Steps

The basic idea of statistical performance analysis is to periodically sample the program counter while a program is executing. Once the program or the sampling is finished, the program counter samples are sorted according to the address range of each module in the program. The number of samples that lie within each module is tallied. From these tallies, an approximate value for the relative amount of execution time spent in each module can be determined and displayed. (See Figure 1.)

Determining the module to which each program counter sample belongs requires the program's link map or memory map to be examined. Most linkers produce memory maps that give information about modules' physical locations in memory.

To good approximation, the number of samples falling within each module is proportional to the amount of time spent executing the code within the module. The final tallies for each module can be displayed as a percentage of the total number of samples recorded for the entire program.

Figure 2 contains examples of statistical performance analysis applied to a simple test program. The test program consists of a main( ) function and 15 delay_loop( ) functions. The delay loop functions are configured so that the second function's delay loop takes twice as much time as the first. The third function's delay loop takes three times as much time as the first, and so on. The main function contains a loop that contains one call to each of the 15 subfunctions. Good statistics should show the second function taking twice as much time as the first, the third function taking three times as much time as the first, and so on. Figure 2(b) shows each program counter sample sorted according the function within which it lies. Trial program configuration was (patest.c): looptoloopN = 5120, scaleN = 2, delay_one = 1, delay_two = 2, delay_three = 3, and so on. This was run using an ES 1800 68020 emulator with a 12.5-MHz "Demon" target for two seconds of 4000-bus cycles between samples.

For a complete listing of the test program patest. c, see Listing One (page 100).

In contrast, Figure 3 presents statistical performance analysis results done on a real PC program. The program being tested is the regression tester for the floating-point arithmetic package mentioned earlier. Only the results for the first 20 functions are included here.

Figure 3. Floating-point arithmetic package regression tester. Trial program configuration: fmtest.exe running on an 8-MHz AT clone. Command-line invocation: pamsdos fmtest. map lattice.cfg fmtest.

  Sample output:
  Module table sorted by sample counts:
  4096 samples collected

  module        samples  %   0     5     10     15     20     25     30
  ushftlm       919      22  **************************************
  ushftrm       846      21  ************************************
  PA_BOT_MEM    670      16  ***************************
  ucheckm       463      11  *******************
  _eufbs        266       6  **********
  ucmpm         169       4  ******
  isubm         138       3  *****
  iaddm          94       2  ***
  ffbitext       74       2  ***
  umultm         62       2  ***
  _nmalloc2      42       1  *
  _nfreex        42       1  *
  udivm          37       1  *
  itestm         33       1  *
  _nchkb         23       1  *
  ffbitins       21       1  *
  _nnohold       20       0
  _CO_pfmt       18       0
  prints         18       0
  calloc         14       0

From this example, it is clear that the routines ushftlm and ushftrm together take up more than 40 percent of the total program execution time. These routines do bitwise shifts left and right. As expected, they are used extensively and are prime candidates for execution-time optimization. The routine PA_BOT_MEM actually represents program counter samples that were below the program being tested. Typically, this represents the fraction of time spent in DOS support functions. The regression tester outputs a significant amount of results to the display, so time spent scrolling the PC's screen, for example, is significant.

Other routines that show up significantly, ucheckm, ucmpm, isubm, iaddm, umultm, udivm, and itestm are all integer math routines used to support the floating-point package. Some Lattice library routines also show up (_eufbs, _nmalloc2, _nfreex, and _nchkb) due to the extensive use of dynamic memory allocation by the floating-point routines.

Statistical Performance Analysis Techniques

The examples in Figures 2 and 3 demonstrate that the relative time required by different modules can be determined from periodic program counter sampling. A logical question is, "So why isn't statistical performance analysis more widely accepted?"

The main argument against statistical performance analysis is that the entire execution path of a program is not sampled. Only isolated snapshots of where the program is executing are sampled, so the results of statistical performance analysis may not be accurate.

The nature of possible inaccuracies is difficult to characterize and depends on the program being analyzed, the sampling period, and total number of samples taken. Most statistical methodology assumes the sampling process is a series of independent events (such as repeated flipping of a coin). The results of such random sampling are well characterized by the binomial distribution, as is the range of variation in possible outcomes.

In contrast, programs are not random: There is a well-defined sequence of events. Starting the same program twice from the same initial conditions produces the same results. Sampling the program counter at periodic intervals is not a series of independent events.

The assumption for statistical performance analysis is that for sufficiently complex programs, appropriate sampling periods, and appropriate sampling durations, the program counter sampling will approximate independent sampling.

The nature of problems encountered with statistical performance analysis samples is somewhat similar to that of problems associated with pseudorandom number generation programs. Fortunately, the general complexity of most programs makes selection of reasonable parameters for statistical performance analysis much easier than selection of reasonable pseudorandom number generation parameters.

Several factors affect the variation of results between one statistical performance analysis trial and the next:

Change in the sampling period
Change in the total sampling time
Nonidentical initial conditions: The program sampling may not be identically synchronized with the program being sampled.
Identical initial conditions, but varying synchronization relative to the program being sampled. This is true when the program sampling is driven off a separate clock circuit from the processor clock circuit running the program.
External events that alter the execution path of the program being sampled: unsynchronized inputs, interrupts, DMA, and so on.

A poor choice of either the sampling frequency or the total sampling time can result in biased results. The conditions in Table 1 are required for good statistics.

Table 1: Conditions for good statistics

  Where

  t   =    program counter sampling interval
  T   =    total sampling time
  N   =    number of program modules
  n   =    total number of program samples (n = T/t)

  then for good sampling

  t<<T     (take a lot of samples)
  N << n,  (there are many samples in each module)
  t ≠   some basic periodicity of the program,
  T >>     basic periodicity of the program, or
  T ==     some multiple of the basic periodicity of the program,
           or part of the program being tested.

The examples in Figures 4 and 5 illustrate the difference between good and bad statistics. In both cases, a simple test program is analyzed. The program consists of a main function and 15 identical delay loop subfunctions. The main function contains a loop that contains one call to each of the 15 subfunctions. Good statistics should show all 15 functions requiring the same fraction of the total execution time. (See Listing One for the test program patest.c.)

Figure 4(a) illustrates a "good" sampling period in which all of the delay loop functions should have about the same number of samples. Trial program configuration (patest.c) is: looptoloopN = 5120, scaleN = 2, delay_one = 8, delay_two=8, delay_three = 8, and so on. This was run using an ES 1800 /68020 emulator with a 12.5-MHz "Demon" target for two seconds of 3850-bus cycles between samples.

Figure 4(b) illustrates a "bad" sampling period, in which the sampling period matches a fundamental periodicity of the program. The delay loop functions are short delays, and the number of main loop iterations is large. Delay functions delay_eight and delay_fifteen have a disproportionately large share of samples. Trial program configuration (patest.c) is: looptoloopN = 5120, scale-N = 2, delay_one = 8, delay_two = 8, delay_three = 8, and so on. This was run using an ES 1800 68020 emulator with a 12.5-MHz "Demon" target for two seconds of 3900-bus cycles between samples.

Figure 4: Sampling periods: (a) good (b) bad

  (a)

  Sample output:
  Module table sorted by sample counts:
  1377 samples collected

  module           samples  %  0   1   2   3   4   5   6   7   8   9   10
  .main            91       7  *****************************
  .delay_seven     87       6  *************************
  .delay_thirteen  87       6  *************************
  .delay_one       86       6  *************************
  .delay_eleven    86       6  *************************
  .delay_fourteen  86       6  *************************
  .delay_twelve    86       6  *************************
  .delay_ten       86       6  *************************
  .delay_five      86       6  *************************
  .delay_six       86       6  *************************
  .delay_nine      85       6  *************************
  .delay_fifteen   85       6  *************************
  .delay_three     85       6  *************************
  .delay_four      85       6  *************************
  .delay_two       84       6  *************************
  .delay_eight     84       6  *************************

  (b)

  Sample output:
  Module table sorted by sample counts:
  1359 samples collected

  module          samples  %   0   1   2   3   4   5   6   7   8   9   10
  .delay_eight    160      12  ************************
  .delay_fifteen  158      12  ************************
  .delay_four      81       6  *************
  .delay_nine      80       6  *************
  .delay_eleven    80       6  *************
  .delay_five      80       6  *************
  .delay_twelve    80       6  *************
  .delay_seven     80       6  *************
  .delay_thirteen  80       6  *************
  .delay_one       80       6  *************
  .delay_two       80       6  *************
  .delay_fourteen  80       6  *************
  .delay_ten       80       6  *************
  .delay_three     79       6  *************
  .delay_six       79       6  *************

Figure 5(a) illustrates "good" total sampling time, in which all of the delay loop functions should have about the same number of samples. Trial program configuration (patest.c) is: looptoloopN = 5, scaleN = 10, delay_one = 8, delay_two = 8, delay_three = 8, and so on. This was run using an ES 1800 68020 emulator with a 12.5-MHz "Demon" target for two seconds of 4000-bus cycles between samples.

Figure 5(b) shows "bad" total sampling time. Half of the delay loop functions have significantly more samples than expected and the other half have less. In this case, the total sampling time was not a multiple of a fundamental periodicity of the program, the delay loop functions are long delays, and the number of main loop iterations is small. (delay = 8192, loop iterations = 5). Trial program configuration (patest.c) is: looptoloopN = 5, scaleN = 10, delay_one = 8, delay_two = 8, delay_three = 8, and so on. This was run using an ES 1800 68020 emulator with a 12.5-MHz "Demon" target for two seconds of 2500-bus cycles between samples.

Figure 5: Sampling time: (a) good (b) bad

  (a)

  Sample output:
  Module table sorted by sample counts:
  1343 samples collected

  module           samples  %  0   1   2   3   4   5   6   7   8   9   10
  .delay_one       85       6  *************************
  .delay_eleven    85       6  *************************
  .delay_six       85       6  *************************
  .delay_eight     83       6  *************************
  .delay_three     83       6  *************************
  .delay_fourteen  83       6  *************************
  .delay_four      82       6  *************************
  .delay_nine      82       6  *************************
  .delay_thirteen  82       6  *************************
  .delay_ten       80       6  *************************
  .delay_twelve    80       6  *************************
  .delay_two       80       6  *************************
  .delay_seven     80       6  *************************
  .delay_fifteen   80       6  *************************
  .delay_five      80       6  *************************

  (b)

  Sample output:
  Module table sorted by sample counts:
  2000 samples collected

  module           samples  %  0   1   2   3   4   5   6   7   8   9   10
  .delay_fourteen  132      7  ******************************
  .delay_nine      131      7  ******************************
  .delay_ten       131      7  ******************************
  .delay_one       131      7  ******************************
  .delay_eleven    131      7  ******************************
  .delay_fifteen   131      7  ******************************
  .delay_twelve    131      7  ******************************
  .delay_thirteen  130      7  ******************************
  .delay_eight     106      5  **********************
  .delay_six       106      5  **********************
  .delay_four      105      5  **********************
  .delay_two       105      5  **********************
  .delay_three     105      5  **********************
  .delay_seven     105      5  **********************
  .delay_five      104      5  **********************

In actual practice, selection of the total sampling time and the sampling frequency is rather subjective. Typically the total sampling time is either:

A. The total execution time of the program. For a program that has definite, finite start, and end points (such as the floating-point package regression tester presented earlier) and runs for seconds or minutes, this is a reasonable choice. B. The execution time of a particular operation within a much larger program. Here the goal is to isolate the operation being tested from the rest of the program. A good example of this is the actual download of program data from a host system to the target system via an emulator.

Typically the sampling frequency is either fixed in hardware or variable. On the PC, the clock-tick frequency is fixed. This makes checking for the effects of natural program periodicity difficult. Using a PC with a different CPU clock frequency or processor has the same effect as changing the sampling rate. If either the interrupt frequency or the rate of external hardware sampling can be varied then checking for the effects of natural program periodicity relatively easy.

Unless there are periodic interrupts in the system being tested, natural periodicities in the program are difficult to determine. The best way to check for natural periodicity is to vary the sampling frequency and see how it affects the results. If periodic interrupts are present, and/or an obvious natural periodicity is present, then the sampling period should be changed such that t' = (irrational #)*t where t = natural period (or previous sampling period) and t' = sampling period to try.

Irrational numbers such as pi, pi/2, pi/3, e, and e/2 are good initial choices. This assumes that the natural periodicity is approximately equal to the sampling period and is much less than the total sampling time, where t = natural period << T.

Using an irrational number minimizes the probability that the new sampling period is also some multiple of the program's natural period. However, due to the finite nature of the sampling, even choosing an irrational period ratio does not guarantee good statistics.

The best way to guarantee reasonable results is to try several different sampling periods.

Program Counter Sampling Methods

The most critical step in statistical performance analysis is collecting the sample program counter values. The most common ways to do this are:

Native interrupt-driven sampling. A periodic, hardware-generated interrupt is used to trigger an interrupt routine which reads the program counter value pushed onto the system stack when the interrupt occurred. The program counter value is written to a part of the system's memory that the program being tested does not affect.
External hardware sampling. External hardware (such as an emulator with trace memory) periodically samples the processor's address and status lines. The address lines during instruction fetch cycles are a fairly accurate method of determining the current program counter. The instruction fetch cycles are recorded into trace memory contained in the external hardware.

Native Interrupt Sampling on PCs

The advantage of using a PC is that it is a common software development platform. Good quality compilers, editors, and other development tools are readily available and inexpensive.

The PC has a clock-tick hardware interrupt (INT 8) that occurs every 54.92 milliseconds (18.21 times per second). This interrupt is used for time keeping by the PC's BIOS routines and by MS-DOS. (See either the Microsoft MS-DOS Programmer's Reference, the XT Technical Reference, or the AT Technical Reference.) The value of 18.21 times per second is used so that there are 2{16} (65,536) clock ticks per hour.

The clock-tick interrupt vector normally points to a timekeeper service routine in either the BIOS or MS-DOS. Prior to running the program to test, the vector is modified to point to a local program counter sampler interrupt routine. This routine samples the pre-interrupt program counter value on the stack and saves it into a local data buffer not affected by the program being tested. After sampling the PC value, control is transferred to the original timekeeper interrupt service routine. The interrupt vector is restored after the program being tested completes execution. An example is the source code file patick.asm (Listing Two, page 101).

The most significant limitation of doing native, interrupt-driven sampling on a PC is that the interrupt rate is fixed at a relatively slow rate. A fixed sampling rate does not allow determination of possible effects of natural periodicity in the program being tested. The only way to investigate this is to use several PCs with different processor clock rates. The sampling rate stays the same, but the execution speed of the program changes, depending on the PC used.

Another problem with using a PC for performance analysis is that programs are dynamically relocated at runtime. Memory maps generally give module addresses relative to the origin of the program. The origin of the program is determined at runtime. This requires either a modified loader that returns the program origin address, or use of a trial program that returns where it has been loaded into memory. The latter approach is presented in the source code files pawhere.c (Listing Three, page 103) and pamsdos.c (Listing Four, page 103).

A more general limitation of using native, interrupt-driven sampling is that it requires system resources. Memory is required for the sampling program and for the program counter sample buffer. This memory usage competes for memory with the program being tested. A periodic interrupt must be provided to drive the sampling. The sampling interrupt service routine requires some execution time. This execution time requirement competes with that of the program being tested. While in most cases, use of some system resources does not significantly impact the program being tested, in some it will.

Analysis Methods

Once the program counter samples have been collected by either native interrupt sampling or external hardware, the samples can be processed on either the native system or an alternate host system.

Reading the memory map for the program being tested is an easy way to determine the beginning and ending address of each module within it. The file pardmap.c (accessible electronically; see "Availability" at the end of this article) is an example of a generic map file reader program. The map file reader reads the map file and creates a data table of module names, address ranges, and number of samples. It then sorts the table according to address values.

For each program counter sample, a binary search is done to determine which module's sample count to increment. The file pautil.c does the binary search and increments the appropriate sample count. This file, along with the make file, the required header file, padef.h, discussed in this article, and sample configuration files for Microtec and Intermetrics cross linkers, and for Lattice, Zortech, and Microsoft PC native linkers are available electronically.

So, Why Aren't You Using Statistical Performance Analysis?

For execution time critical programs, the benefits of using statistical performance analysis far outweigh the potential problems due to questionable accuracy of results. With proper understanding of the statistical performance analysis technique, most potential problems can be eliminated, making statistical performance analysis a useful, practical tool.

Availability

In addition to the code availability sources listed on page 3, complete source code and PC executable versions of pamsdos.exe and paes1800.exe are available on a 5 1/4-inch, 360K floppy disk from Applied Microsystems. The executable version was compiled from the given source code using the PC Lattice C compiler, Version 6.01. Both programs are fully commented and include fairly complete usage messages.


_STATISTICAL PERFORMANCE ANALYSIS_
by Fred Motteler

[LISTING ONE]



/* patest.c -- A collection of simple routines to test the accuracy of
** statistical performance analysis programs for the PC and ES 1800 emulator.
*/

/* Default delay timing parameters */
#define DELAY_ONE 1
#define DELAY_TWO 2
#define DELAY_THREE 3
#define DELAY_FOUR 4
#define DELAY_FIVE 5
#define DELAY_SIX 6
#define DELAY_SEVEN 7
#define DELAY_EIGHT 8
#define DELAY_NINE 9
#define DELAY_TEN 10
#define DELAY_ELEVEN 11
#define DELAY_TWELVE 12
#define DELAY_THIRTEEN 13
#define DELAY_FOURTEEN 14
#define DELAY_FIFTEEN 15

#define DELAY_SCALE 10      /* Effectively muliplies values by 1000 */
#define DELAY_LOOPS 5       /* Default number of times thru main loop */

/* Loop delay parameters.  These are done as globals to allow easy access to
** the timing parameters via the ES 1800 emulator.  This allows different
** timing configurations to be tested without having to recompile and link
** this code.  Kludgie, but it encourages easy experimentation. */
long final_sumL = 0;
int time_oneN = DELAY_ONE;
int time_twoN = DELAY_TWO;
int time_threeN = DELAY_THREE;
int time_fourN = DELAY_FOUR;
int time_fiveN = DELAY_FIVE;
int time_sixN = DELAY_SIX;
int time_sevenN = DELAY_SEVEN;
int time_eightN = DELAY_EIGHT;
int time_nineN = DELAY_NINE;
int time_tenN = DELAY_TEN;
int time_elevenN = DELAY_ELEVEN;
int time_twelveN = DELAY_TWELVE;
int time_thirteenN = DELAY_THIRTEEN;
int time_fourteenN = DELAY_FOURTEEN;
int time_fifteenN = DELAY_FIFTEEN;
int scaleN = DELAY_SCALE;
int looptoloopN = DELAY_LOOPS;

/* Function: long delay_xxxx(int delayN)
** Description: These are simple functions designed to allow varied delays.
** The code in each delay function is the identical to the code in all of
** the other delay functions. This allows accurate comparision of the
** relative execution time of each function. Fifteen of these functions should
** be a reasonable number to represent a simple "real" program.
*/
long
delay_one(delayN)
int delayN;
{
    int i;
    long sumL;

    sumL = 0L;
    delayN <<= scaleN;
    for (i = 0; i < delayN; i++)
    sumL += (long) i;

    return(sumL);
}
long
delay_two(delayN)
int delayN;
{
    int i;
    long sumL;
    sumL = 0L;
    delayN <<= scaleN;
    for (i = 0; i < delayN; i++)
    sumL += (long) i;
    return(sumL);
}

    .
    .
    .
    .

long
delay_fifteen(delayN)
int delayN;
{
    int i;
    long sumL;
    sumL = 0L;
    delayN <<= scaleN;
    for (i = 0; i < delayN; i++)
    sumL += (long) i;
    return(sumL);
}

/* Function: void main()
** Description: This is a simple routine to run the various delay routines.
** The delay time variables are all globals to allow experimentation with
** the timing parameters using the ES 1800 emulator.
*/
void
main()
{
    int i;
    final_sumL = 0L;
    for (i = 0; i < looptoloopN; i++)
    {
    final_sumL += delay_one(time_oneN);
    final_sumL += delay_two(time_twoN);
    final_sumL += delay_three(time_threeN);
    final_sumL += delay_four(time_fourN);
    final_sumL += delay_five(time_fiveN);
    final_sumL += delay_six(time_sixN);
    final_sumL += delay_seven(time_sevenN);
    final_sumL += delay_eight(time_eightN);
    final_sumL += delay_nine(time_nineN);
    final_sumL += delay_ten(time_tenN);
    final_sumL += delay_eleven(time_elevenN);
    final_sumL += delay_twelve(time_twelveN);
    final_sumL += delay_thirteen(time_thirteenN);
    final_sumL += delay_fourteen(time_fourteenN);
    final_sumL += delay_fifteen(time_fifteenN);
    }
}

[LISTING TWO]



TITLE   patick - IBM PC / Clone Clock Tick CS:IP Grabber
; File: patick.asm--Fred Motteler and Applied Microsystems Corporation
;   Copyright 1990. All Rights Reserved
; Description:
;   This file contains three functions:
;   C callable:
;   void painit(bufferLP, lengthN)  Initialize grabber interrupt vector
;   int paclose()           Close grabber interrupt vector
;   Interrupt routine, this is treated like part of painit():
;   patick              Grab CS:IP value
;   These functions are configured for small model.
;   Stack frame structure for painit():
;
stkfr   STRUC
    OLD_FR  DW  ?   ; Previous stack frame pointer
    RETADDR DW  ?   ; Return address to caller
    BUFFERP DW  ?   ; Pointer to buffer to use
    BUFLEN  DW  ?   ; Length of buffer (in longwords)
stkfr   ENDS
;
;   Stack frame structure for clock tick timer routine.
intfr   STRUC
    INT_FR  DW  ?   ; Pre-interrupt stack frame pointer
    IP_VAL  DW  ?   ; Pre-interrupt IP value
    CS_VAL  DW  ?   ; Pre-interrupt CS value
intfr   ENDS
TIMER   EQU 8h      ; Timer interrupt vector number
DGROUP  GROUP   _DATA
_DATA   SEGMENT WORD PUBLIC 'DATA'
        ASSUME  DS:DGROUP
bufptr  DW  0       ; Starting point of buffer
bufsiz  DW  0       ; Number of longwords in the buffer
bufindx DW  0       ; Next location of buffer to use
bufwrap DB  0       ; Flag if buffer has wrapped...
_DATA   ENDS
_TEXT   SEGMENT BYTE PUBLIC 'CODE'
    ASSUME  CS:_TEXT
;
;   void paopen (unsigned long *bufferLP, int lengthN)
;   This a C callable function to initialize the CS:IP grabber and
;   start it up.  bufferLP points the buffer of where to write CS:IP
;   values.  lengthN is the length of the buffer in longwords.
    PUBLIC  paopen
paopen  PROC    NEAR
    push    bp
    mov bp,sp
    push    si
    push    di
    push    es
;
;   Set up the local buffer pointer values from those passed on the stack.
    mov ax,[bp].BUFFERP ; Get pointer to start of buffer
    mov bufptr,ax
    mov ax,[bp].BUFLEN  ; Get length of the buffer
    shl ax,1        ; convert longword length to byte length
    shl ax,1
    mov bufsiz,ax
    xor ax,ax       ; Start at the beginning of the buffer
    mov bufindx,ax
    mov bufwrap,al  ; Reset buffer wrap flag
;
;   Save the original clock tick interrupt vector.
    mov al,TIMER    ; interrupt number into al
    mov ah,35h      ; DOS function = get vector
    int 21h     ; DOS returns old vector in es:bx
    mov cs:oldseg,es    ; save old segment
    mov cs:oldoff,bx    ; save old offset
;
;   Disable interrupts while changing the interrupt vector.
    cli
;
;      Change clock tick interrupt routine to point at local interrupt routine.
    mov al,TIMER    ; vector number
    mov ah,25h      ; DOS function = set vector
    mov dx,OFFSET patick ; point to our interrupt handler
    push    ds      ; don't lose ds, we need to get to local data
    push    cs      ; move this cs to ds
    pop ds      ;
    int 21h     ; set the new vector
    pop ds      ; restore ds
;
;   Enable interrupts and return;
    pop es
    pop di
    pop si
    pop bp
    sti
    ret
;
;   Clock tick grabber routine. This routine samples CS:IP that were pushed
;       on to the stack when the interrupt occurs.
patick: push    bp
    mov bp,sp       ; Treat CS:IP values like stack frame
    push    ax
    push    bx
    push    ds
;
;   Get the local ds to allow access to local variables.
    mov ax,DGROUP
    mov ds,ax
;
;   Use bx as a pointer to the recording buffer
    mov bx,bufptr
    add bx,bufindx
;
;   Grab the pre-interrupt CS:IP values off the stack
    mov ax,[bp].IP_VAL  ; grab the IP
    mov [bx],ax     ; save the IP in the recording buffer
    inc bx
    inc bx
    mov ax,[bp].CS_VAL  ; grab the CS
    mov [bx],ax     ; save the CS in the recording buffer
    inc bx
    inc bx
;
;   Check if we are at the end of the buffer
    sub bx,bufptr   ; get the byte offset index back again
    mov ax,bufsiz   ; get the buffer byte length
    cmp ax,bx
    jne notend      ; jump if not at the end of the buffer
;
;   At the end of the buffer
    mov bx,0        ; reset the buffer index
    mov al,0ffh     ; set flag to indicate buffer wrap
    mov bufwrap,al
;
;   Write out modified buffer index
notend: mov bufindx,bx
;
;   Clean up
    pop ds
    pop bx
    pop ax
    pop bp
;
;   Jump to the original interrupt service routine.  An immediate jump
;   is used so no segment registers are required.
    DB  0eah        ; jmp immediate, to the offset:segment
;                 selected below (brute force approach).
;   Original interrupt handler's offset and segment values.  These are
;   in the current code segment to allow the interrupt routine given
;   here to directly jump to the original interrupt routine.
oldoff  DW  0       ; Room for original timer interrupt offset
oldseg  DW  0       ; Room for original timer interrupt segment

paopen  ENDP
;
;   int paclose()  This is a C callable function to close CS:IP grabber and
;   return the number of CS:IP values grabbed.
    PUBLIC  paclose
paclose PROC    NEAR
    push    bp
    mov bp,sp
    push    si
    push    di
    push    es
;
;   Disable interrupts while the original interrupt vector is restored.
    cli
    mov al,TIMER    ; get interrupt number
    mov ah,25h      ; DOS function = set vector
    push    ds      ;
    mov dx,cs:oldoff    ; old timer offset
    mov ds,cs:oldseg    ; old timer segment
    int 21h     ; restore old vector
    pop ds      ;
;
;   Enable interrupts.
    sti
;
;   Calculate the number of CS:IP values
    cmp bufwrap,0   ; check if the buffer has wrapped
    jne wrapped     ; jump if it has wrapped
    mov ax,bufindx  ; no wrap, return buffer index as count
    jmp done
wrapped: mov    ax,bufsiz   ; wrapped, return buffer size as count
;
;   Clean up stack and return
done:   shr ax,1        ; Return count in number of CS:IP pairs
    shr ax,1
    pop es
    pop di
    pop si
    pop bp
    ret

paclose ENDP
_TEXT   ENDS
    END

[LISTING THREE]



/* pawhere.c -- contains a very simple program that returns its segment base
** address.  Note that this program is Lattice version 6.01 specific in that
** the Lattice small model has "main" at the beginning of the exectable
** portion of the program. Other compiler/linker packages may require that the
** program map be examined for the module that starts the program.
** Copyright 1990 Fred Motteler and Applied Microsystems Corporation
*/
#include <dos.h>
#include <stdio.h>

unsigned int
main()
{
    FILE *fp;

    fp = fopen("pawhere.tmp", "w");
    fprintf(fp, "%x %x\n",
        (FP_SEG((char far *) main)), (FP_OFF((char far *) main)));
    fclose(fp);
    exit(0);
}

[LISTING FOUR]



/* pamsdos.c -- Utility functions used by MS-DOS version of the statistical
** performance analysis package.
** Copyright 1990 Fred Motteler and Applied Microsystems Corporation
*/

#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <string.h>
#include "padef.h"

/* Function: int main( argcN, argvAS )
** Description: MS-DOS based statistical performance analysis program.
**   Command line arguments: pamsdos prog.map prog.cfg prog.exe options
**   Where: prog.map = memory map for program; prog.cfg = memory map
**    configuration; prog.exe = program to run; options = command options
**    for the program to run
*/
int
main( argcN, argvAS )
int argcN;
char *argvAS[];
{
    int errorN;             /* Error code */
    unsigned int segmentW;  /* Starting load address of program to run */
    unsigned int offsetW;
    unsigned long originL;
    int processedN;         /* Number of map globals processed */
    int i;                  /* General index */
    FILE *mapFP;            /* Map file to read */
    FILE *formatFP;         /* File with map file format information */
    char commandAC[PA_LINE_LEN]; /* Complete command line for program */
    int pagelinesN;         /* Number of lines on output page, 0 if
                 * continuous, -1 if no display output, else
                 * n if n lines per page. */
    FILE *listFP;           /* Results output file */
    char listAB[80];        /* Optional results listing file path/name */
    char pagelinesAB[8];    /* String for number of lines/page */

    printf("pamsdos - Statistical performance analysis tool for MS-DOS\n");
    printf("Version %s\n", PA_VERSION);
    printf("Copyright (C) 1990 Fred Motteler and Applied Microsystems Corp\n");
    if (argcN < 4)
    {
    printf("Usage: pamsdos prog.map prog.cfg prog.exe [options]\n");
    printf(" Where: prog.map  memory map for program\n");
    printf("        prog.cfg  memory map configuration file\n");
    printf("        prog.exe  program to run\n");
    printf("        [options] command line options for program to run\n");
    exit(-100);
    }

    /* Determine where the program to run is to be located. */
    if ((errorN = pa_locate(&segmentW, &offsetW)) != 0)
    {
    pa_error(errorN);
    exit(errorN);
    }
    /* Calculate origin of program.  Room must be allowed for memory
     * malloc()'d off the heap. */
    originL = (unsigned long) (segmentW + 1);
    originL <<= 4;
    originL += (unsigned long) (offsetW - 2);
    originL += (unsigned long) (PA_BUFLEN << 2);

    if ((pa_debugN & PA_GENERAL) != 0)
    {
    printf("program start segment:offset %x:%x\n", segmentW, offsetW);
    printf("              linear address %lx\n",originL);
    }

    /* Get the complete command line to invoke the program. */
    strcpy(commandAC, argvAS[3]);
    if (argcN > 4)
    {
    for (i = 4; i <  argcN; i++)
    {
        strcat(commandAC," ");
        strcat(commandAC,argvAS[i]);
    }
    }

    /* Run the program and collect samples. */
    printf("Starting %s\n", argvAS[3]);
    if ((errorN = pa_pcsample(commandAC, PA_SAMPLE, PA_BUFLEN)) != 0)
    {
    pa_error(errorN);
    exit(errorN);
    }

    /* Read in the configuration file to get map format information and
     * to get number of lines / display page and option listing file. */
    if ((formatFP = fopen(argvAS[2], "r")) == (FILE *) NULL)
    {
    pa_error(PA_NO_CFG_E);
    exit(PA_NO_CFG_E);
    }

    /* Read in display lines, and optional output file configuration data
     * from the configuration file. */
    if (((errorN = paconfig(formatFP, PA_PAGELINES, pagelinesAB)) != 0) ||
    ((errorN = paconfig(formatFP, PA_LISTFILE, listAB)) != 0))
    {
    pa_error(errorN);
    fclose(formatFP);
    exit(errorN);
    }

    /* Determine the number of lines/page to display */
    if (sscanf(pagelinesAB, "%d", &pagelinesN) != 1)
    {
    pa_error(PA_BAD_ARG_E);
    fclose(formatFP);
    exit(PA_BAD_ARG_E);
    }

    /* Open the optional listing file */
    if (listAB[0] == '\0')
    listFP = (FILE *) NULL;
    else if ((listFP = fopen(listAB, "w")) == (FILE *) NULL)
    {
    pa_error(PA_NO_LST_E);
    fclose(formatFP);
    exit(PA_NO_LST_E);
    }

 /* Read program's memory map and create "bins" for program counter samples. */
    if ((mapFP = fopen(argvAS[1], "r")) == (FILE *) NULL)
    {
    pa_error(PA_NO_MAP_E);
    fclose(mapFP);
    exit(PA_NO_MAP_E);
    }
    if ((errorN = pardmap(mapFP, formatFP, originL, &processedN)) != 0)
    {
    pa_error(errorN);
    fclose(mapFP);
    fclose(formatFP);
    exit(errorN);
    }

    /* Process the samples and sort the bins according to the PC hits in
     * each bin. */
    printf("Processing samples\n");
    if ((errorN = pa_bstuff(PA_SAMPLE, patableAHP, &processedN)) != 0)
    {
    pa_error(errorN);
    fclose(mapFP);
    fclose(formatFP);
    exit(errorN);
    }

    /* Display the results */
    padisply(patableAHP, processedN, pagelinesN, listFP);
    fclose(mapFP);
    fclose(formatFP);
    exit(0);
}

/* Function: int pa_locate(unsigned int *segmentPW, unsigned int *offsetPW)
** Description: This function figures out where in memory the program to be
**  analyzed is to be run. MS-DOS executables are dynamically located at
**  runtime. In order to avoid the complexity of writing a DOS ".exe" loader
**  program, a simpler approach is used here. This function uses the ANSI
**  system() library function to execute a trial program, "pawhere.exe" that
**  writes its starting code segment and offset to a temporary file
**  "pawhere.tmp". After "pawhere.exe" has finished, this function opens the
**  temporary file and reads the starting segment and offset value. It is
**  assumed that the desired program to be tested will have the same starting
**  code segment and offset. If all operations were successful, then 0 is
**  returned. Otherwise a non-zero error code will be returned.*/
int
pa_locate(segmentPW, offsetPW)
unsigned int *segmentPW;

unsigned int *offsetPW;
{
    FILE *fp;

    /* First figure out where the program will be loaded.  Run "pawhere.exe"
     * via a system() function call. */
    if ((system("pawhere")) != 0)
    return(PA_NO_WHERE_E);

    /* Read in the result of whereami.tmp. */
    if ((fp = fopen("pawhere.tmp", "r")) == (FILE *) NULL)
    return(PA_NO_TMP_E);
    if ((fscanf(fp, "%x %x", segmentPW, offsetPW)) != 2)
    return(PA_BAD_TMP_E);
    fclose(fp);
    if (remove("pawhere.tmp") != 0)
    return(PA_TMP_RM_E);

    return(0);
}

/* Function: int pa_pcsample(char *programS, char *sampfileS, int samplesN)
** Description: This function runs the program (entire command line) pointed
**   to by programS, while sampling its program counter every PC clock tick.
**   Up to samplesN program counter samples are collected, and then written
**   out in binary format to the file sampfiles.*/
int
pa_pcsample(programS, sampfileS, samplesN)
char *programS;         /* Command line of program to run */
char *sampfileS;        /* File to use to write out pc samples */
int samplesN;           /* Maximum number of samples to collect */
{
    unsigned long *pcbufferPL;  /* Word pointer to local pc sample buffer */
    unsigned int *pcbufferPW;   /* Long pointer to local pc sample buffer */
    unsigned long *pcorgPL; /* Original copy of pointer to pc sample buf */
    unsigned int segmentW;  /* Starting segment of program to run */
    unsigned int offsetW;   /* Starting offset of program to run */
    int handleN;        /* pc sample file handle */
    unsigned long sampleL;  /* segment:offset sample converted to linear */
    int i;          /* general index */

    /* Grab memory for the sample buffer */
    if ((pcbufferPL = (unsigned long *) malloc((4*samplesN)))
    == (unsigned long *) NULL)
    return(PA_NO_MEM_E);
    /* Copy buffer pointer to allow word (int) access as well as long access.*/
    pcbufferPW = (unsigned int *) pcbufferPL;
    pcorgPL = pcbufferPL;

    /* Start CS:IP sampling */
    paopen(pcbufferPW, samplesN);

    /* Run the desired program. */
    if (system(programS) != 0)
    {
    paclose();
    return(PA_NO_EXEC_E);
    }

    /* Stop sampling */
    samplesN = paclose();

    /* Convert the samples from offsetW:segment to linear addresses relative
     * to the origin of the loaded program. */
    if ((pa_debugN & PA_GENERAL) != 0)
    printf("pa_pcsample: number of samples: %d\n", samplesN);
    for (i = 0; i < samplesN; i++)
    {
    /* Read segment:offset value from the table. */
    offsetW = *pcbufferPW++;
    segmentW = *pcbufferPW++;

    if ((pa_debugN & PA_GENERAL) != 0)
        printf("pa_pcsample: sample segment:offset %x:%x\n",
            segmentW,offsetW);

    /* Convert it to a linear address. */
    sampleL = ((unsigned long) offsetW)
        + (((unsigned long) segmentW) << 4);
    /* Write the linear address back to the table. */
    *pcbufferPL++ = sampleL;
    if ((pa_debugN & PA_GENERAL) != 0)
        printf("pa_pcsample: linear sample %lx\n",sampleL);
    }

    /* Write the samples to a binary file. */
    if ((handleN = open (sampfileS, (O_CREAT | O_WRONLY | O_RAW), 0))
    == (-1))
    {
    free(pcorgPL);
    return(PA_NO_PC_FILE_E);
    }
    if ((write( handleN, ((char *) pcorgPL), (samplesN << 2)))
    != (samplesN << 2))
    {
    close(handleN);
    free(pcorgPL);
    return(PA_NO_PC_WR_E);
    }

    close(handleN);
    free(pcorgPL);
    return(0);
}