Dr. Dobb's | Automatic Code Instrumentation

Automatic Code Instrumentation

You can profile function execution times easily with just a little help from a VC++ hook and any web browser.

January 01, 1999
URL:http://www.drdobbs.com/automatic-code-instrumentation/184403601

January 1999/Automatic Code Instrumentation/Figure 1

Figure 1: CallMonitor hook implementation

// Copyright (c) 1998 John Panzer.  Permission is granted to
// use, copy, modify, distribute, and sell this source code as 
// long as this copyright notice appears in all source files.
#include <windows.h>
#include <imagehlp.h>
#include <stdio.h>
#include "CallMon.h"

using namespace std;

typedef CallMonitor::ADDR ADDR;

// Processor-specific offset from
// _penter return address to start of
// caller.
static const unsigned OFFSET_CALL_BYTES=5;

// Start of MSVC-specific code

// _pexit is called upon return from
// an instrumented function.
static void _pexit()
{
    CallMonitor::TICKS endTime;
    CallMonitor::queryTicks(&endTime);
    ADDR framePtr,parentFramePtr;

    // Retrieve parent stack frame to pass
    // to exitProcedure
    __asm mov DWORD PTR [framePtr], ebp
    parentFramePtr = ((ADDR *)framePtr)[0];

    CallMonitor::threadObj().exitProcedure(
            parentFramePtr,
            &((ADDR*)framePtr)[3],endTime);
}

// An entry point to which all instrumented
// function returns are redirected.  
static void __declspec(naked) _pexitThunk()
{
    // Push placeholder return address
    __asm push 0     
    // Protect original return value
    __asm push eax   
    _pexit();
    // Restore original return value
    __asm pop eax    
    // Return using new address set by _pexit
    __asm ret        
}

// _penter is called on entry to each client function
extern "C" __declspec(dllexport)
void _penter()
{
    CallMonitor::TICKS entryTime;
    CallMonitor::queryTicks(&entryTime); // Track entry time

    ADDR framePtr;
    __asm mov DWORD PTR [framePtr], ebp

    CallMonitor::threadObj().enterProcedure(
        (ADDR)((unsigned *)framePtr)[0],
        (ADDR)((unsigned *)framePtr)[1]-OFFSET_CALL_BYTES,
        (ADDR*)&((unsigned *)framePtr)[2],
        entryTime);
}
// End of MSVC-specific code

//End of File

January 1999/Automatic Code Instrumentation/Figure 2

Figure 2: CallMonitor class implementation

// Copyright (c) 1998 John Panzer.  Permission is granted to
// use, copy, modify, distribute, and sell this source code as
// long as this copyright notice appears in all source files.

// Utility functions
void indent(int level)
{
    for(int i=0;i<level;i++) putchar('\t');
}

//
// class CallMonitor
//
DWORD CallMonitor::tlsSlot=0xFFFFFFFF;

CallMonitor::CallMonitor() {queryTicks(&threadStartTime);}

CallMonitor::~CallMonitor() {}

void CallMonitor::threadAttach(CallMonitor *newObj)
{
    if (tlsSlot==0xFFFFFFFF) tlsSlot = TlsAlloc();
    TlsSetValue(tlsSlot,newObj);
}

void CallMonitor::threadDetach()
{
    delete &threadObj();
}


CallMonitor &CallMonitor::threadObj()
{
    CallMonitor *self = (CallMonitor *)
        TlsGetValue(tlsSlot);
    return *self;
}

// Performs standard entry processing
void CallMonitor::enterProcedure(ADDR parentFramePtr,
                                 ADDR funcAddr,
                                 ADDR *retAddrPtr,
                                 const TICKS &entryTime)
{
    // Record procedure entry on shadow stack
    callInfoStack.push_back(CallInfo());

    CallInfo &ci = callInfoStack.back();
    ci.funcAddr = funcAddr;
    ci.parentFrame = parentFramePtr;
    ci.origRetAddr = *retAddrPtr,
    ci.entryTime = entryTime;

    logEntry(ci);  // Log procedure entry event

    // Redirect eventual return to local thunk
    *retAddrPtr = (ADDR)_pexitThunk;

    queryTicks(&ci.startTime); // Track approx. start time
}

// Performs standard exit processing
void CallMonitor::exitProcedure(ADDR parentFramePtr,
                                ADDR *retAddrPtr,
                                const TICKS &endTime)
{
    // Pops shadow stack until finding a call record
    // that matches the current stack layout.
    bool inSync=false;
    while(1)
    {
        // Retrieve original call record
        CallInfo &ci = callInfoStack.back();
        ci.endTime = endTime;
        *retAddrPtr = ci.origRetAddr;
        if (ci.parentFrame==parentFramePtr)
        {
            logExit(ci,true); // Record normal exit
            callInfoStack.pop_back();
            return;
        }
        logExit(ci,false);    // Record exceptional exit
        callInfoStack.pop_back();
    }
}

// Default entry logging procedure
void CallMonitor::logEntry(CallInfo &ci)
{
    indent(callInfoStack.size()-1);
    string module,name;
    getFuncInfo(ci.funcAddr,module,name);
    printf("%s!%s (%08X)\n",module.c_str(),
            name.c_str(),ci.funcAddr);
}

// Default exit logging procedure
void CallMonitor::logExit(CallInfo &ci,bool normalRet)
{
    indent(callInfoStack.size()-1);
    if (!normalRet) printf("exception ");
    printf("exit %08X, elapsed time=%I64d\n",ci.funcAddr,
           ci.endTime-ci.startTime);
}

void CallMonitor::getFuncInfo(ADDR addr,
                              string &module,
                              string &funcName)
{
    SymInitialize(GetCurrentProcess(),NULL,FALSE);
    TCHAR moduleName[MAX_PATH];
    TCHAR modShortNameBuf[MAX_PATH];
    MEMORY_BASIC_INFORMATION mbi;

    VirtualQuery((void*)addr,&mbi,sizeof(mbi));
    GetModuleFileName((HMODULE)mbi.AllocationBase,
                      moduleName, MAX_PATH );

    _splitpath(moduleName,NULL,NULL,modShortNameBuf,NULL);

    BYTE symbolBuffer[ sizeof(IMAGEHLP_SYMBOL) + 1024 ];
    PIMAGEHLP_SYMBOL pSymbol =
            (PIMAGEHLP_SYMBOL)&symbolBuffer[0];
    // Following not per docs, but per example...
    pSymbol->SizeOfStruct = sizeof(symbolBuffer);
    pSymbol->MaxNameLength = 1023;

    DWORD symDisplacement = 0;
    SymLoadModule(GetCurrentProcess(),
                  NULL,
                  moduleName,
                  NULL,
                  (DWORD)mbi.AllocationBase,
                  0);

    SymSetOptions( SymGetOptions() & ~SYMOPT_UNDNAME );
    char undName[1024];
    if (! SymGetSymFromAddr(GetCurrentProcess(), addr,
                            &symDisplacement, pSymbol) )
    {
        // Couldn't retrieve symbol (no debug info?)
        strcpy(undName,"<unknown symbol>");
    }
    else
    {
        // Unmangle name, throwing away decorations
        // that don't affect uniqueness:
        if ( 0 == UnDecorateSymbolName(
                pSymbol->Name, undName,
                sizeof(undName),
                UNDNAME_NO_MS_KEYWORDS |
                UNDNAME_NO_ACCESS_SPECIFIERS |
                UNDNAME_NO_FUNCTION_RETURNS |
                UNDNAME_NO_ALLOCATION_MODEL |
                UNDNAME_NO_ALLOCATION_LANGUAGE |
                UNDNAME_NO_MEMBER_TYPE))
            strcpy(undName,pSymbol->Name);
    }
    SymUnloadModule(GetCurrentProcess(),
                    (DWORD)mbi.AllocationBase);
    SymCleanup(GetCurrentProcess());
    module = modShortNameBuf;
    funcName = undName;
}


//End of File

January 1999/Automatic Code Instrumentation/Figure 3

Figure 3: Hierarchical report sample

TestFunction(void)
Calls: 1
Total time: 0.8922 seconds (73.1% of total),
            0.8922 seconds per call
Time excluding children: 0.5414 seconds (44.4% of total),
                         0.5414 seconds per call

Called by: TestMain(void)

January 1999/Automatic Code Instrumentation/Listing 1

Listing 1: The CallMonitor class definition

// Copyright (c) 1998 John Panzer.  Permission is granted to
// use, copy, modify, distribute, and sell this source code as 
// long as this copyright notice appears in all source files.
#ifndef CALLMON_H_
#define CALLMON_H_

#include <string>
#include <vector>

class CallMonitor
{
public:
    typedef unsigned int ADDR; 
    typedef __int64 TICKS;     // Timing information

    CallMonitor();

    // Thread-local singleton accessor
    static CallMonitor &threadObj();

    static void threadAttach(CallMonitor *newObj);
    static void threadDetach();

    // Called when a client function starts
    void enterProcedure(ADDR parentFramePtr,ADDR funcAddr,
                        ADDR *retAddrPtr,
                        const TICKS &entryTime);

    // Called when a client function exits
    void exitProcedure(ADDR parentFramePtr,ADDR *retAddrPtr,
                       const TICKS &endTime);

    // Retrieves module and function name from address
    static void getFuncInfo(ADDR addr,std::string &moduleName,
                            std::string &funcName);

    // Wrappers for system-dependent timing functions
    static inline void queryTicks(TICKS *t) {
        QueryPerformanceCounter((LARGE_INTEGER*)t);
    }

    static inline void queryTickFreq(TICKS *t) {
        QueryPerformanceFrequency((LARGE_INTEGER*)t);
    }
protected:
    // Record for a single call
    struct CallInfo
    {
        ADDR funcAddr;    // Function address
        ADDR parentFrame; // Parent stack frame address
        ADDR origRetAddr; // Function return address
        TICKS entryTime;  // Time function was called
        TICKS startTime;  // Time function started
        TICKS endTime;    // Time function returned
    };
    typedef std::vector<CallInfo> CallInfoStack;

    // Override to log entry events.  
    virtual void logEntry(CallInfo &ci);

    // Override to log exit events.
    virtual void logExit(CallInfo &ci,bool normalRet);

    virtual ~CallMonitor();

    // "Shadow" stack of active functions
    CallInfoStack callInfoStack;
    TICKS threadStartTime;  // Time thread started

private:
    static DWORD tlsSlot;
};

#endif // CALLMON_H_


//End of File

January 1999/Automatic Code Instrumentation/Listing 2

Listing 2: Sample DllMain

#include <windows.h>
#include "CallMon.h"
// Include derived class header here

BOOL WINAPI DllMain(
    HINSTANCE hinstDLL,  
    DWORD fdwReason,     
    LPVOID lpvReserved   
    )
{
    switch (fdwReason)
    {
    case DLL_PROCESS_ATTACH: // fall through
    case DLL_THREAD_ATTACH:
        // Create instance of derived class below
        CallMonitor::threadAttach(new CallMonitor);
        break;
    case DLL_THREAD_DETACH:  // fall through
    case DLL_PROCESS_DETACH:
        CallMonitor::threadDetach();
        break;
    }
    return TRUE;
}
//End of File

January 1999/Automatic Code Instrumentation

Automatic Code Instrumentation

John Panzer

You can profile function execution times easily with just a little help from a VC++ hook and any web browser.

Profilers and coverage analyzers are useful tools that can help provide information about bottlenecks and control flow. Occasionally they can be critically important to a successful project. Although many development environments provide these tools, they lack the functionality necessary to solve all control-flow and timing problems. Commercial products can be costly and may not be customizable. Faced with problems like these, I created the CallMonitor utility framework. CallMonitor traces a client program's function calls and provides simple hooks for building more sophisticated tools.

The CallMonitor utility is implemented as a support DLL. It requires no client-code changes, just a recompile and relink. It can instrument only portions of a program if desired. It is thread safe and traces exceptions cleanly.

I used CallMonitor to implement a hierarchical call profiler, which pinpointed several bottlenecks in a large project. Other possible applications include a trace utility, a coverage analyzer, and an error test-case exerciser. Tools can extend CallMonitor by inheriting from class CallMonitor (Listing 1, callmon.h).

CallMonitor does have a couple of limitations. The mechanism it uses is limited to function-call tracing and cannot be used to trace the execution of individual statements. It makes use of a Microsoft-specific compiler switch to help insert instrumentation calls, so it is currently limited to Visual C++. CallMonitor has been tested under Visual C++ versions 5.0 and 6.0, Professional and Enterprise editions.

How to Instrument Procedure Calls

A function profiler needs to know when each function call begins and ends. The Microsoft VC++ compiler provides half of the solution: the semi-documented /Gh compile flag makes the compiler insert a hidden call to a _penter function at the start of each client function. (The Win32 SDK Call Attributed Profiler uses this method as well.) Although the source code remains unchanged, the object code includes the hidden call as the first instruction in the function. For example, the function

void client() {}

generates the following assembly code on the Intel platform:

_client   PROC NEAR
call __penter
push ebp
mov  ebp, esp
pop  ebp
ret  0

The next step is to provide a _penter function in a support library and link it in. The prototype for _penter is:

extern "C" void _penter();

(see Figure 1, CallMonitor.cpp). Once inside the _penter function, CallMonitor needs to know the address of the client function that called it. The _penter function can deduce this with a small amount of inline assembly. It loads the current frame pointer, stored in the Intel CPU's ebp register, into a local variable. This gives access to the frame record, which provides the return address for the current _penter call. Subtracting five bytes yields the address of the client function. Of course, all of this is nonportable and is processor and compiler dependent. Fortunately, the nonportable code can be isolated to a handful of small functions.

The CallMonitor maintains a separate stack, called the shadow stack, where it records information pertaining to the client function call. This stack is a member of the CallMonitor class, callInfoStack, and is implemented as a vector of CallInfo objects (described later). The shadow stack stores the client function's address and the time of entry to the function. It also stores the client function's return address, which will eventually be used to return control from CallMonitor to the client function's caller.

Given the client function address, the support library can record the calls made to each function. This is enough for a coverage analyzer, but a profiler needs to know when the function ends as well. Unfortunately, the compiler doesn't generate a call to a _pexit function parallel to _penter. But given a little stack manipulation, it's possible to trick the client function into returning to a _pexit thunk instead of to its caller. The _pexit thunk calls instrumentation code and then returns to the original address (cached by _penter).

Manipulating the Program Stack

Within the _penter function, the ebp register points at the current stack frame. _penter first moves the contents of ebp into a local variable, framePtr, for easier access (see Figure 1). Starting at framePtr, the stack contains three useful pieces of information:

framePtr[0]    Parent stack frame pointer
framePtr[1]    Return address of _penter function call
framePtr[2]    Return address of client function

The return address for the client is stored in framePtr[2]. Changing framePtr[2] to the address of a support library function (_pexitThunk, Figure 1) will make the client return to that function instead of to its caller. This gives the support library the needed hook to record the client function exit.

Since _pexitThunk gets control through a ret instruction instead of a call, it can't use a normal return sequence. The thunk first pushes a placeholder null value on the stack. The eax register contains the client function's return value, so the thunk saves this value temporarily by pushing eax onto the stack. Finally, the thunk routine calls a normal _pexit function to complete the exit processing. When _pexit returns, the thunk restores the eax register value and returns.

The _pexit function uses its frame pointer to find the address of the placeholder pushed onto the stack by _pexitThunk. It passes the address of this word to the exitProcedure method, which restores the client function's original return address from information stored in the CallMonitor object (see Figure 2, callmon.cpp). When _pexitThunk eventually returns, it pops this address from the program stack.

Tracing Exceptions

The code for exitProcedure is complicated by client exceptions. When a function exits because of an exception, it does not return normally, so _pexitThunk never gets control. Usually, the exception is eventually handled by a higher-level function. When that function returns, it will be redirected through _pexitThunk. If there is no way of synchronizing the program stack with the shadow stack maintained by the support library, the code will insert the wrong return address and most likely crash the program.

In order to deal with this, exitProcedure doesn't just pop the top of the shadow stack. It walks the shadow stack looking for a record whose frame pointer matches the frame pointer on the program stack. Only when it finds a matching frame pointer does it restore the corresponding return address. This recovery scheme may fail under some unlikely circumstances (by incorrectly matching frame pointers). This has not been a problem in practice.

The CallMonitor Class

The job of a CallMonitor object is to monitor and measure calls for a given thread. Each thread in the client process automatically gets its own CallMonitor object. This occurs when the CallMonitor::threadAttach function is executed (see Listing 2, dllmain.cpp).

Upon entry to a client function, control passes to the _penter function, which in turn calls the CallMonitor::enterProcedure function. Within this function, CallMonitor gathers information and performs housekeeping on its internal data structures and then calls function logEntry. The CallInfo structure passed into logEntry contains two useful pieces of information: funcAddr is the address of the client function, and entryTime is the approximate time the client function was called. (Times are measured in ticks, and are stored in variables of type TICKS. Both ticks and TICKS are platform-dependent entities.) Typically, not much needs to be done within method logEntry. The default implementation writes the information to stdout, which can be useful for testing.

The logExit function is called in two situations. Either a client function has returned normally, in which case normalRet is true, or the function exited via a client exception. In either case, for each logEntry, there is always a corresponding logExit. The CallInfo structure passed into logExit contains additional timing information. startTime contains the approximate time at which the body of the procedure began execution; endTime contains the approximate time at which the body finished execution. The elapsed time spent within the function and its children is (endTime-startTime).

CallMonitor's callInfoStack member contains the shadow call records, and is accessible to derived classes.

CallMonitor provides some useful static utility methods. threadObj returns a reference to the CallMonitor object for the current thread. getFuncInfo retrieves the name of a function from debugging information, given its address. This is useful for reporting statistics. queryTicks retrieves the value of a platform-dependent timer. queryTickFrequency retrieves the number of ticks per second for the current platform.

Using CallMonitor

The most common way to use CallMonitor is to extend it via inheritance. To extend CallMonitor, create a derived class and instantiate objects of that class inside DllMain. In the derived class, override the logEntry and logExit methods.

To use CallMonitor as the basis for a new tool, first create a DLL to contain a class derived from CallMonitor. Within the DllMain procedure, create instances of the derived class and register them with threadAttach (see Listing 2). Include CallMonitor.cpp (Figure 1) in the DLL and build normally.

To use the DLL with a client program, modify the client program's build procedure to include the /Gh compile flag and to link with the DLL's import library. Then rebuild the client and run it normally. The client can consist of an executable and any number of related DLLs, either implicitly or dynamically loaded. (The latter situation is one where the built-in Visual C++ profiler falls short.)

Applications

The most straightforward application of CallMonitor is a trace utility. The default implementations of logEntry and logExit already print a generic trace to the console. Additional useful features would be the ability to separate the output from different threads, display output in a graphical UI, or filter output based on depth, function address, or global flags.

A Coverage Analyzer

By keeping a map of function addresses vs. call counts, it is simple to create a logEntry method that tracks the number of times each function was called. In combination with a .MAP file or by using IMAGEHLP functions, it would not be difficult to create a table of all available functions and then a report showing which functions were called and which weren't.

An Error Simulator

Although a large percentage of production code is devoted to error handling, testing this code through black-box techniques can be difficult. Modifying code to insert special error conditions is a possibility [1], but does require modifying source code. The CallMonitor library provides a way to induce simulated errors into a program as part of a test suite.

Within logEntry and logExit methods, the exerciser can throw any client exception to simulate error conditions. By modifying the _pexitThunk code to pass the address of the client return value, the logExit method could also modify return values from selected functions to simulate errors. The utility could read a configuration file containing a table of functions and exceptions to throw when encountered, and then trigger the exceptions when necessary.

A Call Time Profiler

A profiler needs to track the time spent within the body of each client function. It also needs to be able to differentiate between time spent in a function and time spent in the function's children. The CallMonitor framework provides the timing information needed for these tasks. The basic strategy is as follows.

In the logExit method, the profiler calculates the elapsed time as (ci.endTime-ci.startTime). It adds this elapsed time to a running total for the current function, ci.funcAddr. This enables the profiler to track the total amount of time spent in each client function overall. The profiler also needs to adjust times for the CallMonitor entry and exit overhead. The overhead of the CallMonitor entry code can be estimated by (ci.startTime-ci.entryTime). The overhead of the exit code is harder to estimate accurately, since part of the overhead of the exit code is the calculation and storage of the overhead itself. This means that a significant part of the overhead can't be measured directly. The profiler can get around this by using the overhead of the previous logExit call as an estimate for the current call.

The profiler needs to keep track of how much of each function's elapsed time is spent in its children, so that it can report both the overall time and also the time spent within the function itself. It can track this with an additional counter, childTimes, associated with each function. The parent address for the current function can be found on the callInfoStack.

Finally, the profiler needs to adjust all times for the overhead that instrumentation introduces into the program times. This can be a major source of error for small, frequently called functions. If the profiler is not calibrated correctly, these functions can appear to take orders of magnitude more time than they actually do. Since elapsed times also include the overhead for all children, calculating the correct adjustments can be quite tricky.

Of course, it helps to get the most accurate clock available. The Windows API function QueryPerformanceCounter accesses a moderately high-resolution timer. However, if CallMonitor is running on a Pentium processor, there is a much better alternative available. The RDTSC assembly instruction accesses a 64-bit cycle counter built into the chip [2]. The portable version of CallMonitor uses QueryPerformanceCounter, but the Pentium-specific version uses RDTSC to implement queryTicks.

When profiling a multithreaded program, function times will include time spent in other threads. This may be a problem or an advantage, depending on what is being profiled.

Hierarchical Profiling

Although recording the total time spent within each function is useful, it may not provide all the necessary information. For example, it might not be useful to know that 30% of the program's time is spent in an optimized FastStrCpy function, but it could be helpful to see which functions call it heavily. Hierarchical profilers provide this. They display children and parents for each function, along with times and call counts in the context of that function.

Although the profiler can collect this information at run time, storing it is a problem. To keep track separately of all generations of descendants of a given call would be too slow for real-world programs. Fortunately, many of the benefits of a hierarchical profiler can be realized by storing only one additional level of information with each function record. By keeping track of the call times for each function's immediate children, the code provides a good picture of which children are taking the most time within that function. Drilling down to that child function, then, gives a picture of that child's average behavior.

Given the tables of function statistics, it is fairly simple to write out a basic report listing functions in order of total time. But it can be much more effective to see hierarchical-profiling information as a tree or graph. Some commercial products have extensive UI support for this, displaying a graph of function calls and allowing the user to simulate the effects of optimization. Lacking the time to create such a UI, I found that a simple HTML report was a great way to present this information. The report generator took only a short time to develop.

The HTML report displays each function in its own section. The children of each function appear in a table, along with the child times and call counts. Since each child function name is a hyperlink, the user can drill down into the most expensive child and see where it typically spends its time. Figure 3 shows the general layout of such a report.

When profiling, it's best to take measured times with a grain of salt. All profilers have measurement error, and the act of measurement affects performance as well. Profiling can impact processor cache hits, thread synchronization, and paging. The best approach is to profile using a modified release-mode build, with small, frequently called functions inlined. Use the profiler to identify likely candidates for optimization, and check the effects with the profiler disabled.

Conclusion

The CallMonitor class provides an extensible way to monitor and measure procedure calls within a client program. It is a customizable and extensible framework. It is currently the basis for a hierarchical call profiler and can be used for other tools as well.

References

[1] Dave Pomerantz. "Testing Error Handlers by Simulating Errors." C/C++ Users Journal, June 1998.

[2] Mark Russinovich. "Performance Instrumenting Device Drivers." http://www.sysinternals.com/sysperf.htm.

John Panzer is a developer with Computer Associates in Sarasota, FL. He has nine years of experience in software development, specializing in C++, Windows, graphical interfaces, and software tools. John has an MS in Computer and Information Sciences from UC Santa Cruz. He can be reached at [email protected].