Ron is a consultant specializing in C/C++ business simulations and technical and scientific software. He can be contacted at [email protected]
Sidebar: Visual Basic 5 Profiling
Sidebar: TracePoint Visual Coverage 1.0
Performance improvements come in many different flavors. Changes in the algorithm usually bring the greatest gains: The obvious example of this is the replacement of bubble sort by quicksort or heap sort, reducing time complexity from O(n2) to O(n log n). The next level of improvements usually comes in the actual implementation, when you address issues such as automatic versus heap-based memory, working set and virtual memory behavior, recalculating versus storing results, the structure of loops and branches, and so on. This level of improvements largely determines the constants in front of the O(n log n) bounds. The third level of improvements include processor-specific optimizations that affect caching behavior, internal parallelism, branch predictions, and the like. These may call for rearrangement of loops or use specialized instructions (MMX intrinsics, for example) or library primitives.
Clearly, what you need are different kinds of information for the different stages in the optimization process. At the algorithm level, high-level overviews of caller/callee relationships and intensity (in the form of annotated call graphs) are invaluable. At the source-code level, you want function-by-function or line-by-line timings and counts. Finally, at the processor level, you need instruction breakdowns annotated with the relevant processor behavior. Traditionally, profilers have addressed the middle (source code) stage, although new-generation tools for the other two levels are beginning to appear.
In this article, I examine profiling tools that target Win32 C/C++ development. I used all of the tools for several weeks as part of my development process, and have selected a number of test scenarios (available electronically; see "Resource Center," page 3) that represent a broad range of applications. Some of the tools support other languages or platforms as well (see the text box entitled "Visual Basic 5 Profiling"). Table 1 presents an overview of the features each tool offers. The tools I used include:
- Intel VTune 2.5.
- Microsoft Visual C++ 5.0 (profiling tools only).
- Rational Visual Quantify 4.0.
- TracePoint HiProf 2.0.
- Watcom C++ 11.0 (profiling tools only).
- Win32 SDK (profiling tools only).
Editor's Note: As we go to press, TracePoint's web site (http://www.tracepoint.com) reports that the company has decided to discontinue business operations effective immediately. TracePoint was originally formed in 1995 from within DEC's Western Research Lab, and became an independent company in 1997. Perhaps another company will pick up either TracePoint's technology and/or the HiProf and Visual Coverage pieces. We'll keep you posted.
Intel VTune 2.5
Intel makes microprocessors, and its VTune profiler reflects that fact. The basic profiler uses sampling to obtain measurements of the program under test, but also includes a variety of static and dynamic code-analysis tools that help you make the most of the Intel processors at, shall I say, a painful level of detail. If you are interested in pairing issues, instruction penalties, and processor cache misses, you're in for a treat.
The operation of the program is straightforward: You specify which program to profile, what options to use for the profiler and program, and off you go. VTune executes the program, collects samples, and (when it's done) displays the Modules Report -- the first of a number of bar charts showing activity in your program and the rest of the system. From there, you drill down into areas of interest to obtain other graphics, or annotated source-code listings.
As an alternative to sampling, the VTune Code Analyzer performs a static-code analysis of your program and provides information about the expected performance and low-level behavior of the various Intel processors. The same information can be obtained through Dynamic Assembly Analysis, which analyzes small sections of your program in great detail by actually running the entire program and simulating the performance of the area of interest instruction-by-instruction (as opposed to using the sampling method applied elsewhere). All methods can show your source code (where available) interspersed with assembly code, and annotated with remarks about processor performance issues for the Intel processors.
To help you translate this information to the source-code level, the C and Fortran Code Coaches point out source-level improvements. They require preprocessed source code to do so, and accept a makefile, command line, or ready-made preprocessed source file as input. If you select an area in this file (normally a function or possibly a nested loop) and invoke the Code Coach, you get specific advice for the optimization opportunities that the Code Coach recognizes, most of which have to do with loops and branches.
In addition to the actual profiler, the VTune package also contains the Performance Toolset with C/C++ and Fortran compilers that can be plugged into the Microsoft Developer Studio environment, several numerics and signal-processing libraries, and a wealth of reference information about the Intel processors. The annoying thing is, however, that the access program is a Win16 application that doesn't recognize long filenames and consequently couldn't start Acrobat Reader located in "C:\Program Files\Acrobat3."
So how useful was VTune during my development? Given that I don't develop high-performance numerical codes or design code generators for a compiler, the level of detail offered by VTune was well beyond my needs. As a C/C++ programmer, I don't have very fine control over the eventual instructions that are fed to the CPU, and after my initial amazement over all these processor intricacies, I had little practical use for them. Things might have been different if the Code Coach would have worked properly, but try as I might, I never got beyond VTune's message that it had encountered an error while parsing my (preprocessed) source-code files.
The fact that VTune uses sampling as its data collection method means that execution speed is excellent, but that the granularity (which is modifiable) is not always sufficient to capture all required information. In my constrained optimization test case for example, VTune completely missed the index operator that caused the initial performance hit -- presumably because the operator itself didn't take long to execute per call.
Then there are a few other matters that hamper effective use of the profiler. First, it is inconvenient that the profiler samples for a predefined amount of time. It does not stop sampling when your program terminates; you'll have to interrupt the sampling session manually in that case. Conversely, if your program runs longer, you'll have to adjust the length of the sampling session and try again. Second, there was no good way to restrict profiling to specific areas -- in fact, the samples would include all system activities, relevant or not. Intel advertises this as a feature, but I'm inclined to see it otherwise. Third, the bars in the graphics were usually awfully thin, say a pixel or so. Since they are the main means of navigating through the profile data, I had a difficult time (even with a 20-inch monitor) pointing with the mouse cursor and selecting the correct one each time. That each drill down action brings up a new top-level window with yet more bars doesn't help either. I tended to quickly lose track of all the windows and bars.
To sum up, Intel's VTune excels with processor-level performance hints, but I found it less useful for general-purpose work. It is harder to use than several of the others, the sampling makes for a fairly coarse granularity, and the information provided is too detailed for common usage. However, static and dynamic analysis give you insight into processor behavior, and the electronic documentation on low-level optimizations and Intel processors is valuable even if you don't need to wring out every cycle of performance.
Microsoft Visual C++ 5.0 Profiler
Inside the Microsoft Visual C++ 5.0 (and earlier) box, you'll find a profiler. This profiler can be used to obtain line- or function-level timings and counts, and as a simple coverage tool at the line or function level. The profiler tools consist of three console applications:
- One that modifies the executable or DLL under test by thunking function calls and inserting breakpoints to divert the flow of control to the recording part of the profiler.
- Another that is the actual recorder.
- A third that creates output lists in a variety of formats.
Normally, a batch file controls the operation of the profiler tools and the Visual/ Developer Studio contains a command that runs this batch file for you with the correct executable name filled in. In that case, the output listing appears in one of the tabs of the Output Window. You should be aware that this only works for GUI applications; to profile console applications, you revert to the command line and manually start the correct batch file.
The profiling tools are fairly versatile, allowing you to fine-tune both the instrumentation process (function level or line level, timings, counts, or coverage), and the recording process (determining start and stop points of the profile -- including or excluding specific functions or modules). It also lets you specify how the output listing should appear (sorted according to some criterion, or in tab-delimited format for use by Microsoft Excel and other tools). You can also merge data from different runs to obtain an averaged effect. However, the whole process is purely command line and batch-file-based (with some help from TOOLS.INI), and is definitely not comparable to the GUI-based competition. Moreover, if you want graphical output, you'll have to use Excel or another tool. Without them, you are looking at (sorted) lists of function and line timings.
In actual use, these tools aren't too bad. In fact, they are the sort of profiling tools that have been around for years on most platforms. If you are prepared to spend some time learning command-line options, understanding the profiling process, and working your way through the output listings, they will get you most of the way. The standard cases are all handled by a set of straightforward batch files. The things I missed most (besides ease of use) were the ability to establish caller/ callee relationships and an easy means to annotate source code with the profiling information. In addition, the profiling overhead was noticeably larger than with other instrumentation-based profilers (as a group, they are much slower than sampling profilers are anyway).
In summary, the profiling tools that come with the Microsoft C++ compiler are definitely useful, but not as easy to use or as complete in their analysis options as the best of the flock. On the other hand, they are free once you have the compiler.
Rational Visual Quantify 4.0
Building on the same instrumentation techniques used in Purify/NT, Rational has introduced a profiler for C/C++, Java, and Visual Basic 5 (I only examined C/C++ programs). The tool comes with its own environment, from which you load the program to be profiled. Visual Quantify instruments the program and all the DLLs that it uses (saving the instrumented versions under a different name), then runs it to collect profiling data. When finished, Visual Quantify displays the call graph (with the critical path highlighted), function list, and session summary windows for the run. From here, you can access further information in the form of function details (showing callers and descendants of each function) and annotated source code.
There are many ways to customize Visual Quantify's mode of operation. You can choose between instruction counting (at the line or the function level) or function timing as the measurement method. Certain Windows system modules, however, are always timed rather than instruction counted. Next, you specify the options to the program being tested (although output redirection is not supported). Finally, when the results are displayed, the Filter Manager lets you hide or delete module or function data from the views. For enhanced control, a small API is defined that lets your program take control over its own profiling -- starting and stopping data collection, clearing the buffers, and so on. Obviously, this requires modifications to your source code and recompilation, which is not necessary if you stick to the GUI environment of Visual Quantify proper.
Visual Quantify also works with multiple profiling runs in a given project. With a few commands, data from different runs can be merged to obtain an average or differed to see changes in performance. The latter facility is particularly helpful and uses the color coding in the usual call graph and function views to indicate where performance has improved or deteriorated. If you run a program with multiple threads, the profile shows the combined timings of all threads. By selecting a thread in the Call Graph and focusing on its subtree, the profile reduces to just that thread.
In day-to-day development, Visual Quantify is a pleasure to work with. Its features are well thought out, its UI is intuitive and helpful, and the various views help to analyze the profiling data in several ways -- from the high-level overviews and call patterns to detailed breakdowns per function and source-code line. With the information presented as is, you are likely to emerge with a better understanding of your program's behavior than you thought possible.
However, a few quirks arose during testing. For large profiles, Visual Quantify requires prodigious amounts of virtual memory (the README file advises to reserve 200 MB!). Furthermore, C++ filenames that had a bool parameter were not unmangled. Surprisingly (in view of all the information available), the annotated source-code view does not show line counts, only line timings. Finally, Visual Quantify is limited to Windows NT and Microsoft compilers.
Nevertheless, Visual Quantify is an excellent and professional profiler, accompanied by good documentation.
TracePoint HiProf 2.0
The introduction of TracePoint's HiProf 1.0 profiler broke new ground for Win32 profiling tools and HiProf 2.0 has several improvements over its predecessor, including support for Visual Basic 5. The product has its own GUI-based workbench from which applications are loaded, instrumented, and run. The instrumentation process prepares your executable and its modules for the data-collection run, but skips any modules it considers to be "system modules." Instead of instrumenting these modules, HiProf uses Call Site instrumentation -- basically adding timing code to the callers of noninstrumented module functions. This cleverly sidesteps problems that might arise from modifying system modules, but prevents data collection on functions only used inside those modules. After instrumentation, the program is run under HiProf's control. The results are displayed in a variety of formats -- the Function View (a list of functions), Hierarchical View (a pie-chart breakdown of function callers and descendants), Critical Edge list, and Critical Chain. Further views include the Source View and a navigator view with tabs for profiles and modules. Wherever sensible, views are linked so that navigation in one view is tracked by the other views.
HiProf's operation is subject to a number of settings, the most important being the actual measurement method: instruction counting or function timing, with the proviso that calls to functions in system modules are always timed. Run-time options determine the command-line arguments to the program under test (command-line redirection is supported) and the use of HiProf's console, which is a small control unit to pause and resume the profiling process, and to store snapshots or clear the profiling data. For finer control, HiProf offers tracepoints, which are breakpoints in your program that cause HiProf to execute some action, such as starting or stopping data collection, or storing a snapshot. Tracepoints can only be set on entries or exits of functions, but they are ideal for concentrating on specific parts of your program without changing the source code.
A given project may contain many snapshots, from different runs or different stages within a run. However, to compare two snapshots, you'll either have to start a second instance of HiProf and arrange the views side by side, or use a command-line utility to merge or diff the data from different snapshots. This is probably the weakest point in an otherwise excellent product. For multithreaded programs each thread is shown separately and merging must be done with the command-line utility.
In actual use, HiProf 2.0 pairs with Visual Quantify in features and ease of use. With the introduction of new views in Version 2.0, the data analysis views are on par with Visual Quantify. They give excellent information from many different perspectives, and the internal synchronization among the views means that you waste no time coordinating different sorts of information. In combination with its tracepoints, HiProf lets you tailor both data collection and data presentation without having to touch your source code. In fact, HiProf made profiling and optimizing an application almost addictive -- a far cry from the laborious process that profiling used to be.
Watcom C++ 11.0 Profiler
Similar to Microsoft Visual C++ 5.0, the Watcom C/C++ compiler comes with its own profiler. This one is based on sampling and is packaged as two separate programs -- one to collect the samples, the other to present results. In the Watcom IDE, the Sample command collects the data, and the Profile command lists the results as a bar chart showing images (sample run snapshots) and modules. Drilling down leads you to individual functions and finally to source code annotated with bar graphs that indicate the percentage of time spent in that function. If you want, you can export the sample data to DIF or comma-separated text files.
The whole process has few frills. Unfortunately, not much information is obtained either. The sampling process only records hits per function, and all you get (be it in graphical format or an exported data file), are the names and the hits per function. Although this does give a coarse picture of the program's behavior, I found it insufficient for any sensible optimizations. For example, my MkDep test case spends most of its time in the Windows function ReadFile(), and the Watcom profile never caused me to suspect the function or its callers. Likewise, the index-operator problem in my constrained optimization sample was missed completely. I assume that this basic approach to profiling stems from Watcom's desire to implement the profiling tools on all platforms it supports, but I'm not so happy with the end result.
On the whole, you will probably find the Watcom profiler insufficient for serious profiling needs. All is not lost, however. The Win32 SDK included with the Watcom compiler package contains some profiling tools that are much more useful, at least when it comes to profiling Win32 programs.
Win32 SDK Profilers
Often overlooked, Microsoft's Win32 SDK contains a lot of useful programs. For profiling purposes, I found no less than five tools that would give information about an application's performance, and that is without counting PVIEW or PERFMON. Some of the tools are very specialized, but at least one has features that make it a viable alternative to the commercial competition -- if you are willing to spend time learning it. Several of these tools assume specialized compiler or linker options (or compatibility), which may not be available in all compilers.
- APIMON (API Monitor) is a stand-alone GUI-based program that runs other applications, keeping track of the Win32 API functions they call (complete with parameter and return values), how much time they spend there, and a variety of other things such as heap checking and page faults. It is simple to operate and gives yet another view on your application's behavior. However, there is no way to relate the information to any specific locations inside your program, and the tool does not provide for arguments to the program under test, which severely limits serious testing of command-line applications.
- CAP (Call Attributed Profiler) most closely resembles the commercial profilers. It requires you compile your code with a special compiler option (/Gh for Microsoft C/C++) that inserts _penter() hooks at the start of each function. These hooks are resolved by linking with the CAP.LIB import library, and at run time the associated CAP.DLL module will be loaded and used to record the time spent per function. With the aid of some further programs (most notably CAPVIEW) the captured profile is then displayed as an annotated call tree or as a list of function counts and timings (with time per function and time spent in descendants separated). Colors are used to mark the most critical functions. I was really surprised by the usefulness of this program. Although it is not so versatile as the command-line profiler that comes with the Microsoft compiler, it does provide a fairly accurate picture of the most critical performance data, and it does so in a format that is very usable.
- FIOSAP (File I/O and Synchronization Win32 API Profiler) is designed to help identify I/O and synchronization bottlenecks in multithreaded programs, although you can also use it to monitor file I/O activity in single-threaded programs. It uses a small helper program to patch your executable and reroute all KERNEL32 calls to its own FERNEL32 module, which collects data and forwards the call to the appropriate KERNEL32 function. The data collection comprises file I/O functions and operations on synchronization primitives such as semaphores, events, and mutexes. The information can be used to assess the amount of time spent waiting on the various operations. However, the results are summed over all threads in an application, and it is up to you to find out what the actual causes of performance loss in this area are.
- PROFILE (Win32 Sampling Profiler) (not to be confused with the Microsoft Visual C++ profiler of the same name) operates essentially the same way as the Watcom profiler and runs your program while taking periodic samples of the instruction pointer's location. The result is a text file that indicates the number of hits per function. While it operates a lot faster than CAP, I don't find the information it gathers very useful.
- WST (Working Set Tuner) helps you reduce the working set of your program. Similar to CAP, WST requires recompilation with the insertion of _penter() hook functions and resolves these in the WST.LIB import library. At run time, the WST.DLL module takes frequent snapshots of which functions were called during the period of time preceding the snapshot. The result shows which functions are used close together in time, and the WSTUNE program applies this information to produce a packing list for the linker that places temporally near functions also physically near, thus reducing the working set of your application. The whole process is something that you want to do when your application is almost ready for shipment, because during development the constant addition and removal of functions invalidate any WST results rather quickly. Nevertheless, WST can give your applications the final touch when it comes to performance.
To examine these profilers, I selected a representative sampling of profiling tasks. All the applications are written in C or C++ and run on Win32 platforms.
MkDep. This is a console application that reads C and C++ source files and generates a list of #include dependencies suitable for use in a makefile. On the whole, the program is I/O-bound. My hunch before profiling was that the file I/O would be the problem, and I had spent time optimizing this area with special buffering and so on. The benchmark I used throughout testing was the dependency list for MFC 4.2. In the original version of the program, it took about 5:00 (mins:secs) on my system to process all those files. In the version as it currently stands, this has been reduced to 0:25.
The improvements came from reducing I/O traffic, but not in the way I expected. It turned out that the original version was spending 75 percent of its time in the _access() function I used to look for header files along the INCLUDE paths. I never did suspect that function until I profiled. Still, all that function did under Win32 was call GetFileAttributes(), and there is little room for improvement there. Fortunately, the profiles showed that it wasn't just the time spent per call, but also the number of times it was called. To cut a long story short, I implemented a caching scheme (in effect storing the entire dependency tree of a MkDep run internally) and simplified file I/O. As a result, the program now processes each file in the entire run exactly once, using one ReadFile() call per file to do so. This is, of course, optimal and it accounts for 75 percent of the current run time. The remainder is taken by _access() calls that fail while searching the INCLUDE path, and the final I/O for the actual dependency lists. There is still room for improvement but it can be at most 25 percent, which is good to know because it sets realistic expectations and helps to gauge how much effort should go into further optimizations.
Constrained Optimization. These programs tend to do a lot of internal processing and are therefore mostly CPU-bound. My constrained-optimization test program is another console application that attempts to solve an optimization problem using clever search techniques. It uses large graph-like data structures that display poor locality of reference. To really improve performance in this sort of NP-hard applications (improvements in the order of 1010 or more), you need to improve the search methods, but I used the profilers to see if I could make the operational side of things just mildly faster (say, a factor 2 or 3, which is peanuts to researchers in this field, but still worth some effort).
Before the profiling sessions, I didn't have a good idea about how time was spent in the program, although I suspected that memory allocations might play a role. To my surprise, the profiling sessions revealed that most of the time was spent inside a lowly index operator overloading in one of the array classes, and in the dynamic_cast<> operator used in several critical places. The index operator itself wasn't particularly complicated, but it did bounds checking on each call and was called very often -- well over 43 million times to process just 100,000 nodes in the search tree. The dynamic_cast<> operator was used for a downcast somewhere in the program to obtain application-specific information from a generic tree node. It too was called often (1.4 million times).
After removing these bottlenecks, the program was about 1.24 times faster than the original one, and things became more complicated because the cycle eaters were more evenly distributed. In the end, I was able to speed up the program by nearly a factor of two; further improvements would have required specialized memory allocators or changes to the way information was stored. Without the profiling information, it wouldn't have occurred to me to look at those particular functions -- even though I was the one who designed and implemented both the data structures and the searching algorithm.
Simulation Model Calculations. These test programs are part of a Win32 GUI program that uses MFC as its application framework. My test application is itself a business simulation that processes decisions taken by various "companies" staffed with management trainees. The challenge was to find the bottlenecks in the actual model calculations, since they seemed to be slower than necessary. The problem here is the interactive nature of the application: How do you make sure that the model calculation timings aren't drowned in all the message processing and surrounding activities of the program?
As it turned out, the calculations themselves were no problem, but the code contained calls to logging functions, which created an audit trail of the simulation model's decisions. These logging functions were the real time eaters, and I could only partially remedy that problem. However, only a few profilers let me isolate the calculations from the rest of the program and make this analysis obvious. HiProf's tracepoints came in very useful here; for Visual Quantify I inserted API calls to start and stop the profiling at the correct locations. All other profilers left me wading through long lists of irrelevant information, although with extra effort I could have configured the Visual C++ profiler to exclude nearly everything except for the functions I was interested in. Still, this would have meant a fair bit of work, which would have to be repeated if I had turned to other areas of the program.
Multithreaded Record Processing. My final test was a simple multithreaded program borrowed from the HiProf examples to see how the profilers dealt with multi-threaded programs. I did not attempt to optimize anything here; I was just interested in the information that would be obtained from this analysis.
I was disappointed. With the exception of HiProf and Visual Quantify, all profilers just lump together their timing or sampling information and never show how the time was distributed across threads. Visual Quantify has the most convenient way of separating and combining per-thread information through its Call Graph. HiProf shows all threads separately, but due to its somewhat involved merging procedure, the combined information is not very easy to view.
The profilers I examined use one or more of the following methods to obtain timing information from the program being tested.
Sampling. With sampling, the program is interrupted frequently and a note is made of the location of its instruction pointer. With the aid of a map file or debug information, this location is translated back to a function name or source code position. Normally, all samples within a single function are lumped together. The advantages of this method are its simplicity and the fact that it has comparatively little impact on the execution speed of the program under test; the primary drawback is the coarseness of the information. Even at the best of times the samples are only an approximation of where the time is spent in a program, and for a variety of reasons (granularity, resonance) the collected samples can be downright misleading.
Event-based sampling. Supported only by Pentium Pro processors and later, event-based sampling uses internal counters in these processors to collect information about performance-related events such as cache misses. VTune can use this sampling mode as an alternative to standard sampling, and will collect both the event data and the regular instruction pointer samples by interrupting the processor at frequent intervals.
Timing. A better method is to actually use a timer to measure the time spent in a function. This requires that the profiler be notified of function entries and exits (accomplished through instrumentation or the insertion of hook functions), and assumes an accurate timer. The latter may be provided by the operating system or by the processor. The Pentium and later CPUs contain a high-resolution counter that is used by several profilers for this purpose. The profile is basically a record of wall clock time, which means that it also takes into account things such as cache misses and background activity in the rest of the system. Depending on the purpose, this could be an advantage or a disadvantage, but in any case it is not as repeatable as instruction counting is.
Line counting. Since the overhead of timing is usually too large to make it practical at the line level, line counting is often used to obtain an indication of the program's performance at the source-code line level. It requires some form of breakpoints at the line level and introduces a large amount of overhead. In addition, unless coupled with instruction counting, it may give little information about the actual time spent in some section of code.
Instruction counting. The final method is instruction counting. In essence, the profiler counts how often a particular basic block in the program is executed, then multiplies that with the number of clock cycles required for the instructions in the block. Obviously, this requires detailed information about the underlying processor and is often dependent on the exact version of the processor. Furthermore, the actual number of clock cycles per instruction may vary with the dynamic behavior of the program and its environment (considering things such as cache and pairing behavior, branch prediction, and so on), so the profiler must make some assumptions here. Usually, it assumes optimal execution. Instruction counting does not include any ambient effects, which makes it more reproducible and in some sense "purer," but it tends to give a somewhat optimistic view of the program's speed. Even so, it is used as the most accurate method in the best profilers.
Code analysis. Intel's VTune provides code analysis, a different method that uses very detailed information about processor behavior to report penalties incurred, pairing issues, and expected cache behavior down to the instruction level. This is not profiling in a strict sense, but it does give performance information. A companion Code Coach will sometimes advise on rearrangements of the source code that reduce the performance penalties.
For More Information
2200 Mission College Blvd.
Santa Clara, CA 95052-8119
One Microsoft Way
Redmond, WA 98052
Rational Software Corp.
18880 Homestead Rd.
Cupertino, CA 95014
TracePoint Technology Inc.
One Almaden Blvd., Suite 1100
San Jose, CA 95113
561 Virginia Road
Concord, MA 01742-2732
Copyright © 1998, Dr. Dobb's Journal