The Performance Tuning Cycle
It is important to note that the tuning process is not sequential, but iterative. It is typically not sufficient to step through the three phases only once. Changes made during the power optimization phase may require a new pass at single-core and multi-core optimization to meet performance targets. A subsequent multi-core focused optimization may place inappropriate demand on power and require further power optimization. The hope is that the changes made have less and less of an impact until an equilibrium is reached and the performance targets are met. This is when one can consider performance tuning to be complete. That said, performance regression tests should be run to ensure subsequent bug fixes and changes do not impact performance negatively.
Software tools for performance and power optimization aid in your analysis and tuning efforts. The specific tools detailed in this section are arranged according to the performance tuning phase. The information here provides further details on the capabilities and usages of the tools specific to performance optimization.
Single-Core Performance Tools
Tools for analyzing single-core processor performance provide insight into how an application is behaving as it executes on one processor core. These tools provide different views on the application ranging from how the application interacts with other processes on the system down to how the application affects the processor microarchitecture. Many tools are available that provide profiling capability in different ways. Typically, they fall into one of the following categories:
- System profilers. Provide a summary of execution times across processes on the system
- Application profilers. Provide a summary of execution times at the function level of the application
- Microarchitecture profilers. Provide a summary of processor events across applications and functions executing on the system
Two definitions relevant to profiling concern how the data is viewed. A flat profile is a correlation of processes and functions with the amount of time the profiler recorded in each. A flat profile does not show relationships between the processes and functions listed with any other processes or functions executing on the system. A call graph profile shows these relationships and contributions to the measured times between the caller functions and called functions.
Profilers obtain information by sampling or tracing the system while the application is executing. Three techniques for sampling the system are summarized as follows:
- Operating system provided API. Operating system provides capability to periodically sample and record information on executing processes.
- Software instrumentation. Application has code added to trace and record statistics. * Hardware performance monitoring counters. Employed by microarchitecture profilers. Provides information on microarchitecture events such as branch mispredictions and cache misses.
Table 1 describes several tools used in single-core performance analysis. These tools are not all equal. Some of the tools provide functionality that is a superset of others. For example, sysprof is capable of providing a call graph profile across all applications executing on a system; GNU† gprof is not. However, gprof is available across a wide range of operating systems; sysprof is a Linux tool. An exhaustive list of profiling tools is outside the scope of this article; we merely list a few tools representative of the profiler categories above.
System Profiling: Sysprof
Sysprof is a Linux hosted and targeted system profiler that provides information across the kernel and user level processes. The tool offers a very simple user interface as depicted in Figure 2. To begin profiling, the user presses the Start button. If an application is being profiled it must be started independently of sysprof. The application itself does not require special instrumentation; however if detailed function-level information is desired then debug information should be provided. To stop profiling and show the collected results, the user clicks on the Profile button.
Figure 2 displays a profile of an application viewed using sysprof. The screen is divided into three sections. The top left section labeled Functions is a listing of functions where the greatest amount of time was measured during profiling. Time per individual function includes the time spent in any function called as a result of the function, such as descendents in the call chain. Time is reported in two forms, self time and total time. Self time is the execution time inside the function and does not include called functions. Total time is the amount of time inside the function and all of its descendents. The bottom left window is labeled Callers and is a list of functions that call the highlighted function in the Functions box. Time in the Callers window is relative to the highlighted function. The self time indicates how much time is spent in the caller function while the total time is the component of time spent calling the highlighted function. The sum of the total time column in the Callers box equals the total time of the highlighted function in the Functions box. The Descendents window shows a portion of the call graph of the highlighted function. To follow a path through the call graph further, click the right-facing triangle, which will show another level of the call graph. At each node of the call graph that represents a function, time is reported for it in both, self time and cumulative time. Self time has been previously described. Cumulative time is the time in the function and all of its descendents and is a fraction of the time spent by a caller higher in the call graph.
A command-line version of the tool is also supported. It is also possible to dump the profile results to a file for offline processing.
Application Profiling: GNU gprof
GNU gprof is an application-level profiling tool that serves as the output and reporting tool for applications that have been compiled and instrumented using the -pg option. This option is supported by GNU gcc and other compilers and results in instrumentation being added to the application to collect profile information. The instrumented application generates a profile data file (gmon.out is the default profile file name) when executed. Gprof is then employed to process the profile data and generate reports such as an ordered listing of the functions that consume the largest amount of execution time.
Figure 3 shows sample gprof profile output obtained by profiling the SPEC CPU2000 benchmark, 179.art. The first report is a flat profile and shows a rank ordering of the various functions in the application based upon the amount of time recorded during the execution. Based upon this report, the function, match, had the longest amount of time spent in it, 183.25 seconds, which was 80.64 percent of the total execution time. The profile reports that the function, match was called 500 times. The self s/call column represents the average amount of time spent inside the function per call. The total s/call column represents the average amount of time spent inside the function and its descendents per call. For the function, match, these times are 0.37 seconds and 0.38 seconds respectively.
GNU gprof also provides call graph information. Figure 4 shows a portion of the call graph from the function, match, which shows the primary caller is identified by index , scan_recognize. For further details on gprof, see the online documentation.
Microarchitecture Profiling: Oprofile
Oprofile is a command line-based microarchitecture profiler providing access to the performance monitoring counters. Oprofile targets Linux systems and requires a kernel driver that acts as a daemon to collect the profile information. One of the positive aspects of the tool is that no instrumentation or recompilation of applications is required. In addition, Oprofile can profile optimized versions of applications.
The use model for Oprofile consists of configuring the daemon for profiling and instructing the daemon to begin collecting profile data. The utility, opcontrol, is used to issue commands to the collection daemon. The activity to monitor is then started, which typically implies user invocation of the application on a relevant benchmark. After the activity or application execution is complete, the user shuts down collection. A separate command-line tool, opreport, is called with an option specifying the type of report desired. Other utilities are available that round out the functionality. The command-line utilities that comprise oprofile and a description of each follows:
- opcontrol. Configures the collector, initiates and terminates collection.
- opreport. Displays profile in human readable form, merging available symbolic information where possible.
- opannotate. Displays profile information correlated with source and assembly code. oparchive. Saves profile for offline viewing and analysis.
- opgprof. Translates profile into gprof-compatible file.
Table 2 summarizes the steps for employing oprofile to collect and output a profile of an application reporting clock cycle information. Each step is described followed by the command line to perform the action. These commands should be executed with root privileges.
Figure 5 shows the output of oprofile after collecting a profile of the 179.art application. The application was generated with debug information, which enables function level reporting as evidenced by line number of symbol names provided for the a.out application. The largest percentage of time, 44.2886 percent, was in the kernel (no-vmlinux). Using oprofile, it is possible to turn off collection of events from the kernel. The second through fifth highest ranked functions are inside of the 179.art application.
Profile information can be collected based upon other processor events as well. For a complete list of events supported by oprofile on your particular target, use the –list-events option.
Microarchitecture Profiling: Intel VTune Performance Analyzer
On desktop operating systems, the Intel VTune Performance Analyzer can create flat profiles, application call graph profiles, and microarchitecture profiles. The Intel Application Software Development Tool Suite for Intel Atom Processor includes the VTune analyzer and the VTune analyzer Sampling Collector (SEP), a targetside profile collector for the Intel Atom processor. For embedded form factors that take advantage of Linux, SEP provides microarchitecture profiling capability. Using SEP requires installation of a kernel daemon that is specific to the particular Linux kernel employed. The source code to the daemon can be built to enable collection on specific Linux kernels. The process of using SEP is similar to oprofile. Facilities for configuring collection, starting, and stopping are provided. Once complete, the profile is then transferred to a host environment for visualization inside of the VTune analyzer GUI.
Table 3 describes the steps and command lines employed to configure and collect a profile.
The SEP data collector supports additional options to further configure collection including:
- Sampling. Specify duration, interval between samples, sample buffer size, and maximum samples to count.
- Application. Specify an application to launch and profile.
- Events. Configure events and event masks. Use -event-list for a list of supported options.
- Continuous profiling. Aggregates data by instruction pointer, reducing space and enabling monitoring and output during execution.
- Event multiplexing. Enables collection of multiple events concurrently by modulating the specific event being measured while the application is profiled.
Figure 6 shows a flat profile of the 179.art application collected using SEP and transferred to a host system for analysis under the VTune analyzer GUI. The highlighted ratio in the top right shows the measurement for clocks cycles per instruction retired.
Microarchitecture Profiling: Event-based Sampling
One issue with performance monitoring collection is that access to the performance counters requires kernel, or ring 0, access. Event-based sampling functions by setting up a performance monitoring counter to overflow periodically and then recording the instruction pointer location with the particular event. During profiling and as these events are recorded, a correlation of the number of events to instruction pointers is created. Implementing event-based sampling requires an interrupt handler to record these performance monitoring counter overflows and a driver that writes the counts to a file after collection is complete. The VTune analyzer includes its driver source code, which can be used as a model for other operating systems. In addition, a TBRW utility is included that enables a performance monitoring driver to read and write the VTune analyzer's data format, tb5. This enables other performance monitoring utilities to take advantage of the GUI provided by VTune analyzer.
Microarchitecture Profiling: Intel Performance Tuning Utility
For more advanced microarchitecture profiling, the Intel Performance Tuning Utility (Intel PTU) leverages the same technology as the Intel VTune analyzer and offers sophisticated views of performance events. This tool is available on the whatif.intel.com site, which means it is an experimental tool. Some of the capabilities of Intel PTU include:
- Basic block analysis. Creates and displays a control flow graph and hotspots corresponding to basic blocks in the graph.
- Events over IP graph. Generates a histogram of performance events distributed over application code. Loop analysis. Identifies loops and recursion in the application to aid optimization.
- Result difference. Compares the results of multiple runs to measure changes in performance
- Data access profiling. Identifies memory hotspots and relates them to code hotspots.
Intel PTU is integrated into Eclipse, which places requirements on the system under test to be able to execute the Eclipse environment. Figure 7 shows a screenshot of the basic block analysis feature of Intel PTU.