Performance Monitoring with PAPI

The Performance Application Programming Interface is a portable library of performance tools and instrumentation with wrappers for C, C++, Fortran, Java, and Matlab.


June 01, 2005
URL:http://www.drdobbs.com/tools/performance-monitoring-with-papi/184406109

Philip is the original author and technical lead for PAPI. He is a research consultant for the Innovative Computing Laboratory at the University of Tennessee, Knoxville. Nils and Per are computer scientists at the Center for Parallel Computers (PDC) at the Royal Institute of Technology in Stockholm, Sweden, and are regular contributors to the PAPI project. They can be contacted at [email protected], [email protected], and [email protected], respectively.


Advanced PM Features

Open-Source Performance Analysis Tools


Contrary to widely held misconceptions, performance is portable. Virtually every optimization effort initially involves tuning for architectural features found on most nonvector processors. The code is modified to improve data reuse in the caches, reduce address misses in the TLB, and to prevent register spills to memory, all of which can cause stalls in the processor's pipeline. But the questions often arise: Why should we optimize code at all? Why not just use the best available compiler options and move on? If the code runs too slowly, why not just upgrade our hardware and let Moore's Law do the work for us? There are many answers to these questions, the simplest being that the performance of most applications depends as much on the bandwidth of the memory subsystem as the core clock speed of the processor. While we do see processor megahertz continuing to double every 18 months, DRAM speeds see 10 percent improvement in bandwidth at best. Another reason to optimize has to do with the average performance of the user's hardware. Many applications, such as an MP3 player or video conferencing package, have a certain amount of work to do in a finite amount of time. Considering that most of the user community is running computers that are at least a generation or two behind, optimizing this software guarantees that more people will be able to run your code.

In the high-performance computing (HPC) community, optimization is of critical importance as it directly translates to better science. At HPC centers, computer time is "sold" to various parties of each institution. Take, for example, the National Energy Research Scientific Computing Center (NERSC)—the computer center at Lawrence Berkeley National Laboratory (http://www.lbl.gov/). The center runs a variety of large machines and sells time to each of the different divisions of the lab, as well as other government and academic institutions around the country. On these systems, 99 percent of the workload is for scientific simulation for running virtual experiments beyond physical or economical constraints. These applications range from the simulation of self-sustaining fusion reactors to climate modeling of the entire Earth down to a resolution of a few kilometers. The compute time of these models is always dependent on the resolution of the simulation. To fit within budget, experimenters must choose an appropriate resolution that resolves the phenomena of interest, yet still remains within their allocation. Optimization of these applications can result in a tremendous savings of compute time, especially when you consider that these simulations are parallel codes; there may be thousands of processors executing similar code. A 30 percent decrease in runtime could allow for a 30 percent increase in the resolution of the simulation, possibly exposing subtleties not present in the former simulation. Such a difference directly translates into accelerating the pace of scientific discovery, and even (as in the case of car crash simulations or airplane wing design) result in saved lives.

However, the hardest part of the optimization process is understanding the interaction of the system architecture, operating system, compiler, and runtime system and how that affects the performance of the application. All these elements must be taken into consideration when attempting to optimize an application. Wouldn't it be nice if there was a way to reduce the prerequisite knowledge and experience by having the processor tell you exactly where and why your code was losing performance? Well, it turns out that such a method does exist—on-chip performance monitoring (PM) hardware found on almost every microprocessor in use today. The PM hardware consists of a small number of registers with connections to various other parts of the chip. Traditionally, this hardware was used exclusively for testing and verification. However, now the importance of the PM hardware has become widely understood, especially among the HPC community.

There are two methods of using the PM hardware—aggregate (direct) and statistical (indirect):

In "Optimization Techniques" (DDJ, May 2004), Tim Kientzle describes how to use the real-time cycle counter (rdtsc) found on the x86 architecture. However, there are some problems with this technique. The first is due to the nature of a system designed for multiprocessing. No system can be considered quiet when examined from a sufficiently fine granularity. For example, the laptop on which this article is being written is busy servicing interrupts from a variety of sources, as well as delivering commands to the graphics coprocessor and the PCM audio chip. The consequence of this is that it is very likely that the cycle count from rdtsc() includes a host of other unrelated events. On a desktop system, a stock Redhat 9 system with no "active" processes, the system (vmstat) reports about 120 interrupts and 50 context switches every 10 seconds. Doing cycle-by-cycle accounting of code is thus impossible—just one interrupt, or one context switch during a measurement interval, could mislead you in your quest for bottlenecks.

The next problem with timers is more severe. Timers don't tell you anything about why the hardware is behaving the way it is. The only way to understand the system is to attempt to reconstruct the processor's pipeline as it executes your code. This is a virtually impossible task on today's advanced CPUs, even for experts with detailed knowledge of the processors microarchitecture. The last problem is one of portability and interface semantics. While the rdtsc() code segment works nicely in the Intel/Microsoft environment, it doesn't help when you migrate to another architecture, operating system, or even compiler!

These problems and more have been addressed with the development of the Performance Application Programming Interface (PAPI) library. PAPI (available at http://icl.cs.utk.edu/papi/) is intended to be completely portable from an API standpoint. That is, performance tools and instrumentation that work on one platform, seamlessly work with a recompile on another. While the interface is portable, the generated data is not. Data from the PM hardware rarely has the same semantics from vendor to vendor, and often changes from model to model. A cache miss on one platform may be measured entirely differently on another; even definitions as "simple" as an instruction can become blurred on modern architectures (consider x86 instructions versus x86 uops).

PAPI is implemented on a wide variety of architectures and operating systems. The current release, PAPI 3.0.7, is supported on the Cray T3E, X1, Sun UltraSparc/Solaris, Alpha/Tru64, IBM Power 604e,2,3,4/AIX, MIPS R10k/IRIX, IA64,IA32,x86_64/Linux, and Windows/IA32 (not P4). Wrappers for PAPI exist for C, C++, Fortran, Java, and Matlab.

To facilitate the development of portable performance tools, PAPI provides interfaces to get information about the execution environment. It also provides methods to obtain a complete listing of what PM events are available for monitoring. PAPI supports two types of events, preset and native. Preset events have a symbolic name associated with them that is the same for every processor supported by PAPI. Native events, on the other hand, provide a means to access every possible event on a particular platform, regardless of there being a predefined PAPI event name for it.

For preset events, users can issue a query to PAPI to find out if the event is present on the current platform. While the exact semantics of the event might be different from processor to processor, the name for the event is the same (for example, PAPI_TOT_CYC for elapsed CPU cycles). Native events are specific to each processor and have their own symbolic name (usually taken from the vendor's architecture manual or header file, should one exist). By their nature, native events are always present.

PAPI supports measurements per-thread; that is, each measurement only contains counts generated by the thread performing the PAPI calls. Each thread has its own PM context, and thus can use PAPI completely independently from other threads. This is achieved through the operating system, which saves/restores the performance counter registers just like the rest of the processor state at a context switch. Using a lazy save and restore scheme effectively reduces the additional instruction overhead to a few dozen cycles for the processes/threads that are using the performance monitoring hardware. Most operating systems (AIX, IRIX, Tru64, Unicos, Linux/IA64, HPUX, Solaris) have officially had this support for a significant time, motivated largely by the popularity of the PAPI library and associated tools. For Linux/IA32/x86_64, support exists in the form of a PerfCtr kernel patch (http://user.it.uu.se/~mikpe/linux/perfctr/), written by Mikael Pettersson of Uppsala University in Sweden. This patch has not yet been formally accepted in the main Linux 2.6 kernel tree. Users with concerns about stability and support should know that this patch has been installed and in use for many years at a majority of the U.S. Government's Linux clusters on the Top 500 list, not to mention numerous other computational intensive sites around the world. The notable exception to operating system support is Microsoft Windows. Due to the lack of support of preserving the state of performance counters across context switches, counting on Windows must be done system-wide, greatly reducing the usefulness of the PM for application development and tuning.

PAPI has two different library interfaces. The first is the high-level interface meant for use directly by application engineers. It consists of eight functions that make it easy to get started with PAPI. It provides start/read/stop functionality as well as quick and painless ways to get information, such as millions of floating-point operations per second (MFLOPS/S) and instructions per cycle. Listing One contains code that demonstrates a canonical performance problem—traversing memory with nonunit stride. We measure this code's performance using the PAPI high-level interface. This example uses PAPI presets. They are portable in name but might not be implemented on all platforms, and may in fact mean slightly different things on different platforms. Hardware restrictions also limit which events can be measured simultaneously. This simple example verifies that the hardware supports at least two simultaneous counters; it then starts the performance counters counting the preset events for L1 data cache misses (PAPI_L1_DCM) and the number of floating-point operations executed (PAPI_FP_OPS). Compiling and running this example on an Opteron (64-byte line size) with gcc 3.3.3 (-O -g) results in this output:

Total software flops = 41943040.000000
Total hardware flops = 42001076.000000
MFlop/s = 48.258228
L2 data cache misses is 12640205

There's almost one cache miss for every four floating-point operations—performance is bound to be poor. Switching the for loops so that the memory is accessed sequentially instead should give us better cache behavior; the result is:

Total software flops = 41943040.000000
Total hardware flops = 42027880.000000
MFlop/s = 234.481339
L2 data cache misses is 2799387

Performance has more than quadrupled and the cache misses have been reduced by 80 percent. Here we note that the number of hardware flops is different from run to run. Using "avail" (Figure 4) helps answer this question. Note that the output from avail has been edited for brevity. The PAPI_FP_INS preset is implemented as the vendor FP_MULT_AND_ADD_PIPE metric. This metric is speculative, meaning the hardware counter includes instructions that were not retired.

The low-level interface is designed for power users and tool designers. It consists of 53 functions that range from providing information about the processor and executable to advanced features like counter multiplexing, callbacks on counter overflow, and advanced statistical profiling modes. Counter multiplexing is a useful feature in situations where the PM hardware has a very limited number of registers. This limitation can be overcome by trading accuracy and granularity of the measurements for an increase in the measured number of events. By rapidly switching the contents of the PM's control registers during the course of a user's run, the appearance of a great number of PM registers is given to users. For more on advanced features, see the accompanying text box entitled "Advanced PM Features."

While PAPI provides the necessary infrastructure, the true power of PAPI lies in the tools that use it. There are a number of excellent open-source performance analysis tools available that have been implemented using PAPI. While a complete discussion of all these tools is beyond the scope of this article, we mention a few in the accompanying text box entitled "Open-Source Performance Analysis Tools." For information about these and other tools, please refer to the PAPI web page.

PAPI's goal is to expose real hardware performance information to users. By doing so, most of the guesswork regarding the root cause of a code's performance problem can be eliminated. PAPI does not solve algorithmic design issues or diagnose inefficient parallelization. It can, however, diagnose poor usage of the available processor resources—a problem that, before now, was largely intractable. As of now, PAPI is an ad hoc standard, taking into account the input of the tool development and performance engineering community. In the future, we hope to see PAPI evolve into a true open standard and to see the vendors ship native versions of PAPI with each new release of processor and operating system. In the meantime, with continued cooperation from industry, academia, and the research community, PAPI will continue to evolve and further drive the science of performance engineering.

DDJ



Listing One

#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <sys/time.h>
#include "papi.h"

#define MX 1024
#define NITER 20
#define MEGA 1000000
#define TOT_FLOPS (2*MX*MX*NITER)

double *ad[MX];

/* Get actual CPU time in seconds */
float gettime() 
{
return((float)PAPI_get_virt_usec()*1000000.0);
}
int main () 
{
float t0, t1;
int iter, i, j;
int events[2] = {PAPI_L1_DCM, PAPI_FP_OPS }, ret;
long_long values[2];

if (PAPI_num_counters() < 2) {
   fprintf(stderr, "No hardware counters here, or PAPI not supported.\n");
   exit(1);
}
for (i = 0; i < MX; i++) {
    if ((ad[i] = malloc(sizeof(double)*MX)) == NULL) {
        fprintf(stderr,"malloc failed"\n);
        exit(1);
    }
for (j = 0; j < MX; j++) { 
    for (i = 0; i < MX; i++) {
        ad[i][j] = 1.0/3.0; /* Initialize the data */
    }
}
t0 = gettime();
if ((ret = PAPI_start_counters(events, 2)) != PAPI_OK) {
   fprintf(stderr, "PAPI failed to start counters: %s\n", PAPI_strerror(ret));
   exit(1);
}
for (iter = 0; iter < NITER; iter++) {
    for (j = 0; j < MX; j++) {
        for (i = 0; i < MX; i++) {
            ad[i][j] += ad[i][j] * 3.0;
        }
    }
}
if ((ret = PAPI_read_counters(values, 2)) != PAPI_OK) {
  fprintf(stderr, "PAPI failed to read counters: %s\n", PAPI_strerror(ret));
  exit(1);
}
t1 = gettime();

printf("Total software flops = %f\n",(float)TOT_FLOPS);
printf("Total hardware flops = %lld\n",(float)values[1]);
printf("MFlop/s = %f\n", (float)(TOT_FLOPS/MEGA)/(t1-t0));
printf("L2 data cache misses is %lld\n", values[0]);
}
Back to article

June, 2005: Performance Monitoring with PAPI

PerfSuite 1.0 summary for execution of gen.ver2.3.inte (PID=9867, domain=user)

                          Based on 800 MHz -1
                          GenuineIntel 0 CPU
                          CPU revision 6.000

Event Counter Name                                                  Counter Value
===================================================================

    0 Conditional branch instructions mispredicted.............     3006093956
    1 Conditional branch instructions correctly predicted......     32974709880
    2 Conditional branch instructions taken....................     26952022279
    3 Floating point instructions..............................     44525980237
    4 Total cycles.............................................     353262206234
    5 Instructions completed...................................     489764680025
    6 Level 1 data cache accesses..............................     56390921533
    7 Level 1 data cache hits..................................     41911206947
    8 Level 1 data cache misses................................     14615753570
    9 Level 1 load misses......................................     17611912424
   10 Level 1 cache misses.....................................     17597248300
   11 Level 2 data cache accesses..............................     53158617899
   12 Level 2 data cache misses................................     8440205387
   13 Level 2 data cache reads.................................     43528651785
   14 Level 2 data cache writes................................     10240563775
   15 Level 2 load misses......................................     3615923337
   16 Level 2 store misses.....................................     667575973
   17 Level 2 cache misses.....................................     8529931717
   18 Level 3 data cache accesses..............................     3826843278
   19 Level 3 data cache hits..................................     2799591986
   20 Level 3 data cache misses................................     999714206
   21 Level 3 data cache reads.................................     3573882130
   22 Level 3 data cache writes................................     171800425
   23 Level 3 load misses......................................     944624814
   24 Level 3 store misses.....................................     49427000
   25 Level 3 cache misses.....................................     1024569375
   26 Load instructions........................................     84907675686
   27 Load/store instructions completed........................     95346092870
   28 Cycles Stalled Waiting for memory accesses...............     140032176122
   29 Store instructions.......................................     10267472354
   30 Cycles with no instruction issue.........................     67247126931
   31 Data translation lookaside buffer misses.................     8365029

Statistics
===================================================================
Graduated instructions/cycle...................................             1.386406
Graduated floating point instructions/cycle....................             0.126042
Graduated loads & stores/cycle.................................             0.269902
Graduated loads & stores/floating point instruction............             2.141359
L1 Cache Line Reuse............................................             5.523515
L2 Cache Line Reuse............................................             0.731682
L3 Cache Line Reuse............................................             7.442618
L1 Data Cache Hit Rate.........................................             0.846708
L2 Data Cache Hit Rate.........................................             0.422527
L3 Data Cache Hit Rate.........................................             0.881553
% cycles w/no instruction issue................................             19.036037
% cycles waiting for memory access.............................             39.639729
Correct branch predictions/branches taken......................             1.000000
MFLOPS.........................................................             100.833839

Figure 1: Output from the PSRUN tool of PerfSuite showing hardware events and derived metrics gathered via PAPI.

Back to Article

June, 2005: Performance Monitoring with PAPI

Figure 2: HPCview showing PAPI events gathered from hpcrun for a triply nested loop.

Back to Article

June, 2005: Performance Monitoring with PAPI

Figure 3: TAU display of the performance of a 16-process MPI program with multiple PAPI metrics, subroutine breakdown, and call graph display.

Back to Article

June, 2005: Performance Monitoring with PAPI

/avail -e PAPI_FP_INS
Test case avail.c: Available events and hardware information.
----------------------------------------------------------------
Vendor string and code:      AuthenticAMD (2)
Model string and code:       AMD K8 Revision C (15)
CPU Revision:                8.000000
CPU Megahertz:               1993.626953
CPU's in this Node:          4
Nodes in this System:        1
Total CPU's:                 4
Number Hardware Counters:    4
Max Multiplex Counters:      32
----------------------------------------------------------------
Event name:                       PAPI_FP_INS
Event Code:                       0x80000034
Number of Native Events:          1
Short Description:                |FP instructions|
Long Description:                 |Floating point instructions|
 |Native Code[0]:                   0x40000002  FP_MULT_AND_ADD_PIPE|
 |Number of Register Values:        2|
 |Register[0]: 0xf                  P3 Ctr Mask|
 |Register[1]: 0x300                P3 Ctr Code|
 |Native Event Description:         |Dispatched FPU ops - Revision B and later revisions - 
Speculative multiply and add pipe ops excluding junk ops|

Figure 4: Output from the avail tool from the PAPI distribution.

June, 2005: Performance Monitoring with PAPI

Advanced PM Features

Traditionally, performance monitoring hardware counts the occurrence or duration of events. However, recent CPUs, such as the Intel Itanium2 and IBM Power4/5, include more advanced features. The Itanium2 contains such features as:

—P.M., N.S., and P.E.

June, 2005: Performance Monitoring with PAPI

Open-Source Performance Analysis Tools

The PerfSuite tool is from the National Center for Supercomputing Applications (http://perfsuite.ncsa.uiuc.edu/). PerfSuite includes a number of tools and libraries for performance analysis of serial, MPI, and OpenMP applications. Of initial interest to most users is the PSRUN tool that, when run on an application, generates textual output of raw performance counter data as well as a wealth of derived statistics. Figure 1 is an example of a run of the genIDLEST application. PSRUN can also be used to generate flat statistical profiles by file, function, and source line of an application.

The HPCToolkit, from the Center for High Performance Software Research at Rice University (http://www.hipersoft.rice.edu/hpctoolkit/), includes hpcrun—a command-line tool to generate flat statistical profiles to be visualized with a Java-based GUI called "hpcview" as shown in Figure 2. HPCToolkit is capable of generating and aggregating multiple simultaneous profiles such that for a postanalysis, line-by-line profiles can be generated from multiple hardware metrics. Like PerfSuite, HPCToolkit supports serial and parallel execution models.

Often after an initial performance analysis, users wish to narrow the focus of their examination. This may involve making detailed measurements of HASH(0x87fb30) of code and associating them with an application event, such as updates of a cell in a grid. TAU, the "Tuning and Analysis Utilities" from the University of Oregon, is such a tool (http://www.cs.uoregon.edu/research/paracomp/tau/ tautools/). TAU is a comprehensive environment for performance analysis that includes modules for both source and binary instrumentation, experiment management of performance data, and scalable GUIs suitable for visualization of highly parallel programs. TAU runs just about everywhere and can even function without PAPI being available (although only time-based metrics are available without PAPI). Figure 3 is a TAU display of the performance of a 16-process MPI program.

—P.M., N.S., and P.E.

Terms of Service | Privacy Statement | Copyright © 2024 UBM Tech, All rights reserved.