Best Practices for Developing and Optimizing Threaded Applications: Part 2
The performance analysis phase is used to optimize the serial application before introducing threads
The first article in this series, Best Practices for Developing and Optimizing Threaded Applications: Part 1, introduced a methodology that has been used to successfully thread many applications. The paper gave an overview on software development tools that can assist in developing multi-threaded applications and presented a case study that demonstrated the methodology to develop multi-threaded applications with recent software development tools.
This article will discuss in detail how to do the Performance Analysis phase. The Performance Analysis phase deals with characterizing the application performance and how to optimize the application independent of introducing threads.
The threading methodology consists of four phases. The first phase focuses on performance analysis. The second phase involves effectively converting your serial application to a parallel one. The third phase identifies possible correctness issues in your threaded application, such as data races and deadlocks. Finally, the fourth phase looks for performance improvements in your threaded application.
Performance Analysis Tools
2.1 Serial Optimization
We recommend optimizing the serial application before introducing threads. Why does one need to do serial optimization first? Often we find serial code contains inefficiencies that a threaded implementation can capitalize upon, filling the gaps wasted by the inefficient serial code, giving a false sense of thread scaling. Subsequent optimization, which may comprise nothing more than recompiling with a better compiler and/or more aggressive compiler optimization options, may be perceived as having a bad effect on thread scaling as the serial version of the application can show a disproportionate improvement. Doing serial optimization first reduces this potential problem and may reduce the contribution enough in some cases to eliminate the hotspot from parallel coding consideration. This allows one to concentrate on introducing threads in areas of the application with the biggest payoff in improved performance and leads to the highest performing application.
After determining which workload or workloads will be used to characterize the application performance, it is useful to characterize the initial application performance by taking profiling data with a tool similar to VTune Performance Analyzer or Oprofile. A basic performance analysis where one measures the CPU time is straightforward to perform and allows the determination of the amount of CPU time spent in each function. After modifying compiler optimizations and/or making source code changes, another performance analysis can be used to measure change in the amount of CPU time spent in each function.
2.1.1 Optimizing GNU Compiler build
As the application bottlenecks were in functions that manipulated floating point numbers, we investigated compiler optimizations options that can improve floating point performance. In this study the GNU g++ compiler version 4.1.0 and the Intel C++ Compiler for Linux, version 9.1 were used. Table 1 shows the results of using different compilers and compiler optimizations to improve the application performance. This study points out for this application that using more aggressive optimizations that help optimize floating point calculations we are able to improve the g++ performance by 24% compared to g++ -O2. Often the performance of applications that spend considerable amount of time doing floating point calculations can be improved by using aggressive compiler optimizations.
2.1.2 Optimizing Intel Compiler build
After exploring different g++ optimizations, we next built the application with the Intel C++ compiler. We found that for this application, the Intel C++ compiler generates faster code than g++. The best performance was obtained using advanced loop transformations (-O3), profile guided optimization (-prof-use), interprocedural optimization() and automatic vectorization(-xW).
Typically automatic vectorization and advanced loop transformations benefit applications that manipulate floating-point data. Interprocedural optimization enables numerous compiler optimizations including inlining function calls across different source files, which eliminates function call overhead and provides opportunities for additional optimization. Profile guided optimization is a multiple step process to first build an executable that will generate the compiler profile data, running the application on workload(s), and re-compiling to take advantage of the profile data. Doing so provides additional information to the compiler that can improve optimization, examples include measuring the trip count a given loop is executed—if a loop is only executed a small number of times it may be beneficial to completely unroll the loop and thereby avoiding the loop overhead. On the other hand, if a loop is executed many times, it may be beneficial to apply loop transformations such as loop unrolling and/or automatic compiler vectorization. Profile guided optimization provides additional information to allow the compiler to make more intelligent optimization decisions, but requires addition work by the developers.
The purpose of the performance analysis phase is to generate the fastest serial version of the application which was accomplished using the Intel C++ Compiler and aggressive optimizations.
2.2 VTune Sampling Analysis
The Intel VTune Performance Analyzer allows developers to quickly analyze applications without recompilation or linking. This allows developers to determine bottlenecks in the application quickly through the GUI interface by drilling down from system wide data, to executable of interest, to shared libraries that make up the application, the function level and source level. Being able to quickly drill down from system wide to source level is very powerful, and quickly focuses attention on the part(s) of the application that are important for optimization. VTune fully supports the many performance monitoring events available on the latest Intel processors and provides both sampling and call graph technologies to analyze applications.
Figure 1 shows sampling data for the initial gcc build of the Black Scholes application, and displays system wide performance data. In addition to spending time in the user code, the gcc built application spends significant time in the system libm and libc libraries.
Figure 2 shows the amount of time spent in libm.so for the g++ built executable. Considerable amount of CPU time is spent in these functions when using the g++ compiler. The Intel Compilers provide highly optimized versions of the functions in the system math library (libm), discussed below in section 3.1.
Figure 3 shows the amount of time this application spent in libc.so, and is dominated by the amount of time generating random numbers.
Figure 4 indicates the functions that take the most CPU time from the VTune sampling data. When determining which functions will benefit from the introduction of threads, it is useful to look at sampling performance data and call graph data, discussed in Section 2.4.
2.3 Oprofile Analysis
Oprofile is a low overhead, system profiling tool that runs on Linux system. It uses the performance monitoring hardware on the processor to retrieve information about the kernel and executables on the system. Oprofile can be used to profile interrupt handlers, profile application and its shared libraries, capture the performance behavior of entire system and examine hardware effects such as cache misses.
Our test system was running Fedora Core 5 which comes with Oprofile 0.9.1, which does not support profiling on Intel Core Duo and Intel Core 2 processor family. The latest version of Oprofile, 0.9.2, has added support although VTune provides access to many more of the CPU counters than does Oprofile.
2.4 VTune Call Graph AnalysisCall graph data shows the calling sequence of the application and critical path, and can help identify functions that may be good candidates to introduce threads. The VTune Performance Analyzer call graph viewer is very flexible and can highlight calling tree in numerous ways: show the critical path taken, provides numerous filters to display functions that have the largest number of calls, get called the most, consume the top percentage of CPU time, etc. These different views are useful to highlight potential functions to thread.
2.4.1 Disable inline function -- compilerWhen generating call graph data, it can be useful to disable compiler inlining. This makes it clear which functions are consuming CPU time and helps identify opportunities to introduce threads. The drawback is that disabling inlining perturbs the system that you are measuring, and that preventing compiler inlining could cause false overheads due to function calls. If time allows, one could do the analysis with and without compiler inlining and compare the results. In practice, this is seldom necessary. One can compare the call graph data with inlining disabled with the sampling data. Use both of these compiler options to disable for both GNU and Intel compilers:
Performance Data Interpretation
3.1 Selecting functions to thread
VTune Performance Analyzer call graph data gives the critical path information of all the functions in the application. Critical path data indicates the most time consuming path in your application. It is displayed as a thick red edge and starts from the most time consuming thread.
If you are interested in threading the application, selecting functions to thread on the critical path gives better performance results.
In this scenario, we decided to thread the generateBSdataRand function, since it is in the critical path. The next paper in this series will discuss techniques used to thread this particular function.
Figure 6 shows the VTune Analyzer sampling data from the executable built with the Intel Compiler. The Intel Compiler links the log() and exp() function calls into the user application, avoiding a call into libm.so. The Intel Compilers provide an optimized version of the system math library libm, which provides math functions that are significantly faster than those in libm and provide greater accuracy. Using the optimized versions of the log() and exp() functions is one of the reasons for enhanced performance with the Intel Compiler.
In this paper, we used the performance analysis phase of the threading methodology we presented in our paper Best Practices for Developing and Optimizing Threaded Applications. In this phase, we used performance profiling tools to analyze the application performance before introducing threads to a legacy application. We first measured the baseline performance and used advanced compiler optimizations to optimize the serial performance of the application, exploring the different optimizations available with both gcc and the Intel C++ Compiler for Linux.
The next phase of the threading methodology is to introduce threads, followed by debugging and performance tuning of the threaded application. This will be discussed in the third paper in this series.
- Intel VTunePerformance Analyzer
- Intel Thread Checker
- Intel Math Kernel Library
- Intel Compilers
- GCC, the GNU Compiler Collection
- The Pricing of Options and Corporate Liabilities by Fischer Black and Myron Scholes.
- Financial Recipes contains a wealth of information on financial engineering including sample source code.