Dr. Dobb's | Optimizing Open-Source Software for Intel Architectures

Harry is a product-line marketing manager at Intel's Software Products Division.

Compiler optimization plays an important role in the performance of open-source applications. Default optimization settings are often used during compilation that, in effect, leave some application performance unrealized. Through the use of aggressive compiler optimization, many applications show appreciable increases in performance. In many cases, you can increase the performance of an application in situations where detailed performance analysis is impractical. Understanding the techniques used to apply higher optimization settings and some of the benefits and costs of doing so is essential to improve end-user application performance.

The MySQL open-source database (www.mysql.com) is widely used to manage corporate data, handle transactions, and run e-commerce and data warehousing applications. It is a key component of LAMP, a web server solution consisting of Linux, Apache, MySQL, and PHP/Python/Perl. MySQL is available on many 32- and 64-bit platforms, including Linux, UNIX, BSD, Apple, Windows, IA-32, Power PC, and SPARC. In this article, we focus on MySQL 4.1.12 running on Linux and Intel architectures. However, the process can be applied to other platforms as well.

The GNU Compiler Collection includes a C and C++ compiler, typically referred to as "gcc" and "g++," respectively. Here we refer to these compilers collectively as "GCC." The compiler is available for numerous architectures and operating systems, and offers good performance and compatibility with C/C++ language standards. While we focus on Versions 3.4.4 and 3.2.3, the techniques we describe are valid for earlier or later versions of the compiler.

For its part, the Intel C++ Compiler is an optimizing compiler offered on operating systems such as Windows and Linux, and on architectures such as IA-32, Itanium, and systems with Intel EM64T. The Intel Compiler offers good conformance to C and C++ standards, as well as binary compatibility with GCC. The Intel Compiler is tuned to offer high performance on Intel architecture processors, taking advantage of the latest microarchitectural features. In this article, we refer to Version 9.0, although the techniques can be applied to other versions.

Optimization Process

Optimization using the compiler begins with a characterization of the application. The goal of this step is to determine properties of the code that may favor one optimization over another and to help in prioritizing the optimizations. Large applications may benefit from optimizations for cache memory. If the application contains floating-point calculations, vectorization, which parallelizes floating-point computations, may provide a benefit.

The second step is to prioritize testing of compiler optimization settings based on an understanding of which optimizations are likely to provide a beneficial performance increase. Performance runs take time and effort so it's essential to prioritize the optimizations that are likely to increase performance and foresee any potential challenges in applying them. For example, some advanced optimizations require changes to the build environment. If you want to measure the performance of these advanced optimizations, you must be willing to invest the time to make these changes. At the least, the effort required may lower the priority. Another example is the effect of higher optimization on debug information. Generally, higher optimization decreases the quality of debug information. So besides measuring performance during your evaluation, you should consider the effects on other software-development requirements. If the debugging information degraded to an unacceptable level, you may decide against using the advanced optimization or you may investigate compiler options that can improve debug information.

The third step, selecting a benchmark, involves choosing a small input set for your application so that performance of the application compiled with different optimization settings can be compared. In selecting benchmarks, keep in mind that benchmark:

The next step is to build the application using the desired optimizations, run the tests, and evaluate the performance. The tests should be run at least three times apiece. Our recommendation is to discard the slowest and fastest times and use the middle time as representative. We recommend checking your results as you obtain them, seeing if the actual results match up with your expectations. If the time to do a performance run is significant, you may be able to analyze and verify your collected runs elsewhere and catch any mistakes or missed assumptions early. Finally, if the measured performance meets your performance targets, it is time to place the build changes into production. If the performance does not meet your target, the use of a performance analysis tool (such as the Intel VTune Performance Analyzer) should be considered for your application code. Figure 1 summarizes the process for applying aggressive compiler optimization to application code.

[Click image to view at full size]

Figure 1: The process for applying aggressive compiler optimization to application code.

MySQL Optimization

The first optimization to try is the baseline optimization using -O2. For the Intel Compiler on IA-32, the -O2 option enables a broad range of compiler optimizations, such as partial redundancy elimination, strength reduction, Boolean propagation, graph-coloring register allocation, and sophisticated instruction selection and scheduling. In addition, single-file inlining occurs at -O2, so we expect some of its benefits. Because inlining is important to C++ performance, we also attempt more aggressive inlining by using single-file interprocedural optimization (-ip) and multiple-file interprocedural optimization (-ipo). The -ip option enables similar inlining to what is enabled at -O2, but performs a few more analyses that should result in better performance. The -ipo option enables inlining across multiple files. Interestingly, inlining tends to increase code size; however, if the inlining results in smaller code size for the active part of the application by reducing call and return instructions, the net result is a performance gain. Profile-guided optimization (-prof_use) is a great optimization to use with inlining because it provides the number of times various functions are called and therefore guides the inlining optimization to only inline frequently executed functions, which helps reduce the code-size impact.

The -O3 option enables higher level optimizations focused on data access. We use -O3 and see what kind of performance benefits occur. Finally, we expect that vectorization would not provide a performance benefit; however, vectorization is fairly easy to use, so we will attempt it. Table 1 summarizes the optimizations that will be attempted and the reasons for doing.

Optimization	Expectation
-O2a	Baseline optimization
-O3	Data Access Optimizations should provide benefit
Single file interprocedural optimization (-ip)	Stronger inlining analysis over –O2
Multiple file interprocedural optimization (-ipo)	Multiple file inlining should bring further benefit
Profile guided optimization (-prof_use)	Help performance through code size optimizations
Vectorization (-xN)	Don’t expect an improvement

Table 1: Intel compiler optimization evaluation.

The use of GCC on Itanium-based systems running Linux focused on a few optimizations. The -O3 option is a superset of -O2 optimization and adds simple inlining of functions. We expect -O3 will be beneficial. The -O3 optimization also includes register allocation that may benefit architectures with many registers like IPF. MySQL uses a subset of C++, and GCC offers options that turn off the generation of C++ RTTI and exception handling (EH) information. The use of these options may benefit performance by optimizing the code size and removing unnecessary exception-handling code. Be careful when applying the options to disable generation of RTTI and EH. If linked-in libraries or application code planned for the future depends on this information, you may run into problems. One other optimization that is used is -felide-constructors, a minor C++ optimization.

A special benchmark called "SetQuery" (www.cs.umb.edu/ ~poneil/dbppp/) was developed to help in this optimization effort and was used to measure the performance of the MySQL database on Intel Architecture. SetQuery returns the time that the MySQL database takes to execute a set of SetQuery runs. The SetQuery benchmark measures database performance in a decision-support context such as data mining or management reporting. The benchmark calculates database performance in situations where querying the data is a key to the application performance as opposed to reading and writing records back into the databases.

The use of most of the optimizations was fairly straightforward; however, there are a few optimizations that require some effort to test. Multiple-file interprocedural optimization essentially delays optimization until link time so that every file and function is visible during the process. To properly use -ipo, ensure the compiler flags match the link flags and that the proper linker and archiver are used. The original build environment defines the linker to ld and archiver to ar and these defines were changed to xild and xiar, respectively.

One challenge in using profile-guided optimization is determining the correctness and proper use of the profile information by the compiler. We were surprised by the profile-guided optimization results and suspected the sanity of the profiling data. Two techniques for verifying profile information are manually inspecting the compilation output and using the profmerge facility to dump profile information. During compilation, if the compiler does not find profile information for some number of functions in the file that is being compiled, the compiler emits this diagnostic:

If the compiler is able to find profile information for all functions in a file, the compiler does not emit a diagnostic. Make sure the compiler either doesn't emit the above diagnostic or emits the diagnostic with a number of the functions in a file using profile information. If you are aware of the most frequently executed functions in your application and the file with those functions shows little to no routines compiled with profile information, the profile information may not be applied correctly. The second technique to verify profile information is to use the profmerge application with the -dump option. profmerge -dump dumps the contents of the profile data file (pgopti.dpi). Search for a routine that is known to execute frequently and find the number of blocks in the function (BLOCKS:) and then the section "Block Execution Count Statistics," which contains counts of the number of times the blocks in the function were executed.

Results

The performance runs using the Intel Compiler on IA-32 systems running Linux confirmed a number of our expectations; however, a few results surprised us. The use of stronger inlining with -ip resulted in higher performance over the baseline (8.68 percent) and a larger code size (1.35 percent). Code size was measured as the total size returned by the size command of all executables in the client directory. Surprisingly, the use of -ipo did not result in as great a performance improvement (6.91 percent) than -ip, but did result in larger code size (1.46 percent). The biggest surprise was the combination of -ip and -prof_use, which resulted in less performance than the baseline (-3.31 percent). True to expectation, the combination of -ip and -prof_use resulted in a code size increase of 0.04 percent, which is an improvement over -ip alone. Table 2 summarizes the results of several different option sets on a Pentium 4 desktop system. The best performance was obtained by using -O3 -ip, so we chose to use these two options in our production build. If greater performance was desired, analysis of why -ipo and -prof_use did not increase performance would be the first priority.

Optimization	Code Size (in bytes)	Code Size Increase (vs. baseline)	Execution Time (in seconds)	Execution Time Improvement (vs. baseline)
-O2	16488449	0.00%	526	0.00%
-O3	16488449	0.00%	520	1.15%
-O3 –ip	16710369	1.35%	484	8.68%
-O3 –ipo	16729709	1.46%	492	6.91%
-O3 –ip –prof_use	16494273	0.04%	544	3.31%
-O3 –ip -xN	16871105	2.32%	487	8.01%

Table 2: Intel compiler optimization and SetQuery performance.

Using both GCC and the Intel Compiler, we applied this process to a wider range of platforms, including IA-32. (The performance data has been provided by MySQL AB. The tests were performed using MySQL Version 4.1.12, Intel C++ Compiler 9.0 for Linux, and the GNU C Compiler 3.4.4. The operating system for the Pentium 4 processor-based system was SuSE Linux Enterprise Server 8.2, and hardware specifications were Pentium 4 processor 2.7 GHz, 1-GB RAM, 512-KB L3 cache, Hyper-Threading Technology switched OFF.) We also applied it to Itanium (MySQL version and compiler versions same as above. The operating system for the Itanium 2-based server was Red Hat Enterprise Linux AS 2.1, update 3, and hardware specifications were Itanium 2 processor 1.2 GHz, 1-GB RAM, 6-MB L3 cache.) Other systems included the Intel EM64T. (MySQL version and compiler versions same as above with the following exception, GNU C Compiler 3.2.3. The operating system for the 64-bit Intel Xeon processor-based server was Red Hat Enterprise Linux AS 3, Taroon Update 2, and hardware specifications were 2-way Intel Xeon processor 3.2 GHz, 4-GB RAM, 1-MB L2 cache, Hyper-Threading Technology switched ON.)

The aggressive optimization settings used in each run are summarized by architecture as follows:

Figure 2 graphs the total execution time and relative performance gains comparing the various compilers to their default optimization levels. The results show appreciable performance gains over default optimization using the Intel Compiler on IA-32 (6 percent), the Intel Compiler on Itanium (2 percent), and GCC on Itanium (3 percent). Our expectation that greater optimization will lead to a performance increase was met in these cases. We did not go to the detail of extensively verifying that each performance gain was directly caused by a specific optimization. For example, with GCC, -O3 turns on several optimizations above -O2; we didn't try each optimization individually to see which specific optimization led to greater performance. In addition, some of the aggressive optimization settings did not result in significant performance increases, and in one case resulted in lower performance than the baseline optimization. We chose to share these numbers for a very important reason—stronger compiler optimizations do not guarantee better performance in all cases. This stresses the need to understand some level of detail in applying optimization to your application.

[Click image to view at full size]

Figure 2: Total execution time and relative performance gains comparing the various compilers to their default optimization levels.

Conclusion

MySQL was optimized by using a set of compiler optimizations above and beyond the default -O2 optimization. By applying the process of characterizing the application, prioritizing compiler optimization experiments, selecting a representative benchmark, and measuring performance results, it is possible to improve the performance of applications by taking advantage of higher levels of compiler optimization. This improvement does not require going to the depths of traditional performance analysis, but yields performance benefits nonetheless. Relying on aggressive compiler optimization is a practical first-cut technique for improving the performance of your application; if further performance gains are desired, the next step of low-level performance analysis should be considered.

Questions to Consider
Question	If the Answer is Yes
Is the application large or does the working data set exceed the size of the cache?	The application may be sensitive to cache optimizations.
Are there large amounts of numerical or floating-point calculations?	Vectorization may help performance.
Does the source code make heavy use of classes, methods, and/or templates?	C++ code usually benefits from inlining.
Is the execution spread throughout the code in many small sections?	Code size optimizations may be beneficial.
Is the data access random, localized, or streaming?	Data access optimizations may be beneficial.