The authors are engineers for Intel. They can be contacted at [email protected] ccm.sc.intel.com.
Between 1992 and 1993, our compiler group at Intel was involved in developing optimizations for a new superscalar processor. We found these optimizations -- instruction scheduling and code generation to avoid processor stalls -- effective for many applications. However, we encountered some apps that didn't achieve the expected performance improvements. These programs seemed to have two common characteristics. First, they were large applications consisting of hundreds to thousands of functions, resulting in very large executable programs. Second, they spent little execution time in loops; instead, they actively exercised large portions of the executable code. Much of the application-execution time was spent in branches and function calls and returns.
One outgrowth of our experiences is our implementation of profile-guided optimizations. Profile-guided optimizations work by feeding information about how a program executes back to the compiler. This allows the compiler to focus its efforts more effectively on regions of programs that matter for execution time. In the case of the problem applications, profile information also enables the compiler to make the best use of the hardware resources available on the processor. First, it improves the performance of the instruction cache and paging mechanisms. It does this by helping the compiler to enhance program locality; that is, placing frequently executed code together, while moving infrequently executed code away. Second, profile information enables the compiler to improve the effectiveness of branch prediction hardware by rearranging branches.
The profile-guided optimization techniques we developed have been incorporated in the Intel C/C++ Compiler Plug-in (http://developer.intel.com/drg/pentiumII/ appnotes/compiler.htm), which can be integrated into Microsoft's Developer Studio. This makes it possible for you to use inline-assembly instructions that are currently not supported by Visual C++ 5.0. The Intel C/C++ Compiler Plug-in is compatible with Visual C++ 4.x or later compilers when it comes to command-line switches, inline-assembly format, object modules, library and DLL formats, debug and C++ symbol formats. Other optimizations provided by the plug-in (which are not currently available with Visual C++) include a rounding control option that optimizes floating point to integer conversions.
In this article, we discuss profile-guided optimizations -- how the compiler collects run-time information -- and present several of the optimizations performed by profile-guided compilers.
Figure 1 is a program fragment that illustrates a number of optimizations discussed here. The fragment contains a while loop enclosing an if statement. The shaded regions in the code represent basic blocks, the representation used by profiling. A basic block is a sequence of code in which either all or none of the code executes. Usually, a basic block ends with a branch, call, or return. Figure 2 illustrates the control-flow graph built from those basic blocks. Edges in the graph show how the flow of control can transfer between basic blocks, and each edge is labeled with the number of times control transfers through it.
Give the Compiler More Information
Most compilers make optimization decisions based on static heuristics. Notably, a compiler may decide whether to inline a function based on either its size or whether it is called from within a loop. Profile guidance enables the compiler to make better optimization decisions by using knowledge about how the program actually runs. With this profile information, the compiler can do a better job of inlining functions, ordering basic blocks, allocating registers, and so on.
Three-Step Build Process
In contrast to a traditional compiler that uses a single compilation to optimize an application, profile-guided compilation requires three phases:
- The first phase is the instrumented compilation. Instrumented compilation typically produces an executable program that contains probes in each of the basic blocks of the program. Each probe counts the number of times a basic block executes. If the block is a branch, the probe records the direction taken by that branch.
- The second phase is the profiled execution. When run, the instrumented program generates a data file that contains the execution counts for the specific run of the program.
- In the third phase, information from the profiled execution (or executions) of the program is fed back to the compiler. This data is added to the compiler's control flow graph.
Basic Block Ordering
Without profile feedback, the basic blocks of a function are usually ordered simply to eliminate branches to other branches. The blocks in an if-then-else construct, for example, may be ordered according to some heuristic to try to improve branch-predictor effectiveness, but, in general, blocks are placed according to simple syntactic rules. For example, Figure 3(a) shows how a conventional compiler might order the basic blocks in the executable for the example in Figure 1. The edges outside the layout show control transfers between blocks that are not adjacent. Basic block ordering rules often fail if the condition of an if statement is poorly predicted or predominantly takes the direction opposite of the heuristic. This results in three performance penalties that are addressed by profile-guided block layout.
First, many processors employ a first-time branch-prediction heuristic (for instance, "forward branches not taken, backward taken"). If a branch in a program does not conform well to this heuristic, then it is possible that the first-time prediction may be wrong most of the time. In the layout given in Figure 3(a), this prediction will be wrong because the first iteration of the loop exercises the code in block E rather than block D. This is relatively unimportant when the program is small and the branch executes a large number of times.
If the program is small enough, the processor can remember a prediction for each branch in the program, using the branch's history to predict its future behavior. However, for large applications, this may not be possible since there may be too many branches for the branch predictor to remember. If this happens, then static predictions will again be used for the branch. Using profile-guided block ordering, a compiler is able to select branch conditions to make better use of static branch prediction heuristics. Figure 3(b) shows a layout that makes better use of static prediction.
Second, compilers usually lay out blocks according to a static heuristic to reduce the number of branches along some path, say the then part of an if-then-else; path C-D-F in Figure 3(a). This choice requires an additional jump in the else part (block E). If the else part is executed more frequently, more jump instructions are executed. From the profiling information, the jump will be executed 90 times if the blocks are laid out as in Figure 3(a). A profile-guided compiler attempts to place blocks to minimize the number of jumps along the most frequently executed path. With profiling information, the compiler will note that the path from D to F executes less frequently than the path from E to F, and will place the code as in Figure 3(b). This results in an unconditional jump at the end of block D that executes only ten times.
Third, some blocks of a function -- often error handling and recovery code -- are seldom (or never) executed. Compilers often place such seldom-executed blocks in the middle of a function, filling the instruction cache and virtual memory paging system with code that is never executed. This is probably not significant for small programs, but for large applications, it's better to move infrequently executed code out of line to reduce the amount of code that is paged in or fetched into instruction cache. Profile-guided compilers can determine that such code is infrequently executed and move blocks out of the main body of the function, typically to the end of the function. This results in less instruction cache pollution and, to a lesser extent, fewer pages brought in during a run.
Profile-guided compilers may also be capable of moving infrequently executed blocks out of the code containing the function, and placing them at the end of the entire application. This function splitting can result in significant improvements in paging behavior, while retaining the functionality contained in the infrequently executed code. This optimization has a number of interesting implications in compiler design, such as how to compute address sizes.
One of the most powerful optimizations performed by a compiler is register allocation. A register allocator attempts to assign registers to variables in a way that minimizes the number of times variables are loaded from memory. Because it is common to have more variables than available processor registers, a register allocator must decide which variables to keep in registers, in which portions of a function, and which variables to reference from memory.
Typically, the register allocator uses a static heuristic, preferring to keep variables used in loops in registers, for example. This results in register allocation that is locally optimized within the loop, but it may not produce the best total performance.
For example, in Figure 1, if block D contains a computation that uses a large number of registers, then the compiler may not keep x in a register. If there were variables that were live across the loop, the compiler might move register stores to memory (spills) and reloads outside the loop. If the loop has few iterations, this decision may be worse than placing the spills and reloads inside the loop. Furthermore, in the example, the loads and stores required to perform the increment in block E may be expensive. Hence, keeping x in a register, spilling it on entry to block D, and restoring on exit from that block results in much better code and ten loads and stores of x rather than 90. Profile-guided compilers are able to use execution frequency to select points for spills and reloads that are infrequently executed. This allows the register allocator to use registers for variables in regions that are heavily executed, and avoid spilling variables around infrequently executed regions.
Function inlining is another optimization that benefits from profiling. Function inlining substitutes the code associated with a procedure call (call site) with the actual code for the function called (callee). Inlining removes the call overhead (setting up parameters, and so on) and makes it possible to optimize the callee in the context in which it is called.
Profile information provides execution counts of the call sites, thus enabling the function inliner to focus on the important call sites within the program. Function inlining without profile information may inline functions at call sites that are not important for performance, thereby increasing code size unnecessarily. It is important to control this code-size growth when inlining because it may affect instruction cache misses.
It is possible to profile before inlining, and vice versa. Each approach has its advantages. For example, the program in Example 1 illustrates a case where inlining before profiling is advantageous. In Example 1, if profiling occurs before inlining, then the profiling information for InlinedProc will reflect its behavior over all sites where it is called. Suppose a condition usually evaluates to False. Profile-guided optimization would arrange the branch so that the False block comes before the True block. However, if a condition is usually True when InlinedProc is called from A, then this results in exactly the wrong optimization for a copy of InlinedProc inlined into A.
On the other hand, Example 2 illustrates that it is difficult to do inlining before profiling, because recursive functions cannot be completely inlined. Furthermore, inlining all function calls, even in nonrecursive functions, may result in an executable program so large that it is not practical to run.
In our compiler implementation, we profile before making any inline decisions. To date, we have not observed cases where the behavior of a function differs dramatically based on its caller, which seems to support this choice.
Function layout is a form of code placement similar to basic block reordering. Function layout is not intended to reduce inaccurate branch predictions, but to improve instruction paging and instruction cache misses. This is achieved by placing functions with frequent calls between them close to each other in the executable code, making it more likely that the functions will be on the same page or on pages that can be kept in main memory simultaneously. For small functions, this also helps keep the functions in the instruction cache simultaneously.
A call graph shows the calling relationships between functions. There is one node for each function, and a line between a function A and a function B if A calls B. The edges have a weight that is the number of times A calls B in the instrumented run. Figure 4 shows a call graph that has the edges annotated with the call weights.
Function layout is a complicated problem and any reasonably fast solution to the problem is based on a heuristic. In our compiler, we have implemented three different algorithms for function layout. Two of them are based on a depth-first walk of the call graph, and the other is based on Pettis and Hansen's graph-collapse method (see "Profile Guided Code Positioning," by Karl Pettis and Robert C. Hansen, Proceedings of the ACM SIGPLAN '90).
Our descriptions of the three function layout algorithms use the call graph in Figure 4. Normally, the compiler places functions in the object file in lexical order, that is, the order in which they appear in the source file. Figure 5 shows the source order for our example.
The simplest function layout method is called depth-first ordering (DFO). Starting at the top of the call graph, the functions are ordered by doing a depth-first walk of the call graph. If profiling information is available, the outgoing edges from a function are walked from most to least frequent. If no profiling information is available, the outgoing edges are walked in random order. Even without profile information, this does a better job in most cases than just ordering the functions in source order.
Figure 6 is the DFO layout of our call graph in Figure 4. Starting at the root, function A is first placed. Following the highest call weight, function C is placed next. Continuing down the call graph, E and then D are placed. The DFO walk then returns to A, then places function B. The layout is now complete since function D has already been placed.
The second method is highest-count caller (HCC). This method is also based on a depth-first walk of the dynamic call graph, starting at the top. However, before the walk, only the incoming call edge with the highest execution count is kept for each function. This way, a function will be placed closest to the function that calls it most frequently. Figure 7 shows the resulting function layout.
The last method is closest is best (CIB). The idea is to always pick the call edge in the dynamic call graph with the highest execution count remaining in the call graph. If there are two edges with the same weight, one is chosen at random. For each edge selected, the caller and callee nodes are merged into one node, and the call edges are updated accordingly. Two merged function nodes in the call graph indicate that the functions are placed next to each other. This process continues until the call graph is reduced to one node. When a node is merged into an already merged node, the function is placed closest to the function with the most immediate call relation. Figure 8 shows the result of CIB on our call graph.
Using function layout, we have measured up to 80 percent improvement in Instruction Translation Lookaside Buffer misses (ITLB is an on-chip cache that saves paging information to increase paging performance). On small programs, however, this may not mean a large performance improvement, since small applications normally are not limited by paging and instruction cache performance. On very large applications however, these effects may result in a measurable performance impact. On some larger applications, we have seen two to four percent improvements in instruction cache misses.
Profile-guided optimizations provide numerous opportunities to improve performance, only some of which have been discussed here. For instance, to improve the performance of C++ programs, profile-guided optimizations must consider indirect function calls. C++ virtual function invocation can be optimized significantly by profiling not just the frequency of calls but also their targets. This enables the simple optimization of replacing the indirect function invocation by an if test for the predominant type, using the indirect call only when that test fails. Currently, our compiler does not profile the targets of indirect function calls, but we consider this the most significant extension yet to be added. In addition to reducing the number of indirect calls, this optimization also enables better function placement, better function inlining, and better compiler analysis.
One of the challenges for profile-guided optimizations is to achieve wide acceptance and usage by software vendors. The more-complicated build process required by application developers is often an impediment to wide acceptance. Furthermore, the profiled run of the program typically takes significantly longer than an optimized run, and the application developer may not know what constitutes a representative data set. The benefits of the additional overhead in first doing an instrumented compilation, then running the application with representative data input sets, and finally compiling with feedback of the profile information, must be understood and appreciated before profile-guided optimizations will become a part of the normal application build process. However, for large applications that are executed frequently, profile-guided optimizations may prove to be a significant source of performance.
Copyright © 1998, Dr. Dobb's Journal