Parallel

Optimization Technology

By Keith Rowe, June 01, 1989

Optimizers are rapidly becoming standard fare on high-performance compilers. Keith explains how they work and what sort of benefits you can expect from them.

JUN89: OPTIMIZATION TECHNOLOGY

OPTIMIZATION TECHNOLOGY

Code optimizers are one way to take the drudgery out of programming

Keith Rowe

Keith is a programmer with HCR, and he can be reached at HCR Corporation, 130 Bloor Street West, Toronto, Ontario, Canada M5S 1N5.

If you ask any programmer what are the three chief design goals of his next project he'll probably answer, "speed, speed, and more speed." Writing faster code is something we all want to do, but it usually means a lot of extra, tedious work. Writing faster code also requires the hard-won knowledge of the underlying architecture and in the modern heterogeneous computing environment, it means learning the internals of a number of different machines.

Optimizers, however, let you create faster code without learning the innermost secrets of a particular configuration. This article explains how optimizers work and highlights some of the benefits you can expect from them.

What is an Optimizer?

An optimizer is a program that takes as input a code sequence and generates a second code sequence which is identical in effect but is more efficient in its use of memory space or in its execution time. In today's machines, memory space is a plentiful resource, so most optimizers concentrate on execution time. The term "optimizer," however, can be a misnomer, because there is no guarantee that the resulting version of the code is the fastest version possible.

In general, the optimizer examines the input code sequence for patterns of instructions that can be replaced with more efficient patterns. These replacements are called transformations. Valid transformations must have three properties: 1. Transformations must be correct. The resulting code sequence must have the exact same semantics as the original program. 2. Transformations must be more efficient. The average execution time of the transformed program must be faster than the original for most cases. (Note that in some cases some transformations may slow down the code, rather than speed it up. The designer of the optimizer must carefully examine such risky transformations to see if they are justified.) 3. Transformations must be worth finding. Some transformations take a great amount of computing resources to uncover. If the speedups they provide are minimal, they may not be worth the effort -- especially if the code being optimized will be run infrequently or is still being tested and changed.

A simple example of a valid transformation is constant folding. Expressions which can be seen to evaluate to a constant value, are calculated during optimization and replaced with the result. For instance, a=3+6*5 would be replaced with a=33. This clearly has the same effect --a still has a value of 33 --and the assignment of the value directly saves an addition and a multiplication operation (along with some register operations), so it is more efficient. Because transformations like this one are easy to spot and they will always speed up the executing code, they are certainly worth finding.

Optimizers are closely associated with compilers. Optimization can be thought of as an additional pass (or in some cases passes) the compiler makes to improve the code. This pass may occur between parsing and code generation passes or after code generation, depending on the type of optimization being performed.

The most obvious reason to use an optimizer is to create faster code. The optimized executable performs the same task in fewer cycles and thus increases the throughput of the machine. In a way, this is like buying a cheap form of "upgrade" for your system --more gets done on the same machine without spending a lot of money on new hardware. Beyond the obvious speed-up of executable code, there are other advantages to using an optimizer that make themselves felt at all points in the software development cycle.

With an optimizer, much of the often tedious work of code tweaking is done automatically and with greater consistency. Note that automatic optimization is unlikely to be more efficient than exquisitely hand-tuned code, but it requires far less work from the programmer to produce a similar result. Also, no optimizer will automatically replace a bubble sort with a quick sort -- the responsibility of selecting an appropriate algorithm still rests firmly with the programmer.

Optimizers also allow code to be more readable and more portable. Most programmers have seen code that has had to do such unusual things to run faster that the functionality of the code is almost impossible to fathom. An optimizer can do many of these rewrites during compilation of the code, which means the programmer can concentrate on making the algorithm clear and the logic clean, which improves code maintainability. Similarly, many machine-specific "tricks" to improve code can be kept in the optimizer so that the original source code can be more easily ported.

Finally, optimizers allow hardware designers to produce more efficient architectures. RISC machines like the MIPS and the IBM RT rely on the fact that they have optimizing compilers to ensure that the machine code that they execute takes full advantage of their unique architectural features and avoids their areas of weakness as much as possible.

Global and Local Optimization

Optimizers can be designed to look for transformations at two main levels: global and local. Global optimization seeks to modify the source code (or the compiler's intermediate representation of it) to find transformations that are apparent only when a function is examined as a whole, whereas local optimization usually operates on small pieces of assembly code to find improvements of a more immediate nature.

A great number of profitable transformations can be made by looking only at small pieces of assembly code without needing to understand the program as a whole. Programs that find these localized, low-level transformations are called peephole optimizers or sometimes just peepholers.

Figure 1 shows two typical code sequences (in IBM RT assembly code) and their replacements. In the first case, an unconditional branch is made to the immediately following line. This commonly occurs when a compiler is creating jump tables and can be eliminated. The removal of the branch instruction is correct, more efficient and trivial to find, and making it an ideal transformation.

Figure 1: Two code sequences and their respective optimized replacements

   Original Code      Replacement
   -------------------------------
      b L.16          L.16: ...    L.16: ...
   -------------------------------
      st 0, 16(14)    st 0, 16(14)
      l 1, 16(14)     lr 1, 0

In the second case, register 0 is spilled out to memory and this memory location is immediately reloaded into register 1. Like most machines, memory accesses are slower than register transfers on the RT. It is therefore appropriate to replace the regular load with a register load.

Notice that in both cases finding the transformations did not require anything more than looking at the immediately surrounding code. Peepholers do not need to do a lot of analysis and so they can be quite quick. In general, a peepholer makes the compiler's job much easier because the compiler doesn't need to keep excessive state information.

There is a much larger class of transformations that can be found only when an entire function is examined at one time. Global optimizers can find these high-level transformations by analyzing the semantic content of a given function. Constant folding is an example of a global optimization.

Global optimizers can be designed to work directly with the source code, some intermediate representation produced by the compiler, the assembly code, or even the assembly output. If you choose to work with the assembly code, a great deal of the high-level semantic content is obscured, but working directly with the source code requires a parsing and analysis phase similar to standard compilation. The <ENTRY>optimizer I work with (HCR's Portable Code Optimizer, PCO) takes the intermediate code from the first pass of the compiler and generates an optimized intermediate code representation, which can be sent on to the second pass for code generation. Because this intermediate code is semantically similar to source code, but is in a more easily analyzable format, PCO has the advantage of being able to work at a source code level without the overhead usually associated with it.

There are a number of differences in the use of global and peephole optimizers. Peepholers execute quickly but global optimizers, which need to do a more complete analysis of the code, tend to take much longer. This makes global optimization something best left undone during the iterative test and debug cycle of program development.

A Closer Look at Global Optimization

There are quite a few transformations that can be made by a global optimizer and, depending on the architecture of the machine, some may be done and others may not. As an example of the kind of transformations that a global optimizer can perform, let's look at two functions from a typical piece of C code shown in Example 1. This code sets the first five values of array a to a value. One interesting feature of the calculation of the value is that the division is kept as a separate function that checks the divisor and issues an error message if it is zero. Although, this example will show the results of intermediate steps in C code, you should remember that all the transformations occur in the optimizer, in its own internal representation.

Example 1: Sample code before optimization

  set_array (j)
  int j;
  {
          int x,n,i;
          x = 5; n = 2;
          for (i=0;i<5;i++) {
                  a[i] = testDiv (x,n) + j;
          }
  }
  testDiv (a,b)
  int a;
  int b;
  {
          if (b != 0 ) {
                 return ( a/b);
          } else {
                 fprintf(stderr, "division by zero\n");
                 return( 0 );
          }
  }

The first transformation performed is inlining. If a function is very short and is called in only a few places it may be worthwhile to write out the whole function wherever it is called, and thus save the overhead of the function call. (C++ provides inlining as a standard part of the language, but C does not, so an optimizer is needed to make this change.) Example 2 shows a version of the code with testDiv inlined.

Example 2: Same code as in Example 1 but with function testDiv inlined

  set_array (j)
  int j;
  {
          int x,n,i;
          x = 5; n = 2;
          for (i=0;i<5;i++) {
                  int temp;
                  if ( n != 0 ) {
                          temp = x/n;
                  } else {
                          fprintf(stderr, "division by zero\n");
                          temp = 0;
                  }
                  a[i] = temp + j;
          }
  }

The decision to perform inlining involves many factors. Because the function is short it will not expand the size of the executable to a great degree. The saving of one function call would not be that important but in this case it is within a loop. Any saving inside a loop is magnified by the number of times the loop repeats. In this case, inlining is a profitable transformation but the breakpoint, where it becomes unprofitable, depends to a great degree on the underlying architecture. Finding this breakpoint is part of the tuning that takes place when an optimizer is designed.

The second transformation involves constant propagation and constant folding. The use of immediate values is more efficient than accessing memory, and any expression that can be evaluated at compile time will save cycles at runtime. Thus, it is important to find all the places in the code where values can be predicted. A flow analysis of the variables in a piece of code can show where they are set and reset. In our example, x and n are set at the beginning and never changed --they act as constants. Thus, the optimizer can propagate the constant value of these variables to every place they are used. The value of i cannot be predicted, so it must remain.

Notice that when x and n are replaced with their values we get the expression temp=5/2. Here is another opportunity for constant folding. Constant propagation and constant folding often have this kind of interdependence between them. Each transformation may provide opportunities for other transformations that did not exist before.

For example, the replacement of n in the if statement (Example 3) makes the conditional a constant, which allows us to make another valuable transformation --dead code elimination. The code in the else branch of the statement is now unnecessary and so it and the if statement can be removed. Dead code elimination wins on two counts: it reduces the size of the executable and it saves performing an expensive test and branch operation.

Example 3: The replacement of n in the if statement makes the conditional a constant to make dead-code elimination optimization possible

  set_array(j)
  int j;
  {
          int i;
          for (i=0;i<5;i++) {
                  int temp;
                  if ( 2 != 0) {
                          temp = 2;
                  } else {
                          fprintf(stderr, "division by zero\n");
                          temp = 0;
                  }
                  a[i] = temp + j;
          }
  }

After eliminating the dead code the value of temp won't change inside the loop and so its value can be propagated into the next statement leaving us with the code in Example 4.

Example 4: The code in Example 3 after dead code has been eliminated

  set_array(j)
  int j;
  {
          int j;
          for (i=0;i<5;i++) {
                  a[i] = 2 + j;
          }
  }

Now we have an expression, 2+j, which can't be folded because the value of j cannot be predicted, but still it is wasteful to repeatedly evaluate it every time the loop executes because its value will not change. An optimizer can move blocks of code like this outside the loop using a transformation called loop invariant code motion. In this instance, the profit gained by the transformation may be marginal and the replacement may not have taken place at all if the optimizer had been tuned differently, but in general, code motion is an important source of increased code efficiency.

At last we have the code in Example 5. Although both Example 1 and 5 have the same net result, the number of assembler instructions the code executes on an IBM RT drops from 156 to 122, an improvement of 21 percent. Note that instruction counts do not give an exact representation of the improvement because the timing of the instructions must also be considered, but on the RT most instructions are identical in execution time. This code is representative of the kind of improvement that can be expected with global optimization.

Example 5: Both Example 2 and 5 have the same net result, the number of instructions the code executes improves by 21 percent

  set_array(j)
  int j;
  {
          int temp2,i;
          temp2 = 2 + j;
          for (i=0;i<5;i++) {
                  a[i] = temp2;
          }
  }

Implementing a Global Optimizer

Global optimizers are large and complicated pieces of code. (The PCO, for instance, is half again as large as the rest of the compiler with which it works.) This is reasonable because the understanding of the code required for optimization is much greater than for compilation. Rather than trying to understand all of this code, let's focus on a few of the principal data structures and show how they are used to find optimizing transformations.

The first step in analyzing the code is to break it into basic blocks. A basic block is a sequence of instructions that execute without any transfer of control occurring within the block. Most of the work of the optimizer is performed on these blocks or on structures built up from them. Basic blocks are used to create a flow graph, which adds the flow of control information to the set of basic blocks. Each block is a node in the graph with each directed arc representing a possible transfer of control from one block to another. Figure 2 shows the translation of a typical piece of code into a flow graph. Notice how the use of loops and if statements both effect the flow of control. Flow graphs are a much easier representation to work with because the powerful techniques of graph theory can be used to analyze the relationships between the various nodes (or in this case, basic blocks).

For example, the problem of finding a loop in the code is translated to finding a strongly connected subgraph (a subset of nodes in the graph such that each node in the set can reach all other nodes in the subset) with a unique entry point. Although this doesn't sound easier, it is well-defined mathematically, and graph theory provides algorithms to find such a subgraph.

An important extension of the flow graph is the data flow graph. A data flow graph is created by taking a flow graph and adding information about the variable usage to each node. The information added varies, depending on the use the data flow graph is put to, but the information is usually one or more sets of variable names added to each of the blocks.

Dead code elimination uses a type of data flow analysis called live variable analysis. If a variable is defined (appears on the left hand side of an assignment) and is not used subsequently, the defining statement can be eliminated. Because all the statements in a basic block occur in sequential order, it is fairly easy to detect these cases within a single block. However, what if the variable is used in a later piece of code? Detecting this case requires a complete analysis of the use and definition patterns of the code and data flow analysis becomes essential.

In live-variable analysis, each of the blocks in the flow graph has two sets of variable names: in and out. A variable is considered to be live in a block if it is a member of the out set of the block. The out set of a block is the union of all the in sets of the block's successors. The in set of a block contains all the variable names used in an expression in the block, as well as all the names in the out set that are not defined in the block. The last definition of a variable in a block can be eliminated as dead code if the variable name does not appear in the out set of the block.

Figure 3 shows a data flow graph created for live variable analysis. If you imagine that the variables c and d are used in a later piece of code, the in and out sets will be as shown. The bottom block has the same in set as its successor's out set but adds an a to the in set to reflect its use in that block. The middle two blocks both remove the c from their in sets, because the c is defined in those blocks. The top block is of particular interest, because the definition of b occurs and there is no entry for it in the out set. Thus, the line b++; is dead code and can be eliminated. Other forms of data flow analysis are used to perform loop invariant code detection, copy propagation, and some types of common subexpression removal.

The structures discussed so far are used to analyze large pieces of the code. Many transformations can be found within individual basic blocks and another data structure: The directed acyclic graph is used in this case. A directed acyclic graph (a dag) is a data structure where each node represents either a variable or an operator. Each operator node will have pointers to other nodes to represent the operands. The dag is constructed in such a way that no circular references will ever appear. A dag does not model control flow, so it is perfect to form a more easily analyzed representation of the basic block.

A dag starts with a collection of leaf nodes, one for each variable used in the block. The statements in the block are examined one at a time and for each operator, a node is created. The children of this node are the leaf nodes of the variables that are the operands. If the result of the operation is stored in a variable, the node is labelled with the variable's name. The next time this variable is used, the new node points at this operator node rather than the original leaf node. If a node to be added has identical children and an identical operator, it is not added to the dag, but the variable it is assigned to is added to the variable list of the existing node. Figure 4 shows a dag created for a typical instruction sequence. The second use of d uses the same leaf node, but the second use of b does not because it has been used to label the '+' operator node.

One use of the dag is to assist in dead code elimination. If a node has no parents, the variable it is labelled with has the value of the tree under it when the block completes. When we know this variable is dead (using the data-flow analysis techniques shown above) then this node can be eliminated. This in turn may leave other nodes without parents and expose them to similar elimination. Looking again at Figure 4, if c is dead in the following block, the '/' node can be removed. This leaves the '+' node without a parent, and if b is also dead it can now be removed, thus cascading the dead code elimination. Dags are also used for most other transformations that occur within a block.

Performance

To demonstrate the kind of speedups that the average user can expect when using a good optimizer, I have done a few brief experiments. Like all benchmarking, these tests should be taken with a grain of salt, but they do give an idea of the range of performance improvement available. As I said before, optimization is machine specific. Some machines provide more latitude for improvements than others. As the car makers say "your mileage may vary."

The benchmarks were done on an IBM RT, model 115, with an AFP card installed. This RT was running Version 2.2.1 of AIX and used the standard cc compiler. (This compiler includes a "perform optimization" option, which invokes PCO as a global optimizer and IBM's copt as a peepholer.)

The first test was to compile and execute the standard Dhrystone benchmark with and without optimization. Because many of the statements in the benchmark have no effect, the global optimizer can perform a great deal of dead code elimination and thus the Dhrystone/second rating increases by about 75 percent (see Table 1).

Table 1: Performance results with and without optimization

  Program             No. Opt.  Optimized  Speed-up
  -------------------------------------------------
  Dhrystone (dhry/s)   4838        8474       75%
  grep (s)            18.74       15.88       18%
  sort (s)            38.56        33.6       14%
  yacc (s)            14.95       12.19       22%

To get a better feel for what happens to real code, I took source for the common Unix utilities grep, sort, and yacc, compiled them with and without optimization and set them to large tasks. Although none of these utilities are perfect, they do represent better than average attempts at writing fast code.

grep was asked to search a 2.5-Mbyte on-line dictionary for a particular word pattern. The optimized version ran 18 percent faster over 10 tries. sort was asked to put a smaller on-line dictionary into reverse alphabetic order. Here the speedup was 14 percent. yacc was given a C grammar and asked to generate a parser for it. In this case the speedup was 22 percent.

Conclusion

Optimizers have proven to be a valuable tool for the software developer. They help remove the drudgery of writing efficient code, allow the code to be more readable and portable, and even in tight code, they can find significant improvements. Still, they can only do so much, and the most intellectually challenging part of creating fast code, the design of good algorithms, will remain with the programmer.

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

Parallel

Optimization Technology

OPTIMIZATION TECHNOLOGY

Code optimizers are one way to take the drudgery out of programming

Keith Rowe

What is an Optimizer?

Global and Local Optimization

Figure 1: Two code sequences and their respective optimized replacements

A Closer Look at Global Optimization

Example 1: Sample code before optimization

Example 2: Same code as in Example 1 but with function testDiv inlined

Example 3: The replacement of n in the if statement makes the conditional a constant to make dead-code elimination optimization possible

Example 4: The code in Example 3 after dead code has been eliminated

Example 5: Both Example 2 and 5 have the same net result, the number of instructions the code executes improves by 21 percent

Implementing a Global Optimizer

Performance

Table 1: Performance results with and without optimization

Conclusion

Further Reading

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Parallel Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content

Parallel

Optimization Technology

OPTIMIZATION TECHNOLOGY

Keith Rowe

Related Reading

More Insights

White Papers

Reports

Webcasts

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Parallel Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content