JVM Languages

Tuning Java Performance

By Paul Tyma, April 01, 1996

Given that Java is architecture neutral and interpreted, writing code that performs well isn't necessarily a slam dunk. Paul examines how compiled Java runs, then presents techniques for speeding things up.

APR96: Tuning Java Performance

Paul, who is a PhD candidate in computer engineering at Syracuse University, is president of preEmptive Solutions, an Internet technologies company. He can be reached at [email protected].

Although the Java programming language has opened up new dimensions in the world of programming, it's also uncovered some new challenges. Given that Java is platform independent and interpreted, writing code that performs well is no longer cut-and-dried. Java programmers need to focus their optimization efforts on a higher level, independent of architectural idiosyncrasies.

In this article, I'll examine how compiled Java runs, then present techniques for speeding things up. The coding guidelines I'll present will perform well on any platform. As always, your first consideration in getting the best performance out of your code should be your choice of algorithms and data structures. Binary searches, quick sorts, and hash tables offer great benefits in the right situations.

The data in this article were recorded from running Java on several different machines. The optimizations I discuss do not assume any platform-specific features. For the most part, these are well-published techniques, but focused for Java.

What to Optimize?

Your greatest performance gains will come from speeding up the most-used code, such as highly iterated loops and popularly called methods. Focusing optimization on these areas will give you the best gain-for-effort ratio.

One of Java's greatest selling points is its architectural independence, which it accomplishes by compiling its code into its own intermediate representation, not any specific machine language. Though it's typical for compilers to generate intermediate code and then create actual machine code, Java doesn't go that far. Instead, it leaves its executable in this intermediate form.

Java's intermediate representation is made up of instructions termed "bytecodes." The individual bytecodes look much like the assembly language of any given machine. The intermediate representation (and, subsequently, the Java virtual machine) is designed as a stack-based architecture. Most people who know the assembly language of a modern machine are used to register-based machines. In a stack-based machine, there are no registers. There is a stack where you can push, pop, and manipulate data, but that is the extent of the design.

Another point to consider in deciding where to optimize is that Java bytecodes are currently interpreted. The future promises dynamic ("just-in-time") compilers that jump in just as a Java program is executed to compile the bytecodes into the target machine code. The final result is that the Java program is running the same as any other compiled program (with a short pause at the start while the dynamic compiler works its magic). Unfortunately, these compilers might not be widely available for a long time. So for now, you're forced to live with interpretation.

A significant portion of interpretation time is spent dealing with overhead. Instruction execution requires the following steps:

Fetch the instruction.
Decode the instruction.
Fetch the arguments of the instruction.
Execute the instruction.

These steps are there whether it's a CPU executing its machine code or the Java nterpreter executing its bytecodes. The difference is that in an interpreter, the first three steps are overhead. For a CPU executing machine code, much of the work of the first three steps is soaked up in the pipe of modern superscalar processors.

There is no real way of removing this overhead for each instruction, short of dynamic compilation. Because of Java's architectural independence, we can't rely on the performance tricks of any given architecture. You don't know if the target processor has an on-chip cache, what its jump-prediction algorithms are, or anything else. Even the CISC concept of expensive versus inexpensive instructions doesn't carry over very well: Interpretation tends to smooth over individual instruction costs. Given these uncertainties, the best solution is to simply reduce the total number of instructions that are executed.

This doesn't necessarily mean to reduce the size of your code. Small bits of code can certainly loop themselves into painfully slow performance, but we're more interested in decreasing the number of executed instructions wherever possible--while still achieving the desired output. All this talk has been at the level of interpreters, but don't fret--even when dynamic compilation comes around, it too will benefit from fewer instructions to compile and run.

Many general programming-language optimization rules apply to Java. The int data type is faster than long (as is float compared to double). A long is twice the size of an int, making accesses comparably slower. In addition, Java's int data type is already bigger than some programmers are used to (it is 32 bit, whereas 16 bit is common for many PC C compilers). So, wherever you can, keep things down to int and float.

In general, iterative code is faster than equivalent recursive code. This is true in any language, since method calls are expensive.

Use the Best Algorithm, Use the Best Data Structure

The class file (bytecodes) that results from compiling Java programs is amazingly small. Even intensive graphical programs will often be only a few kilobytes, because accesses to Java's API library rely on finding the classes on the machine executing the program. For example, drawing a line actually calls code in the java/classes directory. Unfortunately, calls to system functions (drawing graphics, opening files, and the like) are largely a crapshoot. Often when optimizing, you are (transparently) leaving the beautiful, architecture-independent world of Java and implicitly executing native code. For our interests, it's important to note that you can't count on the speed of such operations for different architectures. Drawing a line may be swift on some machines but a painful operation on others. In general, performance of code that is heavily dependent upon API calls is largely out of your hands.

Squeezing performance out of object-oriented languages has never been easy. For one thing, object-oriented programming encourages reusable classes. From a grand code-development standpoint, the ability to reuse code for future projects increases long-term programmer productivity. However, generic classes usually require many conditional statements to support a relatively generic set of input. This level of support increases code size and costs processing time.

Consider Java's prebuilt stack class. This class accepts any object to be pushed on the stack. Certainly, that encompasses a broad usage. However, often you only need to stack something as mundane as integers. The built-in stack class will do this--at least after you perform conversions of your integers to and from object status. To achieve better performance, you would be better off creating a stack class specifically designed to stack integers.

Inlining

Inlining is everybody's favorite optimization technique, and for good reason--performance gains can be dramatic. Good program design usually involves relatively fine-grained modularity. With increasing numbers of discrete methods comes increasing numbers of method calls. From a readability and design standpoint, that's great: Discrete tasks in a class are logically placed in their own methods. Unfortunately, from a performance standpoint, method calls are expensive. Depending upon the implementation, this involves saving the current state, performing a jump (which may entail moving to currently uncached memory), and keeping track of where to return.

Again, interpreted environments are hit with extra instructions that don't actually work towards performing the goal of the program but instead deal with overhead. Object-oriented programming often exacerbates the problem by providing countless accessor methods within a class to get (effective) read-only access to its private variables. Accessor methods are nothing more than a variable access (a theoretical single memory read) but still incur all the calling overhead. To overcome this problem, compilers often attempt to inline methods--in effect, placing the method body at the call site and ignoring any semblance of an actual subroutine call. Listing One is part of a class intended to store and manipulate matrices.

The tester class creates two matrix_work objects and multiplies them together. The size of the matrices are substantial to expose the effect of optimizations. The A.multiply(B) statement took several seconds on all tested machines. For future comparisons, I'll use the relative run time of 5000 milliseconds.

To have the javac compiler perform method inlining requires two steps. First, it will only inline static, private, or final methods. You won't always be able to ensure inlining--that is, a private accessor method is only so useful. Secondly, you need to specify the -O option when compiling. After testing (and debugging) is done, there is rarely a good reason not to compile your classes with this option. For Listing One, I inlined the three accessor methods by making them final, as in Example 1. This brought the run time from 5000 milliseconds to approximately 3500. Across all architectures tested, the savings was at least 25 percent. Considering that adding the final modifier to these statements is a trivial act, it's certainly worth the effort. The change in the bytecode generated was evident, as a comparison between Listing Two and Listing Three shows. The only change is from a method invocation to a getfield (that is, an instance variable access).

Declaring methods as static or private also allows inlining where applicable. According to post-inlined bytecodes, the A object is accessing private variables of the B object. The compiler runs its checks and ensures that your code does not violate any access restrictions. After that, it takes whatever liberties it can to create faster code. To put it another way, once the compiler is sure you've followed the rules, it knows where it can safely break them.

You can also inline values by declaring variables as final. In general, this is analogous to utilizing the preprocessor in C to replace some string by a value (specified via the #define macro). This technique improves readability, maintainability, and saves a memory access at run time.

Synchronized Methods

Synchronization of methods is applicable when using multithreading, and it is effectively how Java utilizes the operating-system concept of monitors. Declaring a method as synchronized requires the method to obtain a lock before it is permitted to execute. If another method currently has that lock, the first must wait for the lock until it is available.

This level of synchronization is key in assuring the protection of critical sections of code. However, it is important to remember that every object and every class has exactly one lock. Therefore, if a thread has the lock for a given object, then no other thread may enter any of the synchronized methods within that object. If a thread must needlessly wait for a lock, you may want to redesign your class. To ensure data integrity in Example 2, it is important that not more than one thread is accessing any one account at a time. Logically, there is no reason one thread can't deposit into checking while another thread is withdrawing from savings (assuming these are separate accounts). With the aforementioned code, for a given banking object, that is not possible. The problem is that as soon as any thread enters one of the methods, it obtains a lock for the entire myMoney object, generating an unnecessary performance hit.

There are several solutions to this dilemma. First, consider whether the class is designed with logical contents. For the aforementioned example, it would likely have made more sense to have two separate classes, a savings account class and a checking account class. Another solution would be to not declare methods as synchronized. Protection of critical sections is vital, but overuse hits performance and increases the chance of deadlock.

Besides the possibility of uselessly blocking a thread, the overhead generated by synchronizing a method is significant. Tests have shown that a noninlined, unsynchronized accessor method can run four times faster than a synchronized one. In addition to slowing running time, declaring a method as synchronized removes the compiler's ability to inline that method.

To bring synchronization down to a finer level, you can specify objects to be used as synchronization objects. Example 3 is a slightly modified version of the code in Example 2. The synchronizations do not lock the myMoney object; they lock the checking and banking objects. Again, it would be more desirable to split this type of class in two. However, in some situations that may not be possible and the use of synchronization objects can be useful.

Code Motion

The JDK Beta 2.0 version of the compiler surprisingly did not support code motion; see Example 4(a).

Many contemporary compilers realize the value-2 expression has no dependencies within the loop. Therefore, a compiler could calculate value-2 prior to loop execution and use the result in the conditional with g (as a temporary variable). Unfortunately, in the javac compiler, the expression value-2 is calculated for each iteration of the loop even though the result is the same every time. In Example 4(a), the effect is not very detrimental. But it should be obvious that complex operations in that situation could needlessly waste processing time. Eliminating this type of transgression is nearly free and can be done with a temporary variable; see Example 4(b).

The practice of eliminating redundant calculations can be applied to array indexing. Every time you access an array, the system must determine the indexed memory location--that is, an implicit calculation. On top of that, Java's VM does array range checking. Although run-time array bounds checking has become a well-optimized science, its cost can still be noticeable. Wherever you can, eliminate redundant array references. Looking back to Listing One, there are a significant number of redundant array index calculations embedded in the multiply method. Listing Four shows a new multiply method exhibiting (somewhat overzealous) array index calculation movement.

The temprow vector is assigned to a row of the matrix array. For all processing past this point, the code will be interested only in one row of matrix, so pointing to the row directly will save having to follow the row pointer each time. Additionally, a temporary variable is used to accumulate the values and is assigned to the T array location after the loop. This way, calculating the index into T is done

only once instead of for each iteration of the k loop (i and j do not change inside the k loop).

Recall that inlining the accessor methods brought the run time down from a relative value of 5000 milliseconds to approximately 3500. Implementation of the code change in Listing Four brought the run time down to 2400 milliseconds. Inlining and code motion (array index calculation motion, to be specific) reduced run time by over 50 percent. As a parting triviality, you'll also notice that the loop variable declarations (variables i, j, and k) were moved. The Java VM peculiarly treats the first three local variables of a method (parameters are counted first) slightly differently than all subsequent ones. Several bytecode instructions are tailored specifically for the first three locals, providing a tiny performance improvement (a 2 percent speed increase was measured for Listing Four).

Conclusion

Given Java's close relationship to C, many of the same high-level optimizations pay off. Java's architecture neutrality forbids using CPU tricks to speed up code. No assumptions are allowed. Subsequent generations of Java compilers and interpreters (and dynamic compilers) are likely to improve performance all on their own. Java code that relies heavily on system-dependent API calls are destined to be unpredictable. Good object-oriented design and performance coding have always had their conflicts. This is not to say that good design can't be fast--it just takes a smart programmer and compiler.

Example 1: Inlining accessor methods.

public final int getrows() {
         return rows;  }
public final int getcols() {
         return cols;  }           
public final int getcoord(int r,
int c) { return matrix[r][c]; }

Example 2: Ensuring data integrity.

class banking {
   synchronized public void deposit_checking() {...}
   synchronized public void withdraw_checking() {...}
   synchronized public void deposit_savings() {...}
   synchronized public void withdraw_savings() {...}
    }
// instantiated elsewhere with
   banking myMoney = new banking();
// multiple threads can now access myMoney
 .
 .

Example 3: Modified version Example 2.

class banking {
   Object banking = new Object();
   Object checking = new Object();
   public void depositChecking() {
     synchronized(checking) {...}}
   public void withdrawChecking() {
     synchronized(checking) {...}}
   public void depositSavings() {
     synchronized(savings) {...}}
   public void withdrawSavings() {
     synchronized(savings) {...}}
    }

Example 4: (a) The JDK Beta 2.0 version of the compiler does not support code motion; (b) using a temporary variable to improve performance.

(a)  for (g=0;g<value-2;++g) {
     x[g] = g*2;
     }
(b)  int temp = value-2;
    for (g=0;g<temp;++g) {
     x[g] = g*2;
    }

Listing One

/*============ Matrix Work class ==============*/
 class matrix_work {
    private int matrix[][];
    private int rows,cols;
    matrix_work(int r, int c) {
       rows = r;
       cols = c;
       matrix = new int[rows][cols];
       populate();
      }
    public int getrows() { return rows; }
    public int getcols() { return cols; }       
    public int getcoord(int r, int c) { return matrix[r][c]; }
/* Matrix multiplication - returns number of elements in new matrix */
    public int multiply(matrix_work B) {
      if (cols != B.getrows()) throw new badMultException();
      int numels = rows * B.getcols();
      int T[][] = new int[rows][B.getcols()];
      int i,j,k;            
      for (i=0;i<rows;++i) {
        for (j=0;j<B.getcols();++j) {
          T[i][j] = 0;
          for (k=0;k<cols;++k) {
        T[i][j] += matrix[i][k] * B.getcoord(k,j);
        }
         }
        }
      matrix = T;
      cols = B.getcols();
      return numels;
      }
/* Populates the matrix */
    public void populate() { ..enlightening population code here }
/* ============= Testing class =========== */
    class tester {
    public static void main (String args[]) {
       matrix_work A = new matrix_work(80,40);
       matrix_work B = new matrix_work(40,65);
       int numels = A.multiply(B);
       }
    }

Listing Two

Method int getcols()
   0 aload_0
   1 getfield #19 <Field matrix_work.cols I>
   4 ireturn
      . 
      .
      .
  90 iinc 5 1
  93 iload 5
  95 aload_1
  96 invokevirtual #20 <Method matrix_work.getcols()I>
  99 if_icmplt 35
      .
      .
      .

Listing Three

Method int getcols()
   0 aload_0
   1 getfield #13 <Field matrix_work.cols I>
   4 ireturn
      .
      .
      .
  92 iinc 5 1
  95 iload 5
  97 aload_1
  98 getfield #13 <Field matrix_work.cols I>
 101 if_icmplt 35
     .
     .
     .

Listing Four

public final int multiply(matrix_work B) {
  if (cols != B.getrows()) throw new badMultException();
  int j,k,i;
  int numels = rows * B.getcols();
  int T[][] = new int[rows][B.getcols()];
  int temp,temprow[];
  for (i=0;i<rows;++i) {
    temprow = matrix[i];
    for (j=0;j<B.getcols();++j) {
      temp = 0;
      for (k=0;k<cols;++k) {
      temp += temprow[k] * B.getcoord(k,j);
      }
      T[i][j] = temp;
      }
    }
  matrix = T;
  cols = B.getcols();
  return numels;
  }

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

JVM Languages

Tuning Java Performance

What to Optimize?

Use the Best Algorithm, Use the Best Data Structure

Inlining

Synchronized Methods

Code Motion

Conclusion

Example 1: Inlining accessor methods.

Example 2: Ensuring data integrity.

Example 3: Modified version Example 2.

Example 4: (a) The JDK Beta 2.0 version of the compiler does not support code motion; (b) using a temporary variable to improve performance.

Listing One

Listing Two

Listing Three

Listing Four

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

JVM Languages Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content

JVM Languages

Tuning Java Performance

Related Reading

More Insights

White Papers

Reports

Webcasts

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

JVM Languages Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content