JVM Languages

A Real-Time Performance Visualizer for Java

By John J. Barton and John Whaley, March 01, 1998

Find out where the performance bottlenecks in your Java programs are with the JVM performance visualizer presented here.

Dr. Dobb's Journal March 1998: A Real-Time Performance Visualizer for Java

Finding out where to squeeze for performance

John Barton is a researcher at IBM's T.J. Watson Lab. He can be contacted at [email protected]. John Whaley is a student at MIT. He can be reached at jwhaley@ mit.edu.

Everyone who uses Java is interested in its performance. Why? Because Java programs running on Sun's original interpreter-driven Java virtual machine (JVM) were several orders of magnitude slower than comparable programs written in C/C++. Bottlenecks included synchronization, null reference, array-bounds checking, and Java's exception handling. Just-in-time compilers (JITs) on more advanced JVMs have begun to close the gap, but further performance gains are possible.

To investigate the issues surrounding JVM performance, Derek Lieber (a member of our group at IBM's T.J. Watson Laboratory) wrote an experimental JVM. To improve its run-time performance, he needed a way to discover its performance bottlenecks. A traditional profiler would be of limited use here, because it would simply profile the JVM without regard to the Java program running on it. Also, traditional profilers cannot make sense of JIT-compiled code, or make a clear distinction between time spent executing Java code and time spent in the JVM. Therefore, we set out to write a performance visualizer for our JVM. We'll describe that visualizer in this article. The techniques we'll discuss are general enough to apply to other virtual machines, and the graphics client applet for our visualizer is available electronically; see "Resource Center," page 3.

Our basic design goals involved ease of use, scalability, and potential reuse of the profile data to dynamically target optimizations. For routine use, the visualizer should have minimal impact on target program performance. It should be simple so that it can be applied to arbitrary Java programs, and it should obtain the performance results dynamically, as the target program runs. We wanted the visualizer to work on long-running programs where performance is a major issue, so that ruled out schemes involving large trace files. Moreover, we only need information on the performance bottlenecks: Any seldom-run code need never be considered by the visualizer. Finally, we wanted to use the performance information to drive a dynamic JIT optimizer for our JVM, reinforcing the need for simplicity, efficiency, and scalability.

With these goals in mind, we built a program-counter sampling engine into the JVM. The engine is driven by a high-frequency Windows multimedia (WINMM) interrupt. Each time the WINMM interrupt fires, the handler stores the VM's current program counter into a ring buffer.

We chose a program-counter sampling approach rather than a timed function entry/exit approach because sampling the program counter is much less intrusive. Adding stubs at entry/exit points may seem innocuous, but can substantially change the performance of the profiled program due to changes in processor cache hits, branch prediction tables, and the like. Also, no changes are necessary to the JVM or to the JIT-compiled code, simplifying things greatly. Another benefit is that the sampling overhead is roughly constant, and can easily be adjusted by changing the frequency of the interrupt. Finally, this approach gives us a finer granularity so that we can profile at the level of a single bytecode, rather than at the level of a method.

The sampling engine is controlled by a display client via a socket connection. The display client can tell the sampling engine to start or stop profiling, change the sample frequency, transmit the contents of the buffer containing the trace data, and so on. Because we use a standard TCP socket, the display client could be written in any language; we chose to implement it in Java and execute it in a separate JVM. Furthermore, the display client can be run on a remote machine so that it will interfere minimally with the program we are profiling.

As Figure 1 shows, our performance visualizer has four parts -- two in the display client and two in the sampling engine. On the display side, there's the GUI and its updater, which polls for performance data from the sampling engine across the socket. On the sampling engine side, we have the socket service code and interrupt handler.

The Graphical User Interface

The GUI is a Java applet written using lightweight components and the Java 1.1 event model (http://java.sun.com/products/ jdk/1.1/docs/guide/awt/designspec/ events.html). Figures 2-6 show the five possible frames users can bring up to look at the performance results. The "ConnectionFrame" in Figure 2 lets users control the connection to the JVM being profiled; we won't discuss it further.

The "OverallPieChart" frame in Figure 3 contains a pie chart showing what fraction of the overall time is spent in Java code (user or library code), in supporting run-time routines of the Java virtual machines (garbage collection, for instance), or in the underlying operating system's kernel. Below the pie chart are buttons to activate the other frames.

The other frames contain pie charts and lists. Figure 4 presents what's probably the most interesting frame for most Java programmers, the "Java Methods" frame, which is activated by clicking on the "Java" wedge in the OverallPieChart, or with the Show Java button. This frame shows how the time spent in Java methods is distributed among those methods. The pie chart on the left shows a wedge for each method, and the list on the right gives the method names. Both are sorted by total time spent. All small wedges (<1 degree) are combined into the black "Other" wedge in the lower right. The rest of the colors in the pie chart are arbitrary. From this pie-chart view, users can judge which Java methods are taking the most time.

As the mouse cursor enters a wedge corresponding to a method, the wedge lights up and its information is displayed in the status bar. In addition, the corresponding entry on the list to the right is highlighted. If users left-click on a wedge or list entry, a menu pops up with options for further information on the method. Selecting Display Window from this pop-up menu, for example, brings up a Method frame; see Figure 5.

The Method frame is analogous to the Java Methods frame, but corresponds to the breakdown by line number of the total time spent in a particular Java method. The line numbers are obtained from the debug line table information in the method's class file, so the class file must be compiled unoptimized for separation by line number to work correctly.

Returning to the OverallPieChart frame, Figure 2, users can click on the VM wedge or select Show VM to produce the VirtualMachinePieChart; see Figure 6. This chart has the same form as the Java and Method frames, but the contents list the C++ functions in our JVM. The names look strange because we simply print the mangled function name created by the C++ compiler. This name includes an encoding of the function arguments, so users can discern the different overloaded functions.

Sampling with WINMM

At the opposite end of our visualizer system lies the program-counter sampling engine, which accumulates samples at frequent intervals during program execution. The C++ code in Listing One accumulates a single sample. This code is a static callback function that is called by the Microsoft Win32 WINMM interrupt handler every time the WINMM interrupt fires. It creates a Win32 CONTEXT object to hold a thread context and then fills it by calling Win32's GetThreadContext() function on the handle of the thread to be sampled. In this case, our JVM uses a single Win32 thread, called the "worker," to implement all the Java threads (TaskManager._workerThreadHandle.) Finally, the program counter register (Eip) from the thread context is stored in the ring buffer by the JavaProf.add() member function call.

To ensure a low overhead, we want to minimize the amount of work done inside the callback function. The JavaProf.add() function simply writes the given program counter into a preallocated ring buffer; it does no allocation or other work that might fail in long-running profiles, and it keeps the overhead down to a bare minimum. On the negative side, the GetThreadContext call is not trivial, because it is a Win32 call and has to go through the kernel, but that is a necessary evil. Fortunately, the GetThreadContext call is fairly quick.

The method JavaProfiler::start() (see Listing Two) installs the callback function as a WINMM interrupt handler. The two WINMM functions called here are documented in Microsoft's Visual C++ documentation. The timeSetEvent call installs the interrupt handler with a frequency equal to the desired sampling frequency or to the fastest sampling frequency supported by the WINMM timer, whichever is faster. (On Windows 95/NT, the fastest sampling rate supported is once per millisecond, which is adequate for our purposes.)

After this call, the Win32 kernel begins calling the callback function at the specified frequency and the buffer begins to fill with program counters.

The Program-Counter Server

The program-counter sampling engine is controlled over a socket connection by a client/server type connection. The server end sits inside our JVM. It is also written in C++; we have to access the C data structure stored by the interrupt handler and call C functions for mapping the program counter. Therefore, coding all of it in C++ seemed simplest. In the usual fashion of socket programming, the server starts by listening for incoming connections on a specified TCP port; we arbitrarily picked the default to be 6789.

When a client attempts to connect, the server binds the port and begins waiting for a command from the client. The protocol between the client and server is simple: First, the client sends the command (a single byte) followed by its arguments. Depending on the command, the server then responds to the client with the requested information. The client closes the socket when it is done with the service, and the server goes back to listening for incoming connections.

The server supports control commands, such as start/stop profiling and change sampling frequency, along with data acquisition commands, such as "dump buffer" and "look up point." Because the connection between the client and the server may be too slow to send thousands of points per second, the server supports a number of methods of reducing the amount of data sent. The server supports commands to not send certain fields, such as the method line-number field. It supports a command to change the frequency of the WINMM interrupt, reducing the number of points sent per second. It also can be told to ignore certain kinds of data points, such as all points in the kernel or all points not in Java methods.

The most often used command is "get all data." When the server receives this command, it decodes the program counters in its ring buffer into type, function name, and line number (if possible) and sends the decoded version over the socket connection. The decode process compares the program counter value to known address ranges for the Win32 kernel, C++ code for our Java VM functions, and compiled bytecodes from our JVM (our VM acts like a continuous Just-In-Time compiler, converting all bytecodes to Intel instructions).

For the Win32 kernel functions, we don't try to analyze the address any further; visualizing the performance of the Win32 kernel is not the purpose of this tool. For the internal VM functions we convert the address to a C++ function name using the .map file information generated by Visual C++ when we compiled the JVM. Finally, for addresses that map into regions of compiled bytecodes, we call functions in our JVM to produce the Java method name and line-number values. These functions did not have to be written explicitly for the profiler; they were already present for Java's exception callback trace generation.

Updating the User Interface

The last section of our visualizer lies behind the user interface on the client JVM side. This is the UI "Updater" that connects to the profile server and polls it, dispatching out the returned information to the appropriate windows opened by the user.

Java's 1.1 event model used in this part of the visualizer allows the number and kind of UI panels to be selected dynamically and for new information to be sent to all or just some of them. The event model relies on callback functions in Java's EventListener interface.

The coordinator of the event flow is the class ProfileUpdater, which has methods with names of the form addXXXListener() (where XXX is a kind of event). These methods build lists of EventListener objects passed to it as arguments. When the ProfileUpdater has new data, it creates a corresponding event object and passes it to each EventListener registered with it. Each different kind of data will result in a different kind of event object passed to a different list of EventListeners. In this way we can add an arbitrary number of panels to the UI and update them efficiently.

The ProfileUpdater has its own thread of control -- that is, it implements Java's Runnable interface. It acts as a client of the profiler engine, requesting data from the engine, converting the data into appropriate kinds of events, and then multicasting the events to the corresponding list of EventListeners. Then it sleeps for a fixed time interval, wakes up, and repeats the polling loop.

Where our VM Spends its Time

With all of the components of our visualizer built, we can run it by simply giving a command-line flag to our JVM to start the server during initialization. In a separate JVM, we start the client and connect to that server. Then, the polling/updating cycle starts profiling any Java program running in the first JVM.

As soon as we completed the first version of our profiler we discovered a one-line change to our Java VM that gave a 10 percent speed up. At the top of the VirtualMachinePieChart we found memset(), the libc function usually used to clear memory. Derek, our VM guy, mumbled something unprintable when he saw the profile: He quickly recognized that clearing memory for every Java object created meant many calls to memset(). Moving the memset() call to clear open memory after garbage collection eliminated this cost.

Of course, most of the revelations of a profiler are not so easy to act upon. Several of the functions near the top of our profile come from garbage collection. Our JVM uses a very simple copying collector; there are many better techniques but they all require more coding effort to implement. Other functions at the top indicate less severe changes. A couple of things we need to consider are avoiding string comparison for type checking in the VM and avoiding the creation of Java's exception traceback string when the exception is caught. All of these represent obvious performance bottlenecks that were known when the code was written. Now we know which ones really count and we can attack them. In effect, our profiler creates a continuous "to do" list for optimizing our Java VM.

Based on the VirtualMachinePieChart, we think we can increase the speed of our JVM code dramatically. At this point, the Java code would dominate, meaning that we should then put more effort into our bytecode-to-Intel instruction translator. We know that much improvement can be realized here since we currently do none of the optimizations common in JITs.

DDJ

Listing One

void CALLBACKJavaProfiler::Callback  (
                UINT uID=0,     // identifier of the timer event
                UINT uMsg=0,    // reserved
                DWORD dwUser=0, // user instance data
                DWORD dw1=0,    // reserved
                DWORD dw2=0)    // reserved
{
        // create context object
        CONTEXT context;
        context.ContextFlags = CONTEXT_CONTROL;
        // get the control registers of the worker thread
        HANDLE worker = TaskManager._workerThreadHandle;
        assert(worker);
        GetThreadContext(worker, &context);
        // store the PC
        JavaProf.add((INTEL_OP *)context.Eip);
}

Back to Article

Listing Two

void JavaProfiler::start()  if (!_procID) { // if we haven't already started ...
    // find the maximum precision of the timer device
    TIMECAPS time; timeGetDevCaps(&time, sizeof(time));
    // start the quick timer tick profiler
    _procID = timeSetEvent (max(_period,time.wPeriodMin), // delay in ms
                           0,    // resolution (higher -> less overhead)
                           &Callback, // callback function
                           0,    // user-supplied callback data
                           TIME_PERIODIC);
  }
}


</p>


</p>

Back to Article

1 2 3 4 5 6 7 Next

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

JVM Languages

A Real-Time Performance Visualizer for Java