Gregg is a technology evangelist for Sun Microsystems and Ian is project lead and architect of the NetBeans Profiler. They can be contacted at [email protected] .com and [email protected], respectively.
Profiling Java applications is often considered a black art. There are tools available to help you track down performance bottlenecks and memory-allocation problems, but they are not always easy to use, particularly when profiling large applications. Because large applications tend to be those that most need profiling, this presents a significant problem. It is therefore not surprising that Java applications have a reputation for running slowlyin many cases, this is solely because no performance-related analysis and tuning has been done.
Dynamic bytecode instrumentation is an innovative solution to these problems. It lets you control precisely which parts of an application are profiled. As a result, only relevant information is reported, and the impact on application performance is reduced so that even large applications can be profiled easily.
Obstacles to Profiling
The two biggest obstacles to profiling have been runtime overhead and interpretation of the results.
The runtime overhead imposed by a Java profiler can be a showstopper. The instrumentation added by the profiler can cause the application to run differentlywhich may change the performance problem symptoms, making it harder to find the cause of the problem. At the very least, if the application is running more slowly, it will take longer for it to get to the point where problems occur. In a worst-case scenario, the application might not even run correctly at allunexpected timeouts caused by slower performance could result in an application crash.
After wrestling with those issues, you then have to interpret the results produced by the profiler. This can be overwhelming, even when working on a small application. For large applications, it can be a serious impediment to tracking down the cause of the problem. The larger the application, the higher the likelihood it has code in it that you did not write and therefore might not understand. This is particularly true for web and enterprise applications with several layers of abstraction that are specifically designed to be hidden from developers. Additionally, larger applications tend to have larger heaps and more threads. Profilers will deliver information on all of these things, which is usually more information than you can interpret efficiently. Filters are provided, but frequently are not precise enough and can therefore end up excluding useful information.
Traditional Profiler Technologies
There are two Java APIs available for profilingthe Java Virtual Machine Profiling Interface (JVMPI) and Java Virtual Machine Tool Interface (JVMTI). Most Java profiling tools use one of these APIs for doing instrumentation of an application and for notification of Virtual Machine (VM) events.
The most widely used profiling technique is bytecode instrumentation. The profiler inserts bytecode into each class. For CPU performance profiling, these bytecodes are typically methodEntry() and methodExit() calls. For memory profiling, the bytecodes are inserted after each new or after each constructor. All of this insertion of bytecodes is done either by a post-compiler or a custom class loader.
The key limitation of this technique is that once a class has been modified, it does not change. That lack of flexibility causes problems. If you choose to profile the entire application, the overhead of all that instrumentation can cause serious performance problems. But if you use a package or class-based filter to limit the instrumentation, you might end up not profiling an important part of the application. Further, if changing the instrumentation requires restarting the application, the profiling process slows down considerably.
An alternative to bytecode instrumentation is to monitor events generated by the JVMPI or JVMTI interface. Both APIs generate events for method entry/exit, object allocation, monitor entry, and so on. But there are some drawbacks: Using these features disables VM optimizations and not much flexibility is provided. As a result, most profilers avoid using these features, although there are some VM events that must be tracked this way (garbage collection, for instance).A third approach is to use sampling instead of instrumentation or event monitoring. The profiler periodically examines the stack of each thread in the VM. The interval is typically between 10 and 100 milliseconds (ms). The stack traces are recorded and the method at the top of the stack is assumed to have been running since the last examination. So if the interval is 10 ms and method foo() is at the top of the stack four times, then the profiler will report that method foo() ran for 40 ms. This technique imposes little overhead, but suffers from some drawbacks because the results are generally less accurate. Most importantly, there is no information about the number of times foo() was called. Another drawback is that the profiling overhead grows as the number of threads grows. And finally, sampling provides no solution for memory profiling, so if your application has memory-allocation problems, a profiler that does only sampling will be of no use.
Dynamic Bytecode Instrumentation
A new approach, profiling with dynamic bytecode instrumentation (DBI), offers some important benefits. With DBI, the insertion of instrumentation bytecodes is done at runtime. After you have selected a class or method for profiling, the next time it is invoked by the application, the profiler will insert the necessary instrumentation. This is done using the JVMTI's redefineClass(Class classes, byte newBC) method.
This means you can change the profiling overhead as the application is running. You can even remove all instrumentation so that your application is no longer being profiled. These changes can be made adaptively as you learn more about the behavior of the application. Perhaps best of all, the changes can be made without restarting your application.
CPU Performance Profiling
To get the maximum benefit from DBI, the profiler has to provide additional functionality. For CPU performance profiling, this means intelligent usage of DBI so that only the methods that need profiling have bytecodes added for instrumentation. Performance problems are typically associated with certain operations or functionality in an application. As an example, in a reporting system, a specific report might take longer than others to display. Once you identify the specific method that starts the execution of the feature with the problem, the profiler should be able to instrument just that method, allowing the rest of your application to run at full speed. Limiting the profiling also reduces the output from the profiler so that it only reports on the interesting parts of your application.
The top-level methods you select for profiling are root methods. To provide a complete picture of the performance problem, the profiler must instrument not only the selected root method(s), but also all methods that it invokes. This analysis must be done iteratively so that the profiler calculates the transitive closure of the call chain. When an instrumented method is invoked by the application, the instrumentation code must efficiently determine if a root method is in the call chain. If no root method is in the call chain, then the instrumentation code exits immediately to minimize overhead. If a root method is in the call chain, then profiling information is recorded.
A profiler that lets you select root methods and then applies DBI intelligently allows you to define the problem in your terms. Filters based on package and class names are still useful, but selecting root methods is a much more precise way of controlling profiling overhead while maintaining useful results in the profiler's output.
As with CPU performance profiling, to get maximum benefit from DBI when doing memory profiling, a profiler must use additional logic. This means smarter techniques for keeping track of objects because that is the largest source of overhead when doing memory profiling. It turns out the smartest technique is one of the simplest: Track only a subset of the allocated objects. Tracking as few as 10 percent of the allocated objects usually yields the same information, but with much less overhead. This is particularly true on larger applications.
A profiler should not only track object allocations, it should also help you determine if your application is leaking memory. A memory leak occurs when your application is no longer using an object but is inadvertently still holding a reference to that object, thereby preventing the VM from freeing up the memory used by that object. The sooner you can identify the objects that are potential leaks, the sooner you can reduce profiling overhead by having the profiler use DBI to remove instrumentation from all other classes.
Identifying potential leaks can be difficult, especially in large applications. The memory heap in a large application can be so big that attempting to compare heap snapshots frequently offers little help because the leaking objects are not obvious. This is especially true for leaks that start out small but continue to grow over time; sometimes weeks of application uptime are required before the problem is large enough to be noticeable.
A statistical approach for identifying memory leaks works well and imposes very little overhead. There are two key metrics: the age of each object and the generation count for each class. The age of an object is the number of garbage collections it has survived. The generation count for a class is the number of different ages of all objects of that class. A low generation count indicates that all the objects of a class have been in memory about the same amount of time. A high generation count (or a generation count that always increases) indicates that your application is continuing to allocate new objects of that class without letting go of references to older objects of that class. These are the best candidates for being memory leaks because applications do not typically intend to periodically allocate long-lived objects. Instead, they typically have long-lived objects that are allocated at roughly the same time and short-lived objects that are periodically allocated and then removed by the VM once they are no longer needed.
CPU Performance Profiling Example
We will use the NetBeans Profiler (http:// profiler.netbeans.org/) to demonstrate DBI. For CPU performance profiling, we chose the Java2D program, which is one of the demos included with Sun's Java Development Kit (JDK). When we started the profiling session, we configured the NetBeans Profiler to perform instrumentation of the entire application, as in Figure 1. This caused a dramatic slowdown in application startup time: 10 seconds instead of 2 seconds.
Application performance was also sluggish. When run without profiling, the application was able to keep up with a rapid series of mouse clicks on its top-level display tabs. When run with full instrumentation, it was not able to keep up. An additional problem is that, with full instrumentation, it took longer to find the information we needed. We were interested in the performance of the java2d.AnimatingSurface.run() method. With full instrumentation, we had to wade through 13 threads to find the three that invoke the run() method; see Figure 2.
We then restarted the profiling session and applied a filter that removed the instrumentation from all java.* classes. Application performance improved dramatically, but the comprehensiveness of the reported results suffered quite a bit. Compare Figure 3 with Figure 4. In Figure 3, the java.* classes were excluded and as a result important details about how much time was spent executing the methods called by the run() method are not available. Figure 4 shows the output when all classes were profiled.
To get acceptable application performance and comprehensive results, we used a DBI feature of the NetBeans Profiler to change the instrumentation. We removed the filter that had eliminated the java.* classes and we chose a single root method: java2d.AnimatingSurface.run(). As a result, application performance was very responsive and it was much easier to find the information we needed; only the three threads that had invoked the run() method were displayed, as in Figure 5. And because no java.* classes were filtered out, full details of the run() method were available; see Figure 4.
This application is relatively small, but the lessons learned apply to larger applications as well. When full profiling was done, 19,579 methods were instrumented. Adding a filter to remove the java.* classes dropped the number of methods with instrumentation to 712, but also reduced the amount of information available. With a single root method selected and the java.* classes no longer filtered out, the number of methods with instrumentation dropped to 344. Additionally, we still had comprehensive profiling results for the method that was of interest.
Memory Profiling Example
The sample application we created to demonstrate memory profiling contains a common Java antipattern: Objects are periodically placed into a Map and then forgotten. This prevents the VM from removing those objects from memory. Given enough time, the application eventually runs out of memory and crashes. The objects are small, however, so the amount of time before a crash occurs could be days or weeks. Instead of waiting for it to crash, we used the NetBeans Profiler to watch the sample application's memory usage. The NetBeans Profiler can track just allocations, or allocations and garbage collection. We chose the second option and also to track 10 percent of allocations, as in Figure 6.
The NetBeans Profiler has a dynamic display of the heap contents, as in Figure 7. By default, it lists classes by the amount of space that the objects of that class are using. When we switched the display order to be by generation count and then let the application run for a while, the memory leak was easy to spot. Three classes had generation count values that were noticeably larger and were continuing to increase as the application ran (Figure 8). The three classes are java.util.HashMap$Entry, double, and float. After those were identified, we used a DBI feature within the NetBeans Profiler to turn off the instrumentation of all other classes. This reduced profiling overhead dramatically.
Knowing which classes are leaking is usually only part of the challenge because frequently there are leaking and nonleaking instances of a class. In other words, some of the allocations are okay, some are not. The NetBeans Profiler helped us find the allocations that are leaking by showing us stack traces for each allocation. The top-level list of stack traces in Figure 9 lists two methods that are allocating HashMap$Entry objects: addEntry() and createEntry(). The generation count for addEntry() is dramatically larger than for createEntry(), so it is the more likely source of the leak.
Expanding the stack trace of addEntry() shows that it is called by put(), which in turn is called by several different methods. Of all the different execution paths that result in the creation of HashMap$Entry objects, one has a generation count that is dramatically larger than the rest. That code path begins with the run() method in the aptly named demo.memoryleak.LeakThread class, as in Figure 10. The NetBeans Java editor was used to display the source code, which allowed us to spot the memory leak bug:
map.put (new float, new double );
You do not have to endure untenable amounts of overhead and an overwhelming amount of output to profile your Java applications. Dynamic Bytecode Instrumentation, when properly applied, can make profiling fast and easy.
Dmitriev, Mikhail. "Design of JFluid: A Profiling Technology and Tool Based on Dynamic Bytecode Instrumentation," http://research.sun.com/techrep/2003/ abstract-125.html.