Channels ▼
RSS

Web Development

Soot: Analyze, Transform, and Optimize Java Bytecodes


Soot is a powerful, extensible, open-source toolkit for analyzing and optimizing Java programs. It enables developers to better understand their Java programs, especially when used as a framework for expressing quick and dirty analyses of Java code. Soot also makes it easy for developers to write code that transforms Java programs. The project was originally developed at McGill University in Montreal in early 2000. New versions of Soot are released regularly.

What You Can Do with Soot

Soot was originally developed as a testbed for Java compiler technology, and one of its first uses was to apply standard compiler optimization techniques to Java programs. Soot can optimize code and has an advantage over JIT in that it can take as long as it wants in its analysis (whereas JIT has to make decisions on the fly).

In this article, however, I want to examine two other uses for the Soot framework:

  • As a Java disassembler — unlike javap, Soot produces easy-to-read output in an intermediate language.
  • As a semantic code analyzer.

Beyond these two uses, Soot is also used for code transformations: That is, as a more-sophisticated BCEL tool. Because of Soot's compiler heritage, you can express transformations in terms of Soot's easy-to-understand intermediate languages, rather than in terms of Java bytecode. Soot can also be used for heavy-duty compiler analysis using a dataflow analysis engine: It provides access to call graph and pointer analysis information.

Soot as a Disassembler

Because Java is run on virtual machines, Java compilers convert source code to bytecodes, which are then executed by the JVM. Because bytecodes use a binary format, they can be hard to understand. You may have run the javap tool to investigate some suspicious bytecodes manually. Soot provides a better way to understand your bytecode.

Compilers frequently use intermediate representations (IRs) internally. These IRs make compiler writers' lives easier by reducing the number of cases they need to consider. In addition, these IRs simplify figuring out what's going on in your code.

One important simplification that IRs implement is that, instead of arbitrarily complex expressions such as x * y + z >> 3, IRs can be constrained to express at most one operation per statement. These operations are typically composed of a binary operation (so, two arguments) and one destination. This type of IR is therefore known as three-address code.

The other major simplification delivered by an intermediate language is reducing a program's control-flow constructs to simple conditionals (if/goto). Thus, intermediate representations often are organized is a control-flow graph (CFG).

Soot's intermediate representation (called Jimple) implements both of these simplifications. Soot can produce Jimple from bytecode or from Java source code. Jimple contains 13 basic statements; examples include AssignStmt and InvokeStmt. Let's look at some Jimple code for the following Java code:

    java.util.List<A> list;
    public void m() {
	for (A a : list) {
	    System.out.println(a);
	}
    }

Assuming that Soot is on the classpath (Soot can also be run as an Eclipse plugin), we can invoke Soot on the command-line as follows:

$ java soot.Main For –soot-classpath /usr/lib/j2sdk1.5-sun/jre/lib/rt.jar:. -f j

For is the name of the compiled class. The -soot-class-path option tells Soot to look for classes in the Java 1.5 runtime library, as well as the current directory. The -f j option specifies Jimp (abbreviated Jimple) output. (The difference between Jimp and Jimple is that Jimple includes some details needed to reconstruct the bytecode; Jimp is more legible.)

Soot produces the following For.jimp output file:

    public void m()
    {
        For r0;
        java.util.Iterator r1;
        A r2;
        java.util.List $r3;
        boolean $z0;
        java.lang.Object $r4;
        java.io.PrintStream $r5;

        r0 := @this;
        $r3 = r0.list;
        r1 = $r3.iterator();

     label0:
        $z0 = r1.hasNext();
        if $z0 == 0 goto label1;

        $r4 = r1.next();
        r2 = (A) $r4;
        $r5 = java.lang.System.out;
        $r5.println(r2);
        goto label0;

     label1:
        return;
    }

Soot has transformed the Java bytecode into three-address code, organized in a control-flow graph. We can see that the Java compiler creates an iterator on the list field of the this object, and that the loop body requests an object from the iterator; if the iterator contains another element, the program calls System.out.println on that object.

The $ indicates a stack variable that was not originally present in the code, but was instead introduced in Soot's transformation of the code from bytecode to Jimple. It is possible to tell Soot to use the original names in the source code if it was compiled with debug information (specify -p jb use-original-names on the command line). This option would replace r with a, among other changes.

Other Intermediate Representations in Soot

Although Jimple is the principal Soot IR, it is not the only one. Soot also contains the Baf IR, which is a low-level bytecode-based IR, and the Grimple IR, which contains full expressions instead of three-address code. You might wish to look at these IRs to better understand your code, but you would typically not use them when writing automated analyses; Jimple is better. Another interesting IR is Dava, which attempts to convert the control-flow graph back into structured Java code.

Another way to use Soot is via an Eclipse plugin. This plugin allows you to run Soot without having to use the command line; it is basically a user-friendly front end. You can also graphically view control-flow graphs and step through the workings of your analysis using the Eclipse plugin.


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 
Dr. Dobb's TV