Channels ▼
RSS

JVM Languages

Extreme Parsing

Source Code Accompanies This Article. Download It Now.


Aug03: Extreme Parsing

Kyle is a J2EE architect who works as a project leader at HBO, and also as an independent consultant and software developer. He can be contacted at kdowney@amberarcher.com.


A project I was working on required an IDL parser with extensions. However, writing a parser from scratch wasn't an option—IDL is too complex and I didn't have time. Because I had previously used SableCC (http://www.sablecc.org/), an open-source compiler-compiler for Java, to write a simple parser for the WHERE clause part of the SQL grammar, I took another look at it and realized it has a grammar for basic IDL. Upon further examination, however, its license turned out to be too restrictive, so I decided to write the parser from scratch after all—but use SableCC to generate the lexer and parser.

The Common Object Request Broker Architecture (CORBA) 3.0 specification is daunting, weighing in at over 1000 pages. Granted, a mere 74 pages define IDL syntax and semantics, but the information is still dense and detailed. Consequently, when sitting down to build the parser, I settled on a plan of attack that had a lot in common with Kent Beck's Extreme Programming (XP) core principles (http://www.xprogramming.com/).

XP Core Principles

Four of the 12 core XP principles Beck describes in Extreme Programming Explained: Embrace Change (Addison-Wesley, 1999; ISBN 0201616416) particularly apply to parser development:

  • Simple design. At the end of the process, you want a grammar that represents the full specification and nothing more. Any excess makes the parser slower, may introduce bugs, and makes it harder to understand and extend the grammar. Thus, this XP principle forces you to implement the minimum, following Einstein's advice to make your solutions "as simple as possible—but no simpler."
  • Test-driven development. Standards bodies sometimes produce suites of compliance tests. You could say that a compliant parser for a Standard is done when it passes those tests. Whether you have a suite to work from or you build one from examples in the spec (as I did), the key to test-driven development for parsers is to add only the new grammar elements needed to pass the current test. For IDL, you might start with:

    module M {

    interface foo { };

    };

    which is valid and requires only two keywords and a grammar for what makes a valid identifier. The important part is that you structure each step as an automated unit test. This lets you re-run all tests every time you add new grammar elements. If you do it right, it becomes natural to rerun the whole compliance set every time you make a change, helping you to catch changes (that introduce ambiguities or invalidate other parts of the grammar) early on.

    Parsers benefit from this even more than well-factored, object-oriented code because the grammar specification is not encapsulated. You can easily reorder the token definitions and have keywords recognized as identifiers instead. Additions can (and often do) break seemingly unrelated parts of the grammar.

  • Design improvement. Grammar elements are not encapsulated, but they can be subdivided, so more complex productions build on smaller productions and tokens. Making sure your grammar is clear and reuses simpler productions cleanly is equivalent to refactoring implementation code. Just as with object-oriented refactoring, the goal is to improve (and if possible simplify) the design without changing the functionality. Again, your suite of unit tests provides indispensable tools: After you make changes that shouldn't impact the parser, you can immediately rerun to make sure it really didn't!

  • Continuous integration. Incremental, test-driven development and redesign require frequent builds. To ease this, you'll want to ensure that parser/lexer generation and code compilation are automated. SableCC includes an Ant task, so that, with a few lines of XML, you can incorporate the compiler-compiler phase as part of the build.

Introducing SableCC

Tools such as lex and yacc have been in programmers' parser-development toolchests for a long time. These tools read grammar and generate the lexer and parser code—in C, in the case of lex and yacc. Every major language has at least one compiler-compiler. For instance, Java has ports of lex and yacc, plus more object-oriented, Java-friendly tools such as PCCTS/antlr, JavaCC (Jack, originally from Sun, now from Metamata), and SableCC, among others.

SableCC is nice because it generates Visitor-pattern-based tree-walking classes that make it easy to build code generators, pretty printers, and whatever other tools you create on top of your grammar. Your visitor class gets callbacks for entry into and exit out of each production, and it's passed a Node object with all the details you need.

SableCC's syntax is easy to learn and close to Extended Backus-Naur Format (EBNF), the standard grammar notation used in major specifications. For instance, the EBNF production for IDL interfaces from the CORBA spec in Example 1(a) is rendered in SableCC as Example 1(b).

Continuous Building and Testing

To satisfy the continuous integration and test-driven development principles, you need a one-touch build script. Since SableCC comes with an Ant task, you can integrate the creation of the lexer/parser source code as part of your build. Listing One is a complete Ant build script. The first highlighted section shows how to dynamically load a Task extension into Ant, and assign it to a new XML tag, in this case, <sablecc>. The second highlighted section shows how to use this tag to build the grammar.

In this script, the generated Java source code goes to its own directory and is compiled separately from the unit test and other supporting code. Usually, you will not want to check this code into your source-code control system, so this helps to keep it distinct and makes it easy to clean up (just remove the build directory).

The next step is to set up a suite of automated unit tests. The base class I created for all parsing unit tests lets me load a test input file and feed it to my generated parser; see Listing Two. The import package base (com.amberarcher.corba.idl) varies depending on your declared Package for the parser, but the names of the generated classes (Start, Parser, and so on) are identical. Like many compiler-compilers, there is no supporting library for SableCC.

Most of the code just sets up the input stream that is fed into the lexer. For convenience, this particular base class loads all test files relative to src/share/test-fixtures/idl, so the filename is also built up. The heart of the code is:

Parser p = new Parser(new Lexer(in));

return p.parse();

The returned node, Start, corresponds to the first production in your grammar. The derived test class calls parse and passes the name of the example file to test. I followed a convention in which the filename is mapped to the section of the CORBA spec and includes some basic information about what is being tested. Listing Three shows how to use this base class to execute a single test, illustrating a critical requirement for automated unit testing—the test must decide for itself whether it passed. You should not have to eyeball the results. In this case, the IDL in Example 2 should fail, at least according to the spec. In the test, I try to catch a ParseException (if there is one) and assert either that it failed or that it failed in the place I expected it to fail for the reason it was supposed to fail.

Using IntelliJ IDEA (http://www.intellij.com/idea/), you can run the Ant script and the unit tests inside the IDE. Figure 1 shows a successful test run with multiple test cases executing.

Extreme MiniIDL with SableCC

My goal was to write a SableCC grammar that can handle a fragment of IDL that represents just a handful of tokens and productions from the CORBA 3.0 specification; see Listing Four. CORBA is a mature distributed object technology, and IDL is its language-neutral Interface Definition Language. Given the IDL, you can take an IDL compiler and generate network stubs and skeletons that can connect to the remote object—even if your client is in Perl and the server is in C++ or is a Java EJB. IDL isn't just for network servers, though. Microsoft uses a variation on IDL to define COM components, and Mozilla's XPIDL and the Linux Gnome component object model (Bonobo) also use IDL for component interfaces. The XML Document Object Model (DOM) is also defined in terms of IDL. So having an extensible IDL parser around to generate documentation and write a stub compiler is useful.

Here, I handle a subset of IDL—just enough to declare interfaces (with inheritance), attributes, and modules (a name-scoping mechanism similar to Java packages or C++ namespaces), plus comments. Amazingly, this can be done with under 50 lines of Sablecc grammar, resulting in hundreds of lines of generated Java code.

I start by defining the package, along with a few helpers. In Example 3, the first line defines which package the generate code goes into. The all set is every possible Unicode character, because SableCC inherits Java's Unicode text-handling support. From there, I define a few useful subsets using SableCC's range operator. For instance, lowercase is defined to be all the lowercase ASCII characters from a to z. Integers can be used to represent ASCII character codes; for example, tab = 9; defines some helpful whitespace characters (tab, carriage return, line feed). The eol definition uses SableCC's or operator ("|") and concatenation to represent that an end-of-line is a carriage-return/linefeed, carriage return, or linefeed.

From here you can define derived helpers; see Example 4. Note how "+" and "-" work as you expect they would in defining character ranges; for instance, not_eol is all character excluding (minus) carriage return and line feed, just as not_star is not_eol excluding "*" or eol.

These last three helpers are for matching comment tokens. To match an interface declaration and make sure it works, you have to add tokens—atomic pieces of the grammar (see Example 5). Token definition is like helper definition: name = value. Valid right-side values are literal values (like "interface"), helpers, and combinations of the two. Globbing operators can be used, too; for instance, (' ' | tab | eol)+ indicates that whitespace can be one or more of the three possible values given. The "?" indicates 0 or 1; "*" indicates 0 or more, just as in DOS file globs and regular expressions. Because productions have to be defined in terms of tokens, you have to declare some tokens for symbols like "{" and ";."

With tokens defined, you can also specify some to ignore:

Ignored Tokens

white_space

At this point, you can string tokens together into valid sentences in the grammar. The first one is the Start node, and each one in turn can be defined in terms of the others, and even recursively; see Example 6.

Production definition translates to token definition, with one exception: If you use the same name for a production as for a token, you need to qualify it with a "P." and "T." prefix. In Example 6, interface has this ambiguity and needs to be qualified.

Productions can have branches, but you need to label them like this:

interface = {interface_dcl} interface_dcl |

{forward_dcl} forward_dcl;

The labels in the curly braces become the accessors (getXXX()) in the generated Java code. So now you can load up the test:

interface foo { };

interface B { };

interface Fee_Foo_Fum { };

and it should parse correctly. You build out from here, adding forward declarations, attribute declarations, and inheritance, and writing tests for each extension. With the base helpers and tokens, the rest is straightforward. Listing Five is the full grammar.

Conclusion

Outside of a computer-science classroom, no one gets points for writing a parser and lexer from scratch. However, using a compiler-compiler—whatever your language of choice—is a good expression of Larry Wall's "laziness" virtue of programmers. If you also have the "hubris" virtue, though, and want to tackle a big grammar by yourself, you're advised to build it piecewise using the extreme development methods I present here: Develop a production; build, test, and add; and keep doing it until you've covered the whole grammar.

DDJ

Listing One

<?xml version="1.0"?>
<project name="comc" default="compile" basedir=".">
    <!-- allow for local override of site settings -->
    <property file="${user.home}/.ant.properties"/>
    <property name="build.dir" value="${basedir}/build"/>
    <property name="build.classes.dir" value="${build.dir}/classes"/>
    <property name="build.gen-src.dir" value="${build.dir}/gen-src"/>
    <property name="src.dir" value="src/share/classes"/>
    <property name="grammar.dir" value="src/share/grammars"/>

    <path id="build_classpath">
        <pathelement path="${basedir}/lib/junit.jar"/>
    </path>

   <path id="tools_classpath">
       <pathelement path="${basedir}/build-tools/sablecc-anttask.jar"/>
       <pathelement path="${basedir}/build-tools/sablecc.jar"/>
   </path>

    <taskdef name="sablecc" classname="org.sablecc.ant.taskdef.Sablecc"
        classpathref="tools_classpath"/>
    <target name="generate_idl_parser">
        <mkdir dir="${build.classes.dir}"/>
        <mkdir dir="${build.gen-src.dir}"/>
        <sablecc src="${grammar.dir}" includes="idl.sablecc" 
            outputdirectory="${build.gen-src.dir}"/> 
        <javac srcdir="${build.gen-src.dir}"
            destdir="${build.classes.dir}">
            <include name="**/*.java"/>
        </javac>
        <copy todir="${build.classes.dir}">
            <fileset dir="${build.gen-src.dir}">
                <include name="**/lexer.dat" />
                <include name="**/parser.dat" />
            </fileset>
        </copy>
    </target>
    <target name="compile">
        <mkdir dir="${build.classes.dir}"/>
        <javac srcdir="${src.dir}" classpathref="build_classpath"
            destdir="${build.classes.dir}">
            <include name="**/*.java"/>
        </javac>
        <copy todir="${build.classes.dir}">
            <fileset dir="${build.gen-src.dir}">
                <include name="**/lexer.dat" />
                <include name="**/parser.dat" />
            </fileset>
        </copy>
    </target>
    <target name="clean">
        <delete dir="${build.dir}"/>
    </target>
</project>

Back to Article

Listing Two

package com.amberarcher.corba.idl.testing;
import java.io.FileReader;
import java.io.PushbackReader;
import java.io.File;
import junit.framework.TestCase;
import com.amberarcher.corba.idl.node.Start;
import com.amberarcher.corba.idl.parser.Parser;
import com.amberarcher.corba.idl.lexer.Lexer;

public class AbstractTestCase extends TestCase {
    public AbstractTestCase(String methodName) {
        super(methodName);
    }
    protected Start parse(String testCaseFileBaseName) throws Exception {
        FileReader fin = new FileReader("src" + File.separator +
                "share" + File.separator + "test-fixtures" + 
                File.separator + "idl" +
                File.separator + testCaseFileBaseName);
        PushbackReader in = new PushbackReader(fin);
        Parser p = new Parser(new Lexer(in));
        return p.parse();
    }
}

Back to Article

Listing Three

public class StandardIdlParsingUnitTests extends AbstractTestCase {
    public StandardIdlParsingUnitTests(String methodName) {
        super(methodName);
    }
    public void testUnescapedIdentifierFails() throws Exception {
        String msg = null;
        boolean failed = false;
        try {
            parse("example-3.2.3.1-esc-identifiers-1.idl");
        } catch (ParserException e) {
            msg = e.getMessage();
            failed = true;
        }
        assertTrue(failed);
        assertEquals("[3,27] expecting: identifier, escaped identifier", msg);
    }
}

Back to Article

Listing Four

module M {
    interface base;              /* forward declaration */
    interface base { };
    interface thing : base {
        attribute boolean foo;   // creates get_foo(), set_foo()
    };
};

Back to Article

Listing Five

Package com.amberarcher.corba.idl;

Helpers

    all = [0 .. 0xFFFF];
    lowercase = ['a' .. 'z'];
    uppercase = ['A' .. 'Z'];
    digit = ['0' .. '9'];

    tab = 9;
    cr = 13;
    lf = 10;
    eol = cr lf | cr | lf;

    identifier_prefix = lowercase | uppercase ;
    identifier_char =  identifier_prefix | digit | '_';

    not_eol = [all - [cr + lf]];
    not_star = [not_eol - '*'] | eol;
    not_star_not_slash = [not_eol - ['*' + '/']] | eol;

Tokens

    // different types of comments
    traditional_comment = '/*' not_star+ '*'+
(not_star_not_slash not_star* '*'+)* '/';
    end_of_line_comment = '//' not_eol* eol?;

    attribute = 'attribute';
    boolean = 'boolean';
    interface = 'interface';
    module = 'module';

    identifier = identifier_prefix identifier_char*;
    escaped_identifier = '_' identifier_prefix identifier_char*;

    white_space = (' ' | tab | eol)+;
    start_scope = '{';
    end_scope = '}';
    end_decl = ';';
    inherit_op = ':';
    scoping_op = '::';
    interface_sep = ',';

Ignored Tokens

    white_space,
    traditional_comment,
    end_of_line_comment;

Productions

    specification = definition+;

    definition = {interface} P.interface |
        {module} P.module;

    module = T.module P.identifier start_scope definition+ end_scope
end_decl;

    interface = {interface_dcl} interface_dcl |
        {forward_dcl} forward_dcl;

    interface_dcl =
        interface_header start_scope interface_body end_scope end_decl;

    forward_dcl = T.interface P.identifier end_decl;

    interface_header = T.interface P.identifier
interface_inheritance_spec?;

    interface_body = exports*;

    interface_inheritance_spec = inherit_op interface_name
interface_inheritance_spec_tail*;

    interface_inheritance_spec_tail = interface_sep interface_name;

    interface_name = scoped_name;

    exports = attr_dcl;

    attr_dcl = attribute type_declaration P.identifier end_decl;

    type_declaration = boolean;

    scoped_name = {identifier} P.identifier |
            {relative_identifier} scoping_op P.identifier |
            {global_identifier} scoped_name scoping_op P.identifier;

    identifier = {id} T.identifier | {esc_id} escaped_identifier;

Back to Article


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 

Video