Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

Web Development

XML, Reflective Pattern Matching, and Java


Jun00: XML, Reflective Pattern Matching, and Java

Andy is a computer consultant who lives on Guernsey in the Channel Islands. He can be contacted at [email protected].


Every so often, a small project becomes a victim of its own success -- and becomes a big project. On these occasions, the architecture may need a rethink if the original design can't cope with the extra requirements expansion brings in its wake. This happened to me last year while working on Marius, a set of programs that takes an XML source document containing Java code fragments and explanatory text (see my article "Java, XML, & Literate Programming" DDJ, February 2000). One of the programs, Comb, extracts Java and reassembles it into files that can be compiled. The other, Weave, creates an HTML version of the source document that is easy to read.

At the end of that article, I mentioned that it would be nice to be able to vary the style of the HTML output to fit in with existing web sites. It was at this point that I ran into trouble. Both Comb and Weave had been written using recursive descent, a common and simple technique for working with trees. Unfortunately, this meant that the output of Weave could only be changed by making large-scale modifications to the program. Worse than this, coping with a variety of output formats was completely infeasible. Something new was needed.

Recursive descent makes a depth-first traversal of the tree output by the XML parser, taking actions based on the next node of the tree and where you are in the program. For example, suppose you are processing a document constructed according to a DTD containing Example 1 that represents the log of a tollbooth on a bridge. It says that a RECORD tag with a DATE attribute contains zero or more CAR or VAN tags, and that VAN and CAR tags contain zero or more passengers. (I am distinguishing between passengers and drivers here.) Example 2 is a typical record file, in which a car containing two passengers was followed by a van containing one.

DOM and Recursive Descent

Most XML parsers produce trees conforming to the Document Object Model (DOM) from the W3C. This model contains three interfaces that get used a great deal when processing XML. The Node interface represents a node on a parse tree and contains methods for finding other nodes, such as the getFirstChild() method that finds the first child of a node, if there is one.

Node is extended by the Element interface that represents an XML tag; it contains methods such as getAttribute() for discovering the value of an attribute in a tag, such as the DATE attribute in the RECORD tag in Example 2. It also contains a getTagName() method for discovering the name of a tag. Finally, Node is also extended by Text, which represents character data; its getData() method can be used to get the string of data.

If you wanted to write a program to count the number of passengers in cars and you had parsed the document into a tree of DOM nodes using an XML parser, you could write something like Listing One. The processRecord() method takes an element that is the root of the tree of nodes. Each of its child nodes is checked to see whether it is a CAR tag; if it is, it is passed onto the processCar() method. This method counts each of the child nodes. Because the DTD only permits PASSENGER tags inside children, it simply counts nodes without checking their name.

In the past, I've used recursive descent without a second thought, but in fact it's a way of programming that sits uncomfortably with object-oriented languages. It's tightly coupled to the DTD. If the DTD is complicated, the code to process it will be, too. There's also a strong coupling between the various methods as methods call other methods in the class to process subparts of the DOM tree, and it makes inheritance hard to use. These factors make the code brittle and unable to easily cope with change -- which is exactly the problem I had with Weave.

Pattern Matching with Hex

An alternative to processing trees with recursive descent is to use pattern matching. This is available in some declarative languages such as Prolog or Miranda, but not directly built into Java. However, Java does have another powerful facility -- Reflection -- which can be used to add pattern matching in a simplified form. After a number of experiments, I came up with Hex. To use it, you supply another class called the "delegate" that contains a series of actions in the form of methods named according to a special convention. Hex traverses the DOM tree in depth-first order and, when it recognizes a combination of tags for which there is a matching method in the delegate, it calls the method. Listing Two is another class that counts the number of passengers in cars on the toll bridge.

The setHex() method is required to implement the Hex interface. The Hex class uses this to pass a reference to the delegate so that the delegate can call some special methods in Hex for control purposes. The other method, prePASSENGER_ CAR() is called by Hex whenever a PASSENGER tag is encountered inside a CAR tag during the depth first traversal. Hex passes the matching element as a parameter as you commonly need to do various things based on the value of the attributes of the tag.

When Hex arrives at a new node, it attempts to find a matching action in the delegate, starting with the strongest pattern that takes the current node and all of its ancestors into account. If it fails to find this, it then looks for a pattern matching the current node and all ancestors except the root, then all ancestors except the root and the root's child, and so on. Finding a match in the delegate causes the match to be executed, then Hex moves onto the next node. If it finds no matches it moves silently onto the next node. Consider the typical record file in Example 2 that's represented as the tree of DOM Elements. In Figure 1, Hex traverses from A to L. At position A, it looks for the preRECORD() method in the delegate. At position B, it looks for preCAR_RECORD() and if it does not find that, it looks for preCAR(). At C, it looks for prePASSENGER_CAR_ RECORD(), then prePASSENGER_CAR(), then prePASSENGER(). At D, the behavior changes slightly. Hex also matches against post methods that are called after a node and its children have been processed. At D, it looks for postPASSENGER_CAR_RECORD(), then postPASSENGER_CAR(), and then postPASSENGER().

Position E is treated like C, and F is treated like D. Then, at G, the post methods are tried for the CAR tag -- postCAR_RECORD(), then postCAR(). At position H, Hex tries preVAN_RECORD(); if it does not find that, it then tries preVAN(). Hex carries on in this fashion until it reaches position L, at which point it tries postRECORD() before halting.

The point of having both "pre" and "post" pattern matches is that transforming XML often involves doing something both before and after an element has been processed. For example, you could transform the file into English text using Listing Three. Running this against Example 2 produces:

On 03/03/1999

A car with 2 passengers.

A van with 1 passengers.

Hex also works with Text, the other sort of node that commonly appears in DOM trees. A piece of XML, such as <form>This is text</form>, is represented in DOM as in Figure 2. When a text node is encountered, Hex looks for a method with a name starting with prePCDATA_ with the rest of the name constructed as before. In this case, you would look for a method called prePCDATA_ form and if it was not discovered in the delegate, Hex would look for a method called prePCDATA. Similarly, matches beginning postPCDATA_form will be searched for afterwards. However, Hex hands a match a different set of parameters. A complete match is handed the contents of the Text node as a string, the Text node itself, and the Element node that is its immediate ancestor; for example, public void prePCDATA_form(String content, Text text, Element element). The reason for this is that processing such a Text node will almost certainly require the actual string contents, and may well need values of the attributes that can be obtained from the Element reference. The Text node itself is available in case its contents need to be changed.

It's often helpful to be able to process an XML document in multiple passes; for example, when converting XML to HTML. For display purposes, you may have to output a set of links representing a table of contents, prior to outputting the rest of the HTML. The easy solution is to process the XML in two passes, gathering data for the table of contents in the first pass, and outputting both the table and the rest of the HTML in the second.

Hex handles this by allowing a mode string to be set and exposing a method redoFromStart(). Once the mode string has been set, only patterns starting with "preXXX" and "postXXX" (where XXX is the mode string) will be matched. For example, suppose you want to extend the previous example to output the total number of passengers prior to printing out the details of every car and van. Listing Four is one possible solution. Running this against Example 2 produces:

On 03/03/1999

There were 3 passengers in total.

A car with 2 passengers.

A van with 1 passengers.

Hex has some obvious advantages over recursive descent -- it tends to produce smaller classes because the delegate does not need to include the loops and conditionals that navigate recursive descent through the nodes. My experiments show that this typically saves around 20 percent of the code.

The actions in the delegate are independent from each other, unlike the methods in a class implementing recursive descent. This gives two advantages. First, each action stands alone and can be understood without referring to the other methods. Second, it is possible to use inheritance effectively. Suppose you wish to extend the CarCountingDelegate in Listing Two to count cars and vans. Listing Five, for instance, counts cars and vans. However, extending the recursive descent-based Process class (Listing One) is not so simple. Not only will another processVan() method analogous to the existing processCar() be required, the processRecord() method will also have to be completely replaced -- as it needs to call both of the other two methods.

A delegate is not strongly coupled to the DTD. If the DTD is varied slightly, the delegate can often be left unchanged. This is in strong contrast to recursive descent, which is so tightly coupled to the DTD it almost invariably has to change -- even if the modification to the DTD is small. For example, suppose the tollbooth DTD is extended so that CAR tags also contain PET tags that record when the family dog came along for the ride.

<!ELEMENT CAR ((PASSENGER|PET)*)>

<!ELEMENT PET EMPTY>

The Hex delegates all still work, but the Process class will have to be rewritten.

Hex works using Java Reflection, from which I took the name "Reflective Pattern Matching" for this technique. Reflection is the ability to find out facts about objects at run time (for more information on reflection, see "Java Reflection," by Paul Tremblett, DDJ, January 1998). In particular, Hex uses it to find out the names of all the methods in a delegate and see if any of these methods can be construed as an action. This is done when an object of the Hex class is constructed:

Hexable delegate = new CarCounting Delegate();

Hex hex = new Hex(delegate);

Basically, this means it looks for methods that begin with "pre" or "post" in the delegate. It also performs a useful optimization at this stage by working out how many tags are in the longest pattern. It's obviously pointless trying to match a pattern containing three tags when the longest pattern contains only two. It also calls the setHex() method in the delegate, which is the method the delegate must have to satisfy the requirements of the Hexable interface. Then, given some tree of DOM nodes in a variable doc, you start the processing with hex.process(doc);.

Hex exposes methods such as redoFromStart() that cause the processing to start again, usually after changing the mode string. This can be called at any time during processing and since the depth first traversal is recursive this means that the stack has to be unwound back to the starting state. I did this by making the redoFromStart() method throw a special exception that is actually caught in Hex's process() method. It's declared as an extension of RunTimeException, so that the throw need not be declared in the delegate. It also avoids cluttering up the Hex code with a series of conditionals to see if the delegate wishes to restart the processing, which makes Hex more efficient. However, it's a very unusual control method, coming perilously close to an abuse of Java's exception system, and it means that any code immediately after the call of redoFromStart() will never be executed.

The Hex class occupies around 7 KB in compiled form and is available under the Gnu General public License from http://www.cedillasoft.com/.

Alternatives to Hex

Although the pattern matching available in Hex is relatively simple, it is powerful enough to perform some very sophisticated processing of XML documents. But it can't do everything; it doesn't pattern match against attribute values and it can't recognize if sets of tags that occur one after the other are significant. To do this, you need a more powerful pattern-matching system, and the W3C organization has XSL, which does all this and more. XSL uses a set of patterns described in a special format, and a program called an "XSL processor" to process a DOM tree against them. A highly recommended open-source XSL processor is SAXON, available with extensive documentation at http://users .iclway.co.uk/mhkay/saxon/index.html.

Conclusion

Hex was written in response to a specific need I had in the Marius project, which was to gain more flexibility and maintainability in the processing of XML documents. However, I now use it in a number of places in the Marius programs and it has evolved over its lifetime to become more practical. It packs a great deal of power into a small class and you will find that it handles all but the most complicated processing tasks.

DDJ

Listing One

public class Process {
    public int count = 0;
    public processRecord(Element root) {
        Element next = root.getFirstChild();
        while (next != null) {
            String name = next.getTagName();
            if (name.equals("CAR") {
                processCar(next);
            }
            next = (Element)next.getNextSibling();
        }
    }
    public processCar(Element car) {
    Element next = car.getFirstChild();
    while (next != null) {
        count++;
        next = (Element)next.getNextSibling(); 
    }
}

Back to Article

Listing Two

public class CarCountingDelegate implements Hexable {
    public int count = 0;
    private Hex hex;
    public void setHex(Hex hex) {
        this.hex = hex;
    }
    public void prePASSENGER_CAR(Element element) {
        count++;
    }
}

Back to Article

Listing Three

public class RecordToEnglishDelegate implements Hexable {
    int count = 0;
    Hex hex;
    public void setHex(Hex hex) {
        this.hex = hex;
    }
    public void preRECORD(Element element) {
        String date = element.getAttribute("DATE");
        System.out.println("On " + date);
    }
    public void preCAR(Element element) {
        System.out.print("A car with ");
    }
    public void postCAR(Element element) {
        System.out.println(count + " passengers.");
        count = 0;
    }
    public void preVAN(Element element) {
        System.out.print("A van with ");
    }
    public void postVAN(Element element) {
        System.out.println(count + " passengers.");
        count = 0;
    }
    public void prePASSENGER(Element element) {
        count++;
    }
}

Back to Article

Listing Four

public class RecordToEnglishDelegate2 implements Hexable {
    int count = 0;
    Hex hex;
    public void setHex(Hex hex) {
        this.hex = hex;
    }
    public void preRECORD(Element element) {
        String date = element.getAttribute("DATE");
        System.out.println("On " + date);
    }
    public void prePASSENGER(Element element) {
        count++;
    }
    public void postRECORD(Element element) {
        System.out.println("There were " + count + " passengers in total.");
        count = 0;
        hex.setMode("pass2");
        hex.redoFromStart();
    }
    public void prepass2CAR(Element element) {
        System.out.print("A car with ");
    }
    public void postpass2CAR(Element element) {
        System.out.println(count + " passengers.");
        count = 0;
    }
    public void prepass2VAN(Element element) {
        System.out.print("A van with ");
    }
    public void postpass2VAN(Element element) {
        System.out.println(count + " passengers.");
        count = 0;
    }
    public void prepass2PASSENGER(Element element) {
        count++;
    }
}

Back to Article

Listing Five

public class CarVanCounter extends CarCountingDelegate {
    public int vanCount = 0;
    public void prePASSENGER_VAN(Element element) {
        vanCount++;
    }
}

Back to Article


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.