Channels ▼

Eric Bruno

Dr. Dobb's Bloggers

Easy DOM Parsing in Java

July 25, 2011

There are a few ways to parse XML in Java:

  • SAX parser: An event-based sequential access parser API that only operates on portions of the XML document at any one time.

  • DOM parser: The Document Object Model parser is a hierarchy-based parser that creates an object model of the entire XML document, then hands that model to you to work with.

  • JAXB: The Java Architecture for XML Binding maps Java classes to XML documents and allows you to operate on the XML in a more natural way.

  • String operations: I've seen some people, due to performance or memory constraints, actually perform String operations on a loaded XML document to manually find bits of information within the XML as a String; for instance, using the String class's indexOf and other built-in methods. This is not a scalable or reusable solution.

In my experience, the most popular way to work with XML is to use the DOM parser. With the DOM, the XML is broken down into three main pieces, called entities:

  1. Elements (sometimes called tags)

  2. Attributes

  3. The data (also called values) that the elements and attributes describe

Conceptually, this is simple enough, and most people choose DOM parsing for this very reason. However, when parsing XML, traversing the Document Object Model (DOM) is not always easy. I typically include a set of easy-to-use methods, such as getNode and getNodeValue (shown below), to help me pull data from a parsed XML document. This saves me from rewriting all of the otherwise recursive code to traverse a nested XML hierarchy. Before we dive into the code, I'll give a quick overview of what you need to prepare for when processing XML in Java code.

Look at the XML document below as an example:

    <?xml version="1.0" encoding="UTF-8" ?>
    <Company>
        <Name>My Company</Name>
        <Executive type="CEO">
            <LastName>Smith</LastName>
            <FirstName>Jim</FirstName>
            <street>123 Main Street</street>
            <city>Mytown</city>
            <state>NY</state>
            <zip>11234</zip>
        </Executive>
    </Company>

An element is always enclosed in "<" and ">" brackets and can consist of any piece of text, such as <Company>. Attributes are additional name/value pairs placed within an element's brackets, but after the element's tag name, such as <Executive type="CEO">. The attribute name is always followed by an equals sign (=), and then the value in quotes. An element can contain zero or more attributes, where each attribute name/value pair is separated by whitespace. Elements and attributes, themselves, make up what is called metadata, which is data that describes data.

When parsing XML via a DOM parser, each of the three important parts of the XML structure (elements, attributes, and the data) are represented by the Node class. To process this XML in a meaningful way, you need to create a series of nested loops that start from the document's root node, and recursively navigate through the child nodes, then each child node's children, and so on. Then, when you've found the node by name, you need to check its child nodes and their types to be sure you're reading an attribute or value. For instance, the node data (or value) has the type Node.TEXT_NODE, while an attribute has the type Node.ATTRIBUTE.

Here are the helper methods I use most often:


    import com.sun.org.apache.xerces.internal.parsers.DOMParser;
    import org.w3c.dom.Document;
    import org.w3c.dom.NamedNodeMap;
    import org.w3c.dom.Node;
    import org.w3c.dom.NodeList;

    // ...

    protected Node getNode(String tagName, NodeList nodes) {
        for ( int x = 0; x < nodes.getLength(); x++ ) {
            Node node = nodes.item(x);
            if (node.getNodeName().equalsIgnoreCase(tagName)) {
                return node;
            }
        }

        return null;
    }

    protected String getNodeValue( Node node ) {
        NodeList childNodes = node.getChildNodes();
        for (int x = 0; x < childNodes.getLength(); x++ ) {
            Node data = childNodes.item(x);
            if ( data.getNodeType() == Node.TEXT_NODE )
                return data.getNodeValue();
        }
        return "";
    }

    protected String getNodeValue(String tagName, NodeList nodes ) {
        for ( int x = 0; x < nodes.getLength(); x++ ) {
            Node node = nodes.item(x);
            if (node.getNodeName().equalsIgnoreCase(tagName)) {
                NodeList childNodes = node.getChildNodes();
                for (int y = 0; y < childNodes.getLength(); y++ ) {
                    Node data = childNodes.item(y);
                    if ( data.getNodeType() == Node.TEXT_NODE )
                        return data.getNodeValue();
                }
            }
        }
        return "";
    }

    protected String getNodeAttr(String attrName, Node node ) {
        NamedNodeMap attrs = node.getAttributes();
        for (int y = 0; y < attrs.getLength(); y++ ) {
            Node attr = attrs.item(y);
            if (attr.getNodeName().equalsIgnoreCase(attrName)) {
                return attr.getNodeValue();
            }
        }
        return "";
    }

    protected String getNodeAttr(String tagName, String attrName, NodeList nodes ) {
        for ( int x = 0; x < nodes.getLength(); x++ ) {
            Node node = nodes.item(x);
            if (node.getNodeName().equalsIgnoreCase(tagName)) {
                NodeList childNodes = node.getChildNodes();
                for (int y = 0; y < childNodes.getLength(); y++ ) {
                    Node data = childNodes.item(y);
                    if ( data.getNodeType() == Node.ATTRIBUTE_NODE ) {
                        if ( data.getNodeName().equalsIgnoreCase(attrName) )
                            return data.getNodeValue();
                    }
                }
            }
        }

        return "";
    }

To use this class, simply create a DOMParser class instance, provide it with the path and name of your XML document, navigate to the proper place in the XML hierarchy, and call getNodeValue (or getNodeAttr) for each data item you want to pull out, as shown in the sample code below:

        try {
            DOMParser parser = new DOMParser();
            parser.parse("mydocument.xml");
            Document doc = parser.getDocument();

            // Get the document's root XML node
            NodeList root = doc.getChildNodes();

            // Navigate down the hierarchy to get to the CEO node
            Node comp = getNode("Company", root);
            Node exec = getNode("Executive", comp.getChildNodes() );
            String execType = getNodeAttr("type", exec);

            // Load the executive's data from the XML
            NodeList nodes = exec.getChildNodes();
            String lastName = getNodeValue("LastName", nodes);
            String firstName = getNodeValue("FirstName", nodes);
            String street = getNodeValue("street", nodes);
            String city = getNodeValue("city", nodes);
            String state = getNodeValue("state", nodes);
            String zip = getNodeValue("zip", nodes);

            System.out.println("Executive Information:");
            System.out.println("Type: " + execType);
            System.out.println(lastName + ", " + firstName);
            System.out.println(street);
            System.out.println(city + ", " + state + " " + zip);
        }
        catch ( Exception e ) {
            e.printStackTrace();
        }

I realize that many of you are probably already XML parsing veterans, but I'm sure there are some newbies as well. Whether you're experienced or not, I hope you find these helper methods, well, helpful.

Happy coding!
-EJB

More on this theme: Helper Methods for Writing XML in Java.

Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 

Comments:



Video