Channels ▼

Eric Bruno

Dr. Dobb's Bloggers

Easy DOM Parsing in Java

July 25, 2011

There are a few ways to parse XML in Java:

  • SAX parser: An event-based sequential access parser API that only operates on portions of the XML document at any one time.

  • DOM parser: The Document Object Model parser is a hierarchy-based parser that creates an object model of the entire XML document, then hands that model to you to work with.

  • JAXB: The Java Architecture for XML Binding maps Java classes to XML documents and allows you to operate on the XML in a more natural way.

  • String operations: I've seen some people, due to performance or memory constraints, actually perform String operations on a loaded XML document to manually find bits of information within the XML as a String; for instance, using the String class's indexOf and other built-in methods. This is not a scalable or reusable solution.

In my experience, the most popular way to work with XML is to use the DOM parser. With the DOM, the XML is broken down into three main pieces, called entities:

  1. Elements (sometimes called tags)

  2. Attributes

  3. The data (also called values) that the elements and attributes describe

Conceptually, this is simple enough, and most people choose DOM parsing for this very reason. However, when parsing XML, traversing the Document Object Model (DOM) is not always easy. I typically include a set of easy-to-use methods, such as getNode and getNodeValue (shown below), to help me pull data from a parsed XML document. This saves me from rewriting all of the otherwise recursive code to traverse a nested XML hierarchy. Before we dive into the code, I'll give a quick overview of what you need to prepare for when processing XML in Java code.

Look at the XML document below as an example:

    <?xml version="1.0" encoding="UTF-8" ?>
    <Company>
        <Name>My Company</Name>
        <Executive type="CEO">
            <LastName>Smith</LastName>
            <FirstName>Jim</FirstName>
            <street>123 Main Street</street>
            <city>Mytown</city>
            <state>NY</state>
            <zip>11234</zip>
        </Executive>
    </Company>

An element is always enclosed in "<" and ">" brackets and can consist of any piece of text, such as <Company>. Attributes are additional name/value pairs placed within an element's brackets, but after the element's tag name, such as <Executive type="CEO">. The attribute name is always followed by an equals sign (=), and then the value in quotes. An element can contain zero or more attributes, where each attribute name/value pair is separated by whitespace. Elements and attributes, themselves, make up what is called metadata, which is data that describes data.

When parsing XML via a DOM parser, each of the three important parts of the XML structure (elements, attributes, and the data) are represented by the Node class. To process this XML in a meaningful way, you need to create a series of nested loops that start from the document's root node, and recursively navigate through the child nodes, then each child node's children, and so on. Then, when you've found the node by name, you need to check its child nodes and their types to be sure you're reading an attribute or value. For instance, the node data (or value) has the type Node.TEXT_NODE, while an attribute has the type Node.ATTRIBUTE.

Here are the helper methods I use most often:


    import com.sun.org.apache.xerces.internal.parsers.DOMParser;
    import org.w3c.dom.Document;
    import org.w3c.dom.NamedNodeMap;
    import org.w3c.dom.Node;
    import org.w3c.dom.NodeList;

    // ...

    protected Node getNode(String tagName, NodeList nodes) {
        for ( int x = 0; x < nodes.getLength(); x++ ) {
            Node node = nodes.item(x);
            if (node.getNodeName().equalsIgnoreCase(tagName)) {
                return node;
            }
        }

        return null;
    }

    protected String getNodeValue( Node node ) {
        NodeList childNodes = node.getChildNodes();
        for (int x = 0; x < childNodes.getLength(); x++ ) {
            Node data = childNodes.item(x);
            if ( data.getNodeType() == Node.TEXT_NODE )
                return data.getNodeValue();
        }
        return "";
    }

    protected String getNodeValue(String tagName, NodeList nodes ) {
        for ( int x = 0; x < nodes.getLength(); x++ ) {
            Node node = nodes.item(x);
            if (node.getNodeName().equalsIgnoreCase(tagName)) {
                NodeList childNodes = node.getChildNodes();
                for (int y = 0; y < childNodes.getLength(); y++ ) {
                    Node data = childNodes.item(y);
                    if ( data.getNodeType() == Node.TEXT_NODE )
                        return data.getNodeValue();
                }
            }
        }
        return "";
    }

    protected String getNodeAttr(String attrName, Node node ) {
        NamedNodeMap attrs = node.getAttributes();
        for (int y = 0; y < attrs.getLength(); y++ ) {
            Node attr = attrs.item(y);
            if (attr.getNodeName().equalsIgnoreCase(attrName)) {
                return attr.getNodeValue();
            }
        }
        return "";
    }

    protected String getNodeAttr(String tagName, String attrName, NodeList nodes ) {
        for ( int x = 0; x < nodes.getLength(); x++ ) {
            Node node = nodes.item(x);
            if (node.getNodeName().equalsIgnoreCase(tagName)) {
                NodeList childNodes = node.getChildNodes();
                for (int y = 0; y < childNodes.getLength(); y++ ) {
                    Node data = childNodes.item(y);
                    if ( data.getNodeType() == Node.ATTRIBUTE_NODE ) {
                        if ( data.getNodeName().equalsIgnoreCase(attrName) )
                            return data.getNodeValue();
                    }
                }
            }
        }

        return "";
    }

To use this class, simply create a DOMParser class instance, provide it with the path and name of your XML document, navigate to the proper place in the XML hierarchy, and call getNodeValue (or getNodeAttr) for each data item you want to pull out, as shown in the sample code below:

        try {
            DOMParser parser = new DOMParser();
            parser.parse("mydocument.xml");
            Document doc = parser.getDocument();

            // Get the document's root XML node
            NodeList root = doc.getChildNodes();

            // Navigate down the hierarchy to get to the CEO node
            Node comp = getNode("Company", root);
            Node exec = getNode("Executive", comp.getChildNodes() );
            String execType = getNodeAttr("type", exec);

            // Load the executive's data from the XML
            NodeList nodes = exec.getChildNodes();
            String lastName = getNodeValue("LastName", nodes);
            String firstName = getNodeValue("FirstName", nodes);
            String street = getNodeValue("street", nodes);
            String city = getNodeValue("city", nodes);
            String state = getNodeValue("state", nodes);
            String zip = getNodeValue("zip", nodes);

            System.out.println("Executive Information:");
            System.out.println("Type: " + execType);
            System.out.println(lastName + ", " + firstName);
            System.out.println(street);
            System.out.println(city + ", " + state + " " + zip);
        }
        catch ( Exception e ) {
            e.printStackTrace();
        }

I realize that many of you are probably already XML parsing veterans, but I'm sure there are some newbies as well. Whether you're experienced or not, I hope you find these helper methods, well, helpful.

Happy coding!
-EJB

More on this theme: Helper Methods for Writing XML in Java.

Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 

Comments:

ubm_techweb_disqus_sso_-081d93e28d6d8ba98f857aadbdaf0b00
2014-03-05T18:54:06

Hello! Can I get the complete code for generating DOM trees from any web page? I am in an urgent need of it.


Permalink
ubm_techweb_disqus_sso_-34f700e6633df0e79c29b668604cddcb
2013-09-25T17:16:57

How to use these helper functions if I have more than one Executive Nodes in the XML Document?


Permalink
ubm_techweb_disqus_sso_-06a6faddc430162ab6c827d900667643
2013-08-05T14:02:54

I'm not sure if it has improved. However, I often do the same as you suggest here, and walk through XML. This doesn't scale, however, but it is an option in some cases.


Permalink
ubm_techweb_disqus_sso_-e19cc66ff78a2deb7d88c73f4d5adff6
2013-03-21T18:50:36

I found that early versions of Java had performance issues with the DOM XML (I was mostly using XPath). It was enough that I switched to doing my own tree walking operations to get a 2 to 3 times performance increase. Have the newer versions done anything to change this?


Permalink
ubm_techweb_disqus_sso_-46c85115c2f44b77f6279276ef1f8deb
2013-03-21T18:32:50

I find this article deeply disturbing in the number of bad practices that it naively encourages:

* Completely ignores namespaces throughout.
* Uses equalsIgnoreCase when comparing names, even though XML names are supposed to be case-sensitive.
* Uses internal classes of the Xerces implementation instead of the standard JAXP methods.
* getNodeValue() stops at the first text node. Even for simple content there could be multiple text nodes, embedded comments, etc., not to mention CDATA sections, entity references, etc.

I cringe every time I see these sort of mistakes in the wild and am very disappointed to find them in a Dr. Dobbs article.


Permalink
ubm_techweb_disqus_sso_-98056a731369462740fd6dff664ef640
2013-03-18T20:56:28

What happens when the xml contains the same element name in different sections of the XML?


Permalink


Video