Channels ▼
RSS

Web Development

Transforming XML & the REXML Pull Parser

Source Code Accompanies This Article. Download It Now.


January, 2006: Transforming XML & the REXML Pull Parser

James is a software developer in Scottsdale, Arizona, and a principal in the consulting company 30 Second Rule. He can be reached at jamesgb@neurogami.com.


The canonical way to transform XML documents is to use XSLT. Both XML and XSLT are specifications issued by the W3C, with XSLT itself being a formulation of XML. There is a convenient consistency in this, but where this syntax may work well for defining document structure and content, it may not be a good fit for programming document transformation logic. While XSLT has the official stamp of approval from the W3C, it may not always be the best choice for transforming XML. This is more likely to be true if you are a developer for whom XML is simply yet another data format to be manipulated by a more conventional programming language. In this article, I use Ruby and its built-in XML pull parser to present an alternative approach to XML transformations. In the process, I briefly examine three types of XML parsers, take a look at Ruby's pull-parser API, and finally present an application that reads in the XML content files and transforms them to an RSS 1.0 feed.

XML Parsers

Perhaps the best known XML parser is the DOM parser, so-named because it exposes content and structure through a Document Object Model (DOM) API. In this approach, an XML document is treated as a tree of nodes. Each node may be of one type or another, the most common being element and attribute nodes. Once a document has been parsed, programs may read or alter the document by selecting, adding, or modifying nodes. The DOM API is defined by a W3C specification, and many DOM parsers provide a means for validating a document against a schema or DTD.

A DOM parser is particularly good when the source document is relatively small. It is perhaps the simplest parser to work with, as many popular implementations allow the selection of specific nodes using XPath queries. Because the entire document is held in memory, it is often easier to do conditional selection than when using stream-based parsers. However, as document size increases, memory and speed requirements may become burdensome.

Shortcomings of the DOM parser, such as the requirement for loading the entire document into memory, led to the creation of the Simple API for XML (SAX) parser. It's not a formal specification, though the XML community treats the Java implementation as the authoritative reference. While the DOM API has a nice theoretical crispness, it can be awkward in practice. In contrast, a SAX parser defines relatively little, but provides a clean framework on which to build application-specific logic.

Rather than see an XML document as a tree of nodes, SAX treats the XML as a series of events. For example, a SAX parser reading this document:

<person><!— Sample document —>
<name type='full'>John Doe<name></person>

would report the events: start of document, start of element, comment, start of element, text, end of element, end of element, end of document.

With each event, a SAX parser invokes a method with a corresponding name (for example, in Java, the start of an element would trigger a call to startElement). You must specify the code for each of the possible event methods.

A SAX parser does not hold on to much state, reading in enough of the source XML to construct the next event. Once the next event occurs, the previous content is discarded. A result is that SAX parsers do not require lots of memory. Your application can happily parse multigigabyte XML files, with the primary limitation being time, not space.

The downside to this, though, is that if your application needs to track state, then you must manage it yourself. If you expect that you will often need to track much of the source document, you may be better off with a DOM parser. If, instead, your application is mainly interested in a specific subset of a document (and especially if this document is relatively large), then SAX may be ideal.

A pull parser is similar to a SAX parser—minimal memory consumption, event based, constrained access to the whole document—but it hands control to the application logic, while the invocation of each event in a SAX parser is driven by the parser. SAX code reacts to events, while pull-parser code requests events.

The Ruby XML Pull Parser

As of Version 1.8, the Ruby standard library includes Sean Russell's REXML, a pure-Ruby XML parser (http://www .germane-software.com/software/rexml/). REXML began life as an independent library inspired by Java's Electric XML parser (hence the name, Ruby Electric XML). What made Electric XML different was that it did not implement the W3C DOM, but an API that would be natural and intuitive to any Java developer.

In the same vein, REXML implements a DOM API consistent with Ruby itself. If you are familiar with Ruby, then using REXML to parse and manipulate XML will be second nature. As is common in Ruby, REXML DOM methods typically use built-in iterators, accept blocks, and follow snake_style (rather than lowerCamelCase) naming conventions.

In addition to a DOM parser, REXML also includes a stream parser and a pull parser. The stream parser essentially follows the SAX API, albeit with Ruby naming conventions.

Both the DOM and stream parsers are built on top of a base pull parser. The public pull-parser API largely delegates to the base parser, with a few added convenience methods. As such, the REXML pull parser is the most stable and robust of the available REXML APIs.

Examples

The REXML pull parser has a straightforward API. There are two main classes, PullParser and PullEvent. You create a parser instance by invoking PullParser.new, passing in an XML source. This source can be a string, an I/O handle, or any object that implements the REXML Source API. Using that last object, you can implement a source wrapper around arbitrary data sources, such as a database. The examples I focus on here use strings and files. Listing One (content1.xml) is the initial sample source file, while Listing Two (listing1.rb) is a program that reads this file and emits event information.

The primary parser method is pull; it fetches enough characters from the XML source to assemble and return a PullEvent object. Parsing XML source consists of repeatedly calling the pull method until either the code reaches the end of the source or the application's needs are satisfied. A simple way to do this is to use the each method to iterate over all events. The call to each takes a "block"—a chunk of code that is executed for each item in the iteration.

Listing Two reads in an XML file, printing the event type of each event returned by the parser. A PullEvent object has only a handful of methods; it's a simple wrapper around the details of a segment of XML, such as a doctype, the start tag of an element, or a comment.

The event_type method returns a Symbol object corresponding to one of the 15 possible events. A Ruby symbol behaves much like a lightweight string constant. They are quite useful for defining fixed event names. So, in this example, when the parser encounters the beginning of the sample XML, the first call to event_type returns the symbol :xmldecl.

Listing Three (listing2.rb) can see if an event is a particular type by using one of the PullEvent Boolean methods such as start_element?, comment?, and text?. Because there is one for each event type, your code can selectively operate on, say, events produced from the start of elements.

This example introduces the other part of the PullEvent API—the [ index ] method for retrieving event details. Think of a PullEvent as a data array with an assigned event type. The contents of that array vary with the type.

For both start_element and end_element events, the first item in the array is the element name. The second item for a start element event holds a hash of attribute name/value pairs. If the element has no attributes, then the hash will be empty.

With this basic information, you can write code to do conditional markup selection. Listing Four (listing3.rb) selects only those events for the start of text:p elements with a style attribute of "Standard." The is_standard_text_p? method runs a series of checks against a given event, returning false any time a conditional check fails. Ruby methods (with rare exceptions) return the value of the last expression evaluated.

You can build on this code to do some XML transformation. In Listing Five (listing4.rb), the source file is transformed so that all text:p elements with a "Standard" style are renamed simply to "p." As before, this code loops over every pull event, using a case expression to switch among behavior. The example is really only concerned with elements and text; all other events are passed on as comments, containing the textual dump of the event, provided by inspect.

Another method is added here to handle some of the reconstruction. The method attrs_to_s( attrs ) takes a hash and converts it to a string of attribute name and value pairs. Calling to_a on a Hash object converts it into a set of nested arrays. Each inner array holds the key and corresponding value from the original hash. The map method then replaces each inner array with a "name='value' " string; the call to join just adds them up into one space-delimited string.

As with the previous examples, Listing Five loops over all the pull events, and builds up a new XML string in the variable results. Aside from the new helper method, the code also introduces an array to track element names. When running a conditional examination of events based on attribute values, there is a problem: Only the start tag has the information needed for the conditional logic. But the corresponding end tag must be transformed as well. A simple way to track this is to push and pop element names in and out of a stack. Ruby arrays implement these methods by default, so we get our stack object for free.

In the interest of keeping the examples straightforward, the output XML is constructed by appending strings. This works fine for small cases, but for large, more complex XML production, you might prefer to use the node construction methods of the REXML DOM parser, or Jim Weirich's Builder, http://jimweirich.umlcoop.net/ software/.

Using this basic model, you can construct transformation rules of assorted complexity, but packing the transformation logic inside of a case expression gets messy fast. An improvement is to use Ruby's dynamic nature to invoke methods based on the event types and characteristics. Ruby defines the send method, which takes a method name, followed by any arguments to that method, and invokes it. It is a nice way to call methods when the actual name is not known until runtime.

Listing Six (listing5.rb) invokes a method named after each type of pull event encountered. Because calling a nonexistent method raises an exception, all NoMethodError exceptions are quietly ignored.

This is a bit cleaner, but as the examples grow, the procedural coding style gets unwieldy. And while illustrative of the PullParser and PullEvent APIs, this is essentially following SAX-style processing. But the pull parser offers a chance to interact directly with the parser and event stack, which allows for some interesting processing options.

More Robust Transformation

You can take what I've presented so far and construct a set of libraries and application files that read in one or more OpenOffice files and return an RSS feed, with a feed item for each document. To start, the general logic of fetching a pull event and invoking a method needs to be moved from a simple loop and placed into a method; see Listing Seven (listing6.rb). This then becomes a part of a general-purpose transformation class. The PullTransformer (available electronically; see "Resource Center, page 4) reworks the previous examples by providing a simple but general framework for pulling events and invoking corresponding methods.

A new instance of the PullTransformer class is created by passing in a user-defined module. This module defines the logic for executing some particular transformation. Ruby modules are like classes, but they cannot be instantiated. They provide a means of inheritance by mix-in. When a PullTransformer object receives this module reference, all methods in the module become methods of that object by virtue of the call to self.extend.

The transform method takes an XML source, instantiates a pull parser, and initializes instance variables to track the tags and accumulated output. It then kicks off the main process by calling dispatch.

That chunk of code at the beginning of the class definition is a bit of Ruby magic. The %w{ ...} syntax creates an array of strings from the list of all the event types; each item in the array is passed to define_method, which is a built-in Ruby method for dynamically adding methods to the current class. The block of code following the call to define_method becomes the method body.

Having a minimal default method corresponding to each event type removed the need for trapping NoMethodError exceptions. The class later redefines some of the methods on the assumption that, by default, elements and text should be passed through.

The use of the tag stack has been changed so that, should the transformation code want to ignore an element, it can push nil onto the stack. The class also adds another helper method, skip_until, that allows code to pull and ignore events until some condition (defined by a given block of code) is True. This is useful when the transformation code wants to drop parts of the source document, but still has to read past it to get to the remainder of the XML.

The transformation class also defines execute_conditional to steer the conditional transformation logic defined in the external module.

Defining the Transformation

Earlier examples looped over the event set, running a series of conditional checks against each event to determine what code to execute. While the loop logic was generic, the transformation code was not and should be kept apart. In the current example, the conditional logic and the corresponding transformation code are defined as methods in a module. A mapping of conditions to actions is defined in the map method, which pairs them in the @transformation array.

Condition and action methods are dynamically invoked by execute_conditional in PullParser. Each should expect to be passed a PullEvent object. Conditional methods should return true or false, and the order of entries defined in map is important because that is the order in which execute_conditional loops over the set, looking for the first condition that returns true.

A true condition invokes the mapped action method, passing in the current event object. Because pull events are no longer automatically retrieved in a loop, the action methods must decide whether to call dispatch to fetch and act on the next pull event. Just as the methods of the transformational module become part of the transformer instance, so too all methods and instance variables of PullTransformer are available to the module code. Conditional logic and action methods can call dispatch, skip_until, act on the pull parser, and so on. They also have access to the @transformation array, leaving open the opportunity to alter the transformation logic at runtime.

With this base library, you can define some modules for converting an OpenOffice.org Writer document into RSS. OpenOffice documents consist of multiple XML files bundled into a zip file. To create an RSS item, the code grabs some metadata from the meta.xml file, as well as the first paragraph of text from the content.xml.

There are various ways to do a single transformation on multiple source documents. One way would be to aggregate all the source files into a master file with a new root element. But a nice thing about using Ruby for transformation is that it is easy to call out to other code for subprocessing. In this example, the main transformation acts on content.xml, but the transformation logic loads and transforms meta.xml using another PullTransformer instance.

The main application uses a template file for the RSS 1.0 body XML (see rss.body.xml, available electronically). The code loads this template and loops over a directory of OOo documents. Each document is unzipped into a local temp directory, and content.xml is fed to a transformation process. The results are aggregated and substituted into the template body. The transformational module for content.xml is in content.rb (available electronically). The code ignores all elements by default, pushing nil onto the tag stack. The office:document-content element is transformed into an item element, and another instance of PullTransformer is used to get select content from meta.xml; see item.rb (available electronically). When the code encounters the office:body element, it starts a description element, then skips over everything until it finds the end of the text:sequence-decls section.

The example uses yet another handy REXML pull-parser method—peek, which looks at future events without actually pulling them off the event stream. It's sort of like looking into the future. If the immediate future is not the end of the office:body element, then the code loops and tries to get the text from the first text:p element.

Conclusion

Pull-parser transformations offer the opportunity to manipulate large XML documents using familiar programming constructs, and with Ruby's REXML parser, it is easy to write flexible and dynamic transformation applications. Special thanks must go to Sean Russel, for creating REXML and providing technical review for this article. Any errors or omissions are the sole fault of this author.

DDJ



Listing One

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE office:document-content PUBLIC "-//OpenOffice.org//DTD 
                                 OfficeDocument 1.0//EN" "office.dtd">
<office:document-content 
xmlns:office="http://openoffice.org/2000/office" 
xmlns:style="http://openoffice.org/2000/style" 
xmlns:text="http://openoffice.org/2000/text" 
xmlns:table="http://openoffice.org/2000/table" 
xmlns:draw="http://openoffice.org/2000/drawing" 
xmlns:fo="http://www.w3.org/1999/XSL/Format" 
xmlns:xlink="http://www.w3.org/1999/xlink" 
xmlns:number="http://openoffice.org/2000/datastyle" 
xmlns:svg="http://www.w3.org/2000/svg" 
xmlns:chart="http://openoffice.org/2000/chart"
xmlns:dr3d="http://openoffice.org/2000/dr3d" 
xmlns:math="http://www.w3.org/1998/Math/MathML" 
xmlns:form="http://openoffice.org/2000/form" 
xmlns:script="http://openoffice.org/2000/script" 
office:class="text" office:version="1.0">
  <!-- An example conten.xml file form an OpenOffice.org Writer document -->
 <office:script/>
 <office:font-decls>
  <style:font-decl style:name="Arial" fo:font-family="Arial"/>
  <style:font-decl style:name="Baskerville BE Regular" fo:font-
    family="&apos;Baskerville BE Regular&apos;, &apos;Times New Roman&apos;"/>
  <style:font-decl style:name="Lucidasans1" fo:font-family="Lucidasans"/>
  <style:font-decl style:name="Bitstream Vera Sans" fo:font-
    family="&apos;Bitstream Vera Sans&apos;" style:font-pitch="variable"/>
  <style:font-decl style:name="Lucidasans" fo:font-
    family="Lucidasans" style:font-pitch="variable"/>
  <style:font-decl style:name="Mincho" fo:font-family="Mincho"
    style:font-pitch="variable"/>
 </office:font-decls>
 <office:automatic-styles/>
 <office:body>
  <text:sequence-decls>
   <text:sequence-decl text:display-outline-level="0" text:name="Illustration"/>
   <text:sequence-decl text:display-outline-level="0" text:name="Table"/>
   <text:sequence-decl text:display-outline-level="0" text:name="Text"/>
   <text:sequence-decl text:display-outline-level="0" text:name="Drawing"/>
  </text:sequence-decls>
  <text:p text:style-name="Standard">This is a test document for 
     the Ooo4R project</text:p>
  <text:p text:style-name="Standard">This is the second line. It 
     has some <text:span text:style-name="Citation">text</text:span> 
     with special formatting.</text:p>
 </office:body>
 <!-- End of sample -->
</office:document-content>
Back to article


Listing Two
#!/usr/bin/env ruby
require 'rexml/parsers/pullparser'

parser = REXML::Parsers::PullParser.new( IO.read( "content1.xml" ) )

while parser.has_next?
  pull_event = parser.pull
  puts pull_event.event_type
end
Back to article


Listing Three
#!/usr/bin/env ruby
require "rexml/parsers/pullparser"
parser = REXML::Parsers::PullParser.new( IO.read( "content1.xml" ) )
xml = ""
while parser.has_next?
  pull_event = parser.pull
  puts( pull_event[0] ) if pull_event.start_element?
end
Back to article


Listing Four
#!/usr/bin/env ruby
require "rexml/parsers/pullparser"
parser = REXML::Parsers::PullParser.new( IO.read( "content1.xml" ) )
xml = ""

def is_standard_text_p?( event )
  return false unless event.start_element?
  return false unless event[0] == "text:p"
  event[1][ 'text:style-name'] == "Standard"
end

while parser.has_next?
  pull_event = parser.pull
  puts pull_event.inspect if is_standard_text_p? pull_event   
end
Back to article


Listing Five
#!/usr/bin/env ruby
require "rexml/parsers/pullparser"

parser = REXML::Parsers::PullParser.new( IO.read( "content1.xml" ) )
results  = ""
def is_standard_text_p?( event )
  return false unless event.start_element?
  return false unless event[0] == "text:p"
  event[1][ "text:style-name" ] == "Standard"
end

def attrs_to_s( attrs )
  return "" if attrs.empty?
  " " +  attrs.to_a.map{ |attr| 
      "#{attr[0]}='#{attr[1]}'"
     }.join( " " )   
end

tag_stack = []

while parser.has_next?
  pull_event = parser.pull
  case pull_event.event_type
    when :start_element
      if is_standard_text_p? pull_event   
        tag_stack.push "p"
      else
        tag_stack.push pull_event[0]
      end
      results << "<#{tag_stack.last}#{attrs_to_s(pull_event[1])}>"
    when :end_element
      results << "</#{tag_stack.pop}>"    
    when :text
      results << pull_event[0] 
    else
      results << "<!-- #{pull_event.inspect} -->"
  end
end

puts results
Back to article


Listing Six
#!/usr/bin/env ruby
require "rexml/parsers/pullparser"

parser = REXML::Parsers::PullParser.new( IO.read( "content1.xml" ) )

def is_standard_text_p?( event )
  return false unless event.start_element?
  return false unless event[0]  == "text:p"
  event[1][ "text:style-name" ] == "Standard"
end

def attrs_to_s( attrs )
  return "" if attrs.empty?
  " " +  attrs.to_a.map{ |attr| 
      #{attr[0]}='#{attr[1]}'"
      }.join( " " )   
end

def start_element( event )
  if is_standard_text_p? event   
     $tag_stack.push "p"
  else
    $tag_stack.push event[0]  
  end
  "<#{$tag_stack.last}#{attrs_to_s(event[1])}>"  
end

def end_element( event )
  "</#{$tag_stack.pop}>"         
end

def text( event )
  event[0]
end

results = ""

$tag_stack = []

while parser.has_next?
  pull_event = parser.pull
  begin
    results << send( pull_event.event_type.to_s, pull_event )
  rescue NoMethodError; end
end

puts results
Back to article


Listing Seven
def dispatch
  return unless @parser.has_next?
  event = @parser.pull 
  unless event.end_document?
    send( event.event_type.to_s, event ) 
  end
end
Back to article


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 
Dr. Dobb's TV