Channels ▼
RSS

Web Development

Fast XML Parsing in Ruby


Most programming languages have their own XML parser libraries. And many of those use the DOM (Document Object Model) API. DOM is good for general-purpose XML processing: The input is parsed into a tree structure that can be modified and written back out. Often whitespace is preserved so the output is identical to the input if it is not modified. This generality comes at a cost: large memory requirements (often more than double the input size) and slow read and write. For applications that read the XML data for only specific pieces of information, there are better alternatives.

Other libraries use SAX (Simple API for XML), and a pull parser. SAX is callback-based and I find it difficult to work with, but it allows breaking your code into reasonable size pieces. I use pull parsers, but my code ends up in one or two large functions. Pull parsers have a simple internal structure and a lot of little snippets of code for each data type of interest.

Neither method in mainstream libraries deliver the speed and simplicity I want, however. Let's see how to improve them. The examples in this article start with a simple OPML (Outline Processor Markup Language, an XML language) subset. It is suitable for the RSS export format used by most RSS readers (such as Google Reader or Thunderbird). The second example is a more robust version that handles nested feeds (for example, Google Reader's Categories). The final version is possibly overkill for OPML, but it illustrates techniques suitable for larger parsers such as RSS). The complete OPML programs with test suites and sample input files are on GitHub.

RSS (originally, RDF Site Summary, often dubbed "Really Simple Syndication") is an XML language for publishing news and blog summaries. When I first started working with RSS feeds for a Ruby on Rails application, I used FeedTools. It is a solid package that supports reading, searching, modifying, and writing RSS feeds. But, it doesn't scale to reading thousands of feeds per hour.

The developer of FeedTools lamented in his blog about choosing to use REXML (part of the Ruby standard libraries) and that he was too burned out to redo it with a faster library. Nokogiri and Hpricot are newer XML libraries and an RSS parser using them runs 10% faster than FeedTools. Both use the libxml2 library (in C, part of the GNOME project). An RSS parser written with libxml2 is an additional 10% faster. However, I wanted 10 times faster, not 10% or even 20% faster. The current benchmark results for these parsers and my lean alternative are near the end of the article.

The libxml-ruby docs are much more readable than the libxml2 docs. Digging around in them I found the two additional APIs: SAX and pull parsers. I did not have much success with the SAX API. Pull parsers I understand. An RSS pull parser is over a hundred times faster than FeedTools! The key to the speed is the API, not the language it is written in. The REXML pull parser (pure Ruby) is trivally slower (approximately 2%) than libxml (in C with libxml-ruby Ruby wrapper). Interestingly, the REXML DOM API is 20% slower than the libxml2 DOM API. The RSS parser I use in production is available here.

How To Do It

Let's look at the feed and how its speed was accelerated. A simplified OPML file with one feed looks like this:

OPML Example 1.

<?xml version="1.0" encoding="ISO-8859-1"?>
<opml version="1.1">
  <head>
    <title></title>
  </head>
  <body>
    <outline text="AmethystRSS Blog"
          xmlUrl="http://blog.amethystRSS.net/feed/"/>
  </body>
</opml>

Pull parsers loop through the input. Within the loop, an outer case statement handles the markup types. Only tag start, tag data, and tag end markup are needed for OPML and RSS. Inner case statements switch on the path — the nesting of tag names. For example, the title in Example 1 has the path "opml>head>title" (using ">" to separate tag names is my own convention).

The code to parse it looks like Listing One (in the FLAT branch on GitHub). A case could be made for declaring a class or two to store the results. In the interests of simplicity and for teaching purposes, I've chosen to store the feeds as an array (reasonable) of hashes (a class with just title and URL might be better). Using built-in types simplifies the test data (in test/opml on GitHub). Each OPML file has a similarly named YAML file (YAML Ain't Markup Language) with the expected parse result. By sticking to YAML standard types: sequences (arrays in Ruby) and mappings (hashes in Ruby), the test data is portable across implementation languages that have support for those data types (e.g., Python).

Listing One: Parsing.

      feeds = []
      opml_title = nil
      while reader.read
        case reader.node_type
	when TYPE_ELEMENT:	# begin element
	  case tag = reader.name
	  when 'outline'
	    # save relevant RSS feed attributes: title and URL
	    feeds << {:title => reader['title'], :url => reader['xmlUrl']}
	  end
	when TYPE_TEXT, TYPE_CDATA	# contents
	  case tag
	  when 'title'
	    opml_title = reader.value	# OPML file title
	  end
	end
      end

For simple OPML files with no nesting, this works. Example 1 above is represented like this by Ruby's inspect primitive:

 [{:title => "AmethystRSS Blog", :url => "http://blog.amethystRSS.net/feed/"}]

In YAML, this is:

YAML Example 1.

---
- :title: AmethystRSS Blog
  :url: "http://blog.amethystRSS.net/feed/"

Google Reader supports grouping feeds into named categories. In OPML, these are simply nested outline elements.

OPML Example 2.

<?xml version="1.0" encoding="UTF-8"?>
<opml version="1.0">
  <head>
    <title>Lars subscriptions in Google Reader</title>
  </head>
  <body>
    <outline title="@Bibliotek" text="@Bibliotek">
      <outline text="24 Ways" title="24 Ways"
          type="rss"
          xmlUrl="24ways.xml" htmlUrl="http://24ways.org/"/>
    </outline>
    <outline title="@Politics" text="@Politics">
      <outline text="Jon Udell" title="Jon Udell" type="rss"
            xmlUrl="JonsNewRadio.xml"
	    htmlUrl="http://blog.jonudell.net"/>
      <outline text="Linux Magazine Full Feed"
            title="Linux Magazine Full Feed" 
            type="rss"
            xmlUrl="LinuxMag.xml"
	    htmlUrl="http://www.linux-magazine.com"/>
    </outline>>
  </body>
</opml>

In keeping with the decision to represent records with hashes, the parse of Example 2 in YAML is the feed title now included in the parse results:

YAML Example 2.

---
- Lars subscriptions in Google Reader
- - "@Bibliotek"
  - :title: 24 Ways
    :url: 24ways.xml
- - "@Politics"
  - :title: Jon Udell
    :url: JonsNewRadio.xml
  - :title: Linux Magazine Full Feed
    :url: LinuxMag.xml

The code to parse Example 2 must keep track of the nesting depth. Note also that regular expressions are used in some case statements to match the path for nested elements. The complete program is in the NESTED branch on GitHub.

Listing Two: Core parser functions for keeping track of levels.

    title = nil
    feed_stack = [[]]
    tag_stack = []

    tag = nil	# force scope
    feed = {}
    
    while reader.read
      case reader.node_type
      when XML::Reader::TYPE_ELEMENT
	tag = reader.name
	tag_stack.push(tag)
	case tag_stack.join('>')
	when /^opml>body(>outline)+$/
	  feed = {:title => reader['text'].strip}
	  feed[:url] = reader['xmlUrl'].strip if reader['xmlUrl']
	  if feed[:url].nil?	# Category/folder start?
	    feed_stack.push([feed[:title]])
	  else
	    feed_stack[-1] << feed
	    feed = {}
	  end
	  if reader.empty_element?
	    tag_stack.pop
	  end
	end
      when XML::Reader::TYPE_TEXT, XML::Reader::TYPE_CDATA
	case tag_stack.join('>')
	when 'opml>head>title'
	  title = reader.value.strip
	  feed_stack[0].unshift(title)
	end
      when XML::Reader::TYPE_END_ELEMENT
	case tag_stack.join('>')
	when /^opml>body(>outline)+$/
	  tmp = feed_stack.pop
	  feed_stack[-1] << tmp
	end
	tag_stack.pop
      end
    end

    # flatten feed_stack
    while feed_stack.size > 1
      tmp = feed_stack.pop
      feed_stack[-1] << tmp
    end

    return feed_stack[0]


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 
Dr. Dobb's TV