Channels ▼
RSS

Open Source

Fast XML Parsing in Ruby


The final OPML parser in Listing Three is overkill, but illustrates techniques needed for a larger parser without the larger volume of code. The parser code can frequently be split in two functions, one for the header (information common to all elements) and one for the body (multiple elements of the same or similar kind). It also has a lower-level API for the body to allow the user to supply a callback for each element in the body (suitable for very large files that won't fit in memory).

Listing Three: The final OPML parser.

  def OpmlSpeedReader.parse_header(reader, stack)
    title = nil

    # skip over any leading XML comments
    begin
      status = reader.read
    rescue LibXML::XML::Error
      raise NotOPML, 'Not XML'
    else
      raise NotOPML, 'Empty file' if !status	# EOF
    end while reader.node_type == XML::Reader::TYPE_COMMENT

    if reader.node_type == XML::Reader::TYPE_ELEMENT && reader.name != 'opml'
      raise NotOPML, reader.name
    end

    while status
      case reader.node_type
      when XML::Reader::TYPE_ELEMENT
	stack << reader.name
	path = stack.join('>')
	ignore = false
	case path
	when 'opml>body'
	  break		# end of header
	else
	  ignore = true
	end
	stack.pop if reader.empty_element?
      when XML::Reader::TYPE_TEXT, XML::Reader::TYPE_CDATA
	path = stack.join('>')
	ignore = false
	case path
	when 'opml>head>title'
	  title = reader.value.strip
	end
      when XML::Reader::TYPE_END_ELEMENT
	stack.pop
      end
      status = reader.read
    end
    title
  end


  def OpmlSpeedReader.parse_body(reader, stack)
    feed = {}		# force scope
    begin	# post-test loop
      case reader.node_type
      when XML::Reader::TYPE_ELEMENT
	stack << reader.name
	path = stack.join('>')
	case path
	when /^opml>body(>outline)+$/
	  feed[:title] = if reader['title']
	                   reader['title'].strip
			 else
			   reader['text'].strip
			 end
	  feed[:url] = reader['xmlUrl'].strip if reader['xmlUrl']
	  yield(feed, stack.size - 3) unless feed.empty?
	  feed = {}
	end
	stack.pop if reader.empty_element?
      when XML::Reader::TYPE_END_ELEMENT
	stack.pop
      end
    end while reader.read
  end


  def self.parse(reader)
    parser_stack = []
    title = OpmlSpeedReader.parse_header(reader, parser_stack)

    parser_stack.pop

    feed_stack = [NamedArray.new(title)]
    OpmlSpeedReader.parse_body(reader, parser_stack) do |feed, depth|
      if feed.size > 1
	raise if ((depth+1) <=> feed_stack.size) == -1	# ASSERT
	feed_stack[-1] << Feed.new(feed[:title], feed[:url])
      else
	case (depth+1) <=> feed_stack.size
	when +1
	  raise		# ASSERT
	when 0
	  feed_stack.push(NamedArray.new(feed[:title]))
	when -1
	  tmp = feed_stack.pop
	  feed_stack[-1] << tmp
	  feed_stack.push(NamedArray.new(feed[:title]))
	else
	  raise		# ASSERT
	end
      end
    end

    # Nested feeds (e.g. Google categories) need final flattening.
    while feed_stack.size > 1
      tmp = feed_stack.pop
      feed_stack[-1] << tmp
    end

    feed_stack[0]
  end

The code in Listing Three is on GitHub in the master branch. In the REXML branch is the same functionality using the REXML library instead of libxml. At arm's length, they look the same.

An RSS parser built on these principles is on GitHub at http://github.com/AustinBlues/RSS-Speed-Reader. The overall structure is the same, just additional data fields of interest and a lot more paths to match against. There are six RSS standards (RDF, RSS 0.9, 1.0, and 2.0, and Atom 0.3 and 1.0). Some RSS feed generators use namespaces on tag names, others don't. There are at least 16 different paths for some data values.

Techies like numbers. Figure 1 shows the benchmark times for the average of the middle 3 of five runs. Each run parses 7 large RSS 2.0 files 10 times (enough for a decent size timing sample). These benchmarks parsed RSS only, not ATOM. The library versions are:


Figure 1: Initial parser timings.

Library			Version
=======			=======
Hpricot			0.8.4	 
Nokogiri		1.5.0	 
libxml2	& libxml-ruby	2.7.3 & 1.1.4

Library	  API		time (seconds)
=======	  ===		==============
Hpricot	  DOM		0.73
libxml2	  SAX		0.51
Nokogiri  DOM		0.42
libxml2	  pull		0.19
libxml2	  DOM		0.18

Extending the parsers to handle all six RSS formats adds code complexity and slower run-times. The benchmarks in Figure 2 may be comparing oranges to grapefruits, all citrus, but not the same. Feedzirra is just used here to parse one in-memory RSS feed at a time, but it can also fetch multiple feeds over HTTP in parallel and parse any of the RSS standards and easily extended to other XML formats. I have my own code to do fetching in parallel and it is a couple of hundred lines. That additional capability may slow Feedzirra's results.

Figure 2: Comparison of parsing RSS formats.

Library	     	Version
======	     	=======
FeedTools     	0.2.29     81.82s
Ruby built-in   3.10s
Feedzirra       0.0.31	    0.92s
RssSpeedReader  0.1.3	    0.28s

If I hadn't already done the work, I might be tempted by Feedzirra. The speed difference isn't that much and letting someone else write and maintain several hundred lines of good qualty code is generally a win.

The sheer volume of code in a pull parser can be overwhelming, but knowing the structure, I find it easy to navigate about. If a new bit of data needs to be extracted, a new path for an existing bit of data added, or even a whole new RSS standard supported, there is little doubt where to add the code. This RSS parser has been very stable.

The OPML parser was actually developed after the RSS parser, and as they say, "just wrote itself." Write me if you find some improvements.


Jeffrey Taylor (jeff.taylor@ieee.org) has been programming for fun for 40 years. He currently is writing and doing Ruby on Rails development, after decades of C/C++ programming with side trips through university teaching. His last article for Dr. Dobb's was Hypothesis-Driven Development.


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 

Video