The final OPML parser in Listing Three is overkill, but illustrates techniques needed for a larger parser without the larger volume of code. The parser code can frequently be split in two functions, one for the header (information common to all elements) and one for the body (multiple elements of the same or similar kind). It also has a lower-level API for the body to allow the user to supply a callback for each element in the body (suitable for very large files that won't fit in memory).
Listing Three: The final OPML parser.
def OpmlSpeedReader.parse_header(reader, stack)
title = nil
# skip over any leading XML comments
begin
status = reader.read
rescue LibXML::XML::Error
raise NotOPML, 'Not XML'
else
raise NotOPML, 'Empty file' if !status # EOF
end while reader.node_type == XML::Reader::TYPE_COMMENT
if reader.node_type == XML::Reader::TYPE_ELEMENT && reader.name != 'opml'
raise NotOPML, reader.name
end
while status
case reader.node_type
when XML::Reader::TYPE_ELEMENT
stack << reader.name
path = stack.join('>')
ignore = false
case path
when 'opml>body'
break # end of header
else
ignore = true
end
stack.pop if reader.empty_element?
when XML::Reader::TYPE_TEXT, XML::Reader::TYPE_CDATA
path = stack.join('>')
ignore = false
case path
when 'opml>head>title'
title = reader.value.strip
end
when XML::Reader::TYPE_END_ELEMENT
stack.pop
end
status = reader.read
end
title
end
def OpmlSpeedReader.parse_body(reader, stack)
feed = {} # force scope
begin # post-test loop
case reader.node_type
when XML::Reader::TYPE_ELEMENT
stack << reader.name
path = stack.join('>')
case path
when /^opml>body(>outline)+$/
feed[:title] = if reader['title']
reader['title'].strip
else
reader['text'].strip
end
feed[:url] = reader['xmlUrl'].strip if reader['xmlUrl']
yield(feed, stack.size - 3) unless feed.empty?
feed = {}
end
stack.pop if reader.empty_element?
when XML::Reader::TYPE_END_ELEMENT
stack.pop
end
end while reader.read
end
def self.parse(reader)
parser_stack = []
title = OpmlSpeedReader.parse_header(reader, parser_stack)
parser_stack.pop
feed_stack = [NamedArray.new(title)]
OpmlSpeedReader.parse_body(reader, parser_stack) do |feed, depth|
if feed.size > 1
raise if ((depth+1) <=> feed_stack.size) == -1 # ASSERT
feed_stack[-1] << Feed.new(feed[:title], feed[:url])
else
case (depth+1) <=> feed_stack.size
when +1
raise # ASSERT
when 0
feed_stack.push(NamedArray.new(feed[:title]))
when -1
tmp = feed_stack.pop
feed_stack[-1] << tmp
feed_stack.push(NamedArray.new(feed[:title]))
else
raise # ASSERT
end
end
end
# Nested feeds (e.g. Google categories) need final flattening.
while feed_stack.size > 1
tmp = feed_stack.pop
feed_stack[-1] << tmp
end
feed_stack[0]
end
The code in Listing Three is on GitHub in the master branch. In the REXML branch is the same functionality using the REXML library instead of libxml. At arm's length, they look the same.
An RSS parser built on these principles is on GitHub at http://github.com/AustinBlues/RSS-Speed-Reader. The overall structure is the same, just additional data fields of interest and a lot more paths to match against. There are six RSS standards (RDF, RSS 0.9, 1.0, and 2.0, and Atom 0.3 and 1.0). Some RSS feed generators use namespaces on tag names, others don't. There are at least 16 different paths for some data values.
Techies like numbers. Figure 1 shows the benchmark times for the average of the middle 3 of five runs. Each run parses 7 large RSS 2.0 files 10 times (enough for a decent size timing sample). These benchmarks parsed RSS only, not ATOM. The library versions are:
Figure 1: Initial parser timings.
Library Version ======= ======= Hpricot 0.8.4 Nokogiri 1.5.0 libxml2 & libxml-ruby 2.7.3 & 1.1.4 Library API time (seconds) ======= === ============== Hpricot DOM 0.73 libxml2 SAX 0.51 Nokogiri DOM 0.42 libxml2 pull 0.19 libxml2 DOM 0.18
Extending the parsers to handle all six RSS formats adds code complexity and slower run-times. The benchmarks in Figure 2 may be comparing oranges to grapefruits, all citrus, but not the same. Feedzirra is just used here to parse one in-memory RSS feed at a time, but it can also fetch multiple feeds over HTTP in parallel and parse any of the RSS standards and easily extended to other XML formats. I have my own code to do fetching in parallel and it is a couple of hundred lines. That additional capability may slow Feedzirra's results.
Figure 2: Comparison of parsing RSS formats.
Library Version ====== ======= FeedTools 0.2.29 81.82s Ruby built-in 3.10s Feedzirra 0.0.31 0.92s RssSpeedReader 0.1.3 0.28s
If I hadn't already done the work, I might be tempted by Feedzirra. The speed difference isn't that much and letting someone else write and maintain several hundred lines of good qualty code is generally a win.
The sheer volume of code in a pull parser can be overwhelming, but knowing the structure, I find it easy to navigate about. If a new bit of data needs to be extracted, a new path for an existing bit of data added, or even a whole new RSS standard supported, there is little doubt where to add the code. This RSS parser has been very stable.
The OPML parser was actually developed after the RSS parser, and as they say, "just wrote itself." Write me if you find some improvements.
Jeffrey Taylor (jeff.taylor@ieee.org) has been programming for fun for 40 years. He currently is writing and doing Ruby on Rails development, after decades of C/C++ programming with side trips through university teaching. His last article for Dr. Dobb's was Hypothesis-Driven Development.


