Channels ▼

Jocelyn Paine

Dr. Dobb's Bloggers

Yet More XML: with Prolog

March 22, 2009

I just saw Mark Nelson's More on XML with his account of how difficult Visual C++ and MSXML make it to extract a node not all that far down from the root of an XML tree. So, since Mark was good enough to show us his XML file, I tried with SWI-Prolog.

Here's Mark's XML file. From it, he wants the contents of the Title element:

<ISBNdb server_time="2009-03-19T02:01:00Z">
<BookList total_results="1" page_size="10" page_number="1" shown_results="1">
<BookData book_id="the_data_compression_book" isbn="1558514341">
<Title>The Data Compression Book</Title>
<TitleLong></TitleLong>
<AuthorsText>Mark Nelson, Jean-Loup Gailly, </AuthorsText>
<PublisherText publisher_id="m_t_books">M&amp;T Books</PublisherText>
</BookData>
</BookList>
</ISBNdb>

Now, SWI-Prolog has a library for parsing XML. I've used it for decoding Excel spreadsheets saved as XML, but that wasn't recently, so my memory of the library was patchy. But I knew it returns the parsed XML as a list of lists, and lists are a standard data type in Prolog. So I only needed to know how to load the library, how to invoke the parser, and how the lists it returns represent XML. Luckily, there is a very helpful recent posting about this on the SWI-Prolog mailing list, R: [SWIPL] Working with strings from Prolog super-expert Richard O'Keefe.

Let's try what Richard suggests. I load the library, then parse Mark's XML into a Prolog variable also named "XML", and display that. Good: everything works, and the variable seems to have listy things in it:

Welcome to SWI-Prolog (Multi-threaded, 32 bits, Version 5.6.64)
...Rest of banner...
1 ?- cd('c:/dobbs').
true.

2 ?- use_module(library(sgml)).
% library(option) compiled into swi_option 0.02 sec, 7,664 bytes
% library(sgml) compiled into sgml 0.03 sec, 38,328 bytes
true.

3 ?- load_xml_file('mark.xml',XML), write(XML).
[element(ISBNdb, [server_time=2009-03-19T02:01:00Z], [
, element(BookList, [total_results=1, page_size=10, page_number=1, shown_results=1], [
, element(BookData, [book_id=the_data_compression_book, isbn=1558514341], [
, element(Title, [], [The Data Compression Book]),
, element(TitleLong, [], []),
, element(AuthorsText, [], [Mark Nelson, Jean-Loup Gailly, ]),
, element(PublisherText, [publisher_id=m_t_books], [M&T Books]),
]),
]),
])]

I'm working from Wi-Fi in a library which will close shortly, so I'm going to be really hasty. Richard's posting tells me that the parser returns XML elements as structures holding a tag-name field, an attributes field, and a children field. An XML file will be a list that contains a top-level element, and possibly other stuff I've not had time to read about. Mark's file appears to have a top-level element called ISBNdb, with children that include a BookList element. Let's check that:

4 ?- load_xml_file('mark.xml',XML), XML=[element(_,_,Kids0)], member( element('BookList',_,Kids1), Kids0 ), write(Kids1).
[
, element(BookData, [book_id=the_data_compression_book, isbn=1558514341], [
, element(Title, [], [The Data Compression Book]),
, element(TitleLong, [], []),
, element(AuthorsText, [], [Mark Nelson, Jean-Loup Gailly, ]),
, element(PublisherText, [publisher_id=m_t_books], [M&T Books]),
]),
]
I did as before, but this time, "unified" the top-level XML with a structure containing a new Prolog variable called Kids0. This is a kind of pattern-matching which will put the third field of the top-level element — the level-0 children — into Kids0. Then, I used the built-in predicate "member" to search Kids0 for an element whose first field was 'BookList'. I put its children into Kids1, and displayed that. And one of those level-1 children is a BookData element.

Now I'll iterate that, and write out the second-level children:

5 ?- load_xml_file('mark.xml',XML), XML=[element(_,_,Kids0)], member( element('BookList',_,Kids1), Kids0 ), member( element('BookData',_,Kids2), Kids1 ),write(Kids2).
[
, element(Title, [], [The Data Compression Book]),
, element(TitleLong, [], []),
, element(AuthorsText, [], [Mark Nelson, Jean-Loup Gailly, ]),
, element(PublisherText, [publisher_id=m_t_books], [M&T Books]),
]
And now the third-level children:
6 ?- load_xml_file('mark.xml',XML), XML=[element(_,_,Kids0)], member( element('BookList',_,Kids1), Kids0 ), member( element('BookData',_,Kids2), Kids1 ), member( element('Title',Attrs,Kids3), Kids2), write(Kids3).
[The Data Compression Book]
Lo and behold, the title!

I suppose I'm just showing that if you're lucky enough to have the right libraries, and a language that handles lists nicely and that is also interactive, it's easy to experiment and test your understanding of the data. Once those are sorted out, you can then go on to program a robust system for extracting stuff from it, including validity checks on list size and so on. Thanks for a nice example, Mark.

Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 


Video