Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

Web Development

The Delphi XML SAX2 Component & MSXML 3.0


Sep01: An Expat TSAXParser Implementation

An Expat TSAXParser Implementation

The TSAXParser component comes in two flavors: the one implemented on top of MSXML, and another with the same name, properties, events, methods, and behavior, but implemented on James Clark's Expat library (http://www.jclark.com/bio.htm).

To use the Expat-based component on Windows, you have to download the WIN32 binary of expat.dll at http://sourceforge.net/projects/expat/ (you can also download the source there and build the DLL yourself with the MSVC6 compiler).

Before you can use the Expat DLL in Delphi, you have to translate the C header file expat.h to Pascal. I first used Bob Swart's Headconv 4.0 on expat.h to make a first cut of expat.pas, and then went in for some hours of serious hand-editing the result (correcting the translation errors made by Headconv, and reformatting the code to make it more readable).

I then reimplemented the TSAXParser component using the C functions and callback routines exposed by the Expat library. This was straightforward, with a couple of exceptions.

The "Element Declaration Handler" implementation proved interesting, because here I had to actually free memory in Delphi that had been previously allocated by Expat. Luckily, James Clark provided for this by letting you specify your own memory allocator to be used by the parser. You do this by creating a new instance of the parser with the XML_ParserCreate_MM() function, which has an argument that is a structure containing pointers to memory allocation functions that implement equivalents of malloc(), free(), and realloc(), in my case using the Delphi memory allocator functions GetMem(), FreeMem(), and ReallocMem().

Also, Expat is a SAX1 parser (with extensions in the current version), not a SAX2. To remain compatible with the MSXML version, I implemented some MSXML behavior in TSAXParser, like the way namespaces and namespace prefixes are handled, and the reporting of attribute types. I decided to keep it simple, and just maintain a couple of lists built by the element declaration and attribute declaration handlers. Later on in the parsing process, the element handlers can look up this data to pass it on to the application as needed.

As I do not (yet) have a copy of Borland's Kylix (described as "Delphi for Linux"), I could not test the component on Linux, but it should run virtually unmodified with Kylix (the references to expat.dll will have to be changed to expat.so, I guess).

Expat not only proved superior in performance to the MSXML parser, it also parsed without a hitch a couple of valid XML documents that caused MSXML to throw an OLE exception.

There is only one SAX2 event that I have not implemented yet in the Expat version — the "Unparsed Entity Handler." Complete source code for the Expat component and the same example applications that come with the MSXML version are available electronically; see "Resource Center," page 5.

Finally, I could also have used the C++ Xerxes XML parser (by the Apache XML group, http://xml.Apache.org/), which features COM interfaces to make it compatible with MSXML on Windows, but I doubt that these would be usable in Kylix. And linking C++ code with Object Pascal is complicated by the C++ name mangling, so I preferred to use Expat.

— D.H.


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.