Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

Web Development

The Delphi XML SAX2 Component & MSXML 3.0


Sep01: The Delphi XML SAX2 Component & MSXML 3.0

Danny is a systems programmer at Cevi NV. He can be contacted at [email protected].


An Expat TSAXParser Implementation


SAX parsers are designed for speedy processing of large XML documents, but with minimal memory use because they use a callback function technique. In this article, I show how to use the C++ COM interfaces of Microsoft's MSXML 3.0 SAX2 parser (http://msdn.microsoft.com/xml/general/xmlparser.asp) with Borland Delphi (http://community.borland.com/delphi/). I also present TSAXParser, a Delphi component that uses these interfaces, but shields you from their complexities without sacrificing speed or functionality. All it takes to parse an XML file using TSAXParser is a mouse click to drop the component on a form, select a few event handlers in the object inspector, and then call the Parse method. The complete source code for this component (along with a demo application) is available electronically; see "Resource Center," page 5. For background information on SAX parsers in general, and the MSXML 3.0 SAX2 parser in particular, see "Parsing XML," by David Cox (DDJ, January 2001) and "Programmer's Toolchest," by Eldar Musayev (DDJ, February 2001).

For speed and simplicity, I decided to use the MSXML's native C++ COM interfaces, not the Visual Basic IVBSAX wrappers. Microsoft provided the IVBSAX wrappers in MSXML 3.0 to support Visual Basic and other COM-enabled languages that can't handle the C++ COM interfaces and data types. I will explain how you can implement and use these C++ COM interfaces with Borland Delphi. You can implement these C++ COM interfaces as normal classes; you don't need full-featured COM objects.

Importing the MSXML Type Library

Before using MSXML3.DLL in Delphi, you have to import its type library (Microsoft XML 3.0). This generates an Object Pascal source (MSXML 2_TLB.pas) that gives the Delphi IDE and compiler access to the COM objects and interfaces inside the DLL.

A key difference between the DOM parser in MSXML and the SAX2 parser is that the SAX parser exposes COM interfaces that have no corresponding implementation in MSXML3.DLL. You have to implement these COM interfaces instead of instantiating and using existing COM objects. You then pass a reference to your implementation to the parser, so that it can callback the functions you implemented at the appropriate moment in the parsing process.

Importing the Type Library Wizard

My decision to use the lightweight C++ ISAX interfaces that descend from IUnknown instead of the VB wrappers that descend from IDispatch immediately caused some trouble with the Delphi Import Type Library wizard. The wizard generates Pascal code based on the information present in the type library. While this works well for the IDispatch-type VB interfaces, it has problems with the data types of the C++ interfaces. Example 1(a) shows how a method argument of type unsigned short * in the type library, which corresponds to a const wchar_t * in the MSXML 3.0 SDK, Example 1(b), is incorrectly converted to Word by the import wizard, Example 1(c). Either pWord or pWideChar would have been acceptable, since both represent pointers to a 16-bit word. I decided to edit MSXML2_TLB.PAS to make the argument definitions correspond closely to those in the SDK; see Example 1(d). This permits the use of the native Delphi data types and (wide) string functions. The modified file is included with the source code.

Implementing the C++ COM Interfaces

The abstract C++ interfaces in the SAX2 parser (abstract meaning that there is no corresponding implementation in MSXML3.DLL) derive from IUnknown. The workhorse SAX interface that has to be implemented is ISAXContentHandler, which handles most XML events. Listing One is a (corrected) Object Pascal version of the ISAXContentHandler interface definition.

When all you need is a single in-process instance of the class, Delphi offers a straightforward way of implementing interfaces that derive from IUnknown. The TInterfacedObject class implements IUnknown and is designed to be used in these circumstances. Example 2(a) is a (partial) class definition for ISAXContentHandler using TInterfacedObject. All it takes to instantiate such a class is to call the Create class constructor, Example 2(b), without the hassle of COM class factories and globally unique identifiers. (The C++ examples in the MSXML SDK use the same approach.) The complete definition of a TSAXContentHandler class that implements the ISAXContentHandler interface as it is used in the TSAXParser component is available electronically; see Resource Center, page 5.

Once the class implementing the interface has been defined, all that remains to be done is provide an implementation for each method. It's important to realize that a class that implements an interface has to provide an implementation for all the methods defined in that interface, even if these implementations don't do anything at all.

An Example Implementation

In Listing Two, which includes the ContentHandler's startElement() function, the parameters passed to the function are pointer/size pairs to the namespace URI, the localname, the qualified name of the current element, and a pointer to an ISAXAttribute interface. The ISAXAttribute interface gives access to the attributes of this element. (This interface is implemented in MSXML2.DLL, so you don't have to provide an implementation for it.)

First of all, the function checks whether the application has set the OnStartElement event method. If not, the function immediately returns S_OK and parsing continues. If the event handler has been set, the first thing that happens is a call to the GetLineColumn procedure to retrieve the current line and column position in the XML document. This is done by the ISAXLocator interface, saved earlier when you called the putDocumentLocator() interface implementation. For performance reasons, the application can disable GetLineColumn by setting the UpdateLocation property to False.

Next, the pWideChar C++ COM-style parameters (pURI, pName, pLocalName) are converted to Delphi AnsiStrings, and the attributes are retrieved using the getName(), getValue(), and getType() methods of the ISAXAttribute's interface parameter. The properties of each attribute are stored in a TSXPAttribute object containing the name, value, and type of the attribute, and all attributes of this element are stored in a list (TSXPAttributeList, derived from Tlist) for easy access by the application. The OnStartElement event procedure of the application is then called with the converted arguments and the attribute list. If OnStartElement does not raise an exception, S_OK is returned and parsing continues; otherwise, TSAXContentHandler.StartElement() returns E_FAIL and you abort parsing.

The out Parameters

Contrary to normal COM memory allocation practices, you should not use CoTaskMemFree() to free the out parameter strings returned by functions such as ISAXAttributes.getValue(), with the exception of the getProperty() functions. Most of the time, the SAX2 functions in MSXML2.DLL do not allocate memory for passing a copy of the string parameters to the event handlers, but instead return a pointer and size pair pointing directly to the data in your input buffer. After all, SAX parsers should be fast.

The TSAXParser Component

After playing around with the MSXML SAX2 parser in Delphi, I found it tedious to write variations of the same code each time I wanted to test some other feature of the parser. I realized that it would be easier to have a native Delphi component that would:

  • Generate the event handlers needed with a simple mouse click in the object inspector.
  • Let me completely ignore the event handlers I don't need.
  • Give access to all (or at least most) events and properties of the ISAXXMLReader, ISAXContentHandler, ISAXLexicalHandler, ISAXDTDHandler, ISAXDeclhandler, ISAXAttributes, ISAXLocator, and ISAXErrorHandler interfaces from a single component.

And all this preferably without having to deal with any COM stuff, of course.

This is what TSAXParser is designed to do. The way the component works is straightforward. The unit that contains TSAXParser also contains an implementation for the ISAXContentHandler, ISAXLexicalHandler, ISAXDeclHandler, ISAXDTDHandler, and ISAXErrorHandler interfaces. These classes are instantiated and "owned" by the TSAXParser component. The only purpose of the event handlers in these interface implementations is to convert the COM C++ data types to Pascal format, and to call the corresponding event methods of the application, if any, that have been set by means of TSAXParser. In terms of the "if any," you do not have to provide dummy implementations for the events that don't interest you! The parser-generated events are silently ignored if you didn't provide an event handler. The complete definition of the TSAXParser class is available electronically.

The constructor of TSAXParser creates an instance of the SAX2 reader, immediately followed by the creation of the five handler classes. These are handed over to the reader. Example 3 is code for the constructor of TSAXParser.

A close look at the TSAXParser class definition reveals that the customary private fields you expect for storing the event method properties are missing. The property access methods in Example 4 show why: The properties are not stored in TSAXParser, but directly in the handler classes created in the constructor. This way the address of any TSAXParser event handler created in the object inspector is immediately stored in the corresponding ISAX handler. All that is left for you to do is write code in the event handlers you decide to implement.

While TSAXParser controls access to the event method properties of the handler classes, these classes also maintain a couple of properties on behalf of the TSAXParser that owns them: Every handler can continually update the Line and Column properties before calling the corresponding Delphi event method. It uses the ISAXLocator interface that is made available by ISAXContentHandler.PutDocumentLocator() for this purpose. This gives a Delphi application using TSAXParser automatic access to the current line and column number in the document being parsed.

Because the ISAXLocator getLine()/getColumn() methods are slow, I disabled this feature by default. I found that retrieving line/column information for each event adds about 15 percent overhead to the parsing process. You can enable this automatic location update at any time (or in the Object Inspector if you wish) by setting the UpdateLocation property to True. Even if UpdateLocation is False, you can still request a snapshot update of the Line/Column properties by a call to the GetLocation procedure.

Figure 1 shows the classic example file Bookshop.xml after parsing by a demo application (sxpdemo.dpr, available electronically). Each line shows the current line number in the xml file, an indentation representing the element nesting, the ISAX handler that caused the event (coded as CH, DTDH, LexH, or DeclH), and the formatted arguments as they were received by the event handler.

A Minimal Filter Application

Listing Three is a complete filter application that scans the Bookshop XML example file for <book> elements with the attributes "genre=fiction" and "in_stock=yes." If such an element is found, all elements found between the <book> and the next </book> tag are shown in a memo field.

The application consists of a form with a TSAXParser component, Tmemo control, Boolean to preserve state information, and a few lines of code inside the FormCreate, the SXPStartElement, SXPCharacters, and SXPEndElement event handlers.

FormCreate writes a title line to the memo and fires up the parser. SXPStartElement checks for a <book> tag; when it finds one, it checks the genre and in_stock attributes (note that TSXPAttributes.GetItem() is case insensitive). If genre and in_stock match the search condition, the state flag is set. SXPEndElement resets the flag when it finds a </book> tag, and SXPCharacters dumps all elements to the memo while the state flag is set. It doesn't get any easier than that. Figure 2 shows the filter demo.

Install the TSAXParser component (found in MSSAXParser.pas) before you run the demo application. Installing TSAXParser also installs the DOM components in MSXML2_TLB.pas. MSAXParser.dcr and MSXML2_TLB.dcr contain icons for some of these components. TSAXParser shows up as the SXP icon in the XML tab on the component palette.

Conclusion

Using Delphi to implement the C++ COM interfaces of the Microsoft SAX2 parser lets you achieve maximum speed when parsing XML documents. The TSAXParser component presented here shields you from those COM interfaces, and makes using the MS SAX2 parser a trivial exercise. The work that TSAXParser does behind the scenes on your behalf comes at a price: You lose some of the flexibility you get when using the COM interfaces directly, and there is a small speed penalty (but TSAXParser still beats an equivalent VB application). The ease of use makes such a component an attractive proposition. And you can always fall back on the bare COM interfaces when speed is important.

DDJ

Listing One

// *********************************************************************//
// Interface: ISAXContentHandler
// Flags:     (16) Hidden
// GUID:      {1545CDFA-9E4E-4497-A8A4-2BF7D0112C44}
// *********************************************************************//
ISAXContentHandler = interface(IUnknown)
['{1545CDFA-9E4E-4497-A8A4-2BF7D0112C44}']
function  putDocumentLocator(
          const pLocator: ISAXLocator): HResult; stdcall;
function  startDocument: HResult; stdcall;
function  endDocument: HResult; stdcall;
function  startPrefixMapping(
          const pwchPrefix: pWideChar;
          cchPrefix: SYSINT;
          const pwchUri: pWideChar;
          cchUri: SYSINT): HResult; stdcall;
function  endPrefixMapping(
          const pwchPrefix: pWideChar;
          cchPrefix: SYSINT): HResult; stdcall;
function  startElement(
          const pwchNamespaceUri: pWideChar;
          cchNamespaceUri: SYSINT;
          const pwchLocalName: pWideChar;
          cchLocalName: SYSINT;
          const pwchQName: pWideChar;
          cchQName: SYSINT;
          const pAttributes: ISAXAttributes): HResult; stdcall;
function  endElement(
          const pwchNamespaceUri: pWideChar;
          cchNamespaceUri: SYSINT;
          const pwchLocalName: pWideChar;
          cchLocalName: SYSINT;
          const pwchQName: pWideChar;
          cchQName: SYSINT): HResult; stdcall;
function  characters(
          const pwchChars: pWideChar;
          cchChars: SYSINT): HResult; stdcall;
function  ignorableWhitespace(
          const pwchChars: pWideChar;
          cchChars: SYSINT): HResult; stdcall;
function  processingInstruction(
          const pwchTarget: pWideChar;
          cchTarget: SYSINT;
          const pwchData: pWideChar;
          cchData: SYSINT): HResult; stdcall;
function  skippedEntity(
          const pwchName: pWideChar;
          cchName: SYSINT): HResult; stdcall;
end;

Back to Article

Listing Two

// ISAXContentHandler.StartElement callback function
function  TSAXContenthandler.startElement(
                         const pwchNamespaceUri: pWideChar;
                         cchNamespaceUri: SYSINT;
                         const pwchLocalName: pWideChar;
                         cchLocalName: SYSINT;
                         const pwchQName: pWideChar;
                         cchQName: SYSINT;
                         const pAttributes: ISAXAttributes): HResult; stdcall;
var
  strNamespaceURI: string;
  strLocalName: string;
  strQName: string;
  Attribute: TSXPAttribute;
  i: integer;
  nAttributes : integer;
  pURI, pLocalName, pQname, pValue, pType: pWideChar;
  URIsize, Localsize, Qsize, Typesize, size: integer;
begin
  if Assigned(FOnStartElement) then begin
    GetLineColumn;
    // convert element name and URI to string
    try
      if pwchNamespaceUri <> Nil then
       strNamespaceURI :=WideCharLenToString(pwchNamespaceUri,cchNamespaceUri)
      else
        strNamespaceURI := '';
      if pwchLocalName <> Nil then
        strLocalName := WideCharLenToString(pwchLocalName, cchLocalName)
      else
        strLocalName := '';
      if pwchQName <> Nil then
        strQName := WideCharLenToString(pwchQName, cchQName)
      else
        strQName := '';
      // build the attribute list
      try
        if pAttributes <> Nil then begin
          with pAttributes do begin
            getLength(nAttributes);
            FAttributeList.Capacity := nAttributes;
            for i := 0 to nAttributes - 1 do begin
              getName(i, pURI, URIsize, pLocalName, Localsize, pQName, Qsize);
              getValue(i, pValue, size);
              getType(i, pType, Typesize);
              Attribute := TSXPAttribute.Create;
              with Attribute do begin
                URI       := WideCharLenToString(pURI, URIsize);
                LocalName := WideCharLenToString(pLocalName, Localsize);
                QName     := WideCharLenToString(pQName, Qsize);
                Value     := WideCharLenToString(pValue, size);
                AttType   := WideCharLenToString(pType, Typesize);
              end;
              FAttributeList.Add(Attribute);
            end; // for
          end; // with
        end; // if pAttributes <> Nil
        // now call the application event handler
        FOnStartElement(strNamespaceUri,strLocalName,strQName,FAttributeList);
      finally
        // clean up the attributes list
        for i := 0 to FAttributeList.Count - 1 do begin
          TSXPAttribute(FAttributeList.Items[i]).Free;
        end;
        FAttributeList.Clear;
      end;
      Result := S_OK;
    except
      Result := E_Fail;
    end;
  end else begin
    Result := S_OK;
  end;
end;

Back to Article

Listing Three

unit filterform;
interface
uses
  Windows, Messages, SysUtils, Classes, Graphics, Controls, Forms,
  Dialogs, StdCtrls, MSSAXParser;
type
  TfrmFilter = class(TForm)
    SXP: TSAXParser;
    Memo1: TMemo;
    procedure SXPStartElement(const NamespaceURI, Localname,
      QName: String; const Attributes: TSXPAttributeList);
    procedure SXPEndElement(const NamespaceURI, Localname,
      QName: String);
    procedure SXPCharacters(const Chars: String);
    procedure FormCreate(Sender: TObject);
  private
    { Private declarations }
  public
    { Public declarations }
    bWanted: boolean;
    strElement: string;
  end;
var
  frmFilter: TfrmFilter;
implementation
{$R *.DFM}
procedure TfrmFilter.SXPStartElement(const NamespaceURI, Localname,
  QName: String; const Attributes: TSXPAttributeList);
var
  attr: TSXPAttribute;
begin
  if bWanted then begin
    strElement := LocalName;
    exit;
  end;
  if CompareText(Localname, 'book') <> 0 then exit;
  if not Assigned (Attributes) then exit;
  attr := Attributes.GetItem('genre');
  if (attr = Nil) or (CompareText(attr.Value, 'fiction') <> 0) then exit;
  attr := Attributes.GetItem('in_stock');
  if (attr = Nil) or (CompareText(attr.Value, 'yes') <> 0) then exit;
  bWanted := true;
end;
procedure TfrmFilter.SXPEndElement(const NamespaceURI, Localname,
  QName: String);
begin
  if CompareText(LocalName, 'book') = 0 then begin
    if bWanted then
      Memo1.Lines.Add('--------------------------------------------');
    bWanted := false;
  end;
end;
procedure TfrmFilter.SXPCharacters(const Chars: String);
begin
  if bWanted then
    if Trim(Chars) <> '' then
      Memo1.Lines.Add(Format('%-20s %s', [strElement, Chars]));
end;
procedure TfrmFilter.FormCreate(Sender: TObject);
begin
  Memo1.Lines.Add('Fiction books currently in stock :');
  Memo1.Lines.Add('==================================');
  SXP.Parse;
end;
end.


Back to Article


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.