The Delphi XML SAX2 Component & MSXML 3.0

Danny shows how to use the C++ COM interfaces of Microsoft's MSXML 3.0 SAX2 parser with Borland Delphi. He then presents TSAXParser, a Delphi component that uses these interfaces, but shields you from their complexities.


September 01, 2001
URL:http://www.drdobbs.com/web-development/the-delphi-xml-sax2-component-msxml-30/184404767

Sep01: The Delphi XML SAX2 Component & MSXML 3.0

Danny is a systems programmer at Cevi NV. He can be contacted at [email protected].


An Expat TSAXParser Implementation


SAX parsers are designed for speedy processing of large XML documents, but with minimal memory use because they use a callback function technique. In this article, I show how to use the C++ COM interfaces of Microsoft's MSXML 3.0 SAX2 parser (http://msdn.microsoft.com/xml/general/xmlparser.asp) with Borland Delphi (http://community.borland.com/delphi/). I also present TSAXParser, a Delphi component that uses these interfaces, but shields you from their complexities without sacrificing speed or functionality. All it takes to parse an XML file using TSAXParser is a mouse click to drop the component on a form, select a few event handlers in the object inspector, and then call the Parse method. The complete source code for this component (along with a demo application) is available electronically; see "Resource Center," page 5. For background information on SAX parsers in general, and the MSXML 3.0 SAX2 parser in particular, see "Parsing XML," by David Cox (DDJ, January 2001) and "Programmer's Toolchest," by Eldar Musayev (DDJ, February 2001).

For speed and simplicity, I decided to use the MSXML's native C++ COM interfaces, not the Visual Basic IVBSAX wrappers. Microsoft provided the IVBSAX wrappers in MSXML 3.0 to support Visual Basic and other COM-enabled languages that can't handle the C++ COM interfaces and data types. I will explain how you can implement and use these C++ COM interfaces with Borland Delphi. You can implement these C++ COM interfaces as normal classes; you don't need full-featured COM objects.

Importing the MSXML Type Library

Before using MSXML3.DLL in Delphi, you have to import its type library (Microsoft XML 3.0). This generates an Object Pascal source (MSXML 2_TLB.pas) that gives the Delphi IDE and compiler access to the COM objects and interfaces inside the DLL.

A key difference between the DOM parser in MSXML and the SAX2 parser is that the SAX parser exposes COM interfaces that have no corresponding implementation in MSXML3.DLL. You have to implement these COM interfaces instead of instantiating and using existing COM objects. You then pass a reference to your implementation to the parser, so that it can callback the functions you implemented at the appropriate moment in the parsing process.

Importing the Type Library Wizard

My decision to use the lightweight C++ ISAX interfaces that descend from IUnknown instead of the VB wrappers that descend from IDispatch immediately caused some trouble with the Delphi Import Type Library wizard. The wizard generates Pascal code based on the information present in the type library. While this works well for the IDispatch-type VB interfaces, it has problems with the data types of the C++ interfaces. Example 1(a) shows how a method argument of type unsigned short * in the type library, which corresponds to a const wchar_t * in the MSXML 3.0 SDK, Example 1(b), is incorrectly converted to Word by the import wizard, Example 1(c). Either pWord or pWideChar would have been acceptable, since both represent pointers to a 16-bit word. I decided to edit MSXML2_TLB.PAS to make the argument definitions correspond closely to those in the SDK; see Example 1(d). This permits the use of the native Delphi data types and (wide) string functions. The modified file is included with the source code.

Implementing the C++ COM Interfaces

The abstract C++ interfaces in the SAX2 parser (abstract meaning that there is no corresponding implementation in MSXML3.DLL) derive from IUnknown. The workhorse SAX interface that has to be implemented is ISAXContentHandler, which handles most XML events. Listing One is a (corrected) Object Pascal version of the ISAXContentHandler interface definition.

When all you need is a single in-process instance of the class, Delphi offers a straightforward way of implementing interfaces that derive from IUnknown. The TInterfacedObject class implements IUnknown and is designed to be used in these circumstances. Example 2(a) is a (partial) class definition for ISAXContentHandler using TInterfacedObject. All it takes to instantiate such a class is to call the Create class constructor, Example 2(b), without the hassle of COM class factories and globally unique identifiers. (The C++ examples in the MSXML SDK use the same approach.) The complete definition of a TSAXContentHandler class that implements the ISAXContentHandler interface as it is used in the TSAXParser component is available electronically; see Resource Center, page 5.

Once the class implementing the interface has been defined, all that remains to be done is provide an implementation for each method. It's important to realize that a class that implements an interface has to provide an implementation for all the methods defined in that interface, even if these implementations don't do anything at all.

An Example Implementation

In Listing Two, which includes the ContentHandler's startElement() function, the parameters passed to the function are pointer/size pairs to the namespace URI, the localname, the qualified name of the current element, and a pointer to an ISAXAttribute interface. The ISAXAttribute interface gives access to the attributes of this element. (This interface is implemented in MSXML2.DLL, so you don't have to provide an implementation for it.)

First of all, the function checks whether the application has set the OnStartElement event method. If not, the function immediately returns S_OK and parsing continues. If the event handler has been set, the first thing that happens is a call to the GetLineColumn procedure to retrieve the current line and column position in the XML document. This is done by the ISAXLocator interface, saved earlier when you called the putDocumentLocator() interface implementation. For performance reasons, the application can disable GetLineColumn by setting the UpdateLocation property to False.

Next, the pWideChar C++ COM-style parameters (pURI, pName, pLocalName) are converted to Delphi AnsiStrings, and the attributes are retrieved using the getName(), getValue(), and getType() methods of the ISAXAttribute's interface parameter. The properties of each attribute are stored in a TSXPAttribute object containing the name, value, and type of the attribute, and all attributes of this element are stored in a list (TSXPAttributeList, derived from Tlist) for easy access by the application. The OnStartElement event procedure of the application is then called with the converted arguments and the attribute list. If OnStartElement does not raise an exception, S_OK is returned and parsing continues; otherwise, TSAXContentHandler.StartElement() returns E_FAIL and you abort parsing.

The out Parameters

Contrary to normal COM memory allocation practices, you should not use CoTaskMemFree() to free the out parameter strings returned by functions such as ISAXAttributes.getValue(), with the exception of the getProperty() functions. Most of the time, the SAX2 functions in MSXML2.DLL do not allocate memory for passing a copy of the string parameters to the event handlers, but instead return a pointer and size pair pointing directly to the data in your input buffer. After all, SAX parsers should be fast.

The TSAXParser Component

After playing around with the MSXML SAX2 parser in Delphi, I found it tedious to write variations of the same code each time I wanted to test some other feature of the parser. I realized that it would be easier to have a native Delphi component that would:

And all this preferably without having to deal with any COM stuff, of course.

This is what TSAXParser is designed to do. The way the component works is straightforward. The unit that contains TSAXParser also contains an implementation for the ISAXContentHandler, ISAXLexicalHandler, ISAXDeclHandler, ISAXDTDHandler, and ISAXErrorHandler interfaces. These classes are instantiated and "owned" by the TSAXParser component. The only purpose of the event handlers in these interface implementations is to convert the COM C++ data types to Pascal format, and to call the corresponding event methods of the application, if any, that have been set by means of TSAXParser. In terms of the "if any," you do not have to provide dummy implementations for the events that don't interest you! The parser-generated events are silently ignored if you didn't provide an event handler. The complete definition of the TSAXParser class is available electronically.

The constructor of TSAXParser creates an instance of the SAX2 reader, immediately followed by the creation of the five handler classes. These are handed over to the reader. Example 3 is code for the constructor of TSAXParser.

A close look at the TSAXParser class definition reveals that the customary private fields you expect for storing the event method properties are missing. The property access methods in Example 4 show why: The properties are not stored in TSAXParser, but directly in the handler classes created in the constructor. This way the address of any TSAXParser event handler created in the object inspector is immediately stored in the corresponding ISAX handler. All that is left for you to do is write code in the event handlers you decide to implement.

While TSAXParser controls access to the event method properties of the handler classes, these classes also maintain a couple of properties on behalf of the TSAXParser that owns them: Every handler can continually update the Line and Column properties before calling the corresponding Delphi event method. It uses the ISAXLocator interface that is made available by ISAXContentHandler.PutDocumentLocator() for this purpose. This gives a Delphi application using TSAXParser automatic access to the current line and column number in the document being parsed.

Because the ISAXLocator getLine()/getColumn() methods are slow, I disabled this feature by default. I found that retrieving line/column information for each event adds about 15 percent overhead to the parsing process. You can enable this automatic location update at any time (or in the Object Inspector if you wish) by setting the UpdateLocation property to True. Even if UpdateLocation is False, you can still request a snapshot update of the Line/Column properties by a call to the GetLocation procedure.

Figure 1 shows the classic example file Bookshop.xml after parsing by a demo application (sxpdemo.dpr, available electronically). Each line shows the current line number in the xml file, an indentation representing the element nesting, the ISAX handler that caused the event (coded as CH, DTDH, LexH, or DeclH), and the formatted arguments as they were received by the event handler.

A Minimal Filter Application

Listing Three is a complete filter application that scans the Bookshop XML example file for <book> elements with the attributes "genre=fiction" and "in_stock=yes." If such an element is found, all elements found between the <book> and the next </book> tag are shown in a memo field.

The application consists of a form with a TSAXParser component, Tmemo control, Boolean to preserve state information, and a few lines of code inside the FormCreate, the SXPStartElement, SXPCharacters, and SXPEndElement event handlers.

FormCreate writes a title line to the memo and fires up the parser. SXPStartElement checks for a <book> tag; when it finds one, it checks the genre and in_stock attributes (note that TSXPAttributes.GetItem() is case insensitive). If genre and in_stock match the search condition, the state flag is set. SXPEndElement resets the flag when it finds a </book> tag, and SXPCharacters dumps all elements to the memo while the state flag is set. It doesn't get any easier than that. Figure 2 shows the filter demo.

Install the TSAXParser component (found in MSSAXParser.pas) before you run the demo application. Installing TSAXParser also installs the DOM components in MSXML2_TLB.pas. MSAXParser.dcr and MSXML2_TLB.dcr contain icons for some of these components. TSAXParser shows up as the SXP icon in the XML tab on the component palette.

Conclusion

Using Delphi to implement the C++ COM interfaces of the Microsoft SAX2 parser lets you achieve maximum speed when parsing XML documents. The TSAXParser component presented here shields you from those COM interfaces, and makes using the MS SAX2 parser a trivial exercise. The work that TSAXParser does behind the scenes on your behalf comes at a price: You lose some of the flexibility you get when using the COM interfaces directly, and there is a small speed penalty (but TSAXParser still beats an equivalent VB application). The ease of use makes such a component an attractive proposition. And you can always fall back on the bare COM interfaces when speed is important.

DDJ

Listing One

// *********************************************************************//
// Interface: ISAXContentHandler
// Flags:     (16) Hidden
// GUID:      {1545CDFA-9E4E-4497-A8A4-2BF7D0112C44}
// *********************************************************************//
ISAXContentHandler = interface(IUnknown)
['{1545CDFA-9E4E-4497-A8A4-2BF7D0112C44}']
function  putDocumentLocator(
          const pLocator: ISAXLocator): HResult; stdcall;
function  startDocument: HResult; stdcall;
function  endDocument: HResult; stdcall;
function  startPrefixMapping(
          const pwchPrefix: pWideChar;
          cchPrefix: SYSINT;
          const pwchUri: pWideChar;
          cchUri: SYSINT): HResult; stdcall;
function  endPrefixMapping(
          const pwchPrefix: pWideChar;
          cchPrefix: SYSINT): HResult; stdcall;
function  startElement(
          const pwchNamespaceUri: pWideChar;
          cchNamespaceUri: SYSINT;
          const pwchLocalName: pWideChar;
          cchLocalName: SYSINT;
          const pwchQName: pWideChar;
          cchQName: SYSINT;
          const pAttributes: ISAXAttributes): HResult; stdcall;
function  endElement(
          const pwchNamespaceUri: pWideChar;
          cchNamespaceUri: SYSINT;
          const pwchLocalName: pWideChar;
          cchLocalName: SYSINT;
          const pwchQName: pWideChar;
          cchQName: SYSINT): HResult; stdcall;
function  characters(
          const pwchChars: pWideChar;
          cchChars: SYSINT): HResult; stdcall;
function  ignorableWhitespace(
          const pwchChars: pWideChar;
          cchChars: SYSINT): HResult; stdcall;
function  processingInstruction(
          const pwchTarget: pWideChar;
          cchTarget: SYSINT;
          const pwchData: pWideChar;
          cchData: SYSINT): HResult; stdcall;
function  skippedEntity(
          const pwchName: pWideChar;
          cchName: SYSINT): HResult; stdcall;
end;

Back to Article

Listing Two

// ISAXContentHandler.StartElement callback function
function  TSAXContenthandler.startElement(
                         const pwchNamespaceUri: pWideChar;
                         cchNamespaceUri: SYSINT;
                         const pwchLocalName: pWideChar;
                         cchLocalName: SYSINT;
                         const pwchQName: pWideChar;
                         cchQName: SYSINT;
                         const pAttributes: ISAXAttributes): HResult; stdcall;
var
  strNamespaceURI: string;
  strLocalName: string;
  strQName: string;
  Attribute: TSXPAttribute;
  i: integer;
  nAttributes : integer;
  pURI, pLocalName, pQname, pValue, pType: pWideChar;
  URIsize, Localsize, Qsize, Typesize, size: integer;
begin
  if Assigned(FOnStartElement) then begin
    GetLineColumn;
    // convert element name and URI to string
    try
      if pwchNamespaceUri <> Nil then
       strNamespaceURI :=WideCharLenToString(pwchNamespaceUri,cchNamespaceUri)
      else
        strNamespaceURI := '';
      if pwchLocalName <> Nil then
        strLocalName := WideCharLenToString(pwchLocalName, cchLocalName)
      else
        strLocalName := '';
      if pwchQName <> Nil then
        strQName := WideCharLenToString(pwchQName, cchQName)
      else
        strQName := '';
      // build the attribute list
      try
        if pAttributes <> Nil then begin
          with pAttributes do begin
            getLength(nAttributes);
            FAttributeList.Capacity := nAttributes;
            for i := 0 to nAttributes - 1 do begin
              getName(i, pURI, URIsize, pLocalName, Localsize, pQName, Qsize);
              getValue(i, pValue, size);
              getType(i, pType, Typesize);
              Attribute := TSXPAttribute.Create;
              with Attribute do begin
                URI       := WideCharLenToString(pURI, URIsize);
                LocalName := WideCharLenToString(pLocalName, Localsize);
                QName     := WideCharLenToString(pQName, Qsize);
                Value     := WideCharLenToString(pValue, size);
                AttType   := WideCharLenToString(pType, Typesize);
              end;
              FAttributeList.Add(Attribute);
            end; // for
          end; // with
        end; // if pAttributes <> Nil
        // now call the application event handler
        FOnStartElement(strNamespaceUri,strLocalName,strQName,FAttributeList);
      finally
        // clean up the attributes list
        for i := 0 to FAttributeList.Count - 1 do begin
          TSXPAttribute(FAttributeList.Items[i]).Free;
        end;
        FAttributeList.Clear;
      end;
      Result := S_OK;
    except
      Result := E_Fail;
    end;
  end else begin
    Result := S_OK;
  end;
end;

Back to Article

Listing Three

unit filterform;
interface
uses
  Windows, Messages, SysUtils, Classes, Graphics, Controls, Forms,
  Dialogs, StdCtrls, MSSAXParser;
type
  TfrmFilter = class(TForm)
    SXP: TSAXParser;
    Memo1: TMemo;
    procedure SXPStartElement(const NamespaceURI, Localname,
      QName: String; const Attributes: TSXPAttributeList);
    procedure SXPEndElement(const NamespaceURI, Localname,
      QName: String);
    procedure SXPCharacters(const Chars: String);
    procedure FormCreate(Sender: TObject);
  private
    { Private declarations }
  public
    { Public declarations }
    bWanted: boolean;
    strElement: string;
  end;
var
  frmFilter: TfrmFilter;
implementation
{$R *.DFM}
procedure TfrmFilter.SXPStartElement(const NamespaceURI, Localname,
  QName: String; const Attributes: TSXPAttributeList);
var
  attr: TSXPAttribute;
begin
  if bWanted then begin
    strElement := LocalName;
    exit;
  end;
  if CompareText(Localname, 'book') <> 0 then exit;
  if not Assigned (Attributes) then exit;
  attr := Attributes.GetItem('genre');
  if (attr = Nil) or (CompareText(attr.Value, 'fiction') <> 0) then exit;
  attr := Attributes.GetItem('in_stock');
  if (attr = Nil) or (CompareText(attr.Value, 'yes') <> 0) then exit;
  bWanted := true;
end;
procedure TfrmFilter.SXPEndElement(const NamespaceURI, Localname,
  QName: String);
begin
  if CompareText(LocalName, 'book') = 0 then begin
    if bWanted then
      Memo1.Lines.Add('--------------------------------------------');
    bWanted := false;
  end;
end;
procedure TfrmFilter.SXPCharacters(const Chars: String);
begin
  if bWanted then
    if Trim(Chars) <> '' then
      Memo1.Lines.Add(Format('%-20s %s', [strElement, Chars]));
end;
procedure TfrmFilter.FormCreate(Sender: TObject);
begin
  Memo1.Lines.Add('Fiction books currently in stock :');
  Memo1.Lines.Add('==================================');
  SXP.Parse;
end;
end.


Back to Article

Sep01: A Portable Distributed Event-Logging Facility


(a) 
HRESULT _stdcall characters(
       [in] unsigned short* pwchChars, 
       [in] int cchChars);

(b) 
HRESULT characters(
       [in] const wchar_t * pwchChars, 
       [in] int cchChars);

(c) 
function  characters(
       var pwchChars: Word; // * WRONG * //
       cchChars: SYSINT): HResult; stdcall;

(d)
function  characters(
       const pwchChars: pWideChar;
       cchChars: SYSINT): HResult; stdcall;

Example 1: The problem with the code generated by the Import Type Library wizard: (a) the type library definition; (b) the C++ definition in the SDK; (c) the wizard-generated Pascal definition; (d) a better Pascal definition.

Sep01: A Portable Distributed Event-Logging Facility


(a)
type
  TSAXContentHandler = class(TInterfacedObject, ISAXContentHandler)
  public
    function  putDocumentLocator(const pLocator: ISAXLocator):
                                 HResult; stdcall;
    function  startDocument: HResult; stdcall;
    ... other methods ...
  end;

(b)
var
  ContentHandler: TSAXContentHandler;
begin
  Contenthandler := TSAXContentHandler.Create;  

Example 2: TSAXContentHandler class. (a) The TSAXContentHandler class implements ISAXContentHandler. The IUnknown interface is implemented by TInterfacedObject; (b) Creating an instance of the TSAXContentHandler class.

Sep01: A Portable Distributed Event-Logging Facility


constructor TSAXParser.Create(AOwner: TComponent);
begin
  inherited Create(AOwner);
  FLine := 0;
  FColumn := 0;
  FAttributeList := TSXPAttributeList.Create;
  // create an instance of the Reader COM object
  FReader := CreateComObject(CLASS_SAXXMLReader) as ISAXXMLReader;
  // instantiate the handler classes
  FErrorHandler   := TSAXErrorHandler.Create;
  FLexicalHandler := TSAXLexicalHandler.Create;
  FContenthandler := TSAXContentHandler.Create;
  FDTDHandler     := TSAXDTDHandler.Create;
  FDeclHandler    := TSAXDeclhandler.Create;
  FContentHandler.FDocumentLocator := Nil;
  // Handlers need a reference pointing back to us
  // because they will continually update
  // our Line and Column properties on each event
  FLexicalhandler.SAXParser := Self;
  FContenthandler.SAXParser := Self;
  FDTDHandler.SAXParser := Self;
  FDeclHandler.SAXParser := Self;
  // pass the handler implementations to the reader
  FReader.putErrorHandler(FErrorhandler);
  FReader.putContentHandler(FContentHandler);
  FReader.putDTDHandler(FDTDHandler);
  Freader.putProperty('http://xml.org/sax/properties/lexical-handler',
                      FLexicalhandler as ISAXLexicalHandler);
  Freader.putProperty('http://xml.org/sax/properties/declaration-handler',
                      FDeclhandler as ISAXdeclHandler);
end;

Example 3: Constructor of TSAXParser instantiates the reader and the handler classes, and passes the handlers to the readers.

Sep01: A Portable Distributed Event-Logging Facility


// OnComment is handled by the SAXLexicalHandler
procedure TSAXParser.SetOnComment(const Value: TOnComment);
begin
  FLexicalHandler.OnComment := Value;
end;
function  TSAXParser.GetOnComment: TOnComment;
begin
  Result := FLexicalHandler.OnComment;
end;
// OnStartDocument is handled by the SAXContentHandler
procedure TSAXParser.SetOnStartDocument(const Value: TOnStartDocument);
begin
  FContentHandler.OnStartDocument := Value;
end;
function TSAXParser.GetOnStartDocument: TOnStartDocument;
begin
  Result := FContentHandler.OnStartDocument;
end;

Example 4: Event handler property access methods of TSAXParser use the TSAXHandler classes to store/retrieve their value.

Sep01: Distributed Computing Component Lifecycles

Figure 1: The SXP demo application.

Sep01: Distributed Computing Component Lifecycles

Figure 2: The filter demo application.

Sep01: An Expat TSAXParser Implementation

An Expat TSAXParser Implementation

The TSAXParser component comes in two flavors: the one implemented on top of MSXML, and another with the same name, properties, events, methods, and behavior, but implemented on James Clark's Expat library (http://www.jclark.com/bio.htm).

To use the Expat-based component on Windows, you have to download the WIN32 binary of expat.dll at http://sourceforge.net/projects/expat/ (you can also download the source there and build the DLL yourself with the MSVC6 compiler).

Before you can use the Expat DLL in Delphi, you have to translate the C header file expat.h to Pascal. I first used Bob Swart's Headconv 4.0 on expat.h to make a first cut of expat.pas, and then went in for some hours of serious hand-editing the result (correcting the translation errors made by Headconv, and reformatting the code to make it more readable).

I then reimplemented the TSAXParser component using the C functions and callback routines exposed by the Expat library. This was straightforward, with a couple of exceptions.

The "Element Declaration Handler" implementation proved interesting, because here I had to actually free memory in Delphi that had been previously allocated by Expat. Luckily, James Clark provided for this by letting you specify your own memory allocator to be used by the parser. You do this by creating a new instance of the parser with the XML_ParserCreate_MM() function, which has an argument that is a structure containing pointers to memory allocation functions that implement equivalents of malloc(), free(), and realloc(), in my case using the Delphi memory allocator functions GetMem(), FreeMem(), and ReallocMem().

Also, Expat is a SAX1 parser (with extensions in the current version), not a SAX2. To remain compatible with the MSXML version, I implemented some MSXML behavior in TSAXParser, like the way namespaces and namespace prefixes are handled, and the reporting of attribute types. I decided to keep it simple, and just maintain a couple of lists built by the element declaration and attribute declaration handlers. Later on in the parsing process, the element handlers can look up this data to pass it on to the application as needed.

As I do not (yet) have a copy of Borland's Kylix (described as "Delphi for Linux"), I could not test the component on Linux, but it should run virtually unmodified with Kylix (the references to expat.dll will have to be changed to expat.so, I guess).

Expat not only proved superior in performance to the MSXML parser, it also parsed without a hitch a couple of valid XML documents that caused MSXML to throw an OLE exception.

There is only one SAX2 event that I have not implemented yet in the Expat version — the "Unparsed Entity Handler." Complete source code for the Expat component and the same example applications that come with the MSXML version are available electronically; see "Resource Center," page 5.

Finally, I could also have used the C++ Xerxes XML parser (by the Apache XML group, http://xml.Apache.org/), which features COM interfaces to make it compatible with MSXML on Windows, but I doubt that these would be usable in Kylix. And linking C++ code with Object Pascal is complicated by the C++ name mangling, so I preferred to use Expat.

— D.H.

Terms of Service | Privacy Statement | Copyright © 2024 UBM Tech, All rights reserved.