Danny shows how to use the C++ COM interfaces of Microsoft's MSXML 3.0 SAX2 parser with Borland Delphi. He then presents TSAXParser, a Delphi component that uses these interfaces, but shields you from their complexities.
September 01, 2001
URL:http://www.drdobbs.com/web-development/the-delphi-xml-sax2-component-msxml-30/184404767
Danny is a systems programmer at Cevi NV. He can be contacted at [email protected].
An Expat TSAXParser Implementation
SAX parsers are designed for speedy processing of large XML documents, but with minimal memory use because they use a callback function technique. In this article, I show how to use the C++ COM interfaces of Microsoft's MSXML 3.0 SAX2 parser (http://msdn.microsoft.com/xml/general/xmlparser.asp) with Borland Delphi (http://community.borland.com/delphi/). I also present TSAXParser, a Delphi component that uses these interfaces, but shields you from their complexities without sacrificing speed or functionality. All it takes to parse an XML file using TSAXParser is a mouse click to drop the component on a form, select a few event handlers in the object inspector, and then call the Parse method. The complete source code for this component (along with a demo application) is available electronically; see "Resource Center," page 5. For background information on SAX parsers in general, and the MSXML 3.0 SAX2 parser in particular, see "Parsing XML," by David Cox (DDJ, January 2001) and "Programmer's Toolchest," by Eldar Musayev (DDJ, February 2001).
For speed and simplicity, I decided to use the MSXML's native C++ COM interfaces, not the Visual Basic IVBSAX wrappers. Microsoft provided the IVBSAX wrappers in MSXML 3.0 to support Visual Basic and other COM-enabled languages that can't handle the C++ COM interfaces and data types. I will explain how you can implement and use these C++ COM interfaces with Borland Delphi. You can implement these C++ COM interfaces as normal classes; you don't need full-featured COM objects.
Before using MSXML3.DLL in Delphi, you have to import its type library (Microsoft XML 3.0). This generates an Object Pascal source (MSXML 2_TLB.pas) that gives the Delphi IDE and compiler access to the COM objects and interfaces inside the DLL.
A key difference between the DOM parser in MSXML and the SAX2 parser is that the SAX parser exposes COM interfaces that have no corresponding implementation in MSXML3.DLL. You have to implement these COM interfaces instead of instantiating and using existing COM objects. You then pass a reference to your implementation to the parser, so that it can callback the functions you implemented at the appropriate moment in the parsing process.
My decision to use the lightweight C++ ISAX interfaces that descend from IUnknown instead of the VB wrappers that descend from IDispatch immediately caused some trouble with the Delphi Import Type Library wizard. The wizard generates Pascal code based on the information present in the type library. While this works well for the IDispatch-type VB interfaces, it has problems with the data types of the C++ interfaces. Example 1(a) shows how a method argument of type unsigned short * in the type library, which corresponds to a const wchar_t * in the MSXML 3.0 SDK, Example 1(b), is incorrectly converted to Word by the import wizard, Example 1(c). Either pWord or pWideChar would have been acceptable, since both represent pointers to a 16-bit word. I decided to edit MSXML2_TLB.PAS to make the argument definitions correspond closely to those in the SDK; see Example 1(d). This permits the use of the native Delphi data types and (wide) string functions. The modified file is included with the source code.
The abstract C++ interfaces in the SAX2 parser (abstract meaning that there is no corresponding implementation in MSXML3.DLL) derive from IUnknown. The workhorse SAX interface that has to be implemented is ISAXContentHandler, which handles most XML events. Listing One is a (corrected) Object Pascal version of the ISAXContentHandler interface definition.
When all you need is a single in-process instance of the class, Delphi offers a straightforward way of implementing interfaces that derive from IUnknown. The TInterfacedObject class implements IUnknown and is designed to be used in these circumstances. Example 2(a) is a (partial) class definition for ISAXContentHandler using TInterfacedObject. All it takes to instantiate such a class is to call the Create class constructor, Example 2(b), without the hassle of COM class factories and globally unique identifiers. (The C++ examples in the MSXML SDK use the same approach.) The complete definition of a TSAXContentHandler class that implements the ISAXContentHandler interface as it is used in the TSAXParser component is available electronically; see Resource Center, page 5.
Once the class implementing the interface has been defined, all that remains to be done is provide an implementation for each method. It's important to realize that a class that implements an interface has to provide an implementation for all the methods defined in that interface, even if these implementations don't do anything at all.
In Listing Two, which includes the ContentHandler's startElement() function, the parameters passed to the function are pointer/size pairs to the namespace URI, the localname, the qualified name of the current element, and a pointer to an ISAXAttribute interface. The ISAXAttribute interface gives access to the attributes of this element. (This interface is implemented in MSXML2.DLL, so you don't have to provide an implementation for it.)
First of all, the function checks whether the application has set the OnStartElement event method. If not, the function immediately returns S_OK and parsing continues. If the event handler has been set, the first thing that happens is a call to the GetLineColumn procedure to retrieve the current line and column position in the XML document. This is done by the ISAXLocator interface, saved earlier when you called the putDocumentLocator() interface implementation. For performance reasons, the application can disable GetLineColumn by setting the UpdateLocation property to False.
Next, the pWideChar C++ COM-style parameters (pURI, pName, pLocalName) are converted to Delphi AnsiStrings, and the attributes are retrieved using the getName(), getValue(), and getType() methods of the ISAXAttribute's interface parameter. The properties of each attribute are stored in a TSXPAttribute object containing the name, value, and type of the attribute, and all attributes of this element are stored in a list (TSXPAttributeList, derived from Tlist) for easy access by the application. The OnStartElement event procedure of the application is then called with the converted arguments and the attribute list. If OnStartElement does not raise an exception, S_OK is returned and parsing continues; otherwise, TSAXContentHandler.StartElement() returns E_FAIL and you abort parsing.
Contrary to normal COM memory allocation practices, you should not use CoTaskMemFree() to free the out parameter strings returned by functions such as ISAXAttributes.getValue(), with the exception of the getProperty() functions. Most of the time, the SAX2 functions in MSXML2.DLL do not allocate memory for passing a copy of the string parameters to the event handlers, but instead return a pointer and size pair pointing directly to the data in your input buffer. After all, SAX parsers should be fast.
After playing around with the MSXML SAX2 parser in Delphi, I found it tedious to write variations of the same code each time I wanted to test some other feature of the parser. I realized that it would be easier to have a native Delphi component that would:
And all this preferably without having to deal with any COM stuff, of course.
This is what TSAXParser is designed to do. The way the component works is straightforward. The unit that contains TSAXParser also contains an implementation for the ISAXContentHandler, ISAXLexicalHandler, ISAXDeclHandler, ISAXDTDHandler, and ISAXErrorHandler interfaces. These classes are instantiated and "owned" by the TSAXParser component. The only purpose of the event handlers in these interface implementations is to convert the COM C++ data types to Pascal format, and to call the corresponding event methods of the application, if any, that have been set by means of TSAXParser. In terms of the "if any," you do not have to provide dummy implementations for the events that don't interest you! The parser-generated events are silently ignored if you didn't provide an event handler. The complete definition of the TSAXParser class is available electronically.
The constructor of TSAXParser creates an instance of the SAX2 reader, immediately followed by the creation of the five handler classes. These are handed over to the reader. Example 3 is code for the constructor of TSAXParser.
A close look at the TSAXParser class definition reveals that the customary private fields you expect for storing the event method properties are missing. The property access methods in Example 4 show why: The properties are not stored in TSAXParser, but directly in the handler classes created in the constructor. This way the address of any TSAXParser event handler created in the object inspector is immediately stored in the corresponding ISAX handler. All that is left for you to do is write code in the event handlers you decide to implement.
While TSAXParser controls access to the event method properties of the handler classes, these classes also maintain a couple of properties on behalf of the TSAXParser that owns them: Every handler can continually update the Line and Column properties before calling the corresponding Delphi event method. It uses the ISAXLocator interface that is made available by ISAXContentHandler.PutDocumentLocator() for this purpose. This gives a Delphi application using TSAXParser automatic access to the current line and column number in the document being parsed.
Because the ISAXLocator getLine()/getColumn() methods are slow, I disabled this feature by default. I found that retrieving line/column information for each event adds about 15 percent overhead to the parsing process. You can enable this automatic location update at any time (or in the Object Inspector if you wish) by setting the UpdateLocation property to True. Even if UpdateLocation is False, you can still request a snapshot update of the Line/Column properties by a call to the GetLocation procedure.
Figure 1 shows the classic example file Bookshop.xml after parsing by a demo application (sxpdemo.dpr, available electronically). Each line shows the current line number in the xml file, an indentation representing the element nesting, the ISAX handler that caused the event (coded as CH, DTDH, LexH, or DeclH), and the formatted arguments as they were received by the event handler.
Listing Three is a complete filter application that scans the Bookshop XML example file for <book> elements with the attributes "genre=fiction" and "in_stock=yes." If such an element is found, all elements found between the <book> and the next </book> tag are shown in a memo field.
The application consists of a form with a TSAXParser component, Tmemo control, Boolean to preserve state information, and a few lines of code inside the FormCreate, the SXPStartElement, SXPCharacters, and SXPEndElement event handlers.
FormCreate writes a title line to the memo and fires up the parser. SXPStartElement checks for a <book> tag; when it finds one, it checks the genre and in_stock attributes (note that TSXPAttributes.GetItem() is case insensitive). If genre and in_stock match the search condition, the state flag is set. SXPEndElement resets the flag when it finds a </book> tag, and SXPCharacters dumps all elements to the memo while the state flag is set. It doesn't get any easier than that. Figure 2 shows the filter demo.
Install the TSAXParser component (found in MSSAXParser.pas) before you run the demo application. Installing TSAXParser also installs the DOM components in MSXML2_TLB.pas. MSAXParser.dcr and MSXML2_TLB.dcr contain icons for some of these components. TSAXParser shows up as the SXP icon in the XML tab on the component palette.
Using Delphi to implement the C++ COM interfaces of the Microsoft SAX2 parser lets you achieve maximum speed when parsing XML documents. The TSAXParser component presented here shields you from those COM interfaces, and makes using the MS SAX2 parser a trivial exercise. The work that TSAXParser does behind the scenes on your behalf comes at a price: You lose some of the flexibility you get when using the COM interfaces directly, and there is a small speed penalty (but TSAXParser still beats an equivalent VB application). The ease of use makes such a component an attractive proposition. And you can always fall back on the bare COM interfaces when speed is important.
DDJ
// *********************************************************************// // Interface: ISAXContentHandler // Flags: (16) Hidden // GUID: {1545CDFA-9E4E-4497-A8A4-2BF7D0112C44} // *********************************************************************// ISAXContentHandler = interface(IUnknown) ['{1545CDFA-9E4E-4497-A8A4-2BF7D0112C44}'] function putDocumentLocator( const pLocator: ISAXLocator): HResult; stdcall; function startDocument: HResult; stdcall; function endDocument: HResult; stdcall; function startPrefixMapping( const pwchPrefix: pWideChar; cchPrefix: SYSINT; const pwchUri: pWideChar; cchUri: SYSINT): HResult; stdcall; function endPrefixMapping( const pwchPrefix: pWideChar; cchPrefix: SYSINT): HResult; stdcall; function startElement( const pwchNamespaceUri: pWideChar; cchNamespaceUri: SYSINT; const pwchLocalName: pWideChar; cchLocalName: SYSINT; const pwchQName: pWideChar; cchQName: SYSINT; const pAttributes: ISAXAttributes): HResult; stdcall; function endElement( const pwchNamespaceUri: pWideChar; cchNamespaceUri: SYSINT; const pwchLocalName: pWideChar; cchLocalName: SYSINT; const pwchQName: pWideChar; cchQName: SYSINT): HResult; stdcall; function characters( const pwchChars: pWideChar; cchChars: SYSINT): HResult; stdcall; function ignorableWhitespace( const pwchChars: pWideChar; cchChars: SYSINT): HResult; stdcall; function processingInstruction( const pwchTarget: pWideChar; cchTarget: SYSINT; const pwchData: pWideChar; cchData: SYSINT): HResult; stdcall; function skippedEntity( const pwchName: pWideChar; cchName: SYSINT): HResult; stdcall; end;
// ISAXContentHandler.StartElement callback function function TSAXContenthandler.startElement( const pwchNamespaceUri: pWideChar; cchNamespaceUri: SYSINT; const pwchLocalName: pWideChar; cchLocalName: SYSINT; const pwchQName: pWideChar; cchQName: SYSINT; const pAttributes: ISAXAttributes): HResult; stdcall; var strNamespaceURI: string; strLocalName: string; strQName: string; Attribute: TSXPAttribute; i: integer; nAttributes : integer; pURI, pLocalName, pQname, pValue, pType: pWideChar; URIsize, Localsize, Qsize, Typesize, size: integer; begin if Assigned(FOnStartElement) then begin GetLineColumn; // convert element name and URI to string try if pwchNamespaceUri <> Nil then strNamespaceURI :=WideCharLenToString(pwchNamespaceUri,cchNamespaceUri) else strNamespaceURI := ''; if pwchLocalName <> Nil then strLocalName := WideCharLenToString(pwchLocalName, cchLocalName) else strLocalName := ''; if pwchQName <> Nil then strQName := WideCharLenToString(pwchQName, cchQName) else strQName := ''; // build the attribute list try if pAttributes <> Nil then begin with pAttributes do begin getLength(nAttributes); FAttributeList.Capacity := nAttributes; for i := 0 to nAttributes - 1 do begin getName(i, pURI, URIsize, pLocalName, Localsize, pQName, Qsize); getValue(i, pValue, size); getType(i, pType, Typesize); Attribute := TSXPAttribute.Create; with Attribute do begin URI := WideCharLenToString(pURI, URIsize); LocalName := WideCharLenToString(pLocalName, Localsize); QName := WideCharLenToString(pQName, Qsize); Value := WideCharLenToString(pValue, size); AttType := WideCharLenToString(pType, Typesize); end; FAttributeList.Add(Attribute); end; // for end; // with end; // if pAttributes <> Nil // now call the application event handler FOnStartElement(strNamespaceUri,strLocalName,strQName,FAttributeList); finally // clean up the attributes list for i := 0 to FAttributeList.Count - 1 do begin TSXPAttribute(FAttributeList.Items[i]).Free; end; FAttributeList.Clear; end; Result := S_OK; except Result := E_Fail; end; end else begin Result := S_OK; end; end;
unit filterform; interface uses Windows, Messages, SysUtils, Classes, Graphics, Controls, Forms, Dialogs, StdCtrls, MSSAXParser; type TfrmFilter = class(TForm) SXP: TSAXParser; Memo1: TMemo; procedure SXPStartElement(const NamespaceURI, Localname, QName: String; const Attributes: TSXPAttributeList); procedure SXPEndElement(const NamespaceURI, Localname, QName: String); procedure SXPCharacters(const Chars: String); procedure FormCreate(Sender: TObject); private { Private declarations } public { Public declarations } bWanted: boolean; strElement: string; end; var frmFilter: TfrmFilter; implementation {$R *.DFM} procedure TfrmFilter.SXPStartElement(const NamespaceURI, Localname, QName: String; const Attributes: TSXPAttributeList); var attr: TSXPAttribute; begin if bWanted then begin strElement := LocalName; exit; end; if CompareText(Localname, 'book') <> 0 then exit; if not Assigned (Attributes) then exit; attr := Attributes.GetItem('genre'); if (attr = Nil) or (CompareText(attr.Value, 'fiction') <> 0) then exit; attr := Attributes.GetItem('in_stock'); if (attr = Nil) or (CompareText(attr.Value, 'yes') <> 0) then exit; bWanted := true; end; procedure TfrmFilter.SXPEndElement(const NamespaceURI, Localname, QName: String); begin if CompareText(LocalName, 'book') = 0 then begin if bWanted then Memo1.Lines.Add('--------------------------------------------'); bWanted := false; end; end; procedure TfrmFilter.SXPCharacters(const Chars: String); begin if bWanted then if Trim(Chars) <> '' then Memo1.Lines.Add(Format('%-20s %s', [strElement, Chars])); end; procedure TfrmFilter.FormCreate(Sender: TObject); begin Memo1.Lines.Add('Fiction books currently in stock :'); Memo1.Lines.Add('=================================='); SXP.Parse; end; end.
(a) HRESULT _stdcall characters( [in] unsigned short* pwchChars, [in] int cchChars); (b) HRESULT characters( [in] const wchar_t * pwchChars, [in] int cchChars); (c) function characters( var pwchChars: Word; // * WRONG * // cchChars: SYSINT): HResult; stdcall; (d) function characters( const pwchChars: pWideChar; cchChars: SYSINT): HResult; stdcall;
(a) type TSAXContentHandler = class(TInterfacedObject, ISAXContentHandler) public function putDocumentLocator(const pLocator: ISAXLocator): HResult; stdcall; function startDocument: HResult; stdcall; ... other methods ... end; (b) var ContentHandler: TSAXContentHandler; begin Contenthandler := TSAXContentHandler.Create;
constructor TSAXParser.Create(AOwner: TComponent); begin inherited Create(AOwner); FLine := 0; FColumn := 0; FAttributeList := TSXPAttributeList.Create; // create an instance of the Reader COM object FReader := CreateComObject(CLASS_SAXXMLReader) as ISAXXMLReader; // instantiate the handler classes FErrorHandler := TSAXErrorHandler.Create; FLexicalHandler := TSAXLexicalHandler.Create; FContenthandler := TSAXContentHandler.Create; FDTDHandler := TSAXDTDHandler.Create; FDeclHandler := TSAXDeclhandler.Create; FContentHandler.FDocumentLocator := Nil; // Handlers need a reference pointing back to us // because they will continually update // our Line and Column properties on each event FLexicalhandler.SAXParser := Self; FContenthandler.SAXParser := Self; FDTDHandler.SAXParser := Self; FDeclHandler.SAXParser := Self; // pass the handler implementations to the reader FReader.putErrorHandler(FErrorhandler); FReader.putContentHandler(FContentHandler); FReader.putDTDHandler(FDTDHandler); Freader.putProperty('http://xml.org/sax/properties/lexical-handler', FLexicalhandler as ISAXLexicalHandler); Freader.putProperty('http://xml.org/sax/properties/declaration-handler', FDeclhandler as ISAXdeclHandler); end;
// OnComment is handled by the SAXLexicalHandler procedure TSAXParser.SetOnComment(const Value: TOnComment); begin FLexicalHandler.OnComment := Value; end; function TSAXParser.GetOnComment: TOnComment; begin Result := FLexicalHandler.OnComment; end; // OnStartDocument is handled by the SAXContentHandler procedure TSAXParser.SetOnStartDocument(const Value: TOnStartDocument); begin FContentHandler.OnStartDocument := Value; end; function TSAXParser.GetOnStartDocument: TOnStartDocument; begin Result := FContentHandler.OnStartDocument; end;
The TSAXParser component comes in two flavors: the one implemented on top of MSXML, and another with the same name, properties, events, methods, and behavior, but implemented on James Clark's Expat library (http://www.jclark.com/bio.htm).
To use the Expat-based component on Windows, you have to download the WIN32 binary of expat.dll at http://sourceforge.net/projects/expat/ (you can also download the source there and build the DLL yourself with the MSVC6 compiler).
Before you can use the Expat DLL in Delphi, you have to translate the C header file expat.h to Pascal. I first used Bob Swart's Headconv 4.0 on expat.h to make a first cut of expat.pas, and then went in for some hours of serious hand-editing the result (correcting the translation errors made by Headconv, and reformatting the code to make it more readable).
I then reimplemented the TSAXParser component using the C functions and callback routines exposed by the Expat library. This was straightforward, with a couple of exceptions.
The "Element Declaration Handler" implementation proved interesting, because here I had to actually free memory in Delphi that had been previously allocated by Expat. Luckily, James Clark provided for this by letting you specify your own memory allocator to be used by the parser. You do this by creating a new instance of the parser with the XML_ParserCreate_MM() function, which has an argument that is a structure containing pointers to memory allocation functions that implement equivalents of malloc(), free(), and realloc(), in my case using the Delphi memory allocator functions GetMem(), FreeMem(), and ReallocMem().
Also, Expat is a SAX1 parser (with extensions in the current version), not a SAX2. To remain compatible with the MSXML version, I implemented some MSXML behavior in TSAXParser, like the way namespaces and namespace prefixes are handled, and the reporting of attribute types. I decided to keep it simple, and just maintain a couple of lists built by the element declaration and attribute declaration handlers. Later on in the parsing process, the element handlers can look up this data to pass it on to the application as needed.
As I do not (yet) have a copy of Borland's Kylix (described as "Delphi for Linux"), I could not test the component on Linux, but it should run virtually unmodified with Kylix (the references to expat.dll will have to be changed to expat.so, I guess).
Expat not only proved superior in performance to the MSXML parser, it also parsed without a hitch a couple of valid XML documents that caused MSXML to throw an OLE exception.
There is only one SAX2 event that I have not implemented yet in the Expat version the "Unparsed Entity Handler." Complete source code for the Expat component and the same example applications that come with the MSXML version are available electronically; see "Resource Center," page 5.
Finally, I could also have used the C++ Xerxes XML parser (by the Apache XML group, http://xml.Apache.org/), which features COM interfaces to make it compatible with MSXML on Windows, but I doubt that these would be usable in Kylix. And linking C++ code with Object Pascal is complicated by the C++ name mangling, so I preferred to use Expat.
D.H.
Terms of Service | Privacy Statement | Copyright © 2024 UBM Tech, All rights reserved.