XML Programming in Python

XML brings to the document world what the database world has had for a long time -- interoperability via open systems. Sean shows how you can use Python as a development platform for XML programming.

February 01, 1998
URL:http://www.drdobbs.com/web-development/xml-programming-in-python/184410490

Dr. Dobb's Journal February 1998: XML Programming in Python

A powerful cocktail ofinformation description, representation, and processing power

Sean, chief technical officer and cofounder of Digitome Electronic Publishing (http://www.digitome.com/), is a member of the World Wide Web Consortium's XML Special Interest Group and the Python Software Activity (PSA). He is the author of ParseMe.1st: SGML for Software Developers (Prentice Hall, 1997). Sean can be reached at [email protected].

Sidebar: XML and Python Initiatives

XML, short for "eXtensible Markup Language," is a data-description language developed under the auspices of the World Wide Web Consortium. Simply put, XML provides a standard way of describing and capturing the structure and content of information. Everything from flat "name, address, and telephone number" structures to deeply hierarchical or recursive structures can be described and captured using XML. The XML specification is freely available (http://www.w3.org/TR/WD-xml). Also available are a rapidly expanding set of XML tools, ranging from parsers and editors to end-user applications. Many people see XML as the data-representation format that will underpin the next generation of web applications. Some go further, heralding it as the "mother of all data structures" -- the open systems format to end all open systems formats.

Python, on the other hand, is an object-oriented scripting language invented and maintained by Guido van Rossum. It provides a balanced mix of functional and imperative programming features -- the usual if/while/for control structures versus lists, map, and lambda functions, for instance. It has a clean syntax, refreshingly intuitive semantics, and few "gotchas." The source code for Python is freely available at http://www.python.org/ and there are few restrictions on its use, even in commercial applications.

This highly modular, highly portable language, with its rich set of existing libraries, is easily extended -- either in Python or by building Python extensions in C/C++. Python's feature mix, particularly its excellent support for object-oriented and hierarchical data structures, make it well suited to processing XML-encoded information. This also applies to processing HTML in Python. Add to this the variety of Internet protocols (HTTP, FTP, and the like) Python supports, and you have an excellent Internet programming tool. In short, the combination of XML and Python is a powerful cocktail of information description, representation, and processing power.

XML from 10,000 Feet

XML is a data-description language. This in itself is nothing new. The world is full of data-description languages -- RTF, TeX, and HTML, among them. Yet XML is fundamentally different, particularly in terms of XML's emphasis on the description of information structure and content as distinct from information presentation. RTF, TeX, and HTML are concerned with how information should look, focusing on notions such as page, font, color, indentation, and the like. XML, on the other hand, is concerned with what the information is and how that information is logically structured.

The easiest way to contrast the two approaches is by example. Suppose you wanted to establish a web site to sell second-hand cars and publish price information. How would you tackle the problem? You could put the information together in HTML using something like Listing One. All HTML-based solutions to this sort of problem (be they handcrafted or auto-generated) suffer because useful information about the data is removed in the translation to HTML. HTML knows nothing about cars, and wouldn't recognize a red Toyota if it saw one. More importantly, neither would an HTML search engine!

The essence of the problem is that the process of creating rendered versions of car pricing information -- to HTML, RTF, or whatever -- is a lossy transformation. You no longer have access to the fact that the page contains information on a "car" that is for sale. You cannot unambiguously locate "red" in the context of a car color. You cannot say "car.color == red" to an Internet search engine and expect it to find red cars.

This "dumbing down" of information prior to publication can be avoided with XML. Imagine a world in which you used Listing Two instead of Listing One. Listing Two is a snippet of an XML document that contains elements -- Car, Condition, and so on -- specifically intended to ensure that both the structure and content of the information is retained.

So far, so good. You have retained information that will be of benefit in managing, processing, and searching this information. But how can you know if a Car element contains all the pieces of information you need? In XML, "grammars" can be defined to capture this sort of information. Such grammars are called Document Type Definitions (DTDs); see Listing Three (which is commented to explain what's going on).

Given an XML document containing/referencing a DTD, applications known as "validating XML parsers" check that the document meets the grammatical requirements spelled out by the DTD. The use of such grammars in XML is strictly optional. It is perfectly legal for a class of XML parser known as "nonvalidating XML parsers" to ignore any grammar specified in a DTD. Such parsers restrict their checking to matching start and end tags and other basic checks. Documents obeying these rules are known as well-formed XML documents. Making DTDs optional in XML maintains the powerful notion of validation with respect to a grammar, while simultaneously supporting a more lightweight parse suitable for, say, client-side implementation.

SGML, HTML, and XML

At first glance, HTML and XML documents look quite similar. This is no accident, as they share a common ancestor -- SGML (short for "Standard Generalized Markup Language," ISO 8879).

For all their similarities, however, HTML and XML are fundamentally different in a way that is of great importance to software developers. HTML is a particular set of element types (H1, IMG, TABLE, and the like) chosen by the designers of HTML to be simple to understand and easy to use for information presentation via browsers. XML, however, has no element types. Instead, it lets you roll your own element types specifically for your data and your particular application. XML users can literally make them up as they go along. Moreover, by capturing details about how these element types inter-relate in the form of a DTD, a validating XML parser can validate documents against arbitrarily strict measures of validity.

HTML is a particular tag language -- the one that gave the world the Web. In contrast, XML is a metalanguage -- a language for creating tag languages. These languages can be as presentation oriented or as information-content oriented as you care to make them. You can create HTML-like languages to build presentation applications. You can create SHCML (Second-Hand Car Markup Language), DDJAML (Dr. Dobb's Journal Article Markup Language), and so on.

A language like C, for instance, has keywords (if, while, and so on) and rules governing how they can be combined to form valid sentences known as C programs. The rules are partially captured in the grammar of the language. Such grammars can be mechanically processed into parsers with tools such as YACC. A validating XML parser is a bit like a YACC tool that, instead of generating parser source code from a grammar (DTD), actually executes the generated parser on the fly.

So how does XML relate to its parent SGML? It is a simplified subset of it. All XML documents are SGML documents -- they are simply limited in the features of SGML they can use. The reduced feature set is specifically aimed at maintaining the inherent power of SGML as a metalanguage while simultaneously making SGML "light" enough for Web use. To use a phrase popular in the XML community "XML is SGML--, not HTML++." Common SGML DTDs include HTML, DocBook (technical documentation), and Edgar (company filings). Emerging DTDs in the XML world include CDF (push technologies), OFE (financial transactions), and OSD (software distribution).

Python from 10,000 Feet

Like all powerful programming languages, Python is difficult to describe in a nutshell. Here are a few key features (in no particular order).

Object oriented. Python supports all the usual OO stuff you expect in any modern scripting language; see Listing Four.
Dynamic. Python variables are dynamically typed. A variable can be a string one minute, and a list of associative arrays the next. More unusually, instance variables can be attached to objects dynamically. In Listing Four, object f1 obtains a baz instance variable when the bar1 method is called, not by virtue that it has foo as a superclass. This is illustrated in Listing Five, where the built-in variable __dict__ is a dictionary (associative array) of all the instance variables with their values.
Powerful intrinsic types and operations. Python has a rich set of built-in types including strings, arbitrary-precision integers, lists, and dictionaries. It also has powerful "slicing" operators for constructing and deconstructing variables such as strings, lists, and the like; see Listing Six.
Functional programming. Python's list support is enhanced with support for some common "lispy" functional programming features. Anonymous functions can be used to iterate over lists to achieve various effects; see Listing Seven.
Extensions/libraries. Python is blessed with an vast array of extension modules/libraries, some implemented in Python, others as C extension modules. These include regular expressions, FTP protocol implementation, CGI interfaces, and numerical libraries, to name a few. A Python profiler and debugger (both written in Python) are part of the standard distribution. Python also supports a range of GUIs such as Tk and Win32 via MFC.
Transparent. Python exposes a large amount of its own "behind the scenes" implementation in the form of methods/variables with reserved names. Overriding default behavior is a matter of specifying implementations for these reserved names; see Listing Eight.
WYSIWYG. In Python, the block structure of an application is determined by indentation level. In Python, there are no Begin/End blocks and no dangling else problems. As Listing Nine illustrates, there is no ambiguity about what this means.
Garbage collection. Python implements reference-counting garbage collection. Objects are automatically destroyed when their reference count shrinks to zero. This works transparently most of the time. The only time you need to be aware of it is when creating structures with circular references. Enabling Python to garbage collect such structures involves breaking the circular links "by hand."

XML Processing in Python

The first step in processing XML with any programming language is to parse it and generate an in-memory representation of the tree structure it describes. A variety of XML parsers have been developed in a variety of languages, including C, C++, Java, Perl, Python, and Tcl. Given that XML documents are also SGML documents, SGML parsers can also be used. Here, I'll use the freely available NSGMLS by James Clark (http://www.jclark.com/).

Listing Ten is a complete XML document for the CarsForSale application. Using NSGMLS to parse this XML document produces the output in Listing Eleven. Each line of output can be considered an event communicated to the application by the XML parser. "(" denotes the opening of an element, "-" denotes data content, "A" denotes an attribute, "e" denotes an EMPTY element, and so on.

As Figure 1 illustrates, the data can be visualized as a tree structure in which each node has pointers to its surrounding parent, sibling(s), and first child. Listing Twelve is a simple Python class hierarchy that can capture the basic XML concepts of element, attribute, and data content information.

Listing Thirteen illustrates how a single Car element can be translated into an XMLTree-based representation. With a slightly extended set of methods, this mechanism can be used to read the output of parsers such as NSGMLS.

Serializing to XML

In Python, any class that implements the __repr__ method provides Python with a way of retrieving a string representation of the objects. This method is invoked when backquotes are used around an expression as illustrated in XMLTree; see Listing Fourteen. Also, note Python's powerful string interpolation features. The syntax "<any string>" % (list...) can be used to do printf-style formatting anywhere a string is required.

Having built the single Car tree in the variable x as shown previously, the single command

print x

produces the output in Listing Fifteen (indented for clarity).

The invocation of the __repr__ method at the XMLTree level results in a recursive walk of the entire tree structure assembling the final printable version of the tree, which is itself well-formed XML.

Tree Walking without Recursion

With the sort of tree structures that naturally result from processing XML, recursive tree walking is a common and natural technique. However, for the occasion when a linear traversal is appropriate, we can take advantage of Python's transparency of implementation. In Python, a for loop makes repeated calls to the __getitem__ method of the object being iterated. By implementing __getitem__ in XMLTree, you can write tree traversals like Listing Sixteen. The code to implement this is included in Listing Twelve.

Conclusion

Many see XML as a key technology in the next wave of web-application development. The burgeoning family of XML-based languages such as CDF, OFE, and the like (see the accompanying text box entitled "XML and Python Initiatives"), combined with its integration into browsers such as Microsoft's Internet Explorer 4.0, all point to a healthy and exciting future for XML.

XML brings to the document world what the database world has had for a long time -- interoperability via open systems. It also brings the ideas of data modeling, lossless interchange, and application independence forcefully into the document world. Thanks to the expressive power of DTDs, XML breaks down the barriers between documents and databases. In XML, traditional databases are simply documents with simple DTDs. XML is part grand unifying theory and part pragmatic solution to real-world problems. As you begin to use XML, you will find yourself less and less inclined to design data formats or hand craft lexers/parsers. Why bother when you can use XML?

As for Python, it is a pleasant language for XML processing. Its features are well matched to both the XML architecture and its world view. Open, small, elegant, pragmatic, powerful -- and freely available to all.

References: Python

Lutz, Mark. Programming Python (Nutshell Handbook), (O'Reilly & Associates, 1997).

Watters, Aaron, Guido van Rossum, and James C. Ahlstrom. Internet Programming with Python, (M&T Books, 1996).

Python Language Home Page. http://www.python.org/.

Starship Python. http://www.starship.skyport.net/.

References: XML

Extensible Markup Language (XML), W3C Working Draft. http://www.w3.org/TR/WD-xml.

A Proposal for XSL. http://www.w3.org/TR/note-XSL.

Extensible Markup Language (XML). Robin Cover, Summer Institute of Linguistics. http://www.sil.org/sgml/xml.html

SiteBuilder Network Specs and Standards: XML Parser. http://www.microsoft.com/standards/xml/.

The XML FAQ. Peter Flynn, Silmaril Consultants. http://www.ucc.ie/xml.

DDJ

Listing One

<!-- A snippet of an HTML document containing "car for sale" information --><h1>Toyota</h1>
<li>
<ul>Price:10000 Dollars
<ul>Condition:Good
<ul>Color:Red
</li>