XML Programming in Python

XML brings to the document world what the database world has had for a long time -- interoperability via open systems. Sean shows how you can use Python as a development platform for XML programming.


February 01, 1998
URL:http://www.drdobbs.com/web-development/xml-programming-in-python/184410490

Dr. Dobb's Journal February 1998: XML Programming in Python

A powerful cocktail ofinformation description, representation, and processing power

Sean, chief technical officer and cofounder of Digitome Electronic Publishing (http://www.digitome.com/), is a member of the World Wide Web Consortium's XML Special Interest Group and the Python Software Activity (PSA). He is the author of ParseMe.1st: SGML for Software Developers (Prentice Hall, 1997). Sean can be reached at [email protected].


Sidebar: XML and Python Initiatives

XML, short for "eXtensible Markup Language," is a data-description language developed under the auspices of the World Wide Web Consortium. Simply put, XML provides a standard way of describing and capturing the structure and content of information. Everything from flat "name, address, and telephone number" structures to deeply hierarchical or recursive structures can be described and captured using XML. The XML specification is freely available (http://www.w3.org/TR/WD-xml). Also available are a rapidly expanding set of XML tools, ranging from parsers and editors to end-user applications. Many people see XML as the data-representation format that will underpin the next generation of web applications. Some go further, heralding it as the "mother of all data structures" -- the open systems format to end all open systems formats.

Python, on the other hand, is an object-oriented scripting language invented and maintained by Guido van Rossum. It provides a balanced mix of functional and imperative programming features -- the usual if/while/for control structures versus lists, map, and lambda functions, for instance. It has a clean syntax, refreshingly intuitive semantics, and few "gotchas." The source code for Python is freely available at http://www.python.org/ and there are few restrictions on its use, even in commercial applications.

This highly modular, highly portable language, with its rich set of existing libraries, is easily extended -- either in Python or by building Python extensions in C/C++. Python's feature mix, particularly its excellent support for object-oriented and hierarchical data structures, make it well suited to processing XML-encoded information. This also applies to processing HTML in Python. Add to this the variety of Internet protocols (HTTP, FTP, and the like) Python supports, and you have an excellent Internet programming tool. In short, the combination of XML and Python is a powerful cocktail of information description, representation, and processing power.

XML from 10,000 Feet

XML is a data-description language. This in itself is nothing new. The world is full of data-description languages -- RTF, TeX, and HTML, among them. Yet XML is fundamentally different, particularly in terms of XML's emphasis on the description of information structure and content as distinct from information presentation. RTF, TeX, and HTML are concerned with how information should look, focusing on notions such as page, font, color, indentation, and the like. XML, on the other hand, is concerned with what the information is and how that information is logically structured.

The easiest way to contrast the two approaches is by example. Suppose you wanted to establish a web site to sell second-hand cars and publish price information. How would you tackle the problem? You could put the information together in HTML using something like Listing One. All HTML-based solutions to this sort of problem (be they handcrafted or auto-generated) suffer because useful information about the data is removed in the translation to HTML. HTML knows nothing about cars, and wouldn't recognize a red Toyota if it saw one. More importantly, neither would an HTML search engine!

The essence of the problem is that the process of creating rendered versions of car pricing information -- to HTML, RTF, or whatever -- is a lossy transformation. You no longer have access to the fact that the page contains information on a "car" that is for sale. You cannot unambiguously locate "red" in the context of a car color. You cannot say "car.color == red" to an Internet search engine and expect it to find red cars.

This "dumbing down" of information prior to publication can be avoided with XML. Imagine a world in which you used Listing Two instead of Listing One. Listing Two is a snippet of an XML document that contains elements -- Car, Condition, and so on -- specifically intended to ensure that both the structure and content of the information is retained.

So far, so good. You have retained information that will be of benefit in managing, processing, and searching this information. But how can you know if a Car element contains all the pieces of information you need? In XML, "grammars" can be defined to capture this sort of information. Such grammars are called Document Type Definitions (DTDs); see Listing Three (which is commented to explain what's going on).

Given an XML document containing/referencing a DTD, applications known as "validating XML parsers" check that the document meets the grammatical requirements spelled out by the DTD. The use of such grammars in XML is strictly optional. It is perfectly legal for a class of XML parser known as "nonvalidating XML parsers" to ignore any grammar specified in a DTD. Such parsers restrict their checking to matching start and end tags and other basic checks. Documents obeying these rules are known as well-formed XML documents. Making DTDs optional in XML maintains the powerful notion of validation with respect to a grammar, while simultaneously supporting a more lightweight parse suitable for, say, client-side implementation.

SGML, HTML, and XML

At first glance, HTML and XML documents look quite similar. This is no accident, as they share a common ancestor -- SGML (short for "Standard Generalized Markup Language," ISO 8879).

For all their similarities, however, HTML and XML are fundamentally different in a way that is of great importance to software developers. HTML is a particular set of element types (H1, IMG, TABLE, and the like) chosen by the designers of HTML to be simple to understand and easy to use for information presentation via browsers. XML, however, has no element types. Instead, it lets you roll your own element types specifically for your data and your particular application. XML users can literally make them up as they go along. Moreover, by capturing details about how these element types inter-relate in the form of a DTD, a validating XML parser can validate documents against arbitrarily strict measures of validity.

HTML is a particular tag language -- the one that gave the world the Web. In contrast, XML is a metalanguage -- a language for creating tag languages. These languages can be as presentation oriented or as information-content oriented as you care to make them. You can create HTML-like languages to build presentation applications. You can create SHCML (Second-Hand Car Markup Language), DDJAML (Dr. Dobb's Journal Article Markup Language), and so on.

A language like C, for instance, has keywords (if, while, and so on) and rules governing how they can be combined to form valid sentences known as C programs. The rules are partially captured in the grammar of the language. Such grammars can be mechanically processed into parsers with tools such as YACC. A validating XML parser is a bit like a YACC tool that, instead of generating parser source code from a grammar (DTD), actually executes the generated parser on the fly.

So how does XML relate to its parent SGML? It is a simplified subset of it. All XML documents are SGML documents -- they are simply limited in the features of SGML they can use. The reduced feature set is specifically aimed at maintaining the inherent power of SGML as a metalanguage while simultaneously making SGML "light" enough for Web use. To use a phrase popular in the XML community "XML is SGML--, not HTML++." Common SGML DTDs include HTML, DocBook (technical documentation), and Edgar (company filings). Emerging DTDs in the XML world include CDF (push technologies), OFE (financial transactions), and OSD (software distribution).

Python from 10,000 Feet

Like all powerful programming languages, Python is difficult to describe in a nutshell. Here are a few key features (in no particular order).

XML Processing in Python

The first step in processing XML with any programming language is to parse it and generate an in-memory representation of the tree structure it describes. A variety of XML parsers have been developed in a variety of languages, including C, C++, Java, Perl, Python, and Tcl. Given that XML documents are also SGML documents, SGML parsers can also be used. Here, I'll use the freely available NSGMLS by James Clark (http://www.jclark.com/).

Listing Ten is a complete XML document for the CarsForSale application. Using NSGMLS to parse this XML document produces the output in Listing Eleven. Each line of output can be considered an event communicated to the application by the XML parser. "(" denotes the opening of an element, "-" denotes data content, "A" denotes an attribute, "e" denotes an EMPTY element, and so on.

As Figure 1 illustrates, the data can be visualized as a tree structure in which each node has pointers to its surrounding parent, sibling(s), and first child. Listing Twelve is a simple Python class hierarchy that can capture the basic XML concepts of element, attribute, and data content information.

Listing Thirteen illustrates how a single Car element can be translated into an XMLTree-based representation. With a slightly extended set of methods, this mechanism can be used to read the output of parsers such as NSGMLS.

Serializing to XML

In Python, any class that implements the __repr__ method provides Python with a way of retrieving a string representation of the objects. This method is invoked when backquotes are used around an expression as illustrated in XMLTree; see Listing Fourteen. Also, note Python's powerful string interpolation features. The syntax "<any string>" % (list...) can be used to do printf-style formatting anywhere a string is required.

Having built the single Car tree in the variable x as shown previously, the single command

print x

produces the output in Listing Fifteen (indented for clarity).

The invocation of the __repr__ method at the XMLTree level results in a recursive walk of the entire tree structure assembling the final printable version of the tree, which is itself well-formed XML.

Tree Walking without Recursion

With the sort of tree structures that naturally result from processing XML, recursive tree walking is a common and natural technique. However, for the occasion when a linear traversal is appropriate, we can take advantage of Python's transparency of implementation. In Python, a for loop makes repeated calls to the __getitem__ method of the object being iterated. By implementing __getitem__ in XMLTree, you can write tree traversals like Listing Sixteen. The code to implement this is included in Listing Twelve.

Conclusion

Many see XML as a key technology in the next wave of web-application development. The burgeoning family of XML-based languages such as CDF, OFE, and the like (see the accompanying text box entitled "XML and Python Initiatives"), combined with its integration into browsers such as Microsoft's Internet Explorer 4.0, all point to a healthy and exciting future for XML.

XML brings to the document world what the database world has had for a long time -- interoperability via open systems. It also brings the ideas of data modeling, lossless interchange, and application independence forcefully into the document world. Thanks to the expressive power of DTDs, XML breaks down the barriers between documents and databases. In XML, traditional databases are simply documents with simple DTDs. XML is part grand unifying theory and part pragmatic solution to real-world problems. As you begin to use XML, you will find yourself less and less inclined to design data formats or hand craft lexers/parsers. Why bother when you can use XML?

As for Python, it is a pleasant language for XML processing. Its features are well matched to both the XML architecture and its world view. Open, small, elegant, pragmatic, powerful -- and freely available to all.

References: Python

Lutz, Mark. Programming Python (Nutshell Handbook), (O'Reilly & Associates, 1997).

Watters, Aaron, Guido van Rossum, and James C. Ahlstrom. Internet Programming with Python, (M&T Books, 1996).

Python Language Home Page. http://www.python.org/.

Starship Python. http://www.starship.skyport.net/.

References: XML

Extensible Markup Language (XML), W3C Working Draft. http://www.w3.org/TR/WD-xml.

A Proposal for XSL. http://www.w3.org/TR/note-XSL.

Extensible Markup Language (XML). Robin Cover, Summer Institute of Linguistics. http://www.sil.org/sgml/xml.html

SiteBuilder Network Specs and Standards: XML Parser. http://www.microsoft.com/standards/xml/.

The XML FAQ. Peter Flynn, Silmaril Consultants. http://www.ucc.ie/xml.

DDJ

Listing One

<!-- A snippet of an HTML document containing "car for sale" information --><h1>Toyota</h1>
<li>
<ul>Price:10000 Dollars
<ul>Condition:Good
<ul>Color:Red
</li>

Back to Article

Listing Two

<!-- A snippet of an XML document containing "cars for sale" information --><CarsForSale>
  <Car Price = "10000" Units = "Dollars">
    <Maker>Toyota</Maker>
    <Condition Type = "Good"/>
    <Color>Red</Color>
  </Car>
  <Car price = "20000" units = "Irish Punts">
    <Maker>Ford</Maker>
    <Condition Type = "Good"/>
    <Color>White</Color>
  </Car>
</CarsForSale>

Back to Article

Listing Three

<!-- This is a snippet of an XML Document Type Definition (DTD) --><!-- Define an element type CarsForSale. Contains one or more Car elements-->
<!ELEMENT CarsForSale (Car)+>


<!-- A Car consists of a Maker element , an optional Condition element and a color element --> <!ELEMENT Car (Maker,Condition?,Color)>

<!-- A Car has two associated attributes - price and units. They contain character data and both are required - i.e. a document must supply them for each Car element --> <!ATTLIST Car Price CDATA #REQUIRED Units CDATA #REQUIRED>

<!-- Maker and Color elements consist of text --> <!ELEMENT Maker (#PCDATA)> <!ELEMENT Color (#PCDATA)>

<!-- Condition element does not have any content it is an EMPTY element --> <!-- It has a "Type" attribute which can be one of Excellent, Good or Bad --> <!ELEMENT Condition EMPTY> <!ATTLIST Condition Type (Excellent|Good|Bad) #REQUIRED>

Back to Article

Listing Four

class foo:                  # Declare a class foo    def bar(self):          # Declare a method bar
        self.baz = 1        # set the baz object variable to 1
f = foo()                   # Declare an instance of the class foo
class foo1(foo):            # Declare a class foo1 derived from foo
    def bar1(self):         # Declare a method bar1
        foo.bar(self)       # Call the bar method of superclass
        self.baz1 = 2       # Set the baz1 object variable to 2
f1 = foo1()                 # Declare an instance of the class foo1

Back to Article

Listing Five

>>>f1 = foo1()       # Declare f1print f1.__dict__       # print instance variables - Empty
{}
>>>f1.bar()         # Call the bar method - baz variable created
>>>print f1.__dict__
{'baz': 1}
>>>f1.bar1()     # Call bar1 method. Calls foo.bar thus baz1 variable created
>>>print f1.__dict__
{'baz': 1, 'baz1': 2}

Back to Article

Listing Six

>>>x = "Hello World">>>print x[-4]   # print 4th character from the end
o
>>>print x[2:4]  # print substring starting at offset 2 ending before offset 4
ll
>>>print x[:-1]  # All except last character
Hello Worl


>>>x = ["Hello World", 42, ['foo','bar']] len(x) 3 >>>x[-1] # Slicing works with lists too ['Hello World',42] >>>x = {"Hello":"World", "World":[1,2,[2.1,2.2]]} # An associative array >>>y = x["World"] >>>print y [1,2,[2.1,2.2]] >>>y.reverse() # Reverse list y in situ >>>print y [[2.1, 2.2], 2, 1]

Back to Article

Listing Seven

import sys,types>>>x = [1,2,3,4]        # x is a flat list of 4 numbers_
>>>y = map (lambda e:e*e,x)     # y contains the squares of each element of x
>>>print y
[1,4,9,16]


>>>x = [1,"Hello",[2,3]] # x is a more complex list >>>y = filter (lambda e:type(e)==type(''),x) # y is x, filtered to string # elements only >>>print y ['Hello']

Back to Article

Listing Eight

C>type foo.pyclass foo:
        "This is some documentation on foo"
        def __init__(self):
                print "foo constructor called"
                self.x = 1
        def __del__(self):
                # foo class destructor
                print "foo destructor called"
                pass                # Do nothing
        def __repr__(self):         # Return a string representation
                return "A foo object"


f = foo() # Causes the foo constructor to be called print f # Causes the __repr__ method to be called del f # Causes the destructor to be called

C>python foo.py foo constructor called A foo object foo destructor called

Back to Article

Listing Nine

if x == y:    if y == z:
        print y
else:
    print x # Associated with outermost if by virtue of indententation

Back to Article

Listing Ten

<!DOCTYPE CarsForSale [<!ELEMENT CarsForSale (Car)+>
<!ELEMENT Car (Maker,Condition?,Color)>
<!ATTLIST Car
    Price NUMBER #REQUIRED
    Units CDATA "DOLLARS">
<!ELEMENT Maker (#PCDATA)>
<!ELEMENT Condition EMPTY>
   Type (Excellent|Good|Bad) #REQUIRED>
<!ELEMENT Condition (#PCDATA)>
<!ELEMENT Color (#PCDATA)>
]>
<CarsForSale>
<Car Price = "10000" Units = "Dollars">
<Maker>Toyota</Maker>
<Condition Type = "Good"/>
<Color>Red</Color>
</Car>
</CarsForSale>

Back to Article

Listing Eleven

(CarsForSaleAPrice CDATA 10000
AUnits CDATA Dollars
(Car
(Maker
-Toyota
)Maker
AType TOKEN Good
e
(Condition
)Condition
(Color
-Red
)Color
)Car
)CarsForSale
C

Back to Article

Listing Twelve

# Virtual base class for nodes in an XML tree (XMLTree)class XMLNode:
    # Constructor
    def __init__(self):
        # Each node has four references to its surrounding nodes
        self.Top = self.Bottom = self.Left = self.Right = None


# An XMLElementNode represents an XML element in an XML Tree class XMLElementNode(XMLNode): def __init__(self,gi,EmptyElement=0): # Call superclass constructor XMLNode.__init__(self) self.gi = gi # gi = Element (tag) name self.attributes = {} # Empty associative array self.EmptyElement = EmptyElement # Boolean. 1 for elements like "<foo/>" def AddAttribute (self,name,value): self.attributes [name] = value def __repr__(self): # Return string representation. Recursively walks children/siblings res = "<" + self.gi # Start of start-tag for (name,value) in self.attributes.items(): # Attributes res = res + ' %s = "%s"' % (name,value) if self.EmptyElement == 1: # End of start-tag res = res + "/>" else: res = res + ">" if self.Bottom: res = res + `self.Bottom` # traverse children if self.EmptyElement == 0: # End-tag if required res = res + "</%s>" % self.gi if self.Right: # traverse right siblings res = res + `self.Right` return res # An XMLDataNode represents data content in an XML Tree class XMLDataNode(XMLNode): def __init__(self,datastr): XMLNode.__init__(self) self.datastr = datastr def __repr__(self): return self.datastr # An XMLTree contains a root which is a reference to first node in tree # It maintains a current position in tree in Position instance variable class XMLTree: def __init__(self): # Tree starts out with a Dummy node self.root = XMLElementNode("?ROOT?") self.Position = self.root def __repr__(self): return `self.root.Bottom` # Add the specified node below current position def AddBelow(self,Node): self.Position.Bottom = Node Node.Top = self.Position # Add the specified node to the right of current position def AddRight(self,Node): self.Position.Right = Node Node.Top = self.Position.Top Node.Left = self.Position # Move current position up to parent node def MoveUp(self): self.Position = self.Position.Top def MoveRight(self): self.Position = self.Position.Right def MoveBelow(self): self.Position = self.Position.Bottom def MoveToRoot(self): self.Position = self.root # Predicate - return true if positioned at an XMLDataNode def AtData(self): if self.Position.__class__.__name__ == "XMLDataNode": return 1 return None # Predicate - return true if positioned at an XMLElementNode # If ElementName specified, ensure positioned at that element type def AtElement(self,ElementName=None): if self.Position.__class__.__name__ != "XMLElementNode": return 0 if ElementName == None: return 1 return self.Position.gi == ElementName # Utility function to navigate to next position in Tree # Traversal is depth first, left to right def MoveNext(self): if self.Position.Bottom: self.MoveBelow() return 1 while self.Position.Top: if self.Position.Right: self.MoveRight() return 1 else: self.MoveUp() return 0 # Return data content of current node def GetData(self): if self.Position.__class__.__name__ == "XMLDataNode": return self.Position.datastr sys.stderr.write ("GetData - Current Position is not a Data node") return None # Add an attribute to the current node def AddAttribute(self,name,value): self.Position.AddAttribute(name,value) # Override of Python's subscripting for XMLTree objects. # Allows use of for loop for "linear" iteration of the tree def __getitem__(self, key): if key == 0: self.MoveToRoot() return self else: if self.MoveNext(): return self else: raise IndexError if __name__ == "__main__": import string x = XMLTree() x.AddBelow (XMLElementNode("Car")) x.MoveBelow() x.AddAttribute("Price","10000") x.AddAttribute("Units","Dollars") x.AddBelow (XMLElementNode("Maker")) x.MoveBelow() x.AddBelow (XMLDataNode("Toyota")) x.AddRight (XMLElementNode("Condition",1)) x.MoveRight() x.AddAttribute("Type","Good") x.AddRight (XMLElementNode("Color")) x.MoveRight() x.AddBelow (XMLDataNode("Red")) x.MoveToRoot() # Print the entire tree print x # Print only the cars worth > 500 Dollars for n in x: if n.AtElement("Car"): Price = string.atoi(n.Position.attributes["Price"]) Units = n.Position.attributes["Units"] if Price > 500 and Units == "Dollars": print n.Position

Back to Article

Listing Thirteen

from XMLTree import *x = XMLTree()
x.AddBelow (XMLElementNode("Car"))
x.MoveBelow()
x.AddAttribute("Price","10000")
x.AddAttribute("Units","Dollars")
x.AddBelow (XMLElementNode("Maker"))
x.MoveBelow()
x.AddBelow (XMLDataNode("Toyota"))
x.AddRight (XMLElementNode("Condition",1))
x.MoveRight()
x.AddAttribute("Type","Good")
x.AddRight (XMLElementNode("Color"))
x.MoveRight()
x.AddBelow (XMLDataNode("Red"))
x.MoveToRoot()

Back to Article

Listing Fourteen

# In class XMLTreedef __repr__(self):
    return `self.root.Bottom`
# In class XMLDataNode
def __repr__(self):
    return self.datastr
#In class XMLElementNode
def __repr__(self):
        res = "<" + self.gi             # Start-tag
        for (name,value) in self.attributes.items():    # Attributes
                res = res + ' %s = "%s"' % (name,value)
        if self.EmptyElement == 1:          # End-tag if required
                res = res + "/>"
        else:
                res = res + ">"
        if self.Bottom:                 # Children if any
                res = res + `self.Bottom`
        if self.EmptyElement == 0:          # End-tag if required
                res = res + "</%s>" % self.gi
        if self.Right:                  # Right siblings if any
                res = res + `self.Right`
        return res

Back to Article

Listing Fifteen

<Car Units = "Dollars" Price = "10000"> <Maker>
  Toyota
 </Maker>
 <Condition Type = "Good"/>
 <Color>
  Red
 </Color>
</Car>

Back to Article

Listing Sixteen

#   Assuming x is an XMLTree object#       Print only the cars worth > 500 Dollars
        for n in x:
                if n.AtElement("Car"):
                   Price = string.atoi(n.Position.attributes["Price"])
                   Units = n.Position.attributes["Units"]
                   if Price > 500 and Units == "Dollars":
                      print n.Position # Output sub-tree as an XML fragment

Back to Article


Copyright © 1998, Dr. Dobb's Journal

Dr. Dobb's Journal February 1998: XML Programming in Python

XML Programming in Python

By Sean McGrath

Dr. Dobb's Journal February 1998

Figure 1: Data can be visualized as a tree structure.


Copyright © 1998, Dr. Dobb's Journal

Dr. Dobb's Journal February 1998: XML and Python Initiatives

Dr. Dobb's Journal February 1998

XML and Python Initiatives


XML

The following is a list of some initiatives for XML:

Channel Definition Format (CDF). An initiative by Microsoft and others to use XML as the basis for describing data for "Active Channels" in web browsers such as Internet Explorer 4.0.
Cold Fusion Markup Language (CFML). An XML-based markup language from Allaire for server-side scripting of web applications.
Document Object Model (DOM). A W3C initiative to standardize the APIs for access to HTML and XML documents in a platform- and language-independent way.
Open Financial Exchange (OFE). An initiative by Microsoft, Intuit, and Checkfree to use XML to describe financial transactions.
Open Software Distribution (OSD). An initiative by Microsoft, Marimba, Installshield, and others for an XML-based software distribution mechanism over the Internet.
XML-Data. An initiative by Microsoft and others to use XML syntax to capture DTD information.
XSL. A standard for expressing how XML should be rendered. XSL provides greater formatting power than CSS/HTML, while being interoperable with both. XSL's expression language is, itself, XML combined with the use of ECMAScript for situations where a full-fledged programming language is required.

Python

Here are some initiatives for Python:

Grail. A World Wide Web browser written in Python by Guido van Rossum. Naturally, it supports the execution of Python as applets downloaded in HTML pages.
Python Image Library. Adds an image object to Python. Supports a variety of image file formats and processing options.
Bobo. A collection of software for building web applications with Python. The core of Bobo is an object-request broker (ORB) that converts HTTP requests into Python object requests.
Alice. A 3D graphics package that uses Python as its embedded scripting language.
kjParsing. A parser generator implemented in Python that generates Python code.
Python COM. A Python extension that makes Python interoperable with COM/ActiveX. It supports ActiveX scripting allowing Python to be used like Javascript both client side (Internet Explorer) and server side (IIS). It also allows Python to act as a Windows Scripting Host.

--S.M.


Copyright © 1998, Dr. Dobb's Journal

Terms of Service | Privacy Statement | Copyright © 2024 UBM Tech, All rights reserved.