Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

Web Development

XML Data Binding


Mar03: XML Data Binding

Eldon is a senior software engineer and Allen a software engineer for Rogue Wave Software. They can be contacted at [email protected] and [email protected], respectively.


An XML data binding is programming language code that handles data that can be represented in XML documents. XML data binding utilities dramatically simplify the task of writing XML-enabled applications by automatically creating a data binding for you. A successful data binding provides accurate, high-performing parsing and serialization code in a fraction of the time it takes to write and maintain the same code by hand.

To use an XML data binding, a document's constraints must be understood and captured in the form of an XML schema. XML documents, without expressed constraints, are much more difficult to share and understand. For XML data to be truly portable, there needs to be a shared understanding of the acceptable structure and format so that it can be exchanged. To communicate the allowable set of constraints, you typically use a schema.

There are a variety of schema languages available, including the Document Type Definition (DTD) and XML Schema. XML Schema (http://www.w3.org/TR/ xmlschema-0/) is an attempt to overcome the limitations of DTDs and lets document constraints be expressed using techniques found in many object-oriented (OO) languages; grammar for specifying these constraints is expressed using XML. XML Schema also provides a rich type system of more than 44 different built-in types and the ability to create user-defined types. We will focus on XML Schema because of the benefits that it provides. Most of the principles apply (with slight modifications) to other schema languages.

Data bindings can theoretically translate XML schemas into the constructs of any programming language. Given an XML Schema that incorporates principles from OO languages such as C++ and Java, the most obvious and intuitive mappings are to these languages.

Data bindings can provide the higher level of abstraction using code-generation techniques. Typically, the data binding automatically creates an object model with parsing and serialization code. Most data binding utilities also allow the mapping of XML Schema type definitions to language constructs to be configured in some way.

An XML data binding utility compiles XML Schema type definitions into classes for OO languages. Figure 1 shows how an instance of a compiled class can be compared to an instance document that conforms to the schema. An instance document can be unmarshaled, or parsed, into the equivalent object-oriented instance. Likewise, the OO instance can then be marshaled, or serialized, to the XML document instance equivalent.

Parsing

There are a variety of ways to parse XML and a data binding is free to choose any of them. The DOM and SAX APIs can be extended, with the obvious choice of SAX resulting in a more efficient implementation, although requiring a more complex implementation. A SAX parser is a push parser, with every element in the document—attribute, character data, processing instruction, and the like—passed to a callback that the parsing implementation chooses to utilize or ignore. An XML pull parser, on the other hand, lets the parsing implementation walk the document tree, extracting the elements, attributes, character data, and anything else it cares about. This is a simpler approach that generally results in better performance.

XML data bindings do not require generic parsers. Since an XML data binding utility analyzes the schema, it generates code that optimally parses in instance documents that conform to it. This gives a data binding that uses a specific parser a distinct performance advantage over a generic parser-based implementation.

DOM-based parsing solutions can be inefficient for large XML documents. Not only is a generic in-memory model created, but typically the data stored in this model must then be converted to different data types, which can create overhead. In contrast, SAX-based parsers result in faster implementations, but unless you already have an existing data model to populate, you'll need to create one from scratch.

Mapping Details

A data binding is a mapping to the constructs of the programming language that users desire to work with. A binding for XML schema can be complex, as the XML Schema specification is extensive.

The first issue is naming. The XML Schema type definition mechanism provides the capability to define attributes, elements, and complex types with names that must then be mapped to the programming language conventions. Most data bindings let you customize the names of the type definitions being mapped. However, if the schema follows good design practices, the data binding utility should be designed such that the generated mapping for names is appropriate and acceptable. Otherwise, the generated API may not be desirable to program against.

The data binding is also responsible for creating the mapping of the XML document instance to an OO type-safe object model. Elements and complex types can be mapped to classes in languages such as Java and C++. Complex types are directly analogous to user-defined language types. Element definitions define a specific name for a complex type definition and are not quite analogous. However, top-level elements require a representation and in OO languages, the only tool available is the class. Attributes and element definitions that are simple types can be represented as language primitives such as long, double, float, int, and so on. Attributes can only be simple types, but element declarations can refer to complex type definitions. The data binding creates a signature that lets attributes and child elements of a complex type be accessed. Consider the Purchase Order schema and example document in Listings One and Two, taken from the XML Schema Primer documentation (http://www.w3.org/TR/xmlschema-0/#po.xml, and http://www.w3.org/TR/xmlschema-0/#po.xsd). The schema describes the documents exchanged for a simple ordering and billing application. When compiled to a Java class, the PurchaseOrderType might have an interface similar to Listing Three. The data binding includes a default constructor, equals method, and methods for accessing and setting member data as well as marshaling the object to/from XML data. Elements and attributes that are optional have an associated is<identifier>Set method for querying whether the element or attribute value is present after parsing. Also note that an optional validation interface might be generated, letting users choose if/when validation should occur. The XML details exposed in this interface are minimal. XML parsing, serialization, or validation errors may occur, but in terms of working with the data, the familiar and high-level constructs available to the language are all that is necessary.

XML schemas also provide support for defining collections by declaring an element to have an occurrence constraint that is greater than 1. Such schema collections can be mapped directly to simple vector or list collection types. Support for more advanced collection types (hash tables and dictionaries) is a more difficult task, and you can only configure them with additional input via a configuration file or other form of input. The example Java interface in Listing Four shows how a basic collection mapping can be established. This interface is a result of a binding to the Java language from the Items complex type definition found in the purchase order schema in Listing Two.

An important part of the data binding is the inclusion of an error model that reports useful errors. If an instance document's structure does not conform to the schema, the generated parsing implementation needs to report this to users via raising of an exception. The exception should contain the line and column number where the error occurred, as well as an error message. Listing Five is an exception class for Java.

Using the Generated Classes

Business logic can be added to the generated API by subclassing and adding behavior. The generated classes may provide methods that can be overridden by the subclass. Typically, it is best not to change the generated code directly since these changes are lost when the generator is run again.

Depending on the language, different techniques to build the generated classes may be involved. In Java, an ant script may be generated, letting users easily build the generated code. In C++, makefiles for building the classes into shared libraries on a particular operating system/compiler for linkage into your application may be desired. In either case, it is best that the code generator only generates the files that have changed. There are large schemas such as Financial products Markup Language (FpML; http://www.fpml.org/), which contain over 300 type definitions. If each of these classes must be rebuilt each time a minor change to the schema is made, the build times will be greater than necessary.

For creating new applications, a data binding can bring significant savings. The generated code saves on development, testing, porting, and maintenance.

Converting to XML

In many cases, you need to take existing business processes that exchange data that is not XML and convert them to XML. For such tasks, XML data-binding tools can be valuable. Given a schema that describes the data, the tool generates a set of language constructs that lets users produce and consume XML documents that conform to the schema.

Converting applications to exchange data in XML becomes one of creating and populating instances of the generated classes for sending, and accessing data from instances of the generated classes for receiving. Given an existing process with an existing data model, the problem becomes one of extracting data from one data model into another. When the data model is a set of programming language constructs, it may be much easier than trying to directly generate XML. This typically consists of simply assigning fields from one class to another.

In other cases, a database is the data model. In this instance, you would need to write code that extracts data from the database into the generated classes. While this is not a difficult task, it is a common enough model that several data-binding tools provide support for simplifying this task (see http://www.rpbourret.com/xml/XMLDataBinding.htm).

Generating code for converting the data model to XML presumes the existence of an XML Schema. Other tools might be desirable for creating schemas, but discussion of them is beyond the scope of this article.

Coping with Existing Code

An XML data-binding tool can get XML data into programming language constructs, but it still creates its own data structures. In many cases, there will already be language constructs for representing and storing the data and the data needs to be transferred between the generated data structures and existing data structures. This may not be daunting in some cases, although it may require copying a number of fields from one class to another, or perhaps require augmentation of the generated classes to read/write the data using database tables.

In other cases (complex structures with deep nesting, for example), this may constitute more work or introduce a performance hit you can't afford. Having parallel data models uses more space and takes more time than having a single model, not to mention work needed to create the conversion tools. In short, there comes a point when the data-binding tool does not make sense.

When existing code causes an XML data binding to be undesirable, it might make more sense to augment existing classes with the ability to marshal/unmarshal XML—and generate the code to do so. This is the approach used in tools such as Rogue Wave's XML Streams library (http://www.roguewave.com/products/sourcepro/core/fb.cfm), which lets you adapt C++ classes by adding a few macros so that XML data may be streamed into and out of instances of the class. Castor (http://castor.exolab.org/), an open-source data binding for Java, provides similar functionality using Java's introspection capabilities. Microsoft's .NET lets classes be marked up with metadata (http://msdn.microsoft.com/library/default.asp?url=/library/en-us/cpguide/html/cpconintroducingxmlserialization.asp), allowing them to be serialized as XML. Since these approaches do not use a schema, it might be desirable to extend an XML data-binding tool to perform a similar task of inserting and extracting XML from an existing data model, but base it on XML Schemas rather than on a generic XML model.

Extending the Model

Can the XML data binding model be extended to cope with existing code? The general problem is to generate code that lets existing classes marshal XML and XML to be unmarshaled into instances of existing classes. This can take two forms:

  • Create a schema that represents the data in an existing class and then generate methods that marshal and unmarshal XML that corresponds to the schema.
  • Adapt an existing schema to an existing class and generate marshal and unmarshal methods for this schema that work with the existing class.

Generating Schemas from Code

Creating schemas from programming language constructs involves parsing the language construct—a process that finds all possible data within the construct. Not all data is significant, however. Some data may be there only as a cache. To be an accurate representation of the data, the schema creation requires some input identifying which data members are significant.

One approach is to start with a tool (such as Rational Rose) that takes programming language code and produces UML. From UML, you can produce XMI, which can then be edited to indicate the significant data and then transformed to XML Schema.

Reconciling Object Models with Schema

Another possible scenario consists of an existing object model and schema that may not be a perfect match. This can occur when attempts are made to standardize a proprietary process. The schema created is a compromise of the various ways of handling the process, but doesn't conform exactly to any of the existing object models for the process.

In this case, you want to generate code that extracts data from the XML and places it in instances of objects in the object model. You don't want to generate new classes since you would have a redundant model and still need to get data out of the generated classes and into the existing classes. This approach also requires input identifying how data in the programming language code maps to data in the schema.

Adding Marshaling and Unmarshaling to Existing Code

Once you have a schema that represents how the data is represented in XML, the schema can be used to generate code to extract data from an instance and insert data into an instance. This can be done either intrusively or nonintrusively.

The intrusive approach is to actually modify the language construct, adding functions that provide the marshaling/unmarshaling. There are two advantages to this approach: Adding marshaling/unmarshaling directly to the class alleviates access problems. They don't all go away since there may be private data in a super class, but there is access to all local private and protected data. In addition, this approach is easier to understand. The marshal/unmarshal methods become part of the class.

The nonintrusive approach creates a parallel construct that provides the marshal/unmarshal methods. The advantage of this approach is that it doesn't require a change to existing code. However, if a class has private data with no accessors, this method will not allow this data to be marshaled/unmarshaled.

You also need a way to map data to the schema. In the case where the schema is generated from the code, you only need to be able to identify which items are significant. If the schema was not generated from the code, then you need to create a mapping that connects each item in the schema to a data member in a class.

Conclusion

An XML binding is programming language code that represents XML data, thereby ensuring that the documents conform to their schema. The generated code enables the transfer of XML data to/from instances of the generated classes. While XML data-binding tools may not always be useful when writing code to process XML, they usually do save time in coding, testing, and maintenance. For more information, see Ronald Bourret's XML data-binding resources at http://www.rpbourret.com/xml/XMLDataBinding.htm.

DDJ

Listing One

<?xml version="1.0"?>
<purchaseOrder orderDate="1999-10-20">
  <shipTo country="US">
    <name>Alice Smith</name>
    <street>123 Maple Street</street>
    <city>Mill Valley</city>
    <state>CA</state>
    <zip>90952</zip>
  </shipTo>
  <billTo country="US">
    <name>Robert Smith</name>
    <street>8 Oak Avenue</street>
    <city>Old Town</city>
    <state>PA</state>
    <zip>95819</zip>
  </billTo>
  <comment>Hurry, my lawn is going wild!</comment>
  <items>
    <item partNum="872-AA">
      <productName>Lawnmower</productName>
      <quantity>1</quantity>
      <USPrice>148.95</USPrice>
      <comment>Confirm this is electric</comment>
    </item>
    <item partNum="926-AA">
      <productName>Baby Monitor</productName>
      <quantity>1</quantity>
      <USPrice>39.98</USPrice>
      <shipDate>1999-05-21</shipDate>
    </item>
  </items>
</purchaseOrder>

Back to Article

Listing Two

<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
 <xsd:annotation>
  <xsd:documentation xml:lang="en">
   Purchase order schema for Example.com.
   Copyright 2000 Example.com. All rights reserved.
  </xsd:documentation>
 </xsd:annotation>
 <xsd:element name="purchaseOrder" type="PurchaseOrderType"/>
 <xsd:element name="comment" type="xsd:string"/>
 <xsd:complexType name="PurchaseOrderType">
  <xsd:sequence>
   <xsd:element name="shipTo" type="USAddress"/>
   <xsd:element name="billTo" type="USAddress"/>
   <xsd:element ref="comment" minOccurs="0"/>
   <xsd:element name="items"  type="Items"/>
  </xsd:sequence>
  <xsd:attribute name="orderDate" type="xsd:date"/>
 </xsd:complexType>
 <xsd:complexType name="USAddress">
  <xsd:sequence>
   <xsd:element name="name"   type="xsd:string"/>
   <xsd:element name="street" type="xsd:string"/>
   <xsd:element name="city"   type="xsd:string"/>
   <xsd:element name="state"  type="xsd:string"/>
   <xsd:element name="zip"    type="xsd:decimal"/>
  </xsd:sequence>
  <xsd:attribute name="country" type="xsd:NMTOKEN"
     fixed="US"/>
 </xsd:complexType>
 <xsd:complexType name="Items">
  <xsd:sequence>
   <xsd:element name="item" minOccurs="0" maxOccurs="unbounded">
    <xsd:complexType>
     <xsd:sequence>
      <xsd:element name="productName" type="xsd:string"/>
      <xsd:element name="quantity">
       <xsd:simpleType>
        <xsd:restriction base="xsd:positiveInteger">
         <xsd:maxExclusive value="100"/>
        </xsd:restriction>
       </xsd:simpleType>
      </xsd:element>
      <xsd:element name="USPrice"  type="xsd:decimal"/>
      <xsd:element ref="comment"   minOccurs="0"/>
      <xsd:element name="shipDate" type="xsd:date" minOccurs="0"/>
     </xsd:sequence>
     <xsd:attribute name="partNum" type="SKU" use="required"/>
    </xsd:complexType>
   </xsd:element>
  </xsd:sequence>
 </xsd:complexType>
 <!-- Stock Keeping Unit, a code for identifying products -->
 <xsd:simpleType name="SKU">
  <xsd:restriction base="xsd:string">
   <xsd:pattern value="\d{3}-[A-Z]{2}"/>
  </xsd:restriction>
 </xsd:simpleType>
</xsd:schema>

Back to Article

Listing Three

public class PurchaseOrderType {
   ...
  public PurchaseOrderType() {
    ...
  }
  public boolean equals(Object rhs) {
    ...
  }
  public void unmarshal(InputStream in) throws XmlParseException {
    ...
  }
  public void marshal(OutputStream out) throws XmlSerializeException {
    ...
  }
  public boolean isValid() throws XmlValidationException {
    ...
  }
  public USAddress getShipTo() {
    ...
  }
  public void setShipTo(USAddress ShipTo) {
    ...
  }
  public USAddress getBillTo() {
    ...
  }
  public void setBillTo(USAddress BillTo) {
    ...
  }
  public String getComment() {
    ...
  }
  public void setComment(String Comment) {
    ...
  }
  public boolean isCommentSet() {
    ...
  }
  public Items getItems() {
    ...
  }
  public void setItems(Items Items) {
    ...
  }
  public java.util.Date getOrderDate() {
    ...
  }
  public void setOrderDate(java.util.Date OrderDate) {
    ...
  }
  public boolean isOrderDateSet() {
...
  }
}

Back to Article

Listing Four

public class Items {
  ...
  public Items() {
    ...
  }
  ...
  public Vector getItemVector() {
    ...
  }
  public void setItemVector(Vector Item) {
    ...
  }
  ...
}

Back to Article

Listing Five

public class XmlParseException extends Exception {
  ...
  public XmlParseException(String msg, String src, 
                                     int lineNumber, int columnNumber) {
    ...
  }
  public String toString() {
    ...
  }
  public int getLineNumber() {
    ...
  }
  public int getColumnNumber() {
    ...
  }

Back to Article


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.