Dr. Dobb's | XML & Relational Databases

XML & Relational Databases

Andrew and Dare first examine the merits of an XML data model over the relational data model, then translate between the XML and relational query domains.

May 01, 2003
URL:http://www.drdobbs.com/database/xml-relational-databases/184405339

As XML's use increases, it is becoming a common requirement to view traditional relational data as XML. One way of doing this is to provide a virtual XML view over the relational database. This view is provided through the mechanism of annotating an XSD schema that describes the desired XML shape. These annotations contain relational metadata used to translate queries of the XML view into SQL queries, and to transform the SQL query results into the desired XML shape. In this article, we discuss the merits of choosing an XML data model over the more traditional relational data model. We also examine the mapping process and discuss the task of translating between the XML and relational query domains. Finally, we provide an example of mapping an XSD schema to a sample database and querying the virtual XML View.

Traditional relational database systems are excellent mechanisms for storing normalized, highly structured data. However, it is not always easy to query or manipulate this data model. Database developers have traditionally used the mechanism of relational views and stored procedures to abstract away the complexities of accessing the underlying data—in other words, providing a logical relational view. However, as the logical data model becomes increasingly semistructured (data with some variance), providing a relational view to program against becomes increasingly difficult.

In general, relational databases are excellent persistent storage mechanisms for storing highly structured, normalized data—"square data." But as anyone who has modeled real-world business objects knows, the real world is not square. For instance, consider a typical customer list where there is some variability between customers. One customer has a cell phone, work phone, and home phone; while another has a cell phone and pager number. A typical first attempt at a representation with relational tables uses a column for each property; see Table 1(a).

Table 1: (a) Relational table; (b) creating normalized tables from a relational table.

However, this approach leads to sparse tables with highly denormalized data and can potentially cause performance and scalability problems for a typical relational database. The solution for most databases is to pull out properties and place them in their own normalized table that has an N:1 relationship with the master table, as in Table 1(b). It would then be possible to further normalize the customer data by separating the PhoneNumbers table into separate tables based on type, like Table 2.

Table 2: Further normalizing customer data.

The results would be a highly normalized view of customers that would require users to join several tables to get the desired Customer objects. For most experienced DBAs and SQL users, this probably wouldn't be a problem. However, for most application developers, it would probably be easier to program against a logical Customer object without exposing the details of the underlying relational storage mechanism.

The traditional technique for abstracting the physical database storage is to create a relational view or stored procedure API. However, there is an alternative—creating a virtual XML view. The difference is that users see an XML representation of the relational data as opposed to relation rowsets. To illustrate the advantages, look at a possible XML representation of the Customer List in Listing One. In this view, you are not forced to think about tables, columns, and joins. You just know that Customers have first and last names, and several different types of phone numbers. In other words, the hierarchical nature of the XML representation implies a relationship between logical entities. Further, since XML is semistructured, there are no NULL valued columns—just the absence of elements or attributes.

Listing One

<CustomerList>
    <Customer LastName="Jones" FirstName="Jeff">
        <WorkPhone>5556767</WorkPhone>
        <CellPhone>5558888</CellPhone>
        <HomePhone>5553232</HomePhone>
</Customer> 
<Customer LastName="Smith" FirstName="Jane">
        <CellPhone>5552444</CellPhone>
        <Pager>5558989</Pager>
</Customer>
</CustomerList>

Typically, the Customers table has more than one customer and, since a lot of existing XML parsers need well-formed documents with a single root, you need to add a <CustomerList> wrapper element so that the list of customers has a single root. This is a common problem when mapping relational data to XML, and is addressed as part of the mapping process.

You can further extend the customer list by adding an order table that includes a many to one relationship with the Customers table; see Table 3.

Table 3: Extending customers list.

Now, if you wanted to get a customer list that included order data, you would have to join the two tables and deal with the key relationship and duplicate customer records in the resulting rowset in cases where a given customer has several orders. However, adding it to the XML representation is straightforward. In Listing Two, the containment relationship in the XML model between the <Customer> and <Order> elements implies a logical relationship—in this case, a "Has-A" relationship. Since the XML data model allows multiple subelements of the same type, a given Customer can contain multiple Orders. Also, we have created <Order> elements with some of the columns represented as attributes and some of the columns represented as elements. This is because the XML data model provides more flexibility in that we can map column values to attributes or element values, thereby decoupling users from the underlying layout of the relation tables.

Listing Two

<CustomerList>
<Customer LastName="Jones" FirstName="Jeff">
        <WorkPhone>5556767</WorkPhone>
        <CellPhone>5558888</CellPhone>
        <HomePhone>5553232</HomePhone>
        <Order Product="Apples" Quantity="25" >
            <Shipped>09/17/2001</Shipped>
        </Order>
</Customer> 
<Customer LastName="Smith" FirstName="Jane">
        <CellPhone>5552444</CellPhone>
        <Pager>5558989</Pager>
        <Order Product="Oranges" Quantity="25" >
            <Shipped>06/15/2001</Shipped>
        </Order>
        <Order Product="Tomatoes" Quantity="100" />
</Customer>
</CustomerList>

In general, there are four significant benefits in providing an XML abstraction over a relational database:

XML can represent semistructured data in a more natural way than traditional relational data structures, such as rowsets.
XML is hierarchical. For most of us, our business object relationships are hierarchical and not flat. But in the relational world, we are forced to persist our data into flat relation tables and expose relationships between objects as key relationships. In XML, these relationships can be expressed through hierarchical containment.
XML is a platform-independent, human-readable format. If you can generate XML from your relational data, then you are no longer tied to a particular platform or programming language. In fact, this lets you expose and consume relational data to/from any client application.
The XML abstraction decouples the application code from the database. Often, applications are tightly coupled to the database schema, in particular, to table names in the database. One way to alleviate this problem is to provide a relational abstraction through the use of stored procedures and views. However, another alternative is to use the XML view mechanism to abstract the underlying database. To users of the XML view, the database just looks like a collection of XML fragments. In effect, this decouples the application from the database, and lets the underlying database schema be modified as needed.

However, providing an extensible mechanism for viewing a relational database as XML is not a trivial task.

The Mapping Process

One way for a relational database to expose XML is via a standard SQL query with a canonical mapping. An obvious mapping is where each table in the FROM clause is represented as an XML element, and the columns listed in the SELECT clause are attributes of that element. For example, the results of this query:

select CustomerID, CompanyName, Contact-
     Name, ContactTitle, City from Customers

would be transformed into XML by iterating through the results rowset, creating a new Customer element for each row and adding an attribute for every non-Null column value.

Now, if users wanted to also return orders for a given customer, the query could be rewritten to include the appropriate order information via a join between the Customers and Orders tables:

select Customers.CustomerID, Customers.                                      CompanyName, 
  Customers.ContactName, Customers.
ContactTitle, 
Customers.City, Orders.
from Customers, Orders.OrderId, Orders.                                         OrderDate
where Customers.CustomerId = Orders.                                         CustomerId
order by Customers.CustomerId

Here, the <Order> elements are created in the same way as the <Customer> elements. The elements are then nested in the order in which the tables appear in the where clause. In this case, the <Order> elements are nested inside the <Customer> elements. When there are multiple orders for a given customer, a new <Customer> element is not created until a new customer, as identified by CustomerId, is found in the query results rowset. It is necessary to add an order by clause to the query so that correctly ordered results are generated.

That is the most obvious canonical representation of a rowset. You can imagine other similar heuristics for generating XML; for example, columns could map to subelements instead of attributes. Although, this mapping technique seems like an effective technique, there are a couple of problems with it.

First, the query is still a relational query and requires the user to know the relational structures and relationships. For example, what if the query was for all Customers with order dates of a given range, the query writer would need to know how to join the Customers table to the Orders table. True, a DBA could provide relational views to alleviate this problem, however, the query writer is still required to think in relational terms.
The second problem is that the query writer is not always in control of the shape of the XML being desired. For example, what would happen if the requirement for the query over the Customers table required the CustomerId column to become an attribute and the ContactName column to become a subelement? One technique is to apply a posttransformation to the query results via XSLT. However, this technique would become fairly cumbersome for a range of XML shapes. Further, since the desired shape is not known until after the query has been executed and the results returned, the posttransformation could potentially be a performance burden.

One alternative to canonical mapping with posttransformations is to provide a virtual XML view over the relational database. In other words, view the database as a collection of XML documents instead of relational objects. A popular technique for describing XML shapes is through the use of an XSD schema. To establish the XML view of a relational database, you could annotate the XSD schema with relational metadata that could be used to translate queries over the XML view to relational queries against the underlying database. These relational queries could then produce results that were already shaped for transformation to XML. In other words, since the desired XML shape is known ahead of time, it would be possible to construct a relational query that returned a rowset that was designed for easy conversion to XML.

Listing Three is a simple XSD schema that describes a typical Customer List.

Listing Three

<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"
            xmlns:dt="urn:schemas-microsoft-com:datatypes">
    <xsd:element name="CustomerList" >
        <xsd:complexType>
                <xsd:sequence>
                    <xsd:element ref="Customer" />
                </xsd:sequence>
            </xsd:complexType>
    </xsd:element>  
    <xsd:element name="Customer" >
        <xsd:complexType>
                <xsd:sequence>
                    <xsd:element name="Order" maxOccurs="unbounded" >
                        <xsd:complexType>
                            <xsd:attribute name="OrderID" type="xsd:string"/>
                            <xsd:attribute name="OrderDate" type="xsd:date"/>
                    </xsd:complexType>
                    </xsd:element>
                </xsd:sequence>
                <xsd:attribute name="CustomerID" type="xsd:string"/>
                <xsd:attribute name="Company" type="xsd:string"/>
                <xsd:attribute name="Name" type="xsd:string"/>
                <xsd:attribute name="Title" type="xsd:string"/>
                <xsd:attribute name="City" type="xsd:string"/>
        </xsd:complexType>
    </xsd:element>
</xsd:schema>

To annotate this schema, the following constructs will be used:

Relation defines that an element or attribute maps to a table.
Field defines that an element or attribute maps to a column.
Is-constant, if True, means that the virtual XML is not derived from any tables or fields in the database and should be treated as constant content.
Relationship defines the relationship between tables in the underlying database. This specifies how the resulting XML elements are nested.
Key-fields defines the key field in the table that will be used to uniquely identify an element.

In the annotated XSD schema in Listing Four, a namespace was included to identify the mapping annotations added to the XSD schema. Next, all the attributes and elements that were known to map to columns in the database were identified using the msdata:field annotation. The remaining elements in the schema are mapped to tables in the database using the msdata:relation annotation. Because of scoping rules, the msdata:relation annotation also applies to all the column mapped attributes and subelements. Next, it is necessary to specify how the elements mapped to tables will be joined. In the xsd:annotation section, the relationship is defined to specify the same join information. To be able to use this relationship elsewhere in the document, it is given an alias. The msdata:relationship annotation is then used to specify how the hierarchy of the XML view is built when elements come from different tables. For example, to nest Order information inside of Customer information, the CustomerOrder msdata:relationship specifies the join between the underlying tables in the database. Finally, to enclose the query results in a wrapper element, the <CustomerList> element is annotated with the msdata:is-constant annotation to indicate that it is not mapped to anything in the database.

Listing Four

<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"
            xmlns:dt="urn:schemas-microsoft-com:datatypes"
            xmlns:msdata="urn:schemas-microsoft-com:mapping-schema" >
   <xsd:annotation>
            <xsd:appinfo>
                <msdata:relationship name="CustomerOrder"                
                parent="Customers"
                        parent-key="CustomerID"
                        child-key="CustomerID"
                        child="Orders" />
            </xsd:appinfo>
    </xsd:annotation>
    <xsd:element name="CustomerList" msdata:is-constant="true" >
        <xsd:complexType>
                <xsd:sequence>
                    <xsd:element ref="Customer" />
                </xsd:sequence>
            </xsd:complexType>
    </xsd:element>  

    <xsd:element name="Customer" msdata:relation = 
                          "Customers"  msdata:key-fields="CustomerID" >
        <xsd:complexType>
                <xsd:sequence>
                    <xsd:element name=
                     "Order" maxOccurs="unbounded" msdata:relation = "Orders" 
                    msdata:relationship="CustomerOrder" >
                        <xsd:complexType>
                        <xsd:attribute name="OrderID" msdata:field="OrderID"
                            type="xsd:string"/>
                            <xsd:attribute name=
                                        "OrderDate" msdata:field="OrderDate"
                            type="xsd:date"/>   
                    </xsd:complexType>
                    </xsd:element>
                </xsd:sequence>
                <xsd:attribute name="CustomerID" msdata:field=
                                          "CustomerId" type="xsd:string"/>
                <xsd:attribute name="Company" msdata:field=
                                           "CompanyName"  type="xsd:string"/>
                <xsd:attribute name="Name" msdata:field=
                                           "ContactName" type="xsd:string"/>
                <xsd:attribute name="Title" msdata:field=
                                           "ContactTitle" type="xsd:string"/>
                <xsd:attribute name="City" msdata:field=
                                           "City" type="xsd:string"/>
        </xsd:complexType>
    </xsd:element>
</xsd:schema>

The Query Mechanism

The XML views created by the mapping in the annotated XSD schema are queried using the XML Path (XPath) language. XPath is a language designed primarily for addressing parts of an XML document, which utilizes syntax similar to hierarchical paths used to address parts of a filesystem. XPath also supports the use of functions for interacting with the data selected from the document.

XPath operates over four basic types: strings, numbers, Booleans, and node-sets. Node-sets are an unordered collection of nodes containing no duplicates. In the XPath data model, an XML document is represented as a tree of nodes consisting of a single root node and one or more elements, comments, namespaces, processing instructions, text, and attribute nodes.

The most important and commonly used construct in an XPath expression is the location path. A location path consists of a series of location steps. A location step consists of three parts; an axis that describes the relationship between the current (or context node) and the nodes being selected, a node test that specifies the type or name of the nodes to select, and zero or more predicates containing an expression that acts as a filter constraining the results of the location step.

Microsoft's SQLXML 3.0, for instance, supports a subset of the axes defined in XPath 1.0. The supported XPath axes are: parent, child, attribute, and self. If no axis is specified in a location step, then the default axis applied is the child axis. Thus, a location step that does not specify an axis has the expression evaluated on the children of the currently selected node.

The following are sample XPath queries with descriptions of the results they return:

/CustomerList queries the database for a list of all the customers. The customers are returned as several <Customer> elements within a <CustomerList> element.
/CustomerList/Customer[@CustomerID="3"] queries the database for a customer with a particular customer ID. The customer(s) is (are) returned as zero or more <Customer> elements.
/CustomerList/Customer[@CustomerID="5"]/Order[@OrderDate="1999-10-25"] queries the database for all orders made on the 25th of October 1999 by the customer whose customer ID is 5. The orders are returned as zero or more <Order> elements.

The "@" sign in the aforementioned queries is an abbreviation for the attribute axis that is used to select the attributes of the currently selected element node. The XPath queries may not return well-formed XML, especially in cases where multiple element nodes are returned by a particular query. This is because an XML document must consist of only one top-level element. To convert the results of such queries to well-formed XML requires wrapping the elements returned by the query in a single root element.

Code Example

You can create virtual XML views over a relational database using tools such as SQLXML 3.0, Microsoft SQL Server 2000, and Visual Studio .NET to create annotated XSD schemas, then programmatically query and otherwise manipulate the virtual XML view. Listing Five is C# source code that shows a console application that queries the database for all the customers from a specified company using the mapping schema we presented here.

Listing Five

using System;
using System.Collections;
using System.Xml;
using System.Xml.XPath;
using Microsoft.Data.SqlXml;
    public class CustomerDetails 
    {
static string NorthwindConnString = "Provider=SQLOLEDB;
          Server=SqlServerName;database=Northwind;Integrated Security=SSPI;";
        public static void Main(string[] args)
        {
        if(args.Length != 1){ 
            Console.WriteLine("Please specify the company name"); 
            return;
        } 
                // Get value from command line options
                string companyName = args[0];
                // Set up ADO.NET command to execute an xpath query and 
                // generate results using a mapping schema
                SqlXmlCommand cmd = new SqlXmlCommand(NorthwindConnString);
                cmd.CommandText = "/CustomerList/Customer[@Company=
                                                     " + companyName + "]";
                cmd.CommandType = SqlXmlCommandType.XPath;
                // Wrap results in a root element so that we get back 
                // well-formed XML in case multiple customers from one
                // company found
                cmd.RootTag = "Customers";
                cmd.SchemaPath = "MappingSchema.xsd";
                // Get the results as a XmlReader
                XmlReader Reader;
                Reader = cmd.ExecuteXmlReader();
                // Put the results into a DOM
                XmlDocument doc = new XmlDocument(); 
                doc.Load(Reader);
                // print the name and title of each customer 
                //from the specified company 
                  foreach(XmlNode n in doc.DocumentElement.ChildNodes){
                    Console.WriteLine("NAME: {0}\nTITLE: {1}\n", 
                            n.Attributes["Name"].Value, 
                            n.Attributes["Title"].Value);
                  }
        }
}

Conclusion

Most current XML usage centers on platform-neutral data exchange and leverages the ubiquitous nature of the format. However, the XML data model is also an excellent model for hierarchical, semistructured data, even in programming scenarios where interoperability is not required. Because most business objects are often semistructured, XML becomes a much more natural way to represent the underlying data that, in general, is stored in relational databases. By providing a virtual XML view over relational data, you are relieved of having to write a lot of "plumbing code" for querying and manipulating semistructured data, while still leveraging the advantages of the relational database system.

Andrew and Dare are software design engineers with the WebData group at Microsoft. They can be contacted at [email protected] and [email protected], respectively.