XML & Relational Databases

Andrew and Dare first examine the merits of an XML data model over the relational data model, then translate between the XML and relational query domains.


May 01, 2003
URL:http://www.drdobbs.com/database/xml-relational-databases/184405339

As XML's use increases, it is becoming a common requirement to view traditional relational data as XML. One way of doing this is to provide a virtual XML view over the relational database. This view is provided through the mechanism of annotating an XSD schema that describes the desired XML shape. These annotations contain relational metadata used to translate queries of the XML view into SQL queries, and to transform the SQL query results into the desired XML shape. In this article, we discuss the merits of choosing an XML data model over the more traditional relational data model. We also examine the mapping process and discuss the task of translating between the XML and relational query domains. Finally, we provide an example of mapping an XSD schema to a sample database and querying the virtual XML View.

Traditional relational database systems are excellent mechanisms for storing normalized, highly structured data. However, it is not always easy to query or manipulate this data model. Database developers have traditionally used the mechanism of relational views and stored procedures to abstract away the complexities of accessing the underlying data—in other words, providing a logical relational view. However, as the logical data model becomes increasingly semistructured (data with some variance), providing a relational view to program against becomes increasingly difficult.

In general, relational databases are excellent persistent storage mechanisms for storing highly structured, normalized data—"square data." But as anyone who has modeled real-world business objects knows, the real world is not square. For instance, consider a typical customer list where there is some variability between customers. One customer has a cell phone, work phone, and home phone; while another has a cell phone and pager number. A typical first attempt at a representation with relational tables uses a column for each property; see Table 1(a).

Table 1: (a) Relational table; (b) creating normalized tables from a relational table.

However, this approach leads to sparse tables with highly denormalized data and can potentially cause performance and scalability problems for a typical relational database. The solution for most databases is to pull out properties and place them in their own normalized table that has an N:1 relationship with the master table, as in Table 1(b). It would then be possible to further normalize the customer data by separating the PhoneNumbers table into separate tables based on type, like Table 2.

Table 2: Further normalizing customer data.

The results would be a highly normalized view of customers that would require users to join several tables to get the desired Customer objects. For most experienced DBAs and SQL users, this probably wouldn't be a problem. However, for most application developers, it would probably be easier to program against a logical Customer object without exposing the details of the underlying relational storage mechanism.

The traditional technique for abstracting the physical database storage is to create a relational view or stored procedure API. However, there is an alternative—creating a virtual XML view. The difference is that users see an XML representation of the relational data as opposed to relation rowsets. To illustrate the advantages, look at a possible XML representation of the Customer List in Listing One. In this view, you are not forced to think about tables, columns, and joins. You just know that Customers have first and last names, and several different types of phone numbers. In other words, the hierarchical nature of the XML representation implies a relationship between logical entities. Further, since XML is semistructured, there are no NULL valued columns—just the absence of elements or attributes.

Listing One
<CustomerList>
    <Customer LastName="Jones" FirstName="Jeff">
        <WorkPhone>5556767</WorkPhone>
        <CellPhone>5558888</CellPhone>
        <HomePhone>5553232</HomePhone>
</Customer> 
<Customer LastName="Smith" FirstName="Jane">
        <CellPhone>5552444</CellPhone>
        <Pager>5558989</Pager>
</Customer>
</CustomerList>

Typically, the Customers table has more than one customer and, since a lot of existing XML parsers need well-formed documents with a single root, you need to add a <CustomerList> wrapper element so that the list of customers has a single root. This is a common problem when mapping relational data to XML, and is addressed as part of the mapping process.

You can further extend the customer list by adding an order table that includes a many to one relationship with the Customers table; see Table 3.

Table 3: Extending customers list.

Now, if you wanted to get a customer list that included order data, you would have to join the two tables and deal with the key relationship and duplicate customer records in the resulting rowset in cases where a given customer has several orders. However, adding it to the XML representation is straightforward. In Listing Two, the containment relationship in the XML model between the <Customer> and <Order> elements implies a logical relationship—in this case, a "Has-A" relationship. Since the XML data model allows multiple subelements of the same type, a given Customer can contain multiple Orders. Also, we have created <Order> elements with some of the columns represented as attributes and some of the columns represented as elements. This is because the XML data model provides more flexibility in that we can map column values to attributes or element values, thereby decoupling users from the underlying layout of the relation tables.

Listing Two
<CustomerList>
<Customer LastName="Jones" FirstName="Jeff">
        <WorkPhone>5556767</WorkPhone>
        <CellPhone>5558888</CellPhone>
        <HomePhone>5553232</HomePhone>
        <Order Product="Apples" Quantity="25" >
            <Shipped>09/17/2001</Shipped>
        </Order>
</Customer> 
<Customer LastName="Smith" FirstName="Jane">
        <CellPhone>5552444</CellPhone>
        <Pager>5558989</Pager>
        <Order Product="Oranges" Quantity="25" >
            <Shipped>06/15/2001</Shipped>
        </Order>
        <Order Product="Tomatoes" Quantity="100" />
</Customer>
</CustomerList>

In general, there are four significant benefits in providing an XML abstraction over a relational database:

However, providing an extensible mechanism for viewing a relational database as XML is not a trivial task.

The Mapping Process

One way for a relational database to expose XML is via a standard SQL query with a canonical mapping. An obvious mapping is where each table in the FROM clause is represented as an XML element, and the columns listed in the SELECT clause are attributes of that element. For example, the results of this query:

select CustomerID, CompanyName, Contact-
     Name, ContactTitle, City from Customers

would be transformed into XML by iterating through the results rowset, creating a new Customer element for each row and adding an attribute for every non-Null column value.

Now, if users wanted to also return orders for a given customer, the query could be rewritten to include the appropriate order information via a join between the Customers and Orders tables:

select Customers.CustomerID, Customers.                                      CompanyName, 
  Customers.ContactName, Customers.
ContactTitle, 
Customers.City, Orders.
from Customers, Orders.OrderId, Orders.                                         OrderDate
where Customers.CustomerId = Orders.                                         CustomerId
order by Customers.CustomerId

Here, the <Order> elements are created in the same way as the <Customer> elements. The elements are then nested in the order in which the tables appear in the where clause. In this case, the <Order> elements are nested inside the <Customer> elements. When there are multiple orders for a given customer, a new <Customer> element is not created until a new customer, as identified by CustomerId, is found in the query results rowset. It is necessary to add an order by clause to the query so that correctly ordered results are generated.

That is the most obvious canonical representation of a rowset. You can imagine other similar heuristics for generating XML; for example, columns could map to subelements instead of attributes. Although, this mapping technique seems like an effective technique, there are a couple of problems with it.

One alternative to canonical mapping with posttransformations is to provide a virtual XML view over the relational database. In other words, view the database as a collection of XML documents instead of relational objects. A popular technique for describing XML shapes is through the use of an XSD schema. To establish the XML view of a relational database, you could annotate the XSD schema with relational metadata that could be used to translate queries over the XML view to relational queries against the underlying database. These relational queries could then produce results that were already shaped for transformation to XML. In other words, since the desired XML shape is known ahead of time, it would be possible to construct a relational query that returned a rowset that was designed for easy conversion to XML.

Listing Three is a simple XSD schema that describes a typical Customer List.

Listing Three
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"
            xmlns:dt="urn:schemas-microsoft-com:datatypes">
    <xsd:element name="CustomerList" >
        <xsd:complexType>
                <xsd:sequence>
                    <xsd:element ref="Customer" />
                </xsd:sequence>
            </xsd:complexType>
    </xsd:element>  
    <xsd:element name="Customer" >
        <xsd:complexType>
                <xsd:sequence>
                    <xsd:element name="Order" maxOccurs="unbounded" >
                        <xsd:complexType>
                            <xsd:attribute name="OrderID" type="xsd:string"/>
                            <xsd:attribute name="OrderDate" type="xsd:date"/>
                    </xsd:complexType>
                    </xsd:element>
                </xsd:sequence>
                <xsd:attribute name="CustomerID" type="xsd:string"/>
                <xsd:attribute name="Company" type="xsd:string"/>
                <xsd:attribute name="Name" type="xsd:string"/>
                <xsd:attribute name="Title" type="xsd:string"/>
                <xsd:attribute name="City" type="xsd:string"/>
        </xsd:complexType>
    </xsd:element>
</xsd:schema>

To annotate this schema, the following constructs will be used:

In the annotated XSD schema in Listing Four, a namespace was included to identify the mapping annotations added to the XSD schema. Next, all the attributes and elements that were known to map to columns in the database were identified using the msdata:field annotation. The remaining elements in the schema are mapped to tables in the database using the msdata:relation annotation. Because of scoping rules, the msdata:relation annotation also applies to all the column mapped attributes and subelements. Next, it is necessary to specify how the elements mapped to tables will be joined. In the xsd:annotation section, the relationship is defined to specify the same join information. To be able to use this relationship elsewhere in the document, it is given an alias. The msdata:relationship annotation is then used to specify how the hierarchy of the XML view is built when elements come from different tables. For example, to nest Order information inside of Customer information, the CustomerOrder msdata:relationship specifies the join between the underlying tables in the database. Finally, to enclose the query results in a wrapper element, the <CustomerList> element is annotated with the msdata:is-constant annotation to indicate that it is not mapped to anything in the database.

Listing Four

<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"
            xmlns:dt="urn:schemas-microsoft-com:datatypes"
            xmlns:msdata="urn:schemas-microsoft-com:mapping-schema" >
   <xsd:annotation>
            <xsd:appinfo>
                <msdata:relationship name="CustomerOrder"                
                parent="Customers"
                        parent-key="CustomerID"
                        child-key="CustomerID"
                        child="Orders" />
            </xsd:appinfo>
    </xsd:annotation>
    <xsd:element name="CustomerList" msdata:is-constant="true" >
        <xsd:complexType>
                <xsd:sequence>
                    <xsd:element ref="Customer" />
                </xsd:sequence>
            </xsd:complexType>
    </xsd:element>  

    <xsd:element name="Customer" msdata:relation = 
                          "Customers"  msdata:key-fields="CustomerID" >
        <xsd:complexType>
                <xsd:sequence>
                    <xsd:element name=
                     "Order" maxOccurs="unbounded" msdata:relation = "Orders" 
                    msdata:relationship="CustomerOrder" >
                        <xsd:complexType>
                        <xsd:attribute name="OrderID" msdata:field="OrderID"
                            type="xsd:string"/>
                            <xsd:attribute name=
                                        "OrderDate" msdata:field="OrderDate"
                            type="xsd:date"/>   
                    </xsd:complexType>
                    </xsd:element>
                </xsd:sequence>
                <xsd:attribute name="CustomerID" msdata:field=
                                          "CustomerId" type="xsd:string"/>
                <xsd:attribute name="Company" msdata:field=
                                           "CompanyName"  type="xsd:string"/>
                <xsd:attribute name="Name" msdata:field=
                                           "ContactName" type="xsd:string"/>
                <xsd:attribute name="Title" msdata:field=
                                           "ContactTitle" type="xsd:string"/>
                <xsd:attribute name="City" msdata:field=
                                           "City" type="xsd:string"/>
        </xsd:complexType>
    </xsd:element>
</xsd:schema>

The Query Mechanism

The XML views created by the mapping in the annotated XSD schema are queried using the XML Path (XPath) language. XPath is a language designed primarily for addressing parts of an XML document, which utilizes syntax similar to hierarchical paths used to address parts of a filesystem. XPath also supports the use of functions for interacting with the data selected from the document.

XPath operates over four basic types: strings, numbers, Booleans, and node-sets. Node-sets are an unordered collection of nodes containing no duplicates. In the XPath data model, an XML document is represented as a tree of nodes consisting of a single root node and one or more elements, comments, namespaces, processing instructions, text, and attribute nodes.

The most important and commonly used construct in an XPath expression is the location path. A location path consists of a series of location steps. A location step consists of three parts; an axis that describes the relationship between the current (or context node) and the nodes being selected, a node test that specifies the type or name of the nodes to select, and zero or more predicates containing an expression that acts as a filter constraining the results of the location step.

Microsoft's SQLXML 3.0, for instance, supports a subset of the axes defined in XPath 1.0. The supported XPath axes are: parent, child, attribute, and self. If no axis is specified in a location step, then the default axis applied is the child axis. Thus, a location step that does not specify an axis has the expression evaluated on the children of the currently selected node.

The following are sample XPath queries with descriptions of the results they return:

The "@" sign in the aforementioned queries is an abbreviation for the attribute axis that is used to select the attributes of the currently selected element node. The XPath queries may not return well-formed XML, especially in cases where multiple element nodes are returned by a particular query. This is because an XML document must consist of only one top-level element. To convert the results of such queries to well-formed XML requires wrapping the elements returned by the query in a single root element.

Code Example

You can create virtual XML views over a relational database using tools such as SQLXML 3.0, Microsoft SQL Server 2000, and Visual Studio .NET to create annotated XSD schemas, then programmatically query and otherwise manipulate the virtual XML view. Listing Five is C# source code that shows a console application that queries the database for all the customers from a specified company using the mapping schema we presented here.

Listing Five
using System;
using System.Collections;
using System.Xml;
using System.Xml.XPath;
using Microsoft.Data.SqlXml;
    public class CustomerDetails 
    {
static string NorthwindConnString = "Provider=SQLOLEDB;
          Server=SqlServerName;database=Northwind;Integrated Security=SSPI;";
        public static void Main(string[] args)
        {
        if(args.Length != 1){ 
            Console.WriteLine("Please specify the company name"); 
            return;
        } 
                // Get value from command line options
                string companyName = args[0];
                // Set up ADO.NET command to execute an xpath query and 
                // generate results using a mapping schema
                SqlXmlCommand cmd = new SqlXmlCommand(NorthwindConnString);
                cmd.CommandText = "/CustomerList/Customer[@Company=
                                                     " + companyName + "]";
                cmd.CommandType = SqlXmlCommandType.XPath;
                // Wrap results in a root element so that we get back 
                // well-formed XML in case multiple customers from one
                // company found
                cmd.RootTag = "Customers";
                cmd.SchemaPath = "MappingSchema.xsd";
                // Get the results as a XmlReader
                XmlReader Reader;
                Reader = cmd.ExecuteXmlReader();
                // Put the results into a DOM
                XmlDocument doc = new XmlDocument(); 
                doc.Load(Reader);
                // print the name and title of each customer 
                //from the specified company 
                  foreach(XmlNode n in doc.DocumentElement.ChildNodes){
                    Console.WriteLine("NAME: {0}\nTITLE: {1}\n", 
                            n.Attributes["Name"].Value, 
                            n.Attributes["Title"].Value);
                  }
        }
}

Conclusion

Most current XML usage centers on platform-neutral data exchange and leverages the ubiquitous nature of the format. However, the XML data model is also an excellent model for hierarchical, semistructured data, even in programming scenarios where interoperability is not required. Because most business objects are often semistructured, XML becomes a much more natural way to represent the underlying data that, in general, is stored in relational databases. By providing a virtual XML view over relational data, you are relieved of having to write a lot of "plumbing code" for querying and manipulating semistructured data, while still leveraging the advantages of the relational database system.


Andrew and Dare are software design engineers with the WebData group at Microsoft. They can be contacted at [email protected] and [email protected], respectively.

Terms of Service | Privacy Statement | Copyright © 2024 UBM Tech, All rights reserved.