Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

Web Development

Lore: A Database Management System for XML

Roy Goldman, and

, April 01, 2000


Apr00: Lore: A Database Management System for XML

Jennifer is a professor, and Roy and Jason are Ph.D. candidates in the department of computer science at Stanford University. The Lore team can be contacted at [email protected], and more information about Lore and related XML data management projects can be found at http://www-db.stanford.edu/lore/.


HTML revolutionized the way we specify the appearance of data on the Internet. Today, XML (the eXtensible Markup Language) is changing the way we specify the meaning of data. As XML continues to proliferate, new technologies are needed for managing, searching, and querying XML data. At Stanford University, we have developed Lore, a database management system (DBMS) designed specifically for XML. In the same way that SQL enables powerful queries in a relational DBMS, Lore provides the query language Lorel for issuing expressive queries over XML data. The Lore query engine uses specialized indexes and a cost-based query optimizer to process each query as efficiently as possible. Lore also includes a new keyword-based search technique that takes full advantage of XML's structured data format.

Lore is a multiuser DBMS, supporting crash recovery, materialized views, bulk loading of sets of related XML documents, and a declarative update language. Lore also has an external data manager that enables XML data from external sources to be fetched dynamically and combined with local data during query processing. All in all, the Lore system is composed of over 150,000 lines of C++ code. Lore executables for Linux and Sun Solaris are available for public use at http://www-db.stanford .edu/lore/. An online demo lets you query, search, and browse sample XML databases. If you are working on a project that requires source code, please contact us at [email protected].

Overview of XML

XML lets document authors define their own markup tags and attribute names to assign meaning to the data elements in the document. Further, XML elements can be nested and include references to indicate data relationships, as in Listing One.

Unlike HTML, XML markup tags do not describe how to render the data. Rather, they provide descriptions of the data, allowing software to understand the meaning of the data automatically. (XSL, the eXtensible Stylesheet Language, is a separate language for specifying the presentation of XML documents. See "Rendering XML Documents Using XSL," by Sean McGrath, DDJ, July 1998.) If we store data for many movies in the aforementioned format, it becomes possible for software systems to answer queries such as "What were the titles of the 1999 films that grossed more than $200 million in the USA?" Without XML, it is more difficult to identify the meaning of data in a document. For example, if the number "$100,000,000" appears somewhere in an HTML document, is it referring to a movie's revenue or budget? It may be possible to guess the correct answer based on document context, but such techniques are error-prone and brittle. XML markup eliminates such uncertainty, letting content providers specify the meaning of their data as they create it, and allowing content consumers to easily process the data automatically.

Well-formed XML does not require tags or attributes to be predeclared, and XML places no limits on how you nest different elements. The two movies in our example have some (but not all) attributes and subelements in common. For instance, we have only domestic revenue data for one movie; the same movie lacks information about its producers, and the other movie doesn't specify its running time. Sometimes, it may be desirable to restrict the structure of XML data. XML authors can optionally supply a Document Type Definition (DTD) to serve as a grammar that valid XML documents must adhere to.

XML documents are essentially tree structured, as dictated by the nesting of elements. However, real-world data often is graph structured -- a single data element may be referenced in multiple ways. Though we omit the details here, XML offers special ID and IDREF attribute types to enable any XML element to reference another element. Further, proposed standards such as XLink (see "XLink: The XML Linking Language," by Sean McGrath, DDJ, December 1998) and XPointer permit references across documents. Thus, we can generically view any set of XML documents as a directed, ordered graph of interrelated data elements and attributes.

Querying XML with Lore

Despite the popularity and power of both relational databases and search engines, neither technology is completely appropriate for XML. Traditionally, business data has been stored in relational databases, and SQL is the standard query language. Relational databases generally require data to be much more rigidly structured than XML. A database schema must be designed carefully before any data is loaded. It is difficult to manage data that does not adhere to predefined table structures, and if the structure of the data changes in any way, then the schema must be modified as well. While object-oriented databases do not require rigid table structures, they still rely heavily on predefined, immutable schemas. At the other extreme, search engines are extremely efficient at indexing many documents for keyword-based searches. Unfortunately, search engines are based on linear text and cannot take advantage of the rich structure and data relationships afforded by XML.

Five years ago, we set out to build a new DBMS designed specifically for semistructured data (such as XML, though it did not exist at the time) -- data that includes some structure beyond plain text, yet is irregular or changes too often to be modeled easily as relational or object-oriented data. Our system is called "Lore," and is a rich prototype DBMS for managing XML data. Lore decomposes XML data into its individual elements and attributes, storing a graph physically on disk. In Lore, DTDs are not required; that is, the XML loaded into Lore need not conform to any predefined grammar or schema.

Just as SQL is used to query relational databases, we have developed "Lorel," a declarative query language for XML. Lorel leverages SQL's familiar Select/From/ Where syntax, augmenting the language with powerful path expressions for specifying traversals through the XML graph. Lorel is closely related to OQL, the standard query language for object-oriented database systems.

To illustrate, the following query expresses the aforementioned English query: Find the titles of 1999 films with domestic revenue over $200 million. We assume that all movies in our data set follow the general format in Listing One.

SELECT M.Title

FROM MovieDB.Movie M, M.Revenue R

WHERE M.Year = 1999 AND R.Type = "Domestic" AND R.USD > 200000000

To understand this query, it's easiest to start with the FROM clause, which iterates over each MovieDB.Movie path, binding the Movie to the variable M. For each element bound to M, we bind each of its Revenue elements to R. Given each pair of bindings, we check the filtering conditions in the WHERE clause. If the pair of bindings passes the checks, we return the title of the movie as the query result in the SELECT clause.

By default, Lorel queries traverse both XML elements and attributes with matching tags or attribute names. In the aforementioned query, R.USD refers to the USD attribute and all USD subelements of the element bound to R. Lorel provides an optional syntax for restricting traversals to elements or attributes only.

Lore doesn't assume that any particular structure or patterns exists within the XML data. Hence, Lorel includes some features to help users write queries when they aren't exactly sure of the database's structure. For example, the following Lorel query finds movies that involve "George Lucas" in any way.

SELECT M

FROM DB.Movie M

WHERE M.# = "George Lucas"

The # is a shortcut that represents any number of any tags or attribute names -- if there is some path in the data graph from a movie element to the string "George Lucas," the query returns that movie.

Lorel also gracefully handles situations where data is incomplete. If a path mentioned in a query does not appear in the data, the system does not generate an error. Rather, the system just ignores that path expression and continues to process the rest of the query. Lore does have a warning system that can diagnose queries when unexpected results are returned. Among other features, the warning system can notify users when an nonexistent path expression has been entered.

Lorel is a powerful language with numerous features not illustrated here, including an expression calculator, subqueries, existential and universal quantification, and aggregation.

Query Execution

In any DBMS, the query engine is responsible for translating declarative queries into query results. The heart of the Lore query engine involves matching path expressions in the query to paths in the database. A naive approach would be a top-down execution strategy: Starting at the root of our database, we could exhaustively check all possible paths to find the elements that satisfy our query. This approach can be extremely inefficient, however, depending on the database and/or the query.

As with typical DBMSs, Lore supports indexes to speed up specific types of database accesses. For example, users might create an index on movie revenue values to improve the performance of the aforementioned query that finds the high-grossing films. The disadvantage of an index is that database updates become more expensive, because the index must be updated as well. Still, indexes can offer dramatic improvements in query performance.

Lore supports several different types of indexes. The value index enables fast location of values satisfying specified conditions (as with movie revenue). The edge index quickly locates elements or attributes with a specific tag or name. The path index enables fast identification of all elements reached by a specific path (all DB.Movie.Revenue elements, for instance).

In Lore, data is stored as a directed graph: The physical links between elements are only in one direction. To enable backward navigation, Lore has one more index -- the link index -- which returns the parents of a given XML element. Lore also has several specialized indexes to enhance keyword-based searches.

By leveraging these indexes, the Lore execution engine can consider many strategies beyond the top-down approach. As the next example demonstrates, the best execution strategy for a given query depends on the shape of the database being queried.

Consider the query to "find all elements reachable via A.B.C whose value is 5." In Figure 1(a), there is only one path matching the tags A.B.C, and the value of the element reachable along that path is indeed 5. In this case, top-down evaluation from the root is the best strategy, since we won't explore any paths that don't satisfy the query.

Figure 1(b) shows a database where there are many A.B.C paths. In this case, a top-down execution strategy will perform poorly and will in fact be forced to examine the entire database. Suppose, however, that you first use a value index to locate all elements with value 5, and then traverse the graph backwards to find all matching paths. We call this a "bottom-up" strategy. For the database in Figure 1(b), such a strategy is ideal, because you again need only check one (correct) path to find the query result. In Figure 1(a), a bottom-up strategy would perform poorly: Our index would return all leaf elements, and we would have to traverse the database all the way up to the root before eliminating most of the paths from the query result.

Finally, Figure 1(c) shows a sample database where neither top-down nor bottom-up performs well. A top-down strategy will have to check many A.B.C paths. A bottom-up strategy begins with just one C element with the value 5, but as it traverses backwards to identify matching paths, it will incorrectly follow many B parents that themselves have D parents rather than the A parents needed to match the query. Indeed, in this case, a hybrid strategy is best. First, we traverse top-down to locate all elements reachable by A.B. Next, we use an index to find the C element with value 5, but we traverse only one step backwards; that is, we do not traverse the B edges. Finally, we take the intersection of our two intermediate results to form the single complete path in our answer.

In general, there may be many different options for executing a query. Lore has a query optimizer that evaluates many different strategies and ultimately chooses the query plan that it predicts will yield the best performance. The strategies considered by the query optimizer are highly dependent on which indexes have been created for a given database. For example, if we hadn't created any indexes on the databases in Figure 1, we would have no choice but to use the top-down execution plan. It is just as important that the optimizer has some idea of the shape of the database so that it can choose the best execution strategy. To this end, Lore has a module that periodically gathers important statistics about the database. This statistics module can answer questions for the optimizer such as:

  • If we traverse A.B, what are the odds of finding a C?
  • What is the distribution of integer C values in the database?

  • How many instances are there of the A.B.C?

By combining such information with a cost model for predicting the time required for different execution strategies, the optimizer works to select the best execution strategy for any query and any state of the database.

DataGuides

Neither XML nor Lore requires the structure of the data to be fixed ahead of time (that is, a DTD is not required). Without some idea of the tag and attribute patterns in a database, it can be difficult for users to formulate meaningful queries. Similarly, the execution engine also needs some understanding of a database's structure in order to process a query efficiently. Hence, Lore supports a novel feature called the "DataGuide" -- a concise, dynamically generated structural summary of an XML database. In many ways, a DataGuide serves the role of a DTD or a database schema: End users and the system itself can consult the DataGuide for useful information about the overall structure of the database. However, in Lore, the relationship is flipped around. Instead of data being required to conform to the schema or DTD, the system automatically maintains the DataGuide such that it always accurately summarizes the current structure of the database.

A DataGuide can itself be represented in XML. Each tag/attribute path in the original database appears exactly once in the DataGuide, and every path in the DataGuide appears in the original database. For our Movie data set, Listing Two represents its DataGuide.

Even though our original data set might contain thousands of movies, the path MovieDB.Movie appears only once in the DataGuide. Because DataGuide elements and attributes represent many database instances, we leave actual data values out of the DataGuide.

The DataGuide has the useful property that allows you to check whether or not a path exists in the database in time proportional to the length of the path independent of the size of the database. Also, we can use the DataGuide to store many of the statistics needed for query processing, and we use it to support the path index mentioned earlier. From a user's perspective, DataGuides are a visible component of our web-based user interface to Lore: A Java applet lets users explore the DataGuide interactively and then formulate Lorel queries directly from it.

Keyword and Proximity Search

While declarative languages such as Lorel enable expressive queries, they are hardly aimed at the typical end user. An interactive DataGuide makes it easier to formulate queries, but the Web has shown that casual users are comfortable typing keywords to initiate a search. To this end, Lore has a specialized text index that, given a specific keyword, can quickly identify data containing that keyword in its tag, attribute name, or atomic value.

For XML, a much more interesting issue is how to handle a search over multiple keywords. If you are interested in movies involving "Liam Neeson," you might try to search for "movie near Neeson." Search engines such as AltaVista handle such a search by finding HTML pages that contain both words close to each other in the document text. But in XML, the structural relationships may be much more important than the physical layout of the XML text. In our small example, the string "Liam Neeson" is actually fewer characters away from The Sixth Sense than from Star Wars. But if you take the structure of the data into account, you know that Neeson is closely related to Star Wars and unrelated to The Sixth Sense.

Generalizing this example, you would like the near operator in XML to measure proximity based on the logical relationships between data elements instead of by text alone. Lore includes an innovative approach for performing this kind of structural proximity search, where the distance between elements is measured as the shortest path between those elements in the database graph. Usage experiments show that this definition does a good job of matching our intuition for what the near operator should do in XML. Computing the shortest paths among database elements can be very expensive computationally. Hence, we have developed a specialized compact index that enables fast proximity searches.

References and Related Work

An overview of XML is available at http:// www.w3.org/TR/WD-xml-lang-970331.html, December 1997. The Lore system is described in J. McHugh, S. Abiteboul, R. Goldman, D. Quass, and J. Widom: "Lore: A Database Management System for Semistructured Data." SIGMOD Record, 26(3):54-66, September 1997. The Lorel language is covered in detail in S. Abiteboul, D. Quass, J. McHugh, J. Widom, and J. Wiener: "The Lorel Query Language for Semistructured Data." International Journal on Digital Libraries, 1(1):68-88, April 1997.

Originally, Lore was designed for a data model that is similar, but not identical, to XML. (XML did not exist at the time Lore was conceived.) The changes required to migrate Lore to fully support XML are described by R. Goldman, J. McHugh, and J. Widom in "From Semistructured Data to XML: Migrating the Lore data Model and Query Language," in Proceedings of the 2nd International Workshop on the Web and Databases (WebDB '99), June 1999.

Lore' s query optimizer, DataGuides, and proximity search feature, are described at the Lore web site, http://www-db.stanford .edu/lore. There are other active efforts toward developing query languages for XML. XQL has been supported by many companies. See J. Robie, J. Lapp, and D. Schach's "XML Query Language (XQL)," in Proceedings of QL '98: The Query Languages Workshop, December 1998. Papers are available at http://www.w3.org/TandS/ QL/QL98/.

XML-QL is another XML query language. See A. Deutsch, M. Fernandez, D. Florescu, A. Levy, and D. Suciu, "A Query Language for XML," in Proceedings of the Eighth International World Wide Web Conference (WWW8), 1999. Lorel, XQL, and XML-QL have many common features; we anticipate that any future XML query language standardization efforts will borrow heavily from these languages.

DDJ

Listing One

<MovieDB>
   <Movie Title = "Star Wars: Episode I - The Phantom Menace" Year = "1999">
        <Director>George Lucas</Director>
        <Producer>George Lucas</Producer>
        <Producer>Rick McCallum</Producer>
        <Writer>George Lucas</Writer>
        <Actor>Ewan McGregor</Actor>
        <Actor>Liam Neeson</Actor>
        <Revenue Type="Domestic" USD="427600000"/>
        <Revenue Type="International" USD="352100000"/>
    </Movie>
    <Movie Title="The Sixth Sense" Year="1999" Runtime= "106">
        <Director>M. Night Shyamalan</Director>
        <Writer>M. Night Shyamalan</Writer>
        <Actor>Bruce Willis</Actor>
        <Revenue Type="Domestic" USD="264000000"/>
    </Movie>
</MovieDB>

Back to Article

Listing Two

</MovieDB>
    <Movie Title="" Year="" Runtime= "">
        <Director/>
        <Writer/>
        <Actor/>
        <Revenue Type="" USD=""/>
    </Movie>
</MovieDB>

Back to Article


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.