XQuery Your Office Documents

A key benefit of ODF and OOXML for developers is the reuse of existing standards


October 21, 2007
URL:http://www.drdobbs.com/open-source/xquery-your-office-documents/202401913

Marc Van Cappellen is a Director of Development at DataDirect Technologies. Marc has more than 15 years of experience in various data access technologies including ODBC, JDBC, ADO, XQJ, SQL and XQuery.


Over the past year we have seen office documents -- spreadsheets, wordprocessing documents, and the like -- taking a more prominent role in business processes, workflows, and vertical applications. While office documents have always been common place, their proprietary binary formats have made them all but unusable to businesses which have sought to integrate office documents in their processes. New office document standards like the OpenDocument Format(ODF) and Office Open XML (OOXML), however, are making office document integration in business processes a reality.

A key benefit of ODF and OOXML for developers is the reuse of existing standards -- in essence, your office documents are XML documents, which makes available a complete palette of tools for manipulating these documents and the information they contain. Using tools and technologies available today, you can transform office documents to HTML or PDF, store them in an XML database, shred their information and store it in a relation databases, embed SOAP messages, enrich them with external information, and so on.

How? XQuery, That's How!

XQuery is a query language for XML. A W3C recommendation since January 2007, XQuery is already widely supported, with over 50 open source and commercial implementations. So why not query your office documents using XQuery?

If you look close at ODF or OOXML documents, you notice that they are in fact not XML documents, but rather a number of XML documents packaged in a ZIP file. Figure 1 is an example of an Office Open XML document opened with WinZip.

Figure 1: Office Open XML document opened with WinZip.

A similar document, saved as an OpenDocument Format document, might look like Figure 2.

Figure 2: An OpenDocument Format document,

You have great XQuery implementations to query and transform XML documents, but office documents are packaged inside ZIP files. How can you make them available to XQuery implementation? For Java-based applications, an easy solution is to use the standard fn:doc XQuery function.

Among the standard URL schemes like file: and http:, your Java virtual machine also support the jar: URL scheme. A jar archive is considered "a zip archive with logical extensions," the "logical extensions" being special files like manifest.mf or the META-INF directory located in the archives. But physically these are just zip archives, and as such you can use the jar: URL scheme to read them.

For example, to access word/document.xml in the Office Open XML document coolstuff.docx, you would use this fn:doc call:

doc('jar:file:///C:/coolstuff.docx!/word/document.xml') 

Similarly, to access content.xml in OpenDocument coolstuff.odt, use:

doc('jar:file:///C:/coolstuff.odt!/content.xml') 

This approach is not product specific -- almost any Java-based XQuery implementation that can query XML documents on a local filesystem can handle this pattern. Examples of two such popular XQuery implementations include DataDirect XQuery and Saxon.

Now that you know how to access the office documents you want to query, let's use XQuery to extract all of John's comments, ordered by date. The OOXML format stores comments in the comments.xml document; see the example provided here.

Note that the query (shown below) is all standard XQuery, without any proprietary extensions. The office document is accessed using the jar: URL scheme standard that is supported by your Java environment.

declare namespace w =
  "http://schemas.openxmlformats.org/wordprocessingml/2006/main";
declare variable $office_doc :=
  doc('jar:file:///C:/coolstuff.docx!/word/comments.xml');
<all_john_comments>{
  for $comment in $office_doc/*/w:comment
  where $comment/@w:author = 'John'
  order by xs:dateTime($comment/@w:date)
  return
    <comment date="{$comment/@w:date}">{
      $comment//text()
    }</comment>
}</all_john_commentsg> 

The ODF, on the other hand, embeds the comments within the actual document, which is always named content.xml; see the example provided here.

As the ODF document format differs from the OOXML document format, so does the query you use to reach the embedded user comments:

declare namespace office =
  "urn:oasis:names:tc:opendocument:xmlns:office:1.0";
declare namespace dc = 
  "http://purl.org/dc/elements/1.1/";
declare namespace text = 
  "urn:oasis:names:tc:opendocument:xmlns:text:1.0";
declare variable $office_doc :=
  doc('jar:file:///C:/coolstuff.odt!/content.xml');
{
  for $comment in $office_doc/*/office:body//office:annotation
  where $comment/dc:creator = 'John'
  order by xs:dateTime($comment/dc:date)
  return
    <comment date="{$comment/dc:date}">{
      $comment//text:p//text()
    }</comment>
}</all_john_comments> 

Transforming Office Documents

The ODF reuses concepts throughout the different document formats it supports. For example, the definition of a table in a spreadsheet is almost equivalent to a table embedded in a text document. Figure 3 includes a table listing used cars, including make, model, year of manufacture, and number of miles.

Figure 3: A Table in ODF

In ODF the content of your office document is stored in the content.xml document. Availble here here is the example for the table in Figure 3.

But even consistent use of table definitions isn't of much use to us unless you have a way to get at the data they hold. Fortunately, XQuery supports user-defined functions and function libraries, like this one, which transforms an ODF table into HTML format:

declare namespace table= 
  "urn:oasis:names:tc:opendocument:xmlns:table:1.0" ;
declare function local:TableToHTML(
                      $table as element(table:table))as element(table){
  <table border="1">{
    for $row in $table/table:table-row
    return
      <tr>{
        for $cell in $row/table:table-cell
        return
          <td>{$cell//text()}</td>
      }</tr>
  }</table>
};

Once you have declared such a user-defined function in a module, you can use it to query ODF documents:

declare namespace table= 
  "urn:oasis:names:tc:opendocument:xmlns:table:1.0" ;
declare variable $office_doc :=
  doc('jar:file:///C:/usedcars.odt!/content.xml');
<html>{
  for $table in $office_doc//table:table
  return
    local:TableToHTML($table)
}</html>

This query constructs an HTML document with all tables from the ODF text document, usedcars.odt. Figure 4 is an example of the result, usedcars.html.

Figure 4: The results of a query that constructs an HTML document.

And by simply changing the container document you want to access, you can use the same function to query your ODF spreadsheet, usedcars.ods. Look similar?

declare namespace table= 
  "urn:oasis:names:tc:opendocument:xmlns:table:1.0" ;
declare variable $office_doc :=
  doc('jar:file:///C:/usedcars.ods!/content.xml');
<html>{
  for $table in $office_doc//table:table
  return
    local:TableToHTML($table)
}</html> 

Combining Office Documents with External Data

Not only can you apply XQuery to query and transform office documents; you can also use XQuery to meet other business needs, like data integration. Consider the previous used cars example, but now we need to produce an enriched HTML report that lists both new and used prices. In addition, this information is stored not in an office document, but in a relational database table name "usedcars"; see Figure 5.

Figure 5: Data storage in an RDBMS.

Several XQuery implementations offer access to data stores like relational databases. Using such a product enables developers not only to combine data from different data sources, but to use that data to enrich office documents and other XML sources.

In the following query, the relational table is accessed through the fn:collection function, and the ODF spreadsheet usedcars.ods is accessed through the jar: URL scheme. In addition to the enrichment, the query also sorts the cars by the year of manufacturing.

declare namespace table= 
  "urn:oasis:names:tc:opendocument:xmlns:table:1.0" ;
declare variable $office_doc :=
  doc('jar:file:///C:/usedcars.ods!/content.xml');
declare function local:TableToHTML(
            $table as element(table:table)) as element(table) {
  <table border="1">{
    <tr>
      <th>Make</th>
      <th>Model</th>
      <th>Year</th>
      <th>Mileage</th>
      <th>List price</th>
      <th>Used price</th>
    </tr>,
    for $row in $table/table:table-row[position() > 1]
    let $make := $row/table:table-cell[1]
    let $model := $row/table:table-cell[2]
    let $year := $row/table:table-cell[3]
    let $milleage := $row/table:table-cell[4]
    let $carinfo := collection("dbo.usedcars")/usedcars[make=$make]
                                                       [model=$model]
                                                       [year=$year]
    order by xs:int($year)
    return
      <tr>
        <td>{$make//text()}</td>
        <td>{$model//text()}</td>
        <td>{$year//text()}</td>
        <td>{$milleage//text()}</td>
        <td>{
          if($carinfo) then 
            $carinfo/listprice/text()
          else
            'unknown'
        }</td>
        <td>{
          if($carinfo) then
            $carinfo/usedprice/text()
          else
            'unknown'
        }</td>
      </tr>
  }</table>
};
<html>{
  for $table in $office_doc//table:table
  return
    local:TableToHTML($table)
}</html> 

This query might yield the results in Figure 6.

Figure 6: Results of query.

Conclusions

The XML foundation of office documents has opened up new avenues for document and data integration and manipulation. Data once locked in proprietary binary formats can now be easily leveraged using familiar XML tools and technologies, like XQuery, and existing documents can be combined with data from other sources, such as relational databases, and transformed to create new documents in formats like PDF and HTML.

Terms of Service | Privacy Statement | Copyright © 2024 UBM Tech, All rights reserved.