Channels ▼
RSS

Open Source

XQuery Your Office Documents


Marc Van Cappellen is a Director of Development at DataDirect Technologies. Marc has more than 15 years of experience in various data access technologies including ODBC, JDBC, ADO, XQJ, SQL and XQuery.


Over the past year we have seen office documents -- spreadsheets, wordprocessing documents, and the like -- taking a more prominent role in business processes, workflows, and vertical applications. While office documents have always been common place, their proprietary binary formats have made them all but unusable to businesses which have sought to integrate office documents in their processes. New office document standards like the OpenDocument Format(ODF) and Office Open XML (OOXML), however, are making office document integration in business processes a reality.

A key benefit of ODF and OOXML for developers is the reuse of existing standards -- in essence, your office documents are XML documents, which makes available a complete palette of tools for manipulating these documents and the information they contain. Using tools and technologies available today, you can transform office documents to HTML or PDF, store them in an XML database, shred their information and store it in a relation databases, embed SOAP messages, enrich them with external information, and so on.

How? XQuery, That's How!

XQuery is a query language for XML. A W3C recommendation since January 2007, XQuery is already widely supported, with over 50 open source and commercial implementations. So why not query your office documents using XQuery?

If you look close at ODF or OOXML documents, you notice that they are in fact not XML documents, but rather a number of XML documents packaged in a ZIP file. Figure 1 is an example of an Office Open XML document opened with WinZip.

Figure 1: Office Open XML document opened with WinZip.

A similar document, saved as an OpenDocument Format document, might look like Figure 2.

Figure 2: An OpenDocument Format document,

You have great XQuery implementations to query and transform XML documents, but office documents are packaged inside ZIP files. How can you make them available to XQuery implementation? For Java-based applications, an easy solution is to use the standard fn:doc XQuery function.

Among the standard URL schemes like file: and http:, your Java virtual machine also support the jar: URL scheme. A jar archive is considered "a zip archive with logical extensions," the "logical extensions" being special files like manifest.mf or the META-INF directory located in the archives. But physically these are just zip archives, and as such you can use the jar: URL scheme to read them.

For example, to access word/document.xml in the Office Open XML document coolstuff.docx, you would use this fn:doc call:

doc('jar:file:///C:/coolstuff.docx!/word/document.xml') 

Similarly, to access content.xml in OpenDocument coolstuff.odt, use:

doc('jar:file:///C:/coolstuff.odt!/content.xml') 

This approach is not product specific -- almost any Java-based XQuery implementation that can query XML documents on a local filesystem can handle this pattern. Examples of two such popular XQuery implementations include DataDirect XQuery and Saxon.

Now that you know how to access the office documents you want to query, let's use XQuery to extract all of John's comments, ordered by date. The OOXML format stores comments in the comments.xml document; see the example provided here.

Note that the query (shown below) is all standard XQuery, without any proprietary extensions. The office document is accessed using the jar: URL scheme standard that is supported by your Java environment.

declare namespace w =
  "http://schemas.openxmlformats.org/wordprocessingml/2006/main";
declare variable $office_doc :=
  doc('jar:file:///C:/coolstuff.docx!/word/comments.xml');
<all_john_comments>{
  for $comment in $office_doc/*/w:comment
  where $comment/@w:author = 'John'
  order by xs:dateTime($comment/@w:date)
  return
    <comment date="{$comment/@w:date}">{
      $comment//text()
    }</comment>
}</all_john_commentsg> 

The ODF, on the other hand, embeds the comments within the actual document, which is always named content.xml; see the example provided here.

As the ODF document format differs from the OOXML document format, so does the query you use to reach the embedded user comments:

declare namespace office =
  "urn:oasis:names:tc:opendocument:xmlns:office:1.0";
declare namespace dc = 
  "http://purl.org/dc/elements/1.1/";
declare namespace text = 
  "urn:oasis:names:tc:opendocument:xmlns:text:1.0";
declare variable $office_doc :=
  doc('jar:file:///C:/coolstuff.odt!/content.xml');
<all_john_comments>{
  for $comment in $office_doc/*/office:body//office:annotation
  where $comment/dc:creator = 'John'
  order by xs:dateTime($comment/dc:date)
  return
    <comment date="{$comment/dc:date}">{
      $comment//text:p//text()
    }</comment>
}</all_john_comments> 


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 

Video