Marc Van Cappellen is a Director of Development at DataDirect Technologies. Marc has more than 15 years of experience in various data access technologies including ODBC, JDBC, ADO, XQJ, SQL and XQuery.
Over the past year we have seen office documents -- spreadsheets, wordprocessing documents, and the like -- taking a more prominent role in business processes, workflows, and vertical applications. While office documents have always been common place, their proprietary binary formats have made them all but unusable to businesses which have sought to integrate office documents in their processes. New office document standards like the OpenDocument Format(ODF) and Office Open XML (OOXML), however, are making office document integration in business processes a reality.
A key benefit of ODF and OOXML for developers is the reuse of existing standards -- in essence, your office documents are XML documents, which makes available a complete palette of tools for manipulating these documents and the information they contain. Using tools and technologies available today, you can transform office documents to HTML or PDF, store them in an XML database, shred their information and store it in a relation databases, embed SOAP messages, enrich them with external information, and so on.
How? XQuery, That's How!
XQuery is a query language for XML. A W3C recommendation since January 2007, XQuery is already widely supported, with over 50 open source and commercial implementations. So why not query your office documents using XQuery?
If you look close at ODF or OOXML documents, you notice that they are in fact not XML documents, but rather a number of XML documents packaged in a ZIP file. Figure 1 is an example of an Office Open XML document opened with WinZip.
A similar document, saved as an OpenDocument Format document, might look like Figure 2.
You have great XQuery implementations to query and transform XML documents, but office documents are packaged inside ZIP files. How can you make them available to XQuery implementation? For Java-based applications, an easy solution is to use the standard fn:doc XQuery function.
Among the standard URL schemes like file: and http:, your Java virtual machine also support the jar: URL scheme. A jar archive is considered "a zip archive with logical extensions," the "logical extensions" being special files like manifest.mf or the META-INF directory located in the archives. But physically these are just zip archives, and as such you can use the jar: URL scheme to read them.
For example, to access word/document.xml in the Office Open XML document coolstuff.docx, you would use this fn:doc call:
doc('jar:file:///C:/coolstuff.docx!/word/document.xml')
Similarly, to access content.xml in OpenDocument coolstuff.odt, use:
doc('jar:file:///C:/coolstuff.odt!/content.xml')
This approach is not product specific -- almost any Java-based XQuery implementation that can query XML documents on a local filesystem can handle this pattern. Examples of two such popular XQuery implementations include DataDirect XQuery and Saxon.
Now that you know how to access the office documents you want to query, let's use XQuery to extract all of John's comments, ordered by date. The OOXML format stores comments in the comments.xml document; see the example provided here.
Note that the query (shown below) is all standard XQuery, without any proprietary extensions. The office document is accessed using the jar: URL scheme standard that is supported by your Java environment.
declare namespace w = "http://schemas.openxmlformats.org/wordprocessingml/2006/main"; declare variable $office_doc := doc('jar:file:///C:/coolstuff.docx!/word/comments.xml'); <all_john_comments>{ for $comment in $office_doc/*/w:comment where $comment/@w:author = 'John' order by xs:dateTime($comment/@w:date) return <comment date="{$comment/@w:date}">{ $comment//text() }</comment> }</all_john_commentsg>
The ODF, on the other hand, embeds the comments within the actual document, which is always named content.xml; see the example provided here.
As the ODF document format differs from the OOXML document format, so does the query you use to reach the embedded user comments:
declare namespace office = "urn:oasis:names:tc:opendocument:xmlns:office:1.0"; declare namespace dc = "http://purl.org/dc/elements/1.1/"; declare namespace text = "urn:oasis:names:tc:opendocument:xmlns:text:1.0"; declare variable $office_doc := doc('jar:file:///C:/coolstuff.odt!/content.xml'); <all_john_comments>{ for $comment in $office_doc/*/office:body//office:annotation where $comment/dc:creator = 'John' order by xs:dateTime($comment/dc:date) return <comment date="{$comment/dc:date}">{ $comment//text:p//text() }</comment> }</all_john_comments>