Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

Web Development

Open Office Document Connector


November, 2005: Open Office DOcument Connector

Jean-Marie Gouarne is CTO at Genicorp (http://www.genicorp.fr), an IT services provider, and partner in Ars Aperta (http://www.arsaperta.com/en/index.html), a consulting firm focusing on open source software related strategies. His main consulting areas are business intelligence, information systems management and architecture, and software-related legal issues. In addition, Jean-Marie is an OpenDocument specialist and a Perl programmer; he created and maintains the OpenOffice::OODoc CPAN module. He can be reached at [email protected].


The OASIS OpenDocument format was officially born this year. Its basic principles and the majority of its semantics and syntax came from the OpenOffice.org project. The OpenDocument format (ODF) is poised to become a de facto standard for the free office software and, in the long term, it could be used as a common basis for large-scale content management applications.

So now is the right time to talk about Perl/OpenDocument integration.

The OpenDocument Concept

The ODF is fully documented [1] and everybody can get and use the specification for free. Nobody can be deterred by legal issues or technical obscurities. An ODF-compliant file is nothing more than a compressed archive which contains a few XML members. Both the compression algorithm and the XML schema are open source. But, above all, a new philosophy is gaining ground in the office software world--any document should have a life without the tool which has been used to create it.

Such an idea should have been straightforward from the beginning because the document belongs to its author and not to the author of the editing software. However, the "open document" concept sounded strange until lately, due to the particular context of the office software marketplace in the '90s. For most of the users, there is an unbreakable link between the tool and the content. So the editing software is both a tool and a lock.

But neither a market singularity nor a vendor lock-in policy can fully explain the delay between the first large-scale deployments of desktop software ('80s) and the emergence of a really open document format. The other concern is the cultural frontier between structured and unstructured data. For most of the IT specialists, the office documents were regarded as unstructured data, and, as a consequence, left out of the scope of the mainstream enterprise software which is dedicated to structured data.

In the last few years, thanks to XML, this frontier began to vanish. Technically speaking, it's more and more difficult to see a document as "unstructured data" when you can describe its structure with a DTD or one of the publicly available schema definition dialects (XSD, RelaxNG). Paradoxically, some so-called "unstructured data", such as office documents, could be nothing else than hyper-structured data, i.e. data with complex, non-tabular and flexible structures. In addition, while XML provides the technological background, information systems must become more and more able to process documents as well as tabular databases in order to meet their business needs. Simply because the documents are at the heart of the business processes and contain a large part of the business knowledge. So, the minds are slowly changing, and the availability of open formats is triggering more and more direct document processing projects which can now rely on standard APIs.

>From a management of information systems point of view, the direct document processing has a significant advantage - it avoids the bad use of proprietary macro languages. Macros are useful tools for individual productivity tasks, but they should not be used as a development tool for enterprise mission-critical applications, because they can't be properly supported in the long term, and they can't run out of the desktop editing software (which is a poor and unsecure platform). On the other hand, direct document processing applications can be written in open and powerful scripting languages, according to the good software engineering practices, and can run in more robust environments. All that is good news for the IT department and, ultimately, for the business.

Introducing OpenOffice::OODoc

The OpenOffice::OODoc module is one of the many answers to these recent needs.

Born as a private project at Genicorp [2], this toolbox was primarily used in a few mid-sized organizations in several business sectors (legal, food and healthcare), in order to allow the automatic generation of very simple operational documents or reports by enterprise applications. Then the module became open source and was made available on CPAN package in the beginning of 2004 [3]. The number of users began to grow and, as a consequence, some bugs were reported and fixed. Two major changes were introduced later.

The first one (1.301) was a rework of the basic access layer, due to performance issues in very large document processing. XML::Twig was selected as the basic XML API. The performance concern was a direct consequence of the increasing number of applications: Some users apparently began to process huge documents and/or to combine a large number of documents. The API switch provided linear performances up to thousands of text containers (corresponding to hundreds of pages in a typical text document) and such resource-consuming tasks as the full generation of tables including thousands of cells, which produced unpredictable results in the early versions, became very easy.

More recently, OpenOffice::OODoc 2.xxx was released in order to be in touch with the announced availability of OpenOffice.org 2.0. The new version supports both the primary OpenOffice.org 1.0 format and the new OASIS OpenDocument format, which becomes the default format with OpenOffice.org 2.0 and the lingua franca of a growing number of other office suites. Fortunately, the new format inherits almost all the features and the general organization of the old one.

This toolbox is, to some extent, intended to allow a simple, database-like access to any object in any OpenDocument-compliant file. It relies on the file format, and not on the features of a particular office software. More precisely, OpenOffice::OODoc uses well-known Perl modules such as Archive::Zip and XML::Twig, and not the OpenOffice.org API.

Before going further, let's look at a very short script which illustrates the general logic of the interface:

my $doc = ooDocument(file => 'myfile.sxw');

$doc->cellValue('MyTable', 'B4', 'Hello');

$doc->appendParagraph
	(
	text		=> 'The last paragraph',
	style	=> 'Text body'
	);

$doc->save;

This example is (I hope) self-documented.

The ooDocument() function is a constructor, it returns an object $doc which is a document interface associated to an OpenOffice.org Writer (SXW) physical file.

The second and third instructions do something with the content of the document. The example shows a cellValue() method, which retrieves and changes the content of a table cell (as you can see, the cell is selected by table name and user-oriented logical coordinates, and not with an arcane XPath expression) then an appendParagraph() one, which, without surprise, creates a new paragraph, with given text and style, at the end of the document. But the most important thing, for now, is the $doc object. It owns every content processing method, and, above all, it hides the details of the physical access to the document.

At the end, a save() call is issued in order to physically commit the changes made in the document by the previous instructions. Before this last instruction, the original file remains unchanged.

This example is only intended to show the basic principles of this interface:

  • hide the physical file access mechanisms (zip/unzip) and the UTF8 encoding issues behind an abstract connector
  • present an object-oriented interface, where the main objects are the document connectors
  • provide a growing set of predefined, user-friendly methods in order to allow an easy access to the most frequently processed elements of content and layout, without knowledge of the underlying XML schema.

There is no revolution here. OpenOffice::OODoc is an OpenDocument-aware layer above Archive::Zip and XML::Twig. It only hides the file compression/uncompression steps, avoids the user from learning the OpenDocument XML specification, and provides a compact but readable document-focused language.

Simply put, it's application area covers

  1. automatic document generation or update by back-office applications (ex: reporting)
  2. automatic data check and capture in office documents by back-office applications (ex: form processing).

Basic Functionality

The design goal of OpenOffice::OODoc is document processing automation with a particular focus on integration between documents and enterprise data. In other terms, this API allows the user to retrieve, read, update, delete or create any part of a document considered as a data structure, but it contains neither layout rendering nor format conversion utility. For example, you can use it to update a table, to create a bulleted list, to change the font size and the background color of a given paragraph, to switch a page orientation from portrait to landscape, but you need OpenOffice.org or another OpenDocument-compliant desktop software to print the result or export it in PDF or some proprietary office document format.

Another point must be clearly explained. OpenOffice::OODoc works with open documents in general and is not limited to a particular class of document. In other words, it can be used against spreadsheets, presentations or drawings as well as text documents. It's a logical consequence of the OpenDocument defintion itself. In proprietary office suites, there is an ad-hoc format for each document class. In the OpenDocument world, a given object is always described by the same data structure whatever the class of the containing document. For example, a table cell can be retrieved in the same way in a spreadsheet (Calc) as in a text (Writer) document.

There are two possible levels of use:

  1. A low level, which allows uncontrolled access to anything anywhere, using raw XPath expressions. This level is represented by such methods as getNodeByXPath or makeXPath. Of course, all the XML::Twig features are available as well. It's intended for advanced, OpenDocument-aware users who want to extend the high-level API. Ordinary applications should not use this level, which will not be more commented in the present article.

  2. A high (or ordinary) level, allowing the user to forget the XML navigation and to retrieve and process document elements using a document-oriented vocabulary. This level, of course, can't manage every possible object in an OpenDocument, but its coverage includes the most needed objects (and can be extended according to real world projects).

This second level is the most useful and it's set of "managed objects" presently includes:

  • paragraphs
  • headers
  • lists
  • tables and cells
  • bibliography fields
  • bookmarks
  • user fields
  • images
  • paragraph and text styles
  • image layout
  • page layout
  • document properties (title, author, subject, description, keywords, statistics)

Content, Styles, and Metadata

In the OpenDocument world, there is an explicit separation between several logical spaces, corresponding to several XML members in the physical archive. Because it's very close to the OpenDocument specification (and far from an interactive editing tool), OpenOffice::OODoc doesn't hide this separation, so it must be known by the user.

The most important spaces (at least for a first article) are:

  • document-content: the document body, i.e. all the text elements and non textual objects (such as graphics, bookmarks, variable fields, and so on) that appear in the page bodies
  • document-styles: the reusable layout description elements, mainly the named styles (i.e. the styles which are visible for the end-user through the stylist box with OpenOffice.org) a style can apply to a text object as well as to an image or a page layout - generally speaking, everything which is printable has a style
  • document-meta: the general properties of a document, as the end-user can get them through, say, the different tabs of the File/Properties box in OpenOffice.org.

The default member is document-content, but when an application needs to process more than one member of an OpenDocument (i.e. content, styles and/or metadata), it has to select another member, with an explicit "member" option, when it calls the ooDocument constructor. And when it needs to process more than one member in the same session, it must instantiate more than one document interface. Don't worry, it's not very complicated.

Suppose, for example, that we want to apply a special, not existing yet but later reusable layout to all the paragraphs matching a given text filter (regex) in a given document, and put a new title to the document. Our special layout will be based on the "Text body" pre-defined style, and its own properties will be, say, 1.8 centimeter margins, a centered content and a yellow background. To do so, we need three interfaces against the same file, corresponding to document-content (to select the correct paragraphs and provide the correct style identifier to each one), document-styles (to define and register the new style) and document-meta (because the title belongs to the metadata) respectively.

The following script, where the file name, the search filter and the title come from the command line, shows one of the possible ways to get such a result:

my $filename	= $ARGV[0];
my $filter	= $ARGV[1];
my $title	= $ARGV[2];

# create the document connectors
my $c	= ooDocument
			(
			file		=> $filename,
			member	=> "content"
			);

my $s	= ooDocument
			(
			file		=> $c,
			member	=> "styles"
			);

my $m	= ooMeta(file => $c);

# create the named paragraph style
my $bgcolor = rgb2oo("yellow");

$s->createStyle
			(
			"MyCenteredText",
			family		=> 'paragraph',
			parent		=> 'Text body',
			properties	=>
				{
				'fo:margin-left'		=> '1.8cm',
				'fo:margin-right'		=> '1.8cm',
				'fo:text-align'		=> 'center',
				'fo:background-color'	=> $bgcolor
				}
			);

# select the target paragraphs and apply the style
foreach my $p ($c->selectElementByContent($filter))
	{
	$c->textStyle("MyCenteredText") if $p->isParagraph;
	}

# put the new title
$m->title($title);

# commit the changes
$c->save;

In this example, the first instruction creates a document interface ($c) which is, as usual, explicitly linked to a filename through the "file" option. In addition, a "member" option is passed to the ooDocument constructor in order to select the "document-content" workspace. In this case, the "member" option is provided for the clarity of the presentation only ("document-content" is the default). Then a second interface ($s) is created using the same constructor but this time the value of the "file" option is not a file name, it's the previously created, content-focused document interface ($c) instead. And the "member" option is set to "style", so this second instance will represent the document-styles workspace. By this construct, the two document interfaces can work each one in its own workspace but they consistently share a single file.

The third instruction instantiates a metadata-focused interface. This object must be created using the ooMeta function, because it deeply differs from the content- and styles-focused objects. Its the only interface with predefined title, subject, author, description, ... accessors.

The createStyle method is called in order to define a new named paragraph style ("MyCenteredText"), derived from the "Text body" style, with the required properties. As you can see, this method is called from the $s object, so the style will be created in the document-styles workspace. Then we go to the document-content workspace (represented by $c) in order to get the list of all the text elements matching the filter (selectElementsByContent). In this list, we apply the "MyCenteredText" style to each paragraph element only (we don't change the style of any non-paragraph container).

Then we register the title of our choice using, without surprise, the title accessor from $m (our metadata interface).

Finally, the save method is called from the $c object. But it could be called from the $s object as well. It's at the user's choice, because, as long as two or more document interfaces are connected to the same file, the save method of anyone of them apply to the whole. Subsequent $c->save, $s->save and $m->save instructions should be counter-productive and absolutely useless here.

Conclusion

OpenOffice::OODoc is a document-oriented interface which relies on the first really open and accepted document format. In this article, I introduced the general scope and the main features of the module. Hopefully, it was enough to get you started. In upcoming articles, I will take a deeper look at some advanced features in various content and style management areas, through more examples.

References

[1] See http://www.oasis-open.org/committees/download.php/12573/OpenDocument-v1.0-os.sxw for the full specification and http://en.wikipedia.org/wiki/OpenDocument for a good abstract and some interesting links.

[2] http://www.genicorp.fr

[3] http://search.cpan.org/dist/OpenOffice-OODoc

TPJ


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.