Channels ▼
RSS

Web Development

Managing Documents Using a SOAP::Lite Daemon Architecture


Bryce Harrington is a Senior Performance Engineer at the Open Source Development Labs in Beaverton, Oregon, leading OSDL's NFSv4 testing efforts. He is also a founding member of Inkscape and the Open Clip Art Library. He can be reached at bryce@bryceharrington.org


About a year ago, I helped start a simple project to build a collection of public domain clip art, and realized that here I was, once again in need of a system for managing a bunch of documents. And yet again, none of the open-source document management systems (DMS) out there (including my previous attempts) gave the flexibility necessary to get the job done right. Over the year since then, I've made a first cut at a system that I think will give a much more flexible (and hopefully more powerful) solution.

The Open Clip Art Library is an open-source effort assembled around a community need for free, unencumbered clip art for drawing programs such as Inkscape. In its original, most basic form, it was just a tarball of images that Inkscape and Sodipodi users had contributed into the public domain. Thanks to the hard work of countless contributors and a team of dedicated administrators, the effort was quite successful, and built up a library of thousands of images.

Of course, the inevitable consequence of accumulating so many files was the challenge of how to manage them. The files aren't just static files in a directory hierarchy, but rather a living, working resource that grows and changes through time, and has a range of quirks and issues that need tracking and resolving.

As an example, there are occasionally questions about the acceptability of the image. If someone's original work resembles a company logo too closely, that could be a trademark concern. Or if it is a mechanical trace of a copyrighted image, then it may not be permissible to provide it under a public domain license due to copyright problems. This means we need a way to check the review status of individual documents.

Furthermore, our users demand more powerful ways to organize and sort the images. Artists want the ability to view and update the images they've submitted so far. Linux distro vendors want to be able to create separate packages based on various "themes" (such as just flags or just office clip art). Power users want to be able to get a delta of new images since their last installation update.

The administrators and developers of the project have collected a set of Perl scripts to help in producing the monthly releases to do validation, cleanup, generate navigable views, and update the images themselves. However, many of these are only run during the monthly release process, and aren't fast or dynamic enough to run in real time.

What we needed was a DMS that would give us workflow, state tracking, metadata management, and centralized storage of the clipart files. The more powerful, the better!

Document Management

When discussing document management systems, people frequently wonder how this is different from, say, a content management system (CMS) like Bricolage, or a version control system (VCS) like Subversion. All three deal with files, keep track of versioning information, and allow multiple people to work together on those files in parallel.

But scratching beneath the surface reveals some rather fundamental differences. Each system is geared for a completely different type of user, and while they share similar features, the intended use for those features is drastically different.

Content management systems typically are used to manage components of a web site, and as such deal primarily with page fragments such as news blurbs and logos. The fragments can be updated and change with time, thus usually requiring some dynamic way to redisplay the site. Often, users wish to schedule when those fragments become "live," and control the publishing process.

Version control systems are for managing source code and focus on files that are compiled together into a program. The files can be modified by various people via patches and branches. Since programmers tend to be comfortable working on the command line, many VCS's are implemented as command-line tools.

The files in a document management system are components of "Documents." A document might be 10 word processor files that make up the 10 chapters of a book. Documents are relatively standalone, compared with the contents of a CMS or VCS. Workflow, state, and authorship tend to be more vital in a DMS.

Document management systems seek to fit requirements that would be out of scope or near the edge of scope for a typical VCS or CMS:

  • Store millions of documents or terabytes of data.
  • Handle a wide variety of file formats.
  • Store and access files in nonhierarchical fashions.
  • Randomly search and quickly select sets of documents based on keywords or subject matter.
  • Lock documents to particular authors.
  • Perform bulk operations on new documents (thumbnailing, indexing, translating, OCRing, and the like).

Why Another DMS?

Browsing freshmeat.net shows that there have been a number of attempts at creating a document management system. However, the vast bulk of these systems are often just a web wrapper around a hierarchical file system, with a SQL database to manage metadata. Since they're web-based, bulk operations may be hard to do. If you need to upload 200 files, for example, you are faced with manually uploading each file one by one. Adding new bulk operations would require the involvement of a programmer with intimate knowledge of the system. Further, even adding a minor bulk operation requires a reinstallation of the software, adding the risk of breaking something.

A much better approach would be to enable these types of operations to be done as one-off scripts, quickly coded up and run directly, without any need for changes to the server, and without need for software redeployment. In essence, the idea is to split the interface (the web forms) from the back-end logic. Communication between the two is done through a simple protocol that can be used by a web interface, command-line tools, and GUI applications.

This architectural approach also opens up a lot of flexibility. With a typical DMS, you get an all-in-one solution. If you decide you like the interface of a different DMS better, then you're faced with the daunting prospect of installing that DMS, then figuring out how to dump all your documents, change history, and metadata from the old system and insert it into the new. By having the interface and back end separate, you would be able to swap out the web interface for a different one, while keeping the same back-end server.

Of course, it also allows the converse: If your users have grown accustomed to the Web or GUI interface for your document system, but the back end doesn't cut it, you can replace the server with another that uses the same API, without having to install a different interface.

This modular, generic API approach is also conducive to the way the open-source community works. In the current all-in-one DMS situation, if a random DMS administrator writes a feature to solve an immediate problem in his company's document manager, getting that feature integrated back into the main codebase may take a lot of effort. However, if the feature is implemented as a separate standalone script, it can be posted to a web site and other administrators could download it and run it directly, without any need for that feature to be incorporated into the main source.

LAMP Breaking: A Multi-Tier Server Architecture in Perl

This idea of separating the interface and back end and using a generic API is anything but new, and has been extensively used in other software applications such as e-mail, chat, and file sharing.

The LAMP (Linux-Apache-Mysql-Perl/PHP/Python) architecture is built to provide a web-based interface that can be accessed by any web browser through PHP or CGI. The program logic and UI-generation logic is bundled together, often inseparably. In our new "multi-tier" architecture, we split the program logic into three pieces: the file repository, the business logic, and a daemon.

Document::Repository

At the lowest level is a Perl module to handle the reading and writing of files to the file system. This module abstracts how documents are stored in the system via routines like add(), get(), update(), and so forth. These routines work with files, document IDs, and revision numbers to allow higher layers to be ignorant of where in the real file system the document will be stored.

Document::Repository stores documents into a structured file system based on the document's ID. It's conceivable that there could be many thousands of files in the repository, but from an administrative point of view, it can be problematic to have too many entries in a single directory. Document::Repository handles this by establishing subdirectories when the number of documents exceeds 1000, and sub-subdirectories if the number exceeds 1,000,000.

Even though it is low level, Document::Repository could be used directly, if one needed to interact with the repository in a more hands-on way than is available through the higher-level interfaces. In fact, several administrative scripts (repo_init, repo_add, repo_ls, repo_export, etc.) are included for just this purpose.

Document::Manager

The Document::Manager module embodies the primary business logic for managing documents. It provides user-accessible functionality like checkin(), checkout(), query(), properties(), adding/removing keywords, and other workflow and state functionality.

Document::Manager also serves to define the primary API for client applications. Clients are able to directly call the routines in this module to perform whatever actions they need.

For interacting with an individual document, a module called Document::Object is employed. This models a single document, providing routines for adding files to it, changing its state, updating its change log, and setting properties. Clients do not interact directly with Document::Object, only through Document::Manager.

The Daemon

We chose to use SOAP::Lite as the mechanism for communicating with the daemon. SOAP, or Simple Object Access Protocol, is an XML-based protocol for inter-program communications that works through the Web. It is similar in spirit to other Remote Procedure Call (RPC) technologies like CORBA or DCOM, but is designed to be more lightweight and easy to use. SOAP::Lite is a Perl module that implements SOAP for Perl programs.

SOAP has some downsides, unfortunately. Different implementations support different collections of features; PHP's SOAP doesn't support all the same features that Perl's SOAP::Lite does, for example. There is also some overhead to using SOAP::Lite, thus suggesting there may be performance or scalability issues for larger uses. XML-RPC may be a better solution from an interoperability point of view, although its feature set is more limited than SOAP, so it also has some trade offs.

Despite these issues, for our purposes SOAP::Lite does the trick. Creating a daemon process with SOAP::Lite is done with a minimum amount of code:

my %args;
 $args{'LocalPort'}  = 8012;
 $args{'ReuseAddr'}  = 1;
 $args{'Listen'}     = 5;
 my $daemon = SOAP::Transport::HTTP::Daemon
    -> new ( %args )
    -> dispatch_to('Document::Manager')
    -> options({compress_threshold => 10000})
    ;
 print "Contact to SOAP server at ", $daemon->url, "\n";
 $daemon->handle;

The dmsd daemon also has code for handling command-line arguments (the port to use, SSL certs, the user to run as, etc.) and for forking to a different uid/gid, but that's pretty much it. All SOAP traffic is handled by SOAP::Lite and handed to Document::Manager.

It's not necessary to run dmsd as the root user, which is quite handy when it needs to run on a web server on which you have not been granted root access.

Using Document::Manager from the Client Side

Using SOAP::Lite, client-side scripts to access the daemon are easily written. A trivial client program might look like this:

use SOAP::Lite;
use SOAP::Lite::Utility;

my $doc_id = 42;
my $soap = create_soap_instance('http://www.openclipart.org/Document/Manager',
                                'http://localhost:8012/');
my $response = $soap->call(new => 1);
soap_assert($response);
my $dms = $response->result

$soap->properties($dms, $doc_id, ('publisher' => 'Open Clip Art Library') );

Note the strange calling syntax—because properties() is a member of Document::Manager, you'd expect to call it via $dms->properties(); however, SOAP::Lite's design requires calling it from the $soap object, passing the real object in as the first parameter of the function.

To generate a listing of all the documents in the system, we might do something like this:

$response = $soap->query($dms);
soap_assert($response);

printf("%-8s  %-24s  %-8s  %-8s  %-12s\n", 'ID', 'Title', 'State','Size', 'Date');
foreach my $doc_id (@{$response->result}) {
	 next unless $doc_id;
	 $response = $soap->properties($dms, $doc_id);
	 my $properties = $response->result;
	 if (! $properties ) {
			 $response = $soap->get_error($dms);
			 if (! $response or ! $response->result) {
					 warn "Unknown error retrieving document properties\n";
			 } else {
					 warn $response->result, "\n";
			 }
			 next;
	 }
	 printf("%08d  %-24s  %-8s  %-8s  %-12s\n",
					$doc_id,
					($properties->{'title'} or ''),
					($properties->{'state'} or ''),
					($properties->{'size'} or ''),
					($properties->{'date'} or '')
					);
}

soap_assert() is a routine I made to check the SOAP call's return for any errors. The tutorials for SOAP::Lite leave out much of the error-handling syntax, but I found that without the error messages, it was difficult to figure out where things were going wrong.

For remote command-line use, the WebService::AuthTicket module is used to provide ticket-based authentication. The user logs in to the server with a password, then is given a ticket that will remain valid for a limited duration. Other scripts can use that ticket for conducting operations requiring authentication.

The Future

Document::Manager still has a good bit of maturing ahead of it before it can really be depended on for production use. The near-term focus will be in implementing functionality necessary for making it useful for the Open Clip Art Project. This primarily means increased support for RDF embedded in SVG images via the SVG::Metadata module.

Currently, Document::Manager only has a few command-line scripts as its interface, but it clearly will need a web interface to be useful. CGI::Builder with Template::Toolkit support looks like one good way to implement this.

Looking further ahead, a major change is planned for the back end to enable better scalability. There is another document management project called "DoXFS" that I discovered recently. In discussions with its author, it appears we have highly compatible objectives. The distinguishing feature of DoXFS is its use of the XFS file system for storing metadata properties about documents. This approach gives DoXFS much better performance than is possible through RDBM-based document management systems.

DoXFS would be grafted in at the Document::Repository layer, providing a higher powered alternative to the basic class for administrators willing and able to set up a dedicated XFS file system for it.

Conclusion

Document::Manager follows Perl's ideal of extreme modularity. By dividing a large app up into discrete pieces, each focusing on a specific portion of the task and communicating with other components using concise, flexible APIs, it gives the overall application greater flexibility. Judging from the success of other widely used Perl modules, it may help entirely new kinds of applications to be created.

SOAP::Lite, while certainly far from perfect as a protocol, enables decoupling of the interface and the backend logic, thereby promoting a much greater degree of modularity than is usually achievable with typical LAMP applications. If this approach can catch on more widely in the open-source world, it may enable us to more swiftly solve a much broader range of needs than we currently can.

TPJ


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 
Dr. Dobb's TV