The dividing line between the printed word and the digital word is often a murky place to have to work. While it's true that almost all documents these days are digital, it doesn't necessarily follow that all those documents have all the potential advantages that digitized information can provide, such as easy categorization and retrieval and easy transformation to various other formats. Our modern world creates a chaotic storm of these documents every day, and it's often the job of Perl to pull order out of that chaos.
All of this became obvious in a recent conversation with a TPJ reader whose job involves the unenviable task of converting PDF documents to HTML. As he puts it: "PDF has NO structure. It's just digital paper. What's in the file is:'Put this text at this location.'"
The documents he has to parse contain no information about the logical characteristics of their own parts. Basically, this means you don't really know what's whathow do you tell a magazine article's title from its subtitle from the caption of a figure or diagram? Having written some Perl here at TPJ to convert our documents from our page-layout program's proprietary format to HTML (and XML), I sympathize.
Why is this such a nasty issue? Shouldn't SGML and XML solve this problem? Well, yes. If your documents are created from the beginning in these information-rich markup languages, and your word processor or page-layout program saves your documents in these formats, you're golden. But good luck finding a WYSIWYG word processor or page-layout program that does a halfway-decent job of this. The current industry standards either have very naïve XML implementations, or require a more complicated configuration process than most users are willing to tolerate in order to enable such XML awareness.
And in order for a document format (like XML) to be information-rich, someone has to input that information. It's not enough to type in the title of an articlethe author must then also label this title with the metainformation that in some way states that "this is a title." This is an extra step that most document creators won't take unless they are forced to. When faced with the choice of meeting a printer's deadline or labeling a document for easy processing later on, you can guess where the document author's priority would (and should) fall.
The real problem here is that document authors are concerned mostly with what the document looks like, not how it's logically structured. After all, if some text is at the top of a document, and it's large and bold, we naturally see it as a title. If it's text aligned underneath a picture, we see it as a caption describing that picture. This is a visual process that involves only our eyes and our brains. But there's a solution here, too: In many cases, these visual characteristics are the only metainformation you need to parse the document's structure.
At TPJ, we use a combination of these text characteristics and an object's position on the page to make decisions about tagging our content. We do it all in Perl, naturally. It allows us to entirely automate the process of conversion to HTMLbut only because our articles are laid out in a very consistent way. If a document's text styles and the positions of various objects within the document change unpredictably from one document to the next, there are no patterns to work with and you're back to manually labeling all the parts of your document.
As great as Perl is, it can't help us to glean logical structure from our documents when there's nothing to be gleaned.
The Perl Journal