Channels ▼

Web Development

Cleaning Up the Markup Mess

April, 2004: Letter from the Editor

The dividing line between the printed word and the digital word is often a murky place to have to work. While it's true that almost all documents these days are digital, it doesn't necessarily follow that all those documents have all the potential advantages that digitized information can provide, such as easy categorization and retrieval and easy transformation to various other formats. Our modern world creates a chaotic storm of these documents every day, and it's often the job of Perl to pull order out of that chaos.

All of this became obvious in a recent conversation with a TPJ reader whose job involves the unenviable task of converting PDF documents to HTML. As he puts it: "PDF has NO structure. It's just digital paper. What's in the file is:'Put this text at this location.'"

The documents he has to parse contain no information about the logical characteristics of their own parts. Basically, this means you don't really know what's what—how do you tell a magazine article's title from its subtitle from the caption of a figure or diagram? Having written some Perl here at TPJ to convert our documents from our page-layout program's proprietary format to HTML (and XML), I sympathize.

Why is this such a nasty issue? Shouldn't SGML and XML solve this problem? Well, yes. If your documents are created from the beginning in these information-rich markup languages, and your word processor or page-layout program saves your documents in these formats, you're golden. But good luck finding a WYSIWYG word processor or page-layout program that does a halfway-decent job of this. The current industry standards either have very naïve XML implementations, or require a more complicated configuration process than most users are willing to tolerate in order to enable such XML awareness.

And in order for a document format (like XML) to be information-rich, someone has to input that information. It's not enough to type in the title of an article—the author must then also label this title with the metainformation that in some way states that "this is a title." This is an extra step that most document creators won't take unless they are forced to. When faced with the choice of meeting a printer's deadline or labeling a document for easy processing later on, you can guess where the document author's priority would (and should) fall.

The real problem here is that document authors are concerned mostly with what the document looks like, not how it's logically structured. After all, if some text is at the top of a document, and it's large and bold, we naturally see it as a title. If it's text aligned underneath a picture, we see it as a caption describing that picture. This is a visual process that involves only our eyes and our brains. But there's a solution here, too: In many cases, these visual characteristics are the only metainformation you need to parse the document's structure.

At TPJ, we use a combination of these text characteristics and an object's position on the page to make decisions about tagging our content. We do it all in Perl, naturally. It allows us to entirely automate the process of conversion to HTML—but only because our articles are laid out in a very consistent way. If a document's text styles and the positions of various objects within the document change unpredictably from one document to the next, there are no patterns to work with and you're back to manually labeling all the parts of your document.

As great as Perl is, it can't help us to glean logical structure from our documents when there's nothing to be gleaned.

Kevin Carlson
Executive Editor
The Perl Journal

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.