Channels ▼

Migrating from HTML to XML (Web Techniques, July 2000)

Migrating from HTML to XML (Web Techniques, July 2000)

Migrating from HTML to XML

By Peter Fischer

As the Internet world shifts its focus to XML and related technologies, what happens to HTML? Everywhere you go, products are becoming "XMLitized" as vendors rush to gain market share. While this is great for companies that are only now beginning to build their infrastructures, what about the rest of us whose sites have existed for years, accumulating documents architected on old HTML technology? How are we to take our millions and millions of HTML documents and bring them into the next generation of Internet computing? Fortunately, the market for tools in this space is growing, and technologies like Extensible Hypertext Markup Language (XHTML) are making it easier to migrate your repository of existing HTML documents.

The Motive

HTML began as a simple markup language for formatting data. It quickly evolved into a monster used to display data, and is now composed of many proprietary tags that aren't supported by every browser. Extraneous visual elements, like the <font> tag, only add to HTML's bloat. With the advent of newer devices whose displays are not as visually oriented, like handhelds and mobile phones, HTML is no longer capable of standing up to the new challenges of Internet- and Web-based computing. We're left with a legacy of information captured in HTML that can't evolve to support new computing platforms and paradigms.

XML promises relief. Because content creators must focus on the structure of their documents as opposed to their display, XML documents contain clean information that can be repurposed for various forms of presentation. XML is not a single, predefined markup language like HTML or WML, but rather a specification that lets you create customized markup languages for different classes of documents and data. This means you can create tags that are more appropriate to your document type than the standard HTML tags. Why use <p> tags when you could be using <Abstract> to indicate that a section of text is the abstract to your article?

The Extensible Style-sheet Language (XSL), provides the mechanism for uniting content stored in XML files to multiple presentation formats. Many XSL-based products take an XML document and apply a style sheet to it to create an HTML document that your Web browser can display. There are also style sheets available that can be applied to the same XML document to create a WML document for display on a wireless device. As new presentation languages are created, all you have to do is create a new style sheet for the documents in your repository. The "body" stays the same, but the "uniform" can be changed freely.

The XHTML Stepping Stone

The trouble, of course, is that it's not easy to separate the body of an HTML document from its presentation. So how do we reengineer, reformulate, or reface the millions upon millions of HTML documents that exist today?

One option is to step over to XHTML. XHTML combines the best aspects of HTML and XML into a single, consistent technology. XHTML is a presentation language built from the XML specification. And because it's very similar to HTML, you can migrate your old documents just by cleaning them up a little. This is much easier to accomplish than the tear-down and restructuring of your document that's required when leaping directly to XML.

Cleaning up your documents is a matter of conforming to certain standards defined by XHTML. HTML in itself is not directly derived from XML, so to get the content into XHTML format, you must follow a set of rules.

First, XHTML is case sensitive. XHTML element and attribute names must be written in lowercase. With XHTML you can no longer use one of the tricks of the trade to improve HTML code readabilitytyping element and attribute names in uppercase and the corresponding values in lowercase. Your <P ALIGN="right"> tag must now be written as <p align="right">.

Second, XHTML is strict about the opening and closing of elements. With XHTML you're required to close the most recently opened tag first, followed by the others in succession. In other words, a section that starts with <p><i> must be closed with </i></p>, as opposed to </p></i>. While HTML technically has the same rule, it's rarely checked. With XHTML, it's always checked.

In XHTML, all nonempty elements must be closed, explicitly or implicitly. A trick employed by many developers was to use <p> tags between paragraphs, instead of the proper <p> at the beginning of a paragraph and a closing </p> at the end. Similarly, all empty elements must be terminated. While in HTML constructs like <hr> and <br> are valid, in XHTML the corresponding valid constructs are <hr /> and <br />, respectively.

Additionally, all XHTML attributes need to be quoted. This means constructs like <table border = 2> need to be replaced by <table border = "2">. And finally, the <head> and <body> elements are required in an XHTML document, and the <title> element must be the first item in the <head> section.

If you make these changes to your documents, you'll have a repository of files that not only can be displayed by HTML browsers, but also can be processed by XML-enabled software. I find XHTML most suitable for those environments that are focused on dropping the legacy heritage of HTML and are strategized on creating sites that are well architected in terms of separating information content from information presentation.


If your site has only a few documents, making the changes by hand may not be a big deal. But if your site has accumulated a wealth of documents over the years, or if you're receiving new documents every day, you'll want to look at some of the tools available to help you make the conversions. There are a number of commercial and freeware tools out there.

Some can help author new XHTML documents in addition to converting your older HTML files.

A basic but useful tool is HTML Tidy, a free multiplatform console application. Tidy started out as a program to clean up HTML markup errors and reformat HTML code for legibility. However, it has been extended to perform a variety of operations on an HTML file, including support for converting HTML to XHTML.

For those of us who are used to GUI-based tools, HTML Tidy isn't the most intuitive tool, as it requires that you provide a set of command-line arguments to control its processing. Fortunately, there's a number of tools that provide a nice GUI wrapper around Tidy, as well as a number of Web sites that provide automated CGI scripts for running a page or site through HTML Tidy. I've listed a few of those sites in " Online."

One tool that's very easy to use is HTML-Kit, another freeware program, which runs on the various Windows platforms. In addition to helping HTML authors edit, format, validate, preview, and publish documents on the Web, it has a customizable GUI that uses HTML Tidy to convert documents from HTML to XHTML. Its intuitive user interface, considered to be on par with advanced development tools, provides views with splitter windowsone with the original markup document and the other with the transformed markup. Another window lists any errors and offers advice and suggestions for improving the XHTML code. I've found that this tool offers a "learn as you grow" approach for migrating HTML to XHTML.

XML in Leaps and Bounds

As I mentioned earlier, XHTML is a presentation-based language. If you're simply interested in taking the first steps toward XML integration, using a product like Tidy will get you moving in the right direction. XML-enabled programs will parse and reuse your resulting documents easily. And future browsers that are built on XML standards should have no trouble reading and displaying your files. Yet, if you have big plans for repurposing your content in many different arenas, you may want to consider leaping directly to XML. This requires you to split out the content trapped inside your existing HTML documents so that it's no longer intertwined with presentation markup. Although a human touch is sometimes necessary to sort out meaningful content from meaningless markup, there are some good tools that will make your job a lot easier.

XSpLit is a new tool from Percussion Software that was originally bundled with Percussion's Rhythmyx product line. Rhythmyx is focused on content-management solutions and makes use of XML and XSL technologies internally to map database content into a presentation format. XSpLit works "under the covers" with Rhythmyx to provide this capability. Percussion saw a void in the marketplace as well as the general applicability of the tool and has made XSpLit available as a stand-alone product. Using XSpLit, developers can easily create the XML and XSL equivalents of their existing HTML documents without investing a significant amount of up-front time learning XML. Figure 1 shows the XSpLit interface. The tabs are presented from left to right according to the various stages in the splitting process.

After you tell it which HTML file to operate on, you press the Split button to perform the actual split operation. XSpLit provides views for the original source HTML, the tidied HTML, a log window that shows errors and warnings, and the DTD. For the DTD, XSpLit displays both the original DTD and the DTD based upon the fields that are defined in the HTML document. In addition, you can view the resulting XSL style sheet and the XML content.

XSpLit lets Web developers forward-engineer their HTML documents to corresponding XSL style sheets. Based on the names used to label your original document, XSpLit creates an XML DTD file that contains format definitions. In addition, XSpLit creates a sample XML document using the labeled static content as sample data. This sample file provides a placeholder for the real content file.

You can combine the new XSL style sheet with the content in the XML document to create an HTML document with its source placed in the Output HTML pane. Pressing the Launch Browser button brings up your preferred Web browser with the newly created HTML page. This is useful if you want to test your XML against your style sheet, or if you want to generate some Web pages from your new documents.

All in all, XSpLit provides a handy interface that you can start using right away to get your site up to date.

Final Thoughts

I anticipate that over the next six to eight months we'll see a significant number of tools available for HTML-to-XHTML conversions as the XHTML standard matures. In addition, expect to see other XSpLit-like tools with new features and functionality coming out in that time frame.

In an industry that always seems to crave technological innovation, products are never too far behind. What we have available today is just a preview of what will become an extensive array of solutions. It's important not to fall behind during these early stages. We're just getting started, and those who prepare now will come out ahead.

Peter is director of technical services for Quantum Enterprise Solutions. He specializes in architecting, designing, building, and integrating large-scale distributed systems using application servers, integration middleware, and Java technologies. He can be reached at [email protected].

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.