Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

Keeping it Clean


WebReview.com: HTML Tidy: Keeping it Clean

Rank: 2

HTML Tidy At-A-Glance

• Developed by: Dave Raggett

• Supported platforms include Windows, MacOS, Solaris, Linux, FreeBSD, BeOS, MS-DOS, UnixWare, and others

• Cost: Free

If there's one thing I can't stand it's badly formed HTML. That's HTML that doesn't conform to a W3C standard. You've probably seen what I'm talking about all over the Web: missing tags, proprietary extensions, constructs that break in all but one or two browsers. These errors are annoying and can be avoided with very little time and effort.

There are numerous applications and online services that validate HTML syntax. More often than not, though, they're good but not great. Most will check HTML, but not correct it. If you have a lot of files, you check each one and make corrections by hand. This takes a lot of time and effort. All in all, just about every app and service I've tried is either too bulky or lacks the functionality I need.

Weighing in at under 200 KB, HTML Tidy is the closest you'll get to a perfect HTML utility. Not only does it check HTML files, Tidy fixes the problems it finds. Tidy proves that a lot of functionality can be crammed into a tiny package.

Tidy is an anachronism in the world of the graphical user interface. It's a command line application, meaning you have to type a string of commands to get the program to run. It may sound like an old fashioned way of doing things, however it's anything but. The command line interface gives Tidy a great deal of flexibility. And it has a function that can save you keystrokes.

How Tidy Works

Tidy fixes a number of common, and not so common, mistakes in HTML files. It does this by analyzing the markup in a file and comparing it to a standard HTML 4.01 specification. Depending on the options you specify, Tidy can fix the problems it finds or it can generate a log detailing the errors.

The range of problems Tidy can fix is impressive. It can add missing or mismatched end tags, correct tags that are in the wrong order, insert quotes around attributes, and can even add a missing bracket to a tag. One of the few things Tidy can't do is add summary attributes to tables. You'll have to go in and add that attribute manually.

Total Control

Tidy's 22 command line options are usually enough for most purposes. But they only scratch the surface of what this utility can do. You can tap into all of Tidy's power and functionality by using a configuration file. A configuration file is simply a text file listing various program options that gives you access to many of Tidy's extended features. These features include HTML to XHTML conversion, fixing the so-called HTML produced by Microsoft Word, adding ALT text to images, etc. There are even options for formatting markup, and for dealing with scripting languages. You simply specify the configuration file on the command line, and let Tidy do the rest. You can find complete descriptions of the options at the Tidy home page.

For all their usefulness, configuration files can be cumbersome to create. There are 49 options available, and you not only have to sift through the options to find the ones you want to use, but you also have to spend time building a configuration file. And because no two sets of HTML files are exactly the same, you may have to create multiple files. Keeping track of them can be a chore.

A Touch of Style

Thanks to the influence of Netscape and Microsoft, far too many Web authors use extensions like <font> and <center>. Tidy has a neat option for replacing these tags with Cascading Style Sheet (CSS) properties, making the markup compliant with the HTML 4.01 standard.

The CSS option does a good job of replacing non-standard markup, but not with the CSS you might use. The CSS Tidy adds to a file looks like this:

<style type="text/css">
 :link { color: #0000ff }
 :visited { color: #800080 }
 li.c10 {list-style: none}
 p.c9 {font-family: Arial; font-weight: bold}
 b.c8 {font-family: Arial}
 div.c7 {margin-left: 2em}
 p.c6 {font-family: Arial; font-size: 120%; font-weight: bold}
 b.c5 {font-family: Arial; font-size: 120%}
 p.c4 {font-size: 80%}
 span.c3 {font-size: 80%}
 p.c2 {font-family: Arial; font-size: 150%; font-weight: bold}
 b.c1 {font-family: Arial; font-size: 150%}
</style>

You'll undoubtedly need to do some manual editing to fit the tidied files into your format. Dave Raggett, Tidy's author, wrote that "Tidy is expected to get smarter at this in the future." I'm looking forward to that.

XML and XHTML Anyone?

Since we're on the cusp of Web standards, Tidy also supports both XML and XHTML. Using configuration file options, you can convert an HTML file to XHTML or XML. Or at least that's what the documentation says. The XHTML conversion works very well. Tidy adds the XHTML doctype and namespace to the file, and converts HTML tags to their XHTML equivalents. For example, tags like <BR> and <HR> become <br /> and <hr />. The conversion checks out using the W3C's XHTML validator, and renders well in any browser.

I've never had much luck converting a Web document to XML, however. Instead of an XML file, the output is still HTML. This was really the only disappointing aspect of Tidy. But that doesn't mean it has no XML capabilities. Using the -asxml command line option, Tidy can fix errors in XML files. Not every error mind you. Tidy can't cope with CDATA for example, but it catches most of the major ones.

Strengths and Weaknesses

One of Tidy's biggest strengths is its portability. Versions of Tidy are available for over 15 platforms, including Windows, DOS, Mac OS, several flavors of UNIX/Linux, and BeOS. On top of that, Tidy is an Open Source application. If there isn't a version for your favorite operating system and you program in C, you can download the source code and start hacking.

Tidy is also integrated with a number of Windows editors, including NoteTab Pro, and the soon-to-be renamed 1st Page 2000. Hopefully, Tidy will be integrated with more tools in the future.

Being a command line tool, Tidy won't appeal to anyone who is used to point-and-click convenience. You can, however, create a batch file or shell script to execute Tidy with the options you commonly use. The only other downside is that you sometimes have to run Tidy two or more times to completely clean a file.

All told, Tidy is an indispensable tool for any Web author. It can save you a lot of time finding and correcting errors in your HTML. And it can ensure that your Web documents comply with standards. Just for that reason alone, Tidy is worth the download.


Scott is a Toronto, Canada-based freelance journalist. His articles and reviews have appeared in publications throughout North America.

Enjoy these related product reviews:


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.