Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

Web Development

Java, XML, & Literate Programming


Feb00: Java, XML, & Literate Programming

Andy is a computer consultant who lives on Guernsey in the Channel Islands. He can be contacted at [email protected].


Jon Bentley once asked the question, "Why can't we read programs in the same way that we read novels or magazine articles?" One answer is that, with the exception of comments, programs are not primarily meant for human consumption -- they're written for execution by computers. Of course, over the years, we developed tricks to make life easier for ourselves, including meaningful variable names, the use of abstraction, and so on. Still, reading code remains an effort.

One reason for this difficulty is that code is ordered in the way required by the programming language, rather than the order in which we create it and think of it. Even when working with a system as productive as Java, writing code usually consists of adding some code to one class, then more code to another class -- building up to a complete working system. Explaining code to someone often follows a similar route; you trace an execution thread from one class to another until you can see how a complete task is accomplished.

In the early 1980s, Donald Knuth popularized a solution to the problem of understanding code with his concept of "literate programming." As Figure 1 illustrates, Knuth's WEB system takes a single source document containing code and explanatory text laid out in the order you would describe a program. Running a WEB document through Knuth's Tangle program extracted the code parts and assembled them into a syntactically correct (though not very readable) program. If you run the same source through his Weave program (available electronically; see "Resource Center," page 7), you get TeX source, which can create a beautifully typeset document. TeX produces a device-independent print format called "DVI," which needs one further step to translate it into instructions for a particular printer language such as PostScript. In short, in the hands of a skilled practitioner, a literate program can be read as a well-structured essay.

Sun's javadoc utility (part of the JDK) is similar to Weave, although it targets Java. Javadoc works with some simple Java commenting conventions to produce HTML documents that act as a reference to a collection of classes. But reading a reference manual is not necessarily the best way of understanding how to use a complicated API; hence the huge number of "how to" programming books found in bookstores. The documents produced by Weave are anything but reference manuals -- they are often works of art and can be read for enjoyment as well as explanation.

While Knuth's WEB system produced output via TeX and worked for Pascal programs, Marius, the system I present here, implements some of Knuth's ideas using Java as its programming language, with HTML as the output. In the process, I also leverage the power of XML.

Marius actually refers to my literate programming system as a whole -- both the Weave and the Comb programs (also available electronically both in source and compiled form). The code to Weave and Comb is freely available and published under the Open Source Artistic License.

Assuming you have tagged Java with XML tags, Marius lets you use Weave to generate literate HTML, then use Comb to generate literate Java. More specifically, you create a source document containing both Java and explanatory text. Weave creates formatted output in HTML, then Comb extracts the Java, reassembles it, and produces a normal looking Java program (unlike Knuth's original system that deliberately produced obfuscated Pascal). To illustrate, I include a large example .lit file that describes a minor problem I found in Sun's JTree class. To generate .java files and an out.html file, you'll need Marius and the XML jar from Sun. You execute them with:

java Comb TreeExample.lit

java Weave TreeExample.lit

XML lets documents be marked up with sets of tags that can be specially defined for the purpose. There are a number of programmer tools (such as parsers) that make it easy to work with XML. In the case of Marius, the source document has to be marked up both into areas that represent code, and those that represent explanation. XML is the ideal way to accomplish this. Using Sun's free XML parser (available at http://java.sun.com/xml/) means that projects such as Marius can be created in days, rather than months.

Inside Marius

To understand how you can use Marius to create literate programs, assume that the source document contains chunks of explanatory text (called the "narrative") and chunks of Java code that will be assembled into working classes. However, since you want to present code in an order suitable for explanation, I present code before a class is defined, or indeed at any point in the document.

You use Weave to take the source document and produce HTML that can be read with an ordinary web browser. As Figure 2 illustrates, you then use the Comb program (available electronically) to take the source document and produce syntactically correct Java. Listing One, for instance, is part of a Marius source document describing a simple benchmark algorithm. Running Listing One through Weave produces HTML that looks like Figure 3 when viewed with Netscape Communicator 4.5.

The code and classes are typeset using HTML <PRE> tags that do not disturb the indentation and line arrangement of the code. The resulting HTML pages are formatted in a way that follows the XML closely. Weave only departs from this when encountering a <CLASS> tag when it outputs a small fragment of text that not only resembles a Java class declaration, but also includes a hypertext link to the actual source so it can be seen in its original form.

The code for the NFIB algorithm, which is really the focal point of that particular example, was discussed prior to the introduction of the Benchmark class. In fact in the complete example, Benchmark also presents a fairly substantial user interface that occupies far more of the code in percentage terms than the actual algorithm. Although this code must appear in the final Java output, it is captured in <CODE> tags where the attribute ELIDE (which defaults to F) has been set to T. Weave simply leaves this out of the HTML.

The ability to express code like this is the main advantage of literate programming. Cumbersome but irrelevant details can be left until some later point, or removed altogether. Small, but vital, points can be emphasized. You become an author carefully drawing attention to one area, while drawing a veil over other pieces of the code. Marius breaks up <NARRATIVE> tags into paragraphs <P>, and in fact any HTML tags legal within <P> can be included using the <![CDATA[ ]]> tag. Text that appears inside a CDATA tag is ignored by the XML parser and is simply copied to the output by Weave. As a result, you can include the usual range of HTML tricks including images, different colored fonts, Java applets, and so on. In fact, you can pretty much do anything necessary to make yourself and the code you are presenting clearer to your audience.

A Marius source document keeps code fragments, classes, and explanation together in a single source document. A single Marius source can contain several complete programs. As a result, explanation and code tend to remain synchronized with each other. The common experience of finding out-of-date code in a manual is eliminated.

Comb is responsible for reassembling working Java source code from the Marius source document. This is handled as a three-phase process after the XML parsing is complete.

  • In the first pass, Comb builds up a database of classes and code fragments. It records the NAME, EXTENDS, and WEIGHT attributes of each fragment and the NAME, EXTENDS, IMPLEMENTS, and PACKAGE attributes of each class. It also recognizes a VISIBILITY attribute that defaults to Public and an ABSTRACT attribute that defaults to F. The EXTENDS attribute in a code fragment simply means that the code fragment is part of another fragment or class. The EXTENDS attribute in a class tag has the usual Java inheritance meaning; see Figure 4.
  • The second pass visits each fragment in the database in turn, in no particular order. If the fragment extends another fragment, the extended fragment is informed. Thus, Figure 4 becomes Figure 5.

  • The third pass can now easily perform a depth-first tree traversal outputting Java source as it goes. So that code fragments can be forced to appear in the correct order (since you want readable code) the WEIGHT attribute is used to sort nodes at the same level, fragments with a lighter weight appear earlier than fragments with a heavier weight.

Unlike Knuth's original work, which was meant for printing, Marius outputs for a widely distributed hypertext system. I occasionally want a hypertext link to refer to some class or code fragment within the Marius source file. Weave marks each fragment with an anchor in the HTML so it can be referred to easily, and there is also a <REF> tag that is legal to use within <P>s, as you can see from the example.

It's common to find both import statements and javadoc comments between a package statement for a class and the actual declaration of the class itself. However, any code fragment that extends a class appears within the class itself. To get around this problem, whenever a new class is introduced, Comb also creates a bookmark in its database for fragments that must end up in this position; if the class is called "Benchmark," the bookmark is called "Benchmark.prelude." So to include the user-interface packages in the example, you might see Listing Two in the Marius source.

Comb outputs code for each class in its own file following the conventions expected by Java compilers -- if the class is X, its source is to be found in the file X.java, but it ignores directory structure. Weave simply outputs to the file out.html. These design decisions were taken to keep Comb and Weave as simple and flexible as possible. It is common to run both programs with a short script to rename the output of Weave to something more appropriate, move the Java files to their appropriate directories, then run the Java compiler.

The code that appears in the fragments is (with two unfortunate exceptions) much as you would expect to find it in an ordinary Java source file. Using XML to create the Marius source gives many advantages but it has a problem with the "<" character that is used to signal the beginning of a new tag. Thus, in our example when the code was if(n < 2) {, you actually have to write if (n < 2) {. There's a similar problem with the "&" character that has to be expressed as &. So, in Marius source, you will sometimes see expressions like:

if (object != null && object.some Call()) {

Admittedly, this looks strange. Fortunately, if you make a mistake and use the more natural < or &&, the parser produces a fairly meaningful error message that enables you to track down the problem pretty quickly.

The tags that can legally appear in the Marius source file are controlled by a Document Type Declaration (DTD). The DTD that is currently used with Marius states that a conforming and valid Marius document contains an optional <COPYRIGHT> tag followed by one or more <SECTION> tags. The text appearing between the copyright tags is placed at the beginning of every Java file output by Comb.

<!ELEMENT LIT (COPYRIGHT?, SECTION+)>

<!ELEMENT COPYRIGHT (#PCDATA)>

<!ELEMENT SECTION ((SUBSECTION | NARRATIVE | CODE | CLASS | FILE)*) >

<!ELEMENT SUBSECTION ((NARRATIVE | CODE | CLASS | FILE)*) >

Sections can be further subdivided into <SUBSECTION> tags, although the decision to use subsections tends to depend on the length of the program you are writing. In any case, sections and subsections both contain narrative, classes, and code fragments. There is also the possibility of including other text files that may be required by the program using the <FILE> tag. Comb treats these as a special case of <CLASS> except that the name of a file must be fully specified in its NAME attribute -- so you sometimes see something like this in a Marius source file:

<FILE NAME="Readme.txt">

To install Benchmark simply execute the install.exe file.

</FILE>

Similar Systems

Marius isn't the only Java/XML-based literate programming system. The ABC system (presumably named after its author Anthony B. Coates and available at http:// www.allette.com.au/xml-litprog/) shares some similarities to Marius. It's a literate programming project written in Java using XML documents as input. However, ABC is more ambitious -- eventually it will be language neutral (you could create C++ programs this way) and typesetter independent (you may be able to output PostScript or PDF, not just HTML).

Likewise, there is an excellent, language- neutral TeX-based system called "FunnelWeb" (http://www.ross.net/funnelweb/), and of course Knuth's original work rolls on, as well as a related C-based version called "CWEB." Also, Perl 5 lets insert documentation directly into Perl programs with what are called "PODS." None of these use XML as an input language, of course. They were all done before it was widely released. It's partly XML's unrecognized potential for letting you easily create small languages in the AWT sense that I find so interesting about it.

Conclusion

I have also used Marius source files to convince people that certain pieces of code contain bugs. Although this is a simple process when sitting at the opposite desk, we often have to use code that was created and is maintained by distant groups. Quite often, there's not even a phone number to call and all that can be done is to send a working example of the problem to a support e-mail address. You instantly hit the problem of emphasis. How much of the code in an example is present for support purposes, and how much of the example code demonstrates the problem? Quite often, the broken piece of code is one or two lines buried in a much larger structure -- but without the additional code it cannot be demonstrated. I've found that describing an issue with Marius usually brings a response where other methods have failed.

Marius has been kept deliberately simple because it was created as something of an experiment. Now that it has been in use for a little while, there are a variety of additional features that have been suggested -- mainly for Weave. The examples that have been produced with it have been growing steadily larger and are reaching the stage where they are becoming small books rather than large articles. Future versions of Marius will probably have to make concessions to this development and allow for tables of contents, indexes, and similar elements. The other issue that has arisen is that the HTML produced has a simple and fixed style. It would be nice to be able to make changes to the style using style sheets or some form of HTML template to allow a wider variety of HTML to be produced, so that it could be integrated with existing web sites.

DDJ

Listing One

<SECTION TITLE="The Nfib algorithm">
<NARRATIVE>
<P>Nfib is our next example of benchmarking algorithms, it is closely
related to the Fibonacci series, and gives
a broad idea of the speed of procedure calls in the interpreter. It
relies on its rather peculiar property
that the integer that it returns is the same as the number of procedure
calls needed to generate the number.</P>
<P>The Fibonacci series starts 1, 1, 2, 3, 5,... each element is the sum
of the two before. The <![CDATA[<B>NFIB</B>]]>
algorithm adds 1 at each stage giving the series 1, 1, 3, 5, 9,... Thus
counting from 0, we can see that nfib(3) = 5.
Here is the code for the method:</P>
</NARRATIVE>
<CODE NAME="nfib" EXTENDS="Benchmark" WEIGHT="70">
 public int nfib(int n) {
     if (n < 2) {
  return 1;
     } else {
         return nfib(n - 1) + nfib(n - 2) + 1;
     }
 }
</CODE>
<NARRATIVE>
<P>It's quite easy to see that nfib(2) takes exactly three calls to nfib
to calculate its return value of 3.
To get a very rough idea of procedure call speed, simply divide the time
taken to calculate a relatively large
<REF PATH ="#nfib">nfib</REF> number (e.g, nfib(37)) by the return
value.</P>
<P>The algorithm can be found as part of the benchmark class defined
here, which includes a user interface:</P>
</NARRATIVE>
<CLASS NAME = "Benchmark"
       EXTENDS = "JFrame"
       IMPLEMENTS = "ActionListener"
 PACKAGE = "com.cedillasoft.benchmarks"/>

Back to Article

Listing Two

<NARRATIVE>
<P>We are going to use the Java Foundation classes to create the user
interface so the awt and swing packages need to be imported in the
Benchmark class.</P>
</NARRATIVE>
<CODE NAME="imports" EXTENDS="Benchmark.prelude">
import java.awt.*;
import java.awt.event.*;
import javax.swing.*;
</CODE>


Back to Article


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.