Web Development

XML-Based Programming Systems

By Gregory V. Wilson, March 01, 2003

Will mixing XML and source code revolutionize programming in the coming years? This is the question Greg untangles.

Mar03: XML-Based Programming Systems

Greg, a DDJ contributing editor, is the author of Practical Parallel Programming (MIT Press, 1995), and currently works on access control software for Baltimore Technologies in Toronto. Greg can be reached at [email protected].

Programming languages evolve either by turning theory into prVeldhuizenactice, or by formalizing the best practices of the day. The first approach gave us APL's arrays, Prolog's unification, and ML's type inference. The second produced structured programming (which grew out of well-nested gotos), objects (which formalized the binding between data structures and the functions that manipulate them), and Perl (a mad scientist's version of the UNIX shell).

What good programmers are doing today can, therefore, give us hints about what tomorrow's programming systems will look like. In this article, I propose that one current trend in particular—mixing XML and source code—will revolutionize programming over the next five years. To understand why, you must first look at today's mixed-mode programming systems.

Mixed Content

Early interactive web sites were built using CGI programs that produced fully formed HTML pages. It didn't take long for programmers to realize that nesting code inside HTML, instead of printing HTML from code, made visual design easier and simplified maintenance. Most second-generation sites, therefore, use template frameworks, such as ColdFusion, Active Server Pages (ASP), and Java Server Pages (JSP). These tools let us embed expressions, loops, or entire class declarations in HTML pages. A preprocessor extracts the code, executes it (either on the fly or by creating a free-standing program), and sends the output to the user's browser. Figure 1, for instance, is a JSP page containing pure HTML (green), escape sequences delimiting Java (blue), and actual Java (red).

Embedding code in XML has more uses than just web-page generation. Figure 2 is an input file for Ant (a Make replacement developed as part of Apache). You put commands and dependency information in XML files; Ant then reads them, traces dependencies, and executes required actions. As with JSP, you can extend the command set by writing new Java classes, then binding those classes to particular tags.

At the same time as these systems were being developed, other programmers were putting HTML in code, instead of code in HTML. The best-known example is Sun's JavaDoc, which has its roots in the literate programming (LP) system invented by Donald Knuth (http://www.literateprogramming.com/). LP's basic idea is that if code and documentation are kept in a single file, programmers will be more likely to keep them in synch. Figure 3, a Java class with its JavaDoc documentation, contains three types of content: Java source code, standard HTML markup (like the <em> emphasis tag), and short-hand markup (like the @version tag).

As useful as they are, systems such as JSP, Ant, and JavaDoc share two basic weaknesses. For one thing, they are hard to read. Source code and XML tags are difficult enough to make sense of on their own. When they are stirred together, human readers must invest a lot of cycles to disentangle them. The second, and more important, weakness is the representation gap between what you type in and what you have to debug. If a JSP page misbehaves, for example, you must mentally reverse the transformations applied by the JSP processor to figure out what to fix in the source.

Programming in XML

Sooner or later, programmers will solve both problems by abandoning flat text and storing programs as XML documents. Hype aside, XML is just an extensible way to represent nested data structures. In place of HTML's fixed tag set, XML lets you define new tags to fit specific problem domains. Namespaces then allow different sets of tags to be nested unambiguously.

Storing programs as XML documents would not be a big technical challenge. Figure 4 shows one approach, in which <doc> tags show documentation, and <code> tags show code. Everything inside the <doc> area is pure HTML; everything inside the <code> is Java, with special characters like "<" replaced by escape sequences like "&lt;."

Of course, you wouldn't write these tags by hand, or even see them (unless you had to debug a broken document). Instead, you would use a WYSIWYG editor that inserted the right escape sequences for special characters, formatted HTML as you typed, and so on. You would only look at the "raw tag" view of your source code when you really needed to, just as you only look at hex dumps and assembly code when something has gone wrong at a higher level.

While die-hard Emacs fans might howl at this, it is important to remember that programmers are about the only people who use flat text and ASCII editors these days. Almost everyone else uses tools like AutoCAD and Excel that store data in a machine-friendly form, then render it to make it comprehensible to humans. The formats these tools work with are all evolving into XML. Sooner or later, general-purpose editors will load plug-ins to handle different XML formats, just as today's web browsers load plug-ins to handle audio, video, and image data. Everyone, including programmers, will learn how to use such editors (just as even crusty, old UNIX programmers had to learn how to use GUI browsers instead of Lynx as web sites became more graphical). Once rich editors are a part of programmers' daily lives, it will be natural to use them for source code.

Trendiness isn't the only force that will drive the switch to XML source code. Another compelling reason is that it will let you store many different kinds of data directly in your source files. For example, authors frequently put pictures of data structures in textbooks, while programmers put class diagrams in documentation. These aid understanding, but since today's programming editors only understand flat ASCII, the only way to insert a picture into today's source code is to use ASCII art. (Linking to an external image file in JavaDoc doesn't count, because that image can't be viewed, much less manipulated, inside the source code while you're editing.)

If source files were XML, on the other hand, pictures could be embedded directly using Scalable Vector Graphics (SVG). Mathematics could be added as well using MathML. The World Wide Web Consortium's Amaya browser (http://www.w3.org/Amaya/) shows what a seamless editor for mixed content of this kind might look like.

Of course, there would be no need to restrict rich content to a predefined tag set. Tool vendors could define their own markup for bug fix IDs, compiler directives, and so on. For example, a CASE tool could insert links between class definitions and use cases, then automatically trace them using the equivalent of a web spider to ensure that development teams had built everything they promised. Today, this must be done using specially formatted comments, external metadata files, and other complex, fragile means. Storing programs as XML documents would give tool builders a simple, uniform alternative. In short, it would let programmers take full advantage of what Jon Udell calls "the universal canvas" (http://udell.roninhouse.com/GroupwareReport.html).

Going All the Way

The third reason why this switch is likely to happen is that it would kick-start the same sort of positive feedback cycle that has made UNIX such a powerful programming environment. UNIX has always been more than just an operating system. From the start, it has also embodied a philosophy of programming based on lots of little tools. While standard UNIX command-line utilities like ls, wc, and sed are useful in their own right, their real strength is that they can be combined to process text in complex ways.

The duct tape that holds UNIX's tools together is a common data format and communications protocol: newline-terminated character strings; stdin, stdout, and exit(0). Any program that reads/writes the first, using the second, can be combined with any other to increase the power of both. This led to a virtuous circle: People stored data as flat text because there were so many ways to process it, then wrote text-processing tools because so much data was stored that way. The end result was the world's first component-based programming system.

UNIX's toolset is powerful, but not quite powerful enough to manipulate programs. Source code is stored as text, but a line-oriented view does not capture its real structure. After all, what matters in a program isn't where the line breaks are. What matters is how many parameters a particular function has, or whether one if/then/else is nested inside another.

Text-based programming on UNIX never got further than CPP, the C preprocessor. C's two main successors, C++ and Java, both turned their back on textual manipulation of programs and incorporated other mechanisms (such as templates and the import statement) to do some or all of the things that CPP did for C. In many ways, though, this was a step backward, since there was already evidence of how powerful program manipulation could be when unleashed. That evidence came primarily from languages such as Lisp.

Everything in a Lisp program is a list, including the program itself. For example, instead of writing an assignment statement as x=3+y*5;, you would write something that looks like a list: (set x (+ 3 (* y 5))). Writing programs that inspect or create other programs is therefore as natural to Lisp as string manipulation is to Perl. Consequently, Lisp programmers have been able to explore just how powerful metaprogramming can be; they've found that it enables compact, elegant solutions to hard problems. Those discoveries could, in theory, be replicated in more conventional languages, but in practice, the complexity of manipulating C++, Java, and the like puts them out of reach.

Storing code as in Figure 4 is, therefore, a half-measure, because most of what is interesting about the simple class it contains isn't visible at the XML level. The name of the class, the definition of its main method, that method's parameters and local variables, the reference to the System module inside the loop—all of this is just text as far as an XML processor is concerned. To get at it, you must switch from generic tools (such as SAX, DOM, and XSLT) to language-specific ones. Experience shows that this is a big enough impedance mismatch to stifle development and adoption of metaprogramming.

The solution is to mark up all of a program's structure using XML so that you could use standard tools to inspect and manipulate code with 100-percent accuracy. The language Superx++ shows what this might look like (see Figure 6). Developed by Kimanzi Mati (http://xplusplus.sourceforge.net), Superx++ is a fairly conservative mix of ideas from C++ and various scripting languages. What makes it interesting is that it uses XML tags instead of curly braces and semicolons to show program structure. Figure 5 is a class definition, the creation of a new object (the <node> element), and a couple of print statements to the predefined xout channel (the equivalent of C's stdout).

Viewed at this level, Superx++ might win an award for the world's least readable programming language. As discussed earlier, however, a WYSIWYG editor could easily allow programmers to view and edit this code in a more comprehensible way. At the same time, programmers could use standard tools like XSLT to search and manipulate it.

So what will you do once you can manipulate source code easily and accurately? For a start, once you know the tag set that has been used to mark up the code you're working with, you can throw together outliners, class browsers, and other tools as easily as you now throw together 10-line shell scripts. Wizards and code generators like those described by J. Craig Cleaveland in Program Generators with XML and Java (Prentice Hall PTR, 2001; ISBN 0130258784) will likely also become part of your toolkit. Instead of deriving classes and overriding methods one by one, you will write generators that take a handful of parameters and do the rest of the work automatically.

Further out, we may see tools like Scheme's hygienic macros enter the mainstream (http://www.schemers.org/Documents/Standards/R5RS/). Scheme uses quoting and antiquoting to mark out which parts of code are text to be copied and which parts are to be expanded. Unlike most macro systems, this lets you safely add new syntactic forms to the language. There is no portability problem due to nonstandard code, because the macros are included in the libraries that use them. Replace "quoting" with "XML tags," and this powerful idea could be applied to Java, Python, and other mainstream languages.

Going Even Further

One obstacle to switching to XML is its all-or-nothing nature. Changing one tool or specification at a time is hard enough. Changing editors, compilers, debuggers, and everything else in the programming toolbox at once seems like an impossible task.

However, many programming environments are completely controlled by specific vendors, who may well choose to switch from flat text to rich markup for their own reasons. If Microsoft had made VB.NET source files XML, tens of thousands of programmers would already be putting pictures and hyperlinks in their code. Vendors like Rational would already be shipping plug-ins for Visual Studio .NET to manage metadata. Wolfram Research and The MathWorks (which own Mathematica and MATLAB, respectively) might initially switch to let their users put mathematics in code, and so on.

A second challenge is that allowing rich content brings us right back to the representation gap. If a source file can contain many different types of information, won't debugging it be even harder than debugging complex C++ templates and wizard-generated code?

The answer depends on whether programmers are willing to take one last step and start building compilers, linkers, and debuggers in the same way that we build tools for everyone else. Microsoft Word is not just an editor. Thanks to COM, it is also a library full of powerful functions for formatting and spell-checking text. Microsoft Excel is similarly more than just a spreadsheet: Many experienced Visual Basic programmers use it as a programmable calculator without ever once launching its GUI.

And Visual C++ is, well, a black box, actually. It does not have a COM interface, so you cannot write scripts to control its behavior. Instead, you must "program" it using arcane combinations of command-line flags, embedded directives, and template metaprograms like those in Todd Veldhuizen's Blitz++ library (http://www.oonumerics.org/blitz/). Extensible XML-based programming systems might finally push us to fix this.

Most programmers will never build compiler plugins, any more than most web site administrators write their own Apache modules. However, pluggable tools will let them take advantage of other people's innovations more quickly than today's "wait for the next release" approach. The tools developed by the free and open-source software communities are no better. GCC is widely used as a compiler, but you cannot link functions from libgcc to parse or manipulate programs because no such library exists. You cannot customize the way that GDB (the GNU debugger) displays data structures because GDB does not have a scripting interface (that said, the GDB/MI protocol does offer an RPC-like interface to much of GDB's functionality; see http://sources.redhat.com/gdb/current/onlinedocs/gdb_25.html#SEC217). For whatever reason, we have created a world in which our own tools are less programmable than those we build for everyone else.

One example is Stanford University's SUIF compiler framework (http://suif.stanford.edu/), which lets you insert new optimization modules into the compiler as easily as you add modules to web servers. This not only lets new optimizations spread faster, but also lets you selectively bypass optimizations that aren't appropriate for particular modules. Storing programs as XML isn't necessary for this to happen, but will almost certainly spur it on.

Toffler's Law

None of this is really inevitable, of course. We could still be typing in our programs a byte at a time in 2010. But do you really believe that will be the case? Do you really believe that ours will be the only documents that aren't marked up and can't be manipulated using generic tools? As Alvin Toffler said, the future always arrives too soon, and in the wrong order. Speaking as someone who has typed in a lot of bytes over the last 20 years, programming on the universal canvas is one revolution that can't possibly arrive too soon.

Acknowledgments

Thanks to Kimanzi Mati, Simon Peyton-Jones, Jim Larus, Mike Donat, Paul Prescod, Todd Veldhuizen, Mathew Zaleski, and Mark Mitchell for many useful comments.

DDJ

1 2 3 4 5 6 7 Next

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

Web Development