Eugene is a freelance programmer and writer. He can be contacted at eekim@ eekim.com.
If you peek under the hood of high-profile open-source projects such as Mozilla, Apache, Perl, and Python, you'll find a little program called "expat" handling the XML parsing. If you've ever used the man command on your GNU/Linux distribution, then you've also used groff, the GNU version of the UNIX text formatting application, troff. If you've ever done any work with SGML, from generating documentation from DocBook to building your own SGML applications, you've undoubtedly come across sgmls, SP, and Jade.
Whether you've heard of him or not (and mostly likely, you haven't), James Clark (below right) has made your life easier. In addition to authoring these and other widely used open-source tools (see http:// www.jclark.com/ for a complete list), Clark served as the technical lead of the original W3C XML Working Group and as the editor of the XSLT and XPath recommendations. He recently founded Thai Open Source Software Center (http://www.thaiopensource.com/). His latest project is TREX, an XML schema language. Clark sat down with Eugene Eric Kim to discuss markup languages, the standardization process, and the importance of simplicity.
DDJ: How did you get involved with groff?
JC: One of the first things that got me interested in computing was reading the UNIX Version 7 manual, including the troff manual. I was very interested in TeX, and I worked with it quite a lot. I was also interested in open source, and wanted to make my own contribution. And one of the few classic UNIX programs that had yet to be written at that point was the troff family.
DDJ: What were the major lessons you learned from your experiences with groff and TeX regarding both markup languages and formatting languages?
JC: The problem with TeX and troff is that you're trying to use one language to do three rather different things. You're using it to mark up your documents, like XML; you're using it as a style language, like CSS or XSL; and you're using it to write programs to do the formatting. Using one language for all three separate requirements makes it suboptimal for all of them, in my view.
It's suboptimal for markup because, if you have a document written in TeX or troff, it's very hard to do anything with it other than run it through TeX or troff, so that limits reuse.
It's suboptimal for writing formatting programs because it got this bizarre syntax with backslashes all over the place, which makes the whole thing unreadable. And it's not a real programming language. One lesson I drew from TeX and groff is that you want a real programming language, not a macro processing language. When you look at the thousands of lines of TeX macros or troff macros that people produce, it's a monument to the human intellect, but it's not really the right way to solve the problem.
DDJ: How did you get involved with SGML?
JC: I was interested in using SGML as a replacement for one part of what groff was doing. Then I got Charles Goldfarb's book, The SGML Handbook, and I thought, "Hmm, this is an interesting thing. Let's see if I can write a program for it." Then Charles Goldfarb released his ARCSGML SGML parser, and I started working with that. The more I worked with it, the more I felt it needed improvements and bug fixes, and nobody else seemed to be doing that. There seemed to be a real need for turning a research-worthy tool into more of a production-quality tool, and that turned into sgmls. Working with sgmls, I got more and more dissatisfied with its basic internal structure. There were some things in SGML that would have been very hard to implement within sgmls, and I felt that I really understood how SGML parsing worked, and so I produced a completely new SGML parser, SP.
DDJ: Did you feel like there were any major itches that you got to scratch with the specification of XML?
JC: I knew how insanely complex writing an SGML parser was. SGML is really doing something very simple. It's providing a standard way to represent a tree, and your nodes have a label with names and they can have attributes. That's all it's doing. It's not a complicated concept. Yet SGML manages to make writing something that implements it into a several-man-year project.
A lot of the features do have a reasonable motivation, but when you put them all together, you just get something that's too complex. I think the complexity is misguided. It's failing to pay attention to the importance of simplicity. If a technology is too complicated, no matter how wonderful it is and how easy it makes a user's life, it won't be adopted on a wide scale.
DDJ: You're well known for writing very good reference implementations for SGML and XML Standards. How important is it for these reference implementations to be good implementations as opposed to just something that works?
JC: Having a reference implementation that's too good can actually be a negative in some ways.
DDJ: Why is that?
JC: Well, because it discourages other people from implementing it. If you've got a standard, and you have only one real implementation, then you might as well not have bothered having a standard. You could have just defined the language by its implementation. The point of standards is that you can have multiple implementations, and they can all interoperate.
You want to make the standard sufficiently easy to implement so that it's not so much work to do an implementation that people are discouraged by the presence of a good reference implementation from doing their own implementation.
DDJ: Is that necessarily a bad thing? If you have a single implementation that's good enough so that other people don't feel like they have to write another implementation, don't you achieve what you want with a standard in that all implementations in this case, there's only one of them work the same?
JC: For any standard that's really useful, there are different kinds of usage scenarios and different classes of users, and you can't have one implementation that fits all. Take SGML, for example. Sometimes you want a really heavy-weight implementation that does validation and provides lots of information about a document. Sometimes you'd like a much lighter weight implementation that just runs as fast as possible, doesn't validate, and doesn't provide much information about a document apart from elements and attributes and data. But because it's so much work to write an SGML parser, you end up having one SGML parser that supports everything needed for a huge variety of applications, which makes it a lot more complicated. It would be much nicer if you had one SGML parser that is perfect for this application, and another SGML parser that is perfect for this other application. To make that possible, the standard has to be sufficiently simple that it makes sense to have multiple implementations.
DDJ: Is there any markup software out there that you like to use and that you haven't written yourself?
JC: The software I probably use most often that I haven't written myself is Microsoft's XML parser and XSLT implementation. Their current version does a pretty credible job of doing both XML and XSLT. It's remarkable, really. If you said, back when I was doing SGML and DSSSL, that one day, you'd find as a standard part of Windows this DLL that did pretty much the same thing as SGML and DSSSL, I'd think you were dreaming. That's one thing I feel very happy about, that this formerly niche thing is now available to everybody.
DDJ: What do you think was the crucial step that transformed markup language stuff from a niche into a widespread phenomenon?
JC: I think it was XML and Jon Bosak's initiative to form the XML Working Group. Jon correctly perceived that the W3C as a standards organization had much more market acceptance than did ISO. ISO, in Internet circles, still tends to be a little bit of a dirty word. That seems to be slightly reducing, but certainly, a few years ago, if you mentioned ISO standards, everybody would go, "Ugh. Horribly complicated." They didn't want to have anything to do with it, whereas W3C standards were perceived to be sexy, web oriented, up to date.
Jon correctly perceived that if we wanted to broaden acceptance of markup technology, the way to do it was to get the W3C badge of approval on it. And he selected a good group of people, chaired us well, and we were therefore able to produce something that was good.
XML isn't going to win any prizes for technical elegance. But it was just simple enough that it could get broad acceptance, and it has just enough SGML stuff in it that the SGML community felt they could embrace it. At this stage, the fact that it's based on SGML doesn't seem terribly important, but in terms of getting it started and getting that broad acceptance, having the support of the SGML community was very important. We struck a good balance and the timing was lucky. We delivered this at a time when people were just realizing the need for something like XML. It's the happy coincidence of those things that has led to the amazing level of acceptance of XML.
DDJ: What is TREX?
JC: You can think of it as DTDs in XML syntax minus some things and plus some others. TREX just does validation. DTDs mush together both validation and interpretation of the documents, providing various things like entities and notations. Mushing them together is problematic because often you want one thing but not the other. My work with XML and SGML has convinced me that what you need is good separation between these different things. I wanted to remove from DTDs the things that augment the information in the XML document. And I wanted to add in some of the things that I think XML DTDs have always been missing.
One of the things XML DTDs removed from SGML DTDs was AND groups, which allow you to have unordered content. The SGML AND groups had a bad reputation, and don't have quite the right semantics. TREX adds them back and tries to do them right. XML also radically simplified the kinds of mixed content that you're allowed because there's a problem with the way SGML does it. Instead of restricting it, TREX solves the problem.
There are a number of other things. There's namespaces. I view XML namespaces as one of the core base standards. Then datatyping. SGML has this ad hoc collection of datatypes and allows them only for attributes. XML restricts it a little bit more. There's not really any rhyme or reason for the selection of datatypes. What datatypes you have and how you do structural validation are basically orthogonal issues. The whole point of XML and namespaces is you can cleanly mix different vocabularies. So here's the prime opportunity for modularizing the solution. We have one language, TREX, for dealing with the structural validation, and we can use a separate language for datatypes. Different domains can have their own specific set of datatypes, so I don't really buy into the idea of one universal set of datatypes that are suitable for everybody. You can have one set of datatypes that can work for a lot of the things, but I think for some applications, you're going to need your own special datatypes, so TREX tries to accommodate that.
The other goal is to be simple and easy to learn. I think you can teach somebody TREX in half an hour to an hour, and that's important. Validation is a key thing, and it doesn't have to be complicated.
DDJ: How is TREX different from some of the other proposed schema languages, like RELAX and XSchema?
JC: The biggest difference is probably simplicity and ease of understanding, ease of learning, ease of use. XML Schemas are very complicated, and the complexity lies in directions that don't seem to me to be very useful. And, it still doesn't give me a lot of the things that I want from a schema language. I can't write an XSchema for XSLT, for example. I can get quite close, but I can't deal with some of the important things.
Another big difference is that TREX tries to treat attributes and elements as uniformly as possible. If you're designing an XML or SGML markup language, it's often pretty much arbitrary whether you represent some bit of information as an attribute or as a child element. In my view, XML processing tools and languages should try to minimize the differences between elements and attributes and should try to treat them as uniformly as possible. You can see that in XSLT and XPath. I wanted to apply that idea to schema languages.
In TREX, attributes are integrated into the content model, so it makes it easy to say, "You can have this element or you can have this attribute." It's very common. For example, W3C's XML Schemas, in the restriction element, you can either have a base attribute that names the base type or you can have a simpleType child element that describes the base type directly rather than by referring to it by name. So you want to say, "Either I have the base attribute or I have the simpleType child element." And in TREX, you say exactly that. It's just as easy to say that as it is to say, "Either I have a foo element or a bar element."
DDJ: What do you think the role is with all these different schema languages? Do you think that it's okay to have several different schema languages? Do you think it's important to agree on a single schema language for XML?
JC: I think the problem is that the field of application of XML has become so broad, so diverse, that it's basically impossible to come up with one schema language that will suit everybody. I think the W3C schema language is probably okay for some domains, but for other domains, I think the stuff they've added is just complexity without value. I don't think you can create a language that will be satisfying to everybody.
DDJ: A lot of things that we're talking about, things like SGML, ultimately came out of standards groups. Do you think that standard groups inherently make things complicated because there are so many interests at stake?
JC: It's very hard to produce a simple specification out of a standards group. You have to have people in the group who are really committed to simplicity as one of their top priorities, and you have to have a good number of them. Because without them, everybody wants to put in their own little feature, you end up putting all these little features in, and that results in something that's so large that nobody is happy with it.
It's a tough business creating a standard. The more people you have, the harder it is. That's one of the problems with XML Standards now. The W3C is so popular and XML is so popular that any XML-related standard has gazillions of people participating in it. If you're a standards body, legally, you're not allowed to restrict who participates: You've got to allow any member to participate. And if you have gazillions of people wanting to participate, and you have a policy where you've got to have something that everybody can live with, you inevitably end up with something that is not lightweight, simple, and elegant.
Maybe the solution is to try and do less design by committee. You try and have just one or two people come up with a design first, something small and simple. Then you bring it to the standards process and you try and start with that, and make relatively small changes. That's, in a way, what we did with XML. We had a fairly solid base to work with, and our scope for innovation was relatively small. We also had people on the committee who were committed to simplicity and realized our main goal here was producing something that was simpler than SGML.
Another reason we were able to keep XML reasonably simple is that people did not get interested until the very end. So we had relatively small numbers of people participating in the working group. That was a big factor in keeping XML to a manageable size.
DDJ: What do you think the role is for SGML today?
JC: Oh dear, some SGML people are going to be unhappy with me (laugh). I think SGML has served its purpose by giving birth to a child, XML, which can fulfill most of the roles of SGML. Obviously, there are lots of companies and groups that are using SGML successfully, and they've been using SGML successfully for a decade or more. If they've got applications working with SGML, they've got no particular need to change to XML. But I think that for people who are starting afresh who don't have any SGML background, XML is clearly the way to go.
DDJ: What's the next step for XML?
JC: That's a difficult question. I think XML has become so widespread, it's like asking me, "What's the next application for ASCII text? What's the next application for line-delimited files?" XML is becoming so common, it's not interesting anymore.
One of the things that I was very inspired by in working with TREX was a project from the University of Pennsylvania called XDuce (http://www.cis.upenn.edu/~hahosoya/xduce/), which is an XML processing language. One thing that is interesting about XDuce is that it uses the type information from DTDs to actually type-check your program. Statically typed languages, like Java and C++, help you catch a lot of errors. But with XML processing at the moment, you use the DTD just to validate the file. You don't really use the type information after that. The fact that a document conforms to a DTD is not used by the typing system of the programming languages.
I think one interesting direction is to try doing the kind of things that XDuce is doing, which is integrate the type system of your data, DTDs or schemas, into the type system of the programming language. You want them to all work together in a seamless way so that your compiler can catch a lot more errors when you write programs to process XML, so you can get more reliable programs.