Beyond HTML: An Interview with the Creator of SGML
On the eve of HTML 5, I decided to reach back into the archives to see where web mark-up languages stood a decade ago. This 1998 entry from Web Techniques discusses the state of things back-in-the-day, when XML was the hot new kid on the block--It's an interview with SGML creator Charles F. Goldfarb conducted by former DDJ executive editor Michael Floyd.
A Conversation with Charles F. Goldfarb
By Michael Floyd
It's widely known that when Tim Berners-Lee created HTML he based his hypertext publishing language on SGML-Standard Generalized Markup Language. SGML was already an international standard, and it was being used to publish very large documents, such as airplane maintenance manuals. Now, with the emergence of eXtensible Markup Language (XML), Web developers are about to meet SGML head on.
SGML was invented by Charles F. Goldfarb, and it was he who coined the term "markup language." It all started in 1969 when Goldfarb, leading a small team at IBM, developed a language called GML. In 1974, he created SGML and subsequently wrote the first SGML parser, ARCSGML. Goldfarb would also work to turn SGML into the ISO 8879 standard, and serve as its editor.
With all the buzz surrounding XML, I recently sat down with the "Godfather of Markup," both to record some preWeb history and to gain some insights beyond the hype of XML. Goldfarb currently edits Prentice-Hall's "Definitive XML Series from Charles F. Goldfarb," and he has just released his latest book, The XML Handbook (coauthored by Paul Prescod). Goldfarb, a graduate of Harvard Law School and Columbia College, holds the Graphic Communications Association's first International SGML Award, as well as the PIA Gutenberg Award. What follows is an edited transcript of our conversation.
Dr. Goldfarb, you led the project at IBM that invented SGML's precursor, GML. It's said that necessity is the mother of invention. What specific problem were you trying to solve?
We were trying to do an automated law-office application. I had been a lawyer (in fact, I still am). Lawyers must do research on existing case law, decisions of court, and so on, to find out which ones are applicable to a given situation, find out what the previous legal rulings have been, and then merge that with text that the lawyer has written himself. Eventually, if it's, say, a brief for the court, [he must] then compose it and print it. At the time, which was 1969 or 1970, there weren't any systems available that did these three things. So in order to get the systems to share the data we had to come up with a way to represent it that was independent of any of those applications.
Can you give a sense of the size and scope of the project? How many team members were involved?
It was a very small research project. There was initially myself and another researcher, Ed Mosher, working on it full time. Then, we had part-time consulting from a very brilliant fellow named Ray Lorie who is also one of the pioneers of relational databases. [Ray] had the most brilliant insight into the whole thing, which is that all the elements that are tagged the same way should be processed the same way. Our manager, Andy Symonds, contributed technically as well.
I understand that GML not coincidentally includes the initials of the three of you.
[Chuckles] In fact, I confess that I coined the term "markup language" for that very reason. A research department is measured, in part, by its effectiveness in technology transfer. When a product development group first gets your work, it is grateful, seeks out your help, and acknowledges what you've done. But by the time they've poured their own sweat into it for several years and a product comes out, they've often forgotten about the researchers who had the original bright idea. So the name "Generalized Markup Language"-because of its initials, GML-was my way of labeling the technology so that its origin would be unmistakable.
To launch your new book, The XML Handbook, you gave a talk at one of the Computer Literacy bookstores (in Silicon Valley) that was billed, I believe, as "dispelling the myths behind XML."
Right. That was called "The Truth about XML." The reason I chose that title, and as I'm sure you're aware, was that there's an enormous amount of publicity and hype surrounding XML. And there are a lot of statements being made that on face value, seem to be outlandish. Michael Vizard had an editorial in InfoWorld in which he basically called it a whole new processing paradigm and said that it was going to change everything from electronic commerce to how you package objects to share them. All that's true! But when you look at the explanations you see in these magazines about what XML is, there's nothing that would justify those assertions. It's just presented as "extend HTML, make up your own tags." Why would anyone think that has anything to do with these earth-shaking changes? Part of this is that "simplicity sells" seems to be the philosophy. When they drew the line on how much the Web developer needs to know to understand the power of XML, I think they drew the line in the wrong place.
As I understand it, the signing was so heavily attended that the bookshop was forced to turn people away. Were you expecting this kind of response?
No. I was told to expect, perhaps, 30 people. They counted 168 before closing the doors.
Can you tell us why there is so much interest in XML right now?
Gee, I thought they were there to see me [more laughter]. The publicity hype has been unending. But there's genuine support for it. Anyone doing electronic commerce is heavily into XML. Very few technologies have been embraced by all of the players-even archcompetitors-in the that way XML has. Even Java-which Microsoft now embraces-they certainly weren't on board from the beginning. But in the case of XML, they were. Jon Bosak, the moving force behind this simplification of SGML, is from Sun. Two of the three coeditors of the W3C spec are Tim Bray, who was consulting for Netscape, and Jean Paoli, who is Microsoft's XML architect. Michael Sperberg-McQueen, the third coeditor, is a neutral [party], from the University of Chicago.
SGML has been applied very successfully to document management. Why is XML better suited than SGML to the Web?
SGML has options for every occasion. You're dealing with potential users whom you might think of as small niche markets, very specialized-aerospace, telecommunications, semiconductors, and so on. Those little specialized niches have document collections that are bigger than the Web. So, when you're dealing with applications of that magnitude, if they feel it's going to save them time and money to be able to leave off the end tags of paragraphs, it's worth it to them. They can afford to have software with that customization option, so it makes sense.
For the Web, what the XML committee basically did was look at all these options and come up with their own tailored version, just as a large user would do, but one that was optimized for Web purposes. So, just by eliminating choice in those syntactic areas, they reduced the size of the parser by 80 to 90 percent. So, it made it much easier to implement. Also, there's less diversity. One of the options that's available in full SGML is the ability to omit some markup when you can do so unambiguously. In XML, you're never allowed to omit markup. As a result, if you're what they call a hacker-someone who's written code that's not really parsing the XML properly, but is just kind of scanning it as you might do to locate things in a hurry-you've got a much more consistent text stream to work with. Whereas in SGML, in order to do things safely, you pretty much have to parse all of the time, which isn't that big a deal. But in a networked environment, that can matter. Also, there are requirements in XML that make the document more robust if parts of it are missing, which is more likely to happen in a networked environment than in a more controlled environment.
That leads to another question: I've seen a lot of comparisons of XML to SGML coming from some very smart people, including David Megginson (author of Structuring XML Documents), who says, "XML is simply a subset of SGML with the more esoteric features removed." The opposite end of the spectrum might be something along the lines of "XML is a severely limited form of SGML." How would you compare the two?
If a document conforms to the XML spec, then it also conforms to the SGML spec. So, in that sense, XML is a proper subset of SGML. Whether you consider that a severe limitation or a powerful functional capability depends on how important [it is] to make choices about things that XML doesn't let you make choices about. For the average Web developer, it's not an issue. You use XML. If, on the other hand, your employer is Boeing and you've got to turn out four million pages every quarter for each model of airplane, then you use SGML. If you want to deliver some of those SGML documents over an intranet, then you'll generate XML from it the same way you'd generate HTML today.
We've all heard the rumor: XML is going to replace SGML. Aside from the obvious fact that SGML is already used to manage terabytes of data, why is this scenario unlikely?
For the same reason the Gap is not going to replace Savile Row. If you can afford custom tailoring and get a suit made exactly the way you want and the way you look the best, you'll do it. The rest of us go shopping in department stores. The problem was that up till now, SGML was strictly a custom-tailoring thing. You had to have a very large document collection, and it had to be a big part of what your company did for you to justify the cost of the consultants and the design setup and the expensive software that's necessary. XML is for the mass market. You can still choose the color and fabric, but the sizing is done at the factory. As long you're a close enough fit to one size or another then you're fine. And 99 percent of people are. So, what's going to happen is that an order of magnitude more people are going to use XML, but there's still going to be an order of magnitude more documents in SGML.
Likewise, looking at the other end of the spectrum: There's the SGML community, and then there's the HTML community. So, the HTML authors fear that their world of static elements is about to get a lot more complicated.
Now, you know they don't know those are called elements. They call 'em tags [chuckles]. But there's also a third audience, which is maybe even bigger than the other two-the people who want to shift data around. Jean Paoli told me that he sees the biggest users of XML being the guys who write Excel spreadsheets, and want to get data from their company's mainframe or from a Web server belonging to one of their company's suppliers or customers.
So, not just document exchange, but application exchange of data.
Absolutely! And, you see, not just documents-not just publishing. There's an important distinction there. These are all documents in the sense of the way that word is used in the dictionary. The key is recognizing that XML is a data representation that has the characteristics of a document. That's where the real power comes in, because [you can] process it as data-by first parsing it to extract the data-or you can present it the way you would a document. And you can do both things in the same application at the same time. That's the real breakthrough.