Little Languages, Big Questions

SEP91: LITTLE LANGUAGES, BIG QUESTIONS

Ray is a technical editor at DDJ. He can be reached at 501 Galveston Drive, Redwood City, CA 94063.

What do you think is the most widely used programming language today? Much as we'd like to think they have, neither C nor Pascal has taken over the world. It's too late for Ada and ASM, and too early for C++. So it must be Fortran or Cobol, right? The correct answer, however, is none of the above. The most widely used programming language is, in fact, Lotus 1-2-3 Macro Language.

Yes, we know "real programmers" don't use Lotus Macro Language, but a very large number of real people do use it, and get worthwhile results every day. Why such a puny language claims so large an audience might be a mystery to those of us ensconced in the flame wars of whether multiple inheritance in C++ is a Good Thing. Nevertheless, it's useful to look at how little languages were used in the past, and how they are used in the software field today. Some of this may cause you to rethink your past allegiances.

The History of Managing Complexity

In some sense, the entire history of computer programming is the evolution of strategies for managing complexity. A principal strategy for dealing with complexity has been computer language design.

Computer system designer Glenn Myers, in his classic work on computer architecture (see "References"), talks about the "semantic gap" between a machine's instruction set and the requirements of an application program. In his view, a language compiler is a technique for bridging the semantic gap between two distinct ways of representing a computational solution. Twenty years ago, when Myers wrote his book, application requirements were such that the gap could be spanned with a single step -- by writing an application in Fortran or Cobol and translating this solution into its machine-language representation. Now that requirements are more complex, the semantic gap has become so wide that it's sometimes preferable to bridge it in several steps, using a little language as a stepping-stone.

So over the years, the architecture of complex applications has undergone structural change, from a single module built out of distinct subcomponents, to a nested architecture. Designers often use an image of two Russian dolls, one placed inside the other, to describe this kind of structure.

Complex application problems become easier to solve if the application is partitioned into two (or more) nested components: a core module that provides a primitive set of services for an application area (the "engine"), and a surrounding module that provides programmatic access to these services. The surrounding module is typically a language interpreter for a simple, easily parsed computer language -- a "little language."

Some of these languages are designed on an ad hoc basis and bear no resemblance to any other language (Lotus Macro Language, for example). Other languages are carefully crafted subsets of existing programming languages such as C or Lisp. These days, you can find application engines for handling databases (Paradox), publication graphics (Display Postscript), CAD (AutoCad), symbolic mathematics (Mathematica), 3-D rendering, text-editing, spellchecking, hypertext, and many other tasks.

Using little languages with big engines is therefore not a recent invention -- but a technique that has been reinvented and rediscovered many times since we evolved from the lower life-form of assembly language and began to walk upright, using higher-level languages.

The first little languages were simple extension languages. For example, in 1964, Ivan Sutherland's landmark graphics program, Sketchpad, incorporated a sophisticated, constraint-based system for describing relationships between graphical elements; these constraints were described using a simple language. The CoreWars computer game, popular with hackers on mainframe machines during the mid-sixties, also used a simple language to describe competing processes. No doubt there were many other such instances, but the written record before 1970 is rather sketchy (at least as found in my library).

The Spawn of lex and yacc

Constructing a language that is more than a toy, while not supremely difficult, is nevertheless a laborious task that happens faster with proper tools. It was during the '70s that language construction tools came into full bloom. During those years, the pragmatic, tool-oriented company culture at AT&T's Bell Labs -- as engendered by Kernighan, Ritchie, Thompson, Bentley, and other researchers -- placed a high value on compiler-building utilities. Unix's lex and yacc are now showing their age, but at the time were a reasonably painless way to build the scanning and parsing components of a compiler or interpreter. The Unix pcc (Portable C Compiler) was built with these tools and then ported to many hardware platforms, and for many years represented the de facto definition of the C language.

To my knowledge, some little languages built at Bell Labs using lex and yacc include pic (for diagrams), tbl (table-processing), eqn (equations), nroff/troff (document processing) awk (text manipulation), the various operating system shells (Bourne, Korn), bc (a Basic-like calculator), Ideal (graphics), Grap (mathematical graphs), and pico (image processing). These programs were easy to construct, and proliferated like rabbits in the /usr/bin directory on Unix systems.

Compared to bulky present-day software packages such as dBase, ToolBook, HyperCard, or contemporary C compilers, the Bell Labs languages are pretty lean and mean. On my PC, the size of the Microsoft C compiler (three passes) is three-fourths of a Megabyte; the dBase IV binary is 1.8 Megabytes. By contrast, the size of the awk executable is less than 60 Kbytes. The application-specific component of a 1970's vintage Bell Labs program -- the engine -- is more like a small, two-stroke motor rather than a gas-guzzling V8.

A couple of years ago, Jon Bentley of Bell Labs wrote a CACM column entitled "Little Languages," in which he discussed this idea of using little languages to address different application areas. His brief but thought-provoking article focused on the problem of creating complex graphics within a character-mode document processor, and how the PIC language at Bell Labs was useful in this regard. Bentley, unfortunately, does not give us a complete run-down of the many little languages that have emerged from Bell Labs. Such an account would be most interesting. But the title remains memorable.

It is some kind of indicator that the most recent language creation to emerge from AT&T -- C++, Version 2.1 -- is far from "little" (in fact, some critics claim it is monstrously large). I think the language translation requirements of C++ may have transcended the technology present in lex and yacc, given the large number of potentially ambiguous constructs.

Over the years, the little languages from Bell Labs have been used and abused in different ways. For example, some odd soul used the troff typesetting language to write a Basic interpreter. (Notwithstanding, I've written a troff-like interpreter in PostScript.)

Moving to another place and time, a momentous event in the history of software technology occurred when Richard Stallman abused a little language at MIT.

Stallman's Inspired Hack

The text editor I'm using to write this is a descendant of an inspired hack using the TECO macro language. TECO is a line-oriented text editing facility that ran on various flavors of DEC computers, from PDP8s to 11s to 10s. It is similar to DOS's EDLIN or Unix's ed, in that single-letter commands are used to carry out editing functions on text that has been read into a buffer.

TECO, however, provides additional commands for putting data into local variables, and for testing and branching, and therefore becomes Turing-machine equivalent. A printout of TECO macros pushes the envelope of inscrutability, making the average Unix or DOS command file seem as clear as a blue sky over Montana.

In the mid-1970s, a friend and I implemented a menu-driven mailing list application written in TECO macros. Although the entire source code was only four pages long, we spent many hours debugging and enhancing the code. This experience immensely increased my respect for Stallman's feat, which was to build an extensible, screen-oriented text editor, EMACS, on the uneven, prickly foundation of TECO macros.

More Details.

The EMACS editor, of course, has since been reimplemented by Stallman and many other independent authors. Its feature list has grown to include directory editing, e-mail handling, and functions not usually associated with text editors (but since adopted by mainstream commercial products). The most recent incarnation of Stallman's editor is GnuEmacs, available from the Free Software Foundation as well as many other sources. On my Unix system, the GnuEmacs executable is about 650 Kbytes.

GnuEmacs is written using the two-step language/engine combo. The base layer is about 80,000 lines of C code that implement both basic editing and functions that support "MockLisp," a Lisp-like language in which the higher-level features of the editor are implemented.

This two-layer structure has been followed by EMAC's spiritual descendants on PCs, such as Epsilon and BRIEF. (For more on the BRIEF Macro Language, see "Programmer's Workbench" in this issue.) Although BRIEF was not designed to be an Emacs clone, both the original version of BRIEF and GnuEmacs use Lisp-like macro languages. The new version of BRIEF uses a C-like language, similar to that in Epsilon. Interestingly, BRIEF translates the C-like syntax to the older Lisp form before compiling it into bytecodes.

The Lisp syntax has been chosen as an extension language model by numerous other application programs, such as AutoDesk's AutoCad. It's not just microcomputer applications -- for example, the design automation systems from Cadence and Mentor, which run on workstations, have Lisp-like extension languages. Recently, the CAD Framework Initiative (an industry standardization group) chose the Lisp-like Scheme language as its standard extension language.

The principal reasons are that Lisp syntax is computationally complete, easy to learn, and, most importantly, easy to parse. David Betz's XLisp provides a respectable amount of Lisp functionality in less than 10,000 lines of C code. Perhaps it was for this reason that XLisp was grafted onto AutoCad to become AutoLisp. (For more information on XLisp and XScheme, a dialect of Lisp, see "Testing C Compiler Performance" by David Betz, DDJ, August 1991.)

The Ultimate EMACS

Stallman's design approach has been echoed in text editor implementations far and wide. Some of this influence has been indirect, by intellectual osmosis, leading to inadvertent rediscovery. The ME text editor from Magma Systems may fall into this category. In other cases, the lineage can be traced more directly, as with Borland's word processor, Sprint, which evolved from Final Word, which came from Mince, which came out of MIT.

Perhaps the most elaborate incarnation of a text editor -- the "ultimate EMACS," so to speak -- is the technical publication system by Interleaf. The Interleaf TPS package is a high-end system for writing, editing, and producing technical documents. The software runs on workstations, X-terminals, IBM mainframes, PCs, and Macs. If you need to produce thousands of pages of specifications for an aerospace project, this is the tool of choice. Version 1.0 of the Interleaf software came out of the Scribe research project at MIT in the early '80s. It consisted of 100,000 lines of C code. The current version, Interleaf TPS 5.0, weighs in at over 1.5 million lines of C code. Given these proportions, one can hardly expect Interleaf's macro language to be a puny runt, and it is not, representing about 90 percent of a full Common Lisp implementation (Common Lisp is the Ada of Lisp-like languages), plus object-oriented extensions.

Unlike AutoCad, which keeps its AutoLisp extension language at arm's length (by implementing most of the application's functionality in C), Interleaf Lisp is not just an "extension" language, but also an implementation language. About 25 percent of the functionality of the Interleaf system is implemented in Lisp code, rather than C. This requires about 250,000 lines of Lisp, a large amount, but much less than if it were done in raw C. The boundary between the two languages is transparent enough that a C function can call a Lisp function which calls another C function, without any function being aware of what language is being used by its caller. Most of the data structures that represent a document are packaged as objects visible to both Lisp and C code.

Despite the highly evolved structure of Interleaf TPS, it is gratifying to note that remnants of the EMACS pedigree remain, in the form of the traditional arcane key bindings -- present, but unknown to most of Interleaf's customers and even its most senior technical staff.

How to Ask Hard Questions

As we have seen here, a small language can be used to implement large parts of a complex application. But there are other uses for this technique.

Often, the most important step in solving a difficult problem is to ask the right question. Many key developments in science have occurred as a result of properly formulating a key question -- as happened after the apple fell on Newton's head. Sometimes, however, before you can ask the right question, you need a language in which to express that question.

The accompanying text box touches on the ability of Lisp-like languages to express fundamental problems in computation. Beyond this, a number of AI research projects leave "pure Lisp" way behind by defining languages on top of Lisp in which ideas about knowledge and reasoning can be easily explored. The first well-known example of this is Terry Winograd's Planner, since followed by many others.

Alternative Syntaxes

Most of the little languages mentioned earlier use the syntax of either Lisp or C. Implementors choose Lisp for its simplicity and its strong theoretical underpinnings. C is used because of its familiarity. Alternative models include Forth and Basic.

Forth is considered by its advocates to be the archetypal little language. Not only does it have a byte-sized syntax, but its philosophy of use is congruent to the language/engine idea. The process of writing a Forth program consists of defining Forth words which are defined in terms of other words, which are ultimately defined by calls to the underlying engine (in the case of language/engine duo), or by machine language fragments (in the case of a stand-alone system). The most widespread example of a Forth-like little language is PostScript. A less well-known example is the macro language for the Final Word editor.

Basic has not been used all that much, until recently -- but this will change. Bill Gates's longstanding affection for Basic drives Microsoft's campaign to make Basic the primary macro language for the DOS and Windows platforms. The extension language in Word for Windows is the first step in the campaign. Visual Basic (see next section) represents another volley. The principal contender has yet to be introduced by Microsoft, however.

An older use of Basic-like syntax is found in the Mumps programming language (see the code example in the sidebar). Mumps folds a rich mixture of sophisticated database functionality onto a bare-bones syntax, consisting of 26 single-letter commands (one for each letter of the alphabet, including "G" for "goto"). This strange brew of a language is the basis for one of the largest software development projects ever, the $1.2 billion government contract, now ongoing, to computerize the nation's VA hospitals.

Access to Environments

Another popular use of little languages is to interface to software environments. The environments range from rudimentary operating systems such as MS-DOS, to more complex graphical environments, such as Microsoft Windows or the Mac Finder/System.

Access to operating system functionality has always been available with DOS batch files (and on Unix, with the Bourne/C/Korn shell languages). By contrast, easy programmatic access to windowing environments has been a long time coming.

A recent example is Microsoft's Visual Basic, which deftly skirts the morass of the 600-plus functions in the Microsoft Windows API, and provides high-level access to a usable core set, plus low-level access to the rest. The classic "hello world" program in Visual Basic is only one line long, compared to the 100-line version using the raw Windows API.

On the Mac, a well-known example is the HyperTalk language found in Apple's HyperCard program. As originally conceived by Bill Atkinson, HyperCard was purely an end-user application, free from the complications of programmability. As the program evolved during development, an English-like interpreted language was added by Dan Winkler, in order to take full advantage of HyperCard's rich set of features. A more recent example on the Mac, containing functions closer to an operating system batch language, is Userland's Frontier. (Refer to Michael Swaine's "Programming Paradigms" in this issue for additional details on Frontier.)

Conclusion

In past product advertising, neither Apple nor Microsoft has been known for acknowledging intellectual debts to prior innovators. I haven't checked the ads recently, but I doubt things are different with Visual Basic and HyperCard 2.0. This is no great sin, because good marketing demands a focus on here-and-now rather than on there-and-then. The history of little languages, however, is too interesting -- and their contribution too important -- to be conveniently forgotten.

References

Bentley, Jon. "Programming Pearls: Little Languages." Communications of the ACM (August, 1986).

Myers, Glenford, Advances in Computer Architecture. Wiley, 1982.

Stark, Richard. LISP, Lore, and Logic. Springer-Verlag, 1990.

Thompson, Ken. "Reflections on Trusting Trust," in ACM Turing Award Lectures. ACM Press/Addison-Wesley, 1987.

Unix Research System Papers, vol.2, Murray Hill, N.J.: AT&T Bell Labs.

How Strong Is Your Little Language?

Computationally speaking, some little languages are 98-pound weaklings, others are bantam-weight powerhouses.

Although benchmarks exist to evaluate a compiler's code generation capability and compilation speed, there aren't any formal benchmarks that can evaluate a programming language's computational ability. In a broad sense, all but the most rudimentary languages are equivalent, because all are capable of Turing-machine computation. Looking at the details, however, languages are often different enough that you're comparing apples and oranges.

Despite this observation, one can measure a computer language's expressiveness by the ease in which fundamental problems such as the Halting Problems can be expressed. Richard Stark, in Lisp Lore and Logic, shows how a syntactically small language like Lisp is actually a powerhouse of symbolic processing. He presents an elegant proof of the unsolvability of the Halting Problem in about 20 lines of pure Lisp, by showing how certain clearly stated initial conditions lead to a contradictory statement in Lisp.

He concludes his proof by saying, "When a computational system such as Lisp is a source of problems that can't be solved within the system , there are two [interpretations]. One is that the system is too weak to solve its own problems. The other is that it presents such a strong notion of computability that deep and unsolvable questions can arise. Either interpretation could be true....In this case, we believe that the strong interpretation is correct."

A different, much more informal, benchmark of a language's computational power is the programming exercise that Ken Thompson (coauthor of Unix) used to pass the time in college. According to Thompson -- in his 1983 ACM Turing Award lecture -- the following exercise was popular "...for the same reason that three-legged races are popular." The goal is to write the shortest self-reproducing program: "More precisely stated...to write a source program that, when compiled and executed, will produce as output an exact copy of its source."

Thompson adds, "If you have never done this, I urge you to try it on your own. The discovery of how to do it is a revelation that far surpasses any benefit obtained by being told how to do it." Thompson's program, in C, is shown in Example 1(a). Although he considers it the starting point for "the cutest program I ever wrote," the code itself is 233 lines of somewhat homely C.

Example 1: Writing the shortest self-reproducing program: (a) Ken Thompson's C code; (b) a pure Lisp version; (c) a Mumps implementation.

  (a) char s [] = {
                '\t'
                '0',
                '\n',
                '}',
                ';',
                '\n',
                '\n',
                '/',
                '*',
                '\n',
                ...213 lines deleted...
                0
      };
      /*
        * The string s is a representation of the body
        * of this program from '0' to the end.
        */

       main()
       {
            int i;

           printf("char\ts[]={\n");
           for (i=0 ; s [i]; i++)
               printf("\t%d,\n",s[i]);
           printf("%s",s);
       }

  (b) ( (lambda (x) (list x (list (quote quote) x)))
            (quote
             (lambda (x) (list x (list (quote quote) x))) ))

  (c)     S N=$C(13)_$C(10)_$C(9) S Q=$C(34) S S= "W N_""S
      N=$C(13)_$C(10)_$C(9) S Q=$C(34) S S=""_Q;F I=1:1:$L(
      S) S C=$E(S,I) W $S(C=$C(34) :Q_Q,1:C);W Q_N:F I=1:1:$
      L(S) S C=$E(S,I) W $S(C=$C(59) :N,1:C);Q;"
           W N_ "S N=$C(13)_$C(10)_$C(9) S Q=$C(34) S S="_Q
           F I=1:1:$L(S) S C=$E(S,I) W $S(C=$C(34) :Q_Q,1:C)
           W Q_N
           F I=1:1:$L(S) S C=$E(S,I) W $S(C=$C(59) :N,1:C)
           Q

By comparison, the equivalent can be expressed in pure Lisp in only three lines of very elegant code; Example 1(b). ("Pure Lisp" is the subset of the Lisp language consisting of functions that have no side effects, that is, functions that are mathematically "pure." The code in Example 1(b) is from John McCarthy and Carolyn Talcott, and is quoted in Stark's book.)

For fun, I wrote the equivalent program in the Mumps programming language, which is a Basic-like little language that accesses a sophisticated associative database engine. Although the listing is only a couple of lines, the code is extremely grubby and cryptic; see Example 1(c).

Although each of these programs has the same underlying logic, the steps required to reach the same goal in each language are so different that you can make a case that each program implements a distinct algorithm.

How does your favorite (or most despised) language stack up? Send us your solution to this exercise and we'll publish the most interesting, elegant, and/or bizarre listings. A special prize will go to the first entry written in Lotus Macro Language.

-- R.V.

Tools

Little Languages, Big Questions

The History of Managing Complexity

The Spawn of lex and yacc

Stallman's Inspired Hack

The Ultimate EMACS

How to Ask Hard Questions

Alternative Syntaxes

Access to Environments

Conclusion

References

How Strong Is Your Little Language?

Example 1: Writing the shortest self-reproducing program: (a) Ken Thompson's C code; (b) a pure Lisp version; (c) a Mumps implementation.

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Tools Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

Tools

Little Languages, Big Questions

Related Reading

More Insights

White Papers

Reports

Webcasts

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Tools Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content