*Truck is a programmer in Southern California and can be contacted at [email protected]*

It's simple enough, almost textbook. The textbook in this case is *Compilers: Principles, Techniques and Tools*, by Aho, Sethi, and Ullman (Addison-Wesley, 1986)—the "Dragon Book"—with a smattering of elementary recursion thrown in. It took me a while to figure it out, though. I was working on a formula evaluator for my company's latest product. I've implemented a lot of parsers, so this was supposed to be a fairly easy job. The last time I'd implemented one, I used a YACC-type parser, so this time I decided to write a recursive-descent parser, just for practice.

The formulas were straightforward enough and I soon had a decent parser, except for one problem. Addition, subtraction, and multiplication all passed their unit tests. So did division—until I decided to try double divide expressions. I fed it the expression, "18/6/3" and it gave me the answer "9."

I was expecting "1"; how did it get "9"? At first I thought it was a bug, but it turns out this is the "correct" answer, at least according to the parsing method.

Recursive-descent parsing works by mimicking a language grammar in code. The standard grammar for expressions looks something like this:

e := t

| t '+' e

| t '-' e

t := f

| f '*' t

| f '/' t

| '(' e ')'

That is, an expression *e* can be parsed as a term *t,* optionally followed by a plus or minus sign, followed by an expression. A term can be a factor *f,* optionally followed by a multiplication or division sign followed by a term, or it can be an expression enclosed in parentheses.

A recursive-descent parser takes a string of symbols ("18/6/3") and tries to match it recursively by calling a routine to match the expected symbol. The C++ code to implement the division rule of the grammar:

t := f

| f '/' t

looks like Listing One (leaving out the messy details of tokenizing with a lexor and error processing). The parse for "18/6/3" gets constructed in recursive calls by the *Term* routine. In the first call, the factor "18" and the operator "/" at the front of the input string are processed and the rest is passed recursively to the *Term* function for further evaluation. At this point, the parse looks like Figure 1(a). On the second call, "6" and "/" are processed; see Figure 1(b). On the third call, the "3" at the end of the string is processed by the *Factor* function; see Figure 1(c).

Because the character after the last "3" is not a division, the *Term* function returns with a result of 3.

It's only at this point that the second pending division can be performed and the *Term* function returns the result: 6 divided by 3 gives 2; see Figure 1(d).

Now the first pending division can be performed: 18 divided by 2 gives 9.

More simply, the recursive-descent parser works from right to left, where we humans generally work from left to right.

I tried several alternatives to get around this. I even considered defining it as the "correct" behavior—a feature, if you will—but a quick survey of my office indicated that I would have an uphill battle. It was when my boss started talking about left associativity of the division operation that the double divide attained "dreaded" status: I knew I'd have to find a solution.

In a YACC parser, you could simply mark division as left associative and that would take care of it. How do you handle it in a recursive-descent parser?

If I could rewrite the grammar to move the recursion to the front, then the parser would operate left to right, the same way my boss does. The simple way to do this:

t := t '/' f

doesn't work because the parser gets stuck in an infinite recursive loop (a term is a term is a term is a term is a term...) and doesn't stop until it runs out of stack space.

Back to the textbook. The Dragon book talks about eliminating left recursion from grammars. I won't bore you with the details, but if you follow the book's process, you come up with this:

t := f

| f g

g := '/' t

You create an intermediate result *g*, which takes the recursion off into another rule. For a simple calculator, this isn't much help because the corresponding routine would have to return an expression to be evaluated, not a simple number.

It does suggest another way of looking at it, however. In the Lisp community, where they do everything with recursion, there are special rules for speeding up calculations. What compiler/parser people call "right recursion," Lisp people call "tail recursion" because it occurs at the end (tail) of the calculation. (See the accompanying text box entitled "Tail Recursion" for details.)

Tail recursion can always be replaced by a simple loop. The Lisp people like this because it saves on stack space and recursive function calls; in other words, it's faster and uses less memory.

I like it because it gives me a way of reordering the calculations. The revised grammar rule for division now looks like this:

t := f ( '/' t )*

where I'm extending BNF using regular expression notation (the "*" at the end indicates zero or more occurrences of the expression in parenthesis).

The code for this turns out to be as simple as the original code. The only change is in the *Term* routine. I replace the *if* with a *while* (Listing Two), and the resulting parser gets the right answer.

DoubleDivide.cpp (available electronically; see "Resource Center," page 4) is the full code for the test program. The test program uses parsing objects to make comparing the results of the formula evaluation easier, but the recursive descent is performed by the methods of the parsing object in the same manner just described.

Recall I mentioned the lexor (see DoubleDivide.cpp). In compiler writing, it is common to separate out the job of turning the input stream into tokens. This makes it possible to get the next token from many different places, a necessity for recursive descent. For this article, I've coded a simple lexor; all it does is return either the next *char* or turn a string of digits into an *int*.

Following the lexor is the abstract parser class with some utility functions to identify the output. After that, the two implementations of the parser are in the classes *RecursiveDescent* and *TailRecursion*.

There is no provision for anything other than the expected token in these routines. This is the way recursive descent works. If the expected token doesn't show up, it's an error and the compiler reports it as such. In a more complete implementation, there would be some logic based on the tokens to determine which BNF rules to apply. For instance, a star would indicate multiplication and a slash division. If I implemented the rule about parenthesized expressions:

t := '(' e ')'

the *Term* routine would have to look at the next token, and if it was an open parenthesis, it would call the *Expression* routine. I've left all these details out because it complicated the presentation of tail recursion.

I hate writing repetitive code. My test functions in DoubleDivide.cpp are an example. The *DoesParseMatchResult* tests the results but also collects the printing for both header and detail in one place, so I can modify the formatting in one place. The actual printing function, *Line*, is a template function, so it will handle both the headers (*char const **) and results (*ints*). The class is just a convenient way to group the two static functions.

One of the nice things about recursive descent is the way the functional expression closely follows the BNF rules for the language. In the case of the dreaded double divide, making the result meet expectations turned out to have an equally nice functional expression. Even my boss liked the result.

**DDJ**

Lexor lexor( "18/6/3" ); int Factor() { int d = lexor.GetNumber(); } int Term() { int t = Factor(); if( lexor.CurrToken() == '/' ){ lexor.MoveNext(); int f = Term(); t = t / f; } return t; }Back to article

**Listing Two**

int Term() { int t = Factor(); while( lexor.CurrToken() == '/' ){ lexor.MoveNext(); int f = Factor(); t = t / f; } return t; }Back to article