Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

Tools

Building Parsers with Leopurd


MAR96: Building Parsers with Leopurd

Thor is a consultant in Winston-Salem, NC. He specializes in object technology and development of distributed systems using C/C++ and PowerBuilder.


Whether it's from a user, script file, or serial line, input often must be broken into logical parts before it can be processed. Many programmers resort to nested conditions involving string comparisons, but the resulting code can be hard to understand and maintain. Sometimes a brute-force approach is not sufficient, and trusty old yacc is used to generate a "real" parser. The advantage of this approach is that only the grammar must be maintained; the disadvantage is that yacc generates table-driven parsers that are difficult to read and modify manually. In this article, I'll present leopurd, a program that generates a maintainable, table-free parser from an input grammar.

Top-Down and Bottom-Up Parsers

The two types of parsers in widespread use are top-down and bottom-up. All parsers read language tokens from an input stream and examine them against a set of production rules. Each rule has a left side--the target--and a right side describing a valid way to reach the target. More than one production rule can be associated with each target.

A typical bottom-up parser determines which production rule to apply for a given input sequence by implementing rules as lookup tables. When the parser receives the first token, it tags all rules that can begin with that token. As it reads more tokens, it eliminates candidate rules until only one remains. This process involves guesswork and backtracking. The code implementing table-driven, bottom-up parsers is illegible and difficult to modify, and parser speed often is slower than that of top-down parsers. However, plenty of tools generate bottom-up parsers automatically, and such parsers can always be generated for an unambiguous grammar that requires no more than one token look ahead. Bottom-up parsers include those generated by yacc, bison, and similar programs.

Most handcoded parsers are of the top-down variety, and start with the top-level rule in the grammar. They examine input to see if they can complete this rule. If not, other functions are called to parse any candidate rules. As each rule is completed, the corresponding function executes action code associated with the rule and returns control to the caller. Top-down parsers are easier to modify and optimize than table-driven parsers--the parser can often be made both smaller and faster. However, no standardized tools exist for generating top-down parsers, and some grammars cannot be implemented correctly as top-down parsers. (Fortunately, most popular computer languages can be easily parsed this way.)

The leopurd parser combines the best features of both top-down and bottom-up parsers. It builds legible, maintainable, optimizable top-down parsers from standard grammars similar to yacc grammars. Leopurd generates two C source files from an input grammar: yyleo.h, containing definitions and function prototypes; and yyleo.c, containing the parser source code and any user-supplied code.

The Leopurd Grammar

Leopurd's input grammar describes the syntax of the parser input. The syntax of a language tells you only if a sentence is valid; the sentence's meaning is the realm of semantics. Leopurd supports a subset of the yacc definition language, but it lacks a literal block, associativity, %type and %union statements in the definition section, and C action statements in the rules section. Future versions of leopurd should be fully compatible with yacc.

A grammar file has three sections, separated by double percent signs (%%). The first section is the definition section, where you define the unique parts of a language--the language's terminal tokens. A definition begins with %token, followed by one or more terminal-token identifiers, which can be names or single characters within single quotes. Example 1 defines four unique tokens for a simple calculator.

The rules section defines the syntax of the language, showing the order in which tokens may appear in a sentence. The left side of a production rule is an abstract (nonterminal) symbol. The right side is a combination of nonterminal and terminal symbols, arranged the way they appear in a sentence. More than one right side can be assigned to each left side if separated by a pipe symbol (|). A semicolon terminates a group of rules for a single left side. Every nonterminal symbol referenced in a right side must be defined in a left side somewhere in the grammar.

Example 2 defines the abstract symbol "expression" to be a sum or difference followed by an equal sign. The second rule illustrates how recursion is used. It defines a sum to be a NUMBER followed by '+' and another NUMBER, or a NUMBER followed by '+' and another sum. The abstract symbol difference is defined similarly. Both the rules and definition sections may contain any amount of whitespace.

A right side can be empty. Unlike yacc, leopurd requires that you define the special token EMPTY if your grammar uses empty production rules, as shown in the productions for sum. A nonterminal symbol that matches EMPTY by some rule is said to be "nullable."

The final part of the grammar, the user section, is copied verbatim to the output file yyleo.c and can be any legal C code. Here you can put code used by the parser, such as a scanner, error handler, symbol table, global variables, and a main() function for testing the parser; see Example 3.

Code Generation

Leopurd generates C code to implement a top-down recursive descent parser from an input grammar. The entry point to the parser is the function yyparse(). The parsing target for yyparse() is the top-most nonterminal symbol in the grammar. Since there can be only one yyparse(), there can be only one top-level target.

Because error handling varies widely between applications, leopurd generates only the prototype of the error handler, yyerror().

Leopurd also generates prototypes for the functions next() and match(). A call to next() causes the next token to be read from the input stream. A subsequent call to match() evaluates whether the argument matches the token read from input. These functions are highly dependent on the nature of the input and are left up to you to implement.

Two functions are generated for each nonterminal symbol: <name>_parse() and <name>_action(). The <name>_parse functions are generated completely by leopurd and perform the actual parsing of the nonterminal symbol <name>.

The <name>_action() functions are placeholders for the language's semantics. Leopurd generates a function skeleton, to which you add actions to be taken upon successful parsing of the token. Separating syntax and semantics in this way increases portability and maintainability, and simplifies testing and debugging.

Recursive-Descent Parsing

An abstract, top-down parser parses a sentence by building a parse tree, in which each node represents a token. After a tree is built, the parser traverses the tree in postorder, executing code associated with each node. If the parser cannot generate a tree, it flags an error and returns.

Recursive-descent parsers generate parse trees implicitly by calling parse functions recursively. Upon a successful parse, the leaf function returns. As each parse function returns, its action code executes.

Parsing starts with the yyparse() function, which gets the first token from the input by calling next(). yyparse() then determines which production rule to use by calling match(). When a match is found, the parser does one of two things: If it expects a terminal token, it gets the next token and checks it by calling match(); if it expects a nonterminal symbol, it calls the symbol's <name>_parse() function. This process continues until an error is encountered, in which case yyerror() is called or the production rule is parsed successfully. Upon success, the last <name>_parse() function calls <name>_action(), which executes code associated with <name>. After <name>_action() returns, the process continues until the whole parse tree is unwound. Finally, any actions associated with yyparse() are executed, and yyparse() returns.

Since functions return in the reverse of the order in which they are called, the <name>_action() code of leaf nodes exe-cutes before code closer to the root. Thus, a production far down in the grammar has precedence over one further up. This is how the grammar's structure determines operator precedence. (Right-recursive rules result in right-association for operations.)

How does the parser determine which production rule to select? How do you determine which tokens to match against? For terminal tokens, the parser just matches against the token itself. For nonterminals, it must know which terminal tokens can legally begin a particular nonterminal. That set of tokens for a nonterminal n is called "first of n" and is written FIRST(n).

For a nullable nonterminal, you must determine which terminal tokens can legally come after the empty nonterminal. That set of tokens is called "follow of n" and written FOLLOW(n). Finding the FIRST() and FOLLOW() sets is an iterative process. Once the sets are defined, leopurd knows which tokens to match in each rule.

The resulting parser will operate properly only if the FIRST() and FOLLOW() sets unambiguously determine which production rule to choose at each step. The grammar in Example 1 and Example 2 is riddled with such problems.

Leopurd produces a trace of terminal and nonterminal tokens, production rules, and FIRST() and FOLLOW() sets. The trace is directed to stdout, and can be captured in a file for debugging or grammar verification.

A Parser Example

Listing One shows a grammar for a calculator which can evaluate addition and subtraction expressions. When an equal sign (=) is parsed, the expression is evaluated.

The user section defines the functions yyerror(), next(), and match(), which are declared by leopurd, and a main() function that can read input from a file or standard input.

The function yylex() is the scanner/lexical analyzer, which recognizes tokens in the input stream and returns their values. The global variable yytext points to the first character of the current token. The token's length is stored in yyleng.

When run with the grammar in Listing One, leopurd produces output on stdout similar to Listing Two. Listings Three and Four are the generated files, yyleo.h and yyleo.c.

The <name>_parse() functions implement a ready-to-roll parser. The top-level parse function, statement, is renamed to yyparse(). If the top-level nonterminal uses recursion (like the production rule in Example 4), a redefinition of statement_parse() in yyleo.h maps it to the yyparse() function. Also, although no explicit rule says so, the FOLLOW() of the top-level nonterminal always includes "end of input."

By contrast, the <name>_action() functions are stub functions, sufficient for yyleo.c to compile cleanly. They should be fleshed out manually to provide the semantics of the language.

The "raw" parser generated by leopurd sacrifices efficiency for clarity, so you may need to hand-optimize the resulting code. One such inefficiency is duplicated code, such as the duplicated handling of matches on '-' and '+' in predicate_parse() in Listing Four. This problem could be solved in the grammar by defining a new nonterminal "operator" to match either '+' or '-'. However, a new nonterminal costs a function call at run time, so manual optimization might be preferable. Some grammars result in multiple recursive calls to the same function ("tail recursion"). Such calls can be replaced with a loop, resulting in better performance and stack-space savings. The first if statement in each parse function, which tests for matches on the FIRST() and FOLLOW() sets, often can be eliminated completely.

Source Code

The complete C source code for the current version of leopurd is available electronically (see "Availability," page 3). LITE.H contains definitions common to all source files. It defines the symbol LIST, resulting in visible run-time trace on stdout. For leopurd to run quietly, you must undefine LIST. The scanner is found in LITE1.C. LITE2.C contains functions to parse the definition and user sections. The rules section is parsed by functions in LITE3.C. LITE4.C contains functions that calculate FIRST() and FOLLOW() sets and build the parser.

Leopurd itself is constructed around several top-down parsers. The source code also shows ways to implement the next(), match(), and yyerror() functions, as well as a simple scanner, yylex().

The code generator in LITE5.C produces ANSI C output. It is relatively easy to modify to produce code for other languages.

The source code is ANSI C compliant, with the exception of calls to strdup(), a non-ANSI function supported by many compilers. They can be replaced with calls to malloc(), strlen(), and strncpy(). Leo-purd compiles cleanly using Borland C++ and Turbo C under MS-DOS, and on many UNIX platforms.

Example 1: Definition section for a simple calculator.

%token NUMBER
%token '+' MINUS
%token '='
%%

Example 2: Rules section for a simple calculator.

expression: sum '=' |
        difference '='
        ;
sum:    NUMBER '+' NUMBER |
        NUMBER '+' sum |
        EMPTY
        ;
difference:
        NUMBER MINUS NUMBER |
        NUMBER MINUS difference
        ;
%%

Example 3: Code section for a simple calculator.

#include <stdio.h>
#include <stdlib.h>
static int line_number=0;
/* print an error message */
void yyerror(char *msg_str){
    printf("Error! line %d: %s\n",
line_number,msg_str);
}
int main(){
    int rc;
    while(!(rc=yyparse()));
    exit(rc);
}

Example 4: A recursive top-level production rule.

 statement:
  while '(' expression ')' statement ;

Listing One

%token NO_TOKEN
%token EOI
%token NUMBER 
%token EMPTY
%%
statement:  expression '='
        ;
expression: factor predicate |
        EMPTY
        ;
predicate:  '+' factor predicate |
        '-' factor predicate |
        EMPTY
        ;
factor:     NUMBER |
        '(' expression ')'
        ;
%%
/***** user section *****/
#include <stdio.h>
#include <ctype.h>
#include <string.h>
int yylex(void);
#define INLEN 256
int look=NO_TOKEN;
static FILE *fp;
char *yytext="";
int yyleng;
/**** error function ****/
void yyerror(char *str){
    printf("Error! %s\n",str);
}
/**** look ahead to next token ****/
void next(void){
    look=yylex();
}
/**** match token ****/
int match(int token){
    if(look==NO_TOKEN)
        next();
    /**** uncomment to trace matches 
    printf("matching %d %d\n",look,token); /* */
    return token==look;
}
/**** keep parsing input until an error occurs ****/
int main(int argc,char *argv[]){
    int rc;
    if(argc==1) fp=stdin;
    else if(NULL==(fp=fopen(argv[1],"rb")))
        return printf("File Error!\n");
    while(!(rc=yyparse()));
    printf("%s\n",(rc==2)?"Success!":"Error");
    return fclose(fp);
}
/**** this is the scanner function ****/
int yylex(void){
    static char inbuf[INLEN]={'\0'};
    static char *p=inbuf;
    do{
        /* need another line */
        if(!*p){
            if(!fgets(inbuf,INLEN,fp))
                return EOI;
            p=inbuf;
        }
        /* strip leading whitespace */
        while(isspace(*p)) p++;
    }while(!*p);
    yytext=p;
    yyleng=1;
    switch(*p++){
        case '+': return (int)'+';  /* single char operator ret. by value*/
        case '-': return (int)'-';
        case '*': return (int)'*';
        case '/': return (int)'/';
        case '(': return (int)'(';
        case ')': return (int)')';
        case '=': return (int) '=';
        default:
            if(isdigit(*yytext)){   /* a number */
                while(isdigit(*p)){
                    p++;
                    yyleng++;
                }
                return NUMBER;
            }
            else printf("Unknown token: %c!\n",*yytext);
    }
    return NO_TOKEN;
}
/*** end of user section ****/

Listing Two

EMPTY 1003
NUMBER 1002
EOI 1001
NO_TOKEN 1000
predicate 1007
factor 1006
expression 1005
statement 1004
factor:
  '('  expression  ')'
  NUMBER
predicate:
  EMPTY
  '-'  factor  predicate
  '+'  factor  predicate
expression:
  EMPTY
  factor  predicate
statement:
  expression  '='
predicate: (nullable) 
FIRST(): '-' '+' 
FOLLOW(): ')' '=' 
factor: 
FIRST(): '(' NUMBER 
FOLLOW(): '-' '+' ')' '=' 
expression: (nullable) 
FIRST(): '(' NUMBER 
FOLLOW(): ')' '=' 
statement: 
FIRST(): '(' NUMBER '=' 
FOLLOW(): 
Top level non-terminal : statement

Listing Three

#define YYSTYPE int
int match(int);
void next(void);
void yyerror(char *);
#define NO_TOKEN 1000
#define EOI 1001
#define NUMBER 1002
#define EMPTY 1003
#define statement 1004
#define expression 1005
#define factor 1006
#define predicate 1007
int predicate_parse(void);
int predicate_action(void);
int factor_parse(void);
int factor_action(void);
int expression_parse(void);
int expression_action(void);
#define statement_parse yyparse
int yyparse(void);
int yyparse_action(void);

Listing Four

#include "yyleo.h"
#include <stdlib.h>
/***** user section *****/
#include <stdio.h>
#include <ctype.h>
#include <string.h>
int yylex(void);
#define INLEN 256
int look=NO_TOKEN;
static FILE *fp;
char *yytext="";
int yyleng;
/**** error function ****/
void yyerror(char *str){
    printf("Error! %s\n",str);
}
/**** look ahead to next token ****/
void next(void){
    look=yylex();
}
/**** match token ****/
int match(int token){
    if(look==NO_TOKEN)
        next();
    /**** uncomment to trace matches 
    printf("matching %d %d\n",look,token); /* */
    return token==look;
}
/**** keep parsing input until an error occurs ****/
int main(int argc,char *argv[]){
    int rc;
    if(argc==1) fp=stdin;
    else if(NULL==(fp=fopen(argv[1],"rb")))
        return printf("File Error!\n");
    while(!(rc=yyparse()));
    printf("%s\n",(rc==2)?"Success!":"Error");
    return fclose(fp);
}
/**** this is the scanner function ****/
int yylex(void){
    static char inbuf[INLEN]={'\0'};
    static char *p=inbuf;
    do{
        /* need another line */
        if(!*p){
            if(!fgets(inbuf,INLEN,fp))
                return EOI;
            p=inbuf;
        }
        /* strip leading whitespace */
        while(isspace(*p)) p++;
    }while(!*p);
    yytext=p;
    yyleng=1;
    switch(*p++){
        case '+': return (int)'+';  /* single char operator ret. by value*/
        case '-': return (int)'-';
        case '*': return (int)'*';
        case '/': return (int)'/';
        case '(': return (int)'(';
        case ')': return (int)')';
        case '=': return (int) '=';
        default:
            if(isdigit(*yytext)){   /* a number */
                while(isdigit(*p)){
                    p++;
                    yyleng++;
                }
                return NUMBER;
            }
            else printf("Unknown token: %c!\n",*yytext);
    }
    return NO_TOKEN;
}
/*** end of user section ****/
int predicate_parse(void){
    if((!match('-'))&&(!match('+'))&&(!match(')'))&&(!match('='))){
        yyerror("predicate");
        exit(1);
    }
    else if(match('-')){
        next();
        if(factor_parse()){
            yyerror("factor");
            return 1;
        }
        if(predicate_parse()){
            yyerror("predicate");
            return 1;
        }
    }
    else if(match('+')){
        next();
        if(factor_parse()){
            yyerror("factor");
            return 1;
        }
        if(predicate_parse()){
            yyerror("predicate");
            return 1;
        }
    }
    else if(match(')')||match('=')){
        return 0;
    }
    else return 1;
    predicate_action();
    return 0;
}
int predicate_action(void){
    return 0;
}
int factor_parse(void){
    if((!match('('))&&(!match(NUMBER))){
        yyerror("factor");
        exit(1);
    }
    else if(match('(')){
        next();
        if(expression_parse()){
            yyerror("expression");
            return 1;
        }
        if(match(')'))
            next();
        else{
            yyerror("')'");
            return 1;
        }
    }
    else if(match(NUMBER)){
        next();
    }
    else return 1;
    factor_action();
    return 0;
}
int factor_action(void){
    return 0;
}
int expression_parse(void){
    if((!match('('))&&(!match(NUMBER))&&(!match(')'))&&(!match('='))){
        yyerror("expression");
        exit(1);
    }
    else if(match('(')||match(NUMBER)){
        if(factor_parse()){
            yyerror("factor");
            return 1;
        }
        if(predicate_parse()){
            yyerror("predicate");
            return 1;
        }
    }
    else if(match(')')||match('=')){
        return 0;
    }
    else return 1;
    expression_action();
    return 0;
}
int expression_action(void){
    return 0;
}
int yyparse(void){
    if((!match('('))&&(!match(NUMBER))&&(!match('='))&&(!match(EOI))){
        yyerror("statement");
        exit(1);
    }
    else if(match(EOI)){
        return 2;
    }
    else if(match('(')||match(NUMBER)||match(')')||match('=')){
        if(expression_parse()){
            yyerror("expression");
            return 1;
        }
        if(match('='))
            next();
        else{
            yyerror("'='");
            return 1;
        }
    }
    else return 1;
    yyparse_action();
    return 0;
}
int yyparse_action(void){
    return 0;
}


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.