Dr. Dobb's | Lex and Yacc | February 01, 1996

Lex and Yacc

Ian describes how he used traditional compiler-development techniques and the MKS Lex & Yacc Toolkit to build a keyword-query compiler for a CD-ROM database.

February 01, 1996
URL:http://www.drdobbs.com/database/lex-and-yacc/184409830

FEB96: Lex and Yacc

Ian is a software developer and systems analyst. He can be reached at ActiveSystems Inc., 11 Holland Ave., Suite 700, Ottawa, ON, Canada K1Y 4S1.

Forty years ago, it took a group of experts 17 staff years to write the first Fortran compiler. Today, a single programmer with less knowledge can do essentially the same thing in a matter of weeks. What makes this possible are advances in both computer science and software-development tools. In particular, tools such as lex and yacc become powerful allies in the hands of programmers who know how to use them. For instance, I've used lex and yacc for everything from converting queries to SQL syntax, to verifying that examples of server command files were consistent with command syntax.

In this article, I'll first examine lex and yacc, focusing on the MKS Lex & Yacc Toolkit for DOS, OS/2, and Windows NT. MKS Lex builds a C/C++ or Turbo Pascal lexical analyzer that takes a stream of input and breaks it into tokens according to specific rules. MKS Yacc, on the other hand, builds a C/C++ or Turbo Pascal parser that takes a stream of tokens, matching them against a specified grammar. (If you have a UNIX system, lex and yacc may already be installed. Free versions, such as flex and bison, will compile in almost any system.) To illustrate the power of these tools, I'll then describe how I used them to build a keyword-query compiler for a CD-ROM database.

Lex and Yacc Backgrounder

Yacc generates a parser, yyparse(), that checks whether or not different parts of the input (like numbers, variable names, and operators) are in the correct order. It then processes those parts. Usually the order of processing is different from the order of input.

Lex generates a lexer, yylex(), that splits the input into substrings, which it then classifies for the parser. The lexer can also do some preliminary processing. yylex() always returns a type code (called a "token"). For some substrings, yylex() may also put a value (called an "attribute") in the global variable yylval.

Listings One and Two, extracted from the MKS Lex and Yacc documentation, build an interpreter with lex and yacc. The interpreter is a simple, four-function calculator, with up to 26 variables (A to Z) for temporary storage. Writing such a calculator in C is a fairly big job; with lex and yacc, you need only the code in the listings.

The lex code (Listing One) is just a list of regular expressions, with their associated code segments. Each code segment executes when a substring of the input matches the corresponding regular expression. Lex puts the matching substring in the global variable yytext. Thus lines 8 to 16 classify the single letters A to Z (and a to z) as VARIABLE, and produce a value from 0 to 25, which the parser will use to identify the particular variable. Similarly, lines 18 to 21 classify an unbroken string of digits as INTEGER, and convert the string to an integer for use by the parser. In this way, input strings are classified and given some initial processing before any of the input goes to the parser.

The yacc code (Listing Two) is a list of constructions (or recipes) for building new objects out of old ones. For example, lines 26 to 34 show how the calculator program handles arithmetic expressions (like sum and difference). Any INTEGER coming in from the lexer is an expression (line 27). A VARIABLE identifies an array element in which to store the value of an expression (line 28). Expressions can be constructed from previously constructed expressions (lines 29 to 32).

Objects from the lexer are normally identified in yacc code by character constants or uppercase names; objects constructed by the parser are usually identified by lowercase names. This makes it easier to read and understand the code. You attach code to each construction by placing the code in braces after the objects that you use to build new objects. The value of the new object is represented by $$ signs. The values of the old objects are represented by $1, $2, and so on, from left to right. If the parser cannot immediately use the result, it goes on a stack (managed by code that yacc produces) until it is needed. The equation a=3+4*c, for example, must be processed in reverse order of input, but you only have to write code for simple constructions: c (line 28), 4*c (line 31), 3+expression (line 29), and a=expression (line 23). In the equation a=(3+4)*c, addition must be done first. Line 33 handles this case. After an opening parenthesis, the parser can accept only a complete expression followed by a closing parenthesis. The parser stores five objects (leading parenthesis, first expression, plus, second expression, and trailing parenthesis) on a stack, then recognizes them as an expression (line 33), pulls them off the stack, and replaces them on the stack by one new expression.

The two equations illustrate the labor-saving power of yacc. You do not have to figure out when and how to save intermediate results for an infinite variety of possible calculations. You do not even need to know how such things are done. You only need to find a simple way to build complex structures out of simple structures, and write your own code to process the simple structures.

A Keyword-Query Compiler for a CD-ROM Database

The CD-ROM databases I help build allow users to retrieve data by keyword searches on one or more fields. To search for a record with either the word "harry" or the two words "tom" and "dick" in a single field, the user would enter a keyword list tom dick, harry. Words in a list separated by spaces must all appear in the field. Lists separated by commas are alternatives; a record will be selected if a field satisfies any one of the lists. Thus spaces are equivalent to a logical AND, and commas are equivalent to a logical OR.

I used an off-the-shelf search engine that requires the query to be like Example 1. While this looks like SQL, the operators <, <=, =, >=, and > do not mean what they would in a relational database. They are containment operators instead of comparison operators. For example, FIELD1 = 'joe' indicates that FIELD1 contains 'joe' (possibly with other words), not that FIELD1 has the value 'joe'. Similarly, FIELD1 < 'joe' indicates that FIELD1 contains some word less than 'joe'. A field with the value 'xerxes jack' would satisfy this condition.

I didn't want to expose nontechnical users to the engine query language, particularly when it looked like a language (SQL for relational databases) with different operators. Consequently, I wrote a parser to convert user expressions to expressions that the query engine would accept; see Example 2.

Other Design Issues

The retrieval engine was designed to work with one database, and it used global variables. I moved the engine C code into a C++ database class, changing the global variables to private variables in that class. This allowed me to have several open databases instead of one. I could use much of our existing C interface code by changing the function declarations, making the functions public members in the database class.

The compiler allocates temporary storage on the heap, and any yacc-based compiler will discard some of the corresponding pointers when it detects an error. I therefore entered each of these pointers in a list-manager class when I allocated data in the compiler. Deleting the list manager after each compiler run guarantees that all memory is freed.

Since the rest of the package was written in C++, I preferred to write the parser in C++. This is easy to do with MKS Yacc.

When searching on multiple fields, the field expressions are connected by AND or OR, but in left-to-right precedence (or from top of screen to bottom) instead of the usual precedence of AND over OR. Consequently, I used yacc only for the individual field expressions, and used a simple C function to put the expressions together.

I did not use lex because some of our applications defined keywords differently in each database. For example, one database might allow hyphenated words, but another in the same application might not. Lex requires you to specify the characters that can make up a keyword when you generate the compiler, not when you run it. This project had only a few different tokens (see Listing Three, lines 17 to 22), so it was not hard to write the lexer in C++ instead of lex.

Writing the Compiler

I wrote the compiler in stages. I simultaneously built a lexer and yacc grammar with no C++ code. The grammar and the lexer are interdependent. The lexer will not compile without a definition file that yacc produces from the grammar. The parser will not run without input from the lexer. However, you can do much of the testing before putting C++ code in the parser, because the C++ code is only used to produce the parser output.

Listing Three is the code for a working parser that reads and parses input, and produces error messages for invalid input. In short, it does everything but produce the output. If you have designed a new language, like my keyword lists, you can change the grammar until you are satisfied with the language. One of the great advantages of yacc is that you can do this kind of testing up front, before putting a lot of work into C/C++ code.

The parser gets input (lines 17 to 22 of Listing Three) from the lexer. Starting from the bottom of Listing Three, the parser makes ever larger units out of input tokens, until all input has been combined into one unit, which is the complete search expression for one field. For example, you can follow a character string (CHARSTRING) up through lines 98-99, 66-67, 57-59, and 42-43 until it becomes part of the complete field expression in lines 31-32.

After you have a working parser, you can add code. Listing Four is some of the code that illustrates how the parser works. The variable scan is a pointer to an object of type ParseItem_t, created by the lexer. The object holds a string scan->Value(), and you can append another string to that string with the member function scan->ConCat().

When a CHARSTRING comes in (lines 55-70), and the field is a character field, the corresponding string value is quoted; if the field is numeric, it is not. Then the string is saved (by saving the pointer $$) as the value for the new token value. This particular section illustrates how the yacc grammar is influenced by the code you intend to write--if I had not wanted to process strings upon arrival from the lexer, I could have left this section out and replaced all occurrences of value with CHARSTRING.

If the CHARSTRING lacks a preceding operator code, it will be concatenated with the fieldname and the '=' operator (lines 29-37). If the CHARSTRING follows an operator (<, <=, >=, >), then it will be concatenated with a fieldname and that operator (lines 29 and 40-50). Both cases produce a simplecondition.

A compoundcondition (line 4) can be a simplecondition (lines 7-9). A compoundcondition followed directly by a simplecondition will be ANDed with that simplecondition (lines 11-17). A compoundcondition followed by a comma and then a simplecondition will be ORed with that simplecondition (lines 19-26).

After the user keyword lists are converted to search expressions for each field, they are put together in an SQL command for the search engine. Since the expressions are concatenated very simply, this was done with a C++ program (Listing Seven) that called a setup program (Listing Five) for each field.

Conclusion

The secret to using lex and yacc in a production environment is to pick problems commensurate with your ability to use the tools, and to increase your ability with practice. While the CD-ROM project I've described here may be trivial to a computer-science student, yacc saved us at least a week in development time.

I found yacc handy when we changed the structure of our keyword language. Testing a change was a matter of a few minutes when we could change our yacc source, instead of several days if we had been writing in C or C++.

Lex and yacc are generally useful even when you are not in the compiler business. I have used lex and yacc to verify the syntax of languages that I was describing in developer's guides for a UNIX text-management server and a Windows text-search client. This alerted me immediately to changes that occurred in the language definitions as the design of the server and client evolved. I've also used lex and yacc to produce SGML output from proprietary text markup. We eventually switched to other tools, but we used a modified version of the lex program as a preprocessor. In another project, I used lex and awk to extract English literals from Visual Basic programs and replace them with French literals. Finally, I've used lex and MKS Toolkit (UNIX utilities for DOS) to automate corrections to 18,000 pages of scanned documents, preventing a delay of several weeks in delivery.

For More Information

MKS Lex & Yacc

Mortice Kern Systems Inc.

185 Columbia Street West

Waterloo, ON

Canada N2L 5Z5

http://www.mks.com

Example 1: Typical query by off-the-shelf search engines.

select * from table where
    FIELD1 = 'tom' and FIELD1 = 'dick' or FIELD1 = 'harry'

Example 2: Parser conversion of user expressions to expressions that the query engine accepts.

User expression     Engine query expression
joe                 AC1 = 'joe'
<2                  AC1 < 2
<10 >3, 20          AC1 < 10 AND AC1 > 3 OR AC1 = 20
10..20              AC1 BETWEEN 10 AND 20

Listing One

 1 %{
 2 #include "ytab.h"
 3 extern int yylval;
 4 %}
 5 
 6 %%
 7 
 8 [A-Z]   {
 9         yylval = *yytext - 'A';
10         return VARIABLE;
11     }
12 
13 [a-z]   {
14         yylval = *yytext - 'a';
15         return VARIABLE;
16     }
17 
18 [0-9]+  {
19         yylval = strtol(yytext, (char **)NULL, 0);
20         return INTEGER;
21     }
22 
23 0x[0-9a-fA-F]+  {
24         yylval = strtol(yytext, (char **)NULL, 16);
25         return INTEGER;
26     }
27 
28 [-()=+/*\n] return *yytext;
29 
30 [ \t]+      ;
31 
32 .       yyerror("Unknown character");

Listing Two

 1 %{
 2 #include <stdio.h>
 3 %}
 4 
 5 %token INTEGER VARIABLE
 6 %left '+' '-'
 7 %left '*' '/'
 8 
 9 %{
10 static int variables[26];
11 %}
12 
13 %%
14 
15 program:
16         program statement '\n'
17 |       program error '\n'      = { yyerrok; }
18 |       /* NULL */
19 ;
20 
21 statement:
22         expression          = { printf("%d\n", $1); }
23 |       VARIABLE '=' expression     = { variables[$1] = $3; }
24 ;
25 
26 expression:
27         INTEGER
28 |       VARIABLE            = { $$ = variables[$1]; }
29 |       expression '+' expression   = { $$ = $1 + $3; }
30 |       expression '-' expression   = { $$ = $1 - $3; }
31 |       expression '*' expression   = { $$ = $1 * $3; }
32 |       expression '/' expression   = { $$ = $1 / $3; }
33 |       '(' expression ')'      = { $$ = $2; }
34 ;

Listing Three

 1 /* $Header$ */
 2 /* Field parser for use with CD-ROM and Visual Basic DLL */
 3 /* Ian E. Gorman, ActiveSystems Inc., Ottawa, Ontario, Canada */
 4 
 5 /* Code removed to show only the yacc grammar */
 6 
 7 /* DEFINITIONS -------------------------------------------------------- */
 8 %{
 9 
10 #include <ctype.h>
11 #include <stdarg.h>
12 #include "dbinfo.hpp"       /* includes "parsetab.hpp", "token.hpp" */
13 #include "logfile.hpp"
14 
15 %}
16 
17 %token  CHARSTRING  GLOBSTRING
18 %token  ','
19 %token  AND
20 %token  OPCODE  RANGEOP
21 %token  ALL BLANKS
22 %token  '"'
23 
24 /* OPCODE and CHARSTRING have strings associated with them.
25     The other terminals are not associated with strings. */
26 
27 %%
28 
29 /* RULES -------------------------------------------------------------- */
30 
31 program:
32     condition
33 |   error
34 |   program error
35 ;
36 
37 /* This production just collects various productions into one production,
38     to reduce the number of different actions that must be coded in the
39     next higher production.
40 */
41 
42 condition:
43     compoundcondition
44 |   specialcondition
45 |   quotecondition
46 ;
47 
48 /* Special conditions that exclude any other condition */
49 
50 specialcondition:
51     BLANKS                  /* fieldname = NULL symbol */
52 |   ALL                     /* fieldname >= minimum field value */
53 ;
54 
55 /* One or more ordinary conditions (simple comparisons) */
56 
57 compoundcondition:
58 
59     simplecondition     /* Begin the parse */
60 |   compoundcondition simplecondition   %prec AND /* AND simplecondition */
61 |   compoundcondition comma simplecondition /* OR simplecondition */
62 ;
63 
64 /* ordinary conditions -- single comparisons and ranges */
65 
66 simplecondition:
67     value                   /* fieldname = value */
68 |   globvalue               /* fieldname like value */
69 |   opcode value            /* fieldname >=,>,<,<= value */
70 |   value rangeop value   /* fieldname >= value1 AND fieldname <= value2 */
71 ;
72 
73 /* Quoted string comparison -- convert a quoted string to a sequence of
74     ANDed equalities.  This is how we search for a string with embedded
75     spaces when Bookware will only search for individual words.
76 */
77 
78 quotecondition:
79     quote stringlist quote      /* completed a quoted string */
80 ;
81 
82 quote:
83     '"'                 /* tell lex to discard nonchar symbols */
84 ;
85 
86 stringlist:
87     value                   /* fieldname = value */
88 |   stringlist value        /* AND fieldname = value */
89 ;
90 /* '(' ')', ',', OPCODE are ignored inside a stringlist */
91 /* BLANKS and ALL are treated as ordinary strings inside a string list */
92 
93 
94 /* Values for alpha fields must be quoted in Bookware searches, values for
95     numeric fields must not be quoted.  This is where we decide what to do.
96 */
97 
98 value:
99     CHARSTRING
100 ;
101 
102 /* Wild cards -- xxxx% will become -- fieldname like 'xxxx%'
103     Values must be quoted in Bookware searches.
104     numeric fields must not be quoted.  This is where we decide what to do.
105 */
106 
107 globvalue:
108     GLOBSTRING
109 ;
110 
111 rangeop:
112     RANGEOP
113 ;
114 
115 opcode:
116     OPCODE
117 ;
118 
119 comma:
120     ','
121 ;

Listing Four

 1 /* Compound logical condition */
 2 /* Ian E. Gorman, ActiveSystems Inc., Ottawa, Ontario, Canada */
 3 
 4 compoundcondition:
 5 
 6     simplecondition     /* Begin the parse */
 7     {
 8         $$ = $1;
 9     }
10 
11 |   compoundcondition simplecondition   %prec AND /* AND simplecondition */
12     {
13         $$ = $1;
14         $$->ConCat(" AND ");
15         $$->ConCat($2->Value());
16         scan->ParseList->Discard($2);
17     }
18 
19 |   compoundcondition comma simplecondition /* OR simplecondition */
20     {
21         $$ = $1;
22         $$->ConCat(" OR ");
23         $$->ConCat($3->Value());
24         scan->ParseList->Discard($2);
25         scan->ParseList->Discard($3);
26     }
27 ;
28 
29 simplecondition:
30     value                   /* fieldname = value */
31     {
32         $$ = new(scan->FieldName) ParseItem_t;
33         scan->ParseList->AddAtTail($$);
34         $$->ConCat(" = ");
35         $$->ConCat($1->Value());
36         scan->ParseList->Discard($1);
37     }
38 |   globvalue               /* fieldname like value */
39     {   /* more code here */   }
40 |   opcode value            /* fieldname >=,>,<,<= value */
41     {
42             $$ = new(scan->FieldName) ParseItem_t;
43             scan->ParseList->AddAtTail($$);
44             $$->ConCat(" ");
45             $$->ConCat($1->Value());           /* opcode */
46             $$->ConCat(" ");
47             $$->ConCat($2->Value());           /* value */
48             scan->ParseList->Discard($1);
49             scan->ParseList->Discard($2);
50     }
51 |   value rangeop value   /* fieldname >= value1 AND fieldname <= value2 */
52     {   /* more code here */   }
53 ;
54 
55 value:
56     CHARSTRING
57     {   /* Encloses a string value in single quotes, when the field is not
58             numeric. */
59         if ( scan->FieldNumeric )   {
60             /* Can put a check here for non-numeric value in numeric field.
61                 If not numeric, YYERROR */
62             $$ = $1;
63         } else  {
64             $$ = new("'") ParseItem_t;
65             $$->ConCat($1->Value());
66             $$->ConCat("'");
67             scan->ParseList->Discard($1);
68         }
69     }
70 ;

Listing Five

 1 /* Produce SQL query segment from data for one field */
 2 /* Ian E. Gorman, ActiveSystems Inc., Ottawa, Ontario, Canada */
 3 
 4 /* This function is a wrapper for the yacc parser, yyparse() */
 5 
 6 char *  cParse::FieldParse(
 7     FldItem_t * Field  /* list of field data: name, opcode, expression */
 8     )
 9 {
10     char *  Result = NULL;
11     int     i;
12     yy_parse yaccparse = yy_parse(YSTACKSIZE, ystates, yvals);
13     yy_scan * yscan;
14 
15     if ( ! Field )
16         return NULL;
17 
18     yscan = new(    /* set up the token separator */
19         Field->Expr(),
20         Field->Name(),  
21         Field->IsNumber(),        /* 1 for numeric fields, 0 otherwise */
22         Field->NullValue(),
23         Field->FirstValue()
24         ) yy_scan;
25 
26     if ( ! yscan )
27         return Result;
28 
29     if ( ! yaccparse.yyparse(yscan) )  {
30         /* parse the expression into SQL query */
31         Result = new char[strlen(yscan->output->Value())+1];  /* success */
32         strcpy(Result, yscan->output->Value());
33     }
34 
35     delete yscan;
36 
37     return Result;
38 }

Listing Six

/* Abbreviated version of class yy_parse from MKS lex and yacc */
class yy_parse {
public:
   yy_scan* scan;          // pointer to scanner
   int yydebug;    // if set, tracing if compiled with YYDEBUG=1
   yy_parse(int = 150);    // constructor for this grammar
   yy_parse(int, short *, YYSTYPE *);  // another constructor
   ~yy_parse();        // destructor
   int yyparse(yy_scan * ps);  // parse with given scanner
   void    yyreset() { reset = 1; } // restore state for next yyparse()
   void    setdebug(int y) { yydebug = y; }
// The following are useful in user actions:
   void    yyerrok() { yyerrflag = 0; }    // clear error
   void    yyclearin() { yychar = -1; }    // clear input
   int YYRECOVERING() { return yyerrflag != 0; }
};

Listing Seven

char *  cParse::MakeSQL(
   cSearchList *   ListManager /* Field list manager for current database */
   )
{
   char * StrSQL;      /* new string -- SQL query */
   char * Temp;
   FldItem_t * Element;
   enum _FieldOP   LeadingOpCode, FollowingOpCode;
   ParseItem_t *   QueryString;
   if ( ! ListManager )
       return NULL;
   /* Start with SQL command verb and data base name */
   QueryString = new("SELECT * FROM dbn WHERE (") ParseItem_t;
   /* Assume no opcode before first expression */
   LeadingOpCode = eNOP;
   for ( Element = ListManager->Head()
           ; Element != NULL
           ; Element = ListManager->Next(Element)
       )   {
       FollowingOpCode = ( ListManager->Next(Element)
                       ? ListManager->Next(Element)->OpCode()
                       : eNOP );
       /* open parenthesis at beginning of several ORed field conditions */
       // if ( FollowingOpCode == eOR && LeadingOpCode != eOR )
       //  QueryString->ConCat("(");
       /* parse each field condition, enclose result in parentheses */
       QueryString->ConCat("(");
       if ( ! (Temp = FieldParse(Element)) )   {
           LogErrorDLL(__FILE__, __LINE__,
               "NULL expression while parsing query, Parsed OK to: %s\n",
               QueryString->Value()
               );
           delete QueryString;
           return NULL;            /* incomplete string would be no good */
       }
       QueryString->ConCat(Temp);
       delete Temp;
       QueryString->ConCat(")");
       /* close parenthesis at end of several ORed field conditions */
       // if ( LeadingOpCode == eOR && FollowingOpCode != eOR )
       //  QueryString->ConCat(")");
       /* if not the last field condition, append the connector to next */
       if ( ListManager->Next(Element) != NULL )
           QueryString->ConCat(SQLtoken(FollowingOpCode));
       LeadingOpCode = FollowingOpCode;
   }
   QueryString->ConCat(");");
   StrSQL = new char[strlen(QueryString->Value())+1];
   if ( StrSQL )   {
       strcpy(StrSQL, QueryString->Value());
   }
   delete QueryString;
   return StrSQL;
}
DDJ