Generating Parsers With PCYacc

If computer boot camp is your idea of fun, PCYACC probably isn't for you. But if you nee to parse input quickly and easily, then PCYACC may be worth signing up for.

June 01, 1989
URL:http://www.drdobbs.com/tools/generating-parsers-with-pcyacc/184408156

Figure 1

JUN89: GENERATING PARSERS WITH PCYACC

GENERATING PARSERS WITH PCYACC

C ++, SQL, Smalltalk, and PostScript are just a few of the grammars provided with this parser generator

Alex Lane

Alex Lane is a knowledge engineer for Technology Applications Inc. of Jacksonville, Fla. He can be reached as a lane on BIX and as ALANE on MCI mail.

YACC is an acronym for Yet Another Compiler Compiler. Its name reflects both the preoccupation with parser generators that engaged computer scientists back in the early 70s, as well as Unix's sometimes whimsical naming practices. For many years, YACC (or more properly, yacc) has been a powerful basic tool in the toolchests of compiler writers working with Unix-based systems. Recently, however, several companies have ported yacc to the MS-DOS environment. Among them is Abraxas Software, which publishes PCYACC. I reviewed Version 2.0 of the product on an ARC 386i equipped with 1 Mbyte of RAM and a hard disk.

A bare-bones definition of a compiler is "a program that reads an input file of instructions written in one language and translates them into an output file of instructions in another, lower-level language." Although the most common form of output file that the casual compiler user deals with is a linkable object code file, the output file may contain statements in any language, including a high-level language.

Indeed, the output from PCYACC is a C source program. This C program can, in turn, be compiled and linked to form an executable file. This executable file is capable, in its turn, of parsing input from a file written in whatever language was specified in the input files for PCYACC. It's perhaps a mite confusing in words, so maybe a glance at Figure 1 will help clarify the relationships among all these files.

The input, or source language, for PCYACC is what the folks at Abraxas call a grammar description language (GDL). Statements in this language resemble Backus-Naur Form (BNF) notation, which is commonly used to describe context-free grammars.

A PCYACC input program, called a grammar description program (GDP), is made up of three sections, arranged in the following order: a declaration section, a grammar rule section, and a program section. The three sections are separated from one another by the delimiter %%.

In the declaration section, anything enclosed with %{ and %} is passed directly to the output without change. This is a convenient place for C declarations and itemization of #include files. PCYACC tokens are identified in this section by the keyword %token, and operator associativity is declared using the %left, %right, and %nonassoc keywords. Precedence of operators is implied by the order in which associativity declarations appear. Data types of grammar symbols are declared with the keyword %type, and the keyword %start can be used to declare the start symbol of the grammar.

The second section of the GDP contains all the grammar description language statements for the target compiler. Statements in GDL have the following form:

  LHS: RHS {C code segment};

The LHS (left-hand side) of the rule consists of a nonterminal symbol. The RHS (right-hand side) is a sequence of zero or more grammar symbols. Following the RHS is a segment of C code that describes an associated action to be performed when the RHS is encountered. The LHS and RHS are separated by a colon, and the rule itself is terminated with a semicolon. Multiple rules having a common LHS are often expressed using a vertical bar to denote an OR choice, like this:

  LHS : RHS1 {C code for RHS1};       | RHS2 {C code for RHS2};   ...

Everything in the program section of the PCYACC input file is copied to the output without change. The manual states that the program section must contain definitions for at least three functions: the familiar main( ), a lexical-analyzer function called yylex( ), and an error-reporting function yyerror( ). I found that although definitions for these three functions must appear somewhere in the C source code for the target parsing program, the auxiliary files supplied with PCYACC show that the definitions do not, as stated in the book, have to be in the GDP.

The output of the PCYACC program is a C source file that contains any code passed in from the declaration section and all the code from the program section. In addition, the output file contains the code for a function called yyparse( ), which was generated using the grammar definition rules.

The yyparse( ) function performs look-ahead left and right (LALR) parsing for the language using the rules from the PCYACC input file. The LALR method is an efficient, bottom-up syntax-analysis technique that is used to parse virtually any programming language constructs expressible in context-free grammar terms.

The yyparse( ) function is fed by a user-written function called yylex( ), which performs lexical analysis. Basically this means that the input is broken into meaningful chunks called "tokens," which are in turn passed to the parser.

What You Get

The PCYACC package consists of a 126-page spiral bound, 8 1/2 X 11-inch manual entitled Compiler Construction on Personal Computers (with PCYACC) and three 360K diskettes. The main PCYACC diskette contains the basic PCYACC. EXE program, a couple of explanatory text files, and several subdirectories of fairly simple examples. A second diskette provides grammar description files for several languages, including K&R and ANSI C, Pascal, dBase III and IV, SQL, Smalltalk, Prolog, PostScript, and HyperTalk. The third diskette supplies a fairly complete set of grammar definition and C source files for a C++ preprocessor.

The PCYACC manual assumes that users are familiar with C, which is not an unreasonable assumption under the circumstances. It is well organized, makes good use of fonts and white space, and for the most part is clear in its explanations. Following a brief overview that includes an explanation of typographical conventions in the manual, the book uses two fairly simple examples as a basis for presenting PCYACC concepts. Much of the rest of the book then concerns itself with the theory and practice behind PCYACC, including techniques for debugging. There will be tough going here for anyone just becoming familiar with parser generators and yacc, but between the clear explanations and the numerous examples that come with PCYACC, it's an achievable goal.

From a hardware perspective, PCYACC works on most PC-style microcomputers, including PS/2 and 386-based machines. PCYACC can be run using either two floppy drives or a hard disk and one floppy drive. Software required to do compiler development with PCYACC includes a text editor (such as Brief) and a C language compiler. For those who use the Brief product, Abraxas supplies two macro files (in both compiled and source format) for use with the editor. Installation is straightforward; I merely copied the files (and subdirectories) into a subdirectory on my hard disk.

Using PCYACC

Developing software using PCYACC is a six-step procedure. First, as with any project, you identify the problem. Because, typically, parsing input is only part of the problem, PCYACC does not play a central role at this stage of the development process.

The second step is to define the source language. This definition is most commonly done using a context-free grammar notation such as BNF, although the specification for the grammar can be written directly in PCYACC's GDL. The third step is to incorporate the rules into a PCYACC grammar description file.

Next, the grammar rules should have associated action code written for them. The required support routines (main( ), yylex( ), and yyerror( )) should be coded to allow debugging of the grammar description program. This is an ideal place to use the -S command-line option, which allows the grammar description to be checked without going to the added trouble of generating a C source file. You can really appreciate this feature when working with large grammar definitions. You use PCYACC's other command-line options mostly to override default file-name conventions or to indicate actions you want the program to perform in addition to what it does automatically. Using the help option (-h), you can get a brief synopsis of all available options.

Perhaps the second most valuable service provided by a compiler is the reporting of errors in input files, where accuracy and detail count for a lot of time saved tracking down subtle errors in logic. Here, PCYACC takes advantage of the LALR parser's capability to detect syntactic errors as soon as possible while performing a left-to-right scan of the input. This results in good error trapping while generating the parser. In addition, there are a couple of features of PCYACC you can use to enhance the debugging process.

You can, for example, use the -t command-line option mentioned previously to build a parser that outputs text file parse trees to the default file YY.AST. This option is useful for debugging grammar files and studying source code written in a particular language and can help determine whether there's a problem in a new compiler or in a source file.

Alternately, you can use the -v (verbose) command-line option to produce verbose output from PCYACC in the default file YY.LRT. This file notes all the applicable rules and actions to use for all possible states of the parser. In addition, a summary at the end of the file shows a number of vital statistics, including several terminal symbols used, the number of grammar rules in the program, the number of states, and the number of errors (both of the shift/ reduce and reduce/reduce variety) found in the GDP.

Once the grammar description program has been debugged to the developer's satisfaction, PCYACC is used to compile the program into source code for yyparse( ). Other C code should be developed at this point as well. The last step is to build an executable file that correctly solves the problem you identified in the first place.

To give you an idea of how PCYACC works, let's see what it does with the specification for a crude desk calculator shown in Listing One. The code for this example draws heavily on the yacc specification for a simple desk calculator presented in the book Compilers: Principles, Techniques, and Tools, by Aho et al. (Addison-Wesley, 1986). The specification has been modified slightly for use with PCYACC.

The declaration section first passes the names of two include files to the output file and identifies a token called DIGIT. Next, both the '+' and '*' operators are declared to be left-associative, with '+' having a higher precedence than '*'. The grammar rule section contains the rules for the "language" to be used by the calculator. The symbols $$, $1, and so on have special meanings in the semantic actions. For instance, consider the following rule:

     expr: expr '+' term {$$ = $1 + $3;}

The symbol $$ refers to the value associated with the nonterminal on the LHS of the rule. The symbols $1 and $3 refer to the expr and term grammar symbols on the RHS of the rule. (If the symbol $2 were used, it would refer to the '+' operator in the RHS of the rule.)

The program part of the GDP simply provides a main( ) function (which calls the yyparse( ) generated by PCYACC); a yyerror( ) function that can accept an error string when called; and a yylex( ) function, which performs crude lexical analysis (only single-digit numbers are accepted!) and produces pairs consisting of a token and its associated value. In the case of our calculator, there is only one token (DIGIT), and its value is communicated to the parser through the yacc variable yylval.

The PCYACC specification is quite short when compared to the resulting C source program, shown in Listing Two. This listing was created by invoking PCYACC with no command-line options, as follows:

  PCYACC TEST.Y

The default name for the C source file was TEST.C. I compiled and linked this file using the Microsoft C compiler (Version 5.0) and obtained an executable file 11K in size.

Other Files

Earlier I mentioned that PCYACC comes with several grammar definition files for languages such as Pascal and Smalltalk. Before anyone gets carried away thinking that Abraxas is giving away the store and is handing out compilers for all those languages, hold on! The grammar files that come with PCYACC basically have the ability to check and see whether a file contains syntactically correct "sentences" in a particular language. So, if you go to the trouble of compiling and linking the Pascal grammar file, you'll end up with a program that can parse Pascal code and tell you whether it's succeeded or not. Although some may shrug their shoulders and stifle yawns, these files nevertheless represent some of the best example programs I've seen bundled with a professional software tool.

The one exception to the limited syntax-analysis capabilities of the grammar rule files is the collection of source code files and a PCYACC specification for a C++ preprocessor. Working with these files taught me two things about PCYACC. First, the product is capable of industrial-size projects. The PCYACC output file for the C++ preprocessor is more than 100K in size, and the overall program requires compiling and linking nearly a dozen object files. Second, for large specifications, PCYACC can take a long time to do its work. In fact, the first time I attempted to compile the C++ grammar specification, I ended up rebooting my machine because I thought it had hung. The same thing happened the second time I tried it, and only a chance look at the C++ README.DOC file (which simply notes that "this grammar takes a long time to compile") convinced me to wait patiently for PCYACC to do its job. Despite running on an 80386-based machine, PCYACC took more than five minutes to compile and output the parsing code for the C++ preprocessor.

Computer Boot Camp

Today, writing a compiler is one of those things that a typical computer scientist or engineer does in the course of pursuing an undergraduate degree. For most, the experience is sort of a computer boot camp, requiring a lot of discipline and application over the short haul but something that's eventually left behind to pursue other goals. In the old days, much of the effort of writing a compiler was devoted to the issue of parsing the input. Now this phase is one of the easiest to implement, and thanks to packages such as PCYACC, it can be easily implemented on the PC architecture.

Product Information

PYACC: Abraxas Software Inc., 7033 SW Macadam Ave., Portland, OR 97219. IBM PC, XT, AT or compatible (Macintosh and OS/2 versions also available). Requires 512K RAM and any C compiler. Price: $395.

Availability

All source code for articles in this issue is available on a single disk. To order, send $14.95 (Calif. residents add sales tax) to Dr. Dobb's Journal, 501 Galveston Dr., Redwood City, CA 94063, or call 800-356-2002 (from inside Calif.) or 800-533-4372 (from outside Calif.). Please specify the issue number and format (MS-DOS, Macintosh, Kaypro).

Generating Parsers with PCYACC by Alex Lane

[Listing One]


/* TEST.Y

This specification is based largely on the yacc
specification for a simple desk calculator from "Compilers:
Principles, Techniques and Tools," by Aho, et al. (p. 259,
Addison-Wesley, 1986).

   2/2/89 a.lane
*/

%{
#include <stdio.h>
#include <ctype.h>
%}

%token DIGIT
%left '+'
%left '*'

%%

line  :   /* nothing */
      |   expr '\n'           { printf("%d\n", $1); }
      ;
expr  :   expr '+' term       { $$ = $1 + $3; }
      |   term
      ;
term  :   term '*' factor     { $$ = $1 * $3; }
      |   factor
      ;
factor:   '(' expr ')'        { $$ = $2; }
      |   DIGIT
      ;
%%
main()    {
yyparse();
}

yylex()   {
     int c;
     if ( isdigit( ( c = getchar() ) ) )  {
          yylval = c - '0';
          return DIGIT;
     }
     return c;
}

yyerror(s)
char *s;
{
     fprintf(stderr, "PYERR: %s\n", s);
}






[Listing Two]


# line 11 "test.y"
#include <stdio.h>
#include <ctype.h>
#define DIGIT 257
#ifndef YYSTYPE
#define YYSTYPE int
#endif
YYSTYPE yylval, yyval;
#define YYERRCODE 256

# line 33 "test.y"

main()    {
yyparse();
}

yylex()   {
     int c;
     if ( isdigit( ( c = getchar() ) ) )  {
          yylval = c - '0';
          return DIGIT;
     }
     return c;
}

yyerror(s)
char *s;
{
     fprintf(stderr, "PYERR: %s\n", s);
}

FILE *yytfilep;
char *yytfilen;
int yytflag = 0;
int svdprd[2];
char svdnams[2][2];

int yyexca[] = {
  -1, 1,
  0, -1,
  -2, 0,
  0,
};

#define YYNPROD 9
#define YYLAST 218

int yyact[] = {
       5,       7,      13,       4,       8,       9,       3,       1,
       2,       0,       0,       0,       0,      12,      10,      11,
       4,       0,       3,       2,       0,       0,       0,       0,
       0,       0,       0,       0,       0,       0,       0,       0,
       0,       0,       8,       0,       0,       0,       0,       0,
       0,       0,       0,       0,       0,       0,       0,       0,
       0,       0,       0,       0,       0,       0,       0,       0,
       0,       0,       0,       0,       0,       0,       0,       0,
       0,       0,       0,       0,       0,       0,       0,       0,
       0,       0,       0,       0,       0,       0,       0,       0,
       0,       0,       0,       0,       0,       0,       0,       0,
       0,       0,       0,       0,       0,       0,       0,       0,
       0,       0,       0,       0,       0,       0,       0,       0,
       0,       0,       0,       0,       0,       0,       0,       0,
       0,       0,       0,       0,       0,       0,       0,       0,
       0,       0,       0,       0,       0,       0,       0,       0,
       0,       0,       0,       0,       0,       0,       0,       0,
       0,       0,       0,       0,       0,       0,       0,       0,
       0,       0,       0,       0,       0,       0,       0,       0,
       0,       0,       0,       0,       0,       0,       0,       0,
       0,       0,       0,       0,       0,       0,       0,       0,
       0,       0,       0,       0,       0,       0,       0,       0,
       0,       0,       0,       0,       0,       0,       0,       0,
       0,       0,       0,       0,       0,       0,       0,       0,
       0,       0,       0,       0,       0,       0,       0,       0,
       0,       0,       0,       0,       0,       0,       0,       0,
       0,       0,       0,       0,       0,       0,       0,       0,
       0,       6,
};

int yypact[] = {
     -40,   -1000,      -9,     -37,   -1000,     -40,   -1000,   -1000,
     -40,     -40,     -39,     -37,   -1000,   -1000,
};

int yypgo[] = {
       0,       7,       8,       6,       3,
};

int yyr1[] = {
       0,       1,       1,       2,       2,       3,       3,       4,
       4,
};

int yyr2[] = {
       2,       0,       2,       3,       1,       3,       1,       3,
       1,
};

int yychk[] = {
   -1000,      -1,      -2,      -3,      -4,      40,     257,      10,
      43,      42,      -2,      -3,      -4,      41,
};

int yydef[] = {
       1,      -2,       0,       4,       6,       0,       8,       2,
       0,       0,       0,       3,       5,       7,
};

int *yyxi;


/*****************************************************************/
/* PCYACC LALR parser driver routine -- a table driven procedure */
/* for recognizing sentences of a language defined by the        */
/* grammar that PCYACC analyzes. An LALR parsing table is then   */
/* constructed for the grammar and the skeletal parser uses the  */
/* table when performing syntactical analysis on input source    */
/* programs. The actions associated with grammar rules are       */
/* inserted into a switch statement for execution.               */
/*****************************************************************/


#ifndef YYMAXDEPTH
#define YYMAXDEPTH 200
#endif
#ifndef YYREDMAX
#define YYREDMAX 1000
#endif
#define PCYYFLAG -1000
#define WAS0ERR 0
#define WAS1ERR 1
#define WAS2ERR 2
#define WAS3ERR 3
#define yyclearin pcyytoken = -1
#define yyerrok   pcyyerrfl = 0
YYSTYPE yyv[YYMAXDEPTH];     /* value stack */
int pcyyerrct = 0;           /* error count */
int pcyyerrfl = 0;           /* error flag */
int redseq[YYREDMAX];
int redcnt = 0;
int pcyytoken = -1;          /* input token */


yyparse()
{
  int statestack[YYMAXDEPTH]; /* state stack */
  int      j, m;              /* working index */
  YYSTYPE *yypvt;
  int      tmpstate, tmptoken, *yyps, n;
  YYSTYPE *yypv;


  tmpstate = 0;
  pcyytoken = -1;
#ifdef YYDEBUG
  tmptoken = -1;
#endif
  pcyyerrct = 0;
  pcyyerrfl = 0;
  yyps = &statestack[-1];
  yypv = &yyv[-1];


  enstack:    /* push stack */
#ifdef YYDEBUG
    printf("at state %d, next token %d\n", tmpstate, tmptoken);
#endif
    if (++yyps - &statestack[YYMAXDEPTH] > 0) {
      yyerror("pcyacc internal stack overflow");
      return(1);
    }
    *yyps = tmpstate;
    ++yypv;
    *yypv = yyval;

  newstate:
    n = yypact[tmpstate];
    if (n <= PCYYFLAG) goto defaultact; /*  a simple state */


    if (pcyytoken < 0) if ((pcyytoken=yylex()) < 0) pcyytoken = 0;
    if ((n += pcyytoken) < 0 || n >= YYLAST) goto defaultact;


    if (yychk[n=yyact[n]] == pcyytoken) { /* a shift */
#ifdef YYDEBUG
      tmptoken  = pcyytoken;
#endif
      pcyytoken = -1;
      yyval = yylval;
      tmpstate = n;
      if (pcyyerrfl > 0) --pcyyerrfl;
      goto enstack;
    }

  defaultact:

    if ((n=yydef[tmpstate]) == -2) {
      if (pcyytoken < 0) if ((pcyytoken=yylex())<0) pcyytoken = 0;
      for (yyxi=yyexca; (*yyxi!= (-1)) || (yyxi[1]!=tmpstate); yyxi += 2);
      while (*(yyxi+=2) >= 0) if (*yyxi == pcyytoken) break;
      if ((n=yyxi[1]) < 0) { /* an accept action */
        if (yytflag) {
          int ti; int tj;
          yytfilep = fopen(yytfilen, "w");
          if (yytfilep == NULL) {
            fprintf(stderr, "Can't open t file: %s\n", yytfilen);
            return(0);          }
          for (ti=redcnt-1; ti>=0; ti--) {
            tj = svdprd[redseq[ti]];
            while (strcmp(svdnams[tj], "$EOP"))
              fprintf(yytfilep, "%s ", svdnams[tj++]);
            fprintf(yytfilep, "\n");
          }
          fclose(yytfilep);
        }
        return (0);
      }
    }


    if (n == 0) {        /* error situation */
      switch (pcyyerrfl) {
        case WAS0ERR:          /* an error just occurred */
          yyerror("syntax error");
          yyerrlab:
            ++pcyyerrct;
        case WAS1ERR:
        case WAS2ERR:           /* try again */
          pcyyerrfl = 3;
        /* find a state for a legal shift action */
          while (yyps >= statestack) {
          n = yypact[*yyps] + YYERRCODE;
          if (n >= 0 && n < YYLAST && yychk[yyact[n]] == YYERRCODE) {
            tmpstate = yyact[n];  /* simulate a shift of "error" */
            goto enstack;
            }
          n = yypact[*yyps];


          /* the current yyps has no shift on "error", pop stack */
#ifdef YYDEBUG
            printf("error: pop state %d, recover state %d\n", *yyps, yyps[-
1]);
#endif

          --yyps;
          --yypv;
        }


        yyabort:
            if (yytflag) {
              int ti; int tj;
              yytfilep = fopen(yytfilen, "w");
              if (yytfilep == NULL) {
                fprintf(stderr, "Can't open t file: %s\n", yytfilen);
                return(1);              }
              for (ti=1; ti<redcnt; ti++) {
                tj = svdprd[redseq[ti]];
                while (strcmp(svdnams[tj], "$EOP"))
                  fprintf(yytfilep, "%s ", svdnams[tj++]);
                fprintf(yytfilep, "\n");
              }
              fclose(yytfilep);
            }
          return(1);


      case WAS3ERR:  /* clobber input char */
#ifdef YYDEBUG
          printf("error: discard token %d\n", pcyytoken);
#endif
          if (pcyytoken == 0) goto yyabort; /* quit */
        pcyytoken = -1;
        goto newstate;      } /* switch */
    } /* if */


    /* reduction, given a production n */
#ifdef YYDEBUG
    printf("reduce with rule %d\n", n);
#endif
    if (yytflag && redcnt<YYREDMAX) redseq[redcnt++] = n;
    yyps -= yyr2[n];
    yypvt = yypv;
    yypv -= yyr2[n];
    yyval = yypv[1];
    m = n;
    /* find next state from goto table */
    n = yyr1[n];
    j = yypgo[n] + *yyps + 1;
    if (j>=YYLAST || yychk[ tmpstate = yyact[j] ] != -n) tmpstate =
yyact[yypgo[n]];
    switch (m) { /* actions associated with grammar rules */

      case 2:
# line 22 "test.y"
      { printf("%d\n", yypvt[-1]); } break;
      case 3:
# line 24 "test.y"
      { yyval = yypvt[-2] + yypvt[-0]; } break;
      case 5:
# line 27 "test.y"
      { yyval = yypvt[-2] * yypvt[-0]; } break;
      case 7:
# line 30 "test.y"
      { yyval = yypvt[-1]; } break;    }
    goto enstack;
}