Dr. Dobb's | Practical Parsing for ANSI C

Practical Parsing for ANSI C

Daniele discusses the design of an ANSI C parser front-end, identifying the pitfalls that make design tricky.

December 12, 2006
URL:http://www.drdobbs.com/cpp/practical-parsing-for-ansi-c/196603535

Daniele is a post-doctoral researcher at the Pacific Northwest National Laboratory focusing on supercomputing. He also has a background in low-power software optimization at the source-code level for embedded systems. He can be contacted at [email protected].

Front-ends are present in all applications that process source code—compilers, interpreters, linters, and the like. Common beliefs about the design of front-ends are that they are made up of phases (lexical, syntactic, and semantic analysis), are largely decoupled, and that lexical analysis consists of regular-expression matching. When these assumptions are true, the design of a parser is facilitated. Unfortunately, these assumptions aren't always true in the real world. In this article, I discuss the design of a front-end for ANSI C, and show the pitfalls that make it trickier than you'd expect.

Myth #1: Front-Ends Are Made Up of Independent Phases

A front-end is a program that accepts text input and produces a representation of that text in a desired form—an abstract syntax tree, for instance. Again, front-ends are present in any application that manipulates source code in a nontrivial way—compilers, interpreters, pretty printers, code analyzers (such as splint or Purify), and so on.

Anyone who has taken a compiler course recognizes the classic flow diagram of a front-end in Figure 1. It includes three phases—lexical, syntax, and semantic analysis. The lexer and parser recognize the word-level and phrase-level structure of the input text. The lexer splits the undifferentiated stream of characters of the input program (after preprocessing, if any) into a sequence of tokens, each belonging to a well-defined category such as keyword, operator, literal constant, identifier, or type name. For example, the lexer determines that the consecutive characters "+=" constitute a single operator, rather than two separate "+" and "=" operators, and that "18.52" is a floating-point literal constant rather than the two integer literal constants "18" and "52" with a dot punctuator in the middle. The parser recognizes the language constructs (expressions, statements, declarations, and so on), and usually outputs the syntax tree corresponding to the program. Traditionally, lexers and parsers are obtained automatically by running tools such as flex and bison over a set of specifications. Lexers are specified as a set of actions, each associated to a regular expression; parsers are specified as a set of actions, each associated to BNF syntax rules. Finally, the semantic analyzer contains all the remaining application-dependent algorithms, which implement context checks (for example, making sure that there are no nested function definitions in a C program) and connect to the back-end.

Figure 1: The classical structure of a front-end.

Unfortunately, Figure 1 can be misleading in suggesting that the three phases—lexer, parser, semantics—are independent and operate sequentially without overlaps. In fact, the input can first be entirely tokenized, then its syntax tree derived, and finally all the semantics evaluated on top of this tree. Consequently, you'd think you could design the lexer, parser, and semantic analyzer as separate, distinct problems. While this might be true for simple languages and applications (such as command-line calculators), it isn't true for complex languages like C. For C, it isn't even possible to complete the lexical analysis of a token before some semantic analysis has been completed for the preceding input. For instance:


typedef     unsigned long int ulint_t;

contains the declaration of a user type ulint_t. Parsing this declaration according to the grammar of C leads to the parse tree in Figure 2(a), where ulint_t must be tokenized as an IDENTIFIER. (I write tokens such as "IDENTIFIER" in uppercase to differentiate them from nonterminal symbols in the grammar, such as "expression," "statement," and "type_name"). As a consequence of the type declaration, if this declaration is encountered:


ulint_t my_ul_int;

ulint_t must be tokenized not as an IDENTIFIER (as before), but as a TYPE_NAME. The syntax tree of this declaration is in Figure 2(b).

[Click image to view at full size]

Figure 2: Tokens and parse tree for the declarations in the example.

C lets you define new type names, and the standard C grammar is based on this distinction. Therefore, lexers must be able to actively enforce this distinction. Simple solutions to circumvent this difficulty do not help. For example, a lazy solution may abolish the distinction between IDENTIFIER and TYPE_NAME, and consider each name an IDENTIFIER, adapting the grammar accordingly. This solution fails, however, because it makes the grammar ambiguous, so that some declarations or statements may be parsed in multiple, inconsistent ways.

Two points are clear:

Lexers based on regular-expression matching alone cannot do the job. More intelligence is needed. The lexer needs to know the entire set of declarations and definitions, which are visible in the current scope. Lexers must access the symbol tables associated with the previous input in a way that is consistent with the scoping rules of the language. Unlike simple languages, for which the symbol table is entirely the semantic phase's business, the lexer operates on the basis of the information in the symbol table.
Symbol tables cannot be maintained by lexers alone; therefore, lexers, parsers, and semantic actions cannot operate sequentially. They must work in an intertwined way. Symbol tables cannot be maintained by lexers because understanding the structure of the declarations and definitions is the parser's job, and processing them is the responsibility of the semantic layer.

This leads to the structure in Figure 3—a call graph representing the simplified structure of a front-end developed with flex and bison. The application-specific code calls the parser (traditionally, function yyparse()) just once to parse the entire input. The parser calls the lexer (function yylex()) as many times as there are input tokens in the input text. Every time the parser recognizes a syntax rule, an appropriate semantic action is triggered. If a declaration is recognized by the parser, the corresponding action adds a new entry in the symbol table. The new entry is visible to the lexer the next time it is invoked to fetch the following token, and it may influence whether the token is categorized as IDENTIFIER or TYPE_NAME.

[Click image to view at full size]

Figure 3: Call-graph of a realistic front-end designed with flex and bison.

Myth #2: Lexers Are "Dumb" Objects

Listing One is the simplified flex specification for a lexer that distinguishes between identifiers and type names, depending on the context. As flex requires, sections are separated by "%%" lines.

Digit         [0-9]
Letter        [a-zA-Z_]
 ...
%%
","                { return ','; }
":"                { return ':'; }
"="                { return '='; }
 ...
"+="               { return ADD_ASSIGN;   }
">>="              { return RIGHT_ASSIGN; }
"<<="              { return LEFT_ASSIGN;  }
 ...
"auto"             { return AUTO;         }
"break"            { return BREAK;        }
"case"             { return CASE;         }
 ...
{Letter}({Letter}|{Digit})* 
                  { return category();   }
 ...
%%
#include <blocks.hpp>
 ...
int category()
{
  Block * blockp = Block::currentp;
  ...
  while (blockp) 
  {
     if (blockp->            symbol_table.name_is_defined(yytext)) 
     {
        if (blockp->            symbol_table.name_is_typedef(yytext)) 
          return TYPE_NAME;    
        else 
          return IDENTIFIER; 
     }
     blockp = blockp->parentp;
  }
  return IDENTIFIER;
}
 ...

Listing One

The first section defines a shorthand notation for digits and letters.
The second section gives rules for each lexical element of the language; for example, for one-character operators ("+", "=", ...), the lexer returns a token code equal to the character encoding, by convenience; for multicharacter operators ("+=", "<<=", ...) and keywords ("auto," "break," "case," ...), it returns symbolic token codes.
Finally, for strings of letters and digits starting with a letter, the lexer invokes function category() to determine whether the string must be recognized as an IDENTIFER or a TYPE_NAME. The third section gives the implementation of function category(). The function category() complies with the C scoping rules: It looks up the current token (stored in variable yytext) in the current block, then in the block that immediately surrounds it, and so on, until an entry is found or the outmost block is hit with no success. This is the purpose of the while loop in the function.

Remember that in C, any open brace "{" creates a new block and a new scope, where new variables and types can be declared that hide the objects with the same names declared in the outer blocks. Our front-end maintains a representation of all the blocks encountered in the input source code so far, in the form of a tree of objects of class Block (and yes, I write semantic actions and support code in C++). Each block contains a pointer parentp to its surrounding block, which is NULL for the global block. Each block also contains an object of class SymbolTable—a symbol table with all the variable and type declarations that have appeared so far in that block. Class SymbolTable provides a method named name_is_typedef(...), which tells whether the given name appears as a defined type name in that symbol table current block.

A Lazy Trick

Tags are optional names you can associate to a struct or union definition in order to recall that definition later. In this example, tag_employee is a tag:

struct tag_employee 
{ 
   char     * name;
   unsigned   age;
};
/* ... */
struct tag_employee my_employee;

The peculiar thing with tags is that they live in a separate namespace from declared variables and types. Consequently, it is perfectly legal (although it may be very stupid) to use the same name both as a struct tag and as a user type name:


 1:  typedef unsigned long int ulint_t;
 2:  struct ulint_t {
 3:     char a;
 4:     float b;
 5:  } a;
 6:  void function()
 7:  {
 8:     ulint_t         my_var;
 9:     struct ulint_t  my_struct;
10:  }

Unfortunately, our current front-end fails to parse this code. In fact, it tokenizes ulint_t on line 8 as TYPE_NAME, when an IDENTIFIER is expected after the struct or union keyword, as the syntax rules from the C grammar show:


struct_or_union_specifier	
	: struct_or_union IDENTIFIER
	| struct_or_union IDENTIFIER '{' struct_declaration_list '}'
	| struct_or_union            '{' struct_declaration_list '}'
	;

To solve this, I use a simple (though not elegant) workaround. I do not make the lexer capable of distinguishing tags from identifiers and type names. Instead, I modify the grammar to make it irrelevant whether a tag is tokenized as IDENTIFIER or TYPE_NAME. To do so, in the previous rules, I add for each rule involving IDENTIFIER a copy that involves TYPE_NAME. The result is:


struct_or_union_specifier	
  : struct_or_union IDENTIFIER
  | struct_or_union IDENTIFIER      '{' struct_declaration_list '}'
  | struct_or_union TYPE_NAME
  | struct_or_union TYPE_NAME       '{' struct_declaration_list '}'
  | struct_or_union      '{' struct_declaration_list '}'   
   ;

The solution is simple and effective, it does not cause any ambiguity in the grammar, and it addresses all the cases where a tag can be used, including cast expressions such as pq = (struct tag_q *) ps;.

But Things Get Tricky...

Where declarations are concerned, C is much more permissive than you might think. This permissiveness may lead to poor readability and understandability, as in:


1: typedef unsigned long int ulint_t;  
2: void function() 
3: {
4:    ulint_t another_var;  
 /* ulint_t is used as type name */
5:    char ulint_t;         
 /* ulint_t is used as identifier */
6:    /*...*/
7: }

Despite first impressions, this code is perfectly legal ANSI C, but the front-end based on the lexer in Listing One fails to recognize it.

More precisely, line 1 defines ulint_t as a user type; therefore, ulint_t must be tokenized here as IDENTIFIER. Then, line 4 declares a variable of this type; therefore, ulint_t is tokenized here as TYPE_NAME, as the scoping rules require. So far, nothing is new. Now, line 5 declares a variable named ulint_t—this is allowed. Inner blocks can declare new variables and types that hide outer objects with the same names. Moreover, in the current block, there are no conflicts because no previous declarations in this block involve the name ulint_t. Since line 5 declares a variable, the lexer should tokenize ulint_t in line 5 as IDENTIFIER. Unfortunately, the lexer as specified in Listing One tokenizes it as TYPE_NAME because it finds a type declaration with that name in the outer block.

int category()
{
  Block * blockp = Block::currentp;
  ...
  while (blockp) 
  {
     if (blockp->symbol_table.name_is_defined(yytext)) 
     {
        if (blockp->symbol_table.name_is_typedef(yytext)) 
        {
           if (Block::currentp->type_stack.is_valid_type())
         return IDENTIFIER;
       else 
         return TYPE_NAME;    
        } 
        else 
          return IDENTIFIER; 
     }
     blockp = blockp->parentp;
  }
  return IDENTIFIER;
}
 ...

Listing Two

To solve this, the lexer must cooperate with the parser and the semantic layer. Listing Two presents a solution. You force the current token (ulint_t) as IDENTIFIER when it is the last part of a variable declaration; for instance, when what appears before the current token is a valid type specifier. Although this condition seems simple to verify, the simple condition:


if (Block::currentp->       type_stack.is_valid_type())

hides a full implementation of the C type system provided by the semantic layer. Declarations in C are defined by 108 syntax rules, which is a large part of the entire grammar of the C language. The declaration syntax of C is complex and counterintuitive, and even experienced programmers find it troublesome—qualifiers (const and volatile) may appear in different orders, type specifiers (such as long or unsigned) may modify another type specifier or come alone (then int is implied), declarations of pointers to functions are poorly readable, and much more. A real-life example of a nontrivial declaration is:


void (*signal(int sig,       void (*func)(int))) (int);

My approach maintains a stack of type records in each block, which describes the state of the current declaration. When each element in a declaration (type record) is parsed, according to the correct precedence implied by the grammar rules, it is pushed onto a stack. Whenever a direct declarator identifier is encountered, a snapshot of the current type stack is taken, saved into the symbol table of the current block, and associated to that identifier. Figure 4 illustrates this mechanism.

[Click image to view at full size]

Figure 4: How type stacks are used to process declarations.

Type stacks are easy data structures to process, they capture the full complexity of C declarations; and unlike declarations, they are easy to read. To get the type of a variable in plain English, just read its type stack from top to bottom. For example, "c" in Figure 4 is an array of five pointers to integers. (For brevity, I don't discuss type-stack normalization.)

Finally, type stacks are the key to solve the issue described at the beginning of this section. In short, when the records in the current type stack constitute a valid type, the lexer prevents tokens that are declared as user type names outside (such as ulint_t) to be tokenized as TYPE_NAME, and forces them to be IDENTIFIER.

...And Trickier

Type stacks are employed to process type specifiers, but type specifiers can appear both in declarations and in cast expressions. You observe no conflicts, except when the two features are used together, as in:


1: typedef int * intp_t;
2: char pc;
3: void f() {
4: int * pi = (intp_t) & pc;
5: }

where line 4 contains a variable declaration and initialization, and the initializer contains a cast expression. Appropriate countermeasures must be taken to tokenize intp_t in line 4 as TYPE_NAME. In fact, according to the aforementioned logic, it is tokenized as IDENTIFIER. The lexer should inhibit the behavior "force to IDENTIFIER" when it is in the middle of a variable initializer. Unfortunately, whether or not we are in the middle of a variable initializer is something that only the parser can say, and only after the lexical analysis is complete. The lexical choice to tokenize intp_t would then depend on how a superset of input text is parsed after lexical analysis: It is a circular dependence.

To solve the issue, I exploit the fact that cast expressions always appear between parentheses. The lexer maintains a "parenthesis level"; that is, a counter of how many parentheses are open. This counter is incremented at each encountered "(", and decremented at each ")". Then, type names are not forced to IDENTIFIER when we are inside a couple of parentheses. The corresponding modification can be seen in Listing Three.

int category()
{
  Block * blockp = Block::currentp;
  ...
  while (blockp) 
  {
     if (blockp->symbol_table.name_is_defined(yytext)) 
     {
        if (blockp->symbol_table.name_is_typedef(yytext)) 
        {
           if (Block::currentp->type_stack.is_valid_type() &&
               parenthesis_level==0)
         return IDENTIFIER;
       else 
         return TYPE_NAME;
        } 
        else 
          return IDENTIFIER; 
     }
     blockp = blockp->parentp;
  }
  return IDENTIFIER;
}
 ...

Listing Three