C/C++

Practical Parsing for ANSI C

By Daniele Paolo Scarpazza, December 12, 2006

Daniele discusses the design of an ANSI C parser front-end, identifying the pitfalls that make design tricky.

But Things Get Tricky...

Where declarations are concerned, C is much more permissive than you might think. This permissiveness may lead to poor readability and understandability, as in:


1: typedef unsigned long int ulint_t;  
2: void function() 
3: {
4:    ulint_t another_var;  
 /* ulint_t is used as type name */
5:    char ulint_t;         
 /* ulint_t is used as identifier */
6:    /*...*/
7: }

Despite first impressions, this code is perfectly legal ANSI C, but the front-end based on the lexer in Listing One fails to recognize it.

More precisely, line 1 defines ulint_t as a user type; therefore, ulint_t must be tokenized here as IDENTIFIER. Then, line 4 declares a variable of this type; therefore, ulint_t is tokenized here as TYPE_NAME, as the scoping rules require. So far, nothing is new. Now, line 5 declares a variable named ulint_t—this is allowed. Inner blocks can declare new variables and types that hide outer objects with the same names. Moreover, in the current block, there are no conflicts because no previous declarations in this block involve the name ulint_t. Since line 5 declares a variable, the lexer should tokenize ulint_t in line 5 as IDENTIFIER. Unfortunately, the lexer as specified in Listing One tokenizes it as TYPE_NAME because it finds a type declaration with that name in the outer block.

int category()
{
  Block * blockp = Block::currentp;
  ...
  while (blockp) 
  {
     if (blockp->symbol_table.name_is_defined(yytext)) 
     {
        if (blockp->symbol_table.name_is_typedef(yytext)) 
        {
           if (Block::currentp->type_stack.is_valid_type())
         return IDENTIFIER;
       else 
         return TYPE_NAME;    
        } 
        else 
          return IDENTIFIER; 
     }
     blockp = blockp->parentp;
  }
  return IDENTIFIER;
}
 ...

Listing Two

To solve this, the lexer must cooperate with the parser and the semantic layer. Listing Two presents a solution. You force the current token (ulint_t) as IDENTIFIER when it is the last part of a variable declaration; for instance, when what appears before the current token is a valid type specifier. Although this condition seems simple to verify, the simple condition:


if (Block::currentp->       type_stack.is_valid_type())

hides a full implementation of the C type system provided by the semantic layer. Declarations in C are defined by 108 syntax rules, which is a large part of the entire grammar of the C language. The declaration syntax of C is complex and counterintuitive, and even experienced programmers find it troublesome—qualifiers (const and volatile) may appear in different orders, type specifiers (such as long or unsigned) may modify another type specifier or come alone (then int is implied), declarations of pointers to functions are poorly readable, and much more. A real-life example of a nontrivial declaration is:


void (*signal(int sig,       void (*func)(int))) (int);

My approach maintains a stack of type records in each block, which describes the state of the current declaration. When each element in a declaration (type record) is parsed, according to the correct precedence implied by the grammar rules, it is pushed onto a stack. Whenever a direct declarator identifier is encountered, a snapshot of the current type stack is taken, saved into the symbol table of the current block, and associated to that identifier. Figure 4 illustrates this mechanism.

[Click image to view at full size]

Figure 4: How type stacks are used to process declarations.

Type stacks are easy data structures to process, they capture the full complexity of C declarations; and unlike declarations, they are easy to read. To get the type of a variable in plain English, just read its type stack from top to bottom. For example, "c" in Figure 4 is an array of five pointers to integers. (For brevity, I don't discuss type-stack normalization.)

Finally, type stacks are the key to solve the issue described at the beginning of this section. In short, when the records in the current type stack constitute a valid type, the lexer prevents tokens that are declared as user type names outside (such as ulint_t) to be tokenized as TYPE_NAME, and forces them to be IDENTIFIER.

Previous 1 2 3 4 5 Next

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

C/C++

Practical Parsing for ANSI C

But Things Get Tricky...

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

C/C++ Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content

C/C++

Practical Parsing for ANSI C

But Things Get Tricky...

Related Reading

News

Commentary

Slideshow

Video

Most Popular

More Insights

White Papers

Reports

Webcasts

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

C/C++ Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content