Channels ▼


Practical Parsing for ANSI C

Source Code Accompanies This Article. Download It Now.

But Things Get Tricky...

Where declarations are concerned, C is much more permissive than you might think. This permissiveness may lead to poor readability and understandability, as in:

1: typedef unsigned long int ulint_t;  
2: void function() 
3: {
4:    ulint_t another_var;  
 /* ulint_t is used as type name */
5:    char ulint_t;         
 /* ulint_t is used as identifier */
6:    /*...*/
7: }

Despite first impressions, this code is perfectly legal ANSI C, but the front-end based on the lexer in Listing One fails to recognize it.

More precisely, line 1 defines ulint_t as a user type; therefore, ulint_t must be tokenized here as IDENTIFIER. Then, line 4 declares a variable of this type; therefore, ulint_t is tokenized here as TYPE_NAME, as the scoping rules require. So far, nothing is new. Now, line 5 declares a variable named ulint_t—this is allowed. Inner blocks can declare new variables and types that hide outer objects with the same names. Moreover, in the current block, there are no conflicts because no previous declarations in this block involve the name ulint_t. Since line 5 declares a variable, the lexer should tokenize ulint_t in line 5 as IDENTIFIER. Unfortunately, the lexer as specified in Listing One tokenizes it as TYPE_NAME because it finds a type declaration with that name in the outer block.

int category()
  Block * blockp = Block::currentp;
  while (blockp) 
     if (blockp->symbol_table.name_is_defined(yytext)) 
        if (blockp->symbol_table.name_is_typedef(yytext)) 
           if (Block::currentp->type_stack.is_valid_type())
         return IDENTIFIER;
         return TYPE_NAME;    
          return IDENTIFIER; 
     blockp = blockp->parentp;
  return IDENTIFIER;
Listing Two

To solve this, the lexer must cooperate with the parser and the semantic layer. Listing Two presents a solution. You force the current token (ulint_t) as IDENTIFIER when it is the last part of a variable declaration; for instance, when what appears before the current token is a valid type specifier. Although this condition seems simple to verify, the simple condition:

if (Block::currentp->       type_stack.is_valid_type()) 

hides a full implementation of the C type system provided by the semantic layer. Declarations in C are defined by 108 syntax rules, which is a large part of the entire grammar of the C language. The declaration syntax of C is complex and counterintuitive, and even experienced programmers find it troublesome—qualifiers (const and volatile) may appear in different orders, type specifiers (such as long or unsigned) may modify another type specifier or come alone (then int is implied), declarations of pointers to functions are poorly readable, and much more. A real-life example of a nontrivial declaration is:

void (*signal(int sig,       void (*func)(int))) (int);

My approach maintains a stack of type records in each block, which describes the state of the current declaration. When each element in a declaration (type record) is parsed, according to the correct precedence implied by the grammar rules, it is pushed onto a stack. Whenever a direct declarator identifier is encountered, a snapshot of the current type stack is taken, saved into the symbol table of the current block, and associated to that identifier. Figure 4 illustrates this mechanism.

[Click image to view at full size]

Figure 4: How type stacks are used to process declarations.

Type stacks are easy data structures to process, they capture the full complexity of C declarations; and unlike declarations, they are easy to read. To get the type of a variable in plain English, just read its type stack from top to bottom. For example, "c" in Figure 4 is an array of five pointers to integers. (For brevity, I don't discuss type-stack normalization.)

Finally, type stacks are the key to solve the issue described at the beginning of this section. In short, when the records in the current type stack constitute a valid type, the lexer prevents tokens that are declared as user type names outside (such as ulint_t) to be tokenized as TYPE_NAME, and forces them to be IDENTIFIER.

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.