Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

C/C++

Practical Parsing for ANSI C


Myth #2: Lexers Are "Dumb" Objects

Listing One is the simplified flex specification for a lexer that distinguishes between identifiers and type names, depending on the context. As flex requires, sections are separated by "%%" lines.

Digit         [0-9]
Letter        [a-zA-Z_]
 ...
%%
","                { return ','; }
":"                { return ':'; }
"="                { return '='; }
 ...
"+="               { return ADD_ASSIGN;   }
">>="              { return RIGHT_ASSIGN; }
"<<="              { return LEFT_ASSIGN;  }
 ...
"auto"             { return AUTO;         }
"break"            { return BREAK;        }
"case"             { return CASE;         }
 ...
{Letter}({Letter}|{Digit})* 
                  { return category();   }
 ...
%%
#include <blocks.hpp>
 ...
int category()
{
  Block * blockp = Block::currentp;
  ...
  while (blockp) 
  {
     if (blockp->            symbol_table.name_is_defined(yytext)) 
     {
        if (blockp->            symbol_table.name_is_typedef(yytext)) 
          return TYPE_NAME;    
        else 
          return IDENTIFIER; 
     }
     blockp = blockp->parentp;
  }
  return IDENTIFIER;
}
 ...
Listing One

  • The first section defines a shorthand notation for digits and letters.
  • The second section gives rules for each lexical element of the language; for example, for one-character operators ("+", "=", ...), the lexer returns a token code equal to the character encoding, by convenience; for multicharacter operators ("+=", "<<=", ...) and keywords ("auto," "break," "case," ...), it returns symbolic token codes.
  • Finally, for strings of letters and digits starting with a letter, the lexer invokes function category() to determine whether the string must be recognized as an IDENTIFER or a TYPE_NAME. The third section gives the implementation of function category(). The function category() complies with the C scoping rules: It looks up the current token (stored in variable yytext) in the current block, then in the block that immediately surrounds it, and so on, until an entry is found or the outmost block is hit with no success. This is the purpose of the while loop in the function.

Remember that in C, any open brace "{" creates a new block and a new scope, where new variables and types can be declared that hide the objects with the same names declared in the outer blocks. Our front-end maintains a representation of all the blocks encountered in the input source code so far, in the form of a tree of objects of class Block (and yes, I write semantic actions and support code in C++). Each block contains a pointer parentp to its surrounding block, which is NULL for the global block. Each block also contains an object of class SymbolTable—a symbol table with all the variable and type declarations that have appeared so far in that block. Class SymbolTable provides a method named name_is_typedef(...), which tells whether the given name appears as a defined type name in that symbol table current block.


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.