Dr. Dobb's | Recursive Descent PEG Parsers Using C++ Templates

Recursive Descent PEG Parsers Using C++ Templates

YARD -- "Yet Another Recursive Descent" -- is a C++ parsing framework

January 02, 2009
URL:http://www.drdobbs.com/tools/recursive-descent-peg-parsers-using-c-te/212700432

Christopher is a freelance programmer and consultant, with a particular interest in the design and implementation of programming languages. He can be contacted at [email protected].

In this article, I introduce the YARD (short for "Yet Another Recursive Descent") C++ parsing framework. YARD is based on the PEG ("Parsing Expression Grammars") formalism, and makes heavy use of template programming. The complete source code for YARD is available online from Dr. Dobb's and Google Code.

LR Parsers and BNFs

The syntax of many programming languages is expressed using a grammar in a form called Backus Normal Form (BNF). BNF is a particular representation of Context-Free Grammars (CFG). Often people construct programming language parsers that first break the input into tokens, a process called "lexical analysis." They would then parse the list of tokens using a LR (Left-to-right scanning Right-Most derivation) parser.

Writing tokenizers and LR parsers, in particular the lookup tables used by most LR parsers, is an arduous task. However, many tools exist that, given a grammar, automatically generate the tokenizing and parsing code (Lex and YACC, for instance). But there are several problems with the traditional approach of building LR parsers using code generation tools:

Code generators are complex pieces of software that have their own peculiar syntax and a steep learning curve.
Debugging the grammars and generated code is challenging.
Converting from BNF form to an appropriate LR form is difficult.
Separate passes and tools for lexing and parsing is inconvenient.

Recursive Descent Parsers

Recursive descent (RD) parsers are a form of parser that is easier to construct by hand, but have been unfairly shunned due to a perceived lack of efficiency. The other phases of syntactic analysis (construction of the AST, for instance) far outweigh any performance hit associated with RD parsers. Furthermore, in my own comparisons I have seen YARD parsers perform on the same order of magnitude as well as an LR parser generated by YACC for parsing the C language (for simple grammar recognition tasks).

One of the pleasant aspects of RD parsers is that they resemble the BNF expression of the grammar. Most production rules in the grammar can be mapped to a function that looks at the input and sees if the current input can be constructed using the definition of the rule that it represents. Because rules are constructed from other rules, we call the functions for recognizing the other production rules. As rules can refer to themselves directly or indirectly, this process is recursive.

Parsing Expression Grammars

Recently a new grammar formalism by Bryan Ford called a "Parsing Expression Grammar" (PEG) has become increasingly popular for expressing the syntax of programming languages. PEGs can be used to construct very efficient parsers. A memoizing parser called a packrat parser has been shown to have linear time complexity.

Personally, I am most interested in the fact that the formalism is unambiguous, lends itself more naturally to the construction of parsers, and can be used to eliminate the tokenization phase. In a PEG, the entire grammar can be described with a single grammar.

PEGs can be viewed as a formal definition of a parser. Unlike a BNF, a PEG does not say how to produce legal syntactic phrases, but rather how to recognize legal syntactic phrases. This means that a PEG rule is a matching rule whereas a BNF rule is a production rule. We cannot interpret a grammar in BNF literally as a PEG, but oftentimes, the translation is straightforward. One notable feature of a PEG, which is absent in a BNF, are zero-width assertions. These are rules that do not consume input. Such rules are useful for expressing branching logic within the grammar.

Yet Another Recursive Descent (YARD) Parsing Framework

The Yet Another Recursive Descent (YARD ) parsing framework I present here is a simple yet powerful RD parsing framework that represents grammar rules as types, and PEG operators as templates. This approach lets you construct an efficient parser by writing out grammar rules as types that inherit from other types (including template instantions).

The YARD technique was inspired by the use of expression templates and operator overloading in the Boost.Spirit library by Joel de Guzman. By making templates explicit, it restricts YARD grammars to be static (they cannot be constructed at runtime). This reduces flexibility but improves the performance of the parser.

Listing One is an example of a YARD grammar, where a snippet of an XML parsing grammar is shown. Take notice of the close correspondence of the YARD grammar to the official XML specification.

struct Element :
  Or<EmptyElemTag, Seq<STag, Content, ETag> >
{ };
struct STag :
  Seq<
    Char<'<'>,
    Name,
    Star<Seq<S, Attribute> >,
    Opt<S>,
    Char<'>'>
  >
{ };
struct Attribute :
  Seq<Name, Eq, AttValue>
{ };
struct ETag :
  Seq<CharSeq<'<','/'>, Name, Opt<S>, Char<'>'> >
{ };
struct Content :
  Seq<
    Opt<CharData>,
    Star<
      Seq<
       Or<
          Element,
          Reference,
          CDSect,
          PI,
          Comment
       >,
       Opt<CharData>
      >
    >
  >
{ };
struct EmptyElemTag :
  Seq<
    Char<'<'>,
    Name,
    Star<Seq<S, Attribute> >,
    Opt<S>,
    CharSeq<'/','>'>
  >
{ };

Listing One

YARD has also inspired two other C++ parsing libraries that use the same basic architecture—the < a href="p-stade.sourceforge.net/biscuit/">Biscuit Library by Shunsuke Sogame, and the Parsing Expression Grammar Template Library (PEGTL) by Colin Hirsch. The Biscuit Library is an extension of a previous version of YARD, which includes many features from Boost. The PEGTL is an independent library that uses C++0x features such as variadic templates for greater flexibility. PEGTL provides significantly enhanced diagnostic and input abstraction facilities.

Grammar Rules

In YARD, each grammar rule is represented by a type (I use structs, to avoid having to write public everywhere in my grammar) that implements a static Match member function template. The Match function is provided implicitly by having the struct inherit either from the primitive rule types or from instantiations of the templates representing the rule operators (such as Seq, Or, and Star). The Match function template has the following signature:


template<typename ParserState_T>
static bool Match(ParserState_T& p);

The Match function template accepts a state management object and returns True or False depending on whether the input matches the corresponding rule. Match is required to restore the state parameter to its original state if it fails to find a match. Restoring the state is done automatically by the basic rule combinators.

Parsing and Parser State Management

Parsing an input stream with YARD is done by passing a parser state management object, simply called the parser from here on out, to the start rule of the grammar. The start rule is the rule that describes a well-formed document or snippet in your language.

The parser is responsible for managing the iterator to the input as well as the Abstract Syntax Tree (AST). Listing Two shows the required interface of a parser; YARD grammars are parameterized (specifically the Match template function) so that different kinds of parsers can be used with the same grammar.

class ParserState
{
public:
 // Parse function
 template<typename StartRule_T>
 bool Parse();
 // constructor
 Parser(iterator first, iterator last)
 // public typedefs
 typedef iterator; // an iterator over input
 typedef value_type; // type of input tokens
 typedef node_type; // type of nodes in abstract syntax tree
 // functions for manipulating and accessing the input iterator
 value_type GetElem();
 void GotoNext();
 iterator GetPos();
 void SetPos(iterator pos);
 bool AtEnd();
 iterator Begin();
 iterator End();
 // functions for constructing the abstract syntax tree
 void StartNode(int type);
 void CompleteNode(int type);
 void AbandonNode(int type);
};

Listing Two

The file yard_parser.hpp includes a simple parser that parses ASCII text and generates a simple AST. This can be used as is, if your parsing needs are not particularly demanding, but you will probably want to write your own parser object and possibly your own AST class if you want to squeeze more performance out of the allocation/deallocation scheme.

Rule Combinators

Rule combinators are templates that correspond to operations on rules. The combinator combines rules to create a new rule, and as such provides its own implementation of the Match template function. Most YARD grammar rules are defined as structs that inherit from instantiations of these templates. The core rule combinators are the:

Seq combinator that takes a sequence of rules as operands and tries to match each one in sequence. Seq only succeeds if all rules succeed in matching the input in order. If any rule fails, the parser is restored to its original state just before it attempted to match the first rule, and the rule created by the Seq combinator returns False.
Or combinator tries to match each rule in sequence, until one of them is successful. The rule created by Or returns True if and only if one of the rules is matched successfully.
Star combinator tries to match a single rule as many times as possible. The Star combinator will always succeed (that is, Match will return True), but it might not advance the input iterator. Plus combinator is like the Star combinator, but requires the parameter to match at least once to be successful. Opt combinator tries to match a rule precisely one time, but always returns True whether the rule succeeds. It will not advance the input iterator of the parser unless the rule parameter matched successfully. At combinator tries to match a rule, and returns True if and only if the rule matches. However, the input iterator is not advanced. This is an example of a zero-width rule and is a feature of PEG grammars that is absent from CFG grammars. NotAt combinator is similar to the At combinator in that it is a zero-width rule, but returns True only if the rule fails to match. Upon a successful match, the input iterator is not advanced.

Primitive Rules

The YARD framework comes with a number of predefined grammar rules for parsing ASCII character strings. These are in the file yard_text_grammar.hpp. If you were to extend the framework to handle other types of input (such as Unicode), it would require that you define new base rules, though the rule combinators could stay the same.

For example, the CharSeq type is a rule that attempts to match a sequence of ASCII characters. You can see it in this code example:


struct DefineKeyword :
 Seq<Word<CharSeq<'d','e','f','i','n','e'> >,WS >
{ };

In the YARD framework, the Word combinator takes a rule as a parameter, and tries to match it. The Word combinator is only successful if it can match its parameter, and immediately following the parameter there are no alphanumeric ASCII characters (which are identified using the AlphaNum rule). The Word combinator is defined as:


template<typename T>
struct Word :
   Seq<T, NotAt<AlphaNum>>
{ };

The WS rule matches whitespaces such as blank spaces, carriage returns, and tab characters.


struct WS :
Star<CharSetParser<WhiteSpaceCharSet>>>
{ };

Character sets are represented by instances of a dedicated class template. These are most prominently used by the CharSetParser rule template. A set of functions used for defining character sets can be found in the file yard_char_set.hpp.

Actions and AST Construction

An action is a parsing rule that executes some procedure in its Match template function. For example, the Finao ("Failure Is Not An Option") action attempts to match a rule to the input but throws an exception if that rule fails. An exception causes a YARD parser to halt immediately with an error message.

The Store action is particularly useful for constructing nodes in the AST (Abstract Syntax Tree). When the Match function template of a Store action is first entered, it calls the StartNode() member function of the parser state management object. The Match function is then called on the rule parameter, and if successful, CompleteNode is called on the parser state management object. Otherwise, the AbandonNode() member function of the parser state management object is called.

The parse tree used in the YARD framework is organized as a k-ary tree, where each node points to its first child, and its sibling. The tree nodes are instances of AbstractTreeNode and provide access to a type_info reference that corresponds to the rule associated with the Store action.

Tips and Tricks

It is not uncommon to have referential cycles among rules (for example, rule X refers to rule Y, which refers to rule X). These cause a compilation error. To resolve the error, simply provide a forward declaration of the latter of the two rules at the top of the header file containing the grammar.

If you want to improve performance of a parser, consider reordering the rule arguments to the sequential choice operator (Or). Also, you may want to look into alternate methods of constructing an AST by using a custom parser and new actions.

Final Words

The YARD framework is public domain code. This means that it can be used for any purpose, with no restriction, obligations, and of course, warrantee. If for some reason public domain doesn't appeal to you, I’ll give you a version with whatever license you want. You can even pay me if you want. I’d love to hear about your experiences using or modifying YARD, so drop me a note at [email protected].

Acknowledgments

I wish to warmly thank Colin Hirsch, Kris Unger, and Max Lybbert for their hard work reviewing and commenting on various versions of this article. Extra thanks to Max Lybbert for managing the YARD SourceForge site.