Channels ▼
RSS

Tools

Recursive Descent PEG Parsers Using C++ Templates


Christopher is a freelance programmer and consultant, with a particular interest in the design and implementation of programming languages. He can be contacted at cdiggins@gmail.com.


In this article, I introduce the YARD (short for "Yet Another Recursive Descent") C++ parsing framework. YARD is based on the PEG ("Parsing Expression Grammars") formalism, and makes heavy use of template programming. The complete source code for YARD is available online from Dr. Dobb's and Google Code.

LR Parsers and BNFs

The syntax of many programming languages is expressed using a grammar in a form called Backus Normal Form (BNF). BNF is a particular representation of Context-Free Grammars (CFG). Often people construct programming language parsers that first break the input into tokens, a process called "lexical analysis." They would then parse the list of tokens using a LR (Left-to-right scanning Right-Most derivation) parser.

Writing tokenizers and LR parsers, in particular the lookup tables used by most LR parsers, is an arduous task. However, many tools exist that, given a grammar, automatically generate the tokenizing and parsing code (Lex and YACC, for instance). But there are several problems with the traditional approach of building LR parsers using code generation tools:

  • Code generators are complex pieces of software that have their own peculiar syntax and a steep learning curve.
  • Debugging the grammars and generated code is challenging.
  • Converting from BNF form to an appropriate LR form is difficult.
  • Separate passes and tools for lexing and parsing is inconvenient.

Recursive Descent Parsers

Recursive descent (RD) parsers are a form of parser that is easier to construct by hand, but have been unfairly shunned due to a perceived lack of efficiency. The other phases of syntactic analysis (construction of the AST, for instance) far outweigh any performance hit associated with RD parsers. Furthermore, in my own comparisons I have seen YARD parsers perform on the same order of magnitude as well as an LR parser generated by YACC for parsing the C language (for simple grammar recognition tasks).

One of the pleasant aspects of RD parsers is that they resemble the BNF expression of the grammar. Most production rules in the grammar can be mapped to a function that looks at the input and sees if the current input can be constructed using the definition of the rule that it represents. Because rules are constructed from other rules, we call the functions for recognizing the other production rules. As rules can refer to themselves directly or indirectly, this process is recursive.

Parsing Expression Grammars

Recently a new grammar formalism by Bryan Ford called a "Parsing Expression Grammar" (PEG) has become increasingly popular for expressing the syntax of programming languages. PEGs can be used to construct very efficient parsers. A memoizing parser called a packrat parser has been shown to have linear time complexity.

Personally, I am most interested in the fact that the formalism is unambiguous, lends itself more naturally to the construction of parsers, and can be used to eliminate the tokenization phase. In a PEG, the entire grammar can be described with a single grammar.

PEGs can be viewed as a formal definition of a parser. Unlike a BNF, a PEG does not say how to produce legal syntactic phrases, but rather how to recognize legal syntactic phrases. This means that a PEG rule is a matching rule whereas a BNF rule is a production rule. We cannot interpret a grammar in BNF literally as a PEG, but oftentimes, the translation is straightforward. One notable feature of a PEG, which is absent in a BNF, are zero-width assertions. These are rules that do not consume input. Such rules are useful for expressing branching logic within the grammar.

Yet Another Recursive Descent (YARD) Parsing Framework

The Yet Another Recursive Descent (YARD ) parsing framework I present here is a simple yet powerful RD parsing framework that represents grammar rules as types, and PEG operators as templates. This approach lets you construct an efficient parser by writing out grammar rules as types that inherit from other types (including template instantions).

The YARD technique was inspired by the use of expression templates and operator overloading in the Boost.Spirit library by Joel de Guzman. By making templates explicit, it restricts YARD grammars to be static (they cannot be constructed at runtime). This reduces flexibility but improves the performance of the parser.

Listing One is an example of a YARD grammar, where a snippet of an XML parsing grammar is shown. Take notice of the close correspondence of the YARD grammar to the official XML specification.

struct Element :
  Or<EmptyElemTag, Seq<STag, Content, ETag> >
{ };
struct STag :
  Seq<
    Char<'<'>,
    Name,
    Star<Seq<S, Attribute> >,
    Opt<S>,
    Char<'>'>
  >
{ };
struct Attribute :
  Seq<Name, Eq, AttValue>
{ };
struct ETag :
  Seq<CharSeq<'<','/'>, Name, Opt<S>, Char<'>'> >
{ };
struct Content :
  Seq<
    Opt<CharData>,
    Star<
      Seq<
       Or<
          Element,
          Reference,
          CDSect,
          PI,
          Comment
       >,
       Opt<CharData>
      >
    >
  >
{ };
struct EmptyElemTag :
  Seq<
    Char<'<'>,
    Name,
    Star<Seq<S, Attribute> >,
    Opt<S>,
    CharSeq<'/','>'>
  >
{ };
Listing One

YARD has also inspired two other C++ parsing libraries that use the same basic architecture—the < a href="p-stade.sourceforge.net/biscuit/">Biscuit Library by Shunsuke Sogame, and the Parsing Expression Grammar Template Library (PEGTL) by Colin Hirsch. The Biscuit Library is an extension of a previous version of YARD, which includes many features from Boost. The PEGTL is an independent library that uses C++0x features such as variadic templates for greater flexibility. PEGTL provides significantly enhanced diagnostic and input abstraction facilities.


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 

Video