Channels ▼
RSS

Open Source

Recursive Descent PEG Parsers Using C++ Templates


Grammar Rules

In YARD, each grammar rule is represented by a type (I use structs, to avoid having to write public everywhere in my grammar) that implements a static Match member function template. The Match function is provided implicitly by having the struct inherit either from the primitive rule types or from instantiations of the templates representing the rule operators (such as Seq, Or, and Star). The Match function template has the following signature:


template<typename ParserState_T>
static bool Match(ParserState_T& p);

The Match function template accepts a state management object and returns True or False depending on whether the input matches the corresponding rule. Match is required to restore the state parameter to its original state if it fails to find a match. Restoring the state is done automatically by the basic rule combinators.

Parsing and Parser State Management

Parsing an input stream with YARD is done by passing a parser state management object, simply called the parser from here on out, to the start rule of the grammar. The start rule is the rule that describes a well-formed document or snippet in your language.

The parser is responsible for managing the iterator to the input as well as the Abstract Syntax Tree (AST). Listing Two shows the required interface of a parser; YARD grammars are parameterized (specifically the Match template function) so that different kinds of parsers can be used with the same grammar.

class ParserState
{
public:
 // Parse function
 template<typename StartRule_T>
 bool Parse();
 // constructor
 Parser(iterator first, iterator last)
 // public typedefs
 typedef iterator; // an iterator over input
 typedef value_type; // type of input tokens
 typedef node_type; // type of nodes in abstract syntax tree
 // functions for manipulating and accessing the input iterator
 value_type GetElem();
 void GotoNext();
 iterator GetPos();
 void SetPos(iterator pos);
 bool AtEnd();
 iterator Begin();
 iterator End();
 // functions for constructing the abstract syntax tree
 void StartNode(int type);
 void CompleteNode(int type);
 void AbandonNode(int type);
};
Listing Two

The file yard_parser.hpp includes a simple parser that parses ASCII text and generates a simple AST. This can be used as is, if your parsing needs are not particularly demanding, but you will probably want to write your own parser object and possibly your own AST class if you want to squeeze more performance out of the allocation/deallocation scheme.

Rule Combinators

Rule combinators are templates that correspond to operations on rules. The combinator combines rules to create a new rule, and as such provides its own implementation of the Match template function. Most YARD grammar rules are defined as structs that inherit from instantiations of these templates. The core rule combinators are the:

  • Seq combinator that takes a sequence of rules as operands and tries to match each one in sequence. Seq only succeeds if all rules succeed in matching the input in order. If any rule fails, the parser is restored to its original state just before it attempted to match the first rule, and the rule created by the Seq combinator returns False.
  • Or combinator tries to match each rule in sequence, until one of them is successful. The rule created by Or returns True if and only if one of the rules is matched successfully.
  • Star combinator tries to match a single rule as many times as possible. The Star combinator will always succeed (that is, Match will return True), but it might not advance the input iterator. Plus combinator is like the Star combinator, but requires the parameter to match at least once to be successful. Opt combinator tries to match a rule precisely one time, but always returns True whether the rule succeeds. It will not advance the input iterator of the parser unless the rule parameter matched successfully. At combinator tries to match a rule, and returns True if and only if the rule matches. However, the input iterator is not advanced. This is an example of a zero-width rule and is a feature of PEG grammars that is absent from CFG grammars. NotAt combinator is similar to the At combinator in that it is a zero-width rule, but returns True only if the rule fails to match. Upon a successful match, the input iterator is not advanced.

Primitive Rules

The YARD framework comes with a number of predefined grammar rules for parsing ASCII character strings. These are in the file yard_text_grammar.hpp. If you were to extend the framework to handle other types of input (such as Unicode), it would require that you define new base rules, though the rule combinators could stay the same.

For example, the CharSeq type is a rule that attempts to match a sequence of ASCII characters. You can see it in this code example:


struct DefineKeyword :
 Seq<Word<CharSeq<'d','e','f','i','n','e'> >,WS >
{ };

In the YARD framework, the Word combinator takes a rule as a parameter, and tries to match it. The Word combinator is only successful if it can match its parameter, and immediately following the parameter there are no alphanumeric ASCII characters (which are identified using the AlphaNum rule). The Word combinator is defined as:


template<typename T>
struct Word :
   Seq<T, NotAt<AlphaNum>>
{ };

The WS rule matches whitespaces such as blank spaces, carriage returns, and tab characters.


struct WS :
Star<CharSetParser<WhiteSpaceCharSet>>>
{ };

Character sets are represented by instances of a dedicated class template. These are most prominently used by the CharSetParser rule template. A set of functions used for defining character sets can be found in the file yard_char_set.hpp.

Actions and AST Construction

An action is a parsing rule that executes some procedure in its Match template function. For example, the Finao ("Failure Is Not An Option") action attempts to match a rule to the input but throws an exception if that rule fails. An exception causes a YARD parser to halt immediately with an error message.

The Store action is particularly useful for constructing nodes in the AST (Abstract Syntax Tree). When the Match function template of a Store action is first entered, it calls the StartNode() member function of the parser state management object. The Match function is then called on the rule parameter, and if successful, CompleteNode is called on the parser state management object. Otherwise, the AbandonNode() member function of the parser state management object is called.

The parse tree used in the YARD framework is organized as a k-ary tree, where each node points to its first child, and its sibling. The tree nodes are instances of AbstractTreeNode and provide access to a type_info reference that corresponds to the rule associated with the Store action.

Tips and Tricks

It is not uncommon to have referential cycles among rules (for example, rule X refers to rule Y, which refers to rule X). These cause a compilation error. To resolve the error, simply provide a forward declaration of the latter of the two rules at the top of the header file containing the grammar.

If you want to improve performance of a parser, consider reordering the rule arguments to the sequential choice operator (Or). Also, you may want to look into alternate methods of constructing an AST by using a custom parser and new actions.

Final Words

The YARD framework is public domain code. This means that it can be used for any purpose, with no restriction, obligations, and of course, warrantee. If for some reason public domain doesn't appeal to you, I’ll give you a version with whatever license you want. You can even pay me if you want. I’d love to hear about your experiences using or modifying YARD, so drop me a note at cdiggins@gmail.com.

Acknowledgments

I wish to warmly thank Colin Hirsch, Kris Unger, and Max Lybbert for their hard work reviewing and commenting on various versions of this article. Extra thanks to Max Lybbert for managing the YARD SourceForge site.


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 

Video