Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

C/C++

The Spirit Parser Library: Inline Parsing in C++


Everyday programming often requires writing parsers. The Internet and networking opened up the information floodgates such that software engineers have to deal with more human-to-computer communication, as well as computer-to-computer communication. HTML, XML, SQL, RDF, XDR, Postscript, OS shells, scripting languages, and even tasks such as extracting command-line options and reading files require various degrees of parsing.

Unfortunately, many programmers do not approach parser creation as a language design task, and they code parsers by hand. For simple to moderately complex tasks, many programmers stay away from tools such as YACC because of the "jumbo jet" syndrome. (It's just too big and detached from the application.) Many of today's parsing tasks would have been well served by a C++ library-based parser somewhere between Posix-style regular expressions and a full-blown parser.

Spirit is an attempt to fill the missing gap between existing regular-expression tools and computer-language-oriented parser generators. It is an object-oriented, back-tracking, recursive descent parser generator framework implemented using template metaprogramming techniques. Expression templates, taking advantage of the expressiveness of operator overloading in C++, allow Spirit to approximate the syntax of EBNF (Extended Backus Normal Form) completely in C++ using a syntax closely resembling EBNF.

BNF and EBNF

BNF began as an abbreviation of Backus Normal Form and later became popularly known as the Backus-Naur Form. It is a formal notation that describes a language's syntax. BNF was developed primarily by John Backus and modified by Peter Naur to formalize the grammar of the Algol 60 programming language. It is now standard practice to use formal grammars to specify languages.

A BNF grammar consists of a number of production rules. The rules are principally written as:

<S> ::= A1 | A2 | A3 ... AN

where A1..AN are alternatives separated by the |. The symbol S on the left-hand side of ::= must be replaced by one of the alternatives on the right-hand side. Alternatives consist of terminals and non-terminals. A terminal is an atomic string that cannot be broken down. They terminate the production process. On the other hand, a non-terminal recursively refers to a rule. Here's a trivial example BNF grammar that specifies an integer number:

<integer> ::= <digit> | <integer> <digit>
<digit>   ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

In plain English, this example can be read as: "An integer is a single digit, or, an integer followed by a single digit. A digit is one of 0 or 1 or 2 or 3 ... 9." Symbols in angle brackets, <digit> and <integer>, are non-terminals while 0..9 are terminals.

Repetition is very common in grammars. With BNF, as in the example above, you have to use recursion to do repetition. This is rather awkward. Extended BNF (EBNF) adds several additional facilities to BNF to express repetition. There are quite a few EBNF dialects and even an ISO standard: ISO/IEC 14977. For simplicity and familiarity, below is an EBNF dialect that borrows the ubiquitous regular expression operators.

X*  The Kleene star. X can appear zero or more times.
X+  X can appear one or more times.
X?  Make X optional, X can appear zero or one time.

To simplify the integer rule, it can be written in EBNF like so:

integer ::= digit+
digit   ::= '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' |
            '8' | '9'

This example is read in English as: "An integer is comprised of one or more digits." Notice that the angle brackets are removed from non-terminal symbols and quotes are introduced to indicate terminal strings.

The Spirit Library

The Spirit library lets you write a target grammar exclusively in C++. Inline EBNF grammar specifications can mix freely with other C++ code and generate immediately executable parsers thanks to the generative power of C++ templates.

Character input is not limited to 8-bit ASCII. The library can easily work with 16- and 32-bit characters such as Unicode. In fact, input data is an abstract concept not limited to characters. Spirit can process data in the form of plain bytes, user-defined structs, or even objects.

Quick Start

To elucidate, here is a trivial example. Parse a parenthesized, comma-delimited stream of unsigned integers of the form:

(i1, i2, i3 ... iN)

and store the extracted numbers in a vector. First, define a rule for parsing unsigned integers. In standard EBNF, you would write this as:

<integer> ::= <digit>+

This EBNF rule can be read as an unsigned integer is composed of one or more digits. Spirit's stylized C++/EBNF (SEBNF, for brevity) syntax attempts to mimic EBNF as closely as possible:

integer = +digit;

Constrained to C++'s syntax, we deviate slightly from standard EBNF. EBNF's positive closure +, used to signify a one-or-more occurrence of digit, becomes prefix instead of postfix. (Spirit uses +digit instead of digit+.) Also, the assignment = operator, initiating a production rule declaration, replaces EBNF's ::=.

integer is in itself a parser that can recognize numbers such as 1, 2, 123, and 456. Note that, by default, white spaces are skipped. Thus, in the example, an input such as "1 2 3" is legal. However, we want the parser to inhibit the skipping of white spaces when parsing numbers. This task is accomplished by using the lexeme directive.

 integer = lexeme[ +digit ];

digit is one of Spirit's primitives: a range of single character literals from 0..9. Actually, there are also predefined primitives for parsing integers and real numbers. Here, integer is explicitly defined.

Now we continue with the rest of the grammar. Remember that what we want is a comma-delimited string of integers enclosed in parentheses. In standard EBNF, this grammar would be written as:

<numbers> ::= 
  '(' <integer> (',' <integer>)* ')'

So how would one write this in C++ using Spirit?

numbers = 
  '(' >> integer >> *(',' >> integer) >> ')';

As already mentioned, the requisite changes from the original EBNF specification stems from the syntax restrictions imposed by C++. Most obviously, our SEBNF syntax necessitates a sequencing operator >> where in EBNF a white space would suffice. We say 'a >> b' in Spirit instead of 'a b'. Not unlike the positive closure + already mentioned, the Kleene star *, used to signify a zero-or-more occurrence of the enclosed expression (',' >> integer), is prefix instead of postfix.

Finally we have a complete grammar for our trivial example:

rule<> integer, numbers;

integer = lexeme[ +digit ];
numbers = '(' >> integer >> *(',' >> integer) >> ')';

Through the magic of expression templates, this code is perfectly valid C++. The production rule numbers is in fact an object that has a member function parse that does the work. As far as C++ is concerned, integer and numbers are objects that have to be declared before they are used. The first line starts with these requisite C++ type declarations.

The Spirit library has facilities to attach semantic actions to various points in a grammar so as to produce a complete and fully working parser. Here is a semantic action that inserts a parsed number into a vector:

std::vector<long> my_vec;

void push(char const* str, char const* end)
{
    my_vec.push_back(std::strtol(str, NULL, 10));
}

The function push accepts an iterator str that points to the first matching character, as well as an iterator end pointing to one past the last valid character. We want to attach this function to the integer production rule:

integer = lexeme[ (+digit)[&push] ];

Every time the parser recognizes a valid number, the number is appended to my_vec.

A More Complicated Example: The Classic Calculator Grammar

The following is the EBNF grammar for an arithmetic expression:

<integer>     ::= '-'? <digit>+
<factor>      ::= <integer>
              | '(' <expression> ')'
              | '-' <factor>
<term>        ::= <factor> ( '*' <factor> | '/' <factor> )*
<expression>  ::= <term> ('+' <term> | '-' <term> )*

Here is the corresponding snippet of code that uses SEBNF to define the same grammar:

rule<> expression, term, factor, integer;

integer     = lexeme[ (!ch_p('-') >> +digit) ];
factor      = integer
            | '(' >> expression >> ')'
            | '-' >> factor;
term        = factor >> *( '*' >> factor | '/' >> factor);
expression  = term >> *( '+' >> factor | '-' >> term);

The first line declares all of the rules. It is necessary to forward declare the rules because the factor rule references the expression rule, but the expression rule is defined later.

The integer rule will recognize an integer with an optional sign. The ch_p('-') generates a parser that will accept the - character. This rule uses three of the Spirit operators: !, >>, and +. !a is equivalent to a? in EBNF and indicates that a is optional. It will be matched zero or one time.

The factor rule introduces the | operator. This operator specifies alternatives in which the expression 'a | b' means that either a or b will be matched.


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.