Everyday programming often requires writing parsers. The Internet and networking opened up the information floodgates such that software engineers have to deal with more human-to-computer communication, as well as computer-to-computer communication. HTML, XML, SQL, RDF, XDR, Postscript, OS shells, scripting languages, and even tasks such as extracting command-line options and reading files require various degrees of parsing.
Unfortunately, many programmers do not approach parser creation as a language design task, and they code parsers by hand. For simple to moderately complex tasks, many programmers stay away from tools such as YACC because of the "jumbo jet" syndrome. (It's just too big and detached from the application.) Many of today's parsing tasks would have been well served by a C++ library-based parser somewhere between Posix-style regular expressions and a full-blown parser.
Spirit is an attempt to fill the missing gap between existing regular-expression tools and computer-language-oriented parser generators. It is an object-oriented, back-tracking, recursive descent parser generator framework implemented using template metaprogramming techniques. Expression templates, taking advantage of the expressiveness of operator overloading in C++, allow Spirit to approximate the syntax of EBNF (Extended Backus Normal Form) completely in C++ using a syntax closely resembling EBNF.
BNF and EBNF
BNF began as an abbreviation of Backus Normal Form and later became popularly known as the Backus-Naur Form. It is a formal notation that describes a language's syntax. BNF was developed primarily by John Backus and modified by Peter Naur to formalize the grammar of the Algol 60 programming language. It is now standard practice to use formal grammars to specify languages.
A BNF grammar consists of a number of production rules. The rules are principally written as:
<S> ::= A1 | A2 | A3 ... AN
where A1..AN
are alternatives separated by the |
. The symbol S
on the left-hand side of ::=
must be replaced by one of the alternatives
on the right-hand side. Alternatives consist of terminals and non-terminals. A
terminal is an atomic string that cannot be broken down. They terminate the production
process. On the other hand, a non-terminal recursively refers to a rule. Here's
a trivial example BNF grammar that specifies an integer number:
<integer> ::= <digit> | <integer> <digit> <digit> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
In plain English, this example can be read as: "An integer is a single digit,
or, an integer followed by a single digit. A digit is one of 0
or 1
or 2
or 3
... 9
." Symbols in angle brackets, <digit>
and <integer>
, are non-terminals while 0..9
are terminals.
Repetition is very common in grammars. With BNF, as in the example above, you have to use recursion to do repetition. This is rather awkward. Extended BNF (EBNF) adds several additional facilities to BNF to express repetition. There are quite a few EBNF dialects and even an ISO standard: ISO/IEC 14977. For simplicity and familiarity, below is an EBNF dialect that borrows the ubiquitous regular expression operators.
X* The Kleene star. X can appear zero or more times. X+ X can appear one or more times. X? Make X optional, X can appear zero or one time.
To simplify the integer
rule, it can be written in EBNF like so:
integer ::= digit+ digit ::= '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'
This example is read in English as: "An integer is comprised of one or more digits." Notice that the angle brackets are removed from non-terminal symbols and quotes are introduced to indicate terminal strings.
The Spirit Library
The Spirit library lets you write a target grammar exclusively in C++. Inline EBNF grammar specifications can mix freely with other C++ code and generate immediately executable parsers thanks to the generative power of C++ templates.
Character input is not limited to 8-bit ASCII. The library can easily work
with 16- and 32-bit characters such as Unicode. In fact, input data is an abstract
concept not limited to characters. Spirit can process data in the form of plain
bytes, user-defined struct
s, or even objects.
Quick Start
To elucidate, here is a trivial example. Parse a parenthesized, comma-delimited stream of unsigned integers of the form:
(i1, i2, i3 ... iN)
and store the extracted numbers in a vector. First, define a rule for parsing unsigned integers. In standard EBNF, you would write this as:
<integer> ::= <digit>+
This EBNF rule can be read as an unsigned integer is composed of one or more digits. Spirit's stylized C++/EBNF (SEBNF, for brevity) syntax attempts to mimic EBNF as closely as possible:
integer = +digit;
Constrained to C++'s syntax, we deviate slightly from standard EBNF. EBNF's
positive closure +
, used to signify a one-or-more occurrence of digit
,
becomes prefix instead of postfix. (Spirit uses +digit
instead of digit+
.)
Also, the assignment =
operator, initiating a production rule declaration,
replaces EBNF's ::=
.
integer
is in itself a parser that can recognize numbers such as 1,
2, 123, and 456. Note that, by default, white spaces are skipped. Thus, in the
example, an input such as "1 2 3" is legal. However, we want the parser
to inhibit the skipping of white spaces when parsing numbers. This task is accomplished
by using the lexeme
directive.
integer = lexeme[ +digit ];
digit
is one of Spirit's primitives: a range of single character literals
from 0..9. Actually, there are also predefined primitives for parsing integers
and real numbers. Here, integer
is explicitly defined.
Now we continue with the rest of the grammar. Remember that what we want is a comma-delimited string of integers enclosed in parentheses. In standard EBNF, this grammar would be written as:
<numbers> ::= '(' <integer> (',' <integer>)* ')'
So how would one write this in C++ using Spirit?
numbers = '(' >> integer >> *(',' >> integer) >> ')';
As already mentioned, the requisite changes from the original EBNF specification
stems from the syntax restrictions imposed by C++. Most obviously, our SEBNF syntax
necessitates a sequencing operator >>
where in EBNF a white space
would suffice. We say 'a >> b'
in Spirit instead of 'a b'
.
Not unlike the positive closure +
already mentioned, the Kleene star *
,
used to signify a zero-or-more occurrence of the enclosed expression (',' >>
integer)
, is prefix instead of postfix.
Finally we have a complete grammar for our trivial example:
rule<> integer, numbers; integer = lexeme[ +digit ]; numbers = '(' >> integer >> *(',' >> integer) >> ')';
Through the magic of expression templates, this code is perfectly valid C++. The
production rule numbers
is in fact an object that has a member function
parse
that does the work. As far as C++ is concerned, integer
and
numbers
are objects that have to be declared before they are used. The
first line starts with these requisite C++ type declarations.
The Spirit library has facilities to attach semantic actions to various points in a grammar so as to produce a complete and fully working parser. Here is a semantic action that inserts a parsed number into a vector:
std::vector<long> my_vec; void push(char const* str, char const* end) { my_vec.push_back(std::strtol(str, NULL, 10)); }
The function push
accepts an iterator str
that points to the first
matching character, as well as an iterator end
pointing to one past the
last valid character. We want to attach this function to the integer
production
rule:
integer = lexeme[ (+digit)[&push] ];
Every time the parser recognizes a valid number, the number is appended to my_vec
.
A More Complicated Example: The Classic Calculator Grammar
The following is the EBNF grammar for an arithmetic expression:
<integer> ::= '-'? <digit>+ <factor> ::= <integer> | '(' <expression> ')' | '-' <factor> <term> ::= <factor> ( '*' <factor> | '/' <factor> )* <expression> ::= <term> ('+' <term> | '-' <term> )*
Here is the corresponding snippet of code that uses SEBNF to define the same grammar:
rule<> expression, term, factor, integer; integer = lexeme[ (!ch_p('-') >> +digit) ]; factor = integer | '(' >> expression >> ')' | '-' >> factor; term = factor >> *( '*' >> factor | '/' >> factor); expression = term >> *( '+' >> factor | '-' >> term);
The first line declares all of the rules. It is necessary to forward declare the
rules because the factor
rule references the expression
rule, but
the expression
rule is defined later.
The integer
rule will recognize an integer with an optional sign. The
ch_p('-')
generates a parser that will accept the -
character.
This rule uses three of the Spirit operators: !
, >>
, and
+
. !a
is equivalent to a?
in EBNF and indicates that a
is optional. It will be matched zero or one time.
The factor
rule introduces the |
operator. This operator specifies
alternatives in which the expression 'a | b'
means that either a
or b
will be matched.