Powerful parsing made easy via modern template techniques.
September 01, 2003
URL:http://www.drdobbs.com/jvm/an-algorithm-for-compressing-space-and-t/jvm/logging-in-c/cpp/the-spirit-parser-library-inline-parsing/184401692
Everyday programming often requires writing parsers. The Internet and networking opened up the information floodgates such that software engineers have to deal with more human-to-computer communication, as well as computer-to-computer communication. HTML, XML, SQL, RDF, XDR, Postscript, OS shells, scripting languages, and even tasks such as extracting command-line options and reading files require various degrees of parsing.
Unfortunately, many programmers do not approach parser creation as a language design task, and they code parsers by hand. For simple to moderately complex tasks, many programmers stay away from tools such as YACC because of the "jumbo jet" syndrome. (It's just too big and detached from the application.) Many of today's parsing tasks would have been well served by a C++ library-based parser somewhere between Posix-style regular expressions and a full-blown parser.
Spirit is an attempt to fill the missing gap between existing regular-expression tools and computer-language-oriented parser generators. It is an object-oriented, back-tracking, recursive descent parser generator framework implemented using template metaprogramming techniques. Expression templates, taking advantage of the expressiveness of operator overloading in C++, allow Spirit to approximate the syntax of EBNF (Extended Backus Normal Form) completely in C++ using a syntax closely resembling EBNF.
BNF began as an abbreviation of Backus Normal Form and later became popularly known as the Backus-Naur Form. It is a formal notation that describes a language's syntax. BNF was developed primarily by John Backus and modified by Peter Naur to formalize the grammar of the Algol 60 programming language. It is now standard practice to use formal grammars to specify languages.
A BNF grammar consists of a number of production rules. The rules are principally written as:
<S> ::= A1 | A2 | A3 ... AN
where A1..AN
are alternatives separated by the |
. The symbol S
on the left-hand side of ::=
must be replaced by one of the alternatives
on the right-hand side. Alternatives consist of terminals and non-terminals. A
terminal is an atomic string that cannot be broken down. They terminate the production
process. On the other hand, a non-terminal recursively refers to a rule. Here's
a trivial example BNF grammar that specifies an integer number:
<integer> ::= <digit> | <integer> <digit> <digit> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
In plain English, this example can be read as: "An integer is a single digit,
or, an integer followed by a single digit. A digit is one of 0
or 1
or 2
or 3
... 9
." Symbols in angle brackets, <digit>
and <integer>
, are non-terminals while 0..9
are terminals.
Repetition is very common in grammars. With BNF, as in the example above, you have to use recursion to do repetition. This is rather awkward. Extended BNF (EBNF) adds several additional facilities to BNF to express repetition. There are quite a few EBNF dialects and even an ISO standard: ISO/IEC 14977. For simplicity and familiarity, below is an EBNF dialect that borrows the ubiquitous regular expression operators.
X* The Kleene star. X can appear zero or more times. X+ X can appear one or more times. X? Make X optional, X can appear zero or one time.
To simplify the integer
rule, it can be written in EBNF like so:
integer ::= digit+ digit ::= '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'
This example is read in English as: "An integer is comprised of one or more digits." Notice that the angle brackets are removed from non-terminal symbols and quotes are introduced to indicate terminal strings.
The Spirit library lets you write a target grammar exclusively in C++. Inline EBNF grammar specifications can mix freely with other C++ code and generate immediately executable parsers thanks to the generative power of C++ templates.
Character input is not limited to 8-bit ASCII. The library can easily work
with 16- and 32-bit characters such as Unicode. In fact, input data is an abstract
concept not limited to characters. Spirit can process data in the form of plain
bytes, user-defined struct
s, or even objects.
To elucidate, here is a trivial example. Parse a parenthesized, comma-delimited stream of unsigned integers of the form:
(i1, i2, i3 ... iN)
and store the extracted numbers in a vector. First, define a rule for parsing unsigned integers. In standard EBNF, you would write this as:
<integer> ::= <digit>+
This EBNF rule can be read as an unsigned integer is composed of one or more digits. Spirit's stylized C++/EBNF (SEBNF, for brevity) syntax attempts to mimic EBNF as closely as possible:
integer = +digit;
Constrained to C++'s syntax, we deviate slightly from standard EBNF. EBNF's
positive closure +
, used to signify a one-or-more occurrence of digit
,
becomes prefix instead of postfix. (Spirit uses +digit
instead of digit+
.)
Also, the assignment =
operator, initiating a production rule declaration,
replaces EBNF's ::=
.
integer
is in itself a parser that can recognize numbers such as 1,
2, 123, and 456. Note that, by default, white spaces are skipped. Thus, in the
example, an input such as "1 2 3" is legal. However, we want the parser
to inhibit the skipping of white spaces when parsing numbers. This task is accomplished
by using the lexeme
directive.
integer = lexeme[ +digit ];
digit
is one of Spirit's primitives: a range of single character literals
from 0..9. Actually, there are also predefined primitives for parsing integers
and real numbers. Here, integer
is explicitly defined.
Now we continue with the rest of the grammar. Remember that what we want is a comma-delimited string of integers enclosed in parentheses. In standard EBNF, this grammar would be written as:
<numbers> ::= '(' <integer> (',' <integer>)* ')'
So how would one write this in C++ using Spirit?
numbers = '(' >> integer >> *(',' >> integer) >> ')';
As already mentioned, the requisite changes from the original EBNF specification
stems from the syntax restrictions imposed by C++. Most obviously, our SEBNF syntax
necessitates a sequencing operator >>
where in EBNF a white space
would suffice. We say 'a >> b'
in Spirit instead of 'a b'
.
Not unlike the positive closure +
already mentioned, the Kleene star *
,
used to signify a zero-or-more occurrence of the enclosed expression (',' >>
integer)
, is prefix instead of postfix.
Finally we have a complete grammar for our trivial example:
rule<> integer, numbers; integer = lexeme[ +digit ]; numbers = '(' >> integer >> *(',' >> integer) >> ')';
Through the magic of expression templates, this code is perfectly valid C++. The
production rule numbers
is in fact an object that has a member function
parse
that does the work. As far as C++ is concerned, integer
and
numbers
are objects that have to be declared before they are used. The
first line starts with these requisite C++ type declarations.
The Spirit library has facilities to attach semantic actions to various points in a grammar so as to produce a complete and fully working parser. Here is a semantic action that inserts a parsed number into a vector:
std::vector<long> my_vec; void push(char const* str, char const* end) { my_vec.push_back(std::strtol(str, NULL, 10)); }
The function push
accepts an iterator str
that points to the first
matching character, as well as an iterator end
pointing to one past the
last valid character. We want to attach this function to the integer
production
rule:
integer = lexeme[ (+digit)[&push] ];
Every time the parser recognizes a valid number, the number is appended to my_vec
.
The following is the EBNF grammar for an arithmetic expression:
<integer> ::= '-'? <digit>+ <factor> ::= <integer> | '(' <expression> ')' | '-' <factor> <term> ::= <factor> ( '*' <factor> | '/' <factor> )* <expression> ::= <term> ('+' <term> | '-' <term> )*
Here is the corresponding snippet of code that uses SEBNF to define the same grammar:
rule<> expression, term, factor, integer; integer = lexeme[ (!ch_p('-') >> +digit) ]; factor = integer | '(' >> expression >> ')' | '-' >> factor; term = factor >> *( '*' >> factor | '/' >> factor); expression = term >> *( '+' >> factor | '-' >> term);
The first line declares all of the rules. It is necessary to forward declare the
rules because the factor
rule references the expression
rule, but
the expression
rule is defined later.
The integer
rule will recognize an integer with an optional sign. The
ch_p('-')
generates a parser that will accept the -
character.
This rule uses three of the Spirit operators: !
, >>
, and
+
. !a
is equivalent to a?
in EBNF and indicates that a
is optional. It will be matched zero or one time.
The factor
rule introduces the |
operator. This operator specifies
alternatives in which the expression 'a | b'
means that either a
or b
will be matched.
Take note that the factor
rule does not explicitly use ch_p
.
Instead, the '('
, ')'
, and '-'
are applied directly as
arguments to the >>
operator. If one of the operands in a binary
expression is a Spirit parser, such as the factor
rule, the other operand
may be a literal character. The result of such an expression is another Spirit
parser. ch_p
is used in the integer
rule because a unary operator
(!)
is applied. Without the ch_p
, !'-'
would simply evaluate
to false.
This example is not very useful, although it is valid code. All it can do is recognize a string that belongs to the grammar. Listing One contains the complete code for a calculator that evaluates the expression given to it.
Listing One: A calculator that evaluates the expression given to it .
// Copyright (c) 2001, Joel de Guzman and Dan Nuffer // Permission is granted to use this code without restriction as // long as this copyright notice appears in all source files. #include <boost/spirit/spirit.hpp> #include <iostream> #include <stack> #include <functional> using namespace std; using namespace spirit; stack<long> evaluation_stack; struct push_int { void operator()(char const* str, char const*) const { long n = std::strtol(str, 0, 10); evaluation_stack.push(n); } }; template <class op> struct do_op { do_op(op const& the_op) : m_op(the_op) {} op m_op; void operator()(char const*, char const*) const { long rhs = evaluation_stack.top(); evaluation_stack.pop(); long lhs = evaluation_stack.top(); evaluation_stack.pop(); evaluation_stack.push(m_op(lhs, rhs)); } }; template <class op> do_op<op> make_op(op const& the_op) { return do_op<op>(the_op); } struct do_negate { void operator()(char const*, char const*) const { long lhs = evaluation_stack.top(); evaluation_stack.pop(); evaluation_stack.push(-lhs); } }; int main() { rule<> expression, term, factor, integer; integer = lexeme[ (!ch_p('-') >> +digit)[push_int()] ]; factor = integer | '(' >> expression >> ')' | ('-' >> factor)[do_negate()]; term = factor >> *( ('*' >> factor)[make_op(std::multiplies<long>())] | ('/' >> factor)[make_op(std::divides<long>())]); expression = term >> *( ('+' >> term)[make_op(std::plus<long>())] | ('-' >> term)[make_op(std::minus<long>())]); char str[256]; cin.getline(str, 256); if (parse(str, expression, space).full) { cout << "parsing succeeded\n"; cout << "result = " << evaluation_stack.top() << "\n\n"; evaluation_stack.pop(); } else { cout << "parsing failed\n"; } }
A semantic action can be either a free function or function object (or functor
in STL terminology). One may be attached to any expression within a Spirit grammar
definition, using the expression p[a]
where p
is a parser and
a
is a semantic action. When a match is made, the action will be called
and passed beginning and ending iterators (much like an STL algorithm) to the
input that was matched.
The push_int
semantic action functor converts its input from a string
into a long int
and then pushes it onto evaluation_stack
.
The do_op
semantic action template struct
will apply another
function to the top two values popped from evaluation_stack
and then
push the result back onto the stack. It is used to perform all binary arithmetic
operations done in the calculator.
For each binary operation (+
, -
, *
, and /
), the
appropriate do_op
is created using the function objects from the standard
library to do the operations. The make_op
helper function facilitates
creating do_op
classes.
The do_negate
semantic action will be called when the unary negation
-
operator is invoked. do_negate
pops a value from evaluation_stack
,
negates it, and pushes the result back onto the stack.
After the parser has been created, the program reads a line from cin
.
The expression is then parsed using the free parse
function from the
library. Various parse
functions in Spirit can be used in different situations.
The parse
function used in the example takes as its parameters a NULL
terminated string, the top-level rule (traditionally called the start symbol)
of the grammar to be used for parsing, and a skip parser, which in this case
is spirit::space
. The skip parser instructs the involved parser what
to skip. Using spirit::space
as the skip parser simply means that all
space characters in between symbols and words in the input will be skipped.
parse
returns a parse_info struct
:
template <typename IteratorT> struct parse_info { IteratorT stop; bool match; bool full; unsigned length; };
The member stop
points to the final parse position (i.e., parsing processed
the input up to this point). match
is true if parsing is successful, which
may be so if the parser consumed all the input (full match) or if the parser consumed
only a portion of the input (partial match). full
will be true when a full
match occurs, meaning the parser consumed all the input. length
is the
number of characters consumed by the parser and is valid only if a successful
match has occurred.
Finally, if the parse was successful, the result of the expression is printed along with a success message.
Compiling the example is straightforward because the Spirit library consists
of only headers, as all the classes are templates. This makes it easy to use.
The library can be used straight out of the box. You only need to include the
spirit.hpp
header. There is no library to link against.
The most fundamental concept behind Spirit's design is the parser
class. A parser models a recognizer of a language from the simplest to the most
complex. It has a conceptual member function:
parse(iterator& first, iterator last)
This function does the work of inspecting the iterator
range and reporting
success or failure. The iterator first
, which is passed by reference, is
advanced accordingly when a match is found.
The parse
member function is conceptual, as opposed to virtual, in
the sense that the base class parser
does not really have any such member
function. Subclasses must provide one. The conceptual base class is a template
class parametized by its subclass, which gives it access to its subclass. Whenever
a parser is asked to do its task, it delegates the task to its subclass. This
process is very similar to how virtual functions work, but the difference is
that the member function is bound statically instead of dynamically (run-time
bound). James Coplien first popularized this technique of compile-time polymorphism
in an article in C++ Report entitled "Curiously Recurring Template
Patterns" [1]. Listing Two shows the
parser
class and some examples of trivial subclasses.
Listing Two: The parser class and some trivial subclass examples.
// Copyright (c) 2001, Joel de Guzman and Dan Nuffer // Permission is granted to use this code without restriction as // long as this copyright notice appears in all source files. template <typename DerivedT> struct parser { DerivedT& derived() { return *static_cast<DerivedT*>(this); } DerivedT const& derived() const { return *static_cast<DerivedT const*>(this); } }; template <typename DerivedT> struct char_parser : public parser<DerivedT> { template <typename IteratorT> match parse(IteratorT& first, IteratorT const& last) const { if (first != last) if (bool r = this->derived().test(*first)) { ++first; return match(1); } return match(); } ... }; template <typename CharT = char> class chlit : public char_parser<chlit<CharT> > { public: ... template <typename T> bool test(T ch_) const { return T(ch) == ch_; } private: CharT ch; };
Though quite simple, the example is not contrived, but it is an actual part
of the library. A chlit
object merely compares a single character for
a match and advances the iterator one step forward when successful. The success
of a parse is encoded in a match
object returned by the member function
parse
. Aside from reporting a true or false result, the number of matching
characters from the input can also be obtained from this object.
By itself, the chlit
class might not seem very useful. Yet, composed to
form a hierarchy, trivial parser classes can produce quite complex and elaborate
parsing functionalities (see the alternative
class in Listing
Three for an example).
Listing Three: The alternative class.
// Copyright (c) 2001, Joel de Guzman and Dan Nuffer // Permission is granted to use this code without restriction as // long as this copyright notice appears in all source files. template <typename A, typename B> struct alternative : public binary<A, B> , public parser<alternative<A, B> > { alternative(A const& a, B const& b) : binary<A, B>(a, b) {} template <typename IteratorT> match parse(IteratorT& first, IteratorT const& last) const { if (match hit = this->left().parse(first, last)) return hit; return this->right().parse(first, last); } };
This class is a composite parser class. It has two sub-parsers: left
and right
. parse
returns a success if one of its subjects is successful.
Through object composition alone, we can build parsers of arbitrary complexity.
The library includes a couple of predefined primitives (like chlit
) as
well as composites (such as alternative
). Primitives for parsing include
literal strings, ranges, and numerics. Composites include alternatives, sequences,
intersections, and differences. These parsers can be combined to form a hierarchy
in a multitude of ways.
Primitives and composites can be composed to form more complex composites using
operator overloading, coded in a way that mimics EBNF. Composing an alternative
,
for example requires two operands and is accomplished by overloading the operator
|
as shown here:
template <typename A, typename B> inline alternative<A, B> operator | (parser<A> const& a, parser<B> const& b) { return alternative<A, B>( a.derived(), b.derived()); }
Given two chlit
objects a
and b
, we can form an alternative
composite object using the |
operator:
a | b
Assuming that both a
and b
have the type chlit<char>
,
the resulting composite is:
alternative<chlit<char>, chlit<char> >
Continuing the example, the expression:
a | b | c
where a
, b
, and c
all have the type chlit<char>
,
will generate a nested composite with the type:
alternative<alternative< chlit<char>, chlit<char> >, chlit<char> >
All composite parsers follow the same style of object aggregation. The difference
lies in the algorithms used by each class's parse
member function.
The sequence
class appears in Listing Four.
Listing Four: The sequence class
// Copyright (c) 2001, Joel de Guzman and Dan Nuffer // Permission is granted to use this code without restriction as // long as this copyright notice appears in all source files. template <typename A, typename B> struct sequence : public binary<A, B> , public parser<sequence<A, B> > { sequence(A const& a, B const& b) : binary<A, B>(a, b) {} template <typename IteratorT> match parse(IteratorT& first, IteratorT const& last) const { IteratorT s = first; match ma, mb; if ((ma = this->left().parse(s, last)) && (mb = this->right().parse(s, last))) { first = s; return ma + mb; } return match(); } };
The sequence
class is very similar to alternative
, but it has
a slight twist in the implementation of its parse
member function. sequence
matches both its subjects (left
and right
) in strict sequence.
Like alternative
, sequence
has a corresponding overloaded operator:
the >>
operator.
So far, we have just tackled binary composites such as alternative
and
sequence
. Unary composites work similarly to binary composites but with
only a single subject. The Kleene star, used for zero or more iterations, and
the positive closure, used for one or more iterations, are examples of unary composites. Listing Five shows the Kleene star class implementation.
Listing Five: The Kleene star class implementation.
// Copyright (c) 2001, Joel de Guzman and Dan Nuffer // Permission is granted to use this code without restriction as // long as this copyright notice appears in all source files. template <typename S> struct kleene_star : public unary<S> , public parser<kleene_star<S> > { kleene_star(S const& a) : unary<S>(a) {} template <typename IteratorT> match parse(IteratorT& first, IteratorT const& last) const { match hit(0); while (match next = this->subject().parse(first, last)) hit += next; return hit; } };
As is obvious by now, binary composites inherit from the binary class that
does the work of storing a copy of its left and right subjects. In the same
vein, unary composites inherit from the unary class, similar to the binary class
but with only one subject. Most fundamentally, all parsers, whether primitive
or composite, inherit from the base parser
class.
All entities in the library are variations of the examples presented. There
are entities that deal with set
-like operations, iteration (Kleene star
and its brethren), semantic actions (hooks to external C/C++ code), directives
(unary composites that modify the functionality of its subjects), parser assertions,
and error handlers.
Ultimately, we want to encapsulate an arbitrarily complex parser expression in a placeholder variable that can be used just like any other parser. This is the tricky part. How do you encode an arbitrarily complex type, produced by a rather complex compositional expression, into a variable without knowing its exact type beforehand?
The problem is that each expression evaluates to a unique type signature.
As shown above, the expression a | b
results in a type quite different
from the expression a | b | c
, for example. Yet, we want both rule
= a | b
and rule = a | b | c
to work. The only similarity between
any types produced by compositional expressions is that they have the parser
base class. Even then, parser
is parameterized by its derived class.
Thus, two parser bases with different derived template parameters are essentially
unique types.
The solution: encode the type in a run-time polymorphic class that is derived
from an abstract_parser
class and contains a virtual member function
parse
. Make both the copy constructor and assignment operator of the
rule member templates parameterized by the type of parser argument. Whenever
the copy constructor or assignment operator of the rule is invoked, a concrete
instance of this abstract_parser
is created. A pointer to abstract_parser
is saved in a member variable.
Listing Six shows the class declaration
of our abstract_parser
. Also shown is concrete_parser
, a subclass
that knows about the type of parser and multiply inherits from parser
and abstract_parser
.
Listing Six: The class declaration of abstract_parser and concrete_parser
// Copyright (c) 2001, Joel de Guzman and Dan Nuffer // Permission is granted to use this code without restriction as // long as this copyright notice appears in all source files. template <typename IteratorT> class abstract_parser { public: abstract_parser() {} virtual ~abstract_parser() {} virtual match parse( IteratorT& first, IteratorT const& last) const = 0; }; template <typename ParserT, typename IteratorT> class concrete_parser : public ParserT , public abstract_parser<IteratorT> { public: concrete_parser(ParserT const& parser) : ParserT(parser) {} virtual ~concrete_parser() {} virtual match parse(IteratorT& first, IteratorT const& last) const { return ParserT::parse(first, last); } };
Listing Seven shows a rule's assignment
operator. This assignment operator is a template member function parameterized
by the type of the parser parameter. Any unique parser type composed using SEBNF
expressions, no matter how complex, can be encapsulated this way. Now that the
right-hand side (rhs
) of an assignment (and copy constructor) can be
encapsulated in a rule, the rule may be used anywhere just like any other parser.
Listing Seven: A rule's assignment operator
// Copyright (c) 2001, Joel de Guzman and Dan Nuffer // Permission is granted to use this code without restriction as // long as this copyright notice appears in all source files. template <typename ParserT> rule& operator = (ParserT const& parser) { assert(meta == 0); meta = new impl::concrete_parser<ParserT, IteratorT>(parser); return *this; }
Spirit makes it easy to write parsers directly in C++. Thanks to the expressiveness of C++ operator overloading and the flexibility of template metaprogramming, grammars can be written in a high-level syntax very similar to EBNF. These grammar descriptions can mix freely with C++ code. Semantic actions, attached to any part of the grammar, seamlessly link EBNF grammar specifications to C++ code. The intricacies of the parsing engine framework are hidden beneath an intuitive C++/EBNF interface. Using Spirit is an order of a magnitude easier, compared to coding a parser by hand or perhaps even using a stand-alone parser such as YACC. Since everything is done in C++, there is no need to use an external parser generator along with all the excess baggage that comes with it. Spirit is a flexible, lightweight, and simple-to-use C++ library suitable for many parsing related programming tasks.
[1] James Coplien. "Curiously Recurring Template Patterns," C++ Report, February 1995.
[2] Todd Veldhuizen. "Expression Templates," C++ Report, June 1995.
[3] Peter Naur (ed.). "Report on the Algorithmic Language ALGOL 60," CACM, May 1960.
[4] Erich Gamma, Richard Helm, Ralph Johnson, and John Vlissides. Design Patterns, Elements of Reusable Object-Oriented Software (Addison-Wesley, 1995).
[5] T. J. Parr, H. G. Dietz, and W. E. Cohen. PCCTS Reference Manual, Version 1.00 (School of Electrical Engineering, Purdue University, West Lafayette, August 1991).
[6] Adrian Johnstone and Elizabeth Scott. "RDP, A Recursive Descent Compiler Compiler," Technical Report CSD TR 97 25, Dept. of Computer Science, Egham, Surrey, England, Dec. 20, 1997.
[7] Joel de Guzman. "Spirit Version 1.1," July 2001.
[8] Joel de Guzman. "Spirit Version 1.2," November 2001.
Dan Nuffer is a software engineer. His interests include parsers, compilers, C++ XML, software engineering and Linux. Joel de Guzman is a consultant and software engineer at Boost Consulting (www.boost-consulting.com) with specialization and expertise in the implementation of generic libraries and frameworks, Joel is adept at writing code using modern C++ techniques, especially, but not limited to, template metaprogramming and functional programming in C++. After work, Joel loves to play distortion-laden heavy rock guitar, play around in his home studio, and record his own compositions.
Listing 4: The sequence class
// Copyright (c) 2001, Joel de Guzman and Dan Nuffer // Permission is granted to use this code without restriction as // long as this copyright notice appears in all source files. template <typename A, typename B> struct sequence : public binary<A, B> , public parser<sequence<A, B> > { sequence(A const& a, B const& b) : binary<A, B>(a, b) {} template <typename IteratorT> match parse(IteratorT& first, IteratorT const& last) const { IteratorT s = first; match ma, mb; if ((ma = this->left().parse(s, last)) && (mb = this->right().parse(s, last))) { first = s; return ma + mb; } return match(); } };
Listing 5: The Kleene star class implementation
// Copyright (c) 2001, Joel de Guzman and Dan Nuffer // Permission is granted to use this code without restriction as // long as this copyright notice appears in all source files. template <typename S> struct kleene_star : public unary<S> , public parser<kleene_star<S> > { kleene_star(S const& a) : unary<S>(a) {} template <typename IteratorT> match parse(IteratorT& first, IteratorT const& last) const { match hit(0); while (match next = this->subject().parse(first, last)) hit += next; return hit; } };
Listing 6: The class declaration of abstract_parser and concrete_parser
// Copyright (c) 2001, Joel de Guzman and Dan Nuffer // Permission is granted to use this code without restriction as // long as this copyright notice appears in all source files. template <typename IteratorT> class abstract_parser { public: abstract_parser() {} virtual ~abstract_parser() {} virtual match parse( IteratorT& first, IteratorT const& last) const = 0; }; template <typename ParserT, typename IteratorT> class concrete_parser : public ParserT , public abstract_parser<IteratorT> { public: concrete_parser(ParserT const& parser) : ParserT(parser) {} virtual ~concrete_parser() {} virtual match parse(IteratorT& first, IteratorT const& last) const { return ParserT::parse(first, last); } };
Listing 7: A rule's assignment operator
// Copyright (c) 2001, Joel de Guzman and Dan Nuffer // Permission is granted to use this code without restriction as // long as this copyright notice appears in all source files. template <typename ParserT> rule& operator = (ParserT const& parser) { assert(meta == 0); meta = new impl::concrete_parser<ParserT, IteratorT>(parser); return *this; }
A BNF grammar consists of a number of production rules. The rules are principally written as:
<S> ::= A1 | A2 | A3 ... ANwhere A1..AN are alternatives separated by the |. The symbol S on the left-hand side of ::= must be replaced by one of the alternatives on the right-hand side. Alternatives consist of terminals and non-terminals. A terminal is an atomic string that cannot be broken down. They terminate the production process. On the other hand, a non-terminal recursively refers to a rule. Here's a trivial example BNF grammar that specifies an integer number:
<integer> ::= <digit> | <integer> <digit> <digit> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9In plain English, this example can be read as: "An integer is a single digit, or, an integer followed by a single digit. A digit is one of 0 or 1 or 2 or 3 ... 9." Symbols in angle brackets, <digit> and <integer>, are non-terminals while 0..9 are terminals.
Repetition is very common in grammars. With BNF, as in the example above, you have to use recursion to do repetition. This is rather awkward. Extended BNF (EBNF) adds several additional facilities to BNF to express repetition. There are quite a few EBNF dialects and even an ISO standard: ISO/IEC 14977. For simplicity and familiarity, below is an EBNF dialect that borrows the ubiquitous regular expression operators.
X* The Kleene star. X can appear zero or more times. X+ X can appear one or more times. X? Make X optional, X can appear zero or one time.To simplify the integer rule, it can be written in EBNF like so:
integer ::= digit+ digit ::= '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'This example is read in English as: "An integer is comprised of one or more digits." Notice that the angle brackets are removed from non-terminal symbols and quotes are introduced to indicate terminal strings.
Terms of Service | Privacy Statement | Copyright © 2024 UBM Tech, All rights reserved.