Dr. Dobb's | The Spirit Parser Library: Inline Parsing in C++

The Spirit Parser Library: Inline Parsing in C++

Powerful parsing made easy via modern template techniques.

September 01, 2003
URL:http://www.drdobbs.com/jvm/an-algorithm-for-compressing-space-and-t/jvm/logging-in-c/cpp/the-spirit-parser-library-inline-parsing/184401692

Everyday programming often requires writing parsers. The Internet and networking opened up the information floodgates such that software engineers have to deal with more human-to-computer communication, as well as computer-to-computer communication. HTML, XML, SQL, RDF, XDR, Postscript, OS shells, scripting languages, and even tasks such as extracting command-line options and reading files require various degrees of parsing.

Unfortunately, many programmers do not approach parser creation as a language design task, and they code parsers by hand. For simple to moderately complex tasks, many programmers stay away from tools such as YACC because of the "jumbo jet" syndrome. (It's just too big and detached from the application.) Many of today's parsing tasks would have been well served by a C++ library-based parser somewhere between Posix-style regular expressions and a full-blown parser.

Spirit is an attempt to fill the missing gap between existing regular-expression tools and computer-language-oriented parser generators. It is an object-oriented, back-tracking, recursive descent parser generator framework implemented using template metaprogramming techniques. Expression templates, taking advantage of the expressiveness of operator overloading in C++, allow Spirit to approximate the syntax of EBNF (Extended Backus Normal Form) completely in C++ using a syntax closely resembling EBNF.

BNF and EBNF

BNF began as an abbreviation of Backus Normal Form and later became popularly known as the Backus-Naur Form. It is a formal notation that describes a language's syntax. BNF was developed primarily by John Backus and modified by Peter Naur to formalize the grammar of the Algol 60 programming language. It is now standard practice to use formal grammars to specify languages.

A BNF grammar consists of a number of production rules. The rules are principally written as:

<S> ::= A1 | A2 | A3 ... AN

where A1..AN are alternatives separated by the |. The symbol S on the left-hand side of ::= must be replaced by one of the alternatives on the right-hand side. Alternatives consist of terminals and non-terminals. A terminal is an atomic string that cannot be broken down. They terminate the production process. On the other hand, a non-terminal recursively refers to a rule. Here's a trivial example BNF grammar that specifies an integer number:

<integer> ::= <digit> | <integer> <digit>
<digit>   ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

In plain English, this example can be read as: "An integer is a single digit, or, an integer followed by a single digit. A digit is one of 0 or 1 or 2 or 3 ... 9." Symbols in angle brackets, <digit> and <integer>, are non-terminals while 0..9 are terminals.

Repetition is very common in grammars. With BNF, as in the example above, you have to use recursion to do repetition. This is rather awkward. Extended BNF (EBNF) adds several additional facilities to BNF to express repetition. There are quite a few EBNF dialects and even an ISO standard: ISO/IEC 14977. For simplicity and familiarity, below is an EBNF dialect that borrows the ubiquitous regular expression operators.

X*  The Kleene star. X can appear zero or more times.
X+  X can appear one or more times.
X?  Make X optional, X can appear zero or one time.

To simplify the integer rule, it can be written in EBNF like so:

integer ::= digit+
digit   ::= '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' |
            '8' | '9'

This example is read in English as: "An integer is comprised of one or more digits." Notice that the angle brackets are removed from non-terminal symbols and quotes are introduced to indicate terminal strings.

The Spirit Library

The Spirit library lets you write a target grammar exclusively in C++. Inline EBNF grammar specifications can mix freely with other C++ code and generate immediately executable parsers thanks to the generative power of C++ templates.

Character input is not limited to 8-bit ASCII. The library can easily work with 16- and 32-bit characters such as Unicode. In fact, input data is an abstract concept not limited to characters. Spirit can process data in the form of plain bytes, user-defined structs, or even objects.

Quick Start

To elucidate, here is a trivial example. Parse a parenthesized, comma-delimited stream of unsigned integers of the form:

(i1, i2, i3 ... iN)

and store the extracted numbers in a vector. First, define a rule for parsing unsigned integers. In standard EBNF, you would write this as:

<integer> ::= <digit>+

This EBNF rule can be read as an unsigned integer is composed of one or more digits. Spirit's stylized C++/EBNF (SEBNF, for brevity) syntax attempts to mimic EBNF as closely as possible:

integer = +digit;

Constrained to C++'s syntax, we deviate slightly from standard EBNF. EBNF's positive closure +, used to signify a one-or-more occurrence of digit, becomes prefix instead of postfix. (Spirit uses +digit instead of digit+.) Also, the assignment = operator, initiating a production rule declaration, replaces EBNF's ::=.

integer is in itself a parser that can recognize numbers such as 1, 2, 123, and 456. Note that, by default, white spaces are skipped. Thus, in the example, an input such as "1 2 3" is legal. However, we want the parser to inhibit the skipping of white spaces when parsing numbers. This task is accomplished by using the lexeme directive.

 integer = lexeme[ +digit ];

digit is one of Spirit's primitives: a range of single character literals from 0..9. Actually, there are also predefined primitives for parsing integers and real numbers. Here, integer is explicitly defined.

Now we continue with the rest of the grammar. Remember that what we want is a comma-delimited string of integers enclosed in parentheses. In standard EBNF, this grammar would be written as:

<numbers> ::= 
  '(' <integer> (',' <integer>)* ')'

So how would one write this in C++ using Spirit?

numbers = 
  '(' >> integer >> *(',' >> integer) >> ')';

As already mentioned, the requisite changes from the original EBNF specification stems from the syntax restrictions imposed by C++. Most obviously, our SEBNF syntax necessitates a sequencing operator >> where in EBNF a white space would suffice. We say 'a >> b' in Spirit instead of 'a b'. Not unlike the positive closure + already mentioned, the Kleene star *, used to signify a zero-or-more occurrence of the enclosed expression (',' >> integer), is prefix instead of postfix.

Finally we have a complete grammar for our trivial example:

rule<> integer, numbers;

integer = lexeme[ +digit ];
numbers = '(' >> integer >> *(',' >> integer) >> ')';

Through the magic of expression templates, this code is perfectly valid C++. The production rule numbers is in fact an object that has a member function parse that does the work. As far as C++ is concerned, integer and numbers are objects that have to be declared before they are used. The first line starts with these requisite C++ type declarations.

The Spirit library has facilities to attach semantic actions to various points in a grammar so as to produce a complete and fully working parser. Here is a semantic action that inserts a parsed number into a vector:

std::vector<long> my_vec;

void push(char const* str, char const* end)
{
    my_vec.push_back(std::strtol(str, NULL, 10));
}

The function push accepts an iterator str that points to the first matching character, as well as an iterator end pointing to one past the last valid character. We want to attach this function to the integer production rule:

integer = lexeme[ (+digit)[&push] ];

Every time the parser recognizes a valid number, the number is appended to my_vec.

A More Complicated Example: The Classic Calculator Grammar

The following is the EBNF grammar for an arithmetic expression:

<integer>     ::= '-'? <digit>+
<factor>      ::= <integer>
              | '(' <expression> ')'
              | '-' <factor>
<term>        ::= <factor> ( '*' <factor> | '/' <factor> )*
<expression>  ::= <term> ('+' <term> | '-' <term> )*

Here is the corresponding snippet of code that uses SEBNF to define the same grammar:

rule<> expression, term, factor, integer;

integer     = lexeme[ (!ch_p('-') >> +digit) ];
factor      = integer
            | '(' >> expression >> ')'
            | '-' >> factor;
term        = factor >> *( '*' >> factor | '/' >> factor);
expression  = term >> *( '+' >> factor | '-' >> term);

The first line declares all of the rules. It is necessary to forward declare the rules because the factor rule references the expression rule, but the expression rule is defined later.

The integer rule will recognize an integer with an optional sign. The ch_p('-') generates a parser that will accept the - character. This rule uses three of the Spirit operators: !, >>, and +. !a is equivalent to a? in EBNF and indicates that a is optional. It will be matched zero or one time.

The factor rule introduces the | operator. This operator specifies alternatives in which the expression 'a | b' means that either a or b will be matched.

Take note that the factor rule does not explicitly use ch_p. Instead, the '(', ')', and '-' are applied directly as arguments to the >> operator. If one of the operands in a binary expression is a Spirit parser, such as the factor rule, the other operand may be a literal character. The result of such an expression is another Spirit parser. ch_p is used in the integer rule because a unary operator (!) is applied. Without the ch_p, !'-' would simply evaluate to false.

This example is not very useful, although it is valid code. All it can do is recognize a string that belongs to the grammar. Listing One contains the complete code for a calculator that evaluates the expression given to it.

Listing One: A calculator that evaluates the expression given to it .

//  Copyright (c) 2001, Joel de Guzman and Dan Nuffer
//  Permission is granted to use this code without restriction as
//  long as this copyright notice appears in all source files.

#include <boost/spirit/spirit.hpp>
#include <iostream>
#include <stack>
#include <functional>

using namespace std;
using namespace spirit;

stack<long> evaluation_stack;

struct push_int
{
    void operator()(char const* str, char const*) const
    {
        long n = std::strtol(str, 0, 10);
        evaluation_stack.push(n);
    }
};

template <class op>
struct do_op
{
    do_op(op const& the_op) : m_op(the_op) {}
    op m_op;

    void operator()(char const*, char const*) const
    {
        long rhs = evaluation_stack.top();
        evaluation_stack.pop();
        long lhs = evaluation_stack.top();
        evaluation_stack.pop();
        evaluation_stack.push(m_op(lhs, rhs));
    }
};

template <class op>
do_op<op> make_op(op const& the_op)
{
    return do_op<op>(the_op);
}

struct do_negate
{
    void operator()(char const*, char const*) const
    {
        long lhs = evaluation_stack.top();
        evaluation_stack.pop();
        evaluation_stack.push(-lhs);
    }
};

int main()
{
    rule<> expression, term, factor, integer;

    integer =
        lexeme[ (!ch_p('-') >> +digit)[push_int()] ];

    factor =
            integer
        |   '(' >> expression >> ')'
        |   ('-' >> factor)[do_negate()];

    term =
        factor >>
            *( ('*' >> factor)[make_op(std::multiplies<long>())]
             | ('/' >> factor)[make_op(std::divides<long>())]);

    expression  =
        term >>
            *( ('+' >> term)[make_op(std::plus<long>())]
             | ('-' >> term)[make_op(std::minus<long>())]);

    char str[256];
    cin.getline(str, 256);
    if (parse(str, expression, space).full)
    {
        cout << "parsing succeeded\n";
        cout << "result = " << evaluation_stack.top() << "\n\n";
        evaluation_stack.pop();
    }
    else
    {
        cout << "parsing failed\n";
    }
}

A semantic action can be either a free function or function object (or functor in STL terminology). One may be attached to any expression within a Spirit grammar definition, using the expression p[a] where p is a parser and a is a semantic action. When a match is made, the action will be called and passed beginning and ending iterators (much like an STL algorithm) to the input that was matched.

The push_int semantic action functor converts its input from a string into a long int and then pushes it onto evaluation_stack.

The do_op semantic action template struct will apply another function to the top two values popped from evaluation_stack and then push the result back onto the stack. It is used to perform all binary arithmetic operations done in the calculator.

For each binary operation (+, -, *, and /), the appropriate do_op is created using the function objects from the standard library to do the operations. The make_op helper function facilitates creating do_op classes.

The do_negate semantic action will be called when the unary negation - operator is invoked. do_negate pops a value from evaluation_stack, negates it, and pushes the result back onto the stack.

After the parser has been created, the program reads a line from cin. The expression is then parsed using the free parse function from the library. Various parse functions in Spirit can be used in different situations. The parse function used in the example takes as its parameters a NULL terminated string, the top-level rule (traditionally called the start symbol) of the grammar to be used for parsing, and a skip parser, which in this case is spirit::space. The skip parser instructs the involved parser what to skip. Using spirit::space as the skip parser simply means that all space characters in between symbols and words in the input will be skipped.

parse returns a parse_info struct:

template <typename IteratorT>
struct parse_info
{
    IteratorT  stop;
    bool       match;
    bool       full;
    unsigned   length;
};

The member stop points to the final parse position (i.e., parsing processed the input up to this point). match is true if parsing is successful, which may be so if the parser consumed all the input (full match) or if the parser consumed only a portion of the input (partial match). full will be true when a full match occurs, meaning the parser consumed all the input. length is the number of characters consumed by the parser and is valid only if a successful match has occurred.

Finally, if the parse was successful, the result of the expression is printed along with a success message.

Compiling the example is straightforward because the Spirit library consists of only headers, as all the classes are templates. This makes it easy to use. The library can be used straight out of the box. You only need to include the spirit.hpp header. There is no library to link against.

The Parser

The most fundamental concept behind Spirit's design is the parser class. A parser models a recognizer of a language from the simplest to the most complex. It has a conceptual member function:

parse(iterator& first, iterator last)

This function does the work of inspecting the iterator range and reporting success or failure. The iterator first, which is passed by reference, is advanced accordingly when a match is found.

The parse member function is conceptual, as opposed to virtual, in the sense that the base class parser does not really have any such member function. Subclasses must provide one. The conceptual base class is a template class parametized by its subclass, which gives it access to its subclass. Whenever a parser is asked to do its task, it delegates the task to its subclass. This process is very similar to how virtual functions work, but the difference is that the member function is bound statically instead of dynamically (run-time bound). James Coplien first popularized this technique of compile-time polymorphism in an article in C++ Report entitled "Curiously Recurring Template Patterns" [1]. Listing Two shows the parser class and some examples of trivial subclasses.

Listing Two: The parser class and some trivial subclass examples.

//  Copyright (c) 2001, Joel de Guzman and Dan Nuffer
//  Permission is granted to use this code without restriction as
//  long as this copyright notice appears in all source files.

template <typename DerivedT>
struct parser
{
    DerivedT&
    derived()
    { return *static_cast<DerivedT*>(this); }

    DerivedT const&
    derived() const
    { return *static_cast<DerivedT const*>(this); }
};

template <typename DerivedT>
struct char_parser : public parser<DerivedT>
{
    template <typename IteratorT>
    match
    parse(IteratorT& first, IteratorT const& last) const
    {
        if (first != last)
            if (bool r = this->derived().test(*first))
            {
                ++first;
                return match(1);
            }
        return match();
    }
    ...
};

template <typename CharT = char>
class chlit : public char_parser<chlit<CharT> >
{
public:
    ...
    template <typename T>
    bool test(T ch_) const
    { return T(ch) == ch_; }

private:

    CharT  ch;
};

Though quite simple, the example is not contrived, but it is an actual part of the library. A chlit object merely compares a single character for a match and advances the iterator one step forward when successful. The success of a parse is encoded in a match object returned by the member function parse. Aside from reporting a true or false result, the number of matching characters from the input can also be obtained from this object.

Composites

By itself, the chlit class might not seem very useful. Yet, composed to form a hierarchy, trivial parser classes can produce quite complex and elaborate parsing functionalities (see the alternative class in Listing Three for an example).

Listing Three: The alternative class.

//  Copyright (c) 2001, Joel de Guzman and Dan Nuffer
//  Permission is granted to use this code without restriction as
//  long as this copyright notice appears in all source files.

template <typename A, typename B>
struct alternative
    : public binary<A, B>
    , public parser<alternative<A, B> >
{
    alternative(A const& a, B const& b)
    : binary<A, B>(a, b) {}

    template <typename IteratorT>
    match
    parse(IteratorT& first,  IteratorT const& last) const
    {
        if (match hit = this->left().parse(first, last))
            return hit;
        return this->right().parse(first, last);
    }
};

This class is a composite parser class. It has two sub-parsers: left and right. parse returns a success if one of its subjects is successful.

Through object composition alone, we can build parsers of arbitrary complexity. The library includes a couple of predefined primitives (like chlit) as well as composites (such as alternative). Primitives for parsing include literal strings, ranges, and numerics. Composites include alternatives, sequences, intersections, and differences. These parsers can be combined to form a hierarchy in a multitude of ways.

Operators

Primitives and composites can be composed to form more complex composites using operator overloading, coded in a way that mimics EBNF. Composing an alternative, for example requires two operands and is accomplished by overloading the operator | as shown here:

template <typename A, typename B>
inline alternative<A, B>
operator | (parser<A> const& a, parser<B> const& b)
{
    return alternative<A, B>(
        a.derived(), b.derived());
}

Given two chlit objects a and b, we can form an alternative composite object using the | operator:

a | b

Assuming that both a and b have the type chlit<char>, the resulting composite is:

alternative<chlit<char>, chlit<char> >

Continuing the example, the expression:

a | b | c

where a, b, and c all have the type chlit<char>, will generate a nested composite with the type:

alternative<alternative<
     chlit<char>, chlit<char> >,
     chlit<char> >

All composite parsers follow the same style of object aggregation. The difference lies in the algorithms used by each class's parse member function. The sequence class appears in Listing Four.

Listing Four: The sequence class

//  Copyright (c) 2001, Joel de Guzman and Dan Nuffer
//  Permission is granted to use this code without restriction as
//  long as this copyright notice appears in all source files.

template <typename A, typename B>
struct sequence
    : public binary<A, B>
    , public parser<sequence<A, B> >
{
    sequence(A const& a, B const& b)
    : binary<A, B>(a, b) {}

    template <typename IteratorT>
    match parse(IteratorT& first, IteratorT const& last) const
    {
        IteratorT s = first;
        match ma, mb;
        if ((ma = this->left().parse(s, last)) &&
            (mb = this->right().parse(s, last)))
        {
            first = s;
            return ma + mb;
        }
        return match();
    }
};

The sequence class is very similar to alternative, but it has a slight twist in the implementation of its parse member function. sequence matches both its subjects (left and right) in strict sequence. Like alternative, sequence has a corresponding overloaded operator: the >> operator.

Unary and Binary Composites

So far, we have just tackled binary composites such as alternative and sequence. Unary composites work similarly to binary composites but with only a single subject. The Kleene star, used for zero or more iterations, and the positive closure, used for one or more iterations, are examples of unary composites. Listing Five shows the Kleene star class implementation.

Listing Five: The Kleene star class implementation.

//  Copyright (c) 2001, Joel de Guzman and Dan Nuffer
//  Permission is granted to use this code without restriction as
//  long as this copyright notice appears in all source files.

template <typename S>
struct kleene_star
    : public unary<S>
    , public parser<kleene_star<S> >
{
    kleene_star(S const& a)
    : unary<S>(a) {}

    template <typename IteratorT>
    match parse(IteratorT& first, IteratorT const& last) const
    {
        match hit(0);
        while (match next = this->subject().parse(first, last))
            hit += next;
        return hit;
    }
};

As is obvious by now, binary composites inherit from the binary class that does the work of storing a copy of its left and right subjects. In the same vein, unary composites inherit from the unary class, similar to the binary class but with only one subject. Most fundamentally, all parsers, whether primitive or composite, inherit from the base parser class.

All entities in the library are variations of the examples presented. There are entities that deal with set-like operations, iteration (Kleene star and its brethren), semantic actions (hooks to external C/C++ code), directives (unary composites that modify the functionality of its subjects), parser assertions, and error handlers.

The Rule

Ultimately, we want to encapsulate an arbitrarily complex parser expression in a placeholder variable that can be used just like any other parser. This is the tricky part. How do you encode an arbitrarily complex type, produced by a rather complex compositional expression, into a variable without knowing its exact type beforehand?

The problem is that each expression evaluates to a unique type signature. As shown above, the expression a | b results in a type quite different from the expression a | b | c, for example. Yet, we want both rule = a | b and rule = a | b | c to work. The only similarity between any types produced by compositional expressions is that they have the parser base class. Even then, parser is parameterized by its derived class. Thus, two parser bases with different derived template parameters are essentially unique types.

The solution: encode the type in a run-time polymorphic class that is derived from an abstract_parser class and contains a virtual member function parse. Make both the copy constructor and assignment operator of the rule member templates parameterized by the type of parser argument. Whenever the copy constructor or assignment operator of the rule is invoked, a concrete instance of this abstract_parser is created. A pointer to abstract_parser is saved in a member variable.

Listing Six shows the class declaration of our abstract_parser. Also shown is concrete_parser, a subclass that knows about the type of parser and multiply inherits from parser and abstract_parser.

Listing Six: The class declaration of abstract_parser and concrete_parser

//  Copyright (c) 2001, Joel de Guzman and Dan Nuffer
//  Permission is granted to use this code without restriction as
//  long as this copyright notice appears in all source files.

template <typename IteratorT>
class abstract_parser
{
public:
                    abstract_parser() {}
    virtual         ~abstract_parser() {}
    virtual match   parse(
                        IteratorT& first,
                        IteratorT const& last) const = 0;
};

template <typename ParserT, typename IteratorT>
class concrete_parser
    : public ParserT
    , public abstract_parser<IteratorT>
{
public:

                    concrete_parser(ParserT const& parser)
                    : ParserT(parser) {}
  virtual           ~concrete_parser() {}

  virtual match
  parse(IteratorT& first, IteratorT const& last) const
  { return ParserT::parse(first, last); }
};

Listing Seven shows a rule's assignment operator. This assignment operator is a template member function parameterized by the type of the parser parameter. Any unique parser type composed using SEBNF expressions, no matter how complex, can be encapsulated this way. Now that the right-hand side (rhs) of an assignment (and copy constructor) can be encapsulated in a rule, the rule may be used anywhere just like any other parser.

Listing Seven: A rule's assignment operator

//  Copyright (c) 2001, Joel de Guzman and Dan Nuffer
//  Permission is granted to use this code without restriction as
//  long as this copyright notice appears in all source files.

template <typename ParserT>
rule& operator = (ParserT const& parser)
{
    assert(meta == 0);
    meta = new impl::concrete_parser<ParserT, IteratorT>(parser);
    return *this;
}

Conclusion

Spirit makes it easy to write parsers directly in C++. Thanks to the expressiveness of C++ operator overloading and the flexibility of template metaprogramming, grammars can be written in a high-level syntax very similar to EBNF. These grammar descriptions can mix freely with C++ code. Semantic actions, attached to any part of the grammar, seamlessly link EBNF grammar specifications to C++ code. The intricacies of the parsing engine framework are hidden beneath an intuitive C++/EBNF interface. Using Spirit is an order of a magnitude easier, compared to coding a parser by hand or perhaps even using a stand-alone parser such as YACC. Since everything is done in C++, there is no need to use an external parser generator along with all the excess baggage that comes with it. Spirit is a flexible, lightweight, and simple-to-use C++ library suitable for many parsing related programming tasks.

Notes and References

[1] James Coplien. "Curiously Recurring Template Patterns," C++ Report, February 1995.

[2] Todd Veldhuizen. "Expression Templates," C++ Report, June 1995.

[3] Peter Naur (ed.). "Report on the Algorithmic Language ALGOL 60," CACM, May 1960.

[4] Erich Gamma, Richard Helm, Ralph Johnson, and John Vlissides. Design Patterns, Elements of Reusable Object-Oriented Software (Addison-Wesley, 1995).

[5] T. J. Parr, H. G. Dietz, and W. E. Cohen. PCCTS Reference Manual, Version 1.00 (School of Electrical Engineering, Purdue University, West Lafayette, August 1991).

[6] Adrian Johnstone and Elizabeth Scott. "RDP, A Recursive Descent Compiler Compiler," Technical Report CSD TR 97 25, Dept. of Computer Science, Egham, Surrey, England, Dec. 20, 1997.

[7] Joel de Guzman. "Spirit Version 1.1," July 2001.

[8] Joel de Guzman. "Spirit Version 1.2," November 2001.

Dan Nuffer is a software engineer. His interests include parsers, compilers, C++ XML, software engineering and Linux. Joel de Guzman is a consultant and software engineer at Boost Consulting (www.boost-consulting.com) with specialization and expertise in the implementation of generic libraries and frameworks, Joel is adept at writing code using modern C++ techniques, especially, but not limited to, template metaprogramming and functional programming in C++. After work, Joel loves to play distortion-laden heavy rock guitar, play around in his home studio, and record his own compositions.

Listing 4: The sequence class

Listing 4: The sequence class

//  Copyright (c) 2001, Joel de Guzman and Dan Nuffer
//  Permission is granted to use this code without restriction as
//  long as this copyright notice appears in all source files.

template <typename A, typename B>
struct sequence
    : public binary<A, B>
    , public parser<sequence<A, B> >
{
    sequence(A const& a, B const& b)
    : binary<A, B>(a, b) {}

    template <typename IteratorT>
    match parse(IteratorT& first, IteratorT const& last) const
    {
        IteratorT s = first;
        match ma, mb;
        if ((ma = this->left().parse(s, last)) &&
            (mb = this->right().parse(s, last)))
        {
            first = s;
            return ma + mb;
        }
        return match();
    }
};

Listing 5: The Kleene star class implementation

Listing 5: The Kleene star class implementation

//  Copyright (c) 2001, Joel de Guzman and Dan Nuffer
//  Permission is granted to use this code without restriction as
//  long as this copyright notice appears in all source files.

template <typename S>
struct kleene_star
    : public unary<S>
    , public parser<kleene_star<S> >
{
    kleene_star(S const& a)
    : unary<S>(a) {}

    template <typename IteratorT>
    match parse(IteratorT& first, IteratorT const& last) const
    {
        match hit(0);
        while (match next = this->subject().parse(first, last))
            hit += next;
        return hit;
    }
};

Listing 6: The class declaration of abstract_parser and concrete_parser

Listing 6: The class declaration of abstract_parser and concrete_parser

//  Copyright (c) 2001, Joel de Guzman and Dan Nuffer
//  Permission is granted to use this code without restriction as
//  long as this copyright notice appears in all source files.

template <typename IteratorT>
class abstract_parser
{
public:
                    abstract_parser() {}
    virtual         ~abstract_parser() {}
    virtual match   parse(
                        IteratorT& first,
                        IteratorT const& last) const = 0;
};

template <typename ParserT, typename IteratorT>
class concrete_parser
    : public ParserT
    , public abstract_parser<IteratorT>
{
public:

                    concrete_parser(ParserT const& parser)
                    : ParserT(parser) {}
  virtual           ~concrete_parser() {}

  virtual match
  parse(IteratorT& first, IteratorT const& last) const
  { return ParserT::parse(first, last); }
};

Listing 7: A rule's assignment operator

Listing 7: A rule's assignment operator

//  Copyright (c) 2001, Joel de Guzman and Dan Nuffer
//  Permission is granted to use this code without restriction as
//  long as this copyright notice appears in all source files.

template <typename ParserT>
rule& operator = (ParserT const& parser)
{
    assert(meta == 0);
    meta = new impl::concrete_parser<ParserT, IteratorT>(parser);
    return *this;
}

BNF and EBNF

A BNF grammar consists of a number of production rules. The rules are principally written as:

<S> ::= A1 | A2 | A3 ... AN

where A1..AN are alternatives separated by the |. The symbol S on the left-hand side of ::= must be replaced by one of the alternatives on the right-hand side. Alternatives consist of terminals and non-terminals. A terminal is an atomic string that cannot be broken down. They terminate the production process. On the other hand, a non-terminal recursively refers to a rule. Here's a trivial example BNF grammar that specifies an integer number:

<integer> ::= <digit> | <integer> <digit>
<digit>   ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

In plain English, this example can be read as: "An integer is a single digit, or, an integer followed by a single digit. A digit is one of 0 or 1 or 2 or 3 ... 9." Symbols in angle brackets, <digit> and <integer>, are non-terminals while 0..9 are terminals.

X*  The Kleene star. X can appear zero or more times.
X+  X can appear one or more times.
X?  Make X optional, X can appear zero or one time.

To simplify the integer rule, it can be written in EBNF like so:

integer ::= digit+
digit   ::= '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' |
            '8' | '9'