C/C++

A Generic Parsing Engine in C++

, August 01, 1995

The generic parser our authors present requires no specific knowledge of the source language and is independent of the underlying application.

SP95: A Generic Parsing Engine in C++

A Generic Parsing Engine in C++

A portable engine for parsing using regular expressions

Todd D. Esposito and Andrew K. Johnson

Todd is a systems architect with Prodis Inc., a systems integrator and custom-software development firm in Carol Stream, Illinois. Andrew is a development engineer with Prodis. They can be reached at 708-462-8600 or on CompuServe at 70661,2717.

In building applications and user/management-level tools for our clients, we have found a recurring need for parsing technology. Whether for updating WIN.INI or running complicated macros within a custom applications, parsing keeps rearing its head. Our first few parsers were fitted to the content being parsed, with little or no room for deviation, and each new project started almost completely from scratch.

After a few such projects, we found a better way. The result is the parsing engine presented in this article: a generic parser, requiring no specific knowledge of the source language, and with no ties to the underlying application. The engine is a collection of five objects. The interface is almost exclusively contained in two objects, and then only in a few member functions, so it is easily integrated into any project. In this article, we will examine the engine's design and operation, and use it to implement a Basic-like macro language.

A Structural Overview of the Parsing Engine

The parsing engine is generic in that it contains no application- or language-related code. The parser's five classes are each fitted to a particular aspect of the parsing strategy. The relationships between these objects are outlined in Figure 1. The application using the engine typically will create a gpFlowControl object and feed it the parameters needed to construct Syntax objects, which encapsulate the language constructs. This initializes the engine. The gpFlowControl object creates all the other objects and converts the supplied parameters to the appropriate object type. The application then sends input lines to the engine. The engine returns tokens, telling the application the meaning of the input, and extracts any parameters from the input.

Starting from the top, we encounter the gpFlowControl class, which acts as traffic cop, ensuring that the correct code is executed in the proper order. Most of your application's calls will be to this object, which then passes control to its embedded gpParser. On the way back to your application, the gpFlowControl object takes a peek at the token the gpParser returned and decides if it needs to intercede. This will happen in the case of an If/Then/Else or other control structure.

Each possible command in your macro language must be paired with a "token," which is just a more-convenient representation of the command. The gpParser class substitutes tokens for the associated input patterns it receives. gpParser "learns" your language by way of the AddSyntax() member function, which takes as parameters either a string and a token, or a Syntax object. (gpFlowControl's AddSyntax() function simply passes its parameters to the embedded gpParser.) Syntax objects encapsulate the pairing between syntax and token. Note the syntax/token relationship is not necessarily one-to-one: One token can represent many different syntactic constructs.

The gpRegExp class implements regular expressions (string-matching templates), and is derived from gpString. gpRegExp overloads two key functions: the constructor and the equality operator (==).

The gpString class handles character-string manipulation and provides overloaded operators to make using strings easier (see Listing One). We built this class before the ANSI string class was sanctioned, and we still prefer ours to the ANSI version. It should be a simple matter to retrofit ANSI strings into the system. However, since we use our gpString class in all of our projects (rather than the ANSI class), this has not been a priority.

The Macro Language

Microsoft has popularized the use of Basic as a macro language for applications. While we won't comment on the wisdom of that practice, the Basic syntax structure provides a reasonable example of how to use our parsing engine. The result is the shell of a full-fledged macro language, ready to be embedded into an application. Due to space constraints, our discussion will focus on the operation of the parsing engine, leaving the implementation of the Basic constructs (such as variable typing and storage and retrieval) for you to explore.

Before we begin, we need to design and analyze our Basic syntax. The simplest macro selects a menu or invokes an application process; see Example 1(a). Example 1(b) shows how to call a function in Basic. Example 1(c) shows a more complex macro demonstrating conditional execution, and Example 1(d) uses a loop to extend Example 1(c), allowing the user to try again.

Regular Expressions Unveiled

Regular expressions are tremendously powerful. They help propel the UNIX shells (sh, csh, ksh) and text-processing utilities (sed, grep, awk, perl, and so on) to an unparalleled level of sophistication. Regular expressions, as anyone who has worked with UNIX knows, are string templates. A regular expression matches a string if the string's content fits into the regular expression's template. The DOS command line, DIR *.BAT, for example, uses a limited form of regular expressions. In this case, "*.BAT" is the regular expression, and the names of the files are the content being matched. Normally, a regular expression is not position sensitive and will match a line if any substring of the line matches. For example, the regular expressions "Goodbye," "Cruel" and "World!" will each match "Goodbye Cruel World!" This is useful, since we don't have to know a string's entire content to "find" it. Table 1 provides a description of various wildcards.

The parsing engine makes constant use of regular expressions. In fact, most of its execution time is spent inside a member function of the gpRegExp class. We implemented regular-expression matching in two steps: decomposition in the gpRegExp class's constructor, and pattern matching in the overloaded equality operator. When you create a gpRegExp object, it breaks down each regular expression into several smaller atomic expressions, each of which is classified. Figure 2 depicts how a sample regular expression is decomposed. This process creates a gpRegExp object for each atom in the expression, arranged in a typical linked list. A private member function, ParseAtoms(), serves as a common initializer for several different overloads of the constructor; see Listing Two. ParseAtoms() performs the actual decomposition of the expression in an almost recursive manner.

ParseAtoms() removes characters from its input and stores them in its internal string. It does this one character at a time, until the end of the string or some special character is reached (with two exceptions, "^" and "&," which are explained later). Once ParseAtoms() reaches a special character, it cuts the input string into two pieces: One, it retains and classifies; the other, it uses to construct the next atom, which will in turn go through the same process.

Classification of the atom is all-important, because it drives the pattern-matching engine. ParseAtoms() classifies each atom for content. These classifications are defined by the enumerated type ExpType_T. For example, Literal is a string literal that must match exactly, Wild is a wildcard matching any single character, and so on. Additionally, ParseAtoms() will classify an atom in terms of positional requirement, based on the "^" and "$" special characters, and will mark gpRegExp as meaningful if the "&" character precedes it. (Meaningful, in this sense, indicates the matched portion should be saved for later retrieval, because the calling application will need to know exactly what was matched.)

The second parameter, called "top," is a curious detail. Top controls whether the gpRegExp object is marked as a top-level (or root) atom in the linked list, and determines whether or not to handle the "^" special character. Normally, this parameter is not supplied, except that ParseAtoms() always sets it to 0 (no) when constructing the next atom. It defaults to 1 (yes), so the programmer need not think about it. In fact, it is important not to change this parameter.

Once ParseAtoms() has completed decomposition, the gpRegExp object is ready. Its primary purpose is to match things with the overloaded equality operator. The process starts at the root and travels down the list of atoms. At each level, the gpRegExp object tries to match its input. If it succeeds, it passes what remains unmatched to the next atom. If that atom returns failure, so does the caller. Thus, the root can only succeed if all of the atoms in the list succeed.

Matching is based on the atom's ExpType_T, as determined at construction time. The rules for the types Literal, Wild, Range, and Optional are simple; the Multi-0 and Multi-1 variants, however, bear investigation. Both of these attempt to match as much of the input as possible. If the next atom returns failure, the object will step back one character in the match, and try again. This process ends when the next atom returns success, or the Multi atom can no longer back up because too few characters were matched. Note that the Multi-0 types, which specify zero or more occurrences of a given pattern, will match an empty string, whereas the Multi-1 types consume at least one character.

The Parser Itself

The parser learns about the target language by means of a syntax table, which is a list of Syntax objects. These Syntax objects each contain a gpRegExp and a token. The gpRegExp defines the syntax of the command. The token indicates which command it represents and is usually put through a switch construct by the application to dispatch the proper handler, much like a WM_ message in the Windows WndProc function.

The gpParser class takes all of the credit without doing any of the work. The constructor merely stores the SyntaxList that the caller passed in. The Parse() function does little more than pass its parameter to the SyntaxList's Seek() function; see Listing Three. Seek() simply walks through the list of Syntax pattern objects, asking each if it is equal to the parameter. The Seek() function doesn't know that regular-expression matching is going on, since it just uses the equality operator. If Seek() returns success, Parse() returns the token associated with the matching syntax pattern. This portion of the code shows the true power of overloaded operators in C++.

Constructing the Language

So, how do we use this to construct a macro language? First, we must construct our syntax table. In our example, the DoCmd command takes only one form; see Example 2.

This regular expression indicates we're looking for a line beginning with the literal "DoCmd DoMenuItem" and we want to capture the three possible comma-separated parameters. We also have to assign this command a token, so we choose to define this to be TK_DOMENUITEM, and give it an arbitrary value of (TK_USERDEF + 1) with a #define. We use TK_USERDEF as a base point, so we don't conflict with any predefined tokens.

We continue this process for each command available in our language. When finished, we have produced the code in Listings Four and Five. One word of caution: The most-specific syntax patterns must precede the more-general patterns so that a general pattern (such as Call +&.*(&.*)) does not get matched when a more-specific one (such as Call +MessageBox *(&.*)) was called for.

Listing Six, syntoken.h, contains definitions for tokens such as TK_IF and TK_USERDEF. In most cases, your application will have to handle predefined and user-defined tokens. Table 2 lists the predefined tokens and what they represent.

After calling the BuildMacroEngine() function in Listing Five, read lines from the input file and pass them to the gpFlowControl object that BuildMacroEngine() returned. Do this by calling its Parse() function, sending in the line to parse and an empty StringList, which Parse() will fill with the command's parameter before returning. Parse() will return the token associated with this command.

Handling expressions, such as those between If and Then, is a simple matter of constructing a gpParser object that knows the rules of the expression's grammar. This is demonstrated in the BuildExpEngine() and EvalExp() functions in Listing Five. The gpFlowControl::Parse() function returns TK_IF, TK_WHILE, or a similar token when an expression has to be evaluated before flow control can be determined. In this case, the expression should be in the first element of StringList, and can be passed directly to EvalExp(), which passes it into the gpParser::Parse() function. Breaking down expression syntax adequately can make this process quite powerful. Once gpParser::Parse() establishes a truth value for the expression, the application calls the gpFlowControl::PostExpressionValue() function. This sets the state of the gpFlowControl object and determines how the flow-control construct will be handled.

Conclusion

The parsing engine presented provides a very powerful and generalized solution. We have used it in products from simple utilities to embedded-application scripting languages. Its implementation in C++ makes good use of inheritance and polymorphism. We have attempted to keep the engine pure so that it will port easily. In fact, we used it under DOS, Windows 3.x, Win32s, Windows NT, and OS/2 platforms (in all cases, using the Borland C++ compiler).

However, the engine does have some limitations. The gpRegExp class does not implement all of the UNIX-style, regular-expression special characters and functionality. We have designed the engine with this in mind, but have yet completed this work. And since our gpRegExp class makes all comparisons case insensitive, operations such as case-sensitive variable naming may not be supportable. Also, the gpFlowControl object does not handle flow-control constructs such as Switch/Case and For/Next. These would be relatively easy to add.

Example 1: (a) Basic statement that invokes a menu command; (b) Basic function calls; (c) typical If/Then/Else construct; (d) While loop.

(a)
DoCmd DoMenuItem FILE, OPEN, "C:\DIRNAME\FILENAME.EXT"

(b)
Call MessageBox ("Invalid Type Code!", 0, "Error")

Call RecalcBalances ()

(c)
If FileName$ > "" Then

DoCmd DoMenuItem FILE, OPEN, FileName$

Else

Call MessageBox ("You must type a file name!", 0, "Error")

Endif

(d)
Let GotFile = 0

While GotFile = 0

Call GetFileName (FileName$)

If FileName$ > "" Then

DoCmd DoMenuItem FILE, OPEN, FileName$

Call RecalcBalances ()

Let GotFile = 1

Else

Call MessageBox ("You must type a file name!", 0, "Error")

Endif

WEnd

Example 2: A syntax rule specified via a regular expression, consisting of a string literal followed by three comma-separated parameters.

DoCmd *DoMenuItem *&[A-Z]+ *, *&[A-Z]+ *, *&.*

Figure 1: Structural overview of the parsing engine.

Figure 2: Decomposing the regular expression ^_*If +&.+ +Then$. This implements the expression If <<expression>> Then.

Table 1: Wildcards and their use.


<b>Wildcard     Description</b>
.     Period matches any one character. Using this character in
      the expression "HEL." will match "HELP," "HELL," "HELD,"
      and so on. The <I>ExpType_T</I> value
      <I>Wild</I> represents this wildcard.

*     Asterisk matches zero or more of the previous character.
      This is useful if you don't know how many spaces are
      between two words, such as in: "Goodbye Cruel World!" or
      "GoodbyeCruel     World!" The regular expression to match
      this would be "Goodbye *Cruel *World!" Asterisk can be
      used to create <I>MultiWild0</I>, <I>MultiChar0,</I>
      and <I>MultiRange0</I> when paired with a period,
      regular character, or Range operator, respectively.

+     Plus matches one or more of the previous characters. This
      is useful if you don't know how many spaces are between
      two words, but know there must be at least one, such as
      in: "Goodbye Cruel World!" or "Goodbye     Cruel World!"
      The regular expression to match this would be
      "Goodbye+Cruel+World!". Note that this would not match
      "GoodbyeCruel     World!". Plus can be used to create
      <I>MultiWild1</I>, <I>MultiChar1,</I> and
      <I>MultiRange1</I> when paired with a period,
      regular character, or Range operator, respectively.

[x-y]     The Range operator pair, "[" and "]" provide for
      matching a specific character or range of characters. For
      example, if you are looking for "ABC" or "BBC" but not
      "CBC" or any other, you would use "[A-B]BC" as your
      regular expression. Range defines the <I>ExpType_T</I>
      for this construct.

{ s }     The Optional operator pair, "{" and "}" allow you to
      specify a string that may or may not be there, such as in
      "Goodbye {Cruel} World!", which matches "Goodbye Cruel
      World!" and "Goodbye World!"  Braces result in an
      <I>ExpType_T </I>of <I>Optional</I>.

&     Ampersand indicates the atom to follow is meaningful, and
      should be retained. A parsing application uses ampersand
      to extract parameters from commands. For example, if we
      need to know the name of the file passed to the File Open
      command, our regular expression is "File Open &.+"
      Whatever is matched by the ".+" atom is saved, and will
      be passed back from <I>gpFlowControl::Parse()</I>
      in the <I>lsParms</I> List object. When using
      the <I>gpParser</I> object directly, you can get
      these parameters with a call to
      <I>gpParser::DumpParameters()</I>.

^     Caret matches beginning of line. If this character is the
      first character of the regular expression, it makes the
      expression position sensitive. "^Good" would match
      "Goodbye Cruel World!" but "^Cruel" and "^World" would
      not. If caret occurs anywhere else in the regular
      expression, it is treated as a literal (technically it
      should be quoted, but we're a bit more relaxed than UNIX).

$     Dollar matches end of line. Like caret, a dollar as the
      last character makes the expression position sensitive.
      So, "Good$" and "Cruel$" would not match "Goodbye Cruel
      World!" but "rld!$" would. Unlike caret, it cannot appear
      unquoted anywhere in the regular expression except at the
      end. Dollar effectively ends the regular expression,
      making any characters following it drop into the bit
      bucket.

\     Backslash is the "quote" character. If you need to find a
      special character in a file, such as when looking for
      "$10,000,000 Winner" in your e-mail, you have to quote
      the $ so that <I>gpRegExp </I>doesn't try to
      interpret it. A dollar in the first position of an
      Expression won't find the match. In this case, you would
      use "\$10,000,000 Winner" as your regular expression.

Table 2: Predefined tokens and their uses.


<b>Token         Description</b>
TK_UNRECOGNIZED      No match was found, meaning that a syntax
                     error has occurred in the input stream.
TK_NOOP              No-Operation. This is a comment, or no
                     action should be taken because we're inside
                     a not-taken Then or Else clause or a failed
                     While loop.
TK_REWIND            Rewind the input stream (loop back to a 
                     predetermined point). The line number is in
                     the first parameter.
TK_IF                Part of an IF structure was found. Normally,
TK_ELSE              only TK_IF requires action: An expression
TK_ENDIF             needs to be evaluated. The parameter list
TK_LABEL             contains the expression. TK_LABEL is usually
TK_GOTO              treated as a NOOP, but your application may
                     want to cache this location for faster
                     rewinding later. When TK_GOTO is encountered,
                     the label should be in the parameter list.
TK_WHILE             TK_WHILE begins a While loop. The expression
TK_ENDWHILE          needing evaluation is in the parameter list.
                     TK_ENDWHILE should be treated like TK_REWIND.
TK_COMMENT           A comment was encountered. This token is
                     included for convenience.
TK_EQUALS            These tokens are used in evaluating
TK_NOT_EQUAL         comparative expressions and are included
TK_GREATER_THAN      for convenience.
TK_LESS_THAN
TK_GREATER_OR_EQUAL
TK_LESS_OR_EQUAL
TK_AND               These tokens are used in evaluating logical
TK_OR                expressions and are included for convenience.
TK_NOT
TK_UNMATCHED_ELSE    These tokens are returned whenever a
TK_UNMATCHED_ENDIF   TK_ELSE, TK_ENDIF or TK_ENDWHILE is
                     encountered, and it is not matched
                     with a corresponding TK_IF or TK_WHILE.
TK_USERDEF           Acts as the base point for application-
                     specific verbs.

Listing One

//------------------------------------------------------------------
// gpString.h - Declaration of the gpString class.
// Copyright 1994 Prodis Incorporated.
// Architect: TDE
// Developer: AKJ
//------------------------------------------------------------------
#ifndef GPSTRING_H
#define GPSTRING_H
#include <string.h>
#include <stdlib.h>
#include <ctype.h>
#define NPOS 32000
typedef unsigned size_t;
// StripType is used with ::Strip()
enum StripType { Trailing, Leading, Both, All };
// Direction and Inclusive are used with ::FindString() and ::FindChar()
enum Direction { Forward, Backward };
enum Inclusive { Of, NotOf };
class gpString
  {
    protected :
      char *cText;
      size_t nSize;
      int CharIsOfString(char cChar, char *cBuffer);
    public :
    // Constructors:
      gpString (const char *cString);
      gpString (char cChar);
      gpString (size_t nNewSize, char cFill = ' ');
      gpString ( );
      gpString (gpString &sString);
    // Destructor:
      ~gpString ( );
    // Size-related functions:
      size_t Length ( ) {return strlen(cText);}
      void Resize (size_t nNew);
      size_t Size ( ) {return nSize;}
    // Case Conversion - returns a copy:
      gpString ToUpper ( );
      gpString ToLower ( );
    // Assignment Operators:
      // Return the character in the string at a given offset.
      char &operator[] (int nPos) {return *(cText+nPos);}
      // Type conversion to char *.
      operator char *( );
      // Assign one string to another.
      gpString &operator= (gpString &oString);
      // Append another string to this one.
      gpString &operator+= (gpString &oString);
      // Concatenate two strings.
      gpString operator+ (gpString &oString);
    // Relational operators:
      // Compare two strings for equality.
      virtual int operator== (gpString &oString);
      virtual int operator== (char *cString)
            {return operator==(gpString(cString));}
      // Compare two strings for inequality.
      int operator!= (gpString &oString);   
      int operator!= (char *cString) 
            {return operator!=(gpString(cString));}
      // Compare two strings for exclusive alphanumeric precedence. 
      int operator< (gpString &oString);        
      int operator< (char *cString)
            {return operator<(gpString(cString));}
      // Compare two strings for exclusive alphanumeric antecedence.
      int operator> (gpString &oString);
      int operator> (char *cString)  
            {return operator>(gpString(cString));}
      // Compare two strings for inclusive alphanumeric precedence. 
      int operator<= (gpString &oString);
      int operator<= (char *cString) 
            {return operator<=(gpString(cString));}
      // Compare two strings for inclusive alphanumeric antecedence.
      int operator>= (gpString &oString);
      int operator>= (char *cString) 
            {return operator>=(gpString(cString));}
    // Search Functions:
      // Find the target substring within this string.
      size_t FindSubstring (gpString &sTarget
              , Direction dDirect = Forward, size_t nStart = 0);
      // Find any character of the target string.
      size_t FindChar (Inclusive iIsOf, gpString &sTarget
              , Direction dDirect = Forward, size_t nStart = 0);
    // Edit Routines:
      // Insert the new string at the given position.
      gpString &Insert (gpString &sNew, size_t nPos);
      // Remove nSize characters starting with position nPos.
      gpString &Remove (size_t nPos, size_t nSize = NPOS);
      // Replace nSize characters with the new string.
      gpString &Replace (size_t nPos, size_t nSize, gpString &sNew);
      // Remove the given characters at the given positions.
      // We default to stripping leading and trailing whitespace.
      gpString &Strip (StripType s = Both
              , gpString &sTarget = gpString (" \t"));
  };    
#endif

Listing Two

//------------------------------------------------------------------
// gpRegExp.cpp - Definition of the gpRegExp class.
// Copyright 1994 Prodis Incorporated.
// Purpose: The General Purpose Regular Expression class handles
//          pattern matching duties.
// Architect: TDE
// Developer: AKJ
// Modification History:
//      09/08/94 TDE: Original Code.
//      09/09/94 AKJ: Fixed TDE's stuff, and fleshed out functions.
//      01/17/95 AKJ: Added support for +.
//      02/22/95 AKJ: Minor fix to Opional to allow fallback.
//------------------------------------------------------------------
#include <stdlib.h>
#include <gpregexp\gpregexp.h>
#include <gpstring\gpslist.h>
#define LOWER_BOUND 1
#define UPPER_BOUND 3
// Create a gpRegExp object from a character string.
gpRegExp::gpRegExp (const char *cNewText)
        : gpString (cNewText)
  {
    nDoICount = 0;
    fTopLevel = top;
    NextAtom = 0;
    ParseAtoms ();
  }
gpRegExp::gpRegExp (char cChar, int top)
        : gpString (cChar)
  {
    nDoICount = 0;
    NextAtom = 0;    
    fTopLevel = top;
    ParseAtoms ();
  }
gpRegExp::gpRegExp (int top) : gpString ( )
  {
    nDoICount = 0;
    NextAtom = 0;    
    fTopLevel = top;
    ParseAtoms ();
  }
gpRegExp::gpRegExp (const gpString &s, int top)
        : gpString (s)
  {
    nDoICount = 0;
    NextAtom = 0;    
    fTopLevel = top;
    ParseAtoms ();
  }
gpRegExp::~gpRegExp ( )
  {
    if (NextAtom) delete NextAtom;
  }
gpRegExp &gpRegExp::operator= (gpString &oString)
  {
    if (NextAtom) delete NextAtom;
    NextAtom = 0;
    fTopLevel = 1;
    (*this) = oString;
    ParseAtoms ();
              
    return *this;
  }
gpRegExp &gpRegExp::operator= (char *cString)
  {
    if (NextAtom) delete NextAtom;
    (*this) = (cString);
    NextAtom = 0;
    fTopLevel = 1;
    ParseAtoms ();
                 
    return *this;
  }
//-------------------------------------------------------------
void gpRegExp::ParseAtoms ( )
  {
    int nPos = 0;
    int GotToken = 0;
    gpString *copy;
    ExpType = Literal;
    size_t nOffset;
    
    // Release the previous atoms and reset parameters.
    // This is ususally only for when an assign is done.
    firstOnly = 1;
    lastOnly = 0;
    if (fTopLevel)
      {
        // First, optimize by removing extraneous '.*'s 
        if (FindSubstring ("^.*") == 0)
          {
            firstOnly = 0;
            Remove (0, 3);
          }  
        else if (FindSubstring (".*") == 0)
          {
            firstOnly = 0;
            Remove (0, 2);
          }
        // Next, check for "Beginning-of-Line"
        if (*this[0] == '^')
            Remove (0, 1);
        else
            firstOnly = 0;
      }
    // Strip out the first atom in the string.
    copy = new gpString (cText);
    while (nPos < Length() && ! GotToken)
      {
        switch ((*this)[nPos])
          {
          case '\\':                  // We have to Quote the next
            Remove (nPos, 1);         // character, so remove the
            copy->Remove (nPos, 1);   // slash and get the char.
            nPos++;
            break;
            
          case '.':                   // if we get a '.'
            if ((nPos) == 0)          // and it's the first char
              if ((*this)[1] == '*')  // and it's followed by a '*'
                {
                  Remove (2);         // Then we have a '.*' token.
                  copy->Remove (0, 2);
                  GotToken = 1;
                  ExpType = MultiWild0;
                }
              else if ((*this)[1] == '+')  // if followed by '+'
                {
                  Remove (2);         // Then we have a '.+' token.
                  copy->Remove (0, 2);
                  GotToken = 1;
                  ExpType = MultiWild1;
                }
              else
                {
                  Remove (1);          // we have a plain
                  copy->Remove (0, 1); // old '.' token.
                  GotToken = 1;
                  ExpType = Wild;
                }
            else
              {
                Remove (nPos);         // we have a literal token.
                copy->Remove (0, nPos);
                GotToken = 1;
                ExpType = Literal;
              }
            break;
          case '*':                 // if we get a '*'
            if (nPos == 1)          // and it's the second character
              {
                Remove (2);         // The we have a <char>* token.
                copy->Remove (0, 2);
                GotToken = 1;
                ExpType = MultiChar0;
              }
            else
              {
                Remove (nPos - 1);  // Or, we have a literal token.
                copy->Remove (0, nPos - 1);
                GotToken = 1;
                ExpType = Literal;
              }
            break;
          case '+':                 // if we get a '+'
            if (nPos == 1)          // and it's the second character
              {
                Remove (2);         // The we have a <char>+ token.
                copy->Remove (0, 2);
                GotToken = 1;
                ExpType = MultiChar1;
              }
            else
              {
                Remove (nPos - 1);  // Or, we have a literal token.
                copy->Remove (0, nPos - 1);
                GotToken = 1;
                ExpType = Literal;
              }
            break;
          case '$':                   
            Remove (nPos);          // the buck stops here.
            copy->Remove (0);       // And we won't have any kids.
            lastOnly = 1;
            GotToken = 1;
            ExpType = Literal;
            break;
          case '[':                 // if we get a '['
            if ((nPos) > 0)         // and it's NOT the first char
              {
                Remove (nPos);      // we have a literal
                copy->Remove (0, nPos);
                ExpType = Literal;
                GotToken = 1;
              }   
            else                    // or we are beginning a range.
              nPos++;
            break;
          case ']':                 // when we get ']'
            if ((*this)[nPos + 1] == '*') // we may have [...]*
              {
                Remove (nPos + 2);
                copy->Remove (0, nPos + 2);
                GotToken = 1;
                ExpType = MultiRange0;
              }
            else if ((*this)[nPos + 1] == '+') // we may have [...]+
              {
                Remove (nPos + 2);
                copy->Remove (0, nPos + 2);
                GotToken = 1;
                ExpType = MultiRange1;
              }
            else        
              {
                Remove (nPos + 1);        // or just plain old [...]
                copy->Remove (0, nPos + 1);
                GotToken = 1;
                ExpType = Range;
              }
            break;
          case '{':                 // we have a brace
            if ((nPos) > 0)         // and it's NOT the first char
              {
                Remove (nPos);      // we have a literal
                copy->Remove (0, nPos);
                ExpType = Literal;
                GotToken = 1;
              }   
            else                    // or we are beginning an 
                                    // Optional expression
              {
                nOffset = FindChar (Of, "}");
                copy->Remove (0, nOffset+1);
                Remove (nOffset);
                Remove(0, 1);
                
                while ( (nOffset = FindChar (Of, "|")) != NPOS )
                  {
                    (*this)[nOffset] = '\0';
                    lrChildren.AddItem (
                            new gpRegExp (*this, 0) );
                    Remove (0, nOffset+1);  
                  }
                lrChildren.AddItem (
                            new gpRegExp (*this, 0) );
                GotToken = 1;
                ExpType = Optional;
              }
            break;
          case '&' :                // if we get ampersand
            if (nPos > 0)           // and we've already got an atom
              {
                GotToken = 1;       // then we stop where we are
                ExpType = Literal;
                Remove (nPos);
                copy->Remove (0, nPos);
              }
            else
              {                     // otherwise, we are starting
                nDoICount = 1;      // a meaningful atom.
                Remove (0,1);
                copy->Remove (0,1);
              }
            break;      
          default: 
            // Just copy the character.
            nPos++;
          }
      }
    // Pass the rest along to the next atom.
    if (GotToken && (*copy != ""))
        NextAtom = new gpRegExp (*copy, 0);
        // Flag this guy as NOT top level.
      
    if (copy)
        delete copy;  
  }
//------------------------------------------------------------------
//  Routine : operator== (gpString)
//  Function : sees if the given regular expression matches this gpString.  
//  Notes : Normally, the gpString may be longer than the regular
//          expression;  the comparison ends with the last character
//          in that expression.  However, if the last character of
//          the expression is '$', then an exact match is called
//          for, and the gpString may not have extra characters.
//------------------------------------------------------------------
int gpRegExp::operator==(gpString &sExpress)
  {
    int nMin = 0;
    int lMatch = 0;     // Assume that the match will fail.
    int nPos;
    gpString sBuffer;
    char cBuffer;
    gpString sStringSaver;
    
    if (fTopLevel)
        sStringSaver = sExpress;
    sLastMatch = "";
    switch (ExpType)
      {
        case Literal     : 
            nPos = sExpress.FindSubstring (cText);
            if ((firstOnly && !nPos)||(!firstOnly && (nPos != NPOS)))
              {
                sLastMatch = cText;
                sExpress.Remove (0, nPos);
                sExpress.Remove (0, Length () );
                lMatch = match_remainder (sExpress);
              }
            break;
        case MultiChar1  :
            nMin = 1;       // set our min chars to 1 and fall thru
        case MultiChar0  :
            nPos = sExpress.FindChar(NotOf,(*this)[0]);
            if (nPos == NPOS)
                nPos = sExpress.Length ();
            lMatch = DecrementingMatch(nMin, nPos, sExpress);    
            break;
        case Wild        : 
            if (sExpress.Length () )
              {
                sLastMatch = sExpress[0];
                lMatch = match_remainder(sExpress.Remove (0, 1) );
              }  
            break;
        case MultiWild1  : 
            nMin = 1;       // set our min chars to 1 and fall thru          
        case MultiWild0  : 
            nPos = sExpress.Length ();
            lMatch = DecrementingMatch(nMin, nPos, sExpress);    
            break;
        case Range       : 
            cBuffer = sExpress[0];
            cBuffer = toupper (cBuffer);    
            if ( (cBuffer >= toupper(cText[LOWER_BOUND])) && 
                 (cBuffer <= toupper(cText[UPPER_BOUND])))
              {
                lMatch = match_remainder(sExpress.Remove (0,1));
                sLastMatch = cBuffer;
              }  
            break;
        case MultiRange1 : 
            nMin = 1;       // set our min chars to 1 and fall thru
        case MultiRange0 : 
            for (nPos = 0; 
                 (toupper(sExpress[nPos]) >= 
                      toupper(cText[LOWER_BOUND])) && 
                 (toupper(sExpress[nPos]) <= 
                      toupper(cText[UPPER_BOUND]));
                 nPos++
                );
            lMatch = DecrementingMatch(nMin, nPos, sExpress);       
            break;                   
        case Optional  :
          { 
            gpString sBuffer(sExpress);
            if (lrChildren.Seek (sExpress))
                sLastMatch = lrChildren.Peek()->LastMatch();
            if ((lMatch = match_remainder (sExpress)) == 0)
              {
                sLastMatch = "";
                lMatch = match_remainder (sBuffer);
              }
          }    
      }
    if (fTopLevel)
        sExpress = sStringSaver;
    return (lMatch);            
  }
// These helpers will keep trying to match a Multi-type atom.
// First, try to match as much as possible, then try to match
// the next atom.  If that atom succeeds, good.
// If not, we need to decrement our match string by one character
// and retry.  We do this until we have reached our minimum chars.
int gpRegExp::DecrementingMatch(int nMin, int nPos, gpString &sExpress)
  {
    int lMatch = 0;
    gpString sBuffer;
    
    for (; !lMatch && (nPos >= nMin); nPos--)
      { 
        sBuffer = sExpress;
        sBuffer.Remove(0, nPos);
        lMatch = match_remainder(sBuffer);
      }  
    if (lMatch)
      {
        sLastMatch = sExpress;
        sLastMatch.Remove(++nPos);
      }
    return lMatch;         
  }
int gpRegExp::match_remainder (gpString &sExpress)
  {
    int lMatch = 1;
    if (lastOnly)
      {
        if (sExpress.Length () )
            lMatch = 0;
      }
    else
      { 
        if (NextAtom)
            lMatch = ((*NextAtom) == sExpress);
      }    
    return lMatch;        
  }
//------------------------------------------------------------------
//  Routine : DumpParameters
//  Function : travers atom tree, creating list of meaningful parameters.
//------------------------------------------------------------------
void gpRegExp::DumpParameters (StringList &lsParms)
  {
    if (nDoICount)
        lsParms.AddItem (new gpString (LastMatch () ));  
    if (NextAtom)
        NextAtom->DumpParameters (lsParms);
  }
//------------------------------------------------------------------
// The following code implements a List of gpRegExp's
// We keep it here because the Optional atoms use it.
//------------------------------------------------------------------
RegExpList::RegExpList ( ): List ()
  {
    pFirst = pLast = pCurrent = 0;
  }
RegExpList::~RegExpList ( )
  {
    for (gpRegExp *eRegExp = Reset ();
         eRegExp;
         eRegExp = GetNext ()
        )
         delete eRegExp;            
  }
gpRegExp *RegExpList::Seek (gpString &sName)
  {
    int lFound = 0;
    Reset ();
    while (!lFound && pCurrent)
      {
        if (*(Peek ()) == sName )
            lFound = 1;
        else
            GetNext ();
      }
    return pCurrent;
  }
gpRegExp *RegExpList::Seek (char *cName)
  {
    gpString sName (cName);
    return (Seek (sName) );
  }
gpRegExp *RegExpList::Reset ( ) 
  {
    return (pFirst = pCurrent = (gpRegExp *)List::Reset () );
  }
gpRegExp *RegExpList::GetNext ( ) 
  {
    return (pCurrent = (gpRegExp *)List::GetNext () );
  }
gpRegExp *RegExpList::AddItem (gpRegExp *eNew) 
  {
    pLast = pCurrent = (gpRegExp *)List::AddItem (eNew);
    if (!pFirst)
        pFirst = pLast;
    return (pCurrent);  
  }
gpRegExp *RegExpList::Peek ( ) 
  {
    return pCurrent;
  } 
gpRegExp *RegExpList::Seek (int nSequence)
  {
    if (nListSize < nSequence)
        return 0;
    Reset (); 
    for (int i = 1; i < nSequence; GetNext (), i++);
    return pCurrent;      
  }
gpRegExp &RegExpList::operator[] (int nSequence)
  {
    return (*(Seek (nSequence)));
  }
void RegExpList::Clear ( )
  {
    for (gpRegExp *eRegExp = Reset ();
         eRegExp;
         eRegExp = GetNext ()
        )
         delete eRegExp;
    List::Clear ();
    pFirst = pLast = pCurrent = 0;
  }

Listing Three

//------------------------------------------------------------------
// Syntax.cpp - Definition of the Syntax and SyntaxList classes.
// Copyright 1994 Prodis Incorporated.
// Purpose: The Syntax class pairs syntactic patterns (REs) with tokens (ints)
// Architect: AKJ
// Developer: AKJ
// Modification History:
//------------------------------------------------------------------
#include <stdlib.h>
#include <parser\syntax.h>
SyntaxList::SyntaxList ( ): List ()
  {
    pFirst = pLast = pCurrent = 0;
  }
SyntaxList::~SyntaxList ( )
  {
    for (Syntax *eSyntax = Reset ();
         eSyntax;
         eSyntax = GetNext ()
        )
         delete eSyntax;            
  }
Syntax *SyntaxList::Seek (gpString &sName)
  {
    int lFound = 0;
    Reset ();
    while (!lFound && pCurrent)
      {
        if ( (*(pCurrent->reSyntax)) == sName)
            lFound = 1;
        else
            GetNext ();
      }
    return pCurrent;
  }
Syntax *SyntaxList::Seek (char *cName)
  {
    gpString sName (cName);
    return (Seek (sName));
  }
Syntax *SyntaxList::Reset ( ) 
  {
    return (pFirst = pCurrent = (Syntax *)List::Reset () );
  }
Syntax *SyntaxList::GetNext ( ) 
  {
    return (pCurrent = (Syntax *)List::GetNext () );
  }
Syntax *SyntaxList::AddItem (Syntax *eNew) 
  {
    pLast = pCurrent = (Syntax *)List::AddItem (eNew);
    if (!pFirst)
        pFirst = pLast;
    return (pCurrent);  
  }
Syntax *SyntaxList::Peek ( ) 
  {
    return pCurrent;
  } 
//------------------------------------------------------------------
// Syntax object starts here:
Syntax::Syntax ()
  {
    reSyntax = 0;
    nToken = 0;
  }
Syntax::Syntax (gpRegExp *reExpress, int nNewToken)
  {
    reSyntax = reExpress;
    nToken = nNewToken;
  }
Syntax::Syntax (char *cExpress, int nNewToken)
  {
    reSyntax = new gpRegExp(cExpress);
    nToken = nNewToken;
  }

Listing Four

// BasicMac.h - define tokens for the BASIC-Macro language.
// We need only include the FLOW.H file; it includes everything else we need.
#include <parser\flow.h>
// Function prototypes:
int EvaluateExpression (gpString &sExpress);
gpFlowControl *BuildMacroEngine();
gpParser *BuildExpEngine( );
// These functions are prototyped, but not actually coded. These are just for 
// demonstration purposes, and need to be written in a production system.
int GetMacroLine (gpString &sLine);
int RewindMacroSource (int nLine);
void ErrorHandler();
void SetVariableToValue (gpString &p1, gpString &p2);
void DoMenuItem (gpString &p1, gpString &p2, gpString &p3);
void MessageBox (gpString &p1, gpString &p2, gpString &p3);
void CallFunctionWithParms (gpString &p1, gpString &p2);
int CheckEquals (gpString &p1, gpString &p2);
int CheckGreater (gpString &p1, gpString &p2);
int CheckLessThan (gpString &p1, gpString &p2);
// Define our Tokens:
// We base each token on TK_USERDEF to avoid conflicts with predefined tokens.
#define TK_DOMENUITEM     (TK_USERDEF + 1)
#define TK_ASSIGN         (TK_USERDEF + 2)
#define TK_MBOX           (TK_USERDEF + 3)
#define TK_CALL           (TK_USERDEF + 4)

Listing Five

// BasicMac.cpp - demonstration of using the Prodis Parsing Engine.
// Notes: We include BasicMac.h for our tokens, and it includes 
//   all other neccessary files.
#include "basicmac.h"
// Declare global variable, just for ease of demonstration
// flow control object for parsing input lines
gpFlowControl *fcMacEngine;
// parser object for parsing expressions
gpParser *pExpEngine;           
    
int main ( )
  {
    // Build the macro-language flow control object.
    fcMacEngine = BuildMacroEngine();
    // Build the expression parser object.
    pExpEngine = BuildExpEngine();
    gpString sLine;      // buffer for input lines
    int nLineNo, nToken; // line number, token
    StringList slParms;  // string list for holding parameters
    nLineNo = GetMacroLine(sLine); // Get the first line.
    
    // nLineNo will be zero when there are no more input lines.
    while(nLineNo)
      {
        // Parse the input line. The engine will return a token.
        nToken = fcMacroEngine->Parse(sLine, slParms);
        switch(nToken)
          {
            // no match was found for the line
            case TK_UNRECOGNIZED : 
                ErrorHandler();  //Send an error message
                break;
            case TK_ASSIGN :
                // first parm was the variable name
                // second parm was the value to set it to
                SetVariableToValue(slParms[1], slParms[2]);
                break; 
            case TK_IF :
            case TK_WHILE :
              { 
                // slParms[1] has the expression following the 'if'
                // or while. It must be parsed and evaluated.
                int IsTrue = EvalateExpression(slParms[1]);
                PostExpressionValue(IsTrue);
                break;
              }  
            case TK_DOMENUITEM :
                // The following function - not listed -
                // must implement the macro-language line :
                // DoCmd DoMenuItem [Parm1], [Parms2], [Parm3]
                DoMenuItem (lsParms[1], lsParms[2], lsParms[3]);
                break;
            case TK_MBOX :
                // The following function - not listed -
                // causes a Message Box to be displayed
                MessageBox(lsParms[1], lsParms[2], lsParms[3]);
                break;
            case TK_CALL :
                // The following function - not listed - 
                // Matches the first parameter with a 
                // function call and gets its arguments
                // from the second parameter.
                CallFunctionWithParms(lsParms[1], lsParms[2]);
                break;
            case TK_ENDWHILE :
                // When an ENDWHILE is returned, the first parameter
                // is the number of the line continaing the matching
                // WHILE clause, which will be re-evaluated.
                RewindMacroSource (atoi (lsParms[1]));
                break;
            case TK_ELSE :
            case TK_ENDIF :
                // We don't need to do anything here; gpFlowControl
                // will handle it all.  However, if we were 
                // interested, we could trap these lines.
                break;
          }
        nLineNo = GetMacroLine(sLine); // get the next line
      }
    return (1);           
  }
gpFlowControl *BuildMacroEngine( )
  {
    gpFlowControl fcNew = new FlowControl;
    
    fcNew.AddSyntax ("^ *If +&.+ +Then *$", TK_IF);
    fcNew.AddSyntax ("^ *Else *$", TK_ELSE);
    fcNew.AddSyntax ("^ *Endif *$", TK_ENDIF);
    fcNew.AddSyntax ("^ *Let &.+ *= *&.+ *$", TK_ASSIGN);
    fcNew.AddSyntax ("^ *While *&.* *$", TK_WHILE);
    fcNew.AddSyntax ("^ *WEnd *$", TK_ENDWHILE);
    fcNew.AddSyntax ("^ *Call *MessageBox *(&.+,&.+,&.+) *$" , TK_MBOX);
    fcNew.AddSyntax ("^ *Call *&.+ *(&.+) *$", TK_CALL);
    fcNew.AddSyntax ("^ *DoCmd +DoMenuItem +&.+,+&.+,+&.+ *$", TK_DOMENUITEM);
    return fcNew;
  }                                 
gpParser *BuildExpEngine( )
  {
    gpParser pNew = new gpParser;
    pNew.AddSyntax ("&.+ *= *&.+", TK_EQUALS);
    pNew.AddSyntax ("&.+ *< *&.+", TK_LESS_THAN);
    pNew.AddSyntax ("&.+ *> *&.+", TK_GREATER_THAN);
    return pNew;
  }
int EvalExp (gpString &sExpress)
  {
    int nReturn = 0;     //return value
    int nToken;          //token
    StringList lsParms;  //String list for parameters
    // parse the given line and get the token
    nToken = pExpEngine->Parse(sExpress, lsParms);
    switch (nToken)
      {
        //the line is : expression = expression
        //if the two expressions are equal, return TRUE
        case TK_EQUALS :
            if CheckEquals(lsParms[1], lsParms[2])
                nReturn = 1;
            break;
        //the line is : expression > expression
        //if the first expression is greater, return TRUE
        case TK_GREATER_THAN :         
            if CheckGreater(lsParms[1], lsParms[2])
                nReturn = 1;
            break;
        //the line is : expression < expression
        //if the second expression is greater, return TRUE
        case TK_LESS_THAN :                 
            if CheckLessThan (lsParms[1], lsParms[2])
                nReturn = 1;
      }
    return nReturn;  
  }

Listing Six

//------------------------------------------------------------------
// Syntoken.h - "System" Syntax tokens.
// Copyright 1994 Prodis Incorporated.
// Architect: AKJ
// Developer: AKJ
// Modification History:
//------------------------------------------------------------------
// Define "non-language" tokens:
#define TK_UNRECOGNIZED         0
#define TK_NOOP                 1
#define TK_REWIND               2
#define TK_COMMENT              3
// Define flow-control tokens:
#define TK_IF                   10
#define TK_ELSE                 11
#define TK_MISMATCHED_ELSE      12
#define TK_ENDIF                13
#define TK_MISMATCHED_ENDIF     14
#define TK_LABEL                15
#define TK_GOTO                 16
#define TK_WHILE                17
#define TK_ENDWHILE             18
#define TK_MISMATCHED_ENDWHILE  19
// Define Expression realational tokens:
// (These are just here because we ALWAYS seem to need them.)
#define TK_EQUALS               901
#define TK_NOT_EQUAL            902
#define TK_GREATER_THAN         903
#define TK_LESS_THAN            904
#define TK_GREATER_OR_EQUAL     905
#define TK_LESS_OR_EQUAL        906
#define TK_AND                  907
#define TK_OR                   908
#define TK_NOT                  909
// Define the base for "user-defined" (app-specific) tokens.
// Additional definitions should be of the form:
//      #define TK_SOMETOKEN  (TK_USERDEF + n)
#define TK_USERDEF              1000
End Listings

Previous 1 2 3 4 5

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

C/C++

A Generic Parsing Engine in C++

A Generic Parsing Engine in C++

A portable engine for parsing using regular expressions

Todd D. Esposito and Andrew K. Johnson

A Structural Overview of the Parsing Engine

The Macro Language

Regular Expressions Unveiled

The Parser Itself

Constructing the Language

Conclusion

Listing One

Listing Two

Listing Three

Listing Four

Listing Five

Listing Six

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

C/C++ Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content

C/C++

A Generic Parsing Engine in C++

A Generic Parsing Engine in C++

A portable engine for parsing using regular expressions

Todd D. Esposito and Andrew K. Johnson

Related Reading

More Insights

White Papers

Reports

Webcasts

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

C/C++ Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content