Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

Combining Boyer-Moore String Search with Regular Expressions


June 2000/Combining Boyer-Moore String Search with Regular Expressions


Overview

I recently needed code to do fast regular expression searching, so I started looking for articles on that subject. I found articles on regular expression searching, but the search algorithms were not that fast. In most cases they were "brute-force" implementations. Since regular expression searching is really specialized string searching, I thought that I may be able to find a fast text-searching algorithm that addressed regular expressions. I found some excellent text-searching algorithms but none addressed regular expressions. The information I found covering regular expressions was mostly in textbooks and required setting up a non-deterministic finite state machine from a context-free grammar. The information focused more on parsing and building the finite state machine than it did on text searching. The text search performed using regular expressions was done one character at a time.

I did not need anything that sophisticated, so I developed a regular expression search class that could use some sort of intelligent text-search algorithm. This class is called CRegEx; it employs the Boyer-Moore string-search algorithm. I have compiled and tested CRegEx under Windows using Visual C++ 6.0 and under AIX using xlC. (Note: some of the type names used in this article are artifacts of programming in a Windows environment. For example, an LPTSTR is a macro for a char *, a.k.a. "long pointer to string." However, in the online source listings I provide a Unix makefile that will make all those artifacts defined under Unix.)

The Boyer-Moore Algorithm

Before discussing the class CRegEx, I review the Boyer-Moore algorithm. I chose the Boyer-Moore algorithm because it is significantly faster than other string search algorithms and it is one of the easiest algorithms to implement. It compares characters from right-to-left, starting with the last character in the search pattern. The speed of the Boyer-Moore algorithm is attributed to how well it deals with characters that do not match. When a mismatch is detected, the algorithm checks if the non-matching character is in the search pattern. If it is not in the pattern, then the pattern is shifted over the entire length of the search pattern.

The example shown in Figure 1 illustrates the Boyer-Moore algorithm. (The 1's and 0's indicate match and failure to match respectively.) The first comparison in Figure 1 compares the space following "Mary" to the 'e' in the search pattern. Since a space does not exist in the pattern, M characters of the input text can be skipped, where M is equal to the length of the pattern. If the mismatched character exists in the pattern, as it does in the third comparison, where the 'e' is compared to the second 't' in "little", N characters can be skipped, where N is the distance between the mismatched character and the last character in the pattern. In this case, the distance between 't' and the last character of the pattern is 1. If the mismatch were on a 'w', four characters would have been skipped over. If there were multiple 't's, in the pattern, the number of characters skipped would still be 1 since the skip distance is calculated as the distance from the right-most 't' to the last character in the pattern.

There is a special case when the skip distance is greater than the distance between the mismatch character and the last character in the pattern. This special case occurs when the mismatch occurs to the left of where the mismatched character appears in the pattern. A good example is the eighth comparison, in which "white" is compared to " flee". The 'e's match, but the 't' fails when compared to the first 'e' in " flee". Since the 'e' has already been passed in the pattern, M-j characters can be skipped over, where j is the index of the character in the pattern that is being compared. In this case j is 3 (using a zero-based index), and the skip distance, M-j, is 2.

Using the Boyer-Moore algorithm and its ability to skip comparisons, the search in the above example required only 17 comparisons performed before a match was found. A brute force method would have taken 44 comparisons before a match was found.

The skip values for each character in the pattern are calculated prior to performing any string matching. They are stored in a table, which provides an entry for all possible characters. For single-byte character sets, the size of this table is 256 bytes. The table will need to be expanded for multibyte and Unicode character sets.

Regular Expression Syntax

Before a regular expression can be processed, the regular expression characters must be defined. The syntax supported by CRegEx contains a combination of Unix and DOS/Windows special characters. The following is a list of characters and their meanings:

? — Match zero or one character.

* — Match zero or more characters.

[] — Match any single character included within the brackets.

^ — Match must be at the beginning of the input buffer. This character must be the first character in the regular expression.

$ — Match must be at the end of the buffer. This character must be the last character in the regular expression.

\ — Literal character to match the character immediately following the backslash.

Any character in the pattern that is not listed above will be used to match exactly the character in the text being searched.

Implementing CRegEx

CRegEx has two constructors, CRegEx() and CRegEx(TCHAR *regex, bool begin=false, bool end=false). CRegEx() is the default constructor; it just initializes all the member variables. If you use the default constructor, you must call the CRegEx function SetRegEx(LPTSTR regex) to set the regular expression that is to be found in the input text. The other constructor enables you to specify the regular expression when the CRegEx object is created. The parameters begin and end can be used to force a match at the beginning and end of the input text respectively.

SetRegEx is responsible for compiling the regular expression and then initializing the skip table to be used for the Boyer-Moore string search. Compilation is performed by function Compile, which converts the regular expression into an array of type struct REGEX_CH. This struct is defined as follows:

struct REGEX_CH
{
   wchar_t ch;
   wchar_t *multiChar;
};

The variable ch can be any ASCII character as well as one of the predefined characters RE_EOS or RE_MULTI. RE_EOS is the end-of-string marker for the compiled regular expression. RE_MULTI is used to store the characters that appeared in brackets in the regular expression. The characters are stored in member multiChar with an RE_EOS as the last character.

In an earlier implementation I had defined two additional REGEX_CH characters, RE_MATCHONE and RE_MATCHANY. These REGEX_CH characters were used to compare wildcards '*' and '?'. I eliminated these characters from the current version because of the problems they created in implementing the search algorithm correctly. Because a wildcard can match any character, its presence in the regular expression would force all the entries in the skip table to be the same. If the wildcard were at the end of the search pattern then all the entries in the skip table would be zero, which turns the Boyer-Moore algorithm into the brute force method.

While wildcards at the end of the pattern made things difficult, wildcards at the beginning of the pattern are relatively easy to deal with. I handle lead wildcards by counting them and removing them from the regular expression. The rest of the pattern can then be searched for using Boyer-Moore. When a match is found, the number of lead wildcards is then subtracted from the index of the first character of the match to get the starting point for the match of the entire regular expression.

The following code from the Compile function strips out wildcards:

while (*trePtr == _T('?') || *trePtr == _T('*'))
{
   if (lead != -1)
   {
      if (*trePtr == _T('*'))
      {
         beginMatch = false;
            lead = -1;              
      }
      else
         lead++;
   }
   trePtr++;
}

Whenever a '*' is encountered, lead is set to -1. This means that a match can occur anywhere in the remaining text, and that the start of the match occurs at the beginning of the input text. The variable beginMatch is set to false because the regular expression is guaranteed to match from the beginning of the input text and it is no longer necessary to test for it. If there are only '?'s leading the pattern then lead will be incremented to keep track of how many were removed.

It is unrealistic to force the user to use wildcards only at the beginning of a regular expression. Instead, the regular expression that the user types in must be transformed so that all wildcards appear at the start of the regular expression. CRegEx achieves this transformation by breaking the regular expression into multiple subexpressions, such that each subexpression contains wildcards only at the beginning of the expression. For example, the regular expression "A*B*C*D" will be broken into four smaller regular expressions, "A", "*B", "*C", "*D".

Regular expressions are broken apart during parsing, via a recursive process. Each time a wild card is encountered a new CregEx is created using the non-default constructor. The current pointer to the regular expression is passed in as a parameter to the constructor. The wildcard becomes a leading wild card for the new regular expression, and the new instance of CRegEx completes the task of parsing and compiling the regular expression. This new CRegEx is attached to the old CRegEx via a pointer; the address of the new CRegEx is stored in the old CRegEx's member variable nextRegEx. Thus, the recursive parsing process constructs a linked list of CRegEx, which then represents the original regular expression. The regular expression "A*B*C*D" from the above example will be stored as "A"->"*B"->"*C"->"*D".

The following code segment from Compile creates the next CRegEx in the linked list and passes it the unparsed remainder of the regular expression.

else if (*trePtr == _T('*'))    // Wildcard zero or more chars
{
   nextRegEx = new CRegEx((LPTSTR)trePtr, false, endMatch);
   endMatch=false; // delegate the end Match
   break;          // Stop parsing.
}
else if (*trePtr == _T('?'))    // wildcard zero or one chars
{
   nextRegEx = new CRegEx((LPTSTR)trePtr,true,endMatch);
   endMatch = false; // delegate the end Match
   break;          // Stop parsing
}

When the new CRegEx is created, the remaining regular expression is passed in, along with the begin match and end match flags. The begin flag is set based on the type of wildcard. If a '*' is encountered then begin is set to false because a begin match is guaranteed. If '?' is encountered then the regular expression must match from the beginning, so begin is set to true. The end match is then delegated to the new RegEx and set to false.

The linked list built here will be used later to recursively search for the regular expression in the input text. When the non-default constructor is used, the parser ignores the '^' and '$' special characters in the regular expression; the begin and end parameters take precedence. This approach is necessary because '^' and '$' can occur only at the beginning and end of the regular expression respectively. Since the parser is processing segments of the original regular expression, it must treat any additional '^' and '$' symbols that occur in the regular expression as literals.

Once the pattern has been successfully compiled, CRegEx's InitSkipTable function is called to initialize the member variable skip[TMAXCHAR]. InitSkipTable stores the skip values that are required by the Boyer-Moore algorithm. Each element in skip is initialized to the pattern length. The pattern is scanned and the skip value for each character in the pattern is set to the pattern length minus the index of the character minus 1. The skip values for all the characters that are part of a RE_MULTI character are also set. The skip value for all the characters that are part of RE_MULTI will be the same since they occupy the same logical position in the pattern.

The only thing left to do now is to actually use the compiled regular expression and see if it can be found in some text. The routine that performs the matching is Match(LPTSTR text, int textLen, int &matchLen). The text parameter will be searched to find any occurrences of the regular expression. The parameter textlen is the length of the text being passed in. Match returns the length of the actual match in the matchLen parameter. This parameter enables the caller of Match to know how the wildcards expanded. Match returns a pointer to the first character in the string that matches the regular expression.

Match also contains the implementation of the Boyer-Moore string search algorithm. The code segment in Figure 2 illustrates the Boyer-Moore Algorithm.

Several modifications to the algorithm were required to support regular expressions. The first change concerns where character comparisons begin. If the match must occur at the end of the text, then comparisons start there. If the input text is large, going directly to the end of the input text will save many comparisons. This differs significantly from the way most regular expression processors work — they still compare the regular expression to the entire input text. If the match does not have to occur at the end, then comparisons start at pattern length - 1.

To provide support for bracketed character sets, this implementation of the Boyer-Moore algorithm calls function CRegEx::MatchChar to see if a character matches one of the bracketed characters. MatchChar knows how to handle RE_MULTI characters. The match fail criteria has been updated to include passing the end of the text buffer, or passing the length of the pattern plus the number of single-match wildcards, when a match at the beginning of text is required.

Match is also responsible for calling the next CRegEx in the linked list. As explained above, whenever a wildcard character is encountered during compilation, a new CRegEx object is created and linked to the current CRegEx object. If the current match is successful, it calls nextRegEx->Match to search through the remainder of the text for the next partial pattern. This process will recursively match all the subexpressions to the input text. If any of the subexpressions fail, the match fails. If they all succeed, the match succeeds, and the matchLen variable is set to the cumulative matchLen of all the subexpressions.

Using CRegEx

I have included a test program as part of the online source code (see p. 3 for downloading instructions). If the code is compiled with GREP defined the program will act similar to grep. If GREP is undefined, the program will display the offset where the match occurred in the file. It will also print out the string that was matched. CRegEx supports only a subset of the regular expression syntax that is found on Unix, but can easily be modified to support other metacharacters as well.

David Berry works for Dendrite International Inc. He is the head of development for one of Dendrite's client teams. He can be reached at [email protected].


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.