Dr. Dobb's | Regular Expressions in C++

Regular Expressions in C++

The author of the Boost regex library shows that is C++ as versatile for text processing as script-based languages like Awk and Perl.

October 01, 2001
URL:http://www.drdobbs.com/parallel/regular-expressions-in-c/184404797

Regular expressions form a central role in many programming languages, including Perl and Awk, as well as many familiar UNIX utilities such as grep and sed. The intrinsic nature of pattern matching in these languages has made them ideally suited to text processing applications, particularly for those web applications that have to process HTML. Traditionally, C/C++ users have had a hard time of it, usually being forced to use the POSIX C API functions regcomp, regexec, and the like. These primitives lack support for search and replace operations and are tied to searching narrow character C-strings. Some time ago, I began work on a modern regular expression engine that would support both narrow- and wide-character strings, as well as standard library-style iterator-based searches. This library became the regex library in Boost, which is accepted as part of the peer- reviewed boost library (see http://www.boost.org/). In this article, I'll show how regex++ can be used to make C++ as versatile for text processing as script-based languages such as Awk and Perl.

Data Validation

One of the simplest applications of regular expressions is data-input validation. Imagine that you need to store credit-card numbers in a database. If such numbers are stored in machine-readable format they will consist of a string of either 15 or 16 digits. The regular expression:

[[:digit:]]{15,16}

can be used to verify that the number is in the correct format; here I have used the extended regular expression syntax used by egrep, Awk, and Perl. Regex++ also supports the more basic syntax used by the grep and sed utilities. However, most people find that the extended syntax is both more natural and more powerful, so that is the form I will use throughout this article. I do not intend to discuss the regular expression syntax in this article, but the syntax variations supported by regex++ are described online (http://www.boost.org/doc/libs/1_53_0/libs/regex/doc/html/index.html). The documentation for Perl, Awk, sed, and grep are other useful sources of information, as is the Open UNIX Standard (http://www.opengroup.org/onlinepubs/7908799/xbd/re.html).

To use the aforementioned expression, you will need to convert it into some kind of machine-readable form. In regex++, regular expressions are represented by the template class reg_expression<charT, traits, Allocator>; this acts as a repository for the machine-readable expression and is responsible for parsing and validating the expression. reg_expression is modeled closely on the standard library class std::basic_string, and like that class, is usually used as one of two typedefs:

typedef reg_expression<char> regex; 

typedef reg_expression<wchar_t> wregex;

Listing One contains some code for validating a credit-card format; in fact, this code could hardly be simpler, consisting of just two lines.

Listing One

bool validate_card_format(const std::string& s) 
{ 
   static const boost::regex e("\\d{15,16}"); 
   return regex_match(s, e); 
}

The first line declares a static instance of boost::regex, initialized with the regular expression string; note that I have replaced the verbose (albeit POSIX standard) [[:digit:]] with the Perl-style shorthand \d. Note also that the escape character has had to be doubled up to give \\d. This is an annoying aspect of regular expressions in C/C++. Since character strings are seen by the compiler before the regular expression parser, whenever an escape character should be passed to the regular expression engine, a double backslash must be used in the C/C++ code. The second line simply calls the algorithm regex_match to verify that the input string matches the expression. My use of a static instance of boost::regex here is important — this ensures that the expression is parsed only once (the first time that it is used) and not each time that the function is called. Although the algorithm regex_match is defined inside namespace boost, I haven't prefixed the usage of the algorithm with the boost:: qualifier. This is because the Koenig lookup rules ensure that the right algorithm will be found anyway, as long as one of its arguments is a type also declared inside namespace boost. It should be noted, however, that not all compilers currently support Koenig lookup. For these compilers, a boost:: qualifier is required in front of the call to regex_match. For simplicity, however, all the examples in this article assume that the Koenig lookup is supported.

Now suppose that at some point, the application using this code is converted to Unicode. Using traditional C APIs, this could be difficult, however, the library makes this trivial — I just had to change std::string to std::wstring and boost::regex to boost::wregex (see Listing Two).

Listing Two

bool validate_card_format(const std::wstring& s) 
{ 
   static const boost::wregex e(L"\\d{15,16}"); 
   return regex_match(s, e); 
}

Search and Replace

Frankly, the examples given so far are not all that interesting. One of the key features of languages such as Perl is the ability to perform simple search and replace operations on character strings. Consider the credit-card example again — while it may be machine friendly to store credit-card numbers as long strings of digits, this is not very human friendly. Normally, people expect to see credit-card numbers as groups of three or four digits separated by spaces or hyphens. If you print out receipts containing the customer's card number, you would expect to see the number in a human-friendly form. Conversely, if you receive an order by e-mail, the chances are that the card number has not been typed in a machine-friendly form. Fortunately, regular expression search-and-replace comes to the rescue.

In Listing Three, I have defined a single regular expression that will match a card number in almost any format, along with two format strings that define how the reformatted text should look — one for a machine-readable form and one for a standardized human-readable form. The regular expression and the format strings are used by two functions (machine_readable_card_number and human_readable_card_number) that perform the text reformatting by calling the algorithm regex_merge. This algorithm searches through the input string and replaces each regular expression match with the format string. Note, however, that the format string is not treated as a string literal; instead, it acts as a template from which the actual text is generated. In this example, I've used a sed-style format string where each occurrence of \n is replaced by what matched the nth subexpression in the regular expression. Users of sed or Perl should be familiar with this kind of usage, and the library lets you choose which format string syntax you want to use by passing the appropriate flags to regex_merge. By the way, the name regex_merge comes from the idea that the algorithm merges two strings (the input text and the format string) to produce one new string.

Listing Three

// match any format with the regular expression:
const boost::regex e("\\A"              // asserts start of string
                     "(\\d{3,4})[- ]?"  // first group of digits
                     "(\\d{4})[- ]?"    // second group of digits
                     "(\\d{4})[- ]?"    // third group of digits
                     "(\\d{4})"         // forth group of digits
                     "\\z");            // asserts end of string

// format strings using sed syntax:
const std::string machine_format("\\1\\2\\3\\4");
const std::string human_format("\\1-\\2-\\3-\\4");

std::string 
machine_readable_card_number(const std::string& s)
{
    std::string result = regex_merge(s, e, machine_format, 
                           boost::match_default 
                           | boost::format_sed 
                           | boost::format_no_copy);
   if(result.size() == 0)
    throw std::runtime_error
           ("String is not a credit card number");
   return result;
}
std::string 
human_readable_card_number(const std::string& s)
{
   std::string result = regex_merge(s, e, human_format, 
                           boost::match_default 
                           | boost::format_sed 
                           | boost::format_no_copy);
   if(result.size() == 0)
    throw std::runtime_error
           ("String is not a credit card number");
   return result;
}

Error handling in Listing Three is quite simple — by passing the flag boost::format_no_copy to regex_merge, sections of the input text that do not match the regular expression are ignored and do not appear in the output string. This means that if the input does not match the expression, then an empty string will be returned by regex_merge, and the appropriate exception can be thrown. The algorithm regex_merge will search the input for all possible matches, but in this case, it requires that the expression must match the whole of the input string or nothing at all. Therefore, the expression in Listing Three starts with \\A and ends with \\z. Taken together, these ensure that the expression will only match the whole of the input string and not just one part of it (these are what Perl calls "zero width assertions").

If you study the regular expression in Listing Three, you should notice one big improvement over script-based languages; C++ lets you specify a single-string literal as a series of shorter string literals. I've taken advantage of this in Listing Three to split the regular expression up into logical sections, and then to comment each section. When the compiler sees that section of code, the comments will get discarded and the strings will merge into one long-string literal. Perhaps surprisingly, this makes regular expressions much more readable in C++ than in those traditional scripting languages that require regular expressions to be specified as a single long string.

Nontrivial Search and Replace

So far, the examples have concentrated on simple search-and-replace operations that use an existing syntax (either sed or Perl) for the format string. However, it is sometimes necessary to compute the new string to be inserted. A typical example would be a web application that uses a regular expression to locate a custom HTML tag in a file, then uses the match to perform a database lookup. The output would then be another HTML file with the custom tags replaced by the current database information. Imagine that the custom tag looks something like this:

<mergedata table="tablename" item="itemname" field="fieldname">

You need to extract the three fields, look up the item in the specified table, and then format the specified field as a string. However, there are a couple of complications here: First, the three fields (table, item, and field) can occur in any order; and second, some of the fields may be omitted, in which case default values should be used. Listing Four is the complete code for such an application.

Listing Four

#include <string>
#include <iostream>
#include <fstream>
#include <iterator>
#include <boost/regex.hpp>

const char* expression = 
   "<\\s*datamerge"                      // tag prefix
   "(?:"                                 // non-marking grouping
      "\\s+table\\s*=\\s*\"([^\"]*)\""   // $1 = table name
      "|\\s+item\\s*=\\s*\"([^\"]*)\""   // $2 = item name
      "|\\s+field\\s*=\\s*\"([^\"]*)\""  // $3 = field name
   "){1,3}"                              // grouping repeated 1, 2 or 3 times
   "\\s*>";                              // tag suffix
const boost::regex e(expression);
std::string::const_iterator endp;
std::string lookup_datamerge_string(const std::string& table,
                    const std::string& item, const std::string& field)
{
   // this should carry out a database lookup, 
   // for now just concatonate the names together:
   std::string result = table + "#" + item + "#" + field;
   return result;
}
bool grep_callback(const boost::match_results<std::string::const_iterator>& in)
{
   // get table name with default if necessary:
   std::string table = in[1];
   if(table.size() == 0) table = "default_table_name";
   // get item name (required no defaults):
   std::string item = in[2];
   if(item.size() == 0) 
      throw std::runtime_error("Incomplete datamerge field found");
   // get field name with default if necessary:
   std::string field = in[3];
   if(field.size() == 0) field = "default_field_name";
   // now carry out output, start by
   // sending everything from the end of the last match
   // to the start of this match to output:
   std::cout << std::string(in[-1]);   // output $`
   std::cout << lookup_datamerge_string(table, item, field);
   // now save end of what matched for later:
   endp = in[0].second;
   return true; // continue grepping
}
void load_file(std::string& s, std::istream& is)
{
   s.erase();
   s.reserve(is.rdbuf()->in_avail());
   char c;
   while(is.get(c))
   {
      if(s.capacity() == s.size())
         s.reserve(s.capacity() * 3);
      s.append(1, c);
   }
}
int main(int argc, char * argv[])
{
   try{
   std::filebuf ifs;
   std::filebuf ofs;
   std::streambuf* old_in = 0;
   std::streambuf* old_out = 0;
   if(argc > 1)
   {
      // redirect cin:
      ifs.open(argv[1], std::ios_base::in);
      old_in = std::cin.rdbuf(&ifs);
   }
   if(argc > 2)
   {
      // redirect cout:
      ofs.open(argv[2], std::ios_base::out);
      old_out = std::cout.rdbuf(&ofs);
   }
   std::string s;
   load_file(s, std::cin);
   endp = s.begin();
   // perform search and replace with lookup:
   boost::regex_grep(&grep_callback, s, e);
   // copy tail of file to output:
   std::string::const_iterator end = s.end();
   std::copy(endp, end, std::ostream_iterator<char>(std::cout)); 
   // reset streams:
   if(old_in) std::cin.rdbuf(old_in);
   if(old_out) std::cout.rdbuf(old_out);
   }
   catch(const std::exception& e)
   {
      std::cerr << "Exception thrown during merge: \"" 
                << e.what() << "\"" << std::endl;
   }
   return 0;
}

The key to understanding Listing Four is the regular expression at the start of the code. The expression uses a bounded repeat to match the three fields (table, item, and field). This allows up to two of those fields to be absent and acquire default values. Each time the expression repeats, one of the fields (regardless of which order they appear in) will be matched and then marked for future reference by its enclosing parentheses. Any absent fields will end up with their marked subexpression containing a Null string (so we will know that that field is absent).

The core of Listing Four is remarkably simple — the input HTML file is loaded into a string (a better choice would be a memory-mapped file, however, that would require platform-specific assumptions and I wanted to avoid such complications in an example program like this). Then, the loaded file is used as input to the algorithm regex_grep, which simply searches through the file for all matches of the regular expression. For each match, it fills in an instance of boost::match_results and passes that instance to either a callback function or a function object. The match_results class stores a set of iterators that indicate what matched each subexpression of the regular expression, so the callback function can easily extract the information needed to perform the database lookup. In Listing Four, I've simply created a dummy string from the data instead of performing an actual lookup — in real-world code, the lookup_datamerge_string procedure should be replaced by the actual database lookup code.

Of course, this isn't the only way of solving this particular problem. For example, server-side scripting via ASP or PHP pages is a popular choice. However, the method presented here does have the advantage of simplicity, and of completely separating web page design from programming — something that is important in many environments. There are also some simplifications in Listing Four; particularly the use of a global callback function, along with some global data. These can be eliminated completely by using a function object rather than a callback function as an argument to regex_grep. It is also possible to forward to a class member function using this technique (there are examples that do this in the the library documentation for regex_grep).

There's a lot more that this library can do, however, there isn't space to cover it all here. In this article, I've used std::string extensively to keep the examples as simple as possible, but underneath, the library is completely iterator based and will accept any bidirectional iterator as text input. There are also algorithms for Perl-like split operations that aren't covered here, along with narrow and wide character versions of the traditional POSIX C API functions.

Conclusion

This article shows some of the power that regular expressions in C++ can give you. This library does not seek to replace traditional regex tools such as lex. Rather, it provides a more convenient interface for rapid access to all kinds of pattern matching and text processing — something that has traditionally been limited to scripting languages. In addition, it provides a modern iterator-based implementation that allows it to work seamlessly with the C++ Standard Library, providing the versatility that C++ users have come to expect from modern libraries.

Acknowledgments

Thanks to Steve Cleary for his helpful comments while preparing this article, and also to all the library users who have contributed feedback to this project — without their help, this would never have come this far.

John is an independent programmer in England with an interest in generic programming. He can be contacted at [email protected].