Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

C/C++

Regular Expressions in C++


Regular expressions form a central role in many programming languages, including Perl and Awk, as well as many familiar UNIX utilities such as grep and sed. The intrinsic nature of pattern matching in these languages has made them ideally suited to text processing applications, particularly for those web applications that have to process HTML. Traditionally, C/C++ users have had a hard time of it, usually being forced to use the POSIX C API functions regcomp, regexec, and the like. These primitives lack support for search and replace operations and are tied to searching narrow character C-strings. Some time ago, I began work on a modern regular expression engine that would support both narrow- and wide-character strings, as well as standard library-style iterator-based searches. This library became the regex library in Boost, which is accepted as part of the peer- reviewed boost library (see http://www.boost.org/). In this article, I'll show how regex++ can be used to make C++ as versatile for text processing as script-based languages such as Awk and Perl.

Data Validation

One of the simplest applications of regular expressions is data-input validation. Imagine that you need to store credit-card numbers in a database. If such numbers are stored in machine-readable format they will consist of a string of either 15 or 16 digits. The regular expression:

[[:digit:]]{15,16}

can be used to verify that the number is in the correct format; here I have used the extended regular expression syntax used by egrep, Awk, and Perl. Regex++ also supports the more basic syntax used by the grep and sed utilities. However, most people find that the extended syntax is both more natural and more powerful, so that is the form I will use throughout this article. I do not intend to discuss the regular expression syntax in this article, but the syntax variations supported by regex++ are described online (http://www.boost.org/doc/libs/1_53_0/libs/regex/doc/html/index.html). The documentation for Perl, Awk, sed, and grep are other useful sources of information, as is the Open UNIX Standard (http://www.opengroup.org/onlinepubs/7908799/xbd/re.html).

To use the aforementioned expression, you will need to convert it into some kind of machine-readable form. In regex++, regular expressions are represented by the template class reg_expression<charT, traits, Allocator>; this acts as a repository for the machine-readable expression and is responsible for parsing and validating the expression. reg_expression is modeled closely on the standard library class std::basic_string, and like that class, is usually used as one of two typedefs:

typedef reg_expression<char> regex; 

typedef reg_expression<wchar_t> wregex; 

Listing One contains some code for validating a credit-card format; in fact, this code could hardly be simpler, consisting of just two lines.

Listing One

bool validate_card_format(const std::string& s) 
{ 
   static const boost::regex e("\\d{15,16}"); 
   return regex_match(s, e); 
}

The first line declares a static instance of boost::regex, initialized with the regular expression string; note that I have replaced the verbose (albeit POSIX standard) [[:digit:]] with the Perl-style shorthand \d. Note also that the escape character has had to be doubled up to give \\d. This is an annoying aspect of regular expressions in C/C++. Since character strings are seen by the compiler before the regular expression parser, whenever an escape character should be passed to the regular expression engine, a double backslash must be used in the C/C++ code. The second line simply calls the algorithm regex_match to verify that the input string matches the expression. My use of a static instance of boost::regex here is important — this ensures that the expression is parsed only once (the first time that it is used) and not each time that the function is called. Although the algorithm regex_match is defined inside namespace boost, I haven't prefixed the usage of the algorithm with the boost:: qualifier. This is because the Koenig lookup rules ensure that the right algorithm will be found anyway, as long as one of its arguments is a type also declared inside namespace boost. It should be noted, however, that not all compilers currently support Koenig lookup. For these compilers, a boost:: qualifier is required in front of the call to regex_match. For simplicity, however, all the examples in this article assume that the Koenig lookup is supported.

Now suppose that at some point, the application using this code is converted to Unicode. Using traditional C APIs, this could be difficult, however, the library makes this trivial — I just had to change std::string to std::wstring and boost::regex to boost::wregex (see Listing Two).

Listing Two

bool validate_card_format(const std::wstring& s) 
{ 
   static const boost::wregex e(L"\\d{15,16}"); 
   return regex_match(s, e); 
}

Search and Replace

Frankly, the examples given so far are not all that interesting. One of the key features of languages such as Perl is the ability to perform simple search and replace operations on character strings. Consider the credit-card example again — while it may be machine friendly to store credit-card numbers as long strings of digits, this is not very human friendly. Normally, people expect to see credit-card numbers as groups of three or four digits separated by spaces or hyphens. If you print out receipts containing the customer's card number, you would expect to see the number in a human-friendly form. Conversely, if you receive an order by e-mail, the chances are that the card number has not been typed in a machine-friendly form. Fortunately, regular expression search-and-replace comes to the rescue.

In Listing Three, I have defined a single regular expression that will match a card number in almost any format, along with two format strings that define how the reformatted text should look — one for a machine-readable form and one for a standardized human-readable form. The regular expression and the format strings are used by two functions (machine_readable_card_number and human_readable_card_number) that perform the text reformatting by calling the algorithm regex_merge. This algorithm searches through the input string and replaces each regular expression match with the format string. Note, however, that the format string is not treated as a string literal; instead, it acts as a template from which the actual text is generated. In this example, I've used a sed-style format string where each occurrence of \n is replaced by what matched the nth subexpression in the regular expression. Users of sed or Perl should be familiar with this kind of usage, and the library lets you choose which format string syntax you want to use by passing the appropriate flags to regex_merge. By the way, the name regex_merge comes from the idea that the algorithm merges two strings (the input text and the format string) to produce one new string.

Listing Three

// match any format with the regular expression:
const boost::regex e("\\A"              // asserts start of string
                     "(\\d{3,4})[- ]?"  // first group of digits
                     "(\\d{4})[- ]?"    // second group of digits
                     "(\\d{4})[- ]?"    // third group of digits
                     "(\\d{4})"         // forth group of digits
                     "\\z");            // asserts end of string

// format strings using sed syntax:
const std::string machine_format("\\1\\2\\3\\4");
const std::string human_format("\\1-\\2-\\3-\\4");

std::string 
machine_readable_card_number(const std::string& s)
{
    std::string result = regex_merge(s, e, machine_format, 
                           boost::match_default 
                           | boost::format_sed 
                           | boost::format_no_copy);
   if(result.size() == 0)
    throw std::runtime_error
           ("String is not a credit card number");
   return result;
}
std::string 
human_readable_card_number(const std::string& s)
{
   std::string result = regex_merge(s, e, human_format, 
                           boost::match_default 
                           | boost::format_sed 
                           | boost::format_no_copy);
   if(result.size() == 0)
    throw std::runtime_error
           ("String is not a credit card number");
   return result;
} 

Error handling in Listing Three is quite simple — by passing the flag boost::format_no_copy to regex_merge, sections of the input text that do not match the regular expression are ignored and do not appear in the output string. This means that if the input does not match the expression, then an empty string will be returned by regex_merge, and the appropriate exception can be thrown. The algorithm regex_merge will search the input for all possible matches, but in this case, it requires that the expression must match the whole of the input string or nothing at all. Therefore, the expression in Listing Three starts with \\A and ends with \\z. Taken together, these ensure that the expression will only match the whole of the input string and not just one part of it (these are what Perl calls "zero width assertions").

If you study the regular expression in Listing Three, you should notice one big improvement over script-based languages; C++ lets you specify a single-string literal as a series of shorter string literals. I've taken advantage of this in Listing Three to split the regular expression up into logical sections, and then to comment each section. When the compiler sees that section of code, the comments will get discarded and the strings will merge into one long-string literal. Perhaps surprisingly, this makes regular expressions much more readable in C++ than in those traditional scripting languages that require regular expressions to be specified as a single long string.

Nontrivial Search and Replace

So far, the examples have concentrated on simple search-and-replace operations that use an existing syntax (either sed or Perl) for the format string. However, it is sometimes necessary to compute the new string to be inserted. A typical example would be a web application that uses a regular expression to locate a custom HTML tag in a file, then uses the match to perform a database lookup. The output would then be another HTML file with the custom tags replaced by the current database information. Imagine that the custom tag looks something like this:

<mergedata table="tablename" item="itemname" field="fieldname">


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.