Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

C/C++

Regular Expressions in C++


You need to extract the three fields, look up the item in the specified table, and then format the specified field as a string. However, there are a couple of complications here: First, the three fields (table, item, and field) can occur in any order; and second, some of the fields may be omitted, in which case default values should be used. Listing Four is the complete code for such an application.

Listing Four

#include <string>
#include <iostream>
#include <fstream>
#include <iterator>
#include <boost/regex.hpp>

const char* expression = 
   "<\\s*datamerge"                      // tag prefix
   "(?:"                                 // non-marking grouping
      "\\s+table\\s*=\\s*\"([^\"]*)\""   // $1 = table name
      "|\\s+item\\s*=\\s*\"([^\"]*)\""   // $2 = item name
      "|\\s+field\\s*=\\s*\"([^\"]*)\""  // $3 = field name
   "){1,3}"                              // grouping repeated 1, 2 or 3 times
   "\\s*>";                              // tag suffix
const boost::regex e(expression);
std::string::const_iterator endp;
std::string lookup_datamerge_string(const std::string& table,
                    const std::string& item, const std::string& field)
{
   // this should carry out a database lookup, 
   // for now just concatonate the names together:
   std::string result = table + "#" + item + "#" + field;
   return result;
}
bool grep_callback(const boost::match_results<std::string::const_iterator>& in)
{
   // get table name with default if necessary:
   std::string table = in[1];
   if(table.size() == 0) table = "default_table_name";
   // get item name (required no defaults):
   std::string item = in[2];
   if(item.size() == 0) 
      throw std::runtime_error("Incomplete datamerge field found");
   // get field name with default if necessary:
   std::string field = in[3];
   if(field.size() == 0) field = "default_field_name";
   // now carry out output, start by
   // sending everything from the end of the last match
   // to the start of this match to output:
   std::cout << std::string(in[-1]);   // output $`
   std::cout << lookup_datamerge_string(table, item, field);
   // now save end of what matched for later:
   endp = in[0].second;
   return true; // continue grepping
}
void load_file(std::string& s, std::istream& is)
{
   s.erase();
   s.reserve(is.rdbuf()->in_avail());
   char c;
   while(is.get(c))
   {
      if(s.capacity() == s.size())
         s.reserve(s.capacity() * 3);
      s.append(1, c);
   }
}
int main(int argc, char * argv[])
{
   try{
   std::filebuf ifs;
   std::filebuf ofs;
   std::streambuf* old_in = 0;
   std::streambuf* old_out = 0;
   if(argc > 1)
   {
      // redirect cin:
      ifs.open(argv[1], std::ios_base::in);
      old_in = std::cin.rdbuf(&ifs);
   }
   if(argc > 2)
   {
      // redirect cout:
      ofs.open(argv[2], std::ios_base::out);
      old_out = std::cout.rdbuf(&ofs);
   }
   std::string s;
   load_file(s, std::cin);
   endp = s.begin();
   // perform search and replace with lookup:
   boost::regex_grep(&grep_callback, s, e);
   // copy tail of file to output:
   std::string::const_iterator end = s.end();
   std::copy(endp, end, std::ostream_iterator<char>(std::cout)); 
   // reset streams:
   if(old_in) std::cin.rdbuf(old_in);
   if(old_out) std::cout.rdbuf(old_out);
   }
   catch(const std::exception& e)
   {
      std::cerr << "Exception thrown during merge: \"" 
                << e.what() << "\"" << std::endl;
   }
   return 0;
}


The key to understanding Listing Four is the regular expression at the start of the code. The expression uses a bounded repeat to match the three fields (table, item, and field). This allows up to two of those fields to be absent and acquire default values. Each time the expression repeats, one of the fields (regardless of which order they appear in) will be matched and then marked for future reference by its enclosing parentheses. Any absent fields will end up with their marked subexpression containing a Null string (so we will know that that field is absent).

The core of Listing Four is remarkably simple — the input HTML file is loaded into a string (a better choice would be a memory-mapped file, however, that would require platform-specific assumptions and I wanted to avoid such complications in an example program like this). Then, the loaded file is used as input to the algorithm regex_grep, which simply searches through the file for all matches of the regular expression. For each match, it fills in an instance of boost::match_results and passes that instance to either a callback function or a function object. The match_results class stores a set of iterators that indicate what matched each subexpression of the regular expression, so the callback function can easily extract the information needed to perform the database lookup. In Listing Four, I've simply created a dummy string from the data instead of performing an actual lookup — in real-world code, the lookup_datamerge_string procedure should be replaced by the actual database lookup code.

Of course, this isn't the only way of solving this particular problem. For example, server-side scripting via ASP or PHP pages is a popular choice. However, the method presented here does have the advantage of simplicity, and of completely separating web page design from programming — something that is important in many environments. There are also some simplifications in Listing Four; particularly the use of a global callback function, along with some global data. These can be eliminated completely by using a function object rather than a callback function as an argument to regex_grep. It is also possible to forward to a class member function using this technique (there are examples that do this in the the library documentation for regex_grep).

There's a lot more that this library can do, however, there isn't space to cover it all here. In this article, I've used std::string extensively to keep the examples as simple as possible, but underneath, the library is completely iterator based and will accept any bidirectional iterator as text input. There are also algorithms for Perl-like split operations that aren't covered here, along with narrow and wide character versions of the traditional POSIX C API functions.

Conclusion

This article shows some of the power that regular expressions in C++ can give you. This library does not seek to replace traditional regex tools such as lex. Rather, it provides a more convenient interface for rapid access to all kinds of pattern matching and text processing — something that has traditionally been limited to scripting languages. In addition, it provides a modern iterator-based implementation that allows it to work seamlessly with the C++ Standard Library, providing the versatility that C++ users have come to expect from modern libraries.

Acknowledgments

Thanks to Steve Cleary for his helpful comments while preparing this article, and also to all the library users who have contributed feedback to this project — without their help, this would never have come this far.


John is an independent programmer in England with an interest in generic programming. He can be contacted at [email protected].


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.