Web Development

Regular Expressions in C++

By John Maddock, October 01, 2001

The author of the Boost regex library shows that is C++ as versatile for text processing as script-based languages like Awk and Perl.

You need to extract the three fields, look up the item in the specified table, and then format the specified field as a string. However, there are a couple of complications here: First, the three fields (table, item, and field) can occur in any order; and second, some of the fields may be omitted, in which case default values should be used. Listing Four is the complete code for such an application.

Listing Four

#include <string>
#include <iostream>
#include <fstream>
#include <iterator>
#include <boost/regex.hpp>

const char* expression = 
   "<\\s*datamerge"                      // tag prefix
   "(?:"                                 // non-marking grouping
      "\\s+table\\s*=\\s*\"([^\"]*)\""   // $1 = table name
      "|\\s+item\\s*=\\s*\"([^\"]*)\""   // $2 = item name
      "|\\s+field\\s*=\\s*\"([^\"]*)\""  // $3 = field name
   "){1,3}"                              // grouping repeated 1, 2 or 3 times
   "\\s*>";                              // tag suffix
const boost::regex e(expression);
std::string::const_iterator endp;
std::string lookup_datamerge_string(const std::string& table,
                    const std::string& item, const std::string& field)
{
   // this should carry out a database lookup, 
   // for now just concatonate the names together:
   std::string result = table + "#" + item + "#" + field;
   return result;
}
bool grep_callback(const boost::match_results<std::string::const_iterator>& in)
{
   // get table name with default if necessary:
   std::string table = in[1];
   if(table.size() == 0) table = "default_table_name";
   // get item name (required no defaults):
   std::string item = in[2];
   if(item.size() == 0) 
      throw std::runtime_error("Incomplete datamerge field found");
   // get field name with default if necessary:
   std::string field = in[3];
   if(field.size() == 0) field = "default_field_name";
   // now carry out output, start by
   // sending everything from the end of the last match
   // to the start of this match to output:
   std::cout << std::string(in[-1]);   // output $`
   std::cout << lookup_datamerge_string(table, item, field);
   // now save end of what matched for later:
   endp = in[0].second;
   return true; // continue grepping
}
void load_file(std::string& s, std::istream& is)
{
   s.erase();
   s.reserve(is.rdbuf()->in_avail());
   char c;
   while(is.get(c))
   {
      if(s.capacity() == s.size())
         s.reserve(s.capacity() * 3);
      s.append(1, c);
   }
}
int main(int argc, char * argv[])
{
   try{
   std::filebuf ifs;
   std::filebuf ofs;
   std::streambuf* old_in = 0;
   std::streambuf* old_out = 0;
   if(argc > 1)
   {
      // redirect cin:
      ifs.open(argv[1], std::ios_base::in);
      old_in = std::cin.rdbuf(&ifs);
   }
   if(argc > 2)
   {
      // redirect cout:
      ofs.open(argv[2], std::ios_base::out);
      old_out = std::cout.rdbuf(&ofs);
   }
   std::string s;
   load_file(s, std::cin);
   endp = s.begin();
   // perform search and replace with lookup:
   boost::regex_grep(&grep_callback, s, e);
   // copy tail of file to output:
   std::string::const_iterator end = s.end();
   std::copy(endp, end, std::ostream_iterator<char>(std::cout)); 
   // reset streams:
   if(old_in) std::cin.rdbuf(old_in);
   if(old_out) std::cout.rdbuf(old_out);
   }
   catch(const std::exception& e)
   {
      std::cerr << "Exception thrown during merge: \"" 
                << e.what() << "\"" << std::endl;
   }
   return 0;
}

The key to understanding Listing Four is the regular expression at the start of the code. The expression uses a bounded repeat to match the three fields (table, item, and field). This allows up to two of those fields to be absent and acquire default values. Each time the expression repeats, one of the fields (regardless of which order they appear in) will be matched and then marked for future reference by its enclosing parentheses. Any absent fields will end up with their marked subexpression containing a Null string (so we will know that that field is absent).

The core of Listing Four is remarkably simple — the input HTML file is loaded into a string (a better choice would be a memory-mapped file, however, that would require platform-specific assumptions and I wanted to avoid such complications in an example program like this). Then, the loaded file is used as input to the algorithm regex_grep, which simply searches through the file for all matches of the regular expression. For each match, it fills in an instance of boost::match_results and passes that instance to either a callback function or a function object. The match_results class stores a set of iterators that indicate what matched each subexpression of the regular expression, so the callback function can easily extract the information needed to perform the database lookup. In Listing Four, I've simply created a dummy string from the data instead of performing an actual lookup — in real-world code, the lookup_datamerge_string procedure should be replaced by the actual database lookup code.

Of course, this isn't the only way of solving this particular problem. For example, server-side scripting via ASP or PHP pages is a popular choice. However, the method presented here does have the advantage of simplicity, and of completely separating web page design from programming — something that is important in many environments. There are also some simplifications in Listing Four; particularly the use of a global callback function, along with some global data. These can be eliminated completely by using a function object rather than a callback function as an argument to regex_grep. It is also possible to forward to a class member function using this technique (there are examples that do this in the the library documentation for regex_grep).

There's a lot more that this library can do, however, there isn't space to cover it all here. In this article, I've used std::string extensively to keep the examples as simple as possible, but underneath, the library is completely iterator based and will accept any bidirectional iterator as text input. There are also algorithms for Perl-like split operations that aren't covered here, along with narrow and wide character versions of the traditional POSIX C API functions.

Conclusion

This article shows some of the power that regular expressions in C++ can give you. This library does not seek to replace traditional regex tools such as lex. Rather, it provides a more convenient interface for rapid access to all kinds of pattern matching and text processing — something that has traditionally been limited to scripting languages. In addition, it provides a modern iterator-based implementation that allows it to work seamlessly with the C++ Standard Library, providing the versatility that C++ users have come to expect from modern libraries.

Acknowledgments

Thanks to Steve Cleary for his helpful comments while preparing this article, and also to all the library users who have contributed feedback to this project — without their help, this would never have come this far.

John is an independent programmer in England with an interest in generic programming. He can be contacted at [email protected].

Previous 1 2

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

Web Development

Regular Expressions in C++

Listing Four

Conclusion

Acknowledgments

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Web Development Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content

Web Development

Regular Expressions in C++

Listing Four

Conclusion

Acknowledgments

Related Reading

News

Commentary

Slideshow

Video

Most Popular

More Insights

White Papers

Reports

Webcasts

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Web Development Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content