October 01, 2005
The Perl-Compatible Regular Expressions Library
Perl-strength regular expressions in native apps
Ethan McCallum
The Perl-Compatible Regular Expressions Library gives you the pattern-matching power of Perl in your C and C++ programs.
October, 2005: The Perl-Compatible Regular Expressions Library
Ethan McCallum is a freelance technology consultant who spends a lot of time with Linux, C++, and Java. He can be reached at ethanqm@ penguinmail.com.
Version Compatibility
Feeding the Linker and Preprocessor
Prototyping Regexps
Perl's regular expression (regexp) muscle makes it a favorite for text processing. If your project spec calls for C or C++, however, your choices are few: Call specialized Perl scripts with popen() (ugly), use Perl's C interface (uglier), or use the Perl-Compatible Regular Expressions (PCRE) library [1]. Philip Hazel's PCRE is a native library that implements Perl-style regexp support. It offers Perl's extended regexp semantics as well as its ability to extract matched substrings for you. PCRE is used in several projects, including the Apache httpd, the BlueFish HTML editor, and the nmap scanning tool. Its BSD-style license permits its use in both free and commercial software.
In this article, I examine PCRE, including text matching and substring extraction. I wrap up with a stub of a log-processing tool that uses PCRE as the basis of pattern-matching objects. Familiarity with Perl's regexp rules is not required, though it may help you understand some of the concepts presented here. The sample code was tested under Fedora Core 3, PCRE 4.5/5.0, and GCC 3.4.2. See the "Version Compatibility" sidebar if you are using a different version of PCRE. The complete source code is available at http://www.cuj.com/code/.
Make the Switch
Your first use of PCRE may involve converting an existing program. PCRE provides a straightforward migration path from the POSIX regexp toolkit: Its compatibility layer wraps PCRE calls in functions named for their POSIX counterparts.
Migrating existing code, then, is a two-line operation:
- Change an #include<> statement in your source file.
- Link against a different library in your makefile.
Calls to the familiar regcomp() and regexec() work as before, even though they call PCRE functions behind the scenes. This is demonstrated with step1.cc, a stub program available at http://www.cuj.com/code/. The makefile builds both POSIX and PCRE versions, based on the USING_PCRE preprocessor constant that I pass in the makefile:
#ifdef USING_PCRE
#include<pcre/pcreposix.h>
#else
#include<regex.h>
#endif
(This isn't to suggest that you maintain both POSIX and PCRE compatibility in your apps, but simply to demonstrate the ease of conversion.) If this example fails to build on your system, see the sidebar "Feeding the Linker and Preprocessor."
Despite their difference in linkage, the programs step1-posix and step1-pcre function identically. Pass the programs a regular expression on the command line, for example:
$ ./step1-posix 'foo.*bar'
and feed them lines via standard input. (Single-quote the regexp or the shell misinterprets it.) The programs report whether each line matches the supplied regexp.
Performing a Match
Of course, there's more to PCRE than cloning
regcomp() and
regexec(). The file step2.cc is a rewrite of step1.cc that demonstrates matching using PCRE's main API.
PCRE requires that you compile a regexp using
pcre_compile() (line 73) before you use it to test strings:
pcre* pcre_compile(
const char *pattern,
int options,
const char **errptr,
int *erroffset,
const unsigned char *tableptr
);
The regexp is supplied as the string
pattern. Pass a set of bitwise-OR'ed constants to alter matching behavior. Examples of such constants include
PCRE_CASELESS (case-insensitive matching),
PCRE_UTF8 (assume UTF-8 encoded data), and
PCRE_MULTILINE (test strings may contain newlines). The pattern in step2.cc defaults to no modifiers, or 0. (You can also set some options within the pattern itself.)
If pattern compilation fails,
pcre_compile stores the error message in the supplied
errorMessage pointer and the index of the offending character in
errorOffset.
tableptr is an optional set of character tables. The example passes
NULL so it uses the default tables. The resultant
pcre* object should be stored for maximum efficiency: It's wasteful to repeatedly recompile the regexp for each use.
Equally wise then, is to study the pattern to yield faster matching. The function
pcre_study() (line 66) returns a
pcre_extra* object as the result of its analysis:
pcre_extra *pcre_study(
const pcre *code,
int options,
const char **errptr
);
As with
pcre_compile(), errors are stored in the provided
char** parameter. The
options parameter is currently unused, but exists for forward compatibility.
pcre_study() returns
NULL if it cannot further optimize matching. Functions that take
pcre_extra pointers gracefully handle this condition, though, so the sample code doesn't test
pcre_study()'s return value.
PCRE uses an
int[] as a work area for matching. This array's size is based on the pattern's total number of potential substring matches. Lines 72-79 take advantage of C++'s support for runtime-sized arrays and calls
pcre_fullinfo() to extract the number of potential matches from the
pcre* variable
re:
int totalMatches ;
pcre_fullinfo(
re ,
reStudy ,
PCRE_INFO_CAPTURECOUNT ,
&totalMatches
) ;
When matching against several patterns in a single program (or single thread), it is more memory efficient to reuse a shared work area sized for the pattern with the greatest number of matches. The work area for PCRE 4 and 5 is slightly larger than that of Version 3. Lines 109-115 use preprocessor macros to determine the compile-time library version and size the work area accordingly. (An excerpt of this routine is shared in the sidebar "Version Compatibility.")
pcre_exec() tests an input string for a match (lines 134-143):
int pcre_exec(
const pcre *code,
const pcre_extra *extra,
const char *subject,
int length,
int startoffset,
int options,
int* ovector,
int ovecsize
);
The parameters
code and
extra are the compiled regexp and study results, respectively.
subject and
length are the string to test against the pattern and its length, respectively. It's possible to test starting from an arbitrary point in the subject, but for simple matches against the entire string, the
startoffset is 0 (the beginning).
ovector and
ovecsize refer to the work area and its size, respectively.
pcre_exec() returns one more than the number of substring matchesparenthesized patterns within the regexpin the subject string. A successful match against a regexp with no substrings thus yields 1. Return codes less than 1 indicate a problem. 0 means the work area is too small: Perhaps you missized it, or there's an error in your regexp that causes extra matching to occur. (Check especially for over-escaped parentheses.)
PCRE_ERROR_NOMATCH indicates that the subject did not match the regexp. Several other error constants are described in detail in the pcreapi man page.
Finally, lines 180-181 clean up the
pcre* and
pcre_extra* objects allocated earlier. Behind the scenes, PCRE allocates this memory using
pcre_malloc() and you must free it using
pcre_free(). These functions call plain
malloc() and
free() by default. You can assign a custom allocator by setting the (global) variables
pcre_malloc() and
pcre_free(), respectively.
Unlike step1.cc, step2.cc supports the more powerful Perl-style regexps. For example:
$ ./step2 'January (\S+) 2005'
matches any contiguous set of nonspace characters between the words "January" and "2005." You can also use the POSIX-style character classes. For example,
[:alnum:] matches any alphanumeric character, so the following expression would match January of any year:
$ ./step2 'January [:alnum:]*'
Refer to the sidebar "Prototyping Regexps" for hints on how to prototype your regexps.
Setting Options Within the Pattern
PCRE and Perl support the
(?N) operator to set matching options within the pattern itself. This is an alternative to hard-coding an option (such as
PCRE_CASELESS) in the call to
pcre_compile().
Replace
N with one or more characters, such as
(?i) for a caseless match. The operator affects the regexp up to the next enclosing parentheses, or to the end of the pattern if there are none. For example:
(?i)foobar
matches the string
foobar with any capitalization; whereas in the regexp:
foo((?i)bar)baz
only
bar is matched in a case-insensitive manner.
The option letters match their Perl counterparts:
- (?i) (PCRE_CASELESS), case-insensitive matching.
- (?m) (PCRE_MULTILINE), test strings may span multiple lines.
- (?s) (PCRE_DOTALL), lets "." match even newlines in test strings.
- (?x) (PCRE_EXTENDED), permits space and comments in regexps.
You may specify multiple modifiers, such as
(?mx) for a caseless, extended-format regexp.
The
(?) operator is used elsewhere in PCRE (such as callouts and named substring matches) and can be considered a general control sequence.
Extracting Substrings
POSIX regexp support boils down to a simple question: "Does string
X match pattern
Y?" Perl and PCRE let you mark and extract specific substrings in the matching text segments.
pcre_get_substring() hands you the matched substrings as an array:
int pcre_get_substring_list(
const char *subject,
int* ovector,
int stringcount,
const char ***listptr
);
Here,
subject is the string tested by
pcre_exec().
ovector and
stringcount are the work area and number of matched strings, respectively. PCRE stores the captured matches in
listptr. Free the
listptr array using
pcre_free_substring_list(). Note that array index 0 represents the entire string, while 1 is the first matched substring.
The functions
pcre_copy_substring() and
pcre_get_substring() return individual matched substrings based on their numeric position in the regexp:
int pcre_copy_substring(
const char *subject,
int *ovector,
int stringcount,
int stringnumber,
char *buffer,
int buffersize
);
int pcre_get_substring(
const char *subject,
int *ovector,
int stringcount,
int stringnumber,
const char **stringptr
);
pcre_get_substring() allocates memory for a new string; call
pcre_free_substring() to release it. By comparison,
pcre_copy_substring() copies the string to the user-supplied buffer array, so it does not need to be explicitly freed.
PCRE uses the same match-counting rules as Perl: You calculate a match position from outer to inner parentheses, then from left to right. Consider the following regexp excerpt:
... ((\S+) (\d+)) ...
If the outermost parentheses bound match
N, then
(\S+) is match
N+1, and
(\d+) is match
N+2.
Simplify code maintenance by storing symbolic names for substring matches in an
enum. Alternatively, you can tag your matches and fetch them by name. Inside a substring match's parentheses, precede the pattern of interest with
?P<name>. Pass
name as the
stringname parameter of
pcre_copy_named_substring() and
pcre_get_named_substring(). (These operate similarly to
pcre_copy_substring() and
pcre_get_substring(), respectively.) For example:
pcre_get_named_substring( ... pcre*, pcre_study* ... , "Foo" , ... )
fetches the text matching the regexp fragment:
(?P<Foo>\d+)
The stub programs step3.cc and step4.cc demonstrate extracting substrings by numeric index and name, respectively.
Callouts
Perl's
(?{ ... code ... }) blocks fire code as parts of a string are matched against the regexp. The callout is the PCRE equivalent. Mark a regexp portion using
(?CN) syntax, where
N is a number from 0 to 255. For example:
(?C1)(foo)(?C2)(bar)(?C3)
PCRE calls the function assigned to the variable
pcre_callout as it encounters each marker in the matched string. This function has the signature:
int function( pcre_callout_block* )
As pcre_callout is a global variable, there can be only one callout function per program. In turn, pcre_callout_block.callout_number identifies the callout marker (it's the N in (?CN)) such that pcre_callout can distinguish between callout points. pcre_callout can thus be a simple switch() block that calls other functions based on the callout's number.
C++ users take note: The callout function must be a global or static (class) function; object member functions are not permitted. Furthermore, the function must be exposed with C linkage using an extern "C" declaration. There's nothing to stop you from using the callout function as a pass-through to an object, though. You can assign an arbitrary object to a regexp's pcre_extra.pcre_callout member. That object will be available in the callout function as the pcre_callout_block.callout_data member. (You must cast it from void* to your expected object type.) To not use callouts, (re)set the value of pcre_callout to NULL. The stub program step5.cc (available at http://www.cuj.com/code/) uses a callout to print the last substring match made in the subject string.
Putting It All Together
The sample application (named "app" and available online) uses the techniques described here to postprocess netfilter/iptables firewall logs. A Matcher object represents a regular expression. Its operator() member function calls PCRE code to test strings against the regexp. On success, a MatchInfo* is returned. MatchInfo is a lightweight wrapper around PCRE data types that lets calling code fetch substring matches by descriptive name or numeric index. As an alternative, callers may specialize the template version of PCREMatch::operator() to work directly with the raw PCRE match data.
The PacketInfo class holds source and destination host/port pairs. Its accessor member functions are used by Output objects to further process that data. For example, the supplied TextOutput class prints the info to an output stream. Output could also be subclassed to export the data to a database or XML. PacketInfo and Output objects meet inside a Processor object, which receives logfile data from main().
Matcher and MatchInfo classes wrap PCRE calls. They are generic and may thus be copied to other apps. By comparison, PacketInfo, Output, and Processor are specific to the sample app.
The Wrap-Up
PCRE lets you bring Perl functionality into your native code without resorting to unpleasant methods. Without PCRE to do the heavy lifting, PCREMatch and MatchInfo would have hidden some very ugly code behind their interfaces.
Acknowledgment
Thanks to Derek Ashmore for reviewing (and improving) this article.
References
- [1] http://www.pcre.org/.
CUJ
1
|
2
|
3
|
4
Next Page