Dr. Dobb's | Regex Arcana | August 01, 2004

Regex Arcana

Perl's Regexes aren't what computer science purists would call "regular"-that idea was thrown out the window when backreferences became something you could include in the pattern itself,

August 01, 2004
URL:http://www.drdobbs.com/web-development/regex-arcana/184416136

August, 2004: Regex Arcana

Jeff "japhy" Pinyan, author of Regexp::Parser, is finishing up at RPI in Troy, NY. He can be contacted at [email protected].

Perl's regular expressions (hereafter known as "regexes") have come a long way since Version 1.0. They're like the eccentric uncle who spends a few years sojourning somewhere exotic and comes back older, wiser, and more worldly. Or perhaps, in Perl's case, more out-of-this-worldly. Perl's regexes aren't what computer science purists would call "regular"—that idea was thrown out the window when backreferences became something you could include in the pattern itself, as shown in this simple dynamic regex:

# find any any doubled words
my $text = "Four score and and seven years ago";
while ($text =~ /\b(\w+)\W+\1\b/g) {
  print "'$1' repeats itself";
  # 'and' repeats itself
}

This regex is self-referential: The backreference \1 tells the regex engine to match whatever has been captured into $1. This means the regex doesn't know for certain what it will match until runtime. And this is only the tip of the iceberg.

I'm going to share two pieces of regex arcana with you—features of the Perl regex engine that aren't used often, but offer those who can master them great power.

Code Evaluation

The code evaluation assertion (?{ CODE }) was introduced in Perl 5.005, but it was rough around the edges. In Perl 5.6, it was fine-tuned and more adjustments were made by 5.8. The code evaluation assertion allows a regex to execute an arbitrary block of Perl code during the course of its pattern matching. This device becomes particularly interesting when one realizes that the backtracking stack is much like a variable scoping stack.

Delayed Execution

The delayed execution assertion (??{ CODE }) is the heart of dynamic regexes. Using them is like walking across a bridge as you're building it. Their true beauty is revealed when you create a Regexp object (via qr//) with a delayed execution assertion in it. Inside that assertion is the self-same Regexp object. Then you have created a recursive regex, one that is capable of matching nested data structures or acting like a proper grammar.

Getting Up to Speed

This article is going to use some regex variables you might not be familiar with, so I introduce them to you in Table 1.

Here is a simple example exhibiting these variables:

"perl" =~ /(.(..).)/;
# $1  = "perl"
# $2  = "er"
# $+  = "er" (in this case, $2)
# $^N = "perl" (in this case, $1)
# @-  = (0, 0, 1)
# @+  = (4, 4, 3)

After a successful pattern match against $str, the pair of values $-[X] and $+[X] hold the offsets in $str of the Xth capture group. $-[0] and $+[0] are the offsets for $&:

substr($str, 0, $-[0]);		# produces $'
substr($str, $-[0], $+[0] - $-[0]);	# produces $&
substr($str, $+[0]);			# produces $'
substr($str, $-[1], $+[1] - $-[1]);	# produces $1
substr($str, $-[2], $+[2] - $-[2]);	# produces $2 (etc.)

You are probably aware that $& and its friends are bad news. They cause reductions in the speed of your programs, so use them only when debugging. If you use them once in your code, you might as well use them as many times as you can in that code, because you've already suffered the hit.

Using (?{ ... })

Our first incantation is the code evaluation assertion. It will execute the code inside when it is encountered, and then continue with the regex; they always succeed, unless their contents interrupt the execution of your program.

Let's look at some applications of this assertion.

Inspecting a Regex's State

It has been said that "you can't observe the behavior of a system without affecting the system's behavior." Not so, at least in Perl 5.8. The following regex uses code assertions to show how a particular optimization in the engine works:

# the /x modifier allows embedded whitespace
# to help improve readability and clarity
"salad" =~ m{
  .* (?{ print "[$&] " }) a
  .* (?{ print "($&) " }) a
}x;
# output:
# [sal] [s] (sal)

We often see backtracking as a character-at-a-time procedure, but it's really much more efficient than that. In this case, something like .*a makes the engine jump from one "a" to the next when matching or backtracking. The code assertion doesn't get in the way of this, even though it's in-between the .* and a. (Sadly, in Perl 5.6, it did get in the way. Try that code in 5.6 and you'll see it stops the optimization from occurring.)

You have access to the regex variables inside a code assertion; my example above uses $&, although it's general practice to avoid that variable. You can also access $1 and other digit variables, $^N, $+, and the arrays @- and @+. One caveat, though, is that you can't access a capture group until it has been completed; that is, the closing parenthesis for $1 must be passed before you can see the contents of $1 in a code assertion, otherwise you'll be seeing its previous value. This code won't print anything inside the quotes because $1 hasn't been closed before it's printed:

"perl" =~ m{
  ( (?: . (?{ print "'$1'\n" }) )+ )
}x;

In order to inspect the digit variables as they're being created, you need to be cunning: Create a variable inside a code evaluation immediately prior to the capture group that stores the current location in the string. Then, use substr() on $_. This is a copy of the string you're matching against inside a code evaluation:

"perl" =~ m{
  (?{ $p = $-[0] })
  ( (?:
    .
    (?{ print substr($_, $p, $+[0] - $p), "\n" })
  )* )
}x;

That $_ trick is undocumented as far as I can tell, which means it might not stay that way in the future. (But considering Perl 6 regexes will be radically different, we should have fun while we can.)

In this way, code evaluation assertions are good debugging tools because you can use them without interfering with the execution of the regex and you can get helpful diagnostic information, such as where you are in the string, what you've captured, and so on.

Capturing Repetitions

I've often been asked why a regex like /(\w)+/ only returns the last word character captured:

"japhy" =~ /(\w)+/ and print "<$1>";  # 'y'

The reason is because the regex is continually storing what \w matches to $1, and each time, it's overwriting the old contents. It won't create $2, $3, and so forth, and it doesn't create a @1 array. So the question remains: How can I keep track of repeated capture groups?

The answer works on the principle that the matching and backtracking in a regex is very similar to a variable scoping stack if you use local() on your variables inside code evaluations. local() works differently in a regex; it gives a variable a value that August, 2004: Regex Arcana