Channels ▼

Jocelyn Paine

Dr. Dobb's Bloggers

Arbno, the Cursor, and Snobol Patterns in Prolog

January 26, 2010

Can any Prolog or SWI-Prolog expert tell me how to implement an equivalent of Snobol's "arbno" pattern for use in Definite Clause Grammars, efficiently and elegantly? This is the pattern that matches an arbitrary number of repetitions of an argument pattern. And how would you define Snobol's "$" and "@"?

In my daily string-hacking, I still use Snobol, despite its age. And I use SWI-Prolog, with Definite Clause Grammars (DCGs) to specify the syntax of strings. Because compared with Snobol patterns and DCGs, which other executable syntax notations stand a snowball's chance in Hell?

But DCGs are only a notation for writing grammars and turning these into parsers. One still has to code the syntactic primitives that the grammar rules will call.

For example, there is a Snobol pattern function named "span" which returns patterns that match a non-empty sequence of characters. The pattern:

   span( "abc" ) 
would match "a", and "ab", and "cb", and "acbbcabb": non-empty sequence of "a"'s, "b"'s, and "c"'s. I can implement this easily enough in DCG rules, but it's nice not to have to bother.

Another example: the "rem" pattern matches the remainder of a string. The "$" operator assigns the string matched by the pattern on its left to the variable on its right. And the "break" pattern function returns patterns that match everything up to a specified character. So the pattern:

   break( "XYZ" ) $ Pre rem $ Post 
would match "Xabc" and assign the null string to Pre and "Xabc" to Post. It would match "abcXdef" and assign "abc" to Pre and "Xdef" to Post.

(If the above paragraphs don't mean much, you may like to read my explanations of Snobol in Programs That Transform Their Own Source Code; or: the Snobol Foot Joke and More Technonecrophilia with Snobol One-Liners. I also recommend Mark Emmer's very useful A Snobol4 Tutorial, sections from which I've linked to for further reading in these essays.)

Yet another example is the "arb" pattern, which matches an arbitrary string. So the pattern:

   arb $ Pre span( "abc" ) $ Mid rem $ Post 
would match "XabccbaY", assigning "X" to Pre, "abccba" to Mid, and "Y" to Post.

Which brings me to "arbno". This is a pattern function that implements a kind of higher-order "arb". As already mentioned, it returns patterns that match an arbitrary number of repetitions of the argument. So the pattern:

   arbno( span("abc") "X" )  
would match "abcXbcXcbaaX". And
   arbno( span("abc") "X" ) $ Matched 
would match "abcXbcXcbaaX" and assign it to Matched.

That's Snobol pattern matching for you. The spans and the breaks and the arbs and the rems are nice to have around, and I've written SWI-Prolog equivalents which I can call from DCG rules. But I don't like my code for arbno.

Before I show you it, I'll demonstrate my coding style by showing you "span". First, — and assuming you are comfortable with DCGs and the "phrase" predicate, as well as Prolog in general — here is me calling "span" from the SWI-Prolog top level:

 2 ?- <strong>phrase( span("ab"), "aXXXX", Out ).</strong><br /> Out = [88, 88, 88, 88].<br /> <br /> 3 ?- <strong>phrase( span("ab"), "abbaXXXX", Out ).</strong><br /> Out = [88, 88, 88, 88].<br /> <br /> 4 ?- <strong>phrase( span("ab"), "XXXX", Out ).</strong><br /> false.<br /> <br /> 5 ?- <strong>phrase( span("ab",Matched), "abbaXXXX", Out ).</strong><br /> Matched = [97, 98, 98, 97],<br /> Out = [88, 88, 88, 88]. 
In this output, "Out" is the third argument of "phrase". As customary, it holds the string remaining after the match. In three of the matches, this is four 88s: that is, four "X"'s. I shouldn't need to remind Prolog programmers that double-quoted strings are merely a handy notation for lists of character codes. The [97, 98, 98, 97] are "abba".

Now I'll define "span":

 span( Chars ) --><br />   span( Chars, _Matched ).<br /> <br /> <br /> span( Chars, [ C | RestMatched ] ) --><br />   any( Chars, [C] ),<br />   span_tail( Chars, RestMatched ).<br /> <br /> <br /> span_tail( Chars, Matched ) --><br />   span( Chars, Matched ), !.<br /> <br /> span_tail( _Chars, [] ) --><br />   []. 

I used something else to build "span". It's an uncomplicated pattern function named "any", which returns a pattern that match any of the characters in its argument. Thus, the pattern:

   any( "abc" ) 
matches "a", "b", or "c".

This is "any" in Prolog:

 any( Chars ) --><br />   any( Chars, _Matched ).<br /> <br /> <br /> any( Chars, [C] ) --><br />   [ C ],<br />   { memberchk( C, Chars ) }, !. 

By the way, you will notice that I have two arities of pattern. The one-argument variant is directly analogous to the Snobol equivalent. Its argument takes the set of characters making up the substring to be matched. (In "span" and "any", at least. In other patterns, this argument may be different. In "len", for instance, it's a number of characters.)

The extra argument of the two-argument variant returns the substring matched. You can see this in my example of "span". I coded this because I found it often useful. Snobol has other ways to get this matched substring, namely the "." and "$" operators. But as I'll explain later on, I can't decide how to implement these efficiently.

Now that I've demonstrated my coding conventions, I'll return to "arbno". Here is a call which matches repetitions of span("abc") followed by "X". It shows the alternative solutions, invoked in SWI-Prolog by typing a semicolon:

 6 ?- <strong>phrase( arbno( (span("abc"),"X") ), "abcXbcXcbaaX", Out ).</strong><br /> Out = [97, 98, 99, 88, 98, 99, 88, 99, 98|...] <strong>;</strong><br /> Out = [98, 99, 88, 99, 98, 97, 97, 88] <strong>;</strong><br /> Out = [99, 98, 97, 97, 88] <strong>;</strong><br /> Out = [] <strong>;</strong><br /> false. 
As before, Out is the string left unmatched by "phrase"'s first argument; 97 is the character code for "a"; and 88 is the character code for "X".

This is how I define arbno:

 arbno( Pat ) --><br />   arbno( Pat, _Matched ).<br /> <br /> <br /> arbno( Pat, Matched ) --><br />   arbno( Pat, [], Matched ).<br /> <br /> <br /> arbno( _Pat, MatchedSoFar, MatchedSoFar ) --><br />   [].<br /> <br /> arbno( Pat, MatchedSoFar, Matched ) --><br />   { copy_term( Pat, PatCopy ) },<br />   $( PatCopy, MatchedHere ),<br />   { append( MatchedSoFar, MatchedHere, MatchedSoFarAndHere ) },<br />   arbno( Pat, MatchedSoFarAndHere, Matched ).<br /> <br /> <br /> $( Pat, Matched, In, Out ) :-<br />   phrase( Pat, In, Out ),<br />   append( Matched, Out, In ). 

My first clause has the same purpose as the first clauses for "span" and "any". It defines the variant that doesn't return the substring matched. You can see that it ignores the _Matched variable in its tail.

The variant of the predicate that does return the substring matched is next. This is "arbno"'s core, if patterns can have cores. Unlike "span" and "any", it uses an accumulator argument to do so. Its first clause can be read declaratively as:

 An arbitrary number of repetitions of Pat can match the empty list. The substring matched before this call followed by the substring matched by it is the same as the substring matched before this call. 

The second clause can be read as:

 An arbitrary number of repetitions of Pat can match Pat followed by  an arbitrary number of repetitions of Pat. The substring matched before this call followed by the substring matched by it is the same as the substring matched before this call, followed by the substring matched by Pat, followed by the substring matched by the remaining  repetitions of Pat. 

That declarative reading doesn't explain my call to copy_term. This is a standard predicate, which here copies Pat, replacing variables that occur in it by new variables in the copy. Why do I use it? Had I not, the first time Pat was called, it would probably have bound the variables in its arguments. (Whether it would in fact have is up to its author, but there's no point in having variables if you're not going to put something into them.)

But then consider the recursive call to arbno, and hence to Pat. This will need to bind those variables too. But the values it needs to bind them to almost certainly won't be the same as the values the first call bound them to; and so the call will fail. Copying the variables avoids this.

The next tail goal calls my analogue of Snobol's immediate assignment operator. I introduced this earlier, when I said that

   break( "XYZ" ) $ Pre rem $ Post 
would match "Xabc" and assign the null string to Pre and "Xabc" to Post; or would match "abcXdef" and assign "abc" to Pre and "Xdef" to Post. Here's an example call to my analogue. I've made it a binary operator, but I shan't bother to show the call to "op":
 7 ?- <strong>phrase( ( break("XYZ")$Pre, rem$Post ), "abcXdef", Out ).</strong><br /> Pre = [97, 98, 99],<br /> Post = [88, 100, 101, 102],<br /> Out = [] <strong>;</strong><br /> false. 

In arbno, I am using "$" to grab the substring matched by Pat, which the "append" then appends to make MatchedSoFarAndHere.

Here, again, is my definition of "$":

 $( Pat, Matched, In, Out ) :-<br />   phrase( Pat, In, Out ),<br />   append( Matched, Out, In ). 

But there is a problem. In "$" and in the second clause of arbno, I seem to use too many "append"'s, but I can't think of a more efficient method of getting the substring matched by arbno. Can you? The "append" in

   $( Pat, Matched ) 
seems a particularly egregious way to capture the substring that Pat matches. In effect, it subtracts two strings that I can get easily, namely that remaining before matching Pat, and that remaining after doing so.

Perhaps, though, my arbno isn't the most useful analogue of Snobol's anyway. Because as I explained earlier, my patterns have an optional extra argument that returns the substring they match. You might want to pass these to arbno. Or you might want to call your own predicates that pass information back in variables. But because of my copy_term call, these variables will never get bound.

So here's a different try at an arbno which does capture bindings made by the pattern it calls. It resembles setof/3, in that: it calls a goal (the pattern); and takes a "template" argument which is a term containing the variables whose bindings we want to get back; and returns a list of terms containing these bindings. So by analogy with setof, I'll name it seqof.

Here, I show a call that matches "abcXbcXcbaaX" against an arbitrary number of span("abc") followed by "X"'s. The output shows alternative solutions, with their values for List and Out. List is the list of terms returned by seqof. This contains values for Matched, which holds the substring matched by "span". Out is the substring remaining after the entire match: returned, as before, in "phrase"'s third argument:

 8 ?- <strong>phrase( seqof(Matched,(span("abc",Matched),"X"),List), "abcXbcXcbaaX", Out ).</strong><br /> List = [],<br /> Out = [97, 98, 99, 88, 98, 99, 88, 99, 98|...] <strong>;</strong><br /> List = [[97, 98, 99]],<br /> Out = [98, 99, 88, 99, 98, 97, 97, 88] <strong>;</strong><br /> List = [[97, 98, 99], [98, 99]],<br /> Out = [99, 98, 97, 97, 88] <strong>;</strong><br /> List = [[97, 98, 99], [98, 99], [99, 98, 97, 97]],<br /> Out = [] . 

Next, my definition of seqof. Like arbno, it calls copy_term. This copies Pat, and also the template, which I've named Vars. Copying them both in the same call means the copied variables in Vars will be shared properly with their occurrences in the copied pattern. Calling the copied pattern will bind them (if its author has written it in a sensible way), and they will be captured in the first element of seqof's third argument, the list:

 seqof( _Vars, _Pat, [] ) --><br />   [].<br /> <br /> seqof( Vars, Pat, [ VarsCopy | List ] ) --><br />   { copy_term( Pat-Vars, PatCopy-VarsCopy ) },<br />   PatCopy,<br />   seqof( Vars, Pat, List ).   

To conclude, I've shown you how one can implement equivalents of some Snobol patterns in Prolog DCGs, and thereby facilitate string-processing. Other programmers know so too: I'm sure I saw Richard O'Keefe post about "span" in the SWI-Prolog mailing list, though I can't now find the post. Patterns via DCGs are hardly rocket science, assuming that rocket science means something impossible to comprehend and implement. I suspect, though, that we can't implement all the Snobol patterns as DCGs without forcing the user to write extra arguments to their DCG rules, to pass global pattern-matcher state from pattern to pattern.

I say this because there are Snobol patterns that seem to demand such state. Such as:

   &anchor = 1<br />   "abcd" arb len(1) $ output @output fail<br /> end 

This matches the string "abcd" against an arbitrary substring followed by a single character, the latter matched by len(1). The Snobol "fail" is like a Prolog "fail": it makes the matcher backtrack and try alternative solutions. In this case, solutions for "arb", which will match successively longer substrings.

The "len(1) $ output" assigns len's single character to "output", a variable that Snobol associates with the standard output stream. This displays that character on a line by itself. And the "@output" calls the "@" operator, which assigns the pattern matcher's cursor to "output". The cursor is an integer giving the current position in the string being matched. So the program displays a sequence of characters and character positions:

 C:\dobbs>\snobol4\SNOBOL4.EXE at.sno<br /> <br /> SNOBOL4+          Version 2.24.    8087 present.<br /> (c) Copyright 1984,1998 Catspaw, Inc., Salida, Colo.  All Rights Reserved.<br /> <br /> No errors<br /> <br /> a<br /> 1<br /> b<br /> 2<br /> c<br /> 3<br /> d<br /> 4 

But I can't see how to define a pattern predicate that gets the cursor's value, unless one passes it in an extra DCG argument. Which would inconvenience the person calling the pattern predicates, because they'd have to pass this argument into every goal in their rules and and then out again. Although we could, I suppose, write a pre-processor that translates DCG rules into Prolog such that they all have these extra arguments. Except for these extra arguments, it would be the standard DCG translator.

Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 


Video