Channels ▼

Jocelyn Paine

Dr. Dobb's Bloggers

Snobol Patterns in Prolog IV: bal, and the Use of Failure to Diagnose Patterns

February 06, 2010

It's a pun between logic and control flow: a pun where you hear "A or B" with one ear but "Do A; until something fails, then try B" with the other. In most languages, there are Boolean expressions that return a TRUE or a FALSE, and conditions that jump to a THEN or an ELSE. But in Snobol, there are pattern matches that succeed or they fail; and if a match fails, the matcher will backtrack to already-matched subpatterns, seeking alternative matches that make the current match succeed. And in Prolog, there are calls to predicates that succeed or they fail; and if a call fails, Prolog will backtrack to already-called predicates, seeking alternative solutions that make the current call succeed. "It is not enough", as Gore Vidal wrote, "to succeed. Others must fail". I am going to show you how I implemented Snobol's "bal" pattern, which matches a string that's balanced in round brackets, with Prolog Definite Clause Grammar rules; and how to use failure to probe "bal"'s behaviour.

First, what is "bal" for? Snobol programmers use it when processing source code, including Snobol source code. A good example is the function-defining function "dexp" from James Gimpel's book Algorithms in Snobol4. I explained how it works in my Programs That Transform Their Own Source Code; or: the Snobol Foot Joke, and I'll recap now.

In effect, "dexp" is a macro-expander. It takes a string containing a function defined by an expression, such as

dexp( "ave(x,y) = (x + y) / 2.0" )
and expands it into a Snobol statement containing the function's body. For "ave", that statement is
"ave ave = (x + y) / 2.0 :s(return)f(freturn)"
Then it compiles the body by calling the built-in "code" function. As I demonstrated in Programs That Transform Their Own Source Code, one can then enter the code by jumping to the label that starts it, "ave". Finally, "dexp" calls the built-in "define" function to make "ave" a function having this label as entry point. It calls "define" as follows, passing it "ave"'s name and local parameters:
define( "ave(x,y)" )

The original string — that is, "dexp"'s argument — contains a bracketed list of local parameters, "(x,y)". This list needs to be removed in order to make "ave"'s body. Also, the combination of function name and local parameters, "ave(x,y)", must be recognisable in order that "dexp" can split it away from the expression following and pass it to "define". To recognise it, "dexp" uses "bal".

It uses "bal" in the following pattern:

break('(') . Name bal . Args
I explain this and Snobol in more detail in Programs That Transform Their Own Source Code. But the gist is that the "break('(') . Name" matches everything up to but not including the opening bracket and assigns it to the variable Name. Then "bal . Args" matches the opening bracket up to and including the closing bracket, and assigns it to the variable Args.

Unlike the other patterns I've written about in these essays, I myself haven't used my Prolog "bal" for real-world tasks. Perhaps that's because most of what I've applied Snobol and Prolog to recently has been processing files of drug-trial data: numerical measurements to be fed to the statistical language R. Perhaps it's because when I do manipulate source code, I write Definite Clause Grammars from scratch, following the syntax of the programming language. But some Prolog programmers may find "bal" useful. And I also want to use it in demonstrating how "fail" can make both languages enumerate all possible matches of a pattern, thereby helping us check the pattern's specification.

Here are some calls to my Prolog "bal":

2 ?- phrase( bal(Matched), "()", Rest ).
Matched = [40, 41],
Rest = [] ;

3 ?- phrase( bal(Matched), "(", Rest ).

4 ?- phrase( bal(Matched), "(a)(b)", Rest ).
Matched = [40, 97, 41],
Rest = [40, 98, 41] ;
Matched = [40, 97, 41, 40, 98, 41],
Rest = [] ;

5 ?- phrase( bal(Matched), "x(a)(b)y", Rest ).
Matched = [120],
Rest = [40, 97, 41, 40, 98, 41, 121] ;
Matched = [120, 40, 97, 41],
Rest = [40, 98, 41, 121] ;
Matched = [120, 40, 97, 41, 40, 98, 41],
Rest = [121] ;
Matched = [120, 40, 97, 41, 40, 98, 41, 121],
Rest = [] ;

In these calls, I'm using the built-in phrase/3 predicate to invoke "bal". Its second argument is the complete string to be matched — what Snobol calls the subject. For example "x(a)(b)y". Its third argument, Rest, is what remains unmatched. The argument to "bal", Matched, will be unified with the substring matched by it. Double-quoted string constants are, as usual, represented as lists of character codes: codes 40 and 41 are '(' and ')', while 97 is 'a' and 120 is 'x'. I'm typing the calls into SWI-Prolog's top-level interpreter: the semicolon makes the interpreter backtrack and seek an alternative solution.

Now here's my definition of "bal" in Definite Clause Grammar (DCG) rules. The second rule is the "bal" that returns the substring it matches; the first rule discards this argument:

bal -->
  bal( _Matched ).

bal_prim( [C] ) -->
  notany( "()", [C] ), !.

bal_prim( "()" ) -->
  "()", !.

bal_prim( Matched ) -->
  "(", !, bal( Matched1 ), ")",
  { append( [ 0'( | Matched1 ], ")", Matched ) }.

bal_tail( [] ) -->

bal_tail( Matched ) -->
  bal( Matched ).

bal( Matched ) -->
  bal_prim( Matched1 ),
  bal_tail( Matched2 ),
  { append( Matched1, Matched2, Matched ) }.

notany( Chars ) -->
  notany( Chars, _Matched ).

notany( Chars, [C] ) -->
  [ C ],
  { not( memberchk( C, Chars ) ) }, !.

I hope the code is straightforward, though it would be nice to do away with the two "append"'s. One reason I'm publishing it is that it took me a little bit of work to discover "bal"'s specification. Namely, and in terms of the above listing, what to define as a primitive component in bal_prim, and how exactly to code the recursive call to "bal".

I say this because I found a 1998 posting by Mark-Jason Dominus to org.perl.perl5-porters, Pattern matching in SNOBOL4. He says that if Snobol "bal" were not built in, it could be defined as:

bal = span('()') | '(' *bal ')' | arbno(*bal)
We can read this declaratively as:
The longest possible substring composed only of round brackets; OR:
an opening round bracket followed by "bal" followed by a closing round bracket; OR:
any number, including none, of repetitions of "bal".

But the "span" seemed wrong, because it could match an unbalanced string such as "))(". And the third alternative also seemed wrong, because "arbno" can match any number, including none, of repetitions. So it could match the empty string, which "bal" doesn't.

So — and this is where failure and backtracking enter — I checked "bal"'s behaviour with this little Snobol program:

  &anchor = 1
  &fullscan = 1

  "x(a)()xxx(" bal $ output fail


This matches the string "x(a)()xxx(" against the pattern. The "bal $ output" displays the matched substring on a line by itself, by assigning it to "output", a variable that Snobol associates with the standard output stream. And the "fail" backtracks into "bal", making it try an alternative match. Because "fail" can never succeed, the pattern will keep backtracking until "bal" finds no more matches. But in doing so, it will display all these matches. This tells me exactly which substrings "bal" will match, from which I can see whether any of them is empty.

This is the program's output, and you can see that none of the alternatives is empty. The output shows how "bal" steps neatly along, matching successively longer balanced portions of the subject:

C:\dobbs>\snobol4\SNOBOL4.EXE bal.sno

SNOBOL4+ Version 2.24. 8087 present.
(c) Copyright 1984,1998 Catspaw, Inc., Salida, Colo. All Rights Reserved.

No errors


For comparison, here is corresponding output from my Prolog "bal". I have called it from the built-in predicate forall/2, which finds all solutions to the first argument, invoking the second for each one. Here, that's all solutions to "bal(M)"; the call to "format" displays M:

6 ?- forall( phrase( bal(M), "x(a)()xxx(", _ ), format('~s~n', [M]) ).

In fact, "forall" merely packages Prolog "fail" into a cleaner control structure. I could have got the same effect with:

7 ?- phrase( bal(M), "x(a)()xxx(", _ ), format('~s~n', [M] ), fail.

Another example where failure lists all alternative matches is in the next program. I think I saw this example, or one similar, in that standard reference book The SNOBOL4 Programming Language by R. E. Griswold, J. F. Poage, and I. P. Polonsky.

The program matches a string against a pattern made by concatenating two sets of alternative strings and a single string. As before, "fail" causes all matches to be generated. One slight change, not significant, is that now I've stored the main part of the pattern in a variable, "bread". I've concatenated "fail" to this in the pattern-match statement:

  &anchor = 1
  &fullscan = 1

  bread = ("bre" | "br" | "b") $ output
+         ("a" | "ea" | "rea") $ output
+         "d" $ output

  "bread" bread fail


Here is the output:

C:\dobbs>\snobol4\SNOBOL4.EXE bread.sno

SNOBOL4+ Version 2.24. 8087 present.
(c) Copyright 1984,1998 Catspaw, Inc., Salida, Colo. All Rights Reserved.

No errors


Here is my Prolog equivalent. My "bread" pattern calls the immediate-assignment operator "$" that I defined in Arbno, the Cursor, and Snobol Patterns in Prolog. Apart from that, it uses only standard DCG primitives: the semicolon is a DCG alternation operator, analogous to Snobol "|" and translated by Prolog's DCG pre-processor into the ";/2" predicate for disjunction of goals:

bread -->
  ("bre" ; "br" ; "b") $ S1,
  ("a" | "ea" | "rea") $ S2,
  "d" $ S3,
  { format( '~s,~s,~s~n', [S1,S2,S3] ) }.

And here is its output, which I'll run once with an explicit "fail", and once from "forall". The "{fail}"'s insert a call to fail/0 into the first argument to "phrase". Protecting goals from the DCG pre-processor with curly brackets in this way is a standard feature of Prolog:

8 ?- phrase( (bread,{fail}), "bread", [] ).

9 ?- forall( phrase( (bread,{fail}), "bread", [] ), true ).

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.