Channels ▼

Parsing Complex Text Structures

Source Code Accompanies This Article. Download It Now.

Jun00: Programmer's Toolchest

Ian is a software developer in the R&D division of OmniMark Technologies. He can be contacted at [email protected]

Regular expressions support a very productive programming technique in awk, lex, and Perl. The expressions constitute a language precisely describing the patterns that can occur in text. Different patterns can be paired with action code, and applied to an incoming stream of text, each pattern selecting the part of the input to which its actions will be applied. This method of programming is concise and powerful, but many text-processing problems involve structures (such as nested sets of parentheses) that cannot be described by regular expressions.

A pattern language that includes recursive patterns and conditional pattern matching can handle complex text structures without supplementary programming. In this article, I will use the OmniMark pattern language from OmniMark Technologies (the company I work for) to illustrate that the more powerful patterns can be used to parse computer languages -- a job that might otherwise be done with tools such as lex and yacc.

The OmniMark Programming Language

OmniMark, which uses an English-like syntax, is available as a command-line version on many different platforms, and as a Windows IDE. OmniMark programs are completely portable between all command-line versions and the IDE. The IDE can single-step through any of the OmniMark programs in this article, making their operation much easier to understand. The command-line version and documentation are freely available at, while the IDE is free for noncommercial use.

OmniMark transforms streams of data via a combination of rule-based and procedural programming. find rules use patterns to select parts of the input stream. The patterns can place different parts of the selected data in variables, for use by the procedure that is part of the find rule. markup rules, on the other hand, can capture any part of a structured (XML or SGML) document, for handling by an associated procedure. The find and markup rules can be used together to set up coprocesses with information transfer between the coprocesses. The processes can work together to read an input stream, convert the input to SGML or XML markup, and verify the output. (I used this technique to build a reliable WinHelp-RTF-to-SGML converter that has been in use for three years.)

OmniMark functions can be internal functions written with OmniMark code or external functions written with C or C++.

OmniMark has three internal data types: switch (Boolean), counter (integer), and stream (character data). These types are organized into shelves, which are like arrays, except that a shelf can have a numeric index and a set of hash keys at the same time. There is also a set of operators to use the shelves as queues, stacks, and hash-keyed tables.

External data types are created and modified via external function libraries, but they are still managed by OmniMark, in shelves.

A stream item can be attached to a memory buffer, file, or any other source or destination for sequential data. For example, an external source can be written in C or C++ to get a TCP/IP connection and to return data on demand from the connection.

The default output can be redirected to any stream or set of parallel streams. The original default is resumed when exiting the scope in which the redirection was done. Redirection can be nested, creating a stream stack of output directions that will be resumed as scopes are exited.

Programmers often want to invert the sequence of output data relative to input data. OmniMark provides referents to support this inversion. You write a named referent at the point where you need the output, and you set the value of the referent at some other convenient time. OmniMark automatically takes care of all data buffering, including multiple passes on the data when required.

OmniMark programs can throw and catch exceptions. When an exception is thrown, exiting from scopes and destroying objects may create other unfinished business -- throwing new exceptions. The original exception will remain active until caught by its handler, but the new exceptions will also be resolved correctly.

OmniMark Patterns

OmniMark patterns use a language similar to the regular expression language used in awk, sed, lex, and Perl, but with different notation. OmniMark patterns are not delimited by slashes, white space has no significance, character classes are named, and literals must be quoted. Table 1 lists some of the differences. OmniMark patterns can span several lines. Table 2 summarizes the character classes and occurrence operators in OmniMark patterns.

You can identify subpatterns in a pattern, and you can assign the corresponding input text to named pattern variables. Example 1 shows the use of pattern variables to swap the first two fields of a tab-delimited file. The find rule is the way that OmniMark specifies pattern-action pairs. Examples 2 and 3 show corresponding rules in Perl and sed.

When input is matched by any pattern, the input is moved out of the input buffer. Consequently, only one pattern will match any particular string of input data.

OmniMark patterns or subpatterns can be Boolean functions. The function reads the current input buffer, and decides whether to accept (True) or reject (False). If the function accepts, the input text is added to the pattern buffer, where it is accessible to pattern variables. If the function rejects, the input text is pushed back onto the input buffer, where it will be accessible to testing by another pattern.

Listing One shows how pattern functions can be used to recursively match nested sets of parentheses. The paren-block function defines a parenthesized block as an interior part enclosed by a balanced set of parentheses. The paren-block-interior function defines the interior as a zero-length string or any concatenation of parenthesized blocks and characters other than the opening or closing parentheses.

The paren-block function attempts to match a string that begins with an opening parenthesis and ends with a closing parenthesis. If the match succeeds, the matched text goes into the pattern buffer and the function returns True. If the match fails, any text that has been read gets pushed back to the input buffer, and the function returns False.

The paren-block-interior function makes repeated attempts to match input either as a character that is not a parenthesis, or as a complete parenthesized block of text. Each success adds text to the pattern buffer. The first failure breaks the loop, and the function returns True. Since a failure can occur on the first attempt, the function will accept an empty string as a possible interior part of a block.

You can see how recursive pattern matching works by following the string (a(b)c in Listing One. If you have the OmniMark IDE, put the input in a file, and single step through the source code.

The leading "(" is picked up by the first find rule, because it can be the first character of an outer block. The character a is picked up as part of the interior of the outer block. The second "(" is picked up as the beginning of an inner block. The b is in the interior of the second block, which is completed by the first ")." Matching continues with the c as part of the outer block. At this point, there are no more characters, the outer block cannot be completed, and the entire pattern has failed to match. Consequently, the entire string is pushed back on the input buffer, ready for an attempt to match the pattern in the second find rule.

The leading "(" has already been rejected by the first rule, and passes to the second find rule, which accepts any single character. The a goes to the first rule, is rejected, and passes to the second rule. The second "(" goes to the first rule, and begins an attempt to match a new block. Matching continues with the b and the ")," which completes a successful match, and the substring (b) is printed. The following c fails the first find rule and is discarded after succeeding in the second find rule.

Recursion of pattern functions creates an implicit stack to hold partially completed structures until their substructures are completed. Therefore, a set of pattern functions corresponding to a properly defined grammar can be used to parse a programming language.

Defining a Grammar

A parser that depends only on pattern matching must be a recursive-descent predictive parser, because pattern matching occurs sequentially, from left to right, and each part of a pattern is tried only after the preceding parts match. This means that some grammars, particularly those with left recursion, cannot be handled by a pattern-matching parser. Consequently a grammar may need rewriting before you can use it in a pattern-matching parser.

Example 4 is a left-recursive grammar for a four-function calculator, a grammar that would work with yacc. A pattern-matching parser based on this grammar would require left-recursive pattern functions like the one in Example 5. The function can loop forever because it can invoke itself without ever reading any input, just by matching on the line that adds two expressions.

Example 6 shows an equivalent (describing the same language) grammar with right recursion instead of left recursion. Example 7 shows one of the pattern functions for this grammar. The function cannot loop forever, because it must read some input before it can invoke itself.

The new grammar needs some additional changes to account for the left associativity of arithmetic operators. Example 8 shows how addition and subtraction would be done with the grammar in Example 7. The actual operations do not occur until the pattern is completely recognized. However, the function can recursively call itself, matching as long as there is another sequence of minus signs and digits. Each of the recursive matches will be complete before its enclosing match, and subtraction will occur from right to left instead of from left to right as it should. The expression 3-2-1 will evaluate as 3-(2-1) instead of (3-2)-1.

To prevent right-to-left evaluation, left-associative operators like the arithmetic operators must be evaluated in patterns where the operator is not followed by a recursive pattern function. Example 9 shows a grammar meeting this requirement. The asterisks identify the productions (or patterns) in which numeric evaluation and computation will occur. This is the grammar that I used for the four-function calculator.

The Four-Function Calculator

Listing Two is a four-function calculator, written with patterns and pattern functions in OmniMark code, following the grammar in Example 9. For comparison, Listings Three and Four (available electronically; see "Resource Center," page 5) show the lex and yacc code for a similar four-function calculator.

The calculator is driven by the repeat ...again control loop at the bottom of the file in the process rule. The repeat scan...again loop tries to match arithmetic expressions, one to a line of input, on a continuous basis. A request to quit and a blank line are also treated as acceptable input. Any other input, up to the end of the current line, is treated as a syntax error.

The two catch actions trap any expected errors (stack underflow would actually be a design fault), thereby preventing premature termination of the program.

The always action executes whether or not a catch has executed, and guarantees that the next iteration of the outer loop will start with the correct initial conditions. For example, it clears both the stack and the remainder of an input line after a division-by-zero error.

The patterns within the pattern functions all allow blank space because, without separate processing of tokens and grammar, everything must be done at once. To follow the operation of the parser, assume that the input is a string of digits. The string will be recognized in the following steps (you can follow the step-by-step execution with the IDE):

1. The control loop reads #main-input and tries to match the pattern function expression.

2. The expression function can match when the term and r-expression functions match.

3. The term function can match when the factor and the r-term functions match.

4. The factor function matches because the integer pattern matches the string of digits. The input text is captured in the pattern variable int and is pushed on the stack as an integer value.

5. The r-term function matches the empty string, because the two preceding matches in the function fail. The term function has now matched.

6. The r-expression function matches the empty string, because the two preceding matches in the function fail.

7. The expression function is now matched, an integer is popped from the stack and sent to #main-output.

Now consider the expression "4+9."

1. The control loop reads #main-input and tries to match the pattern function expression.

2. The expression function can match when the term and r-expression functions match.

3. The term function matches 4 as in steps 3 to 5 of the preceding example. The first integer value is pushed on to the stack.

4. The r-expression function can match when the add-term function and a second invocation of r-expression matches.

5. The add-term function can match if the term function can match the text after the plus sign.

6. The term function matches 9 as in steps 3 to 5 of the preceding example. A second integer value is pushed on to the stack.

7. The add-term function is now matched. Two integers are popped and their sum is pushed.

8. The second invocation of the function r-expression matches the empty string, because the two preceding matches in the function fail.

9. The first invocation of the r-expression function is now matched on its first pattern (add-term r-expression).

10. The expression function is now matched, an integer is popped from the stack and sent to the #main-output.

As a third example, consider the expression "(4+9)."

1. The expression function can match if an opening parenthesis, a second invocation of expression, and a closing parenthesis can match.

2. The opening parenthesis matches.

3. A second invocation of expression matches as in steps 1 to 10 of the preceding example.

4. The closing parenthesis matches.

5. The first invocation of expression is now matched, an integer is popped from the stack and sent to the #main-output.

In all but the simplest expressions, the pattern functions invoke each other recursively. This is the essential feature that makes it possible to parse text structures, including those described by context-free grammars, with nothing more than a set of rules that combine pattern matching with action code.


With recursive pattern matching, parsers for small languages do not require a lot of programming, and the programs can be quite readable and maintainable. Many other text-processing problems, such as converting RTF text to SGML or XML documents, are comparable to writing parsers for small languages. The greater part of the work goes into planning and strategy, so that most of your resources are directed at the problem to be solved, instead of being absorbed by the tools needed to solve the problem.


Thanks to Shirley van Blaricom and Stacy Finkelstein for helpful comments and suggestions.


Listing One

#!/bin/omnimark -sb
; parnest.xom
; omnimark: pattern functions used in recursive pattern matching
; Run this program with the command
;   omnimark -sb parnest.xom parnest.txt
; where parnest.txt is an input file with nested sets of parentheses
; This program extracts nested matching parentheses from a file
; and prints the outermost set with all of the intervening text
; as a single unit.
; In the following line, the underlined text would be printed out
;   ( skip this ( but ( show ) this ) and ( not this
;               ---------------------
; Try using this program as input to itself:
;   omnimark -sb parnest.xom parnest.xom

; forward definition because functions refer to each other
define switch function paren-block
define switch function paren-block-interior
    repeat scan #current-input
        match [any except "()"]     ;any except start or end of block
        match paren-block           ;any contained block
    return true
define switch function paren-block
    return #current-input matches ( "(" paren-block-interior ")" )

find paren-block => text
    ; list each outer block
    output text || "%n"
find any        ; Grab everything rejected by the first 'find' rule
    ; discard all characters
    submit #main-input  ;Send all input through the 'find' rules above

Back to Article

Listing Two

;   Four-function calculator written with OmniMark pattern matching rules.
;   Supports + - * / and () with correct precedence
;   Requires OmniMark 5.1 or later (free from
;   Run the program with the following comannd line:
;       omnimark -s dc.xom
;   Exit by typing "quit" or EOF (ctrl-Z in Windows, ctrl-D in Unix)
declare #main-input has unbuffered
declare #main-output has unbuffered

; integer stack, with stack operators
global counter Stack variable initial-size 0
declare catch StackUnderflow
define function push
(   value counter num   )
    set new Stack to num
define counter function pop
    local counter num
    throw StackUnderflow when number of Stack < 1   ;nothing to pop
    set num to Stack    
    remove Stack        ;discard top of stack
    return num
; patterns and pattern matching functions

declare catch DivisionByZero
macro integer is
(   digit+  )

define switch function expression
elsewhere                           ;forward definition

define switch function factor
    do scan #current-input
        match blank* "(" blank* expression blank* ")"
            ;do nothing
        match blank* integer => int
            push( int )
            return false
    return true
define switch function multiply-factor
    do scan #current-input
        match blank* "*" blank* factor
            push( pop() * pop() )
            return true
    return false
define switch function divide-factor
    do scan #current-input
        match blank* "/" blank* factor
            local counter num
            set num to pop()
            throw DivisionByZero when num = 0
            push( pop() / num )
            return true
    return false
define switch function r-term
    do scan #current-input
        match multiply-factor r-term
        match divide-factor r-term
    ; if we match nothing else, we can match the empty string
    return true
define switch function term
    return #current-input matches ( factor r-term )
define switch function add-term
    do scan #current-input
        match blank* "+" term
            push( pop() + pop() )
            return true
    return false
define switch function subtract-term
    do scan #current-input
        match blank* "-" term
            push( 0 - pop() + pop() )
            return true
    return false
define switch function r-expression
    do scan #current-input
        match add-term r-expression
        match subtract-term r-expression
    ; if we match nothing else, we can match the empty string
    return true
define switch function expression
    return #current-input matches ( term r-expression )
; control loop
        repeat scan #main-input
            match expression =>text blank* "%n"
                output "d" % pop() || "%n"
                put #error "Stack error: " || "d" % number of Stack
                            || "%n"
                    when number of Stack != 0
                clear Stack
            match blank* ul "quit"
            match blank* "%n"
            match   any-text+ =>text "%n"?
                put #error "Syntax error: %x(text)%n"
                clear Stack
    catch DivisionByZero
        put #error "Divide by zero%n"
    catch StackUnderflow
        put #error "Stack underflow%n"
        do scan #main-input
            match any-text* "%n"
        clear Stack

Back to Article

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.