Parallel

High-Speed Finite-State Machines

By Brenton Hoff, November 01, 1997

Many programming problems can be efficiently handled with a custom virtual machine-and many virtual machines can be implemented as finite-state machines. Brenton presents a technique for implementing a virtual machine, using text-processing as an example application.

Dr. Dobb's Journal November 1997: High-Speed Finite-State Machines

Byte-oriented applications benefit from an assembly trick

Brenton is the principal software engineer of Grouse Software, based in Adelaide, South Australia. He can be contacted at [email protected].

Sidebar: A High-Speed Static Huffman Decoder

Many problems can be efficiently handled with a custom virtual machine. This machine provides an application-specific language that is more powerful, expressive, and concise than any general-purpose programming language. Because it can be carefully tailored to the particular problem domain, a virtual machine can provide a compact, efficient way to solve a broad class of problems. However, simulating virtual machines can be time consuming.

In this article, I'll describe a technique I used for implementing a virtual machine for text processing; this virtual-machine architecture forms the core of both a simple word-count utility and a fast grep program. The grep utility is called Grouse Grep (ggrep). "Grouse" is Aussie slang for "very good" (although most of my acquaintances have noted that the definition of "grouser" is "a grumbler"). Source code and executables for these programs are available electronically from DDJ (see "Availability," page 3) and at ftp://ftp.grouse.com.au/pub/wc/.

High-Speed Implementation

Many virtual machines can be implemented as finite-state machines. Each of the machine's states describes how the machine reacts to events while in that state. The fastest way to implement this is to use a table for each state. Each table contains an entry for each event that describes how to handle that event. There is almost no overhead; only a few instructions are required to select the action and dispatch control to the appropriate routine.

You can speed up your implementation further by using threaded code. Instead of storing a code for each action, you just store the address of a handling routine directly in the table. This technique is common in interpreted languages (such as Forth) and is a good method of reducing the cost of handing control to the next action.

On the 80x86 family, you can build compact state tables by using single-byte addresses. Just make sure that all of your handlers start within the same 256-byte code page.

The best way to develop these high-performance machines is to prototype the machine states and actions in a high-level language, keeping an eye on how the assembly version will work, then port the machine to assembly when the machine's operation has been debugged. Many applications require the state tables to be generated dynamically, and it is convenient and efficient to use a high-level language for this process.

Word/Line Count Utility

As an example, I'll show how I converted a prototype word-count utility written in C into a compact, highly optimized 80x86 assembly-language version.

Counting words requires just two states -- a word state and a whitespace state. Each byte is classified as a word character or a whitespace character as part of the table lookup. The machine's actions increment the word or line counters when appropriate. Figure 1 shows the state machine, and lists the actions required.

The C implementation in Listing One is straightforward. To keep the file-system overhead low, the file is brought into memory in fairly large blocks, and the memory is examined directly by the state machine. The machine's state is maintained by a state-table pointer, which points to the whitespace-state table or the word-state table. Word and line counters are also maintained.

The trickiest part is stopping the machine at the end of each buffer. Maintaining a counter virtually doubles the cost of the innermost loop. Instead, I add a marker byte to the end of the buffer, and create additional actions to handle this marker. I've chosen the byte 238 (hex EE), since this value is uncommon in typical text files.

The key part of this state machine is three lines:

NextCh = *pText++;
Action = pTab[NextCh];
switch (Action) {

By carefully choosing registers and memory layout, each of these three lines becomes a single assembly-language instruction. Since this sequence is only four bytes, you can simply copy it at the end of each action, avoiding another jump:

lodsb
xlat
jmp ax

The lodsb/xlat sequence uses si (pText) and bx (pTab) to load the corresponding action into the lower byte of ax. As long as all of the actions are in the same 256-byte page, you can initialize the upper part of ax outside the inner loop.

The assembly-language version of the word-count machine is also available electronically. Note that the values chosen to represent the actions in the C program are the entry points of the corresponding assembly-language action. The word count runs twice as quickly as the C version, but the utility is heavily limited by the file-system interface, spending more time waiting for the file to be read than it does actually performing the count.

Regular Expressions

A more sophisticated example of this approach is Grouse Grep, my implementation of the grep search utility. By compiling the regular expression (RE) into a fast state machine, my program runs significantly faster than traditional recursive search implementations.

I'll start by showing how to compile a regular expression into a simple, unoptimized state machine, and then discuss some optimizations. The basic regular expression match is performed on a line terminated with NUL.

Matching Individual Bytes

The simplest search case is finding a single byte (such as "x") in the line. This search requires a one-state machine with three actions -- COMPLETED, ABANDON, and AGAIN. The state describes how the machine operates while it is scanning for the desired byte.

The table entry for "x" contains the action COMPLETED, which stops the search and reports the current position as the successful matching text. The table entry for NUL contains the action ABANDON. Since NUL marks the end of the line, this action will stop the search if no match is found.

All other entries in the table contain AGAIN, which instructs the search engine to move to the next character of the line, translate the byte into the action in the state table, and perform the specified action. This action makes the search engine scan the line from left to right.

Example 1 describes the state table for this search, and provides both C and Intel 80x86 assembly versions of the search engine.

Matching Classes

Matching classes of characters merely involves writing COMPLETED for each character of the table that matches. For example, a state table for matching the class "[aeiou]" is implemented by writing COMPLETED in the "a", "e", "i", "o", and "u" table entries, along with ABANDON for NUL and AGAIN for the remaining entries. The same idea applies to handling case insensitivity and the period character (which matches anything).

Timing information for the 486 shows that the cost of the lodsb/xlat/jmp sequence is nominally 14 clock cycles. This is slightly more costly than conventional code when matching a single byte, but is cheaper than any alternative code for matching classes of bytes.

Backtracking

The examples so far contain only a single search state. Searching for a string such as "koala" requires a search engine with at least two states: a "search for k" state and a "match oala" state. If a "k" is found, the search advances to the next state and looks for "oala".

If the match attempt fails, the search must backtrack to the initial state and continue searching for the next "k". The search engine must remember the text position and the match state for each point where the search may resume after failing to complete a match attempt.

Most existing regular expression searches use recursion to maintain backtracking information. While simple and effective, this is slower than the table-driven machine, because of the bookkeeping associated with function calls and returns. Most regular expression programs try to avoid recursion where possible because of this expense. The finite-state machine implementation eliminates this overhead entirely.

When a backtracking pathway is found that may be required later in the search, a text/state table pointer pair (describing how to resume) is pushed onto a stack. Backtracking from any state is then a simple matter of popping the pair off the stack and invoking the next threaded action according to the state table.

An additional bonus for the assembly-language version is that the CPU stack can be used directly to maintain the backtracking, allowing the stack to be implemented very efficiently.

String Matches

Rather than implement a "match string" action, it's more convenient to give each character of the search string its own state, to allow multicharacter match cases in any position. Three additional search engine actions are required:

ADVANCE, which advances to the next state for the next text character when the current state matches.
START_PUSH, which pushes a backtracking state/text context when "k" is found so that the "k" search can resume if the "oala" scan is unsuccessful.
NO_MATCH, which reverts to the last backtracking state/text context, executed during any of the "o", "a", "l", or "a" states if no match found.

Figure 2 shows the state machine for this search, including arrows for the various actions possible in each state. It also introduces an ok and fail state. These states handle the ultimate success or failure of the search, and allow the remaining states to perform matches without any line management overhead. Every entry of the fail table contains ABANDON; every entry of the ok table contains COMPLETED. The backtracking stack is initialized with a fail reference to catch the underflow when all backtracking options are exhausted.

Anchor to Start or End

To anchor the search to the start of the line, merely change the AGAIN actions in the first match state to ABANDON. If the first state does not match the first character of the line, the match fails, without the engine searching the remainder of the line for a possible match start.

Anchoring the search to the end of the line is easy to implement by editing the ok state so that almost all bytes contain NO_MATCH, and only end-of-line characters such as LF, CR, and NUL contain COMPLETED.

Optional Elements

Optional elements are implemented by adding two actions to the engine. These actions convert dead ends into match paths, and add an optional path to the backtracking stack if the match succeeds. In both cases, the option is included if available, but the match can advance if it is not. See Figure 3, which shows the state tables to search for "ding?o".

Where the input matches the optional state (e.g., "dingo"), an additional match path omitting the state is pushed as the match advances. The path pushed consists of the next state plus the previous character. Where the first path attempted fails to match, the alternative is selected by the backtracking mechanism.

If the optional element isn't matched (e.g., "dino"), the BACK_AND_ADVANCE action instructs the machine to retry the match with the next state.

The undo behavior is achieved by adding a "zero trip" case to the backtracking stack as part of the match. The next state and the previous character are added to the backtracking stack, so that if the first match path fails, the alternative path, which effectively undoes the optional match, is attempted. Similarly, when matching "quoka" using "quokk?a", the "a" does not match the second "k", but because the action is BACK_AND_ADVANCE instead of NO_MATCH, the "a" is retried in the next state, leading to a successful match.

Iteration

Iteration is specified by "*", meaning "zero or more," and "+", meaning "one or more." "*" iteration repeats the previous RE zero or more times: "xy*z" matches "xz", "xyz", "xyyz", "xyyyz", and so on. The "+" operator is the same, except that the previous expression is duplicated before the iteration, so that "a[bc]+d" is equivalent to "a[bc][bc]*d". Once this substitution is made, both cases can be treated identically. The correct method of handling iteration is to try to move forward as far as possible within the iteration state, and push option points into the backtracking stack as the loop proceeds. If the iteration fails, revert to the last item on the backtracking stack and try to match the states after the iteration again.

Figure 4 shows the additional action required for iteration, and displays the tables and actions to search for "bil*by".

For the text "biby", the "l" match state fails, but the BACK_AND_ADVANCE action means the "b" is retried in the fourth state, leading to a successful match. For the text "billlllby", the "l"s are matched and the match stays in the same state. Again, the "b" does not match the iteration but is matched in the next state.

This completes the minimal "language" required to perform basic RE searches. This version usually runs faster than recursive implementations, but slower than fixed-string searches.

The code to map the RE from the plain text specification to the state tables was written in C. The search engine was prototyped in C and then ported to assembly. The assembly source to implement this language is available electronically.

Optimizing for Speed

The speed of grep is not the speed of the regular expression match -- it is the speed at which the search engine can discard lines that can't match the RE. Improving the speed of the search breaks down into the following areas:

Minimizing the effort required to read the file from the operating system and break the data into lines.
Optimizing searching by using string-scanning techniques such as the Boyer-Moore optimization or by exploiting the CPU's search instructions.
Editing the search order to look for easy parts of the RE first so that lines that can't match are eliminated quickly.
Analyzing relationships between tables and optimizing state transitions (especially iterative backtracking).

File Buffering

The main way to reduce both the file-system cost and the line-splitting cost is to read the file directly into a large buffer in memory, and to adapt the search algorithm to recognize and handle line separators as they are encoded in the file. However, the file is often larger than the memory buffer. The solution is to read as much of the file as possible, then trim the buffer so it only contains complete lines. The file is rewound to the trim point, so that the next buffer will start at the start of a line.

LF is treated as the line separator for counting and display purposes, and CR also acts as a line terminator for searches anchored to the end of the line. Since the NUL is no longer needed as a line terminator, it is used as a buffer terminator. This allows almost all search actions to ignore the end-of-buffer cases.

The search optimizations are implemented by inserting an additional machine state between the fail state and the first expression state, called the "start state." This state receives actions specific to the speed optimizations without disrupting the actions of any of the existing expression search states.

String Searching

Some options of grep, such as selecting lines that don't match, work much better when the file is presented as a series of lines. Other options, such as line-number reporting, require that the start search examine each byte. However, simple searches can afford to skip bytes during the match start search, allowing the use of the Boyer-Moore algorithm (see "A Fast String Searching Algorithm," by R.S. Boyer and J.S. Moore, Communications of the ACM, October 1977). This technique is the main method used by other searches to gain speed.

The Boyer-Moore algorithm is simple: If you are searching for a simple string such as "galah", don't look in the first position for "g" -- look in the fifth position for "h". If you find an "h", start a normal match attempt from the first character. If you don't find an "h", look at the character you inspected: If it isn't in the string at all, skip forward by the length of the string. Even when the character is in the string, the search can still skip forward: For example, finding an "l" when searching for "galah" allows the search to skip forward two bytes.

A series of skip-byte actions were added to the search engine. The optimized search for "galah" has SKIP4 for "g", SKIP2 for "l", SKIP1 for "a", and START_OFFSET_MATCH for "h", with all other characters having SKIP5, except for the zero byte, which has ABANDON. In general, the skip size for an element is found by counting the number of NO_MATCH entries for that element in the tables leading up to the last state of the string.

CPU Byte Search

Scanning for a single byte is best done by using the processor's byte-search instruction. On the 486, this instruction scans memory twice as quickly as the lodsb/xlat/jmp sequence. A BYTE_ SEARCH action is used in the starting state where the Boyer-Moore algorithm is not available because the string is only a single literal ("%", for example) or because the search contains elements that cannot be optimized after the initial literal ("x.*y.*z").

Easiest First

Some regular expressions, such as "x[^a]*[a-z]*[^c]*wombat" are very inefficient if searched from left to right, as the first elements require massive amounts of backtracking. However, these expressions usually have an easier component that can be searched efficiently -- in this case, "wombat". Ggrep looks for easy strings in the RE, and splits the RE in two if a worthwhile position is found. The optimization is implemented by analyzing and editing the regular expression before attempting to produce the table-driven version.

In the example, the code searches for "wombat", and, where found, splits the line and searches for "x[^a]*[^b]*[^c]$" (anchored to the end of the line to ensure the split search may be reassembled). The line matches if both parts of the RE match, and the text selected by each match is combined to produce the final match.

Some care is required when implementing this optimization to ensure that the search continues examining the current line if the first part of the regular expression does not match. In addition, extra effort is required if the match is to report the longest possible part of the line matching the specified expression.

Optimizing Iteration

The amount of backtracking generated by iteration states can be significantly reduced by inspecting the next state after the iteration. Consider the expression "rock.*wallaby". While looping through the ".*" state (which matches everything up to the end of the line), it is easy to see that proceeding to the next state is only worthwhile if the character is "w": Any other character will not match.

These redundant backtracking cases can be eliminated by replacing AGAIN_ PUSH_ADVANCE with AGAIN for every element of the iteration table where the next table has NO_MATCH. This optimization is done as a post-processing step once the expansion of the search into tables is complete.

The result is that the search does not incur the cost of pushing the redundant backtracking paths, and (more importantly) reduces the number of backtracking paths that may need to be considered. In many cases, this optimization raises the performance of iterative searches close to that of simple string searches.

Acknowledgments

Thanks to Ross Williams, David Knight, Christian Adami, Ted Bullen, Martin Hogan, Mark Rawolle, and Simon Hackett for reviewing a draft of this article and providing valuable suggestions. Thanks also to my family for their support.

Listing One

#include "fcntl.h"#include "io.h"
#include <stdio.h>


</p>
typedef unsigned short  UINT16;   /*Unsigned 16-bit integer*/
typedef unsigned char   BYTE;     /*unsigned byte*/


</p>
#define PAGE_ALIGNED far
#define ENDMARKER       0xEE


</p>
#define EE_NOP          0x00      /*Buf end check, NOP if not*/
#define NOP             0x04      /*Stay in same state*/
#define EE_WORD         0x08      /*Buf end check, WORD if not*/
#define WORD            0x0c      /*Words++, change to word*/
#define WHITE           0x14      /*Change to whitespace*/
#define LF_WHITE        0x1a      /*Lines++, change to whitespace*/
#define LF_NOP          0x1c      /*Lines++, stay in same state*/
#define BUF_SIZE        (28*2048u)


</p>
/*Global variables updated directly by counting loop*/
unsigned long Lines;
unsigned long Words;
BYTE *pCurrentTable;


</p>
BYTE Tables[512];
#define WhiteTable      (&Tables[0])
#define WordTable       (&Tables[256])


</p>
static BYTE Buf[BUF_SIZE + 2];
static void InitTables(void) {
    int i;
    for (i = 0; i < 256 ; i++) WhiteTable[i] = WORD;
    WhiteTable[ 9]   = WhiteTable[11] = WhiteTable[12] = NOP;
    WhiteTable[13]   = WhiteTable[32] = NOP;
    WhiteTable[10]   = LF_NOP;
    WhiteTable[ENDMARKER] = EE_WORD;


</p>
    for (i = 0; i < 256; i++) WordTable[i] = NOP;
    WordTable[ 9]   = WordTable[11] = WordTable[12] = WHITE;
    WordTable[13]   = WordTable[32] = WHITE;
    WordTable[10]   = LF_WHITE;
    WordTable[ENDMARKER] = EE_NOP;
}
/*C and assembly versions of word counter*/
void PAGE_ALIGNED CountC(UINT16 NrBytes, BYTE *pText);
void PAGE_ALIGNED CountAsm(UINT16 NrBytes, BYTE *pText);
static int WCFile(char *pFilename);
int main(int argc, char **argv) {
    InitTables();
    while (--argc) {   /*Count each file*/
        if (! WCFile(*++argv))
            return 1;
    }
    return 0;
}
static int WCFile(char *pFilename) {
    int NrBytes;
    int Handle;
    /*Attempt to open file for reading as raw bytes*/
    Handle = open(pFilename, O_BINARY | O_RDONLY);
    if (Handle == -1) {
        printf("\nCan't open: %s", pFilename);
        return 0;
    }
    Lines = 0;
    Words = 0;
    pCurrentTable = WhiteTable;
    do {
        /*Read a slab of the file into memory*/
        NrBytes = read(Handle, Buf, BUF_SIZE);
        /*Exit loop if there's an error*/
        if (NrBytes == -1) break;
        /*Process the slab using CountC or CountAsm*/
        CountAsm(NrBytes, Buf);
        /*Stop if finished with file*/
    } while (NrBytes >= BUF_SIZE);
    printf("%7lu %7lu %7lu %s\n", Lines, Words, tell(Handle), pFilename);
    (void) close(Handle);
    return 1;
} /*WCFile*/
void PAGE_ALIGNED CountC(UINT16 NrBytes, BYTE *pText) {
    BYTE NextCh;
    BYTE Action;
    BYTE *pTab = pCurrentTable;
    BYTE *pEnd = &pText[NrBytes];
    /*Add endmarker to buffer to trigger end test below*/
    *pEnd++ = ENDMARKER;
    /*Loop through each byte*/
    for (;;) {
        NextCh = *pText++;
        Action = pTab[NextCh];
        switch (Action) {
        case EE_NOP:   if (pText == pEnd) break;
                       /*FALLTHROUGH*/
        case NOP:      continue;


</p>
        case EE_WORD:  if (pText == pEnd) break;
                       /*FALLTHROUGH*/
        case WORD:     Words++;
                       pTab = WordTable;
                       continue;
        case WHITE:    pTab = WhiteTable;
                       continue;
        case LF_WHITE: pTab = WhiteTable;
                       /*FALLTHROUGH*/
        case LF_NOP:   Lines++;
                       continue;
        }
        /*End of buffer found: remember in-word context and exit*/
        pCurrentTable = pTab;
        return;
    }
} /*CountC*/


</p>


</p>

Back to Article

DDJ

1 2 3 4 5 6 7 8 Next

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

Parallel

High-Speed Finite-State Machines

Byte-oriented applications benefit from an assembly trick

High-Speed Implementation

Word/Line Count Utility

Regular Expressions

Matching Individual Bytes

Matching Classes

Backtracking

String Matches

Anchor to Start or End

Optional Elements

Iteration

Optimizing for Speed

File Buffering

String Searching

CPU Byte Search

Easiest First

Optimizing Iteration

Acknowledgments

Listing One

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Parallel Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content

Parallel

High-Speed Finite-State Machines

Byte-oriented applications benefit from an assembly trick

High-Speed Implementation

Word/Line Count Utility

Regular Expressions

Matching Individual Bytes

Matching Classes

Backtracking

String Matches

Anchor to Start or End

Optional Elements

Iteration

Optimizing for Speed

File Buffering

String Searching

CPU Byte Search

Easiest First

Optimizing Iteration

Acknowledgments

Listing One

Related Reading

News

Commentary

Slideshow

Video

Most Popular

More Insights

White Papers

Reports

Webcasts

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Parallel Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content