When it comes to full-text indexing, we usually think of methods such as inverted indices that break text on word boundaries, consequently requiring search terms to be whole words only. Yet all of us probably have had the experience of searching for not-quite-words C++, VM/CMS, SQL*Plus, 127.0.0.1, <blink>, and the like that were skipped by an inverted index, or broken into less-distinctive pieces. The same goes when you are working with data such as DNA sequences, where you need to quickly find any sequence of symbols.
In this article, I examine an indexing method that lets you find any character sequence in the source text in time only proportional to the sequence length using a structure that can compress the entire source text and index into less space than the text alone. This technique is exceptionally fast at detecting and counting occurrences of any string in the source text. The fact that you can build a string match incrementallyadding one character at a time and seeing the result at each stepgives you the flexibility to explore variable patterns such as regular expressions with maximum effectiveness.
Finding any character sequence in source textand fast!
In Fast String Searching with Suffix Trees (DDJ, August 1996), Mark Nelson addressed full-text indexing using suffix trees; while in Data Compression with the Burrows-Wheeler Transform (DDJ, September 1996), he focused on the use of the Burrows-Wheeler Transform (BWT) for compression. While the BWT is commonly known as a data-compression technique, researchers have found that block-sorted data has a structure that lends itself naturally to search, while using space close to its minimal compressed size. This was first demonstrated in the FM index (see Opportunistic Data Structures with Applications, by Paolo Ferragina and Giovanni Manzini, Proceedings of the 41st IEEE Symposium on Foundations of Computer Science, 2000). In short, the same transformation that yields high-compression ratios, by grouping similar substrings together, also lets you find arbitrary substrings with little overhead.
Block Sorting Reviewed
When block sorting, you construct a list of n blocks consisting of the source text S (usually followed by a special End-of-String character $) of length n, cyclically shifted from zero to n-1 positions. When you sort the blocks, you get a matrix M; see Figure 1(a). The first column of M is called F, and the last, L. F has a simple structure, containing all the characters of S in sorted order, with duplicates. Column L has a more complex structure that contains enough information to reconstruct the original string, and usually forms the basis for BWT compression.
Figure 1(a): Block-sorted text. Start with the source string "abracadabra." Append an end-of-file metacharacter $, with the property that $<a-z. Cyclically shift the string from 0 to n-1 places, and sort the resulting list. Each row is called a "block," containing original text left-shifted 0 to n-1 places. M is the 12×12 block-sorted matrix for string "abracadabra$," F is the first column of the matrix, L is the last column.
In its naive form, M contains n2 characters, but a simple trick represents each block by its first character and a link to the block starting one character to the right; Figures 1(b) and 1(c). To decode a block, you read its first character, follow its link to the next block, read its character and link, and repeat the process until the desired number of characters have been read. This character-and-link representation slashes spatial complexity from O(n2) to O(n).
Figure 1(b): Replacing abracadra$ with "a" and a pointer to bracadabra$a.
Figure 1(c): Twelve blocks replaced with F column and 12 pointers to suffixes.
The links act as a permutation on M, which I call FL because it permutes the orderly F column to the higher entropy L column. FL is the permutation caused by shifting M one column to the left and resorting it; each row i moves to a new position FL[i].
Since the F column is a sorted list of characters, the next space saver is to change from explicitly storing the F column, to simply recording the starting position for each character's range in F, using an array that I call "C"; Figure 1(c). At a given position i in M, you look through C to find the section c of F that contains i. The F() method in bwtindex.cc applies this idea. Also see Figure 1(d).
Figure 1(d): Discard F column and use C to identify characters.
By storing only FL and C, you have a reasonablebut not minimalrepresentation of M, and you can decode successive characters of the source from left to right. The decode method in bwtindex.cc shows how to carry out this iteration. See Figure 1(e).
Figure 1(e): Decoding from any position. Chase pointers and translate each position to char using C. The decode method shows this technique in bwtindex.cc.
Useful Properties of the Permutation
Figure 2 shows how FL is order preserving on blocks that start with the same character. That is, given two blocks i and j that both start with c, lexical comparison implies that if i<j, then FL[i]<FL[j]. This is one of the core elements of the BWT.
Figure 2: This map is called FL because it rearranges the F column into the L column. FL is order-preserving for strings that start with the same character, so each character's section in FL contains an ascending list of integers. Here, you see the "fan-out" for blocks prefixed by "a."
This order-preserving property means that FL consists of sections of integers in ascending order, one section for each character. You can search one of these sections for a target value quickly, using binary search.