Design

Full-Text Searching & the Burrows-Wheeler Transform

By Kendall Willets, December 01, 2003

Here's an indexing method that lets you find any character sequence in the source text using a structure that can fit the entire source text and index into less space than the text alone.

Pattern Matching

If you pick an arbitrary pattern string P, say "abr," one way to find all occurrences of it is to search the sorted blocks in M, finding the range of blocks that start with a, then narrowing it to blocks prefixed by "ab," and so on, extending the pattern from left to right. This method is workable, but a more efficient algorithm (first developed by Ferragina and Manzini) works in the opposite direction, extending and matching the pattern one character to the left at each turn.

To understand this method inductively, first consider how to match one character c, then how to extend a single character beyond a pattern that has already been matched. The answer to the first problem is easy, since you know blocks in the range C[c]...C[c+1]-1 start with c. I call this range "R_c."

To left-extend a pattern match, consider the string cP formed by prepending a character c onto the already matched string P. Use FL, starting from the range of locations prefixed by P, mapping FL inversely to find the interval of blocks prefixed by cP, as follows.

Given the next character c and the range R_P of blocks prefixed by P, you need to find the range R_cP of blocks prefixed by cP. You know two things about R_cP:

It must fall within the range R_c of blocks starting with c, that is, R_cP is a subrange of R_c.
FL must map all blocks in R_cP into blocks in R_P, because every block starting with cP must left-shift to a block that starts with P.

My approach is to start with the widest possible range R_c, and narrow it down to those entries that FL maps into R_P. Because of the sorting, you know that entries prefixed by cP form a contiguous range. Since FL is order-preserving on R_c, you can find R_cP as follows:

Scan R_c from the start until you find the lowest position i that FL maps into R_P.
Scan backwards from the end to find the highest position j that FL maps into R_P (in practice, you use binary search, but the idea is the same).

The resulting [i,j] range is R_cP, the range of blocks prefixed by cP. Figures 3(a), 3(b), 3(c), and 3(d) show this narrowing-down process; the refine method implements this algorithm in bwtindex.cc.

Figure 3(a): Start by finding the range of all blocks starting with "r."

Figure 3(b): To find R_br, find bs that precede rs. I show two copies of F for clarity. Working right-to-left, the copy on the right represents the previous range, and the one on the left is new. To search: 1. Take all blocks starting with b; 2. search this set for the first i where FL[i]startp, and the last j where FL[j]<=endp; everything in this [i,j] range must map into the target interval between startp and endp. This [i,j] interval identifies all bs that are followed by r; in other words, all blocks starting with "br." i and j become the new startp/endp for the next step.

Figure 3(c): Extending "br" to "abr." The next char is "a," so you start with the range of F containing as. Narrow this range down from all a's to a's followed by "br," again by inverting FL from the "br" range into the "a" range. This process can continue indefinitely, left-extending the pattern one character at each step. For any pattern, this process will give you the start and end (and count) of a range of matches in M. From any position in this range, you can decode forward to show the context of each match, but you don't know the offset of each match in the source file.

Figure 3(d): What you don't know is the distance of each match to end/start of text.

The result at each step in this process is a start/end position for a range of blocks prefixed by the pattern matched so far. The difference between these positions is the number of matches in the text, and starting from any position in this range, you can decode characters in and following each match.

The Location Problem

There is one valuable piece of information you haven't found: The exact offset of each match within the original text. I can call this the "location problem," because there is virtually no information in a sorted block to tell you how far you are from the start or end of the text, unless you decode and count all the characters in between.

There are a number of solutions to the location problem that I won't address here except to say that all of them require extra information beyond FL and C, or any BWT representation. The simple but bulky solution is just to save the offset of each block in an array of n integers, reducing the problem to a simple lookup, but adding immensely to the space requirement. The problem is how to get the equivalent information into less space.

Some approaches rely on marking an explicit offset milepost at only a few chosen blocks, so you quickly encounter a milepost while decoding forward from any block. Others use the text itself as a key, to index locations by unique substrings. Another possibility lets you jump ahead many characters at a time from certain blocks, so as to reach the end of the text more quickly while counting forward. The variety of possible solutions makes it impossible to cover them here.

A Word About Compression

Recall that I promised a full-text index that consumes only a few bits per character, but so far you've only seen a structure taking at least one int per character—hardly an improvement. However, the integers in FL have a distribution that makes them highly compressible. You already know FL contains long sections of integers in ascending order. Another useful fact is that consecutive entries often differ by only one; in normal text, as many as 70 percent of these differences are one, with the distribution falling off rapidly with magnitude. My own experiments using simple differential and gamma coding have shrunk FL to fewer than 4 bits per character, and more sophisticated methods (see "Second Step Algorithms in the Burrows-Wheeler Compression Algorithm," by Sebastian Deorowicz; Software: Practice and Experience, Volume 32, Issue 2, 2002), have shrunk FL to even more competitive levels.

The practical problem with compression is that elements of FL then vary in size, so finding an element FL[i] requires scanning from the beginning of the packed array. To eliminate most of the scanning, you need to use a separate bucket structure, which records the value and position of the first element of each bucket. To find FL[i], you scan forward from the beginning of the closest bucket preceding i, adding the encoded differences from that point until position i is reached. The process is laborious, but does not affect the higher level search and decoding algorithms.

Kendall is a software engineer living in San Francisco and can be contacted at [email protected].

Previous 1 2 3 4 5 6 7 8 9 10 Next

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

Design

Full-Text Searching & the Burrows-Wheeler Transform

Pattern Matching

Figure 3(a): Start by finding the range of all blocks starting with "r."

Figure 3(d): What you don't know is the distance of each match to end/start of text.

The Location Problem

A Word About Compression

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content

Design

Full-Text Searching & the Burrows-Wheeler Transform

Pattern Matching

Figure 3(a): Start by finding the range of all blocks starting with "r."

Figure 3(d): What you don't know is the distance of each match to end/start of text.

The Location Problem

A Word About Compression

Related Reading

News

Commentary

Slideshow

Video

Most Popular

More Insights

White Papers

Reports

Webcasts

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content