Channels ▼
RSS

Database

Algorithm Alley: Fast String Searching with Suffix Trees


In the January/February, 1996 issue of Dr. Dobb's Sourcebook, Michael Abrash described how to greatly accelerate three-dimensional graphics by precomputing the correct information. The suffix trees Mark Nelson describes here are similar in concept-precalculating this one data structure makes future searches very quick. Since string searching is an important component of modern compression algorithms, a faster string search has some interesting repercussions. If you haven't heard about suffix trees, that's because no one knew how to build one quickly. At least, not until last year....

— Tim Kientzle


Imagine you've just been hired as a programmer for a DNA sequencing project. Researchers are busy slicing and dicing viral genetic material, producing fragmented sequences of nucleotides. They send these sequences to your server, which is then expected to locate the sequences in a database of genomes. The genome for a given virus can have hundreds of thousands of nucleotide bases, and you have hundreds of viruses in your database. Clearly, a simple brute-force search won't be fast enough to satisfy impatient PhDs.

Since the database that you are testing against is invariant, you can preprocess it to simplify the search. One preprocessing approach is to build a search trie-a type of tree that has N possible branches from each node, where N is the number of characters in the alphabet. For searching through input text, a straightforward approach to a search trie yields a suffix trie. "Suffix" refers to the fact that the trie contains all of the suffixes of a given block of text (perhaps a viral genome).

Figure 1 shows a suffix trie for the word BANANAS. Note that, starting at the root node, each of the suffixes of BANANAS is found in the trie, starting with BANANAS, ANANAS, NANAS, and finishing up with a solitary S. Because of this organization, you can search for any substring of the word by starting at the root and following matches down the tree until exhausted.


Figure 1.

What makes the suffix trie a nice construct is that if you have an input text of length N and search string of length M, a traditional brute-force search will take as many as N*M character comparisons to complete. Optimized searching techniques (such as the Boyer-Moore algorithm) can guarantee searches that require no more than M+N comparisons, with even better average performance. But the suffix trie requires only M character comparisons, regardless of the length of the text being searched! This means you could determine whether or not the word BANANAS was in the collected works of William Shakespeare by performing just seven character comparisons!

The catch is that suffix tries require O(N2) time and space to construct. This quadratic performance rules out the use of suffix tries where they are needed most-to search through long blocks of data.

Under the Spreading Suffix Tree

Edward McCreight gave a definitive illustration of the data structure used to circumvent this problem in his 1976 paper, "A Space-economical Suffix Tree Construction Algorithm" (Journal of the ACM, 23:262-272, 1976). The suffix tree retains the same topology as the suffix trie, but eliminates nodes that have only a single descendant. This process, known as "path compression," means that individual edges in the tree now represent sequences of text instead of single characters.

Figure 2 shows what the suffix trie from Figure 1 looks like when converted to a suffix tree. You can see that the tree still has the same general shape, just far fewer nodes. By eliminating every node with just a single descendant, the node count is reduced from 23 to 11.


Figure 2.

The reduction in the number of nodes, in concert with an efficient algorithm, ensures that the time and space requirements for constructing a suffix tree are reduced from O(N2) to O(N). In the worst case, a suffix tree can be built with a maximum of 2N nodes, where N is the length of the input text. So for a one-time investment proportional to the length of the input text, you can create a tree that turbocharges your string searches.

McCreight's original algorithm for constructing a suffix tree had a few disadvantages. Principal among them was the requirement that the tree be built right to left; by adding the longest suffixes first. This ruled the algorithm out for online processing, making it much more difficult to use for applications such as data compression.

Twenty years later, Esko Ukkonen developed a slightly modified version of the algorithm that works from left to right. Both my sample code and the descriptions that follow are based on Ukkonen's work, described in "On-line Construction of Suffix Trees" (Algorithmica, September 1995).

For a given string of text, T, Ukkonen's algorithm starts with an empty tree, then adds each prefix of T to the suffix tree. For example, when creating the suffix tree for BANANAS, B is inserted into the tree, then BA, then BAN, and so on. When BANANAS is finally inserted, the tree is complete.

Adding a new prefix to the tree is done by walking through the tree and visiting each of the current tree's suffixes. You start at the longest suffix (BAN in Figure 3), and work down to the shortest suffix, which is the empty string.


Figure 3.

Each suffix ends at a node that consists of one of these three types:

  • A leaf node. In Figure 4, the nodes labeled 1, 2, 4, and 5 are leaf nodes.


Figure 4.
  • An explicit node. The nonleaf nodes that are labeled 0 and 3 in Figure 4 are explicit nodes. They represent a point on the tree where two or more edges part ways.
  • An implicit node. In Figure 4, prefixes such as BO, BOO, and OO all end in the middle of an edge. These positions are referred to as implicit nodes. They would represent nodes in the suffix trie, but path compression eliminated them. As the tree is built, implicit nodes are sometimes converted to explicit nodes.

In Figure 4, there are five suffixes in the tree (including the empty string) after adding BOOK to the structure. Adding the next prefix, BOOKK, to the tree means visiting each of the suffixes in the existing tree, and adding the letter K to the end of the suffix.

The first four suffixes-BOOK, OOK, OK, and K-all end at leaf nodes. Because of the path compression applied to suffix trees, adding a new character to a leaf node will always just add to the string on that node. It will never create a new node, regardless of the letter being added.

After all of the leaf nodes have been updated, you still need to add K to the empty string, which is found at node 0. Since there is already an edge leaving node 0 that starts with the letter K, you don't have to do anything. The newly added suffix K will be found at node 0 and will end at the implicit node found one character down along the edge leading to node 2. Figure 5 shows the final shape of the resulting tree.


Figure 5.

Things get Knotty

Updating the tree in Figure 4 was relatively easy, involving two types of updates: simply extending an edge, and performing an implicit update. Adding BOOKKE to the tree in Figure 5 demonstrates the two other types of updates. In the first type, a new node is created to split an existing edge at an implicit node, followed by the addition of a new edge. The second type of update consists of adding a new edge to an explicit node.

When adding BOOKKE to the tree in Figure 5, you once again start with the longest suffix, BOOKK, and work your way to the shortest-the empty string. Updating the longer suffixes is trivial as long as you are updating leaf nodes. In Figure 5, the suffixes that end in leaf nodes are BOOKK, OOKK, OKK, and KK. The first tree in Figure 6 shows what the tree looks like after these suffixes have been updated using the simple string extension.


Figure 6.

The first suffix in Figure 5 that doesn't terminate at a leaf node is K. When updating a suffix tree, the first nonleaf node is defined as the active point of the tree. All of the suffixes that are longer than the suffix defined by the active point will end in leaf nodes. None of the suffixes shorter than it will terminate in leaf nodes.

The suffix K terminates in an implicit node part way down the edge defined by KKE. When testing nonleaf nodes, you need to see if they have any descendants that match the new character being appended; in this case, that would be E.

A quick look at the first K in KKE shows that it only has a single descendant: K. This means you have to add a descendent to represent E. This is a two-step process. First, split the edge holding the arc so that it has an explicit node at the end of the suffix being tested. The middle tree in Figure 6 shows what the tree looks like after the split.

Once the edge has been split and the new node has been added, you have a tree that looks like that in the third position of Figure 6. Note that the K node, which has now grown to be KE, has become a leaf node.

Updating an Explicit Node

After updating the suffix K, you have to update the next-shorter suffix, which is the empty string. The empty string ends at explicit node 0, so just check to see if it has a descendant that starts with the letter E. A quick look at the tree in Figure 6 shows that node 0 doesn't have this descendent, so another leaf node is added, which yields the tree in Figure 7.


Figure 7.

Generalizing the Algorithm

By taking advantage of a few of the characteristics of the suffix tree, you generate a fairly efficient algorithm. The first important trait is "once a leaf node, always a leaf node." Any node created as a leaf will never be given a descendant; it will only be extended through character concatenation. More importantly, every time a new suffix is added to the tree, you are going to automatically extend the edges leading into every leaf node by a single character. That character will be the last character in the new suffix.

This makes management of the edges leading into leaf nodes easy. Any time you create a new leaf node, just set its edge to represent all the characters from its starting point to the end of the input text. Even though you don't know what those characters are, you know they will be added to the tree eventually. Because of this, once a leaf node is created, just forget about it. If the edge is split, its starting point may change, but it will still extend all the way to the end of the input text.

This means that you only have to worry about updating explicit and implicit nodes starting at the active point, which was the first nonleaf node. Given this, you progress from the active point to the empty string, testing each node for update eligibility.

However, you can save some time by stopping the update earlier. As you walk through the suffixes, add a new edge to each node that doesn't have a descendant edge starting with the correct character. When you finally do reach a node that has the correct character as a descendant, simply stop updating. Knowing how the construction algorithm works, you can see that if you find a certain character as a descendent of a particular suffix, you are bound to also find it as a descendent of every smaller suffix.

The point where you find the first matching descendent is called the "end point" and it has a particularly useful feature. Since you add leaves to every suffix between the active point and the end point, you know that every suffix longer than the end point will soon be a leaf node. This means the end point will turn into the active point on the next pass over the tree!

By confining updates to the suffixes between the active point and end point, you can cut the processing required to update the tree. And by keeping track of the end point, you automatically know what the active point will be on the next pass. Example 1 is a first pass at the update algorithm using this information.

Example 1: Updating the suffix tree (in C-like pseudocode).

Update( new_suffix )
{
  current_suffix = active_point
  test_char = last_char in new_suffix
  done = false;
  while ( !done ) {
    if current_suffix ends at an explicit node {
      if the node has no descendant edge starting with test_char
        create new leaf edge starting at the explicit node
      else
        done = true;
    } else {
      if the implicit node's next char isn't test_char {
        split the edge at the implicit node
        create new leaf edge starting at the split in the edge
      } else
        done = true;
    }
    if current_suffix is the empty string
      done = true;
    else
       current_suffix = next_smaller_suffix( current_suffix )
  }
  active_point = current_suffix
}


The Suffix Pointer

The pseudocode in Example 1 is more or less accurate, but glosses over one difficulty-as you navigate through the tree, you repeatedly call next_smaller_suffix(). This routine has to find the implicit or explicit node corresponding to a particular suffix.

If you do this by simply walking down the tree to find the correct node, the algorithm isn't going to run in linear time. To get around this, add one additional pointer to the tree-the suffix pointer. The suffix pointer is found at each internal node. Each internal node represents a sequence of characters that start at the root. The suffix pointer points to the node that is the first suffix of that string. So if a particular string contains characters 0 through N of the input text, the suffix pointer for that string will point to the node that is the termination point for the string starting at the root that represents characters 1 through N of the input text.

Figure 8 shows the suffix tree for the string ABABABC. The first suffix pointer is found at the node that represents ABAB, and points to BAB. Likewise, BAB has its own suffix pointer, which points to the node for AB.


Figure 8.

The suffix pointers are built as the tree is updated. As I move from the active point to the end point, I keep track of the parent node of each of the new leaves I create. Each time I create a new edge, I also create a suffix pointer from the parent node of the last leaf edge I created to the current parent edge. (Obviously, I can't do this for the first edge created in the update, but I can for all the remaining edges.)

With the suffix pointers in place, navigating from one suffix to the next is simply a matter of following a pointer. This critical addition to the algorithm is what reduces it to an O(N) algorithm.

Tree Houses

To illustrate suffix trees, I wrote STREE.CPP, a program that reads in a string of text from standard input and builds a suffix tree. A second version, STREED.CPP, has extensive debug output.

Understanding STREE.CPP is just a matter of understanding the workings of the data structures it contains. The most important data structure is the Edge object; see Example 2.

Example 2: The Edge class.


class Edge {
 public :
    int first_char_index;
    int last_char_index;
    int end_node;
    int start_node;
    void Insert();
    void Remove();
    Edge();
    Edge( int init_first_char_index,
          int init_last_char_index,
          int parent_node );
    int SplitEdge( Suffix &s );
    static Edge Find( int node, int c );
    static int Hash( int node, int c );
};

Each time a new edge in the suffix tree is created, a new Edge object is created to represent it. The four data members of the object are:

  • first_char_index and last_char_index. Each edge in the tree has a sequence of characters from the input text associated with it. To ensure that the storage size of each edge is identical, I just store two indices into the input text to represent the sequence.
  • start_node. The number of the node that represents the starting node for this edge. Node 0 is the root of the tree.
  • end_node. The number of the node that represents the end node for this edge. Each time an edge is created, a new end node is created as well. The end node for every edge will not change over the life of the tree, so this can be used as an edge id as well.

A frequent task performed when building the suffix tree is a search for the edge emanating from a particular node based on the first character in its sequence. On a byte-oriented computer, there could be as many as 256 edges originating at a single node. To make the search reasonably quick and easy, I store the edges in a hash table, using a hash key based on their starting node number and the first character of their substring. The Insert() and Remove() member functions are used to manage the transfer of edges in and out of the hash table.

The second important data structure used when building the suffix tree is the Suffix object. Remember that updating the tree is done by working through all of the suffixes of the string currently stored in the tree, starting with the longest and ending at the end point. A Suffix is simply a sequence of characters that starts at node 0 and ends at some point in the tree.

It makes sense that we can then safely represent any suffix by defining the position in the tree of its last character, since we know the first character starts at node 0, the root. The Suffix object (see Example 3) defines a given suffix using that system.

Example 3: The Suffix class.

class Suff ix {
 public :
   int origin_node;
   int first_char_index;

   int last_char_index;
   Suff ix( int node, int start, int stop );
   int Explicit();
   int Implicit();
   void Canonize();
};

The Suffix object defines the last character in a string by starting at a specific node, then following the string of characters in the input sequence pointed to by the first_char_index and last_char_index members. For example, in Figure 8, the longest suffix, ABABABC, would have an origin_node of 0, first_char_index of 0, and last_char_index of 6.


Figure 8.

Ukkonen's algorithm requires that you work with these Suffix definitions in canonical form. The Canonize() function is called to perform this transformation any time a Suffix object is modified. The canonical representation of the suffix simply requires that the origin_node in the Suffix object be the closest parent to the end point of the string. This means that the suffix string represented by the pair (0, "ABABABC"), would be canonized by moving first to (1, "ABABC"), then (4, "ABC"), and finally (8,"").

When a suffix string ends on an explicit node, the canonical representation will use an empty string to define the remaining characters in the string. An empty string is defined by setting first_char_ index to be greater than last_char_index. When this is the case, you know that the suffix ends on an explicit node. If first_char_index is less than or equal to last_char_index, the suffix string ends on an implicit node.

Given these data structure definitions, you will find the code in STREE.CPP to be a straightforward implementation of the Ukkonen algorithm. For additional clarity, use STREED.CPP to dump copious debug information out at run time.

Acknowledgment

I was convinced to tackle suffix-tree construction by reading Jesper Larsson's paper for the 1996 IEEE Data Compression Conference. Jesper was also kind enough to provide me with sample code and pointers to Ukkonen's paper.


Mark is the author of C++ Programmer's Guide to the Standard Template Library (IDG Books, 1995) and The Data Compression Book, Second Edition (M&T Books, 1995).


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 

Video