Database

Ternary Search Trees

By Jon Bentley and Bob Sedgewick, April 01, 1998

When you have to store a set of strings, what data structure do you use? Jon and Robert suggest one place you can start is with ternary search trees, which combine the time efficiency of digital tries with the space efficiency of binary search trees.

Inserting a New String

The insert function inserts a new string into the tree, and does nothing if it is already present. We insert the string s with the code root = insert(root, s);. The first if statement detects running off the end of new node, initializes it, and falls through to the standard case. Subsequent code takes the appropriate branch, but branches to eqkid only if characters remain in the string.

   Tptr insert(Tptr p, char *s)
   { if (p == 0) { 
          p = (Tptr) malloc(sizeof(Tnode)); 
          p->splitchar = *s; 
          p->lokid = p->eqkid = p->hikid = 0; 
     } 
     if (*s < p->splitchar) 
          p->lokid = insert(p->lokid, s); 
     else if (*s == p->splitchar) { 
          if (*s != 0) 
              p->eqkid = insert(p->eqkid, ++s); 
     } else 
          p->hikid = insert(p->hikid, s); 
     return p; 
  }

This code is short but subtle, and worth careful study.

The resulting tree stores the strings themselves, but no other information. Real symbol tables store additional data with each string. To illustrate this problem, we'll store a pointer to every string in the tree; this data will be used by later search algorithms. We could add a new info field to each node, but that would be wasteful. Instead, we'll exploit the fact that a node with a null splitchar cannot have an eqkid, and store the data in that field. We could make a union that contains the two possible pointers, but that would be syntactically clumsy (we would have to refer to the union at each reference). We will instead use the sleazy hack of coercing a string into a Tptr. The code inside *s == p->splitchar test, therefore, becomes the following:

if (*s == 0) 
    p->eqkid = (Tptr) insertstr; 
else 
    p->eqkid = insert(p->eqkid, ++s);

Faster Insertion Functions

We can improve insertion in two different ways. We'll start by tuning the insertion code, and consider different orders in which we can insert the nodes into the tree.

A major cost of the insert function is calling malloc to create each node. The function insert2 in the demo program (available electronically) uses the ancient C technique of allocating a buffer of nodes and dealing them out as needed. Profiling shows that this eliminates most of the time spent in storage allocation. We measured the time to insert the 234,936 words in a dictionary file (one word per line, for a total of 2,486,813 characters). Table 1 lists the run times in seconds on three 200-MHz machines, with the compilers optimizing code for speed. The insert3 function (also available electronically) uses other common techniques to reduce the run time even further: transforming recursion to iteration, keeping a pointer to a pointer, reordering tests, saving a difference in a comparison, and splitting the single loop into two loops. These changes are beneficial on all machines, and substantial on the SPARC.

Table 1: Run times of insertion functions.

Better Insertion Orders

In what order should you insert the nodes into a tree? No matter in what order you insert the nodes, you end up with the same digital search trie -- the data structure is totally insensitive to insertion order. Binary search trees are at the opposite end of the spectrum: If you insert the nodes in a good order (middle element first), you end up with a balanced tree. If you insert the nodes in sorted order, the result is a long skinny tree that is very costly to build and search. Fortunately, if you insert the nodes in random order, a binary search tree is usually close to balanced.

Ternary search trees fall between these two extremes. You can build a completely balanced tree by inserting the median element of the input set, then recursively inserting all lesser elements and greater elements. A simpler approach first sorts the input set. The recursive build function inserts the middle string of its subarray, then recursively builds the left and right subarrays. We use this method in our experiments; it is fast and produces fairly well-balanced trees. The cost of inserting all words in a dictionary with function insert3 is never more than about 10 percent greater than searching for all words. D.D. Sleator and R.E. Tarjan describe theoretical balancing algorithms for ternary search trees in "Self-Adjusting Binary Search Trees" (Journal of the ACM, July 1985).

Comparison to Other Data Structures

Symbol tables are typically represented by hash tables. We conducted a simple experiment to compare ternary search trees to that classic data structure; the complete code is available electronically.

To represent n strings, our hash code uses a chained table of size tabsize=n. The hash function, from Section 6.6 of B. Kernighan and D. Ritchie's The C Programming Language, Second Edition (Prentice Hall, 1988), is reasonably efficient and produces good spread:

int hashfunc(char *s) 
{   unsigned n = 0; 
        for ( ; *s; s++) 
                 n = 31 * n + *s; 
        return n % tabsize; 
}

The body of the search function is as follows:

for (p = tab[hashfunc(s)]; p; p = p->next) 
    if (strcmp(s, p->str) == 0) 
       return 1; 
return 0;

For fair timing, we replaced the string comparison function strcmp with inline code (so the hash and tree search functions used the same coding style). We used the same dictionary to compare the performance of hashing and ternary trees. We first built the tree by inserting each word in turn (middle element first, then recurring). Finally, we searched for each element word in the input file. Table 2 presents the times in seconds for the two data structures across three machines. For both building and (successful) searching, the two structures have comparable search times.

Table 2: Run times of data structures.

Ternary search trees are usually noticeably faster than hashing for unsuccessful searches. Ternary trees can discover mismatches after examining only a few characters, while hashing always processes the entire key. On some data sets with very long keys and mismatches in the first few characters, ternary trees took less than one-fifth the time of hashing.

Functions insert3 and search combine to yield a time-efficient symbol table. On one dictionary described at our web site, ternary search trees used about three times the space of hashing. An alternative representation of ternary search trees is more space efficient: When a subtree contains a single string, we store a pointer to the string itself (and each node stores three bits telling whether its children point to nodes or strings). This leads to code that is less efficient, but it reduces the number of tree nodes close to the space used by hashing.

We analyzed the worst-case performance of several aspects of ternary search trees in our theoretical paper. In "The Analysis of Hybrid Trie Structures" (Proceedings of the 9th Annual ACM-SIAM Symposium on Discrete Algorithms, January 1998), J. Clement, P. Flajolet, and B. Valle analyze the expected performance of a broad class of hybrid trees that includes ternary trees, and describe extensive experiments to support their theory.

Other Operations on Ternary Trees

Most standard techniques on binary search trees can be applied immediately to their ternary cousins. For instance, we can print the strings in the tree in sorted order with a recursive traversal:

void traverse(Tptr p) 
{    if (!p) return; 
     traverse(p->lokid); 
     if (p->splitchar) 
         traverse(p->eqkid); 
     else 
         printf("%s/n", (char *) p->eqkid); 
     traverse(p->hikid); 
}

Simple recursive searches can find the predecessor or successor of a given element or list all items in a given range. If we add a count field to every node, we can quickly count the elements in a given range, count how many words begin with a given substring, or select the mth largest element. Most of these operations require logarithmic time in a ternary tree, but linear time in a hash table.

Previous 1 2 3 Next

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

Database

Ternary Search Trees

Inserting a New String

Faster Insertion Functions

Better Insertion Orders

Comparison to Other Data Structures

Other Operations on Ternary Trees

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content

Database

Ternary Search Trees

Inserting a New String

Faster Insertion Functions

Better Insertion Orders

Comparison to Other Data Structures

Other Operations on Ternary Trees

Related Reading

News

Commentary

Slideshow

Video

Most Popular

More Insights

White Papers

Reports

Webcasts

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content