Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

Database

Ternary Search Trees


Inserting a New String

The insert function inserts a new string into the tree, and does nothing if it is already present. We insert the string s with the code root = insert(root, s);. The first if statement detects running off the end of new node, initializes it, and falls through to the standard case. Subsequent code takes the appropriate branch, but branches to eqkid only if characters remain in the string.

   Tptr insert(Tptr p, char *s)
   { if (p == 0) { 
          p = (Tptr) malloc(sizeof(Tnode)); 
          p->splitchar = *s; 
          p->lokid = p->eqkid = p->hikid = 0; 
     } 
     if (*s < p->splitchar) 
          p->lokid = insert(p->lokid, s); 
     else if (*s == p->splitchar) { 
          if (*s != 0) 
              p->eqkid = insert(p->eqkid, ++s); 
     } else 
          p->hikid = insert(p->hikid, s); 
     return p; 
  } 

This code is short but subtle, and worth careful study.

The resulting tree stores the strings themselves, but no other information. Real symbol tables store additional data with each string. To illustrate this problem, we'll store a pointer to every string in the tree; this data will be used by later search algorithms. We could add a new info field to each node, but that would be wasteful. Instead, we'll exploit the fact that a node with a null splitchar cannot have an eqkid, and store the data in that field. We could make a union that contains the two possible pointers, but that would be syntactically clumsy (we would have to refer to the union at each reference). We will instead use the sleazy hack of coercing a string into a Tptr. The code inside *s == p->splitchar test, therefore, becomes the following:

if (*s == 0) 
    p->eqkid = (Tptr) insertstr; 
else 
    p->eqkid = insert(p->eqkid, ++s); 

Faster Insertion Functions

We can improve insertion in two different ways. We'll start by tuning the insertion code, and consider different orders in which we can insert the nodes into the tree.

A major cost of the insert function is calling malloc to create each node. The function insert2 in the demo program (available electronically) uses the ancient C technique of allocating a buffer of nodes and dealing them out as needed. Profiling shows that this eliminates most of the time spent in storage allocation. We measured the time to insert the 234,936 words in a dictionary file (one word per line, for a total of 2,486,813 characters). Table 1 lists the run times in seconds on three 200-MHz machines, with the compilers optimizing code for speed. The insert3 function (also available electronically) uses other common techniques to reduce the run time even further: transforming recursion to iteration, keeping a pointer to a pointer, reordering tests, saving a difference in a comparison, and splitting the single loop into two loops. These changes are beneficial on all machines, and substantial on the SPARC.

Table 1: Run times of insertion functions.

Better Insertion Orders

In what order should you insert the nodes into a tree? No matter in what order you insert the nodes, you end up with the same digital search trie -- the data structure is totally insensitive to insertion order. Binary search trees are at the opposite end of the spectrum: If you insert the nodes in a good order (middle element first), you end up with a balanced tree. If you insert the nodes in sorted order, the result is a long skinny tree that is very costly to build and search. Fortunately, if you insert the nodes in random order, a binary search tree is usually close to balanced.

Ternary search trees fall between these two extremes. You can build a completely balanced tree by inserting the median element of the input set, then recursively inserting all lesser elements and greater elements. A simpler approach first sorts the input set. The recursive build function inserts the middle string of its subarray, then recursively builds the left and right subarrays. We use this method in our experiments; it is fast and produces fairly well-balanced trees. The cost of inserting all words in a dictionary with function insert3 is never more than about 10 percent greater than searching for all words. D.D. Sleator and R.E. Tarjan describe theoretical balancing algorithms for ternary search trees in "Self-Adjusting Binary Search Trees" (Journal of the ACM, July 1985).

Comparison to Other Data Structures

Symbol tables are typically represented by hash tables. We conducted a simple experiment to compare ternary search trees to that classic data structure; the complete code is available electronically.

To represent n strings, our hash code uses a chained table of size tabsize=n. The hash function, from Section 6.6 of B. Kernighan and D. Ritchie's The C Programming Language, Second Edition (Prentice Hall, 1988), is reasonably efficient and produces good spread:

int hashfunc(char *s) 
{   unsigned n = 0; 
        for ( ; *s; s++) 
                 n = 31 * n + *s; 
        return n % tabsize; 
} 

The body of the search function is as follows:

for (p = tab[hashfunc(s)]; p; p = p->next) 
    if (strcmp(s, p->str) == 0) 
       return 1; 
return 0; 

For fair timing, we replaced the string comparison function strcmp with inline code (so the hash and tree search functions used the same coding style). We used the same dictionary to compare the performance of hashing and ternary trees. We first built the tree by inserting each word in turn (middle element first, then recurring). Finally, we searched for each element word in the input file. Table 2 presents the times in seconds for the two data structures across three machines. For both building and (successful) searching, the two structures have comparable search times.

Table 2: Run times of data structures.

Ternary search trees are usually noticeably faster than hashing for unsuccessful searches. Ternary trees can discover mismatches after examining only a few characters, while hashing always processes the entire key. On some data sets with very long keys and mismatches in the first few characters, ternary trees took less than one-fifth the time of hashing.

Functions insert3 and search combine to yield a time-efficient symbol table. On one dictionary described at our web site, ternary search trees used about three times the space of hashing. An alternative representation of ternary search trees is more space efficient: When a subtree contains a single string, we store a pointer to the string itself (and each node stores three bits telling whether its children point to nodes or strings). This leads to code that is less efficient, but it reduces the number of tree nodes close to the space used by hashing.

We analyzed the worst-case performance of several aspects of ternary search trees in our theoretical paper. In "The Analysis of Hybrid Trie Structures" (Proceedings of the 9th Annual ACM-SIAM Symposium on Discrete Algorithms, January 1998), J. Clement, P. Flajolet, and B. Valle analyze the expected performance of a broad class of hybrid trees that includes ternary trees, and describe extensive experiments to support their theory.

Other Operations on Ternary Trees

Most standard techniques on binary search trees can be applied immediately to their ternary cousins. For instance, we can print the strings in the tree in sorted order with a recursive traversal:

void traverse(Tptr p) 
{    if (!p) return; 
     traverse(p->lokid); 
     if (p->splitchar) 
         traverse(p->eqkid); 
     else 
         printf("%s/n", (char *) p->eqkid); 
     traverse(p->hikid); 
} 

Simple recursive searches can find the predecessor or successor of a given element or list all items in a given range. If we add a count field to every node, we can quickly count the elements in a given range, count how many words begin with a given substring, or select the mth largest element. Most of these operations require logarithmic time in a ternary tree, but linear time in a hash table.


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.