Channels ▼
RSS

Database

Supercharging Sequential Searches

Source Code Accompanies This Article. Download It Now.


DEC90: SUPERCHARGING SEQUENTIAL SEARCHES

Speed plus compression equals faster searches

Walter is an analyst at Phoenix Mutual Life. He can be reached at 5 Burns Ave., Enfield, CT 06082 or 203-745-9159.


We all know a sequential search is slow. Search time increases linearly with the size of the list; and as a list grows beyond a few items, the search time quickly becomes unbearable. Nevertheless, because it is easy to code, works on just about any list, and provides acceptable speed for short lists, the sequential search remains one of the most commonly used search algorithms.

Consequently, there is much to be gained by speeding up the sequential search algorithm, while maintaining its inherent generality and simplicity. This article describes a simple algorithm that can often speed up a sequential search by a factor of two or more. And there's a special bonus -- a list can also be compressed, often to half its original size. The improvement results from a better method of comparing each key. The number of key comparisons is the same as with any sequential search, but the time spent comparing each key is dramatically reduced.

As with any sorted sequential list, the number of keys compared will be, on average, half the number of keys in the list. The number of key comparisons, however, is not the whole story. Each key comparison is itself a sequential search (a search for a non-matching character), so the number of character comparisons (I'll assume the keys are character strings) is N / 2 * K / 2 (or NK / 4) where N is the number of items in the list and K is the key length.

A sorted list, however, presents us with an interesting opportunity. The opportunity arises from the fact that the sort brings similar keys together. Often, the first few characters of one key duplicate the first few characters of the preceding key. As a consequence, a typical sequential search spends much of its time comparing the same leading characters over and over. If those redundant characters are skipped the search will be faster. The approach described here, called the suffix list, speeds up the search by eliminating those needless comparisons. Here's how it works.

The list is kept in ascending order and each key is divided into two parts, a prefix and a suffix. The prefix is the portion of the key which matches the previous key. The suffix is the remainder of the key, beginning with the first character that differs from the previous key. The suffix list stores an integer that represents the prefix length along with each key.

Unlike a simple list, in which the complete key for each item is used during the search, a suffix list uses only the suffix in key comparisons. The prefix itself is ignored, because its length -- the number of characters that match the previous key -- provides all the information needed for the search.

The data can be stored as a linked list, as an array of fixed length items, or as a series of variable length items concatenated one right after the other in contiguous memory. Listing One, page 100, shows code which uses a linked list. It doesn't matter which method you use; the basic principles are the same, although the methods for traversing the list differ.

A linked list has the advantage of making insertion and deletion of items easier, but storing the list in contiguous memory uses less space.

The seq_cell_S structure in Figure 1(a) illustrates a typical structure for building a linked list. The sfx_cell_S structure in Figure 1(b), on the other hand, illustrates a structure for building a list to be used with a suffix search. The only difference is the addition of the element cell.pfxcnt, which stores the prefix length.

Figure 1: The seq_cell_S structure in (a) illustrates a typical structure for building a linked list. The sfx_cell_S structure in (b), on the other hand, illustrates a structure for building a list to be used with a suffix search. The only difference is the addition of the element cell.pfxcnt, which stores the prefix length.

  (a)

  struct seq_cell-S {
                       struct seq_cell_S *next; /* Next node */
                       struct seq_cell_S *prev; /* Previous node */
                       char key[1];             /* Key value */
                       }cell;

  (b)

  struct sfx_cell_S {
                       struct seq_cell_S *next; /* Next node */
                       struct seq_cell_S *prev; /* Previous node */
                       unsigned char pfxcnt;    /* Prefix Length */
                       char key[1];             /* Key value */
                       }cell;

cell.key is the first byte of a null terminated string that contains the key. Note that the cell.key is not a pointer to a string, but is the actual location of the beginning of the string. The cell.prev and cell.next elements are pointers to the previous and following cells in the list, respectively.

Figure 2 shows a list of city names, the prefix counts, and prefix and suffix values. In seq_cell_S, the prefix and suffix are not stored separately. The full key is stored in cell.key, as normal, but is now supplemented by the prefix length, which is kept in cell.pfxlen.

Figure 2: A list of city names, the prefix counts, and prefix and suffix values

  Standard      Pfxlen  Prefix        Suffix
-------------------------------------------------

  Acampo         0                    Acampo
  Acton          2      Ac            ampo
  Adelanto       1      A             delanto
  Adin           2      Ad            in
  Agoura Hills   1      A             goura Hills
  Agoura Hills  12      Agoura Hills
  Aguanga        2      Ag            uanga
  Ahwahnee       1      A             hwahnee
  Alameda        1      A             lameda
  Alamo          4      Alam          o

Searching

Like any sequential search, a pattern key is compared to each successive key in the list. The search starts at the beginning of the list and continues until a matching item is found or until an item greater than the pattern is found. An example of code that performs this task is contained in the Search( ) function in Listing One.

Unlike a standard sequential search, a suffix list search does not examine the actual key value for every item in the list. Instead, the prefix length for each item is compared to a running count of the number of characters matched so far in the pattern. When the search begins, the match count is zero; nothing has been matched. The search progresses, and the match count increases as each character of the pattern is matched until, when the item is found, the match count is equal to the pattern length.

Only when the match count is equal to the item's prefix count does the actual key value come into play.

If the prefix length is greater than the number of characters matched in the pattern, the search skips directly to the next item in the list. Why? Observe that the next character to compare is part of the prefix for the current key; it is, by definition, the same as the character in the same position in the previous key. It is also the first character in the last key that did not match the pattern. Obviously, if the character did not match in the last key, it will not match in this one. So the search can safely jump to the next item in the list.

If the prefix length is less than the match count, the search ends in failure -- the pattern is not in the list. This happens only when a character position that has already been successfully matched contains a new and different character. The list is in ascending order, so that new character would have to be greater than the one already matched in that position. Therefore, the pattern key would have to have come before the current item if it were in the list.

If the prefix length does, in fact, equal the match count, the suffix must be compared to the pattern. The comparison proceeds character by character, beginning with the first character of the suffix and with the first unmatched character in the pattern. The match count is incremented for each matching character. If it turns out that the pattern matches the suffix exactly, the item has been found, and the search is over. If the pattern is greater than the item, the search continues. If the pattern is less than the item, the item is not on the list, and the search fails.

Consider the following example, which searches for Adept in the list in Figure 2. When the search begins, the match count is zero. The prefix count of the first item is, of course, also zero. So Adept is compared to Acampo. Only the first character matches, so the match count becomes 1, and the search continues. The prefix count for the next item in the list, Acton, is 2, which is greater than the match count, so it is skipped. The prefix count of the third item, Adelanto, is the same as the match count, so the pattern and suffix are compared. The next two characters of the pattern, de, match the corresponding characters in the key, so the match count is advanced by 2, to become 3. The next character, l, is less than p, so the search continues. The fourth item, Adin, has a prefix count of 2, which is less than the match count. The search is over because the item is not in the list.

From the previous example it's clear that this algorithm makes relatively few character comparisons. The average number of comparisons will be between N / 2 + K and N + K. (Where N is the number of records, K is the average key length, and the comparison of the prefix count to the match count is a single comparison.) This is generally far less than the average of KN / 4 character comparisons which a standard search would make.

Insertion

Before an item can be inserted into the list, the appropriate location for the item has to be found. After all, the list has to remain correctly sorted after the new item is added. The first step in adding an item, therefore, is to search for the item by using the algorithm described above. In the case of duplicate items, the search must continue to the last matching item.

After the position for the new item is found, the key has to be separated into prefix and suffix. That's easy, because the prefix was already identified during the search. Search( ) saves the match count from the item just before the position at which the new key is to be inserted. That match count is, by definition, the prefix count for the new item. We just allocated enough memory to hold the cell structure, and then insert the new item, including the prefix length and key value, into the list. The whole process is not much different from that of any other linked list insertion.

But there is a wrinkle. Remember that each item's prefix length depends on the previous item. It's possible that after insertion there will be greater similarity between the new key and the next one. If that's the case, the prefix length of the next item in the list will change. Because the list is sorted, the prefix length will increase if it changes.

The prefix lengths tell us all we need to know to adjust the next item. The prefix length of the new item will never be less than the existing prefix length of the next item -- if it were, the list wouldn't be properly sorted. If the prefix length of the new item is greater than that of the next item, the prefix length of the next item will not change. Only if the prefix lengths of the two items are the same will the prefix length of the next item change. In that case, the two suffixes must be compared and the prefix length adjusted accordingly.

Let's insert Adept into the sample list. The first step is to search the list, just as we did before. That search ended just after Adelanto with a match count of 3. So the new node is inserted into the list between Adelanto and Adin with a prefix length of 3. The prefix length for Adin is not the same as the prefix length for the new entry, so the prefix length for Adin does not change.

Deletion

Deletion of an item from the list is similar to insertion. Again, the first step is to search through the list until the desired item is found; the second step is to remove it.

The actual removal of the item is no different than removal of an item from any other list -- memory allocated from the heap must be freed, pointers updated, and so on. But just as with insertion, the prefix length of the next item following the deleted item may change. This time it will get smaller, never larger.

The adjustment of the prefix length depends, not surprisingly, on the prefix length of the item being deleted and the prefix length of the item following it. The new prefix length for the next item will be the lesser of the two.

Let's delete Adept. The search proceeds as before, this time ending successfully with Adept. We unlink it from the list, but before releasing the memory we compare the prefix length of the deleted item to that of the item following it. The new prefix length of Adin is already less than that of Adept, so the prefix length for Adin does not change.

Compression

You will have observed that the prefix portion of the key is never used. In fact, as far as the search is concerned, it can be eliminated completely. The prefix will never have to be reconstituted for basic list operations. Even when part of the suffix must be rebuilt on deletion, all of the information needed is contained in the suffix of the key being deleted.

By eliminating the prefix, the list can be stored more compactly. That's an obvious advantage if the list is kept entirely in memory. But it's also an advantage if the list must be retrieved from disk frequently -- the more compact the data, the greater its chance of being in cache.

How much space does it save? Tests run on a list of 250 city names and zip codes give some idea of the improvement possible. The original list used 20 bytes for each city name. The average length of the actual names was about 8.2 characters and the average suffix length was about 5.7 characters. The suffix structure includes a single unsigned charto store the prefix length, and the variable-length keys require a null terminator, so the net result (including 4 bytes for links) is a savings of 20 - (5.7 + 6) = 8.3 bytes per record. That's a 41 percent savings relative to a fixed length table. (See Table 1.)

Table 1: Typical savings provided by compression technique

  Type of List             List Size   Percent Saved
  --------------------------------------------------

  Fixed length records     20.0 bytes        0%
  Linked list (full keys)  13.2             33%
  Linked suffix list       11.7             41%
  Contiguous suffixes       7.7             61%

Of course the amount of space saved depends upon the data. The greater the similarity between keys, the greater the savings. Best of all, when duplicate keys occur, the suffix is null and none of the key is stored.

Performance

A search of the 250 city names was about twice as fast as a standard sequential search. The improvement agrees with predictions based on the formulas for character comparisons presented earlier.

There are a few other tricks for speeding up a sequential search. These include using the search pattern as an end marker, unrolling the loop, or using a self-organizing list.

The self-organizing list is generally the most effective of the three. When the distribution obeys Zipf's law, it takes NK / log(2)N character comparisons. The self-organizing list is a substantial improvement over a standard sequential search; but it is generally not quite as fast as the suffix search. The trade-off between a self-organizing list and a suffix list will not favor the self-organizing list unless the key length is less than the log (base 2) of the number of records.

Applications

This algorithm was originally devised to search nodes in a B-tree. In a B-tree each node contains keys that are very similar -- sometimes all of the keys are identical -- so the suffix search is substantially faster than a standard sequential search. But the real payoff is that by eliminating the prefix many more keys fit in a node -- and that reduces the number of disk hits, which are relatively time-consuming. A binary search, which is often used to find a key in a B-tree node, is still faster than the suffix search, but the reduction of the number of disk hits more than makes up for the slower search. (If duplicate keys are permitted in the B-tree, the binary search must be followed by a sequential search anyway.)

There are, of course, other applications where keys with similar prefixes are common: directory lists, compiler symbol tables, and so forth. Similar improvements ought to be possible there, too. It makes sense to compress keys by removing the redundant prefix, and that was the original objective of this method. It was somewhat surprising, however, to find that the compression improves the speed of the search. One would expect the elimination of the prefix portion of the key to make list maintenance more awkward. Instead, the prefix count turns out to be more useful than the actual characters of the prefix.

There are drawbacks to the method. If the prefix is eliminated, the keys have to be reconstructed when needed. The programs are also a bit more complex than a standard sequential search. But for many applications, the advantages far outweigh those drawbacks.

_SUPERCHARGING SEQUENTIAL SEARCHES_ by Walter Williams

[LISTING ONE]

<a name="0270_000e">

/***********************************************************************
 * SS.C -- Sample Sorted Sequential Suffix Search (c) 1989 Walter Williams
 ***********************************************************************/

#include <stdio.h>
#include <string.h>
#include <malloc.h>

#ifndef TRUE
#define TRUE    1
#define FALSE   0
#endif

typedef struct snode_S
    {
    struct snode_S *prev;   /* Address of previous node in list     */
    struct snode_S *next;   /* Address of next node in list         */
    unsigned int pfxlen;    /* Number of characters in prefix       */
    char key[1];            /* First character of key               */
    } snode_T, *snode_TP;

/************************ Function Prototypes ***************************/
snode_TP Search(char *, snode_TP, int *, unsigned int *);
snode_TP Insert(char *, snode_TP);
snode_TP Delete(char *, snode_TP);

/*----------------------------------------------------------------------*/
/* SEARCH() -- Search the list for a pattern key.                       */
/* 'pattern' is a null terminated string containing the key which is    */
/*           the object of the search.                                  */
/* 'list' is the address of a dummy node which contains head and tail   */
/*           pointers for a linked list.                                */
/* 'exact' is the address of a flag which is TRUE for an exact match    */
/*           and FALSE if the pattern is not found.                     */
/* 'match' is the address of an unsigned int to use as a match counter  */
/* The return value is a pointer to the structure containing the        */
/* matching key, or the next largest node if the pattern was not found. */
/*----------------------------------------------------------------------*/

snode_TP Search(char *pattern, snode_TP list, int *exact, unsigned int *match)
    {
    snode_TP cnode;     /* Pointer to current node              */
    char *sp;           /* Suffix pointer                       */
    int tm= 0;          /* Temp storage for match count         */
    /***/
    *exact= FALSE;                      /* Assume unsuccessful search   */
    *match= tm;
    for (cnode= list->next;  cnode != list;  cnode= cnode->next)
        {
                                /* Compare match count to prefix count  */
        if (tm < cnode->pfxlen)
            continue;
        else if (tm > cnode->pfxlen)
            break;
        else /* (tm == cnode->pfxcnt) */
            {
                 /* Compare the actual key suffix, maintain match count */
            sp= cnode->key + cnode->pfxlen;
            while (*pattern == *sp && *sp && *pattern)
                {
                ++sp;
                ++pattern;
                ++tm;
                }
                    /* Done if suffix greater than or equal to pattern  */
            if (*pattern < *sp )
                {
                break;
                }
            else if (*pattern == '\0' && *sp == '\0')
                {
                *match= tm;
                *exact= TRUE;
                break;
                }
            }
        *match= tm;
        }
    return (cnode);
    }

/*--- INSERT() Adds an item to the list.  ---*/
snode_TP Insert(char *pattern, snode_TP list)
    {
    snode_TP cnode;         /* Node we are inserting                    */
    snode_TP nnode;         /* Next node after cnode                    */
    char *sp;               /* Pointer to suffix                        */
    unsigned int match;
    int exact;
    /***/
                                /* Find spot where we insert the node   */
    nnode = Search(pattern, list, &exact, &match);
    if (exact == TRUE)          /* Skip to first non-matching key       */
        {
        nnode = nnode->next;
        while (nnode != list && nnode->key[nnode->pfxlen] == '\0')
            nnode = nnode->next;
        }
                                /* Allocate space for the new node      */
    cnode = (snode_TP) malloc(sizeof(snode_T) + strlen(pattern));
    cnode->pfxlen = match;
    strcpy(cnode->key, pattern);
                                /* Link it into the list ahead of nnode */
    cnode->next = nnode;
    cnode->prev = nnode->prev;
    nnode->prev->next = cnode;
    nnode->prev = cnode;
                                /* Update pfxlen in following node      */
    sp = nnode->key + nnode->pfxlen;
    if (cnode->pfxlen == nnode->pfxlen)
        {                       /* Compare the two suffixes             */
        nnode->pfxlen= 0;
        while (*sp == *pattern && *pattern && *sp)
            {
            ++sp;
            ++pattern;
            ++nnode->pfxlen;
            }
        }
    return (cnode);
    }

/*--- DELETE() Deletes an item from the list ---*/
snode_TP Delete(char *pattern, snode_TP list)
    {
    snode_TP cnode;         /* Node we are deleting                     */
    snode_TP nnode;         /* Next node after cnode                    */
    int exact;              /* Flag set if exact match                  */
    unsigned int match;     /* No. of characters matched in pattern     */
    /***/
                            /* Find the node we want to delete          */
    cnode = Search(pattern, list, &exact, &match);
    if (exact == FALSE)     /* Abort if not an exact match              */
        {
        printf("%s not found\n", pattern);
        nnode= NULL;
        }
    else
        {                   /* Remove it from the list                  */
        cnode->next->prev = cnode->prev;
        cnode->prev->next = cnode->next;
        nnode = cnode->next;/* Save for return value                    */
                            /* Update suffix in following node          */
        if (cnode->pfxlen < cnode->next->pfxlen)
            cnode->next->pfxlen = cnode->pfxlen;
                            /* Release deleted node                     */
        free((char *) cnode);
        printf("%s deleted\n", pattern);
        }
    return (nnode);
    }












Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 

Video