An Incremental String Search in C

This data-matching algorithm narrows the search space with each keystroke


June 22, 2007
URL:http://www.drdobbs.com/cpp/an-incremental-string-search-in-c/200000227

Searching and sorting routines are an essential part of any programmer's toolkit. A number of sorting routines compete for the programmer's favor, distinguished from on another by speed, the format of the data they expect, and complexity. Most programmers choose a sequential or binary search algorithm. This article discusses a variant technique that is sometimes very useful.

Sequential and binary searches take a key and a sorted list as input, returning information about the location of the list entry that matches a given key. Sequential and binary searches, as they're usually implemented, expect to be given full information about the key to be searched for; neither attempts to find partial matches.

It's often more convenient to seek entries in a sorted list that are compatible with a given search string as characters in the search string are made available. Instead of waiting for the user to enter the full key to search on, we can examine the underlying list dynamically to determine which entries in the list are candidates for the item we're searching for. This method is usually referred to as incremental search. A few design decisions should be made up front when implementing an incremental search routine. How should the routine behave if a search string with no matches is entered? Should it refuse to accept characters in the search string with no match in the search list, or accept them and indicate that no match is possible? What if the data contains duplicate entries? What should be done when one entry in the string array being searched is a prefix of another? This situation would arise if two entries in the array were ZIMMER and ZIMMERMAN. If you typed ZIMMER as your search string, should the incremental search immediately report that it's found an exact match or await further instructions?

Regardless of the answers to these questions, any incremental search routine should certainly return some indicator of success or failure. If any partial matches are found, the range of indices that yield a match should also be made available to the calling routine.

Later, we'll look at a C function, inc_search, which performs incremental search on certain types of string arrays. Without going into too much detail now, suffice it to say that inc_search returns a 1 if it finds a partial match, a 0 if it doesn't, and a -1 if the search is aborted by the user.

To illustrate the basic idea behind our implementation of incremental search, suppose we define variables arr and base by:

typedef char STR[6]; 
STR arr[] = {"ABOVE","BASTE", 
  "BLUE","CANE"}; 
STR *base = arr; 

The first line defines the user type STR, which consists of strings of five or fewer characters (recall that strings in C are terminated by an ASCII zero). The second declares arr to be a four-element array of strings and define the entries in the array. The third line declares base to be a pointer to a variable of type STR and initializes base to point to the origin of the array arr. The effect of these declarations is shown in Figure 1.

Figure 1: Initial configuration base and ar.

Since base and arr point to data objects of type STR, we can say that base[0] = arr[0], base[1] = arr[1] , and so on. Now suppose the user enters `B' as the first character in the search string. A simple linear search would tell us that variables base[1] and base[2] provide partial matches. We'd like to arrange things so that when the user chooses the next character in the search string, the same linear search routine will be able to tell us the new candidates for the search. We can do this if we make base point to the second character in arr[1] and restrict the maximum index for base to 1, as shown in Figure 2.

Figure 2: Pointer configuration after entering code>.

If the user enters `L' as the search character, we can apply linear search to base[0] through base[1] and determine that a match results only for index 1. This terminates the search in this example. In effect, we're using base as a movable template for our string search. Each time a search character is entered, the range of indices that yield a match is computed and the value of base is shifted accordingly.

Three points bear mentioning. One is that the algorithm depends upon representing the string to be searched as a two-dimensional array of characters. We're assuming that characters in the strings are stored in contiguous memory locations.

Second, base points to a data object of size six. This ensures that base[0], base[1], and so on are offset from one another by six bytes, so the relative offsets that apply to elements of the array arr are preserved.

Finally, the strings referenced by the base pointers in Figures 1 and 2 are in alphabetical order. This is no accident. It's a consequence of the fact that entries in arr are in sorted order and we're limiting the offsets that are added to base. If our search algorithm had referenced base[2] in Figure 2 (a pointer to the string "ANE"), then base[0] through base[2] would not be in sorted order. Part of the housekeeping the incremental search algorithm must perform involves knowing what range of offsets we can safely add to base to preserve sorted order.

Coding Details

Only three functions are needed to implement an incremental string search. One function, getkey, is a simple keyboard filter. The second, get_range, determines the range of indices in a string array that match a given input character. The third, inc_search, uses the previous two functions to perform the search. Definitions of these functions and their required support code are shown in Listing One. getkey essentially reads and echoes any printable character entered from the keyboard, including a carriage return. In such cases, getkey returns the ASCII code of the key pressed.

#define MAXSTR 25    /* Maximum number of strings in array */
#define MAXLEN 30     /* Maximum length of each string */
#define ESC 27        /* ASCII code for Escape */

/* if MIXCASE is nonzero, comparisons are case-sensitive */
#define MIXCASE 0
#if MIXCASE
    #define CASEMOD(x) x
#else
    #define CASEMOD toupper
#endif

typedef char STR[MAXLEN+1];             /* Definition of type STR */
typedef STR STARRAY[MAXSTR]:            /* ... of type STARRAY    */

int get_range(search_char,array,last_index,ptr_first,ptr_last)
char search_char;
STR *array;
int last_index, *ptr_first, *ptr_last;
{
  int j, k;     /* Loop variables */
  int found;    /* Indicates if a match has been found */

  search_char = CASEMOD(search_char);
  for (j=0; j<=last_index; j++)  /* Seek first match */
    if (found = CASEMOD(array[j][0]) == search_char)
      {                          /* A match has been found at index j */
        k = j                    /* Store value in k */
    break;
      }
  if (found)
    {          /* A match has been found at index k; determine its extent */
      for (j=k; CASEMOD(array[j+1][0]) == search_char; j++)
        ;      /* Set j to last matching index */
      *ptr_first = k;
      *ptr_last = j;
      return(1);
     }
  return(0);
}

int getkey(void)
{ char ch;

  while (1)
    {
      if ((ch = getch()) == '\'' || ch == '\" ')
        return(0);
      if (ch == ESC || ch == '\r' || ch >= 32 && ch <= 125)
        { putch(ch);
          return(ch);
        }
      if (ch == 0) getch();  /* Discard function key input */
    }
}
int inc_search(arr,imax,ptr_start,ptr_end)
STARRAY arr;
int *ptr_start, *ptr_end;
{  int first,last,ch;    /* Parameters for GET_RANGE call */
   int OK = 1;           /* Return value of GET_RANGE; defaults to one */
   int start,end         /* First and last indices that yield a match */
   STR *base;            /* Starting location for search */

   end = imax;
   base = arr; start = 0;
   while ( (ch = getkey()) != '\r' && ch != ESC
           && (OK = get_range(ch,base,imax, &first, &last)))
     {                          /* No CR or ESC has been entered,
                                /* and the list of matches is nonempty */
       end = start + last;      /* Adjust values of end and start */
       start += first;
       if (strcmp(arr[start],arr[end])==0)
         break;           /* Abort loop if there's a unique match */
       else               /* A match has been found, but it's not unique */
          {               /* Adjust start location for search */
            base = &base[first][1];
            imax = last - first;  /* Adjust max search index */
          }
     }
   if (ch == ESC)
     return(-1);               /* Return -1 if search was aborted */
                               /* Else return values as described above */
   *ptr_start = start;
   *ptr_end = end;
   return(OK);

}

/* The code that appears from this point on is used solely to */
/* demonstrate the incremental search facility. */

STARRAY arr =
  { "Anderson","Bradley","Burke","Federico","Fikar",
    "Frazer","Geary","Grando","Haas","Lehr","Lemke","Mahoney",
    "Martin","Martino","Milne","Milnor","Montgomery","Rivera",
    "Rivers"};

int do_again(void)
{ char ch;

  printf("\n Do again ? (Y/N) ");
  while ((ch = toupper(getch())) != 'Y' && ch != 'N')
 
    ;
  putch(ch);
  return(ch);
}
main()
{  int k,maxindex,start,end;

   /* Print the array to be searched */
   for (k=0; k<=MAXSTR-1; k++)
     if (arr[k][0] == 0)
       { maxindex = k-1;
     printf("\n");
     break;
       }
     else printf("%-20s",arr[k]);
   for (k=1; k<=80; k++)
     putch('=');
   /* Perform incremental search */
   do  {
     printf("\n Enter a search string : ");
     switch (inc_search(arr,maxindex, &start, &end)) {
       case -1 : printf ("\n Aborted by user.");
                 break;
       case  0 : printf("\n No matches found.");
                 break;
       case  1 : printf("\n Matches are : \n");
                 for (k=start; k<=end; k++)
                   printf("%-20s",arr[k]);
         printf("\n"):
       }
   }
   while (do_again() == 'Y');
}

Listing One.

Some exceptions exist. If a single or double quotation mark is entered, the key is not echoed and getkey returns a value of zero. The reason for choosing this particular return value is simple. Since quotes are used to indicate a literal match, the return value of zero will match the ASCII zero terminator for a string only if an exact match exists between the search string entered thus far and string to which it's being compared.

get_range searches the leading characters in a string array for entries that match a given character. For the reasons discussed in the previous section, get_range must know the maximum array index it's allowed to reference; otherwise, it may not be searching a sorted array. If get_range succeeds in finding a match, it returns a value of one and passes the corresponding range of indices back to the caller through its argument list. If get_range fails to find a match, its return value is zero.

inc_search does the actual searching. The header for this function is int inc_search (arr,imax,ptr_start, prt_end) . Here arr is a pointer to a two-dimensional array of characters, and the integer imax gives the largest first index that arr can safely reference. If the search routine finds one or more matches for the search string, the formal parameters ptr_start and prt_end (which are pointers to integers) return the range of indices in arr[] that yield a match.

inc_search begins by setting the range of matching indices to be the entire set of array indices for the search array arr. This ensures that if the user presses Enter without first entering search characters, the function will (correctly) report that all array entries are compatible with the search string. It also sets base — the variable indicating the starting location for the array to be searched — equal to arr.

Next, inc_search calls getkey to get an input character. If this character is different from Escape or Enter, it's fed to get_range, which reports on its success or failure in finding a match. If get_range, failed or succeeded only for a single array value, the search terminates. The only other possibility is that a match occcured over two or more array values. Let first and last be the corresponding range of indices. We need to move the array pointer base so it points to the second character of base[first] . This is easily done in C with the statement base = e base[first][1].

After moving the origin of the search array this way, we start the entire cycle again: Ask for a character, find all possible matches, and move the array pointer accordingly. The process eventually ends in success, failure, or an abort.

The performance of the inc_search routine presented here seems quite acceptable, at least for modestly sized arrays. If you're applying the functions to an array of several hundred entries, some changes might be in order. get_range uses a sequential search to find matches. For large arrays, it might be better to use binary search initially, then switch to sequential search when the range of indices to be checked is less than a hundred. You may also wish to include code to allow for editing of the search string. In any case, the routines presented here should be short and clear enough to meet your needs.

Terms of Service | Privacy Statement | Copyright © 2024 UBM Tech, All rights reserved.