Design

Generating Perfect Hash Functions

By Thomas Gettys, February 01, 2001

Thomas presents an efficient hash technique guaranteed to generate a perfect hash function for an arbitrary set of numeric search keys.

Feb01: Algorithm Alley

Thomas is a senior software developer for PSC Inc. He can be contacted at tpgettys @teleport.com.

Hashing is a searching technique. There are many ways to search a collection for an item, some very clever and elegant. For raw speed, however, hashing is how it is done, period.

Open any book on algorithms and you will find a chapter on searching. In that chapter, there will be a discussion on hashing. But after a short introduction to hashing and hash functions, you'll find that the bulk of the material is about collision resolution (see, for instance, The Art of Computer Programming, Vol.3: Sorting and Searching, by D.E. Knuth, Addison-Wesley, 1997).

There are a couple of very good reasons for this focus on collision resolution. The first is that what makes for a good hash function can be very application specific. Are the search keys numbers or character strings? How is the population of keys distributed (compare the names under "G" in your phone book with the words that begin with "G" in your dictionary)? Considerations such as these will affect the design of your hash function.

About all that can be reasonably said is to state the properties that good hash functions typically have in common and present some examples that have been found to work well (see "Hash Functions," by Bob Jenkins, DDJ, September 1997). Note in passing that there are interesting similarities between hash functions and random-number generators.

The second reason collision resolution techniques receive such attention is that collisions are a stone cold fact of life with dynamic search lists. Even with the very best hash function in hand, collisions will occur when inserting items into a hash table.

In The Art of Computer Programming, Vol.3: Sorting and Searching, Knuth mentions the so-called "Birthday Paradox" in this context. Suppose you have a hash table with 365 entries and a hash function that uniformly maps a person's birthday to a table location. The probability that there will be one or more collisions when 23 people are inserted into the table is greater than 50 percent!

Dictionaries

What if, however, you are blessed with the problem of searching a static list? There will be no insertions or deletions; the only operation is to determine if a given key is, or is not, in the list. This is called the "dictionary problem," and occurs in many settings.

Since the list to be searched is static, it makes sense to look for a hash function that minimizes the number of collisions. A hash function for which there are no collisions is called a "perfect hash function" (PHF). A PHF for which the hash table has no holes in it (that is, the hash table is only as big as the search list) is called a "minimal perfect hash function" (MPHF).

There are many techniques for finding a PHF (or MPHF) for a given set of search keys; which to use is somewhat dependent on the characteristics of your particular set of keys. An excellent survey of the current state-of-the-art is to be found in the monograph "Perfect Hashing," by Z.J. Czech, G. Havas, and B.S. Majewski (Theoretical Computer Science, 1997), which also features an extensive bibliography.

In this article, I'll demonstrate one of the techniques for generating a PHF for a set of numeric search keys that has come in handy on several occasions in my work on barcode decoding.

There are several barcode symbologies in use today, each with its own particular character set (for more information on barcodes, see "A C++ Class for Generating Bar Codes," by Douglas Reilly, DDJ, July 1993). The bars and spaces of a character are normalized and combined into a single number. The relationship between this number and the character it is defined to represent is arbitrary, and so a table search is necessary to perform the translation. Using a PHF, I can determine with a single calculation if the number corresponds to a valid character, and if so, which one.

Algorithm Overview

The algorithm I use to generate a perfect hash function is surprisingly easy to appreciate:

1. Start with a square array that is t units on a side.

2. Place each key K in the square at location (x,y), where x=K/t, y=K mod t.

3. Slide each row some amount so that no column has more than one entry.

4. Collapse the array down into a linear array; this is our hash table.

5. The hash function uses t and the displacements from step 3 to locate K.

An Example

To illustrate, I will use as the set of static keys the set S={0, 3, 4, 7, 10, 13, 15, 18, 19, 21, 22, 24, 26, 29, 30, 34}. Since each key is to be mapped into a square of size t, do you see that t must be at least 6? If you try t=5, for example, you would map the key K=34 to (x,y)=(6,4), which is outside the 5×5 square.

You therefore must chose t such that t*t>=max(S). In Figure 1, I have inserted the keys from set S into square A by computing the coordinates as x=K/6 and y=K=mod6 (x is the row number and y the column number).

This completes the first two steps in the algorithm overview. To accomplish step 3, the rows are shifted to the right until there is at most one key in each column, recording the amount of each shift in the array r. This is known as the "first-fit method (FFM). The result of applying the FFM to array A is shown in Figure 2.

The next step is to simply collapse the shifted rows down into hash table C (see Figure 3). Finally, you realize the hash function H(K) as follows:

x=K/t
y=K mod t
index=r[x]+y
H(K)=C[index]

For example, suppose K=15. You have that x=2 and y=3, r[2]=5, so the index into C is 5+3=8. Since C[8]=15, we have determined that 15 is an element of S.

What if K=17? Well, x=2 and y=5, so the index is r[2]+5=5+5=10. Since C[10]=19, you know that 17 is not an element of S.

A Small Improvement

The real secret sauce here is in finding a good set of row displacements. As it turns out, finding a set of displacements that minimizes the size of C is known to be a hard problem. There is a heuristic that often works well, however. It is known as the "first-fit decreasing method."

The only change from what I just did is in step 3; instead of taking the rows in turn, you start with the most populated of the rows first. Thus, try taking the row in the order: 3, 0, 4, 1, 2, 5; see Figure 4. In this case, you achieve a minimal perfect hash function. While this cannot be expected in general, it has been my experience that the first-fit decreasing method does result in acceptable table compression. I will always try several values of t and chose the one that yields the smallest hash table.

Conclusion

I've presented an efficient technique that is guaranteed to generate a perfect hash function for an arbitrary set of numeric search keys. Modifications to this technique that make it suitable for letter-oriented keys can be found in chapter 5 of "Perfect Hashing," by Z.J. Czech, G. Havas, and B.S. Majewski (Theoretical Computer Science, 1997).

I hope that you are more encouraged to consider using hashing, knowing that a perfect hash function can be obtained quickly and with certainty.

DDJ

Listing One

/***********************************************************************
File: FFDM.C - First Fit Decreasing Method
Overview:
 This program implements the FFDM algorithm for generating a Perfect Hash
 Function (PHF) for a given set of integer search keys.
Invocation:
 FFDM KEYFILE t-VALUE
 KEYFILE is the fully-qualified name of the text file containing the integer
 search keys, one per line.
 t-VALUE is the "magic" number to use for the hash function.  The value of
 t must be such that t*t > max(key).
Notes & Caveats:
-The algorithm is described in the article "Perfect Hashing" in Theoretical
 Computer Science 182, 1997, by Czech, Havas & Majewski.
-The data structure names mimic those from the article "Perfect Hashing" to
 make it easy to compare the two; the downside is that the names are not well
 chosen!
**************************************************************************/

#include <stdio.h>      // fopen() fclose()
#include <stdlib.h>     // atoi() exit()
#include <string.h>     // strcpy()

// the following section of definitions can be extracted into a header file

// error codes
#define Success           0
#define t_value_ERROR    -1
#define fopen_ERROR      -2
#define key_value_ERROR  -3

// application-specific constants
#define t_Max           100   // must be at least sqrt(max_key)
#define HashTable_Max  1000   // upper limit for
#define InvalidKey       -1   // a key value that's impossible for your app

struct RowStruct
{
   int RowNumber;                      // the row number in array A[][]
   int RowItemCnt;                     // the # of items in this row of A[][]
};

// global data

// the arrays A[][], r[] and C[] are those in the article "Perfect Hashing"
int A[t_Max][t_Max];       // A[i][j]=K (i=K/t, j=K mod t) for each key K
int r[t_Max];              // r[R]=amount row A[R][] was shifted
int C[HashTable_Max];      // the shifted rows of A[][] collapse into C[]

// Row[] exists to facilitate sorting the rows of A[][] by their "fullness"
struct RowStruct Row[t_Max];     // entry counts for the rows in A[][]

/***********************************************************************
Function: void InitArrays(void)
Overview: Prepares the algorithm data structures for use.
Parameters: none
Return Value: none
Notes & Caveats:
-A row offset may be 0, so the items in r[] are set to a negative value to
 indicate that the offset for each row is not known yet.
-Every item in A[][] and C[] is set to a value that is known to be an invalid
 key for the specific application.  -1 is often a good choice.
************************************************************************/

void InitArrays(void)
{
   int row, column;
   for (row = 0; row < t_Max; row++)
   {
      r[row] = -1;                     // valid offsets are non-negative
      Row[row].RowNumber  = row;       // insert the row numbers and
      Row[row].RowItemCnt = 0;         //  indicate that each row is empty

      for (column = 0; column < t_Max; column++)
      {
         A[row][column] = InvalidKey;
      }
   }
   for (row = 0; row < HashTable_Max; row++)
   {
      C[row] = InvalidKey;
   }
}
/*************************************************************************
Function: int ReadKeyData(char *Filename, int t, int *KeyCount)
Overview: Reads the file of seach keys and maps them into the array A[][].
Parameters:
 char *Filename   - the name of the file containing the search keys
 int t            - the number of rows in A[][]; max(key) must be < t*t
 int *KeyCount    - pointer to location to place number of keys read
Return Value:
 fopen_ERROR      - the specified file could not be opened (doesn't exist)
 key_value_ERROR  - a search key value was too large (depends on t)
 Success          - successful completion of responsibilities
Notes & Caveats:
-The number of items in each row is also computed and returned in
 Row[row].RowItemCnt.
-The number of keys is returned to the caller via a pointer.  If an error
 is detected the number of keys reflects how many keys were read before the
 error condition was detected.
************************************************************************/

int ReadKeyData(char *Filename, int t, int *KeyCount)
{
   FILE *fptr;
   int key;
   int row, column;

   *KeyCount = 0;               // set # keys=0 before attempting fopen
   fptr = fopen(Filename, "rt");
   if (fptr == NULL) return(fopen_ERROR);
// fill data structures using search key data
   while ((fscanf(fptr, "%d", &key)) == 1)
   {
      row    = key / t;
      column = key % t;
      if (row >= t) return(key_value_ERROR);
      A[row][column] = key;
      Row[row].RowItemCnt++;
      (*KeyCount)++;
   }
   return(Success);
}
/***********************************************************************
Function: void SortRows(int t)
Overview: sort Row[] in descending order of row fullness.
Parameters:
 int t - the number of rows in A[][]; max(key) must be < t*t
Return Value: none
Notes & Caveats:
-The algorithm needs to know which row of A[][] is most full, 2nd most full,
 etc.  This is most easily done by sorting an array of row-item-counts and
 remembering which row the item counts go with.  That is what the array
 "struct RowStruct Row[]" does for us.
-I saw no point in trying to be clever here, so a simple bubble sort is used.
************************************************************************/

void SortRows(int t)
{
   int i, j;
   struct RowStruct tmp;
   for (i = 0; i < t-1; i++)
   {
      for (j = i+1; j < t; j++)
      {
         if (Row[i].RowItemCnt < Row[j].RowItemCnt)
         {
            tmp    = Row[i];
            Row[i] = Row[j];
            Row[j] = tmp;
         }
      }
   }
}
void main(int argc, char *argv[])
{
   int t, NumKeys;
   int k, ndx, rc, row, offset, PrintFlag;
   char Filename[64];
// process the command-line arguments
   if (argc < 3)
   {
      printf("usage: FFDM KEYFILE t-VALUE\n");
      printf("where: KEYFILE is the name of your file of numeric keys\n");
      printf("       t-VALUE is a number such that t*t > max(key)\n");
      exit(-1);
   }
   strcpy(Filename, argv[1]);
   t = atoi(argv[2]);

   if (t > t_Max)
   {
      printf("t may not exceed %d\n", t_Max);
      exit(t_value_ERROR);
   }
   if (argc > 3) PrintFlag = 1;
   else          PrintFlag = 0;
// initialize data structures
   InitArrays();

// read in the user's key data
   rc = ReadKeyData(Filename, t, &NumKeys);
   if (rc != 0)
   {
      printf("ReadKeyData() failed with error %d\n", rc);
      exit(rc);
   }
// prime the algorithm - sort the rows by their fullness
   SortRows(t);

// do the First-Fit Descending Method algorithm
// For each non-empty row:
// 1. shift the row right until none of its items collide with any of
//    the items in previous rows.
// 2. Record the shift amount in array r[].
// 3. Insert this row into the hash table C[].

   for (ndx = 0; Row[ndx].RowItemCnt > 0; ndx++)
   {
      row = Row[ndx].RowNumber;        // get the next non-empty row
      for (offset = 0; offset < HashTable_Max-t-1; offset++)
      {
         for (k = 0; k < t; k++)       // does this offset avoids collisions?
         {
            if ((C[offset+k] != InvalidKey) &&
                (A[row][k]   != InvalidKey)) break;
         }
         if (k == t)
         {
            r[row] = offset;          // record the shift amount for this row
            for (k = 0; k < t; k++) // insert this row into the hash table
            {
               if (A[row][k] != InvalidKey)
               {
                  C[offset+k] = A[row][k];
               }
            }
            break;
         }
      }
      if (offset == HashTable_Max-t-1)
      {
         printf("failed to fit row %d into the hash table\n", row);
         printf("try increasing the hash table size\n");
         exit(-1);
      }
   }
// all done!  locate the "right-most" hash table entry
   for (k = 0; k < HashTable_Max; k++)
   {
      if (C[k] != InvalidKey) offset = k+1;
   }
// print the results
   printf("t value          : %d\n",  t);
   printf("Number of keys   : %d\n",  NumKeys);
   printf("Hash table size  : %d\n",  offset);
   printf("Table utilization: %f%\n", 100.0*NumKeys/offset);

   if (PrintFlag != 0)
   {
      printf("\noffset table r[]\n");
      printf("row offset\n");
      for (k = 0; k < t; k++)
      {
         printf("%2d  %3d\n", k, r[k]);
      }
      printf("\nhash table C[]\n");
      for (k = 0; k < offset; k++)
      {
         printf("%d\n", C[k]);
      }
   }
}

1 2 3 4 5 Next

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

Design