Dr. Dobb's | Tables Within Tables: C++ and Brent's Technique

Tables Within Tables: C++ and Brent's Technique

June 03, 2007
URL:http://www.drdobbs.com/cpp/tables-within-tables-c-and-brents-techni/199900579

One of the primary goals of object-oriented programming is to enable programmers to create code that is reusable and applicable to more than one project. The C++ class facility is one way to address that goal. This article explores the use of the class facility in C++ to write generic code and shows an implementation of an interesting variation on hashing known as "Brent's technique."

Brent's technique was invented by R.P. Brent, a computer scientist and mathematician about whom I know little other than his citations in Donald Knuth's The Art of Computer Programming, Vol 3: Sorting and Searching. Brent's technique is a variation on the primary/secondary hashing technique Knuth calls "Algorithm D" (Listing One).

INSERT IS
  address = H1(data)
  if table[address] is empty
    insert data in table[address]
  else
    address = H1(data) + H2(data)
    while table[address] isn't empty
      address = address + H2(data)
    end while
  end if
  insert data in table[address]

LOOKUP IS
  address = H1(data)
  if table[address] isn't equal to data then
    address = H1(data) + H2(data)
    while table[address] isn't equal to data and
      table[address] isn't empty
      address = address + H2(data)
    end while
  end if
  if table[address] is empty then
    return empty
  else
    return table[address]
  end if

Listing One: Pseudocode for Algorithm D: Define two mapping functions--H1 and H2

Hash Tables

Hash tables are a method of implementing the abstract data type Dictionary, which supports the following operations:

CREATE - create an empty dictionary
INSERT - insert an object into the dictionary
LOOKUP - find an object in the dictionary or return an error
REMOVE - remove an object from the dictionary.

The simplest implementation of a dictionary is a linear list. To insert an object, stick it in the first empty slot in the list. To find an object, look at each object in the list, succeed if you find it, and fail otherwise. To delete an object, mark its slot as empty. This is fine for small numbers of objects, but the time taken by the LOOKUP operation is proportional to the number of objects in the list.

Hashing is one of the methods used to improve the time complexity of LOOKUP. Define a function that maps the data contained in an object into a table address. Instead of looking at every object in the table to see if an object is present, compute the address from the object you're seeking and look at that element of the table. Either it's there or it isn't. The time complexity of a hash LOOKUP is determined by the time it takes to compute the hash function, which is independent of the table size.

The catch is that if you are using a finite table (which is generally the case!), more than one key can map to the same address. This duplicate mapping is called a collision. You can minimize collisions by increasing the size of the table; if the table is most empty, new objects are unlikely to collide with existing ones. But there will always be a practical limit to how much memory you can afford to waste. Therefore, the time complexity of a hashing algorithm is determined primarily by how efficiently it can resolve collisions.

Use a second function to handle collisions. Add the secondary hash value to the primary one to get a new address and, while a collision still exists, add the secondary hash value to the current address. Eventually, you will arrive at an empty slot in which to put your data.

The best hash function is the simplest one that sufficiently minimizes collisions. Knuth discusses several hash functions, but the best is based on modulo arithmetic and careful selection of table size. If the table size is prime, you will get an even distribution of objects in the table. So your primary hash function becomes:

H1(data) = data MOD tablesize

or, more properly:

H1(data) = map(data) MOD tablesize

where map is some function that derives an integer from the data in the key.

The secondary function must never collide with the primary function. To prevent such collisions, select a number relatively prime to the table size. You can start at any address in the table and visit every other address by repeatedly adding in the secondary hash function.

The best behavior of all would be to use twin primes -- two prime numbers that have the property p1 - p2 = 2. (Why this is the case isn't within the scope of this article; take it on faith or read Knuth.) LOOKUP closely follows the logic of INSERT. The sequence of address produced by adding the secondary hash value is called the collision list for a particular object. To find an object, look down its collision list until you find either the object or an empty slot.

Brent's Technique

Brent's technique (Listing Two) is a method for optimizing the time complexity of successful LOOKUP operations.

INSERT IS
  address = H1(data)
  depth = 0
  while table[address] isn't empty
    address = address + H2(data)
    increment(depth)
  end while
  if depth < 2 then
    table[address] = data
  else
    address = H1(data)
    for i = 0 to depth
      address2 = address + H2(table[address])
      depth2 = 0
      while depth2 <= i and table[address2] isn't empty
        address2 = address + H2(table[address])
        increment(depth2)
      end while
      if depth2 <=  i
        table[address2] = table[address]
        return
      end if
      address = address + H2(data)
    end for
    table[address] = data
  end if
  insert data in table[address]

Listing Two: Pseudocode for Brent's technique.

For certain applications (notably compilers), the number of successful LOOKUPs is overwhelmingly larger than the number of INSERTs. Brent's technique works by spending more time inserting objects into the table in hopes of making the table better suited for LOOKUP. A C++ implementation is shown in Listing Three.

#include <stdio.h>
#if !defined(_ZTC_)
#include <= stream.h>
// Bad old UNIX doesn't have ansi headers
#define UINT_MAX ((unsigned)-1)
typedef unsigned size_t;  // should be correct most places.
#else
#include <= stream.hpp>= 
#include <= limits.h>= 
#endif
#include "hashtab.hpp"

// find_tabsize returns the next twin prime >= the desired table size
static size_t find_tabsize(size_t tabsize) {
// Each element of primes is the lower twin of a twin prime.  The table
// also has the property that primes[i] >= (primes[i]+100) to keep the
// table size down.
  static unsigned primes[] = {
  5, 107, 227, 347, 461, 569, 809, 1019, 1151, 1277, 1427, 1607, 1721,
  1871, 1997, 2111, 2237, 2339, 2549, 2657, 2789, 2969, 3119, 3251, 3359,
  3461, 3581, 3767, 3917, 4019, 4127, 4229, 4337, 4481, 4637, 4787, 4931,
  5099, 5231, 5417, 5519, 5639, 5741, 5849, 6089, 6197, 6299, 6449, 6551,
  6659, 6761, 6869, 7127, 7307, 7457, 7559, 7757, 7877, 8009, 8219, 8387,
  8537, 8819, 8969, 9239, 9341, 9461, 9629, 9767, 9929, 10037, 10139, 10271,
  10427, 10529, 10709, 10859, 11057, 11159, 11351, 11489, 11699, 11831, 11939,12041,
  12161, 12377, 12539, 12821, 13001, 13217, 13337, 13679, 13829, 13931, 14081,14249,
  14387, 14549, 14867, 15137, 15269, 15581, 15731, 15887, 16061, 16187, 16361,16631,
  16829, 16979, 17189, 17291, 17417, 17579, 17681, 17789, 17909, 18041, 18251,18521,
  18911, 19079, 19181, 19379, 19541, 19697, 19841, 19961, 20147, 20357, 20477,20639, 
  20747, 20897, 21011, 21191, 21317, 21491, 21599, 21737, 21839, 22037, 22157,22271,
  22481, 22619, 22739, 22859, 22961, 23201, 23369, 23537, 23669, 23831, 24107,24371,
  24917, 25031, 25169, 25301, 25409, 25577, 25799, 25931, 26111, 26249, 26681,26861,
  27059, 27239, 27407, 27527, 27689, 27791, 27917, 28097, 28277, 28409, 28547,28661,
  29021, 29129, 29387, 29567, 29669, 29879, 30011, 30137, 30269, 30389, 30491,30839,
  31079, 31181, 31319, 31511, 31721, 31847,
  };
#  define PTABSIZ (sizeof(primes)/sizeof(size_t))

  int i;
  for(i = 0; i <=  PTABSIZ; i++)
    if(tabsize <= = primes[i])
      return primes[i] + 2;
  cerr < < "Table size too big!_n";
  exit(-1);
}
#define empty ((void *) 0)
#define deleted ((void *) -1)

// constructor for the hashtab class
hashtab::hashtab(size_t tablesize,
    size_t (*map)(void *element),
    int (*compare)(void *e1, void *e2)) {
  // round up table size to nearest twin prime
  this->tablesize = find_tabsize(tablesize);
  // allocate the table
  table = new void *[this->tablesize];
  if(table == NULL) {
    cerr << "Out of memory in hashtab constructor";
    exit(1);
  }
  // clear the table
  for(int i = 0; i < this->tablesize; i++)
    table[i] = empty;
  // store comparison and mapping functions
  this->map = map;
  this->compare = compare;
  // clear insertion count
  inserted = 0;
}

// insert an element into the hash table, using brent's technique.
void hashtab::insert(void *element) {
  size_t mapped_val,h1,h2,cur,depth,i;
  size_t c2,cur2,depth2;

  // test predicate for finding an empty slot in the table.
#  define VACANT(x) (table[(x)] == empty || table[(x)] == deleted)
  // check for table full -- always leave at least
  // one empty slot in the table, or you'll search endlessly for
  // keys not in a full table!
  if(++inserted == tablesize) {
    cerr << "hash table full_n";
    exit(1);
  }

  // map key to value and do primary hash
  mapped_val = map(element);
  h1 = mapped_val % tablesize;
  // secondary hash
  h2 = (mapped_val % (tablesize - 2)) + 1;
  // compute depth of collision list
  for(cur = h1, depth = 0; !VACANT(cur); cur = (cur + h2) % tablesize) {
    // if the element is already in the table then replace it and
    // return
    if(compare(element,table[cur]) == 0) {
      table[cur] = element;
      return;
    }
    depth++;
  }
  // if we aren't too deep, just jam the element here.
  if(depth < 2) {
    table[cur] = element;
    return;
  }
  for(i = 0,cur = h1; i < depth; i++,cur = (h1+h2) % tablesize) {
    // secondary hash of current element of collision list
    c2 = (map(table[cur]) % (tablesize - 2)) + 1;
    // see if this table element would be displaced fewer
    // positions than the object being inserted.
    for(depth2 = 0,cur2 = (cur + c2<=) % tablesize;
      !VACANT(cur2) && depth2 <=  i;
      depth2++,cur2 = (cur2 + c2) % tablesize)
      ;
    // if we're still going to improve matters by inserting here
    if(depth2 <=  i) {
      table[cur2] = table[cur];
      break;
    }
  }
  // postcondition : cur is the index of an empty slot in which
  // to place element.
  table[cur] = element;
}

#define NOTFOUND (UINT_MAX)

// lookup function private to hashtab class.
size_t hashtab::_lookup(void *element) {
  size_t mapped_val,h1,h2,cur;
  // map key to value and do primary hash
  mapped_val = map(element);
  h1 = mapped_val % tablesize;
  // secondary hash.
  h2 = (mapped_val % (tablesize - 2)) + 1;
  for(cur = h1;
    table[cur] != empty &&
    (table[cur] == deleted ? 1 : compare(element,table[cur] != 0);
    cur = cur + h2) % tablesize)  
    ;
  return (table[cur] == empty ? NOTFOUND : cur);
}

// find an element in the table
void *hashtab::lookup(void *element) {
  size_t index = _lookup(element);
  return (index == NOTFOUND ? empty : table[index]);
}

// remove an element from the table.
void hashtab::remove(void *element) {
  int x;
  if(inserted-- == 0) {
    cerr << "deletion from empty hash table_n";
    exit(1);
  }
  x = _lookup(element);
  if(x != NOTFOUND)
    table[x] = deleted;
}

#if defined(DEBUG_HASHTAB) && defined(_ZTC_)
void hashtab::dump_tab(void (*display)(void *e)) {
  size_t i;
  for(i = 0; i < tablesize; i++) {
    if(table[i] == empty)
      cout << "empty";
    else if(table[i] == deleted)
      cout << "deleted";
    else
      display(table[i]);
    if (i && (i % 5 == 0))
      cout.put('_n');
    else
      cout.put(' ');
  }
  if (i % 5 == 0)
    cout.put('_n');
}
#endif

Listing Three: Brent.CPP.

First, compute the length of the collision list for the object being inserted. If the length is less than two, simply insert the object at the end of its collision list.

If the depth of the collision list is greater than two, look at each element in the list. If the first element would only need to be moved one place down the list, move it and replace it with the object to be inserted. If the second element would have to be moved fewer than two places down its collision list, replace it.

In general, if the n^th element in an object's collision list would have to be moved fewer than n places down its own collision list, replace it. This increases the access time for the object being displaced but decreases the access time for the object being inserted, giving you a net improvement over the straight Algorithm D.

Since the worst-case performance of LOOKUP is directly proportional to the length of the longest collision list, the updated table can be searched more quickly because each insertion balances the collision list lengths as fairly as possible. The proof of the superiority of Brent's technique over Algorithm D is, as they say, left as an exercise for the reader.

If you are interested in hashing in general, read the chapter on hashing in Knuth's book. The treatment is complete without being tedious.

C++ and Brent's Technique

The class mechanism of C++ is perfect for hiding the details of implementing an abstract data type. A class has public and private parts; put the data type's permissible operations in the public part and hide all the nasty details in the private part.

The ideal hash table class would be totally generic and manage arbitrary objects without regard for their contents. This isn't possible because C++ is not a pure object-oriented language. You can't pass types as parameters to a constructor. The hash table class needs to know some things about the objects it is trying to manage, such as what empty and deleted objects look like and how to map an object to an address and decide if two objects are equal.

The solution I found (Listing Four) is similar to the usual C technique of using function pointers to implement generic functions. The hash table contains pointers to objects instead of the objects themselves. A pointer to comparison and mapping functions must be passed to the constructor for the type.

#define DEBUG_HASHTAB
// hashtab.hpp -- class definition for hash table
class hashtab {
  size_t tablesize;                    // size of the table
  size_t inserted;                     // # of elements inserted
  void **table;                        // actual table
  size_t (*map)(void *element);        // map a key to the table index type
  int (*compare)(void *e1,void *e2);   // compare two elements a la strcmp
  size_t_lookup(void *element);        // internal lookup function
public:
  hashtab(size_t tablesize,
    size_t (*map)(void *element),
    int (*compare)(void *e1, void *e2));
  );
#endif
};

Listing Four: HASHTAB.HPP.

The constructor function for a hash table uses the function find_tab_size to round up the table size to a suitable twin prime. I generated the table of primes with a hacked-up version of the standard sieve of Eratosthenes program (it's useful for something besides benchmarking, after all!).

An example usage would be something like the one in HASHTEST.CPP (Listing Five).

// inttab class

#include "hashtab.hpp"

size_t imap(void *element) { return *((int *)element); }

int icompare(void *e1,void *e2) {
        if(*(int *)e1 > *(int *)e2)
                return 1;
        else if(*(int *)e1 == *(int *)e2)
                return 0;
        else
                return --1;
}

class inthashtab  : hashtab {
public:
        inthashtab(size_t size) : (size,imap,icompare) {}
        int *lookup(int x) {
                return (int *)hashtab: :lookup((void *)&x);
        }
        void insert(int x) {
                int *ip;
                if(hashtab: :lookup(&x) != empty) return;
                ip = new int;
                *ip = x;
                hashtab: :insert((void *)ip);
        }
        void remove(int x) {
                int *ip = (int *)hashtab: :lookup((void *)&x);
                if(ip != NULL) {
                        hashtab: :remove((void *)ip);
                        delete ip;
                }
         }
};

Listing Five: HASHTEST.CPP.

The member functions of the derived class inthashtab take different arguments from the parent class hashtab, so it is not necessary to make the members in the parent class virtual -- the members of inthashtab overload the names of the parent class functions instead of redefining them.

This approach works but has one potential problem. If you wanted to use a pointer or reference variable of the type hashtab, you wouldn't be able to call the samenamed member functions of the derived class. You might try to get around this by making the parent class functions virtual, but C++ won't allow you to redefine the type of a virtual function in a derived class.

So you can't indiscriminately assign pointers to derived hash table objects to hash table pointers and expect to use them. However, this restriction is not onerous since you generally don't need polymorphic access to instances of a hash table class. If you think of a reason to do this (besides perversity), I'd be interested in hearing it.

The inttab class is a trivial example. When the only data in an object is a key, the LOOKUP function degenerates into a Boolean function. You don't retrieve anything from the table; you just find out whether something was already there. Also, using new to create each table element is inefficient for small objects; ideally, you should use memory-allocation techniques. But inttab should give you some idea of what you can do with the hashtab class.

You aren't limited to using the hashtab class the way I did. Instead of deriving a new subclass from hashtab, use hashtab variables as private data for a class with an entirely different interface. Drop hashtab right into the associative array class discussed in Bjarne Stroustrup's The C++ Programming Language. You can even use multiple hashtab variables to look up objects in the table using more than one key!

C++ allows you to develop generic, reusable data types that can successfully hide most ugly implementation details. Since C++ is not a pure object-oriented language, we can't hide them all. But this isn't really a problem: our's is an impure world, and the mixed nature of C++ may be a better fit than pure object-oriented languages. It is sometimes nice to work at a low level -- something you can't do with a pure language like Smalltalk.

Brent's technique is a very good fit for certain applications. Having a general-purpose hash class in the toolbox leaves no excuse for using linear searches for lookups!