Dr. Dobb's | An Existential Dictionary

An Existential Dictionary

You can avoid the overhead of conventional search techniques by recording the existence of a key--without storing the key itself.

November 01, 1990
URL:http://www.drdobbs.com/database/an-existential-dictionary/184408440

NOV90: AN EXISTENTIAL DICTIONARY

Superimposed coding packs a lot of information into a small space

This article contains the following executables: E FLOYD.ZIP

Edwin is manager of a data center for the Hughston Foundation, a non-profit orthopaedic and sports medicine facility in Columbus, Georgia. He is an occasional contributor to DDJ. Edwin can be reached through CompuServe at 76067,747.

Have you ever needed a search routine that could determine with fair accuracy whether or not a particular piece of information exists without the overhead of a conventional search? If so, the "superimposed coding" technique I present in this article may be the tool you've been waiting for. I've implemented the system as a Turbo Pascal object that supports superimposed code dictionaries, which I call "existential dictionaries" because they record the fact that a key exists without storing the key itself. Existential dictionaries can be used for a variety of applications, including spell checking, document retrieval, and database applications.

The technique of superimposed coding packs a lot of information into a small space. An existential dictionary for a 10,000-word spelling checker, for instance, can occupy as little as 23K. In the case of database applications, an existential dictionary can save time. For example, consider a 10,000-record database indexed on one field. A superimposed code dictionary for 10,000 keys can be as small as 18K -- small enough to hold in memory. Checking the dictionary for a key before searching the index for that key reduces unnecessary index searches, thus reducing such searches by a factor of one thousand.

Small existential dictionaries can be used as "surrogate keys" to help locate documents indexed by author and subject keyword terms. The size of a surrogate key file is typically 10 to 20 percent of the size of a conventional index, and the surrogate key file can be substantially quicker when searching for certain types of queries. Surrogate key files are also used in place of conventional indices in some very large databases.

How Superimposed Coding Works

Consider a hypothetical tiny spelling checker with a word list of 400 English words. To construct an existential dictionary for it, start with an array of 8000 bits, all initially off. For each word in the list, generate a hash number between 0 and 7999, and turn the corresponding bit in the array on. If the hash function is sufficiently random (a non-trivial problem), it will distribute the on bits uniformly in the table with very few collisions. (A "collision" happens when two words hash to the same bit position.) The code bits for all words are superimposed in the same table, without regard to collisions.

To test an unknown word for membership in the list, first generate the word's hash number. If the corresponding bit is off, we are certain that the word is not in the list. If the bit is on, the word is either in the list, or its hash number simply happens to match the hash number of a word in the list. This is called a "false drop." (A false drop happens when the existence test succeeds incorrectly due to a collision.) For this dictionary, the probability of a false drop is about 400/8000, or 1/20. We want to reduce this probability as much as is practical.

One way to reduce the false drop probability is to increase the size of the bit table. For instance, we can double the size of the bit table to 16,000 and change the range of the hash function correspondingly. The new probability will then be about 1/40. Jon Bentley, in his priceless book Programming Pearls, describes a spelling-checker program written by Doug McIlroy. Doug's program uses a 2²⁷ (134 million) bit table to store existence bits for about 30,000 words. His false drop probability is a little less than 1/4000. By an ingenious representation trick, Doug manages to compress this 16-Mbyte table to 40K, which, plus an index, makes for a total dictionary size of less than 64K.

Another way to reduce the false drop probability is to represent each word by more than one bit. In our example, we can use two hash functions on each word to turn about 800 bits on out of a total of 8000. In this case, the probability for a single-bit collision increases to about 1/10. A false drop now requires two collisions with a combined probability of 1/10 * 1/10 or about 1/100. Thus, using two bits per word improves the accuracy of this dictionary fivefold. Using three bits per word will improve the accuracy even further.

More Details.

Our dictionary is not limited to English words. In general, any sequence of bytes, or "key" can be represented as a set of bits in a bit table. The number of bits used to represent a key is independent of the length of the key. As more bits are used to represent each key and the bit table fills up, two effects become noticeable.

First, more collisions occur. The number of on bits in the table is significantly less than the number of keys multiplied by the number of bits per key. In fact, if our hash functions randomize well and the bit table is not too small, the number of on bits will be very close to the value predicted by the BitsOn equation in Figure 1(a). In this equation, N represents the size of the table in bits, e is the natural logarithm base, K represents the number of keys, and B represents the number of bits per key. (See the accompanying textbox entitled "Derivation of the BitsOn Equation.") If we use ten bits per word in our hypothetical dictionary, then N= 8000, K= 400, and B=10. According to this formula, about 3150 bits are on.

Figure 1: Dictionary equations (a): BitsOn equation (b): FalseDrop equation (c): Optimal table-size equation (d): Expected collisions

  (a) BitsOn = N - N e ^-BK/N

                  (BitsOn)^B
  (b) FalseDrop = -----------
                    (N)^B

                 KB      =      KB
  (c) N = -------------- ~ ------------
          -Log _e (0.5)    0.693147..

  (d) Expected collisions = K - N + N e ^-K/N

In general, the false drop probability is the product of all the single bit collision probabilities, each of which is BitsOn/N, see Figure 1(b). This equation yields a false drop probability of (3150/8000)¹⁰, or about 1/11,000, as shown in the example.

The second effect of using more bits to represent each key is that the accuracy of the dictionary continues to improve more and more slowly until half of the bits in the table are on. At that point, the accuracy begins to deteriorate. At the optimal point, when half of the bits are on, the false drop probability is 1/2^B. Given the number of keys and the number of bits per key, we can rearrange the equations described earlier into a simple formula for the optimal table size, as shown in Figure 1(c).

To use Doug McIlroy's problem as an example, we have 30,000 keys and we want the false drop probability to equal 1/4000 or less. Representing each key by 12 bits in an optimal bit array will yield a false drop probability of 1/2¹², or 1/4096. In this case, the array size will be 30,000*12/0.693,147 = 519,370 bits, or 64,921 bytes, which is somewhat larger than Doug's dictionary. The procedure for building our bit table is considerably faster and simpler than Doug's procedure, which requires a sort. Also, by making our bit table a little larger than the optimal size, we can add keys dynamically. For instance, a 65,520-byte table at 12 bits per key becomes optimal at about 30,277 keys. At 31,000 keys, the false drop probability is still only 1/3369.

The Hash Function

With these formulae, we can engineer a bit table for any number of keys and with any desired degree of accuracy. The calculated performance is guaranteed, as long as the hash function is sufficiently random. I evaluated a number of hash functions found in the literature for speed and randomization and eventually created one based upon a random-number generator described by Steven Park and Keith Miller in a 1988 Communications of the ACM article entitled "Random Number Generators: Good Ones are Hard to Find." The "Minimal Standard" generator described in their article is a simple multiplicative congruence algorithm that generates a sequence of seeds:

  Next seed = (seed*16807) modulo 2,147,483,647

The Park-Miller generator is a 31-bit, full-period generator. Given any 31-bit starting seed (except zero), it will not return to that seed or repeat a number for 2,147,483,647 cycles. The sequence passes most tests of randomness. It is not, of course, truly random. A given starting seed always generates the same sequence of "random" numbers. Because of this, we call it a "pseudo-random" number generator.

In the January, 1990 Communications of the ACM, David Carta described a fast machine-language implementation of the Park-Miller generator. My version of the Carta implementation for the 8086 instruction set requires only two 16-bit multiply operations, plus some shifts and adds.

A good pseudo-random number generator simplifies the task of creating hash functions. The generator eliminates the need for a separate hash function for each dictionary bit. Instead, we need only a single hash function to produce a suitable starting seed. The random number generator will then beget an endless supply of unique, random, 31-bit integers. Each integer, modulo N, is a bit position to turn on or test.

We still need to construct a suitable hash function to generate the starting seed. I found this task unexpectedly frustrating. The entire responsibility for avoiding collisions rests with the seed-generator function. If this function generates the same seed number for two different keys, the bits selected will also be the same for those two keys. Therefore, the seed function must generate as few 31-bit seed collisions as possible among the keys of interest and should also be quick. What is a reasonable collision performance to expect from a hash function? I set out to devise a suitable test.

First, I needed a large number of typical keys. I started with a list of 109,584 unique English words obtained from a public domain source. The words averaged 8.5 characters each. I planned to use a hash function to generate a 31-bit number for each word. If the hash function randomized well, it would produce about the same number of collisions as a random selection of 109,584 items from a population of 2³¹, with replacement. I computed the number of collisions to expect in a random selection of K items from a population of N, see Figure 1(d). According to this equation, for K=109584 and N=2147 483647, I could expect about three collisions, which is not enough for a meaningful test.

Accordingly, I expanded the number of keys. I generated three keys from each English word: All uppercase, all lowercase, and first letter capitalized with the rest of the word lowercase. (I excluded the single-letter words "A" and "I.") I now had 328,751 keys. I also included the 10-digit ASCII numbers from zero to 109,583, which brought the count to K= 438335 unique keys and 45 expected collisions. To evaluate each hash function, I used it on the keys to generate 438,335 31-bit numbers. Next, I sorted the numbers and counted the duplicates. Throughout the remainder of this article, I'll refer to this procedure as the "collision test."

In addition, I devised a realistic test of the hash function's false-drop performance. I partitioned the word list into seven disjoint subsets of 8000 to 18,000 words each. To evaluate each hash function, I used it to create an optimal bit table for each subset, using 14 bits per word. I shifted each word to lowercase before hashing. Then, for each subset, I counted the number of false drops from the entire word list. I left all the test words in uppercase, so that none of them matched any word in a subset. The false drop probability for each bit table is about 1/16,384. So, a good hash function should produce about 7*109,584/16,384, or about 47 false drops for all subsets combined. In the following text, I'll refer to this procedure as the "practical test."

The behavior of hash functions on this scale is somewhat counterintuitive. A widely used, 32-bit CRC function, for instance, generated thousands of collisions. Eventually, I ran across the following reference to a simple hash algorithm in Donald Knuth's The Art of Computer Programming:

1. Generate a sequence of as many pseudo-random numbers as there are bytes in the key.

2. Multiply each byte in the key by a different member of the sequence.

3. Add up the results.

Since I had a source of random numbers, I tried this hash algorithm with an initial seed of "1." The resulting collision count was 61. This result was close, but there were still two problems with this algorithm. It ignored trailing null bytes, and, unless the starting seed was varied, it used the same sequence of multipliers for every key. The first problem was easy to fix -- I simply added a constant to each byte before multiplying. (I used 65,280 _FFOOh.)

I am ashamed to say how many blind alleys I ran down before settling on a solution to the second problem. Because the starting seed completely determines the random number sequence, you might think that enough variation could be easily introduced by setting the seed to a simple function of the length or the terminal key characters. Not so. The only reliable method I've found to randomize the starting seed is to run the hash function twice in the following manner: The first time you run the function, start with a constant seed. The second time around, use the result of the first run as the seed.

Each different starting seed defines a different hash function. I tested 50 starting seeds and discovered that most of them performed well on the collision test, but poorly on the practical test. I knew that after the hash function determines the initial seed, the practical test uses the raw random-number generator to select bits. This fact eventually led me to suspect the "randomness" of the Park-Miller generator's output. Knuth describes a simple shuffling algorithm that improves the randomness of even poor generators. After I incorporated an eight-way shuffle into the Park-Miller generator, the false drop count improved significantly, although it was still 17 percent above the expected count. The eight-way shuffling algorithm is shown in Figure 2.

Figure 2: The eight-way shuffling algorithm

  Establish the initial seed.
  Build the shuffle table:
     fill an eight entry table with the next eight Park-Miller numbers; save the ninth
Park-Miller number in a
     variable called NextOut.

  To generate the next shuffled random number:
     select a table entry using NextOut, bits 17 through 19; replace NextOut with the contents
of the table
     entry; replace the table entry with the next Park-Miller number; return the contents of
NextOut as
     the random number.

Even shuffling will not randomize effectively enough where extreme accuracy is required. An optimal bit table for 26 or more bits per key will suffer from the unavoidable collision behavior of the initial hash function. In such a case, I would not use the random-number generator at all. Instead, I would repeat the hash function with a different starting seed for each bit. This method would require a fast processor to achieve good performance.

Table 1 shows the first 30 seeds and their collision and false-drop counts. Repeating the hash algorithm for each bit produced 43 false drops.

Table 1: Seed performance

          False-Drop               False-Drop
  Seed    Coll    Raw     Shuf     Seed     Coll    Raw      Shuf

-----------------------------------------------------------------

   1     36      64       61        16       41       51       62
   2     48      66       48        17       37       57       58
   3     53      55       49        18       56       72       59
   4     41      67       41        19       38       80       65
   5     40      59       49        20       47       80       72
   6     48      47       62        21       46       67       54
   7     47      62       54        22       39       71       44
   8     47      64       58        23       50       70       48
   9     36      63       64        24       33       57       64
  10     43      69       54        25       46       66       56
  11     43      59       49        26       35       46       58
  12     45      67       51        27       51       59       46
  13     39      68       48        28       40       67       55
  14     40      65       50        29       51       60       67
  15     48      68       47        30       53       54       58

                    Average:    44    63    55

There is no evidence of any real pattern here, and no reason to believe that any particular starting seed is better than the others for all possible key sets. I selected a starting seed of 26, which led to the recommended hash-and bit-selection procedure outlined in Figure 3. This procedure produces bit tables that perform according to the design information given earlier.

Figure 3: Recommended hash and bit selection procedure

  Function Hash (seed, key)
    Set the initial value of the function result to zero.
    For each byte in the key:
           generate the next Park-Miller seed;
           add _{(key byte + 65,280) * seed} to the function result.
  Limit the final result to the low-order 31 bits;
  if it's zero, increment it.

  Employ the function twice:

  Hash (Hash (26, key), key)

  The final hash result is the initial seed for generating bit numbers:

  Build the shuffle table.
  Repeat B times:
     generate the next shuffled random number;
     set (or test) the bit specified by the number modulo N.

  If you have a fast machine and wish to set more than 26 bits per key:

    Set the initial seed to 1.
    Repeat B times:
       Set (or test) the bit specified by Hash (Hash (seed, key), key);
       increment the seed.

Implementation

The implementation presented here is a Turbo Pascal unit (see Listing One, page 110) that supplies a Dictionary object; the methods are listed in Table 2. The two non-object functions are listed in Table 2. (Refer to the actual code for details about how to use these routines.) I've also developed a complete batch-oriented spelling checker to illustrate the use of the dictionary objects. Due to space constraints, the spelling checker isn't listed in this article, although the complete program (including documentation and sample terms), is available through DDJ.

Table 2: Methods for Dictionary object

  Method                                      Description
  -------------------------------------------------------------------------

  Init(MaxKeys, BitsPerKey)                   A constructor that calculates
                                              the size of the bit
                                              table.  Acquires and
                                              initializes storage from the
                                              heap.

  InsertString, InsertBlock, and InsertHash   Boolean functions which set
                                              bits in the bit table.  These
                                              functions return TRUE if all
                                              the bits are already on.

  StringInDictionary, BlockInDictionary, and  Boolean functions which
  HashInDictionary                            return TRUE if all bits
                                              associated with a key or
                                              initial hash and
                                              HashInDictionary seed are on
                                              in the bit table.

  SaveDictionary and RestoreDictionary        Preserve the attributes and
                                              contents of the bit table in
                                              a disk file.

  EstError and ActError                       Floating-point functions
                                              which return the false drop
                                              probability.  EstError
                                              estimates it by the BitsOn
                                              and FalseDrop equations;
                                              ActError counts the on bits
                                              and uses the FalseDrop
                                              equation to compute the
                                              exact probability.  In
                                              numerous tests, these
                                              functions have always
                                              returned very nearly the
                                              same result.

Table 3: Non-object functions

  DictionaryBytes(MaxKeys, BitsPerKey)  A LongInt function that uses the
                                        Optimal Table Size equation to
                                        compute the size in bytes of an
                                        optimal bit table for the
                                        specified key count and bits
                                        per key.

  DictHash(Data, Size)                  A LongInt function that provides
                                        access to the hash algorithm.

The save file produced by the Save-Dictionary method is intended for direct, binary portability to another hardware platform, a Tandem NonStop minicomputer. The only concession made to the Tandem architecture is the byte order of binary integers. These are byte-reversed before saving to disk, and reversed again after restoration.

The random-number generator, the hash function, and the bit-set and test functions are implemented as inline-assembler macros. The bit table size is limited to 65,520 bytes as a concession to Turbo Pascal's maximum GetMem size, and to improve the performance of the bit-set and test functions. You can overcome this limitation by distributing keys to multiple dictionaries. For instance, my Tandem spelling checker divides the dictionary by the first letter of a word into seven bit tables: A-B, C-D, E-H, I-N, O-R, S-T, and U-Z. These are the same partitions used in the practical test. Observe one caution when using multiple bit tables: You must add together the false drop probabilities for each existence test performed on the same key. For instance, if you test a key in two dictionaries and each dictionary has a false drop probability of 1/4096, the combined false drop probability is 2/4096 or 1/2048. My spelling checker selects the appropriate dictionary for the unknown word by the word's first character, and tests only in that dictionary.

Performance

To get an idea of the speed of the dictionary methods, I performed a timing benchmark on an 8-MHz NEC-V20. (This machine runs about three times as fast as the original PC.) First, I built an optimal, 14-bit dictionary from a subset of the 109,584 word list. The subset consisted of the 17,639 words beginning with "S" and "T." The false drop probability, as determined by counting the on bits, was 1/16,450.

Secondly, I read the entire 109,584 word list and tested each word against the dictionary. Actually, I read each list twice; first I timed the read alone, and then I timed the read with the dictionary method calls. The differences in timings were an estimate of the processor time spent in the dictionary methods. In this benchmark, the methods inserted 250 keys per second and tested 435 keys per second. Existence testing was faster than inserting because almost all of the words, excepting those beginning with "S" and "T" were rejected. The existence test terminates as soon as it finds the first zero bit. Key length and number of bits per key affect performance. For instance, with the same files at 12 bits per key, the benchmark inserted 273 keys per second and tested 447 keys per second. At 14 bits per key, on a 20-MHz 80386, the benchmark inserted 1558 keys per second and tested 2611 keys.

The number of iterations of the Park-Miller algorithm determines the performance of the hash and bit-selection algorithms. The hash algorithm performs 2L iterations, where L is the length of the key. The total number of iterations for the hash and bit-selection algorithms is 2L+B+9, where B is the number of bits per key. If we repeat the hash function for each bit, we perform 2BL iterations. When the hash function is repeated for each bit, the 80386 benchmark can insert only 361 keys per second and can test 1712 keys.

Conclusion

Superimposed coding is an elegant way to test for the existence of a key. This approach has applications in many areas of information retrieval. It is a fast method, especially with the shortcuts afforded by the Park-Miller random-number generator. The equations available allow us to engineer an optimal existential dictionary with almost any desired capacity and accuracy.

Derivation of the BitsOn Equation

Consider an array of Nbits, all initially off, and an iterative process that selects bits at random and turns them on. What is the probability of a "collision," that is, the probability that the random process will select a bit that is already on, at the i^th iteration? On the first iteration (i = 0), the probability is zero because all bits are off. On the second iteration (i = 1), the probability is 1/N, because exactly one bit is on. Let J_i be the number of bits on at the i^th iteration. Then, the probability of a collision (call it P_i) is as shown in Figure 4(a). The probability of a non-collision (call it Q_i) is shown in Figure 4(b).

Figure 4: BitsOn equations (a): Collision probability (b): Non-collision probability (c): Recurrence formula (d): Explicit equation (e): Logarithm (f): Rearranged (g): Limit (h): Approximation (i): BitsOn (j): Optimal table (k): Solved for N.

            J_i
  (a) P_i = ----
            N

                         J_i     N - J_i
  (b) Q_i = 1 - P_i = 1 - ---- = --------
                         N        N

                          1          N - 1
  (c) Q_i = Q_{i - 1} - Q_{i - 1} --- = Q_{i - 1} -----
                          N            N

            (N - 1)ⁱ
  (d) Q_i = ----------
              (N)ⁱ

                       N - J_i         N - 1
  (e) Log_e(Q_i) = Log_e -------- = iLog_e -----
                          N             N

                  N - J_i
           Log_e ---------
      i             N
  (f) - = -----------------
      N           N - 1
         N Log_e ---------
                    N

              N - 1
  (g) N Log_e ------ > -1
                N

      i           N - J_i
  (h) - = - Log_e --------
      N             N

  (i) J_i = N - Ne^-i/N

       J_i       N - Ne^-i/N               1
      ---- = ------------ = 1 - e^-1/N = ---
  (j)  N           N                    2

      e^-i/N = 1 - 1/2 = 0.5

                i             BK
  (k) N = ------------ = ------------
           -Log_e(0.5)      -Log_e(0.5)

We can see that each time the process succeeds in turning a bit on that was previously off, the probability of a non-collision is reduced by 1/N. But the probability of selecting an off bit, and thus succeeding in turning it on, is the current probability of a non-collision. Combining these facts, we can write a recurrence formula for the value of Q at any iteration, based on the previous iteration, see Figure 4(c).

Knowing that Q₀ = 1, we can convert the recurrence formula to an explicit equation for Q_i (refer to Figure 4_d). By substituting and taking the logarithm of both sides, we arrive at the equation shown in Figure 4(e). We can rearrange this equation into the equation shown in Figure 4(f). As N becomes larger, the expression shown in Figure 4(g) rapidly becomes so close to -1 that we can neglect the difference. (Hint, it's a "sick" limit.) So, for practical purposes, the equation becomes that shown in Figure 4(h). Solving for J_i, we have the equation in Figure 4(i, which is the expression for the number of bits that were originally off, but are now on, in a table of N bits, after i iterations of the random process. In our application, the number of iterations is simply the product of the number of keys and the number of bits per key. So, i = BK, and we have the BitsOn equation.

The false drop probability is P_i^B. By differentiation, we can show that J_i/N = 1/2 minimizes the false drop probability. (The proof is left as an exercise to the reader. Hint, differentiate _BitsOn/N^B rsp. B.) So, in an optimal table, the relationship shown in Figure 4(j) holds true. Solving for N, we arrive at the optimal table-size equation shown in Figure 4(k). This is the expression for the size in bits of an optimal bit table containing K keys with B bits per key.

The value of - Ln(0.5) showed up early in the process of developing these routines. I experimented with various sizes of bit tables by counting iterations while turning on random bits, until half of the bits were on. The ratio of iterations to table size in bits was always about 0.693. This derivation was motivated by a desire to explain this mysterious number.

-- E.F.

_AN EXISTENTIAL DICTIONARY_ by Edwin T. Floyd

[LISTING ONE]



{$A+,B-,D-,E-,F+,I+,L-,N-,O-,R-,S-,V+}
Unit Dict;
Interface
{ DICT.PAS dictionary object and methods to create and use a superimposed
  code dictionary.  Copyright Edwin T. Floyd, 1990. }

Type
  Dictionary = Object
    DictArray : Pointer;  { Pointer to dictionary bit array }
    DictCount : LongInt;  { Number of key entries in this dictionary }
    DictSize : Word;      { Number of bytes in dictionary bit array }
    DictBits : Byte;      { Number of bits per key entry }

    Constructor Init(MaxKeys : Word; BitsPerKey : Byte);
    { Initialize dictionary, specify maximum keys and bits per key. }

    Constructor RestoreDictionary(FileName : String);
    { Restore dictionary saved on disk by SaveDictionary }

    { Note: Use either Init or RestoreDictionary, not both. }

    Destructor Done;
    { Release storage allocated to dictionary. }

    Function DictionarySize : Word;
    { Returns number of bytes that will be written by SaveDictionary. }

    Procedure SaveDictionary(FileName : String);
    { Save dictionary in a disk file. }

    Function InsertString(Var s : String) : Boolean;
    { Insert string in dictionary; returns TRUE if string is already there. }

    Function StringInDictionary(Var s : String) : Boolean;
    { Returns TRUE if string is in dictionary. }

    Function InsertBlock(Var Data; Len : Word) : Boolean;
    { Insert block in dictionary; returns TRUE if block is already there. }

    Function BlockInDictionary(Var Data; Len : Word) : Boolean;
    { Returns TRUE if block is in dictionary. }

    Function InsertHash(Hash : LongInt) : Boolean;
    { Insert hash in dictionary; returns TRUE if hash is already there. }

    Function HashInDictionary(Hash : LongInt) : Boolean;
    { Returns TRUE if hash is in dictionary. }

    Function EstError : Real;
    { Returns estimated probability of error. }

    Function ActError : Real;
    { Returns actual probability of error (slow, counts bits). }

  End;

Function DictionaryBytes(MaxKeys : LongInt; BitsPerKey : Byte) : LongInt;
{ Returns the size in bytes of the optimal dictionary bit table for the
  indicated key and bit-per-key counts. }

Function DictHash(Var Data; Len : Word) : LongInt;
{ Hash data block to a positive long integer. }

Implementation

Const
  MagicNumber = $E501205F; { Used to validate dictionary save file }
  RandMult = 16807;        { =7**5; RandMult must be expressable in 16 bits.
                           48271 may give better "randomness" (see ACM ref.) }
  ShuffleBits = 3;
  ShuffleShift = 16 - ShuffleBits;
  ShufTableEnd = $FFFF Shr ShuffleShift;
  HashSeed : Word = 26;    { Initial hash seed }
  RandSeed : LongInt = 1;  { Random number seed: 0 < RandSeed < 2**31-1 }

Type
  SaveFileHeader = Record
  { Header for dictionary save file (all numbers are byte-reversed) }
    Magic : LongInt;       { Magic number for validity test }
    BitsCount : LongInt;   { Bits-per-key and entry count }
    Size : Word;           { Size of dictionary bit map in bytes }
  End;

Var
  ShufTable : Array[0..ShufTableEnd] Of LongInt;
  NextOut : Word;

Function IRand : LongInt;
{ Return next "minimal standard", 31 bit pseudo-random integer.  This function
  actually computes (RandSeed * RandMult) Mod (2**31-1) where RandMult is
  a 16 bit quantity and RandSeed is 32 bits (See Carta, CACM 1/90). }
Inline(
  $A1/>RandSeed+2/       {         mov     ax,[>RandSeed+2]}
  $BF/>RandMult/         {         mov     di,>RandMult}
  $F7/$E7/               {         mul     di}
  $89/$C3/               {         mov     bx,ax}
  $89/$D1/               {         mov     cx,dx}
  $A1/>RandSeed/         {         mov     ax,[>RandSeed]}
  $F7/$E7/               {         mul     di}
  $01/$DA/               {         add     dx,bx}
  $83/$D1/$00/           {         adc     cx,0 ; cx:dx:ax = Seed * Mult }
  $D0/$E6/               {         shl     dh,1 ; split p & q at 31 bits }
  $D1/$D1/               {         rcl     cx,1}
  $D0/$EE/               {         shr     dh,1 ; cx = p, dx:ax = q }
  $01/$C8/               {         add     ax,cx}
  $83/$D2/$00/           {         adc     dx,0 ; dx:ax = p + q }
  $71/$09/               {         jno     done}
  $05/$01/$00/           {         add     ax,1 ; overflow, inc(p + q) }
  $83/$D2/$00/           {         adc     dx,0}
  $80/$E6/$7F/           {         and     dh,$7F ; limit to 31 bits }
                         {done:}
  $A3/>RandSeed/         {         mov     [>RandSeed],ax}
  $89/$16/>RandSeed+2);  {         mov     [>RandSeed+2],dx}

Function Hash(Seed : LongInt; Var Data; Len : Word) : LongInt;
{ Hash a block of data into a random long integer.  This is actually
  equivalent to the following:

     RandSeed := Seed;
     Hash := 0;
     For i := 1 To Len Do Hash := Hash + (IRand * (Data[i] + $FF00);
     Hash := Hash AND $7FFFFFFF;
     If Hash = 0 Then Inc(Hash);

  Overflow is ignored.  The seed is kept in registers; RandSeed is not
  affected by this routine. }
Inline(
  $59/                   {      pop     cx     ; cx := len}
  $5E/                   {      pop     si     ; bx:si := @data}
  $5B/                   {      pop     bx}
  $58/                   {      pop     ax     ; dx:ax := seed}
  $5A/                   {      pop     dx}
  $E3/$59/               {      jcxz    alldone}
  $FC/                   {      cld}
  $1E/                   {      push    ds}
  $8E/$DB/               {      mov     ds,bx}
  $55/                   {      push    bp}
  $31/$DB/               {      xor     bx,bx}
  $53/                   {      push    bx     ; zero accumulator}
  $53/                   {      push    bx}
  $89/$E5/               {      mov     bp,sp}
                         {next:                ; for each byte of data...}
  $51/                   {      push    cx}
  $BF/>RandMult/         {      mov     di,>RandMult}
  $89/$C3/               {      mov     bx,ax}
  $89/$D0/               {      mov     ax,dx  ;   compute next seed}
  $F7/$E7/               {      mul     di}
  $93/                   {      xchg    ax,bx}
  $89/$D1/               {      mov     cx,dx}
  $F7/$E7/               {      mul     di}
  $01/$DA/               {      add     dx,bx}
  $83/$D1/$00/           {      adc     cx,0   ; cx:dx:ax = Seed * Mult}
  $D0/$E6/               {      shl     dh,1   ; split p & q at 31 bits}
  $D1/$D1/               {      rcl     cx,1}
  $D0/$EE/               {      shr     dh,1   ; cx = p, dx:ax = q}
  $01/$C8/               {      add     ax,cx}
  $83/$D2/$00/           {      adc     dx,0   ; dx:ax = p + q}
  $71/$09/               {      jno     noovfl}
  $05/$01/$00/           {      add     ax,1   ; overflow, inc(p + q)}
  $83/$D2/$00/           {      adc     dx,0}
  $80/$E6/$7F/           {      and     dh,$7F ; limit to 31 bits}
                         {noovfl:}
  $89/$C3/               {      mov     bx,ax  ; save seed}
  $89/$D1/               {      mov     cx,dx}
  $AC/                   {      lodsb          ; get next byte + $FF00}
  $B4/$FF/               {      mov     ah,$FF}
  $89/$C7/               {      mov     di,ax}
  $F7/$E1/               {      mul     cx     ; multiply by seed}
  $97/                   {      xchg    ax,di}
  $F7/$E3/               {      mul     bx}
  $01/$FA/               {      add     dx,di}
  $01/$46/$00/           {      add     [bp+0],ax ; accumulate}
  $11/$56/$02/           {      adc     [bp+2],dx}
  $89/$D8/               {      mov     ax,bx}
  $89/$CA/               {      mov     dx,cx}
  $59/                   {      pop     cx}
  $E2/$B9/               {      loop    next   ;  until out of data}
                         {;}
  $58/                   {      pop     ax}
  $5A/                   {      pop     dx}
  $5D/                   {      pop     bp}
  $1F/                   {      pop     ds}
  $80/$E6/$7F/           {      and     dh,$7F}
                         {alldone:}
  $89/$C3/               {      mov     bx,ax}
  $09/$D3/               {      or      bx,dx}
  $75/$01/               {      jnz     exit}
  $40);                  {      inc     ax}
                         {exit:}

Procedure Shuffle;
{ Load the shuffle table }
Begin
  For NextOut := 0 To ShufTableEnd Do ShufTable[NextOut] := IRand;
  NextOut := Word(IRand) Shr ShuffleShift;
End;

Function SIRand : LongInt;
{ Return the next shuffled random number }
Var
  y : LongInt;
Begin
  y := ShufTable[NextOut];
  ShufTable[NextOut] := IRand;
  NextOut := Word(y) Shr ShuffleShift;
  SIRand := y;
End;

Function TestBit(Var BitArray; Size : Word; BitNo : LongInt) : Boolean;
{ Returns TRUE if indicated bit number, modulo size of bit array, is set.
  Size is in bytes. }
Inline(
                         {; dx:ax := BitNo}
  $58/                   {      pop     ax}
  $5A/                   {      pop     dx}
                         {; bl := bit mask}
  $88/$C1/               {      mov     cl,al}
  $80/$E1/$07/           {      and     cl,$07}
  $B3/$80/               {      mov     bl,$80}
  $D2/$EB/               {      shr     bl,cl}
                         {; dx:ax := byte offset}
  $D1/$EA/               {      shr     dx,1}
  $D1/$D8/               {      rcr     ax,1}
  $D1/$EA/               {      shr     dx,1}
  $D1/$D8/               {      rcr     ax,1}
  $D1/$EA/               {      shr     dx,1}
  $D1/$D8/               {      rcr     ax,1}
                         {; dx := byte offset}
  $5F/                   {      pop     di}
  $39/$D7/               {      cmp     di,dx}
  $77/$0E/               {      ja      quickdiv}
                         {; protect against overflow}
  $89/$F9/               {      mov     cx,di}
                         {protloop:}
  $D1/$E1/               {      shl     cx,1}
  $39/$D1/               {      cmp     cx,dx}
  $76/$FA/               {      jbe     protloop}
  $F7/$F1/               {      div     cx}
  $89/$D0/               {      mov     ax,dx}
  $31/$D2/               {      xor     dx,dx}
                         {quickdiv:}
  $F7/$F7/               {      div     di}
                         {; es:di := seg:ofs of byte}
  $5F/                   {      pop     di}
  $01/$D7/               {      add     di,dx}
  $07/                   {      pop     es}
                         {; test bit}
  $30/$C0/               {      xor     al,al}
  $26/$22/$1D/           {      es:and  bl,[di]}
  $74/$02/               {      jz      notset}
  $FE/$C0);              {      inc     al}
                         {notset:}

Function SetBit(Var BitArray; Size : Word; BitNo : LongInt) : Boolean;
{ Sets the indicated bit number modulo size of bit array.  Returns TRUE if
  bit was already set.  Size is in bytes. }
Inline(
                         {; dx:ax := BitNo}
  $58/                   {      pop     ax}
  $5A/                   {      pop     dx}
                         {; bl := bit mask}
  $88/$C1/               {      mov     cl,al}
  $80/$E1/$07/           {      and     cl,$07}
  $B3/$80/               {      mov     bl,$80}
  $D2/$EB/               {      shr     bl,cl}
                         {; dx:ax := byte offset}
  $D1/$EA/               {      shr     dx,1}
  $D1/$D8/               {      rcr     ax,1}
  $D1/$EA/               {      shr     dx,1}
  $D1/$D8/               {      rcr     ax,1}
  $D1/$EA/               {      shr     dx,1}
  $D1/$D8/               {      rcr     ax,1}
                         {; dx := byte offset mod size }
  $5F/                   {      pop     di}
  $39/$D7/               {      cmp     di,dx}
  $77/$0E/               {      ja      quickdiv}
                         {; protect against overflow}
  $89/$F9/               {      mov     cx,di}
                         {protloop:}
  $D1/$E1/               {      shl     cx,1}
  $39/$D1/               {      cmp     cx,dx}
  $76/$FA/               {      jbe     protloop}
  $F7/$F1/               {      div     cx}
  $89/$D0/               {      mov     ax,dx}
  $31/$D2/               {      xor     dx,dx}
                         {quickdiv:}
  $F7/$F7/               {      div     di}
                         {; es:di := seg:ofs of byte}
  $5F/                   {      pop     di}
  $01/$D7/               {      add     di,dx}
  $07/                   {      pop     es}
                         {; test bit}
  $30/$C0/               {      xor     al,al}
  $88/$DC/               {      mov     ah,bl}
  $26/$22/$25/           {      es:and  ah,[di]}
  $74/$04/               {      jz      notset}
  $FE/$C0/               {      inc     al}
  $EB/$03/               {      jmp     short set}
                         {notset:}
  $26/$08/$1D);          {      es:or   [di],bl}
                         {set:}

Function LongSwap(n : LongInt) : LongInt;
{ Reverse bytes in a LongInt. }
Inline(
  $5A/                   {      pop    dx}
  $58/                   {      pop    ax}
  $86/$C4/               {      xchg   ah,al}
  $86/$D6);              {      xchg   dh,dl}

Function DictionaryBytes(MaxKeys : LongInt; BitsPerKey : Byte) : LongInt;
Begin
  DictionaryBytes := Round(MaxKeys * BitsPerKey / (-Ln(0.5) * 8));
End;

Function DictHash(Var Data; Len : Word) : LongInt;
Begin
  DictHash := Hash(Hash(HashSeed, Data, Len), Data, Len);
End;

Constructor Dictionary.Init(MaxKeys : Word; BitsPerKey : Byte);
Var
  DictBytes : LongInt;
Begin
  DictBytes := DictionaryBytes(MaxKeys, BitsPerKey);
  If DictBytes > $FFF0 Then Begin
    WriteLn(DictBytes, ' bytes optimal for dictionary, but ', $FFF0,
      ' is maximum size dictionary.  Using max size.');
    DictBytes := $FFF0;
  End Else If DictBytes > MaxAvail Then Begin
    WriteLn(DictBytes, ' bytes optimal for dictionary, but only ', MaxAvail,
      ' bytes are available.  Using ', MaxAvail);
    DictBytes := MaxAvail;
  End Else If DictBytes < 16 Then DictBytes := 16;
  DictSize := DictBytes;
  GetMem(DictArray, DictSize);
  FillChar(DictArray^, DictSize, 0);
  DictCount := 0;
  DictBits := BitsPerKey;
End;

Constructor Dictionary.RestoreDictionary(FileName : String);
Var
  Header : SaveFileHeader;
  DictBytes : LongInt;
  f : File;
  OldMode : Byte;
Begin
  OldMode := FileMode;
  FileMode := $40;
  Assign(f, FileName);
  Reset(f, 1);
  BlockRead(f, Header, SizeOf(Header));
  With Header Do Begin
    Magic := LongSwap(Magic);
    Size := Swap(Size);
    DictBytes := FileSize(f) - SizeOf(Header);
    If (Magic <> MagicNumber) Or (Size <> DictBytes) Or (Size < 16)
    Or (Size > $FFF0) Then Begin
      WriteLn('File ', FileName, ' is not a dictionary save file.');
      Halt(1);
    End;
    DictSize := Size;
    DictBits := BitsCount And $FF;
    DictCount := LongSwap(BitsCount And $FFFFFF00);
    GetMem(DictArray, DictSize);
    BlockRead(f, DictArray^, DictSize);
    Close(f);
    FileMode := OldMode;
  End;
End;

Destructor Dictionary.Done;
Begin
  FreeMem(DictArray, DictSize);
  DictArray := Nil;
  DictSize := 0;
  DictBits := 0;
  DictCount := 0;
End;

Function Dictionary.DictionarySize : Word;
Begin
  DictionarySize := DictSize + SizeOf(SaveFileHeader);
End;

Function Dictionary.InsertString(Var s : String) : Boolean;
Begin
  InsertString := InsertBlock(s[1], Length(s));
End;

Function Dictionary.StringInDictionary(Var s : String) : Boolean;
Begin
  StringInDictionary := BlockInDictionary(s[1], Length(s));
End;

Function Dictionary.InsertBlock(Var Data; Len : Word) : Boolean;
Begin
  InsertBlock := InsertHash(DictHash(Data, Len));
End;

Function Dictionary.BlockInDictionary(Var Data; Len : Word) : Boolean;
Begin
  BlockInDictionary := HashInDictionary(DictHash(Data, Len));
End;

Function Dictionary.InsertHash(Hash : LongInt) : Boolean;
Var
  i : Byte;
  InDict : Boolean;
Begin
  InDict := True;
  RandSeed := Hash;
  Shuffle;
  For i := 1 To DictBits Do
    If Not SetBit(DictArray^, DictSize, SIRand) Then InDict := False;
  If Not InDict Then Inc(DictCount);
  InsertHash := InDict;
End;

Function Dictionary.HashInDictionary(Hash : LongInt) : Boolean;
Var
  i : Byte;
  InDict : Boolean;
Begin
  InDict := True;
  RandSeed := Hash;
  Shuffle;
  i := 0;
  While (i < DictBits) And InDict Do Begin
    If Not TestBit(DictArray^, DictSize, SIRand) Then InDict := False;
    Inc(i);
  End;
  HashInDictionary := InDict;
End;

Procedure Dictionary.SaveDictionary(FileName : String);
Var
  Header : SaveFileHeader;
  f : File;
Begin
  Assign(f, FileName);
  ReWrite(f, 1);
  With Header Do Begin
    Magic := LongSwap(MagicNumber);
    Size := Swap(DictSize);
    BitsCount := LongSwap(DictCount) + DictBits;
  End;
  BlockWrite(f, Header, SizeOf(Header));
  BlockWrite(f, DictArray^, DictSize);
  Close(f);
End;

Function Dictionary.EstError : Real;
Begin
  EstError := Exp(Ln(1.0-Exp(-(DictCount*DictBits)/(DictSize*8.0)))*DictBits);
End;

Function Dictionary.ActError : Real;
Var
  AllBits, BitsOn, i : LongInt;
Begin
  AllBits := LongInt(DictSize) * 8;
  BitsOn := 0;
  For i := 0 To Pred(AllBits) Do
    If TestBit(DictArray^, DictSize, i) Then Inc(BitsOn);
  ActError := Exp(Ln(BitsOn / AllBits) * DictBits);
End;