## Superimposed coding packs a lot of information into a small space

This article contains the following executables: E FLOYD.ZIP

*Edwin is manager of a data center for the Hughston Foundation, a non-profit orthopaedic and sports medicine facility in Columbus, Georgia. He is an occasional contributor to DDJ. Edwin can be reached through CompuServe at 76067,747.*

Have you ever needed a search routine that could determine with fair accuracy whether or not a particular piece of information exists without the overhead of a conventional search? If so, the "superimposed coding" technique I present in this article may be the tool you've been waiting for. I've implemented the system as a Turbo Pascal object that supports superimposed code dictionaries, which I call "existential dictionaries" because they record the fact that a key exists without storing the key itself. Existential dictionaries can be used for a variety of applications, including spell checking, document retrieval, and database applications.

The technique of superimposed coding packs a lot of information into a small space. An existential dictionary for a 10,000-word spelling checker, for instance, can occupy as little as 23K. In the case of database applications, an existential dictionary can save time. For example, consider a 10,000-record database indexed on one field. A superimposed code dictionary for 10,000 keys can be as small as 18K -- small enough to hold in memory. Checking the dictionary for a key before searching the index for that key reduces unnecessary index searches, thus reducing such searches by a factor of one thousand.

Small existential dictionaries can be used as "surrogate keys" to help locate documents indexed by author and subject keyword terms. The size of a surrogate key file is typically 10 to 20 percent of the size of a conventional index, and the surrogate key file can be substantially quicker when searching for certain types of queries. Surrogate key files are also used in place of conventional indices in some very large databases.

### How Superimposed Coding Works

Consider a hypothetical tiny spelling checker with a word list of 400 English words. To construct an existential dictionary for it, start with an array of 8000 bits, all initially off. For each word in the list, generate a hash number between 0 and 7999, and turn the corresponding bit in the array on. If the hash function is sufficiently random (a non-trivial problem), it will distribute the on bits uniformly in the table with very few collisions. (A "collision" happens when two words hash to the same bit position.) The code bits for all words are superimposed in the same table, without regard to collisions.

To test an unknown word for membership in the list, first generate the word's hash number. If the corresponding bit is off, we are certain that the word is not in the list. If the bit is on, the word is either in the list, or its hash number simply happens to match the hash number of a word in the list. This is called a "false drop." (A false drop happens when the existence test succeeds incorrectly due to a collision.) For this dictionary, the probability of a false drop is about 400/8000, or 1/20. We want to reduce this probability as much as is practical.

One way to reduce the false drop probability is to increase the size of the bit table. For instance, we can double the size of the bit table to 16,000 and change the range of the hash function correspondingly. The new probability will then be about 1/40. Jon Bentley, in his priceless book Programming Pearls, describes a spelling-checker program written by Doug McIlroy. Doug's program uses a 2^{27} (134 million) bit table to store existence bits for about 30,000 words. His false drop probability is a little less than 1/4000. By an ingenious representation trick, Doug manages to compress this 16-Mbyte table to 40K, which, plus an index, makes for a total dictionary size of less than 64K.

Another way to reduce the false drop probability is to represent each word by more than one bit. In our example, we can use two hash functions on each word to turn about 800 bits on out of a total of 8000. In this case, the probability for a single-bit collision increases to about 1/10. A false drop now requires two collisions with a combined probability of 1/10 * 1/10 or about 1/100. Thus, using two bits per word improves the accuracy of this dictionary fivefold. Using three bits per word will improve the accuracy even further.

Our dictionary is not limited to English words. In general, any sequence of bytes, or "key" can be represented as a set of bits in a bit table. The number of bits used to represent a key is independent of the length of the key. As more bits are used to represent each key and the bit table fills up, two effects become noticeable.

First, more collisions occur. The number of on bits in the table is significantly less than the number of keys multiplied by the number of bits per key. In fact, if our hash functions randomize well and the bit table is not too small, the number of on bits will be very close to the value predicted by the BitsOn equation in Figure 1(a). In this equation, N represents the size of the table in bits, e is the natural logarithm base, K represents the number of keys, and B represents the number of bits per key. (See the accompanying textbox entitled "Derivation of the BitsOn Equation.") If we use ten bits per word in our hypothetical dictionary, then N= 8000, K= 400, and B=10. According to this formula, about 3150 bits are on.

#### Figure 1: Dictionary equations (a): BitsOn equation (b): FalseDrop equation (c): Optimal table-size equation (d): Expected collisions

(a) BitsOn = N - N e <SUP>-BK/N</SUP> (BitsOn)<SUP>B</SUP> (b) FalseDrop = ----------- (N)<SUP>B</SUP> KB = KB (c) N = -------------- ~ ------------ -Log <SUB>e</SUB> (0.5) 0.693147.. (d) Expected collisions = K - N + N e <SUP>-K/N</SUP>

In general, the false drop probability is the product of all the single bit collision probabilities, each of which is BitsOn/N, see Figure 1(b). This equation yields a false drop probability of (3150/8000)^{10}, or about 1/11,000, as shown in the example.

The second effect of using more bits to represent each key is that the accuracy of the dictionary continues to improve more and more slowly until half of the bits in the table are on. At that point, the accuracy begins to deteriorate. At the optimal point, when half of the bits are on, the false drop probability is 1/2^{B}. Given the number of keys and the number of bits per key, we can rearrange the equations described earlier into a simple formula for the optimal table size, as shown in Figure 1(c).

To use Doug McIlroy's problem as an example, we have 30,000 keys and we want the false drop probability to equal 1/4000 or less. Representing each key by 12 bits in an optimal bit array will yield a false drop probability of 1/2^{12}, or 1/4096. In this case, the array size will be 30,000*12/0.693,147 = 519,370 bits, or 64,921 bytes, which is somewhat larger than Doug's dictionary. The procedure for building our bit table is considerably faster and simpler than Doug's procedure, which requires a sort. Also, by making our bit table a little larger than the optimal size, we can add keys dynamically. For instance, a 65,520-byte table at 12 bits per key becomes optimal at about 30,277 keys. At 31,000 keys, the false drop probability is still only 1/3369.

### The Hash Function

With these formulae, we can engineer a bit table for any number of keys and with any desired degree of accuracy. The calculated performance is guaranteed, as long as the hash function is sufficiently random. I evaluated a number of hash functions found in the literature for speed and randomization and eventually created one based upon a random-number generator described by Steven Park and Keith Miller in a 1988 Communications of the ACM article entitled "Random Number Generators: Good Ones are Hard to Find." The "Minimal Standard" generator described in their article is a simple multiplicative congruence algorithm that generates a sequence of seeds:

Next seed = (seed*16807) modulo 2,147,483,647

The Park-Miller generator is a 31-bit, full-period generator. Given any 31-bit starting seed (except zero), it will not return to that seed or repeat a number for 2,147,483,647 cycles. The sequence passes most tests of randomness. It is not, of course, truly random. A given starting seed always generates the same sequence of "random" numbers. Because of this, we call it a "pseudo-random" number generator.

In the January, 1990 Communications of the ACM, David Carta described a fast machine-language implementation of the Park-Miller generator. My version of the Carta implementation for the 8086 instruction set requires only two 16-bit multiply operations, plus some shifts and adds.

A good pseudo-random number generator simplifies the task of creating hash functions. The generator eliminates the need for a separate hash function for each dictionary bit. Instead, we need only a single hash function to produce a suitable starting seed. The random number generator will then beget an endless supply of unique, random, 31-bit integers. Each integer, modulo N, is a bit position to turn on or test.

We still need to construct a suitable hash function to generate the starting seed. I found this task unexpectedly frustrating. The entire responsibility for avoiding collisions rests with the seed-generator function. If this function generates the same seed number for two different keys, the bits selected will also be the same for those two keys. Therefore, the seed function must generate as few 31-bit seed collisions as possible among the keys of interest and should also be quick. What is a reasonable collision performance to expect from a hash function? I set out to devise a suitable test.

First, I needed a large number of typical keys. I started with a list of 109,584 unique English words obtained from a public domain source. The words averaged 8.5 characters each. I planned to use a hash function to generate a 31-bit number for each word. If the hash function randomized well, it would produce about the same number of collisions as a random selection of 109,584 items from a population of 2^{31}, with replacement. I computed the number of collisions to expect in a random selection of K items from a population of N, see Figure 1(d). According to this equation, for K=109584 and N=2147 483647, I could expect about three collisions, which is not enough for a meaningful test.

Accordingly, I expanded the number of keys. I generated three keys from each English word: All uppercase, all lowercase, and first letter capitalized with the rest of the word lowercase. (I excluded the single-letter words "A" and "I.") I now had 328,751 keys. I also included the 10-digit ASCII numbers from zero to 109,583, which brought the count to K= 438335 unique keys and 45 expected collisions. To evaluate each hash function, I used it on the keys to generate 438,335 31-bit numbers. Next, I sorted the numbers and counted the duplicates. Throughout the remainder of this article, I'll refer to this procedure as the "collision test."

In addition, I devised a realistic test of the hash function's false-drop performance. I partitioned the word list into seven disjoint subsets of 8000 to 18,000 words each. To evaluate each hash function, I used it to create an optimal bit table for each subset, using 14 bits per word. I shifted each word to lowercase before hashing. Then, for each subset, I counted the number of false drops from the entire word list. I left all the test words in uppercase, so that none of them matched any word in a subset. The false drop probability for each bit table is about 1/16,384. So, a good hash function should produce about 7*109,584/16,384, or about 47 false drops for all subsets combined. In the following text, I'll refer to this procedure as the "practical test."

The behavior of hash functions on this scale is somewhat counterintuitive. A widely used, 32-bit CRC function, for instance, generated thousands of collisions. Eventually, I ran across the following reference to a simple hash algorithm in Donald Knuth's The Art of Computer Programming:

- 1. Generate a sequence of as many pseudo-random numbers as there are bytes in the key.

- 2. Multiply each byte in the key by a different member of the sequence.

- 3. Add up the results.

Since I had a source of random numbers, I tried this hash algorithm with an initial seed of "1." The resulting collision count was 61. This result was close, but there were still two problems with this algorithm. It ignored trailing null bytes, and, unless the starting seed was varied, it used the same sequence of multipliers for every key. The first problem was easy to fix -- I simply added a constant to each byte before multiplying. (I used 65,280 _{FFOOh}.)

I am ashamed to say how many blind alleys I ran down before settling on a solution to the second problem. Because the starting seed completely determines the random number sequence, you might think that enough variation could be easily introduced by setting the seed to a simple function of the length or the terminal key characters. Not so. The only reliable method I've found to randomize the starting seed is to run the hash function twice in the following manner: The first time you run the function, start with a constant seed. The second time around, use the result of the first run as the seed.

Each different starting seed defines a different hash function. I tested 50 starting seeds and discovered that most of them performed well on the collision test, but poorly on the practical test. I knew that after the hash function determines the initial seed, the practical test uses the raw random-number generator to select bits. This fact eventually led me to suspect the "randomness" of the Park-Miller generator's output. Knuth describes a simple shuffling algorithm that improves the randomness of even poor generators. After I incorporated an eight-way shuffle into the Park-Miller generator, the false drop count improved significantly, although it was still 17 percent above the expected count. The eight-way shuffling algorithm is shown in Figure 2.

#### Figure 2: The eight-way shuffling algorithm

Establish the initial seed. Build the shuffle table: fill an eight entry table with the next eight Park-Miller numbers; save the ninth Park-Miller number in a variable called NextOut. To generate the next shuffled random number: select a table entry using NextOut, bits 17 through 19; replace NextOut with the contents of the table entry; replace the table entry with the next Park-Miller number; return the contents of NextOut as the random number.

Even shuffling will not randomize effectively enough where extreme accuracy is required. An optimal bit table for 26 or more bits per key will suffer from the unavoidable collision behavior of the initial hash function. In such a case, I would not use the random-number generator at all. Instead, I would repeat the hash function with a different starting seed for each bit. This method would require a fast processor to achieve good performance.

Table 1 shows the first 30 seeds and their collision and false-drop counts. Repeating the hash algorithm for each bit produced 43 false drops.

#### Table 1: Seed performance

False-Drop False-Drop Seed Coll Raw Shuf Seed Coll Raw Shuf ----------------------------------------------------------------- 1 36 64 61 16 41 51 62 2 48 66 48 17 37 57 58 3 53 55 49 18 56 72 59 4 41 67 41 19 38 80 65 5 40 59 49 20 47 80 72 6 48 47 62 21 46 67 54 7 47 62 54 22 39 71 44 8 47 64 58 23 50 70 48 9 36 63 64 24 33 57 64 10 43 69 54 25 46 66 56 11 43 59 49 26 35 46 58 12 45 67 51 27 51 59 46 13 39 68 48 28 40 67 55 14 40 65 50 29 51 60 67 15 48 68 47 30 53 54 58 Average: 44 63 55

There is no evidence of any real pattern here, and no reason to believe that any particular starting seed is better than the others for all possible key sets. I selected a starting seed of 26, which led to the recommended hash-and bit-selection procedure outlined in Figure 3. This procedure produces bit tables that perform according to the design information given earlier.

#### Figure 3: Recommended hash and bit selection procedure

Function Hash (seed, key) Set the initial value of the function result to zero. For each byte in the key: generate the next Park-Miller seed; add <SUB>(key byte + 65,280) * seed</SUB> to the function result. Limit the final result to the low-order 31 bits; if it's zero, increment it. Employ the function twice: Hash (Hash (26, key), key) The final hash result is the initial seed for generating bit numbers: Build the shuffle table. Repeat B times: generate the next shuffled random number; set (or test) the bit specified by the number modulo N. If you have a fast machine and wish to set more than 26 bits per key: Set the initial seed to 1. Repeat B times: Set (or test) the bit specified by Hash (Hash (seed, key), key); increment the seed.

### Implementation

The implementation presented here is a Turbo Pascal unit (see Listing One, page 110) that supplies a Dictionary object; the methods are listed in Table 2. The two non-object functions are listed in Table 2. (Refer to the actual code for details about how to use these routines.) I've also developed a complete batch-oriented spelling checker to illustrate the use of the dictionary objects. Due to space constraints, the spelling checker isn't listed in this article, although the complete program (including documentation and sample terms), is available through DDJ.

#### Table 2: Methods for Dictionary object

Method Description ------------------------------------------------------------------------- Init(MaxKeys, BitsPerKey) A constructor that calculates the size of the bit table. Acquires and initializes storage from the heap. InsertString, InsertBlock, and InsertHash Boolean functions which set bits in the bit table. These functions return TRUE if all the bits are already on. StringInDictionary, BlockInDictionary, and Boolean functions which HashInDictionary return TRUE if all bits associated with a key or initial hash and HashInDictionary seed are on in the bit table. SaveDictionary and RestoreDictionary Preserve the attributes and contents of the bit table in a disk file. EstError and ActError Floating-point functions which return the false drop probability. EstError estimates it by the BitsOn and FalseDrop equations; ActError counts the on bits and uses the FalseDrop equation to compute the exact probability. In numerous tests, these functions have always returned very nearly the same result.

#### Table 3: Non-object functions

DictionaryBytes(MaxKeys, BitsPerKey) A LongInt function that uses the Optimal Table Size equation to compute the size in bytes of an optimal bit table for the specified key count and bits per key. DictHash(Data, Size) A LongInt function that provides access to the hash algorithm.

The save file produced by the Save-Dictionary method is intended for direct, binary portability to another hardware platform, a Tandem NonStop minicomputer. The only concession made to the Tandem architecture is the byte order of binary integers. These are byte-reversed before saving to disk, and reversed again after restoration.

The random-number generator, the hash function, and the bit-set and test functions are implemented as inline-assembler macros. The bit table size is limited to 65,520 bytes as a concession to Turbo Pascal's maximum GetMem size, and to improve the performance of the bit-set and test functions. You can overcome this limitation by distributing keys to multiple dictionaries. For instance, my Tandem spelling checker divides the dictionary by the first letter of a word into seven bit tables: A-B, C-D, E-H, I-N, O-R, S-T, and U-Z. These are the same partitions used in the practical test. Observe one caution when using multiple bit tables: You must add together the false drop probabilities for each existence test performed on the same key. For instance, if you test a key in two dictionaries and each dictionary has a false drop probability of 1/4096, the combined false drop probability is 2/4096 or 1/2048. My spelling checker selects the appropriate dictionary for the unknown word by the word's first character, and tests only in that dictionary.

### Performance

To get an idea of the speed of the dictionary methods, I performed a timing benchmark on an 8-MHz NEC-V20. (This machine runs about three times as fast as the original PC.) First, I built an optimal, 14-bit dictionary from a subset of the 109,584 word list. The subset consisted of the 17,639 words beginning with "S" and "T." The false drop probability, as determined by counting the on bits, was 1/16,450.

Secondly, I read the entire 109,584 word list and tested each word against the dictionary. Actually, I read each list twice; first I timed the read alone, and then I timed the read with the dictionary method calls. The differences in timings were an estimate of the processor time spent in the dictionary methods. In this benchmark, the methods inserted 250 keys per second and tested 435 keys per second. Existence testing was faster than inserting because almost all of the words, excepting those beginning with "S" and "T" were rejected. The existence test terminates as soon as it finds the first zero bit. Key length and number of bits per key affect performance. For instance, with the same files at 12 bits per key, the benchmark inserted 273 keys per second and tested 447 keys per second. At 14 bits per key, on a 20-MHz 80386, the benchmark inserted 1558 keys per second and tested 2611 keys.

The number of iterations of the Park-Miller algorithm determines the performance of the hash and bit-selection algorithms. The hash algorithm performs 2L iterations, where L is the length of the key. The total number of iterations for the hash and bit-selection algorithms is 2L+B+9, where B is the number of bits per key. If we repeat the hash function for each bit, we perform 2BL iterations. When the hash function is repeated for each bit, the 80386 benchmark can insert only 361 keys per second and can test 1712 keys.

### Conclusion

Superimposed coding is an elegant way to test for the existence of a key. This approach has applications in many areas of information retrieval. It is a fast method, especially with the shortcuts afforded by the Park-Miller random-number generator. The equations available allow us to engineer an optimal existential dictionary with almost any desired capacity and accuracy.

### Further Reading

- 1. Bentley, Jon. Programming Pearls. Reading, Mass.: Addison-Wesley, 1986.

- 2. Berra, P. Bruce; S.M. Chung, and N.I. Hachem. "Computer Architecture for a Surrogate File to a Very Large Data/Knowledge Base." IEEE Computer (March 1987).

- 3. Carta, David G. "Two Fast Implementations of the 'Minimal Standard' Random Number Generator." Communications of the ACM (January 1990).

- 4. Knuth, Donald E. The Art of Computer Programming. vol. 2, pp. 32-33; vol. 3, pp. 559-567. Reading, Mass.: Addison-Wesley, 1973.

- 5. Park, Stephen K., K.W. Miller. "Random Number Generators: Good Ones are Hard to Find." Communications of the ACM (October 1988).

- 6. Peterson, James L. "A Note on Undetected Typing Errors." Communications of the ACM (July 1986).

- 7. Salton, Gerard. Automatic Text Processing. Reading, Mass.: Addison-Wesley, 1989.

### Derivation of the BitsOn Equation

Consider an array of Nbits, all initially off, and an iterative process that selects bits at random and turns them on. What is the probability of a "collision," that is, the probability that the random process will select a bit that is already on, at the i^{th} iteration? On the first iteration (i = 0), the probability is zero because all bits are off. On the second iteration (i = 1), the probability is 1/N, because exactly one bit is on. Let J_{i} be the number of bits on at the i^{th} iteration. Then, the probability of a collision (call it P_{i}) is as shown in Figure 4(a). The probability of a non-collision (call it Q_{i}) is shown in Figure 4(b).

#### Figure 4: BitsOn equations (a): Collision probability (b): Non-collision probability (c): Recurrence formula (d): Explicit equation (e): Logarithm (f): Rearranged (g): Limit (h): Approximation (i): BitsOn (j): Optimal table (k): Solved for N.

J<SUB>i</SUB> (a) P<SUB>i</SUB> = ---- N J<SUB>i</SUB> N - J<SUB>i</SUB> (b) Q<SUB>i</SUB> = 1 - P<SUB>i</SUB> = 1 - ---- = -------- N N 1 N - 1 (c) Q<SUB>i</SUB> = Q<SUB>i - 1</SUB> - Q<SUB>i - 1</SUB> --- = Q<SUB>i - 1</SUB> ----- N N (N - 1)<SUP>i</SUP> (d) Q<SUB>i</SUB> = ---------- (N)<SUP>i</SUP> N - J<SUB>i</SUB> N - 1 (e) Log<SUB>e</SUB>(Q<SUB>i</SUB>) = Log<SUB>e</SUB> -------- = iLog<SUB>e</SUB> ----- N N N - J<SUB>i</SUB> Log<SUB>e</SUB> --------- i N (f) - = ----------------- N N - 1 N Log<SUB>e</SUB> --------- N N - 1 (g) N Log<SUB>e</SUB> ------ > -1 N i N - J<SUB>i</SUB> (h) - = - Log<SUB>e</SUB> -------- N N (i) J<SUB>i</SUB> = N - Ne<SUP>-i/N</SUP> J<SUB>i</SUB> N - Ne<SUP>-i/N</SUP> 1 ---- = ------------ = 1 - e<SUP>-1/N</SUP> = --- (j) N N 2 e<SUP>-i/N</SUP> = 1 - 1/2 = 0.5 i BK (k) N = ------------ = ------------ -Log<SUB>e</SUB>(0.5) -Log<SUB>e</SUB>(0.5)

We can see that each time the process succeeds in turning a bit on that was previously off, the probability of a non-collision is reduced by 1/N. But the probability of selecting an off bit, and thus succeeding in turning it on, is the current probability of a non-collision. Combining these facts, we can write a recurrence formula for the value of Q at any iteration, based on the previous iteration, see Figure 4(c).

Knowing that Q_{0} = 1, we can convert the recurrence formula to an explicit equation for Q_{i} (refer to Figure 4_{d}). By substituting and taking the logarithm of both sides, we arrive at the equation shown in Figure 4(e). We can rearrange this equation into the equation shown in Figure 4(f). As N becomes larger, the expression shown in Figure 4(g) rapidly becomes so close to -1 that we can neglect the difference. (Hint, it's a "sick" limit.) So, for practical purposes, the equation becomes that shown in Figure 4(h). Solving for J_{i}, we have the equation in Figure 4(i, which is the expression for the number of bits that were originally off, but are now on, in a table of N bits, after i iterations of the random process. In our application, the number of iterations is simply the product of the number of keys and the number of bits per key. So, i = BK, and we have the BitsOn equation.

The false drop probability is P_{i}^{B}. By differentiation, we can show that J_{i}/N = 1/2 minimizes the false drop probability. (The proof is left as an exercise to the reader. Hint, differentiate _{BitsOn/N}^{B} rsp. B.) So, in an optimal table, the relationship shown in Figure 4(j) holds true. Solving for N, we arrive at the optimal table-size equation shown in Figure 4(k). This is the expression for the size in bits of an optimal bit table containing K keys with B bits per key.

The value of - Ln(0.5) showed up early in the process of developing these routines. I experimented with various sizes of bit tables by counting iterations while turning on random bits, until half of the bits were on. The ratio of iterations to table size in bits was always about 0.693. This derivation was motivated by a desire to explain this mysterious number.

-- E.F.

_AN EXISTENTIAL DICTIONARY_ by Edwin T. Floyd

<a name="0237_0014"> {$A+,B-,D-,E-,F+,I+,L-,N-,O-,R-,S-,V+} Unit Dict; Interface { DICT.PAS dictionary object and methods to create and use a superimposed code dictionary. Copyright Edwin T. Floyd, 1990. } Type Dictionary = Object DictArray : Pointer; { Pointer to dictionary bit array } DictCount : LongInt; { Number of key entries in this dictionary } DictSize : Word; { Number of bytes in dictionary bit array } DictBits : Byte; { Number of bits per key entry } Constructor Init(MaxKeys : Word; BitsPerKey : Byte); { Initialize dictionary, specify maximum keys and bits per key. } Constructor RestoreDictionary(FileName : String); { Restore dictionary saved on disk by SaveDictionary } { Note: Use either Init or RestoreDictionary, not both. } Destructor Done; { Release storage allocated to dictionary. } Function DictionarySize : Word; { Returns number of bytes that will be written by SaveDictionary. } Procedure SaveDictionary(FileName : String); { Save dictionary in a disk file. } Function InsertString(Var s : String) : Boolean; { Insert string in dictionary; returns TRUE if string is already there. } Function StringInDictionary(Var s : String) : Boolean; { Returns TRUE if string is in dictionary. } Function InsertBlock(Var Data; Len : Word) : Boolean; { Insert block in dictionary; returns TRUE if block is already there. } Function BlockInDictionary(Var Data; Len : Word) : Boolean; { Returns TRUE if block is in dictionary. } Function InsertHash(Hash : LongInt) : Boolean; { Insert hash in dictionary; returns TRUE if hash is already there. } Function HashInDictionary(Hash : LongInt) : Boolean; { Returns TRUE if hash is in dictionary. } Function EstError : Real; { Returns estimated probability of error. } Function ActError : Real; { Returns actual probability of error (slow, counts bits). } End; Function DictionaryBytes(MaxKeys : LongInt; BitsPerKey : Byte) : LongInt; { Returns the size in bytes of the optimal dictionary bit table for the indicated key and bit-per-key counts. } Function DictHash(Var Data; Len : Word) : LongInt; { Hash data block to a positive long integer. } Implementation Const MagicNumber = $E501205F; { Used to validate dictionary save file } RandMult = 16807; { =7**5; RandMult must be expressable in 16 bits. 48271 may give better "randomness" (see ACM ref.) } ShuffleBits = 3; ShuffleShift = 16 - ShuffleBits; ShufTableEnd = $FFFF Shr ShuffleShift; HashSeed : Word = 26; { Initial hash seed } RandSeed : LongInt = 1; { Random number seed: 0 < RandSeed < 2**31-1 } Type SaveFileHeader = Record { Header for dictionary save file (all numbers are byte-reversed) } Magic : LongInt; { Magic number for validity test } BitsCount : LongInt; { Bits-per-key and entry count } Size : Word; { Size of dictionary bit map in bytes } End; Var ShufTable : Array[0..ShufTableEnd] Of LongInt; NextOut : Word; Function IRand : LongInt; { Return next "minimal standard", 31 bit pseudo-random integer. This function actually computes (RandSeed * RandMult) Mod (2**31-1) where RandMult is a 16 bit quantity and RandSeed is 32 bits (See Carta, CACM 1/90). } Inline( $A1/>RandSeed+2/ { mov ax,[>RandSeed+2]} $BF/>RandMult/ { mov di,>RandMult} $F7/$E7/ { mul di} $89/$C3/ { mov bx,ax} $89/$D1/ { mov cx,dx} $A1/>RandSeed/ { mov ax,[>RandSeed]} $F7/$E7/ { mul di} $01/$DA/ { add dx,bx} $83/$D1/$00/ { adc cx,0 ; cx:dx:ax = Seed * Mult } $D0/$E6/ { shl dh,1 ; split p & q at 31 bits } $D1/$D1/ { rcl cx,1} $D0/$EE/ { shr dh,1 ; cx = p, dx:ax = q } $01/$C8/ { add ax,cx} $83/$D2/$00/ { adc dx,0 ; dx:ax = p + q } $71/$09/ { jno done} $05/$01/$00/ { add ax,1 ; overflow, inc(p + q) } $83/$D2/$00/ { adc dx,0} $80/$E6/$7F/ { and dh,$7F ; limit to 31 bits } {done:} $A3/>RandSeed/ { mov [>RandSeed],ax} $89/$16/>RandSeed+2); { mov [>RandSeed+2],dx} Function Hash(Seed : LongInt; Var Data; Len : Word) : LongInt; { Hash a block of data into a random long integer. This is actually equivalent to the following: RandSeed := Seed; Hash := 0; For i := 1 To Len Do Hash := Hash + (IRand * (Data[i] + $FF00); Hash := Hash AND $7FFFFFFF; If Hash = 0 Then Inc(Hash); Overflow is ignored. The seed is kept in registers; RandSeed is not affected by this routine. } Inline( $59/ { pop cx ; cx := len} $5E/ { pop si ; bx:si := @data} $5B/ { pop bx} $58/ { pop ax ; dx:ax := seed} $5A/ { pop dx} $E3/$59/ { jcxz alldone} $FC/ { cld} $1E/ { push ds} $8E/$DB/ { mov ds,bx} $55/ { push bp} $31/$DB/ { xor bx,bx} $53/ { push bx ; zero accumulator} $53/ { push bx} $89/$E5/ { mov bp,sp} {next: ; for each byte of data...} $51/ { push cx} $BF/>RandMult/ { mov di,>RandMult} $89/$C3/ { mov bx,ax} $89/$D0/ { mov ax,dx ; compute next seed} $F7/$E7/ { mul di} $93/ { xchg ax,bx} $89/$D1/ { mov cx,dx} $F7/$E7/ { mul di} $01/$DA/ { add dx,bx} $83/$D1/$00/ { adc cx,0 ; cx:dx:ax = Seed * Mult} $D0/$E6/ { shl dh,1 ; split p & q at 31 bits} $D1/$D1/ { rcl cx,1} $D0/$EE/ { shr dh,1 ; cx = p, dx:ax = q} $01/$C8/ { add ax,cx} $83/$D2/$00/ { adc dx,0 ; dx:ax = p + q} $71/$09/ { jno noovfl} $05/$01/$00/ { add ax,1 ; overflow, inc(p + q)} $83/$D2/$00/ { adc dx,0} $80/$E6/$7F/ { and dh,$7F ; limit to 31 bits} {noovfl:} $89/$C3/ { mov bx,ax ; save seed} $89/$D1/ { mov cx,dx} $AC/ { lodsb ; get next byte + $FF00} $B4/$FF/ { mov ah,$FF} $89/$C7/ { mov di,ax} $F7/$E1/ { mul cx ; multiply by seed} $97/ { xchg ax,di} $F7/$E3/ { mul bx} $01/$FA/ { add dx,di} $01/$46/$00/ { add [bp+0],ax ; accumulate} $11/$56/$02/ { adc [bp+2],dx} $89/$D8/ { mov ax,bx} $89/$CA/ { mov dx,cx} $59/ { pop cx} $E2/$B9/ { loop next ; until out of data} {;} $58/ { pop ax} $5A/ { pop dx} $5D/ { pop bp} $1F/ { pop ds} $80/$E6/$7F/ { and dh,$7F} {alldone:} $89/$C3/ { mov bx,ax} $09/$D3/ { or bx,dx} $75/$01/ { jnz exit} $40); { inc ax} {exit:} Procedure Shuffle; { Load the shuffle table } Begin For NextOut := 0 To ShufTableEnd Do ShufTable[NextOut] := IRand; NextOut := Word(IRand) Shr ShuffleShift; End; Function SIRand : LongInt; { Return the next shuffled random number } Var y : LongInt; Begin y := ShufTable[NextOut]; ShufTable[NextOut] := IRand; NextOut := Word(y) Shr ShuffleShift; SIRand := y; End; Function TestBit(Var BitArray; Size : Word; BitNo : LongInt) : Boolean; { Returns TRUE if indicated bit number, modulo size of bit array, is set. Size is in bytes. } Inline( {; dx:ax := BitNo} $58/ { pop ax} $5A/ { pop dx} {; bl := bit mask} $88/$C1/ { mov cl,al} $80/$E1/$07/ { and cl,$07} $B3/$80/ { mov bl,$80} $D2/$EB/ { shr bl,cl} {; dx:ax := byte offset} $D1/$EA/ { shr dx,1} $D1/$D8/ { rcr ax,1} $D1/$EA/ { shr dx,1} $D1/$D8/ { rcr ax,1} $D1/$EA/ { shr dx,1} $D1/$D8/ { rcr ax,1} {; dx := byte offset} $5F/ { pop di} $39/$D7/ { cmp di,dx} $77/$0E/ { ja quickdiv} {; protect against overflow} $89/$F9/ { mov cx,di} {protloop:} $D1/$E1/ { shl cx,1} $39/$D1/ { cmp cx,dx} $76/$FA/ { jbe protloop} $F7/$F1/ { div cx} $89/$D0/ { mov ax,dx} $31/$D2/ { xor dx,dx} {quickdiv:} $F7/$F7/ { div di} {; es:di := seg:ofs of byte} $5F/ { pop di} $01/$D7/ { add di,dx} $07/ { pop es} {; test bit} $30/$C0/ { xor al,al} $26/$22/$1D/ { es:and bl,[di]} $74/$02/ { jz notset} $FE/$C0); { inc al} {notset:} Function SetBit(Var BitArray; Size : Word; BitNo : LongInt) : Boolean; { Sets the indicated bit number modulo size of bit array. Returns TRUE if bit was already set. Size is in bytes. } Inline( {; dx:ax := BitNo} $58/ { pop ax} $5A/ { pop dx} {; bl := bit mask} $88/$C1/ { mov cl,al} $80/$E1/$07/ { and cl,$07} $B3/$80/ { mov bl,$80} $D2/$EB/ { shr bl,cl} {; dx:ax := byte offset} $D1/$EA/ { shr dx,1} $D1/$D8/ { rcr ax,1} $D1/$EA/ { shr dx,1} $D1/$D8/ { rcr ax,1} $D1/$EA/ { shr dx,1} $D1/$D8/ { rcr ax,1} {; dx := byte offset mod size } $5F/ { pop di} $39/$D7/ { cmp di,dx} $77/$0E/ { ja quickdiv} {; protect against overflow} $89/$F9/ { mov cx,di} {protloop:} $D1/$E1/ { shl cx,1} $39/$D1/ { cmp cx,dx} $76/$FA/ { jbe protloop} $F7/$F1/ { div cx} $89/$D0/ { mov ax,dx} $31/$D2/ { xor dx,dx} {quickdiv:} $F7/$F7/ { div di} {; es:di := seg:ofs of byte} $5F/ { pop di} $01/$D7/ { add di,dx} $07/ { pop es} {; test bit} $30/$C0/ { xor al,al} $88/$DC/ { mov ah,bl} $26/$22/$25/ { es:and ah,[di]} $74/$04/ { jz notset} $FE/$C0/ { inc al} $EB/$03/ { jmp short set} {notset:} $26/$08/$1D); { es:or [di],bl} {set:} Function LongSwap(n : LongInt) : LongInt; { Reverse bytes in a LongInt. } Inline( $5A/ { pop dx} $58/ { pop ax} $86/$C4/ { xchg ah,al} $86/$D6); { xchg dh,dl} Function DictionaryBytes(MaxKeys : LongInt; BitsPerKey : Byte) : LongInt; Begin DictionaryBytes := Round(MaxKeys * BitsPerKey / (-Ln(0.5) * 8)); End; Function DictHash(Var Data; Len : Word) : LongInt; Begin DictHash := Hash(Hash(HashSeed, Data, Len), Data, Len); End; Constructor Dictionary.Init(MaxKeys : Word; BitsPerKey : Byte); Var DictBytes : LongInt; Begin DictBytes := DictionaryBytes(MaxKeys, BitsPerKey); If DictBytes > $FFF0 Then Begin WriteLn(DictBytes, ' bytes optimal for dictionary, but ', $FFF0, ' is maximum size dictionary. Using max size.'); DictBytes := $FFF0; End Else If DictBytes > MaxAvail Then Begin WriteLn(DictBytes, ' bytes optimal for dictionary, but only ', MaxAvail, ' bytes are available. Using ', MaxAvail); DictBytes := MaxAvail; End Else If DictBytes < 16 Then DictBytes := 16; DictSize := DictBytes; GetMem(DictArray, DictSize); FillChar(DictArray^, DictSize, 0); DictCount := 0; DictBits := BitsPerKey; End; Constructor Dictionary.RestoreDictionary(FileName : String); Var Header : SaveFileHeader; DictBytes : LongInt; f : File; OldMode : Byte; Begin OldMode := FileMode; FileMode := $40; Assign(f, FileName); Reset(f, 1); BlockRead(f, Header, SizeOf(Header)); With Header Do Begin Magic := LongSwap(Magic); Size := Swap(Size); DictBytes := FileSize(f) - SizeOf(Header); If (Magic <> MagicNumber) Or (Size <> DictBytes) Or (Size < 16) Or (Size > $FFF0) Then Begin WriteLn('File ', FileName, ' is not a dictionary save file.'); Halt(1); End; DictSize := Size; DictBits := BitsCount And $FF; DictCount := LongSwap(BitsCount And $FFFFFF00); GetMem(DictArray, DictSize); BlockRead(f, DictArray^, DictSize); Close(f); FileMode := OldMode; End; End; Destructor Dictionary.Done; Begin FreeMem(DictArray, DictSize); DictArray := Nil; DictSize := 0; DictBits := 0; DictCount := 0; End; Function Dictionary.DictionarySize : Word; Begin DictionarySize := DictSize + SizeOf(SaveFileHeader); End; Function Dictionary.InsertString(Var s : String) : Boolean; Begin InsertString := InsertBlock(s[1], Length(s)); End; Function Dictionary.StringInDictionary(Var s : String) : Boolean; Begin StringInDictionary := BlockInDictionary(s[1], Length(s)); End; Function Dictionary.InsertBlock(Var Data; Len : Word) : Boolean; Begin InsertBlock := InsertHash(DictHash(Data, Len)); End; Function Dictionary.BlockInDictionary(Var Data; Len : Word) : Boolean; Begin BlockInDictionary := HashInDictionary(DictHash(Data, Len)); End; Function Dictionary.InsertHash(Hash : LongInt) : Boolean; Var i : Byte; InDict : Boolean; Begin InDict := True; RandSeed := Hash; Shuffle; For i := 1 To DictBits Do If Not SetBit(DictArray^, DictSize, SIRand) Then InDict := False; If Not InDict Then Inc(DictCount); InsertHash := InDict; End; Function Dictionary.HashInDictionary(Hash : LongInt) : Boolean; Var i : Byte; InDict : Boolean; Begin InDict := True; RandSeed := Hash; Shuffle; i := 0; While (i < DictBits) And InDict Do Begin If Not TestBit(DictArray^, DictSize, SIRand) Then InDict := False; Inc(i); End; HashInDictionary := InDict; End; Procedure Dictionary.SaveDictionary(FileName : String); Var Header : SaveFileHeader; f : File; Begin Assign(f, FileName); ReWrite(f, 1); With Header Do Begin Magic := LongSwap(MagicNumber); Size := Swap(DictSize); BitsCount := LongSwap(DictCount) + DictBits; End; BlockWrite(f, Header, SizeOf(Header)); BlockWrite(f, DictArray^, DictSize); Close(f); End; Function Dictionary.EstError : Real; Begin EstError := Exp(Ln(1.0-Exp(-(DictCount*DictBits)/(DictSize*8.0)))*DictBits); End; Function Dictionary.ActError : Real; Var AllBits, BitsOn, i : LongInt; Begin AllBits := LongInt(DictSize) * 8; BitsOn := 0; For i := 0 To Pred(AllBits) Do If TestBit(DictArray^, DictSize, i) Then Inc(BitsOn); ActError := Exp(Ln(BitsOn / AllBits) * DictBits); End;