Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

Parallel

Arithmetic Coding and Statistical Modeling


FEB91: ARITHMETIC CODING AND STATISTICAL MODELING

This article contains the following executables: NELSON.ARC

Mark is the vice president of development and senior developer for Greenleaf Software Inc., of Dallas, Texas. He can be reached through the DDJ office.


Most of the data compression methods in common use today fall into one of two camps: dictionary-based schemes or statistical methods. In the world of small systems, dictionary-based data compression techniques seem to be more popular at this time. However, by combining arithmetic coding with powerful modeling techniques, statistical methods for data compression are actually able to achieve better performance. This article discusses how to combine arithmetic coding with modeling methods to achieve some impressive compression ratios.

Terms of Endearment

In general, data compression operates by taking "symbols" from an input "text," processing them, and writing "codes" to a compressed file. For the purposes of this article, symbols are usually bytes, but they could just as easily be pixels, 80-bit floating point numbers, or EBCDIC characters. To be effective, a data compression scheme needs to be able to transform the compressed file back into an identical copy of the input text. Needless to say, it also helps if the compressed file is smaller than the input text!

Dictionary-based compression systems operate by replacing groups of symbols in the input text with fixed length codes. A well-known example of a dictionary technique is LZW data compression. (See "LZW Data Compression," DDJ, October 1989). LZW operates by replacing strings of essentially unlimited length with codes that usually range in size from 9 to 16 bits.

Statistical methods of data compression take a completely different approach: They operate by encoding symbols one at a time. The symbols are encoded into output codes, the length of which varies based on the probability or frequency of the symbol. Low probability symbols are encoded using many bits, and high probability symbols are encoded using fewer bits.

In practice, the dividing line between statistical and dictionary methods is not always so distinct. Some schemes can't be clearly put in one camp or the other, and there are always hybrids which use features from both techniques. However, the methods discussed in this article use arithmetic coding to implement purely statistical compression schemes.

How Arithmetic Coding Works

Only in the last ten years has a respectable candidate to replace Huffman coding been successfully demonstrated: arithmetic coding. Arithmetic coding completely bypasses the idea of replacing an input symbol with a specific code. Instead, it takes a stream of input symbols and replaces it with a single floating point output number. The longer (and more complex) the message, the more bits are needed in the output number. It was not until recently that practical methods were found to implement this on computers with fixed-sized registers.

The output from an arithmetic coding process is a single number less than 1 and greater than or equal to 0. This single number can be uniquely decoded to create the exact stream of symbols that went into its construction. To construct the output number, the symbols being encoded have to have set probabilities assigned to them. For example, if I were to encode the random message BILL GATES, I would have a probability distribution that looks like Figure 1.

Figure 1: Probability distribution for BILL GATES


  Character  Probability
  ----------------------

   SPACE     1/10
   A         1/10
   B         1/10
   E         1/10
   G         1/10
   I         1/10
   L         2/10
   S         1/10
   T         1/10

Once the character probabilities are known, the individual symbols need to be assigned a range along a "probability line," which is nominally 0 to 1. It doesn't matter which characters are assigned which segment of the range, as long as it is done in the same manner by both the encoder and the decoder. The nine-character symbol set used here would look like Figure 2.

Figure 2: A nine-character symbol set

  Character  Probability  Range
  -----------------------------------

  SPACE      1/10         0.00 - 0.10

  A          1/10         0.10 - 0.20
  B          1/10         0.20 - 0.30
  E          1/10         0.30 - 0.40
  G          1/10         0.40 - 0.50
  I          1/10         0.50 - 0.60
  L          2/10         0.60 - 0.80
  S          1/10         0.80 - 0.90
  T          1/10         0.90 - 1.00

Each character is assigned the portion of the 0 - 1 range that corresponds to its probability of appearance. Note also that the character "owns" everything up to, but not including the higher number. So the letter T in fact has the range 0.90 - 0.9999 ....

The most significant portion of an arithmetic coded message belongs to the first symbol to be encoded. When encoding the message BILL GATES, the first symbol is B. For the first character to be decoded properly, the final coded message has to be a number greater than or equal to 0.20 and less than 0.30. To encode this number, we keep track of the range within which this number could fall. So after the first character is encoded, the low end of this range is 0.20 and the high end is 0.30.

After the first character is encoded, we also know that the range for our output number is bounded by the low and high numbers. During the rest of the encoding process, each new symbol to be encoded will further restrict the possible range of the output number. The next character to be encoded, I, owns the range 0.50 through 0.60. If this was the first number in our message, we would set these as our low- and high-range values. But I is the second character; therefore, we say that I owns the range corresponding to 0.50 - 0.60 in the new subrange of 0.2 - 0.3. This means that the new encoded number will have to fall somewhere in the 50 to 60th percentile of the currently established range. Applying this logic will further restrict our number to the range between 0.25 and 0.26. The algorithm to accomplish this for a message of any length is shown in Figure 3. Figure 4 shows this process followed through to its natural conclusion with our chosen message. So the final low value, 0.2572167752, will uniquely encode the message BILL GATES using our present encoding scheme.

Figure 3: Encoding algorithm for a message of any length

  Set low to 0.0
  Set high to 1.0
  While there are still input symbols do
       get an input symbol
       code_range = high - low.
       high = low + range*high_range (symbol)
       low = low + range*low_range(symbol)
  End of While
  output low

Figure 4: Resulting message

  New Character  Low Value    High Value
  ---------------------------------------

       0.0                    1.0
       B         0.2          0.3
       I         0.25         0.26
       L         0.256        0.258
       L         0.2572       0.2576
       SPACE     0.25720      0.25724
       G         0.257216     0.257220
       A         0.2572164    0.2572168
       T         0.25721676   0.2572168
       E         0.257216772  0.257216776

Given this encoding scheme, it is relatively easy to see how the decoding process will operate. We find the first symbol in the message by seeing which symbol owns the code space in which our encoded message falls. Because 0.2572167752 falls between 0.2 and 0.3, we know that the first character must be B. We must now remove the B from the encoded number. We know the low and high ranges of B, so we can remove their effects by reversing the process that put them in. First, we subtract the low value of B from 0.2572167752, giving 0.0572167752. Then we divide by the range of B, 0.1, and get 0.572167752. Now we can calculate where that lands, which is in the range of the next letter, I.

The algorithm for decoding the incoming number looks like Figure 5. Note that I have conveniently ignored the problem of how to decide when there are no more symbols left to decode. This can be handled by either encoding a special EOF symbol, or carrying the stream length along with the encoded message. The decoding algorithm for the BILL GATES message will proceed as shown in Figure 6.

Figure 5: Algorithm for decoding an incoming number

  get encoded number
  Do
       find symbol whose range straddles the encoded number
       output the symbol
       range = symbol_low_value - symbol_high_value
       subtract symbol_low_value from encoded number
       divide encoded number by range
  until no more symbols

Figure 6: Resulting message

  Encoded Number  Output Symbol  Low  High  Range
  -----------------------------------------------

  0.2572167752    B              0.2  0.3   0.1
  0.572167752     I              0.5  0.6   0.1
  0.72167752      L              0.6  0.8   0.2
  0.6083876       L              0.6  0.8   0.2
  0.041938        SPACE          0.0  0.1   0.1
  0.41938         G              0.4  0.5   0.1
  0.1938          A              0.2  0.3   0.1
  0.938           T              0.9  1.0   0.1
  0.38            E              0.3  0.4   0.1
  0.8             S              0.8  0.9   0.1
  0.0

To summarize, the encoding process consists simply of narrowing the range of possible numbers with every new symbol. The new range is proportional to the predefined probability attached to that symbol. Decoding is the inverse procedure: The range is expanded in proportion to the probability of each symbol as it is extracted.

Practical Matters

The process of encoding and decoding a stream of symbols using arithmetic coding is not too complicated. But at first glance, it seems completely impractical. Most computers support floating point numbers of up to 80 bits or so. Does this mean you have to start over every time you finish encoding 10 or 15 symbols? Do you need a floating point processor? Can machines with different floating point formats communicate using arithmetic coding?

As it turns out, arithmetic coding is best accomplished using standard 16-bit and 32-bit integer math. No floating point math is required, nor would it help to use it. Instead, we use an incremental transmission scheme in which fixed-size integer-state variables receive new bits at the low end and shift them out the high end, forming a single number that can be as long as the number of bits available on the computer's storage medium.

In the previous section, I showed how the algorithm works by keeping track of a high and low number that bracket the range of the possible output number. When the algorithm first starts up, the low number is set to 0.0, and the high number to 1.0. To work with integer math, first change the 1.0 to 0.999...., or.111 ... in binary.

To store these numbers in integer registers, we first justify them so the implied decimal point is on the left-hand side of the word. Then we load as many of the initial high and low values as will fit into the word size we are working with. My implementation uses 16-bit unsigned math, so the initial value of high is 0xFFFF, and low is 0. We know that the high value continues with FFs forever, and low continues with 0s forever, so we can shift those extra bits in with impunity when they are needed.

If you imagine our BILL GATES example in a 5-digit register, the decimal equivalent of our setup would look like Figure 7(a). To find our new range numbers, we need to apply the encoding algorithm from the previous section. We first calculate the range between the low and high values. The difference between the two registers will be 100000, not 99999, because assuming the high register has an infinite number of 9s added on to it, we need to increment the calculated difference. We then compute the new high value using the formula from the previous section: high = low + high_range(symbol).

Figure 7: (a) Decimal equivalent of setup; (b) high and low look after calculation; (c) resulting message

  (a)

  HIGH:  99999
  LOW:   00000

  (b)

  #define va_arg(list,mode) ((mode *) (list + = sizeof(mode))) [-1]
  HIGH:  29999 (999...)   LOW:   20000 (000...)

  (c)

                          HIGH   LOW    RANGE   CUMULATIVE OUTPUT
  ---------------------------------------------------------------

  Initial state           99999  00000  100000
  Encode B (0.2-0.3)      29999  20000
  Shift out 2             99999  00000  100000  .2
  Encode I (0.5-0.6)      59999  50000          .2
  Shift out 5             99999  00000  100000  .25
  Encode L (0.6-0.8)      79999  60000  20000   .25
  Encode L (0.6-0.8)      75999  72000          .25
  Shift out 7             59999  20000  40000   .257
  Encode SPACE (0.0-0.1)  23999  20000          .257
  Shift out 2  39999      00000         40000   .2572
  Encode G (0.4-0.5)      19999  16000          .2572
  Shift out 1  99999      60000         40000   .25721
  Encode A (0.1-0.2)      67999  64000          .25721
  Shift out 6  79999      40000         40000   .257216
  Encode T (0.9-1.0)      79999  76000          .257216
  Shift out 7  99999      60000         40000   .2572167
  Encode E (0.3-0.4)      75999  72000          .2572167
  Shift out 7  59999      20000         40000   .25721677
  Encode S (0.8-0.9)      55999  52000          .25721677
  Shift out 5             59999  20000          .257216775
  Shift out 2                                   .2572167752
  Shift out 0                                   .25721677520

In this case, the high range is 0.30, which gives a new high value of 30000. Before storing this value, we need to decrement it (again because of the implied digits appended to the integer value), so the new value of high is 29999. The calculation of low follows the same path, with a resulting new value of 20000. High and low now look like Figure 7(b).

At this point, the most significant digits of high and low match. Due to the nature of our algorithm, high and low can continue to grow closer to one another without ever quite matching. Therefore, once the most significant digit matches, that digit will never change. We can then output that digit as the first digit of our encoded number. This is done by shifting both high and low left by one digit, and shifting in a 9 in the least significant digit of high. The equivalent operations are performed in binary in the C implementation of this algorithm.

As this process continues, high and low are continually growing closer together, then shifting digits out into the coded word. The process for our BILL GATES message looks like Figure 7(c). Note that after all the letters have been accounted for, two extra digits need to be shifted out of either the high or low value to finish up the output word.

A Complication

This scheme works well for incrementally encoding a message. There is enough accuracy retained during the double precision integer calculations to ensure that the message is accurately encoded. However, there is potential for a loss of precision under certain circumstances.

In the event that the encoded word has a string of 0s or 9s in it, the high and low values will slowly converge on a value, but their most significant digits may not match immediately. For example, high and low may look like Figure 8(a). At this point, the calculated range is going to be only a single digit long, which means the output word will not have enough precision to be accurately encoded. Even worse, after a few more iterations, high and low could look like Figure 8(b).

Figure 8: (a) High and low; (b) after a few more iterations, a new high and low; (c) setting an underflow counter to remember what was thrown away

  (a)
  HIGH:  700004
  LOW:   699995

  (b)
  HIGH:  70000
  LOW:   69999

  (c)
              Before  After
  -------------------------

  HIGH:       40344   43449
  LOW:        39810   38100
  Underflow:  0       1

At this point, the values are permanently stuck. The range between high and low has become so small that any calculation will always return the same values. But because the most significant digits of both words are not equal, the algorithm can't output the digit and shift. It seems like an impasse.

In the original algorithm, if the most significant digit of high and low matched, we shifted it out. To prevent the underflow problem just encountered, a second test needs to be applied before the digits match, while they are on adjacent numbers. If high and low are one apart, we must test to see if the second most significant digit in high is a 0, and the second digit in low is a 9. If so, we are on the road to underflow and need to take action.

When underflow rears its ugly head, we head it off with a slightly different shift operation. Instead of shifting the most significant digit out of the word, we just delete the second digits from high and low, and shift the rest of the digits left to fill up the space. The most significant digit stays in place. We must then set an underflow counter to remember that we threw away a digit, and we aren't quite sure whether it was going to end up as a 0 or a 9. This operation is shown in Figure 8(c).

After every recalculation operation, if the most significant digits don't match up, we can check for underflow digits again. If they are present, we shift them out and increment the counter.

When the most significant digits do finally converge to a single value, we first output that value. Then we output all the previously discarded "underflow" digits. The underflow digits will be all 9s or 0s, depending on whether High and Low converged to the higher or lower value. In the C implementation of this algorithm, the underflow counter keeps track of how many 1s or 0s to put out.

Decoding

In the "ideal" decoding process, we had the entire input number to work with, so the algorithm had us do things such as "divide the encoded number by the symbol probability." In practice, we can't perform an operation like that on a number that could be billions of bytes long. Just like the encoding process, the decoder can operate using 16-and 32-bit integers for calculations.

Instead of maintaining two numbers, high and low, the decoder has to maintain three integers. The first two, high and low, correspond exactly to the high and low values maintained by the encoder. The third number, code, contains the current bits being read in from the input bits stream. The code value will always lie between the high and low values. As they come closer and closer to it, new shift operations will take place, and high and low will move back away from code.

The high and low values in the decoder correspond exactly to the high and low that the encoder was using. They will be updated after every symbol, just as they were in the encoder and should have exactly the same values. By performing the same comparison test on the upper digit of high and low, the decoder knows when to shift a new digit into the incoming code. The same underflow tests are performed as well, in lockstep with the encoder.

In the ideal algorithm, it was possible to determine what the current encoded symbols were, just by finding the symbol whose probabilities enclosed the present value of the code. In the integer math algorithm, things are somewhat more complicated: The probability scale is determined by the difference between high and low, so instead of the range being between 0.0 and 1.0, it will be between two positive 16-bit integer counts. The current probability is determined by where the present code value falls within that range. If you were to divide (value low) by (high low+1), you would get the actual probability for the present symbol.

Earlier, we saw how each character "owned" a probability range in the scale ranging from 0.0 to 1.0. To implement arithmetic coding using integer math, this ownership was restated as a low and high count along an integer range from 0 to the maximum count.

Modeling

The need to accurately predict the probability of symbols in the input data is inherent to the nature of arithmetic coding. The principle of this type of coding is to reduce the number of bits needed to encode a character as its probability of appearance increases. So if the letter "e" represents 25 percent of the input data, it would only take 2 bits to code. If the letter "z" represents only 0.1 percent of the input data, it might take 10 bits to code. If the model is not generating probabilities accurately, it might take 10 bits to represent "e" and 2 bits to represent "z," causing data expansion instead of compression.

The second condition is that the model needs to make predictions that deviate from a uniform distribution. The better the model is at making these predictions, the better the compression ratios will be. For example, a model could be created that assigned all 256 possible symbols a uniform probability of 1/256. This model would create an output file that was exactly the same size as the input file, because every symbol would take exactly 8 bits to encode. Only by correctly finding probabilities that deviate from a uniform distribution can the number of bits be reduced, leading to compression. Of course, the increased probabilities have to accurately reflect reality, as prescribed by the first condition.

It may seem that the probability of a given symbol occurring in a data stream is fixed, but this is not quite true. Depending on the model being used, the probability of the character can change quite a bit. For example, when compressing a C program, the probability of a newline character in the text might be 1/40. This probability could be determined by scanning the entire text and dividing the number of the character's occurrences by the total number of characters. But if we use a modeling technique that looks at a single previous character, the probabilities change. In that case, if the previous character was a "}", the probability of a newline character goes up to 1/2. This improved modeling technique leads to better compression, even though both models were generating accurate probabilities.

Finite Context Modeling

The type of modeling I will present in this article is referred to as "finite-context" modeling. It is based on a very simple idea: The probabilities for each incoming symbol are calculated based on the context in which the symbol appears. In all the examples I will show here, the context consists of nothing more than previously encountered symbols. The "order" of the model refers to the number of previous symbols that make up the context.

The simplest finite-context model would be an order-0 model, in which the probability of each symbol is independent of any previous symbols. In order to implement this model, we need only a single table containing the frequency counts for each symbol that might be encountered in the input stream. For an order-1 model, you will keep track of 256 different tables of frequencies, because you need to keep a separate set of counts for each possible context. Likewise, an order-2 model needs to be able to handle 65,536 different tables of contexts.

Adaptive Modeling

Logically, as the order of the model increases, the compression ratios ought to improve as well. For example, the probability of the symbol "u" appearing in this article may only be five percent, but if the previous context character is "q", the probability goes up to 95 percent. Being able to predict characters with high probability lowers the number of bits needed, and larger contexts ought to let us make better predictions.

Unfortunately, as the order of the model increases linearly, the memory consumed by the model increases exponentially. With an order-0 model, the space consumed by the statistics could be as small as 256 bytes. But once the order of the model increases to 2 or 3, even the most cleverly designed models will consume hundreds of kilobytes.

One conventional way of compressing data is to take a single pass over the symbols to be compressed to gather the statistics for the model. A second pass is then made to actually encode the data. The statistics are then usually prepended to the compressed data so that the decoder will have a copy of them to work with. This will obviously have serious problems if the statistics for the model consume more space than the data to be compressed.

The solution is to perform "adaptive" compression, in which both the compressor and the decompressor start with the same model. The compressor encodes a symbol using the existing model, then updates the model to account for the new symbol. The decompressor likewise decodes a symbol using the existing model, then updates the model. As long as the algorithm to update the model operates identically for the compressor and the decompressor, the process can operate perfectly without having to pass a statistics table from the compressor to the decompressor.

The place where adaptive compression suffers is in the cost of updating the model. When updating the count for a particular symbol using arithmetic coding, the update code has the potential cost of updating the cumulative counts for all the other symbols as well, leading to code that has to perform an average of 128 arithmetic operations for every single symbol encoded or decoded.

Because of the high cost in both memory and CPU time, higher order adaptive models have become practical only in the last ten years. It is somewhat ironic that as the cost of disk space and memory goes down, so does the cost of compressing the data they store. As these costs continue to decline, we will be able to implement even more effective programs than are practical today.

Highest-Order Modeling

An order-0 model doesn't take into account any of the previous symbols from the text file when calculating the probabilities for the current symbol. By looking at the previous characters in the text file, or the "context," we can more accurately predict the incoming symbols.

When we move up to an order-2 or order-3 model, one problem is that in a fixed-order model, each character must have a finite, nonzero probability, so that it can be encoded, if and when it appears. The solution is to set the initial probabilities of all the symbols to 0 for a given context, and to have a method of falling back to a different context when a previously unseen symbol occurs. This is done by emitting a special "Escape" code. For the previous context of REQ, we could set the Escape code to a count of 1, and all other symbols to a count of 0. The first time the character "U" followed REQ, we would have to emit an Escape code, followed by the code for "U" in a different context. During the model update immediately following that, we could increment the count for "U" in the REQ context to 1, giving it a probability of 1/2. The next time it appeared, it would be encoded in only 1 bit, with the probability increasing and the number of bits decreasing with each appearance.

The obvious question is: What do we use as our "fall-back" context after emitting an Escape code? In the MODEL-2 program (see the following section entitled "Implementation"), if the order-3 context generates an Escape code, the next context tried is that of order-2. This means that the first time the context REQ is used, and "U" must be encoded, an Escape code is generated. Following that, the MODEL-2 program drops back to an order-2 model, and tries to encode the character "U" using the context EQ. This continues on down through the order-0 context. If the Escape code is still generated at order-0, we fall back to a special order (-1) context. The -1 context is set up at initialization to have a count of 1 for every possible symbol. It is never updated, so it is guaranteed to be able to encode every symbol.

Implementation

In writing this article, I've developed a number of programs, ranging from routines that implement the arithmetic coding algorithm itself to higher-order modeling. Due to space constraints, all of these fully-commented files are available electronically (see "Availability," page 3). The modules presented here in the magazine implement an order-2 compression program. The source files consist of MODEL.H (Listing One ), the header file; COMP-2.C (Listing Two), the main module for the compression program that handles escape codes and does compression ratio checking; EXPAND-2.C Listing Three), the main module for decompression; and MODEL-2.C (Listing Four), the highly optimized source for a variable order compression program that can be used with COMP-2 or EXPAND-2.

MODEL-2.C has prodigious memory requirements. When running on DOS machines limited to 640K, these programs have to be limited to order-1, or perhaps order-2 for text with a higher redundancy ratio. To examine compression ratios for higher orders on binary files, there are three choices. First, the program can be compiled using Zortech C and use EMS_handle pointers. Second, they can be built using a DOS Extender, such as Rational Systems 16/M. Third, they can be built on a machine that supports virtual memory, such as VMS. The code distributed here was written in an attempt to be portable across all three options.

I found that with an extra megabyte of EMS, I could compress virtually any ASCII file on my PC using order-3 compression. Some binary files require more memory. My Xenix system had no problem with order-3 compression, and turned in the best performance overall in terms of speed. I had mixed results with DOS Extenders. For unknown reasons, my tests with Lattice C/286 and Microsoft C + DOS 16/M ran much slower than Zortech's EMS code, although logic indicates that the opposite should be true. This was not an optimization problem either, because the Lattice and Microsoft implementations ran faster than Zortech's when inside 640K. Executables built with the Lattice C/286 and the Zortech 2.06 code are available electronically. The Rational Systems 16/M license agreement requires royalty payments, so that code cannot be distributed.

Testing and Comparing: CHURN

To test compression programs, I've created a general-purpose test program, CHURN, that simply churns through a directory and all of its subdirectories, compressing and then expanding all the files it finds there. CHURN and a Unix variant, CHURNX, are not printed here for reasons of space, but both are available electronically (see page 3).

Table 1 shows the results returned when testing various compression programs against two different bodies of input. The first sample, TEXTDATA, is roughly 1 Mbyte of text-only files, consisting of source code, word processing files, and documentation. The second sample, BINDATA, is roughly 1 Mbyte of randomly selected files, containing executables, database files, binary images, and so on.

Table 1: Compression testing results

             TEXTDATA - Total input bytes:    987070   Text data
  -----------------------------------------------------------------------

  COMPRESS  Unix 16-bit LZW implementation   351446 bytes  2.85 bits/byte
  LZHUF     Sliding window dictionary        313541 bytes  2.54 bits/byte
  PKZIP     Proprietary dictionary based     292232 bytes  2.37 bits/byte
  Model-2   Highest context, max order-3     239327 bytes  1.94 bits/byte

             BINDATA - Total input bytes:     989917   Binary data
  -----------------------------------------------------------------------

  COMPRESS  Unix 16-bit LZW implementation   662692 bytes  5.35 bits/byte
  PKZIP     Proprietary dictionary based     503827 bytes  4.06 bits/byte
  LZHUF     Sliding window dictionary        503224 bytes  4.06 bits/byte
  Model-2   Highest context, max order-3     500055 bytes  4.03 bits/byte

For comparison, three dictionary-based coding schemes were also run on the datasets. COMPRESS is a 16-bit LZW implementation in widespread use on Unix systems. The PC implementation that uses 16 bits takes up about 500K of RAM. LZHUF is an LZSS program with an adaptive Huffman coding stage written by Haruyasu Yoshikazi, later modified and posted on Fidonet by Paul Edwards. This is essentially the same compression used in the LHARC program. Finally, the commercial product PKZIP 1.10 by PKWare, (Glendale, Wisc.) was also tested on the datasets. PKZIP uses a proprietary dictionary-based scheme, which is discussed in the program's documentation.

Conclusions

In terms of compression ratios, these tests show that statistical modeling can perform at least as well as dictionary-based methods. At present, however, these programs are somewhat impractical, due to their high resource requirements. MODEL-2 is fairly slow, compressing data with speeds in the range of 1 Kbyte per second, and needs huge amounts of memory to use higher-order modeling. However, as memory becomes cheaper and processors more powerful, schemes such as the ones shown here may become practical. Currently, they could be applied to circumstances where either storage or transmission costs are extremely high.

Order-0 adaptive modeling using arithmetic coding could be usefully applied today to situations requiring extremely low memory consumption. The compression ratios might not be as good as those of more sophisticated models, but the memory consumption is minimized.

References

The June 1987 issue of Communications of the ACM provides the definitive overview of arithmetic coding. Most of the article is reprinted in the book Text Compression, by Timothy C. Bell, John G. Cleary, and Ian H. Witten. This book provides an excellent overview of both statistical and dictionary-based compression techniques.

Bell, Timothy C., John G. Cleary, and Ian H. Witten. Text Compression. Englewood Cliffs, N.J.: Prentice Hall, 1990.

Nelson, Mark. "LZW Data Compression." Doctor Dobb's Journal (October, 1989).

Storer, J.A. Data Compression. Rockville, Md.: Computer Science Press, 1988.

Witten, Ian H., Radford M. Neal, and John G. Cleary. "Arithmetic Coding for Data Compression." Communications of the ACM (June, 1987).

[LISTING ONE]

<a name="006a_001b">

/*
 * model.h
 *
 * This file contains all of the function prototypes and
 * external variable declarations needed to interface with
 * the modeling code found in model-1.c or model-2.c.
 */

/*
 * Eternal variable declarations.
 */
extern int max_order;
extern int flushing_enabled;
/*
 * Prototypes for routines that can be called from MODEL-X.C
 */
void initialize_model( void );
void update_model( int symbol );
int convert_int_to_symbol( int symbol, SYMBOL *s );
void get_symbol_scale( SYMBOL *s );
int convert_symbol_to_int( int count, SYMBOL *s );
void add_character_to_model( int c );
void flush_model( void );



<a name="006a_001c">
<a name="006a_001d">
[LISTING TWO]
<a name="006a_001d">

/*
 * comp-2.c
 *
 * This module is the driver program for a variable order
 * finite context compression program.  The maximum order is
 * determined by command line option.  This particular version
 * also monitors compression ratios, and flushes the model whenever
 * the local (last 256 symbols) compression ratio hits 90% or higher.
 *
 * To build this program:
 *
 * Turbo C:     tcc -w -mc comp-2.c model-2.c bitio.c coder.c
 * QuickC:      qcl /AC /W3 comp-2.c model-2.c bitio.c coder.c
 * Zortech:     ztc -mc comp-2.c model-2.c bitio.c coder.c
 * *NIX:        cc -o comp-2 comp-2.c model-2.c bitio.c coder.c
 *
 * Command line options:
 *
 *  -f text_file_name  [defaults to test.inp]
 *  -c compressed_file_name [defaults to test.cmp]
 *  -o order [defaults to 3 for model-2]
 */
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include "coder.h"
#include "model.h"
#include "bitio.h"

/*
 * The file pointers are used throughout this module.
 */
FILE *text_file;
FILE *compressed_file;

/*
 * Declarations for local procedures.
 */
void initialize_options( int argc, char **argv );
int check_compression( void );
void print_compression( void );

/*
 * The main procedure is similar to the main found in COMP-1.C.
 * It has to initialize the coder, the bit oriented I/O, the
 * standard I/O, and the model.  It then sits in a loop reading
 * input symbols and encoding them.  One difference is that every
 * 256 symbols a compression check is performed.  If the compression
 * ratio exceeds 90%, a flush character is encoded.  This flushes
 * the encoding model, and will cause the decoder to flush its model
 * when the file is being expanded.  The second difference is that
 * each symbol is repeatedly encoded until a succesfull encoding
 * occurs.  When trying to encode a character in a particular order,
 * the model may have to transmit an ESCAPE character.  If this
 * is the case, the character has to be retransmitted using a lower
 * order.  This process repeats until a succesful match is found of
 * the symbol in a particular context.  Usually this means going down
 * no further than the order -1 model.  However, the FLUSH and DONE
 * symbols do drop back to the order -2 model.  Note also that by
 * all rights, add_character_to_model() and update_model() logically
 * should be combined into a single routine.
 */
void main( int argc, char **argv )
{
    SYMBOL s;
    int c;
    int escaped;
    int flush = 0;
    long int text_count = 0;

    initialize_options( --argc, ++argv );
    initialize_model();
    initialize_output_bitstream();
    initialize_arithmetic_encoder();
    for ( ; ; )
    {
   if ( ( ++text_count & 0x0ff ) == 0 )
            flush = check_compression();
        if ( !flush )
            c = getc( text_file );
        else
            c = FLUSH;
        if ( c == EOF )
            c = DONE;
        do {
            escaped = convert_int_to_symbol( c, &s );
            encode_symbol( compressed_file, &s );
        } while ( escaped );
        if ( c == DONE )
       break;
        if ( c == FLUSH )
   {
       flush_model();
            flush = 0;
   }
        update_model( c );
        add_character_to_model( c );
    }
    flush_arithmetic_encoder( compressed_file );
    flush_output_bitstream( compressed_file );
    print_compression();
    fputc( '\n', stderr );
    exit( 0 );
}

/*
 * This routine checks for command line options, and opens the
 * input and output files.  The only other command line option
 * besides the input and output file names is the order of the model,
 * which defaults to 3.
 */
void initialize_options( int argc, char **argv )
{
    char text_file_name[ 81 ];
    char compressed_file_name[ 81 ];

    strcpy( compressed_file_name, "test.cmp" );
    strcpy( text_file_name, "test.inp" );
    while ( argc-- > 0 )
    {
        if ( strcmp( *argv, "-f" ) == 0 )
   {
       argc--;
       strcpy( text_file_name, *++argv );
   }
        else if ( strcmp( *argv, "-c" ) == 0 )
   {
       argc--;
       strcpy( compressed_file_name, *++argv );
   }
        else if ( strcmp( *argv, "-o" ) == 0 )
   {
       argc--;
            max_order = atoi( *++argv );
   }
   else
   {
            fprintf( stderr, "\nUsage: COMP-2 [-o order] " );
            fprintf( stderr, "[-f text file] [-c compressed file]\n" );
            exit( -1 );
   }
   argc--;
   argv++;
    }
    text_file = fopen( text_file_name, "rb" );
    compressed_file = fopen( "test.cmp", "wb" );
    if ( text_file == NULL || compressed_file == NULL )
    {
        printf( "Had trouble opening one of the files!\n" );
        exit( -1 );
    }
    setvbuf( text_file, NULL, _IOFBF, 4096 );
    setbuf( stdout, NULL );
    printf( "Compressing %s to %s, order %d.\n",
            text_file_name,
            compressed_file_name,
            max_order );
}

/*
 * This routine is called to print the current compression ratio.
 * It prints out the number of input bytes, the number of output bytes,
 * and the bits per byte compression ratio.   This is done both as a
 * pacifier and as a seat-of-the-pants diagnostice.  A better version
 * of this routine would also print the local compression ratio.
 */
void print_compression()
{
    long total_input_bytes;
    long total_output_bytes;

    total_input_bytes  =  ftell( text_file );
    total_output_bytes = bit_ftell_output( compressed_file );
    if ( total_output_bytes == 0 )
        total_output_bytes = 1;

    fprintf( stderr,"%ld/%ld, %2.3f\r",
        total_input_bytes,
        total_output_bytes,
             8.0 * total_output_bytes / total_input_bytes );
}

/*
 * This routine is called once every 256 input symbols.  Its job is to
 * check to see if the compression ratio hits or exceeds 90%.  If the
 * output size is 90% of the input size, it means not much compression
 * is taking place, so we probably ought to flush the statistics in the
 * model to allow for more current statistics to have greater impactic.
 * This heuristic approach does seem to have some effect.
 */
int check_compression()
{
    static long local_input_marker = 0L;
    static long local_output_marker = 0L;
    long total_input_bytes;
    long total_output_bytes;
    int local_ratio;

    print_compression();
    total_input_bytes  =  ftell( text_file ) - local_input_marker;
    total_output_bytes = bit_ftell_output( compressed_file );
    total_output_bytes -= local_output_marker;
    if ( total_output_bytes == 0 )
        total_output_bytes = 1;
    local_ratio = (int)( ( total_output_bytes * 100 ) / total_input_bytes );

    local_input_marker = ftell( text_file );
    local_output_marker = bit_ftell_output( compressed_file );

    if ( local_ratio > 90 && flushing_enabled )
    {
        fprintf( stderr, "Flushing... \r" );
        return( 1 );
    }
    return( 0 );
}



<a name="006a_001e">
<a name="006a_001f">
[LISTING THREE]
<a name="006a_001f">

/*
 * expand-2.c
 *
 * This module is the driver program for a variabler order finite
 * context expansion program.  The maximum order is determined by
 * command line option.  This particular version can respond to
 * the FLUSH code inserted in the bit stream by the compressor.
 *
 * To build this program:
 *
 * Turbo C:     tcc -w -mc expand-2.c model-2.c bitio.c coder.c
 * QuickC:      qcl /W3 /AC expand-2.c model-2.c bitio.c coder.c
 * Zortech:     ztc -mc expand-2.c model-2.c bitio.c coder.c
 * *NIX:        cc -o expand-2 expand-2.c model-2.c bitio.c coder.c
 *
 *
 * Command line options:
 *
 *  -f text_file_name  [defaults to test.inp]
 *  -c compressed_file_name [defaults to test.cmp]
 *  -o order [defaults to 3 for model-2]
 */
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include "coder.h"
#include "model.h"
#include "bitio.h"

/*
 * Declarations for local procedures.
 */
void initialize_options( int argc, char **argv );
void print_compression( void );

/*
 * Variables used throughout this particular file.
 */
FILE *text_file;
FILE *compressed_file;

/*
 * The main loop for expansion is very similar to the expansion routine
 * used in the simpler compression program, EXPAND-1.C.  The routine
 * first has to initialize the standard i/o, the bit oriented i/o,
 * the arithmetic coder, and the model.  The decompression loop
 * differs in a couple of respects.  First of all, it handles the
 * special ESCAPE character, by removing them from the input
 * bit stream but just throwing them away otherwise.  Secondly,
 * it handles the special FLUSH character.  Once the main decoding
 * loop is done, the cleanup code is called, and the program exits.
 */
void main( int argc, char *argv[] )
{
    SYMBOL s;
    int c;
    int count;
    long int counter = 0;

    initialize_options( --argc, ++argv );
    initialize_model();
    initialize_input_bitstream();
    initialize_arithmetic_decoder( compressed_file );
    for ( ; ; )
    {
        do
        {
            get_symbol_scale( &s );
            count = get_current_count( &s );
            c = convert_symbol_to_int( count, &s );
       remove_symbol_from_stream( compressed_file, &s );
        } while ( c == ESCAPE );
        if ( c == DONE )
            break;
        if ( ( ++counter & 0xff ) == 0 )
       print_compression();
   if ( c != FLUSH )
       putc( (char) c, text_file );
   else
   {
       fprintf( stderr, "\rFlushing...      \r" );
       flush_model();
   }
   update_model( c );
   add_character_to_model( c );
    }
    print_compression();
    fputc( '\n', stderr );
    exit( 0 );
}

/*
 * This routine checks for command line options, and opens the
 * input and output files.  The only other command line option
 * besides the input and output file names is the order of the model,
 * which defaults to 3.
 */
void initialize_options( int argc, char **argv )
{
    char text_file_name[ 81 ];
    char compressed_file_name[ 81 ];

    strcpy( compressed_file_name, "test.cmp" );
    strcpy( text_file_name, "test.out" );
    while ( argc-- > 0 )
    {
        if ( strcmp( *argv, "-f" ) == 0 )
   {
       argc--;
       strcpy( text_file_name, *++argv );
   }
        else if ( strcmp( *argv, "-c" ) == 0 )
   {
       argc--;
       strcpy( compressed_file_name, *++argv );
   }
        else if ( strcmp( *argv, "-o" ) == 0 )
   {
       argc--;
            max_order = atoi( *++argv );
   }
   else
   {
            fprintf( stderr, "\nUsage: EXPAND-2 [-o order] " );
            fprintf( stderr, "[-f text file] [-c compressed file]\n" );
            exit( -1 );
   }
   argc--;
   argv++;
    }
    text_file = fopen( text_file_name, "wb" );
    compressed_file = fopen( compressed_file_name, "rb" );
    setvbuf( text_file, NULL, _IOFBF, 4096 );
    setbuf( stdout, NULL );
    printf( "Decoding %s to %s, order %d.\n",
            compressed_file_name ,
            text_file_name,
            max_order );
}

/*
 * This routine is called to print the current compression ratio.
 * It prints out the number of input bytes, the number of output bytes,
 * and the bits per byte compression ratio.   This is done both as a
 * pacifier and as a seat-of-the-pants diagnostice.  A better version
 * of this routine would also print the local compression ratio.
 */
void print_compression()
{
    long input_bytes;
    long output_bytes;

    output_bytes = ftell( text_file );
    input_bytes = bit_ftell_input( compressed_file );
    if ( output_bytes == 0 )
        output_bytes = 1;
    fprintf( stderr,
             "\r%ld/%ld, %2.3f    ",
        input_bytes,
        output_bytes,
             8.0 * input_bytes / output_bytes );
}


<a name="006a_0020">
<a name="006a_0021">
[LISTING FOUR]
<a name="006a_0021">

/*
 * model-2.c
 *
 * This module contains all of the modeling functions used with
 * comp-2.c and expand-2.c.  This modeling unit keeps track of
 * all contexts from 0 up to max_order, which defaults to 3.
 * In addition, there is a special context -1 which is a fixed model
 * used to encode previously unseen characters, and a context -2
 * which is used to encode EOF and FLUSH messages.
 *
 * Each context is stored in a special CONTEXT structure, which is
 * documented below.  Context tables are not created until the
 * context is seen, and they are never destroyed.
 *
 */
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include "coder.h"
#include "model.h"

/*
 * This program consumes massive amounts of memory.  One way to
 * handle large amounts of memory is to use Zortech's __handle
 * pointer type.  So that my code will run with other compilers
 * as well, the handle stuff gets redefined when using other
 * compilers.
 */
#ifdef __ZTC__
#include <handle.h>
#else
#define __handle
#define handle_calloc( a )        calloc( (a), 1 )
#define handle_realloc( a, b )    realloc( (a), (b) )
#define handle_free( a )          free( (a) )
#endif

/* A context table contains a list of the counts for all symbols
 * that have been seen in the defined context.  For example, a
 * context of "Zor" might have only had 2 different characters
 * appear.  't' might have appeared 10 times, and 'l' might have
 * appeared once.  These two counts are stored in the context
 * table.  The counts are stored in the STATS structure.  All of
 * the counts for a given context are stored in and array of STATS.
 * As new characters are added to a particular contexts, the STATS
 * array will grow.  Sometimes, the STATS array will shrink
 * after flushing the model.
 */
typedef struct {
                unsigned char symbol;
                unsigned char counts;
               } STATS;

/*
 * Each context has to have links to higher order contexts.  These
 * links are used to navigate through the context tables.  For example,
 * to find the context table for "ABC", I start at the order 0 table,
 * then find the pointer to the "A" context table by looking through
 * then LINKS array.  At that table, we find the "B" link and go to
 * that table.  The process continues until the destination table is
 * found.  The table pointed to by the LINKS array corresponds to the
 * symbol found at the same offset in the STATS table.  The reason that
 * LINKS is in a separate structure instead of being combined with
 * STATS is to save space.  All of the leaf context nodes don't need
 * next pointers, since they are in the highest order context.  In the
 * leaf nodes, the LINKS array is a NULL pointers.
 */
typedef struct {
                 struct context *next;
               } LINKS;

/*
 * The CONTEXT structure holds all of the know information about
 * a particular context.  The links and stats pointers are discussed
 * immediately above here.  The max_index element gives the maximum
 * index that can be applied to the stats or link array.  When the
 * table is first created, and stats is set to NULL, max_index is set
 * to -1.  As soon as single element is added to stats, max_index is
 * incremented to 0.
 *
 * The lesser context pointer is a navigational aid.  It points to
 * the context that is one less than the current order.  For example,
 * if the current context is "ABC", the lesser_context pointer will
 * point to "BC".  The reason for maintaining this pointer is that
 * this particular bit of table searching is done frequently, but
 * the pointer only needs to be built once, when the context is
 * created.
 */
typedef struct context {
                         int max_index;

                         LINKS __handle *links;
                         STATS __handle *stats;
                         struct context *lesser_context;
                       } CONTEXT;
/*
 * max_order is the maximum order that will be maintained by this
 * program.  EXPAND-2 and COMP-2 both will modify this int based
 * on command line parameters.
 */
int max_order=3;
/*
 * *contexts[] is an array of current contexts.  If I want to find
 * the order 0 context for the current state of the model, I just
 * look at contexts[0].  This array of context pointers is set up
 * every time the model is updated.
 */
CONTEXT **contexts;
/*
 * current_order contains the current order of the model.  It starts
 * at max_order, and is decremented every time an ESCAPE is sent.  It
 * will only go down to -1 for normal symbols, but can go to -2 for
 * EOF and FLUSH.
 */
int current_order;
/*
 * This variable tells COMP-2.C that the FLUSH symbol can be
 * sent using this model.
 */
int flushing_enabled=1;
/*
 * This table contains the cumulative totals for the current context.
 * Because this program is using exclusion, totals has to be calculated
 * every time a context is used.  The scoreboard array keeps track of
 * symbols that have appeared in higher order models, so that they
 * can be excluded from lower order context total calculations.
 */
short int totals[ 258 ];
char scoreboard[ 256 ];

/*
 * Local procedure declarations.
 */
void error_exit( char *message );
void update_table( CONTEXT *table, unsigned char symbol );
void rescale_table( CONTEXT *table );
void totalize_table( CONTEXT *table );
CONTEXT *shift_to_next_context( CONTEXT *table, unsigned char c, int order);
CONTEXT *allocate_next_order_table( CONTEXT *table,
                                    unsigned char symbol,
                                    CONTEXT *lesser_context );

/*
 * This routine has to get everything set up properly so that
 * the model can be maintained properly.  The first step is to create
 * the *contexts[] array used later to find current context tables.
 * The *contexts[] array indices go from -2 up to max_order, so
 * the table needs to be fiddled with a little.  This routine then
 * has to create the special order -2 and order -1 tables by hand,
 * since they aren't quite like other tables.  Then the current
 * context is set to \0, \0, \0, ... and the appropriate tables
 * are built to support that context.  The current order is set
 * to max_order, the scoreboard is cleared, and the system is
 * ready to go.
 */

void initialize_model()
{
    int i;
    CONTEXT *null_table;
    CONTEXT *control_table;

    current_order = max_order;
    contexts = (CONTEXT **) calloc( sizeof( CONTEXT * ), 10 );
    if ( contexts == NULL )
        error_exit( "Failure #1: allocating context table!" );
    contexts += 2;
    null_table = (CONTEXT *) calloc( sizeof( CONTEXT ), 1 );
    if ( null_table == NULL )
        error_exit( "Failure #2: allocating null table!" );
    null_table->max_index = -1;
    contexts[ -1 ] = null_table;
    for ( i = 0 ; i <= max_order ; i++ )
        contexts[ i ] = allocate_next_order_table( contexts[ i-1 ],
                                               0,
                                               contexts[ i-1 ] );
    handle_free( (char __handle *) null_table->stats );
    null_table->stats =
         (STATS __handle *) handle_calloc( sizeof( STATS ) * 256 );
    if ( null_table->stats == NULL )
        error_exit( "Failure #3: allocating null table!" );
    null_table->max_index = 255;
    for ( i=0 ; i < 256 ; i++ )
    {
        null_table->stats[ i ].symbol = (unsigned char) i;
        null_table->stats[ i ].counts = 1;
    }

    control_table = (CONTEXT *) calloc( sizeof(CONTEXT), 1 );
    if ( control_table == NULL )
        error_exit( "Failure #4: allocating null table!" );
    control_table->stats =
         (STATS __handle *) handle_calloc( sizeof( STATS ) * 2 );
    if ( control_table->stats == NULL )
        error_exit( "Failure #5: allocating null table!" );
    contexts[ -2 ] = control_table;
    control_table->max_index = 1;
    control_table->stats[ 0 ].symbol = -FLUSH;
    control_table->stats[ 0 ].counts = 1;
    control_table->stats[ 1 ].symbol =- DONE;
    control_table->stats[ 1 ].counts = 1;

    for ( i = 0 ; i < 256 ; i++ )
        scoreboard[ i ] = 0;
}
/*
 * This is a utility routine used to create new tables when a new
 * context is created.  It gets a pointer to the current context,
 * and gets the symbol that needs to be added to it.  It also needs
 * a pointer to the lesser context for the table that is to be
 * created.  For example, if the current context was "ABC", and the
 * symbol 'D' was read in, add_character_to_model would need to
 * create the new context "BCD".  This routine would get called
 * with a pointer to "BC", the symbol 'D', and a pointer to context
 * "CD".  This routine then creates a new table for "BCD", adds it
 * to the link table for "BC", and gives "BCD" a back pointer to
 * "CD".  Note that finding the lesser context is a difficult
 * task, and isn't done here.  This routine mainly worries about
 * modifying the stats and links fields in the current context.
 */

CONTEXT *allocate_next_order_table( CONTEXT *table,
                                    unsigned char symbol,
                                    CONTEXT *lesser_context )
{
    CONTEXT *new_table;
    int i;
    unsigned int new_size;

    for ( i = 0 ; i <= table->max_index ; i++ )
        if ( table->stats[ i ].symbol == symbol )
            break;
    if ( i > table->max_index )
    {
        table->max_index++;
        new_size = sizeof( LINKS );
        new_size *= table->max_index + 1;
        if ( table->links == NULL )
            table->links = (LINKS __handle *) handle_calloc( new_size );
        else
            table->links = (LINKS __handle *)
                 handle_realloc( (char __handle *) table->links, new_size );
        new_size = sizeof( STATS );
        new_size *= table->max_index + 1;
        if ( table->stats == NULL )
            table->stats = (STATS __handle *) handle_calloc( new_size );
        else
            table->stats = (STATS __handle *)
                handle_realloc( (char __handle *) table->stats, new_size );
        if ( table->links == NULL )
            error_exit( "Failure #6: allocating new table" );
        if ( table->stats == NULL )
            error_exit( "Failure #7: allocating new table" );
        table->stats[ i ].symbol = symbol;
        table->stats[ i ].counts = 0;
    }
    new_table = (CONTEXT *) calloc( sizeof( CONTEXT ), 1 );
    if ( new_table == NULL )
        error_exit( "Failure #8: allocating new table" );
    new_table->max_index = -1;
    table->links[ i ].next = new_table;
    new_table->lesser_context = lesser_context;
    return( new_table );
}

/*
 * This routine is called to increment the counts for the current
 * contexts.  It is called after a character has been encoded or
 * decoded.  All it does is call update_table for each of the
 * current contexts, which does the work of incrementing the count.
 * This particular version of update_model() practices update exclusion,
 * which means that if lower order models weren't used to encode
 * or decode the character, they don't get their counts updated.
 * This seems to improve compression performance quite a bit.
 * To disable update exclusion, the loop would be changed to run
 * from 0 to max_order, instead of current_order to max_order.
 */
void update_model( int symbol )
{
    int i;
    int local_order;

    if ( current_order < 0 )
        local_order = 0;
    else
        local_order = current_order;
    if ( symbol >= 0 )
    {
        while ( local_order <= max_order )
        {
            if ( symbol >= 0 )
                update_table( contexts[ local_order ], (unsigned char) symbol );
            local_order++;
        }
    }
    current_order = max_order;
    for ( i = 0 ; i < 256 ; i++ )
        scoreboard[ i ] = 0;
}
/*
 * This routine is called to update the count for a particular symbol
 * in a particular table.  The table is one of the current contexts,
 * and the symbol is the last symbol encoded or decoded.  In principle
 * this is a fairly simple routine, but a couple of complications make
 * things a little messier.  First of all, the given table may not
 * already have the symbol defined in its statistics table.  If it
 * doesn't, the stats table has to grow and have the new guy added
 * to it.  Secondly, the symbols are kept in sorted order by count
 * in the table so as that the table can be trimmed during the flush
 * operation.  When this symbol is incremented, it might have to be moved
 * up to reflect its new rank.  Finally, since the counters are only
 * bytes, if the count reaches 255, the table absolutely must be rescaled
 * to get the counts back down to a reasonable level.
 */

void update_table( CONTEXT *table, unsigned char symbol )
{
    int i;
    int index;
    unsigned char temp;
    CONTEXT *temp_ptr;
    unsigned int new_size;
/*
 * First, find the symbol in the appropriate context table.  The first
 * symbol in the table is the most active, so start there.
 */
    index = 0;
    while ( index <= table->max_index &&
            table->stats[index].symbol != symbol )
        index++;
    if ( index > table->max_index )
    {
        table->max_index++;
        new_size = sizeof( LINKS );
        new_size *= table->max_index + 1;
        if ( current_order < max_order )
        {
            if ( table->max_index == 0 )
                table->links = (LINKS __handle *) handle_calloc( new_size );
            else
                table->links = (LINKS __handle *)
                   handle_realloc( (char __handle *) table->links, new_size );
            if ( table->links == NULL )
                error_exit( "Error #9: reallocating table space!" );
            table->links[ index ].next = NULL;
        }
        new_size = sizeof( STATS );
        new_size *= table->max_index + 1;
        if (table->max_index==0)
            table->stats = (STATS __handle *) handle_calloc( new_size );
        else
            table->stats = (STATS __handle *)
                handle_realloc( (char __handle *) table->stats, new_size );
        if ( table->stats == NULL )
            error_exit( "Error #10: reallocating table space!" );
        table->stats[ index ].symbol = symbol;
        table->stats[ index ].counts = 0;
    }
/*
 * Now I move the symbol to the front of its list.
 */
    i = index;
    while ( i > 0 &&
            table->stats[ index ].counts == table->stats[ i-1 ].counts )
        i--;
    if ( i != index )
    {
        temp = table->stats[ index ].symbol;
        table->stats[ index ].symbol = table->stats[ i ].symbol;
        table->stats[ i ].symbol = temp;
        if ( table->links != NULL )
        {
            temp_ptr = table->links[ index ].next;
            table->links[ index ].next = table->links[ i ].next;
            table->links[ i ].next = temp_ptr;
        }
        index = i;
    }
/*
 * The switch has been performed, now I can update the counts
 */
    table->stats[ index ].counts++;
    if ( table->stats[ index ].counts == 255 )
        rescale_table( table );
}

/*
 * This routine is called when a given symbol needs to be encoded.
 * It is the job of this routine to find the symbol in the context
 * table associated with the current table, and return the low and
 * high counts associated with that symbol, as well as the scale.
 * Finding the table is simple.  Unfortunately, once I find the table,
 * I have to build the table of cumulative counts, which is
 * expensive, and is done elsewhere.  If the symbol is found in the
 * table, the appropriate counts are returned.  If the symbols is
 * not found, the ESCAPE symbol probabilities are returned, and
 * the current order is reduced.  Note also the kludge to support
 * the order -2 character set, which consists of negative numbers
 * instead of unsigned chars.  This insures that no match will every
 * be found for the EOF or FLUSH symbols in any "normal" table.
 */
int convert_int_to_symbol( int c, SYMBOL *s )
{
    int i;
    CONTEXT *table;

    table = contexts[ current_order ];
    totalize_table( table );
    s->scale = totals[ 0 ];
    if ( current_order == -2 )
        c = -c;
    for ( i = 0 ; i <= table->max_index ; i++ )
    {
        if ( c == (int) table->stats[ i ].symbol )
        {
            if ( table->stats[ i ].counts == 0 )
                break;
            s->low_count = totals[ i+2 ];
            s->high_count = totals[ i+1 ];
            return( 0 );
        }
    }
    s->low_count = totals[ 1 ];
    s->high_count = totals[ 0 ];
    current_order--;
    return( 1 );
}
/*
 * This routine is called when decoding an arithmetic number.  In
 * order to decode the present symbol, the current scale in the
 * model must be determined.  This requires looking up the current
 * table, then building the totals table.  Once that is done, the
 * cumulative total table has the symbol scale at element 0.
 */
void get_symbol_scale( SYMBOL *s )
{
    CONTEXT *table;

    table = contexts[ current_order ];
    totalize_table( table );
    s->scale = totals[ 0 ];
}

/*
 * This routine is called during decoding.  It is given a count that
 * came out of the arithmetic decoder, and has to find the symbol that
 * matches the count.  The cumulative totals are already stored in the
 * totals[] table, form the call to get_symbol_scale, so this routine
 * just has to look through that table.  Once the match is found,
 * the appropriate character is returned to the caller.  Two possible
 * complications.  First, the character might be the ESCAPE character,
 * in which case the current_order has to be decremented.  The other
 * complication is that the order might be -2, in which case we return
 * the negative of the symbol so it isn't confused with a normal
 * symbol.
 */
int convert_symbol_to_int( int count, SYMBOL *s)
{
    int c;
    CONTEXT *table;

    table = contexts[ current_order ];
    for ( c = 0; count < totals[ c ] ; c++ )
        ;
    s->high_count = totals[ c-1 ];
    s->low_count = totals[ c ];
    if ( c == 1 )
    {
        current_order--;
        return( ESCAPE );
    }
    if ( current_order < -1 )
        return( (int) -table->stats[ c-2 ].symbol );
    else
        return( table->stats[ c-2 ].symbol );
}


/*
 * After the model has been updated for a new character, this routine
 * is called to "shift" into the new context.  For example, if the
 * last context was "ABC", and the symbol 'D' had just been processed,
 * this routine would want to update the context pointers to that
 * contexts[1]=="D", contexts[2]=="CD" and contexts[3]=="BCD".  The
 * potential problem is that some of these tables may not exist.
 * The way this is handled is by the shift_to_next_context routine.
 * It is passed a pointer to the "ABC" context, along with the symbol
 * 'D', and its job is to return a pointer to "BCD".  Once we have
 * "BCD", we can follow the lesser context pointers in order to get
 * the pointers to "CD" and "C".  The hard work was done in
 * shift_to_context().
 */
void add_character_to_model( int c )
{
    int i;
    if ( max_order < 0 || c < 0 )
       return;
    contexts[ max_order ] =
       shift_to_next_context( contexts[ max_order ],
                              (unsigned char) c, max_order );
    for ( i = max_order-1 ; i > 0 ; i-- )
        contexts[ i ] = contexts[ i+1 ]->lesser_context;
}

/*
 * This routine is called when adding a new character to the model. From
 * the previous example, if the current context was "ABC", and the new
 * symbol was 'D', this routine would get called with a pointer to
 * context table "ABC", and symbol 'D', with order max_order.  What this
 * routine needs to do then is to find the context table "BCD".  This
 * should be an easy job, and it is if the table already exists.  All
 * we have to in that case is follow the back pointer from "ABC" to "BC".
 * We then search the link table of "BC" until we find the linke to "D".
 * That link points to "BCD", and that value is then returned to the
 * caller.  The problem crops up when "BC" doesn't have a pointer to
 * "BCD".  This generally means that the "BCD" context has not appeared
 * yet.  When this happens, it means a new table has to be created and
 * added to the "BC" table.  That can be done with a single call to
 * the allocate_new_table routine.  The only problem is that the
 * allocate_new_table routine wants to know what the lesser context for
 * the new table is going to be.  In other words, when I create "BCD",
 * I need to know where "CD" is located.  In order to find "CD", I
 * have to recursively call shift_to_next_context, passing it a pointer
 * to context "C" and they symbol 'D'.  It then returns a pointer to
 * "CD", which I use to create the "BCD" table.  The recursion is guaranteed
 * to end if it ever gets to order -1, because the null table is
 * guaranteed to have a for every symbol to the order 0 table.  This is
 * the most complicated part of the modeling program, but it is
 * necessary for performance reasons.
 */
CONTEXT *shift_to_next_context( CONTEXT *table, unsigned char c, int order)
{
    int i;
    CONTEXT *new_lesser;
/*
 * First, try to find the new context by backing up to the lesser
 * context and searching its link table.  If I find the link, we take
 * a quick and easy exit, returning the link.  Note that their is a
 * special Kludge for context order 0.  We know for a fact that
 * the lesser context pointer at order 0 points to the null table,
 * order -1, and we know that the -1 table only has a single link
 * pointer, which points back to the order 0 table.
 */
    table = table->lesser_context;
    if ( order == 0 )
        return( table->links[ 0 ].next );
    for ( i = 0 ; i <= table->max_index ; i++ )
        if ( table->stats[ i ].symbol == c )
            if ( table->links[ i ].next != NULL )
                return( table->links[ i ].next );
            else
                break;
/*
 * If I get here, it means the new context did not exist.  I have to
 * create the new context, add a link to it here, and add the backwards
 * link to *his* previous context.  Creating the table and adding it to
 * this table is pretty easy, but adding the back pointer isn't.  Since
 * creating the new back pointer isn't easy, I duck my responsibility
 * and recurse to myself in order to pick it up.
 */
    new_lesser = shift_to_next_context( table, c, order-1 );
/*
 * Now that I have the back pointer for this table, I can make a call
 * to a utility to allocate the new table.
 */
    table = allocate_next_order_table( table, c, new_lesser );
    return( table );
}

/*
 * Rescaling the table needs to be done for one of three reasons.
 * First, if the maximum count for the table has exceeded 16383, it
 * means that arithmetic coding using 16 and 32 bit registers might
 * no longer work.  Secondly, if an individual symbol count has
 * reached 255, it will no longer fit in a byte.  Third, if the
 * current model isn't compressing well, the compressor program may
 * want to rescale all tables in order to give more weight to newer
 * statistics.  All this routine does is divide each count by 2.
 * If any counts drop to 0, the counters can be removed from the
 * stats table, but only if this is a leaf context.  Otherwise, we
 * might cut a link to a higher order table.
 */
void rescale_table( CONTEXT *table )
{
    int i;

    if ( table->max_index == -1 )
        return;
    for ( i = 0 ; i <= table->max_index ; i++ )
        table->stats[ i ].counts /= 2;
    if ( table->stats[ table->max_index ].counts == 0 &&
         table->links == NULL )
    {
        while ( table->stats[ table->max_index ].counts == 0 &&
                table->max_index >= 0 )
            table->max_index--;
        if ( table->max_index == -1 )
        {
            handle_free( (char __handle *) table->stats );
            table->stats = NULL;
        }
        else
        {
            table->stats = (STATS __handle *)
                handle_realloc( (char __handle *) table->stats,
                                 sizeof( STATS ) * ( table->max_index + 1 ) );
            if ( table->stats == NULL )
                error_exit( "Error #11: reallocating stats space!" );
        }
    }
}

/*
 * This routine has the job of creating a cumulative totals table for
 * a given context.  The cumulative low and high for symbol c are going to
 * be stored in totals[c+2] and totals[c+1].  Locations 0 and 1 are
 * reserved for the special ESCAPE symbol.  The ESCAPE symbol
 * count is calculated dynamically, and changes based on what the
 * current context looks like.  Note also that this routine ignores
 * any counts for symbols that have already showed up in the scoreboard,
 * and it adds all new symbols found here to the scoreboard.  This
 * allows us to exclude counts of symbols that have already appeared in
 * higher order contexts, improving compression quite a bit.
 */
void totalize_table( CONTEXT *table )
{
    int i;
    unsigned char max;

    for ( ; ; )
    {
        max = 0;
        i = table->max_index + 2;
        totals[ i ] = 0;
        for ( ; i > 1 ; i-- )
        {
            totals[ i-1 ] = totals[ i ];
            if ( table->stats[ i-2 ].counts )
                if ( ( current_order == -2 ) ||
                     scoreboard[ table->stats[ i-2 ].symbol ] == 0 )
                     totals[ i-1 ] += table->stats[ i-2 ].counts;
            if ( table->stats[ i-2 ].counts > max )
                max = table->stats[ i-2 ].counts;
        }
/*
 * Here is where the escape calculation needs to take place.
 */
        if ( max == 0 )
            totals[ 0 ] = 1;
        else
        {
            totals[ 0 ] = (short int) ( 256 - table->max_index );
            totals[ 0 ] *= table->max_index;
            totals[ 0 ] /= 256;
            totals[ 0 ] /= max;
            totals[ 0 ]++;
            totals[ 0 ] += totals[ 1 ];
        }
        if ( totals[ 0 ] < MAXIMUM_SCALE )
            break;
        rescale_table( table );
    }
    for ( i = 0 ; i < table->max_index ; i++ )
   if (table->stats[i].counts != 0)
            scoreboard[ table->stats[ i ].symbol ] = 1;
}

/*
 * This routine is called when the entire model is to be flushed.
 * This is done in an attempt to improve the compression ratio by
 * giving greater weight to upcoming statistics.  This routine
 * starts at the given table, and recursively calls itself to
 * rescale every table in its list of links.  The table itself
 * is then rescaled.
 */
void recursive_flush( CONTEXT *table )
{
    int i;

    if ( table->links != NULL )
        for ( i = 0 ; i <= table->max_index ; i++ )
            if ( table->links[ i ].next != NULL )
                recursive_flush( table->links[ i ].next );
    rescale_table( table );
}

/*
 * This routine is called to flush the whole table, which it does
 * by calling the recursive flush routine starting at the order 0
 * table.
 */
void flush_model()
{
    recursive_flush( contexts[ 0 ] );
}

void error_exit( char *message)
{
    putc( '\n', stdout );
    puts( message );
    exit( -1 );
}


Copyright © 1991, Dr. Dobb's Journal


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.