Dr. Dobb's | Jan 02: Algorithm Alley

Jan 02: Algorithm Alley

When it comes to compression, properly handling the end of the bitstream requires modeling what is really happening. David examines one way to do this usingarithmetic coding.

January 01, 2002
URL:http://www.drdobbs.com/architecture-and-design/jan-02-algorithm-alley/184404954

Jan02: Algorithm Alley

David is a math teacher at El Paso Community College. He can be contacted at [email protected].

Arithmetic compression and Huffman compression share a problem that does not appear to be fully solved in any literature — how to end the compression bitstream so that it matches the 8-bit binary granularity of modern filesystems.

While at first glance this might seem difficult to solve, in truth it is straightforward if you separate the task into small, easily accomplished parts. Furthermore, solving the problem results in an additional benefit that proper file endings for compression can make the compression completely "bijective," or one-to-one. (Bijective algorithms with a compressor C and decompressor D have the interesting property that for any file X, C(D(X))=X. This means that not only can any file be compressed using the algorithm, but any file can be decompressed as well. This property is particularly valuable when compressed files are going to be encrypted because it avoids giving a cracker any clues in the form of invalid compressed files.)

Properly handling the end of the bitstream first requires modeling what is really happening. To this end, I'll focus on arithmetic coding. (For more information, see "Arithmetic Coding and Statistical Modeling," by Mark R. Nelson, DDJ, February 1991.)

The first level of abstraction is to see if the compressor is looking at a series of symbols followed by an EOF signal from the operating system. For example, a typical compressor reads in the symbols S0:S1:....:Sn:EOF one at a time and passes them to a probability-modeling module. In turn, the probability of the character asks the arithmetic encoder to update the output with new high and low ranges. The encoder's output for each symbol is either the null symbol N, or a series of symbols being defined as 0, 1, or C (for carry).

The null symbol represents the case where the arithmetic encoder causes less than 1 bit of information to be output. This is what leads to most of the problems with arithmetic compression since it means that more than one symbol can map to a single bit. But the trick is just to use the token N. When you decompress, you won't have a problem because you will be working on the set of N, C, 0, and 1.

Now assume you want to change the symbols 01001CC:CC:N:N:011CC:EOF into 1s and 0s. The Ns aren't a problem since they are dropped. The carry bits are output using the conventional rules for arithmetic coding. If a series of Cs is followed by a 1, slide the 1 in front of the Cs and change the Cs to 0s. If a 0 or EOF follows, slide a 0 in front of the Cs and change the Cs to 1s. (This is because you are really mapping to a structure that has an infinite number of zeros at the end.) For this example, you now have 0100101:11:::11101:10000... EOF is replaced by an infinite number of zeroes. (The ":" characters are the original symbol separators.)

Now comes the tricky part. Notice that the last symbol is :01CC:. If the last symbol contains a 1 or at least two Cs, then the compression at this point is fully reversible. The problem occurs when the last symbol is of the form :C:, :000:, or :N: — you've dropped the EOF marker so if this symbol is of this form, the decompressor would not know that the hidden :N: or :000: (where "000" can be any number of zeroes) was last.

Working around this problem requires another fix. Although there are several ways to tackle this problem, I use the following for my arithmetic compressor: The trick is that when you line up the symbols in order so that the least-likely symbols are closer to the high end of the probability model and the most-likely symbols are closer to the low end, then all the symbols have at least one 1 bit in their expansion, except maybe the most likely symbol.

There are two special cases where the output of the arithmetic coder may not have any 1 bits. In the first case, the most-probable symbol has a 0.5 or less chance of occurring. The second case, where the most-probable symbol may have no bits in its arithmetic encoding, is when the first symbol occurs with probability of greater than 0.5 so that the :N: field has a high possibility of being the last token in the first mapping. But notice the next most-probable symbol is always 1 or longer and will contain a 1 in the final mapping to the infinite file.

So here's the next trick. If the file being compressed ends with the most-probable symbol (and thereby might lack a 1 bit), you add the next most-probable symbol to the file. If the symbol ends with a tail of the most-probable symbol followed by a series of second most probable, you add one more second most-probable symbol to the end of string of the second most-probable symbols. This mapping is reversible so far and the last symbol mapped has at least one 1-bit set. Again, in the example 0100101:11:::11101:10000... EOF is replaced by an infinite number.

Since, in this case, the file didn't end in :C:, :N:, or :000:, you don't need an extra string at the end. Notice that the ":" characters designated as symbol separators are not really needed because on decompression, you are only looking at 0s and 1s. The example now becomes 0100101:11:::11101:10000... This is still not a byte-aligned stream, but it is a unique string for any arithmetic compression that knows the next trick.

If you divide the output stream into bytes with the separator character, it then looks like: 01001011:11110110:00000000:00000000:... Because of the concept of the infinite file, you drop all the trailing :00000000: fields; they are implicitly there. This gives a new stream value of :01001011:11110110:. Finally, this is the compressed output.

Now that the output is properly aligned, my decoder properly restores the input file as long as it assumes the file ends with an infinite string of zeros.

For just a bit more savings, there is one trick left to apply. Notice that with this algorithm you have compressed the file to binary, where any combination is possible, except having a 1 in the last bit position. Think of the file as containing three types of symbols — the all 0 symbol ("0"), the leading-1 symbol (":10000000:"), and all the rest (type R). If a files ends with the tail of an all-0 symbol (":00000000:") followed by one or more leading-1 symbol (":10000000:"), then you can chop off the last symbol.

Conclusion

So is getting a standard arithmetic coder to conform to the definition of bijective worth the effort? I think so, for two compelling reasons:

First, you know from the Counting Theorem that it isn't possible to compress all files. In fact, most files expand under compression due to gaps created when certain files are not used as possible compressions. Bijective compression makes full use of the file space.
Second, programs such as PGP compress and then encrypt. This not only saves space but helps to increase the Unicity distance. Using most conventional compression algorithms, a brute-force attacker can quickly discard most keys by only testing a few bytes of the input file. The nonbijective file quickly shows itself to be invalid. With a bijective algorithm, attackers can always decompress the input stream, valid or not.

To illustrate how you can implement the concepts presented here, I'm providing electronically (see "Resource Center," page 5) a pair of command-line programs — a compressor (aribd.cpp) and decompressor (unaribd.cpp). Instructions on their use is included in the source-code comments.

DDJ