Welcome to this issue of the DDJ Data Compression Newsletter.
Welcome to this issue of the DDJ Data Compression Newsletter. Last month's
newsletter featured a short puzzle, based on a compression scheme I received
recently from a clever reader. Those of you who didn't read last month's
newsletter can catch up on the puzzle by going to the archived copy on
the DDJ site:
http://www.ddj.com/maillists/compression/0105cm/0105cm001.htm
In essence, the puzzle proposes a scheme for infinitely recursive compression
based on the notion that Huffman coding will always be defeated by using
the same model with an arithmetic encoder.
The puzzle is clearly based on a fallacy, because we know that an infinite
cycle of compression is impossible you can't endlessly recycle
a file through different compressors and get a continuously smaller output
file.
The trick is exposing the flaw in the theory, and I'm happy to say that
the readers of the newsletter and posters to comp.compression were pretty
good at smoking out the problems.
The first *big* problem in the theory is the premise that an arithmetic
coder will always defeat a Huffman coder in a compression battle. It's
very close to the truth, but not quite true.
The more precise way to state this is to say that when you are coding
a symbol using an entropy coder, the arithmetic coder will come closer
to using the exact number of bits specified by the probability you have
passed it.
In practice, this means that if you compress a complete file using a
strict order-0 static model, the file encoded using the arithmetic coder
will nearly always be smaller. At worst, it should tie.
However, if you instead choose to use an adaptive model, all bets are
off. Sometimes the arithmetic coder will create a smaller output file,
other times it won't. I have a short program below that demonstrates the
problem pretty effectively.
The program below assumes that you have a two-symbol alphabet, consisting
of ones and zeros. This is a worst-case scenario for a Huffman coder,
since it has to use a minimum of one bit to encode every symbol. In a
situation like this, it has to use a single bit to encode every symbol,
meaning it can't possibly compress any file!
When run with the input data as shown above, the program produces the
following output:
However, if I change the input stream to "010101010101010101", I get
the following output:
In this case, the arithmetic coder lost to the Huffman coder, which
is in a no-win situation to begin with!
So what does this mean? We're passing good probabilities to the encoder
in the adaptive situation, but the arithmetic coder fails to give an improvement.
The answer is simple: in the adaptive situation, we're passing changing
guesses about the probability of the symbols. Those probabilities are
constantly changing, and are only going to approximate the true probability
of the stream over the long haul.
The superiority of the arithmetic coder hinges completely on the fact
that the modeler passes it accurate probabilities. In the case of the
adaptive coder, this just doesn't happen.
This takes the puzzle up one level of difficulty. We know need to find
a file X that has probability model M. When compressed with a Huffman
coder, the file generates X'. When compressed with an arithmetic coder,
it generates X''.
The easy way to defeat this paradox would be to say that there may not
be a file X such that encoding with a Huffman coder creates X'. Unfortunately,
you are guaranteed that it *is* in fact possible. You can do this by considering
X' to be a binary stream of ones and zeros. As the example above showed
earlier, after passing through a Huffman coder, this will also produce
an identical stream of ones and zeros. So for every file X', there is
a Huffman encoding algorithm that creates X'.
Even worse, the two-symbol encoding program will in fact create a smaller
file when using an arithmetic encoder. If the Huffman file has an imbalance
of ones and zeros such that they don't each have exactly 50 percent probability
of appearing, the arithmetic coding program may well create an X'' that
is smaller than X'.
I'd love to hear your explanation of how we get out of this mess! As
always, your comments are welcome at [email protected].
Terms of Service | Privacy Statement | Copyright © 2024 UBM Tech, All rights reserved.
August 13, 2001
URL:http://www.drdobbs.com/dr-dobbs-data-compression-newsletter/184404644
//
// AvsH.cpp
//
#include
Input chars = 19
Huffman bits = 19
Arithmetic = 16.2423
Input chars = 18
Huffman bits = 18
Arithmetic = 19.8172