Dr. Dobb's | Dr. Dobb's Data Compression Newsletter

Dr. Dobb's Data Compression Newsletter

August 13, 2001
URL:http://www.drdobbs.com/dr-dobbs-data-compression-newsletter/184404644

DDJ Data Compression Newsletter Issue #19

Welcome to this issue of the DDJ Data Compression Newsletter.

Welcome to this issue of the DDJ Data Compression Newsletter. Last month's newsletter featured a short puzzle, based on a compression scheme I received recently from a clever reader. Those of you who didn't read last month's newsletter can catch up on the puzzle by going to the archived copy on the DDJ site:

http://www.ddj.com/maillists/compression/0105cm/0105cm001.htm

In essence, the puzzle proposes a scheme for infinitely recursive compression based on the notion that Huffman coding will always be defeated by using the same model with an arithmetic encoder.

The puzzle is clearly based on a fallacy, because we know that an infinite cycle of compression is impossible — you can't endlessly recycle a file through different compressors and get a continuously smaller output file.

The trick is exposing the flaw in the theory, and I'm happy to say that the readers of the newsletter and posters to comp.compression were pretty good at smoking out the problems.

The first *big* problem in the theory is the premise that an arithmetic coder will always defeat a Huffman coder in a compression battle. It's very close to the truth, but not quite true.

The more precise way to state this is to say that when you are coding a symbol using an entropy coder, the arithmetic coder will come closer to using the exact number of bits specified by the probability you have passed it.

In practice, this means that if you compress a complete file using a strict order-0 static model, the file encoded using the arithmetic coder will nearly always be smaller. At worst, it should tie.

However, if you instead choose to use an adaptive model, all bets are off. Sometimes the arithmetic coder will create a smaller output file, other times it won't. I have a short program below that demonstrates the problem pretty effectively.

The program below assumes that you have a two-symbol alphabet, consisting of ones and zeros. This is a worst-case scenario for a Huffman coder, since it has to use a minimum of one bit to encode every symbol. In a situation like this, it has to use a single bit to encode every symbol, meaning it can't possibly compress any file!

//
// AvsH.cpp
//
#include 
#include 

using namespace std;

char get_data()
{
  static char *pData = "1111101011110011111";
  return *pData++;
}

main()
{
  int ones = 1;
  int zeros = 1;
  double total_a = 0;
  double total_h = 0;
  for ( ; ; ) {
    double p;
    int c = get_data();
    if ( c == '\0' )
      break;
    else if ( c == '1' ) {
      p = (double) ones / (ones+zeros);
      ones++;
    } else {
      p = (double) zeros / (ones+zeros);
      zeros++;
    }
    total_h += 1.0;
    total_a += -log(p)/log(2);
  }
  cout << "Input chars  = " << (ones + zeros - 2) << "\n";
  cout << "Huffman bits = " << total_h << "\n";
  cout << "Arithmetic   = " << total_a << "\n";
  return 0;
}

When run with the input data as shown above, the program produces the following output:

Input chars  = 19
Huffman bits = 19
Arithmetic   = 16.2423

However, if I change the input stream to "010101010101010101", I get the following output:

Input chars  = 18
Huffman bits = 18
Arithmetic   = 19.8172

In this case, the arithmetic coder lost to the Huffman coder, which is in a no-win situation to begin with!

So what does this mean? We're passing good probabilities to the encoder in the adaptive situation, but the arithmetic coder fails to give an improvement.

The answer is simple: in the adaptive situation, we're passing changing guesses about the probability of the symbols. Those probabilities are constantly changing, and are only going to approximate the true probability of the stream over the long haul.

The superiority of the arithmetic coder hinges completely on the fact that the modeler passes it accurate probabilities. In the case of the adaptive coder, this just doesn't happen.

This takes the puzzle up one level of difficulty. We know need to find a file X that has probability model M. When compressed with a Huffman coder, the file generates X'. When compressed with an arithmetic coder, it generates X''.

The easy way to defeat this paradox would be to say that there may not be a file X such that encoding with a Huffman coder creates X'. Unfortunately, you are guaranteed that it *is* in fact possible. You can do this by considering X' to be a binary stream of ones and zeros. As the example above showed earlier, after passing through a Huffman coder, this will also produce an identical stream of ones and zeros. So for every file X', there is a Huffman encoding algorithm that creates X'.

Even worse, the two-symbol encoding program will in fact create a smaller file when using an arithmetic encoder. If the Huffman file has an imbalance of ones and zeros such that they don't each have exactly 50 percent probability of appearing, the arithmetic coding program may well create an X'' that is smaller than X'.

I'd love to hear your explanation of how we get out of this mess! As always, your comments are welcome at [email protected].