Channels ▼
RSS

Dr. Dobb's Data Compression Newsletter


DDJ Data Compression Newsletter Issue #19

Welcome to this issue of the DDJ Data Compression Newsletter.


Welcome to this issue of the DDJ Data Compression Newsletter. Last month's newsletter featured a short puzzle, based on a compression scheme I received recently from a clever reader. Those of you who didn't read last month's newsletter can catch up on the puzzle by going to the archived copy on the DDJ site:

http://www.ddj.com/maillists/compression/0105cm/0105cm001.htm

In essence, the puzzle proposes a scheme for infinitely recursive compression based on the notion that Huffman coding will always be defeated by using the same model with an arithmetic encoder.

The puzzle is clearly based on a fallacy, because we know that an infinite cycle of compression is impossible — you can't endlessly recycle a file through different compressors and get a continuously smaller output file.

The trick is exposing the flaw in the theory, and I'm happy to say that the readers of the newsletter and posters to comp.compression were pretty good at smoking out the problems.


The first *big* problem in the theory is the premise that an arithmetic coder will always defeat a Huffman coder in a compression battle. It's very close to the truth, but not quite true.

The more precise way to state this is to say that when you are coding a symbol using an entropy coder, the arithmetic coder will come closer to using the exact number of bits specified by the probability you have passed it.

In practice, this means that if you compress a complete file using a strict order-0 static model, the file encoded using the arithmetic coder will nearly always be smaller. At worst, it should tie.

However, if you instead choose to use an adaptive model, all bets are off. Sometimes the arithmetic coder will create a smaller output file, other times it won't. I have a short program below that demonstrates the problem pretty effectively.

The program below assumes that you have a two-symbol alphabet, consisting of ones and zeros. This is a worst-case scenario for a Huffman coder, since it has to use a minimum of one bit to encode every symbol. In a situation like this, it has to use a single bit to encode every symbol, meaning it can't possibly compress any file!

//
// AvsH.cpp
//
#include <cmath>
#include <iostream>

using namespace std;

char get_data()
{
  static char *pData = "1111101011110011111";
  return *pData++;
}

main()
{
  int ones = 1;
  int zeros = 1;
  double total_a = 0;
  double total_h = 0;
  for ( ; ; ) {
    double p;
    int c = get_data();
    if ( c == '\0' )
      break;
    else if ( c == '1' ) {
      p = (double) ones / (ones+zeros);
      ones++;
    } else {
      p = (double) zeros / (ones+zeros);
      zeros++;
    }
    total_h += 1.0;
    total_a += -log(p)/log(2);
  }
  cout << "Input chars  = " << (ones + zeros - 2) << "\n";
  cout << "Huffman bits = " << total_h << "\n";
  cout << "Arithmetic   = " << total_a << "\n";
  return 0;
}

When run with the input data as shown above, the program produces the following output:

Input chars  = 19
Huffman bits = 19
Arithmetic   = 16.2423

However, if I change the input stream to "010101010101010101", I get the following output:

Input chars  = 18
Huffman bits = 18
Arithmetic   = 19.8172

In this case, the arithmetic coder lost to the Huffman coder, which is in a no-win situation to begin with!


So what does this mean? We're passing good probabilities to the encoder in the adaptive situation, but the arithmetic coder fails to give an improvement.

The answer is simple: in the adaptive situation, we're passing changing guesses about the probability of the symbols. Those probabilities are constantly changing, and are only going to approximate the true probability of the stream over the long haul.

The superiority of the arithmetic coder hinges completely on the fact that the modeler passes it accurate probabilities. In the case of the adaptive coder, this just doesn't happen.

This takes the puzzle up one level of difficulty. We know need to find a file X that has probability model M. When compressed with a Huffman coder, the file generates X'. When compressed with an arithmetic coder, it generates X''.

The easy way to defeat this paradox would be to say that there may not be a file X such that encoding with a Huffman coder creates X'. Unfortunately, you are guaranteed that it *is* in fact possible. You can do this by considering X' to be a binary stream of ones and zeros. As the example above showed earlier, after passing through a Huffman coder, this will also produce an identical stream of ones and zeros. So for every file X', there is a Huffman encoding algorithm that creates X'.

Even worse, the two-symbol encoding program will in fact create a smaller file when using an arithmetic encoder. If the Huffman file has an imbalance of ones and zeros such that they don't each have exactly 50 percent probability of appearing, the arithmetic coding program may well create an X'' that is smaller than X'.


I'd love to hear your explanation of how we get out of this mess! As always, your comments are welcome at [email protected].


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.