Channels ▼
RSS

C/C++

Data Compression with Arithmetic Encoding


Notes and Summary So Far

The algorithm demonstrated to this point can be tested with fp_proto.cpp in the attached project. You need to be very aware that we have seen so far is not really a valid or useful encoder. It does, however, demonstrate with some accuracy the general flow of the algorithm. The key observations so far are:

  • Characters are encoded according to their position in the probability range [0, 1).
  • We keep track of the current state using variables high and low. A valid output result will be any number in the range [low,high).
  • As each character is encoded, it compresses the range [low,high) in a way that corresponds exactly to its position in the probability range.
  • We have a few invariants that result from the math used in the algorithm. The value of low never decreases, the value of high never increases, and low is always less than high.

The big problem with this demonstration algorithm is that it depends on C++ double values to hold the message state. Floating-point variables have limited precision. This means that eventually, as the range between high and low continues to narrow, the distance between them becomes too small to represent with floating-point variables. With the model used here, you can encode 10 characters or so, but after that, the algorithm won't work.

The rest of this article will present a fully functional algorithm. Conceptually, it will look very much like the code already presented. The big difference is that the variables used in the encoder and decoder, high, low, and message, are no longer going to be of the C++ type double. Instead, they will be arbitrarily long binary variables.

For our program to handle binary numbers of arbitrary length, they will be processed a bit at a time — with bits being read in, calculations being performed, and then bits being output and then discarded as they are no longer needed. The details of how this is done are where all of the work appears in this algorithm.

In addition to those modifications, the rest of this article will cover a few other points that have not been dealt with yet, including:

  • How to deal with the end of the stream.
  • How to choose the actual output number from the range [low,high).
  • Why arithmetic coding is superior to Huffman coding as an entropy coder.

Unbounded Precision Math with Integers

In this, the second part of the exposition, you are going to have to wrap your head around some unconventional mathematics. The algorithm that was described in the first part of this article is still going to be faithfully executed, but it will no longer be implemented using variables of type double. It will instead be implemented with integer variables and integer math, albeit in interesting ways.

The basic concept that we implement is this: The values of high, low, and the actual encoded message are going to be binary numbers of unbounded length. In other words, by the time we finish encoding the Project Gutenberg copy of Moby Dick, high and low will be millions of bits long, and the output value itself will be millions of bits long. All three will still represent a number greater than or equal to 0, and less than 1.

An even more interesting facet of this is that even though the three numbers in play are millions of bits long, each time we process a character, we will only do a few operations of simple integer math; 32 bits or perhaps 64 if you like.

Number Representation

Recall that in the reference version of the album, low and high were initialized like this:

  double high = 1.0;
  double low = 0.0;

In the integer version of the algorithm, we switch to a representation like this:

  unsigned int high = 0xFFFFFFFFU;
  unsigned int low = 0;

Both numbers have an implied decimal point leading their values, which would mean that high is actually (in hex) 0.FFFFFFFF or (in binary) 0.1111...1111, and low is 0.0. The number that we output will likewise have an implied decimal point before the first bit.

But this is not quite right — in the first implementation high was 1.0. The value 0.FFFFFFF is close, but it is just a bit less than 1.0. How is this dealt with?

This is where a bit of mind-bending math comes into play. Although high has 32 bits in memory, we consider it to have an infinitely long trail of binary 1s trailing off the right end. So it isn't just 0.FFFFFFFF, that string of Fs (or 1s) continues off into infinity — they are out there, but haven't been shifted into memory yet.

And while it may not be obvious, Google can help convince you that an infinite string of of 1s starting with 0.1 is actually equal to 1. (The short story: 2x - x = 1.0, therefore x = 1.0.) Likewise, low is considered to be an infinitely long binary string, with 0s hanging off the end out to the last binary place.

Resulting Changes to the Model

Remember that the final implementation is going to be entirely implemented using integer math. Previously, the model returned probabilities as a pair of floating-point numbers representing the range that a particular symbol owns.

In the updated version of the algorithm, a symbol still owns a specific range of on the number line greater than equal to 0 and less than 1. But we are now going to represent these using a pair of fractions: upper/denom and lower/denom. This doesn't actually affect our model code too much. The sample model we used in the previous section returned, for example, .22 and .23 for character 'W'. Now, instead, it will return {22, 23, 100} in a structure called prob.

The Real Encoder: First Pass

Before getting into the final working code, I'm presenting some code that implements the basic algorithm, but takes a couple of shortcuts around real problems. This not-quite-done code looks like this:

  unsigned int high = 0xFFFFFFFFU;
  unsigned int low = 0;
  char c;
  while ( input >> c ) {
    int range = high - low + 1;
    prob p = model.getProbability(c);
    high = low + (range * p.upper)/p.denominator;
    low = low + (range * p.lower)/p.denominator;
    for ( ; ; ) {
      if ( high < 0x80000000U )
        output_bit( 0 );
      else if ( low >= 0x80000000U )
        output_bit( 1 );
      else
        break;
      low <<= 1;
      high << = 1;
      high |= 1;
    }
  }

The first part of this loop is conceptually and functionally identical to the floating-point code given in part 1. We take the range — that is, the difference between and low and high — and we allocate some subset of that range to the character being encoded, depending on the probability returned from the model. The result is that low gets a little larger, and high gets a little smaller.

The second part of the code is a little more interesting. The while loop is new, and what it does is new as well — the simplified floating-point algorithm didn't have anything like this. It performs range checks on low and high, looking for situations in which the values have the same most-significant bit (MSB).

The first check looks to see whether high has dropped below 0x80000000, in which case its MSB is 0. Because we know that low is always less than high, its MSB will also be 0. And because the two values only get closer to one another, the MSB of both values will forever be 0.

The other range check looks to see if low has increased above 0x7FFFFFFF, in which case, both it and high will have MSB values of 1, and will always have an MSB of 1.

In either of these cases, remember that we have three invariants to work with: high only decreases, low only increases, and high is always greater than low. So once high has a leading bit of 0, it will never change. Once low has a leading bit of 1, it will never change. If this is the case, we can output that bit to the output stream — we know what it is with 100% certainty, so let's shift it out to the output stream and get rid of it.

And after we output the bit, we discard it. Shifting low and high one bit to the left discards that MSB. We shift in a 1 into the least-significant bit of high, and a 0 into the least significant of low. Thus, we keep working on 32 bits of precision by expelling the bits that no longer contribute anything to the precision of our calculations. In this particular implementation, we just keep 32 bits in working registers, with some additional number already sent to the output, and some other number pending for input.


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.