Channels ▼
RSS

C/C++

Data Compression with Arithmetic Encoding


Arithmetic coding is a common algorithm used in both lossless and lossy data compression algorithms. It is an entropy encoding technique, in which the frequently seen symbols are encoded with fewer bits than lesser seen symbols. It has some advantages over well-known techniques such as Huffman coding. This article describes the CACM87 implementation of arithmetic coding in detail, giving you a good understanding of all the details needed to implement it.

On a historical note, this is an update of an article I wrote more than 20 years ago. That article was published in the print edition of Dr. Dobb's Journal, which meant that a lot of editing was done in order to avoid excessive page count. In particular, that Dr. Dobb's piece combined two topics: a description of arithmetic coding along with a discussion of compression using Prediction by Partial Matching (PPM).

Because space considerations are no longer a limiting factor on the Web, I hope to do justice to the fascinating details of arithmetic coding. PPM, a worthy topic of its own, will be discussed in a later article. While perhaps long, I hope that this new effort will be the thorough explanation of the subject I wanted to do in 1991.

I think the best way to understand arithmetic coding is to break it into two parts, and I'll use that idea in this article. First, I give a description of how arithmetic coding works, using regular floating-point arithmetic implemented using standard C++ data types. This approach allows for a completely understandable, but slightly impractical, implementation. In other words, it works, but it can only be used to encode very short messages.

The second section of the article describes an implementation in which we switch to doing a special type of math on unbounded binary numbers. This is a somewhat mind-boggling topic in itself, so it helps if you already understand arithmetic coding: You don't have get hung up trying to learn two things at once.

To wrap up, I present working sample code written in modern C++. It isn't the most optimized code in the world, but it is portable and easy to add to your existing projects. It should be perfect for learning and experimenting with this coding technique.

Fundamentals

The first thing to understand about arithmetic coding is what it produces. Arithmetic coding takes a message (often a file) composed of symbols (nearly always eight-bit characters), and converts it to a floating-point number greater than or equal to zero and less than one. This floating-point number can be quite long — effectively your entire output file is one long number — which means it is not a normal data type that you are accustomed to using in conventional programming languages. My implementation of the algorithm will have to create this floating-point number from scratch, bit by bit, and likewise read it in and decode it bit by bit. This encoding process is done incrementally. As each character in a file is encoded, a few bits will be added to the encoded message, so it is built up over time as the algorithm proceeds.

The second thing to understand about arithmetic coding is that it relies on a model to characterize the symbols it is processing. The job of the model is to tell the encoder what the probability of a character is in a given message. If the model gives an accurate probability of the characters in the message, they will be encoded very close to optimally. If the model misrepresents the probabilities of symbols, the encoder may actually expand a message instead of compressing it!

Encoding with Floating-Point Math

The term arithmetic coding covers two separate processes: encoding messages and decoding them. I'll start by looking at the encoding process with sample C++ code that implements the algorithm in a very limited form using C++ double data. The code in this first section is only useful for exposition. That is, don't try to do any real compression with it.

To perform arithmetic encoding, we first need to define a proper model. Remember that the function of the model is to provide probabilities of a given character in a message. The conceptual idea of an arithmetic coding model is that each symbol will own its own unique segment of the number line of real numbers between 0 and 1. It's important to note that there are many different ways to model character probabilities. Some models are static, never changing. Others are updated after every character is processed. The only two things that matter to us are that the model attempts to accurately predict the probability a character will appear, and that the encoder and decoder have identical models at all times.

As an example, we can start with an encoder that can encode only an alphabet of 100 different characters. In a simple static model, we will start with capital letters, then move to the lower case letters. This means that the first symbol, 'A', will own the number line from 0 to .01, 'B' will own .01 to .02, and so on. (In all cases, this is strictly a half-closed interval, so the probability range for 'A' is actually >= 0 and < .01.)

With this model, my encoder can represent the single letter 'B' by outputting a floating-point number that is less than .02 and greater than or equal to .01. So for example, an arithmetic encoder that wanted to create that single letter could output .15 and be done.

Obviously, an encoder that just outputs single characters is not much use. To encode a string of symbols involves a slightly more complicated process. In this process, the first character defines a range of the number line that corresponds to the section assigned to it by the model. For the character 'B', that means the message is between .01 and .02.

The next character in the message then further divides that existing range proportionate to its current ownership of the number line. So some other letter that owns the very end of the number line, from .99 to 1.0, would change the range from [.01,.02) to [.0199, .020). This progressive subdividing of the range is just simple multiplication and addition, and is best understood with a simple code sample. My first pass in C++, which is far from a working encoder, might look like this:

double high = 1.0;
double low = 0.0;
char c;
while ( input >> c ) {
  std::pair<double,double> p = model.getProbability(c);
  double range = high - low;
  high = low + range * p.second;
  low = low + range * p.first; 
}
output << low + (high-low)/2;

After the entire message has been processed, we have a final range, [low,high). The encoder outputs a floating-point number right in the center of that range.

Examining the Floating-Point Prototype

The first pass encoder is demonstrated in the attached project as fp_proto.cpp. To get it working, I also needed to define a simple model. In this case, I've created a model that can encode 100 characters, with each having a fixed probability of .01, starting with 'A' in the first position. To keep things simple, I've only fleshed the class out enough to encode the capital letters from the ASCII character set:

struct {
  static std::pair<double,double> getProbability( char c )
  {
    if (c >= 'A' && c <= 'Z')
      return std::make_pair( (c - 'A') * .01, (c - 'A') * .01 + .01);
    else
      throw "character out of range";
  }
} model;

So in this probability model, 'A' owns the range from 0.0 to 0.01, 'B' from .01 to .02, 'C' from .02 to .03, and so on. (Note that this is not an accurate or effective model, but its simplicity is useful at this point.) For a representative example, I called this encoder with the string "WXYZ". Let's walk through what happens in the encoder:

We start with high and low set to 1.0 and 0.0. The encoder calls the model to get the probabilities for letter 'W', which returns the interval [0.22, 0.23) — the range along the probability line that 'W' owns in this model. If you step over the next two lines, you'll see that low is now set to 0.22, and high is set to 0.23.

If you examine how this works, you'll see that as each character is encoded, the range between high and low becomes narrower and narrower, but high will always be greater than low. Additionally, the value of low is always increasing, and value of high is always decreasing. These invariants are important in getting the algorithm to work properly.

So after the first character is encoded, we know that no matter what other values are encoded, the final number in the message will be less than .23 and greater than or equal to .22. Both low and high will be greater than equal to 0.22 and less than .23, and low will be strictly less than high. This means that when decoding, we are going to be able to determine that the first character is 'W' no matter what happens after this, because the final encoded number will fall into the range owned by 'W'. The narrowing process is roughly shown in Figure 1:

Aritmetic Coding
Figure 1: Narrowing process.

Let's see how this narrowing works when we process the second character, 'X'. The model returns a range of [.23, .24) for this character, and the subsequent recalculation of high and low results in and interval of [.2223, .2224). So high and low are still inside the original range of [.22, .23), but the interval has narrowed.

After the final two characters are included, the output looks like this:

Encoded message: 0.2223242550

I'll discus how the exact value we want to output needs to be chosen, but in theory at least (for this particular message), any floating-point number in the interval [0.22232425,0.22232426) should properly decode to the desired values.

Decoding With Floating-Point Math

I find the encoding algorithm to be very intuitive. The decoder reverses the process, and is no more complicated, but the steps might not seem quite as obvious. A first-pass algorithm at decoding this message would look something like this:

void decode(double message)
{
  double high = 1.0;
  double low = 0.0;
  for ( ; ; ) 
  {
    double range = high - low;
    char c = model.getSymbol((message - low)/range);
    std::cout << c;
    if ( c == 'Z' )
      return;
    std::pair<double,double> p = model.getProbability(c);
    high = low + range * p.second;
    low = low + range * p.first; 
  }
}

The math in the decoder basically reverses the math from the encode side. To decode a character, the probability model just has to find the character whose range covers the current value of the message. When the decoder first starts up with the sample value of 0.22232425, the model sees that the value falls between the interval owned by 'W': [0.22,0.23); so the model returns W. In fp_proto.cpp, the decoder portion of the simple model looks like this:

static char getSymbol( double d)
{
  if ( d >= 0.0 && d < 0.26)
    return 'A' + static_cast<int>(d*100);
  else
    throw "message out of range";
}

In the encoder, we continually narrow the range of the output value as each character is processed. In the decoder, we do the same narrowing of the portion of the message we are inspecting for the next character. After the 'W' is decoded, high and low will now define an interval of [0.22,0.23), with a range of .01. So the formula that calculates the next probability value to be decoded, (message - low)/range, will be .2324255, which lands right in the middle of of the range covered by 'X'.

This narrowing continues as the characters are decoded, until the hardcoded end of message, letter 'Z' is reached. Success!


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 

Video