 Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

# Data Compression with Arithmetic Encoding

### Notes and Summary So Far

The algorithm demonstrated to this point can be tested with `fp_proto.cpp` in the attached project. You need to be very aware that we have seen so far is not really a valid or useful encoder. It does, however, demonstrate with some accuracy the general flow of the algorithm. The key observations so far are:

• Characters are encoded according to their position in the probability range [0, 1).
• We keep track of the current state using variables `high` and `low`. A valid output result will be any number in the range [`low`,`high`).
• As each character is encoded, it compresses the range [`low`,`high`) in a way that corresponds exactly to its position in the probability range.
• We have a few invariants that result from the math used in the algorithm. The value of `low` never decreases, the value of `high` never increases, and `low` is always less than `high`.

The big problem with this demonstration algorithm is that it depends on C++ `double` values to hold the message state. Floating-point variables have limited precision. This means that eventually, as the range between `high` and `low` continues to narrow, the distance between them becomes too small to represent with floating-point variables. With the model used here, you can encode 10 characters or so, but after that, the algorithm won't work.

The rest of this article will present a fully functional algorithm. Conceptually, it will look very much like the code already presented. The big difference is that the variables used in the encoder and decoder, `high`, `low`, and `message`, are no longer going to be of the C++ type `double`. Instead, they will be arbitrarily long binary variables.

For our program to handle binary numbers of arbitrary length, they will be processed a bit at a time — with bits being read in, calculations being performed, and then bits being output and then discarded as they are no longer needed. The details of how this is done are where all of the work appears in this algorithm.

In addition to those modifications, the rest of this article will cover a few other points that have not been dealt with yet, including:

• How to deal with the end of the stream.
• How to choose the actual output number from the range` [low,high)`.
• Why arithmetic coding is superior to Huffman coding as an entropy coder.

### Unbounded Precision Math with Integers

In this, the second part of the exposition, you are going to have to wrap your head around some unconventional mathematics. The algorithm that was described in the first part of this article is still going to be faithfully executed, but it will no longer be implemented using variables of type `double`. It will instead be implemented with integer variables and integer math, albeit in interesting ways.

The basic concept that we implement is this: The values of `high`, `low`, and the actual encoded message are going to be binary numbers of unbounded length. In other words, by the time we finish encoding the Project Gutenberg copy of Moby Dick, `high` and `low` will be millions of bits long, and the output value itself will be millions of bits long. All three will still represent a number greater than or equal to 0, and less than 1.

An even more interesting facet of this is that even though the three numbers in play are millions of bits long, each time we process a character, we will only do a few operations of simple integer math; 32 bits or perhaps 64 if you like.

### Number Representation

Recall that in the reference version of the album, `low` and `high` were initialized like this:

```  double high = 1.0;
double low = 0.0;
```

In the integer version of the algorithm, we switch to a representation like this:

```  unsigned int high = 0xFFFFFFFFU;
unsigned int low = 0;
```

Both numbers have an implied decimal point leading their values, which would mean that `high` is actually (in hex) 0.FFFFFFFF or (in binary) 0.1111...1111, and `low` is 0.0. The number that we output will likewise have an implied decimal point before the first bit.

But this is not quite right — in the first implementation `high` was 1.0. The value 0.FFFFFFF is close, but it is just a bit less than 1.0. How is this dealt with?

This is where a bit of mind-bending math comes into play. Although `high` has 32 bits in memory, we consider it to have an infinitely long trail of binary 1s trailing off the right end. So it isn't just 0.FFFFFFFF, that string of Fs (or 1s) continues off into infinity — they are out there, but haven't been shifted into memory yet.

And while it may not be obvious, Google can help convince you that an infinite string of of 1s starting with 0.1 is actually equal to 1. (The short story: 2x - x = 1.0, therefore x = 1.0.) Likewise, `low` is considered to be an infinitely long binary string, with 0s hanging off the end out to the last binary place.

### Resulting Changes to the Model

Remember that the final implementation is going to be entirely implemented using integer math. Previously, the model returned probabilities as a pair of floating-point numbers representing the range that a particular symbol owns.

In the updated version of the algorithm, a symbol still owns a specific range of on the number line greater than equal to 0 and less than 1. But we are now going to represent these using a pair of fractions: `upper`/`denom` and `lower`/`denom`. This doesn't actually affect our model code too much. The sample model we used in the previous section returned, for example, .22 and .23 for character 'W'. Now, instead, it will return {22, 23, 100} in a structure called `prob`.

### The Real Encoder: First Pass

Before getting into the final working code, I'm presenting some code that implements the basic algorithm, but takes a couple of shortcuts around real problems. This not-quite-done code looks like this:

```  unsigned int high = 0xFFFFFFFFU;
unsigned int low = 0;
char c;
while ( input >> c ) {
int range = high - low + 1;
prob p = model.getProbability(c);
high = low + (range * p.upper)/p.denominator;
low = low + (range * p.lower)/p.denominator;
for ( ; ; ) {
if ( high < 0x80000000U )
output_bit( 0 );
else if ( low >= 0x80000000U )
output_bit( 1 );
else
break;
low <<= 1;
high << = 1;
high |= 1;
}
}
```

The first part of this loop is conceptually and functionally identical to the floating-point code given in part 1. We take the range — that is, the difference between and `low` and `high` — and we allocate some subset of that range to the character being encoded, depending on the probability returned from the model. The result is that `low` gets a little larger, and `high` gets a little smaller.

The second part of the code is a little more interesting. The `while` loop is new, and what it does is new as well — the simplified floating-point algorithm didn't have anything like this. It performs range checks on `low` and `high`, looking for situations in which the values have the same most-significant bit (MSB).

The first check looks to see whether `high` has dropped below 0x80000000, in which case its MSB is 0. Because we know that `low` is always less than high, its MSB will also be 0. And because the two values only get closer to one another, the MSB of both values will forever be 0.

The other range check looks to see if `low` has increased above 0x7FFFFFFF, in which case, both it and `high` will have MSB values of 1, and will always have an MSB of 1.

In either of these cases, remember that we have three invariants to work with: `high` only decreases, `low` only increases, and `high` is always greater than `low`. So once `high` has a leading bit of 0, it will never change. Once `low` has a leading bit of 1, it will never change. If this is the case, we can output that bit to the output stream — we know what it is with 100% certainty, so let's shift it out to the output stream and get rid of it.

And after we output the bit, we discard it. Shifting `low` and `high` one bit to the left discards that MSB. We shift in a 1 into the least-significant bit of `high`, and a 0 into the least significant of `low`. Thus, we keep working on 32 bits of precision by expelling the bits that no longer contribute anything to the precision of our calculations. In this particular implementation, we just keep 32 bits in working registers, with some additional number already sent to the output, and some other number pending for input.

### More Insights To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.

## Featured Reports ## Featured Whitepapers 