Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

# Data Compression with Arithmetic Encoding

Figure 2 shows how the math system now works while in the middle of some arbitrary compression run. Even though we are using 32-bit math, the algorithm is now dealing with arbitrarily long versions of `high` and `low`:

Figure 2: How the system now works.

As `low` and `high` converge, their matching digits are shifted out on the left side, presumably to a file. These digits are never going to change, and are no longer needed as part of the calculation. Likewise, both numbers have an infinite number of binary digits that are being shifted in from the right — 1s for `high` and 0s for `low`.

### A Fatal Flaw and the Workaround

In the code just presented, we have a pretty reasonable way of managing the calculation that creates an arbitrarily long number. This implementation is good, but it isn't perfect.

We run into problems with this algorithm when the values of `low` and `high` start converging on a value of 0.5, but don't quite cross over the line that would cause bits to be shifted out. A sequence of calculations that runs into this problem might produce values like these:

```low=7C99418B high=81A60145
low=7FF8F3E1 high=8003DFFA
low=7FFFFC6F high=80000DF4
low=7FFFFFF6 high=80000001
low=7FFFFFFF high=80000000
```

Our algorithm isn't quite doing what we want here. The numbers are getting closer and closer together, but because neither has crossed over the 0.5 divider, no bits are getting shifted out. This process eventually leaves the values of the two numbers in a disastrous position.

Why is it disastrous? The initial calculation of `range` is done by subtracting `low` from `high`. In this algorithm, `range` stands in as a proxy for the number line. The subsequent calculation is intended to find the proper subsection of that value for a given character. For example, in the earlier sample, we might have wanted to encode a character that has a range of [0.22,0.23). If the value of `range` is 100, then we can clearly see that the character will subdivide that with values of 22 and 23.

But what happens if `low` and `high` are so close that `range` just has value of 1? Our algorithm breaks down — it doesn't matter what the probability of the next character is if we are going to try to subdivide the range [0,1) — we get the same result. And if we do that identically for every character, it means the decoder is not going to have any way to distinguish one character from another.

Ultimately, this can decay to the point where `low == high`, which breaks one of our key invariants. It's easy to work out what kind of disaster results at that point.

Later on in this article, I'll go over some finer points in the algorithm that show why you run into trouble long before the two values are only 1 apart. We'll come up with specific requirements for the accuracy needed by the value of `range`. Even without those detailed requirements, clearly something needs to be done here.

The fix to the algorithm is shown in the code below. This version of the algorithm still does the normal output of a single bit when `low` goes above 0.5 or `high` drops below it. But when the two values haven't converged, it adds a check of the next most-significant bits to see if we are headed towards the problem of near-convergence. This will be the case when the two most-significant bits of `high` are 10 and the two most-significant bits of `low` are 01. When this is the case, we know that the two values are converging, but we don't yet know what the eventual output bit is going to be.

```unsigned int high = 0xFFFFFFFFU;
unsigned int low = 0;
int pending_bits = 0;
char c;
while ( input >> c ) {
int range = high - low + 1;
prob p = model.getProbability(c);
high = low + (range * p.upper)/p.denominator;
low = low + (range * p.lower)/p.denominator;
for ( ; ; ) {
if ( high < 0x80000000U ) {
output_bit_plus_pending( 0 );
low <<= 1;
high << = 1;
high |= 1;
} else if ( low >= 0x80000000U ) {
output_bit_plus_pending( 1 );
low <<= 1;
high << = 1;
high |= 1;
} else if ( low >= 0x40000000 && high < 0xC0000000U )
pending_bits++;
low << = 1;
low &= 0x7FFFFFFF;
high << = 1;
high |= 0x80000001;
} else
break;
}
}

void output_bit_plus_pending(bool bit, int &pending_bits)
{
output_bit( bit );
while ( pending_bits-- )
output_bit( !bit );
}
```

So what do we do when we are in this near-converged state? We know that sooner or later, either `high` is going to go below 0.5 or `low` will go above it. In the first case, both values will have leading bits of 01, and in the second, they will both have leading bits of 10. As this convergence increases, the leading bits will extend to either 01111... or 10000..., with some finite number of digits extended.

Given this, we know that once we figure out what the first binary digit in the string is going to be, the subsequent bits will all be the opposite. So in this new version of the algorithm, when we are in the near-convergence state, we simply discard the second most-significant bit of `high` and `low`, shifting the remaining bits left, while retaining the MSB. Doing that means also incrementing the `pending_bits` counter to acknowledge that we need to deal with it when convergence finally happens. An example of how this looks when squeezing out the converging bit in `low` is shown in Figure 3. Removing the bit from `high` is basically identical, but of course a 1 is shifted into the least-significant bit (LSB) during the process.

Figure 3: Squeezing out the converging bit.

The bit twiddling that makes this happen can be a bit difficult to follow, but the important thing is that the process has to adhere to the following rules:

1. The low 30 bits of both `low` and `high` are shifted left one position.
2. The LSB of `low` gets a 0 shifted in.
3. The LSB of `high` gets a 1 shifted in.
4. The MSB of both words is unchanged — after the operation, it will still be set to 1 for `high` and 0 for `low`.

The final change that ties all this together is the introduction of the new function `output_bit_plus_pending()`. Each time that we manage this near-convergence process, we are know that another bit has been stored away — and we won't know whether it is a 1 or a 0. We keep a count of all these consecutive bits in `pending_bits`. When we finally reach a situation where an actual MSB can be output, we do it, plus all the pending bits that have been stored up. And of course, the pending bits will be the opposite of the bit being output.

This fixed up version of the code does everything we need to properly encode. The final working C++ code will have a few differences, but they are mostly just tweaks to help with flexibility. The code shown above is more or less the final product.

### The Real Decoder

I've talked about some invariants that exist in this algorithm, but one I have skipped over is this: The values of `high` and `low` that are produced as each character is processed in the encoder will be duplicated in the decoder. These values operate in lock-step right down to the least-significant bit.

A consequence of this is that the code in the decoder ends up looking a lot like the code in the encoder. The manipulation of `high` and `low` is effectively duplicated. And in both the encoder and the decoder, the values of these two variables are manipulated using a calculated `range`, along with the probability for the given character.

The difference between the two comes from how we get the probability. In the encoder, the character is known because we are reading it directly from the file being processed. In the decoder, have to determine the character by looking at value of the message we are decoding — where it falls on the [0,1) number line. It is the job of the model to figure this out in function `getChar()`.

The compressed input is read into a variable named `value`. This variable is another one of our pseudo-infinite variables, like `high` and `low`, with the primary difference being what is shifted into it in the LSB position. Recall that `high` gets an infinite string of 1s shifted in, and `low` gets an infinite string of 0s. `value` gets something completely different — it has the bits from the encoded message shifted into. So at any given time, `value` contains 32 of the long string of bits that represent the number that the encoder created. On the MSB side, bits that are no longer are used in the calculation are shifted out of `value`; and on the LSB side, as those bits are shifted out, new bits of the message are shifted in.

The resulting code looks like this:

```unsigned int high = 0xFFFFFFFFU;
unsigned int low = 0;
unsigned int value = 0;
for ( int i = 0 ; i < 32 ; i++ ) {
value <<= 1;
value += m_input.get_bit() ? 1 : 0;
}
for ( ; ; ) {
unsigned int range = high - low + 1;
unsigned int count =  ((value - low + 1) * m_model.getCount() - 1 ) / range;
int c;
prob p = m_model.getChar( count, c );
if ( c == 256 )
break;
m_output.putByte(c);
high = low + (range*p.high)/p.count -1;
low = low + (range*p.low)/p.count;
for( ; ; ) {
if ( low >= 0x80000000U || high < 0x80000000U ) {
low <<= 1;
high << = 1;
high |= 1;
value += m_input.get_bit() ? 1 : 0;
} else if ( low >= 0x40000000 && high < 0xC0000000U ) {
low << = 1;
low &= 0x7FFFFFFF;
high << = 1;
high |= 0x80000001;
value += m_input.get_bit() ? 1 : 0;
} else
break;
}
}
```

So this code looks very similar to the final encoder. The updating of the values is nearly identical — it adds the update of `value` that updates in tandem with `high` and `low`. The nature of the algorithm introduces another invariant: `value` will always be greater than or equal to `low` and less than `high`.

### A Sample Implementation

All of this is put together in the production code included in ari.zip. The use of templates makes this code very flexible and should make it easy to plug it into your own applications. All the needed code is in header files, so project inclusion is simple.

In this section, I'll discuss the various components that need to be put together in order to actually do some compression. The code package has four programs:

• `fp_proto.cpp`, the floating-point prototype program. Useful for experimentation, but not for real work.`compress.cpp`, which compresses a file using command-line arguments.
• `decompress.cpp`, which decompresses a file using command-line arguments.
• `tester.cpp`, which puts a file through a compress/decompress cycle, tests for validity, and outputs the compression ratio.

The compression code is implemented entirely as a set of template classes in header files. These are:

• `compressor.h`, which completely implements the arithmetic compressor. The compressor class is parameterized on input stream type, output stream type, and model type.
• `decompressor.h`, the corresponding arithmetic decompressor with the same type parameters.
• `modelA.h`, a simple order-0 model that does an acceptable job of demonstrating compression.
• `model_metrics.h`, a utility class that helps a model class set up some types used by the compressor and decompressor.
• `bitio.h` and `byteio.h`, the streaming classes that implement bit-oriented I/O.

This article is going to skip over the details pertaining to the bit-oriented I/O classes implemented in `bitio.h` and `byteio.h`. The classes provided will allow you to read or write from `std::iostream` and `FILE *` sources and destinations, and can be pretty easily modified for other types. Details on the implementation of these classes can be found in my article C++ Generic Programming Meets OOP, which includes a bit-oriented I/O class as an illustrative example.

All the code requires a C++11 compiler, but could be modified to work with earlier versions without much trouble. The makefile will build the code with g++ 4.6.3 or clang 3.4. The solution file included will build the projects with Visual C++ 2012.

### model_metrics.h

In the code I've shown you so far, I blithely assume that your architecture uses 32-bit math and efficiently supports unsigned integer math. This is somewhat of an unwanted and unneeded restriction, so my production code will define the math parameters using templates. The way it works is that the compressor and decompressor classes both take a `MODEL` type parameter, and they rely on that type to provide a few typedefs and constants:

• `CODE_VALUE` he integer type used to perform math. In the sample code shown so far we assumed this was `unsigned int`, but signed and longer types are perfectly valid.
• `MAX_CODE` The maximum value of a code: In other words, the highest value that `high` can reach. If we are using all 32 bits of an `unsigned int`, this would be 0xFFFFFFF, but as we will see shortly, this will normally be quite a bit smaller than the max that can fit into an `int` or `unsigned int`.
• `FOURTH`, `ONE_HALF`, `THREE_FOURTHS` The three values used when testing for convergence. In the sample code, these were set to full 32-bit values of 0x40000000, 0x80000000, and 0xC0000000, but they will generally be smaller than this. The values could be calculated from `MAX_CODE`, but we let the model define them for convenience.

`struct prob` is the structure used to return probability values from the model. It is defined in the model because the model will know what types it wants the three values in this structure to be. The three values are `low`, `high`, and `count`, which define the range of a given character being encoded or decoded.

The header file `model_metrics.h` contains a utility class that helps the model class figure out what these types are. In general, the choice of `CODE_VALUE` is going to be defined using the natural int or unsigned int type for your architecture. Calculating the value of `MAX_CODE` requires a little bit more thinking though.

 Digression: Overflow Avoidance It's a given that we are doing the math in the encoder and decoder using type `CODE_VALUE`. We need to make sure that these three lines of code don't generate an intermediate result that won't fit in that type: ```CODE_VALUE range = high - low + 1; high = low + (range * p.high / p.count) - 1; low = low + (range * p.low / p.count); ``` The values of `high` and `low` will have some number of bits, which in `model_metrics.h` we designate by `CODE_VALUE_BITS`. The maximum number of bits that can be in the counts `low` and `high` returned from the model in `struct prob` are defined in the header file by `FREQUENCY_BITS`. The same header defines `PRECISION` as the maximum number of bits that can be contained in type `CODE_VALUE`. Looking at the calculation above shows you that this expression must always be true: ```PRECISION >= CODE_VALUE_BITS + FREQUENCY_BITS ``` On machines commonly in use today, `PRECISION` will be 31 if we use signed integers, 32 if we use unsigned. If we arbitrarily split the difference, we might decide that `CODE_VALUE_BITS` and `FREQUENCY_BITS` could both be 16. This will avoid overflow, but in fact it doesn't address a second constraint, discussed next.

 Digression: Underflow Avoidance In the decoder, we have the following code that is executed each time we decode a new character: ```CODE_VALUE range = high - low + 1; CODE_VALUE scaled_value = ((value - low + 1) * m_model.getCount() - 1 ) / range; int c; prob p = m_model.getChar( scaled_value, c ); ``` The value returned by `m_model.getCount()` will be a frequency count, so it will be represented by `FREQUENCY_BITS`. We need to make sure that there are enough possible values of `scaled_value` so that it can be used to look up the smallest values in the model. Because of the invariants described earlier, we know that `high` and `low` have to be at least `ONE_FOURTH` apart. The MSB of `high` has to be 1, making it greater than `ONE_HALF`, and the MSB of `low` has to be 0, making it less than `ONE_HALF`. But in addition, the special processing for near-convergence insures that if `high` is less than `THREE_FOURTHS`, then `low` must be less than `ONE_FOURTH`. Likewise, if `low` is greater than `ONE_FOURTH`, then `high` must be greater than `THREE_FOURTHS`. So the worst case for convergence between these values is represented, for example, when `low` is `ONE_HALF-1` and `high` is `THREE_FOURTHS`. In this case, `range` is going to be just a bit over `ONE_FOURTH`. Now let's consider what happens when we are using 16 bits for our frequency counts. The largest value returned by `m_model.getCount()` will be 65,535. Let's look at what might happen if we were using just eight bits in `CODE_VALUE`, making `MAX_CODE` 256. In the worst case, `high` might be 128 and `low` might be 63, giving a range of 65. Because `value` has to lie between `low` and `high` (back to invariants), this means there are only going to 65 possible values of `scaled_value`. Because the range occupied by a given symbol can be as small as 1, we need to be able to call `m_model.getChar()` with any value in the range [0,65536) to be able to properly decode a rarely appearing character. Thus, `MAX_CODE` of 255 won't cut it. In the worst case, we are dividing the range occupied by `CODE_VALUE` calculations by 4, so we are in effect chopping two bits off the range. To get that scaled up so it can generate all the needed values of `scaled_value`, we end up with this prerequisite: ```CODE_VALUE_BITS >= FREQUENCY_BITS + 2 ``` If we have a 32 bit `CODE_VALUE` type to work with, this means that a comfortable value of `CODE_VALUE` will be 17, making `FREQUENCY_BITS` 15. In general, we want `FREQUENCY_BITS` to be as large as possible, because this provides us with the most accurate model of character probabilities. Values of 17 and 15 maximize `FREQUENCY_BITS` for 32-bit unsigned integer math.

### More Insights

 To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.