Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

# Data Compression with Arithmetic Encoding

### modelA.h

I haven't talked too much about modeling in this article. The focus is strictly on the mechanics of arithmetic coding. We know that this depends on having an accurate model of character probabilities, but getting deeply into that topic requires another article.

For this article, the sample code uses a simple adaptive model, which I call `modelA`. This model has character probabilities for 257 symbols — all possible symbols in an eight-bit alphabet, plus one additional end-of-file (EOF) symbol. All the characters in the model start out with a count of 1, meaning each has an equal chance of being seen, and that each will take the same number of bits to encode. After each character is encoded or decoded, its count is updated in the model, making it more likely to be seen, and thus reducing the number of bits that will be required to encode it in the future.

The model is implemented by using a `CODE_VALUE` array of size 258. The range for a given character `i` is defined by the count at `i` and the count at `i`+1, with the total count being found at location 257. Thus, the routine to get the probability looks like this:

```prob getProbability(int c)
{
prob p = { cumulative_frequency[c],
cumulative_frequency[c+1],
cumulative_frequency[257] };
if ( !m_frozen )
update(c);
pacify();
return p;
}
```

Because of the way arithmetic coding works, updating the count for character i means the counts for all characters greater than `i` have to be incremented as well, because adjusting the range for one position on the number line is going to squeeze everyone to the right — the counts in `cumulative_frequency` are in fact cumulative, representing the sum of all characters less than `i`. This means the update function has to work its way through the whole array. This is a lot of work, and there are various things we could do to try to make the model more efficient. But `ModelA` is just for demonstration, and the fact that the array probably fits neatly in cache means this update is adequate for now.

One additional factor we have to watch out for is that the we can't exceed the maximum number of frequency bits in our count. This is managed by setting a flag to freeze the model once the maximum frequency is reached. Again, there are a number of things we could do to make this more efficient, but for demonstration purposes this is adequate:

```  void inline update(int c)
{
for ( int i = c + 1 ; i < 258 ; i++ )
cumulative_frequency[i]++;
if ( cumulative_frequency[257] >= MAX_FREQ )
m_frozen = true;
}
```

The decoder has to deal with the same issues when looking up a character based on the `scaled_value`. The code does a linear search of the array, whose only saving grace is that it ought to be sitting in the cache:

```prob getChar(CODE_VALUE scaled_value, int &c)
{
for ( int i = 0 ; i < 257 ; i++ )
if ( scaled_value < cumulative_frequency[i+1] ) {
c = i;
prob p = {cumulative_frequency[i],
cumulative_frequency[i+1],
cumulative_frequency[257]};
if ( !m_frozen)
update(c);
return p;
}
throw std::logic_error("error");
}
```

In a future article, the notion of how to develop efficient models will be a good topic to cover. You won't want to use `modelA` as part of a production compressor, but it does a great job of letting you dig into the basics of arithmetic compression.

When you instantiate `modelA` you have three optional template parameters, which you can use to tinker with the math used by the compressor and decompressor. (Of course, the compressor and decompressor have to use the same parameters).

```template<typename CODE_VALUE_ = unsigned int,
int CODE_VALUE_BITS_ = (std::numeric_limits<CODE_VALUE_>::digits + 3) / 2,
int FREQUENCY_BITS_ = std::numeric_limits<CODE_VALUE_>::digits - CODE_VALUE_BITS_>
struct modelA : public model_metrics<CODE_VALUE_, CODE_VALUE_BITS_, FREQUENCY_BITS_>
{
```

On a 32-bit compiler, opting for all defaults will result in you using 17 bits for `CODE_VALUE` calculations, and 15 bits for frequency counts. Static assertions in the `model_metrics` class check to ensure that your selections will result in correct compression.

### compressor.h

This header file contains all the code that implements the arithmetic compressor. The class that does the compression is a template class, parameterized on the input and output classes, as well as the model. This allows you to use the same compression code with different types of I/O in as efficient a manner as possible.

The compressor engine is a simple C++ object that has an overloaded `operator()()`, so the normal usage is to instantiate the engine then call it with input, output, and model parameters. Because of the way that C++ manages template instantiation, the easiest way to actually use the engine is via a convenience function, `compress()`, that takes care of those details and can be called without template parameters, as in:

```std::ifstream input(argv[1], std::ifstream::binary);
std::ofstream output(argv[2], std::ofstream::binary);
modelA<int, 16, 14> cmodel;
compress(input, output, cmodel);
```

You can see the (simple) implementation of the convenience function in the attached source.

The `operator()()` is where all the work happens, and it should look very similar to the trial code shown earlier in this article. The difference is that what is shown here is fully fleshed out, with all the details taken care of:

```int operator()()
{
int pending_bits = 0;
CODE_VALUE low = 0;
CODE_VALUE high = MODEL::MAX_CODE;
for ( ; ; ) {
int c = m_input.getByte();
if ( c == -1 )
c = 256;
prob p = m_model.getProbability( c );
CODE_VALUE range = high - low + 1;
high = low + (range * p.high / p.count) - 1;
low = low + (range * p.low / p.count);
//
// On each pass there are six possible configurations of high/low,
// each of which has its own set of actions. When high or low
// is converging, we output their MSB and upshift high and low.
// When they are in a near-convergent state, we upshift over the
// next-to-MSB, increment the pending count, leave the MSB intact,
// and don't output anything. If we are not converging, we do
// no shifting and no output.
// high: 0xxx, low anything : converging (output 0)
// low: 1xxx, high anything : converging (output 1)
// high: 10xxx, low: 01xxx : near converging
// high: 11xxx, low: 01xxx : not converging
// high: 11xxx, low: 00xxx : not converging
// high: 10xxx, low: 00xxx : not converging
//
for ( ; ; ) {
if ( high < MODEL::ONE_HALF )
put_bit_plus_pending(0, pending_bits);
else if ( low >= MODEL::ONE_HALF )
put_bit_plus_pending(1, pending_bits);
else if ( low >= MODEL::ONE_FOURTH && high < MODEL::THREE_FOURTHS ) {
pending_bits++;
low -= MODEL::ONE_FOURTH;
high -= MODEL::ONE_FOURTH;
} else
break;
high <<= 1;
high++;
low <<= 1;
high &= MODEL::MAX_CODE;
low &= MODEL::MAX_CODE;
}
if ( c == 256 ) //256 is the special EOF code
break;
}
pending_bits++;
if ( low < MODEL::ONE_FOURTH )
put_bit_plus_pending(0, pending_bits);
else
put_bit_plus_pending(1, pending_bits);
return 0;
}
inline void put_bit_plus_pending(bool bit, int &pending_bits)
{
m_output.put_bit(bit);
for ( int i = 0 ; i < pending_bits ; i++ )
m_output.put_bit(!bit);
pending_bits = 0;
}
```

 Digression: Internal End of File The output stream created by this encoder depends on a special EOF character to indicate that it is complete. That character, 256, is the last symbol encoded and output to the stream. When the decoder reads a symbol of 256, it stops decoding and storing symbols, and it knows it has finished. This method of terminating a stream works pretty well in most scenarios. There are some interesting alternatives you can take if you are compressing individual files or other data types (such as network packets) that have a system-imposed EOF. As a file already has an EOF in place, we are wasting some information by encoding our own EOF symbol. David Scott has done a lot of work in this area, and some of his results are here. His general technique for eliminating internal EOF signaling in any compression algorithm is to first make the algorithm bijective. When a compression algorithm is bijective, it means that any file will decompress properly to some other file. This is generally not the case for most compressors: there are usually illegal files that will break the decompressor, and the code I have presented here definitely has that problem. The next step is to ensure that when the input file reaches an EOF, we can terminate the compressed file at that point with a valid sequence of bytes that ends on a byte boundary. This means that when the decompressor sees an EOF on input, it knows two things. First, it knows that it has properly decoded and output all symbols up to this point. Second, that it has finished. Depending on an external EOF to terminate a compressed stream won't always work. For example, if you are reading data from some sort of stream that doesn't have external termination, you will need an internal EOF marker, as used here, to delineate the end of the compressed stream.

Digression: Ending the Encoded Value

At some point when `c==256`, `high` and `low` will stop converging, and we will break out of the main encoding loop - we are more or less done. The question at this point is: how many more bits do we need to output to ensure that the decoder can decode the last symbol(s) properly? The easiest way to do this might be to just output `CODE_VALUE_BITS` of `low`, ensuring that all the precision we have left is used. But we don't actually need this much precision. All we need to guarantee is that when the decoder is reading in `value`, the final bits ensure that `low <= value < high`.

Because of the way the encoding loop works, we know that when we break out of it, `high` and `low` are in just one of three states:

highlow
11xxx01yyy
11xxx00yyy
10xxx00yyy

Examining this, we can see that if `low` starts with `00`, we can write the bits `01` and be satisfied that any values for the remaining bits will result in `value` being greater than `low` and less than `high`. If `low` starts with `01`, we can write the bits `10` and again be ensured that the desired condition holds.

Because there may be additional pending bits left over from the encoding loop, we write `01` or `10` by calling `put_bit_plus_pending()` with a `0` or `1`, after incrementing `pending_bits`. That incremented value of `pending_bits` ensures that if a `0` is written, it will be followed by a `1`, and vice versa.

So just two bits (plus any pending) are all that are needed to properly terminate the bit stream.

### More Insights

 To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.