Channels ▼
RSS

C/C++

Data Compression with Arithmetic Encoding


modelA.h

I haven't talked too much about modeling in this article. The focus is strictly on the mechanics of arithmetic coding. We know that this depends on having an accurate model of character probabilities, but getting deeply into that topic requires another article.

For this article, the sample code uses a simple adaptive model, which I call modelA. This model has character probabilities for 257 symbols — all possible symbols in an eight-bit alphabet, plus one additional end-of-file (EOF) symbol. All the characters in the model start out with a count of 1, meaning each has an equal chance of being seen, and that each will take the same number of bits to encode. After each character is encoded or decoded, its count is updated in the model, making it more likely to be seen, and thus reducing the number of bits that will be required to encode it in the future.

The model is implemented by using a CODE_VALUE array of size 258. The range for a given character i is defined by the count at i and the count at i+1, with the total count being found at location 257. Thus, the routine to get the probability looks like this:

prob getProbability(int c)
{
  prob p = { cumulative_frequency[c], 
             cumulative_frequency[c+1], 
             cumulative_frequency[257] };
  if ( !m_frozen ) 
    update(c);
  pacify();
  return p;
}

Because of the way arithmetic coding works, updating the count for character i means the counts for all characters greater than i have to be incremented as well, because adjusting the range for one position on the number line is going to squeeze everyone to the right — the counts in cumulative_frequency are in fact cumulative, representing the sum of all characters less than i. This means the update function has to work its way through the whole array. This is a lot of work, and there are various things we could do to try to make the model more efficient. But ModelA is just for demonstration, and the fact that the array probably fits neatly in cache means this update is adequate for now.

One additional factor we have to watch out for is that the we can't exceed the maximum number of frequency bits in our count. This is managed by setting a flag to freeze the model once the maximum frequency is reached. Again, there are a number of things we could do to make this more efficient, but for demonstration purposes this is adequate:

  void inline update(int c)
  {
    for ( int i = c + 1 ; i < 258 ; i++ )
      cumulative_frequency[i]++;
    if ( cumulative_frequency[257] >= MAX_FREQ ) 
      m_frozen = true;
  }

The decoder has to deal with the same issues when looking up a character based on the scaled_value. The code does a linear search of the array, whose only saving grace is that it ought to be sitting in the cache:

prob getChar(CODE_VALUE scaled_value, int &c)
{
  for ( int i = 0 ; i < 257 ; i++ )
    if ( scaled_value < cumulative_frequency[i+1] ) {
      c = i;
      prob p = {cumulative_frequency[i], 
                cumulative_frequency[i+1],
                cumulative_frequency[257]};
      if ( !m_frozen)
        update(c);
      return p;
    }
  throw std::logic_error("error");
}

In a future article, the notion of how to develop efficient models will be a good topic to cover. You won't want to use modelA as part of a production compressor, but it does a great job of letting you dig into the basics of arithmetic compression.

When you instantiate modelA you have three optional template parameters, which you can use to tinker with the math used by the compressor and decompressor. (Of course, the compressor and decompressor have to use the same parameters).

template<typename CODE_VALUE_ = unsigned int, 
         int CODE_VALUE_BITS_ = (std::numeric_limits<CODE_VALUE_>::digits + 3) / 2,
         int FREQUENCY_BITS_ = std::numeric_limits<CODE_VALUE_>::digits - CODE_VALUE_BITS_>
struct modelA : public model_metrics<CODE_VALUE_, CODE_VALUE_BITS_, FREQUENCY_BITS_> 
{

On a 32-bit compiler, opting for all defaults will result in you using 17 bits for CODE_VALUE calculations, and 15 bits for frequency counts. Static assertions in the model_metrics class check to ensure that your selections will result in correct compression.

compressor.h

This header file contains all the code that implements the arithmetic compressor. The class that does the compression is a template class, parameterized on the input and output classes, as well as the model. This allows you to use the same compression code with different types of I/O in as efficient a manner as possible.

The compressor engine is a simple C++ object that has an overloaded operator()(), so the normal usage is to instantiate the engine then call it with input, output, and model parameters. Because of the way that C++ manages template instantiation, the easiest way to actually use the engine is via a convenience function, compress(), that takes care of those details and can be called without template parameters, as in:

std::ifstream input(argv[1], std::ifstream::binary);
std::ofstream output(argv[2], std::ofstream::binary);
modelA<int, 16, 14> cmodel;
compress(input, output, cmodel);

You can see the (simple) implementation of the convenience function in the attached source.

The operator()() is where all the work happens, and it should look very similar to the trial code shown earlier in this article. The difference is that what is shown here is fully fleshed out, with all the details taken care of:

int operator()()
{
  int pending_bits = 0;
  CODE_VALUE low = 0;
  CODE_VALUE high = MODEL::MAX_CODE;
  for ( ; ; ) {
    int c = m_input.getByte();
    if ( c == -1 )
      c = 256;
    prob p = m_model.getProbability( c );
    CODE_VALUE range = high - low + 1;
    high = low + (range * p.high / p.count) - 1;
    low = low + (range * p.low / p.count);
    //
    // On each pass there are six possible configurations of high/low,
    // each of which has its own set of actions. When high or low
    // is converging, we output their MSB and upshift high and low.
    // When they are in a near-convergent state, we upshift over the
    // next-to-MSB, increment the pending count, leave the MSB intact,
    // and don't output anything. If we are not converging, we do
    // no shifting and no output.
    // high: 0xxx, low anything : converging (output 0)
    // low: 1xxx, high anything : converging (output 1)
    // high: 10xxx, low: 01xxx : near converging
    // high: 11xxx, low: 01xxx : not converging
    // high: 11xxx, low: 00xxx : not converging
    // high: 10xxx, low: 00xxx : not converging
    //
    for ( ; ; ) {
      if ( high < MODEL::ONE_HALF )
        put_bit_plus_pending(0, pending_bits);
      else if ( low >= MODEL::ONE_HALF )
        put_bit_plus_pending(1, pending_bits);
      else if ( low >= MODEL::ONE_FOURTH && high < MODEL::THREE_FOURTHS ) {
        pending_bits++;
        low -= MODEL::ONE_FOURTH;  
        high -= MODEL::ONE_FOURTH;  
      } else
        break;
      high <<= 1;
      high++;
      low <<= 1;
      high &= MODEL::MAX_CODE;
      low &= MODEL::MAX_CODE;
    }
    if ( c == 256 ) //256 is the special EOF code
      break;
  }
  pending_bits++;
  if ( low < MODEL::ONE_FOURTH )
    put_bit_plus_pending(0, pending_bits);
  else
    put_bit_plus_pending(1, pending_bits);
  return 0;
}
inline void put_bit_plus_pending(bool bit, int &pending_bits)
{
  m_output.put_bit(bit);
  for ( int i = 0 ; i < pending_bits ; i++ )
    m_output.put_bit(!bit);
  pending_bits = 0;
}

Digression: Internal End of File

The output stream created by this encoder depends on a special EOF character to indicate that it is complete. That character, 256, is the last symbol encoded and output to the stream. When the decoder reads a symbol of 256, it stops decoding and storing symbols, and it knows it has finished. This method of terminating a stream works pretty well in most scenarios. There are some interesting alternatives you can take if you are compressing individual files or other data types (such as network packets) that have a system-imposed EOF. As a file already has an EOF in place, we are wasting some information by encoding our own EOF symbol.

David Scott has done a lot of work in this area, and some of his results are here. His general technique for eliminating internal EOF signaling in any compression algorithm is to first make the algorithm bijective. When a compression algorithm is bijective, it means that any file will decompress properly to some other file. This is generally not the case for most compressors: there are usually illegal files that will break the decompressor, and the code I have presented here definitely has that problem.

The next step is to ensure that when the input file reaches an EOF, we can terminate the compressed file at that point with a valid sequence of bytes that ends on a byte boundary. This means that when the decompressor sees an EOF on input, it knows two things. First, it knows that it has properly decoded and output all symbols up to this point. Second, that it has finished.

Depending on an external EOF to terminate a compressed stream won't always work. For example, if you are reading data from some sort of stream that doesn't have external termination, you will need an internal EOF marker, as used here, to delineate the end of the compressed stream.

Digression: Ending the Encoded Value

At some point when c==256, high and low will stop converging, and we will break out of the main encoding loop - we are more or less done. The question at this point is: how many more bits do we need to output to ensure that the decoder can decode the last symbol(s) properly? The easiest way to do this might be to just output CODE_VALUE_BITS of low, ensuring that all the precision we have left is used. But we don't actually need this much precision. All we need to guarantee is that when the decoder is reading in value, the final bits ensure that low <= value < high.

Because of the way the encoding loop works, we know that when we break out of it, high and low are in just one of three states:

highlow
11xxx01yyy
11xxx00yyy
10xxx00yyy

Examining this, we can see that if low starts with 00, we can write the bits 01 and be satisfied that any values for the remaining bits will result in value being greater than low and less than high. If low starts with 01, we can write the bits 10 and again be ensured that the desired condition holds.

Because there may be additional pending bits left over from the encoding loop, we write 01 or 10 by calling put_bit_plus_pending() with a 0 or 1, after incrementing pending_bits. That incremented value of pending_bits ensures that if a 0 is written, it will be followed by a 1, and vice versa.

So just two bits (plus any pending) are all that are needed to properly terminate the bit stream.


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.