Channels ▼
RSS

C/C++

Data Compression with Arithmetic Encoding


Figure 2 shows how the math system now works while in the middle of some arbitrary compression run. Even though we are using 32-bit math, the algorithm is now dealing with arbitrarily long versions of high and low:

Aritmetic Coding
Figure 2: How the system now works.

As low and high converge, their matching digits are shifted out on the left side, presumably to a file. These digits are never going to change, and are no longer needed as part of the calculation. Likewise, both numbers have an infinite number of binary digits that are being shifted in from the right — 1s for high and 0s for low.

A Fatal Flaw and the Workaround

In the code just presented, we have a pretty reasonable way of managing the calculation that creates an arbitrarily long number. This implementation is good, but it isn't perfect.

We run into problems with this algorithm when the values of low and high start converging on a value of 0.5, but don't quite cross over the line that would cause bits to be shifted out. A sequence of calculations that runs into this problem might produce values like these:

low=7C99418B high=81A60145
low=7FF8F3E1 high=8003DFFA
low=7FFFFC6F high=80000DF4
low=7FFFFFF6 high=80000001
low=7FFFFFFF high=80000000

Our algorithm isn't quite doing what we want here. The numbers are getting closer and closer together, but because neither has crossed over the 0.5 divider, no bits are getting shifted out. This process eventually leaves the values of the two numbers in a disastrous position.

Why is it disastrous? The initial calculation of range is done by subtracting low from high. In this algorithm, range stands in as a proxy for the number line. The subsequent calculation is intended to find the proper subsection of that value for a given character. For example, in the earlier sample, we might have wanted to encode a character that has a range of [0.22,0.23). If the value of range is 100, then we can clearly see that the character will subdivide that with values of 22 and 23.

But what happens if low and high are so close that range just has value of 1? Our algorithm breaks down — it doesn't matter what the probability of the next character is if we are going to try to subdivide the range [0,1) — we get the same result. And if we do that identically for every character, it means the decoder is not going to have any way to distinguish one character from another.

Ultimately, this can decay to the point where low == high, which breaks one of our key invariants. It's easy to work out what kind of disaster results at that point.

Later on in this article, I'll go over some finer points in the algorithm that show why you run into trouble long before the two values are only 1 apart. We'll come up with specific requirements for the accuracy needed by the value of range. Even without those detailed requirements, clearly something needs to be done here.

The fix to the algorithm is shown in the code below. This version of the algorithm still does the normal output of a single bit when low goes above 0.5 or high drops below it. But when the two values haven't converged, it adds a check of the next most-significant bits to see if we are headed towards the problem of near-convergence. This will be the case when the two most-significant bits of high are 10 and the two most-significant bits of low are 01. When this is the case, we know that the two values are converging, but we don't yet know what the eventual output bit is going to be.

unsigned int high = 0xFFFFFFFFU;
unsigned int low = 0;
int pending_bits = 0;
char c;
while ( input >> c ) {
  int range = high - low + 1;
  prob p = model.getProbability(c);
  high = low + (range * p.upper)/p.denominator;
  low = low + (range * p.lower)/p.denominator;
  for ( ; ; ) {
    if ( high < 0x80000000U ) {
      output_bit_plus_pending( 0 );
      low <<= 1;
      high << = 1;
      high |= 1;
    } else if ( low >= 0x80000000U ) {
      output_bit_plus_pending( 1 );
      low <<= 1;
      high << = 1;
      high |= 1;
    } else if ( low >= 0x40000000 && high < 0xC0000000U )
      pending_bits++;
      low << = 1;
      low &= 0x7FFFFFFF;
      high << = 1;
      high |= 0x80000001;
    } else
      break;
  }
}

void output_bit_plus_pending(bool bit, int &pending_bits)
{
  output_bit( bit );
  while ( pending_bits-- )
    output_bit( !bit );
}

So what do we do when we are in this near-converged state? We know that sooner or later, either high is going to go below 0.5 or low will go above it. In the first case, both values will have leading bits of 01, and in the second, they will both have leading bits of 10. As this convergence increases, the leading bits will extend to either 01111... or 10000..., with some finite number of digits extended.

Given this, we know that once we figure out what the first binary digit in the string is going to be, the subsequent bits will all be the opposite. So in this new version of the algorithm, when we are in the near-convergence state, we simply discard the second most-significant bit of high and low, shifting the remaining bits left, while retaining the MSB. Doing that means also incrementing the pending_bits counter to acknowledge that we need to deal with it when convergence finally happens. An example of how this looks when squeezing out the converging bit in low is shown in Figure 3. Removing the bit from high is basically identical, but of course a 1 is shifted into the least-significant bit (LSB) during the process.

Aritmetic Coding
Figure 3: Squeezing out the converging bit.

The bit twiddling that makes this happen can be a bit difficult to follow, but the important thing is that the process has to adhere to the following rules:

  1. The low 30 bits of both low and high are shifted left one position.
  2. The LSB of low gets a 0 shifted in.
  3. The LSB of high gets a 1 shifted in.
  4. The MSB of both words is unchanged — after the operation, it will still be set to 1 for high and 0 for low.

The final change that ties all this together is the introduction of the new function output_bit_plus_pending(). Each time that we manage this near-convergence process, we are know that another bit has been stored away — and we won't know whether it is a 1 or a 0. We keep a count of all these consecutive bits in pending_bits. When we finally reach a situation where an actual MSB can be output, we do it, plus all the pending bits that have been stored up. And of course, the pending bits will be the opposite of the bit being output.

This fixed up version of the code does everything we need to properly encode. The final working C++ code will have a few differences, but they are mostly just tweaks to help with flexibility. The code shown above is more or less the final product.

The Real Decoder

I've talked about some invariants that exist in this algorithm, but one I have skipped over is this: The values of high and low that are produced as each character is processed in the encoder will be duplicated in the decoder. These values operate in lock-step right down to the least-significant bit.

A consequence of this is that the code in the decoder ends up looking a lot like the code in the encoder. The manipulation of high and low is effectively duplicated. And in both the encoder and the decoder, the values of these two variables are manipulated using a calculated range, along with the probability for the given character.

The difference between the two comes from how we get the probability. In the encoder, the character is known because we are reading it directly from the file being processed. In the decoder, have to determine the character by looking at value of the message we are decoding — where it falls on the [0,1) number line. It is the job of the model to figure this out in function getChar().

The compressed input is read into a variable named value. This variable is another one of our pseudo-infinite variables, like high and low, with the primary difference being what is shifted into it in the LSB position. Recall that high gets an infinite string of 1s shifted in, and low gets an infinite string of 0s. value gets something completely different — it has the bits from the encoded message shifted into. So at any given time, value contains 32 of the long string of bits that represent the number that the encoder created. On the MSB side, bits that are no longer are used in the calculation are shifted out of value; and on the LSB side, as those bits are shifted out, new bits of the message are shifted in.

The resulting code looks like this:

unsigned int high = 0xFFFFFFFFU;
unsigned int low = 0;
unsigned int value = 0;
for ( int i = 0 ; i < 32 ; i++ ) {
  value <<= 1;
  value += m_input.get_bit() ? 1 : 0;
}
for ( ; ; ) {
  unsigned int range = high - low + 1;
  unsigned int count =  ((value - low + 1) * m_model.getCount() - 1 ) / range;
  int c;
  prob p = m_model.getChar( count, c );
  if ( c == 256 )
    break;
  m_output.putByte(c);
  high = low + (range*p.high)/p.count -1;
  low = low + (range*p.low)/p.count;
  for( ; ; ) {
    if ( low >= 0x80000000U || high < 0x80000000U ) {
      low <<= 1;
      high << = 1;
      high |= 1;
      value += m_input.get_bit() ? 1 : 0;
    } else if ( low >= 0x40000000 && high < 0xC0000000U ) {
      low << = 1;
      low &= 0x7FFFFFFF;
      high << = 1;
      high |= 0x80000001;
      value += m_input.get_bit() ? 1 : 0;
    } else
      break;
  }
}

So this code looks very similar to the final encoder. The updating of the values is nearly identical — it adds the update of value that updates in tandem with high and low. The nature of the algorithm introduces another invariant: value will always be greater than or equal to low and less than high.

A Sample Implementation

All of this is put together in the production code included in ari.zip. The use of templates makes this code very flexible and should make it easy to plug it into your own applications. All the needed code is in header files, so project inclusion is simple.

In this section, I'll discuss the various components that need to be put together in order to actually do some compression. The code package has four programs:

  • fp_proto.cpp, the floating-point prototype program. Useful for experimentation, but not for real work.compress.cpp, which compresses a file using command-line arguments.
  • decompress.cpp, which decompresses a file using command-line arguments.
  • tester.cpp, which puts a file through a compress/decompress cycle, tests for validity, and outputs the compression ratio.

The compression code is implemented entirely as a set of template classes in header files. These are:

  • compressor.h, which completely implements the arithmetic compressor. The compressor class is parameterized on input stream type, output stream type, and model type.
  • decompressor.h, the corresponding arithmetic decompressor with the same type parameters.
  • modelA.h, a simple order-0 model that does an acceptable job of demonstrating compression.
  • model_metrics.h, a utility class that helps a model class set up some types used by the compressor and decompressor.
  • bitio.h and byteio.h, the streaming classes that implement bit-oriented I/O.

This article is going to skip over the details pertaining to the bit-oriented I/O classes implemented in bitio.h and byteio.h. The classes provided will allow you to read or write from std::iostream and FILE * sources and destinations, and can be pretty easily modified for other types. Details on the implementation of these classes can be found in my article C++ Generic Programming Meets OOP, which includes a bit-oriented I/O class as an illustrative example.

All the code requires a C++11 compiler, but could be modified to work with earlier versions without much trouble. The makefile will build the code with g++ 4.6.3 or clang 3.4. The solution file included will build the projects with Visual C++ 2012.

model_metrics.h

In the code I've shown you so far, I blithely assume that your architecture uses 32-bit math and efficiently supports unsigned integer math. This is somewhat of an unwanted and unneeded restriction, so my production code will define the math parameters using templates. The way it works is that the compressor and decompressor classes both take a MODEL type parameter, and they rely on that type to provide a few typedefs and constants:

  • CODE_VALUE he integer type used to perform math. In the sample code shown so far we assumed this was unsigned int, but signed and longer types are perfectly valid.
  • MAX_CODE The maximum value of a code: In other words, the highest value that high can reach. If we are using all 32 bits of an unsigned int, this would be 0xFFFFFFF, but as we will see shortly, this will normally be quite a bit smaller than the max that can fit into an int or unsigned int.
  • FOURTH, ONE_HALF, THREE_FOURTHS The three values used when testing for convergence. In the sample code, these were set to full 32-bit values of 0x40000000, 0x80000000, and 0xC0000000, but they will generally be smaller than this. The values could be calculated from MAX_CODE, but we let the model define them for convenience.

struct prob is the structure used to return probability values from the model. It is defined in the model because the model will know what types it wants the three values in this structure to be. The three values are low, high, and count, which define the range of a given character being encoded or decoded.

The header file model_metrics.h contains a utility class that helps the model class figure out what these types are. In general, the choice of CODE_VALUE is going to be defined using the natural int or unsigned int type for your architecture. Calculating the value of MAX_CODE requires a little bit more thinking though.

Digression: Overflow Avoidance

It's a given that we are doing the math in the encoder and decoder using type CODE_VALUE. We need to make sure that these three lines of code don't generate an intermediate result that won't fit in that type:

CODE_VALUE range = high - low + 1;
high = low + (range * p.high / p.count) - 1;
low = low + (range * p.low / p.count);

The values of high and low will have some number of bits, which in model_metrics.h we designate by CODE_VALUE_BITS. The maximum number of bits that can be in the counts low and high returned from the model in struct prob are defined in the header file by FREQUENCY_BITS. The same header defines PRECISION as the maximum number of bits that can be contained in type CODE_VALUE. Looking at the calculation above shows you that this expression must always be true:

PRECISION >= CODE_VALUE_BITS + FREQUENCY_BITS

On machines commonly in use today, PRECISION will be 31 if we use signed integers, 32 if we use unsigned. If we arbitrarily split the difference, we might decide that CODE_VALUE_BITS and FREQUENCY_BITS could both be 16. This will avoid overflow, but in fact it doesn't address a second constraint, discussed next.

Digression: Underflow Avoidance

In the decoder, we have the following code that is executed each time we decode a new character:

CODE_VALUE range = high - low + 1;
CODE_VALUE scaled_value =  ((value - low + 1) * m_model.getCount() - 1 ) / range;
int c;
prob p = m_model.getChar( scaled_value, c );

The value returned by m_model.getCount() will be a frequency count, so it will be represented by FREQUENCY_BITS. We need to make sure that there are enough possible values of scaled_value so that it can be used to look up the smallest values in the model.

Because of the invariants described earlier, we know that high and low have to be at least ONE_FOURTH apart. The MSB of high has to be 1, making it greater than ONE_HALF, and the MSB of low has to be 0, making it less than ONE_HALF. But in addition, the special processing for near-convergence insures that if high is less than THREE_FOURTHS, then low must be less than ONE_FOURTH. Likewise, if low is greater than ONE_FOURTH, then high must be greater than THREE_FOURTHS. So the worst case for convergence between these values is represented, for example, when low is ONE_HALF-1 and high is THREE_FOURTHS. In this case, range is going to be just a bit over ONE_FOURTH.

Now let's consider what happens when we are using 16 bits for our frequency counts. The largest value returned by m_model.getCount() will be 65,535. Let's look at what might happen if we were using just eight bits in CODE_VALUE, making MAX_CODE 256. In the worst case, high might be 128 and low might be 63, giving a range of 65. Because value has to lie between low and high (back to invariants), this means there are only going to 65 possible values of scaled_value. Because the range occupied by a given symbol can be as small as 1, we need to be able to call m_model.getChar() with any value in the range [0,65536) to be able to properly decode a rarely appearing character. Thus, MAX_CODE of 255 won't cut it.

In the worst case, we are dividing the range occupied by CODE_VALUE calculations by 4, so we are in effect chopping two bits off the range. To get that scaled up so it can generate all the needed values of scaled_value, we end up with this prerequisite:

CODE_VALUE_BITS >= FREQUENCY_BITS + 2

If we have a 32 bit CODE_VALUE type to work with, this means that a comfortable value of CODE_VALUE will be 17, making FREQUENCY_BITS 15. In general, we want FREQUENCY_BITS to be as large as possible, because this provides us with the most accurate model of character probabilities. Values of 17 and 15 maximize FREQUENCY_BITS for 32-bit unsigned integer math.


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 

Video