Channels ▼
RSS

C/C++

Data Compression with Arithmetic Encoding


decompressor.h

The decompressor code mirrors the compressor code. The engine is a template class that is parameterized on the input and output stream types, and the model type. A convenience function called decompress() makes it easy to instantiate the engine and then decompress without needing to declare template parameters:

std::ifstream input(argv[1], std::ifstream::binary);
std::ofstream output(argv[2], std::ofstream::binary);
modelA<int, 16, 14> cmodel;
decompress(input, output, cmodel);

The convenience function is in the attached source package.

The actual operator code is shown here. This is very much like the sample code shown earlier, with all the additional loose ends tied up so that this is production-ready:

int operator()()
{
  CODE_VALUE high = MODEL::MAX_CODE;
  CODE_VALUE low = 0;
  CODE_VALUE value = 0;
  for ( int i = 0 ; i < MODEL::CODE_VALUE_BITS ; i++ ) {
    value <<= 1;
    value += m_input.get_bit() ? 1 : 0;
  }
  for ( ; ; ) {
    CODE_VALUE range = high - low + 1;
    CODE_VALUE scaled_value =  ((value - low + 1) * m_model.getCount() - 1 ) / range;
    int c;
    prob p = m_model.getChar( scaled_value, c );
    if ( c == 256 )
      break;
    m_output.putByte(c);
    high = low + (range*p.high)/p.count -1;
    low = low + (range*p.low)/p.count;
    for( ; ; ) {
      if ( high < MODEL::ONE_HALF ) {
        //do nothing, bit is a zero
      } else if ( low >= MODEL::ONE_HALF ) {
        value -= MODEL::ONE_HALF;  //subtract one half from all three code values
        low -= MODEL::ONE_HALF;
        high -= MODEL::ONE_HALF;
      } else if ( low >= MODEL::ONE_FOURTH && high < MODEL::THREE_FOURTHS ) {
        value -= MODEL::ONE_FOURTH;
        low -= MODEL::ONE_FOURTH;
        high -= MODEL::ONE_FOURTH;
      } else
        break;
      low <<= 1;
      high <<= 1;
      high++;
      value <<= 1;
      value += m_input.get_bit() ? 1 : 0;
    }
  }
  return 0;
}

Digression: Priming the Input Pump

One of the implementation problems when performing arithmetic coding is the management of the end of the stream. An example of this problem can be imagined if we have a best-case scenario in which we encode an entire file using just two or three bits. The actual compressed file will of course need to have one byte, but that's it.

We run into a problem in the decode, because we need to be able to fill the initial value of value with the appropriate number of bits when decoding. A typical number would be 17, which would require that we read three bits into value before we start encoding. Something like this is executed when the encoder starts:

  for ( int i = 0 ; i < MODEL::CODE_VALUE_BITS ; i++ ) {
    value <<= 1;
    value += m_input.get_bit() ? 1 : 0;
  }

For every eight bits read via get_bit(), there will be one call to read a byte from the underlying file or other input stream. We could just ignore EOF conditions and return 0xFF or 0x00 whenever reading a byte. This would work pretty well as long as we properly detect the end of stream in our encoded stream. But in the case of encoder error or file corruption, it could result in the decoder running forever, reading in bogus values, decoding them to other bogus values, and writing them to output.

Things would be much better if this error condition was detected. What we would really like is for the attempt to read past the end of file to generate an error. To accomplish this, the bit-oriented I/O class I use for input has a specialized constructor that asks for the number of bits that will be used in a CODE_VALUE. When constructing it in the convenience function, I pass this along:

  input_bits<INPUT> in(source,MODEL::CODE_VALUE_BITS);

This value tells the input class that I may need to read that many bits past the EOF, but no more. When constructed, I store that number in the input class member variable m_CodeValueBits.

The code that reads bytes when they are needed now has processing that looks like this:

m_CurrentByte = m_Input.getByte();
if ( m_CurrentByte < 0 ) {
  if ( m_CodeValueBits <= 0 ) 
    throw std::logic_error("EOF on input");
  else
    m_CodeValueBits -= 8;
}

This gives the error checking we need while still allowing a few reads past the end of file.

Implementation Notes

When looking at the code, it will help to have some additional background observations on the implementation:

Using Exceptions for Fatal Errors

Using C++ exceptions properly can be a really difficult task. Ensuring that all possible paths through stack unwinding leave your program in a correct state is not for the faint of heart. This is less of a problem when you are using exceptions strictly as a fatal error mechanism, as I discuss here.

The demo code I use here throws exceptions in response to the following error conditions:

  • EOF on the input stream.
  • A request for a character from the model with a scaled_value larger than the maximum value currently defined — this is a logic error that should never happen.

Tightening up this code would give some other places that this would be useful; for example, failures when writing output.

Propagating errors this way is efficient, and helps keep the code clean — no checking for failures from the model or I/O objects, and it provides a very consistent way to implement error handling when you write new classes for modeling or I/O. To catch these errors, the normal template for compressing or decompressing is going to look like this:

try {
  //
  // set up I/O and model
  // then compress or decompress
    return 0; //the success path
  }
  catch (std::exception &ex)
  {
    std::cerr << "Failed with exception: " << ex.what() << "\n";
  }
  return 255; //failure path

Building the Demo Programs

There are four executables that can be built from the attached source package:

  • fp_proto: The floating-point prototype code. This code is useful to demonstrate the basics of how arithmetic coding works, but isn't much good for real work, with reasons outlined in the article.
  • compress: A compressor that accepts an input and output file name on the command line, then compresses using a simple order-0 model. You should see modestly good compression with this file. To get superior compression requires some work on more advanced modeling.
  • decompress: A decompressor that accepts an input and output file name on the command line. This will work properly with files created by the corresponding compress program.
  • tester: A test program that accepts a single file name on the command line. It executes a compress to a temporary file, then a decompress, then compares the result with the original file. The app exits with 0 if the comparison succeeds, 255 in the event of a failure.

If you are working on a Windows box, you can build all four applications from the attached ari.sln solution file. This should work with Visual C++ 2012 or later.

If you are working on Linux, you can build all four files with the attached makefile, using make all. By default, this will use g++ to compile, but if you change one option in the file, you can easily switch to clang. The code has been tested with g++ version 4.6.3, and clang++ version 3.4.

The code uses a few C++11 features, so if you try to build the projects with earlier versions of any of these compilers you may run into some problems (mostly with the I/O code). It should be reasonably easy to strip out some of the functionality and create classes that will support either iostreams or FILE * objects.

Testing

The tester program provides a good way to exercise the code. If you are on a Linux system, you create a directory called corpus, populate it with a tree of as many files as you like, then execute make test, which will run the tester program against all files. If a failure occurs, the process should abort with message, allowing you to see which file failed:

[email protected]:$ make test
find corpus -type f -print0 | xargs -0r -n 1 ./tester
compressing corpus/about/services/trackback/index.html... 5.41378
compressing corpus/about/services/feed/index.html... 5.71502
compressing corpus/about/services/index.html... 5.41378
compressing corpus/about/trackback/index.html... 5.39995
compressing corpus/about/feed/index.html... 5.25144
compressing corpus/about/serial1/trackback/index.html... 5.43231
compressing corpus/about/serial1/feed/index.html... 5.32922
compressing corpus/about/serial1/index.html... 5.43231
compressing corpus/about/tdcb/trackback/index.html... 5.37245

The tester app also prints out the number of bits/byte that was achieved by the compressor, that output can be collected to provide stats. It's a little harder to do this with a one-liner under Windows, so I am afraid I don't have an easy bulk test option.

Logging

When an arithmetic compressor goes bad, it can be exceedingly difficult to determine what went wrong. There are two common sources of error. The encoder itself can go awry, some error with bit twiddling, or failure to maintain the invariants that have been discussed ad nauseum here. The second possible source of error is in the model itself. There are a few things that can go wrong here, but the most common error is a loss of synch between the encoder and decoder models. For things to work properly, both models have to operate in perfect lockstep.

Some low-level logging can be turned on in the encoder by building with the LOG constant defined. You can do this by editing the makefile and adding -DLOG to the compiler command line, or defining it in the C++|Preprocessor|Preprocessor Definitions area of the project properties for Windows builds.

Once you have the projects reconfigured, you can do a clean and rebuild of the projects. From that point on, the compressor will create a log file called compressor.log, and the decompressor will create a file called decompressor.log. These log files will output the state of the compressor or decompressor as each character is encoded or decoded, with output like this:

0x2f(/) 0x0 0x1ffff => 0x5da2 0x5f9f
0x2f(/) 0xd100 0x1cfff => 0xff74 0x1016d
0x0d 0xba00 0x1b6ff => 0xc6b2 0xc7ab
0x0a 0xb200 0x1abff => 0xbb9d 0xbc92
0x2f(/) 0x9d00 0x192ff => 0xcb2f 0xce01
0x2f(/) 0xcbc0 0x1807f => 0xed8d 0xf04f
0x20 0x6340 0x113ff => 0x7a19 0x7ac4
0x20 0x3200 0x189ff => 0x5e4d 0x60e7
0x42(B) 0x2680 0x173ff => 0x83a0 0x84e2
0x57(W) 0xa000 0x1e2ff => 0x11492 0x115c8
0x54(T) 0x9200 0x1c8ff => 0xfe53 0xff7c
0x2e(.) 0x5300 0x17cff => 0x8a98 0x8bb4
0x43(C) 0x9800 0x1b4ff => 0xe994 0xeaa2
0x50(P) 0x9400 0x1a2ff => 0xef56 0xf056
0x50(P) 0x5600 0x156ff => 0xac4c 0xae31

The data you see there is the character being encoded, both in hex and text representation (if printable), followed by the values of low and high before and after the character is processed. If you diff the two log files, the only difference between the two should be in the last character position, because the decompressor doesn't update its state when it sees an EOF:

[email protected]:$ diff compressor.log decompressor.log
10587c10587
> 0x100 0x27a0 0x10cff => 0x10cfa 0x10cff
---
> 0x100 0x27a0 0x10cff
[email protected]:$

Any other differences indicate an error. You can use the line number of the position to determine exactly where things went awry.

EOF

By itself, an entropy encoder is not a magic bullet. A good compressor needs sophisticated modeling to back it up. With that good model, an arithmetic encoder will give you excellent results, and should outperform Huffman or other older and more revered coders. There is an entire class of coders referred to as Range encoders that use the ideas here, tailored for greater efficiency.

The computational requirements of an arithmetic encoder are definitely going to be higher than with something like Huffman, but modern CPU architectures should be able to minimize the impact this has on your program. Arithmetic encoders are particularly better suited for adaptive models when compared to Huffman coding.

Source is attached here.

Your comments and corrections are welcome.


Mark Nelson is a frequent contributor to Dr. Dobb's and lead author of the Data Compression Book.


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.