An Algorithm for Online Data Compression

Compressing data on the fly involves different tradeoffs than when you have all the time and space in the world.


October 01, 1998
URL:http://www.drdobbs.com/an-algorithm-for-online-data-compression/184403560

October 1998/An Algorithm for Online Data Compression/Figure 1

Figure 1: The compression/decompression process

October 1998/An Algorithm for Online Data Compression/Figure 2

Figure 2: Calgary Corpus data compression results (unlimited block size and memory usage)

October 1998/An Algorithm for Online Data Compression/Figure 3

Figure 3: Calgary Corpus data compression results (unlimited block size)

October 1998/An Algorithm for Online Data Compression/Figure 4

Figure 4: Calgary Corpus data compression results (unlimited memory usage)

October 1998/An Algorithm for Online Data Compression

An Algorithm for Online Data Compression

Sergey Ignatchenko

Compressing data on the fly involves different tradeoffs than when you have all the time and space in the world.


Lossless data compression algorithms have a wide range of applications. The most common, file compression, is nearly unrestricted in terms of memory usage and operation time. The most essential algorithm metric for file compression is the compression ratio. But in the case of online data compression, the situation is quite different. For the purpose of this article, I define online data compression as the compression of stream-based material where the compressor/decompressor pair must be able to process data at a rate equivalent to the input rate. Moreover, in some cases there must be a constant upper bound on the delay between compressor input and decompressor output.

Online data compression imposes a number of restrictions on compression algorithms. First, these compression algorithms should meet strict requirements, for low memory usage and high processing speed. Second, data flow in interactive systems usually consists of small data packets (several hundred bytes), and each of these data packets should be delivered as quickly as possible. For the compression algorithm described in this article, this second requirement means that the packets must be self-contained. That is, if packet sequence {A, B} is compressed into packet sequence {A', B'}, then the information contained in packet A' should be sufficient to decompress packet A. Thus, at the end of compression of packet A, all information needed for decompression of packet A must be placed in packet A'. This procedure is usually referred to as flush. Frequent flushes reduce the compression ratio for all compression algorithms, but some algorithms are more vulnerable than others to frequent flushes. For online compression, flush frequency depends on the size of the data packet.

In this article I describe the stream-based algorithm LZH-Light. LZH-Light is similar to the well-known Deflate [1], but has significant advantages over Deflate in terms of online data compression. I will outline those advantages shortly, then explain how the LZH-Light algorithm works.

Why LZH-Light?

At present, a large number of different lossless compression algorithms exist. Deflate is one of the most popular algorithms, and zlib [2] (http://www.cdrom.com/pub/infozip/zlib/) is one of its most popular implementations. The Deflate algorithm is some combination of the LZ77 algorithm and Huffman-like encoding, and it is very popular for offline compression. However, its usefulness for online compression is limited by the following drawbacks. (The figures given in parentheses are for zlib.)

The LZH-Light algorithm was developed to overcome problems concerning online data compression. Similar to the Deflate algorithm, it is a combination of the LZ77 algorithm and Huffman-like encoding. The advantages of LZH-Light are:

Although the LZH-Light algorithm yields lower offline compression ratios than the Deflate algorithm, LZH-Light can be more suitable for online compression.

LZH-Light Description

My description of the LZH-Light technique consists of two parts: data formats, and algorithms that support the data formats. This description is not complete; it presents basic principles, not a detailed specification. The full source code for the current LZH-Light implementation is available on the CUJ ftp site (see page 3 for downloading instructions), which can be used as reference material.

Data Formats

The LZH-Light algorithm compresses data in two stages: the LZ77 stage and the Huffman-like stage. At the first (LZ77) stage, raw source data, interpreted as a byte stream, is converted into an intermediate stream of symbols. Each symbol in this stream belongs to the range 0-273. In addition, some symbols may have one or two attached suffixes, each suffix consisting of one or several bits. In the second (Huffman-like) stage, the intermediate stream is converted into a compressed output bit stream. This bit stream is LZH-Light's compression output. The compression/decompression process is depicted in Figure 1.

The LZH-Light algorithm contains a parameter that affects the data format. This parameter is LZBUFSIZE, the size of the sliding window for LZ77's compressor and decompressor. On one hand, the ability to change LZBUFSIZE allows the programmer to control memory usage. On the other hand, the existence of such parameter means that LZH-Light is actually not a single algorithm, but instead is a family of algorithms with incompatible data formats.

The Intermediate Data Format

The intermediate stream of symbols produced by LZ77's compression (S in Figure 1) will be reproduced on the output side (S' in Figure 1) of the Huffman-like decompressor, and fed as input into the LZ77 decompressor. This intermediate stream of symbols consists of simple instructions for the LZ77 decompressor.

The length of such string may range from 4 to 521, and the backward distance may range from 0 to LZBUFSIZE. (If the backward distance is less than the string's length, then the referenced string overlaps the current position.) Tables 1 and 2 show how the ch code, first bit suffix, and second suffix are combined to represent the length and the backward distance.

Thus, the LZ77 compressor's task is to build an intermediate stream S, such that the LZ77 decompressor, which applies the previously described instructions to S' (identical to S), will output a raw byte stream identical to the input raw byte stream of the compressor.

The LZ77 compressor tries to encode (using symbols 256-271) every substring from the input raw stream that occurred earlier. This substring must have a length of at least four bytes, and its previous occurrence must be not earlier than LZBUFSIZE symbols from the current position.

Huffman-Like Encoding Format

The second stage of LZH-Light compression is similar to Huffman coding. At every given point during compression, a Huffman-like coding table exists. This table defines how many and which bits are used for encoding each symbol of the intermediate stream. (Recall that these symbols of may range from 0 to 273.) If a symbol of the intermediate stream has bit suffixes, these bit suffixes do not undergo Huffman-like encoding. They are placed into the output stream following the bits representing the symbol. The decompressor maintains its own copy of this table. At intervals, the compressor updates the table and places information about these updates in its output stream. The decompressor responds to this information, so that the decompressor's table is maintained identical to the compressor's table at every corresponding point in the decompression process.

The Huffman-like encoding table used in LZH-Light differs from the classic Huffman coding table. All symbols of the intermediate stream (0-273) are divided into 16 groups. With each group, indicated by group number grp (0-15), is associated a number of bits nGroupBit(grp) ranging from 0 to 8. Each group contains not more than 2nGroupBit(grp) symbols. A symbol ch of group grp is encoded by first placing the four bits of grp into the output stream, followed by the nGroupBit bits of the number representing the symbol's position within its assigned group. For example, if ch is the third symbol in group 1, and nGroupBit for group 1 is 2, the Huffman-like encoder would send the binary sequence 000110 to the output stream. The first four bits (0001) are the group number and the last two bits (10, equivalent to 2 in decimal) represent that ch is the third symbol in its group.

This approach has both advantages and disadvantages compared to classic Huffman coding:

Huffman-Like Table Synchronization

For correct operation of Huffman-like coding, it is necessary to synchronize the coding tables of both the compressor and the decompressor. The LZH-Light algorithm achieves this as follows.

The LZH-Light Algorithms

Strictly speaking, a description of the LZH-Light data format provides sufficient information for implementing a compressor and a decompressor that are LZH-Light-compliant. Nevertheless, studying the algorithms used in the current implementation may be helpful. I describe some of these algorithms in this section.

Searching for Duplicate Strings

The LZ77 compression stage relies on detecting duplicate strings in the input stream. The LZ77 compressor uses a hash table to find such strings. This table (hashTable) is constructed with a hash function that operates on substrings of length LZMATCH. Each cell hashTable[hashValue] contains a position index, indicating where a substring of length LZMATCH with a hash value of hashValue was recently encountered.

For each input byte, the compressor calculates the hash value for the string composed of the previous LZMATCH-1 symbols from the input stream plus the input byte. In other words, the hash function operates as a sliding window of length LZMATCH over the input stream. If the hash table contains an entry for this hash value and this entry was encountered no earlier than LZBUFSIZE symbols before, then the compressor calculates the match length between the current string and the previously encountered string (a match length less than LZMATCH can occur as a result of a hash table collision).

Obviously, the data model being used is less accurate than tree-like data models (and less accurate than Deflate's data model), and yields a worse compression ratio. Nevertheless, this effect can be considerably reduced. One of the biggest contributors to LZH-Light's lower accuracy is hash-table collisions. Consider the input stream ABCDEFGH...ABCDEFGH. Suppose that while analyzing the second substring "ABCDEFGH" the compressor was prevented from finding a match due to a hash-table collision. In this case the probability of finding a match for any of the substrings "BCDEFGH", "CDEFGH", or "DEFGH" is still very high. Thus, to increase the compression ratio the compressor tries to match not only in the forward direction — starting with the symbols at the position indicated by the hash table — but also in the reverse direction to match as many symbols as possible preceding this position. (In the current implementation this algorithm is found in the section enclosed in #ifdef LZBACKWARDMATCH ... #endif.)

Another way to increase the compression ratio is by using the so-called lazy match (see [1]). After a match of length N beginning with the symbol ch is found, the compressor repeats the entire search process (including hashing) starting at the next input byte after ch. If the compressor finds a new match with length M that is longer than N, it abandons the match found at ch. It then places ch into the intermediate stream without modification, followed by the code indicating the match of length M. Otherwise, the compressor places the code indicating the match of length N into the intermediate stream.

I obtained interesting results from studying different methods of hash-value calculation. The classical method of hash-value calculation for a given string s of length LEN can be expressed as follows:

unsigned hash = 0;
for( int i=0; i < LEN ; ++i )
    //ROTL( x, shift ) rotates bits
    //of x to shift positions left
    hash = ROTL(hash, SHIFT) ^ s[i];
hash %= TABLE_SIZE;

(SHIFT is chosen more or less arbitrarily, and the best results are obtained when TABLE_SIZE is 9 prime number.)

zlib uses an alternative method:

#define SHIFT \
    ((TABLE_BITS+LEN-1)/LEN)
unsigned hash = 0;
for( int i=0; i < LEN ; ++i )
    hash = (hash << SHIFT) ^ s[i];
//TABLE_SIZE is 1 << TABLE_BITS
hash &= (TABLE_SIZE - 1);

Using the classical method ensures a smaller number of hash-table collisions (and, therefore, a higher compression ratio), but it is much slower than zlib's method. This speed reduction is caused in fact by the % operation. Analysis reveals a third method, which is similar to the classical method but much faster:

unsigned hash = 0;
for( int i=0; i < LEN ; ++i )
    hash = ROTL(hash, SHIFT) ^ s[i];
//TABLE_SIZE is 1 << TABLE_BITS
hash = (hash * A + B) >>
    (sizeof(unsigned) - TABLE_BITS);

Here the last calculation represents a formula for a linear congruential pseudo random-number generator, where A and B are specially selected constants. This method ensures that the number of hash-table collisions is a bit lower than that of the classical method, and is a much faster operation. (This is at least true on x86 processors. On x86 processors the * operation is up to three times faster than the % operation.)

Group Selection for Huffman-Like Encoding

To achieve the highest possible compression ratio using groups of symbols, the sum of stat(SYM) for symbols belonging to group 0 must be equal to the sum of stat(SYM) belonging to group 1, and so on. In practice strict equality is not possible, but using the following algorithm achieves a rather accurate approximation:

In a practical implementation, the algorithm is more complex due to restrictions imposed on groups (see the section "Huffman-Like Table Synchronization" above) and optimizations.

Frequency of Changes to Coding Table

In accordance with the described data format, the LZH-Light compressor chooses when to change the tables used in Huffman-like encoding. At present this algorithm is one of the least developed algorithms of the current LZH-Light implementation: tables are automatically changed after HUFFRECALCLEN (by default 4096) symbols have passed through the intermediate stream.

Current Implementation of LZH-Light

Although the current implementation of the LZH-Light algorithm is written in C++, it has a C interface, which is defined in lzhl.h (see Listing 1). This allows use of this implementation with both C and C++ programs.

To increase the algorithm's operation speed, all algorithm customizations are performed at compile time with the help of macro definitions (see the _lshl.h file in the source archives). As mentioned previously, the only parameter that affects the LZH-Light data format is LZBUFSIZE. All other parameters affect only compressor operation and are used for achieving appropriate balance of compression speed, compression ratio, and amount of memory used. Since a large number of LZH-Light algorithm configurations can be obtained using macro definitions from _lzhl.h, comprehensive testing and benchmarking were performed only for several recommended configurations (see Table 5).

The program in Listing 2 can be used for testing various configurations of the LZH-Light algorithm. This program provides sample compression/decompression of a single file. It is not optimal; its only purpose is to verify proper operation for the chosen configuration.

Efficiency and Performance

The measurements of algorithm efficiency for LZH-Light and the zlib implementation of Deflate were carried out under the following conditions:

Results of the measurements are shown in Figure 2, Figure 3, and Figure 4.

The most interesting results of testing are:

Uses for LZH-Light

The basic field for LZH-Light application is online compression of reliable data streams. At present, one earlier modification of this algorithm is used in a prominent Russian electronic stock market system named RTS (http://www.rtsnet.ru). Data in this system is transferred over a Wide Area Network and can be efficiently compressed. On the other hand, security requirements require encryption, which must be performed after compression. Applying the LZH-Light algorithm by itself reduced the network traffic by a factor of 4-4.5. Since LZH-Light is five to six times faster than the encryption algorithm (DES), and provides a compression ratio of 4-4.5, the total processing time for a data block is reduced by a factor to 2-2.5. Thus, in this particular case, applying the compressor reduced both network traffic and CPU usage.

As an example of an LZH-Light application, Listing 3 shows the interface to a lzhl_tcp library. This library wraps TCP with a compression/decompression layer, based on the LZH-Light algorithm. Using this library is identical to using standard TCP. The only (and obvious) limitation is the requirement of using this wrapper on both sides of the TCP connection.

The LZH-Light algorithm can be applied to more than communication channel compression. It is potentially useful for any application for which high compression speed and low memory usage are essential.

References

[1] P. Deutsch. RFC1951: DEFLATE Compressed Data Format Specification version 1.3.

[2] P. Deutsch, J.L. Gailly. RFC1950: ZLIB Compressed Data Format Specification version 3.3.

Sergey Ignatchenko is a senior software developer working on a major Russian electronic stock trading system named RTS (http://www.rtsnet.ru). He specializes in Object-Oriented design and development of distributed systems, and can be reached at [email protected].

October 1998/An Algorithm for Online Data Compression/Listing 1

Listing 1: A C interface to LZH-Light

LZHL_CHANDLE LZHLCreateCompressor( void );

size_t LZHLCompressorCalcMaxBuf( size_t );

size_t LZHLCompress( LZHL_CHANDLE, void* dst,
                     void* src, size_t srcSz );

void LZHLDestroyCompressor( LZHL_CHANDLE );

LZHL_DHANDLE LZHLCreateDecompressor( void );

int  LZHLDecompress( LZHL_DHANDLE, void* dst, size_t* dstSz,
                     void* src, size_t* srcSz );

void LZHLDestroyDecompressor( LZHL_DHANDLE );
/* End of File */
October 1998/An Algorithm for Online Data Compression/Listing 2

Listing 2: A program to test LZH-Light configurations

#define BLOCKSIZE 4000000

void usage()
    {
    printf( "Usage: LZHL {+|-} <input_file> <output_file>\n" );
    }

int main( int argc, char** argv )
    {
    if( argc < 4 )
        {
        usage();
        return 1;
        }
    if( *argv[ 1 ] == '+' )
        {
        FILE* f;
        int i, rawLen, compLen;
        unsigned char* rawBlock;
        unsigned char* compBlock;
        LZHL_CHANDLE comp;
        
        f = fopen( argv[ 2 ], "rb" );
        if( !f )
            abort();
        fseek( f, 0, SEEK_END );
        rawLen = ftell( f );
        fseek( f, 0, SEEK_SET );
        rawBlock = (unsigned char*)malloc( rawLen );
        compBlock = (unsigned char*)
            malloc( LZHLCompressorCalcMaxBuf( rawLen ) );
        fread( rawBlock, 1, rawLen, f );
        fclose( f );

        comp = LZHLCreateCompressor();
        compLen = 0;

        for( i=0; i < rawLen ; i += BLOCKSIZE )
            {
            int l = min( BLOCKSIZE, rawLen - i );
            int lComp = LZHLCompress( comp, compBlock + compLen,
                                      rawBlock + i, l );
            compLen += lComp;
            }
        LZHLDestroyCompressor( comp );

        f = fopen( argv[ 3 ], "wb" );
        if( !f )
            abort();

        // rawLen is stored as a byte sequence to avoid problems 
        // with little_endian/big_endian
        for( i=0; i < 4 ; ++i )
            fputc( (unsigned char)(rawLen >> i*8), f );
        fwrite( compBlock, 1, compLen, f );
        fclose( f );
        }
    else if( *argv[ 1 ] == '-' )
        {
        FILE* f;
        int i, fileLen, rawLen, compLen;
        size_t srcSz, dstSz;
        unsigned char* rawBlock;
        unsigned char* compBlock;
        LZHL_DHANDLE decomp;
        
        f = fopen( argv[ 2 ], "rb" );
        if( !f )
            abort();

        fseek( f, 0, SEEK_END );
        fileLen = ftell( f );
        fseek( f, 0, SEEK_SET );
        if( fileLen < 4 )
            abort();

        rawLen = 0;
        for( i=0; i < 4 ; ++i )
            rawLen |= (fgetc( f ) << i*8);
        compLen = fileLen - 4;

        rawBlock = (unsigned char*)malloc( rawLen );
        compBlock = (unsigned char*)malloc( fileLen );
        if( !rawBlock || !compBlock )
            abort();
        fread( compBlock, 1, fileLen, f );
        fclose( f );

        srcSz = compLen;
        dstSz = rawLen;
        decomp = LZHLCreateDecompressor();
        for(;;)
            {
            int Ok = LZHLDecompress( decomp,
                                     rawBlock + rawLen - dstSz,
                                     &dstSz,
                                     compBlock + compLen - srcSz,
                                     &srcSz );
            if( !Ok )
                abort();
            if( srcSz == 0 )
                break;
            }
        LZHLDestroyDecompressor( decomp );

        f = fopen( argv[ 3 ], "wb" );
        if( !f )
            abort();
        fwrite( rawBlock, 1, rawLen, f );
        fclose( f );
        }
    else
        {
        usage();
        return 1;
        }


    return 0;
    }
//End of File
October 1998/An Algorithm for Online Data Compression/Listing 3

Listing 3: An LZH-Light interface to a TCP library
SOCKET lzhl_socket( int af, int type, int protocol );
int lzhl_send( SOCKET s, const char* buf, int len, int flags );
int lzhl_recv( SOCKET s, char* buf, int len, int flags );
int lzhl_closesocket( SOCKET sock );
/* End of File */
October 1998/An Algorithm for Online Data Compression/Table 1

Table 1: Formation of first bit suffix and match lengths for intermediate symbols

October 1998/An Algorithm for Online Data Compression/Table 2

Table 2: Formation of second bit suffix

The above values depend on the value of LZBUFSIZE. LZBUFBITS is log_2(LZBUFSIZE)
(i.e., LZBUFSIZE = 1 ,, LZBUFBITS ). Backward distance is calculated as
BaseDisp + bit_suffix_2 & ( (1 << (Len-3)) -1 ).

Terms of Service | Privacy Statement | Copyright © 2024 UBM Tech, All rights reserved.