Embedded Systems

Run Length Encoding Revisited

By Phil Daley, May 01, 1989

The run-length encoding technique we presented in February generated a flurry of response, and so here's another approach to the problem.

MAY89: RUN LENGTH ENCODING REVISITED

Phil was a technical editor of MICRO The 6502 Journal for two years and is now a software engineer at Mentor Resources, 1 Tara Blvd., Nashua, NH 03062; 603-888-2580.

Many files stored on computer systems have a lot of wasted space: Often, records have duplicated spaces or fixed length records that aren't fully used, so blank areas of unused space exist that could be compressed out of the records. (Notice how much smaller an .EXE file is after being processed by EXEPACK, if it wasn't linked with /E.) In my work with user interfaces and screen files, a good deal of each record is either blanks or zeros. Routines to squeeze this space out of records can be very simple (saving a minimum amount of space) or very complex (designed by the program for each record and requiring significant amounts of processing time). What brought this all to mind was the RLE presented in Robert Zigon's "Run Length Encoding" (DDJ, February 1989). The approach presented in that article, however, doesn't do well with runs of unique characters, doubling the length of each unique character as it goes. Two changes to the encoding rules can result in a dramatic increase in the compression factor without a large increase in processing time: 1. Don't compress less than three duplicated characters; and 2. Mark areas of unduplicated characters the same as duplicated ones, using a bit to indicate whether the next run is compressed or not.

Every byte in the file is marked as either "compressed" or "uncompressed." The markers (or "compression bytes") are 1 byte in size. A compression byte has the high bit set if it indicates a run of redundant bytes, and clear if it marks a set of unique bytes. The other seven bits are used to indicate the length (1 - 128) of the run of compressed or uncompressed bytes.

In the case of a compression byte with the high bit set (>=128), the next byte is the character that is duplicated. When a compression byte indicates unique characters (<=127), the next compression_byte + 1 characters are the normal bytes in the file. The worst case (all unique characters) becomes 129 bytes for 128 bytes, and the best case is 2 bytes for 128 duplicated bytes. This is illustrated in Figure 1, which equals a total of 10 bytes, the same as Zigon's. Notice that the CCD pattern only increased the character count by 1 byte. This would be true for any number of unique bytes in a row, up to 128. Zigon's method increases each different character to 2 bytes each.

Figure 1: Compression byte with high list set

  Input buffer: A A A A B B B B B B C C D E E E E E

  Output buffer: 83 41 85 42 02 43 43 44 84 45

    Translation: 83   (4 compressed bytes value) 41
                 85   (6 compressed bytes value) 42
                 02   (3 uncompressed bytes value) 43 43 44
                 84   (5 compressed bytes value) 45

In my application of saving screen images and other database-type records, where a good deal of the record is either spaces or zeros, this routine saves 50 percent of the record space, on the average. The routine does require you to save the record length along with the record, but most C implementations of variable length indexed records require this information.

While I haven't coded this in assembly because the C implementation runs fast enough for my practical purposes, it probably runs a little slower than Zigon's for compression (See Listing One). The uncompress routine is probably just as fast as his, because of its simplicity.

Notes

The disk program files were compiled using the large model, the only one for which I keep libraries on my disk. The source would produce smaller, faster .OBJ and .EXE files if it were recompiled with the small memory model.

Availability

All source code for articles in this issue is available on a single disk. To order, send $14.95 (Calif. residents add sales tax) to Dr. Dobb's Journal, 501 Galveston Dr., Redwood City, CA 94063; or call 800-356-2002 (from inside Calif.) or 800-533-4372 (from outside Calif.). Please specify the issue number and format (MS-DOS, Macintosh, Kaypro).

_RLE Revisited_ by Phil Daley

[LISTING ONE]

<a name="00fc_0008">

/***************************************************************************
*      PROGRAM          RLE.C                       *
*      written by Phil Daley                       *
*      February 3, 1989                        *
***************************************************************************/
int main(void);
int uncompress(unsigned char *,int,unsigned char *) ;
int compress(unsigned char *,int,unsigned char *) ;
int process_comp(unsigned char *,int,int) ;
int process_uncomp(unsigned char *,int,int) ;
#include   <stdio.h>
#include   <memory.h>
unsigned char   screen[24][80] = {
"IMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM;",
":                                                                              :",
":                                                                              :",
":                                                                              :",
":          o   This is a sample screen that would be typical of the type       :",
":              that would present information to a user for instructions       :",
":              or a help screen, etc.                                          :",
":                                                                              :",
":                                                                              :",
":          o   While it contains a lot of unique characters in the text        :",
":              lines, it also contains a lot of white space in empty lines     :",
":              and margins.                                                    :",
":                                                                              :",
":                                                                              :",
":          o   It would be unusual for the compression algorithm used          :",
":              here to find any repeated characters other than spaces          :",
":              and the border.                                                 :",
":                                                                              :",
":                                                                              :",
":                                                                              :",
":                                                                              :",
":                                                                              :",
":                                                                              :",
"HMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM<"} ;

unsigned char   new[2000] ;

/**********************    main      ***********************/
int main()         /*  this is a demo main  */
{
    int     orig_length = 1920 ;
    int     compressed_length ;
    int     i, j ;

    compressed_length = compress(screen[0],orig_length,new) ;
    printf("The original screen (1920) compressed to %d bytes\n",compressed_length) ;
    memset(screen,0,1920) ;      /*  erase the orig  */
    orig_length = uncompress(new,compressed_length,screen[0]) ;
    printf("And back to the original (%d) length\n",orig_length) ;
    for (i = 0; i < 24; i++)
   for (j = 0; j < 80; j++)   /*  show it  */
       printf("%c",screen[i][j]) ;
    return(0) ;
}
/***********************   compress   ****************************/
compress(in_array,in_size,out_array)
unsigned char   *in_array ;
int      in_size ;
unsigned char   *out_array ;
{
    register int i = 0 ;
    register int j = 0 ;
    register int k ;
    register int l ;

    while (i < in_size) {
   if (in_array[i] == in_array[i + 1]  && in_array[i + 1] == in_array[i + 2]) {
       k = process_comp(in_array,i,in_size) ;
       out_array[j++] = (unsigned char)k | 0x80 ;
       out_array[j++] = in_array[i] ;
       i += k ;
   }
   else  {
       k = process_uncomp(in_array,i,in_size) ;
       out_array[j++] = (unsigned char)k ;
       for (l = 0; l < k; l++)

      out_array[j++] = in_array[i++] ;
   }
    }
    return(j) ;
}
/***********************   process_comp   ****************************/
process_comp(in_array,i,in_size)
unsigned char   *in_array ;
int      i ;
int      in_size ;
{
    register int len = 0 ;

    while (in_array[i] == in_array[i + 1] && i < in_size) {
   len++ ;
   i++ ;
   if (len == 126)
       break ;
    }
    return(len + 1) ;
}
/***********************   process_uncomp   ****************************/
process_uncomp(in_array,i,in_size)
unsigned char   *in_array ;
int      i ;
int      in_size ;
{
    register int len = 0 ;

    while ((in_array[i] != in_array[i + 1] || in_array[i] != in_array[i + 2]) && i < in_size) {
   len++ ;
   i++ ;
   if (len == 127)
       break ;
    }
    return(len) ;
}


/**********************    uncompress   ***********************/
uncompress(in_array,in_size,out_array)
unsigned char   *in_array ;
int      in_size ;
unsigned char   *out_array ;
{
    register int    i ;
    register int    j ;
    register int    k=0 ;
    register int    l ;

    for (i = 0; i < in_size;) {
   j = in_array[i++] ;
   if (j > 128) {
       for (j -= 128; j > 0; j--)
      out_array[k++] = in_array[i] ;
       i++ ;
   }
   else
       for (l = 0; l < j; l++)
      out_array[k++] = in_array[i++] ;
    }
    return(k) ;
}


-30-

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

Embedded Systems

Run Length Encoding Revisited

Figure 1: Compression byte with high list set

Notes

Availability

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Embedded Systems Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content

Embedded Systems

Run Length Encoding Revisited

Related Reading

More Insights

White Papers

Reports

Webcasts

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Embedded Systems Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content