Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

Embedded Systems

Run Length Encoding Revisited


MAY89: RUN LENGTH ENCODING REVISITED

Phil was a technical editor of MICRO The 6502 Journal for two years and is now a software engineer at Mentor Resources, 1 Tara Blvd., Nashua, NH 03062; 603-888-2580.


Many files stored on computer systems have a lot of wasted space: Often, records have duplicated spaces or fixed length records that aren't fully used, so blank areas of unused space exist that could be compressed out of the records. (Notice how much smaller an .EXE file is after being processed by EXEPACK, if it wasn't linked with /E.) In my work with user interfaces and screen files, a good deal of each record is either blanks or zeros. Routines to squeeze this space out of records can be very simple (saving a minimum amount of space) or very complex (designed by the program for each record and requiring significant amounts of processing time). What brought this all to mind was the RLE presented in Robert Zigon's "Run Length Encoding" (DDJ, February 1989). The approach presented in that article, however, doesn't do well with runs of unique characters, doubling the length of each unique character as it goes. Two changes to the encoding rules can result in a dramatic increase in the compression factor without a large increase in processing time: 1. Don't compress less than three duplicated characters; and 2. Mark areas of unduplicated characters the same as duplicated ones, using a bit to indicate whether the next run is compressed or not.

Every byte in the file is marked as either "compressed" or "uncompressed." The markers (or "compression bytes") are 1 byte in size. A compression byte has the high bit set if it indicates a run of redundant bytes, and clear if it marks a set of unique bytes. The other seven bits are used to indicate the length (1 - 128) of the run of compressed or uncompressed bytes.

In the case of a compression byte with the high bit set (>=128), the next byte is the character that is duplicated. When a compression byte indicates unique characters (<=127), the next compression_byte + 1 characters are the normal bytes in the file. The worst case (all unique characters) becomes 129 bytes for 128 bytes, and the best case is 2 bytes for 128 duplicated bytes. This is illustrated in Figure 1, which equals a total of 10 bytes, the same as Zigon's. Notice that the CCD pattern only increased the character count by 1 byte. This would be true for any number of unique bytes in a row, up to 128. Zigon's method increases each different character to 2 bytes each.

Figure 1: Compression byte with high list set

  Input buffer: A A A A B B B B B B C C D E E E E E

  Output buffer: 83 41 85 42 02 43 43 44 84 45

    Translation: 83   (4 compressed bytes value) 41
                 85   (6 compressed bytes value) 42
                 02   (3 uncompressed bytes value) 43 43 44
                 84   (5 compressed bytes value) 45

In my application of saving screen images and other database-type records, where a good deal of the record is either spaces or zeros, this routine saves 50 percent of the record space, on the average. The routine does require you to save the record length along with the record, but most C implementations of variable length indexed records require this information.

While I haven't coded this in assembly because the C implementation runs fast enough for my practical purposes, it probably runs a little slower than Zigon's for compression (See Listing One). The uncompress routine is probably just as fast as his, because of its simplicity.

Notes

The disk program files were compiled using the large model, the only one for which I keep libraries on my disk. The source would produce smaller, faster .OBJ and .EXE files if it were recompiled with the small memory model.

Availability

All source code for articles in this issue is available on a single disk. To order, send $14.95 (Calif. residents add sales tax) to Dr. Dobb's Journal, 501 Galveston Dr., Redwood City, CA 94063; or call 800-356-2002 (from inside Calif.) or 800-533-4372 (from outside Calif.). Please specify the issue number and format (MS-DOS, Macintosh, Kaypro).

_RLE Revisited_ by Phil Daley

[LISTING ONE]

<a name="00fc_0008">

/***************************************************************************
*      PROGRAM          RLE.C                       *
*      written by Phil Daley                       *
*      February 3, 1989                        *
***************************************************************************/
int main(void);
int uncompress(unsigned char *,int,unsigned char *) ;
int compress(unsigned char *,int,unsigned char *) ;
int process_comp(unsigned char *,int,int) ;
int process_uncomp(unsigned char *,int,int) ;
#include   <stdio.h>
#include   <memory.h>
unsigned char   screen[24][80] = {
"IMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM;",
":                                                                              :",
":                                                                              :",
":                                                                              :",
":          o   This is a sample screen that would be typical of the type       :",
":              that would present information to a user for instructions       :",
":              or a help screen, etc.                                          :",
":                                                                              :",
":                                                                              :",
":          o   While it contains a lot of unique characters in the text        :",
":              lines, it also contains a lot of white space in empty lines     :",
":              and margins.                                                    :",
":                                                                              :",
":                                                                              :",
":          o   It would be unusual for the compression algorithm used          :",
":              here to find any repeated characters other than spaces          :",
":              and the border.                                                 :",
":                                                                              :",
":                                                                              :",
":                                                                              :",
":                                                                              :",
":                                                                              :",
":                                                                              :",
"HMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM<"} ;

unsigned char   new[2000] ;

/**********************    main      ***********************/
int main()         /*  this is a demo main  */
{
    int     orig_length = 1920 ;
    int     compressed_length ;
    int     i, j ;

    compressed_length = compress(screen[0],orig_length,new) ;
    printf("The original screen (1920) compressed to %d bytes\n",compressed_length) ;
    memset(screen,0,1920) ;      /*  erase the orig  */
    orig_length = uncompress(new,compressed_length,screen[0]) ;
    printf("And back to the original (%d) length\n",orig_length) ;
    for (i = 0; i < 24; i++)
   for (j = 0; j < 80; j++)   /*  show it  */
       printf("%c",screen[i][j]) ;
    return(0) ;
}
/***********************   compress   ****************************/
compress(in_array,in_size,out_array)
unsigned char   *in_array ;
int      in_size ;
unsigned char   *out_array ;
{
    register int i = 0 ;
    register int j = 0 ;
    register int k ;
    register int l ;

    while (i < in_size) {
   if (in_array[i] == in_array[i + 1]  && in_array[i + 1] == in_array[i + 2]) {
       k = process_comp(in_array,i,in_size) ;
       out_array[j++] = (unsigned char)k | 0x80 ;
       out_array[j++] = in_array[i] ;
       i += k ;
   }
   else  {
       k = process_uncomp(in_array,i,in_size) ;
       out_array[j++] = (unsigned char)k ;
       for (l = 0; l < k; l++)

      out_array[j++] = in_array[i++] ;
   }
    }
    return(j) ;
}
/***********************   process_comp   ****************************/
process_comp(in_array,i,in_size)
unsigned char   *in_array ;
int      i ;
int      in_size ;
{
    register int len = 0 ;

    while (in_array[i] == in_array[i + 1] && i < in_size) {
   len++ ;
   i++ ;
   if (len == 126)
       break ;
    }
    return(len + 1) ;
}
/***********************   process_uncomp   ****************************/
process_uncomp(in_array,i,in_size)
unsigned char   *in_array ;
int      i ;
int      in_size ;
{
    register int len = 0 ;

    while ((in_array[i] != in_array[i + 1] || in_array[i] != in_array[i + 2]) && i < in_size) {
   len++ ;
   i++ ;
   if (len == 127)
       break ;
    }
    return(len) ;
}


/**********************    uncompress   ***********************/
uncompress(in_array,in_size,out_array)
unsigned char   *in_array ;
int      in_size ;
unsigned char   *out_array ;
{
    register int    i ;
    register int    j ;
    register int    k=0 ;
    register int    l ;

    for (i = 0; i < in_size;) {
   j = in_array[i++] ;
   if (j > 128) {
       for (j -= 128; j > 0; j--)
      out_array[k++] = in_array[i] ;
       i++ ;
   }
   else
       for (l = 0; l < j; l++)
      out_array[k++] = in_array[i++] ;
    }
    return(k) ;
}


-30-














Copyright © 1989, Dr. Dobb's Journal


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.