Channels ▼

Al Williams

Dr. Dobb's Bloggers

Slimming Strings With Custom Base-40 Packing

April 01, 2011

It's probably not strictly true, but to me one of the biggest differences between programming embedded systems and programming for desktop or server systems is that you spend so much time obsessing over the size of your embedded software. When my embedded systems were running in 2K of EEPROM and my desktops had gigabytes of memory that was an easy distinction to make. But lately some of my "embedded" systems have snappy ARM processors, run Linux, and have LCD touchscreens. Embedded indeed!

What these little embedded PCs don't have, however, is giant hard drives and huge expanses of memory. So space is still at a premium. You can always hand optimize your C code, but the compiler optimization is normally pretty good these days. Still, with all these user interfaces, I find I have some systems that contain an excess of strings.Text messages for help. Text messages for logging. Text messages for prompting the user. And text isn't all that space-efficient.

In the old days, we used to use a very simple form of compression to save space on limited magnetic media (tape, if you must know). It occurred to me that this compression technique -- known as base 40 packing -- would be easy enough to implement on even the simplest embedded system. The algorithm assumes that you can do integer division. Well, you need multiplication too, but those can be shifts, so the algorithm is simple and can be made pretty fast.

The compression is not dramatic but for a lot of strings it can be significant. The algorithm can always squeeze 3 characters into two bytes. A 33% reduction is pretty good, but there's no free lunch. The price of this compression is you can only encode 40 characters (but with my hack we'll increase this). For some applications, that's actually plenty. Let's understand the basic algorithm.

When you have a hex (base 16 number) with three digits each digit stands for a certain "count." So 123 (in hex) means 1*256 + 2*16 + 3. More precisely, it is really 1 times 16 to the 2nd power, plus 2 times 16 to the 1st power, plus 3 times 16 to the 0th power. This numerical encoding scheme is how base 40 packing works. The idea is that we will form a three-digit number using base 40. Each digit will be worth 0 to 39 (hence, 40 characters are available).

The selection of 40 is not arbitrary. If you just split up a 16-bit number, you'd wind up with three 5 bit numbers (and one extra bit). You can only encode 32 symbols in 5 bits. So simple packing won't get more than 32 symbols in the same space. Using base 40, the biggest number we can generate is 39*1600+39*40+39 = 63,999, which fits in 16 bits.

If you can stand all upper-case or all lower-case letters, 40 characters isn't bad. You can store 26 letters, 10 digits, and still have four characters left over for space, new line, etc. Unfortunately, many systems today need mixed case letters and more symbols. To fix that you can borrow a page from the old-fashioned baudot teletype machines. Assume that uppercase letters and punctuation will be relatively infrequent. Then, you can define one symbol to be a "shift" to select a second character set. To keep things simple, I usually just make the shift apply to the next character. However, if you really want to be a tightwad, you could make a shift apply until an unshift character. What works best will depend on the strings you are encoding.

As an example, consider this header file, which defines a character set with two "cases":

#ifndef __CTABLE_H
#define __CTABLE_H

char ctable[2][39]=
  {
    {
      '\0','a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z','0','1','2','3','4','5','6','7','8','9',' ','\n'
    },
    {
      '\0','A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z','(','!','@','#',',','.','?','/','*',')','<','>'       
    }
  };

  
#endif

That's a lot of characters. I'm purposely wasting space by putting zero in first on both subarrays because it makes debugging easier. Also, there's no reason the "upper" case can't use one extra symbol (meaning a double shift is actually a character). I just didn't do that.

The encoding can be sloppy because you'll do that once on a PC and use the output of the encoder to define the text in your embedded source (for example, an array of hex words in C or some DW directives if you are using assembler). Here's a simple encoder:

/* Encode strings as base 40 triplets.
   Al Williams DDJ
*/

#include <stdio.h>

// The character table is a 2x38 array. The 2 arrays are for
// the shift state and by convention 0 is always 0
// code 39 (not in the array which tops out at index 38)
// causes the shift to be set for the next character

#include "ctable.h"   // needs to be the same as decoder!


#define ZEROCHAR '~'  // use this to represent zero in input
// what to use for unknown symbols -- note must not be shifted!
#define UNKNOWN_SYM 37  // space in the default encoding

int pending=0;  // encode needs to send more than one symbol
 

// return a symbol (or symbols, see pending) for a given character
int encode(int c)
{
  int i;
  static int shifted=-1;
// if we are mid shift, just send the pending symbol
  if (shifted!=-1)  
    {
      i=shifted;
      shifted=-1;
      pending=0;
      return i;
    }
  // make it easy for the user to enter a zero
  if (c==ZEROCHAR) c='\0';
  // simple search for the symbol -- run time
  // performance isn't a big deal here because
  // we will just run this on the PC at design time
  for (i=0;i<39;i++)
    {
      if (ctable[0][i]==c)
	{
	  pending=0;  
	  return i;  // b40 symbol
	}
      if (ctable[1][i]==c)
	{
	  pending=1;
	  shifted=i;
	  return 39;    // shift code
	}
    }
  // use 0 for any unmatched character
  // you might prefer using your symbol for '.' or '?'
  // and you can set that in the #define
  fprintf(stderr,"Warning character %c (%x) unmatched\n",c,c);
  pending=0;
  return UNKNOWN_SYM;  // unmatched character
}

// Emit a 16 bit quantity
// in the d array (d[2] is first digit, d[0] is last digit)
  void emit(int *d)
  {
    unsigned int v;
    v=d[2]*1600+d[1]*40+d[0];
    printf("%04X,",v);
  }
  


int main(int argc, char *argv[])
{
  int d[3];  // current B40 "digits"
  int cp=3;  // pointer in d array
  int c;     // current character


  // Note you have to press ^D twice if not at the start of a line!  
  while ((c=getchar())!=EOF)
    {
      int cval;
      pending=1;   // make sure we call encode once
      while (pending)  // while encode has something to say
	{
	  int cenc;
	  cenc=encode(c);  // could be index (0-37) or shiftcode
	  d[--cp]=cenc;
	  // every 3 characters we have to emit something
	  if (cp==0)
	    {
	      emit(d);
	      cp=3;  // reset d array
	    }
	}
      
    }
  
  // flush any left overs
  if (cp!=3)
    {
    while (cp!=-1) d[--cp]=0;
    emit(d);  // write that last word
    }
  // done!  
  return 0;
  
}
 

The only tricky part is handling the end of file. Otherwise, it is just a simple lookup in the table to find the three "digits" and making up the 16-bit word with them. The decoder is even simpler, which is a good thing because you might have to hand code it in assembly:


/* Decode strings from base 40 triplets.
   Al Williams DDJ
*/

#define TESTMAIN 1   // build test harness?


#include <stdio.h>

// The character table is a 2x38 array. The 2 arrays are for
// the shift state and by convention 0 is always 0
// code 39 (not in the array which tops out at index 38)
// causes the shift to be set for the next character
#include "ctable.h"   // needs to be the same as decoder!

// Decode a triplet into out[start] and beyond
// since shift characters don't create characters you could have
// from 1 to 3 characters in out (since [shift]X[shift] is legal)
// so the return value is how many characters were set
int decode(int in,char *out, int start)
{
  static shiftstate=0;
  int count=3;  // assume we will emit 3 characters
   
  unsigned int tmp;
  // get first character
  tmp=in/1600;
  if (tmp==39)  // test for shift
    {
    shiftstate=1;
    count--;  // won't count as output
    }
  else
    {
      // real character, so look it up and reset shift
      // which may not have been set anyway
     out[start++]=ctable[shiftstate][tmp];
     shiftstate=0;
    }
  // get 2nd character and repeat logic
  // keep in mind that X*1600 => X*1024+X*512+X*64
  // so this could be optimized if necessary
  tmp=(in-tmp*1600)/40;
  if (tmp==39)
    {
      shiftstate=1;
      count--;
    }
  else
    {
    out[start++]=ctable[shiftstate][tmp];
    shiftstate=0;
    }
  // get 3rd character.. if you don't have a mod
  // function you could say tmp=(in-tmp*40)
  // don't forget that x*40 = x*32+x*8 so you could
  // make * 40 very efficient if you needed to optimize
  tmp=in%40;
  if (tmp==39)
    {
      shiftstate=1;
      count--;
    }
  else
    {
      out[start++]=ctable[shiftstate][tmp];
      shiftstate=0;
    }
  return count;   // return count
}

#if TESTMAIN==1
int main(int argc, char *argv[])
{
  int v,c;
  char buf[4];
  // grab a triplet
  while (scanf("%x,",&v)>0) 
    {
      // decode it
      c=decode(v,buf,0);
      buf[c]='\0';
      // for us printing it is fine
      printf("%s",buf);
    }
  printf("\n");
}

#endif

Of course, this only makes sense if you have enough strings that the savings will be greater than the space used by the look-up table and the decoder (don't forget to get rid of the test main before you determine the size on your platform). You can further optimize the multiplications in the code to be shifts and adds (see the comments in the code).

Although this is an old technique, it lends itself well to simplified CPUs where space is going to be tightest. The shift code eats away at the 33% savings, of course, but you can mitigate it somewhat by experimenting with different lookup tables. For example, maybe your strings contain only a few lowercase q's but lots of uppercase S characters. You can scramble the table so that S is encoded in the "lower" case and the lowercase q becomes a shifted character.

That's just one weapon in the space-saving arsenal. Drop me a comment and let me know how you like to save space and I'll share some more of mine in future blogs.

Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 


Video