Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

# Al Williams

May 04, 2012

In years past, I used to travel around teaching week-long seminars on different programming topics. Although I haven't done that for awhile, I do miss teaching and working with students. I guess because I miss it, I try to drop in on some high-school engineering classes from time to time (and yes, these days high schools have engineering classes). I also talk to regular classes about what it means to be an engineer.

## More Insights

More >>

More >>

### Webcasts

More >>

One theme I talk about with kids is that engineers often make trades. I want a car that is perfectly safe, costs \$10, and gets 200 miles per gallon of gas. But I can't have all of those things, so I make trades, or perhaps trade-offs would be a better way to put it. More metal makes the car safer, but makes it get less gas mileage and cost more. That's a trade.

The classic trade in computer science is trading memory for execution speed. You can see that in a nearly pure form by looking at how you'd write a C function to count the number of bits set in a word (sometimes known as the bit population count).

At first glance this seems pretty easy:

```unsigned int value, count;
for (count=0; value; value >>=1)  count+=value & 1;
```

That's easy enough and doesn't take much memory once compiled. The downside is the speed. If you have an 8-bit byte, you could make 8 passes through the loop. At least the code does stop early if you run out of set bits, but you can't count on it ever taking less than 8 passes for a byte. For a 32-bit word, you could take 32 loops.

One easy way to speed things up is to trade memory for speed. I can go faster if I'm willing to invest in memory for a lookup table.

```unsigned int value, count;
unsigned char bitset[] =  { 0, 1, 1, 2, 1, 2, 2, 3, …. };
count=bitset[value];
```

That's great and takes only one array lookup. But the price is a 256 element array to process a byte. A 32-bit word? Forget about it! You'd need 232 entries in your table.

On the other hand, both of these are extreme cases. Consider this code:

```unsigned int value, count=0;
unsigned char bit4set[] =
{ 0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4 };
while (value)
{
count+=bit4set(value&0xF);
value>>=4;
}
```

This is the middle trade. The table only works on 4 bits at a time, and you have to (potentially) take more than one pass through the loop. For a byte, you might make 0, 1, or 2 loops. A 32-bit word could take as much as 8 loops (which is still better than 32).

There's at least one more engineering lesson to be learned here: You can optimize code to some point, but really big optimizations are usually due to selecting a smarter algorithm. Consider this algorithm that Peter Wegner published in 1960:

```unsigned int value, count;
for (count=0; value; count++)
{
value&=value-1;
}
```

You have to think about this one a little. If the number is already zero, the `for` loop doesn't execute and `count` is zero, as it should be. Any other number is going to have a 1 in some position. Calculating `value-1` gives me the same number except that the rightmost 1 bit will become a 0, and all the bits to the right of that will be 1s. For example:

```00000100 -> 00000011
10101000 -> 10100111
00000001 -> 00000000
```

Using the `and` operator ensures that all the rightmost bits will go to 0, but none of the bits ahead of the rightmost 1 will change. What this means is that the loop only executes once for every 1 bit in the word. Granted, it could still take 8 loops to count a byte, but assuming the inputs are evenly distributed, that won't happen often. Using the simple non-table approach, half the possible inputs would take 8 loops and 64 would require 7 loops. Only one bit pattern would take a single loop.

With the `value-1` method, there are 8 different byte-sized inputs that take a single loop. There is only one input (`11111111`) that actually requires 8 loops. So while it is memory efficient, it should be faster on the average.

For a final engineering lesson from this simple example, you can also consider reading the documentation. If you use GCC, you could use the `__builtin_popcount` function (here) to provide the correct count. You can presume that it uses assembly language and any special features the CPU might have and ought to be faster than anything you could build yourself in C. Then again, maybe characterizing each method in theory and then in practice would be a pretty good exercise for one of these high-school classes.

 To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.