Stock in Trade
In years past, I used to travel around teaching week-long seminars on different programming topics. Although I haven't done that for awhile, I do miss teaching and working with students. I guess because I miss it, I try to drop in on some high-school engineering classes from time to time (and yes, these days high schools have engineering classes). I also talk to regular classes about what it means to be an engineer.
More Insights
White Papers
- Evaluating the Performance of Shared WAN Links for Data Center Backup and Disaster Recovery
- Demystifying Unified Communications
Reports
More >>Webcasts
- Real Time Analytics: A Case Study Webinar
- Banking on Results: Turn an Avalanche of Data into Actionable Insight
One theme I talk about with kids is that engineers often make trades. I want a car that is perfectly safe, costs $10, and gets 200 miles per gallon of gas. But I can't have all of those things, so I make trades, or perhaps trade-offs would be a better way to put it. More metal makes the car safer, but makes it get less gas mileage and cost more. That's a trade.
The classic trade in computer science is trading memory for execution speed. You can see that in a nearly pure form by looking at how you'd write a C function to count the number of bits set in a word (sometimes known as the bit population count).
At first glance this seems pretty easy:
unsigned int value, count; for (count=0; value; value >>=1) count+=value & 1;
That's easy enough and doesn't take much memory once compiled. The downside is the speed. If you have an 8-bit byte, you could make 8 passes through the loop. At least the code does stop early if you run out of set bits, but you can't count on it ever taking less than 8 passes for a byte. For a 32-bit word, you could take 32 loops.
One easy way to speed things up is to trade memory for speed. I can go faster if I'm willing to invest in memory for a lookup table.
unsigned int value, count;
unsigned char bitset[] = { 0, 1, 1, 2, 1, 2, 2, 3, …. };
count=bitset[value];
That's great and takes only one array lookup. But the price is a 256 element array to process a byte. A 32-bit word? Forget about it! You'd need 232 entries in your table.
On the other hand, both of these are extreme cases. Consider this code:
unsigned int value, count=0;
unsigned char bit4set[] =
{ 0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4 };
while (value)
{
count+=bit4set(value&0xF);
value>>=4;
}
This is the middle trade. The table only works on 4 bits at a time, and you have to (potentially) take more than one pass through the loop. For a byte, you might make 0, 1, or 2 loops. A 32-bit word could take as much as 8 loops (which is still better than 32).
There's at least one more engineering lesson to be learned here: You can optimize code to some point, but really big optimizations are usually due to selecting a smarter algorithm. Consider this algorithm that Peter Wegner published in 1960:
unsigned int value, count;
for (count=0; value; count++)
{
value&=value-1;
}
You have to think about this one a little. If the number is already zero, the for loop doesn't execute and count is zero, as it should be. Any other number is going to have a 1 in some position. Calculating value-1 gives me the same number except that the rightmost 1 bit will become a 0, and all the bits to the right of that will be 1s. For example:
00000100 -> 00000011 10101000 -> 10100111 00000001 -> 00000000
Using the and operator ensures that all the rightmost bits will go to 0, but none of the bits ahead of the rightmost 1 will change. What this means is that the loop only executes once for every 1 bit in the word. Granted, it could still take 8 loops to count a byte, but assuming the inputs are evenly distributed, that won't happen often. Using the simple non-table approach, half the possible inputs would take 8 loops and 64 would require 7 loops. Only one bit pattern would take a single loop.
With the value-1 method, there are 8 different byte-sized inputs that take a single loop. There is only one input (11111111) that actually requires 8 loops. So while it is memory efficient, it should be faster on the average.
For a final engineering lesson from this simple example, you can also consider reading the documentation. If you use GCC, you could use the __builtin_popcount function (here) to provide the correct count. You can presume that it uses assembly language and any special features the CPU might have and ought to be faster than anything you could build yourself in C. Then again, maybe characterizing each method in theory and then in practice would be a pretty good exercise for one of these high-school classes.

