Dr. Dobb's | Code Finessing | October 05, 2006

Code Finessing

When it comes down to it, what counts is quality code that is correct, readable, and efficient.

October 05, 2006
URL:http://www.drdobbs.com/open-source/code-finessing/193104882

Diomidis is an Associate Professor in the Department of Management Science and Technology of the Athens University of Economics and Business. He is also the author of the "Open Source Perspective" books Code Reading (Software Development Productivity Award, 2004) and Code Quality (Addison-Wesley, 2006). He can be contacted at [email protected].

CScout is a refactoring browser for C code that has been in the works for the past five years. When I set out to apply CScout on the Linux kernel source code, I discovered that it failed to correctly expand a couple of C macros, causing the analysis to fail. This prompted me to reimplement CScout's macro expansion using a precise functional specification, then optimize the code's severe degradation in time performance, and finally tidy up the optimized code mess. The work touched on many interesting subjects: correctness, performance, compiler optimizations, and readability. Although the domain I worked on is specialized, the lessons I learned have everyday applicability.

CScout (www.spinellis.gr/cscout) can process workspaces of multiple C projects, mapping the complexity introduced by the C preprocessor back into the original source code files. CScout takes advantage of modern hardware advances (fast processors and large memory capacities) to analyze C source code beyond the level of detail and accuracy provided by current compilers and linkers. For example, CScout lets you rename an arbitrary identifier in your code, and the change correctly propagates to exactly those source code elements (proper C code and macros) that form an equivalent correct code body. To satisfy this tall order, CScout incorporates in its code base a C preprocessor with parts of a compiler and a linker.

The C Preprocessor

The power of the C/C++ preprocessor macros comes from two deceptively simple constructs. With the preprocessor's object-like macros, such as #define PI 3.1415927, you can define compile-time constants and guide conditional compilation; with function-like macros such as #define sqr(x) ((x) * (x)), you can write inline and generic functions. Both facilities can also be used or abused to morph your code closer toward forms you believe to be more readable than plain C/C++. (An early implementation of the UNIX shell appeared to be written in an Algol-like language, implemented—you guessed it—through C macros.)

However, the apparent simplicity of these two constructs is just the lid on the proverbial can of worms. A macro can call other macros, it can generate new macro names as a result of its expansion, and it can also recursively call itself. Correctly handling these cases when implementing the preprocessor is anything but simple. If the expansion is too conservative, then some pretty amazing code that relies on the preprocessor ceases to work. For example, the Boost preprocessor library (www.boost.org/libs/preprocessor/doc/index.html) uses nested macro expansions to provide powerful generative metaprogramming facilities: code constructs that generate other code at compile time. More spectacularly, two past winners of the Obfuscated C Code Contest (www.ioccc.org) had the preprocessor generate a list of prime numbers. If, on the other hand, the expansion is too aggressive, the preprocessor will go into an endless loop. This is definitely not something we would want to happen when we compile our code—internal compiler errors are bad enough.

Fortunately, the C Standardization Committee has set precise rules for determining when nested macros can be recursively expanded. The relevant wording in the C Standard is as follows (hold your breath):

Then, the resulting preprocessing token sequence is rescanned, along with all subsequent preprocessing tokens of the source file, for more macro names to replace. If the name of the macro being replaced is found during this scan of the replacement list (not including the rest of the source file's preprocessing tokens), it is not replaced. Furthermore, if any nested replacements encounter the name of the macro being replaced, it is not replaced. These nonreplaced macro name-preprocessing tokens are no longer available for further replacement even if they are later (re)examined in contexts in which that macro name preprocessing token would otherwise have been replaced.

Understanding and implementing this specification is anything but simple. Specialists use the (literally) colorful expression "painting a macro with blue paint" to refer to the concept of hiding macros that have already been replaced. A search for preprocessor and "blue paint" in Google gives you many heated discussions arguing over the finer points of this specification.

Going Functional

Throughout CScout's development, following some initial struggling to get the macro expansion specification to behave correctly, my implementation of the C preprocessor remained stable. Over time, I have accumulated about 50 test programs covering both the preprocessor's basic functionality and obscure problems I have encountered and fixed over the years. I have these test cases run as part of regression tests before every release.

In the process of applying CScout to the Linux kernel source code, I encountered a bug that I simply couldn't fix within the framework of the code I had written. The simplified version of the Linux code was:


#define A B
#define B C
#define X(val) Y(val)
#define C(a)        D((a))

X((A(1)) | (A(2)));

This should expand into:


Y((D((1))) | (D((2))));

However, every change I made to my code to fix this problem often resulted in the failure of another regression test.

At that point I decided to start from scratch, looking for a model implementation of the macro expansion algorithm that would help me implement mine correctly. I thus found that, back in the 1980s, when the members of the C Standardization Committee wrestled with defining macro expansion, they decided to come up with an algorithm that behaved appropriately and then translated that algorithm into standardese English. At that point, Dave Prosser, a member of the committee, drafted a pseudocode implementation in a functional style. The idea behind the algorithm was to expand as many macros as possible, as long as there's no danger of falling into an infinite recursion trap. I was unable to find the complete algorithm on the Web, but Dave kindly e-mailed me the original document source code (written using the UNIX troff macros), which I've now made available as a PDF document; see www.spinellis.gr/ blog/20060626/.

The algorithm is written in a pure functional style, using lists (formed by concatenating their elements with the operator "•") and recursion instead of iteration. It doesn't use any destructive assignments to variables; all variables are simply shorthand for other values. This style makes it very easy to reason about the code (a convenience for the Standardization Committee), but, as I painfully discovered later on, can lead you to a performance tar pit.

Prosser's algorithm works on preprocessing tokens derived from processing a source-code file and its macros. It uses the notion of a hide set associated with each token to decide whether to expand the token or not. Initially, each token starts with an empty hide set, but during macro expansion the tokens accrue in their hide sets the macros that were used during expansion. Here is the algorithm excerpt for expanding function-like macros:

expand(TS)

{

...

if TS is T^HS • (• TS' and T is a "()'d macro" then

check TS' is actuals • )^HS' • TS'' and

actuals are "correct for T"

return expand(subst(ts(T ),fp(T ),actuals,((HSHS') {T },

{}) • TS'' );

The above sequence means that if the token sequence TS to be expanded consists of a token T with a hide set HS followed by an open bracket and a token sequence TS', then the algorithm checks that TS' contains the macro's actual parameters followed by a closing bracket with a hide set HS' and a sequence TS". At that point, expand calls an auxiliary function subst to substitute the macro's formal arguments with the actual arguments, and then recursively expands the result. The subst function also applies a new hide set to the result, which is formed by the intersection of the hide sets HS and HS' with the addition of macro T. The intersection of the two hide sets lets the algorithm perform the maximum number of replacements without going into an infinite loop.

The subst function takes as its arguments an input sequence consisting of a macro's body, the macro's formal and actual parameters, the hide set to apply to the result, and the output sequence generated (empty at first). It replaces the formal parameters found in the macro's body with the actual parameters supplied by its caller, while also handling the complications of the stringizing and token concatenation operators (# and ##). Without the code handling the two operators, its body is as follows:

subst(IS,FP,AP,HS,OS)

{

if IS is {} then

return hsadd(HS,OS );

else if IS is T • IS' and T is FP[i] then

return subst(IS',FP,AP,HS,OS • expand(select(i,AP )));

note IS must be T^HS' • IS'

return subst(IS',FP,AP,HS,OS • T^HS' );

}

This means that when the input sequence IS is empty, then the result is the application of the hide set HS to the output sequence OS that has been built up. Otherwise, the head token T is taken from the input sequence and subst is called recursively with the input sequence's tail IS' and a new output sequence. If T is a formal parameter FP[i], then the output sequence will have on its end the expansion of the corresponding actual parameter; otherwise, it will simply have T.

Having been bitten by many subtle problems in my first implementation of macro expansion, I decided to implement Prosser's algorithm in exactly the form it was written. I could then run my regression tests on it and be sure that I was starting from a solid base. I decided to tackle efficiency issues later. For representing the data structures in Prosser's algorithm, I used the closest equivalents available in the C++ Standard Library: list for the token sequences, and set for the hide sets. I could thus use Standard Library methods like front() and push_back() to manipulate the lists, and algorithms like set_intersection to process the hide sets. For example, this is how I form the hide set for calling the subst function:


HideSet hs;
set_intersection(    head.get_hideset().begin(),    head.get_hideset().end(),
   close.get_hideset().begin(),     close.get_hideset().end(),
   inserter(hs, hs.begin()));
hs.insert(macro.get_name_token());

After I understood the algorithm, implementing it proved to be a straightforward task. In fact, the first implementation of the new macro processing class was 112 lines (17 percent) shorter than its predecessor. I always find lower line counts a good sign—less code to comprehend, fewer chances for bugs. Indeed, two days, 10 source file revisions, and 74 additional lines later, my implementation passed all my regression tests, including the troublesome Linux code. By comparison, I got to the state of the previous implementation in three years after 32 file revisions.

Crunch Time

Being satisfied with the correctness of the code, I set out to measure its performance. Functional code is often criticized for its poor time and space performance, so I was worried about the price I would have to pay for the speed in which I got the algorithm running. Surprisingly, I didn't see a dramatic performance difference in the running of the regression tests. The runtime increased by 21 percent, a difference I subsequently decreased to 17 percent by only calling expand on tokens for which macros were defined. After some inconclusive profiling runs, I reasoned that the remaining speed difference could be an acceptable price to pay for an implementation guaranteed to be correct over one that was demonstrably incorrect, and decided to leave it at that. I thus continued my experiments with CScout and the Linux kernel source code.

At that point, things started going horribly wrong. Files that CScout could process in a few seconds took tens of minutes to complete. I was very surprised, because my regression tests also include the processing of real-world applications, like Brian Kernighan's original implementation of the awk language, and the performance drop in these tests was nothing like the orders of magnitude I was observing.

I reasoned that maybe some particular lines contained macros that were confusing the expansion algorithm, and inserted a debug statement to print the file and line number of every line being processed. Sitting in front of the screen watching the lines fly by, I stared in disbelief when I saw CScout pausing at a line for more than 10 seconds. I noted its number, and that of a couple of other lines that prompted similar pauses. All of them had a call to strcmp() in common. The corresponding GCC header files offered a hint for solving the mystery. In those files, strcmp() is defined through some fairly large and hairy definitions as a macro. Consequently, a simple invocation like strcmp(a, b) expands into more than 900 tokens (there's no point in holding your breath here):


__extension__ ({ size_t __s1_len, __s2_len; 
   (__builtin_constant_p (a) && __builtin_constant_p ( b) && 
   (__s1_len = strlen (a), __s2_len = strlen ( b), 
   (!((size_t)(const void *)((a) + 1) - (size_t)(const void 
   [... 24 similar lines removed ...]

b))[2]); if ( __s2_len > 2 && __result == 0) 
   __result = (__s1[3] - ((__const unsigned char *) 
   (__const char *) (  b))[3]); } } __result; }))) : 
    __builtin_strcmp (a,  b)))); });

This code compiles into fast and compact assembly code, but the expansion that led to it severely strained my new implementation of C macro expansion. The reason is a seldom-appreciated fact behind recursive algorithms. A naïve recursive implementation of an algorithm can change an algorithm's complexity class. You probably know that the most dramatic differences in a program's performance appears when switching between algorithms of different complexity classes. For example, substituting a linear search (complexity class O(N)) with a binary search (complexity O(log N)) cuts the half million operations required on average to search through a million elements down to 19. Conversely, implementing macro expansion (which should be linear in the number of tokens processed) through a quadratic algorithm would skyrocket the number of operations required to process the roughly 900 tokens of our strcmp sequence from 900 to 810,000.

So how did recursion make the complexity of my implementation of the algorithm quadratic? Recall that the two main functions of the algorithm work on sequences of the tokens to be expanded. When processing N tokens, these functions get invoked roughly N times. Now, if each function called constructs a list of N elements, that would make the algorithm's complexity O(N × N), which is quadratic. You may counter here, that I shouldn't be constructing separate data structures from scratch, but merely modifying the existing structures on the fly and passing references to those structures on each recursive call. Unfortunately, following this road in an undisciplined manner is a sure way to introduce subtle bugs. The reason my new code worked without errors was exactly because I implemented it in a pure functional style.

Quality Time

Luckily, there's a way we can arrive at a clean algorithm implementation without suffering a performance penalty. This almost-magic road is called "tail recursion elimination." You can apply this technique in places where functions get recursively called from their end (tail); that is, at the point where they return to their caller. In such cases you can replace the recursive call with an assignment of the new argument values to the old ones and a jump (via a goto statement) to the function's entry point. In fact, many optimizing compilers automatically perform this optimization, saving you the trouble of doing it by hand.

As most calls in Prosser's expansion algorithm are indeed tail recursive, my first reaction was to see if the compiler realized this fact and optimized the code accordingly. I compiled the code with GCC's -S switch, which generates the intermediate assembly language file, and peeked at the result. A set of 12 separate lines recursively calling subst inside the function's body confirmed my fear: GCC hadn't realized it could optimize the calls away (at least in Version 3.4.2, which I was using).

I thus set out to perform this operation by hand. I did this in several small steps, each one under version control. The large number of regression tests at hand allowed me to proceed with confidence. If a step failed, I reverted the changes and performed it again, more carefully. This is how an excerpt of my initial implementation of subst looked in its recursive form:


static PtokenSequence
subst(const Macro &m, dequePtoken is, 
  const mapArgval &args, HideSet hs, 
   bool skip_defined, const Macro *caller)
{
   if (is.empty())
     return (hsadd(hs, os));
   for (ti = tail.begin(); ti != tail.end(); ti++)
      if (!ti->is_space() && 
      (ai = find_formal_argument(args, *ti)) != args.end()) {
         tail.erase(tail.begin(), ++ti);
         os.push_back(stringize(ai->second));
          return (subst(m, tail, args,
            hs, os, skip_defined, caller));
   }

The first step involved mechanically changing the tail recursive calls into jumps to the function's beginning. In this step, I also altered the code to directly modify the function's arguments is and os, instead of constructing new ones. The corresponding code in the function's body took this form:


again:
  if (is.empty())
    return (hsadd(hs, os));
  for (ti = is.begin(); ti != is. end(); ti++)
    if (!ti->is_space() &&  (ai =
find_formal_argument(args, *ti)) !=  args.end()) {
     is.erase(is.begin(), ++ti);
     os.push_back(stringize(ai-> second));
     goto again;
  }

Running the offending strcmp() Linux statement with the above code showed a dramatic improvement in performance. The processing time dropped by almost two orders of magnitude from 12.2 seconds to 0.16 seconds—at a price. My code ended up littered with 12 instances of a goto statement. Although not all uses of goto are bad (I can't see a problem if a small percentage of a program's functions contain a single label), I try as a matter of principle to avoid the use of a goto whenever I can use an equally readable alternative.

If my implementation was coded in Java, an alternative would be to place the function's body within a labeled loop and use a continue statement to branch to the loop's beginning. Because C++ doesn't provide an explicit facility to exit through multiple nested loops, my next step involved converting a number of inner loops, which searched for the first nonspace token, into calls to a function I named find_nonspace. This exercise had the added benefit of reducing the code's indentation levels by one. Consequently, I felt that through this change, I repaid some of the style debt I incurred by dropping the recursive algorithm. This is the code's new look:


again:
   if (is.empty())
     return (hsadd(hs, os));
   ti = find_nonspace(is.begin(),  is.end());
   if (ti != is.end() && (ai = 
find_formal_argument(args, *ti)) != args.end()) {
    is.erase(is.begin(), ++ti);
    os.push_back(stringize(ai-> second));
    goto again;
   }

At that point, it became child's play to put the function's body inside a while loop and substitute all instances of goto with continue. In a number of cases, I was also able to factor a couple of continue statements into a single one, further improving the code's readability. This is the final look of the goto-less code:


while (!is.empty()) {
  ti = find_nonspace(is.begin(), is.end());
  if (ti != is.end() && (ai = 
find_formal_argument(args, *ti)) !=  args.end()) {
    is.erase(is.begin(), ++ti);
    os.push_back(stringize(ai-> second));
    continue;
  }
}
return (hsadd(hs, os));

In three steps, I moved from a clean recursive implementation to an equally readable and much more efficient iterative code. The in-between stopovers were somewhat messy, but version control and regression tests allowed me to navigate through them with confidence.

Lessons Learned

I have to admit that there were times during this implementation work when I felt frustrated and doubted that the new code would ever work reliably and efficiently. In retrospect, things went remarkably well. The dictum "first make the code correct, then make it fast" proved itself once more. I think this is the prudent course to take when dealing with tricky algorithms. I admit that the degree to which recursion and the functional programming style burdened my program's performance surprised me. This is the first time I witnessed first hand an unintentional change of the complexity class of an algorithm. I was also disappointed by the fact that GCC was unable to optimize the recursive tail calls. It seems that the state of the art in compiler technology I heard advertised as an undergraduate student isn't really what was claimed it was. I was lucky to have at hand a large set of tests to verify the extensive changes I made to the code. I thus realized why unit tests are such an integral part of agile methodologies, which rely on frequent refactoring exercises.

When I committed the final version of the implementation to the version-control system, I felt confident that the code was correct, readable, and efficient. It was quality code, for which I felt proud. And this is what counts in the end.