Down The Rabbit Hole
Last week, I described how I almost traced a bug to the wrong program. Having come this far, I would like to describe the actual source of the bug — such a strange source that I would never have suspected it when I began looking.
White PapersMore >>
- State of Cloud 2011: Time for Process Maturation
- SaaS 2011: Adoption Soars, Yet Deployment Concerns Linger
- Real results: Speeding quality application delivery with DevOps [in financial services]
- Client Windows Migration: Expert Tips for Application Readiness
After several false starts, I had found that an early pass in a multipass compiler was sometimes producing output in which widely spaced, apparently random characters were replaced by
'0' characters (i.e., not characters with all bits zero, but rather the representation of the digit 0). The question remained unanswered as to why this might be happening.
As before, the first thing I did was to look at the circumstances. I found that this early compiler pass was being run as follows:
- Run the preprocessor on the program's input and put the results in a temporary file.
- Create a second temporary file, and copy data into that file, which represents the compiler's output from the standard library. This stop avoids having to recompile the standard-library declarations every time the compiler is run.
- Run the first compiler pass with its input coming from the temporary file that contains the preprocessor's output, and its output being appended to the second temporary file.
In UNIX shell terms, then, this compiler pass was being run as follows:
preprocessor <input >temp1 cp library temp2 phase1 <temp1 >>temp2
Here we are following the usual convention that < precedes a file used as a program's input, > precedes a file used as a program's output, and >> precedes a file to which output is being appended. The
cp program works with file names as its arguments, so it doesn't need < or >.
As usual, an important early step in debugging any program is to recreate the problem with as little extra context as possible. Accordingly, I started by saving the preprocessor output in a file that I will call
testprog. I then reproduced the rest of the process:
cp library temp2 phase1 <testprog >>temp2
Sure enough, the output in
temp2 would occasionally include spurious
The next step was to note that the compiler did not actually look at the library file, so I did not need to use it. The resulting simplification was
phase1 <testprog >temp2
To my astonishment, this simplified version worked every time, even though
phase1 did not even have read permission on its output. Could it be that the file append operation was broken? To find out, I tried appending to an empty file:
cp nullfile temp2 phase1 <testprog >>temp2
This version also worked perfectly every time. Perhaps there was something about a nonempty file that caused a failure.
To find out, I created a file that contained
"Hello, world!\n" and used that file as input:
cp hello temp2 phase1 <testprog >>temp2
Again, this version worked perfectly.
At this point, I was really confused. How could a program that is not capable of reading its output file possibly behave in a way that depended on the prior contents of its output file? To prove that the compiler was not reading its output file, I tried breaking things up this way:
cp library temp2 phase1 <testprog >temp3 cat temp3 >>temp2
cat program stands for catenate; it copies each of its input files to its output file. In this case it has only one input file, so it effectively appends the contents of
phase1 in this way guaranteed that it would not have access to the contents of
To my utter astonishment, this revision of the program failed. Not every time — but it never failed every time. In fact, I could execute
phase1 <testprog >temp3
and then repeatedly execute
cp library temp2 cat temp3 >>temp2
and I would still see occasional failures.
In other words, I had just proved that the compiler bug I was hunting was not in the compiler at all, but rather in the operating system — specifically, in the part of the operating system that implemented the >> operation. Moreover, whatever the bug was, it depended somehow on the contents of the file on which the >> operation was being run. At this point, I could hand the problem off to the operating-system development group.
The entire experience had a distinct Alice-In-Wonderland feel to it. At first I thought there was something wrong with my program, because the compiler was producing error messages. Then I realized that I could compile the same program twice and have it sometimes compile and sometimes fail to compile, so I suspected that part of the compiler was producing the error messages. Then I realized that when that compiler phase produced its error messages, it was doing so because it was getting incorrect input from the previous phase — which, in turn, was producing incorrect output because of a bug in the operating system. So every time I thought I had found the problem, I was just going one step further into indeterminacy.
This whole episode taught me several important lessons about debugging:
- Be sure you know what's broken. What looks like a bug in your program might be a bug in the compiler; what looks like a compiler bug might be an operating-system bug.
- Part of knowing what's broken is proving it. The way to prove it is to come up with a failing test case that is so simple that the problem cannot be anywhere else.
- Even if a failure looks like it's impossible, it might be happening anyway. In this particular case, even though the compiler was not reading its output file, the operating-system bug made the compiler's success depend somehow on the output file's initial contents.
- It is possible to learn quite a bit about what is wrong with a program, even if you have neither a debugger nor the program's source code available.
These lessons are worth taking to heart; they have saved me an enormous amount of trouble over the years that would otherwise have been spent chasing bugs that weren't.
Believe it or not, this story is still not over. Next week, I'm going to explain what the operating-system developers told me about the problem once they finally found it. Somehow I doubt you'll be surprised to learn that their first attempt to reproduce the problem failed, despite the simple test case I had given them.