Isolating A Superbug
preamble
and output
would always work with one strategy but not another?
Let's recap what has happened so far:
Compiling my program would occasionally produce error messages that seemed unrelated to the program. Compiling exactly the same program again would usually work. I thought I had traced the problem to one phase of the compiler behaving consistently. However, when I tried to reproduce the problem, I learned that the real problem was that the previous phase was sometimes producing incorrect output.
Finally, I discovered that the problem wasn't in the compiler at all! Concatenating a copy of the compiler output to a copy of a preamble file that the compiler always put at the beginning:
cp preamble result cat output >>result
would cause result
occasionally to have random characters changed to '0'
. This misbehavior happened even though
cp nullfile result cat output >>result
and
cat preamble output >result
both worked every time.
This last misbehavior was genuinely puzzling: How could it possibly be that concatenating preamble
and output
would always work with one strategy but not another? The version that worked used the cat
command; the one that failed used both the cat
and cp
commands. Could something be wrong with the cp
command? No, because the part of the output that the cp
command copied was always correct; it was the part of the output copied by the cat
command that was wrong.
So something in the cat
command was misbehaving, and that misbehavior depended on the contents of the preamble
file. I knew that an empty preamble
file worked and the particular one I was using failed. What about other files? After some experimentation, I had a bunch of examples of files that worked and others that failed. What did the ones that failed have in common?
After staring at them for some time, and constructing files to test various hypotheses, I realized something important:
Every preamble file that caused a failure had an odd number of characters.
Moreover, even when the preamble
had an odd number of characters,
cat preamble output >result
always worked. What could possibly cause this command to behave differently from
cp preamble output cat preamble >>output
Of course! The second one appends to a file. And how does appending work? By opening the file, seeking to the end, and writing. If preamble
has an odd number of characters, then that seek will go to an odd offset, and the seek wouldn't happen at all in
cat preamble output >result
In other words, I had a new hypothesis: Seeking to an odd position in a file, then writing, occasionally caused spurious characters to appear.
This was an easy hypothesis to test: Write a program that creates a file, seeks to an odd position, and writes data to the file. Sure enough, once in a while, the file would pick up characters that didn't belong there. Finally, I had isolated this bug accurately enough that I could send it to the operating-system group.
This story contains several lessons that are not obvious at first glance:
- You don't always need the source code for a program to isolate bugs in it.
- Bugs in one program sometimes behave in ways that lead you to think they're in another program entirely.
- If something is happening that appears to be impossible, what's wrong is your understanding of what is happening. So you need to correct your understanding.
- Every time you can rule out part of a program as the source of your bug, you're that much closer to finding it.
Next week, I'll reveal how it is possible for an operating-system bug to cause such a bewildering symptom.