Debugging by Hypothesis
Last week's article got an unusually large number of comments, so I'm going to continue this theme with another example. This time, the failing program was a compiler.
- The People Problem: Cyber Threats Aren't Just a Technology Challenge
- Red Hat cloud a road map to government cloud computing based on openness, portability, and choice
- Cloud Collaboration Tools: Big Hopes, Big Needs
- SaaS 2011: Adoption Soars, Yet Deployment Concerns Linger
- 5 Reasons to Choose an Open Platform for Cloud
- Accelerate Cloud Computing Success with Open Standards
There's a technical term for people who claim that their programs don't compile because of a compiler failure rather than a bug in their own programs: arrogant. In this particular case, though, this arrogance was justified, because I would compile a program and get a bunch of error messages, and then I would compile exactly the same program again without any problems at all. Moreover, the error messages seemed to have nothing to do with my program: They would complain about invalid characters that my program most definitely did not have.
It's particularly hard to find out what is wrong when doing the same thing twice gives different results. Even worse was that the failures were rare: The spurious error messages would occur only about one time in 10. Moreover, I had nothing to do with the compiler in question, so I couldn't fix the bug even if I knew what it was. On the other hand, I needed to be able to compile my programs. What could I do?
I decided that my only chance of getting anywhere was to figure out how to reproduce the problem. If I could say to the compiler folks: "When you do X, Y, and Z, your compiler occasionally produces spurious error messages," then I might be able to get them to fix it.
This particular compiler had several phases, which were run from a shell script. As a result, the logical first step was to figure out which phase was failing. Once I had done that, I could easily change the shell script to capture a copy of the input to that phase. My hope was that I could run just that phase with that particular input, thereby provoking a failure. I would then be able to give this input file to the compiler group and tell them that their compiler occasionally, but not always, failed with this particular input.
This is probably a good time to take a step back and look at the principles behind this strategy.
- An important early step in debugging any problem is to learn how to make it fail.
- If you can divide a failing program into several parts (compiler phases in this case), and you can show that the failure is happening in a particular part, the problem will be easier to find.
- If someone else is going to be fixing the problem you have found, it is important to bundle up all the information needed to produce a failure.
These principles seem straightforward enough. However, an important one is missing:
- Once you have bundled up the information needed to produce a failure, you must verify that your information actually does produce a failure.
I would have been embarrassed indeed if I had not realized this last principle. For when I ran the failing compiler phase on the input file that I had captured from a prior failure, it worked perfectly. Every time. Apparently, something about the act of capturing the input file caused the problem to go away.
I invite readers to speculate about what might cause this state of affairs before I continue the story next week.