When Systematic Testing Isn't Enough
Last week, I promised to describe some aspects of software that make testing difficult. As I was organizing my thoughts to do so, I noticed Andrew Binstock's article, which argues that the whole idea of unit testing is so important and useful as to be beyond dispute.
So let me get this out of the way up front: He's right. Unit testing — which I sometimes think of as sanity testing — is an extremely important idea, and everyone ought to be doing it. In fact, hardware designers have been doing it literally for decades. They call it BIST, which stands for Built-In Self-Test. In both hardware and software, the idea is the same: It is better to have a program or device say "I'm broken" than for it to go nuts, and much better than for it to produce incorrect results quietly. As Andrew Binstock says:
Today (and for years now), I write code and then I write unit tests that exercise the edge cases and one or two main cases. Right away, I can tell if I missed something obvious or if my implementation has a slight burble that mishandles cases I expected to flow through easily.
The issue that I want to address is when this kind of testing is not enough. In particular, I'll argue that a testing strategy that says "When all the tests pass, you're done" is not enough for anything beyond trivial programs.
One reason that such a mechanical approach doesn't work in practice is that once a program grows beyond what a few programmers can handle, it will never be completely bug-free. For example, a handful of relatively unimportant bugs may require an architectural change to fix, and it may be much easier to wait for that change than to fix each of the bugs individually and then redo the fixes for the new architecture. Alternatively, a bug report might come in when testing is almost complete, and fixing that bug might delay shipment unacceptably. In either of these circumstances, it won't do just to run the tests and wait to ship until all the tests succeed.
However, even if we disregard such situations — treating them as project-management issues rather than as testing issues — there are still at least five kinds of bugs that systematic testing does an unusually bad job of revealing, and for which other strategies are therefore necessary:
- Performance bugs. I'm not talking about the kind of bugs that might come from using an insertion sort or a bubble sort instead of a more sensible algorithm; rather, I'm talking about bugs — or even architectural errors — that result in systems that work fine on a small scale but perform unacceptably under heavy but realistic loads. Detecting such bugs requires simulated load testing at the very least; and of course the load has to be a sensible model of the conditions that the system actually encounters.
- Resource leaks. Curiously, even programs in languages with built-in garbage can leak memory. For example, suppose we execute a statement such as
y=f(x), wherexandyare both large data structures. As long as the variablexcontinues to exist, the entire data structure that it represents will stick around — andxwill continue to consume memory even if the programmer knows thatxwill not be used again. Of course programmers can solve these problems by setting variables such asxto a null value once they know that the variables will not be used again. However, the same argument can be made for the case of deleting variables explicitly in languages without garbage collection. - Security vulnerabilities. Testing for security bugs requires a different mindset than testing for ordinary bugs. Security testers must assume that problems they encounter will be the work of a malicious adversary — if you like, that they will come from Machiavelli rather than from Murphy.
- Timing bugs in parallel systems. Such bugs can be extraordinarily hard to find because the same program can work on one occasion and fail on another.
- Corrupted data structures. Such problems can be particularly nasty when the data structures are on a disk or network. The trouble is that a program can ask for several data-structure operations, but they might not all be executed. For example, a network or power failure might halt such a sequence of operations in the middle. What is worse, especially in the case of disk operations, is that programs cannot generally even assume that all the operations before a particular point in time will have been executed and the ones after that point will not be executed. For example, a disk controller might rearrange the sequence of operations in its queue, thereby causing an operation to be executed even though one requested before it is not executed.
Unit tests by themselves don't do a very good job of detecting any of these kinds of problems. Clearly, a strategy is necessary that goes well beyond unit testing — regardless of how much effort the programming language exerts in order to avoid undefined behavior.

