Limiting The Harm From Failures
Last week, I observed that different kinds of software failure affect the people who use the software in different ways. Software failures are harmful for two different reasons:
- The program fails to produce a correct result.
- The program produces an incorrect result.
Although these two situations may appear to be the same, they are actually very different. A program that fails to produce a correct result often does so by crashing. Not only does a crashing program fail to produce any results — correct or incorrect — in the future, but the reason that it crashed is often obscure. Obviously, such a situation can be annoying. On the other hand, relying on an incorrect result with no knowledge that it is incorrect has equally obvious problems.
This dual nature of software failure presents a dilemma to programmers. Suppose, for example, that there is an easy way — such as an assert statement — to make a program crash if it is about to produce an incorrect result. By using such a technique thoroughly, we can avoid incorrect results — making it much more likely that the program will fail to produce any result at all.
The relative (un)desirability of these two kinds of failure varies dramatically from one context to another. At one extreme is a flight-control system for an airliner. Just about any kind of incorrect result is less disastrous than having the system fail to respond altogether, because the pilot will probably notice that the airplane is misbehaving and disconnect the errant component. An example of the other extreme is a missile launch system: It would almost always be better for the missile to fail to launch completely than for it to launch unexpectedly.
Software testing is usually aimed at reducing both of these kinds of failure. However, unless our programs are always bug-free, testing by itself does not address how a program fails when it does so. To address this problem, we must figure out what kinds of failures we are willing to accept, and then design our systems so that even if they do fail, they will probably fail in acceptable ways. In effect, how our software handles error conditions will depend on the context in which it is used.
By implication, if we are writing a program, such as a data-structure library, that is intended to be used in a variety of contexts, then it must be able to accommodate a variety of error-handling strategies. This requirement means that such a library is more useful if it is possible to customize its error handling to fit the diverse contexts in which that library will be used. Next week, I'll get specific about how we can accommodate such error-handling diversity as part of system design.

