How Can Broken Software Promise Anything?
Last week, I started discussing how a program might cope with one of its components reporting that it couldn't do what was asked of it. I argued that when part of a system fails, it must
- Report the failure in a way that the rest of the system can recognize, and
- Leave the system in a state from which recovery is possible.
More Insights
White Papers
- Evaluating the Performance of Shared WAN Links for Data Center Backup and Disaster Recovery
- Demystifying Unified Communications
Reports
More >>Webcasts
- Real Time Analytics: A Case Study Webinar
- Banking on Results: Turn an Avalanche of Data into Actionable Insight
How can the second of these requirements ever be met in the case of software errors? If the program is broken, how can one ever be confident in any assessment of what it did?
One way to illustrate the answer is to describe a paper I read once. I no longer remember the exact details, but I can summarize the abstract:
This paper is in two parts.
In the first part, we prove that it is impossible to play a fair game of telephone poker.
In the second part, we describe an effective procedure for playing a fair game of telephone poker.
At first glance, this abstract is impossible. Either it is possible to play a fair game of telephone poker or it isn't. If it isn't, then there's no such thing as an effective procedure for doing so. However, the real point of the paper was that although telephone poker is theoretically impossible, it is possible to come as close as we like to the impossible ideal by using encryption keys that are long enough.
Analogously, when we design systems, we can divide them into subsystems and arrange that when one subsystem goes awry, that failure does not affect the rest of the system. As with our telephone poker example, it's not necessary to contain the failures in all cases as long as we can do so in enough cases that we don't have to worry about it.
For example, most computers today have some kind of hardware-assisted firewalls between processes. As a result, it is hard for a failure in one process to affect the memory of any other process. Of course, just because it's hard doesn't mean it's impossible. Perhaps an evil chip designer left a back door in the protection system for that specific purpose. Perhaps the hardware fails in a way that allows the firewall to be breached, just as a software error tries to breach it. Perhaps the firewall itself is broken. Nevertheless, the existence of such firewalls means that when we write everyday programs, we do not have to worry that our programs' memory will be silently corrupted by other programs running on the same machine.
Because large systems usually have ways of containing damage, we can use such containment to turn software bugs into predictable, recoverable failures. Not all the time, and not with complete certainty, but often enough to make a real difference in how reliable our systems are.
Next week I'll talk about two techniques that C++ programmers can use to make their programs more reliable, particularly in the areas of data structures: class invariants and data-structure audits.

