Operating Systems Help Us Limit Harm From Failures
Last week, I suggested that programs should use assert statements to handle conditions that "can't happen," such as invariant failures. I want to continue that discussion by looking at how a user might deal with the failure of such a program.
- Bad Data is Costing You Millions of Dollars
- Data Quality Issues in the Configuration Management Database (CMDB)
Suppose you call a function that reports failure. If you are to recover from such a situation at all, it is important to be able to limit the harm that the function's failure might have done. For example, suppose the function reads an input file. If it reports failure — even if that failure can be due only to a bug in the function itself — it is nice to be able to trust that the function did not change the contents of the input file.
The ability to limit harm in such circumstances is an important service of an operating system. For example, if you tell the operating system that a file is read-only, then it is much harder for even a broken program to change that file's contents than if you do not make the file read-only. Similarly, most modern operating systems make it somewhere between difficult and impossible for a bug in one process to change memory belonging to a different process.
Processes do something else important, too: They limit the scope of "program termination." In C++ terms, when we talk about a program terminating because of an assert, what we really mean is that a process is terminating. Moreover, the operating system usually prevents a process that terminates from corrupting memory belonging to other processes.
As a result, most operating systems make it possible for us to use processes to deal with assertion failure: We can put part of our computation in a separate process, and then, if that process fails, we can be reasonably confident that that failure did not sneak back across the process boundary and corrupt our own computation.
This strategy is important for building large, reliable systems. For example, Microsoft Windows has the notion of a service, which is a process that the operating system runs automatically. Associated with each service is a set of instructions for what should happen if that service terminates. Should the system restart the failed process? If so, should it do so immediately, or should it wait a while? If the process fails again, what should the system do? Is there a limit to the number of times the system should try to restart the process before giving up? Should the system take some other action in addition to (or instead of) restarting the process? And so on.
Of course the system's mechanism for starting and restarting services might itself be broken. For that matter, there might be bugs that create holes in the firewalls between processes. However, both of these mechanisms are fairly easy to understand, and therefore to implement and test. Moreover, they are apt to be very widely used, which makes them much less likely to fail than the individual services they control.
In short, assert statements, together with facilities such as operating-system-assisted firewalls, suggest a strategy of:
- Arranging for parts of a system to fail as quickly as possible when they do fail, and
- Wrapping those parts in a way that limits the scope of the failure.
Those wrappers might consist of putting different parts of the system in different processes, or running different parts at different times and using files to pass information between them, or even running them on different computers connected by a network. Either way, the underlying idea is the same: Arrange for each part of the system to be able to complete its work successfully or to report failure — and limit the scope of that failure — as aggressively as possible.
Of course, it's not always possible to limit the ill effects from a failure completely. Next week, I'll discuss how we can use data-structure audits to reduce the harm in such cases.