Data Structure Audits
Last week , I suggested using the operating system to help you put firewalls around parts of a system that might fail. I want to continue by discussing how to clean up after such subsystem failures.
White PapersMore >>
Many data structures can be divided into two parts, which we might call the data part and the structure part. The idea is that a data structure contains a collection of data, together with additional information that can be used to access those data particularly easily or quickly. For example, the data in a typical filesystem are the contents of the disk blocks that constitute the filesystem. Each file has a corresponding part of the filesystem that keeps track of the blocks that constitute it. Directories (also known as folders) store collections of name/value pairs that make it possible to locate particular files. Moreover, there is usually an additional data structure that keeps track of all of the disk blocks that are not part of any file, so that when a user program wants to create a file, or add to a file that already exists, it is easy to find space on disk to do so.
It is not hard to imagine an invariant for a filesystem. It might specify that every disk block is part of a file, a directory, free space, or one (and only one) of the auxiliary data structures that keep track of the files, directories, or free space. Because this property is an invariant, operating-system code that manipulates the filesystem can assume that it is true, and such code is responsible for ensuring that the invariant remains true whenever user code is executing.
Just about anyone who has spent much time working with filesystems has learned two things about them:
- Despite the best of intentions, it is not always possible to maintain the invariant in the face of a hardware crash or power failure.
- Accordingly, most modern operating systems come with some kind of filesystem verifier that can be used to ensure that the underlying data structures — and associated invariants — are valid when required.
Filesystem checkers are tremendously useful. Not only can running such a program after a crash repair any damage that the crash might have caused (or that might have caused the crash!), but it also reports whether damage was present. Such reports are particularly useful to filesystem designers, because the designers may be able to redesign their filesystems to make them more robust against particularly common kinds of failure. This benefit to designers comes in addition to the benefit to users that an ordinary crash need not cause extensive data loss.
A filesystem checker is an example of a data-structure auditing program. Such a program typically verifies or reconstructs the structure part of a data structure from the data part. Such a program has two prerequisites:
- The data has to be stored in a way that even if the structure becomes corrupted, it is still possible to reconstruct it.
- Someone has to spend the time to write the auditing program and figure out when to use it.
I first learned about data-structure audits from a fellow I met at a conference. He had spent many years working on telephone switching systems at a time at which processors were so expensive that a typical telephone central office had only two of them — despite a requirement of no more than eight hours of down time for any reason over 40 years. He said that this level of reliability required three separate strategies.