Data Structure Audits
I've already mentioned the first one: using two separate processors. I imagine that each processor had its own power supply, and probably its own connections in and out of the central office. The idea was that even a complete failure of either processor should not affect the other.
White PapersMore >>
- Transforming Operations - Part 1: Managing Outsourced Development in Telecommunications
- High Performance Computing in Finance: Best Practices Revealed
The second strategy was that each processor was constantly monitoring the other one. If a processor stopped responding, or started behaving abnormally, the other processor would restart it. This strategy is similar to the firewall notion that I described last week: Even if one processor went haywire, it is very unlikely that the failure could affect the other processor because there was so little communication between the two. Therefore, the odds are overwhelming that if a processor failed, the other one would restart it and life would go on.
I already knew about these first two strategies. What I learned from the switching expert at this conference was that these strategies weren't good enough by themselves to meet the reliability requirements. The reason was that bugs, both hardware and software, would gradually corrupt the processors' internal data structures to the point where that corruption would interfere with the switching system's operation.
This problem lead to the third strategy: All data structures in the system were designed so that they could be audited and repaired. Whenever a processor had nothing else to do, it would go through all of the data structures in its memory, looking for anomalies in each data structure and rebuilding the structure part of that data structure from the corresponding data. He told me that this auditing strategy was so effective that when it was turned off, a processor would crash every day or two; but with auditing turned on, years could elapse between crashes.
The approach of designing data structures to be auditable, and auditing them from time to time, has two advantages. Not only does it simplify recovery after failures (and sometimes makes those failures invisible, or nearly so, if the failing component can be restarted automatically), but data-structure auditing is a powerful debugging tool.
In fact, it's more than that — it's a bug prevention tool. If you have an effective data auditor, you can call it in an assert statement at strategic points throughout your program. If it ever detects a failure, you have found a bug — even though you might never have been able to construct an external test case to trigger that bug. And even though continuously auditing your data structures may make your program unacceptably slow for production, you can always turn on
NDEBUG in the version of the program you ship. If someone sends you a test case that causes the program to fail, you can try it again with
NDEBUG turned off; perhaps the data-structure auditing will find that bug too.
In short, if you can divide your data structures into a data part and a structure part, and you can write an audit program that is capable of rebuilding the structure from the data, you can use the auditor not only to make your programs more reliable, but to get them working more quickly than you might be able to do otherwise.