So There Is a Reason I'm Sitting Here at the Airport
At this writing, I'm sitting in an airport. LAX to be specific. Waiting to get out of Los Angeles. But according to the airlines there appears to be some nationwide glitch that's disrupted my plans. Ever the curious and not getting the full story from CNN, I dropped a note to someone who knows about this kind of stuff. Bill Curtis is chief scientist at Cast Software and I was able to catch up with him while waiting for take off.
Q: Bill, I'm at the airport right now, and a glitch has delayed my flight. What's the problem?
A: Media reports indicate an outage in a telecommunication system that links computers involved in filing flight plans.
Q: Is this the result of a common pattern?
A: Until we know more about the specific cause of the outage it is hard to say. It could be a software defect or it could be a hardware failure. The common pattern is that a failure in one part of a system can propagate throughout other parts of the system, crashing or degrading overall system performance. Even if future outages have different causes, a common theme is that many parts of the system are very old and need to be upgraded to handle the volume and complexities of modern air traffic.
Q: Are there other such patterns you've seen.
A: Most big system outages are not the result of a single failure. They are the result of a problem in one part of the system beginning a chain reaction of problems in other parts. With the size and complexity of modern computer systems, many of the most critical defects occur in the interactions among different parts of the system. It is very hard for humans to imagine and test for all the possible interactions that can result in damaging consequences.
Q: What is the solution to these types of problems.
A: The FAA needs to upgrade the system. However, they have had problems with large upgrades in the past because they were unable to manage them effectively. The Federal government needs to upgrade its ability to acquire the large, complex systems needed to run the nation’s infrastructure. At the same time, those developing computer systems, and especially software, need to stop thinking of themselves as ‘artists’ and start thinking of themselves as "engineers". We need to provide them better automated tools to augment our limited ability to comprehend all the interactions in large, complex systems. Software engineering is a very young engineering discipline, and not enough software developers are skilled in developing robust software architectures that can handle outages in other parts of the system and sustain their processing. Airplanes are designed with backup systems and other defensive mechanisms so they can keep flying when some part of the system fails. We need to pursue the same type of robustness in upgrading the air traffic control system.
Q: Bill, they just called my flight -- finally. Thanks for your time.

