Written in Blood

By Ed Nisley, June 01, 2003

Software reliability and embedded systems go hand-in-glove.

Jun03: Embedded Space

Ed is an EE, PE, and author in Poughkeepsie, New York. You can contact him at [email protected].

We pray for one last landing

On the globe that gave us birth;

Let us rest our eyes on fleecy skies

And the cool, green hills of Earth.

"The Green Hills of Earth"

—Robert Heinlein

On a typical American day, 145 people die in traffic accidents, 74 from falls, and 68 by accidental poisoning. There are 540 deaths by accidental injury, 160 suicides, and 92 homicides. Deaths by disease completely dwarf those numbers: 3800 from heart disease, 3000 from various cancers, and 920 from strokes.

That's every day, on the average—not every week or every year.

Our concept of the likelihood of an event bears little relation to its actual probability. Dying one by one in our automobiles, homes, or hospitals seems almost acceptable. Dying in groups has more impact, perhaps because such events produce more distraught people for live TV interviews.

Tenerife, 1977: 578 dead when two 747s collide on a runway. Japan, 1985: 520 dead when a 747 fails by explosive decompression. New York, 1996: 230 dead when a 747 fuel tank explodes.

Boston, 1942: 492 dead in a nightclub fire. Kentucky, 1977: 164 dead in a supper club fire. Rhode Island, 2003: 97 dead in a nightclub fire started by indoor pyrotechnics. (Perhaps I'm behind the entertainment power curve, but isn't "indoor pyrotechnics" a generically bad idea?)

Despite those numbers, we continue flying and going out for entertainment, correctly viewing the risks as minimal. In fact, more people die on American roads every few months than in the entire history of commercial aviation.

Engineering has evolved a rich vocabulary for ways to describe a system's ability to operate correctly. Its reliability indicates your confidence that it will not fail. A fault tree reveals how its components interact and how a fault becomes a failure. A risk assessment shows you what happens after a failure occurs. The overall risk is simply the product of how much a failure costs and the probability it will occur.

Sometimes, though, failure is not an option.

Reliability

The English language has enough syntactic and semantic redundancy that you can extract great meaning from even a small sentence fragment. One Saturday morning, while bagging some red peppers in the grocery store, I overheard one worker say to another "...mutter mutter... Space Shuttle...mutter mutter..."

I walked over to Mary and said, "Behold the power of rumor. Two words—Space Shuttle." At that moment, we knew. The space program isn't news unless someone dies.

Unlike commercial aviation, unlike automobile travel, unlike any commercial transportation system, each fatal accident paralyzes our manned space program. Each incident also wipes out a major chunk of what's sometimes called our "spacefaring fleet."

Four vehicles do not constitute a fleet. They make up, at best, a collection of hand-tended, one-off, custom vehicles. Standing down for two years after the Challenger accident revealed both the system's fragility and how little confidence we can place in its continued performance. The Columbia disintegration may have a more severe effect, despite NASA's brave front.

It wasn't supposed to work out this way. We had the script and the plan. We lacked only the means.

Jules Verne began the story in 1865 with From the Earth to the Moon, the first more-or-less numerically accurate portrayal of space travel. Inspired by Verne's story, Konstantin Tsiolkovsky worked out actual launch and free-space physics in Russia just before the Wright Flyer demonstrated that men could fly as well as their dreams, but his work remained buried in the early years of Bolshevik Russia.

Shortly thereafter, Robert Goddard developed the first workable liquid-fueled rockets in a Massachusetts field. After a series of tests ended in a particularly spectacular crash, the state fire marshal forbade any further launches, forcing a move to a site that offered good weather and distant neighbors in Roswell, New Mexico. Yes, that Roswell.

Across the Atlantic, Hermann Oberth also memorized Verne and worked out further details. He inspired Werner von Braun, who eventually got an offer he couldn't refuse from the Nazi Army. By 1944, his A4 rocket had become the V2, which, while not quite a practical weapon, was good enough for der Führer.

As David Reynolds points out in Apollo, it's easy to trace von Braun's subsequent career path. Every rocket designed by his team sported a snappy white-and-black paint job, ostensibly for visual tracking. Photos of the V2 and Saturn V clearly exhibit the lineage.

There's no time for debugging when your system stands on a tail of flame. Guidance computers can switch to their backups, but nozzles and turbopumps simply cannot fail. In other words, you must be willing to accept whatever failure rate your system's reliability predicts.

A simple way to establish reliability, given the likelihood of each component's successful operation, is to multiply all those probabilities together. If a system has 10 components, each with a 99 percent chance of working correctly, the overall system will work 90 percent of the time: 0.99¹⁰. With 100 components, the odds drop to 37 percent. Lash a thousand such parts together and you'll reach orbit once every 23,000 tries.

The Saturn V booster had roughly a million parts, yet NASA launched 13 successful missions in 13 attempts. The comparable Russian N1 booster racked up zero for four and seared its Baikonur launch complex off the map in the process. In round numbers, anything less than exactly 1.00 raised to the millionth power is zero.

Fault Trees

By now, essentially everyone knows that Version 1.0 of any software is fraught with peril. Open-source software is no different, except that you can contribute to the bug hunt by rummaging in the code. In either case, we've come to accept errors in code as pretty much the normal state of affairs.

That principle applies equally well to rocketry, which is why von Braun developed a long-term plan to reach the planets, if not the stars. Broad-winged shuttles would haul supplies to a 1000-mile orbit. A rotating-torus station, assembled in orbit, would provide living quarters for construction crews building Moon landers. Those hard-vacuum tools and techniques would soon produce interplanetary vessels.

Never mind that von Braun didn't know how to dissipate the energy differential between orbital and landing velocities, that rotating toroids are inherently unstable, or that boosting supplies out of Earth's gravity well is breathtakingly inefficient. Fixing those problems is a matter of engineering.

The key point is that each step would provide experience, paid in sweat and blood and tears, for the next increment. The first few shuttles would certainly have Version 1.0 growing pains, but after a few years, the fleet, a real fleet with tens or hundreds of shuttles, would gain enough reliability to become dependable.

The machinery and controls proven in Earth-orbit shuttles would provide engineering experience for the Moon lander. Construction techniques honed on the station would create graving yards for Mars-bound ships. Organizational details, smoothed out through decades of practice, would mean that by 2001, more or less, Pan Am could be running the shuttles.

That plan, popularized in magazines, books, TV, and even Disney movies, galvanized the American public. Everyone knew how it should be done and that we would do it. Everyone, that is, except the folks who could make it happen. President Eisenhower had no interest in rocketry other than as a means to deliver atomic warheads.

The rest is familiar history: Sputnik, the Moon race, imploding funds, Skylab, the Shuttle, the incredible shrinking station formerly known as Freedom, and on and on. Apollo may have been the only way to reach the moon in a decade, but without repeated practice on standardized hardware and iterative development of successive projects, we did not build a foundation for the continued presence of humans in space.

When the cumulative probabilities aren't in your favor, you work smarter. A system's fault tree starts with an undesirable outcome ("losing a wing") at the root and adds branches for all possible causes ("structural burn-through"). Each branch ramifies further as you identify subcauses ("losing a tile"), until eventually you reach the level of individual components ("adhesive"). You can include the effect of redundant components, error-detection-and-correction hardware, and so forth; the final result resembles nothing so much as a giant flow chart.

By assigning probabilities to each component's failure and working back through the tree, you can determine the overall likelihood of the undesirable outcome. If that number is too high, which is usually the case, the high-risk branches of the fault tree identify where you should devote your resources. A Monte Carlo simulation can evaluate the effect of uncertainties in your numbers and show unexpected failure linkages.

What makes fault-tree analysis work is a background of experience with components and their failure modes in your application. Misapplying a component, misunderstanding how it can fail, or using a component with unknown failure modes can produce a tidy fault tree full of meaningless reassurance.

Firmware, unfortunately, fits poorly into the fault-tree analysis model. You can't assign a probability that a particular function will fail because it's 100 percent reliable given the same input data. You can't assume two functions are independent, as an unrelated fault can lock up a system and kill both threads. Worse, you cannot assume that small, low-probability failures cause small, well-contained faults because software doesn't obey Newton's Laws.

Unseen Risks

The probabilistic models that evaluate the reliability and risks of complex projects have enabled us to build bridges, buildings, bombs, and boosters. Their very success makes the failures harder to understand because the concatenation of unlikely events required for a catastrophe seems impossibly unlikely.

Aircraft design has been in that zone for decades. Fatal accidents occur only after several tandem faults, often over the course of hours or days. The odds against each successive fault are high, the system can survive each individual incident, but the final failure destroys the craft.

Building codes also work that way. Panic bars on doors, lighted exit signs, occupancy limits, and stairwell design all reduce the effects of a fire. Omit any one feature and you're probably safe, any two and you're still fine, omit all of them and still nothing happens for years on end. Then one day the bodies pile up and an investigation applies further standards to the accumulated base of knowledge.

The Shuttle isn't in that zone because there were only five and the base design has been largely static for a quarter of a century. Lessons have been learned, the reliability of each craft is improving, but a fleet of (now) three works differently than a fleet of hundreds. Worse, those lessons can't be incorporated in better designs without building those designs.

Even though design checklists must always be written in blood, we do learn from our mistakes. Henry Petroski shows clearly that we design the safest systems we can, based on the always-painful tradeoff between cost and function, then learn how those designs interact with the real world. Our failures form the basis of better understanding and greater safety, but that experience cannot be gained by extrapolations or simulations.

Going through the rigors of a risk assessment forces you to catalog all the what-ifs, then return to imagine even more gotchas. After a few iterations, perhaps guided by formal tools, you think you've got them all, only to wake up sweating when you realize those redundant fiber cables all pass through the same railroad tunnel or you've stacked those high-availability servers under a fire sprinkler.

Software has a long way to go before it reaches that level of analysis—a single unpatched program can now bring down a data center. Firmware cannot be evaluated in isolation from its hardware, as failures in either propagate to the other at lightning speed.

The general public is beginning to experience systemic failures due to firmware that cause death and injury. Unfortunately, with essentially all embedded systems being one-off designs, lessons learned in one system can't be rigorously applied to the next. I suspect only Moore's Law can save firmware from itself: Firmware must come in standard bricks with known properties.

Proponents of various software methodologies will argue that better specifications, better design, and better testing can save the day, that those lessons are indeed being learned. To some extent I agree, but I contend that only when we have stable firmware-in-hardware bricks, whether large or small, listed in a catalog with guaranteed price, performance, and reliability specs that measurably improve over time, can we make real progress.

Moore's Law will make those bricks feasible when the cost of our accustomed practices becomes unacceptable. That cost has honed engineering to an edge where a few hundred airline deaths can be balanced against the benefit to millions of lives. When we can measure software's reliability, incrementally improve it, and truly balance costs against benefits, then we'll have real software engineering.

Manned space travel can continue with hand-tooled craft or become a true fleet of assorted vehicles that improve based on accumulated experience. I hold scant hope for the latter, given NASA's track record of terminating any project that could conceivably supplant the Shuttle. Perhaps von Braun's winged shuttles aren't the best design; we'll never know until we try.

As for embedded systems, are we prepared to learn from lessons written in blood?

Contact Release

Signals from Pioneer 10 dropped below our detection threshold this year, 31 years into a 21-month mission. The final whisper from 82 AU said only that Pioneer continues onward into the Deep Dark. Godspeed, old friend—may we meet again at Aldebaran. More at http://spaceprojects.arc.nasa.gov/Space_Projects/pioneer/ PNhome.html.

Everyone raised with the Heinlein canon recalled the last song of Rhysling, the Blind Singer of the Spaceways, on February 1. We don't live in Heinlein's alternate future, but he captured the human essentials of ours. "The Green Hills of Earth" appears in several collections, notably The Past Through Tomorrow (out of print). More on Heinlein at http://www.nitrosyncretic.com/rah/archives.html.

Dan Simmons captured something quite different in his short story "Two Minutes Forty-Five Seconds," collected in Prayers to Broken Stones (Spectra, ISBN 0553762524). Amazon.com lists the paperback at $23.00, so check your favorite used bookstore or library.

Go to http://catless.ncl.ac.uk/Risks/ for the Forum On Risks To The Public In Computers And Related Systems. A study of applying Probabilistic Risk Analysis to generate public information is at http://www .wmsym.org/wm99/pqrsta/57/57-4.pdf. A good intro to multiple-failure reliability lives at http://www.crhc.uiuc.edu/EASY/Papers02/ EASY02-herbert.pdf. Read anything by Henry Petroski starting with the list at http://www.americanscientist.org/other/Petroski/ Petroski-books.html.

David West Reynold's Apollo: The Epic Journey to the Moon (Harcourt, ISBN 0151009643) makes clear what we did not see while watching in real time. The results of a small-scale space race were quite different: http://www.moonrace2001.org/contest.shtml.

Details of von Braun's V2 project unfold at http://www.v2rocket.com/. Those for whom Buchenwald is just a foreign word will find this disturbing; those who can never forget know it already.

You'll find airline safety statistics with useful links at http://www.airsafe.com/ or go to the source at http://www.ntsb.gov/aviation/aviation.htm. The CDC has a database chock-full of intriguing how-we-die numbers at http://www.cdc.gov/ncipc/wisqars/. Download the CSV files to a spreadsheet in order to print them.

DDJ

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.