Life Support

By Ed Nisley, November 01, 2001

It's no surprise that microcontrollers and other embedded devices are used in life-support systems. But can you really, really trust them?

Nov01: Embedded Space

Ed is an EE, PE, and author in Poughkeepsie, New York. You can contact him at [email protected].

Somewhere in the first few pages of every semiconductor data book, you'll find a paragraph like this:

Don't even think of using our products in a life-support application without first getting our explicit, written permission. Should our product fail, we don't want to hear from you, your lawyer, or your late customer's lawyer. Believe this!

It's usually wrapped up in somewhat more formal language, but you get the notion.

Life-support equipment sustains a human life and, should it fail, will cause a human death. Pacemakers, blood pumps, ventilators, defibrillators, and similar medical widgetry spring immediately to mind as systems where the failure of a single component can cause a system failure and a loss of life.

In recent years, though, embedded systems have taken control of products that you'd not ordinarily consider members of that elite group. As we become more dependent on automation for functions previously handled by humans, that automation must become more robust because the consequences of failure can be severe.

If you think "life support" means scenes from "ER," a few vignettes may change your channel.

You've Watched the Movie

While Apollo 13 was en route to the Moon, Mission Control in Houston noted that the fill-level sensor for Oxygen Tank 2 in the Service Module was "off-scale high," but suspected a sensor failure. Figure 1 relates a conversation that you should memorize.

At 55 hours and 53 minutes into the flight, in response to a reminder from Mission Control, the crew turned on the fans inside three separate oxygen tanks to stir their contents. A minute after the fans started, telemetry data reported high temperatures in Tank 2, anomalous oxygen system pressures, and voltage and current transients, followed immediately by unexpected acceleration along all three axes and fuel cell output voltage failures.

Oxygen Tank 2, serial 10024X-TA0008, had been installed in the Apollo 10 Service Module, but failed a preflight test. It fell about two inches while being removed, then was repaired, tested, and installed in the Apollo 13 SM, where it failed a subsequent test. To empty the tank, internal heaters and fans ran from a 65-V power supply for nearly eight hours, considerably longer than specified by the normal procedure.

Block II Oxygen Tank heater assemblies were rated for use with a 65-V supply, but included older Block I over-temperature protection switches with a 30-V limit. Although the voltage specification had been increased years earlier and all the other heater components modified accordingly, the switches were not changed. This was an oversight that caused no problems, as Apollo 7 through 12 used mismatched assemblies without incident.

Fault analysis later showed that Block I switch contacts would weld shut if they began to open with 65 V applied. That failure allowed parts of the assembly to reach nearly 1000F during TA0008's eight-hour bake, damaging the Teflon insulation and partially melting the wires. Figure 2 shows similar contacts after enduring those conditions in a test.

TA0008 passed subsequent tests that, unfortunately, did not include verifying that the protective switch would open at the correct temperature. There was, thus, no indication that anything was wrong inside the tank, a heavily insulated and sealed pressure vessel with an interior that could not be visually inspected.

It is possible that the fill-level indications earlier in the flight were not sensor failures after all. For all we know, the previous fan cycle caused a small, self-extinguishing fire that pushed the sensor "off-scale high." At this late date, we will never be certain.

When the Apollo 13 crew turned the fans on again at 55:53, the damaged wiring inside Tank 2 arced and the insulation caught fire. Teflon, aluminum, and even solder burn vigorously in pure oxygen.

Do you recognize this scenario? It occurs whenever design changes and verification testing meet an impossible schedule. Even with eminently competent people on the job, oversights and omissions will occur. Those errors may not be immediately obvious, may not be found by subsequent testing, and may lie dormant for years.

Fortunately, Apollo 13 returned safely, through the liberal use of duct tape and Yankee ingenuity. As I pored over the transcripts and failure analyses, I wondered if I could do as well in the same situation.

Not embedded enough? Shift your focus from translunar space to a Houston parking garage.

You've Read the Signs

Every elevator carries placards reminding you not to use them in case of fire. You probably know that elevator control systems park the cars at ground level when they detect smoke or fire. Fire, however, might not be the only hazard an elevator occupant faces.

This past spring, Tropical Storm Allison flooded much of Houston to unprecedented levels. According to an article in the Houston Chronicle, Kristie Tautenhahn, a law firm proofreader who worked second shift, stayed in her office overnight rather than risk driving home. At about 5:30 am on June 9, the security department announced that water was entering the building's underground parking garage and that employees should move their cars up.

Tautenhahn's car was parked near Level 4, the bottom level. She boarded the elevator and descended to at least Level 3. A police spokesman quoted by the Chronicle said "It appears that water began rushing into the elevator and it malfunctioned and she drowned in the elevator."

When I related that story to my friends in the biz, each and every one had the same sick feeling you have right now. Was Level 3 flooded? Did that elevator car inexorably descend below the water level in the shaft? What could an elevator passenger possibly do as water began pouring through the door joints?

The rest of the story emerged a week later, in Chronicle articles dated June 17 and 19. Witnesses reported that Tautenhahn left the elevator on Level 3 and walked downward toward her car. At that moment, a wall between the garage and a drainage tunnel collapsed, releasing a torrent of water that pushed her back into the elevator. The Harris County Medical Examiner's Office ruled that her death was caused by drowning, with a head injury as a contributing factor.

Although her death remains a tragedy, knowing that it didn't have an Edgar Allen Poe subtext may help at least a little bit.

However, what if that wall had failed a few minutes earlier? What if the lower levels of the garage had been full of water? What if she had punched the Level 4 button instead of Level 3? What if, what if, what if?

Building construction and safety regulations require smoke sensors that prevent elevator doors from opening into a fire and command all cars to return to ground level. As nearly as I can tell, water sensors that prevent an elevator car from descending into a flooded shaft are not, although I wouldn't be surprised to discover that they are required for below-grade installations. If not, they may well be mandatory in the near future.

The regulations I've seen (and I am not a civil or structural engineer) specify drains or sump pumps in elevator pits, although these seem intended to remove fire sprinkler system water that enters the shafts. Obviously, the Houston flooding far exceeded any reasonable expectations for an elevator sump pump.

Once you begin thinking about elevator operation, how many weird failure modes can you envision? More than 150 years of relentless design improvements have eliminated essentially all mechanical failures, but the new hazards seem endless. Remember when Aum Shin Rikyo released Sarin gas in the Tokyo subway? Wouldn't it be nice if elevator doors didn't open on that level? Is it fair to expect designers to consider such a remote possibility?

A project's design point simultaneously tells you what to build and limits your implementation choices. Envisioning all the potential problems that may arise when the system's use (through misuse or abuse) strays beyond the design point can be extremely difficult, particularly for projects with even moderate constraints on schedule, budget, materials, and headcount. There are some events you know might happen, but are simply so rare that you can't justify any additional cost to prevent them.

Simply put, how much are you willing to spend for a feature that will (almost) certainly never be activated? Conversely, how much would you spend to have it when it's needed?

Not sufficiently plausible? Check your kitchen.

You've Seen the Light

You've certainly noticed the relentless pressure to web enable nearly every embedded system, regardless of whether it makes any sense or not. Very often, the driving force is economic from the supplier's side, not convenience from the consumer's side, with technical merit having little to do with it.

Earlier this year, the cheese in our refrigerator sprouted a spectacular case of mold. It turned out that the door switch had failed, continuously lighting two 40-W bulbs. The cooling system couldn't handle the additional 80 W and the refrigerator temperature hit 55F while the freezer moved into Antarctic territory.

Hmmm, switch contacts. Fortunately, the refrigerator neither caught fire nor exploded. I will not provide a voice transcript of our kitchen conversation.

Being that type of guy, I tore the refrigerator apart and replaced the switch myself. It cost about 10 bucks at the local parts store, but it was a special-order item. We managed to cope with gloom inside the box for the three days the new switch took to arrive.

Now, web enabling a refrigerator makes no sense until you consider service costs. As with our switch, repairing an appliance typically requires two service calls by an actual human technician: One to figure out what's wrong, and another to replace the failed part. Eliminating the second trip and reducing the cost of the service event by nearly half might justify more hardware in the refrigerator.

However, expecting a customer to run diagnostic routines and read the codes back over the phone won't fly. Expecting the refrigerator to note that it's getting much too warm and call for service might, even if the fridge lacks a biohazard sensor to sniff the cheese.

The catch? Justifying the expense and nuisance of a phone hookup, plus the support infrastructure. Web enabling the control system and adding nifty consumer features goes a long way toward making the whole deal palatable, at least on high-end refrigerators. After all, you've surely had the desire to check your cheese through your web phone in rush-hour traffic.

Not life supportish enough? Put them all together.

Now, Do the Design

Here's the crux of the problem: Web enabled, deeply embedded systems will have the potential for life-threatening behavior. Even though the systems aren't in conventional life-support applications, they can directly contribute to injury or even death.

Essentially by definition, deeply embedded controllers have a long product life, static code, limited physical access, and users who couldn't care less. Unlike the chaos in the PC market, which measures product lives in months, an embedded controller may perform the same job for decades. Which, as long as it's untouched, is perfectly all right.

Crackers, however, delight in breaking into systems: Anything with a fixed IP address has "Crack Me!" painted all over it. The diversity and sheer weirdness of embedded systems may delay the inevitable, as will the lack of embedded engineering skills in the cracker population, but not enough to make a difference.

Suppose tens of thousands of refrigerators suddenly begin sending fake messages to, say, the local police and hospital servers? Or each one dials 911 every 10 minutes? Or, in always-on homes, it monopolizes your DSL bandwidth by FTP-ing bulk junk to or from random servers? Or it sets the thermostat to 55F without updating the display?

You've seen proposals for health-care monitoring and diagnosis over the Internet. When that comes to pass, as it surely will, a distributed Denial-of-Service attack on the monitoring center may prove fatal. When you dial 911 and a refrigerator, not necessarily your own, is on the line, who you gonna call? If your medication goes nasty when stored above 40F, would you notice before your next dose? You can imagine some fairly ugly scenarios here, too.

Although I'd like to believe that a refrigerator can't perform DDoS attacks, it's entirely possible. Code in flash ROM and provision for remote updates provides the key. Given access to many identical units and any way to modify the code, even if the refrigerator doesn't run Linux, even if the net connection runs a dial-up link through CHAP with DHCP, the cracks will appear in short order.

Crackers have no compunction about hardware destruction, making failures increasingly expensive and hard to diagnose. Not to mention that the afflicted appliance can't phone for a firmware update any longer.

Whenever you read about web-enabled and Internet-aware gadgets, ask yourself how secure they are and what hostile code could accomplish, not just to the gadget, but to the Internet and society at large.

Even when talented embedded systems designers use the latest development methodologies, mistakes will occur. Companies are under the same pressures that produced TA0008, their products subject to the same unexpected conditions as that Houston garage, and users are as well trained as ordinary refrigerator browsers.

Red Code will look tame.

Reentry Checklist

Figure 2 comes from the NASA Apollo 13 image collection at http://images.jsc.nasa.gov/images/pao/AS13/10075548.htm. You can search the entire collection by keyword.

NASA's Mission Transcript Collection (SP-2000-4602, two CDs) includes PDFs of the radio traffic transcripts from Redstone 3 through Apollo 17, including, for all you SETI junkies, the famous "Bogey at 10 o'clock high" report from Gemini 7. Send a self-addressed, padded envelope stamped with $1.95 to NASA Headquarters Information Center, Mail Code CI-4, 300 E Street SW, Room 1H23, Washington, D.C. 20546-0001.

A summary of the Apollo 13 mission and the official problem findings at http://www.hq.nasa.gov/office/pao/History/alsj/a13/a13.summary.html shows just how much detail you can recover from a nearly infinite paper trail. The Apollo 13 mission site is at http://nssdc.gsfc.nasa.gov/planetary/lunar/apollo13info.html. An unofficial annotated transcript appears at http://members.accessus.net/~090/awh/as13.html. The tank serial number appears as TA0008 or TA0009; I used NASA sources.

The Houston Chronicle news archives, searchable for a nominal fee, reside at http://www.chron.com/. Use keywords "Tautenhahn" and "elevator."

If you swear you'll never ride an elevator again, the short story "Descending" by Thomas Disch might scare you off escalators, too. It's collected in Fundamental Disch (Bantam Books, 1980; ISBN 0-533-13670-4).

Get the scoop on some soon-to-be web-enabled appliances at http://www.myappliance.carrier.com/. The reality may be more pedestrian than you expect, but this is the cutting edge of the cleaver. It's not just refrigerators!

Don't read Set Phasers on Stun and Other True Tales of Design, Technology, and Human Error by Steven Casey (Aegean Publishing, 1993; ISBN 0-9636178-7-7) near bedtime.

DDJ

1 2 3 Next

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.