Tools

Reliability: The Hard and the Soft

By Ed Nisley, May 01, 2005

Ed looks at how to shake the errors out of reliability-critical systems.

Ed's an EE, PE, and author in Poughkeepsie, NY. Contact him at [email protected] with "Dr Dobbs" in the subject to avoid spam filters.

The supreme misfortune is when theory outstrips performance.
—Leonardo da Vinci

Leonardo was describing how light and shade affect the colors on a sphere, not how to design an embedded system, but his precept remains painfully true for us today, five centuries later. We have no shortage of theories describing how to create reliable systems, but reality continues to smack us upside the head.

Theory says that design reviews, of both software and hardware, weed out errors ranging from incomplete and misleading specifications and bad implementation choices, all the way down to simple typos. The reality is that many eyes often see what should be there, not what's actually in place.

Theory says that the choice of language affects the overall error rate, as (for example) C and its descendants permit errors in places where Java and its relations don't even have places. Java is not a silver bullet, although it is a step in the right direction.

Several recent, large-scale space missions provide perfect examples of how errors can lurk undiscovered in even the most reliability-conscious programs. In theory, the procedures applied during the programs should preclude these errors, but, in practice, even one overlooked error can be catastrophic.

Let's examine some in-flight errors, then see what Java becomes when it's ready to take flight.

Errors of Commission

One of the fundamental truths of hardware design is that stuff breaks. When your mission requires high reliability, you must include redundant hardware, detect device failures, and activate those backup gizmos. Spacecraft take this to an extreme because repair isn't possible: The hardware must be both inherently reliable to reduce the number of failures and include spares so the mission can continue despite failures.

NASA's Genesis spacecraft spent three years in space collecting solar-wind particles in silicon, diamond, germanium, sapphire, aluminum, glass, and gold sheets. After reentering the Earth's atmosphere, the craft would deploy a parasail and be snagged in mid-air by a helicopter to avoid any shock or contamination caused by ground contact. In fact, the sample collection capsule would be opened in a Class-10 clean room.

From all reports, the mission proceeded flawlessly, right up to the moment when the first drogue parachute should have deployed. Something was obviously wrong, though, as the long-range tracking camera showed the disk-shaped capsule tumbling wildly without a parachute. Figure 1 shows the result: Genesis smacked into the desert floor at just under 200 mph, shattering the collectors, cracking the sample capsule, and coating everything with dried mud.

Two redundant avionics packages, each with two of the accelerometer switches in Figure 2, failed to detonate the parachute-ejection charges. The switches should have activated at specific accelerations during reentry, but all four failed.

Subsequent investigation revealed that the switches were installed upside-down on the circuit boards. The expected acceleration simply held them more firmly inactive.

Although your first reaction might be to blame the schlub with the soldering iron, the Genesis Mishap Board discovered that the original design specified the wrong orientation. Everyone who reviewed the design, and surely their number was legion, evidently assumed the schematic diagram and board stuffing guide were correct.

Circuit-design programs include a logical symbol that shows what the part does, a physical symbol that shows what it looks like, and a list linking logical and physical pins. A single part may be available in several different packages, so you cannot assume a simple relationship between the schematic symbol and the silkscreen image on the circuit board.

When you create a new part for a circuit-design program's library, you draw its logical and physical symbols, then create the pin list. Large organizations have departments doing this stuff for a living, but the task always boils down to somebody reading the part's datasheet, drawing pictures, and copying pin numbers.

The accelerometer switch in Figure 2 is about as simple a part as can be: two leads on a can. Inside the hermetically sealed can is a weight riding on a carefully calibrated spring (that's my guess, anyway). Unlike a resistor, however, the can is not symmetric: orientation matters! The datasheet surely describes which way the acceleration must be applied to activate the switch, but keeping acceleration, deceleration, board orientation, and trajectory all correctly aligned is a challenge.

If I were creating that library part, I'd want to attach an ohmmeter to a real switch and give it a few vigorous shakes to be sure I knew which end was supposed to be up. Maybe that happened and something else went wrong? Maybe the board orientation changed after the first layout?

What has not been revealed so far is how the error managed to pass all the tests presumably applied to the overall assembly. I find it hard to believe that nobody applied the appropriate acceleration to the complete sample return capsule to activate the switches, but so it goes.

Errors of Omission

The eminently successful Cassini-Huygens mission to Saturn provides an agonizing demonstration of how things go wrong despite the most rigorous checking and verification procedures known to mankind. What's really painful is that a second error very nearly negated an amazing fix for the first error.

The mission plan called for the Huygens probe to transmit data to Cassini through two separate radio channels while entering Titan's atmosphere and landing on its surface. Because the probe decelerates relative to the spacecraft, the radio signals are Doppler-shifted: Cassini approaches Titan at about 5.5 km/s as Huygens floats downward under its parachute.

The RF receiver and data demodulator circuitry had been used successfully in Earth-orbiting satellites, so they were well-understood designs with a good track record. The Italian firm that provided the package refused to document the circuitry, asserting that it contained proprietary IP, and NASA and JPL accepted what was literally a black-box subsystem, acting under the reasonable assumption that it would work as required.

System tests throughout the decade-long process of designing and building the craft showed no problems. As James Oberg described the situation in IEE Spectrum, "However, a proposal for a so-called full-up high-fidelity test of the radio link between the probes (where every system is subjected to a simulation of the exact signals and conditions it will experience during flight) had been rejected because it would have required disassembly of some of the communications components...The reassembled spacecraft would then have had to undergo exhaustive and expensive recertification."

Cassini-Huygens launched in October 1997 and spent the next seven years executing precise gravity-boosted flybys of Earth, Venus, and Jupiter on the way to Saturn. Ah, if only we had the propulsion of any third-rate science-fiction rocket ship!

Although Cassini's radio receivers could handle the expected 38-kHz Doppler shift, which is a more-or-less straightforward matter of bandwidth in the RF chain, the demodulator converting the received signal into data bits was designed for a nominal 8192 b/s data rate. A relative velocity not only Doppler-shifts the RF carrier frequency, it also affects the effective data rate and, alas, a 5.5-km/s speed differential puts the data rate well outside the demodulator's limits. The original mission plan would have methodically chopped the data stream to garbage.

Oberg tells an antiDilbertian story of an engineering manager (!) following a hunch, an engineer designing a complex test and making on-the-fly adjustments to verify another hunch, carefully documenting the results and presenting them to a disbelieving organization, and eventually convincing a recalcitrant bureaucracy that Something Really Was Wrong. Fixing the demodulator would be a Simple Matter of Firmware if it were in hand, but there was no way to insert the new code without dismantling the probe.

The solution involved orbital mechanics: They adjusted Cassini's Saturn and Titan approach trajectory to reduce the downrange velocity difference and, thus, the Doppler shift and the data rate change. As you probably saw in January, the Huygens probe worked perfectly.

Half the probe's image data was lost anyway, because somebody forgot to turn on one of the two receivers. It seems the command to activate Channel A wasn't in the sequence controlling Cassini during Huygen's Titan entry, and the data stream fell on a deaf ear after all.

CBS News quoted David Southwood, director of science for the European Space Agency, as saying "We should remember we're human and we should learn lessons, so I will institute an ESA inquiry on how the command came to be missing."

High-Reliability Language

What makes those horror stories particularly interesting is that they have nothing to do with software. Unlike most of the conspicuous embedded-systems failures in recent years, the software worked perfectly. There's a simple reason for that: The development process that produces aerospace-grade software is designed to prevent errors, so only the truly bizarre goofs remain.

Back on earth, if an error in your code can drop your customer's aircraft from the sky, the FAA requires certification to DO-178B. Achieving DO-178B Level A certification, mandatory for flight-critical software, involves producing literally dozens of weighty documents that itemize every aspect of your software-development process.

DO-178B certification applies to the entire system, not just a functional block or routine. In fact, demonstrating that your high-level source code implements the specifications is necessary, but not sufficient, as you must also demonstrate complete test coverage on the corresponding assembly-language output—no dead code, no untested branches, no uninitialized variables, no compiler-generated weirdness, no nothing!

The requirements go further. You must ensure both temporal and spatial behavior by demonstrating that your code cannot stall in an infinite loop, that it will respond within the specified time limits, and that it cannot scribble over itself or anything else. You must do this, not by handwaving arguments, but by evaluating the code: routine by routine, line by line, even the library functions and compiler boilerplate that you didn't write.

Pop Quiz: List all the C++ compilers that produced output easily certified to DO-178B. Write legibly.

The requirements go even further. Your test cases must stimulate the entire as-built program; you may use unit test cases during development, but they're not acceptable for certification. Each test case must relate to a specific system-level functional requirement; if the complete set of tests doesn't produce the proper coverage, then you have a specifications problem.

Dynamic memory allocation? Multithreaded applications? Asynchronous exceptions? Only if you can prove correct operation to someone who isn't ruffled by hand-powered breezes.

Ada compilers have been producing DO-178B certifiable code since their earliest days, but the entire cadre of Ada programmers is reaching retirement age with no replacements on tap. C and C++ programmers are cheap and readily available, but the ensuing code presents nightmarish certification problems.

Java, on the other hand, is sufficiently denatured to eliminate most C and C++ horrors, while including many high-level features people have come to depend on. Can Java be adapted for DO-178B certification?

Yes, indeed, as versions of Java and its underlying JVM designed for high-reliability and real-time embedded systems are now inching toward approval by Sun and the Java Community. To meet the requirements for safety-critical DO-178B certification, though, here's a short list of what happens to a language that's already widely regarded as quite safe...

Dynamic class loading and dynamic memory allocation Go Away, which also eliminates the need for garbage collection. That allows hard, real-time performance and prevents a host of memory allocation issues.

All temporary data is stored on the stack, so the compiler can guarantee that the program never dereferences a dangling pointer. You hardly ever hear of the stack in ordinary Java programming, do you?

As a side effect, those changes eliminate 99+ percent of the JDK library routines, paring it down to about 10K more-or-less easily verified lines of code. Nearly everything a Java programmer has come to expect in the JDK won't be there for a safety-critical application.

Asynchronous control transfers, meaning all that exception handling stuff, Go Away. You must figure out how to handle errors with provably correct results and timings.

You can have separate threads, but without time slicing between them. You must ensure that overall system timing works properly regardless of how the threads play out. Rate Monotonic Scheduling is your friend.

Two interesting side effects fall out naturally from all that paring: The overall code footprint drops 99.5 percent to about 100 KB, and the performance jumps to about 90 percent of comparable C projects. For embedded-systems code, those are highly desirable attributes.

Now, if your first reaction is to reject programming in such a straitjacket, go back and reread the first part of this column. Would those projects have fared as well with your code in charge?

If your second reaction is to wonder what it would take to get your code up to spec, you're thinking the right thoughts. Full-throttle DO-178B certification isn't appropriate for most projects, but perhaps it's time to start rejecting structural complexity in favor of simple languages and routines that, well, just get the job done.

Reentry Checklist

Remember that my intent is not to slag NASA or ESA, but to highlight the extreme difficulty of catching that last error. Those organizations have an incredibly good and very public track record; it's likely that we'd all do better if our track records were similarly public, painful though it may be.

Read Leonardo's track record and be awed: The Notebooks of Leonardo DaVinci (Definitive Edition in One Volume) edited by Edward MacCurdy, Konecky & Konecky, 2003. Even 500 years later, you'll feel inadequate.

Watch the Genesis capsule reentry at http://anon.nasa-global.speedera.net/anon.nasa-global/genesis/genesis.mov. A note on the Genesis Mishap Board's preliminary report is at http://discovery.nasa.gov/news_101804.html.

The IEEE Spectrum article describing the Cassini receiver demodulator error is at http://www.spectrum.ieee.org/WEBONLY/ publicfeature/oct04/1004titan.html, and the ESA version is available at http://www.esa.int/spacecraftops/ESOC-Article-fullArticle_par-40_1103125842574.html.

IEEE Spectrum describes the Cassini receiver command error at http://www .spectrum.ieee.org/WEBONLY/wonews/ jan05/0105nhuygens.html and a CBS News version may still be at http://cbsnews.cbs.com/network/news/space/recent.html.

Just read the list of DO-178B's documentation requirements to show how little you're doing now: http://www.lynuxworks .com/solutions/milaero/do-178b.php3. Because DO-178B describes a set of processes, rather than prescribing a set of tests, it's far more complex than I've described here.

Kelvin Nilson of Aonix describes some aspects of real-time Java at http://www.stsc.hill.af.mil/crosstalk/2004/12/index.html.The illegibly small GIF table on this page compares ordinary and real-time Java: http://www.rtcmagazine.com/home/ printthis.php?id=100111.

DDJ

1 2 3 Next

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

Tools