We will teach you not to expect anything. That way, you will be ready for it.
Broken Angels, Richard K. Morgan
Mechanical and civil engineers must consider how their projects can fail, right at the start of the design process, by calculating the stress applied to each component and the strength required to withstand it. Electrical engineers apply similar calculations to their circuits by considering voltage, current, and thermal ratings. In each case, engineers determine the project's expected behavior based on well-known properties of the bulk material and the finished components.
Software design isn't engineering simply because it does not deal with physical materials that have known properties. Software components don't exhibit a graded response to increasing stress: A single, trivial, data-dependent error can cause complete, instantaneous failure with no prior warning. Programs simply don't have a well-defined safe operating area.
Last month I described the homebrew ISBN bar-code reader we used to catalog our book collection and mentioned that all of the error-detection tests eventually branch to the startup code. While that blithe response might be okay for a hobby project, somewhat more structured error handling is generally in order for a real project.
Let's take at another, seemingly simple, gadget with a different approach to error handling.
A typical nonPC consumer product can hide remarkable complexity behind an exceedingly simple facade. The instructions always tout ease-of-use and customers have come to expect simple operations. From the designer's standpoint, however, a severely limited interface both reduces the degrees of freedom available to break the software and makes debugging the inevitable in-the-field failures essentially impossible.
Figure 1 shows a desperate measure that certainly made sense during developmenta message useful only to folks who write and maintain the source code. One can only hope there aren't thousands of itemized error checks among the KLOCs, but that's certainly implied.
The gadget in question is my GoVideo Rave-MP AMP256 music player, and that error occurs when I insert a Secure Digital card or copy music to the internal Flash memory. At least, the failure sometimes occurs when I do those things. As you well know, intermittent failures are the most difficult ones to debug.
The player does not respond to any buttons, not even the power switch, with that message on the display. After removing and reinstalling the AAA cell, it sometimes gets wedged while powering up. Rarely, the white LED backlight may be turned on with the LCD inactive, indicating that the firmware hasn't restarted. When it does restart correctly, the LCD will be active, but it might be either completely blank or show the initial Rave-MP logo for as long as I care to wait.
Restoring normal operation requires considerable fiddling: Remove the cell, remove the SD card, wait a few minutes, install the cell, turn the power on, wait for it to index the music files in memory, turn it off again, install the SD card, and hope for the best. Some variation of those steps may or may not result in a successful recovery, but eventually it returns to normal operation and, almost always, the music files stored on the internal Flash memory remain intact.
After reading the manual and perusing the FAQ at http://www.rave-mp.com/ to no avail, I summarized what I knew about the problem in an e-mail to their Customer Service folks. Their stereotyped response arrived in a few days, advising me to consult the FAQs. I responded by pointing out that I'd already done so, but I'm still awaiting a response.
So much for Customer Service.
Being that sort of bear, I figured it'd be interesting to see what's inside. Figure 2 shows the front face of the circuit board, with the 128x64 LCD panel and 10 switches that constitute the entire user interface. Circuitry beneath the LCD implements an FM radio tuner.
Figure 3 shows the rear face of the board, with an SD card socket atop the built-in 256-MB Flash memory and the single-chip controller that does the heavy lifting. The earphone jack at the top right and the USB jack at the bottom are the only connections to the external world. The earphone cable doubles as the FM radio antenna.
Obviously, there are no user-serviceable components inside.
The markings on the controller chip led me to the SigmaTel web site (http://www.sigmatel.com/). The STMP3510 controller turns out to be a single-chip Digital Signal Processor with a wide variety of on-chip peripheral interfaces. SigmaTel doesn't provide datasheets, but its product selection guides and high-level descriptions hint at what's inside.
The processor has a 24-bit datapath and two internal 24-bit ROMs connected directly to the DSP core. I suspect the 16-K ROM contains boot code, well-defined peripheral routines, and digital filter coefficients. The 90-K ROM might be a Flash array holding the firmware and ready for a field upgrade.
On-chip hardware blocks connect the DSP to a wide assortment of external gadgetry: buttons, color and monochrome LCDs, backlights, various Flash memory cards, USB, I2C, and I2S interfaces. It even has an IDE hard-drive interface. I'm sure not all of this can be active at once.
The USB interface implements the Mass Storage Device Class functions, so it looks like an ordinary USB Flash drive when plugged into a PC. Windows, Mac, and Linux users can simply copy (or drag-and-drop) MP3 files from their hard drives to the player, even if that occasionally causes collateral damage of some sort.
The product description touts the chip's ability to play not only MP3 files, but audio files encrypted with any of several Digital Rights Management schemes. It can even display JPEG graphic files, a feature that certainly isn't implemented in the AMP256. Similarly, although one of the spec bullets touts Ogg Vorbis compatibility, my player doesn't seem to recognize Ogg files.
It evidently runs well with WMA files from Microsoft's Windows Media Player 10, one of the ways to install DRM-protected files into the Flash ROM. Apple's iTunes doesn't work and the company cautions that "other music management software" might not play, either. The DRM features lock music to a unique on-chip serial number.
It should come as no surprise that the DRM software routines are licensed, not purchased. The royalties for that code probably comprise a significant chunk of the final cost-of-goods number, but there's no way for me to know.
DSP chips sport an architecture specialized for high-speed signal processing, rather than general-purpose logic. Given the chip's limited program storage, I suspect it's constrained more by what fits than by what's possible, making feature selection a zero-sum game. Perhaps, should you need three separate DRM routines, you must then give up JPEG images on a color LCD.
However, completely different devices can use the same chip by surrounding it with a different assortment of peripherals and loading a different slug of firmware. That's the ideal situation for both SigmaTel, which can sell a standard hardware device in high volume, and its customers, who can get a media player on the market without developing everything from scratch.
The SigmaTel SDK, which its web site mentions without giving any details, certainly includes routines that control the I/O devices and present a useful API to higher level code. Implementing a media player thus becomes more a matter of choosing the desired options and integrating them with a thin layer of your custom code, rather than starting from hardware specs and a blank IDE template.
One unpleasant consequence is that a single error in a low-level SDK routine can appear in several different players sold by several different vendors. I mentioned this last September in the context of security, but it's just as applicable to any common function provided by Other People's Code. Identifying such a common-mode failure would be difficult, but I'm sure there are examples out there in the wild.
An MP3 player's user interface doesn't allow for a lot of error handling, so the program must take reasonable recovery action without involving users. For example, it should simply ignore files it can't handle (such as Ogg Vorbis), skip over files with bad data (such as damaged MP3 files), and step through subdirectories automatically.
Although the AMP256 firmware does all that, it's probably missing a test for a filesystem condition that shows up only under certain conditions during the first power-on after new files appear. That uncaught condition causes a catastrophic downstream failure that eventually falls into one of those generic "this can't possibly happen" handlers that we've all used.
At which point, for lack of anything smarter, the whole player wedges.
After the Crash
Industrial-grade control systems sometimes include a watchdog timer to regain control after a system crash. The program must continuously refresh the watchdog timer, typically by wiggling an output bit, at a specific rate that prevents it from triggering a hardware reset. The timer directly resets the hardware, rather than presenting a software-controlled interrupt, under the reasonable assumption that a failed program has entered unknown territory and cannot be trusted to respond correctly.
On the high end, real-time control systems have fail-safe redundant hardware that maintains correct operation despite failed hardware. Flight-control hardware may include triple- or quad-redundant hardware channels for input sensing, computation, and output actuation. Recovery mechanisms within each channel may use watchdog timers to regain control, but the overall system must produce correct, on-time outputs even while an affected channel restarts.
Consumer devices generally don't include watchdog timers, even if they'd benefit from an automatic reset after a failure. Economics motivates part of the decision, as a watchdog timer represents a per-unit hardware cost that provides no benefit during normal operation that, in a market where even a manual reset switch may be too expensive, simply can't be justified. The customer will probably toggle the power switch anyway after the device crashes, so why add exotic hardware?
Unfortunately, flipping the power switch may not have the intended effect. The power switch for many devices no longer directly affects the power supply; gone are the days when the main AC power cord passed through a Big Red Switch on the way to the power supply.
As a result, the "power switch" merely presents a hardware interrupt to the CPU, which can subsequently ignore it. The only reliable way to reset a badly wedged system is to unplug it or, more inconveniently, remove its battery.
The STMP3510 feature list says that it will run "up to 50 hours" from a single AAA cell. The entire AMP256 player certainly doesn't last that long because it includes a white backlight LED, external Flash ROM, and other gadgets that draw far more current. Even with those additional loads, removing and immediately replacing the cell often doesn't cause a complete reset.
The AMP256 includes a step-up power supply to produce a constant voltage from the steadily decreasing AAA cell's output. I suspect that the supply's output filter capacitor holds enough charge to maintain the DSP's internal state for a minute or two and, perhaps, the power-on reset circuitry doesn't handle the situation correctly.
Alas, a microcontroller may not start up correctly after a hard crash, even with a good hardware reset, if the firmware depends on the validity of a corrupted data structure.
Avoiding the Debris Field
An embedded system with a limited user interface must not only handle ordinary runtime errors with aplomb, it must also handle catastrophic errors by anticipating and correcting unexpected situations. Yes, your design must expect the unexpected, because the user certainly won't.
The only fundamental assumption you can make during a power-on restart is that the CPU hardware works correctly; if that's untrue, then any other assumptions are worthless. Drop me a note if you've ever caught a partially bad CPU with a firmware diagnostic test, as I've never had that pleasure.
Your code should verify a hash value (or at least a nontrivial checksum) over the entire firmware image before assuming that it can run successfully. There isn't much you can do with a broken program, but having a bare-bones backup image that can display a meaningful error message is a good start. It's even better if the backup image can load a new firmware slug from the external world.
Software state variables stored in battery-backed or nonvolatile memory form a rich compost to sprout program errors during startup. Before your program resumes from a previous state, you must ensure that all variables represent a valid state before you use any of them. That may be as simple as storing an overall hash before shutting down or as complex as a transaction-based state update mechanism.
Because software is so brittle, a single inconsistent value can have catastrophic consequences, so your state verification must cover every variable. Your code should be able to fall back to a previous state or, in the worst case, rebuild a default, known-good state on the fly.
After good firmware and good variables, what's left? Only trivial matters like including all the error messages in your instruction manual and getting your Customer Service folks up to speed on error recovery. But you do those anyway, right?
For a view of the engineering complexity behind seemingly simple mechanical devices, take a look at the Federal Highway Administration's guide for support structures at http://www.fhwa.dot.gov/bridge/signinspection.cfm.
Fortunately, software isn't prone to fatigue failure. The mid-1950s DeHavilland DH106 Comet jet aircraft failures showed aircraft engineers just how lethal repeated stresses can be: Read the chilling accident report at http://www.geocities.com/CapeCanaveral/Lab/8803/cometcrs.htm.
Small-scale embedded system crashes aren't particularly photogenic, but high-end crashes can be in-your-face events. Just for fun, feed "windows crash airport picture" (without the quote marks) into your favorite search engine. Failing that, The Gallery at http://www.windowscrash .com/ should keep you amused for hours.
The tech background of Morgan's Broken Angels (ISBN 0739442201) and Altered Carbon (ISBN 0345457684) will haunt you forever, should you happen to think uploading your consciousness to a computer is a Good Thing. Great Expectations is Dicken's story of destructive human stresses: http://www.gutenberg.org/etext/1400.