Dr. Dobb's | Loose Ends | June 08, 2007

To a good first approximation, all e-mail is now spam: Something over 91 percent of all messages, as of late last year, are unwanted trash. Attack probes against my firewall dropped to one every 146 seconds shortly after the December 2006 earthquake near Taiwan knocked out undersea data cables between Asia and the U.S., but returned to one a minute in short order.

E-mails from you folks provide one reason to not pull the plug. In this column, I'll go into more detail in response to thoughtful inquiries and suggestions.

Power Kills

A reader aliased as Naj Hajek asked the age-old question of whether 'tis better, from a longevity standpoint, to leave consumer electronics turned on all the time or to turn them off when not in use. This turns out to be one of those simple questions with a complex answer.

From an energy-saving standpoint, turning off unused equipment has an immediate payback. Although most contemporary equipment dissipates a few watts even when nominally off, simply because what was once the power switch has become just another microcontroller input bit, power dissipation in standby (or idle or sleep or whatever) mode is much less than in fully active operation.

At my rule-of-thumb rate of $2 per always-on watt per year, many cheap consumer devices cost more to run than to buy. Inexpensive plug-in power monitors, such as the P3 Kill-A-Watt, show the actual power consumption that affects your utility bill.

In terms of reliability, though, the answer may be different. Almost by definition, higher power dissipation corresponds directly with higher operating temperature; just put your hand on the exhaust duct of your PC's CPU to verify that assertion.

It should come as no surprise that metals and plastics change size with changes in temperature. A material's coefficient of thermal expansion (

) is the fractional change in length per degree of temperature change, usually expressed in parts per million per degree. For example, copper has

=17×10^-6/K or, more simply, just 17. That means heating a 10.000-mm rod by 30 K (or 30-degrees C°) would increase its length to 10.005 cm=10+(10×30×17×10^-6).

Alas, not everything expands and contracts at the same rate. Old-style tin-lead solder has

=17, new-style lead-free solder is around 22, aluminum is 23, glass fibers are 16-50, epoxy ranges from 15 to 100, and plastics are all over the map. Circuit boards contain all of those materials, firmly bonded together, and all undergoing the same temperature cycles.

The power transistor in Figure 1 has four mechanical connections: its aluminum heatsink and three copper leads. It's firmly mounted to the heatsink with a steel screw through a plastic bushing, with a thermally conductive plastic sheet separating it from the heatsink. The center of the screw is 21 mm above the solder joints on the other side of the circuit board.

[Click image to view at full size]

Figure 1: The aluminum heatsink on this power transistor anchors it firmly to the circuit board and applies thermal stress to the three copper leads. The dust on the brown capacitor standing in front shows which way the wind blew in this gadget.

If the transistor normally operates at 70 C°, about 40 C° above room temperature, its copper leads and tab will expand by 21×40×17×10^-6=0.014 mm. The aluminum heatsink, however, expands by 21×40×23×10^-6=0.019 mm, a difference of 0.005 mm. That lengthening occurs every time the power turns on, followed by relaxation when the power goes off.

Below the transistor, the 2-mm thick circuit board is mostly epoxy with some glass fibers. It may heat up by 30 degrees C° and expand 0.006 mm, nearly as much as the difference between the transistor's two metals, with the net effect that the distance between the mounting screws and the solder joints varies by about 0.01 mm.

That doesn't sound like much, but it turns out that solder isn't particularly ductile. After years of repeated thermal cycles, the solder can crack around a copper lead, electrically isolating that lead from the circuit board. Of course, the connection becomes intermittent before it fails completely, so whacking the thing upside the head helps for a while, until at the end, "it just stops working" forever.

I talked to my buddy Eks, an ingenious fellow closing in on his 100th patent, and he mentioned he's seeing this problem in a disturbing number of fairly recent electronic gadgets. He drew the sketch in Figure 2 to show where the cracks occur. Figure 3 shows the nice-looking solder fillet (pronounced "fil-IT" in the metalworking trades) around each lead of a dual diode, which Eks says are typical of the failed joints he's seen. You generally cannot identify the cracked joints by eye, even under a microscope.

Figure 2: My friend Eks sketched a power transistor's lead to show where solder cracks create a nonconductive slip joint after years of thermal cycling.

Figure 3: These power-diode leads have graceful solder fillets, but without X-ray eyes, you can't detect internal cracks just by looking. The red ink on the leads is probably an inspection mark.

Eks says the only certain repair involves resoldering all the joints on a circuit board to fix the single failure, a tedious, labor-intensive process that he's willing to perform on his pet gadgets, but is obviously out of the question for most boards and most folks.

This is not something you can cure with software, although if the specs call for frequent on-off cycles for a high-power gadget, you might ask the hardware folks if they've really considered the effect of thermal stress.

Eks and I agree that most gadgets will outlive their warranty long before their solder joints crack. However, in deeply embedded and long-lived applications, this is precisely the sort of failure that drives repair techs over the edge: Reliable gear that slowly goes crazy, then fails completely, with no obvious cause.

State Machines, Redux

Whenever I mention state machines, an e-mail blizzard arrives from folks who have been using them for years and wonder why the rest of you haven't gotten with the program. One such reader, Peter Wolstenholme, is a coauthor of Modeling Software with Finite State Machines and sent a copy along for my perusal.

In particular, they note that UML Use Cases tend to be "sunny-day scenarios" describing how the program should work when everything goes right. That approach has the unfortunate side effect of requiring explicit definitions for all possible misbehaviors, truly a thankless and basically impossible task.

Although a finite state machine must cope with the same situations, it moves from one known state to another under the direction of clear rules describing the allowable input values and conditions. Unknown situations either trigger an entry to a specific error state or perform no action at all. The machine simply cannot branch into the bushes or digress into "this can't happen" code.

Their guiding principle is this: "A behavior specification should solve the problem and must be comprehensible for outside persons..." They advocate using collections of small, special-purpose state machines linked together with a straightforward protocol, so that each machine does exactly what's needed and the higher level machines control lower levels without meddling in their innards. The specs for those machines are both human-readable documentation and runtime control, with the added benefit that anything not specified simply can't happen.

This may sound a lot like the usual hierarchical decomposition, but it has the advantage of using components that actually perform predictable functions. The interface handles command-and-control logic, not the usual data-passing-and-tinkering we've all fallen into.

Chapter 7, "Misunderstandings About Finite State Machines", refutes many of the reasons you use for not employing state machines. In particular, control-system state machines have several key differences from the usual string-parsing machines (usually) taught in comp-sci courses and, as a result, can handle input conditions without triggering the dreaded combinatorial explosion.

Of course, nobody ever knows the specifications for a program before it's written, but state machines let you define what the program will actually do (and not do!) and change those definitions to match the mutating specs, without having code that does weird things you've never considered.

When a state machine does something unexpected, it's generally a case of an input not doing what it should. At that point, you know the machine's exact state and can match up the input with the spec to see what should be happening. I predict you'll often find the spec didn't define that situation, so the program is actually doing what it should: Nothing!

One particularly useful difference from usual practice is that input values indicate whether they're valid. For example, a digital input bit can be High, Low, or Unknown; it's not just a simple Boolean value. Although you can't always detect an error condition, having to decide what to do may direct your attention in the right direction.

The book also describes their proprietary StateWORKS virtual machine, which collects all the tedious and generic state-machine control logic into a solid lump. The book carries a license for the entry-level version, although you need not buy anything to use the principles.

The StateWORKS virtual machine is a Posix-conformant C++ program that runs under a variety of operating systems, so the state machines don't run on bare-metal microcontrollers. Wolstenholme mentioned in an e-mail that one user wrote a Java program that ate the state-machine descriptions and spat out "intricate and unreadable C code" that he then compiled for the target system. In effect, StateWORKS became the system definition and verification model, with the C code implementing the machines "as designed."

Bottom line: This one is worth reading even if you think your current methods are working fine, because it'll shake your confidence.

Finding (the) Fault

The real problem with software is that everything is deeply "intertwingled," so an error's symptoms rarely identify it. Even after you demonstrate a failure, you must search the entire heap of source code for the error. Worse, firing up a debugger or activating the compiler's tracing hooks sometimes makes the error Go Away without curing it.

My electronics workbench (the real one with solder splashes and scorch marks) has several PC-based test instruments, so I follow mailing lists to keep up with new software and firmware. I generally wait for a few weeks after a new release, so that more adventurous folks than I can find Things That Go Wrong. Unfortunately, it's a habit that regularly pays off.

One instrument's recent beta triggered a gnarly problem. The developer (who, understandably, wishes to remain anonymous) could reproduce the failure, except when the code was compiled for debugging. Worse, the instrument's source hadn't changed, apart from the trivial matter of converting from .NET 1.1 and Visual Studio 2003 to .NET 2.0 and VS2005.

The problem turned out to be an uninitialized variable used by code that, evidently, worked fine in the .NET 1.1 infrastructure and failed in 2.0. The debugger was useless because .NET's threading model has undergone drastic revision and the new debugger doesn't work with old code and the old code works fine. Got that?

It's easy to say you should never leave an uninitialized variable lying around and that proper source control/analysis/testing/ verification would catch this. It's much harder to actually make that happen in real life.

An uninitialized variable starts life with the wrong value, but variables in embedded systems can have other problems throughout their lifetime. Indeed, it seems that a straightforward programming error killed the Mars Global Surveyor orbiter. Bob Paddock sent a pointer to the initial report by NASA's John McNamee:

Several readers with personal experience working with NASA employees tell me that, as I expected, the old-school NASA can-do spirit is still alive in the trenches, despite decades of mismanagement. They also suggest that, in their experience, contractors and aerospace-company employees aren't quite so dedicated to the cause.

I explore NASA's failures because they have excellent documentation, not to castigate them. If you know how to find similar reports on other projects, I'll be more than happy to put them to good use!

Last Tab

Given the tonnage of spam, it's almost certain that organizations upstream of my inbox have deleted worthwhile messages. If I don't respond to your note, it's because I didn't get it; try again with different wording.

Resources

Spam numbers as of late last year at www.postini.com/news_events/pr/pr110606.php.

More on the Taiwan earthquake and cables at en.wikipedia.org/wiki/2006_Hengchun_earthquake.

Some details on the Kill-A-Watt power meter: www.p3international.com/products/special/P4400/P4400-CE.html.

NIST tabulation of solder properties: www.boulder.nist.gov/div853/lead%20free/solders.html.

Thermal cycling and chip solder joints: www.imec.be/IMECAT/documents/08_2004_Eurosime_Vandevelde_paper.pdf

Most elements and some compounds appear on Wikipedia: en.wikipedia.org/wiki/Copper.

More on StateWORKS at www.stateworks.com and Modeling Software with Finite State Machines by R. Wagner, R. Schmuki, T. Wagner, and P. Wolstenholme, Auerbach Publications, 2006.

Look up "intertwingled" in The Jargon File at catb.org/~esr/jargon/html/index.html.

That quote on the MGS failure comes from www.spaceref.com/news/viewnews.html?id=1185. I want to see what the review board comes up with, as it's likely to be relevant to our earthbound code, too.