Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

Embedded Space


Mar01: Embedded Space

SEU Meets Embedded Linux

Ed is an EE, PE, and author in Poughkeepsie, New York. You can contact him at [email protected].


Our story begins 10 megayears and 8500 parsecs away, as an old star enters the accretion disk around the Eater of All Mass at the core of our Milky Way galaxy. Shredded by the black hole's gravitational force, the 3.5 gigakelvin stellar core ejects a gout of plasma, rich in iron nuclei, along the magnetic lines of force twining the disk.

The plasma spews from the core nearly at right angles to the accretion disk. The Eater eventually captures most of the ions, but a thin stream, perhaps only a few megatons, squirts away from the maelstrom at just over escape velocity.

Away from the galactic core, the ions repel each other and spread the plasma into what we'd call a "hard vacuum." Although galactic magnetic and electric fields are almost undetectably faint, they have eons to work their will. Over the next six or seven million years, the iron nuclei reach relativistic speeds as they loop and swirl above the galaxy.

Meanwhile, back on Earth, we're in the late Miocene epoch. North America is bulldozing the Pacific plate at 1.25 cm/year, with Silicon Valley one mile down and rising. Dog prototypes are at Release 0.5, horses at 0.8, penguins at 0.9 (odd tenths=unstable version), but humans remain in design review. A "Personal Digital Assistant" is the claw at the tip of a paw.

You may fill in the ensuing three megayears as you prefer. Let us fast-forward to the near future...

SEU Minus 400 Microseconds

An iron nucleus accelerated to 0.9999999997c has an Einsteinian mass of 0.2 ng. That may not sound like much, but its energy of 1020 eV or 16 joules is enough to run a nightlight bulb for maybe two seconds.

One such nucleus proceeds along a straight line, essentially unaffected by the solar system's magnetic and gravitational fields, until encountering a helium atom 80 kilometers above Earth. The two nuclei merge, then shatter into a sleet of highly energetic nuclei and Greek letters, aligned in a tight cone along the original vector.

On Earth, you're nursing a hot cuppa on the patio of your favorite caffeine-with-bits shop. Surrounded by a dense fog of Bluetooth RF chatter, you're tapping out a buy order for some particularly ravaged dot-com stock (with staggering upside potential, indeed) on your shiny-new, Penguin-Powered PDA.

You've just adjusted the number of shares to 4096 (you are a geek, right?), verified the price (just under a buck a share), and tapped the Confirm icon. As far as you're concerned, the transaction is complete.

However, if you were to look up (with your subatomic sensors), you'd spot a white-hot blaze, boresighted on your location. Those charged Greek letters, encountering increasingly dense air, shatter more atoms and liberate every subatomic particle known to man. None of these fragments will reach Ground Zero, however: The particles form what's called a "cosmic ray air shower."

Within your PDA, the UI passes your updated number of shares from the screen to your stock-ticker-and-broker Internet app, which stores it in RAM. Although Bluetooth signals travel at the speed of light, the PDA still takes time to create a message, negotiate a connection, get airtime, transmit, and eventually receive a response.

SEU Minus 5 Nanoseconds

Two meters from your PDA, the sixth generation of collision fragments in the air shower continues at nearly the speed of light. A relativistic helium nucleus slams into a stationary oxygen atom, and a 1 GeV neutron emerges from the ensuing debris.

It punches neatly though the PDA's case, the metallic EMI shield, the epoxy body of a DRAM chip, and scores a direct hit on a silicon atom in the memory array. There isn't much unused silicon surface, so the neutron has probably clobbered something vital.

The PDA's RF power amp emits the first signal required to initiate a transaction. The nearest transceiver with a hardwired Internet link is 10 meters, roughly 30 nanoseconds, away. The neutron does its damage before the signal reaches the receiver.

Single Event Upset!

A strange brew of nuclear fragments emerges from the shattered silicon nucleus. Most of the particles carry an electric charge, many decay into electrons (or, shudder, positrons), and others collide with silicon nuclei to produce even more particles. Unlike the kilometers-long cosmic-ray air shower, this chaos occurs within a tight cone several microns long, oriented along the original neutron's velocity.

Moore's Law has modeled (and, to some extent, dictated) the shrinkage of LSI dimensions for nearly four decades. At the start of the integrated circuit era, IC transistors and interconnects were visible to the naked eye. By now, however, individual transistors are approaching the size where quantum effects determine their behavior.

In particular, the number of electrons stored in a single DRAM cell has fallen dramatically. The storage capacitors in those old 16-Kb memory chips designed in the 1980s held about a million electrons each. Now, with memory chips approaching 1 Gb, the number of electrons is getting downright countable.

You can work it out: Q=C×V, charge equals capacitance times voltage. Storage cells have about 30 fF of capacitance and little more than 1 V across the plates. There being 6.24×1018 electrons per coulomb of charge — call it 200,000 electrons.

The current flowing between logic gates has also decreased dramatically. Recall that 1 ampere of current is 1 coulomb/second, so a current of 1 mA involves about 1013 electrons.

However, currents in integrated circuits don't flow uniformly for a whole second, so you must factor time into the equation. Over the course of a nanosecond, you're watching a parade of only 10,000 electrons. At that level, a few thousand electrons dropped from space is no small thing!

The beam of charged particles sears downward, adjacent to the storage volume of a DRAM cell. As fate would have it, that cell's capacitor holds no net charge. Tens of thousands of electrons distribute themselves uniformly across one conducting plate, charging the cell enough to just barely flip its state. The remainder of the particles dissipate rapidly in the bulk silicon.

SEU Plus 1 Millisecond

A Single Event Upset (SEU), also known as a "soft error," is a transient fault in the normal operation of a circuit. The hardware operates normally after an SEU, but it may produce incorrect results as the value propagates onward. The next time you use the circuit, you'll get the right answer. This tends to make error diagnosis somewhat difficult; by definition, SEUs are not repeatable.

Conversely, a hard error causes a permanent, solid, easily verified fault in a particular circuit. Every time you present the same inputs, the same incorrect output will ensue.

It's worth noting the difference between a fault, a failure, and an error. In the hardware domain, a fault occurs when a particular circuit doesn't deliver the expected output. A failure, on the other hand, results from actually using the incorrect output. An error occurs when the system behaves incorrectly. (But, frankly, the terminology is both confusing and misused. Let me know how you define the terms; I'll report any consensus.)

For example, changing a DRAM cell just before writing a new value into the cell is a fault, not a failure. A change in the output of a circuit that's unused during a particular computation may occur with complete impunity. Back in the days before protected-mode (PM) operating systems, you could destroy big hunks of PM circuitry on a 386 CPU and never notice the difference. Contrary to what your parents taught you, this is a case where "If nobody sees me, I can get away with it."

Even when a failure occurs, the system may operate correctly. Applying an Error Correcting Code (ECC) to a memory bank can deliver corrected data to the CPU in spite of memory faults.

Many network servers depend on ECC-equipped memory to reduce the effect of transient memory errors, whatever their cause. Standard ECC coding, which uses 8 bits to protect 8 bytes, can correct all single-bit failures and detect all double-bit failures. If more than 2 bits fail, however, error detection effectiveness goes downhill.

Desktop and PDA systems, as a rule, do not include ECC for one simple reason — product cost. The savage economics of consumer electronics dictate that the selling price is all important; increasing that price, even to reduce the probability of an after-sale glitch, just won't happen.

Come on, who would believe a subatomic particle from outer space could clobber a PC or PDA?

So the SEU in your stricken DRAM cell produces a failure as the CPU fetches the variable containing that bit. Combined with other data and massaged by your stock-ticker-and-broker app, it becomes an error in part of a message destined for the Internet.

SEU Plus 100 Milliseconds

My HP48GX calculator manual details the steps required to recover from what HP euphemistically calls a "freeze," wherein the calculator stops responding to the keyboard. You turn the thing over, pry off the upper-right rubber foot, and poke a paper clip in the hole thus exposed. That closes the master reset switch inside, gives the CPU a mighty whack upside the head, and sets things aright.

UNIX in general — and Linux in particular — benefits from the perception that it rarely needs the PC equivalent of a poke with a paper clip. You'll find Linux web servers running for months at a time with no resets or reboots; that's the expected outcome, not something to be particularly proud of.

While such reliability may be true of the Linux kernel, it certainly isn't true of some apps. One reason servers can remain up for long periods of time is that they run specific programs with well-controlled inputs, programs that have survived relentless debugging. On the desktop, alas, anything goes: Maybe the kernel stays up, but the apps can crash pretty freely.

However, a PDA more closely resembles a server than a desktop box: It runs a well-defined set of programs with reasonably well-defined user input, using core routines burned in ROM. You generally don't dump quite so much, hmmmm, stuff in a PDA, nor do you stir it around quite so violently.

PDA users also have pretty much the same reliability expectations as server users — it should never crash.

So we've pretty much set the stage for disaster. You have the least expensive hardware that will work, no error detection or correction, circuits that are either at or fast approaching the lower limit of SEU resistance, and an inherent belief in bulletproof software.

On the plus side, your online stockbroker, knowing full well that security is vital, insists that you use a cryptographic system to sign outgoing messages with your digital signature. The server can verify that a message actually did come from your hands, because no other person knows your secret key.

Your ticker-and-broker app wraps your signature around the buy order and ships it off to the server. All along the way, from the RF to the Internet and back, strong crypto protects your message from both snooping and alteration.

It makes no difference because your data has been altered at the source, where it's presumably under your direct control and supervision.

SEU Plus 15 Seconds

You stare in disbelief at the PDA's little screen as hot coffee rises into your sinuses. It seems your stockbroker's server has just executed and confirmed an order on your behalf to purchase 266,240 shares at, yes, just under a buck a share. Ever the geek, you instantly realize two things: Somehow, somewhere, something added 218 shares to your order and, worse, you have three days to come up with a quarter million bucks. There's no chance of denying this one, because your broker can show an unbroken chain of traceability right back to your PDA.

Welcome to the wonderful world of ubiquitous, embedded computing!

Is It Really Embedded?

One may argue that a PDA isn't really an embedded system or that Linux doesn't have a ghost of a chance against, say, PalmOS. That's missing the point. Up-and-coming PDAs will be the first conspicuous collision of SEU-susceptible hardware with the public's absolutely unshakable conviction that "Nothing Can Go Wnorg." In the desktop world, we've become accustomed to occasional odd behavior from our PCs, even if the number of transient faults due to, say, cosmic rays remains less than the glitches due to various software foibles. Raise your hand if you've never restored your PC from a backup (raise both hands if you wished you had a backup). Ever wonder why your PC froze without warning?

Even if you discount death-from-space scenarios, we must think harder about reliability for systems that people will depend on for more than just casual surfing. The economic forces that propel Moore's Law bear down equally hard on the embedded systems that control nearly everything around you.

It may be okay if your fancy home control system occasionally blinks the lights at 3:00am. It's certainly not acceptable for your refrigerator to defrost all day, just because 1 bit in its countdown timer mysteriously flipped. You can easily fill in other, far more critical, examples...

Embedded software can be written with essentially zero bugs. Embedded hardware can be built with essentially perfect fault tolerance. Your challenges are now to understand how it's done, then make it happen with a less-than-infinite budget.

Or maybe you like caffeine in your sinuses?

Reentry Checklist

You'll find a good introduction to cosmic ray SEUs distributed across the Web under search terms "single event upset," "+cosmic+DRAM," and "soft (fail* | error)." IBM has several interesting papers on the subject that you can track down by searching http://www.ibm.com/.

You should read "IBM Experiments in Soft Fails in Computer Electronics (1978-1994)" by James F. Ziegler, et. al, IBM Journal of Research and Development, V40N1 (http://www,research.ibm.com/journal/rd/ziegl/zieglert.html). Although six-years-old by now, it gives a chilling summary of just how vulnerable our solid-state stuff really is.

Software reliability is only one part of the puzzle, but you can't go wrong with McGuire's Writing Solid Code (Microsoft Press, 1993) and Steve McConnell's Code Complete (Microsoft Press, 1993). Only after you realize that errors really matter can you begin worrying about external influences.

And, hey, it's not called Embedded Space for nothing!

DDJ


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.