Simulator Sickness

By Ed Nisley, October 01, 2002

You've probably gotten that queasy feeling when problems arise in thoroughly tested and debugged systems. Ed wonders if a field call concerning one of your products turns your stomach completely over?

Oct02: Embedded Space

Ed is an EE, PE, and author in Poughkeepsie, New York. You can contact him at [email protected].

Evolution selects your descendants for a better match with your environment. It can't predict the future and never makes a preemptive tweak to get ahead of the game, which is why we have a real problem with virtual reality. Perhaps our ancestors were never in situations where their eyes, ears, and body reported different motions or those were not survivable situations.

Military designers got there first with flight simulators presenting the visual part of flying, only to discover their pilots had problems doing aerobatics while sitting firmly on the ground. It took decades to get the mechanical part of simulation up to speed and, even today, simulators don't cope well with continuous barrel rolls and end-over-end tumbles.

Your brain can resolve minor disagreements between your eyes and the balance organs in your ears, to which anyone with a kid on a merry-go-round can attest. When the conflicting stimuli reach a certain level, however, the inevitable reaction (and the literally dirty secret of the space program) is violent nausea. This is much easier to handle on the ground, where it's called "simulator sickness," than in free fall where vomitus gets into everything.

You've probably gotten that queasy feeling when a problem arose in a thoroughly tested and debugged system. It's worse when you know you've already applied the latest and greatest simulation tools. Might a field call concerning one of your products turn your stomach completely over?

What we'll see here may not be as dramatic as puking into life support filters, but it can have far worse consequences. Welcome to the embedded systems version of simulator sickness, where what you simulate may not be what you actually deliver.

The Hard Line

We got into this mess one step at a time, where each step made perfect sense. Early logic chips held only a few gates in human-scale packages that could be probed with existing technology and fingers. As IC design rules and packages shrank, test equipment manufacturers responded by shrinking their probes.

Although IC layouts shrink reasonably well, a fact that Moore's Law has both demonstrated and driven, circuit board designers are struggling with the consequences. We're now faced with two nasty problems—access to test points and RF design constraints.

Figure 1 shows the state of the art in logic analyzer connections, where the slab-sided connector at the top extracts 2 bytes from the pad layout below. A standard oscilloscope probe and an ordinary pencil point provide a sense of scale; the silver dots between them are the contact pads. Fairly obviously, if you haven't designed test points into the circuit, you're never going to get signals out through a hand-held probe.

The RF-design problem arises from a collision of clock frequencies and board dimensions: Beyond a certain frequency, a wire becomes an antenna and the old layout rules stop working. Contemporary PC signals have hit that barrier and complex gadgets like optical routers are well beyond it. When a centimeter or two of wire radiates the signal into space, you can't simply poke a probe onto a pad and hope to see anything useful.

Early microprocessors were simple enough that third-party companies could, with perhaps a boost from the CPU designers, build in-circuit emulators that replaced the entire CPU chip with external logic. Fat packs of ribbon cable connected the CPU socket on the target system to the ICE box, where a customized user interface displayed what would have gone on inside a stock CPU.

As clock frequencies increased, packages shrank, and more complex peripherals appeared on the CPU chip, ICE design became increasingly difficult. Eventually, ball-grid-array chip connections made it essentially impossible to connect an ICE to the target system. Although they're available for some CPUs, ICE design seems to be falling behind the thin edge of the systems-design wedge.

Somewhat in parallel with ICE development, some microprocessor families had "bondout parts" that provided access to internal signals of interest through additional package pins. You could clamp a logic analyzer to those pins and watch the internal buses dance, or run the signals to a full-throttle ICE for complete system control.

Increasing clock frequencies, shrinking chip packages, and the economics of producing a custom chip for a very small market doomed bondout CPUs. The proliferation of embedded CPU variants, each requiring a separate bondout part, was the final straw.

What made increasing sense was building debugging support directly into the chip and making it available to the folks who used to design ICE hardware. After all, with all those gates available, why not devote some to making life easier for software developers?

The Motorola Background Debugging Mode (BDM) interface began with simple start-stop breakpoints, register and memory access, and program control through a small auxiliary port. A debugger could use the BDM hardware to build a perfectly reasonable version of a classic software debugger, which a number of companies now offer with a magic pod that handles the hardware hocus-pocus.

Because many embedded applications cannot stop while you scratch your head and ponder the screen, BDM evolved to include full-speed tracing. The latest Motorola ColdFire embedded processors squirt status and tracing information off the chip through a dedicated set of pins, from which an external ICE-oid pod can reconstruct control flow and datapath values on the fly. Although some CPU intervention is required, an embedded process that can't tolerate a few missing cycles here and there is on the way to destruction anyway.

To show just how desperate the situation has become, it seems ColdFire CPU cores devote nearly a quarter of their real estate to BDM debugging and tracing hardware. That's almost as amazing as the New York Department of Transportation suddenly striping a quarter of its lanes for bicycles! It seems that Moore's Law has finally produced some tangible benefits for folks who write embedded code.

Wouldn't it be better if we could write code good enough that a quarter of the hardware needn't be dedicated to a last-ditch defense against software errors? Finding design errors in the final code is both staggeringly difficult and incredibly expensive. That wonderful "I found it!" feeling (you know what I'm talking about, right?) fades when you ponder the costs of finding, fixing, verifying, and distributing the new version.

And the Soft

Although you'd like to believe that software has evolved at the same frenetic pace as the hardware, that's not quite so. In fact, the debuggers in use today are lineal descendants of DEC's mid 1960s ur-DDT (the "T" stood for "tape"), albeit with glossy GUI front ends.

The central problem comes from our inability to think faster: We simply cannot comprehend software errors presented at a billion instructions per second. We must halt the CPU, display its state, look at memory, perhaps tweak a few values, and then continue onward. That's somewhat unfair, but if stop-drop-and-roll debugging sounds familiar, well, everybody's done it at one point or another.

As an example, TASKING (their choice of caps) produces a debugger for multiple-CPU, single-chip systems that's arguably the state of the art for such things. It handles multiple CPU cores with a layer of code between the hardware and the debugger that presents a single CPU's state to the debugger and switches among cores as needed. How well this works in practice I cannot say, but it's immensely better than a few blinking LEDs and scope traces, which was as good as it got just a few decades ago.

Rather than thinking faster, which seems impossible, we're now trying to think earlier with better modeling tools and simulations of entire systems. By pretending that we know how the system will work, we can actually find quite a few errors before they're hidden inside the hardware or buried in layers of code.

Essentially by definition, simulators exclude parts of the real world in order to reduce the remaining problem to a manageable size. Digital logic simulators omit voltage levels vital to analog design. Analog simulators don't worry about conductor dimensions that are vital to RF design. No simulator deals with the colors that matter to marketing folks coming up with a nice-looking translucent package.

Deciding what to simulate when you verify your design is a vital part of the process but, often, the easy-to-simulate parts aren't the most critical ones. Hold on tight as we examine two data points.

Ooops Upside the Head

The benefits of software-design review meetings cannot be overstated: There's nothing like trying to explain what you're doing in front of an interested bunch of people to bring out the worst parts of your design. The fact that reviews happen more-or-less publicly and that we're trained to think more-or-less privately surely has something to do with the fact that many people regard design reviews with the same enthusiasm as root-canal work.

A Usually Reliable Source participated in a review of some OS-level software that runs on heterogeneous groups of multiCPU machines. A key point of the design required locking a globally known file to ensure atomic access. The presenter described how this was accomplished from his C program using the appropriate OS resources.

This is a well-known problem with well-known solutions and the proposed code seemed perfectly fine. After all, these folks have been doing multiCPU stuff long enough to get it right.

Then my URS asked whether the file would be protected against accesses from processes running as shell scripts on the same machine (but perhaps on a different CPU), from processes running under a different OS on a another node, or with various other combinations of language, OS, and location.

The silence was both protracted and glacial as everyone did a mental dash through the obvious solutions. It seems that file locking depends heavily on the functions of a particular OS and isn't, at least in its current incarnation, globally enforceable.

Ooops.

At least the review caught that one, even if it's still an outstanding problem with no obvious solution. Conversely, a recent Japanese experience shows how a problem with a trivial solution can elude rigorous scrutiny until it kills a project stone-cold dead.

The Japanese H2 booster series had five consecutive successful launches followed by two embarrassing failures, at which point, Japan's National Space Development Agency (NASDA) initiated the H2-A series. The new boosters were intended to reduce launch costs by 50 percent and improve reliability. As you might expect, the new boosters were painstakingly designed and carefully simulated.

The first few flights of a new booster generally carry payloads that, if the booster fails, won't break the bank. Japan's Institute for Space and Astronautical Science proposed a $4.5 million Demonstrator of Atmospheric Reentry System with Hyper Velocity device (DASH) for H2-A's second test flight. DASH would make a few orbits with an apogee near the geosynchronous belt before reentering the atmosphere at 10 kilometers per second over the Sahara to study the effects of high thermal loading. It was a relatively simple device, at least in rocket-science terms, and was extensively tested before launch.

Although the H2-A booster again worked perfectly, DASH didn't separate from the second stage and could not complete its mission. After marching through the paper trail, ISAS discovered that a draftsman had swapped a few pins in a DASH-to-booster connector while drawing up the construction blueprints.

Ooops.

The ISAS mission summary puts it this way: "Such a simple mistake should have been easily found through various checkout procedures before launch, but in the case of DASH, this flight connector was not used throughout the ground checkouts. [...] enough function tests of DPU could not be executed through this flight connector [...] another special cable and connector had been used for the ground checkouts."

There's enough blame to go around.

Better design methods can reduce the number of errors built into the code, and simulation can find the ones hidden within complex internal hardware. After that, on-chip debugging aids can help isolate errors that occur only in the final product under real conditions.

Avoiding simulator sickness requires more than just verifying the easy stuff, though. You must also look at what didn't go through the fancy design methods, can't be simulated, and ought to be trivially easy, then verify that those pieces match up with the simulator's assumptions.

It's your belly on the line.

Reentry Checklist

You can find Tektronix test equipment at http://www.tektronix.com/ and ColdFire BDM tracing doc at http://e-www .motorola.com/brdata/PDFDB/docs/MCF5XXXDBWP.pdf.

Altium-TASKING's view of the debugging problem is at http://www.tasking.com/technology/sdte-debuggers.html.

The DASH report is at http://www.isas.ac.jp/dtc/dash-e/dash-e.html, with more at http://www.satnewsasia.com/Stories/764.html and http://spaceflightnow.com/h2a/f2/status.html. Your browser will fetch a PDF telling more than you want to know about the H2-A booster from http://www.nasda.go.jp/This/This-e/ h2a_e.html.

It's October and time for those of us in Daylight Saving Time areas to fall back one hour. Reader Matt Seitz points out that it's not Savings, as detailed at http:// webexhibits.org/daylightsaving/b.html.

Speaking of time zones, far-traveling reader Colin Sheperton once asked a Nepalese border guard why their time zone was off by 15 minutes and received this summary of world affairs: "Nepal is not India."

DDJ

1 2 Next

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.