Reality Checks
In the May column I asked if boot-time software diagnostics had ever flushed out a hardware problem. I felt that failing hardware would also wipe out the tests, leaving an inert lump.
Paul Bristow reports that he "wrote the code for a real-time system that also executed the power-on test as its idle task to keep the processor doing something useful for...98 percent of its time." The test performed a CPU check, a nondestructive RAM test, and verified the ROM checksum, over and over and over again. Months passed with no errors.
"But then one day, on one system," he says, "I got an error messagechecksum wrong on the ROM. Two weeks later, [a] similar message. And over the next few weeks, messages steadily got more frequentbut the equipment still worked perfectly, controlling the hardware and talking to the operators, and doing the test.
"Examination of the board revealed the cause of the problemit was an erasable ROM containing the running codeand there was no sticker to stop the light starting to erase it. So the stored ROM code was becoming iffy.
"So the ROM was intermittently producing a wrong byte or two, but never with high enough frequency to show up in the running code!"
If I'd never occasionally forgotten to put a sticker on my EPROMs, I'd be in a position to throw stones, but, well, BTDT.
Chris Sawran, while working on an 8051 USB Mass Storage gadget, discovered "that at seemingly 'random'...times, especially during a data file DMA transfer, the DMA would stop."
He "eventually found that under some obscure situations, the GPIO direction and/or data registers were getting cleared..." and "After many tests, and many more explanations, I finally managed to capture the bug with a small test case.
"When the CPU was pushing arguments onto the stack, if the argument being pushed to a data-space address...happened to have the same SFR address [as] a specific set of MCU registers, then a glitch in the hardware clock trees assigned the value on the data lines to both the RAM address and the SFR register simultaneously."
After the hardware guys took a look at the design, they discovered, much to their horror, that "the defect was shown to also exist in the previous generation chip, which was selling and now 'out in the field'."
Pop Quiz: What would you do in a situation like that? Analyze the code to determine whether the error could occur in real life? Recall the product? Issue a service notice? Bury your findings and hope nobody notices?
Extra credit: Estimate the expected liability cost of each course of action. Assume all documents you produce as part of the estimation process will be used as evidence against you.
Chris observes "it's tough to really trust any of your code again after that..."
Truer words were never spoken!