It's A Heisenbug!
It's finally time to wrap up the discussion I started in December about what seemed at first to be a mysterious compiler bug. Last week, I described the bug's characteristics that I had carefully isolated: If a program seeked to an odd offset in a file, then wrote data, then each character thus written had a small but nonzero probability of changing to an ASCII zero character.
- The Role of the WAN in Your Hybrid Cloud
- Stop Malware, Stop Breaches? How to Add Values Through Malware Analysis
I could reproduce this problem reliably, so I thought it would be easy for the operating-system group to find and fix. Nevertheless, the first time they tried it, they could not reproduce the problem. After some discussion, we realized the reason: When they tried it, they did so on a machine with no other users. When I encountered the problem, there were most definitely other users on the machine.
I should mention at this point that the problem occurred well before the first personal computers. The machine in question was probably about the size of an upright piano, and served a department of 60 or so programmers. As a result, the operating-system group had exclusive use of the machine for two hours every evening, which allowed them to test prospective new versions of the operating system without inconveniencing anyone else if it should happen to crash. Sure enough, when the operating-system group put a synthetic load on the machine, they were able to reproduce the problem. That was quite a relief!
Once the operating-system folks were able to reproduce the problem, it was clear where the faulty code had to be. The problem occurred only on writes starting at odd offsets. As it happened, there was one place in the operating system that executed different code depending on whether offsets were even or odd.
The reason for this even-odd choice was that the computer had 16-bit words. In other words, each word contained two 8-bit characters. This machine characteristic, in turn, influenced a very frequently executed piece of assembly-language code that copied data from a user's address space into a system I/O buffer.
The computer in question had two different instructions to move data from one part of memory to another:
move word and
move byte. As an optimization, the operating system's data-movement routine checked whether the source and destination both started at even or odd addresses. If they did so, the operating system could move the data more quickly by executing a series of
move word instructions, with the data-movement loop preceded or followed, as needed, by a single
move byte instruction to handle extra characters at the beginning or end of the data.
The tricky part came when the source and destination were differently aligned. The obvious way to handle that case would have been to execute a series of
move byte instructions, but the operating-system group thought they knew better. What they did instead was to save a single character in a register, and then:
- Copy a word (two bytes) onto the stack.
- Shift the register and pop one byte off the stack into the other half of the register.
- Store the register into memory.
- Pop the other byte off the stack into the register.
In other words, each trip through the loop copied two characters, which straddled a word boundary on the destination side but not on the source side. Of course, the code had to handle the extra characters, if any, at the beginning and end of the loop; but that was not particularly difficult.
What turned out to be the source of the problem was when the code tried to "pop one byte off the stack." On this particular machine, that operation was impossible: The stack could only be pushed or popped one word at a time. So what the code actually did was to copy one byte, change the stack pointer by one word, and then copy the other byte.
Note that I just described three separate operations:
- Copy a byte from the stack.
- Change the stack pointer by one word.
- Copy a byte from the stack.
These operations seemed innocent enough on the surface. However, between (2) and (3), one of these data bytes was in memory that was no longer part of the stack, because the stack pointer had been changed.
And indeed, that was the source of the problem: If an I/O interrupt happened to occur between instructions (2) and (3), the machine's state would be pushed onto the stack, where it would overwrite the single byte waiting to be transferred in step (3). As it happened, the normal state of this machine had the same bit pattern as a zero character, which explained why the character would always change to zero.
Bugs of this kind are often called Heisenbugs, after Werner Heisenberg. Heisenberg is best known for the uncertainty principle, which says (for example) that it is possible to determine the position or momentum of a particle accurately, but not both at the same time. As a result, the term has come to denote a bug that stems from uncertainty about the order in which two parts of a program are executed. Such bugs can be the most difficult of all to find.
Indeed, there is a sense in which this bug wasn't really found at all. In particular, there was no point in this whole process at which any of us used a machine-language debugger or any other automated debugging, inspection, or testing tools. Instead, the entire process was more like repeating the following steps:
- Figure out which part of the system is failing.
- Make it fail consistently.
- Find the circumstances that are necessary to make it fail.
Each step in this process provided information that led to the next step; the last step made it possible to refine the first step further the next time around. Specifically, the problem appeared first to be in the compiler, then in a particular phase of the compiler, then in the earlier phase that provided bad data to the later phase, then in the I/O part of the operating system, and finally in a specific part of the operating system's I/O facilities. Each of these refinements grew out of understanding the precise circumstances in which the problem occurred.
I often find it surprising how much you can figure out about what is wrong with a program by repeating these three steps.