Testing Behavior, Not Just Results
When you were a student, did you ever lose partial credit for an exam question that you answered correctly because you did so for the wrong reason? Or because you didn't show your work? Hold that thought.
Last week, I noted that if you're trying to verify a program's average performance, no single test will establish either that the program is performing as it should or that it is not doing so. For example, Quicksort, the algorithm that I used for last week's example, usually runs in O(n log n) time, but it might be as slow as O(n2). If a single test shows that it is running in O(n2) time, is the slowness a bug or a coincidence?
Similar reasoning applies to multiple tests. Although repeated tests can make one more confident that the program's performance is as it should be, but certainty is never completely attained; it is merely approached. How, then, does one verify that such a program is behaving properly? One way is to examine not only what results the program produces but also the steps that it performs along the way. In the case of Quicksort, for example, detailed examination might be able to verify that that the program actually implements the Quicksort algorithm, not just that it produces correctly sorted results and usually runs in the amount of time one would expect.
In the case of Quicksort, it's not terribly difficult to determine whether the algorithm is doing what it should — provided that one can put monitoring code inside the implementation itself. Quicksort involves three steps:
- Choose a pivot element.
- Rearrange the elements of the sequence so that all elements that are less than the pivot precede all elements that are equal to the pivot, which in turn precede all elements greater than the pivot.
- Recursively sort the two subsequences that are unequal to the pivot.
Each of these steps is easy to verify — so long as what the program does to carry out these steps is visible to the test code in the first place. Such verification might use a special "test version" of the Quicksort code, which would describe (i.e., write onto a log file) what it was doing as it went along. This version might start by reporting the position of each element in its input sequence, then identifying which element it chose as the pivot. It might continue by reporting each pair of elements that it swaps, and then, for verification, how the elements had been rearranged at the end of step (2).
The result of running Quicksort instrumented in this way would be a file that traces all of the operations that the Quicksort program performs. The test program might then take actions such as verifying that the elements being swapped were actually part of the sequence, that after the swaps were finished, the sequence met the ordering requirements outlined in (2), and that the reordering was followed by the appropriate recursive calls.
This kind of verification is similar to the idea of asking students to show their work on an exam. For Quicksort, and many other programs, it is not enough for a tester to verify that the program produces the right answer; it is also important that it do so in the right way and for the right reason.
Testing a program by instrumenting it and looking at the resulting log is often called "white-box testing," as opposed to "black-box testing," which looks only at a program's documented behavior. White-box testing is an essential tool in a number of contexts. We have just seen one such context: verifying indirectly that a program's performance will be reasonable by verifying directly that it uses an algorithm known to have reasonable performance. However, this context is far from the only one.
For example, programs often use resources while they are running and release those resources when they are finished. Memory leaks are an obvious example of failing to release resources when needed, but memory is not the only resource that must be managed in this way. If you look only at a program's overt behavior, you may not be able to know with confidence whether or not the program is leaking memory. For that matter, the operating system may not even report memory usage accurately. Nevertheless, for testing purposes, one might instrument such a program by logging every time the program allocates or frees a resource. One might then run a program that examines the log files to ensure that every resource that is allocated is also freed — exactly once.
White-box testing is hardly a new idea. In fact, one form of white-box testing has been around since the earliest days of computing, namely parity checking. Although many small computers these days do not include parity checking, it was an essential tool in earlier computers because memory and processors were much less reliable than they are today. Even arithmetic units often did parity checking on their computations, analogously to how one might use casting out nines to verify a hand computation. If every memory operation and every computation carries a small amount of extra information that acts like a simplified model of the data or computation, then hardware failures become much easier to detect. In effect, a parity bit is a rough way of tracking what is supposed to happen during a computation. If the parity check is correct, that does not prove that the computation itself is correct; but a parity-check failure proves that something is broken.
There is (at least) one other context in which white-box testing is important, and that is in making sure that a program does not attempt any undefined operations — especially not operations that rely on dangling pointers or out-of-range indices. Next week I'll talk about why these kinds of operations are so important to detect and prevent, and how white-box testing can help in doing so.