Channels ▼

Andrew Koenig

Dr. Dobb's Bloggers

Testing Is Not Verification and Vice Versa

January 02, 2014

Last week, I suggested by example that looking only at a program's externally visible behavior — black-box testing — can let major bugs go undetected. In particular, such testing cannot easily verify that a program will never behave (or misbehave) in a particular way, and that security considerations were one reason why such guarantees were often important.

Here is an example that makes this general claim concrete. I once used a time-sharing implementation of an interpretive language that was intended to be semantically safe: Every operation that a user could perform was intended to have a well-defined result (which might be a diagnostic message). To that end, the implementation checked for validity each individual operation that user programs performed as they did so.

Unfortunately, one of those checks contained a subtle bug: When a program tried to allocate a multidimensional array, the allocator computed the amount of memory it needed by multiplying the array's dimensions. Unfortunately, it forgot to check for overflow. As a result, a program could allocate a very large two-dimensional array — much larger than was actually capable of fitting in memory. For example, allocating a 222-by-214+1 array would compute a size of 236+214. On a 32-bit machine, this size would quietly overflow to 214, which was well within the available memory for allocation purposes.

Bugs of this sort are particularly pernicious because testing will often fail to reveal them. In this particular case, for example, an array that was only too big for memory, but not big enough to overflow, would cause an appropriate diagnostic message. So would an array with a size that overflowed to a value that still would not fit in memory. To trigger the bug, one would have to allocate an array with an overall size that was in bounds when viewed modulo 232 bytes. In that case, the size computation would yield what looked like a plausible value after the overflow. Although aggressively systematic testing might reveal this particular kind of misbehavior, the bug would be much easier to find if whoever was doing the testing knew enough about the code being tested to construct tests that were particularly likely to evoke bugs.

On the surface, this overflow bug does not seem particularly harmful, because it's the kind of bug that users normally would not encounter — and user programs that would legitimately fit in memory would never encounter. However, in practice, this bug made possible a data structure that looked like an ordinary two-dimensional array that overlaid all of memory. As a result, it gave user programs the ability to access any part of the computer's memory that was not protected by hardware. Further experimentation showed that it was possible to use this ability to turn on the bit that gave the user administrative privileges, after which the entire system was within reach.

This example shows a key way in which security bugs differ from others. Ordinary programs are very unlikely to create a multidimensional array of just the right "size." As a result, there is not much reason to test for such bugs because they happen so rarely.

However, in the presence of malicious users, not only do such bugs take on a malevolence far beyond their surface qualities, but there is good reason to believe that a user who discovers such a bug will not report it; instead saving the knowledge of its existence for later use. In other words, not only are obscure security bugs more severe than they might seem, but developers cannot rely on users to report such bugs when they do encounter them.

For this reason, such bugs are particularly important to find before the system is released to users. However, in the case of this particular bug, we have already seen two reasons why ordinary testing is not enough to be confident in finding it:

  • The bug was triggered by actions that should have caused a runtime diagnostic message, and
  • Many similar actions would have caused that message.

It is not even good enough to say that a white-box test should have revealed it, because the knowledge needed to construct such a white-box test would also have made it possible to anticipate and avoid the problem in the first place.

What is needed, then, is a way of anticipating unanticipated errors. One way to approach this paradoxical goal might be to arrange that memory accesses be checked for validity separately from the computation that yielded the address to be accessed. In other words:

  • The program allocates an array by multiplying its dimensions to yield a size.
  • This size is used to allocate a block of memory.
  • The block's size can be figured out, either by asking the memory allocator or by storing it explicitly as part of the block.

As originally written, when it came time to use such a block, the system would take a sequence of array indices as input, check whether each one was within its corresponding bound, then compute the address corresponding to that array element. The authors of the code assumed that this check was enough to ensure validity, because each individual array bound was known to be valid. The problem in this case, then, was that the authors assumed that if each individual bound was valid, all of them together were also valid.

If the authors' attitude had been that such assumptions weren't enough by themselves, then whenever a user program tried to access memory in an array, the system might have verified that the address to be accessed was actually a valid address in the block of memory, as defined by the block's address and size. Such verification would be able to preserve an important property: Even a nonsensical request would not cause the system to access unauthorized memory. This property would hold even if the address computation overflowed, because either it would yield a valid address, in which case there would be no harm done, or it wouldn't, in which case the result would be a runtime diagnostic message.

In short, this kind of program is one in which internal validity checking might be able to reveal bugs that neither black- nor white-box testing would be likely to encounter. C++ is particularly well suited to such internal validity checking because C++ class definitions make it possible to implement data structures for which all accesses pass through a single, common piece of code. A validity check in that common code will run as part of every access. If that check is implemented as part of an assert statement, it is even possible to turn it off should it be too expensive. As ever, it is an open question whether the ability to turn off such sanity checks is a feature or a bug.

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.



I agree with the diagnosis here, but think the cure is not so clear cut. In my experience, bugs are obvious in hindsight, but not before. Additionally, any complex system is likely to have an enormous state-space, with all sorts of chaotic and rare behaviours. Therefore I tend to think verification is not possible under reasonable resource limits.

So we do need to test more, in any way we can, but lets not imagine it will avoid weirdness. Presumably the programmers in the example were competent and had the usual face-palm moment when they saw the problem.

I would tend to add random black-box testing, pervasive diagnostic support in production and significant allocated resource to complexity reduction refactorings.