How C Makes It Hard To Check Array Bounds
Last week, I said I would discuss the tension between performance and safety. An important part of that tension comes from the nature of C arrays and memory management.
In theory, an array is a simple data structure: When you want to access one of its elements, you provide the index of that element and you gain read or write access to that element. The trouble with this description is that it implies that you provide an index each time you access an element. Using the index to find the element is often an expensive computation, especially if the element size is not a power of two: In an expression such as ++a[i]
, it would not be surprising for five times as much time to elapse in computing the address of a[i]
as in incrementing it.
For at least 50 years, compiler writers have been trying to access array elements more quickly. Much of their effort has centered around loops such as
for (size_t i = 0; i < n; ++i) c[i] = a[i] + b[i];
This example converts three indices to their corresponding memory locations each time through the loop, with all of the attendant overhead. Many compilers would effectively rewrite this loop along the following lines. Here, we assume that Pointer
is an appropriate type to point to an element of a
, b
, or c
:
Pointer ap = &a[0], bp = &b[0], cp = &c[0], aend = ap + n; while (ap < aend) *cp++ = *ap++ + *bp++;
This rewritten code replaces three index computations by three index increment operations, which are typically much faster. This replacement is harder than it looks, because the compiler has to be able to verify that no part of the loop aside from the for
statement itself changes the value of i
. This verification is trivial in this particular example, but in real code, it is often much more difficult.
Among the most important differences between C and its predecessors is that C gave programmers a way to make these kinds of optimizations explicit, rather than relying on the compiler to do so for them. C made these optimizations explicit by unifying the notions of arrays and pointers in a portable way, and by allowing programmers to do most of the index computations themselves that compilers for earlier languages had done for them.
It may seem strange that taking an automatic computation and making it manual would be an advantage, but many programmers would rather optimize code by hand instead of relying on the compiler to do it for them and then not knowing for certain whether it had actually been done. At least partly for this reason, it became the norm among C programmers to use pointers rather than indices to access array elements.
In addition to being faster in many cases, pointers have another big advantage over arrays: A pointer to an array element is a single value that is enough to identify that element uniquely. Suppose, for example, that we want to write a function that manipulates a range of array elements. Without pointers, we need three parameters to identify the range: the array and two indices. By using pointers, we can get by with only two parameters.
Moreover, pointers are a way to unify dynamically allocated memory with other memory. For example, the malloc
library function returns a pointer to dynamic memory, after which we can use that pointer to create whatever data structures we desire. Once we have created these data structures in this memory, we can use pointers to parts of those data structures as a way to allow other functions to access parts of those data structures. Those functions, in turn, have no need to know anything about the nature of the memory that they are using.
In exchange for this convenience, pointers offer three hazards that are not present in code that refer to array elements by indices.
First, because a pointer refers to an array element independently of the array itself, it is possible for the pointer to persist even after the array has gone away. For example, once we have stored a copy of &a[0]
in ap
, we have created the risk that we will try to use *ap
after a
has gone away. This risk did not exist before in that form, because once array a
had vanished, we no longer had any way of referring to its elements.
Second is the possibility of pointer arithmetic. If we use a pair of pointers to represent the endpoints of a range of array elements, we must be able to figure out the addresses of intermediate elements — which we do by arithmetic on those pointers. This pointer arithmetic opens up much more creative ways of creating invalid addresses — ways that are also harder to check — than does mere arithmetic on integers.
Finally, using pointers to represent ranges requires the existence of pointers that, although valid by themselves, do not point to valid memory. An example of such a pointer is aend
in our earlier example. We create aend
and use it as an upper bound for our loop, but if we were ever to try to evaluate *aend
, the effect of doing so would be undefined. Such a pointer is called an off-the-end pointer; the existence of such pointers makes it much more difficult to verify that C programs do not contain bounds errors.
We shall see more about the reasons for this difficulty next week.