Moving Data and Address Arithmetic
Last week, I observed that an important reason for moving data in C++ programs, rather than copying it, is to be able to implement types such as vector
efficiently. Such types use the fact that a computer's memory addresses are numbers, and those numbers can be the results of computation. For example, the C++ standard guarantees that all of the elements of a vector
are stored in contiguous memory locations. This property makes it possible to locate any vector
element quickly from its index once the program knows the location and index of any other element.
If a data structure uses address arithmetic to determine the location of its data, then from time to time programs will almost surely find it necessary to put data at a particular place in memory — namely, at the address that is the result of a particular computation. If that location is already occupied, then the program must move data around so that that location is no longer occupied. For example, if a vector
is currently stored in n contiguous chunks of memory, and the program wants to append an n+1st element to the vector
, that element must wind up in the memory immediately following the nth element. If something else occupies that, then either that something else must move, or the program must reallocate the entire vector
.
A program can often avoid address arithmetic by using pointers. Instead of storing data in contiguous memory, such a program typically stores the address of each piece of data along with one or more of its neighbors. It is possible to implement such a data structure without caring at all about the memory addresses at which its elements are stored. However, storing an element's address only with its neighbors often implies that there is no quick way to jump from one place in the data structure to another place that is not a neighbor. In an odd reversal of interface and implementation, data structures that store their elements consecutively make random access easier, while data structures that store their elements in random locations support only sequential access.
Programs that don't care about the addresses of their data rarely need to move those data. After all, we can view the whole point of moving an object as a way of changing the object's address without changing its contents; if we don't care what the address is, we don't need to change it. As a result, C++ library data structures such as list
do not need to be able to move their elements once those elements have been created.
The tradeoff between moving data to appropriate addresses and using pointers to keep track of data in arbitrary locations is one of the fundamental tradeoffs that influences programmers' choice of data structures. The C++ library offers two extremes and a compromise, which, like most compromises, combines advantages and disadvantages of both of the extremes.
One extreme is vector
. All of a vector
's elements are always contiguous, so accessing the nth element takes the same amount of time for all values of n. Maintaining this contiguous data structure requires reallocating all of the vector
's elements whenever there is no room for a new element.
The other extreme is list
. You can think of each element of a list
as being totally independent from all others and coupled with pointers to its immediate neighbors. As a result, the library never needs to move list
elements once they are created, not even to append new elements to a list
. On the other hand, locating the nth element of a list typically requires n pointer operations. Moreover, each element has at least a couple of pointers' worth of space overhead, which implies that list
s can waste quite a bit of memory unless the list
elements themselves are large.
The compromise data structure, deque
, often gets glossed over in descriptions of the C++ library. It is typically implemented as a collection of chunks, which are vector
-like data structures, together with an auxiliary data structure that makes it possible to find the right chunk quickly. A deque
is almost surely harder to implement than either list
or vector
. However, from a user's viewpoint, it combines some nice properties of the two data structures.
For example, as with list
, you can append elements to a deque
at either end, not just at one end. Moreover, doing so does not require reallocating the deque
's already existing elements. As with vector
, you can access a deque
's elements efficiently in random order. Moreover, deque
does not have nearly the space overhead per element that list
does.
Because of these properties, some writers have suggested using deque
whenever there is no obvious reason to choose a different data structure. This advice leads naturally to the question of what some of those reasons might be.
One such reason is that like vector
and unlike list
, deque
does not support fast insertion and deletion in the middle. Indeed, although list
never requires moving elements around, deque
will do so for insertions and deletions anywhere but at the ends of the structure. Compared with vector
, a deque
will typically have more overhead for vector
-like operations such as incrementing and decrementing iterators, accessing elements, and so on. Moreover, in addition to fast insertion and deletion, list
has specialized operations for transferring elements from one list
object to another without moving the elements themselves. Because deque
uses address arithmetic to support random access, it sometimes needs to move elements to other memory locations, something that list
never needs to do.
Because moving data is so tied up with address arithmetic, programs that avoid address arithmetic can often avoid moving data. In fact, it is possible to write programs that never change the contents of memory once they have put something there in the first place. Such programs can have surprising properties: Programmers have to jump through some unusual hoops in order to write them, but there are some kinds of reliability and security problems that become much less likely. Next week, I'll continue this discussion with some examples of such programs and programming techniques.