Moving Data — Why It Matters
In a recent article , I talked about the idea of moving, rather than copying, data as an important new idea in C++0x. This post, and the next few, will go into more detail about the idea, along with some of the surrounding history and theory.
Brian Kernighan once laid down two rules for programmers who want to optimize their programs:
- Don't do it.
- (for experts only) Don't do it yet.
I agree. I also agree with his similar prescription: "Make it work before you make it faster." Computers exist to make things easier for people, not the other way around. The idea of moving, rather than copying data, is mostly an optimization. So why do I think it's important?
- Unlike many other optimizations, this one often works invisibly. For example, if I use
push_backto append an element to a
vector<string>, that code is apt to run much more quickly on an implementation that supports data movement than on one that does not, and I did not have to change any of my code to take advantage of that speed improvement.
- Moving data is a notion that, until now, was implemented — not quite correctly — as two separate operations. For example, if I write
a = b;
I am saying that I want to make a copy of
band put that copy in
a. If I know that I am no longer going to use the value of
a, then I have overspecified what I want the machine to do. I can make that overspecification less important by writing
a = b; clear(b);
clearis a (hypothetical) function that gets rid of
b's contents without actually destroying its object, but between these two statements, there is still a moment during which two copies of
b's value will exist. By writing
a = std::move(b);
I am making it clear that there is no need ever to make a second copy of
- Because moving an object does not create a second copy of the object, it becomes possible to define classes with objects that can be moved but not copied. Indeed, the
iostreamclasses have this characteristic.
It makes sense not to allow stream objects to be copied, because if it were possible to do so, the library would have to assign a meaning to program fragments such as
istream in = std::cin; std::cin >> x;
Questions would arise: Does reading from
std::cinaffect what the next read from
inwill see? Does using
put_backon one of the streams affect the other? What about closing one of them and reopening it with another file? Or, for that matter, the same file?
Allowing a stream to be moved but not copied makes it possible, for example, to have a
vector<ifstream> without worrying about how to answer these questions.
Moreover, the optimization has more of an effect than one might think:
- Many implementations take surprisingly long to allocate and free dynamic memory, because in order to do so, the system must maintain a data structure that keeps track of all allocated memory in the entire machine. Such a structure can be large and complicated, hence slow to manipulate.
- Modern processors usually run much faster than the memory to which they are attached. To compensate for this speed difference, they have caches that they pre-fill with data needed for future instructions. When an instruction dereferences a pointer, as is usually needed for programs that use dynamic memory, the processor has to wait until the memory can report its contents; this delay can take as long as executing many instructions.
Jon Bentley once conducted an informal test to see how many elements one had to put into a
vector before it would take longer to insert and delete elements in a
vector than in a
list. He found the crossover point for vectors of built-in types (such as
int) to be between 50 and 100 elements. In other words, deleting an element from a
vector, and then copying all of the subsequent elements, took less time than the dynamic memory allocation associated with deleting an element from a
list, until the
vector grew to many tens of elements. That difference came both from the overhead of dynamic memory allocation and the ability to use the processor's built-in cache.
Kernighan's admonition about optimization applies differently to compiler writers and library designers and implementers than it does to application programmers, because many programmers and machines use a single compiler and library. As a result, an optimization in a compiler will make a lot of programs run faster — and some of those programs might well have been unacceptably slow were it not for the optimization. In effect, an optimization that is part of a compiler — if done correctly — is an optimization that a lot of that compiler's users no longer need to consider.
In short, moving data without copying it is not only an optimization, it's a big optimization. Moreover, it's a way to specify a common operation more accurately and succinctly. As a result, library authors can say more accurately how they intend their libraries to work, and a lot of expensive memory allocation and deallocation becomes completely unnecessary. Although I don't normally care much for optimizations, I'll make an exception for this one.