Channels ▼
RSS

C++ Made Easier: Plain Old Data


April 2002/C++ Made Easier

Refresh your memory on why traditional techniques for processing plain old data don't apply to objects.


Introduction

Several people have recently asked us questions similar to this one: “Can I use memcpy to copy an object of type string?”

Our first impulse is to say that if you have to ask, you shouldn’t be doing it — because you will get in trouble if you try. Nevertheless, the concepts behind the question are interesting enough to merit a closer look.

Briefly, the answer is that you can use memcpy safely to copy an object only if the object’s type is what is called a POD type, which stands for “Plain Old Data.” Because string is not a POD type, there is no guarantee that it is safe to use memcpy on a string.

What memcpy Does

As its name suggests, the memcpy function, from the C Standard library, copies memory:

void* memcpy(void* dest,
  const void* source, size_t n);

The source and dest arguments each refer to the initial byte of an n-byte region of memory; the two regions must not overlap. The memcpy function copies the memory in the source region to the memory in the dest region, obliterating whatever contents the dest region might have had previously. The memcpy function returns a copy of the dest pointer.

For example, suppose we write:

int x = 42;
int y;
memcpy(&y, &x, sizeof(int));

As it happens, int is a POD type, so it is safe to use memcpy on int objects. Accordingly, after executing these statements, y will have a value of 42, just as if we had executed:

y = x;

instead of calling memcpy.

The question, then, is what will happen if we write:

string s = "Hello, world!";
string t;
memcpy(&t, &s, sizeof(string));

Will t have Hello, world! as its value or will the value be different? Will such a program even work at all?

The answer is that the program is not guaranteed to work, because string is not a POD type. Indeed, it is likely that this program fragment will cause a crash, as we shall see. The rest of this article explains what a POD type is and gives an idea of why memory-manipulation functions such as memcpy are generally safe only when applied to objects of POD type.

Fundamental Types

We can think of the memory in any computer that supports C++ as being composed of a collection of bytes, each of which contains an implementation-defined number of bits. All bytes contain the same number of bits. In a C++ implementation, that number must be at least eight; if the computer hardware does not support 8-bit or larger bytes, the C++ implementation must fake it in software. Most computers that support C++ have bytes that are exactly 8 bits long, but we have seen computers with bytes as long as 64 bits.

There are three important facts to know about bytes:

  1. A byte is the smallest addressable unit of memory. That is, every region of memory that it is possible to use pointers to define comprises an integral number of bytes. Accordingly, it is possible to use a byte address (which C++ uses the void* type to express) and an integer (which represents an object’s size) to refer to the memory that any object occupies.
  2. Every bit in a region of memory is part of exactly one byte. In particular, there is no information that might somehow fall into the cracks between the bytes [1].
  3. The sizeof operator, when given an object or a type as its argument, returns the number of bytes in an object of that type. All objects of a given type are the same size, so only the type matters.

These three properties imply that if x is an object, we can use ((void*)&x) and sizeof(x) together to represent the memory that x occupies. The question, then, is whether there is any more to x than the contents of its memory. It is that question that the POD notion exists to address: if a type is a POD type, the implication is that there is nothing more to an object of that type than the contents of its memory.

The fundamental types — that is, the arithmetic, enumeration, and pointer types — are POD types. In other words, the value of an object of a fundamental type depends entirely on the contents of the region of memory that corresponds to that object. It follows that using memcpy to copy an object of a fundamental type will copy that object’s value.

To see what’s happening more clearly, let’s look again at our earlier example:

int x = 42;
int y;
memcpy(&y, &x, sizeof(int));

Here, x and y are of fundamental type (int). They are therefore of POD type, so the bytes that constitute them completely determine their values. Moreover, every object of type int comprises sizeof(int) bytes.

When we call memcpy, it copies a number of bytes given by its third argument— in this case, sizeof(int) — from the region of memory occupied by x to the region occupied by y. Accordingly, the call to memcpy has the same effect as executing:

y = x;

because there is nothing more to x or y than the contents of the corresponding memory.

Structures

Let’s expand our universe by using memcpy to copy a structure. For example:

struct Point {
  int x, y;
};

Point p, q;
p.x = 123;
p.y = 456;
memcpy(q, p);

Will the call to memcpy still have the same effect as executing:

q = p;

or does the fact that Point is a user-defined type make memcpy not work?

The answer is that this structure is a POD type, because it is still so simple that its memory entirely determines its value, and therefore memcpy is safe to use in this context.

Structure Assignment

When we defined our Point structure, we did not give it an assignment operator. When we try to assign objects of such a type, the compiler treats such an assignment as being equivalent to assigning the objects’ data members. In other words, executing:

q = p;

has the same effect as executing:

q.x = p.x;
q.y = p.y;

Because the x and y members of Point are of fundamental type, we can use memcpy to copy those members. Accordingly, it is also safe to use memcpy to copy the entire object.

Suppose, now, that we were to redefine Point to include an assignment operator:

struct Point {
  int x, y;
  Point& operator=(const Point&);
};

We have deliberately omitted the definition of this assignment operator so that you won’t be tempted to think that you know what it does. It should now be clear that defining the assignment operator for Point has removed the guarantee that memcpy is safe to use on Point objects, because without seeing the definition of the assignment operator, we have no way of knowing that it has the same effect as calling memcpy.

Even if we define our assignment operator to have the same effect as the compiler-generated one:

Point& Point::operator=(const Point& p)
{
  x = p.x;
  y = p.y;
  return *this;
}

we should no longer consider it safe to use memcpy to copy a Point object, because doing so would rely on knowledge of the inner workings of the Point type, and those workings might change in the future.

In other words, we should be able to trust memcpy only when we are confident that using memcpy to copy an object will have the same effect as the assignment operator for that object. If the object, or any of its (non-static) data members, has a user-defined assignment operator, the compiler would have to read the definition of that operator to figure out whether it has the same effect as the compiler-generated assignment operator; such figuring in general is provably beyond the reach of any program. The moment any member acquires a user-defined assignment operator, this confidence vanishes. Therefore, a type that has a user-defined assignment operator in any of its data members is not a POD.

A More Precise Definition

We have seen two aspects of POD types: the fundamental types are POD types, and structures with user-defined assignment operators are not POD types. Here are the rest of the details:

  • Arithmetic types, enumeration types, and pointers (including pointers to functions and pointers to members) are POD types.
  • An array is a POD type if its elements are.
  • A structure or union is a POD if all of the following are true:
    • Every one of its non-static data members is a POD.
    • It has no user-declared constructors, assignment operators, or destructor.
    • It has no private or protected non-static data members.
    • It has no base classes.
    • It has no virtual functions.

The idea is that a class is a POD type if it has nothing to hide about its representation. Therefore, we can be sure that the value of an object of such a type is nothing more or less than the values of its components, so that copying the object is equivalent to copying its memory.

Discussion

Let us return to our original question: Is it safe to use memcpy to copy a string? We know that the string class has constructors, so it is not a POD. Therefore, the answer must be no. But what happens if we try anyway? Saying that a class is not a POD is saying only that memcpy is not guaranteed to work. It is not necessarily guaranteed to fail either. The question is whether copying the object is equivalent to copying its memory. To answer this question for the string class, we need to think about how it might be implemented.

A plausible implementation uses the string object itself to store an integer that represents the string’s length and pointer to dynamically allocated memory that contains the string’s characters. This integer and pointer are fundamental types, so surely it must be possible to use memcpy to copy them, right?

Wrong. Here’s the problem:

string s = "hello", t = "world";
memcpy(&s, &t, sizeof(string));

Before we call the memcpy, both s and t contain pointers that refer to memory somewhere:

Calling memcpy will overwrite the pointer in s with a copy of the pointer in t. Now, both pointers will point to the same memory, and the memory holding hello, to which s formerly referred, will be inaccessible and will therefore never be freed:

This program fragment will therefore leak memory. Moreover, when it comes time to destroy s and t, the effect of doing so is likely to be to try to deallocate the same memory twice, resulting in a crash.

This example should make it clear why the rules for defining POD types exclude types such as string. The moment a class author defines a constructor, destructor, or assignment operator, we can no longer be confident that copying an object of that class is equivalent to copying the object’s memory.

Summary

Functions such as memcpy, which deal with a class object’s memory directly, undercut the class author’s intentions. Doing so is dangerous unless the intentions are at so low a level as to make them impossible to undercut. Such a class is called a POD (Plain Old Data) to indicate that there is nothing more to the class than its contents. The moment data abstraction enters the picture, be it through constructors, destructors, assignment operators, base classes, or virtual functions, it is time to use only the operations that the class provides, and eschew low-level operations such as memcpy.

Note

[1] This fact is not as obvious as it sounds. For example, we have seen computers with 36-bit words in which the usual way to represent characters is to stuff five seven-bit characters into a word, with one bit per word left over. This implementation strategy fails on two counts: bytes contain fewer than eight bits, and there are bits that are not part of any byte.

A C++ implementation could solve the first problem by using eight-bit bytes, but that strategy would still leave unused bits in each word. Therefore, a correct solution would have to involve bytes with a size that divides evenly into 36, namely 9, 18, or 36 bits.

Andrew Koenig is a member of the Large-Scale Programming Research Department at AT&T’s Shannon Laboratory, and the Project Editor of the C++ standards committee. A programmer for more than 30 years, 15 of them in C++, he has published more than 150 articles about C++ and speaks on the topic worldwide. He is the author of C Traps and Pitfalls and co-author of Ruminations on C++.

Barbara E. Moo is an independent consultant with 20 years’ experience in the software field. During her nearly 15 years at AT&T, she worked on one of the first commercial projects ever written in C++, managed the company’s first C++ compiler project, and directed the development of AT&T’s award-winning WorldNet Internet service business. She is co-author of Ruminations on C++ and lectures worldwide.


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.