A Simple, Immutable, Node-Based Data Structure
We continue the discussion of programming without mutable data by looking at what operations make sense for such data structures. For example, last week I noted that array-like data structures aren't terribly useful in a world that doesn't allow data to change, because it's hard to implement even such simple operations as appending to an array efficiently. The difficulty is that in an environment with immutable data, you can't just append a value to an array; you have to create a new array that contains the old array along with the value that you want to append.
I suggested one way of finessing around the problem: Use clever data structures so that the old and new arrays actually share storage. In other words, if x
is an array-like data structure, and you execute
auto y = append(x, v);
then what is really stored is only a (single) copy of y
; under the hood, x
refers to the same data along with an indication that the last element of y
isn't part of x
. I also pointed out that this technique has limitations. In particular, after you execute
auto z = append(x, w);
it is impossible for y
and z
to share storage, which means that all of the elements of x
must be copied.
A reader pointed out that if the value to which you are appending is an rvalue, such as in
auto w = append(foo(), v);
then it is possible to arrange for the compiler to reuse foo()
's memory and avoid copying its elements. Like my suggestion, this one can be useful in some circumstances and has shortcomings in others. In particular, both of these suggestions have the problem that two statements that appear to behave similarly might have dramatically different execution times, as one of them (such as the call to append(x, v)
above) does not need to copy the elements of x
, but the other one (i.e., the call to append(x, w)
) does.
This need to copy data does not come from the fact that the user of these data structures is asking to "copy" them. A crucially important characteristic of immutable data is that there is never any need to copy it because of its contents alone; the copy always has the same value as the original. The only reason to copy elements of an array-like data structure is that the elements' location in memory is itself the result of a computation. If we never change a value after we create it, and we don't care where in memory that value is located, we don't need to copy values.
To explore this idea, consider a simplified version of the C++ standard-library pair
template. Without getting too hung up on the details of specific programming languages, imagine a data structure with the following properties:
- For any values
x
andy, pair(x, y)
is a (single) value that contains bothx
andy
. - If we define
z
aspair(x, y)
, thenfirst(z)
isx
andsecond(z)
isy
.
If we were talking about C or C++ here, I would probably want to say that first(z)
evaluates to a copy of x
or something like that. However, in the world of immutable data, this distinction is unnecessary.
As we're not talking about C or C++, let's simplify matters still further by assuming that our data structure is dynamically typed. Then all we really need in order to make our data structure useful is one more function and one value:
- For any value
x, ispair(x)
is true if and only ifx
is a pair. - The special value
nil
is not a pair, and is not equal to any other value.
Without going into implementation details, let me observe again that because we are dealing entirely with immutable data, the execution time for pair(x, y)
does not depend on the complexity of x
or y
.
The data structure I've just described is the core of Lisp and its derivatives, and it has remained so over the more than 50 years that the Lisp family of languages has been in existence. Next week, I'm going to discuss how a data structure of this kind can represent both programs and data, and how it can be used to solve a wide variety of programming problems more easily than you might think.