Dr. Dobb's | Discriminated Unions (I)

Discriminated Unions (I)

Discriminated unions are an important data abstraction, very helpful in database interfaces, efficient parsers, and various other type-safe abstractions. Usually, support for discriminated unions must be present in the language. This column presents a full-blown source-level implementation of discriminated unions in existing C++, an implementation that satisfies most of the requirements that one would normally ask from an innate language feature, syntactic nicety included.

April 01, 2002
URL:http://www.drdobbs.com/cpp/discriminated-unions-i/184403821

Believe me: in spite of the appearances, if you were looking for an article on programming, you're in the right place. It's not about discrimination against work unions. This installment of "Generic<Programming>" discusses the discriminated union data type.

Discriminated union is a concept with many aliases: disjoint union, variant type, or algebraic type. Regardless of what you call them, discriminated unions come in very useful as data abstractions in a variety of applications. Many languages (mostly the functional languages) rely on discriminated unions for pretty much all interesting data structures. It is possible that you will use discriminated unions with C++ in the near future as well, because...

But what is a discriminated union, and what's in it for you? You will find much more about this soon, but first, per popular request, allow me to make an addendum to my previous column.

An Add-Hoc Addendum to Ad-Hoc Visitation

The previous installment of "Generic<Programming>" [1] defined a class template AdHocVisitor, which implements Visitor-like functionality over a class hierarchy post factum. (In Latin, "post factum" stands for "without modifying that class hierarchy.")

AdHocVisitor's semantics are pretty simple. Say you instantiate:

typedef AdHocVisitor<cons<TextArea, VectorGraphics, 
    Bitmap>::type>
    DocElementVisitor;

Then, AdHocVisitor's workings will ensure proper visitation: when you call DocElementVisitor::StartVisit, the appropriate overload Visit(TextArea*), Visit(VectorGraphics*), or Visit(Bitmap*) will be automatically invoked.

The mechanism through which AdHocVisitor performs this argument type deduction is equivalent to an if-else chain; that is, the code does something like:

if (TextArea* pTextArea = dynamic_cast<TextArea*>(p))
{
    Visit(pTextArea);
}
else if (VectorGraphics* pVectorGraphics =
    dynamic_cast<VectorGraphics*>(p))
{
    Visit(pVectorGraphics);
}
else if (Bitmap* pBitmap = dynamic_cast<Bitmap*>(p))
{
    Visit(pBitmap);
}
else
{
    throw "Unknown type passed";
}

As my previous column explains, the way AdHocVisitor implements the cascading tests is through a recursive template that operates on a typelist, but the net semantics is the same as that of the above chain of if-else statements.

Now imagine that later you define a new class FormattedTextArea that inherits TextArea. Naturally, you would append a type to the DocElementVisitor's definition so that you can visit FormattedTextArea objects as well:

typedef AdHocVisitor<cons<TextArea, VectorGraphics, 
    Bitmap, FormattedTextArea>::type>
    DocElementVisitor;

When visiting with DocElementVisitor, the added dynamic_cast<FormattedTextArea*> test will be placed at the bottom of the if-else food chain, which means that it will never succeed — any FormattedTextArea "Is-A" TextArea, so dynamic_cast<TextArea*> works. Consequently, the code will always think that a FormattedTextArea is really a TextArea and will visit it as such. The last test "starves" because is always preempted by the precedent one. This never helps.

A simple solution is to shuffle the typelist so that FormattedTextArea comes before TextArea:

typedef AdHocVisitor<cons< FormattedTextArea, TextArea, 
    VectorGraphics, Bitmap>::type>
    DocElementVisitor;

A more elaborate solution automates this task. Indeed, Loki comes with a nice DerivedToFront typelist shuffler that does exactly what we need here. You need to look up Loki's code [2] or the explanations [3] for the full story, but the usage is remarkably simple:

typedef DerivedToFront<cons< FormattedTextArea, TextArea, 
    VectorGraphics, Bitmap>::type>
    ShuffledHierarchy;
typedef AdHocVisitor<ShuffledHierarchy::Result>
    DocElementVisitor;

So this is pretty much the addendum. Now, when writing the previous column, the word limit was looming close, and I decided not to address this warning, but kept the focus on typelists, not on hierarchy visitation subtleties. In doing this, I was reminded of a remark made by Scott Meyers in an email:

"Too many digressions [...] prevent readers from absorbing the full impact of what you want to say, because they get distracted. [...] When writing, I think it's important to maintain focus."

Sometimes real life is more fun than the best sitcom, because the first person to send an email about the previous column's omission was ... Scott Meyers himself. The second person was Uwe Behrens. Scott really deserves an extra bonus because he also knew all about the DerivedToFront-based solution. Obviously, Uwe hasn't read Modern C++ Design [3]; otherwise he'd be much more optimistic about this approach than his comments suggest:

"I have used similar schemes in the past for my own applications, including implementations of the Visitor pattern, and during that, I have learned about a serious flaw in this approach. And I learnt it the hard way (i.e., in night-long sessions with a debugger)."

To conclude, the bad news is that I didn't mention an important issue in "Typelists and Applications" [1], and the good news is that that issue has an automated solution [2, 3]. Thanks much to Scott Meyers and Uwe Behrens.

So What Are Discriminated Unions?

Ok, back to le menu du jour. There's a paper on discriminated unions [4] that is the basis of the article you're now reading. That paper, however, is written for a narrow, specialized audience — so many readers asked for a more detailed presentation.

A discriminated union is a data type that's able to store one value of a bounded collection at a time and to offer typesafe set and access primitives for that value.

There are a couple of keywords here that deserve close attention. First off, the offered access must be typesafe, which means that the C vintage union feature, although it seems to qualify, fails because it doesn't offer that typesafe access — in other words, it's not discriminate.

union SomeData
{
    int anInteger_;
    double aDouble_;
    char* aPointer_;
};

The compiler simply stores the three member variables in overlapped memory and doesn't stop you from illegal uses such as:

// Likely a silent reinterpret_cast
SomeData dt;
dt.aPointer_ = "Hello, cruel world of undiscriminated unions";
int weird = dt.anInteger;

The second important point to make is about the bounded aspect of the collection of types. That marks an important difference between full-blown dynamic polymorphism and discriminated unions. You can think of a pointer to a base class as some sort of a union, because it can point to anything derived from that base class. However, the set is unbounded — you don't know the set of possible types beforehand. In contrast, a discriminated union only stores a bounded set of possible types, set known at its definition point. Some of those types, however, can of course be polymorphic themselves.

When describing union, most C and C++ textbooks refer to it as of an inherently low-level, uninteresting facility — hey, if you really care about saving memory, here's a little feature for you. That's all that C's union can offer you. Discriminated unions, however, are a very useful high-level abstraction. This and the next installment of "Generic<Programing>" will try to convince you of this.

Uses of Discriminated Unions

Discriminated unions come in handy whenever you need to manipulate heterogeneous objects (in state and interface), in a homogenous way. (Polymorphism is great at manipulating objects with a heterogeneous state, but uniform interface.)

Consider a data type intended to store a database field's value. Say database fields can be integers, floating-point numbers, strings, or dates. They don't really have anything in common, except that they live together in the database. If you want to implement a data structure suitable for storing and manipulating database fields, you need to be able to hold any of the database's field types at a given moment. This is a task that fits variants like a glove.

Another example is storing data in a dynamically typed language, such as Scheme or Basic. A variable can have any type of a finite, small set of possibilities.

And yet another example would be items passed around in a parser. For example, during syntactical parsing, you obtain lexemes that each can be either a keyword, an identifier, a literal constant (integer, floating-point, string, etc.), an operator, etc. A discriminated union is best at storing such a lexical element. An object-oriented approach would allow nonsensical operations, such as obtaining the integral value of a keyword. In other words, it's not that nicely typed.

As we'll see below, when carefully implemented, discriminated unions can also have significant (a marketing person would say "overwhelming") efficiency advantages over other approaches.

Current Implementations

The obvious way of implementing discriminated unions in C and C++ is to use a "tagged union": you store a classic union comprising the types of interest together with an integer that tells what field is currently valid. Consider the DatabaseField example:

struct DatabaseField
{
    enum TypeTag { typeNull, typeInt, typeDouble, 
 typeString, typeDate }
        tag_;

   union
    {
        int fieldInt_;
        double fieldDouble_;
        char* fieldString_;
    Date fieldDate_;
    };
    ...
  };

Whenever you assign to any field* member of DatabaseField, you need to set the tag_ member appropriately as well. Conversely, before accessing the members of the union, you need to consult the tag_ first to make sure what exact type the DatabaseField cloaks.

This protocol can be encapsulated by a collection of get/set functions (GetInt, SetInt, GetString, SetString, etc.). The problem is, such an approach is clumsy, limited, and rigid. You need to maintain by hand a three-way relationship — type tag to data to get/set functions. Besides, couldn't there be a nicer interface than the clumsy get/set primitives?

The totally object-oriented approach to implementing discriminated unions is very simple: you derive all of your possible types in the union from a common class, say, DBField. So you have StringDBField and IntDBField and so on. Then, your discriminated union is simply a reference to DatabaseField. You use dynamic_cast (or whatever you call it in the OO language of choice) to retrieve the actual type.

This is one of those cases where object orientation is that square peg in a round hole. A hierarchy makes sense when your types have sensibly close operations; however, there's not much that strings, doubles, integers, and dates have in common. A hierarchy cannot effectively bring structure to such a setup. You need to either rely on casts or be obligated to write embarrassing functions, such as "give me the floating-point value of this date." Sure signs of a bad design.

A Modern Implementation

We will build together an industry-strength implementation of discriminated unions that offers almost everything that you would expect from a similar language-provided feature.

The first observation to make is that, given that discriminated unions are all about collections of types, we can nicely use the typelists we introduced in the previous column. Therefore, our first shot at implementation looks like this:

template <class TList>
class Variant // DiscriminatedUnion would be too long
{
    ...
};

You would instantiate Variant with a typelist:

typedef Variant<cons<int, double, String, Date>::type> DatabaseField;

In addition to letting the user choose the set of supported types, a good implementation should provide:

Value semantics: default and copy constructors, assignment, and destructor
Constructors that initialize Variant with a value of any type in the typelist
Assignment operator for all types in a typelist
Typesafe type interrogation and retrieval mechanisms
A swap primitive, useful in a variety of idioms
Conversions

Last but not least, Variant should provide efficient underlying plumbing. Don't forget we are talking about a primitive mechanism that is usually wired straight into the language.

Storage: It's Not Only about Efficiency

So, how do you want to store the actual value of the variant type?

The first shot at a design might just as well use a union for storage. It does exactly what we need — allows variables of different types to share the same underlying memory. With a little trick, we can take the types from a typelist and put them in a union. (Did you know that unions can be templates?)

template <class TList> union ConfigurableUnion;
  
template <class H, class T> 
union ConfigurableUnion< typelist<H, T> > 
{
    H head_;
    ConfigurableUnion<T> tail_;
};
  
template <class H, class T> 
union ConfigurableUnion< typelist<H, null_type> > 
{
    H head_;
};

ConfigurableUnion is a union in which you can nicely control the collection of types to store.

However, as soon as you try to do this:

typedef ConfigurableUnion<std::string, int> ImpossibleType;

the compiler reminds you that unions don't accept types with constructors and destructors as members, which renders ConfigurableUnion useless for any serious programs [5].

The second idea is to simply store pointers instead of straight types and to use free store allocation for the data. This idea would work, but is has two problems: a little one and a big one. (Exactly which is which depends on your views leaning more towards purism or towards performance.)

One problem is that this solution is not as efficient as the in-situ allocation that a union would offer. Using free store incurs a time overhead, and possibly a space overhead as well (due to the housekeeping data that the allocator must maintain).

One more fundamental (or less relevant, depending on your views) problem is that using free store allocation changes the outbound exception guarantees that Variant can make on construction. If using in-situ allocation, Variant's constructor could throw only the same kinds of exceptions as the copy constructor of the passed-in type. In particular, if you instantiate Variant with types whose copy doesn't throw, the Variant won't throw as well. Variant blends in nicely in any code that uses exceptions without biasing it — in a sense, Variant is to exceptions what tofu is to cooking.

If we design Variant to use free store allocation, its constructors can throw std::bad_alloc in addition to whatever behavior its stored types have. In particular, it can throw exceptions even when instantiated with "nice" types, such as int or char. This is not good because it loosens the guarantees that Variant can make to its clients, which in turn must loosen the guarantees that they can offer, and so on.

In conclusion, it would be best if Variant used in-situ allocation.

The search continues with the obvious idea of in-situ allocation: we could use an untyped buffer of characters, right?

template <class TList>
class Variant 
{
    // compute the needed buffer size
    enum { neededSize = ... }; 
    unsigned char buffer_[neededSize];
    ...
};

We actually can compute neededSize quite easily. Just think of implementing a recursive algorithm that gets the largest element in a typelist: you compare the head with the largest element in the tail (which you compute recursively), and you choose the larger of the two.

template <class TList> struct MaxSize;
  
template <> 
struct MaxSize<null_type>
{
    enum { result = 0 };
};

template <class Head, class Tail> 
struct MaxSize< typelist<Head, Tail> >
{
private:
    enum { tailResult = size_t(MaxSize<Tail>::result) };
public: 
    enum { result = sizeof(Head) > tailResult ? 
        sizeof(Head) : size_t(tailResult) };
};

As you can see, MaxSize recursively applies itself to compute the internal variable tailResult, and then it chooses the larger number. The limit case (null typelist) just returns the smallest possible positive number — zero.

Given MaxSize, our Variant definition becomes:

template <class TList>
class Variant 
{
    enum { neededSize = MaxSize<TList>::result };
    unsigned char buffer_[neededSize];
    ...
};

If you heard an awful sound, that's our current storage implementation hitting the wall: it doesn't work. Worse, it might work sometimes — with some types, on some machines, on some compilers, and during certain phases of the moon. Get ready for tough times — we're having alignment issues.

Indeed, the problem with our approach is that the compiler doesn't know that we plan on storing "true" data inside buffer_. For all it knows, the compiler must care about a buffer of characters.

It turns out that, in most current computer architectures, some data types must be stored at specific offsets in memory. For example, typically four-byte integers can be allocated only at addresses that are a multiple of four. That doesn't mean that all eight-byte data must be allocated at multiple-of-eight addresses — their alignment could be four as well. (No data, however, can require alignment greater than its own size, because that would break the array storage rules.)

The rationale for the notion of alignment is the physical arrangement of memory banks, the bus, and the processor. For example, certain memory banks are physically linked to the lowest-order eight bits of the memory bus, so by construction their content cannot directly reach the higher-order bits.

If you try to access data that's not properly aligned, one of several things might happen. You can get the data you wanted, only much slower, because essentially the processor simulates in software, by shuffling data around, what should be a hardware operation. (The x86 processors do this.) Many other processors, however, terminate your program immediately on grounds of an attempt to execute an illegal memory access.

The good news is that most of the time, you don't need to worry about alignment: the processor nicely takes care of it by allocating all data at properly aligned addresses. The bad news is that you do need to care exactly now, because buffer_[neededSize] really means you're taking your life in your hands.

How can we guarantee that buffer_ is properly aligned? Well, that's exactly what union does — it guarantees enough size for its largest member and alignment for its most restrictive member. This is what we need, so our storage should look something like this:

union 
{
    unsigned char buffer_[neededSize];
    Align dummy_;
};

We never use dummy_; it's there just for alignment purposes. We now only need to define the Align type properly. We'll investigate exactly how to do that in the next installment. Until then, happy coding and ideas galore!

Bibliography

[1] A. Alexandrescu. "Generic<Programming>: Typelists and Applications," C/C++ Users Journal C++ Experts Forum, February 2002, <www.cuj.com/experts/2002/alexandr.htm>.

[2] You can download Loki from <www.moderncppdesign.com>.

[3] A. Alexandrescu. Modern C++ Design (Addison-Wesley Longman, 2001).

[4] A. Alexandrescu. "An Implementation of Discriminated Unions in C++," The 2001 C++ Template Programming Workshop, held in conjunction with OOPSLA 2001, <www.oonumerics.org/tmpw01/alexandrescu.pdf>.

[5] Rumor has it that the next revised C++ Standard, fondly called C++0x by those who keep score at home, will support types with constructors as union members. Anyway, types with destructors can't be allowed, and that makes sense. Given that the union doesn't store a discriminator, how would the compiler know which destructor to call when going out of scope?

Andrei Alexandrescu is a Ph.D. student at University of Washington in Seattle, and author of the acclaimed book Modern C++ Design. He may be contacted at www.moderncppdesign.com. Andrei is also one of the featured instructors of The C++ Seminar