When describing union
, most C and C++ textbooks refer to it as of an inherently low-level, uninteresting facility hey, if you really care about saving memory, here's a little feature for you. That's all that C's union
can offer you. Discriminated unions, however, are a very useful high-level abstraction. This and the next installment of "Generic<Programing>" will try to convince you of this.
Uses of Discriminated Unions
Discriminated unions come in handy whenever you need to manipulate heterogeneous objects (in state and interface), in a homogenous way. (Polymorphism is great at manipulating objects with a heterogeneous state, but uniform interface.)
Consider a data type intended to store a database field's value. Say database fields can be integers, floating-point numbers, strings, or dates. They don't really have anything in common, except that they live together in the database. If you want to implement a data structure suitable for storing and manipulating database fields, you need to be able to hold any of the database's field types at a given moment. This is a task that fits variants like a glove.
Another example is storing data in a dynamically typed language, such as Scheme or Basic. A variable can have any type of a finite, small set of possibilities.
And yet another example would be items passed around in a parser. For example, during syntactical parsing, you obtain lexemes that each can be either a keyword, an identifier, a literal constant (integer, floating-point, string, etc.), an operator, etc. A discriminated union is best at storing such a lexical element. An object-oriented approach would allow nonsensical operations, such as obtaining the integral value of a keyword. In other words, it's not that nicely typed.
As we'll see below, when carefully implemented, discriminated unions can also have significant (a marketing person would say "overwhelming") efficiency advantages over other approaches.
Current Implementations
The obvious way of implementing discriminated unions in C and C++ is to use a "tagged union": you store a classic union
comprising the types of interest together with an integer that tells what field is currently valid. Consider the DatabaseField
example:
struct DatabaseField { enum TypeTag { typeNull, typeInt, typeDouble, typeString, typeDate } tag_; union { int fieldInt_; double fieldDouble_; char* fieldString_; Date fieldDate_; }; ... };
Whenever you assign to any field*
member of DatabaseField
, you need to set the tag_
member appropriately as well. Conversely, before accessing the members of the union, you need to consult the tag_
first to make sure what exact type the DatabaseField
cloaks.
This protocol can be encapsulated by a collection of get
/set
functions (GetInt
, SetInt
, GetString
, SetString
, etc.). The problem is, such an approach is clumsy, limited, and rigid. You need to maintain by hand a three-way relationship type tag to data to get
/set
functions. Besides, couldn't there be a nicer interface than the clumsy get
/set
primitives?
The totally object-oriented approach to implementing discriminated unions is very simple: you derive all of your possible types in the union from a common class, say, DBField
. So you have StringDBField
and IntDBField
and so on. Then, your discriminated union is simply a reference to DatabaseField
. You use dynamic_cast
(or whatever you call it in the OO language of choice) to retrieve the actual type.
This is one of those cases where object orientation is that square peg in a round hole. A hierarchy makes sense when your types have sensibly close operations; however, there's not much that strings, double
s, integers, and dates have in common. A hierarchy cannot effectively bring structure to such a setup. You need to either rely on casts or be obligated to write embarrassing functions, such as "give me the floating-point value of this date." Sure signs of a bad design.
A Modern Implementation
We will build together an industry-strength implementation of discriminated unions that offers almost everything that you would expect from a similar language-provided feature.
The first observation to make is that, given that discriminated unions are all about collections of types, we can nicely use the typelists we introduced in the previous column. Therefore, our first shot at implementation looks like this:
template <class TList> class Variant // DiscriminatedUnion would be too long { ... };
You would instantiate Variant
with a typelist:
typedef Variant<cons<int, double, String, Date>::type> DatabaseField;
In addition to letting the user choose the set of supported types, a good implementation should provide:
- Value semantics: default and copy constructors, assignment, and destructor
- Constructors that initialize
Variant
with a value of any type in the typelist - Assignment operator for all types in a typelist
- Typesafe type interrogation and retrieval mechanisms
- A
swap
primitive, useful in a variety of idioms - Conversions
Last but not least, Variant
should provide efficient underlying plumbing. Don't forget we are talking about a primitive mechanism that is usually wired straight into the language.
Storage: It's Not Only about Efficiency
So, how do you want to store the actual value of the variant type?
The first shot at a design might just as well use a union
for storage. It does exactly what we need allows variables of different types to share the same underlying memory. With a little trick, we can take the types from a typelist and put them in a union. (Did you know that union
s can be templates?)
template <class TList> union ConfigurableUnion; template <class H, class T> union ConfigurableUnion< typelist<H, T> > { H head_; ConfigurableUnion<T> tail_; }; template <class H, class T> union ConfigurableUnion< typelist<H, null_type> > { H head_; };
ConfigurableUnion
is a union
in which you can nicely control the collection of types to store.
However, as soon as you try to do this:
typedef ConfigurableUnion<std::string, int> ImpossibleType;
the compiler reminds you that unions don't accept types with constructors and destructors as members, which renders ConfigurableUnion
useless for any serious programs [5].
The second idea is to simply store pointers instead of straight types and to use free store allocation for the data. This idea would work, but is has two problems: a little one and a big one. (Exactly which is which depends on your views leaning more towards purism or towards performance.)
One problem is that this solution is not as efficient as the in-situ allocation that a union
would offer. Using free store incurs a time overhead, and possibly a space overhead as well (due to the housekeeping data that the allocator must maintain).
One more fundamental (or less relevant, depending on your views) problem is that using free store allocation changes the outbound exception guarantees that Variant
can make on construction. If using in-situ allocation, Variant
's constructor could throw only the same kinds of exceptions as the copy constructor of the passed-in type. In particular, if you instantiate Variant
with types whose copy doesn't throw, the Variant
won't throw as well. Variant
blends in nicely in any code that uses exceptions without biasing
it in a sense, Variant
is to exceptions what tofu is to cooking.
If we design Variant
to use free store allocation, its constructors can throw std::bad_alloc
in addition to whatever behavior its stored types have. In particular, it can throw exceptions even when instantiated with "nice" types, such as int
or char
. This is not good because it loosens the guarantees that Variant
can make to its clients, which in turn must loosen the guarantees that they can offer, and so on.