C/C++

GenSerial: A Generic C++ Serialization Library

By Evan Sherbrooke, September 01, 2004

The GenSerial library is designed for loading and saving complex files, importing/exporting with other file formats, and implementing generic deep copies, among other things.

September, 2004: GenSerial: A Generic C++ Serialization Library

Evan Sherbrooke holds a B.A. in Mathematics, an M.S. in Computer Science, and a Ph.D. in Computer Aided Design and Manufacturing from MIT. He is currently developing next-generation surfacing technology for CAD systems. Evan can be contacted at [email protected].

Sooner or later, in almost every serious project a fundamental question comes up: "How will users load and save data?" Depending on the complexity of the project, that question can have a wide range of answers. For example, a simple address book application might only need to handle a single address book entry class; saving to a file might simply involve calling a method of the class called Write and passing it an std::ostream&. Loading from a file could be equally simple—just read in an integer indicating the number of records in the file, and call a Read method that many times.

On the other hand, consider a full-scale computer-aided-design application. Its database is likely to be incredibly complicated, perhaps containing graph-like structures representing topology, reference-counted shared pointers, template classes, multiple inherited classes, abstract base classes, and the like. In such cases, simple Read/Write methods are cumbersome to implement and can lead to all kinds of maintenance nightmares as developers forget to update them, or update them incorrectly. When data needs to be upgradeable from one release of the application to the next, the nightmares get even worse.

In this article, I'll introduce GenSerial, a library designed to lighten the burden on developers in situations such as this. GenSerial has no Read/Write methods that need to be written, and the algorithms that actually read and write data are completely decoupled from the classes themselves. That means that you can create new writing/ reading functions for different purposes, and you don't have to change any of the classes to do it. For example, you might use the simple text-based serialization algorithm supplied with GenSerial for in-house development and debugging, and switch to a custom binary format in production code for smaller data size. Or you might want to write a translator to export your data to some other format. GenSerial adds a simple interface to your serializable classes so that your custom serialization methods can iterate over the data in those classes in a generic manner.

The burden GenSerial places on developers is extremely light. GenSerial requires only a single, simple macro in the class definition itself, and a number of simple macros in some C++ source file equal to the number of serialized fields in the class plus two. Thus far, the library has only been tested in Visual Studio .NET 2003, but it should be portable to any Standards-conforming compiler that can also use the free Boost library (http:// www.boost.org/). GenSerial is open source and is available from SourceForge (http:// www.sourceforge.net/) and at http://www .cuj.com/code/.

An Instructive Example

Consider a simple geometric example. Create a simple 2D Point class and use it to define a Circle, which is part of some Shape hierarchy, as in Listing 1. As you can see, only minor changes to these classes are needed to make them serializable. In particular, there are three different DECLARE_ macros and the Shape class inherits from AbstractSerializable. The only other restriction imposed is that any nonabstract class must have a default constructor, although it need not be public.

Take a look at the Point class. Many serialization systems require that any serializable class inherit from some common base class, but GenSerial does not. There are cases where a common serializable base class is desirable (as we'll see in the case of the Shape class), but there are also cases where it is not. For example, suppose the Point class was used to store points in a large triangular mesh. If Point were to inherit from a serializable base class, then at least an extra word of storage per Point would be needed for the virtual function table. That can add up to a lot of wasted memory when a lot of Points are used. To tell GenSerial that the serialization information does not have to be accessed virtually, I use the DECLARE_ SIMPLE_SERIALIZABLE macro [1].

Of course, there are situations where it is necessary to have the serialization information accessed virtually. What if all of the Shape instances are stored in the BunchOfShapes class in Listing 1? BunchOfShapes contains a vector of smart pointers; it would be nice if you could pass a single instance of BunchOfShapes to GenSerial and have it automatically serialize the correct concrete subclasses of Shape stored in the vector. That is, if one of the pointers in the vector is in fact a Circle, GenSerial should read and write it as a Circle automatically.

The DECLARE_ABSTRACT_SERIALIZABLE and DECLARE_SERIALIZABLE macros make this dynamic dispatching easy. The word ABSTRACT in the middle of the macro used by class Shape tells GenSerial that the Shape class is serializable, but it cannot be instantiated by itself; a concrete derived class must be used instead. The DECLARE_SERIALIZABLE macro inside class Circle, on the other hand, indicates that Circle is a concrete class, and its most immediate serializable parent is class Shape. That means that when a Circle is read or written, its Shape part is automatically read or written first.

Because both Circle and Shape inherit from AbstractSerializable, the access to serialization information is dynamic. That means that if you have a Circle instance expressed as a pointer to Shape, a simple virtual function call gives you the information necessary to serialize or deserialize the Circle instance.

Listing 2 shows part of a .cpp file for these objects. The .cpp file contains declarations of the objects' actual serialization tables. Fortunately, the declarations are quite simple; each class has a BEGIN/END macro, and between the macros are a number of SERIALIZABLE_FIELD macros corresponding to the serializable fields of the class. You'll note that there are no DEFINE_ABSTRACT_SERIALIZABLE_BEGIN/END macros; DEFINE_SERIALIZABLE_BEGIN/END works for either abstract or concrete classes. Also notice that SERIALIZABLE_FIELD doesn't care about the type of the field at all. Using the magic of templates (and a whole lot of partial template specialization), GenSerial can automatically detect the type and figure out how to serialize it.

That's all that any developer who's not actually writing the load/save code for the project needs to know. Using just those simple macros, you can make almost any class fully serializable.

What's in the Box?

The basic strategy of GenSerial is to turn each class into a virtual container, where the elements of the container are serializable fields. GenSerial uses a generalized iterator class to traverse each class, and then relies on the power of the modern C++ compiler to do the rest.

The DECLARE macros in the header file do a few simple things:

They declare a public interface into the virtual container: const SerializerInterface * getSerializer() const. Because the interface is public, algorithms that actually do the reading and writing of the database don't have to be granted explicit friendship by the class.
They define a couple of compile-time flags that indicate whether the getSerializer() function is virtual (in the case of DECLARE_SERIALIZABLE or DECLARE_ABSTRACT_SERIALIZABLE) or not (in the case of DECLARE_SIMPLE_SERIALIZABLE), and whether the class is abstract (in the case of DECLARE_ABSTRACT_SERIALIZABLE) or not.
They define some private static data, shared among all instances of the class, where the actual serialization table and serializer are both stored. The serialization table is the virtual container of serializable fields, and the serializer is the helper class that knows how to interact with the table.

Then, inside the .cpp file, the DEFINE and SERIALIZABLE_FIELD macros do the following:

They define the serialization table and serializer for the class, along with the necessary functions for accessing them.
If the serializer is polymorphic, they add an entry in the table to tell GenSerial about the superclass.
They register the serializer in a global hash table, indexed by the name of the class. That way, when a file is read, GenSerial immediately finds the serializer for each class and uses it to create any necessary instances.

Each SERIALIZABLE_FIELD macro simply declares a single entry in the serialization table. Here is where the power of the compiler really comes into play. Based on the type of the field, the compiler stores a single word of type information (for example, a primitive type such as char or double, a smart pointer, contained class, or whatever), an offset indicating where the field is with respect to the beginning of the class, and a pointer to a serializer for that field. It is this serializer pointer for the field that allows us any number of levels of recursion in serializing the most complicated types.

The serializer for a class T is actually only one type of serializer. It is of type Serializer<T> and inherits from SerializerInterface. There are other serializers that conform to the SerializerInterface requirements as well. For example, standard containers such as std::vector have serializers that can iterate any number of times depending on the size of the vector. std::pair has a serializer that iterates twice, once for each part of the pair. Because they all use the same common SerializerInterface, the serializer algorithms generally don't care what specific serializer is being used.

How to Read or Write a Database

GenSerial supplies a simple algorithm for writing a text-based stream version of a database. It proceeds essentially along the lines outlined in Marshall Cline's C++-FAQ (http://www.parashift.com/c++-faq-lite/), using a two-pass algorithm:

On the first pass, the algorithm traverses the database, recording any smart pointers and assigning them a unique integer identifier. It also makes sure that any dependent or contained classes are fully processed.
On the second pass, the algorithm writes out a header containing information about all of the classes that need to be written. This information includes the class name and the field names of the class, as well as a unique type identifier that will be used to identify the class on read. Then it actually writes the database, one instance at a time, to the stream, replacing any pointers in the instances with the unique integer identifiers created on the first pass.

Because each of the two passes of the algorithm involves traversing the database in a similar manner, GenSerial gathers the common functionality into a single function that is in charge of iterating over serialization tables. The iteration is recursive, so that unless a field is of a primitive type such as double or char, iteration also takes place over the field's own serializer.

The process of reading the database is essentially the reverse of the two-pass writing algorithm, with one or two additional wrinkles:

The algorithm reads the header information and creates a map of serializers based on the unique type identifier associated with each class. Because the class name is stored in the header, it can be used to look up the appropriate serializer in the serializer registry. Then the algorithm reads in each instance, using the serializer to create the class and populate each field except for smart pointers. Smart pointers are left null, but their addresses are recorded for resolution on the next step along with the unique integer identifier that was stored in the stream.
Once all the data is read, the pointers are resolved.

The important thing is that both reading and writing algorithms can proceed without needing anything from the classes involved except the access provided by SerializerInterface. That means you can change your classes or serialization algorithms independently of one another.

Other Applications

GenSerial is intended for loading and saving files, and even for import and export with other file formats. But it is also easy to use it as the basis of a generic Undo/Redo system. The system simply identifies the data that is changed by a transaction and uses GenSerial to record it in a stream of bytes. If Undo should become necessary, then GenSerial reads this stream of bytes to restore the changed objects to their original state. In fact, GenSerial even supplies two functions, serializableObjectToTextStream and textStreamToSerializableObject, to make this process easy to implement. These functions handle not only an instance of an object itself, but any dependent objects that might be contained or referred to via smart pointers. You can use the same two functions to implement a Redo that does not require executing the transaction again.

You can also use GenSerial to implement a generic deep copy. For example, copying a complicated graph structure containing vertices, edges, and embedded data is trivial. It simply requires a call to GenSerial's copySerializableObjectAndSubObjects function, which internally serializes and then deserializes the object and any dependent objects.

Limitations

GenSerial has been extensively used in a project involving template classes, multiple inheritance, and all kinds of complicated data structures. It even reads a file correctly if some fields in a class have been removed, added, or rearranged since the file was written. However, it does currently have the following limitations:

GenSerial does not yet provide a generic class upgrade mechanism, although its design should allow the addition of such a mechanism fairly easily.
Although multiple inherited classes are supported, the macros as supplied only allow one of the base classes to contain actual serialized data [2]. A few more DECLARE and DEFINE macros should suffice to remove this limitation.
GenSerial will only serialize pointer fields correctly if they are either boost::shared_ptr or boost::scoped_ptr. It should be easy enough to add support for your own custom smart pointer, but I recommend against supporting raw pointers directly. There are perfectly good alternatives to raw pointers now, and a library that refuses to serialize raw pointers is probably doing you a favor. If you must serialize raw pointers (for example, when dealing with a lot of legacy data structures), you can modify GenSerial to allow it, but think carefully before doing it. It might be easier to replace the raw pointer with a smart pointer type from boost or elsewhere.
GenSerial requires that every serializable class have a default constructor, not necessarily public nor declared explicitly.

Open-source development of GenSerial is continuing, so many of these limitations should disappear in future releases.

Acknowledgments

Macros have been used to aid in serialization for many years, most notably in systems that make use of MFC to serialize data. Macros were used more fully at Parametric Technology Corp. to generate serialization tables for classes in C, and the idea was extended and greatly enhanced in C++by Paul Licameli at Revit Technology, which is where I first came across it. I am particularly indebted to Paul for interesting and informative discussions on the topic.

Notes

[1] This macro does not prevent you from having either a superclass or a subclass of Point, or even from using virtual or multiple inheritance. It will, however, mean that when a Point instance is serialized, it will not be a superclass or subclass instance of Point that is stored, but an instance of Point itself.
[2] This is not actually as much of a limitation as it might sound, because one of the most common uses of multiple inheritance is to implement a protocol hierarchy, where client-accessible interface classes contain no data. In such cases, GenSerial works just fine.

1 2 3 Next

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

C/C++