Dr. Dobb's | DataStructures as Objects

DataStructures as Objects

Improving programmer productivity becomes increasingly difficult as programs grow more and more complex. Jiri examines what's involved in writing complex programs, which include a complex interaction among objects and employ many data structures and design patterns.

October 01, 1999
URL:http://www.drdobbs.com/cpp/datastructures-as-objects/184411070

Oct99: <b>DataStructures as Objects

Jiri is president of CodeFarms and author of Taming C++: Pattern Classes and Persistence for Large Projects. He can be contacted at [email protected].

In almost every case, software differs significantly from application to application. Telephone switches and airline reservations databases, for instance, aren't the same as C++ compilers, text editors, missile-control system, or stress calculations for concrete bridges. Consequently, it is unlikely that we could improve software productivity equally for such diverse applications.

In particular, complex programsthose which include a complex interaction among objectsemploy many data organizations (data structures or design patterns). They use so many, in fact, that practically every class participates in one or more data organization. Organizations such as collections, aggregates, graphs, or hash tables involve two classes; many-to-many involves three; pattern composites involve three or more; while a tree involves only one class. For examples of complex systems, see Table 1.

An Example

Figure 1 illustrates an electrical circuit where blocks B1,B2,... have terminals T1,T2,... connected into nets N1,N2,... Nets represent only logical connections. A physical implementation of the circuit would represent connections as wires, composed of horizontal and vertical segments of certain width. A typical operation used in many algorithms is, for a given net, to find all the blocks to which it is attached or, for a given block, to find all adjacent nets.

Figure 2 is the Unified Modeling Language (UML) diagram describing this problem. There are four classes involved in four one-to-many relations. Each terminal is attached to exactly one net and exactly one block. Listing One shows how you would implement these relations in C. Because this implementation style is efficient both in space and in the access speed of the data, it is frequently used in performance-sensitive applications (sometimes even in C++ today).

Still, this implementation has problems:

The numerous pointers implementing the data permeate structures, and unless you have a diagram at hand, you can't see what is going on. In this example, there are only four relations. Imagine a problem with 25 relations!
It is easy to make an error in setting one of the pointers. Such errors cause program crashes, and are extremely hard to find. If you visited any programming department these days, you would find many programmers walking through the pointer chains with a debugger.
Maintenance and modifications of such programs are difficult. Replacing or removing a data structure means, more or less, wading through the entire code.
The same data structures are coded again and again, without any efficient code reuse.

Most C++ programmers would likely implement this data organization using a collection class from a library, such as the Standard Template Library (STL) or Rogue Wave's tools.h++, which also have iterator classes associated with them. A good C++ implementation would also use a String class for all names, and cout<< instead of printf(), but that is not essential here. For instance, the C++ implementation in Listing Two is an improvement over the C version in Listing One in the following ways:

Some, but not all, of the pointers disappeared from the class (structure) definitions, and even though you still have to walk through all the classes, it is easier to see the underlaying data organization.
It is much less likely to make a pointer error, because we use safe, well-debugged library functions such as Collection<T>::add().
Modifications of this code are also easier. For example, if you remove collection Circuit::blocks from class Circuit, the compiler tells you about all the places that have to be modified.

It may come as a surprise, but the C++ version implements a completely different data structure than the original C version. The reason is that practically all class libraries implement the collection not as a linked list, but as an array of pointers; see Listing Three.

Why? Because this is the way collections are implemented in Smalltalk, and the first extensively used class librarythe NIH class library by Gorlen et al. (see Data Abstraction and Object-oriented Programming in C++, by K.E. Gorlen, S.M. Orlow, and P.S. Plexico, John Wiley 1990, ISBN 0-471-92346)followed the Smalltalk style. Generic linked lists can be implemented with C++ templates, but both the implementation and the use of array-based collections are easier. This, of course, is a chicken-and-egg situation, because templates were introduced after array-based collections were already in use, which influenced the design of templates.

Besides some effect on the required space and performance, this also means that the Collection is a Bag, instead of a Set as in the C version. In the original problem, a Terminal can be only under one net and one block. The Collection (Bag) is better suited for problems such as Figure 3, where a student can take several courses. As Listing Four illustrates, however, two Collections are required if you need two-way access.

Missed Opportunity

By switching from linked lists to arrays of pointers, we missed an opportunity that improves the robustness and the quality of any data structure. When coding linked lists in both C and C++, it is much better to use rings, rather than NULL ending lists found in every textbook.

When an object is attached to a ring, its pointer is never NULL. When an object is disconnected, you can set all its pointers to NULL. This permits a checkin the run time and with only a few if statementsto determine whether or not an object can be safely added to the data structure or destroyed.

For example, in the C version in Listing One, you can rewrite functions addTermToNet() and addTermToBlock() to be safer when adding an object to a list, or when destroying it; see Listing Five. This technique must be used by the class library, not just by the application programmer. Since I started using it, I rarely need a debugger, and my data structures are free of errors more or less after the code compiles. By going to array-based collections, we deprived ourselves of this opportunity.

The C implementation is what is sometimes called an "intrusive data structure" because pointers or other values are inserted into the objects that participate in it. The array-based Collection<T> is an example of an "indirectly linked data structure" where you can add a collection of terminals to the Circuit, without adding any pointers or other members to the Terminal class (for more information, see my article series "Intrusive Data Structures," The C++ Report, May/July/October 1998).

Even though the C++ version is much better than the C one, it nonetheless suffers drawbacks:

You can add the same terminal to two nets, and the program will not catch it. Also, if you destroy a terminal or a block that is part of the data structure, the program will mysteriously crashsometimes much later than the moment when this happened, and the error is extremely hard to find.
The definition of the data structure is still spread through all the classes. In real-life applications, programs with more than 15 or 20 collections are common, and to keep track of how the data is connected is physically impossible, even for top-grade programmers.
Some data structures are inherently intrusive. In Figure 1, Terminal must have a pointer to Net. In Figure 3, Student must have a Collection<Course>. This means that the definition of one data structure is spread over several classes.

The first drawback can be solved by reference counting, but complicates the logic of the code, and implies performance and space penalties. The second and third drawbacks are the primary reasons for introducing graphical diagrams such as UML.

Spaghetti++

The fact that we need UML diagrams means that something in our programming style is out of control. We cannot follow what is going on. A similar thing happened in the past. Flowcharts were considered essential, until structured programming was invented. Very few people use flowcharts today.

Before we eliminated goto statements from programs, program statements created a complicated graph. Today, the graph is reduced to a tree, which is easier to follow. Prior to structured programming, we had a graph of statements. Now we have a graph of classes. The complexity moved one level up, but is showing its horns again. Part of this messiness is that data structures and design patterns are not represented as objects. They are built into our classes, but do not stand out as clearly defined entities.

Assume you design a hypothetical class library that provides a class for each data structure. To introduce a data structure into your code, you create an instance of that class, but you do not attach it to any of your objects. This class gives you the complete interface needed for the data structure. The pointers/arrays that form the data structure will beby some magicautomatically inserted under the statement (className) PARTICIPATES. This statement indicates that class "className" participates in one or more data structures. You don't want to know what is under that statement or how it got there; see Listing Six.

This has a remarkable effect. The data structures completely disappeared from the class definitions, and are now all together as one compact schema definition. Those few lines are equivalent to a UML diagram, and an automatic conversion back and forth would be simple to implement. Our schema actually provides more information (implementation details) than the UML diagram, and because it is an integral part of the code, it can never be out of date. Hopefully, the library is implemented with rings, so it is also protected against pointer errors or mistakes in using the library.

If you want to start the architecture design without implementing specific details, you can limit yourself to one-to-many and many-to-many organizations in the beginning. As your design grows, you replace them later by some other data structure.

What I've said to this point can be interpreted in another way. Without any performance penalty, you are treating program internal data as a memory-resident, custom-designed database. You have the schema that describes the logic of the database, data integrity is guaranteed by implementing the data structures with rings, and methods of the hypothetical class library form the database interface.

If you work with the class libraries commonly in use today (see Listing Two, for example), class Block has a member termsOnBlock, which is a collection of Terminals. To add a terminal to this collection you write:

Block *bp; Terminal *tp;

...

bp->termsOnBlock.add(tp);

The logic of this statement says: "Go to block bp, get its data structure termsOnBlock, and add tp to it." This reflects the philosophy of the data structures being built-in and distributed throughout the application classes.

When using our hypothetical class library, there is a paradigm shift in how the data is accessed. The difference may appear insignificant, but it changes the concept of how you treat the data structures. Adding a terminal to a block now becomes:

Block *bp; Terminal *tp;

...

termsOnBlock.add(bp,tp);

This says: "Go to data structure termsOnBlock and, under block bp, add terminal tp to this data structure." The logic is different. Data structures have been elevated to the same level of visibility as application classes. If you want to know what data structure is involved here, you go to the schema and instantly see it.

Figures 4(a) and 4(b) show the difference between how the two different approaches look at the same data structure. In the new approach in Figure 4(b), there is a duality between data objects and their relations. Relations (data structures) are treated as self-standing objects, not as something that must be attached to an application class. The new approach always gives you two rows of entities with easy-to-understand links between them, and if you encounter a reference to any data structure somewhere in the code, you just go the schema and see right away what the data structure does. For more complex data structures, the commonly used approach in Figure 4(a) results in a network of many objects, which is difficult to interpret or remember. I call this situation "Spaghetti++."

A New Approach

The central idea of the new approach is to evolve one piece of code that compiles and runs correctly at all stages of the designeven as a preliminary architecture design. There is no rapid prototyping or code redesigns. The software is designed rapidly, but as it evolves into the final product, it is always testable and always correct.

You begin with a skeleton of all the classes you neednot many members in them yetas in Listing Six. You define the data structure by typing in the schema, and then code some simple functions. For example, howBig() in Listing Seven returns the current size of the problem. Such a function permits instant evaluation of the impact of changes in the size of the internal data; for example, when replacing a collection by a hash table.

Another useful function is Circuit::autoCheck(), which traverses the data in all chains and directions, and verifies that everything is correct. Calling this function is much more useful than walking through the data with a debugger. From the beginning of the project, I also like code functions such as those shown in Listings Seven and Eight. Function inputData() lets you create tests for special situations, and randomData() lets you test for large, realistic data sets.

The system architect then gives the programmer a small running program with the schema in place. This improves the communication and ensures that the intended architecture is really implemented. The coding is easier with the schema always at hand.

Most of the errors related to the data organization are detected by the compiler. Remaining data errors are automatically detected at run time. Bugs caused by errors in the program logic cannot be prevented and must be found by standard methods, but these errors are usually easier to find, especially when data errors are immediately detected.

Testing is faster and much more thorough, because the program has been tested repeatedly during its evolution, using functions such as autoCheck(), prtNets(), and so on. With each new layer of logic, new test functions are added.

Every program eventually reaches the point where it has to be redesigned. When using the schema, even major changes to data organization are simple. For example, if you have the schema in Listing Six:

Collection<Circuit,Block> blocks;

Collection<Circuit,Net> nets;

Aggregate<Block,Terminal> termsOnBlock;

Aggregate<Net,Terminal> termsOnNet;

and decide to replace the Collection of blocks by a hash table and add to each Circuit a collection of all Terminals, you just change the schema to:

Hash<Circuit,Block> blocks;

Collection<Circuit,Net> nets;

Collection<Block,Terminal>termsOnBlock;

Aggregate<Net,Terminal> termsOnNet;

Collection<Circuit,Terminal> allTerms;

The compiler pinpoints all places in the code that you have to update. Besides making modifications easy, the schema helps new programmers to quickly penetrate the organization of the data, and avoid costly mistakes.

Persistent Data

When the software requires storage of complex data to disk, 50 percent of the total development time (sometimes more) is spent on this task alone. Some C++ class libraries support serialization, but coding two serialization functions for every class is tedious and error prone. At one time, I had the opportunity to examine a commercial system for business data processing that had over 1 million lines of C++, one third of which were serialization functions. Because of the complexity of serialization functions, it reached the point that nobody dared to add any new features to that entire system. Also, serialization is the least efficient method of storing data to disk (see my book Taming C++: Pattern Classes and Persistence for Large Projects, Addison-Wesley 1994, ISBN 0-201-52826-6).

If we can supplement our hypothetical library by automatic persistence, which would not require coding serialization functions (as in Java, for example), we would also significantly improve the design processbut only for those applications that store data to disk. The mechanism can be implemented as a DiskUtility class, which we add to the schema:

DiskUtility util;

we call

util.save("myFile",cp,"Circuit");

when saving circuit cp and all connected data to myFile, or we call

util.open("myFile",&cp,"Circuit");

when reading data from myFile and returning the new cp pointer.

In Java, you don't have to code serialization functions, but internally the storage mechanism uses serialization.

Practical Results

My company (CodeFarms) has implemented two libraries that closely resemble the hypothetical library described here. Data Object Library (DOL) (originally called Organized C++) uses a code generator to create blocks of pointers hidden under PARTICIPATES(...). It also generates functions that make the data persistent. The Pattern Template Library (PTL) uses C++ templates and multiple inheritance to implement the PARTICIPATES(...) code segments. It does not provide persistence. (Both manuals and versions of these libraries for major platforms and compilers are freely available at http://www.codefarms.com/.)

DOL includes several methods of automatic persistence, one of them (called "memory blasting") is an order of magnitude faster than serialization. DOL's drawback is that, prior to compilation, a code generator has to be called. The generator requires only 1/10 of the compilation time, and has to be called only when the schema, inheritance, or members of some classes changenot before every compilation. The code generator does not mangle existing source, it only generates a new header file, which must be included. However, the additional step is not smoothly supported by integrated environments such as Microsoft Developer Studio.

PTL's drawbacks are multiple layers of multiple inheritance, which are transparent but still make classes internally complex. In addition to data structures, PTL also provides design patterns not available in DOL such as Composite, Flyweight, and a fast, dynamically reconfigurable Finite State Machine.

In general, our experience is that, with this new approach, complex projects that do not need persistence can be developed three to five times faster, depending on how users are acquainted with the new technology. When persistence is required, a 5- to 10-fold improvement can be expected. Part of this success is that the new method decreases the required manpower, which in turn decreases the communication among people, which then results in an additional reduction of the required development time.

Is STL a Good Standard?

There is an interesting anomaly in STL: Its iterators are already coded in the style I recommend (independent interface classes), but its data organizations are designed to be either members ofor be inherited byapplication classes. It does not support the paradigm shift described here, and it does not provide multiway data structures such as aggregate, graphs, or many-to-many. Practically all STL classes are two-class, one-way organizations.

If the new approach dramatically improves the efficiency of the design process (and I believe I gave enough evidence), then STL should either be redesigned, replaced, or complemented with classes for two-way data structuressee Al Stevens's "C Programming" column (DDJ, December 1998) for an implementation of two-way data structures with STL.

What About Java?

Though Java's creators claim they removed pointers, Java's libraries still work with references and are subject to potential misuse as described for STL earlier. Java has automatic persistence, but it is based on serialization, and its performance is even poorer than that of the C++ serialization.

I can't figure out how to implement the hypothetical library in Javamaybe you can. DOL uses a code generator to create the PARTICIPATES() statements, and inserts them as macros. In PTL, PARTICIPATES() statements hide multiple inheritance. Neither method would work in Java.

Any language that allows a seamless implementation of the new approach will quickly topple all other languages because of the enormous potential to improve software productivity.

DDJ

Listing One

struct Circuit {
    Block *blocks;
    Net *nets;
};
struct Block {
    Block *next;
    Terminal *term;
    char *name;
};
struct Net {
    Net *next;
    Terminal *term;
    char *name;
};
struct Terminal {
    Block *nextOnBlock;
    Block *nextOnNet;
    Block *block;
    Net *net;
    char *name;
};
/* print all nets in the circuit */
void prtNets(Circuit *c){
    Net *np;
    for(np=nets; np; np=np->next){
        printf("%s\n",np->name);
    }
}
/* print nets connected to block bp */
void prtTerms(Block *bp){
    Terminal *tp;
    for(tp=bp->term; tp; tp=tp->next){
        printf("%s\n",tp->net->name);
    }
}
/* add Terminal to a Block */
void addTermToBlock(Terminal *tp, Block *bp){
    tp->nextOnBlock=bp->term;
    bp->term=tp;
    block=bp;
}
/* add Terminal to a Net */
void addTermToNet(Terminal *tp, Net *np){
    tp->nextOnNet=np->term;
    np->term=tp;
    tp->net=np;
}
/* create Terminal, attach it to block and net */
Terminal *createTerminal(Block *bp, Net *np, char *name){
    Terminal *tp; char *p;
    tp=(Terminal*)malloc(sizeof(Terminal));
    p=malloc(strlen(name)+1);
    if(!tp || !p)return NULL;
    addTermToBlock(tp,bp);
    addTermToNet(tp,np);
    strcpy(p,name);
    tp->name=p;
    return tp;
}

Back to Article

Listing Two

template<class T> Collection {
public:
    void add(T *tp);
    ...
};
template<class T> CollIterator {
public:
    CollIterator(Collection<T>& col);
    T* operator++();
    ...
};
class Circuit {
    Collection<Block> blocks;
    Collection<Net> nets;
public:
    void prtNets();
};
class Block {
    Collection<Term> termsOnBlock;
    char *name;
public:
    void prtTerms();
};
class Net {
    Collection<Term> termsOnNet;
    char *name;
};
class Terminal {
    Block *block;
    Net *net;
    char *name;
public:
    Terminal(Block *bp, Net *np, char *tName);
};
/* print all nets in the circuit */
void Circuit::prtNets(){
    Net *np;
    CollIterator<Net> it(nets);
    while(np= ++it){
        printf("%s\n",np->name);
    }
}
/* print nets connected to block bp */
void Block::prtTerms(){
    Terminal *tp;
    CollIterator<Terminal> it(termsOnNet);
    while(np= ++it){
        printf("%s\n",tp->net->name);
    }
}
/* replaces the original function createTerminal() */
Terminal:Terminal(Block *bp, Net *np, char *tName){
    name= new char[strlen(name)+1];
    if(name)strcpy(name,tName);
    bp->blocks.add(this);
    block=bp;
    np->nets.add(this);
    net=np;
    return tp;
}

Back to Article

Listing Three

template<class T> Collection {
    T **p;
    int sz,used;
    void growArray();
public:
    Collection(){p=NULL; sz=used=0;}
    void add(T *tp){
        if(used>=sz)growArray();
        p[used]=tp; used++;
    }
    ...
};

Back to Article

Listing Four

class Course {
    Collection<Student> students;
    ...
};
class Student {
    Collection<Course> courses;
    ...
};

Back to Article

Listing Five

void addTermToNet(Terminal *tp, Net *np){
    if(tp->nextOnNet || tp->net){
        printf("error: cannot add term=%x\n",tp);
        return;
    }
    tp->nextOnNet=np->term;
    np->term=tp;
    tp->net=np;
}
/* returns NULL when np properly destroyed */
Net *destroyNet(Net *np){
    if(next || term || name){
        printf("error: net=%x not disconnected\n",np);
    }
    else {free np; np=NULL;}
    return np;
}

Back to Article

Listing Six

// ---------- class library --------------------
// In Collection, Child does not know its Parent.
template<class Parent, class Child> Collection {
public:
    void add(Parent *pp, Child *cp);
    ...
};
template<class Parent, class Child> CollIterator {
public:
    CollIterator(Parent *pp);
    Child* operator++();
    ...
};
// Aggregate is a Collection, where each Child keeps a pointer to its Parent.
template<class Parent, class Child> Aggregate {
public:
    void add(Parent *pp, Child *cp);
    ...
};
template<class Parent, class Child> AggrIterator {
public:
    AggrIterator(Parent *pp);
    Child* operator++();
    ...
};
// -------- application code -------------------
class Circuit {
    PARTICIPATES(Circuit);
public:
    void prtNets();
};
class Block {
    PARTICIPATES(Block);
    char *name;
public:
    void prtTerms();
};
class Net {
    PARTICIPATES(Net);
    char *name;
};
class Terminal {
    PARTICIPATES(Terminal);
public:
    Terminal(Block *bp, Net *np, char *tName);
};
/* --------  DATA STRUCTURE SCHEMA -------- */
Collection<Circuit,Block>  blocks;
Collection<Circuit,Net>    nets;
Aggregate<Block,Terminal> termsOnBlock;
Aggregate<Net,Terminal>   termsOnNet;
/* ---------------------------------------- */
/* print all nets in the circuit */
void Circuit::prtNets(){
    Net *np;
    CollIterator<Circuit,Net> it(this);
    while(np= ++it){
        printf("%s\n",np->name);
    }
}
/* print nets connected to block bp */
void Block::prtTerms(){
    Terminal *tp;
    AggrIterator<Block,Terminal> it(this);
    while(np= ++it){
        printf("%s\n",tp->net->name);
    }
}
/* replaces the original function createTerminal() */
Terminal:Terminal(Block *bp, Net *np, char *tName){
    name= new char[strlen(name)+1];
    if(name)strcpy(name,tName);
    blocks.add(bp,this);
    nets.add(np,this);
    return tp;
}

Back to Article

Listing Seven

int Circuit::howBig(int numBlocks,int numNets,int numTerminals){
    return numBlocks*sizeof(Block) +
           numNets*sizeof(Net) +
           numTerminals*sizeof(Terminal);
}

Back to Article

Listing Eight

    // input a small data set from a manually coded ascii file;
void Circuit::inputData(char *fileName);
    // generate large pseudorandom data of given size
void Circuit::randomData(int numTerms,int maxTermsOnNet,int
				maxTermsOnBlock);

Back to Article

Figure 1: Example of a logical circuit.

Figure 2: UML diagram for the circuit from Figure 1.

Figure 3: In this situation, the bag is the proper representation between Course and Student.

(a)

(b)

Figure 4: (a) Mental concept when using existing class libraries; (b) mental concept when using the hypothetical library.

Table 1: Examples of the new method in use in some complex systems. (*)= Total effort from concept to the release of the tested product. (+)= Proprietory information or information not available.