Dr. Dobb's | The C5 Generic Collection Library

The C5 Generic Collection Library

Collection libraries provide functionality for storing and manipulating collections of related data items.

June 08, 2007
URL:http://www.drdobbs.com/windows/the-c5-generic-collection-library/windows/the-c5-generic-collection-library/199902700

Niels holds a Ph.D. in mathematics and is a software developer at Edlund A/S in Copenhagen. Peter is a professor in information technology at the IT University of Copenhagen. They can be contacted at [email protected] and [email protected], respectively.

Collection libraries provide functionality for storing and manipulating collections of related data items. Typically, they implement array lists, linked lists, hash- and order-based sets, bags (multisets), hash- and tree-based dictionaries, priority queues, and so on. Collection libraries are useful in applications ranging from computer games and computer graphics, to optimization, compilers, operating systems, and web servers, to name a few. A good library lets you work at a higher level of abstraction, uses generic types for improved expressiveness and type safety, and improves performance by providing good algorithms and data structures.

Microsoft .NET 2.0 comes with a standard collection library in namespace System.Collections.Generic (msdn2.microsoft.com/en-US/library/system.collections.generic.aspx). Although useful and fast for many purposes, SCG lacks basic features such as sets and bags, and advanced features such as interval queries on sorted sets and sorted dictionaries, event listeners on collection updates, and the like. And to squeeze out the last nanoseconds of efficiency, the standard collection library mostly uses nonvirtual methods, which limits reuse and extendibility of the collection classes.

Another .NET collection library, Peter Golde's PowerCollections (www.wintellect.com/MemberOnly/PowerCollections.aspx), extends rather than replaces the standard .NET collection classes. However, this approach is also a disadvantage because it inherits some of its limitations. Additionally, the PowerCollections license may prohibit use in projects that involve GPL software.

With all this in mind, we designed and developed the C5 collection library, which supports advanced functionality and accommodates new and currently unforeseen kinds of collection classes. The C5 collection library, which was built for Microsoft .NET 2.0 and Mono 1.2, is open source and available in source and binary form at www.itu.dk/research/c5. The library's license lets you freely use it with commercial and open-source projects.

Design Principles

For starters, we studied existing Java and Smalltalk collection libraries and drew up a list of design objectives:

The library should provide well-known collection abstractions such as lists, sets, bags (multisets), priority queues, double-ended queues, stacks, and dictionaries (finite maps).
The library should provide standard implementations of these, using data structures such as array lists, doubly linked lists, hash-based and comparer-based collections, and interval heaps.
Functionality should be orthogonal so that it can be freely combined.
The library should provide convenient but hard-to-implement functionality, such as persistent sorted collections and sublist views on hash-indexed linked lists.
The library should fit with existing C# concepts such as enumerables, the foreach statement, and events, and make good use of C# 2.0 features such as generic types and methods, the yield statement (iterator methods), nullable value types, and anonymous methods.
The functionality of a collection class should be completely characterized by the interfaces it implements. This permits "programming to interface, not to implementation." For instance, the classes ArrayList<T> and LinkedList<T> have the same functionality and implement the same interfaces.
Good asymptotic complexity is more important than nanosecond efficiency. Thus, existing operations on .NET's standard array list List<T> may be faster by a constant factor than operations on C5 ArrayList<T>, but other operations may simply be nonexistent, and therefore not efficiently implementable on List<T>.
The library should be tested and documented well enough to be widely usable. This includes documentation of the asymptotic running time of each operation on each collection implementation.

Figures 1 and 2 show the inheritance hierarchies of the collection types and dictionary types.

[Click image to view at full size]

Figure 1: The collection type hierarchy.

Figure 2: The dictionary type hierarchy.

Unusual Functionality

In addition to what you'd expect from a good collection library, C5 offers some unusual functionality:

Slidable sublist views that reconcile array lists (fast access by index) with linked lists (fast insertion and deletion of single items), without exposing implementation details such as linked-list nodes. Sublist views make for good orthogonality because all operations that can be applied to lists can also be applied to sublist views. For instance, you need to provide only one Reverse() method, not both Reverse(int start) and Reverse(int start, int length)—the latter two can be performed as Reverse() on suitable sublist views. The same conceptual economy applies to numerous other operations such as several kinds of search, backwards search, tests, shuffling, and so on.
Listenable collection change events allow one to efficiently track changes to collections by attaching event handlers. This makes it easy to build composite collections, such as multidictionaries, using the given components.
Fast snapshots of tree-based collections let you have many almost-identical versions of a set or bag. This supports time- and space-efficient implementation of geometric algorithms.
Handles can be associated with priority queue items to permit fast access, update, or removal of items. In practice, a priority queue is pretty useless without this functionality, yet many collection library implementations do not provide it.

Using C5

To illustrate the use of C5, consider a simple index of a file—a sorted list of all words that appear in the file, and for each such word, a sorted list of the lines in which it appears. We use a dictionary to map each word to a set of line numbers; this avoids duplicate line numbers for each word. Using a tree dictionary makes sure the words come out sorted when the dictionary is printed, and similarly for the line numbers. For each word, the line numbers are represented by a TreeSet<int>, and hence the entire file index is represented by a TreeDictionary<String, TreeSet<int>>.

Let filename be the name of the file. Listing One(a) then builds the index in the tree-dictionary index. The foreach loop body is not optimal: It makes at least two searches in index for each string s; namely, one call to Contains and one or two calls to the indexer index[s]. Since this is a common pattern, dictionaries in the C5 library provide a method bool Find(K k, out V v) that combines the Contains(k) test with the lookup of the associated value v. Using Find, the number of searches is reduced to one or two; see Listing One(b).

(a)
IDictionary<String, TreeSet<int>> index
   = new TreeDictionary<String, TreeSet<int>>();
Regex delim = new Regex("[^a-zA-Z0-9]+");
using (TextReader rd = new StreamReader(filename)) {
   int lineno = 0;
   for (String line = rd.ReadLine(); line != null; line = rd.ReadLine()) {
      String[] res = delim.Split(line);
      lineno++;
      foreach (String s in res)
         if (s != "") {
            if (!index.Contains(s))
               index[s] = new TreeSet<int>();
            index[s].Add(lineno);
         }
   }
}

(b) 
foreach (String s in res)
   if (s != "") {
      TreeSet<int> theSet;
      if (!index.Find(s, out theSet))
         index[s] = theSet = new TreeSet<int>();
      theSet.Add(lineno);
   }

(c)
foreach (String word in index.Keys) {
   Console.Write("{0}: ", word);
   foreach (int ln in index[word])
      Console.Write("{0} ", ln);
   Console.WriteLine();
}

(d)
 ...
characterized: 150
characters: 229 230 319 321
checks: 78
circular: 348
class: 149 320 402 492 516
classes: 7 105 110 117 152 226 232 393
classic: 331
 ...

Listing One

The number of searches can be further reduced to exactly one for each string s by using the C5 method FindOrAdd, at the expense of creating a single empty TreeSet<int> in addition to those needed for all the words in the index. Regardless of how the index is built, it can be printed in alphabetical order using standard C# idioms; see Listing One(c). Listing One(d) is a fragment of the output.

A similar approach can be used to find all anagrams, such as "generate" and "teenager," in a text. From a word, you can compute the corresponding bag of characters, such as {a, e, e, e, g, n, r, t g}; this represents the anagram class of the word. You can create a dictionary mapping bags of characters to sets of words, and populate the dictionary in much the same way as the previous file index.

A Convex Hull Example

The convex hull of a point set is the smallest convex set containing all the points; see Figure 3. The convex hull can be described by a set of corner points of the point set. The classic algorithm for finding the convex hull works by sorting all the points, splitting them into a lower and an upper list (separated by the dashed line in Figure 3), and then scanning each list. The scanning considers triples (p₀; p₁; p₂) of neighbor points, deletes point p₁ if it does not lie outside the line from p₀ to p₂, then goes on to consider the next triple of points. For efficiency, this algorithm should be implemented using linked lists to hold the items (points). If an array list were used, then each point deletion would require sliding of all subsequent points in the array, which is slow.

Figure 3: Convex hull of a point set. The dashed arrow separates the lower and upper point lists. The grey points will be deleted during scanning of the lists; the black remaining points are the "corner" points that span the convex hull.

Then how should we access the items in the linked list? Item access by index is very slow, but we don't want client code to handle individual linked-list nodes (as in the standard .NET LinkedList<T>), because then we have to prevent such code from violating all sorts of invariants—a list node should belong to exactly one list, lists should not be circular, and so on. The C5 solution to this problem is to use slidable list views to access list items in constant time, whether in linked lists or in array lists.

In Listing Two, list contains a sequence of points, and view is a sublist view comprising three points. The while loop executes so long as there are more points in the list. If the middle point p₁ lies outside the line from point p₀ to point p₂, then the view will next be slid to the right; otherwise, p₁ is deleted, and the view will next be slid to the left (unless the view is at the beginning of the underlying list).

IList<Point> view = list.View(0, 0);
int slide = 0;
while (view.TrySlide(slide, 3))
   if (Point.Area2(view[0], view[1], view[2]) < 0) // right turn
      slide = 1;
   else { // left or straight
      view.RemoveAt(1);
      slide = view.Offset != 0 ? -1 : 0;
   }
}

Listing Two

A Multidictionary Example

A multidictionary is a dictionary that maps a key to a set of values. Multidictionaries are found in the C++ Standard Template Library and in Peter Golde's PowerCollections for .NET. The C5 collection library does not include a multidictionary implementation, but one can easily be built from the given dictionary and set classes. Here we show how all operations can be efficient thanks to listenable update events.

A hash-based multidictionary with key type K and value type V can be declared by deriving from a hash-based dictionary whose key type is K and whose value type is a collection of V:



public class MultiHashDictionary<K,V> : 
     HashDictionary<K, ICollection<V>> {
  ...
}

The multidictionary should override the Contains(K) method, which tests whether a given key is associated with any value:


public override bool Contains(K k) {
   ICollection<V> values;
   return base.Find(k, out values) 
     && values != null && !values.IsEmpty;
}

The call base.Find(k, out values) returns True if key k is in the dictionary and if so, binds the associated value to values; otherwise, it returns False. The inherited Contains method cannot be used because it will also return True if k is associated with a null reference or with an empty collection.

Similarly, the multidictionary's indexer this[k] should be overridden to return a new empty collection when k is not associated with any collection, or is associated with a null reference. Moreover, the multidictionary should have a new method Add(K,V) to add a single (key, value) pair:


public virtual void Add(K k, V v) {
   ICollection<V> values;
   if (!base.Find(k, out values) 
         || values == null) {
      values = new HashSet<V>();
      Add(k, values);
   }
   values.Add(v);
}

The negated call !base.Find(k, out values) returns False if key k is in the dictionary; if so, it binds the associated value to values. Otherwise, it returns True. It is worthwhile to contrast the new Add method with the inherited one, which has signature Add(K, ICollection<V>) and can be used only to associate an entire collection with key k.

Note that the Add method uses new HashSet<V>() to create a new hash-based set containing the value v when inserting a key k not already in the dictionary. It might as well have created a hash-based bag, a tree-based set, a tree-based bag, or some other collection of items of type V. This is part of the reason why the C5 library does not provide a multidictionary as a predefined type: Many different kinds of multidictionaries can be constructed, and the C5 library provides the components to construct the one you want. In fact, this multidictionary class could be generalized by adding a further generic parameter W, which should stand for the desired type of value collection:


public class MultiHashDictionary<K, V, W> 
     : HashDictionary<K, W>
   where W : ICollection<V>, new()
{ ... }

In the implementation { ... } of the multidictionary, we just need to replace new HashSet<V>() with new W(), and voilà, we have a typesafe and general multidictionary implementation that works for W being tree set, tree bag, hash set, hash bag, hashed linked list, and so on. Similar to the new Add(K,V) method, there should be a new method Remove(K,V) to remove a single (key, value) pair; it uses much the same logic.

The only thing missing from our multidictionary collection is a Count property that returns the total number of (key, value) pairs in the multidictionary. The inherited Count property returns the number of (key, value set) pairs in the base dictionary, which may be both smaller and larger than the number of (key, value) pairs. The straightforward way to compute Count is to sum the counts of all value sets, as in Listing Three(a). However, this takes time linear in the number of keys in the multidictionary. If the Count property needs to be computed frequently, this may be very slow. We could address this problem by adding a private field count to the MultiDictionary<K,V> class, and then override methods in the multidictionary to try to keep the count field up to date. However, this does not work because one may add items to a value set without calling any methods on the multidictionary. Also, a value set may be associated with more than one key in the multidictionary, so that each addition or removal should be counted more than once. What can you do?

(a)
public new virtual int Count {
   get {
      int count = 0;
      foreach (KeyValuePair<K,ICollection<V>> entry in this)
        if (entry.Value != null)
          count += entry.Value.Count;
      return count;
   }
}

(b)
public class MultiHashDictionary<K,V> : HashDictionary<K, ICollection<V>> {
   private int count = 0; // Cached value count, updated by events only
   private void IncrementCount(Object sender, ItemCountEventArgs<V> args) {
      count += args.Count;
   }
   private void DecrementCount(Object sender, ItemCountEventArgs<V> args) ...
   private void ClearedCount(Object sender, ClearedEventArgs args) ...
   public MultiHashDictionary() {
      ItemsAdded +=
         delegate(Object sender,
            ItemCountEventArgs<KeyValuePair<K,ICollection<V>>> args) {
         ICollection<V> values = args.Item.Value;
         if (values != null) {
            count += values.Count;
            values.ItemsAdded += IncrementCount;
            values.ItemsRemoved += DecrementCount;
            values.CollectionCleared += ClearedCount;
            }
         };
      ItemsRemoved += ...; // Similar to ItemsAdded
   }
   ...
}

Listing Three

The solution is straightforward—use C5's collection update events. We declare event handlers for adding and removing items from a value set and for clearing a value set, and attach all three event handlers to every value set. But how do we make sure to attach these handlers to every value set, regardless of how and by whom it was created? We use collection update events once more, this time attaching event listeners not to the value sets but to the base dictionary. When a value set is added to the dictionary, the three value set event listeners get attached to the value set, and if the value set is later removed again, then the event listeners get detached. This automatically handles the case where the same value set is associated with multiple keys in the base dictionary. Listing Three(b) presents a solution. Now it is easy to define a constant-time Count property; it simply returns the count field:


public new virtual int Count {
   get { return count; }
}

Clearly, it is worthwhile to install all the event listeners and maintain the count field only if the Count property may be called often. The point of the example is that the C5 library provides a safe and transparent means for maintaining derived fields (such as count), whereas other collection libraries do not.

Implementation Challenges

Some of the functionality in C5 is difficult to implement efficiently—the combination of hash indexes and sublist views on linked lists, for instance.

First, it is fairly easy to provide each of these two extensions in isolation. A hash-indexed linked list in C5 is implemented by maintaining a hash dictionary that maps an item x to the list node in which it appears (if any). A sublist view is implemented as a pair of the list node just before the view and the list node just after the view, and all proper lists have an artificial sentinel node added before and after the list.

However, combining the two features is nontrivial because hash-based item lookup should work on sublist views too, and we want to share a single hash dictionary between all views of a list (or else view sliding would be extremely slow). So when performing hash-based item lookup on a view, we need to quickly decide whether the resulting item falls within the given view, not somewhere else in the underlying list. Traversing the view to see whether the item is within the view would be slow and would defeat the purpose of the hash index.

Instead, we adopt a list-ordering algorithm due to Sleator and Dietz in the version described by Bender (theory.lcs.mit.edu/ ~edemaine/papers/DietzSleator_ESA2002/ paper.pdf). The basic idea is to give each list node an integer tag so that nodes appear in a strictly increasing tag order. The Sleator-Dietz-Bender algorithm describes how to efficiently redistribute node tags at insertions so that a newly inserted node can be given a tag strictly between those of its neighbors. In the C5 implementation, we use a two-level version of this basic scheme, by maintaining tags on groups of nodes at the first level, and tagging the groups themselves at the second level.

Conclusion

Informal feedback from users has been very positive, showing that the C5 library provides useful, advanced and well-tested collection classes.