Dr. Dobb's | Inside the db4o Database

Inside the db4o Database

Among other features, db4o, an open-source object database for Java and .NET, has built-in support for synchronization.

May 16, 2006
URL:http://www.drdobbs.com/database/inside-the-db4o-database/187203597

Rick is a QA Engineer for Compuware's NuMega Labs, where he has worked on both .NET and Java projects. He can be contacted at [email protected].

In the context of databases—and object databases, in particular—replication is the ability to duplicate an object from one database (call it "database A") into another ("database B"). The duplication is performed in such a way that both object instances can be modified, so that at a later time the duplicated object can be returned to its original database, and any differences between the object in A and the object in B can be resolved. There are variations on this scenario, but that's the basic idea.

Users of mobile devices might recognize this as "synchronization." In such applications, the desktop system maintains a set of "master" databases that are replicated onto the handheld device. In the process of using the handheld, you add phone numbers, delete appointments, create new shopping lists, and so on. And, at some future point, you synchronize the mobile device with the desktop, which records the changes you've made into the master database. In short, replication allows a subset of information to be drawn from one database into another, and permits the user to operate on that subset in a disconnected fashion.

The Need

Suppose you want to implement replication. What capabilities do database systems need to provide for error-free—or, at least, as close to error-free as we could reasonably get—reconciliation of data modified by disconnected users? (In this article, I assume that the databases employed are object databases. So, when I speak of "data," I speak of "objects.")

Certainly, the first requirement would be some way of identifying an object, even as clones of it are replicated from one database to another. In other words, when you put an object into the database, a unique identifier must be attached to it. That identifier must be bound to the object in such a way that, when the object is replicated into another database, the identifier goes with it. (You cannot count on the object's data content as the sole means of identifying that object.) Ideally, this identifier will be invisible and unmodifiable, except under unusual circumstances. After all, there is no reason to make its presence and value known to anything other than whatever framework handles the replication process.

Furthermore, this identifier must be universally unique. Suppose you begin with database A. You replicate a subset of its data into database B. Then you replicate data from database A into database C (some of the objects in C may also be objects in B). Suppose further that, after replication, object X was created in B, and object Y was created in C, and that both objects are of the same class. We must be guaranteed that the unique identifiers of X and Y are universally unique, even though databases B and C (and, in fact, A as well) are disconnected. If, by some chance, object X and object Y received the same IDs, then it would be difficult to impossible to reconcile both databases back to the master database.

Accurate reconciliation of disconnected databases also demands that some sort of version information be attached to an object when the object is modified. This is necessary so that synchronization can determine which of two replicated objects is the most up to date. Ideally, this would be information along the lines of "this object was modified at date and time xxx." At the least, we need a modification flag, such as the one on the Palm OS, which indicates when a record has been made dirty. Otherwise, the synchronization application will have to examine each and every object's content (that is, each object passing through synchronization), comparing it with the original (the object in the master database), to see if any changes have occurred, and attempt to deduce from those changes which object is older. Such resolution code would be unacceptably complex and slow.

Finally, a synchronization system must provide a mechanism that allows the developer to manage conflict resolution. Under normal circumstances, when an object that has been replicated is being reconciled back to the main database, the replication framework code can examine version information (assuming it exists) and determine whether the master database object or the replicated object is the most up to date. Conflicts occur when the version information is such that the winner is not apparent. The developer must be able to provide code that can resolve the confusion.

An Implementation

db4o from db4objects (www.db4objects.com) is an open-source object database that provides the capabilities that I've described. db4o is available in Java and .NET flavors; in this article, I use its Java version.

Besides being open source, db4o's features include an easy-to-grasp API, equally lucid persistence model, and the fact that it is a true object database (not an object/relational mapping database). Add to this its invisible transactioning and its adherence to ACID (atomicity, consistency, isolation, and durability) behavior, and you have an embeddable database that concentrates a lot of power in a small area.

The entire db4o library is provided in a single .JAR file (or .DLL if you're using .NET), and you'll typically find yourself handling better than 90 percent of your database work with about 10 API calls. In addition, as far as db4o is concerned, all objects are persistence capable. Classes whose objects are to be persistent need not be derived from a persistence-aware base class, nor must persistent classes implement any specialized persistence interface. If you want to put an object into the database, you more or less tell db4o, "Please put this object into the database," and it does the rest.

In fact, db4o doesn't need any description of a persistent object's structure (a schema file). When you tell db4o to put an object in the database, the runtime engine uses reflection to "spider" the object's architecture and deduce how the object is to be stored. The spidering works in reverse, too; when you fetch an object, db4o need not be told anything about that object to properly instantiate it.

Replication is relatively new in db4o, having appeared in the 5.0 release. Replication support within a given database is optional, and for good reason. Again, replication requires that db4o attach a universal identifier to each object, as well as maintain versioning information. This necessarily consumes space in the database, so it should be enabled only when required.

Specifically, the universal identifier that db4o constructs is composed of two parts:

An object ID number. Every object in the database has associated with it a unique 32-bit long. This identifier is unique within the database.
A database signature. This signature is associated with the entire database. Typically, it is about 30 bytes long, and is derived from a variety of sources. For example, one element used is the host name of the computer that the database was originally created on. The signature is guaranteed unique for all db4o databases.

So, the logical concatenation of the signature with the object ID provides a truly universal identifier for any object in any db4o database.

This, however, is not all the information you need for complete replication support. Again, versioning must be attached to objects so that modifications can be properly reconciled at synchronization. To this end, when databases are replicated, db4o records the transaction number (in both databases) of the replication operation. This transaction number effectively becomes the object's version number. (The original transaction number is also retained.) Any subsequent modification to the object causes its version number to be incremented. Consequently, when synchronization takes place at a future point, db4o compares version numbers to see which objects have been modified since the last replication.

Here, db4o takes a cautious approach. It does not presume that the object with the higher version number is the winner. Instead, db4o uses the following replication plan:

If neither object has changed since the last replication, nothing happens.
If one object has changed, and the other has not, the changed object wins (overwriting the loser).
If both objects have changed, then a user-supplied conflict resolution routine is called.

Consequently, db4o passes off to you the problem of determining which object is the more current. db4o does not assume that a bigger version number indicates a later object. The aforementioned conflict resolution routine is a sort of deus ex machina method that must be supplied to the replication process, and that steps in to decide which object is the winner.

Have "Patients"

To illustrate db4o's API, how replication is crafted, and how a conflict resolution method is supplied, I present a patient-weight monitoring database—one that a nutritionist might use to keep track of pounds that patients have added or lost.

The database holds objects from two classes: Patient and WeightEntry; see Listing One(a). (For clarity, I only show the data fields of the classes.)

(a)

public class Patient
{  private String name;
   private long patientID;
   private ArrayList weightHistory;
 ... methods for Patient ...
}
public class WeightEntry
{  private float weight;
   private long weightDate;
 ... methods for WeightEntry ...
}

(b)

public void addWeight(float _weight,
  long _timeMillis)
{
  // Create new WeightEntry
  WeightEntry _weightEntry =
    new WeightEntry(_weight, _timeMillis);
  // Attach it
  this.weightHistory.add(_weightEntry);
}

Listing One

The fields defined for Patient are reasonably obvious; they hold the patient's name and a system-assigned patient identification number.

The weightHistory ArrayList tracks the patient's weight. So, when a new weight is recorded, a new WeightEntry object is created and placed on the end of the ArrayList. (We are assuming, here, that new weight recordings are placed at the end of the list, so it stays in a time-sorted order.)

Of course, before replicating anything, you have to get it into the database in the first place. First, let's define a method—a member of Patient—that lets you add an entry onto the ArrayList, as in Listing One(b) (also available electronically). This method instantiates a WeightEntry object, initializes its fields, and attaches it to the ArrayList. Now, you can create a patient and store his information into the database; see Listing Two.

// Make the database replication-capable
Db4o.configure().generateUUIDs(Integer.MAX_VALUE);
Db4o.configure().generateVersionNumbers(Integer.MAX_VALUE);

// Open a database (Create if it does not exist)
ObjectContainer patientDB = Db4o.openFile("PATIENTA.YAP");
// Create a new Patient
Patient _patient = new Patient(
  "John Doe",
  001L);
// Add a weight
_patient.addWeight((float)190.0,
  java.lang.System.currentTimeMillis());
// Put the new patient in the database
patientDB.set(_patient);
// Commit the transaction and close the database
patientDB.commit();
patientDB.close();

Listing Two

This is really all you need to store the patient John Doe, whose ID is 001, and whose weight at the time this snippet was run was 190 lbs.

Again, db4o's API is remarkably succinct. First, the class that models a database is db4o's ObjectContainer. To put an object into a database, you merely need to pass a reference to that object to the ObjectContainer's set() method. The db4o engine discovers the object's structure invisibly, regardless of the depth and complexity of that structure. So, in our example, not only is the Patient object stored, but its member ArrayList (weightHistory) and the WeightEntry object (within the ArrayList) are all stored as well. This characteristic—that all objects referenced by an object that is persistent are also persistent—is referred to as "persistence by reachability."

Notice that, before I opened the database, I called a pair of methods that configured the db4o engine to make the database replication capable. Specifically, I set global parameters so that any database subsequently created by the application would have UUIDs and version numbers enabled. Again, these options are going to cost us a bit in terms of performance and space consumed, so you only set them for databases to be replicated.

First Replication

Once you have a database created and filled with patient information, suppose you want to replicate that information into another database. I refer to the original database as PATIENTA and the database being replicated into as PATIENTB.

In db4o, replication is a three-step process:

First, you must instantiate a ReplicationProcess object. This object must implement a ReplicationConflictHandler(), which is the deus ex machina callback method mentioned earlier. This method is called whenever the version information in both objects has changed since the last synchronization (per db4o's replication plan).
Next, you define a query that guides the replication process. In most cases, this query is crafted to return all the objects in the database (thus causing all objects to be replicated). You can modify the query to restrict it to a subset of the database's objects, and thereby fine tune which objects participate in the replication.
Finally, you execute the query. Each object returned by the query is handed over to the ReplicationProcess object's replicate() method, which does all the low-level dirty work. It all looks like Listing Three.

// Set things up for replication
Db4o.configure().generateUUIDs(Integer.MAX_VALUE);
Db4o.configure().generateVersionNumbers(Integer.MAX_VALUE);
// Open both databases
ObjectContainer patientADB = Db4o.openFile("PATIENTA.YAP");
ObjectContainer patientBDB = Db4o.openFile("PATIENTB.YAP");
// Create a ReplicationProcess object
  patientADB.ext().replicationBegin(
    patientBDB,
    new ReplicationConflictHandler() {
      public Object resolveConflict(
        ReplicationProcess replicationProcess,
        Object a,
        Object b) { return a; }
  });
//  PATIENTB
replication.setDirection(patientADB,patientBDB);
// Do the replication
Query q = patientADB.query();
ObjectSet replicationSet = q.execute();
while (replicationSet.hasNext()) {
  replication.replicate(replicationSet.next());
}
replication.commit();
// Close both databases
patientADB.close();

Listing Three

After opening the databases, you create a ReplicationProcess object via a call to the replicationBegin() method of the ObjectContainer's extended interface (available through the ext() method). The replicationBegin() method is called on the primary database's ObjectContainer, and takes as its first argument the secondary database's ObjectContainer. Its other argument—which we have supplied as an anonymous object—is the ReplicationConflictHandler object. As you can see, our conflict handler is rather simple. If given a choice between Object a (from PATIENTA) or Object b (from PATIENTB), we choose Object a every time.

Next, we call setDirection() on the ReplicationProcess object, specifying the direction of the replication as going from PATIENTA to PATIENTB.

The query object I create in the next section of code is simply an empty query, which returns all objects in the database. The execution of the query returns an ObjectSet, through which we iterate to process all the objects that are to be replicated by passing those objects into the replicate() method. When it's all done, we commit() the query and close the databases; see Listing Four.

// Set things up for replication
Db4o.configure().generateUUIDs(Integer.MAX_VALUE);
Db4o.configure().generateVersionNumbers(Integer.MAX_VALUE);
// Open both databases
ObjectContainer patientADB = Db4o.openFile("PATIENTA.YAP");
ObjectContainer patientBDB = Db4o.openFile("PATIENTB.YAP");
// Create a ReplicationProcess object
ReplicationProcess replication =
  patientBDB.ext().replicationBegin(
    patientADB,
      public Object resolveConflict(
        ReplicationProcess replicationProcess,
        Object b,
        Object a) { return b; }
  });
// Set the direction from PATIENTB to
replication.setDirection(patientBDB,patientADB);
// Do the replication
replication.whereModified(q);
ObjectSet replicationSet = q.execute();
while (replicationSet.hasNext()) {
  replication.replicate(replicationSet.next());
}
replication.commit();
// Close both databases
patientADB.close();
patientBDB.close();

Listing Four

Pretty simple. And, happily, synchronizing any changes in PATIENTB back to PATIENTA is equally simple. With one small addition, this code is more or less a mirror image of the aforementioned code. (It's a mirror image because we're replicating in the opposite direction.)

As you can see, this code is virtually identical to its preceding cousin. Most differences stem from the fact that the direction is now from PATIENTB to PATIENTA. Hence, the ReplicationProcess object and the conflict resolution handler are modified accordingly. In addition, the setDirection() method identifies PATIENTB as the originating database.

The other addition is a constraint that I attached to the query that drives the replication. Specifically, I called the ReplicationProcess's whereModified() method, with the query as its sole argument. This constrains the query so that it returns only those objects that have changed since the last replication. This significantly speeds the process because objects that have not changed need not be handed to the replication code at all. (New objects are, of course, considered by the replication process to be modified.)

Replication Complete

The complete source code for the patient database is available electronically; see "Resource Center," page 5. The code consists of a number of small, easily extended applications. One creates the primary patient database; another replicates its contents to a secondary database. Another application modifies the secondary database, and I've provided code for replicating back to the master, as well as a simple application that displays the database contents.

You can explore a database even deeper with db4o's ObjectManager. This is a standalone database browser application that lets you open a database and navigate through all its contained objects, object references, data member fields, and so on. You can even modify individual data members. In short, with the help of ObjectManager, you can experiment with more complex replication code, and examine the resulting database to see the effects.