How do you generate scalable, platform-independent persistent object identifiers in a simple, performance-friendly manner?
Regardless of what object purists might wish, relational databases (RDBs) are the most common data-storage mechanism used to persist objects. Despite the different paradigms, the reality is that people are using objects and relational databases together, and while relational databases need keys, objects do not. Therefore, I propose a simple, scalable and platform-independent strategy for assigning keys (or persistent object identifiers in the object world) to objects.
Let's start with some basics about keys. A key is a unique identifier for a relational entity in a data model, which in turn evolves into a unique identifier for a row in a relational table. Keys are also the primary source of coupling within relational databases because they define and maintain the relationships between rows in tables, coupling the schema of the tables to one another. Keys are also used to define the indices on a table. Indices improve performance within a relational database, coupling the index definition to the table schema. In short, keys are an important relational concept, and as such deserve more design consideration than they typically get.
When a key is modified, either in instance values or in column schema, the modification quickly ripples through your relational schema because of this high coupling. Unfortunately, assigning business meaning (such as a social security number) to so-called natural keys is a common practice. The problem is that anything with business meaning is liable to change (for example, the U.S. government is running out of nine-digit social security numbers). Because the change is out of your control, you run the risk of your software being inordinately impacted. Remember how fun the Year 2000 crisis was?
Having realized that keys with business meaning are an exceptionally bad idea and looking for yet another way to lock customers in with proprietary technology, many database vendors have devised schemes for generating so-called surrogate keys. Most of the leading database vendorscompanies such as Oracle, Sybase, and Informiximplement a surrogate key strategy called incremental keys. This entails maintaining a counter within the database server and writing the current value to a hidden system table to maintain consistency, which is used to assign a value to newly created table rows. Every time a row is created, the counter is incremented and that value is assigned as the key value for that row. The implementation strategies vary from vendor to vendor (the values assigned may be unique across all tables or only within a single table), but the concept is the same.
Incremental keys aren't the only surrogate-key strategy available to you. In fact, two strategies that aren't database-oriented exist: universally unique identifiers (UUIDs), from the Open Software Foundation, and globally unique identifiers (GUIDs), from Microsoft. UUIDs are 128-bit values that are created from a hash of your Ethernet card ID, or an equivalent software representation, and your computer system's current datetime. Similarly, GUIDs are 128-bit hashes of a software ID and the datetime.
Although these surrogate key strategies all work reasonably well, they aren't enterprise-readyrather, they're enterprise-challenged. First, the strategies often are predicated on the concept of your applications actually being in communication with your database, a requirement that is difficult to fulfill if your application needs to support mobile, disconnected users. Second, the strategies typically break down in multi-database scenarios, especially when the databases are from different vendors. Third, obtaining the key value causes a dip in performance each time, especially if you need to bring the value back across the network. Fourth, there are minor compatibility glitches when porting from one product to another if the vendors use disparate key-generation schemes.
Because they input the current datetime into the hash, the GUID and UUID strategies are also "enterprise-challenged." Consider how accurately datetimes are stored. In most modern operating systems, including Windows NT, datetimes are recorded to thousandths of a second.
Even with a perfect hash, at best all you can do is generate 1,000 keys per second, per processor. That sounds impressive, but UUIDs and GUIDs aren't scalable, making them inappropriate for mission-critical enterprise applications. (By the way, for all you Microsoft-bashers out there, Java also stores date times to the thousandth of a second, so your UUID-generator class isn't enterprise-ready, either. Additionally, both UUIDs and GUIDs have the problem of not being able to guarantee uniqueness, although the odds of overlap are extremely minuscule.)
The HIGH-LOW Strategy
So how do you generate scalable, platform-independent persistent object identifiers in a simple and performance- friendly manner? Enter the HIGH-LOW strategy. The basic idea is that a persistent object identifier is in two logical parts: A unique HIGH value that you obtain from a defined source, and an N-digit LOW value that your application assigns itself. Each time that a HIGH value is obtained, the LOW value will be set to zero.
For example, if the application that you're running requests a value for HIGH, it will be assigned the value 1701. Assuming that N, the number of digits for LOW, is four, all persistent object identifiers (OIDs) that the application assigns to objects will be combination of 1701000, 1701001, 1701002, and so on up until 1701999. At this point, a new value for HIGH is obtained, LOW is reset to zero, and you begin again. If another application requests a value for HIGH immediately after you do, it will given the value of 1702, and the OIDs that will be assigned to objects that it creates will be 17020000, 17020001, and so on. As you can see, as long as HIGH is unique, all values will be unique.
So how do you calculate HIGH? There are several ways to do this. First, you could use one of the incremental key features provided by database vendors. This has the advantage of improved performance; with a four-digit LOW, you have one access on the database to generate 10,000 keys instead of 10,000 accesses. However, this approach is still platform-dependent. You could also use either GUIDs or UUIDs for the HIGH value, solving the scalability problem, although you would still have platform-dependency problems.
Do It Yourself
A third approach is to implement the HIGH calculation yourself. Write a portable utility in ANSI-compliant C, PERL, or 100% Pure Java that maintains an M-digit incremental key. You either need to have a single source for this key-generator within your system or have multiple sources that in turn have an algorithm to generate unique values for HIGH between them. Of course, the easiest way to do so is to recursively apply the HIGH-LOW approach, with a single source for which the HIGH-generators to collaborate. In the previous example, perhaps the HIGH server obtained the value of 17 from a centralized source, which it then used to generate values of 1701, 1702, 1703, and so on up until reaching 1799, at which point it would then obtain another two-digit value and start over again.
How do you make this work in the real world? First, you want to implement a factory class that encapsulates your algorithm for generating persistent object identifiers. A factory class (Design Patterns, Gamma, et al, Addison Wesley, 1995) is responsible for creating objects, implementing any complex logic in a single place. Second, two- and four-digit numbers won't cut it. It's common to see 96-bit or 112-bit values for HIGH and 16-bit or 32-bit values for LOW. The reason why 128 bits is a magic size is that you need that many bits to have enough potential values for persistent object identifiers without having to apply a complex algorithm. For example, the first time that a new persistent object identifier is requested, the factory will obtain a new HIGH and reset LOW to zero, regardless of the values for HIGH and LOW the last time the factory was instantiated. In our example, if the first application assigned the value of 17010123 and then was shut down, the next time that application runs it would start with a new value for HIGH, say 1867, and start assigning 18670001 and so on. Yes, this is wasteful, but when you're dealing with 112-bit HIGHs, who cares? Increasing the complexity of a simple algorithm to save a couple of bytes of storage is the mentality that brought on the Y2K crisis, so let's not make that mistake again.
A third issue to consider for persistent object identifiers is polymorphism, or type-changing. A chair object, for example, may become a firewood object. If you have a chair object with 12345 as its identifier and an existing firewood object with 12345 as its identifier, you have a problem when the chair becomes firewoodnamely, you must assign the chair a new identifier value and update everything that referred to it. The solution is to make your persistent object identifier values unique across all objects, and not just across types and classes of objects. Make the persistent object identifier factory class a singleton (only one instance in memory space), so that all objects obtain their identifiers from a central source.
Fourth, never display the value of the persistent object ID, never let anyone edit it, and never let anyone use it for anything other than identification. As soon as you display or edit a value, you give it business meaning, which you saw earlier to be a very bad idea for keys. Ideally, nobody should even know that the persistent object identifier exists, except perhaps the person debugging your data schema during initial development.
Fifth, consider distributed design. You may want to buffer several HIGH values to ensure that your software can operate in disconnected fashion for quite some time. The good news is that persistent object identifiers are the least of your worries when developing software to support disconnected usageor perhaps that's the bad news, depending on your point of view. You may also decide to store the LOW value locally; after all, when your software shuts down if disconnected, usage is a serious requirement for you.
Include an Identifier
Finally, because this strategy only works for the objects of a single enterprise, you may decide to include some sort of unique identifier to your keys to guarantee that persistent object identifiers are not duplicated between organizations. The easiest way to do this is to append your organization's internet domain name, which is guaranteed to be unique, to your identifiers if you are rolling your own HIGH values, or to simply use a UUID/GUID for your HIGH values. This will probably be necessary if your institution shares data with others or if it is likely to be involved with a merger or acquisition.
The HIGH-LOW strategy for generating values for persistent object identifiers has been proven in practice to be simple, portable, scalable and viable within a distributed environment. Furthermore, it provides excellent performance because most values are generated in memory, not within your database. The only minor drawback is that you typically need to store your persistent object identifiers as strings in your database, therefore you can't take advantage of the performance enhancements that many databases have for integer-based keys.
The HIGH-LOW strategy works exceptionally well and is enterprise-ready. Furthermore, its application often proves to be the first step in bringing your organization's persistence efforts into the realm of software engineering.