The OMG should extend the existing UML class diagram definition to help you develop real-world, mission-critical applications using object and relational technologies.
One of the fundamental questions object developers face is how to make their objects persistin other words, save them between sessions. Although the answer appears simple on the surfaceyou can use files, relational databases, object-relational databases, and even full-fledged objectbasespractice reveals that it is more difficult than it looks. In reality, your persistence strategy can be so complex that you inevitably need to model it. Luckily, you have the Unified Modeling Language (UML), the industry standard notation that is allegedly sufficient for modeling object-oriented software, so you should have no problem, right? Well, not quite.
The UML does not explicitly include a data modelmore appropriately named a persistence modelin the object world. Although you can use class models to model an objectbases schema, as I showed in my Sept. and Oct. 1998 columns, they are not immediately appropriate for modeling schema of relational databases. The purists may argue that you should only use objectbases, but in reality, relational databases are a $7-billion marketwhich indicates that the majority of developers are using relational databases on the back end to store objects.
The problem is the object-relational impedance mismatch: the object paradigm and the relational paradigm are built on different principles. The object paradigm, on one hand, is based on the concept of object networks that have both data and behavior, networks that you traverse. Object technology employs concepts that are well supported by the UML such as classes, inheritance, polymorphism, and encapsulation. The relational paradigm, on the other hand, is based on collections of entities that only have data and rows of data that you combine, which you then process as you see fit. Relational technology employs concepts such as tables, columns, keys, relationships between tables, indices on tables, stored procedures, and data access maps. Unfortunately, though the UML doesnt support these concepts very well, we still need to be able to model them. And persistence modeling is more complicated than merely applying a few stereotypes to class diagrams.
Proposing a Standard
The good news is that the UML supports the concept of a profile, the definition of a collection of enhancements that extend an existing diagram type to support a new purpose. For example, the UML 1.3, available for download from http://www.rational.com, includes a standard profile for modeling software development processes. I propose a profile that extends the existing class diagram definition to support persistence modeling, which should help to make the UML usable for organizations that are developing real-world, mission-critical applications using both object and relational technologies. I hope that the Object Management Group (OMG)s UML working group take this proposal as input into the definition of a standard profile for persistence models.
Potential Modeling Stereotypes
Figure 1 shows an example of a logical persistence model. A logical persistence diagram is given the stereotype <<logical persistence diagram>>, one of the potential persistence modeling stereotypes described in the sidebar. Logical persistence models show the data entities your application will support, the data attributes of those entities, the relationships between the entities, and the candidate keys of the entities. You model entities using standard UML class symbols with the stereotype <<entity>>, although this stereotype is redundant if your diagram is identified as a logical persistence diagram. Entity attributes are modeled identically to the class attributes, with the exception that they always have public visibility, depicted with a plus sign (+) in the UML. Relationships between entities are modeled as either associations or aggregation associations, as you would expect, and subtyping relationships are indicated using inheritance.
Figure 1 |
Candidate keys, and keys in general, are one of several concepts you will find difficult to model using the UML. A key is a collection of one or more attributes or columns whose values uniquely identify a row (which is the relational equivalent of an objects data aspects). The problem is that any given entity can have zero or more natural keys. A natural key is a key whose attributes currently exist in the entity, whereas an artificial key has had one or more attributes introduced. You should mark the columns that form an entitys candidate key, as you can see with ResidentialCustomer in Figure 1, using the combination of the stereotype <<candidate key>> and a constraint indicating which candidate key or keys the column belongs to. Figure 1 shows constraints in the format {ck = #}, although {candidate key number = #} may be more appropriateone of many issues an official standard profile would need to address.
Having described how to use the UML for logical persistence modeling, its unfortunate that logical persistence models offer little if any benefit to your software development efforts. The problem is that logical persistence models add nothing useful that isnt already documented in your class model. In fact, the only thing that logical persistence models show that standard class models dont is candidate keys, and frankly, modeling candidate keys is a bad idea. Experience has shown that natural keys are one of the greatest mistakes of relational theorythey are out of your control and subject to change because they have business meaning. Keys form the greatest single source of coupling in relational databases, and when they change, those changes propagate throughout your model. It is good practice to reduce coupling within your design; therefore, you want to avoid using keys with business meaning. The implication is that you really dont want to model candidate keys.
Tables, Columns, and Relationships
Figure 2 shows an example of a physical persistence model, which describes a relational databases schema. As you would expect, the stereotype <<physical persistence diagram>> should be applied to the UML class diagram. For tables, you should use standard class symbols with the <<table>> stereotype applied. Table columns are modeled as public attributes and the ANSI SQL type of the column (Char, Number, and so forth) should be indicated following the standard UML approach. You model simple relationships between tables as associations (relational databases dont have the concept of aggregation or subtyping and inheritance, so you would not apply these symbols to this type of diagram).
Figure 2 |
Figure 3 shows how to model views, alternative access paths to one or more tables, modeled as class symbols with the stereotype <<view>>. As you would expect, views have UML dependencies on the tables that they provide access to, and in many ways, are the relational equivalent of facades from the object world. Indices, shown in Figure 3, are modeled as class symbols with a <<primary index>> or <<secondary index>> stereotype. You use indices to implement the primary and secondary keys, if any, in a relational database. A primary key is the preferred access method for a table, whereas secondary keys provide quick access via alternative paths. Indices are interesting because their attributes, which all have implementation visibility, imply the attributes that form the primary and secondary keys respectively of a given table. Although you could add the optional stereotypes <<primary key>> and <<secondary key>>, this information would merely clutter your diagram with redundant information.
Figure 3 |
Foreign keys, columns that maintain the relationship between the rows contained in one table to those stored in another, are modeled using the <<foreign key>> stereotype, as shown in Figure 3. Foreign keys are also clunky because they are effectively columns that depend on columns in another table (either the primary key columns or one of the secondary key columns). To model this properly, you should have a dependency relationship between the two columns, although this quickly clutters up your diagrams. You could potentially indicate this type of dependency using a constraint, but I suspect this would unnecessarily complicate your models. The point is, this is yet another issue that should be addressed by a standard profile for a UML persistence model. For now, you should choose one alternativeI recommend the stereotypeand stick to it.
Figure 4 shows you can model triggersfunctions that are automatically invoked when a certain action is performed on a tablein a straightforward manner. You can model them as operations on a table using the <<trigger>> stereotype and a constraint indicating when the trigger should be invoked. Although you could use operation names such as insert(), delete(), and update() to indicate when the triggers would be invoked, the trigger-naming strategy is often specific to the database vendor, so you really want to use constraints instead. One of my general design philosophies is that you can count on having to port your database over time, therefore you want to avoid database vendor-specific features whenever possible (even if you only need to upgrade to a new version of the same database that is still a port). Triggers are modeled with private visibility, depicted with a minus sign (-) in the UML, because they shouldnt be invoked directly.
Figure 4 |
Stored procedures, which are operations defined within a relational database, are also clunky to model in the UML, because they dont map well to the object paradigm. Figure 4 depicts a stored procedure as a class with one operation marked with the <<stored procedure>> stereotype. A stored procedure is conceptually similar to a utility class, which implements one or more functions that are often casually related to one another at best, although it doesnt have a name and only implements one operation. You want to use a single class symbol per stored procedure, because you need the notational real estate to model the dependencies that the stored procedure has to the tables as well as the views it accesses to fulfill its responsibilities. Due to the large number of dependencies that stored procedures may have, and because any given stored procedure may implement a defined interface (modeled by the lollipop symbol), using one utility class to model all the stored procedures in a relational database quickly becomes unwieldy. You may choose to use the standard UML package symbol to aggregate similar stored procedures in your persistence model; in fact, some database vendors actually support this concept in their products.
Here, I have focused solely on the static aspect of persistence modeling rather than the dynamic nature shown in data access maps, potentially modeled via UML sequence diagrams or collaboration diagrams. Persistence modeling is a complex endeavor that has been ignored far too long within the object industry. It is clear to me, and to the majority of developers in this field, that the OMGs UML working group has dropped the ball on this issue. There is more to persistence models than adding a few stereotypes to UML class diagrams. Its time the UML community started addressing topics that are critical to the majority of object technology projects today, such as persistence modeling and user interface modeling. In theory, the UML is complete, but in practice it still has a way to go. Hopefully the OMG will choose to finish the good job it has started.
Potential Stereotypes for a UML Persistence Model |
<<artificial>> Apply to a column in a physical persistence model, such as a total column in an invoice table, that has been added as the result of denormalization. <<associative table>> Apply to a table that is introduced to resolve a many-to-many relationship between tables. <<candidate key>> Apply to an attribute in a logical persistence model to mark it as part of a candidate key for an entity. For entities with several candidate keys you will need to add a constraint to indicate which key it is part of. <<entity>> Apply to indicate an entity in a logical persistence model. This stereotype is redundant if the diagram is marked with <<logical persistence diagram>>. <<foreign key>> Apply to a column in a table to indicate that it is a foreign key in another table. <<logical persistence diagram>> Apply to a UML class diagram to indicate that it represents a logical persistence diagram. <<oid>> Apply to a column in a table to indicate that it is a persistent object identifier, an artificial/surrogate key that has no business meaning. <<physical persistence diagram>> Apply to a UML class diagram to indicate that it represents a physical persistence diagram. <<primary index>> Apply to a class to indicate that it represents the primary physical index for a given table. <<primary key>> Apply to a column to indicate that it forms part of the primary key for that table. This stereotype is redundant if the primary index is modeled. <<random access file>> Apply to a class to represent a random access file approach to storage. <<secondary index>> Apply to a class to indicate that it represents one of the secondary physical index for a given table. <<secondary key >> Apply to a column to indicate that it forms part of a secondary key for that table. This stereotype is redundant if the secondary index is modeled. For tables with several secondary keys you will need to add a constraint to indicate which key it is part of. <<sequential file>> Apply to a class to represent a sequential file approach to storage. <<stored procedure>> Apply to class to indicate that it models a single stored procedure. <<table>> Apply to a class to indicate that it represents a physical database table. This stereotype is redundant if the diagram is marked with <<physical persistence diagram>>. <<trigger>> Apply to an operation to indicate that it models a trigger on the table. You will need to indicate the type of trigger with a constraint. <<view>> Apply to a class to indicate that it represents a view within a database.
|