Todd is a data architect at Innovative Health Strategies. He can be reached at [email protected]
With the financial reporting requirements of the Sarbanes-Oxley Act, businesses have good reason to be concerned about the past. Enhancements to existing applications and changes that incorporate auditing needs and point-in-time recoverability in new application development have moved from back-burner wish-lists to mission-critical, immediate necessities. Beyond backups and logs at the DBMS level, there are numerous options when considering history-preserving data structure configurations. For instance, you can add start/stop timestamps or currency flags to tables in various fashions. Sometimes you can add dates to unique indexes. Best practice necessitates mapping historical needs to chosen historical data structures so that they work well together.
But problems arise for both applications and users if historical data structures do not align with actual usage of the contained history. Historical data can easily clutter a database, ultimately slowing down performance, or forcing jumps through SQL-hoops to access the correct active subset of rows in a table. There are any number of permutations that address these problems. The approach I present here offers five structural archetypes that involve the fusion of time and data inside database tables. I also examine the nature of temporal data structure patterns, and provide guidelines for establishing a history-management strategy that can be leveraged across an organization.
Regardless of the implemented data structure, a few basic audit-supporting attributes should apply to all tables. These attributes (serving as basic columns) encompass tracking a row's creation, and timing a row's last change. I recommend incorporating four standard columns on virtually every table. The four standard column titles would be some variation on "Row-Created Timestamp," "Row-Creator ID," "Row Last Updated Timestamp," and "Row Last Updated ID." The last two columns are unnecessary in situations warranting only inserts with no row updates. Thus, even on tables that only contain current data, the proper population of these four columns (along with a comprehensive set of database backups) should provide a considerable amount of point-in-time recoverability.
The temporal patterns (or archetype structures) I examine here generally echo the standardized approaches used for handling time as found inside existing business-intelligence-related (BI) multidimensional data-management practices. A variant of these BI ritual practices applies these data-management practices across all kinds of database tables, especially tables used for supporting operational applications. Multidimensional practice (like that found in star schemas) simplifies the collection of data elements into two kinds of table structures. Items are grouped together into:
- Fact tables that contain all numeric data intended for analysis.
- Dimension tables that store the various text data items available for selecting, grouping, or ordering the fact statistics.
Pointers inside the fact table rows provide the proper link to dimension rows. In many respects, the dimension tables function as indexes into the associated fact tables. Within these dimensional structures, there are three standard update strategies employed for managing value changes over time. The dimension update process is generically called "Slowly Changing Dimension," and these strategies are "Type 1," "Type 2," and "Type 3."
- Type 1 strategy retains only current values because the associated columns in a dimension table do not require the retention of history.
- Type 2 strategy inserts a new row every time one of the column values changes, which always allows association of the crucial fact with the original value of dimension columns.
- Type 3 strategy actually adds an additional column onto the dimension table, providing retention of both old and new values (Old_Customer_Name and New_Customer_Name).
For generalized database usage, the value change management archetypes in this discussion include the:
- Current-only table.
- Functional-history table.
- Row-level audit (or shadow) table.
- Column-level audit table.
- Ledger table.
Each of the structures can serve a unique kind of circumstance (as described in Table 1). Therefore, when building new applications, the database design process should include a table-by-table evaluation, identifying which type of temporal functions best suit the overall application needs. Moreover, these approaches are widely used today by developers everywhere. My focus here is on organizing these options into a framework that leads to a cohesive and organized development approach.
Clearly, instances occur where you need only the data as it currently stands. For example, an order-taking system may require a table that manages the next available service date. As capacity is scheduled for use, that availability date moves ahead. Knowing what the next available service date was as of last week or last month may not be useful. The current-only table (Figure 1) simply offers an original version of any table that omits start or stop timestamping columns, and provides an important tool to keep in the toolchest. Remember that even under the eyes of the Sarbanes-Oxley compliance reaper, some data may still have no historical value. (However, it is likely that there is much less data categorized in this manner than was considered of little historical value previously.) Furthermore, circumstances may dictate using a current-only table for performance, while using an additional table-type for duplicating the historical aspects of the data. It may be a bit obvious, but the current-only approach is indeed equivalent to the multidimensional Type 1 strategy.
When most people think of "keeping history" in their database, some variation of the functional-history table is often what comes to mind (Figure 1). A functional-history table is built by adding an activity start date and an activity stop date to a basic table structure. The row retains historic values frozen in time. When a value changes, then the original row has its activity stop date populated with the current date, and a new row, using the current date as the start date and the new column values, is inserted. Logically, this is the equivalent of the previously mentioned dimension Type 2 updating approach.
Fundamentally, this functional-history table alteration does revise the meaning of the table involved. A table originally worked as an "Employee" or "Order" table now becomes an "Employee Activity" or "Order History" table. This meaning adjustment occurs because a single item (either a single employee or a single order, per the given examples) exists in multiple rows after making this change. Consequently, effective use of the functional-history table requires making another change or two. Previously, a unique index likely existed on the item identifier (the "Employee ID" or the "Order Number"). Because any single Employee ID value may be duplicated across multiple rows as changes unfold over time, that index needs to include the activity start-date column. With this start-date addition, the index remains unique.
If you manage data models, the functional-history table, in all its time-laden glory, should exist within the conceptual data model because the functional dependencies and joins should expect a time component. Additionally, while many application tasks read this functional-history table, some need only the current values. It becomes tedious rewriting code over and over in search of a maximum start-date value. Likewise, using a test for NULLs in a stop-date column in order to determine currency can be an issue because NULLs and indexing sometimes conflict with each other. Therefore, a fairly common practice when dealing with functional-history tables involves adding a column that serves as an "Active Indicator." A "Live or Dead" flag provides a simple test to retrieve only current rows. As a DBMS managed view can be defined using the flag, even this simple test remains hidden from the application code.
The similarity of the row-level audit table to the functional-history table results from retaining values of each change event that impacts table content (see Figure 2). A distinct difference from the functional-history table exists because the row-level audit table is not meant to change the base table, but serve as an add-on structure so that you now have two tablesthe original and the row-level audit version keeping history! The row-level audit table is a virtual copy of the base table with an addition of two columns. One column contains the timestamp of the change event, while the second column identifies the basic nature of that change eventfor example, as an INSERT, UPDATE, or DELETE. Any change to the base table also inserts a row into the row-level audit table. This duplication of content and effort suggests the preference to name the row-level audit table a shadow table. In a sense, this shadow table is beneath the surface of the main application functionality, making it a hypofunctional-history table more than a functional-history table. By using a shadow structure, a current-only table can remain as part of an application, and its link to a shadow version retains change history in case the application is audited.
Another potentially useful add-on table is the column-level audit table. While ostensibly a simple variation of the previous row-level audit, it differs from the shadow table because everything has gone vertical. The row has been placed up-and-down rather than side-to-side (see Figure 2). By up-ending things in this vertical manner, the table now must include descriptive metadata. In fact, this audit table looks nothing like the original table being tracked. Structurally, the column-level audit table contains the key from the original table, a timestamp of the value change event, the name of the column with changing data, the old/original value, and lastly, the incoming new value. The type of change event may be implied by the data (a nulled old value column implying an INSERT, or a nulled new value implying a deletion), or these actions may be made explicit within another column of the table. This structure assumes that the application tracks exactly which columns are changing and only inserts rows for those columns with altered values. Including arbitrary inserts of all columns for any change, removes any usefulness from this approach because the table explodes with an overabundance of details that must be sifted to find a true change. By keeping both old and new values in this manner on a single row, the column-level audit table provides the functionality implied by a Type 3 Slowly Changing Dimension.
Applications often make use of column-level audit tables in cases that include a detailed application requirement, thus allowing users the capability to browse specific data changes easily. Under these requirements-driven circumstances, change tracking is usually limited to specific column and specific change events. The column-level audit table's existing format allows for fairly meaningful row browsing with less additional formatting necessary for presentation.
The ledger table is a specialty table. This history archetype is unlike any of the "Slowly Changing Dimension" approaches. The distinction for this unlikeness occurs because the ledger table is treated like the fact table in multidimensional designs. In the past, the ledger table structure was useful only in rare and specific circumstances, such as storing data for an actual general ledger application. Although it still fills those same requirements today, the ledger structure's usefulness is no longer quite so rare. As implied by the ledger moniker, a ledger table acts like an accounting book with the application of credits and debits (see Figure 3). With proper use of the ledger archetype, only inserts occur on this table. The ledger table is a personal favorite, often referred to as a "sticky" table; like a fly strip, once data lands here, it is meant to stay there forever.
The ledger table can require a significant amount of work. For this technique to function properly, all numbers must "balance" when summed. This means that when items change, the changes must be evaluated before allowed for insertion to the ledger table. If a previous column entry of +5 must be "updated" to a +6 value, rather than updating the +5 to a +6, either a "net change" or "gross change" approach must be used to insert new rows into the ledger table. Using a net change approach, a row is inserted for a value of +1 (see Figure 3). The previous +5 and the new +1 will sum up to the correct +6 value. Under the gross change tactic, two rows are inserted for the single change. First, a row for -5 is inserted, which logically reverses the previous +5 row. Next, a row for the +6 is inserted (see Figure 4). The ledger table, by its nature, is transactional. Circumstances of spotty or incomplete transactional pictures in the incoming data content mean that greater work must be done by the process creating the ledger table inserts.
Putting It All Together
Any specific application requires incorporating several of the structures described here; see Listing One. Even a single table may take advantage of a combination of approaches. Actually, the row-level or column-level audit functions best in tandem with either a current-only or functional-history structure. Other useful configurations may not be so obvious. For example, a functional-history table may be implemented only partially by using a subset of columns for triggering new row insertion. (For example, a change in the status code causes an update to the stop date and the insert of a new row; but an update of the text inside a comment column may not cause any response at all on this functional-history table.) Because of this partiality of the functional-history structure, perhaps a shadow table is implemented to catch all changes. In this fashion, changes that do not drive a new row insertion are not lost. While this double-table approach does increase overall application complexity, it may form a valid choice under certain circumstances.
History concerns do not have a single one-size-fits-all solution. As the various detailed archetypes demonstrate, a great deal of flexibility exists in describing the nature of a table's relationship to change over time. Current-only tables may contain noncritical data items or may be used in conjunction with other archetypes, such as row-level audit, to cover time-variant value changes. Applications often use functional-history tables so that these tables are probably the most typical of our historical data archetypes. Row-level audit structures can allow for a current-only operational table to remain smaller and more quickly responsive to queries. Column-level audit tables offer value for users who wish to browse precise changes within the application itself. And lastly, under the right set of circumstances, ledger tables can maintain an explicit track of fully comprehensive transactional change.
Alternately, overlooking the evaluation of these options can leave systemic blind spots. Such a case could happen if employing a functional-history that only inserts rows on a limited set of changes to handle one user requirement, but without duly considering a different requirement that all changes be tracked for auditing. Therefore, a hole that could have been filled by adding in a row-level audit table was missed. On the other hand, if you created a column-level audit table because some changes needed to be browsed interactively online, but then simply continued to use that one column-level audit structure for all other change tracking, a monster is born. The single column-level audit table would quickly become the largest table in the application (row-wise) and create a hazardous performance bottleneck.
Ideally, within a development group or organization, top priorities should include evolving a strategy that identifies the preferred historical archetypes and deciding a preference of the history archetype options for developers to employ. Documentation can be as simple as describing the history archetypes, or their variations germane to a specific shop's environment, and enumerating appropriate conditions suitable for the use of each archetype. Providing such a carefully thought out course of action often expedites application development and minimizes history retention implementation problems. When development practices incorporate detailed plans for establishing and maintaining history, these plans help achieve a greater level of consistency and reuse across all of an organization's applications.
/* TABLE: Employee_Current_Only */ CREATE TABLE Employee_Current_Only( Employee_ID int NOT NULL, Employee_Name varchar(100) NOT NULL, Employment_Status varchar(10) NOT NULL, Create_Dt datetime NOT NULL, Create_ID char(10) NOT NULL, Altered_Dt datetime NOT NULL, Altered_ID char(10) NOT NULL, CONSTRAINT PK_Employee_Current_Only PRIMARY KEY CLUSTERED (Employee_ID) ) go IF OBJECT_ID('Employee_Current_Only') IS NOT NULL PRINT '<<< CREATED TABLE Employee_Current_Only >>>' ELSE PRINT '<<< FAILED CREATING TABLE Employee_Current_Only >>>' go /* TABLE: Employee_Functional_History */ CREATE TABLE Employee_Functional_History( Employee_ID int NOT NULL, Start_Dt datetime NOT NULL, Employee_Name varchar(100) NOT NULL, Employment_Status varchar(10) NOT NULL, Create_Dt datetime NOT NULL, Create_ID char(10) NOT NULL, Altered_Dt datetime NOT NULL, Altered_ID char(10) NOT NULL, Active_Ind char(1) NOT NULL, Stop_Dt datetime NULL, CONSTRAINT PK_Employee_Functional_History PRIMARY KEY CLUSTERED (Employee_ID, Start_Dt) ) go IF OBJECT_ID('Employee_Functional_History') IS NOT NULL PRINT '<<< CREATED TABLE Employee_Functional_History >>>' ELSE PRINT '<<< FAILED CREATING TABLE Employee_Functional_History >>>' go /* TABLE: Employee_Row_Level_Audit */ CREATE TABLE Employee_Row_Level_Audit( Employee_ID int NOT NULL, Start_Dt datetime NOT NULL, Employee_Name varchar(100) NOT NULL, Employment_Status varchar(10) NOT NULL, Create_Dt datetime NOT NULL, Create_ID char(10) NOT NULL, Altered_Dt datetime NOT NULL, Altered_ID char(10) NOT NULL, Active_Ind char(1) NOT NULL, Action_Code char(10) NOT NULL, CONSTRAINT PK_Employee_Row_Level_Audit PRIMARY KEY CLUSTERED (Employee_ID, Start_Dt) ) go /* TABLE: Employee_Column_Level_Audit_Option_1 */ CREATE TABLE Employee_Column_Level_Audit_Option_1( Employee_ID int NOT NULL, Event_Timestamp datetime NOT NULL, Column_Name varchar(80) NOT NULL, Old_Text varchar(100) NULL, New_Text varchar(100) NULL, Old_Number decimal(10, 2) NULL, New_Number decimal(10, 2) NULL, Old_Date datetime NULL, New_Date datetime NULL, Event_ID char(10) NOT NULL, CONSTRAINT PK_Employee_Column_Level_Audit_Option_1 PRIMARY KEY CLUSTERED (Employee_ID, Event_Timestamp) ) go IF OBJECT_ID('Employee_Column_Level_Audit_Option_1') IS NOT NULL PRINT '<<< CREATED TABLE Employee_Column_Level_Audit_Option_1 >>>' ELSE PRINT '<<< FAILED CREATING TABLE Employee_Column_Level_Audit_Option_1 >>>' go /* TABLE: Employee_Column_Level_Audit_Option_2 */ CREATE TABLE Employee_Column_Level_Audit_Option_2( Employee_ID int NOT NULL, Event_Timestamp datetime NOT NULL, Column_Name varchar(80) NOT NULL, Old_Text varchar(100) NULL, New_Text varchar(100) NULL, Old_Number varchar(30) NOT NULL, Event_ID char(10) NOT NULL, CONSTRAINT PK_Employee_Column_Level_Audit_Option_2 PRIMARY KEY CLUSTERED (Employee_ID, Event_Timestamp) ) go IF OBJECT_ID('Employee_Column_Level_Audit_Option_2') IS NOT NULL PRINT '<<< CREATED TABLE Employee_Column_Level_Audit_Option_2 >>>' ELSE PRINT '<<< FAILED CREATING TABLE Employee_Column_Level_Audit_Option_2 >>>' go /* TABLE: Employee_Paycheck_Ledger */ CREATE TABLE Employee_Paycheck_Ledger( Employee_Paycheck_Event_Key uniqueidentifier NOT NULL, Employee_ID int NOT NULL, Pay_Date datetime NOT NULL, Pay_Amount decimal(10, 2) NOT NULL, Event_Type char(10) NOT NULL, Created_Dt datetime NOT NULL, Created_ID char(10) NOT NULL, CONSTRAINT PK_Employee_Paycheck_Ledger PRIMARY KEY CLUSTERED (Employee_Paycheck_Event_Key) ) go IF OBJECT_ID('Employee_Paycheck_Ledger') IS NOT NULL PRINT '<<< CREATED TABLE Employee_Paycheck_Ledger >>>' ELSE PRINT '<<< FAILED CREATING TABLE Employee_Paycheck_Ledger >>>' go IF OBJECT_ID('Employee_Row_Level_Audit') IS NOT NULL PRINT '<<< CREATED TABLE Employee_Row_Level_Audit >>>' ELSE PRINT '<<< FAILED CREATING TABLE Employee_Row_Level_Audit >>>' goBack to article