Database table normalization can improve performance.
April 26, 2006
URL:http://www.drdobbs.com/architecture-and-design/database-design-how-table-normalization/186701027
"Given our design, how can we prevent table fragmentation?" the client asked. They wanted me to provide the secret handshake that would make RDBMS gracefully handle updates without wasting space or leading to storage fragmentation.
Sadly, the fragmentation was a logical consequence of the design. They had a table that followed the M.E.S.S. (that's "Multiple Entity Sub-Species") design pattern. In this design, a number of different subclasses are merged into a single table. The hallmarks of this design are a large number of optional columns, leading to NULL values and sparse data.
The business reason is that raw materials ("prospects") are supplied from a variety of customers. A number of services are performed, all of which will update the prospect. Each customer relationship involves slightly different features of each prospect, but the overall value-add business process is substantially similar. The business processes, viewed from a distance, are all nearly identical; they differ in some details, but are more alike than different. All of them involve adding information to a prospect.
In this case, the situation is compounded by a C.R.E.E.P. (the "Continuously Re-Evolving Entity Pattern"). Each "prospect" has an almost indefinite number of features, including events, conditions, services, process status and relationship details. A CREEP object is often miscast as a relational row, with each feature modeled as a column. The attributes may not be sparse, but they grow without any practical boundary, and the naive mapping from attribute to column is often inappropriate.
Generally, C.R.E.E.P. designs start as a spreadsheet. When the data reaches 256 columns, this becomes impractical. It is then transformed into a database table without doing any fundamental rethinking of the business problem.
The biggest consequence of a MESS + CREEP design is that we have columns which are initially null, but get filled with large text comments or dates. We also have text comments which evolve; rarely getting smaller. This often leads to issues with the space allocation in the RDBMS. Before too long we have highly fragmented storage. In the case of Oracle, we will have extensive row chaining. This fragmentation and chaining is a performance burden with a number of candidate solutions.
Fortunately, the customer was concerned about the potential fragmentation and willing to consider prevention instead of cure. Oracle, for example, has a curative procedure, but this customer saw an unacceptable cost in running a nightly fix on the fragmented storage. Costs included the cost and complexity of performing the fix processing as well as the added risk of downtime. The customer considered prevention to have more long-term value than a work-around.
Preventing fragmentation either means pre-allocating each row to its maximum size, or normalizing the design so that updates are performed rarely. Pre-allocation has an unpleasant cost for wasted storage. Spreading the relevant data among more physical blocks of storage means that the common batch-oriented table scans will be unacceptably slow.
When we look closely at the normalization of this data we can identify two closely intertwined issues: The MESS table design and the CREEP business evolution. Each of these shapes the solution.
There are several ways to handle subclasses in a relational database. The MESS table unifies all sub-species through a union of all possible attributes and making liberal use of NULL. Another approach is to decompose each subclass into distinct tables. This makes the processing more complex, since a number of tables must be joined, leading to a number of similar programs that differ in a few column names and a table name.
A third choice for handling subclasses is to mirror the inheritance tree in the table design. A core table contains the superclass attributes: Those features which are truly common (or very nearly common) to all subclasses. Other tables contain subclass-unique attributes and are joined to the core table as needed by specific applications. When we look at this design closely, we often see that the core table attributes are not really CREEP-style attributes: they are remarkably stable values. Rows go into the core table and are rarely updated, which minimizes fragmentation.
The second of the intertwined issues is the CREEP problem: We keep tacking features on to this entity. These attributes often form groups or clusters that have a timestamp, an actor's name, a comment string, and possibly an official event or condition name. Generally, a cluster of related columns is a distinct entity (as well as a third normal form violation.)
The recommendation is to normalize the MESS table to separate the sub-species. This will add tables to the database, and it will tend to reduce fragmentation of the data. Updates will often be focused on a sub-entity, moving around rows in smaller and more densely packed tables.
Many operations will require joins instead of scanning the MESS table. What is the tradeoff cost of this normalized architecture when compared with the MESS architecture?
We'll compare three sample table designs that reflect degrees of normalization to control storage fragmentation.
Fragmentation is a problem that leads to steady degradation of system performance. The degradation in performance leads to a growing business cost to operate the system. In addition to this degradation there is a background cost that is part of any database design that includes technical workarounds.
The background cost of fragmentation is the processing (and support) for ongoing defragmentation. These costs include any downtime to defragment, and risk of failure in this processing. Further, the processing itself adds complexity but does not create value.
We'll try to measure the relative costs of a database which degrades through a comparison between three designs--denormalized, partially normalized, and fully normalized.
Listing One is the denormalized MESS design. The evs are "Events" which are filled in by updates, leading to fragmentation. The business processing makes ev1 mandatory, and the other events may, or may not happen.
CREATE TABLE MESS( key VARCHAR(20), ev1text VARCHAR(100), ev1date TIMESTAMP, ev2text VARCHAR(100), ev2date TIMESTAMP, ev3text VARCHAR(100), ev3date TIMESTAMP );
Listing Two shows a normalized design which controls fragmentation by isolating ev2 and ev3 event data into a separate table. The M2 table can be sparse and can fragment.
CREATE TABLE M1C( key VARCHAR(20), ev1text VARCHAR(100), ev1date TIMESTAMP ); CREATE TABLE M1X( key VARCHAR(20), ev2text VARCHAR(100), ev2date TIMESTAMP, ev3text VARCHAR(100), ev3date TIMESTAMP ); CREATE UNIQUE INDEX M1C_X1 ON M1C( key ); CREATE UNIQUE INDEX M1X_X1 ON M1X( key );
The most interesting design is in Listing Three. This uses inserts instead of updates to fold in the additional data.
CREATE TABLE M2C( key VARCHAR(20) ); CREATE TABLE M2E( key VARCHAR(20), evt NUMBER, evtext VARCHAR(100), evdate TIMESTAMP ); CREATE UNIQUE INDEX M2C_X1 ON M2C( key ); CREATE INDEX M2E_X1 ON M2E( key );
I used three update scenarios for each design. The first job did only two updates--one to update each sub-entity's attributes. I compared this with scenarios doing 5 updates and 10 updates. The higher number of updates would show the effects of any storage reclamation strategy the RDDBMS used.
Table 1 shows the cost of fragmentation in size and time. This table shows each design separately, measuring how storage size grows and how performance grows due to fragmentation. These results cannot be used to compare between designs. It only shows how a particular design gets larger and slower due to fragmentation.
The minimal impact of fragmentation occurred with only two updates. The maximum effect was with 10 updates.
Fragmented Size | Fragmented Time | |||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
min | max | min | max | |||||||||||||||||
Denorm | 1.58 | 2.67 | 1.11 | 1.13 | ||||||||||||||||
Semi-Norm | 1.63 | 2.13 | 1.49 | 1.56 | ||||||||||||||||
Norm | 1.73 | 2.23 | 1.46 | 2.04 |
The MESS storage grew by a factor from 1.58 to 2.67, as expected. Each update replaced NULLs or short strings with longer strings, taking up more storage, leading to rows being placed elsewhere in the file structure. Storage was consumed rapidly by updates to the MESS.
The MESS query time, however, did not grow as rapidly as the storage did. This is interesting, and most likely reflects the very small size of the sample data (100 rows). Since the database only occupies a few physical blocks, it can be read quite rapidly in spite of fragmentation. A larger database would have a larger performance penalty.
The partially normalized storage grew by a factor from 1.62 to 2.12. Separating the columns that change from the columns that are static reduces fragmentation, as expected. Query performance, uses joins and unique indexes, grew by 49 percent to 56 percent after the series of updates.
The fully normalized storage grew by a factor from 1.73 to 2.23. The fully normalized version had one row in each table before fragmentation, and a number of rows after fragmentation. Query performance grew between 46 percent and 104 percent after the series updates. This dramatic slow-down us likely due to the change in cardinality from 1:1 to 1:n.
Comparison between structures reveals that the partially normalized design has a performance penalty of just 14 percent compared with the MESS design. Without fragmentation, the partially normalized structure may actually return results faster than the denormalized MESS. The fully normalized structure, with a 1:n join has a performance penalty of 68 percent to 131 percent when compared with the original MESS.
A semi-normalized design does not endure the same level of fragmentation as a denormalized MESS design. Since it uses a 1:1 join instead of a 1:m join, the performance is generally quite good. Further, change can often be isolated to the extension table, offering some protection from devastating change. The rate of fragmentation is the lowest and the performance penalty from a 1:1 join is also quite low.
The management overview is this:
This semi-normalized version, however, requires the most insight to create. It requires understanding the attributes and their usage. Investments made in understanding the application data and processing can pay dividends by reducing administrative busy-work and reducing the risk of problems that are caused by that administrative overhead. Further, understanding the application can lead to optimization of the data and the associated processing.
Steven is a database developer who focuses on data warehousing and the associated e-business architectures. He can be contacted at [email protected]
Terms of Service | Privacy Statement | Copyright © 2024 UBM Tech, All rights reserved.