Described by Martin Fowler in his aptly titled book, Refactoring (Addison-Wesley, 1999), and popularized by Extreme Programming (XP), refactoring is a useful technique to restructure code in a disciplined manner. By making certain small changes to your code that retain its original semantics, you can support new requirements and keep your design as simple as possible, allowing you to safely and easily evolve your code without putting existing functionality at risk.
As opposed to traditional development, in which your design is first modeled in detail (an approach often referred to as "big design up front"BDUF), with refactoring, your system's design can evolve over time with an "emergent" method. Refactoring works very well for programming source code, and refactoring tools such as Instantiations Inc.'s jFactor (www.instantiations.com/jfactor) and IntelliJ's IDEA (www.intellij.com/idea) are now available for Java. However, the general consensus is that refactoring data schemas is much more difficult than refactoring source code.
Impediments to Data Refactoring
Why? Because databases are unfortunately notorious for high coupling. Within the database, there is coupling between tables via foreign keys to other tables, and further coupling between tables and database code, such as triggers and stored procedures, which manipulates those tables. A more significant problem is the coupling between a database and the systems that access it, including online applications, batch jobs, reporting applications and data extraction systems. Within these systems, the same data tables and columns are often accessed from several parts of the system, again increasing the system's interconnections. Therefore, a simple data refactoring can trigger a cascade of other refactorings within your database and source code.
Applying Good Principles
Luckily, good design principles like encapsulation, loose coupling and high cohesion help to make data refactoring easier. Through their familiarity with the rules of data normalization, data professionals are acutely cognizant of coupling and cohesion issues between data entities (see C.J. Date's An Introduction to Database Systems 7/e, Addison-Wesley, 2001). Unfortunately, the importance of encapsulation isn't as well recognized by the data community. In fact, in his book, Foundation for Object/Relational Database Systems (Addison-Wesley, 1998), Date brags that encapsulation is a "non-issue." In reality, when a database is well encapsulatedperhaps through a persistence framework such as Castor (http://castor.exolab.org) or CocoBase (www.thoughtinc.com), by stored procedures, or even by data objectsit becomes significantly easier to evolve because coupling between systems and your database is dramatically reduced.
Politics can't be avoided, and control of an organization's data is always a political hotbed. Data administration groups are often very protective of their production environments, which is a good thing when managed appropriately. However, centralized data groups commonly tend toward prescriptive and documentation-intensive processes that dramatically slow down development and hobble emergent design efforts. Producing detailed logical or physical data models early in a project is not very agile and is more appropriate for a BDUF approach. Data administration groups can aid agile development efforts; they just need to reconsider the way that they operate.
The Process of Data Refactoring
First, don't get carried away. A development team shouldn't modify a production database on a whim; first, you should try out ideas in your own development sandbox, examine the production implications of proposed changes (more on this later), and consider making those changes in the production database only if and when it makes sense. To paraphrase Ron Jeffries (www.xprogramming.com), we're agile, not stupid.
For data refactoring to work, effective data migration and conversion are crucialevery time you change the database schema, you still need to store the same data to maintain the original semantics of your systems. Typically, you'd write scripts to copy the affected data to a secondary location, convert the old schema to the new one and then translate the copied data, so you can write it to the new schema. You actually need two scripts: one that evolves your development database schema and another that will eventually be used to help migrate your production database schema. The production script is an accumulation of the scripts for your development environmenta necessary approach because you can refactor your development environment whenever you need tobut your production environment can be changed only when you release your system into production. You must back up your database before running these scripts, so you can restore your database if required. Yes, writing these scripts is difficult, but with experience, it becomes a much easier task.
While production realities will constrain development, you also need to be realistic. For example, a common refrain of data groups states that the organization has x terabytes of data; therefore, you can't change the production schema because it's so big. I believe this is a red herring because the vast majority of projects, including reporting database/warehouse efforts, will typically deal with only a fraction of the available datathus, even a collection of data refactorings will have a negligible impact. The extent of the modifications required for systems coupled to the changed portions of the schema is the real issueit's not the size of your database; it's how you use it that counts.
Include a DBA on Your Team
By far, the best approach for supporting data refactoring is to have an experienced database administrator (DBA) as an active team member. A DBA will apply the data refactorings taking into account any necessary constraints from the current production databases. Furthermore, she'll help the team through the corporate database administration processes and reviews required to transition into production. Naturally, she'll also need to work with the people who apply the supporting refactorings to the systems that access the changed data schema, as well.
Your DBA should be well versed in your company's database standards, guidelines and chosen tools, so she can apply them effectively. Important standards and guidelines typically focus on naming conventions for tables, columns and stored procedures. Tool compatibility is also important to ensure that your data models are usable by others in your organization and to reduce licensing costs.
Having a DBA on your team is a different approach for organizations accustomed to a centralized data administration group that functions merely as a quality gate to production and interacts with project teams only through reviews. For data refactoring to work, however, data administration groups must actively support development teams as best they can, and developers may have to tolerate a few constraints that reflect the realities that the data administration folks must put up with. When everyone comes to the table willing to work toward an effective solution, it's possible to overcome the object-data divide (see "Crossing the Object-Data Divide," Thinking Objectively, Mar. and Apr. 2000) and minimize constraints without putting your production environment at risk.
Refactoring in Practice
Refactoring a well-designed database schema is easy when you don't need to worry about the systems that access the database. However, unless you've chosen to encapsulate data access, you often have extensive coupling between systems and a database, which makes data refactoring a tough task. Data refactoring works best when you apply one refactoring at a time, so you can iterate and incrementally release your work. With the right approach and the right team support, you can bring the well-known benefits of refactoring to your database as well as to your code.
A summary of known data refactorings is listed at www.agilemodeling.com/essays/dataRefactorings.htm.
I'd like to thank Charles T. Betz, Ron Jeffries and Chris Roffler for their input, either on the Agile Modeling mailing list or in private, about my work to date in data refactoring. There's more to come.
Scott W. Ambler