In This Issue
- Questioning Traditional Data Management
- Hot Links
The Data Warehouse Institute (TDWI) has estimated that data quality problems cost U.S. businesses $611 billion annually. Where I come from that's serious money. Last September, Dr. Dobb's Journal ran a data quality survey which discovered that the majority of IT organizations recognized that they had data quality problems but were struggling to address them effectively. I believe that the primary reason for this is because data management groups have based their processes on assumptions which prove to be questionable at best and downright false at worst. Let's explore the assumptions of the traditional data management community and see what alternatives the agile community has to offer.
Assumption #1: It's expensive to evolve a database schema. As I show in Refactoring Databases this assumption is clearly false. As TDWI shows there are significant data quality problems out there, so it behooves data management professionals to adopt a safe and straightforward technique such as database refactoring. When existing data professionals first hear about refactoring they often profess that it's a great idea for small databases, which it is, but that it isn't realistic for "large databases" due to the sheer volume. This is yet another assumption, or more accurately an excuse, which must be overcome. If there is a quality problem with a voluminous data source then all the more reason for addressing that problem.
Assumption #2: You need to model the details up front. As I show in Agile Database Techniques, and in several print columns, it is not only possible to do data modeling in an evolutionary/agile manner it is highly desirable. Detailed up-front modeling actually proves to incredibly risky in practice because people become committed to their original design and are either unwilling or unable to change strategies later on. For all the talk within the data community about designing everything up front we often see overbuilt databases with extra columns, tables, and views which actually detract from the quality of the design. We also see existing columns and tables being used for purposes other than originally intended, once more detracting from the quality.
With an evolutionary approach you can identify the details when you need them, which proves to be more efficient than up-front modeling for several reasons. First, you can focus on modeling only what you need when you need it, reducing the overall modeling effort. Second, because your knowledge of the domain grows as the project progresses you can ask more intelligent questions when you model storm in a just-in-time (JIT) manner instead of at the beginning of the project. Third, if you follow the common agile practice of delivering working software on a regular basis your stakeholders have greater knowledge of the solution space later in the project and therefore can give you more intelligent answers.
Yes, you should do some initial up front modeling, but at a very high level. Changing requirements and mistakes pesky humans will always necessitate the need to evolve your code/schema. A high-level model enables you to act on what you know by identifying a likely direction to do in, yet puts you in a position where your design can evolve over time based on your improving understanding of the domain. You can address potential performance and data sourcing issues in your initial modeling efforts without the burden of capturing unnecessary details early in the lifecycle.
Assumption #3: You need to write everything down. This also proves to be a questionable assumption in practice. Instead of writing static documentation why not write executable documentation such as a test? For example, why write a performance requirement when you can just as easily write a performance test which validates whether your design is actually performant? With a test-first approach to design the information is still captured but in a far more valuable format.
Assumption #4: You need to take a data-driven approach. It shouldn't be a surprise that data professionals believe in the importance of their specialty. There is a consistent belief among data professionals that the information model pervades the overall system, which is completely true. But then again security issues also pervade the overall system, as does usage, as does functionality, as does... you get the point. Taking a "data-driven" approach is a preference, but from the state of data quality within the industry it doesn't appear to be a very good one. Most modern methodologies, such as Rational Unified Process (RUP), Dynamic System Development Method (DSDM), and Microsoft Solution Framework (MSF) tend to promote a usage-driven approach with use cases, user interface prototypes, and usage scenarios respectively being the main modeling artifacts, not data models. What I've observed on numerous occasions is that many data professionals will ignore or simply be oblivious to the plethora of other important issues that we must also deal with as IT professionals.
Assumption #5: Review and inspections are an effective way to ensure quality. Once again, that doesn't seem to be working out so well in practice. The most effective way to ensure data quality is to actually test your database, and better yet to do so continuously in a regression manner. From what I can tell there is very little recognition within the data community that database regression testing is critical to ensuring data quality: sadly the plethora of data quality books and papers rarely mention database testing let alone discuss it coherently. This is a huge blind spot within the data community.
Assumption #6: They need to govern data. Yes, someone does need to govern the data within your organization, but that doesn't imply a traditional command-and-control approach. Adopting traditional data management techniques such as detailed modeling and reviews is little more than throwing additional bureaucracy at the data governance problem. Good governance is collaborative and it is an enabling force, not a controlling force. Considering the typical relationship that data management groups have with development teams I believe that there is very little chance that they will succeed at data governance. In July 2006, a Dr. Dobb's survey which explored the current state of data management within IT organizations found that two-thirds of respondents will go around their data groups to get their jobs done in a timely and effective manner. This result, and the lack-lustre performance of the traditional data community during the past three decades, should give us all reason for concern.
The agile database techniques that I've mentioned in this newsletter, such as database refactoring, database regression testing, and agile data modeling, are technically straightforward to adopt. Yes, they require a bit of training and mentoring at first and they could definitely do with better tool support. But the real challenge is cultural. Traditional data management appears to be based on some very questionable assumptions and few people within that community have seen fit to challenge them. As the agile community has clearly shown over the past few years these assumptions don't seem to hold water in practice. I invite the data community to step back and reconsider what they hold to be true.
Data Warehousing Special Report: Data Quality and the Bottom Line by Wayne W. Eckerson
The column Whence Data Management? summarizes the results of a July 2006 survey which explored the current state of data management practices within IT organizations and discovered that there is significant room for improvement.
The column Whence Data Quality? summarizes the results of a September 2006 survey which explored the extent of data quality techniques currently employed within IT organizations.
In the Summer of 2004 I wrote a three-part series of columns describing in detail how to take an agile approach to data modeling.
Larry English has written a great article about evolutionary approaches to database development entitled Kaizen and Information Quality.
One Truth Above All Else Anti-Pattern explores another of the great assumptions of traditional data management.
Agile Database Techniques: Effective Strategies for the Agile Development won a 2004 Jolt Productivity Award. It overviews a collection of agile techniques such as database refactoring, database testing, and agile data modeling.
Refactoring Databases: Evolutionary Database Design won a 2006 Jolt Productivity award. It describes in detail the process of refactoring an existing database schema and contains implementation details for over 60 common database refactorings.
Database Regression Testing describes in detail how to regression test a relational database.