Whence Data Quality?

Scott reports on his survey that focused on data quality issues and the application of various data-oriented development techniques.


January 11, 2007
URL:http://www.drdobbs.com/database/whence-data-quality/196900212

Scott is a DDJ Senior Contributing Editor and author of numerous IT books. He can be contacted at www.ambysoft.com/ scottAmbler.html.


According to a survey I performed in July 2006, 96 percent of respondents believe that data is a corporate asset. The survey, reported in "Whence Data Management?" (DDJ, November 2006), unfortunately showed that far fewer than that act on this belief. Only a minority of firms did database regression testing, or database refactoring, and not-surprisingly, the majority of respondents reported having production data problems. These results concerned me, so in September 2006, I ran a second survey that focused on data quality issues and the application of various data-oriented development techniques. This month, I report the findings of that survey.

We sent the survey out to the DDJ mailing list and received 1137 responses in return. When it comes to primary job role, there were 585 developers, 168 IT managers, 107 project managers, 102 data professionals, and 174 people in other roles. A little more than 98 percent of the respondents were from North America, although I suspect that the trends the survey reveals are applicable internationally—I leave it up to you to be the judge of that. Respondents worked in various sizes of IT organizations (see Figure 1), 78 percent worked in the private sector, and 71 percent had 10 or more years of experience in IT.

Figure 1: Size of the IT organization.

Current State of Data Quality

Respondents were asked what they felt about the quality of their production data. Figure 2 summarizes the results by IT organization size. Data quality is highest within smaller organizations, presumably because they haven't had time to make serious mistakes and/or because they've adopted modern database development techniques. An interesting trend is that quality drops the larger the organization gets, until about the mid-sized organization level, where quality starts to rise again. My guess is that mid-size organizations can still survive with a little bit of "data chaos", but eventually, an organization reaches a certain size and needs to start to improve its data management approach if it's to survive.

Figure 2: Current State of data quality (% respondents) by IT organization size.

The survey asked whether the respondents worked in organizations with defined service-level agreements (SLAs) for database performance and for database availability. Of the respondents who knew the answer, the survey allowed "I Don't Know" as a possible answer for many questions, 30 percent and 42 percent, respectively, responded positively. Having SLAs in place seemed to be correlated to improved data quality—60 percent of respondents where one or both SLAs exist indicated that data quality was either perfect or pretty good compared with 52 percent when no SLAs exist. Respectively, 5 percent and 10.3 percent reported serious data quality problems.

Figure 3 depicts the correlation of various approaches to data naming conventions with data quality—this is important because it is an indicator of the health of the relationship between the data management group and developers. The survey showed that when developers willingly followed the data naming conventions, data quality was better than when the conventions were enforced by the data group. Both of these approaches were much better than having inconsistently followed conventions, which in turn, was better than having no data naming conventions at all. (At www.agiledata.org, I describe a collection of strategies for promoting a more effective, collaborative relationship between data professionals and developers.)

Figure 3: Data naming conventions and data quality.

Reaction Time Is Critical

When I speak about agile database development techniques at conferences, I like to ask the audience whether they could successfully rename a column in a production database and deploy that change within a day. Many people in the audience laugh because they know that their organization is unable to accomplish this obviously trivial task. I asked a similar question in the survey, and the results are summarized in Figure 4 by organization size. Interestingly, once again we see the same trend from Figure 2.

Figure 4: Length of time to rename production column (% respondents) by size of IT organization.

On average, 11 percent reported that it would take 3 months to rename a column, 7 percent said it would take more than 3 months, and 8 percent worked in organizations where they felt it was far too risky to even attempt to rename a column. Too risky? Yikes! Seems to me that those organizations have convinced themselves that it's exceedingly difficult to evolve a relational database schema. As Pramod Sadalage and I show in Refactoring Databases (Addison-Wesley, 2006), this is not the case—regardless of how tight the coupling of external programs to the database schema, even when hundreds of heterogeneous programs access the database, it is possible to safely and rapidly make changes. Better yet, we show you how to make database changes that offer significantly greater value, such as moving a column to another table, splitting a column that is currently being used for multiple purposes, and fixing data quality problems.

If you can't easily evolve your database schema, or improve the quality of the data within it, then clearly, you can't respond effectively to changes within your business environment. Nor can you easily fix the problems within your database, problems that will only cost you more and more money over time. Look at Figure 4 again, and ask yourself how many respondents appear to work in organizations where their databases are an anchor around their necks, and not the assets that they desperately want them to be.

Database Testing

In the September survey, 66 percent reported that they do some form of database testing. In the July survey, I asked about database regression testing, a more complicated form of database testing, and found that less than half of respondents do that form of testing. Figure 5 depicts the relationship between various forms of testing and overall database quality. Many respondents indicated that they take more than one approach to database testing, so the results are commingled. The organizations that do no database testing at all seem to be in the worst shape, which should come as no surprise. Testing at the end of the lifecycle is an improvement, but appears to be the least effective time to test—apparently we need to rethink traditional approaches. Testing at the end of each development iteration is more effective still, but taking a test-driven design (TDD) approach appears to be the best approach.

Figure 5: Database testing and data quality.

Data Modeling

The survey also asked about the approach to data modeling by project teams, and Figure 6 correlates the answers to data quality. It's interesting to note that evolutionary/agile approaches to data modeling prove to be just as effective as traditional/serial approaches, and that both approaches are better than not data modeling at all. What we don't know from the survey is how much data modeling is actually occurring. In my experience, traditional teams seem to do a lot more modeling than agile teams, so potentially, agile teams are achieving the same results as traditional teams for a smaller investment. I suspect that a more detailed study is required to tease out what is really happening.

Figure 6: Approach to data modeling and data quality.

Conclusion

These surveys have shown that we clearly have some serious problems when it comes to data quality. At www.ambysoft.com/ surveys, I have posted the source data (with the e-mails removed), the original questions, and PowerPoint slide decks summarizing critical findings from all of my surveys, including the most recent one. Please use these assets to communicate the challenges with traditional approaches to data management within your organization. Better yet, please analyze the data for yourself and report your findings back to the IT community. Now is the time to start digging ourselves out of the "data morass" that we find ourselves in.

Implications from the Survey

  1. Database service-level agreements (SLAs) appear to lead to improved data quality.
  2. A collaborative approach to data management is more effective than a command-and-control approach, which in turn, is better than no approach at all.
  3. A large percentage of organizations struggle to evolve their database schema in a timely manner, thereby reducing their competitiveness in the marketplace.
  4. The earlier, and more often, that you test your database in the development lifecycle, the greater the data quality.
  5. Evolutionary/agile approaches to data modeling are just as effective as traditional approaches, and both approaches correlated to improved data quality.
  6. —S.A.

Terms of Service | Privacy Statement | Copyright © 2024 UBM Tech, All rights reserved.