Future of the Web and the Cloud: Data Sharing or Data Silos?
Missing from many discussions about the Internet, enterprise computing, social networks and cloud computing is the benefit for which the 'data bank' and 'data base' were conceived - to facilitate data sharing, consistency and data integrity. As a result, we find ourselves working through data quality issues and unnecessary integration problems made more complex by data silos. Complicating the situation are those reactionary developers whose enthusiasm for a database sea causes them to turn a blind eye towards productivity losses from data silos, including the effort and expense to reconcile and integrate data.
Fifty years ago the typical approach to software development was to build disjoint applications that created ad hoc data stores. Eventually organizations recognized the importance of data sharing by integrated information systems. When the idea of data as an asset gained traction, computer scientists pursued technology for maintaining data hubs and data banks, such as E. F. Codd's "A Relational Model for Large Shared Data Banks". Today we have the Protein Data Bank, National Practitioner Data Bank, Cosmic Data Bank, NEA Data Bank, various national and state DNA data banks and other manifestations of data as a shared asset. But we also have the information soup of the World Wide Web, which Jim Gray called the world's largest database.
The early data processing paradigm, marked by much redundancy and duplication and little information sharing, evolved into the notion of building applications over a unified and integrated data base. The data base would serve as a cornucopia of facts which would beget useful applications.
A fundamental reason for operating with a data base was to have one place in an organization's information structure for storing a fact that was accessible by multiple applications. This meant data could be an asset shared by an entire organization instead of being locked up in a data store accessible to a single application. A second benefit was efficiency; a database management system (DBMS) represented common logic that did not have to be rewritten for each new application program.
Organizations such as IBM, GE, MITRE, SDC and TRW pushed the early development of database technology, along with pioneers such as Charles Bachman, Dwight Buetell, Don Nelson, Ted Olle, Dick Pick and John A. Postley. The consortium that published the 1960 COBOL standard, CODASYL, published the first database standard in 1971. The seminal work of David L. Childs and E.F Codd, respectively, on set-theoretic data structures and the relational model, leveraged Georg Cantor's set theory to provide a foundation for today's SQL databases.
By the 1980s software technology had advanced to the point there was emphasis, even on small projects, in reusing code and defining formats for shared data. Development with CASE tools, modeling software and data dictionaries, such as Digital's Common Data Dictionary, was commonplace. The database management system (DBMS) had become mainstream technology. By the 1990s, companies such as IBM, Oracle, Sybase, Ingres, Informix and Microsoft were competing in the multi-billion dollar market for database software. SQL databases became a favored solution for enterprise applications, including mission-critical applications. And the original notion of a data base as its single repository of facts for an organization had morphed into the database, a container for data and logic that was managed by a DBMS. Organizations typically had disparate databases at the workgroup, department and enterprise level.
Because of competition and other influences, some organizations undertook the creation of data warehouses. This was in some measure a return to the original concept of having a single authoritative source for facts. The size increases for data warehouses have been dramatic. In 1995, Wal-Mart's 7.5 terabyte data warehouse was among the world's largest. In less than five years it grew to 24 terabytes and today there are petabyte-sized data warehouses. eBay's data warehouse is 5 petabytes and Wal-Mart's has grown to 2.5 petabytes. Although a data warehouse provides the base for doing analytical queries and business intelligence processing, it does not represent that 1960s goal of a single unified data base that can sustain an organization's operational and management information systems.
The growth of networks, distributed databases, object databases, and embedded databases have moved us further away from the goal of an integrated, unified data base. Next we'll look at more of the data silo issues, and remediation with enterprise data models, data integration, federated data and linked data.