Information Density, Mathematical Identity, Set Stores and Big Data
Performance when executing queries for transactions and analytics remains a universal concern for those supporting data warehousing, high-volume web sites, business intelligence and cloud computing. One factor has been growth and the sheer volume of data being stored and retrieved puts a strain on computing resources. The term Big Data has become a convenient handle to label software that operates with very large data sets and the data access technology that's necessary.Two of my previous blog postings discussed David Childs' work on extended set theory (XST) providing a mathematical foundation for managing data. Extended sets could be an important solution to the data management challenges of Big Data applications.
Childs' research has focused on a mathematical identity for data and the use of mathematical expressions for behavior. The desirability of an algebraic approach for operations with various data types has been explored by other notable researchers. David J. DeWitt and Scott L. Vanderberg of the University of Wisconsin published "Algebraic Support for Complex Objects with Arrays, Identity, and Inheritance".
DeWitt and Vanderberg describe a viable algebraic query processor for an advanced data model, with databases of "completely arbitrary structure, multisets, arrays ... and a form of object identity that allows (but does not force) any part of a structure to have identity separate from its value."
SQL databases emerged as a result of applying set theory to data management, but using data operations on sets doesn't tie us to the relational model. For example, data types such as graphs and rings are defined as sets with various properties. The operating method for users of a CODASYL network model database is to define set ownership and membership and then traverse through links when accessing data.
Extended set theory is an augmentation of classical set theory. Instead of set operations with ordered pairs, extended set theory involves operations with n-tuples. In a seminar at Microsoft Research, Childs expressed this as
"In extended set theory, n-tuples are just sets with integer scopes (superscripts on element values)."
Today Big Data and scalability have captured the imagination of system architects and database designers. Web sites such as Facebook and Flicker present a unique scalability challenge, but many organizations are doing analytics with gigabyte-sized and larger data sets. Sharding, a process that distributes rows for a single table across multiple partitions, has been used for very large web databases with high transaction volumes. This partitioning and distribution of data across multiple servers permits the use of a shared-nothing architecture to provide scalability.
Techniques for partitioning data have become an area of interest for database designers and system architects. Besides sharding, other approaches include hash partitioning, range partitioning and list partitioning but extended set theory can play a role.
The advantage of the Childs' set-theoretic data model is it enables us to model data at a logical level and dynamically restructure it to match query requirements. Childs describes this advantage of extended set theory as adapting the data to the query instead of adapting the query to the data. That means a query can operate with an informationally-dense data set, requiring fewer I/O operations to deliver excellent performance against gigabyte-sized data sets, such as those generated for the Transaction Processing Council's TPC-H queries.
Benchmarks have shown extended set theory offers some performance advantages when processing large data sets. Indeed some of the execution times for decision-support queries have been startling when compared to IBM DB2, Microsoft SQL Server and Oracle.
Many researchers and developers concerned about Big Data and cloud computing scalability have accepted Brewer's CAP Theorem and the need for partitioning data for performance reasons. XST treats all data representations as mathematical objects and provides a mathematical foundation for operations with data that's been partitioned.
Analytics and business intelligence typically produce longer-running queries than a online transaction processing (OLTP) mix, but performance and response times can still be an issue, particularly with complex ad hoc queries. A sharding scheme may provide efficiency for OLTP but it can induce a performance penalty if a complex query spans partitions.
Childs' XST might prove to be an important tool for addressing data partitioning problems, assuming more developers and architects are willing to investigate a mathematical foundation for managing data. That was done by a generation of developers and architects who accepted Codd's relational model and the SQL databases that followed.
Childs has been researching set-theoretic operations for decades and most recently published "Set-Store Data Access Architectures". It compares mechanical and mathematical data models and explores column-store, row-store and set-store architectures. The paper explains a benefit of mathematical data access strategies:
" 9.3 Mathematical DAMs Since with mathematical data models updating is always a constructive process and no modifications to existing data are required, the only database design concern is to make sure all data required to support application data access requirements, is that all the data is actually loaded.
9.4 DAM Strategies This is not meant to imply that use of mathematical data models do not require considerable work and design intelligence to support high performance I/O data access, but only that the effort is less predictive and more adaptive in nature."
Childs has said the recent paper is ";an attempt to remedy forty years of set-theoretic obfuscation.";
One wonders whether this will re-kindle interest in data management based on a mathematical foundation or whether there's simply more energy going into building something new based on different objectives.
Oracle CEO Larry Ellison quite succinctly summed up the problem of capturing and sustaining mindshare for technology:
"The computer industry is the only industry that is more fashion-driven than women's fashion."
It will be interesting to watch whether set-store data access architectures become high fashion for processing large data sets.
Part 2: "Laying the Foundation" It will be interesting to watch whether set-store data access architectures become high fashion for processing large data sets.