Performance Evaluations and a Universal Language for Data
Ralph Stout conducted benchmarks at Information Builders of software running TPC-H queries, including two major SQL products. He reported a performance advantage for software using extended set processing (XSP) technology, which recognizes the mathematical identity of data.
Results were similar for a performance evaluation that was conducted by BAE Systems. Tim Ellis of BAE offers feedback about Ralph Stout's benchmarks and his findings about the XSP solution.
Ralph's point is a very good one, and really at the core of the failings of the existing RDBMs. With destructive updating, they must spend inordinate amounts of system resources to try to ensure that no data would be lost in the event of a transaction failure. And to Dave's recent revelation, it's also at the core of why RDBs are so slow: They cannot use the full buffer I/O capacity of current file systems [by using larger, more efficient paging sizes] because to do so would necessitate expensive record bookkeeping and index updating (via more indirect indexes typically) with each transaction.
Each of these actions involves one or more seeks of the files system. Any paging of sufficiently large size (say, multi-megabyte) to gain an I/O transfer rate advantage will result in hundreds or thousands of record update operations (seeks) when the block is save back to disk and therefore brings the whole operation back to a crawl. So they have to try to balance maximizing page size to improve basic I/O transfer performance against the disk seek costs of destructive updates.
In XSP, the sets are, by definition, immutable and therefore there are no destructive updates. This removes the restriction on I/O buffer sizes and allows the data to be moved at full transfer rate capacities. In practice, buffer sizes should only be limited by data set size and therefore automatically optimized by the adaptive data restructuring. The buffers will, by definition, be exactly as big as they need to be for any and all information access operations.
As for points to make to the RDB community 'digiterati', I would offer the following:
Data Integrity - A common point about the value of traditional RDBs is in their ability to ensure data integrity. A couple of specific points about the RDB architecture's approach to continually saving and preserving data as soon as it enters the DB or begins a transaction. As Ralph points out, this is far easier said than done, but RDBs must assure their vast customer base that their data is indeed safe. I think the XSP counterpoint here should focus on the immutability of XSP sets. As long as incoming data has been stored somewhere (anywhere) as an extended set, it is safe until explicitly deleted.
Further, there is no risk during a transaction, no matter how long, that data could be lost due to failure of the transaction itself or the system on which it's running. The only thing that could be lost is the portion of the result or intermediate data set that was "in flight" at the time of the failure. But since the original data is still safe in an immutable set, the operation can be simply re-executed. In addition, this re-execution can still take advantage of all completed sub-operations since their resultant sets and mathematical identity would also have been saved to immutable extended sets, thereby recouping much of the resources consumed by the original transaction operations.
Duplication - The buzzword today among large data centers and warehouses is "de-duplication". How do I minimize any duplication of data within my vast holdings so that
a) I don't run the risk of losing referential integrity or having concurrency issues when one copy of a data set is modified and another is not, and b) minimize the storage hardware, power, and cooling costs.
(A recent stat I heard is that a current rule of thumb for data warehouses is for each 10 Petabytes of storage, the annual electricity alone is ~$1.1M based on current rates.)
There appears to be considerable work going into this de-duplication effort to find, verify, remove, and re-reference links within RDBs. Since XSP inherently uses the most cost effective (based on set operation costs) data available, and since XSP inherently maintains a complete catalog of the mathematical identities of its datasets, the only reason that multiple, exact copies of the same data would exist in an XSP system is if the same data were ingested multiple times (which is a common problem in large data systems with multiple data sources). However, this could be resolved with a simple set difference operation to define a unique output set. There will be duplication of some data, to be sure, in the form of variously repartitioned data structures as a result of past set operations. But this is of value for two reasons:
1) it supports faster access performance by allowing the XSP system to use the least cost data set available, and 2) it provides a level of data security in the event of a disk failure.
A new point here is that, based on more recent work with set based processors, we've seen that these overlapping set partitions do not add significantly to the total storage since, in practice, most operation results are a tiny fraction (several orders of magnitude less) of the total holdings for a large database. For example, if my holdings are PBs, and my retained result sets are typically ~MBs or even GBs in size, I can on average have ~million result sets and still not double the size of my holdings. How much space is taken up by typical RDB index and secondary structures? :-)
Fault Tolerance - To my last point on disk failure, I would point out that while duplication of data in extended sets would decrease the likelihood of data loss in the event of a disk or track failure, it would not guarantee against data loss without some set theoretic help. Given the huge size (PB class and moving to ExaBytes) of today's data systems, multiple failures which result in irrecoverable loss of data will occur, even on dual parity RAID 6 disk systems. I strongly suspect (though admittedly I haven't given it sufficient thought to fully bake the concept) that it would be possible to design a set of extended set operations which would guarantee that adaptively partitioned extended sets could be constructed and preserved in such a way that ALL data would be automatically duplicated at least once and on separate hardware. This should be quite possible and still require far less total storage than today's RDBs which use from 3X to 10X+ storage to hold all the indexes and secondary storage structures to be able to safely store, find, and update their records.
Sam Alexander has recently written about Childs, Kuratowski, extended sets, Algebraix Data and his Romaji Dictionary. About extended sets he commented:
Hence the crucial point of it all: if you're willing to manipulate extended sets, rather than requiring one specific form (raw, relational or hierarchical), then you never have to worry about converting data from one form to another. It's a universal language of data.
Source: "Extended Set Theory, Extended Set Algebra, and Algebraix Data"In XSP, the sets are, by definition, immutable and therefore there are no destructive updates. This removes the restriction on I/O buffer sizes and allows the data to be moved at full transfer rate capacities.