Movement on the Big Data Front
The enthusiasm for Big Data applications has us putting persistent data solutions under a microscope these days. It must be noted that although Big Data applications involve operations with large data sets, their function can vary from online transaction processing to analytics to semantics-driven information retrieval. And an application might be using a distributed key-value store, a row- or column-order store, a set store, a triples store or some other technology.My previous blog posting about David Child's extended set theoretical model caught the attention of Dr. Hasan Sayani, the Graduate School Program Director of Software Engineering for UMUC. Dr. Sayani has long been acquainted with the extended set theoretical model and read Childs' most recent paper about set-store architectures. I sense frustration when Dr. Sayani wrote:
I have been following Dave since the 60's when I was at Michigan and though I see the value in what he has accomplished I have failed to ignite any attention among those who might benefit from it!
White PapersMore >>
Perhaps Dr. Sayani will be cheered by the recent developments on the product front and in the blogoshphere. In a few years, we might able to point to Big Data providing the spark for adoption of extended set theory (XST). Certainly the blogosphere has shown interest in XST, including Jerome Pineau's "Big Honking Databases" blog and Ron Jeffries' XProgramming blog.
Pineau has written about business intelligence and his success with extended set theory as implemented by the XSPRADA engine. In October 2009, XSPRADA Corporation became Algebraix Data Corporation, with an analytic database product named A2DB based on new patented technology:
Algebraix Data Corporation today announced that it is has been granted U.S. Patent No. 7,613,734, for its systems and methods for providing data sets using a store of algebraic relations.
Algebraix is one of the more recent entries in the analytic database race that's fueled in part by the interest of venture capitalists, by established companies offering new products, such as Oracle Exadata, and by advocates of open source software, such as HBase, Hadoop and HadoopDB.
The BI community represents only one slice of the Big Data user pie. The piece that represents the Linked Data / Web 3.0 / Semantic Web community isn't as large, but that community is growing. In March 2010, Oxford University and the University of Southampton announced a new Institute for Web Science will lead the way in Web 3.0 development with £30 million in funding from the UK government:
Web 3.0 will take the web to a whole new level by publishing data in a linkable format so that users and developers can see and exploit the relationships between different sets of information.
Cassandra, Hadoop Map/Reduce, Greenplum and other engines come up frequently in discussions about Big Data. But if Sir Tim Berners-Lee has his way, we'll be having more discussions about solutions for Really Big Data.
The W3C Resource Description Framework (RDF) defines a triples data model that's gained acceptance for Semantic Web applications, Linked Data and building out Web 3.0. There are a variety of data stores capable of handling billions of RDF triples, including OpenLink Virtuoso, Ontotext BigOWLIM, AllegroGraph, YARS2, and Garlik 4store.
Raytheon BBN Technologies has approached the triples store problem from the perspective of using a cloud-based technology known as SHARD (Scalable, High_Performance, Robust and Distributed). SHARD is a distributed data store for RDF triples that supports SPARQL queries. It's based on Cloudera Hadoop Map/Reduce and it's been deployed in the cloud on Amazon EC2. SHARD uses an iterative process with a MapReduce operation executed for each single clause of a SPARQL query. According to Kurt Rohloff, a researcher at Raytheon BBN, SHARD "performs better than current industry-standard triple-stores for datasets on the order of a billion triples."