Document Database Stores
CouchDB and MongoDB are representative of the JSON class of document database, whereas there are a large number of products that store documents encoded as XML. MongoDB is a popular product based on client-server database architecture with b-tree indexes and communication over TCP/IP networks.
When it comes to distributed processing of massive data sets, Hadoop MapReduce has become the red-hot technology du jour. Researchers at Yahoo, for instance, used 3,800 nodes with it to sort a petabyte of data in 16.25 hours.
Google developed and recently patented MapReduce. The map function produces a list of key-value pairs that MapReduce turns into a list of values.
The Apache Hadoop Project includes the Hadoop Distributed File System (HDFS), MapReduce, HBase database, Pig analysis language, Hive query and analysis tool, and other software. HBase is a distributed column store, modeled after Google Bigtable that can serve as input or output for MapReduce.
HBase is one of several column stores competing in the analytics and business intelligence market. Storing tables in column-major order provides substantial performance improvements over row-major stores. Benefits such as improved locality and cache performance make for better performance of retrieval-oriented queries, but performance is poor for insertion queries. Other column stores include Sybase IQ, Vertica, and CStore, an open source collaboration among several universities.
Increased interest in semantic searching and Linked Data has brought RDF triples store into the spotlight. These offerings include AllegroGraph, Bigdata, Garlik, Jena, Ontotext Big-OWLIM, OpenLink Virtuoso, Oracle 11g, and Sesame. Several have been deployed on Amazon EC2 to exploit the distributed processing power of the cloud. Raytheon BBN researchers have also used Hadoop MapReduce to create a distributed RDF store that supports SPARQL query processing.
Restrictions And Best Practices
To ensure durability and data integrity, SQL databases provide logging and data replication. NoSQL offerings need a similar safety net. Cassandra, for example, supports transaction logging and automatic replication. Tokyo Cabinet and HBase support write-ahead logging. Tokyo Cabinet and CouchDB support master-master replication, whereas MongoDB supports master-slave replication and replica pairs.
Architects using document-oriented databases must deal with how to store different document types and whether to have a separate database for each one. Alternatives to separate databases include using an attribute to specify type or using separate collections.
Because the new generation of data stores is intended to address scalability and availability needs, certain restrictions apply for maximum efficiency. With Amazon SimpleDB, for example, the time limit for queries is five seconds. If a query takes longer, SimpleDB returns partial results, and the application must make additional calls to obtain complete results. SimpleDB restricts a query result to a maximum of 250 items, whereas Google recently lifted the AppEngine data store query result limit of a thousand items.
In horizontally partitioned systems, queries that require cross-shard joins are expensive, so the design and algorithms for partitioning require skill and knowledge of data-usagepatterns. When complex queries such as aggregation are required, NoSQL operational databases aren't typically a good fit, but they can provide source data for separate solutions for analytics. Organizations using a key-value datastore sometimes need the indexing and query capabilities of SQL. They can turn to other software that supports indexing and queries, such as Apache Lucene. Regardless of whether your organization is using SQL or NoSQL databases, it's still a good idea to use version control and separate databases for testing and production.
For all the areas that NoSQL options address, we're still left with the question of which database software to adopt. The answer depends on fundamental issues: How much and what types of data will you store? Will it be used for complex queries? How many concurrent users are you supporting? And will your database scale as it takes on more users and data? SQL or NoSQL, this is the place to start.