Channels ▼


The NoSQL Alternative

Document Database Stores

CouchDB and MongoDB are representative of the JSON class of document database, whereas there are a large number of products that store documents encoded as XML. MongoDB is a popular product based on client-server database architecture with b-tree indexes and communication over TCP/IP networks.

MongoDB manages JSON object collections and provides scalability with sharding and replication. Queries are JSON objects, and MongoDB also provides 2-D geospatial searches. There are driver APIs for various languages, including JavaScript, Java, Perl, PHP, Python, Ruby, and C++. Notable production deployments include, The New York Times, Disqus, Electronic Arts, and Business Insider.

CouchDB is a schema-less data store that provides a REST-style API for document-centric CRUD (create, retrieve, update, delete) operations. CouchDB can do retrieval based on key values, and it can operate with Hadoop MapReduce for nontrivial queries. It's also possible to generate views using JavaScript functions. Creating a view can be time-consuming but subsequent retrievals with the view are speedier.

CouchDB supports multimaster replication and distribution of data across multiple instances. It manages documents in JSON format and uses the SpiderMonkey JavaScript engine, making it well-suited for Web applications, such as Erlang, HTTP, JavaScript, PHP, Python, and Ruby.

For applications where XML documents and XQuery are preferred over JSON documents and JavaScript queries, there are a number of open source and commercial products. Besides the XML document stores, there are dozens of XQuery processors. The list includes Apache Xindice, Berkeley DB, eXist-db, IBM DB2, MonetDB, Mark Logic, Sedna, WebMethods Tamino, and TigerLogic XDMS.

Distributed Processing

When it comes to distributed processing of massive data sets, Hadoop MapReduce has become the red-hot technology du jour. Researchers at Yahoo, for instance, used 3,800 nodes with it to sort a petabyte of data in 16.25 hours.

Google developed and recently patented MapReduce. The map function produces a list of key-value pairs that MapReduce turns into a list of values.

The Apache Hadoop Project includes the Hadoop Distributed File System (HDFS), MapReduce, HBase database, Pig analysis language, Hive query and analysis tool, and other software. HBase is a distributed column store, modeled after Google Bigtable that can serve as input or output for MapReduce.

HBase is one of several column stores competing in the analytics and business intelligence market. Storing tables in column-major order provides substantial performance improvements over row-major stores. Benefits such as improved locality and cache performance make for better performance of retrieval-oriented queries, but performance is poor for insertion queries. Other column stores include Sybase IQ, Vertica, and CStore, an open source collaboration among several universities.

Increased interest in semantic searching and Linked Data has brought RDF triples store into the spotlight. These offerings include AllegroGraph, Bigdata, Garlik, Jena, Ontotext Big-OWLIM, OpenLink Virtuoso, Oracle 11g, and Sesame. Several have been deployed on Amazon EC2 to exploit the distributed processing power of the cloud. Raytheon BBN researchers have also used Hadoop MapReduce to create a distributed RDF store that supports SPARQL query processing.

Restrictions And Best Practices

To ensure durability and data integrity, SQL databases provide logging and data replication. NoSQL offerings need a similar safety net. Cassandra, for example, supports transaction logging and automatic replication. Tokyo Cabinet and HBase support write-ahead logging. Tokyo Cabinet and CouchDB support master-master replication, whereas MongoDB supports master-slave replication and replica pairs.

Architects using document-oriented databases must deal with how to store different document types and whether to have a separate database for each one. Alternatives to separate databases include using an attribute to specify type or using separate collections.

Because the new generation of data stores is intended to address scalability and availability needs, certain restrictions apply for maximum efficiency. With Amazon SimpleDB, for example, the time limit for queries is five seconds. If a query takes longer, SimpleDB returns partial results, and the application must make additional calls to obtain complete results. SimpleDB restricts a query result to a maximum of 250 items, whereas Google recently lifted the AppEngine data store query result limit of a thousand items.

In horizontally partitioned systems, queries that require cross-shard joins are expensive, so the design and algorithms for partitioning require skill and knowledge of data-usagepatterns. When complex queries such as aggregation are required, NoSQL operational databases aren't typically a good fit, but they can provide source data for separate solutions for analytics. Organizations using a key-value datastore sometimes need the indexing and query capabilities of SQL. They can turn to other software that supports indexing and queries, such as Apache Lucene. Regardless of whether your organization is using SQL or NoSQL databases, it's still a good idea to use version control and separate databases for testing and production.

For all the areas that NoSQL options address, we're still left with the question of which database software to adopt. The answer depends on fundamental issues: How much and what types of data will you store? Will it be used for complex queries? How many concurrent users are you supporting? And will your database scale as it takes on more users and data? SQL or NoSQL, this is the place to start.

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.