PaaS, Public, Private Clouds
The Platform-as-a-Service (PaaS) solution bundles developer tools and a data store, but users who opt to use an infrastructure provider or build a private cloud have to match the data store or database to their application requirements and budget. There are open source and commercial products that have a wide range of capabilities, from scalable simple data stores to robust platforms for complex query and transaction processing.
Databases, data stores, and data access software for cloud computing must be evaluated for suitability for both public and private clouds and for the class of applications supported. For example, Amazon Dynamo was built to operate in a trusted environment, without authentication and authorization requirements. Whether the environment supports multi-tenant or multi-instance applications also influences the database decision.
Databases and Data Stores
Data management options for the cloud include single format data stores, document databases, column data stores, semantic data stores, federated databases and object-relational databases. The latter group includes "Swiss Army Knife" servers from IBM, Microsoft, OpenLink, and Oracle that process SQL tables, XML documents, RDF triples and user-defined types.
Building a petabyte size web search index is a very different problem from processing an order or mapping wireless networks. The requirements of the application and data store for those tasks are quite different. For new applications hosted in the cloud, developers will look primarily to several classes of data store:
- SQL/XML (object-relational) databases
- Column data stores
- Distributed hash table (DHT), simple key-value stores
- Tuple spaces variants, in-memory databases, entity-attribute-value stores and other non-SQL databases having features such as filtering, sorting, range queries and transactions.
Because this cornucopia of data stores has diverse capabilities, it's important to understand application requirements for scalability, load balancing, consistency, data integrity, transaction support and security. Some newer data stores are an exercise in minimalism. They avoid joins and don't implement schemas or strong typing, instead storing data as strings or blobs. Scalability with very large data set operations is a requirement for cloud computing, which has contributed to the recent enthusiasm for the DHT and distributed key-value stores.
Associative arrays, dictionaries, hash tables, rings, and tuple spaces have been around for years, as have entity-attribute-value (EAV) stores, database partitions and federated databases. But cloud computing puts an emphasis on scalability and load balancing by distributing data across multiple servers. The need for low-latency data stores has created an Internet buzz about key-value stores, distributed hash tables (DHT), entity-attribute-value stores and data distribution by sharding.
Tuple spaces are a solution for distributed shared memory that originated with the Linda effort at Yale that spawned more than 20 implementations, including Object Spaces, JavaSpaces, GigaSpaces, LinuxTuples, IBM TSpaces, and PyLinda. You can find GigaSpaces eXtreme Application Platform as a pay-per-use service on Amazon EC2. It includes a local and distributed Jini transaction manager, Java Transaction API (JTA), JDBC support, with b-tree and hash-based indexing capabilities. Amazon SimpleDB also provides standard tuple spaces interfaces, but adds secondary indexing and support for additional query operators.
For large data sets and databases, partitioning data has been a facilitator of parallel query processing and load balancing. Horizontal partitioning, referred to as sharding, has caught the attention of developers looking to build multi-terabyte cloud databases because of its success at Amazon, Digg, eBay, Facebook, Flickr, Friendster, Skype, and YouTube.
SQLAlchemy and Hibernate Shards, object-relational mappers for Python and Java, respectively, provide sharding that's useful for cloud database design. Google developed Hibernate Shards for data clusters before donating it to the Hibernate project. You can do manual sharding for a platform such as Google AppEngine, use SQLAlchemy or Hibernate Shards for Python or Java development, or use a cloud data store such as MongoDB that provides administrative commands for creating shards.
Distributed Hash Table, Key-Value Data Stores
Distributed hash tables and key-value stores are tools for building scalable, load balanced applications, not for enforcing rigid data integrity, consistency and Atomic Consistent Isolated Durable (ACID) properties for transactions. They have limited applicability for applications doing ad hoc query and complex analytics processing.
Products in this group include memcached, MemcacheDB, Project Voldemort, Scalaris, and Tokyo Cabinet. Memcached is ubiquitous and a popular solution for caching for database-powered web sites. It's a big associative array that's accessed with a get or put function, using the key that's a unique identifier for data. It's particularly useful for caching information produced by expensive SQL queries, such as counts and aggregate values. MemcacheDB is a distributed key-value data store that conforms to the memcached protocol but uses Berkeley DB for data persistence.
Scalaris is a distributed key-value store, implemented in Erlang, which has a non-blocking commit protocol for transactions. Using the Web interface, you can read or write a key-value pair, with each operation being an atomic transaction. Using Java, you can execute more complex transactions. Scalaris has strong consistency and supports symmetric replication, but does not have persistent storage.
The open source Tokyo Cabinet database library is causing a buzz in online discussions about key-value stores. It's blazingly fast, capable of storing 1 million records in 0.7 seconds using the hash table engine and 1.6 seconds using the b-tree engine. The data model is one value per key and it supports LZW compression. When keys are ordered, it can do prefix and range matching. For handling transactions, it features write ahead logging and shadow paging. Tokyo Tyrant is a database server version of Tokyo Cabinet that's been used to cache large SQL databases for high-volume applications.
Some products of this group support queries over ranges of keys, but ad hoc query operations and aggregate operations (sum, average, grouping) require programming because they are not built-in.
Hadoop MapReduce would be a nominee for the Academy Award for parallel processing of very large data sets, if one existed. It's fault-tolerant and has developed a strong following in the grid and cloud computing communities, including developers at Google, Yahoo, Microsoft, and Facebook. Open source Hadoop is available from Apache, a commercial version is available from CloudEra and Amazon offers an Elastic MapReduce service based on Hadoop.
MapReduce operates over the Hadoop Distributed File System (HDFS), with file splits and data stored as key value pairs. The HDFS enables partitioning data for multiple machines to do parallel processing of batches and reduce processing time. MapReduce is suitable for processing very large data sets for purposes such as building search index engines or data mining, but not for online applications requiring sub-second response times. Frameworks built on top of Hadoop, such as Hive and Pig, are useful for extracting information from databases for Hadoop processing. The eHarmony.com site is an example of the marriage of an Oracle database and Amazon MapReduce, using the latter for analytics involving millions of users.
EAV stores are derived from data management technology that pre-dates the relational model for data. They do not have the full feature set of an SQL DBMS, such as a rich query model based on a non-procedural, declarative query language. But they are more than a simple key-value data store. EAV data stores from major cloud computing providers include Amazon SimpleDB, Google AppEngine datastore and Microsoft SQL Data Services. And one type, the RDF datastore used for knowledge bases and ontology projects, has been deployed in the cloud.
Google Bigtable uses a distributed file system and it can store very large data sets (petabyte size) on thousands of servers. It's the underlying technology for the Google AppEngine datastore. Google uses it, in combination with MapReduce, for indexing the Web and for applications such as Google Earth. Bigtable is a solution for projects that require analyzing a large collection, for example the one billion web pages and 4.78 billion URLs in the ClueWeb09 data set from Carnegie Mellon University. For those seeking an open source alternative to Bigtable for use with Hadoop, Hypertable, and HBase have developed a following. Hypertable runs on top of a distributed file system, such as HDFS. HBase data is organized by table, row and multi-valued columns and there's an integrator-style interface for scanning a range of rows. Hypertable is implemented in C++, whereas HBase is implemented in Java.
The Google AppEngine includes a schemaless data store that's optimized for reading, supports atomic transactions and consistency, and stores entities with properties. It permits filtering and sorting on keys and properties. It has 21 built-in data types, including list, blob, postal address and geographical point. Applications can define entity groupings as the basis for performing transactional updates and use GQL, a SQL-like query language. Access to the Google AppEngine datastore is programmable using Python interfaces for queries over objects known as entities. The datastore is also programmable using Java Data Objects (JDO) and Java Persistence API. Although AppEngine bundles a data store, the AppScale project provides software for operating with data stores such as HBase, Hypertable, MongoDB and MySQL.