Amazon SimpleDB is a schemaless, Erlang-based, eventually consistent data store suited for high-availability applications. The data model provides domains of large collections of items, which are hash tables containing attributes that are key-value pairs. Attributes can have multiple values and there are no joins. The query language provides queries that can return an itemName, all attributes, the attribute count or an attribute list. Data is stored in a single format (untyped strings), without applying constraints, so all predicate comparisons are lexicographical. Therefore for accurate query results you must store data in an ordered format, for example padding numbers with leading zeroes and using dates in ISO 8601:2004 format.
Azure Services Platform
Microsoft's Windodws Azure, like Google AppEngine and Force.com, offers a platform for cloud computing that includes a data store and other features for application development. Microsoft .NET Services provide a service bus and authentication and Live Services are application building blocks. Microsoft also offers SharePoint Services and Dynamics CRM Services in the Azure cloud. Like Amazon S3 and EC2, communication using the Azure Services Platform is based on the web services model, with Microsoft supporting SOAP and REST. Microsoft Azure bundles SQL Data Services (SDS) and exposes Azure Table Storage via ADO.NET Data Services. The database Azure currently offers is a single instance of SQL Server that's limited to 10 gigabytes of storage. For a larger requirement it's necessary to partition data to scale horizontally.
For those with a history of using industrial-strength databases, a big adjustment to the new EAV stores is lack of strong typing. SimpleDB uses string values to store everything, so comparisons and sorting require that you pad numbers with leading zeros. Microsoft SQL Data Services provides Base64, Boolean, datetime decimal, and string. With more than 20 types, Google AppEngine has more built-in types than SimpleDB or SQL Data Services.
RDF and Semantic Data Stores
Social networking and e-commerce have shown us there are classes of web applications that must operate with massive data stores and support a user base measured in millions. Cloud computing is often touted as a vehicle for scaling out that type of site and powering Web 3.0 applications. Tim Berners-Lee has said a web of linked data will evolve from the web of linked documents. This has produced a surge of interest in data stores that can handle very large knowledge bases and data sets encoded to impart semantics using the W3C Resource Description Format (RDF) and in the W3C SPARQL query language.
Interest in RDF, micro formats and linked data has raised awareness of the capabilities and capacity of RDF data stores. Because there are a number of RDF data stores, the benchmark wars are reminiscent of the Transaction Processing Council (TPC) benchmark competition among SQL vendors. RDF data is stored as subject-predicate-object triples. The leading RDF data stores often store additional information for versioning and temporal queries, but they are capable of storing and querying over billions of triples. A W3C wiki identifies more than a dozen triple stores, about half citing deployments or benchmarks with 1 billion triples or more.
Sesame, Jena, and Mulgara are popular open source solutions. OpenLink Virtuoso is a universal server that in a recent benchmark loaded 110,500 triples per second. The Virtuoso Universal Server (Cloud Edition) is a pre-packaged AMI for EC2. In addition to SQL and XML databases, it provides online backup to Amazon S3 buckets and installable RDFizer cartridges. Franz AllegroGraph RDFStore offers a vehicle for building RDF-based federated knowledge stores in the cloud. It supports SPARQL queries, Prolog and RDFS++ reasoning. On Amazon EC2, it stored and indexed a 10-billion triple data set in 6.19 hours using 10 large EC2 instances. The SQL/XML products can store RDF triples, including Oracle 11g, IBM Boca for DB2. On the patent front, Microsoft has been active with applications for methods to store RDF triplets and convert SPARQL queries to SQL.
Document Stores, Column Stores
Storing data by column, rather than the row-orientation of the traditional SQL platform, does not subvert the relational model. But when combined with data compression and a shared-nothing, massively parallel processing (MPP) architecture, it can sustain high-performance applications doing analytics and business intelligence processing. By using a Sybase IQ or Vertica column store with a cloud computing service, organizations can roll their own scalable BI solutions without a heavy capital outlay for server hardware. Sybase IQ processes complex analytics queries, accelerates report processing and includes a word index for string processing, such as SQL LIKE queries. It provides connectivity via standard data access APIs and its Rcube schemas provide a performance advantage over the star schema typically used for relational data warehouses and data marts. Vertica Analytic Database is a solution from a company co-founded by Michael Stonebreaker. Vertica supports a grid architecture, terabyte-sized databases, and standards-based connectivity. It makes pay-as-you-go analytics available to Amazon EC2 users, with a large AMI instance, drivers for ODBC, JDBC, Python, and Ruby, and a database size of 1 terabyte per node as you scale out to multiple nodes.