Cloud Databases: Platforms, Engines, and Data Stores
The cloud database landscape today is marked by diversity. In that respect it is not unlike the database landscape 20 years ago. Back then there were products that ran on the desktop, on file servers, on mainframes, and on SQL servers. Today we have all of those database types but we also have powerful database engines embedded in mobile devices and client-server databases hosted in the cloud. The cloud database community is experiencing the cornucopia of APIs effect that the SQL database community experienced long ago. New cloud database platforms arrived in recent years, bringing with them disparate programming interfaces for data access.
We're seeing thousands of Memcached and NoSQL deployments. NewSQL platforms are emerging and the largest database vendors (IBM, Microsoft, Oracle) are carving out their share of the cloud computing market.
Moving to the cloud does not require rewriting application code. Organizations can port enterprise applications and databases to the cloud, such as by using Amazon EC2 Relational Database AMIs to operate with EnterpriseDB, IBM DB2, MySQL, Microsoft SQL Server, Oracle Database 11g, PostgreSQL, Sybase IQ and SQL Anywhere, and Vertica. This strategy moves the data to the cloud without having to reprogram a data access layer that uses ADO.NET, ODBC, JDBC, or proprietary APIs.
Organizations can transport enterprise applications and databases to the cloud while retaining ODBC, JDBC, .NET, and Java data access code. Likewise they can choose to host parts of a service-oriented architecture (SOA) in the cloud, preserving APIs and the logic of data services and XML-based web services.
Developers creating new applications for the cloud not migrating existing code and databases can use the aforementioned platforms or look to other cloud database alternatives. There are too many choices for one article, so we'll look at selected platforms here.
Google App Engine
Developers using Google App Engine have a choice of APIs for data storage, using the built-in Datastore and High Replication Datastore (HRD) for Python and Java development. Besides APIs such as Mail, Images, and Taskqueue, App Engine developers can operate on in-memory data with Memcache. The Datastore Java API supports Java Persistence (JPA) and queries using Java Data Objects (JDO). The HRD provides the advantage of replicating data across data centers using a solution based on the Paxos algorithm. There is also a master/slave replication alternative that supports asynchronous replication across data centers.
Developers can also check the status of Java and Python services using the Google App Engine status dashboard. Google has committed to supporting hosted SQL, full-text search, and MapReduce in upcoming releases of App Engine. All paid users will have a 99.95% uptime service-level agreement, but HRD has achieved 99.999% uptime since its launch.
Amazon Relational Database Service (RDS)
Amazon EC2 developers can use preconfigured AMIs for the database products discussed above or opt for the Amazon Relational Database Service (RDS) to operate with Oracle and MySQL databases. RDS is a web service that provides pay-per-use capacity and freedom from the overhead of database administration, such as the backups, replication, and the arduous task of keeping up with security patches. With RDS, developers are able to use familiar tools and administrators can use Amazon CloudWatch to monitor storage and computer resource consumption. A principal advantage of RDS is the scalability and elasticity of the cloud, being able to bring resources to bear when you need them.
Amazon RDS for MySQL supports read replicas and deployment across multiple Availability Zones. Amazon RDS for Oracle provides automatic host replacement in the event of a computer instance failure, but it does not currently support replication. Amazon RDS for Oracle also supports provisioned database storage. Each DB Instance can select from 5GB to 1TB for its primary data set at a rate between $0.10-0.12 per gigabyte per month.
Microsoft SQL Azure
Microsoft SQL Azure is a pay-as-you-go solution that enables developers familiar with SQL Server to use familiar tools when they move to a cloud database. SQL Azure integrates with Visual Studio and SQL Server and supports Transact-SQL (T-SQL) and the TDS protocol. Like Database.com, it's built with a multi-tenant architecture, meaning that other users are sharing the database storage. For data access, SQL Azure supports ODBC, ADO.NET, JDBC, and PHP. SQL Azure also provides data synchronization capabilities based on Microsoft Sync Framework technologies. This permits data replication and synchronization within and across data center boundaries.
Developers can create OData feeds from SQL Azure or Microsoft SQL Server. OData, formerly known as ADO.NET, operates with technologies such as HTTP, AtomPub, and JSON, to deliver information from applications and services.
Like Sybase IQ and Vertica, ParAccel is a columnar database. That means its data model is optimal for analytics, not transaction processing. It can operate as a disk-based or in-memory database. According to ParAccel CEO Barry Zane, the sweet spot for the in-memory database is 2 terabytes (TB), whereas it's 25 TB for an on-disk database. ParAccel offers multiple programming interfaces, including an SQL API and Map/Reduce.
Xeround MySQL Cloud Database
Xeround's claim to fame is tailoring MySQL for the cloud environment. The Xeround offering is a distributed database that runs in memory across multiple nodes. It's suitable for transactional applications, not for analytics and business intelligence (BI). One notable aspect of Xeround is it is cloud-agnostic, running on Rackspace, Heroku, and Amazon Web Services (AWS). The sweet spot for Xeround's performance advantages is databases whose size ranges from 2 gigabytes (GB) to 50 GB. Xeround supports auto scaling after you set CPU utilization, memory utilization, and connection upper and lower thresholds. Xeround has published online transaction processing (OLTP) benchmark results that compare Amazon RDS MySQL performance with that of Xeround Cloud Database, for 1 to 240 concurrent users.
DBT-2 is a transaction processing benchmark, similar to TPC-C, with the source code available at SourceForge.
DBT-2 simulates a parts supplier with several workers accessing a database, updating customer information and checking on parts inventories. The test metrics include transactions per second, CPU utilization, I/O activity, and memory utilization. For 60-240 users, Xeround sustained 8000 transactions per minute whereas RDS performance peaked at 15 users and declined between 30-240 users.
Cloudant has emerged as a platform that operates with semi-structured data in a highly distributed, scalable architecture. Although it is NoSQL database technology, CEO Mike Miller has described Cloudant as relationally complete. Cloudant is available as a download or as a cloud database on Amazon EC2. Hosting a large multi-tenant architecture on EC2 proved to be a challenge.
Cloudant users communicate with the database from the web browser, using the HTTP: protocol. Cloudant feels this enables a client to talk directly to the database without going through a middle-tier server. But in my mind this raises questions about the type of application for which Cloudant is appropriate. Without middle-tier servers, application-specific logic such as business rules must be handled in the client or in the database. Since the client is the web browser, that means having to create sophisticated HTML, scripts, and browser add-ins.
The size sweet spot for optimal Cloudant performance is 1 GB to 100 TB. Cloudant has recently introduced a solution that bundles Apache Lucene plus CouchDB, an integration that supports full-text search and real-time analytics.
NuoDB, formerly NimbusDB, is perhaps best-known because its chief architect is industry icon Jim Starkey. Starkey has long been involved in creating database management systems and is the father of multi-version concurrency control (MVCC). NuoDB represents the NewSQL approach of supporting scalability with a radical shift of architecture; it offers support for standard SQL grammar and APIs. NuoDB provides the capability to scale out horizontally without giving up support transactions having atomicity, consistency, isolation, and durability (ACID) properties. NuoDB is not generally available yet so there's no empirical data about optimal database sizes. The founders will point out that it's intended more for transactional applications than for big data applications, such as analytics with very large data sets.
Cloud computing has IT organizations thinking about the advantages of computing as an expense that does not require large capital expenses for infrastructure. This tread is behind the interest in database-as-a-Service, such as the Database.com service from Salesforce.com. Developers writing apps for the Salesforce CRM service or for the Force.com platform use a database structure built with a multi-tenant architecture. Because a database can include data from a multitude of Salesforce customers, the architecture provides a very sophisticated security model. Besides authentication and authorization capabilities (found with a robust DBMS), Database.com supports encryption, roles, profiles, sharing rules, and privileges tied to user ID and session. In addition, the validity of database requests can be checked against access hours and originating IP address. Database.com is an SQL database solution. Clients are able to access data using JDBC drivers, ODBC drivers, and REST and SOAP web services interfaces. Developers are able to use a variety of languages and tools, including Java, PHP, Ruby, .NET, Adobe AIR, and Apex.
Not a Slam Dunk
Selecting a database platform, with its associated tools and middleware choices, for deployment in the cloud is a daunting task. The choice is a bit easier when the job is to move an existing database into the cloud, but there are still configuration issues to resolve, such as whether to fail over to a different EC2 availability zone. Not even a move into a private cloud is guaranteed to be a slam dunk due to infrastructure issues, such as storage options, operating systems, and hypervisors.
Ken North is a well-known expert in database technologies and regular contributor to Dr. Dobb's. Read his blog here .