The federated database provides a solution when data is distributed because volume, workload or other considerations make it impractical to combine it into a single database. Open SkyQuery and Flickr have been showcases for federation,
SkyQuery runs distributed queries over federated astronomical data sources. Flickr uses sharding to support billions of queries per day over federated MySQL databases used for data management of 2 billion photos. That type of success and the scalability requirements of cloud computing have put new emphasis on federated data and sharding. Mergers and acquisitions also may force the creation of federated data stores to permit execution of business intelligence and other queries against disparate CRM databases.
IBM has been using GaianDB, based on Apache Derby, to test performance of a lightweight federated database engine. It distributed the database over 1000 nodes, which GaianDB was able to query in 1/8 second. Fetching a million rows took five seconds.
Platform and API Issues
Database options for public cloud computing can be limited by the choice of cloud provider. SaaS providers, such as Google AppEngine and Force.com, offer a specific platform for development, including predefined APIs and data stores. But private clouds and infrastructure providers, such as GoGrid, Joyent and Amazon EC2, enable the cloud user to match the software, database environment and APIs to requirements.
Ease of development is an important aspect for a cloud database solution, with application programming interfaces (API) being a major factor. Some data access programming for the cloud can be done with familiar APIs, such as Open Database Connectivity (ODBC), JDBC, Java Data Objects (JDO) and ADO.NET.
For certain classes of applications, security is an obstacle to using public cloud services, but it's not an insurmountable obstacle. Current thinking on the subject emphasizes encryption, authorization, authentication, digital certificates, roles and policy-based security controls. Database backups to the cloud can be encrypted. Communications can use secure networking and encrypted data.
Java and .NET offer robust cryptographic solutions for applications and services accessing databases. Operating systems and robust SQL databases offer additional layers of security. SQL databases provide features such as row-level encryption and role-based assignment of privileges and access to data. But even with multi-level security, one serious threat to databases in the public cloud and the corporate data center is a breach of hypervisor security by an authorized employee. For helping to ensure data security, Amazon EC2 provides for the definition of security groups. But you must use an Amazon API function to manually monitor the security group descriptions. And there is no logging function to monitor failed attempts at authentication.
There are differences in security depending on whether you use SaaS, a platform provider or infrastructure provider. Because SaaS providers offer a bundle with tools, APIs and services, the SaaS user is not caught up in choosing the optimal data store and security model. However, those creating private clouds or using an infrastructure provider must select a data management solution that's consistent with the application's security requirement.
Saleforce.com hosts applications on Oracle databases using a multi-tenancy model. On the other hand, Amazon EC2 is an example of multi-instance security. If you fire up an AMI running Oracle, DB2 or Microsoft SQL Server, you have a unique instance that does not serve other tenants. The process of authorizing database users, defining roles and granting privileges is your responsibility when using IaaS.
Fault-Tolerance and Cloud Failover
One of the exciting possibilities introduced by cloud service providers is being able to configure fault-tolerant, highly-available systems and hot backups for disaster recovery. It's possible to configure and operate a private cloud for a fairly seamless failover to Amazon EC2, for example. It would require replicating data in the private and public cloud, implementing the Amazon APIs and availability zones, IP assignment and load balancing for the private cloud, and using server configurations compatible with Amazon instances. The latter would be necessary to avoid breaking applications or services due to changes in endianness, the Java heap size and other dissimilarities. A recent dialogue with IBM about the inverse scenario, deploying databases in a public cloud and moving them to a private cloud, revealed this would be more of a challenge.
The SQL database became predominant even though an earlier generation of databases delivered ACID properties and excellent performance on Create Replace Update Delete (CRUD) operations. But they, like some of the software mentioned here, required a programmer to write code to navigate through data in order to perform queries, such as an aggregation query. But SQL platforms provided an ad hoc query solution that did not require procedural programming because it used a declarative query language and provided built-in aggregation functions. The logic that must be programmed in an application or service, versus built-in with the database engine, is an element of the total cost of ownership (TCO) of a data store solution.
The direction an organization takes on cloud computing, whether to go the private cloud route or use a public cloud, will determine what options are available for data management. For those who walk the PaaS path, the focus will be on the platform's capabilities, not the data store per se. Those who walk the private cloud or IaaS paths will have to choose a hardware and software configuration, including a data store that fits the business goals and requirements of applications running in the cloud. Many factors will influence the choice of one or more of a spectrum of cloud database solutions, ranging from simple data stores to platforms that support complex queries and transaction processing.
Not every project requires the full functionality of the SQL database managers so there's a definite need for lightweight, fast, scalable data stores.