Supercomputing, The Cloud, Big Data, and NoSQL
If you count yourself among the informed members of the software and computing community, you're undoubtedly aware of NoSQL, "Big Data", cloud computing, and supercomputing. Sometimes technology that has become trendy is a branch on an evolutionary tree; other times it's a revolutionary departure from long-established status quo.
The arrival of new technology often rekindles the pervasive debate over the merits of "tried-and-true" versus "new and improved". The latter often introduces new words in our lexicon, with recent examples being Big Data, NoSQL, and cloud computing. Supercomputing has been with us for a while but there have been significant strides in 2011, including IBM Watson, Tianhe-1A, and an Amazon virtual supercomputer.
IBM Watson can process 200 million pages of text in 3 seconds. (How's that for having enough capacity for big data workloads?) China claimed the supercomputer crown with Tianhe-1A and its capacity to perform 2.5 thousand trillion calculations per second. Tianhe-1A is 50% faster than the XT5 Jaguar at Oak Ridge National Laboratories. One of the more interesting approaches to solving large-scale computing problems is the Amazon virtual supercomputer. This was an ad hoc solution for an Amazon EC2 user, a pharmaceutical company that spent $1,279 per hour to rent 30,000 cores. That virtual supercomputer had enough capacity to rank 42nd on the list of the top 500 supercomputers.
My previous Dr. Dobb's blog post discussed the surge of interest in the cloud and Big Data (Terabytes to Petabytes: Reflections on 1999-2009).
Having enormous computing and storage capabilities is undoubtedly a prime factor in the growing importance of Big Data. We have capacity for analytics and data visualization that was unheard of a decade ago, including the ability to process large data volumes from disparate sources. These data sources include SQL and other structured data (click streams, web logs, RFID and sensor data, high-speed, low-latency data feeds), and a host of unstructured data, such as Tweets.
The desire to build social networks and web-scale applications has led to being able to support millions of users, and store and process information about hundreds of millions. The availability of seemingly unlimited capacity has generated enthusiasm for Hadoop and other solutions for processing large data sets. The major players in the SQL database space, for example, are integrating Hadoop with their database product line.
These new computing and storage requirements have revived, in some circles, a debate over whether to supplant tried-and-true languages, architectures, and database solutions. Important topics in recent debates have concerned attributes and capabilities of different database solutions. The topics in focus have included horizontal scalability and sharding, ACID versus BASE properties (consistency), schemas and type support, granularity of encryption, and query methods.
One of the more interesting debates is about types, schemas, type-less programming, and schema-less databases. I'll take a closer look at these issues in an upcoming blog post.