Recently, a lot of attention in the data-management space has been paid to "scale out" systems such as Hadoop and HBase. The interest in the cloud has further emphasized this focus on horizontal scaling. But while the need to scale out is critical for tackling Big Data problems, the need to scale up is equally important. You can enjoy dramatic hardware savings, as well as much better throughput, with a system that pays attention to scaling up.
- The Role of the WAN in Your Hybrid Cloud
- Book Expert: Advanced Analytics with Spark: Patterns for Learning Data at Scale
Scalability has two faces: up and out. Scaling up means that the system performs better as you add more resources (CPU, memory, etc.) to a single node in the system. Scaling out also involves adding more nodes to a distributed system. But there is more to this scalability is also about performance.
When you build a complex distributed application, you work with certain building blocks. The end result has to scale out, so you can easily add more hardware resources in the face of higher load. So if you build your scale out system using high-performance, upwardly scalable components, you can dramatically reduce your cluster size. This really starts to show as you run bigger clusters (on the order of thousands of machines). You should scale up in order to scale out well.
A system that is built around performance software scales better. For example, HipHop scales better than PHP because you can run the same workload on fewer servers.
Scaling Relational Databases
Relational databases are no different. Working with a large, complex data sets in the context of an RDBMS involves challenges for both scaling up and out.
Traditionally, databases scale up very well. A lot of research and development time has gone into making sure enterprise-class database systems run well on powerful hardware, and there are many forces within the industry aligned to make this happen not just software vendors, but also hardware manufacturers who want to show how many more transactions per second they can push on new hardware.
Scaling relational databases out is a difficult problem. However, it's been solved for some systems, and before I dive into how relational databases scale out, let's classify the different database systems and scale out approaches.
OLTP vs. OLAP
OLTP and OLAP are the two standard database workloads. OLTP is handles mission-critical, real-time transactional workloads. It enables processing in which the system responds immediately to user requests. An ATM machine for a bank is a typical example of a commercial transaction-processing application. Any Web or mobile application using a database is also an OLTP application.
OLAP workloads contain a lot of aggregations over large amounts of data. It allows you to slice and dice your historical data as well analyze it and provide forecasts. But since one size doesn't fit all, scaling approaches for OLTP and OLAP are substantially different.
There are several products that scale OLAP well. The most well-known open source product on the market today is Hadoop/Hive. There are also commercial OLAP databases available, of course, including Vertica, Greenplum, and others.
Here is what enables database scaling for these systems:
- Columnar indexes and compression. This is a way compress data efficiently with the ability to compute aggregates without decompression. This saves the amount of work you need to do per machine for a given query. Note that this is not a scaling out technology; however, it does save the amount of computation per node on the distributed system. It's a performance play.
- Intra-parallel query execution. As a distributed system runs a query, it breaks it apart into subqueries and pushes them into individual nodes. These subqueries can be evaluated in parallel by partitioning the work across cores. It's a scale up technique that helps build a better scalable system.
- Shared-nothing architecture. This is a scale out approach that doesn't have a single point of contention.
- Distributed query optimizer. This is a system that intelligently breaks a SQL query apart and pushes it into the SQL nodes.
Vertica, in particular, has invested into scale up with its columnar indexes and compression. This has allowed it to run analytics on smaller clusters, sometimes faster than Hadoop.
Unlike OLAP, OLTP workloads have different requirements.
- They are real time, and randomly access individual data records. This means that columnar indexes won't work as well in OLTP as in OLAP.
- Fewer queries touch large quantities of data. This means that parallel query execution won't help as much.
- OLAP systems often sacrifice parallel real-time data ingestion in favor of batched data loading.
- OLTP systems are mission-critical and require close to 100% uptime.
Two common strategies for scaling OLTP are share-everything systems and sharding. Share-everything systems are meant to run on top of expensive I/O subsystems, such as SANs, but they ultimately have a bottleneck with regards to disk I/O. One of the commercial implementations of the shared-disk system is Oracle RAC. This implementation solves the problems of CPU-bound workloads, but breaks down once you start running I/O-bound workloads that hit a storage bottleneck.
With sharding, because OLTP queries tend to be very simple, it's possible to just run a layer that manages multiple servers running RDMSs shards that don't know anything about the fact that they work in a clustered environment. Facebook is known for running MySQL shards. Scaling up this way is beneficial because it allows you to reduce the hardware footprint. There are a lot of companies that choose to run sharded systems, mostly due to the fact that they rarely need to run queries that cross shard boundaries. Sharding is built around systems that scale up well and perform better than clustering solutions such as HBase.
Scaling out is incredibly important in the world of big data, but in the long run, systems that invest in scaling up as well will deliver more value through better performance and savings on hardware.
Nikita Shamgunov is the CTO of MemSQL. (Andrew Binstock is on vacation.)