In addition to making sure that reduce input keys are brought to the same reducer (a partition contains multiple keys, and all the reduce inputs with the same key are in the same partition), the shuffle also sorts by key so that all the values for a given key are processed by the reducer in one batch without the framework needing to buffer them in memory.
The output of the
reduce function is written back to HDFS, where it can be viewed by the user, or fed into another MapReduce job in a workflow of jobs. The intermediate map output and reduce input files are deleted once the job is complete.
The implementation seems straightforward, but what Hadoop does very well is protect the user in the face of machine failure, which tends to be common on large clusters of machines (just by the laws of probability). If one of the nodes storing a block replica or one running a map or a reduce task fails, the MapReduce runtime reschedules the task on another node. The user may not even notice the failure and the job will complete in the usual way.
Similarly, if a disk fails, HDFS will re-replicate its blocks to another node in the cluster. No file is unavailable during the time that re-replication is taking place, since the other two replicas of each block are available on other nodes in the cluster.
The Hadoop Ecosystem
HDFS and MapReduce form the core of Hadoop, but they are just two parts in a growing system of components for large-scale data processing. Just like the term "Linux" can be understood to mean both the kernel and a larger distribution of the kernel and applications, so the term "Hadoop" can encompass a stack of components running on HDFS and MapReduce, forming a sort of Big Data Operating System. Today, there are many vendors offering Hadoop distributions, and in the remainder of this article, I look at some of the components that can be found in most distributions.
Analysis: Pig, Hive, and More
The first projects to emerge in the Hadoop ecosystem were for users who wanted to run analytics but either could not or did not want to write MapReduce jobs. Apache Pig and Apache Hive provide languages and runtimes for data analysis. Pig is a new dataflow language for operating on sets of tuples, while Hive offers SQL, which is fairly close to SQL '92. I will show examples of both languages in the next part of this series.
Hive is no longer the only option for SQL on Hadoop, although it is the most mature option. Over the last year or so, various vendors and open source projects (such as Apache Drill and Cloudera Impala) have announced Hadoop SQL implementations that replace the MapReduce runtime that Hive depends on with a lower latency system.
For Java programmers, the Cascading framework and Apache Crunch are two libraries that provide higher-level APIs than MapReduce. The simple key-value abstraction is replaced by operations on tuples, such as joins, and multi-stage jobs are handled implicitly by the library (operations are compiled into MapReduce jobs). Both projects also provide Scala APIs, which makes for very concise queries. One advantage that both systems have over Pig and Hive is that writing user-defined functions (UDFs) is smoother in Cascading and Crunch, because they are written in the host language (Java or Scala), rather than a different language (Java for Pig and Hive, although Pig also supports Python).
For data scientists, the Apache Mahout project offers a rich set of machine-learning libraries, including algorithms such as clustering and classification. Many of the algorithms are written to use MapReduce, so they scale to large datasets.
Realtime Datastore: HBase
HDFS performs very well for the use cases it was designed for: streaming reads and writes of large files. However, there are other data access patterns where a datastore with different characteristics is more appropriate. Apache HBase supports fast random access both reads and writes to large tabular datasets, which may consist of billions of rows and millions of columns. Unlike some NoSQL datastores, however, HBase reads and writes are strongly consistent. Another useful characteristic of HBase is that you do not have to forfeit the ability to perform batch analysis: HBase will act as a source or a sink for MapReduce jobs; and Pig, Hive, and the other analysis frameworks can be run over datasets in HBase.
Data Ingest: Flume and Sqoop
A common problem that new users of Hadoop face is how to get their data into Hadoop. From the beginning, there has been a way to copy data efficiently between Hadoop clusters (using a tool called distcp), but this presented a chicken-and-egg problem: How do you get all your data into Hadoop in the first place? Two ingest tools have emerged to help out: Apache Flume and Apache Sqoop. Flume is a system for collecting log data and storing it in HDFS or HBase. Flume agents run on the cluster and continuously process events from sources such as server log files or syslog (many others are supported as well), and reliably delivers them to the target datastore. By contrast, Sqoop is a bulk ingest system, and the canonical use case is performing a nightly dump of all the data in a transactional relational database into Hadoop for offline analysis.
Hadoop clusters have to support many users, sometimes across different departments in an organization. Typically, they run a mix of ad hoc queries and production jobs that have to run to a particular schedule. Managing the resources in the cluster so that users' needs are balanced and production deadlines are met is the job of the MapReduce scheduler. Another project, Apache Oozie, provides a service for defining, scheduling, and running workloads, which are pipelines of dependent jobs. A simple workload might be a MapReduce job that does data preparation followed by a Hive query for producing a daily report. More complex workloads may be made up of tens or even hundreds of jobs.
Data and Metadata: Avro and HCatalog
Most users of Hadoop start by storing data as plain text. Text is simple to work with; however, it is inefficient for large datasets and there are lots of good alternatives that work with Hadoop.
Apache Avro is a language-independent, compact, fast, binary data format that is growing in popularity and is supported by most components in the Hadoop ecosystem. Avro data is described by a schema, and one interesting feature is that the schema is stored in the same file as the data it describes, so files are self-describing. Avro does not require code generation, making it possible to write generic tools for processing Avro data with any schema.
There has been a recent resurgence of interest in columnar formats in Hadoop. Columnar formats store values from one column next to another on disk, rather than the normal way of writing row-by-row. They are ideal for storing datasets with many columns when most queries only access a small number of the columns, since the unread columns can be skipped very efficiently. Hive has provided a columnar format for a while in the form of RCFile. Recently, Trevni was released as a part of Avro with the goal of providing a high-performance, language-neutral columnar format for Hadoop.
One of Hive's successes has been its metastore: The database of table definitions that describe the data stored in HDFS. The Apache HCatalog project was started to open up the metastore so that other projects, such as Pig and MapReduce, could share table definitions with Hive.
Packaging and Deployment: Bigtop and Ambari
With such a large number of components in a Hadoop deployment, it's difficult to pick a combination that works well together. This is where Apache Bigtop comes in to test and package a known set of Hadoop components, relieving users of this burden. Hadoop distributions such as Cloudera's CDH and Hortonworks' HDP build on Bigtop for their testing and packaging.
Running a cluster is more complex than simply installing the packages and starting the service daemons. Projects such as Apache Ambari and Cloudera Manager provide a cluster-level interface for managing configuration, monitoring, alerts, log file search, inter-service dependencies, service upgrades, and other details.
Next Generation Hadoop
Hadoop version 2 has a new resource management framework called YARN (Yet Another Resource Negotiator), which generalizes data processing beyond MapReduce. The prototypical YARN application is MapReduce, which will not be going away because it shines for batch computations over very large datasets. However, YARN opens up the processing power of a Hadoop cluster to new distributed processing algorithms, such as large-scale graph processing (which is inefficient in MapReduce because the graph has to be materialized to disk between each MapReduce job in the workflow).
In the next article in this series, I will demonstrate how to write and run a MapReduce job on Hadoop.
Tom White has been an Apache Hadoop committer since February 2007, and is a member of the Apache Software Foundation. He is an engineer at Cloudera, a company set up to offer Hadoop tools, support, and training. He is the author of the best-selling O'Reilly book, Hadoop: The Definitive Guide.