Infrastructure for Operating on Big Data in Compute Clusters
Many of the most interesting data-rich applications are those that compute on large data sets. These "Big Data" applications, beyond simply operating on data that is big, are also typically data hungry in that the quality of their results improves with the quantity of data available.
As a result, one of the challenges associated with these applications is constructing a data storage system that scales in both capacity and delivered throughput. That is, as more data become available, or more throughput is required, administrators need the ability to expand easily the infrastructure that stores the data as well as the computing resources that will operate on it. Fortunately, these applications tend to be easy to parallelize. Further, they tend to operate on large data objects, and as a result, they are typically limited by the delivered bandwidth — not the seek performance — of a storage system.
Therefore, commodity disk-based, cluster hardware, deployed at scale, is a good fit for supporting big data applications. For example, the Open Cirrus cluster at Intel Labs Pittsburgh consists of a modest 200 servers; yet, it provides more than 1300 computing cores and over 600 TB of disk storage — enough to accommodate many big data problems. Here, each server acts as both a storage and compute server. Without careful engineering, however, a commodity cluster network can become the bottleneck for big data applications.
For example, the core networking infrastructure of the Intel Open Cirrus cluster consists of commodity 1-Gbps Ethernet components. There are approximately 10 racks with 15-40 servers per rack. Each rack has a top-of-rack (TOR) switch to which the servers in that rack are connected via 1-Gbps connections; the TOR switches in turn are connected to a central switch via (trunked) 4-Gbps connections. If a parallel big data application were forced to process its stored data by transmitting them all between racks, this central switch would present a bottleneck and the maximum throughput would be 10 x 4 Gbps = 40 Gbps (5 GB/s). However, the data objects are actually stored on over 700 magnetic disk drives, each of which can deliver close to 50 MB/s, for a theoretical throughput of approximately 35 GB/s. If the storage devices were solid state devices, such as the Intel X25-M which can sustain 250 MB/s, rather than magnetic media, the theoretical throughput could be even higher (175 GB/s).
Of course, upgrading the network (to 10-Gbps Ethernet components, for example) could reduce the disparity between the throughput delivered by the storage devices and that provided by the network, but a reasonable question to ask is this: Can we deliver application performance that matches the throughput provided by the storage devices without building an expensive cluster network?
Location-Aware Storage Systems
If big data applications can be written such that each block of their data is consumed on the server where it resides, the amount of data sent over the network can be drastically reduced. In this case, significant throughput can be delivered to the application without requiring a high-throughput network. In fact, the application performance should approach the throughput delivered by the storage system, which is the limiting factor in many big data applications, regardless of network performance.
This observation was one of the motivating factors for runtimes, such as Hadoop, that are based on the map/reduce paradigm. The assumption here is that the application's (big data) input can be organized into large blocks, say 64 MB, and that these blocks need not be processed in any particular order. With this assumption, the data can be processed by sending the computation to each of the servers containing data blocks that need to be processed and by operating on the blocks on that server in whatever order is most efficient.
The key here is that the cluster file system, such as HDFS, exposes location information describing where the data objects are stored; thereby enabling applications (or application runtimes) to be location-aware. In other words, the application task running on some server,
N, can query the file system to determine which data blocks are stored on server
By enabling applications and runtimes to query the file system for data location information we enable the application to consume data in place, avoiding network transmission — provided the application and file system both understand the same representation of location information. This common understanding is not necessarily found in a system such as the Open Cirrus software architecture, shown in Figure 2, because the application executes in a virtual environment while the file system does not. In such a scenario, an application instance running inside a virtual machine with IP address,
A, generally will be unable to determine whether or not it is running on the physical host with IP address,
B, which contains a desired data block.
To solve this problem, researchers from Intel Labs collaborated with a team from Carnegie Mellon University to develop two additional cluster services : the Data Location Service (DLS) and the Resource Telemetry Service (RTS). The DLS provides a standard interface for applications to query a cluster regarding the location of a data block regardless of the file/storage system storing that data block.1 Typically, the location response will be a hostname or IP address (or set of such locations if the block is replicated). The application may still be unable to interpret the location information if, for example, it is executing in a virtual environment. The RTS was designed to provide the missing information by being a single clearinghouse for cluster-wide location information. Applications access the RTS through a general interface that provides the relative distance between two location identifiers. Applications, such as Hadoop, can use the RTS information to assign the processing of data blocks to particular compute tasks in such a way that data transfers over the network are minimized. Further, the application can make this assignment in the presence of virtualization and without having detailed information regarding the operation of the underlying file system.
Operators of cluster computing systems are concerned not only with the performance of these systems; they are also concerned with their operational cost, particularly in the area of power consumption. A common goal is power proportionality (the energy consumed by the cluster should be proportional to the work completed). For example, if the processing of 100 TB of data consumes 40 kilowatt-hours, the processing 1 TB of data would ideally consume 400 watt-hours.
An important implication of power proportionality on today's systems is that, if there is not enough work available to keep the entire cluster busy, some servers should be powered down (or placed in a low-power state). However, because cluster file systems such as HDFS distribute their stored data across many servers in the cluster (often randomly), powering down servers when their processing capability is not needed may be impractical; such actions may render the data stored on those servers unavailable. This is a limitation even when the data are replicated; if every block is replicated
K times in the cluster, and the blocks are placed randomly, it may be impossible to power down more than
(K-1) servers without some block becoming unavailable.
One solution to this problem is to modify the random placement algorithm for data blocks. Researchers at Intel Labs, in collaboration with colleagues from Georgia Institute of Technology and Carnegie Mellon University, have designed such a placement strategy . The central idea is to organize the storage servers according to the equal-work layout policy wherein the servers are divided into
K pools, where
K is the replication factor. Within a pool, blocks are placed on servers such that, for whatever number of servers is active, the servers can process equal amounts of work.
Processing Streaming Data with SLIPstream
Different technologies are needed for streaming data than for the stored data discussed so far. Video understanding and computer-vision techniques are computationally intensive tasks that are becoming increasingly important for a variety of applications. These include both throughput-oriented applications, such as multimedia search and retrieval, and latency-sensitive interactive perception applications such as surveillance, gaming, videoconferencing, and vision-based user interfaces. Enabling these types of interactive perception applications requires not only new algorithms and techniques, but new runtime systems, that optimize latency as well as throughput, and make effective use of parallel computing machinery. The Scalable Low-latency Interactive Perception on streaming data (SLIPstream) research project seeks to develop the APIs, runtimes, and tools that are needed to create and effectively execute latency-sensitive streaming applications on parallel and distributed computing resources .
A Runtime for Streaming Perception Applications
A major effort in the SLIPstream project is to develop a set of APIs and a runtime environment to rapidly develop and deploy streaming perception applications on parallel hardware. Such applications, which employ computer vision techniques to understand the environment (such as detecting when a user is visible in a video stream) or interact with a user, pose some interesting challenges. In particular, they are very computationally demanding, require very low latency in interactive settings, and often have time varying and data-dependent execution characteristics. On the other hand, they generally have characteristics that make them amenable to parallel execution: they are often structured as a series of processing steps or transformations on a stream of data items (e.g., image frames), the items are often independent or nearly so, and even processing within a frame may be decomposable into smaller units. A developer with a good understanding of the algorithms involved and with expertise in writing parallel applications should be able to make effective use of parallel hardware for perception applications. The real challenge is enabling the computer vision and video analytics domain experts, who may not have expertise in parallel application development, to write and run their code on clusters of multi-core machines.
To this end, the SLIPstream team at Intel Labs Pittsburgh has developed the Sprout runtime and APIs, which seek to preserve the ease of writing sequential code, but which expose and exploit parallelism in streaming perception applications. The high-level architecture of Sprout is shown in Figure 3. This system is designed to exploit coarse-grained parallelism in the application, at the level of algorithms and processing steps, rather than fine-grained parallelism at the instruction or loop level. The Sprout programming model requires the application be expressed as a data flow graph, with vertices corresponding to processing steps, or stages, in an application, and the edges corresponding to explicit data dependencies between these stages. This model is suited to video analytics and computer vision on video streams, as these applications inherently have such structure. Sprout provides an easy-to-use API for defining the stages, which essentially wraps the sequential implementations of each algorithm used in the application in an object class that provides strongly typed inputs, outputs, and methods to adjust tunable parameters at runtime. Sprout also provides a library of common stages and stage templates, e.g., round-robin splitters, image tilers, camera capture stages, etc., and it incorporates support for useful data types, e.g., OpenCV image types; user-defined types are easily incorporated into the system. The Sprout stage interfaces hide much of the complexity of writing and running parallel and distributed code by automating the coordination and transfer of data between stages.
A human readable configuration file specifies how the application is constructed from the stages, essentially describing the data flow graph of the application. This file is used by the Sprout runtime to orchestrate the deployment and execution of the application on multiple cores and machines. The runtime also monitors execution times and latencies, by using both white box (stage data) as well as black box system information. The runtime supports both manual and automatic runtime adaptation and tuning of the application at runtime.
A key aspect of the Sprout runtime, and SLIPstream research in general, is the focus on coarse-grain parallelism at the level of processing stages. One point this research seeks to demonstrate is that coarse-grained parallelism is sufficient to effectively exploit parallel computing resources in perception applications. Breaking up applications into task and data parallel pieces at this level of granularity is easily accomplished by non-experts in parallelization, unlike low-level vectorization and loop parallelization techniques. Furthermore, this alleviates much of the need for explicit concurrency management and the complexity, correctness, and performance issues that it entails. Finally, the mechanisms employed are complementary to low-level parallelization techniques, and experienced developers are free to implement threaded and vectorized stages, as well as employ parallel libraries such as Intel IPP. Effective use of both low- and high- level parallelism for a complex vision task, using SLIPstream, has recently been demonstrated .
Automatic Tuning of Distributed Streaming Applications
As SLIPstream is primarily concerned with interactive perception applications, ensuring low end-to-end latency in the applications is paramount. Algorithms in these applications typically have many adjustable parameters that can have a significant effect on both the fidelity and computational costs. In addition, the degree to which computations are parallelized (that is, the number of data parallel instances of a stage) may be flexible. If dynamically adjustable, such parameters provide an opportunity to control latency, but how one configures a parallel application to run optimally on a particular set of cluster resources remains an open problem.
To manage latency in streaming applications, an automatic tuning system has been developed that can dynamically adjust application parameters and degrees of parallelization of stages to bound application latency while maximizing fidelity. As applications can have multiple parallel stages and dozens of tunable parameters, the system first uses performance monitoring to identify the critical stages that contribute most to latency. It further reduces complexity by analyzing, in a grey box manner, which set of parameters and degree of parallelization controls affect the critical stages. Online machine learning is employed to learn and dynamically update performance models for the critical stages as a function of parameter settings. These models, in conjunction with the structure of the application provided in the data flow graph, are used to estimate end-to-end application latency. A solver is then used to either minimize latency or, given a fidelity model, select parameters to maximize fidelity subject to a latency requirement. Ongoing research is investigating how to learn such models quickly, and how to trade off time spent refining the models versus making use of them.
SLIPstream Algorithmic Research Efforts
In addition to the systems research, SLIPstream also involves research efforts on perception algorithms and how they are implemented. In particular, most computer vision code today is written in either a purely mathematical form, with little regard to implementation efficiency (other than in the asymptotic complexity sense), or it is written with very specific hardware in mind. The former has little chance of running efficiently on any system, while the latter often takes algorithmic shortcuts and sacrifices accuracy, robustness, or generality to run fast on a target platform. With the availability of a scalable infrastructure and programming model provided by Sprout, we believe there is great scope for developing perception applications between these extremes — applications that do not sacrifice algorithmic properties, yet can be readily parallelized and scaled up with additional hardware to achieve performance goals.
Parallelized applications on computing clusters provide the opportunity to process much more data than it is possible to process on single machines. This ability to use more data can have significant ramifications on the algorithms and applications. More data used correctly can mean greater accuracy or robustness. Processing more data can also allow simpler algorithms to be effective, potentially increasing further the processing capacity. Finally, by using more data with computation, we can enable entirely new capabilities, such as the synthesis of accurate novel arbitrary views of a scene by merging information from dozens of different camera views.