In recent years, advances in semiconductor electronics have pushed the instrumentation of our world to unprecedented levels. Sensors are now all around us: many cell phones contain GPS receivers as well as cameras, doorways have motion detectors, stop lights sense vehicles at intersections, and satellites orbiting overhead are constantly imaging the Earth. Additionally, we have data sourced electronically: feeds from social networking sites, crawls of Web pages, repositories of medical images, results from computer simulations, etc. Many of the data objects from these sources are collected for analysis, archived, subjected to re-analysis, cross-correlated with other data objects, and processed to create additional, derived data sets.
The result is that we live in a world that is data rich. In this article, we consider two types of data sources: stored and streaming. A stored data object is just that, information that has been archived in some way. A corpus of digital images stored on a collection of magnetic disks would be an example of stored data. Streaming data objects have a real-time component; a live video feed is the canonical example of streaming data. The two types of data present different processing challenges in that applications operating on stored data are often throughput-sensitive, while those operating on streaming data are often latency-sensitive. While the two types of data present subtly different performance constraints, both require significant, scalable computing resources. For example, an image search application operating on stored images may need to scale out depending on the number of images or complexity of the search. Similarly, an application executing a face-detection algorithm on live video may need to scale out if faces are detected and more compute-intensive face recognition algorithms are invoked.
Cloud computing technologies enable many users to share modern computing clusters while providing mechanisms for scaling applications as needed. As a result, researchers in Intel Labs are investigating what challenges arise when leveraging cloud computing technologies in the context of rich data applications operating on either stored or streaming data, and what solutions may address those challenges. This research program includes support of the Open Cirrus research test bed, development of an open source software stack for operating on stored data, development of a runtime system for operating on streaming data, and exploration of the benefits resulting from integration of optical networks in compute clusters.
Supporting Open Research in Cloud Computing: Open Cirrus
When considering the topic of cloud computing on large data sets, many questions suggest themselves at an early stage. How should the data be organized and stored? What are the important system software components that enable access to the data? How should those components be organized? What are the appropriate user interfaces? How can the data be processed in the most efficient manner possible?
Answering such a broad array of questions is difficult for a single research organization. Tackling a broad research agenda naturally requires a vibrant research community. To help provide cloud computing resources to this community, Intel, HP, and Yahoo!, in collaboration with the National Science Foundation, sponsored the Open Cirrus cloud computing testbed . (Open Cirrus is a trademark of Yahoo!)
The goals of the Open Cirrus project are to:
- Foster systems-level research in cloud computing.
- Encourage new cloud computing applications and applications-level research.
- Collect and share experimental datasets.
- Develop open-source stacks and APIs for the cloud.
To achieve those goals, Open Cirrus provides a world-wide, federated collection of cloud computing sites and a software architecture designed to unify those sites into a coherent platform. The sites (shown in Figure 1), each of which includes a cluster of at least 1,000 cores, are provided by the Open Cirrus collaborating institutions: HP, Intel, Yahoo!, the University of Illinois (UIUC), Karlsruhe Institute of Technology (KIT), the Infocomm Development Authority (IDA) of Singapore, the Russian Academy of Sciences (RAS), the Electronics and Telecommunications Research Institute (ETRI) of South Korea, the Malaysian Institute Of Microelectronic Systems (MIMOS), and Carnegie Mellon University (CMU).
The Open Cirrus test-bed is intended to support research at various levels of the cloud-computing stack from the lowest layers that interact directly with hardware to the highest application layers. However, research in the upper layers requires that lower-level software be available and stable. Consequently, the Open Cirrus community has adopted a common software service architecture; the core services of this architecture are shown in Figure 2. All core software components are open-source projects managed by the Apache Software Foundation. Intel Labs has made significant contributions to the development of Zoni and Tashi, and Yahoo! has made significant contributions to the development of Hadoop and HDFS.
The primary responsibility of the lowest software layer (Zoni) is to partition the cluster into domains. A domain is a set of compute servers that are network-isolated from the rest of the servers in the cluster (by the use of VLANs). When users experiment with system software that interacts with key networking components, such as DHCP services, they, or the cluster system administrators, will first use Zoni to create an isolated domain for the experiment; in this way, if the experiment goes awry, it cannot affect the normal operation of the cluster (and other activity in the cluster cannot interfere with the experiment).
Most of the research that does not interact with core networking services, however, will take place in the primary domain of the cluster. This domain is considered to be for production use. Experiments in the primary domain are isolated from other activity by operating in virtual machine environments, and the virtual machines are managed by a cluster management layer such as Tashi. Tashi enables users to rapidly deploy virtual machine instances in the cluster by specifying attributes of the virtual machine (such as number of processors and the amount of memory) as well as the software that should run within that virtual machine.
However, the data sets in the cluster are potentially valuable to many users, and consequently, are ideally not stored in virtual machine images. Instead, the Open Cirrus core services include a cluster file system that resides beneath the virtual machine layer. In this way, data stored in the cluster file system are accessible from any of the virtual machines operating in the primary domain. After evaluating many cluster storage options, the Hadoop File System (HDFS) best fit the needs of Open Cirrus.
By leveraging the virtual machine layer, the cluster administrators can provide any number of application-level services. The Open Cirrus software service architecture explicitly suggests one such application runtime: the Hadoop map/reduce framework. This framework is particularly suitable for enabling cluster users to process data stored in the cluster file system.
Naturally, the utility of these clusters would be quite limited if they only hosted the development of these core services. Fortunately, many of the cluster users are not involved directly in research on cloud computing; instead, they simply use the Open Cirrus clusters as computing resources in the course of conducting research in some other field. This use of the cluster is welcome and encouraged, because these users provide a realistic context for evaluating the system software by providing authentic data-rich workloads.