Today, we speak casually of terabytes of data the way we used to speak of megabytes. Multi-petabyte data stores are common and growing. To handle all this data quickly and efficiently, and to extract useful business insights from it, we need programming paradigms that combine the scalable use of multicore processors and many-node clusters with data processing and analysis. No single programming paradigm can serve all problem domains, so expect many alternatives and consolidation. We're entering an era of innovation and the time is ripe for a robust programming paradigm well suited to big data processing and analytics. In this article, I'll discuss dataflow, a technique that manages very large data flows into and through islands of high-speed computation.
What Is Dataflow?
As a software architecture, dataflow consists of nodes in a graph connected by directed dataflow queues. The graph represents an application or program and the nodes represent functions to be applied to data. The dataflow queues transport the data between connected functions. Figure 1 shows connectors on the left reading data from data sources. The data flows from left to right, from node to node. In most cases, each operator in the graph receives data from its input queue, applies its function to the data, and outputs the results to its output queue. For example, the second operator, entitled "Missing Value," reads rows of data, looks for missing values in each row, and outputs replacement values to its output queue.
Composing a dataflow graph is a process of creating instances of operators and linking them together by stitching the output of one operator to the input of another. This composition model provides great flexibility in how users link operators to create applications. By thinking of each operator as a function consuming data and producing an output, dataflow provides the higher-order mechanism (composition) for organizing those functions.
Dataflow's shared-nothing model yields a simple programming paradigm, somewhat analogous to the UNIX shell where shell commands can be piped from one to the next to produce the desired output. Each shell command is independent but all live by a contract to consume standard input and pipe out standard output. Dataflow operators live by a similar kind of contract, using queues. Additionally, operators in dataflow can have multiple input and output queues. They also have properties that allow manipulating their behavior. Flow control is applied to queues to enable different consumption rates between operators. Deadlock detection can also be applied to prevent race conditions between operators.
The dataflow model provides pipeline parallelism by its very nature. As an operator consumes input it pushes its output downstream. Operators normally don't have to see all of their input before producing output. As each operator works, a pipeline of data makes its way through the graph. Each operator can be represented by a thread, or possibly a set of threads. With many operators in a dataflow graph working in parallel, dataflow readily takes advantage of multicore processors.
Dataflow also supports horizontal partitioning, allowing data to be segmented and applied to a replicated section of a graph. This partitioning can be dynamic, adjusting to the number of computing resources available at execution time. Horizontal partitioning is a great way to "divide and conquer" data problems where data dependency is not an issue. It can both scale up to multicore processors and scale out to multiple nodes within a cluster.
One definition of dataflow frequently found in computer science textbooks describes an architecture whereby changing the value of a variable results in the automatic recalculation of other dependent variables. This is more of the "spreadsheet" model made popular by Microsoft Excel and others.
In his seminal paper titled "The Semantics of a Simple Language for Parallel Processing" [PDF], Gilles Kahn defines Kahn Process Networks (KPNs). According to Wikipedia, KPN's define a programming model "...where a group of deterministic sequential processes are communicating through unbounded FIFO channels." This matches closely with how we've been describing dataflow.
KPNs were originally designed with distributed programming in mind. They are also heavily used for modeling signal-processing systems. The functional approach of KPNs is a good fit for signal processing, allowing filters to be built as pluggable functions that can be inserted as needed within a signal flow.
The original concept of dataflow (spreadsheets) and KPNs have, over time, evolved to the software architecture of dataflow.
The concepts of dataflow are easy to grasp leading to "design-time scalability." As Figure 1 shows, dataflow lends itself well to GUI environments, bringing high-performance computing to an extremely accessible usability level.
The java.util.concurrent library has been in existence since Java 5, yet few programmers have deep experience with it. With the proliferation of multicore processors and the abundance of raw data, developers need the ability to write high-performing, scalable software, which is almost impossible if we stay in today's thread-centric, shared-memory model of programming. Dataflow provides a means of composing scalable applications using a building block approach. It does so while abstracting away low-level issues such as threads, memory synchronization, and race conditions.
Dataflow has many of the good qualities of functional programming. For example, dataflow queues are immutable, eliminating concerns about shared memory synchronization or side effects from upstream operators. Immutability also allows queues to have multiple readers for good functional reuse. Having no global state also limits side effects. By contract, each operator in a dataflow graph only mutates the data it receives according to the function it provides. It's easy to look at a dataflow graph and surmise how it works and what it does. The similarity of this approach to the actor model cannot be overlooked.