This is an informal overview of the major library enhancements in Java SE 8 to take advantage of new language features, primarily lambda expressions and default methods, specified by JSR 335 and implemented in the OpenJDK Lambda Project. The new language features for Java SE8 are described in Lambda Expression in Java 8, which should be read first.
Background
Had lambda expressions (closures) been part of the Java language from
the beginning, our Collections APIs would certainly look different
than they do today. As the Java language acquires lambda expressions
as part of JSR 335, this has the unfortunate side effect of making
our Collections
interfaces look even more out of date! While it might
be tempting to start from scratch and build a replacement collections
framework ("Collections II"), replacing the collections framework
would be a major task, as Collections
interfaces permeate the entire
Java ecosystem, and the adoption lag would be many years. Instead, we
pursue an evolutionary strategy of adding extension methods to
existing interfaces (such as Collection
, List
, or Iterable
), and
adding a stream abstraction (for example, java.util.stream.Stream
) for
performing aggregate operations on datasets, and
retrofitting existing classes to provide stream views, enabling many
of the desired idioms without making people trade in their trusty
ArrayList
s and HashMap
s. (This is not to say that Collections
will never be replaced; clearly, there are limitations beyond simply
not being designed for lambdas. A more modern collections framework
may be considered for a future version of the JDK.)
A key driver for this work is making parallelism more accessible to developers. While the Java platform provides strong support for concurrency and parallelism already, developers face unnecessary impediments in migrating their code from sequential to parallel as needed. Therefore, it is important to encourage idioms that are both sequential- and parallel-friendly. This is facilitated by shifting the focus towards describing what computation should be performed, rather than how it should be performed. It is also important to strike the balance between making parallelism easier but not going so far as to make it invisible; our goal is explicit but unobtrusive parallelism. (Making parallelism transparent would introduce nondeterminism and the possibility of data races where users might not expect it.)
Internal vs. External Iteration
The collections framework relies on the concept of external
iteration, where a Collection
provides, by implementing Iterable
,
a means to enumerate its elements, and clients use this to step
sequentially through the elements of a collection. For example, if we
wanted to set the color of every shape in a collection of shapes to
red, we would write:
for (Shape s : shapes) { s.setColor(RED); }
This example illustrates external iteration; the for-each
loop calls
the iterator()
method of shapes
, and steps through the collection
one by one. External iteration is straightforward enough, but it has
several problems:
- Java's
for-each
loop is inherently sequential, and must process the elements in the order specified by the collection. - It deprives the
library
method of the opportunity to manage the control flow, which might be able to provide better performance by exploiting reordering of the data, parallelism, short-circuiting, or laziness.
Sometimes the strong guarantees of the for-each
loop (sequential,
in-order) are desirable, but often are just an impediment to
performance. The alternative to external iteration is internal iteration, where
instead of controlling the iteration, the client delegates that to the
library and passes in snippets of code to execute at various points in
the computation.
The internal-iteration equivalent of the previous example is:
shapes.forEach(s -> s.setColor(RED));
This may appear to be a small syntactic change, but the difference is
significant. The control of the operation has been handed off from
the client code to the library code, allowing the libraries not only
to abstract over common control flow operations, but also enabling
them to potentially use laziness, parallelism, and out-of-order
execution to improve performance. (Whether an implementation of
forEach
actually does any of these things is for the
implementation to determine; but with internal iteration, they
are at least possible, whereas with external iteration, they are not.)
Where external iteration mixes the what (color the shapes red) and
the how (get an Iterator
and iterate it sequentially), internal
iteration lets the client dictate the what but lets the library
control the how. This offers several potential benefits: client code
can be clearer because it need only focus on stating the problem, not
the details of how to go about solving it, and we can move complex
optimization code into libraries where it can benefit all users.
Streams
The key new library abstraction introduced in Java SE 8 is a stream,
defined in package java.util.stream
. (There are
several stream types; Stream<T>
represents a
stream of object references, and there are specializations such as
IntStream
to describe streams of primitives.)
A stream represents a sequence of values, and exposes a set of
aggregate operations that allow us to express common manipulations on
those values easily and clearly. The libraries provide convenient
ways to obtain stream views of collections, arrays, and other data
sources.
Stream operations are chained together into pipelines. For example, if we wanted to color only the blue shapes red, we could say:
shapes.stream() .filter(s -> s.getColor() == BLUE) .forEach(s -> s.setColor(RED));
The stream()
method on Collection
produces a stream view of the
elements of that collection; the filter()
operation then produces a
stream containing the shapes that are blue, and these elements are
then made red by forEach()
.
If we wanted to collect the blue shapes into a new List
, we could
say:
List<Shape> blue = shapes.stream() .filter(s -> s.getColor() == BLUE) .collect(Collectors.toList());
The collect()
operation collects the input elements into an
aggregate (such as a List
) or a summary description; the argument to
collect()
is a recipe for how to do this aggregation. In this case,
we use toList()
, which is a simple recipe that accumulates the
elements into a List
. (More detail on collect()
will be discussed shortly.)
If each shape were contained in a Box
, and we wanted to know which
boxes contained at least one blue shape, we could say:
Set<Box> hasBlueShape = shapes.stream() .filter(s -> s.getColor() == BLUE) .map(s -> s.getContainingBox()) .collect(Collectors.toSet());
The map()
operation produces a stream whose values are the result of
applying a mapping function (here, one that takes a shape and returns
its containing box) to each element in its input stream.
If we wanted to add up the total weight of the blue shapes, we could express that as:
int sum = shapes.stream() .filter(s -> s.getColor() == BLUE) .mapToInt(s -> s.getWeight()) .sum();
So far, we haven't provided much detail about the specific signatures
of the Stream
operations shown; these examples simply illustrate the
types of problems that the streams framework is designed to address.
Streams vs. Collections
Collections and streams, while bearing some superficial similarities, have different goals. Collections are primarily concerned with the efficient management of, and access to, their elements. By contrast, streams do not provide a means to directly access or manipulate their elements, and are instead concerned with declaratively describing the computational operations which will be performed in aggregate on that source. Accordingly, streams differ from collections in several ways:
- No storage. Streams don't have storage for values; they carry values from a source (which could be a data structure, a generating function, an I/O channel, etc) through a pipeline of computational steps.
- Functional in nature. Operations on a stream produce a result, but do not modify its underlying data source.
- Laziness-seeking. Many stream operations, such as filtering, mapping, sorting, or duplicate removal) can be implemented lazily. This facilitates efficient single-pass execution of entire pipelines, as well as facilitating efficient implementation of short-circuiting operations.
- Bounds optional. There are many problems that are sensible to
express as infinite streams, letting clients consume values until
they are satisfied. (If we were enumerating perfect numbers, it is
easy to express this as a filtering operation on the stream of all
integers.) While a
Collection
is constrained to be finite, a stream is not. (To terminate in finite time, a stream pipeline with an infinite source can use short-circuiting operations; alternately, you can request anIterator
from aStream
and traverse it manually.)
As an API, Streams
is completely independent from Collections
. While
it is easy to use a collection as the source for a stream
(Collection
has stream()
and parallelStream()
methods) or to
dump the elements of a stream into a collection (using the collect()
operation as shown earlier), aggregates other than Collection
can be
used as sources for streams as well. Many JDK classes, such as
BufferedReader
, Random
, and BitSet
, have been retrofitted to act
as sources for streams, and Arrays.stream()
provides stream view of
arrays. In fact, anything that can be described with an Iterator
can be used as a stream source, though if more information is
available (such as size or metadata about stream contents like
"sortedness"), the library can provide an optimized execution.
Laziness
Operations such as filtering or mapping can be performed eagerly
(where the filtering is performed on all elements before the filter
method returns), or lazily (where the Stream
representing the
filtered result only applies the filter to elements from its source as
needed.) Performing computations lazily, where practical, can be
beneficial. For example, if we perform filtering lazily, we can fuse
the filtering with other operations later in the pipeline, so as not
to require multiple passes on the data. Similarly, if we are
searching a large collection for the first element that matches a
given criteria, we can stop once we find one, rather than processing
the entire collection. (This is especially important for infinite
sources; laziness is merely an optimization for finite
sources, it makes operations on infinite sources possible, whereas an
eager approach would never terminate.)
Operations such as filtering and mapping can be thought of as naturally
lazy, whether or not they are implemented as such. On the other hand,
value-producing operations such as sum()
, or side-effect-producing
operations such as forEach()
, are "naturally eager," because they
must produce a concrete result.
In a pipeline such as:
int sum = shapes.stream() .filter(s -> s.getColor() == BLUE) .mapToInt(s -> s.getWeight()) .sum();
the filtering and mapping operations are lazy. This means that we
don't start drawing elements from the source until we start the sum
operation, and when we do perform the sum
operation, we fuse
filtering, mapping, and addition into a single pass on the data. This
minimizes the bookkeeping costs required to manage intermediate
elements.
Many loops can be restated as aggregate operations drawing from a data
source (array, collection, generator function, I/O channel), doing a
series of lazy operations (filtering, mapping, etc), and then doing a
single eager operation (forEach
, toArray
, collect
, etc), giving us filter-map-accumulate, filter-map-sort-foreach, etc. The naturally
lazy operations tend to be used to compute temporary intermediate
results, and we exploit this property in our API design. Rather than
have the filter
and map
return a collection, we instead have them
return a new stream. In the Streams API, operations that return a
stream are lazy, and operations that return a non-stream result (or
return no result, such as forEach()
) are eager. In most cases where
potentially lazy operations are being applied to aggregates, this
turns out to be exactly what we want each stage takes a stream of
input values, performs some transformation on it, and passes the
values to the next stage in the pipeline.