What It Means For Us: A Programmer's View
How will all of this change the way we write our software, if we care about harnessing mainstream hardware performance? The basic conclusions echo and expand upon ones that I proposed in "The Free Lunch is Over":
- Applications will need to be at least massively parallel, and ideally able to use non-local cores and heterogeneous cores, if they want to fully exploit the long-term continued exponential growth in compute throughput being delivered both in-box and in-cloud. After all, soon the vast majority of compute cores available to a mainstream application will be non-local.
- Efficiency and performance optimization will get more, not less, important. We're being asked to do more (new experiences like sensor-based UIs and augmented reality) with less hardware (constrained mobile form factors and the eventual plateauing of scale-in when Moore's Law ends). In December 2004 I wrote: "Those languages that already lend themselves to heavy optimization will find new life; those that don't will need to find ways to compete and become more efficient and optimizable. Expect long-term increased demand for performance-oriented languages and systems." This is still true; witness the resurgence of interest in C++ in 2011 and onward, primarily because of its expressive flexibility and performance efficiency. A program that is twice as efficient has two advantages:
- It will be able to run twice as well on a local disconnected device especially when Moore's Law can no longer deliver local performance improvements in any form.
- It will always be able to run at half the power and cost on an elastic compute cloud even as those continue to expand for the indefinite future.
- Programming languages and systems will increasingly be forced to deal with heterogeneous distributed parallelism. As previously predicted, just basic homogeneous multicore has proved to be a far bigger event for languages than even object-oriented programming was, because some languages (notably C) could get away with ignoring objects while still remaining commercially relevant for mainstream software development. No mainstream language, including the just-ratified C11 standard, could ignore basic concurrency and parallelism and stay relevant in even a homogeneous-multicore world. Now expect all mainstream languages and environments, including their standard libraries, to develop explicit support for at least distributed parallelism and probably also heterogeneous parallelism; they cannot hope to avoid it without becoming marginalized for mainstream app development.
Expanding on that last bullet, what are some basic elements we will need to add to mainstream programming models (think: C, C++, Java, and .NET)? Here are a few basics I think will be unavoidable, that must be supported explicitly in one form or another.
- Deal with the processor axis' lower section by supporting compute cores with different performance (big/fast, slow/small). At minimum, mainstream operating systems and runtimes will need to be aware that some cores are faster than others, and know which parts of an application want to run on which of those cores.
- Deal with the processor axis' upper section by supporting language subsets, to allow for cores with different capabilities including that not all fully support mainstream language features. In the next decade, a mainstream operating system (on its own, or augmented with an extra runtime like the Java/.NET VM or the ConcRT runtime underpinning PPL) will be capable of managing cores with different instruction sets and running a single application across many of those cores. Programming languages and tools will be extended to let the developer express code that is restricted to use just a subset of a mainstream programming language (e.g., the
restrict()qualifiers in C++ AMP. I am optimistic that, for most mainstream languages, such a single language extension will be sufficient while leveraging existing language rules for overloading and dispatch, and thus minimizing the impact on developers.
- Deal with the memory axis for computation, by providing distributed algorithms that can scale not just locally but also across a compute cloud. Libraries and runtimes like OpenCL and TBB and PPL will be extended or duplicated to enable writing loops and other algorithms that run on large numbers of local and non-local parallel cores. Today we can write a
parallel_for_eachcall that can run with 1,000x parallelism on a set of local discrete GPUs and ship the right data shards to the right compute cards and the results back; tomorrow we need to be able to write that same call that can run with 1,000,000,000x parallelism on a set of cloud-based GPUs and ship the right data shards to the right nodes and the results back. This is a "baby step" example in that it just uses local data (e.g., that can fit in a single machine's memory), but distributed computation; the data subsets are simply copied hub-and-spoke.
- Deal with the memory axis for data, by providing distributed data containers, which can be spread across many nodes. The next step is for the data itself to be larger than any node's memory, and (preferably automatically) move the right data subsets to the right nodes of a distributed computation. For example, we need containers like a
distributed_tablethat can be backed by multiple and/or redundant cloud storage, and then make those the target of the same distributed
parallel_for_eachcall. After all, why shouldn't we write a single
parallel_for_eachcall that efficiently updates a 100 petabyte table? Hadoop enables this today for specific workloads and with extra work; this will become a standard capability available out-of-the-box in mainstream language compilers and their standard libraries.
- Enable a unified programming model that can handle the entire chart with the same source code. Since we can map the hardware on a single chart with two degrees of freedom, the landscape is unified enough that it should be able to be served by a single programming model in the future. Any solution will have at least two basic characteristics: First, it will cover the Processor axis by letting the programmer express language subsets in a way integrated holistically into the language. Second, it will cover or hide the Memory axis by abstracting the location of data, and copying data subsets on demand by default, while also providing a way to take control of the copying for advanced users who want to optimize the performance of a specific computation.
Perhaps our most difficult mental adjustment, however, will be to learn to think of the cloud as part of the mainstream machine to view all these local and non-local cores as being equally part of the target machine that executes our application, where the network is just another bus that connects us to more cores. That is, in a few years we will write code for mainstream machines assuming that they have million-way parallelism, of which only thousand-way parallelism is guaranteed to always be available (when out of WiFi range).
Five years from now, we want to be delivering apps that run well on an isolated device, and then just run faster or better when they are in WiFi range and have dynamic access to many more cores. The makers of our operating systems, runtimes, libraries, programming languages, and tools need to get us to a place where we can create compute-bound applications that run well in isolation on disconnected devices with 1,000-way local parallelism…and when the device is in WiFi range just run faster, handle much larger data sets, and/or light up with additional capabilities. We have a very small taste of that now with cloud-based apps like Shazam (which function only when online), but yet a long way to go to realize this full vision.