In the twilight of Moore's Law, the transitions to multicore processors, GPU computing, and hardware or infrastructure as a service (HaaS) cloud computing are not separate trends, but aspects of a single trend mainstream computers from desktops to "smartphones" are being permanently transformed into heterogeneous supercomputer clusters. Henceforth, a single compute-intensive application will need to harness different kinds of cores, in immense numbers, to get its job done.
The free lunch is over. Now welcome to the hardware jungle.
From 1975 to 2005, our industry accomplished a phenomenal mission: In 30 years, we put a personal computer on every desk, in every home, and in every pocket.
In 2005, however, mainstream computing hit a wall. In "The Free Lunch Is Over (A Fundamental Turn Toward Concurrency in Software)," I described the reasons for the then-upcoming industry transition from single-core to multicore CPUs in mainstream machines, why it would require changes throughout the software stack from operating systems to languages to tools, and why it would permanently affect the way we as software developers have to write our code if we want our applications to continue exploiting Moore's transistor dividend.
In 2005, our industry undertook a new mission: to put a personal parallel supercomputer on every desk, in every home, and in every pocket. 2011 was special: it's the year that we completed the transition to parallel computing in all mainstream form factors, with the arrival of multicore tablets (such as iPad 2, Playbook, Kindle Fire, Nook Tablet) and smartphones (for example, Galaxy S II, Droid X2, iPhone 4S). 2012 will see us continue to build out multicore with mainstream quad- and eight-core tablets (as Windows 8 brings a modern tablet experience to x86 as well as ARM), and the last single-core gaming console holdout will go multicore (as Nintendo's Wii U replaces Wii.
This time, it took us just six years to deliver mainstream parallel computing in all popular form factors. And we know the transition to multicore is permanent, because multicore delivers compute performance that single-core cannot and there will always be mainstream applications that run better on a multicore machine. There's no going back.
For the first time in the history of computing, mainstream hardware is no longer a single-processor von Neumann machine, and never will be again.
That was the first act.
It turns out that multicore is just the first of three related permanent transitions that layer on and amplify each other; as the timeline in Figure 1 illustrates.
- Multicore (2005-). As explained previously.
- Heterogeneous cores (2009-). A single computer already typically includes more than one kind of processor core, as mainstream notebooks, consoles, and tablets all increasingly have both CPUs and compute-capable GPUs. The open question in the industry today is not whether a single application will be spread across different kinds of cores, but only "how different" the cores should be. That is, whether they should be basically the same with similar instruction sets but in a mix of a few big cores that are best at sequential code plus many smaller cores best at running parallel code (the Intel MIC model slated to arrive in 2012-2013, which is easier to program), or should they be cores with different capabilities that may only support subsets of general-purpose languages like C and C++ (the current Cell and GPGPU model, which requires more complexity including language extensions and subsets).
Heterogeneity amplifies the first trend (multicore), because if some of the cores are smaller, then we can fit more of them on the same chip. Indeed, 100x and 1,000x parallelism is already available today on many mainstream home machines for programs that can harness the GPU.
We know the transition to heterogeneous cores is permanent because different kinds of computations naturally run faster and/or use less power on different kinds of cores and different parts of the same application will run faster and/or cooler on a machine with several different kinds of cores.
- Elastic compute cloud cores (2010-). For our purposes, "cloud" means specifically HaaS delivering access to more computational hardware as an extension of the mainstream machine. This trend started to hit the mainstream with commercial compute cloud offerings from Amazon Web Services (AWS), Microsoft Azure, Google App Engine (GAE), and others.
Cloud HaaS again amplifies both of the first two trends, because it's fundamentally about deploying large numbers of nodes where each node is a mainstream machine containing multiple and heterogeneous cores. In the cloud, the number of cores available to a single application is scaling fast. In mid-2011, Cycle Computing delivered a 30,000-core cloud for under $1,300/hour using AWS) and the same heterogeneous cores are available in compute nodes (e.g., AWS already offers "Cluster GPU" nodes with dual nVIDIA Tegra M2050 GPU cards, enabling massively parallel and massively distributed CUDA applications).
In short, parallelism is not just in full bloom, but increasingly in full variety.
This article will develop four key points:
- Moore's End. We can observe clear evidence that Moore's Law is ending because we can point to a pattern that precedes the end of exploiting any kind of resource. But there's no reason to panic, because Moore's Law limits only one kind of scaling, and we have already started another kind.
- Mapping one trend, not three. Multicore, heterogeneous cores, and HaaS cloud computing are not three separate trends, but aspects of a single trend: putting a personal heterogeneous supercomputer cluster on every desk, in every home, and in every pocket.
- The effect on software development. As software developers, we will be expected to enable a single application to exploit a jungle of enormous numbers of cores that are increasingly different in kind (specialized for different tasks) and different in location (from local to very remote; on-die, in-box, on-premises, in-cloud). The jungle of heterogeneity will continue to spur deep and fast evolution of mainstream software development, but we can predict what some of the changes will be.
- Three distinct near-term stages of Moore's End. And why "smartphones" aren't, really.
Let's begin with the end of Moore's Law.
Mining Moore's Law
We've been hearing breathless "Moore's Law is ending" announcements for years. That Moore's Law would end was never news; every exponential progression must. Although it didn't end when some prognosticators expected, its end is possible to forecast we just have to know what to look for, and that is diminishing returns.
A key observation is that exploiting Moore's Law is like exploiting a gold mine or any other kind of resource. Exploiting a gold ore deposit never just stops abruptly; rather, running a mine goes through phases of increasing costs and diminishing returns until finally the gold that's left in that patch of ground is no longer commercially exploitable and operating the mine is no longer profitable.
Mining Moore's Law has followed the same pattern. Let's consider its three major phases, where we are now in transition from Phase II to Phase III. And throughout this discussion, never forget that the only reason Moore's Law is interesting at all is because we can transform its raw resource (more transistors) into a useful form (either greater computational throughput or lower cost).
Phase I, Moore's Motherlode = Unicore "Free Lunch" (1975-2005)
When you first find an ore deposit and open a mine, you focus your efforts on the motherlode, where everybody gets to enjoy a high yield and a low cost per pound of gold extracted.
For 30 years, mainstream processors mined Moore's motherlode by using their growing transistor budgets to make a single core more and more complex so that it could execute a single thread faster. This was wonderful because it meant the performance was easily exploitable compute-bound software would get faster with relatively little effort. Mining this motherlode in mainstream microprocessors went through two main subphases as the pendulum swung from simpler to increasingly complex cores:
- In the 1970s and 1980s, each chip generation could use most of the extra transistors to add One Big Feature (for example, on-die floating point unit, pipelining, out of order execution) that would make single-threaded code run faster.
- In the 1990s and 2000s, each chip generation started using the extra transistors to add or improve two or three smaller features that would make single-threaded code run faster, and then five or six smaller features, and so on.
Figure 2 shows how the pendulum swung toward increasingly complex single cores, with three sample chips: the 80286, 80486, and Pentium Extreme Edition 840. Note that the chips' boxes are to scale by number of transistors.
By 2005, the pendulum had swung about as far as it could go toward the complex single-core model. Although the motherlode has been mostly exhausted, we're still scraping ore off its walls in the form of some continued improvement in single-threaded code performance, but no longer at the historically delightful exponential rate.
Phase II, Secondary Veins = Homogeneous Multicore (2005-)
As a motherlode gets used up, miners concentrate on secondary veins that are still profitable but have a more moderate yield and higher cost per pound of extracted gold. So when Moore's unicore motherlode started getting mined out, we turned to mining Moore's secondary veins using the additional transistors to make more cores per chip. Multicore let us continue to deliver exponentially increasing compute throughput in mainstream computers, but in a form that was less easily exploitable because it placed a greater burden on software developers who had to write parallel programs that could use the hardware.
Moving into Phase II took a lot of work in the software world. We've had to learn to write "new free lunch" applications ones that have lots of latent parallelism and so can once again ride the wave to run the same executable faster on next year's hardware, hardware that still delivers exponential performance gains but primarily in the form of additional cores. And we’re mostly there — we have parallel runtimes and libraries like Intel Threading Building Blocks (TBB) and Microsoft Parallel Patterns Library (PPL), parallel debuggers and parallel profilers, and updated operating systems to run them all.
But this time the phase didn't last 30 years. We barely have time to catch our breath, because Phase III is already beginning.
Phase III, Tertiary Veins = Heterogeneous Cores (2011-)
As our miners are forced to move into smaller and smaller veins, yields diminish and costs rise. The miners are turning to Moore's tertiary veins: Using Moore's extra transistors to make not just more cores, but also different kinds of cores and in very large numbers because the different cores are often smaller and swing the pendulum back toward the left.
There are two main categories of heterogeneity, see Figure 3.
- Big/fast vs. small/slow cores. The smallest amount of heterogeneity is when all the cores are general-purpose cores with the same instruction set, but some cores are beefier than others because they contain more hardware to accelerate execution (notably by hiding memory latency using various forms of internal concurrency). In this model, some cores are big complex ones that are optimized to run the sequential parts of a program really fast, while others are smaller cores that are optimized to get better total throughput for the scalably parallel parts of the program. However, even though they use the same instruction set, the compiler will often want to generate different code; this difference can become visible to the programmer if the programming language must expose ways to control code generation. This is Intel's approach with Xeon (big/fast) and MIC (small/slow) which both run approximately the x86 instruction set.
- General vs. specialized cores. Beyond that, we see systems with multiple cores having different capabilities, including that some cores may not be able to support all of a mainstream language like C or C++. In 2006-2007, with the arrival of the PlayStation 3, the IBM Cell processor led the way by incorporating different kinds of cores on the same chip, with a single general-purpose core assisted by eight or more special-purpose SPU cores. Since 2009, we have begun to see mainstream use of GPUs to perform computation instead of just graphics. Specialized cores like SPUs and GPUs are attractive when they can run certain kinds of code more efficiently, both faster and more cheaply, which is a great bargain if your workload fits it.
GPGPU is especially interesting because we already have an underutilized installed base: A significant percentage of existing mainstream machines already have compute-capable GPUs just waiting to be exploited. With the June 2011 introduction of AMD Fusion and the November 2011 launch of NVIDIA Tegra 3, systems with CPU and GPU cores on the same chip is becoming a new norm. That installed base is a big carrot, and creates an enormous incentive for compute-intensive mainstream applications to leverage that patiently waiting hardware. To date, a few early adopters have been using technologies like CUDA, OpenCL, and more recently C++ AMP to harness GPUs for computation. Mainstream application developers who care about performance need to learn to do the same; see Table 1.
But that's pretty much it we currently know of no other major ways to exploit Moore's Law for compute performance, and once these veins are exhausted, it will be largely mined out.
We're still actively mining for now, but the writing on the wall is clear: "mene mene diminishing returns" demonstrate that we've entered the endgame.