To help visualize the filling-out process more concretely, why not check to see how mainstream hardware has progressed on this chart? The easiest place to start is the long-standing mainstream CPU and more recent GPU:
- From the 1970s to the 2000s, CPUs started with simple single cores and then moved downward as the pendulum swung to increasingly complex cores. They hugged the left side of the chart by staying single-core as long as possible, but in 2005 they ran out of room and turned toward multicore NUMA cache architectures; see Figure 8.
- Meanwhile, in the late 2000s, mainstream GPUs started to be capable of handling computational workloads. But because they started life in an add-on discrete GPU card format where graphics-specific cores and memory were physically located away from the CPU and system RAM, they started further upward and to the right (Specialized / Disjoint (local)). GPUs have been moving leftward to increasingly unified views of memory, and slightly downward to try to support full mainstream languages (such as adding exception handling support).
- Today's typical mainstream computer includes both a CPU and a discrete or integrated GPU. The dotted line in the graphic denotes cores that are available to a single application because they are in the same device, but not on the same chip.
Now we are seeing a trend to use CPU and specialized (currently GPU) cores with very tightly coupled memory, and even on the same die:
- In 2005, the XBox 360 sported a multicore CPU and GPU that could not only directly access the same RAM, but had the very unusual feature that they could share even L2 cache.
- In 2006 and 2007, the Cell-based PS3 console sported a single processor having both a single general-purpose core and eight special-purpose SPU cores. The solid line in Figure 9 denotes cores that are on the same chip, not just in the same device.
- In June 2011 and November 2011, respectively, AMD and NVIDIA launched the Fusion and Tegra 3 architectures, multicore CPU chips that sported a compute-class GPU (hence extending vertically) on the same die (hence well to the left).
- Intel has also shipped the Sandy Bridge line of processors, which includes an integrated GPU that is not yet as compute-capable but continues to grow. Intel's main focus has been the MIC effort of more than 50 simple, general-purpose x86-like cores on the same die, expected to be commercially available in the near future.
Finally, we complete the picture with cloud HaaS; Figure 10:
- In 2008 and 2009, Amazon, Microsoft, Google, and other vendors began rolling out their cloud compute offerings. AWS, Azure, and GAE support an elastic cloud of nodes each of which is a traditional computer ("big-core" and loosely coupled, therefore on the bottom right corner of the chart) where each node in the cloud has a single core or multiple CPU cores (the two lower-left boxes). As before, the dotted line denotes that all of the cores are available to a single application, and the network is just another bus to more compute cores.
- Since November 2010, AWS also supports compute instances that contain both CPU cores and GPU cores, indicated by the H-shaped virtual machine where the application runs on a cloud of loosely coupled nodes with disjoint memory (right column) each of which contains both CPU and GPU cores (currently not on the same die, so the vertical lines are still dotted).
Putting it all together, we get a noisy profusion of life and color as in Figure 11: This may look like a confused mess, so let's notice two things that help make sense of it.
First, every box has a workload that it's best at, but some boxes (particularly some columns) are more popular than others. Two columns are particularly less interesting:
- Fully unified memory models are only applicable to single-core, which is being essentially abandoned in the mainstream.
- Incoherent/weak memory models are a performance experiment that is in the process of failing in the marketplace. On the hardware side, theoretical performance benefits that come from letting caches work less synchronously have already been largely duplicated in other ways by mainstream processors having stronger memory models. On the software side, all of the mainstream general-purpose languages and environments (C, C++, Java, .NET) have largely rejected weak memory models, and require a coherent model that is technically called "sequential consistency for data race free programs" [PDF] as either their only supported memory model (Java, .NET) or their default memory model (ISO C++11, ISO C11). Nobody is moving toward the middle vertical incoherent/weak memory strip of the chart; at best they're moving through it to get to the other side, but nobody wants to stay there.
But all other boxes, including all rows (processors), continue to be strongly represented, and we realize why that's true because different parts of even the same application naturally want to run on different kinds of cores.
Second, let's clarify the picture by highlighting and labeling the two regions that hardware is migrating toward in Figure 12:
Here again we see the first and fourth columns being de-emphasized, as hardware trends have begun gradually coalescing around two major areas. Both areas extend vertically across all kinds of cores and the most important thing to note is that these represent two mines, where the area to the left is the Moore's Law mine.
- Mine #1: "Scale in" = Moore's Law. Local machines will continue to use large numbers of heterogeneous local cores, either in-box (such as CPU with discrete GPU) or on-die (Sandy Bridge, Fusion, Tegra 3). We'll see core counts increase until Moore's Law ends, and then stabilize core counts for individual local devices.
- Mine #2: "Scale out" = distributed cloud. Much more importantly, we will continue to see a cornucopia of cores delivered via compute clouds, either on-premises (e.g., cluster, private cloud) or in public clouds. This is a brand new mine directly enabled by the lower coupling of disjoint memory, especially loosely coupled distributed nodes.
The good news is that we can heave a sigh of relief at having found another mine to open. The even better news is that the new mine has a far faster growth rate than even Moore's Law. Notice the slopes of the lines when we graph the amount of parallelism available to a single application running on various architectures; see Figure 13. The bottom three lines are mining Moore's Law for "scale-in" growth, and their common slope reflects Moore's wonderful exponent, just shifted upward or downward to account for how many cores of a given size can be packed onto the same die. The top two lines are mining the cloud (with CPUs and GPUs, respectively) for "scale-out" growth and it's even better.
If hardware designers merely use Moore's Law to deliver more big fat cores, on-device hardware parallelism will stay in double digits for the next decade, which is very roughly when Moore's Law is due to sputter, give or take about a half decade. If hardware follows Niagara's and MIC's lead to go back to simpler cores, we'll see a one-time jump and then stay in triple digits. If we all learn to leverage GPUs, we already have 1,500-way parallelism in modern graphics cards (I'll say "cores" for convenience, though that word means something a little different on GPUs) and likely reach five digits in the decade timeframe.
But all of that is eclipsed by the scalability of the cloud, whose growth line is already steeper than Moore's Law because we're better at quickly deploying and using cost-effective networked machines than we've been at quickly jam-packing and harnessing cost-effective transistors. It's hard to get data on the current largest cloud deployments because many projects are private, but the largest documented public cloud apps (which don't use GPUs) are already harnessing over 30,000 cores for a single computation. I wouldn't be surprised if some projects are exceeding 100,000 cores today. And that's general-purpose cores; if you add GPU-capable nodes to the mix, add two more zeroes.
Such massive parallelism, already available for rates of under $1,300/hour for a 30,000-core cloud, is game-changing. If you doubt that, here is a boring example that doesn't involve advanced augmented reality or spook-level technomancery: How long will it take someone who's stolen a strong password file (which we'll assume is correctly hashed and salted and contains no dictionary passwords) to retrieve 90% of the passwords by brute force using a publicly available GPU-enabled compute cloud? Hint: An AWS dual-Tegra node can test on the order of 20 billion passwords per second, and clouds of 30,000 nodes are publicly documented (of course, Amazon won't say if it has that many GPU-enabled nodes for hire; but if it doesn't now, it will soon). To borrow a tired misquote, 640 trillion affordable attempts per second should be enough for anyone. But if that's not enough for you, not to worry; just wait a small number of years and it'll be 640 quadrillion affordable attempts per second.