On The Charts: Not Three Trends, But One Trend
Next, let's put all of this in perspective by showing that multicore, hetero-core, and cloud-core are not three trends, but aspects of a single trend. To show that, we have to show that they can be plotted on the same map. Figure 4 shows an appropriate map that lets us chart out where processor core architectures are going, where memory architectures are going, and visualize just where we've been digging around in the mine so far.
First, I describe each axis, then map out past and current hardware to spot trends, and finally draw some conclusions about where hardware is likely to concentrate.
Processor Core Types
The vertical axis shows processor core architectures. As shown in Figure 5, from bottom to top, they form a continuum of increasing performance and scalability, but also of increasing restrictions on programs and programmers in the form of additional performance issues (yellow) or correctness issues (red) added at each step.
Complex cores are the "big" traditional ones, with the pendulum swung far to the right in the "habitable zone." These are best at running sequential code, including code limited by Amdahl's Law.
Simpler cores are the "small" traditional ones, toward the left of the "habitable zone." These are best at running parallelizable code that still requires the full expressivity of a mainstream programming language.
Specialized cores like those in GPUs, DSPs, and Cell's SPUs are more limited, and often do not yet fully support all features of mainstream languages (such as exception handling). These are best for running highly parallelizable code that can be expressed in a subset of a language like C or C++. For example, XBox Kinect skeletal tracking requires using the CPU and the GPU cores on the console, and would be impossible otherwise.
The farther you move upward on the chart (to the right in the blown-up figure), the better the performance throughput and/or the less power you need, but the more the application code is constrained as it has to be more parallel and/or use only subsets of a mainstream language.
Future mainstream hardware will likely contain all three basic kinds of cores, because many applications have all these kinds of code in the same program, and so naturally will run best on a heterogeneous computer that has all these kinds of cores. For example, most PS3 games, all Kinect games, and all CUDA/OpenCL/C++AMP applications available today could not run well or at all on a homogeneous machine, because they rely on running parts of the same application on the CPU(s) and other parts on specialized cores. Those applications are just the beginning.
The horizontal axis shows six common memory architectures. From left to right, they form a continuum of increasing performance and scalability, but (except for one important discontinuity) also increasing work for programs and programmers to deal with performance issues (yellow) or correctness issues (red). In Figure 6, triangles represent cache and lower boxes represent RAM. A processor core (ALU) sits at the top of each cache "peak."
Unified memory is tied to the unicore motherlode and the memory hierarchy is wonderfully simple a single mountain with a core sitting on top. This describes essentially all mainstream computers from the dawn of computing until the mid-2000s. This delivers a simple programming model: Every pointer (or object reference) can address every byte, and every byte is equally "far away" from the core. Even here, programmers need to be conscious of at least two basic cache effects: locality, or how well "hot" data fits into cache; and access order, because modern memory architectures love sequential access patterns (for more on this, see my Machine Architecture talk).
NUMA cache retains a single chunk of RAM, but adds multiple caches. Now, instead of a single mountain, we have a mountain range with multiple peaks, each with a core on top. This describes today's mainstream multicore devices. Here, we still enjoy a single address space and pretty good performance as long as different cores access different memory, but programmers now have to deal with two main additional performance effects:
- locality matters in new ways because some peaks are closer to each other than others (two cores that share an L2 cache vs. two cores that share only L3 or RAM), and
- layout matters because we have to keep data physically close together if it's used together (on the same cache line) and apart if it's not (to avoid the ping-pong game of false sharing).
NUMA RAM further fragments memory into multiple physical chunks of RAM, but still exposes a single logical address space. Now the performance valleys between the cores get deeper because accessing RAM in a chunk not local to this core incurs a trip across the bus. Examples include bladed servers, symmetric multiprocessor (SMP) desktop computers with multiple sockets, and newer GPU architectures that provide a unified address space view of the CPU's and GPU's memory but leave some memory physically closer to the CPU and other memory closer to the GPU. Now we add another item to the menu of what a performance-conscious programmer needs to think about: copying. Just because we can form a pointer to anything doesn't mean we always should, if it means reaching across an expensive chasm on every access.
Incoherent and weak memory makes memory be by default unsynchronized, in the hope that allowing each core to have its own divergent view of the state of memory can make them run faster, at least until memory must inevitably be synchronized again. As of this writing, the only remaining mainstream CPUs with weak memory models are current PowerPC and ARM processors (popular despite their memory models rather than because of them; more on this below). This model still has the simplicity of a single address space, but now the programmer further has to take on the burden of synchronizing memory himself.
Disjoint (tightly coupled) memory bites the bullet and lets different cores see different memory, typically over a shared bus, while still running as a tightly coupled unit that has low latency and whose reliability is still evaluated as a single unit. Now the model turns into a tightly clustered group of mountainous islands, each with core-tipped mountains of cache overlooking square miles of memory, and connected by bridges with a fleet of trucks expediting goods from point to point bulk transfer operations, message queues, and similar. In the mainstream, we see this model used by 2009-2011 vintage GPUs whose on-board memory is not shared with the CPU or with each other. True, programmers no longer enjoy having a single address space and the ability to share pointers, but in exchange they have removed the entire set of programmer burdens accumulated so far and replaced them with a single new responsibility: copying data between islands of memory.
Disjoint (loosely coupled) is the cloud where cores spread out-of-box into different rooms and buildings and datacenters. This moves the islands farther apart, and replaces the bus "bridges" with network "speedboats" and "tankers." In the mainstream, we see this model in HaaS cloud computing offerings; this is the commoditization of the compute cluster. Programmers now have to arrange to deal with two additional concerns, which often can be abstracted away by libraries and runtimes: reliability as nodes can come and go, and latency as the islands are farther apart.
Charting the Hardware
All three trends are just aspects of a single trend: filling out the chart and enabling heterogeneous parallel computing. Figure 7 shows that the chart wants to be filled out because there are workloads that are naturally suited to each of these boxes, though some boxes are more popular than others.