Parallelization is only half the battle. The move to multicore is an opportunity for chip vendors to embrace other interesting architectural changes in an attempt to reach teraflop performance within this decade. We illustrate this by describing the architecture of the Cell BE processor. Each Cell BE contains a total of nine cores. The Power Processing Unit (PPU) is a fairly traditional CPU core, with a PowerPC instruction set architecture and a reasonably sized cache. The other eight cores are referred to as Synergistic Processing Units (SPUs), and use a nontraditional architecture to achieve high performance in a small chip area. Instead of caches, each SPU sports 256 KB of on-chip, locally addressable memory (the local store), in addition to 128 128-bit registers. The SPUs use a vector instruction set that allows, for example, operating on groups of four floating-point numbers at once. Each SPU also features its own Memory Flow Controller (MFC), which can independently issue DMA transfers between main memory and the SPU local stores. The SPU's predictable, dual-issue pipeline allows complete certainty about the optimality of a particular sequence of assembly instructions. This lets a single Cell BE processor using all eight SPUs perform large matrix multiplications at over 200 Gflops, compared to about 12 Gflops for a single traditional CPU core.
The same features that make the Cell such a high-performance chip also make it hard for compilers for languages such as C/C++ to take advantage of. The assumptions these languages make about memory organization and instruction sets simply do not map well to heterogeneous processors such as the Cell. And if you think this kind of architecture is restricted to the domain of game consoles and special-purpose equipment, consider what vendors such as AMD and Intel are saying about their upcoming architecture. AMD's Fusion project aims to merge GPU-like cores with traditional x86-style cores, which, given the similarities between SPUs and GPU processors, will result in an architecture much like that of the Cell. At its 2006 developer forum, Intel showed off an 80-core processor prototype capable of more than a teraflop of compute power. Those 80 cores are not traditional x86 coresthis is an entirely different processor architecture. Intel also recently announced the Larrabee project, an attempt at converging GPUs and multicore CPU architectures.
So wouldn't it be nice to be able to program with the familiar tools and languages without compromising on performance?