What's Under the Hood
After the onset of the multicore revolution, chip designers are using a higher number of simpler, smaller, lower-clocked, more energy efficient cores rather than pushing the limits of a larger, power hungry, highly speculative, and less efficient single core.
Larrabee will include 32 or more cores, each of which is much simpler than the Pentium 4, and significantly simpler than the current Intel Core2 or i7 architecture. Cores will process instructions in order, sending out-of-order execution to history as a silicon-hungry and power-inefficient luxury of the past. Not only will Larrabee save on all the circuitry required for out-of-order execution, rumor has it that branch prediction will be simpler and branch mis-prediction penalties might cost up to 9 cycles.
Each core features 512-bit wide SIMD units for Larrabee. Cores will have similar traits as in other multicore architectures, i.e., relatively small amounts of local memory (256 kbyte per core) and high number of registers (128), a ring interconnect. Even the latency scaling among memory levels will be similar to earlier many-cores.
When developing compute-intensive code on Larrabee, developers should keep in mind two fundamental principles:
- Transform control flow into data flow
- Use an explicit working set
The first principle is crucial for avoiding expensive mis-prediction penalties. More important, branch elimination is essential to perform SIMD parallelization. The second principle allows for the best exploitation of the (small) per-core local memory. An explicitly streaming design avoids thrashing the caches, and a careful use of non-temporal hints won't pollute the caches with single-use data.
Some less-than-pleasant manual unrolling of subsequent iterations of a loop (I call this "vertical unrolling") should still be able to pay off on Larrabee, given the relative abundance of registers. But at least, no horizontal unrolling should be needed: horizontal unrolling consists in merging together loops that originally processed independent streams (see High-performance Regular Expression Scanning on the Cell/B.E. Processor [5, Section 5.6], in order to fill the read-after-write dependency stalls within each loop iteration that the compiler can't fill. The reason why no horizontal unrolling should be needed is the 4-way Simultaneous Multi-Threading (SMT) that Larrabee cores feature: The hardware avoids stalls by switching effortlessly to whichever other thread is ready to compute.
Additionally, Larrabee has masked scatter/gather instructions. Their impact is:
- Code economy: One gather instruction may replace hundreds of scalar instructions. To convince yourself, try expressing a gather's equivalent in your favorite SIMD assembly. Remember to unpack 16 indexes from the SIMD register, do pointer arithmetic, perform 16 loads, and repack the 16 read values into the SIMD result register. If the gather is masked, prepend each of the 16 loads with tests and jumps. If needed, include type conversion/promotion code. Good luck.
- No need for explicit packing and unpacking: As noted above, a gather instruction dereferences each 32-bit index (or pointer) in the ith lane of vector register and returns the result in the same ith lane of the result vector register. On other SIMD architectures, you must manually unpack each pointer to a scalar register or rotate it to a predefined location before use.
- Masking: No need for speculated writes. One technique to avoid expensive branches is to write output speculatively [5, Section 5.4]. This means using (unconditional, unpredicated) store instructions that always write something. When not needed, they write invalid output to ignored locations. You shouldn’t ever have to do this on Larrabee because scatter instruction allow you to mask each of the 16 stores. The hardware skips the writes you don't want, and will likely complete earlier if only a few are selected.
The rest of this article is an in-depth walk-through of a single piece of code, ported to LRBni. The code I show is a simplified replacement for the kernels generated by flex, and you can use it to write your own high-performance lexical scanner.
The example I present here is, if you want, a lazy port to LRBni of that tokenizer. Given its intended use (search engine indexing), it is oriented to yield high throughput on large numbers of independent streams, rather than low execution time on a single stream.