The Rest of LRBni
Several vector instructions have been added for moving bits around within each element. Vinsertfield rotates each source element according to a per-instruction immediate value, then masks off a portion of the result according to two more immediate values, and leaves the destination element untouched where the mask is zero, effectively inserting the rotated source element into the destination element, as in Figure 15. (In this case "mask" just means a normal bitmask, of the sort you might logical-and with a register, not the writemask.) Used with no bitmask, vinsertfield can also serve as a rotate-by-immediate vector instruction.
vbitinterleave11pi and vbitinterleave21pi allow the interleaving of bits from two registers; vbitinterleave11pi alternates bits from the two sources, starting with bit 0 of the last source, and vbitinterleave21pi alternates one bit from the last source with two bits from the first source. Bit-interleaving is useful for generating swizzled addresses, particularly in conjunction with vinsertfield, for example in preparation for texture sample fetches (volume textures in the case of vbitinterleave21pi). The following sequence generates 16 offsets into a fully-swizzled 256x256 four-component float texture from 16 8-bit x coordinates stored in v1 and 16 8-bit y coordinates stored in v2, as in Figure 16:
vxorpi v3, v3, v3 vbitinterleave11pi v0, v2, v1 vinsertfield v3, v0, 4, 4, 19
(Note that if it was a gather instruction that was going to use these indices, and if the texel size was 8 or less, it wouldn't be necessary to use vinsertfield to shift up the address in order to address the texels, since the gather instruction can scale by 2, 4, or 8.)
There are also shuffle instructions for permuting elements from a source vector to a destination vector.
Although LRBni is primarily a vector instruction extension, it adds a few scalar instructions as well. In addition to bsfi and bsri, it adds insertfield, bitinterleave11, and bitinterleave21, the scalar versions of the vector bit-manipulation instructions described above. Prefetching and other cache-control instructions have been added as well; these are particularly important on Larrabee, where data must be fetched far enough ahead and at a high enough rate to keep the voracious vector units well-fed and fully loaded in streaming applications, without the help of out-of-order hardware.
Finally, note that in the initial version of the hardware, a few aspects of the Larrabee architecture -- in particular vcompress, vexpand, vgather, vscatter, and transcendentals and other higher math functions -- are implemented as pseudo-instructions, using hardware-assisted instruction sequences, although this will change in the future.
What Does It All Add Up To?
I'd sum up my experience in writing a software graphics pipeline for Larrabee by saying that Larrabee's vector unit supports extremely high theoretical processing rates, and LRBni makes it possible to extract a large fraction of that potential in real-world code. For example, real pixel-shader code running on simulated Larrabee hardware is getting 80% of theoretical maximum performance, even after accounting for work wasted by pixels that are off the triangle but still get processed due to the use of 16-wide vector blocks. Tim Sweeney, of Epic Games -- who provided a great deal of input into the design of LRBni -- sums up the big picture a little more eloquently:
Larrabee enables GPU-class performance on a fully general x86 CPU; most importantly, it does so in a way that is useful for a broad spectrum of applications and that is easy for developers to use. The key is that Larrabee instructions are "vector-complete."
More precisely: Any loop written in a traditional programming language can be vectorized, to execute 16 iterations of the loop in parallel on Larrabee vector units, provided the loop body meets the following criteria:
- Its call graph is statically known.
- There are no data dependencies between iterations.
Shading languages like HLSL are constrained so developers can only write code meeting those criteria, guaranteeing a GPU can always shade multiple pixels in parallel. But vectorization is a much more general technology, applicable to any such loops written in any language.
This works on Larrabee because every traditional programming element -- arithmetic, loops, function calls, memory reads, memory writes -- has a corresponding translation to Larrabee vector instructions running it on 16 data elements simultaneously. You have: integer and floating point vector arithmetic; scatter/gather for vectorized memory operations; and comparison, masking, and merging instructions for conditionals.
This wasn't the case with MMX, SSE and Altivec. They supported vector arithmetic, but could only read and write data from contiguous locations in memory, rather than random-access as Larrabee. So SSE was only useful for operations on data that was naturally vector-like: RGBA colors, XYZW coordinates in 3D graphics, and so on. The Larrabee instructions are suitable for vectorizing any code meeting the conditions above, even when the code was not written to operate on vector-like quantities. It can benefit every type of application!
A vital component of this is Intel's vectorizing C++ compiler. Developers hate having to write assembly language code, and even dislike writing C++ code using SSE intrinsics, because the programming style is awkward and time-consuming. Few developers can dedicate resources to doing that, whereas Larrabee is easy; the vectorization process can be made automatic and compatible with existing code.
In short, it will be possible to get major speedups from LRBni without heroic programming, and that surely is A Good Thing. Of course, nothing's ever that easy; as with any new technology, only time will tell exactly how well automatic vectorization will work, and at the least it will take time for the tools to come fully up to speed. Regardless, it will equally surely be possible to get even greater speedups by getting your hands dirty with intrinsics and assembly language; besides, I happen to like heroic coding. So in the next article we'll look under the hood, examining how rasterization, a process that is most definitely not inherently parallel, can be efficiently implemented with LRBni.