Larrabee's New Instructions In C++: A Prototype
So you've read the article and watched the video, and you're chomping at the bit to start playing around with Larrabee, Intel's multicore architecture that boasts many cores, many threads, and a new vector instruction set -- all in the name of pushing performance. You have everything you need except -- a Larrabee.
No problem. This C++ prototype library provides a C++ implementation of the Larrabee new instructions (LRBni), making it possible for you to experiment with developing Larrabee code without a Larrabee compiler and without Larrabee hardware.
Okay, you can't go full-bore Larrabee. The library doesn't try to match the LRBni in every case, particularly when it comes to exceptions, flags, bit-precision, or memory alignment restrictions. Nor should you assume that the exact syntax and semantics of the functions presented in the library will be supported in future Larrabee-based hardware and software. Still it's enough to get you going.
As Michael Abrash explains in detail in A First Look at the Larrabee New Instructions (LRBni), the Larrabee new instructions are extensions of the existing Intel-based vector graphics streaming SIMD instructions. They operate on two new sets of registers:
- 32 512-bit vector registers (v0-v31) that hold either 16 32-bit values or 8 64-bit values
- 8 16-bit vector mask registers (k0-k7) that hold 16 bit masks
The LRB prototype primitives support these data types with the following C objects:
- typedef struct { float v[16]; } _M512
- typedef struct { double v[8]; } _M512D
- typedef struct { int v[16]; } _M512I
- typedef unsigned short __mmask;
Additionally, enumerated types are defined for the instructions that use immediate value operands. These are listed in the last section of this document.
Most Larrabee vector instructions have the form:
vop v1 {k1}, v2, S(v3/m)
where v1 is the destination vector register, k1 is the vector mask register, v2 is the first source vector register, and S(v3/m) is the second source -- written that way to indicate that it is the result of a swizzle/broadcast/conversion process S on either a memory location m or vector register v3. k1 is a writemask, meaning that only those elements with the corresponding bit set in k1 are computed and stored into v1. Elements in v1 with the corresponding bit clear in k1 retain their previous values, so a merging of the new element values with the previous element values is implied.
The C++ LRBni prototype primitive functions take the sources as inputs and return the destination value. To simplify the usage and to enable compiler optimizations, pairs of functions are provided for each vector instruction --- the full version that takes a mask and the destination register as arguments, and a short version when operating on the entire vector. Each Larrabee vector instruction is implemented with two functions like this:
v1 = _mm512_mask_op(v1_old, k1, v2, v3); v1 = _mm512_op(v2, v3);
If the destination is required as a source (for example the MADD operation), it is included in the inputs as v1 or v1_old. If the instruction writes to both a vector register and a vector mask register, the vector register is returned and the vector mask is written through a pointer (listed as either k1_resor k2_res). For example:
_M512I _mm512_adc_pi(_M512I v1, __mmask k2, _M512I v3, __mmask *k2_res) _M512 _mm512_mask_add_ps(_M512 v1_old, __mmask k1, _M512 v2, _M512 v3)
Because this programming model cannot enforce the same variable be used twice when a single register is both a source and a destination, it allows for constructs that may map to more than a single Larrabee instruction.
Instead of adding the swizzle/broadcast/conversion arguments to the vector functions, the S(v/m) operation is implemented as a set of functions that operate on either a vector register or a memory location and produce a vector result. The output of these swizzle/broadcast/conversion functions can be used as any vector source input, however each Larrabee vector instruction only supports such an operation on the last source operand. While this decoupling creates an easier programming model, it also allows for constructs that may map to more than a single Larrabee instruction.
For more details, go here

