Channels ▼


A First Look at the Larrabee New Instructions (LRBni)

Larrabee's Vector Architecture

LRBni adds two sorts of registers to the x86 architectural state. There are 32 new 512-bit vector registers, v0-v31, and 8 new 16-bit vector mask registers, k0-k7. While some core resources such as caches are shared by the core threads, that is not the case for registers; each thread has a full complement of vector and vector mask registers.

LRBni vector instructions are either 16-wide or 8-wide, so a vector register can be operated on by a single LRBni instruction as 16 float32s, 16 int32s, 8 float64s, or 8 int64s, as in Figure 2, with all elements operated on in parallel. LRBni vector instructions are also ternary; that is, they involve three vector registers, of which typically two are inputs and the third the output. This eliminates the need for most move instructions; such instructions are not a significant burden on out-of-order cores, which can schedule them in parallel with other work, but they would slow Larrabee's in-order pipeline considerably.

Figure 2: The vector data types supported by LRBni.

For the purposes of discussion, I divide LRBni into several broad groups:

  • vector arithmetic, logical, and shift;
  • vector mask generation;
  • vector load/store;
  • other instructions, including those that help keep the vector pipeline well fed.

I discuss each of these in turn, referring to Table 1, which lists a broad sample of LRBni instructions. (Table 1 is not a complete listing; some instructions are still evolving, and others would require too much explanation.) Vector instructions start with v, and vector mask instructions start with k. The mnemonic suffixes follow the SSE convention of "px," where p means "packed" (that is, a vector of 8 or 16 elements), and x refers to the element type:

  • s for float32 (single-precision, henceforth referred to as simply float);
  • i for int32;
  • u for unsigned int32 (used only in conversions and a few specific instructions);
  • q for int64, and d for float64 (double-precision, henceforth referred to as double).

Load and store instructions, which don't use p, use one of the following:

  • d for 32-bit quantities (dwords);
  • q for 64-bit quantities (qwords).

To keep things simple, for the most part I'm going to talk only about float and int32 operations in this article, but LRBni provides support (albeit somewhat less extensive) for double and int64 operations as well.

You can find additional information about LRBni, including instruction descriptions and prototyping libraries, here.

Table 1: Broad sampling of LRBni instructions.

Vector Arithmetic, Logical, and Shift Instructions

The arithmetic, logical, and shift vector instructions include everything you'd expect: add, subtract, add with carry, subtract with borrow, multiply, round, clamp, max, min, absolute max, logical-or, logical-and, logical-xor, logical-shift and arithmetic-shift by a per-element variable number of bits, and conversions among floats, doubles, and signed and unsigned int32s. There are also multiply-add and multiply-sub instructions, which run at the same speed as other vector instructions, thereby doubling Larrabee's peak flops. Finally, there is hardware support for transcendentals and higher math functions. The arithmetic vector instructions operate in parallel on 16 floats or int32s, or 8 doubles, although this is not fully orthogonal; most float multiply-add instructions have no int32 equivalent, for example. The logical vector instructions operate on 16 int32s or 8 int64s, and the shift vector instructions operate on 16 int32s only. The non-orthogonality of the vector instructions may seem a bit inconvenient, but they make for lower-power hardware, which in turn makes it possible to have more cores -- and therefore more processing power.

Both the destination and the first source operand for a vector instruction must typically be vector registers (for certain instructions, one of the first two operands must be a vector mask register, as I discuss shortly), but the last source may optionally be a memory operand; this feature comes at no performance cost and saves a great many load instructions, reducing code size and freeing up the in-order pipeline to do other work. This is the reason for the existence of the reverse-subtract instructions, and also for the many variants of multiply-add and multiply-subtract, which allow you to choose which of the three operands is added to or subtracted from, although the destination must always be a vector register. Multiply-add and multiply-sub have three vector operands like other vector instructions, but are special in that they have three sources, so the first operand must serve as both a source and the destination; hence, unlike the other instructions, most multiply-add and multiply-sub instructions have no non-destructive form. (The exception is vmadd233, a special form of multiply-add designed specifically for interpolation, which gets both offset and scale from a single operand and consequently uses only two source operands.) It's worth noting that multiply-add and multiply-sub instructions are fused; that is, no bits of floating-point precision are lost between the multiply and the add or subtract, so they are more accurate than and not exactly equivalent to a multiply instruction followed by a separate add or subtract instruction.

But wait, there's a lot more to vector instructions, which are really more like little clusters of processing functions than traditional scalar or SSE instructions -- and all at no extra cost! If there's a memory operand to a vector instruction, that operand may optionally be broadcast from one or four elements in memory up to 16 vector elements (or 8 for doubles or int64s) prior to the instruction's operation, as in Figure 3.

Figure 3: 1-to-16 {1to16} and 4-to-16 {4to16} broadcasts.

This is useful for keeping memory and cache footprint down when applying a scalar or a four-element vector across a vector operation. Alternatively, the source memory operand may be converted from one of several compact types (including float16) to float, or from a smaller integer to int32, as listed in Table 2. This is not only useful for keeping footprint down but also removes the need for a separate instruction to perform the conversion. However, a single instance of a load-op instruction can either convert or broadcast, but can't do both. If there is no memory operand, the last vector register operand may be swizzled in one of seven ways, as in Table 3, including one that supports efficient calculation of four cross-products at once. All load-op broadcasts, conversions, and swizzles are free, occurring during the normal course of vector instruction execution.

Table 2: (a) Load-op broadcasts and up-conversions for int32 operations. (b) Load-op broadcasts and up-conversions for float operations.

We're still not quite done, because every vector instruction can also perform predication. Each vector mask register contains 16 bits, neatly matching the 16 elements in a vector register. Every vector instruction can take a vector mask register as the writemask operand, and if any bit in that vector mask register is zero, the corresponding element of the destination register is left unchanged. Once again, there is no cost for this. Vector instructions can also specify no writemask, for the common case in which all 16 elements should be updated.

Table 3: Vector operand swizzles. Notation: dcba denotes the 32-bit elements that form one 128-bit block in the source (with 'a' least-significant and 'd' most-significant), so aaaa means that the least-significant element of the 128-bit block in the source is replicated to all four elements of the same 128-bit block in the destination; the depicted pattern is then repeated for all four 128-bit blocks in the source and destination. These can be used only if there is no memory operand, and on only one operand per vector instruction. 64-bit elements are handled identically, except that in that case the above table describes 256-bit blocks, and the action is repeated for the two 256-bit blocks in the vector. .

Predication makes it possible to handle the partial vector iteration at the end of vectorized loops. More importantly, it makes it possible to handle conditionals and loops in vector code.

Let's take a look at some of these features in action. First, here's a simple floating-point vector multiply:

'vmulps v0, v5, v6      ; v0 = v5 * v6

Figure 4 shows how this performs 16 multiplies in parallel.

Figure 4: vmulps v0, v5, v6.

Next, let's make it a multiply-add (Figure 5):

vmadd231ps v0, v5, v6      ; v0 = v5 * v6 + v0

Figure 5: vmadd231ps v0, v5, v6.

Here, the destination is also the third source. In the instruction mnemonic, "231" refers to the placement of the three operands in the multiply-add equation. Thus, "madd231" means "multiply operand_2 with operand_3, add operand_1"; "madd132" would mean "multiply operand_1 with operand_3, add operand_2," which translates to "v0 = v0 * v6 + v5" for the three operands used above.

Now we'll add predication; k1 writemasks the updating of the elements (Figure 6):

vmadd231ps v0 {k1}, v5, v6

Figure 6: vmadd231ps v0 {k1}, v5, v6.

We can make one source a load-op memory operand using the standard assortment of x86 addressing modes (Figure 7):

vmadd231ps v0 {k1}, v5, [rbx+rcx*4] 

Figure 7: vmadd231ps v0 {k1}, v5, v6.

We can broadcast from 4 elements in memory to 16 elements to operate on (Figure 8):

vmadd231ps v0 {k1}, v5, [rbx+rcx*4] {4to16}

Figure 8: vmadd231ps v0 {k1}, v5, [rbx+rcx*4] {4to16}.

Or we can upconvert from float16 format (Figure 9):

vmadd231ps v0 {k1}, v5, [rbx+rcx*4] {float16}

Figure 9: vmadd231ps v0 {k1}, v5, [rbx+rcx*4] {float16}.

One note here: Memory operands to vector instructions must be aligned to the size of the block of data loaded; for this purpose, it is the size before writemasking is applied that matters. Thus, the example in Figure 7 must be 64-byte aligned, but the example in Figure 8 only has to be 16-byte aligned, and the example in Figure 9 only has to be 32-byte aligned. (The alignment requirement is implementation-dependent, and could change in the future, but it will be true of the initial versions of Larrabee, at least.)

No, it's not like any x86 assembly syntax you've ever seen, but it's actually pretty straightforward, and, as you can see, for once things are spelled out pretty clearly -- "{float16}" is a lot easier to parse than most assembly-language mnemonics I've encountered.

All of the above instructions run at the same throughput (although again that's implementation dependent), and all of the capabilities illustrated above work with any vector instruction.

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.