Channels ▼
RSS

Parallel

A First Look at the Larrabee New Instructions (LRBni)


Getting Data Into and Out of Vector Format

The instructions covered so far are the heart of Larrabee's data-crunching capabilities, but by themselves they'd require all their input and output to be arranged in structure-of-arrays (SOA) form, which would be unfortunate because most data is in array-of-structures (AOS) form -- not least a lot of graphics data, such as vertex arrays. Since Larrabee's initial use will be as the processor for a graphics card, it's obviously essential to be able to get data into and out of SOA format efficiently, and LRBni adds three sorts of instructions for this purpose. Of these, first and most important are the gather/scatter instructions. The key to gather is that it lets you load each element of the destination vector from any memory address, independent of where the other elements are being loaded from, as in Figure 12, which I'll discuss shortly. If you think of this as performing a separate scalar load for each element, it's obvious why it's so useful for vectorization -- it's the vector load instruction for cases where each of the 16 streams has a different data source.

Figure 12: vgatherd v1 {k1}, [rbx + v2*4]. This is a simplified representation of what is currently a hardware-assisted multi-instruction sequence, but will become a single instruction in the future.

Consider the case of checksumming an int32 array. If it's just one array, you can process it 16 values at a time, using the normal vector load instruction, vload, followed by vaddpi, to sum 16 values at a pop; or you could just do a load-op vadd, as in Listing One. Then, at the end, you can sum together the 16 values you've accumulated, and you're done. (If the array wasn't a multiple of 16 in length, you'd use the writemask to do a partial sum at the end.)


; Partial code to calculate an array checksum, summing
; 16 elements at a time; code after the loop to do a final
; sum of the 16 partial sums would also be required.
; On entry:
;   rbx points to the base of the array to sum.
;   rcx is how many elements to sum.
; On exit, v0 contains the 16 partial sums.
	vxorpi	 v0, v0, v0
	shr		rcx, 4	; do 16 at a time
ChecksumLoop:
	vaddpi	v0, v0, [rbx]
	add	rbx, 64
	dec	rcx
	jnz	ChecksumLoop

Listing One

If, however, the value you were checksumming was a field in a structure, so a skip was required between each addition, the vgatherd instruction would allow you to parallelize in either of two different ways. You could gather 16 fields at a time from the array, as in Listing Two.


; Partial code to calculate the checksum of a specific field in
; an array of structures, summing 16 elements at a time; code
; after the loop to do a final sum of the 16 partial sums would
; also be required.
; On entry:
;   v2 contains the offsets of the first 16 checksum fields
;        in the array relative to rbx.
;   rcx is how many elements to sum.
; On exit, v0 contains the 16 partial sums.
	vxorpi		v0, v0, v0
	shr		rcx, 4	; do 16 at a time
ChecksumLoop:
	vgatherd	v1 {k0}, [rbx + v2]
	vaddpi		v0, v0, v1
	; step to the next 16 values to checksum
	vaddpi		v2, v2, [Mem_Structure_Size_Times_16] {1to16}
	dec		rcx
	jnz		ChecksumLoop

Listing Two

Or, more generally, you could process 16 different streams and do 16 sums at once, one from each of 16 different arrays; you'd gather 16 values, one from each array, and then vaddpi them, as in Listing Three. When ChecksumLoop in Listing 3 finishes, you will have accumulated the 16 sums for the 16 arrays. The structure size can even be different for each array. (Note that Listings Two and Three are almost identical; gather is so flexible that the same gather-based code can do many different things, depending on the initial conditions.)


; Calculates checksums of a specific field in 16 arrays of structures in parallel.
; On entry:
;   v2 contains the 16 offsets of the checksum field in each of the
;        16 arrays relative to rbx.
;   rcx is how many elements to sum.
; On exit, v0 contains the 16 checksums.
	vxorpi		v0, v0, v0
ChecksumLoop:
	vgatherd	v1 {k0}, [rbx + v2]
	vaddpi		v0, v0, v1
	; step to the next value in each array
	vaddpi		v2, v2, [Mem_Structure_Sizes]
	dec		rcx
	jnz		ChecksumLoop

Listing Three

Okay, those last two code listings require a bit of explanation, because the gather/scatter instructions do not follow normal addressing rules. The address for a gather or scatter is formed from the sum of a base register and the elements of a scaled index vector register, as in Figure 12. This is the only case in which a vector register can be used to address memory. More precisely, for each element to be loaded, the address is the sum of the base register and the sign-extension to 64 bits of the corresponding element of the index vector register, optionally scaled by 2, 4, or 8. Note that the 32-bit size of the elements used for the index results in a 4 GB limit on the range for gather/scatter (or larger if scaling by 2, 4 or 8).

What if your gather targets aren't all contained within a 4 GB range? Then you need to wrap another loop around the basic gather loop, in order to step through the 4 GB ranges touched by the gather addresses, which is somewhat more complicated, but not unduly so.

All of the above applies for scatters, but in reverse.

Finally, gather and scatter support all the data conversions that vload and vstore, respectively, support, as well as writemasking. They don't support broadcast or store selection, since those would be useless for these instructions -- to broadcast in a gather, just set all the index fields to the same value (a partial broadcast is performed in Figure 12), and scatter can similarly easily perform store selection.

Another important feature is the ability to queue data efficiently with the vcompress and vexpand instructions. For vcompress, the writemask-enabled elements of the source vector are stored sequentially in memory, as in Figure 13; for vexpand, the writemask-enabled elements of the destination are loaded from a sequential stretch of memory, reversing the action of vcompress, as in Figure 14. A new scalar instruction, countbits, has been added so that the number of enabled bits in a vector mask register -- and thus the number of elements stored by vcompress or loaded by vexpand -- can easily be counted.

Figure 13: vgatherd v1 {k1}, [rbx + v2*4]. : vcompressd [rbx] {k1}, v0. This is a simplified representation of what is currently a two-instruction sequence.

As with all vector instructions vcompress and vexpand can be used without specifying a writemask, in which case all elements are loaded or stored, with no compression or expansion needed. In this mode, vcompress and vexpand function as unaligned store and load.

Finally, the bsf and bsr bit-scan instructions have been enhanced. Where the existing bsf instruction finds the first 1-bit starting from bit 0 and scanning up, the new bsfi instruction finds the first 1-bit starting from the bit above the bit specified by the destination operand. This allows bsfi to continue a search started with bsf, without any bit-clearing overhead. The bsri instruction similarly provides a starting point for reverse bit scans. These instructions are useful for parallel-to-serial conversion when the results of a vector operation must be processed serially, as we will see when we look at rasterization.

Figure 14: vexpandd v0 {k1}, [rbx]. This is a simplified representation of what is currently a two-instruction sequence.


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 

Video