Getting Data Into and Out of Vector Format
The instructions covered so far are the heart of Larrabee's data-crunching capabilities, but by themselves they'd require all their input and output to be arranged in structure-of-arrays (SOA) form, which would be unfortunate because most data is in array-of-structures (AOS) form -- not least a lot of graphics data, such as vertex arrays. Since Larrabee's initial use will be as the processor for a graphics card, it's obviously essential to be able to get data into and out of SOA format efficiently, and LRBni adds three sorts of instructions for this purpose. Of these, first and most important are the gather/scatter instructions. The key to gather is that it lets you load each element of the destination vector from any memory address, independent of where the other elements are being loaded from, as in Figure 12, which I'll discuss shortly. If you think of this as performing a separate scalar load for each element, it's obvious why it's so useful for vectorization -- it's the vector load instruction for cases where each of the 16 streams has a different data source.
Consider the case of checksumming an int32 array. If it's just one array, you can process it 16 values at a time, using the normal vector load instruction, vload, followed by vaddpi, to sum 16 values at a pop; or you could just do a load-op vadd, as in Listing One. Then, at the end, you can sum together the 16 values you've accumulated, and you're done. (If the array wasn't a multiple of 16 in length, you'd use the writemask to do a partial sum at the end.)
; Partial code to calculate an array checksum, summing ; 16 elements at a time; code after the loop to do a final ; sum of the 16 partial sums would also be required. ; On entry: ; rbx points to the base of the array to sum. ; rcx is how many elements to sum. ; On exit, v0 contains the 16 partial sums. vxorpi v0, v0, v0 shr rcx, 4 ; do 16 at a time ChecksumLoop: vaddpi v0, v0, [rbx] add rbx, 64 dec rcx jnz ChecksumLoop
If, however, the value you were checksumming was a field in a structure, so a skip was required between each addition, the vgatherd instruction would allow you to parallelize in either of two different ways. You could gather 16 fields at a time from the array, as in Listing Two.
; Partial code to calculate the checksum of a specific field in
; an array of structures, summing 16 elements at a time; code
; after the loop to do a final sum of the 16 partial sums would
; also be required.
; On entry:
; v2 contains the offsets of the first 16 checksum fields
; in the array relative to rbx.
; rcx is how many elements to sum.
; On exit, v0 contains the 16 partial sums.
vxorpi v0, v0, v0
shr rcx, 4 ; do 16 at a time
ChecksumLoop:
vgatherd v1 {k0}, [rbx + v2]
vaddpi v0, v0, v1
; step to the next 16 values to checksum
vaddpi v2, v2, [Mem_Structure_Size_Times_16] {1to16}
dec rcx
jnz ChecksumLoop
Or, more generally, you could process 16 different streams and do 16 sums at once, one from each of 16 different arrays; you'd gather 16 values, one from each array, and then vaddpi them, as in Listing Three. When ChecksumLoop in Listing 3 finishes, you will have accumulated the 16 sums for the 16 arrays. The structure size can even be different for each array. (Note that Listings Two and Three are almost identical; gather is so flexible that the same gather-based code can do many different things, depending on the initial conditions.)
; Calculates checksums of a specific field in 16 arrays of structures in parallel.
; On entry:
; v2 contains the 16 offsets of the checksum field in each of the
; 16 arrays relative to rbx.
; rcx is how many elements to sum.
; On exit, v0 contains the 16 checksums.
vxorpi v0, v0, v0
ChecksumLoop:
vgatherd v1 {k0}, [rbx + v2]
vaddpi v0, v0, v1
; step to the next value in each array
vaddpi v2, v2, [Mem_Structure_Sizes]
dec rcx
jnz ChecksumLoop
Okay, those last two code listings require a bit of explanation, because the gather/scatter instructions do not follow normal addressing rules. The address for a gather or scatter is formed from the sum of a base register and the elements of a scaled index vector register, as in Figure 12. This is the only case in which a vector register can be used to address memory. More precisely, for each element to be loaded, the address is the sum of the base register and the sign-extension to 64 bits of the corresponding element of the index vector register, optionally scaled by 2, 4, or 8. Note that the 32-bit size of the elements used for the index results in a 4 GB limit on the range for gather/scatter (or larger if scaling by 2, 4 or 8).
What if your gather targets aren't all contained within a 4 GB range? Then you need to wrap another loop around the basic gather loop, in order to step through the 4 GB ranges touched by the gather addresses, which is somewhat more complicated, but not unduly so.
All of the above applies for scatters, but in reverse.
Finally, gather and scatter support all the data conversions that vload and vstore, respectively, support, as well as writemasking. They don't support broadcast or store selection, since those would be useless for these instructions -- to broadcast in a gather, just set all the index fields to the same value (a partial broadcast is performed in Figure 12), and scatter can similarly easily perform store selection.
Another important feature is the ability to queue data efficiently with the vcompress and vexpand instructions. For vcompress, the writemask-enabled elements of the source vector are stored sequentially in memory, as in Figure 13; for vexpand, the writemask-enabled elements of the destination are loaded from a sequential stretch of memory, reversing the action of vcompress, as in Figure 14. A new scalar instruction, countbits, has been added so that the number of enabled bits in a vector mask register -- and thus the number of elements stored by vcompress or loaded by vexpand -- can easily be counted.
As with all vector instructions vcompress and vexpand can be used without specifying a writemask, in which case all elements are loaded or stored, with no compression or expansion needed. In this mode, vcompress and vexpand function as unaligned store and load.
Finally, the bsf and bsr bit-scan instructions have been enhanced. Where the existing bsf instruction finds the first 1-bit starting from bit 0 and scanning up, the new bsfi instruction finds the first 1-bit starting from the bit above the bit specified by the destination operand. This allows bsfi to continue a search started with bsf, without any bit-clearing overhead. The bsri instruction similarly provides a starting point for reverse bit scans. These instructions are useful for parallel-to-serial conversion when the results of a vector operation must be processed serially, as we will see when we look at rasterization.


