Rasterization on Larrabee

Intra-tile Rasterization: 16×16 Blocks

Intra-tile rasterization starts at the level of a whole tile. Tile size varies, depending on factors such as pixel size, but let's assume that we're working with a 64×64 tile. Given that starting size, we calculate the edge equation values at the 16 trivial reject and trivial accept corners of the 16×16 blocks that make up the tile, just as we did at the tile level -- but now we do these calculations 16 at a time. Let's start with the trivial reject test.

First, we calculate which corner of the tile is the trivial reject corner, calculate the value of the edge equation at that point, and set up a table containing the 16 steps of the edge equation from the value at the tile trivial reject corner to the trivial reject corners of the 16×16 blocks that make up the tile. The signs of the 16 values that result tell us which of the blocks are entirely outside the edge, and can therefore be ignored, and which are at least partially accepted, and therefore have to be evaluated further.

In Figure 18, for example, we calculate the trivial reject values for the black edge by stepping from the value we calculated earlier at the trivial reject corner of the tile, and eliminate five of the 16×16 blocks that make up the tile. The trivial reject corner for the tile is shown in red, and the 16 trivial reject corners for the blocks are shown in white. The gray blocks are those that are rejected against the black edge. You can see that their trivial reject corners all have positive edge equation values. The other 11 blocks have negative values at their trivial reject corners, so they're at least partially inside the black edge.

Figure 18: The trivial reject tests for the 16 16×16 blocks in the tile.

To make this process clearer, in Figure 19, the arrows represent the 16 steps that are added to the black edge's tile trivial reject value. Each of these steps is just an add, and we can do 16 adds with a single vector instruction, so it takes only one vector instruction to generate the 16 trivial reject values, and one more vector instruction -- a compare -- to test them.

Figure 19: The steps from the tile trivial reject corner for the black edge to the trivial reject corners of the 16 16×16 blocks for the black edge.

All this comes down to just setting up the right values, then doing one vector add and one vector compare. Remember that the edge equation is of the form Bx + Cy; therefore, to step a distance across the tile, we just set x and y to the horizontal and vertical components of that distance, evaluate the equation, and add that to the starting value. So all we're doing in Figure 19 is adding the 16 values that step the edge equation to the 16 trivial reject corners. For example, to get the edge equation value at the trivial reject corner of the upper-left block, we'd start with the value at the tile trivial reject corner, and add the amount that the edge equation changes for a step of -48 pixels in x and -48 pixels in y, as shown by the yellow arrow in Figure 20. To get the edge equation value at the trivial reject corner of the lower-left block, we'd instead add the amount that the edge equation changes for a step of -48 pixels in x only, as shown by the purple arrow. And that's really all there is to the Larrabee rasterizer -- it's just a matter of stepping the edge equation values around the tile so as to determine what blocks and pixels are inside and outside the edges.

Figure 20: Examples of stepping the edge equation Bx + Cy.

Once again, we'll do trivial accept tests as well as trivial reject tests. In Figure 21, we've calculated the trivial accept values and determined that 6 of the 16×16 blocks are trivially accepted for the black edge, and 10 of them are not trivially accepted. We know this because the values of the equation of the black edge at the trivial accept corners of the 6 pink blocks are negative, so those blocks are entirely inside the edge; while the values at the trivial accept corners of the other 10 blocks are positive, so those blocks are not entirely inside the black edge. The trivial accept values for the blocks can be calculated by stepping in any of several different ways: from the tile trivial accept corner, from the tile trivial reject corner, or from the 16 trivial reject values for the blocks. Regardless, it again takes only one vector instruction to step and one vector instruction to test the results. Combined with the results of the trivial reject test, we also know that 5 blocks are partially accepted.

Figure 21: The trivial accept tests for the 16 16×16 blocks in the tile.

The 16 trivial reject and trivial accept values can be calculated with a total of just two vector adds per edge, using tables generated at triangle set-up time,. They can be tested with two vector compares, which generate mask registers describing which blocks are trivially rejected and which are trivially accepted. We do this for the three edges, ANDing the results to create masks for the triangle, do some bit manipulation on the masks so they describe trivial and partial accept, and bit-scan through the results to find the trivially and partially accepted blocks. Each 16×16 that's trivially accepted against all three edges becomes one bin command; again, no further rasterization is needed for pixels in trivially accepted blocks.

This is not obvious stuff, so let's take a moment to visualize the process. First, let's just look at one edge and trivial accept.

For a given edge, say edge number 1, we take the edge equation value at the tile trivial accept corner, broadcast it out to a vector, and vector add it to the precalculated values of the 16 steps to the trivial accept corners of the 16×16 blocks. This gives us the edge 1 values at the trivial accept corners of the 16 blocks, as in Figure 22. (The values shown are illustrative, and are not taken from a real triangle.)

Figure 22: Edge 1 trivial accept tests for the 16 16×16 blocks.

The step values shown on the second line in Figure 22 are computed when the triangle is set up, using a vector multiply and a vector multiply-add. At the top level of the rasterizer -- testing the 16×16 blocks that make up the tile, as in Figure 22 -- those set-up instructions are just direct additional rasterization costs because the top level only gets executed once per triangle, so it would be accurate to add 6 instructions to the cost of the 16×16 code we'll look at shortly in Listings One and Two. However, as the hierarchy is descended, the tables for the lower levels (16×16-to-4×4 and 4×4-to-mask) get reused multiple times. For example, when descending from 16×16 to 4×4 blocks, the same table is used for all partial 16×16 blocks in the tile. Likewise, there is only one table for generating masks for partial 4×4 blocks, so the additional cost per iteration in Listing 3 due to table set-up would be 2 instructions divided by the number of partial 4×4 blocks in the tile. This is generally much less than 1 instruction per 4×4 block per edge, although it gets higher the smaller the triangle is.

Then we compare the results to 0; as in Figure 22, this generates a mask that contains 1-bit where a block trivial accept corner is less than zero -- that is, wherever a block trivial accept corner is inside the edge, meaning the whole block is inside the edge.

We do this for each of the three edges, ANDing the masks together to produce a composite mask, as in Figure 23. (The ANDing happens automatically as part of the mask updating, as we'll see later when we look at the code.) This composite mask has 1-bit only where a block trivially accepts against all three edges, which is to say, for blocks that are fully trivially accepted and require no further rasterization.

Figure 23: Generating the composite trivial accept mask for the three edges.

Now we can bit-scan through the composite mask, picking out the 16×16 blocks that are trivially accepted and storing a draw-block command for each of them in the rasterizer output queue. The first bit-scan would find that block 0 is trivially accepted, the second bit-scan would find that block 4 is trivially accepted, and the third bit-scan would find that there were no more trivially accepted blocks, as in Figure 24.

Figure 24: Scanning through the composite trivial accept mask to find trivially accepted 16×16 blocks.

This sort of parallel-to-serial conversion -- identifying and processing relevant elements of a vector result efficiently -- is key to getting good performance out of semi-vectorizable code. This is exactly why the bsfi instruction was added as part of the Larrabee New Instructions. Bsfi is based on the existing bit-scan instruction, but is enhanced to allow starting the scan at any bit, rather than only at bit 0.

We're finally ready to look at some real code. Listing One shows just the trivial accept code for the 16×16 blocks in a 64×64 tile; this will suffice to illustrate what we've discussed, and anything more would just complicate the explanation. Let's map these instructions to the steps we just described.

```   ; On entry:
; rsi: base pointer to thread data
; v3: steps from edge 1 tile trivial accept corner to corners of 16x16 blocks
; v4: steps from edge 2 tile trivial accept corner to corners of 16x16 blocks
; v5: steps from edge 3 tile trivial accept corner to corners of 16x16 blocks

; Step to edge values at 16 16x16 block trivial accept corners

; See if each trivial accept corner is inside all three edges
; K1 is set by the first instruction, then the result of the
; second instruction is ANDed with k1, and likewise for the third instruction
vcmpltpi k1, v0, [rsi+ConstantZero]{1to16}
vcmpltpi k1{k1}, v1, [rsi+ConstantZero]{1to16}
vcmpltpi k1{k1}, v2, [rsi+ConstantZero]{1to16}

; Get the mask; 1-bits are trivial accept corners that are
; inside all three edges
kmov eax, k1

; Loop through 1-bits, issuing a draw-16x16-block command
; for each trivially accepted 16x16 block
bsf ecx, eax
jnz TrivialAcceptDone

TrivialAcceptLoop:
; <Store draw-16x16-block command, along with (x,y) location>

bsfi ecx, eax
jnz TrivialAcceptLoop

TrivialAcceptDone:

```
Listing One: Trivial accept code for 16×16 blocks.

First, there are three vector adds to step the three edge values to the trivial accept corners of the 16×16s; these are the three vaddpi instructions.

Second, there are three vector compares to test the signs of the results to see if the blocks are trivially accepted against the edges; these are the three vcmpltpi instructions. The composite mask is accumulated in mask register k1.

Next, the kmov copies the mask of trivially accepted blocks into register eax; and finally, there is a loop to bit-scan eax to pick out the trivially accepted 16×16s one by one so they can be processed as blocks without further rasterization.

It takes only six vector instructions, one mask move, and two scalar instructions to set up, plus two scalar instructions per trivially accepted block, to find all the trivially accepted 16×16s in a tile. And, in truth, it doesn't even take that many instructions. What's shown in Listing 1 is a version of the trivial accept code that uses normal vector instructions, but there's actually a special instruction specifically designed to speed up rasterization -- vaddsetspi -- that both adds and sets the mask to the sign. I haven't mentioned it because I thought the discussion would be clearer and more broadly relevant if I used the standard vector instructions, but as you can see in Listing Two, if we use vaddsetspi, we need only three vector instructions to find all the trivially accepted 16×16 blocks.

```   ; On entry:
; rsi: base pointer to thread data
; v3: steps from edge 1 tile trivial accept corner to corners of 16x16 blocks
; v4: steps from edge 2 tile trivial accept corner to corners of 16x16 blocks
; v5: steps from edge 3 tile trivial accept corner to corners of 16x16 blocks

; Step to edge values at 16 16x16 block trivial accept corners &
; see if each trivial accept corner is inside all three edges
kxnor k1, k1		; set mask to 0xFFFF
; The result of each instruction is ANDed with k1

; Get the mask; 1-bits are trivial accept corners that are
; inside all three edges
kmov eax, k1

; Loop through 1-bits, issuing a draw-16x16-block command
; for each trivially accepted 16x16 block
bsf ecx, eax
jnz TrivialAcceptDone

TrivialAcceptLoop:
; <Store draw-16x16-block command, along with (x,y) location>

bsfi ecx, eax
jnz TrivialAcceptLoop
TrivialAcceptDone:

```
Listing Two: Trivial accept code for 16×16 blocks using vaddsetspi.

There may also be fewer active edges due to trivial accepts at the tile level (remember, edges that are trivially accepted for a tile can be -- and are -- ignored when rasterizing within the tile), and that would result in still fewer instructions. This is another case where software can boost performance by adapting to the data.

More Insights

 To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.