### Descending the Rasterization Hierarchy

We're now done rasterizing trivially accepted 16×16 blocks, but we still have to handle partially covered 16×16 blocks, and by this point, it should be obvious how that works. We descend into each partial 16×16 to evaluate the 4×4 blocks it contains, just as we descended into the 64×64 tile to evaluate the 16×16s it contained. Again, we put trivially accepted 4×4s directly into the bin. Partially accepted 4×4s, however, need to be processed into pixel masks. This is done with a vector add of a precalculated, position-independent table for each edge that steps from the 4×4 trivial reject corner to the center of the pixel itself. The sign bits can then be ANDed together to form the pixel mask.

Figure 25 illustrates the calculation of the pixel mask. From the trivial reject corner of the 4×4, shown in red, a single add yields the equation value for the black edge at the 16 pixel centers. The red arrows show how the 16 values in the precalculated, position-independent step table for each edge are added to the trivial reject corner of the 4×4 to evaluate the edge equation at the 16 pixel centers. The pixels shown in blue have negative values and are inside the edge. Figure 25 shows the actual mask that is generated, with a 1-bit for each pixel inside the edge, and a 0-bit for each pixel that's outside.

Figure 25 is a good demonstration of how vector efficiency can fall off with partially vectorizable tasks, and why Larrabee uses 16-wide vectors rather than something larger. We will often do all the work needed to generate the pixel mask for a 4×4, only to find that many or most of the pixels are uncovered, so much of the work has been wasted. And the wider the vector, the lower the ratio of covered to uncovered becomes, on average, and the more work is wasted.

Listing Three shows code for calculating the pixel masks for all the partially covered 4×4 blocks in a partially covered 16×16 block. Note that this is the only case in which per-pixel work is performed; all solidly covered-block cases are handled without generating a pixel mask.

; On entry: ; rbx: pointer to output buffer ; rsi: base pointer to thread data ; k1: mask of partially accepted 4x4 blocks in the current 16x16 ; v0: edge 1 trivial reject corner values for 4x4 blocks in the current 16x16 ; v1: edge 2 trivial reject corner values for 4x4 blocks in the current 16x16 ; v2: edge 3 trivial reject corner values for 4x4 blocks in the current 16x16 ; Store values at corners of 16 4x4s in 16x16 for indexing into vstored [rsi+Edge1TrivialRejectCornerValues4x4], v0 vstored [rsi+Edge2TrivialRejectCornerValues4x4], v1 vstored [rsi+Edge3TrivialRejectCornerValues4x4], v2 ; Load step tables from corners of 4x4s to pixel centers vloadd v3, [rsi+Edge1PixelCenterTable] vloadd v4, [rsi+Edge2PixelCenterTable] vloadd v5, [rsi+Edge3PixelCenterTable] ; Loop through 1-bits from trivial reject test on 16x16 block (trivial accepts have been ; XORed out earlier), descending to rasterize each partially-accepted 4x4 kmov eax, k1 bsf ecx, eax jnz Partial4x4Done Partial4x4Loop: ; See if each of 16 pixel centers is inside all three edges ; Use rcx, the index from the bit-scan of the current partially accepted 4x4, to index into ; the 4x4 trivial reject corner values generated at the 16x16 level, and pick out the ; trivial reject corner values for the current partially accepted 4x4 ; K2 is set by the first instruction, then the result of the ; second instruction is ANDed with k2, and likewise for the third instruction vcmpgtpi k2, v3, [rsi+Edge1TrivialRejectCornerValues4x4+rcx*4]{1to16} vcmpgtpi k2 {k2}, v4, [rsi+Edge2TrivialRejectCornerValues4x4+rcx*4]{1to16} vcmpgtpi k2 {k2}, v5, [rsi+Edge3TrivialRejectCornerValues4x4+rcx*4]{1to16} ; Store the mask kmov edx, k2 mov [rbx], dx ; <Store the (x,y) location and advance rbx> bsfi ecx, eax jnz Partial4x4Loop Partial4x4Done:

Listing Three first stores the trivial reject corner values for the 16 4×4 blocks for the three edges. These are the values we'll step relative to in order to generate the final pixel masks for each partial 4×4 and that were generated earlier by the 16×16 code that ran just before this code. This is done with the three **vstored** instructions.

Next, the step tables for the three edges are loaded with the three **vloadd** instructions. (Actually, these will probably just be preloaded into registers when the triangle is set up and remain there throughout the rasterization of the triangle, but loading them here makes it clearer what's going on.)

Once that set-up is complete, the code scans through the partial accept mask, which was also generated earlier by the 16×16 code. First, the mask is copied to **eax** with the **kmov** instruction, and then **bsfi** is used to find each partially accepted 4×4 block in turn.

For each partially covered 4×4 found, the code does three vector compares to evaluate the three edge equations at the 16 pixel centers. Note that each **vcmpgtpi** uses the index of the current 4×4 to retrieve the trivial reject corner value for that 4×4 from the vector generated at the 16×16 level. While we could directly add and then test the signs of the edge equations, as I described earlier, it's more efficient to instead rearrange the calculations by flipping the signs of the step tables so that the testing can be done with a single compare per edge. Put another way, instead of adding two values and seeing if the result is less than zero:

m + n < 0

it's equivalent and more efficient to compare one value directly to the negation of the other value:

m < -n

(In case you're wondering, we couldn't use this trick at the 16×16 and 64×64 levels of the hierarchy because in those cases, in addition to getting the result of the test, we also need to keep the result of the add, so we can pass it down the hierarchy as the new corner value.)

The result of the compares is the final composite pixel mask for that 4×4, which can be stored in the rasterization output queue. And that's it. Once things are set up, the cost to rasterize each partial 4×4 is three vector instructions, a mask instruction, and a handful of scalar instructions. Once again, if the whole tile was trivially accepted against one or two edges, then proportionately fewer vector instructions would be required.

One nice feature of the Larrabee rasterization approach is that MSAA ("multisample antialiasing") falls out for one extra vector compare per 16 samples per edge because it's just a different step from the trivial reject corner. (That's for partially accepted 4×4 blocks; for all trivially accepted blocks, there is no incremental cost for rasterization of MSAA.) In Figure 26, each pixel is divided into four samples, and the step shown is to one of the four samples rather than to the center of each pixel. This takes one compare, so a total of four compares would be required to generate the full MSAA sample mask for 16 pixels.