# Rasterization on Larrabee

### Descending the Rasterization Hierarchy

We're now done rasterizing trivially accepted 16×16 blocks, but we still have to handle partially covered 16×16 blocks, and by this point, it should be obvious how that works. We descend into each partial 16×16 to evaluate the 4×4 blocks it contains, just as we descended into the 64×64 tile to evaluate the 16×16s it contained. Again, we put trivially accepted 4×4s directly into the bin. Partially accepted 4×4s, however, need to be processed into pixel masks. This is done with a vector add of a precalculated, position-independent table for each edge that steps from the 4×4 trivial reject corner to the center of the pixel itself. The sign bits can then be ANDed together to form the pixel mask.

Figure 25 illustrates the calculation of the pixel mask. From the trivial reject corner of the 4×4, shown in red, a single add yields the equation value for the black edge at the 16 pixel centers. The red arrows show how the 16 values in the precalculated, position-independent step table for each edge are added to the trivial reject corner of the 4×4 to evaluate the edge equation at the 16 pixel centers. The pixels shown in blue have negative values and are inside the edge. Figure 25 shows the actual mask that is generated, with a 1-bit for each pixel inside the edge, and a 0-bit for each pixel that's outside.

Figure 25: The pixel mask is calculated at pixel centers by stepping relative to the trivial reject corner for the 4×4.

Figure 25 is a good demonstration of how vector efficiency can fall off with partially vectorizable tasks, and why Larrabee uses 16-wide vectors rather than something larger. We will often do all the work needed to generate the pixel mask for a 4×4, only to find that many or most of the pixels are uncovered, so much of the work has been wasted. And the wider the vector, the lower the ratio of covered to uncovered becomes, on average, and the more work is wasted.

Listing Three shows code for calculating the pixel masks for all the partially covered 4×4 blocks in a partially covered 16×16 block. Note that this is the only case in which per-pixel work is performed; all solidly covered-block cases are handled without generating a pixel mask.

```
; On entry:
; rbx: pointer to output buffer
; rsi: base pointer to thread data
; k1: mask of partially accepted 4x4 blocks in the current 16x16
; v0: edge 1 trivial reject corner values for 4x4 blocks in the current 16x16
; v1: edge 2 trivial reject corner values for 4x4 blocks in the current 16x16
; v2: edge 3 trivial reject corner values for 4x4 blocks in the current 16x16

; Store values at corners of 16 4x4s in 16x16 for indexing into
vstored [rsi+Edge1TrivialRejectCornerValues4x4], v0
vstored [rsi+Edge2TrivialRejectCornerValues4x4], v1
vstored [rsi+Edge3TrivialRejectCornerValues4x4], v2

; Load step tables from corners of 4x4s to pixel centers

; Loop through 1-bits from trivial reject test on 16x16 block (trivial accepts have been
; XORed out earlier), descending to rasterize each partially-accepted 4x4
kmov eax, k1
bsf ecx, eax
jnz Partial4x4Done

Partial4x4Loop:
; See if each of 16 pixel centers is inside all three edges
; Use rcx, the index from the bit-scan of the current partially accepted 4x4, to index into
; the 4x4 trivial reject corner values generated at the 16x16 level, and pick out the
; trivial reject corner values for the current partially accepted 4x4
; K2 is set by the first instruction, then the result of the
; second instruction is ANDed with k2, and likewise for the third instruction
vcmpgtpi k2, v3, [rsi+Edge1TrivialRejectCornerValues4x4+rcx*4]{1to16}
vcmpgtpi k2 {k2}, v4, [rsi+Edge2TrivialRejectCornerValues4x4+rcx*4]{1to16}
vcmpgtpi k2 {k2}, v5, [rsi+Edge3TrivialRejectCornerValues4x4+rcx*4]{1to16}

kmov edx, k2
mov [rbx], dx

; <Store the (x,y) location and advance rbx>

bsfi ecx, eax
jnz Partial4x4Loop
Partial4x4Done:

```
Listing Three: Pixel mask code for partially covered 4×4 blocks.

Listing Three first stores the trivial reject corner values for the 16 4×4 blocks for the three edges. These are the values we'll step relative to in order to generate the final pixel masks for each partial 4×4 and that were generated earlier by the 16×16 code that ran just before this code. This is done with the three vstored instructions.

Next, the step tables for the three edges are loaded with the three vloadd instructions. (Actually, these will probably just be preloaded into registers when the triangle is set up and remain there throughout the rasterization of the triangle, but loading them here makes it clearer what's going on.)

Once that set-up is complete, the code scans through the partial accept mask, which was also generated earlier by the 16×16 code. First, the mask is copied to eax with the kmov instruction, and then bsfi is used to find each partially accepted 4×4 block in turn.

For each partially covered 4×4 found, the code does three vector compares to evaluate the three edge equations at the 16 pixel centers. Note that each vcmpgtpi uses the index of the current 4×4 to retrieve the trivial reject corner value for that 4×4 from the vector generated at the 16×16 level. While we could directly add and then test the signs of the edge equations, as I described earlier, it's more efficient to instead rearrange the calculations by flipping the signs of the step tables so that the testing can be done with a single compare per edge. Put another way, instead of adding two values and seeing if the result is less than zero:

```
m + n < 0

```

it's equivalent and more efficient to compare one value directly to the negation of the other value:

```
m < -n

```

(In case you're wondering, we couldn't use this trick at the 16×16 and 64×64 levels of the hierarchy because in those cases, in addition to getting the result of the test, we also need to keep the result of the add, so we can pass it down the hierarchy as the new corner value.)

The result of the compares is the final composite pixel mask for that 4×4, which can be stored in the rasterization output queue. And that's it. Once things are set up, the cost to rasterize each partial 4×4 is three vector instructions, a mask instruction, and a handful of scalar instructions. Once again, if the whole tile was trivially accepted against one or two edges, then proportionately fewer vector instructions would be required.

One nice feature of the Larrabee rasterization approach is that MSAA ("multisample antialiasing") falls out for one extra vector compare per 16 samples per edge because it's just a different step from the trivial reject corner. (That's for partially accepted 4×4 blocks; for all trivially accepted blocks, there is no incremental cost for rasterization of MSAA.) In Figure 26, each pixel is divided into four samples, and the step shown is to one of the four samples rather than to the center of each pixel. This takes one compare, so a total of four compares would be required to generate the full MSAA sample mask for 16 pixels.

Figure 26: Stepping to one set of samples for 4X MSAA rasterization of a partially covered 4×4.

### More Insights

 To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.

# First C Compiler Now on Github

The earliest known C compiler by the legendary Dennis Ritchie has been published on the repository.

# HTML5 Mobile Development: Seven Good Ideas (and Three Bad Ones)

HTML5 Mobile Development: Seven Good Ideas (and Three Bad Ones)

# Building Bare Metal ARM Systems with GNU

All you need to know to get up and running... and programming on ARM

# Amazon's Vogels Challenges IT: Rethink App Dev

Amazon Web Services CTO says promised land of cloud computing requires a new generation of applications that follow different principles.

# How to Select a PaaS Partner

Eventually, the vast majority of Web applications will run on a platform-as-a-service, or PaaS, vendor's infrastructure. To help sort out the options, we sent out a matrix with more than 70 decision points to a variety of PaaS providers.

More "Best of the Web" >>