Why Rasterization Was the Problem Child
I can't do a proper tutorial on rasterization here, so I'll just run through a brief refresher. For our purposes, all rasterization will be of triangles. There are three edges per triangle, each defined by an equation Bx+Cy relative to any point on the edge, with the sign indicating whether a point is inside or outside the edge. Both x and y are in 15.8 fixed-point format, with a range of [-16K, +16K). The edge equations are tested at pixel or sample centers, and for cases where a pixel or sample center is right on an edge, well-defined fill rules must be observed (in this case, top-left fill rules, which are generally implemented by subtracting 1 from the edge equation for left edges and flat-top edges). Rasterization is performed with discrete math, and must be exact, so there must be enough bits to represent the edge equation completely. Finally, multisampled antialising must be supported.
Let's look at a quick example of applying the edge equation. Figure 1 shows an edge from (12, 8) to (4, 24). The B coefficient of the edge equation is simply the edge's y length: (y1 - y0). The C coefficient is the negation of the edge's x length: (x0 - x1). Thus, the edge equation in Figure 1 is (24 - 8)x + (12 - 4 )y. Since we only care about the sign of the result (which indicates inside or outside), not the magnitude, this can be simplified to 2x + 1y, where the x value used in the equation is the distance between the point of interest and any point at which the equation is known to be zero (which is to say, any point on the line). Usually, a vertex is used; for example, as the vertex at (12, 8) is used in Figure 1. All points on the edge have the value 0, as can be seen in Figure 1 for the point on the line at (8, 16).
Points on one side of the edge will have positive values, as in Figure 2 for the point at (12, 16), which has a value of 8.
Points on the other side of the edge will have negative values, as in Figure 3 for the point at (4, 12), which has a value of -12.
Simple though it is, the edge equation is the basis upon which the Larrabee rasterizer is built. By applying the three edge equations at once, it is possible to determine which points are inside a triangle and which are not. Figure 4 shows an example of how this works; the pixels shown in green are considered to be inside the triangle formed by the edges, because their centers are inside all three edges. As you can see, the edge equation is negative on the side of each edge that's inside the triangle; in fact, it gets more negative the farther you get from the edge on the inside, and more positive the farther you get from the edge on the outside.
Vectorization is an essential part of Larrabee performance -- capable of producing a speedup of an order of magnitude or more -- so the knotty question is how we can perform the evaluation in Figures 1-4 using vector processing. More accurately, the question is how we can efficiently perform the evaluation using vector processing; obviously, we could use vector instructions to evaluate every pixel on the screen for every triangle, but that would involve a lot of wasted work. What's needed is some way of using vector instructions to quickly narrow in on the work that's really needed.
We considered a lot of approaches. Let's take a look at a couple so you can get a sense of what a different tack we had to take in order to vectorize a task that's not an obvious candidate for parallelization -- and to leverage the unique strengths of CPUs.
The Pixomatic 1 Rasterization Approach
Pixomatic version 1 used a rasterization approach often used by scalar software rasterizers, decomposing triangles into 1 or 2 trapezoids, then stepping down the two edges simultaneously, on pixel centers, emitting the spans of pixels covered on each scan line, as in Figure 5.
This approach was efficient for scalar code, but it just doesn't lend itself to vectorization. There were several other reasons this approach didn't suit Larrabee well (for example, it emits pixel-high spans; but for vectorized shading, you want 4×4 blocks, both to generate 2D texture gradients and because a square aspect ratio results in the highest utilization of the vector units). But the most important reason was that I just could never come up with a way to get good results out of vectorizing edge stepping.
Another approach, often used by hardware, is sweep rasterization. An example of this is in Figure 6. Starting at a top vertex, a vector stamp of 4×4 pixels is swept left, then right, then down, and the process is repeated until the whole triangle has been swept. The edge equation is evaluated directly at each of the 16 pixels for each 4×4 block that's swept over.
Sweep rasterization is more vectorizable than the Pixomatic 1 approach because evaluating the pixel stamp is well-suited to vectorization; but on the other hand, it requires lots of badly predicted branching, as well as a significant amount of work to decide where to descend. It also fails to take advantage of the ability of CPUs to make smart, flexible decisions, which was our best bet for being competitive with hardware rasterization. So we decided sweep rasterization wasn't the right answer.