Pulling It All Together
Now that we've stepped all the way down through the rasterization hierarchy, let's go back and look again at the rasterization descent overview we started with, this time with a detailed understanding what's going on.
Figure 27 shows a triangle and a 64×64 tile to which the triangle is to be drawn, with the tile subdivided into 16×16 blocks; Figure 27 is a repeat of Figure 8, but this time I've added dashed extensions of the edges to the border of the tile, so we can see what blocks and pixels are on what sides of the edges.
To rasterize the triangle in Figure 27, we first calculate the values of the triangle's three edge equations at the tile's trivial accept and trivial reject corners and find that the tile is neither trivially rejected nor trivially accepted by any edge. (Again, this would actually only be done for a large triangle; we would use bounding box tests for such a small triangle.) We set up the various step tables we'll use, and then we step the edge equations to their respective trivial accept and trivial reject corners of the 16 blocks, each 16×16 in size, that make up the tile, and make a mask containing the signs of the results.
We then bit-scan through the resulting mask, find that 12 of the 16 blocks are trivially rejected, and descend into each of the remaining 4 blocks in turn. In three of the blocks, we'll ultimately find that there's nothing to draw, so for the purposes of this discussion, we'll ignore those and look at the more interesting case of what happens when we descend into the block that the triangle lies inside -- the block outlined in yellow. (Note that if the triangle were large enough to fully cover a 16×16 block, that block would be trivially accepted and no further descent into that block would be required.)
Before we look at what happens when we descend into the 16×16 block containing the triangle, there's one more thing in Figure 27 that we should examine. You may have noticed that in the earlier version of this figure, Figure 8, only the one block in yellow was found, not the three green blocks. Why did the bit-scan find 4 blocks this time, when the triangle is entirely contained in one block? The reason is that the Larrabee rasterization approach, as discussed in this article, can only eliminate blocks by trivially rejecting them. If you look closely, you will see that none of the three green blocks is trivially rejected by any edge. This is an inefficiency of this rasterization method, although there are techniques, which are beyond the scope of this article, that remove much of the waste.
Descending the rasterization hierarchy, we take the 16×16 block containing the triangle, subdivide it into 16 4×4 blocks, and evaluate which of those are touched by the triangle by stepping to evaluate the edge equation at each of their trivial accept and trivial reject corners for each edge, as in Figure 28. We find that 10 of the blocks are trivially rejected, and that none of the 6 remaining blocks are trivially accepted against all three edges.
We've finally reached the bottom of the rasterization hierarchy, so we can bit-scan through the partial-accept mask generated for the 16×16, to find the partially accepted 4×4 blocks, and generate the 4×4 pixel mask for each of the blocks in turn, as in Figure 29.
Here, we see once again that the reliance on trivial reject to eliminate blocks has caused a false positive on a block that actually doesn't touch the triangle (the left-most block). It's possible to do bounding box tests to eliminate such blocks, but it's not clear whether that's more efficient than just testing for empty masks -- that is, masks with no pixels enabled -- and skipping those blocks.
After completing this 16×16 block, we pop back up to rasterize the other 16×16 blocks that weren't trivially rejected (which in this case, turned out not to contain any of the triangle). And that's really all there is to it!
Notes on Rasterization
Now that we understand the basic rasterization algorithm, let's take a quick look at some interesting implementation refinements.
In software, we don't have the luxury of custom data and ALU sizes, but we do have the luxury of adapting to input data, and this adaptive rasterization helps boost our efficiency. For example, edge evaluations have to be done with 48 bits in the worst case. For those cases, being software, we have to use 64 bit because there is no 48-bit integer support in Larrabee. However, we don't have to do that at all for the 90+% of all triangles that fit in a 128×128 bounding box because, in those cases, 32 bits is enough.
When we do have to do 64-bit edge evaluation, we only have to use it for tile assignment. As it turns out, within tiles up to 128×128 in size (and 128×128 is our largest tile size), any edge that the tile is not trivially accepted or rejected against can always be rasterized using 32 bits.
We can also detect triangles that fit in a 16×16 bounding box and process them with one less descent level, less set-up, and no trivial accept test (because there will rarely be trivially accepted 4×4s in such small triangles). Finally, triangles that fit in very small bounding boxes can be done simply by directly calculating the masks for the 16 or 32 pixels directly, with little set-up and minimal processing.
In fact, for small triangles we could even take the z value of the closest vertex and compare it to the z buffer for the triangle's bounding box, and possibly z-reject the triangle before we even rasterize it!
There are other optimization possibilities I won't get into because there's just not space in this article, and of course, there's no telling how well they'll work until we try them. But one nice thing about software is that it's easy to run the experiments to check them out.
And with that, we conclude our lightning tour of the Larrabee rasterization approach, and our examination of how vector programming can be applied to a semi-parallel task. As I mentioned earlier, software rasterization will never match dedicated hardware peak performance and power efficiency for a given area of silicon, but so far, it's proven to be efficient enough. It also has a significant advantage, in that because it uses general-purpose cores, the same resources that are used for rasterization can be used for other purposes at other times, and vice versa. As Tom Forsyth puts it, because the whole chip is programmable, we can effectively bring more square millimeters to bear on any specific task as needed -- up to and including the whole chip. In other words, the pipeline can dynamically reconfigure its processing resources as the rendering workload changes. If we get a heavy rasterization load, we can have all the cores working on it. It wouldn't be the most efficient rasterizer per square millimeter, but it would be one heck of a lot of square millimeters of rasterizer, all doing what was most important at that moment; in contrast to a traditional graphics chip with a hardware rasterizer, where most of the circuitry would be idle when there was a heavy rasterization load. A little while later, when the load switches to shading, the whole Larrabee chip can become a shader if necessary. Software simply brings a whole different set of strengths and weaknesses to the table.
There's a lot to learn and rethink with Larrabee, and a lot of potential to be exploited. Only time will tell how well it all works out -- but meanwhile, it certainly is an interesting time to be a performance programmer!
Further information about Larrabee is available at www.intel.com/software/graphics.