A High-level View of Larrabee Rasterization
Larrabee takes a substantially different approach, one better suited to vectorization. In the Larrabee approach, we evaluate 16 blocks of pixels at a time to figure out which blocks are even touched by the triangle, then descended into each block that's at least partially covered, evaluating 16 smaller blocks within it, continuing to descend recursively until we had identified all the pixels inside the triangle. Here's an example of how that might work for our sample triangle.
As I'll discuss shortly, the Larrabee renderer uses a chunking architecture. In a chunking architecture, the largest rasterization target at any one time is a portion of the render target called a "tile"; for this example, let's assume the tile is 64×64 pixels, as in Figure 7.
First, we test which of the 16×16 blocks (16 of them -- we check 16 things at a time whenever possible in order to leverage the 16-wide vector units) that make up the tile are touched by the triangle, as in Figure 8.
We find that only one 16×16 block is touched -- the block shown in yellow. So we descend into that block to determine exactly what is touched by the triangle, subdividing it into 16 4×4 blocks (once again, we check 16 things at a time to be vector-friendly), and evaluate which of those are touched by the triangle, as in Figure 9.
We find that five of the 4×4s are touched, so we process each of them separately, descending to the pixel level to generate masks for the covered pixels. The pixel rasterization for the first block is in Figure 10.
Figure 11 shows the final result.
As you can see, the Larrabee approach processes 4×4 blocks, like the sweep approach. But unlike the sweep approach, it doesn't have to make many decisions in order to figure out which blocks are touched by the triangle, thanks to the single 16-wide test performed before each descent. Consequently, this rasterization approach typically does somewhat less work than the sweep approach to determine which 4×4 blocks to evaluate. The real win, however, is that it takes advantage of CPU smarts by not-rasterizing whenever possible. I'll have to walk through the Larrabee rasterization approach in order to explain what that means, but as an introduction, let me tell you another optimization story.
Many years ago, I got a call from a guy I had once worked for. He wanted me to do some consulting work to help speed up his new company's software. I asked him what kind of software it was, and he told me it was image processing software, and that the problem lay in the convolution filter, running on a Sparc processor. I told him I didn't know anything about either convolution filters or Sparcs, so I didn't think I could be of much help. But he was persistent, so I finally agreed to take a shot at it.
He put me in touch with the engineer who was working on the software, who immediately informed me that the problem was that the convolution filter involved a great many integer multiplies, which the Sparc did very slowly because, at the time, it didn't have a hardware integer multiply instruction. Instead, it had a partial product instruction, which had to be executed for each significant bit in the multiplier. In compiled code, this was implemented by calling a library routine that looped through the multiplier bits, and that routine was where all the time was going.
I suggested unrolling that loop into a series of partial product instructions, and jumping into the unrolled loop at the right point to do as many partial products as there were significant bits, thereby eliminating all the loop overhead. However, there was still the question of whether to make the pixel value or the convolution kernel value the multiplier. The smaller the multiplier, the fewer partial products would be needed, so we wanted to pick whichever of the two was smaller on average.
When I asked which was smaller, though, the engineer said there was no difference. When I persisted, he said they were random. When I said that I doubted they were random (since randomness is actually hard to come by), he grumbled. I don't know why he was reluctant to get me that information -- I guess he thought it was a waste of time -- but he finally agreed to gather the data and call me back.
He didn't call me back that day, though. And he didn't call me back the next day. When he hadn't called me back the third day, I figured I might as well get it over with and called him. He answered the phone and, when I identified myself, he said, "Oh, Hi. I'm just standing here with my managers, watching. We're all really happy."
When I asked what exactly he was happy about, he replied, "Well, when I looked at the data, it turned out 90% of the values in the convolution kernel were zero, so I just put an if-not-zero around the multiply, and now the whole program runs three times faster!"
Not-rasterizing is a lot like that, as we'll see shortly.
As noted earlier, Larrabee uses "chunked"(also known as "binned" or "tiled") rendering, where the target is divided into multiple rectangles or tiles. The rendering commands are sorted according to the tiles they touch and stored in the corresponding bins, then the contents of each bin are rendered separately to the corresponding tile. It's a bit complex, but it considerably improves cache coherence and parallelization.
For chunking, rasterization consists of two steps: The first identifies which tiles a triangle touches, and the second rasterizes the triangle within each tile. So it's a two-stage process and I'm going to discuss the two stages separately.
Figure 12 shows an example of a triangle to be drawn to a tiled render target. The light blue area is a 256×256 render target, subdivided into four 128×128 tiles.
With Larrabee's chunking architecture, the first step in rasterizing the triangle in Figure 12 is to determine which tiles the triangle touches and put the triangle in the bins for those tiles (tiles 1 and 3). Once all triangles have been binned, the second step, intra-tile rasterization, is to rasterize the triangle to tile 1 when the tile 1 bin is rendered, and to rasterize the triangle to tile 3 when the tile 3 bin is rendered.
Assignment of triangles to tiles can easily be performed for relatively small triangles -- say, up to a tile in size, which covers 90% of all triangles -- by doing bounding box tests. For example, it would be easy with bounding box tests to find out what two tiles the triangle in Figure 12 is in. Larger triangles are currently assigned to tiles by simply walking the bounding box and testing each tile against the triangle; that doesn't sound very efficient, but tile-assignment time is generally an insignificant part of total rendering time for larger triangles because there's usually a lot of shading work and the like to do for those triangles. However, if large-triangle tile-assignment time does turn out to be significant, we could use a sweep approach, as discussed earlier, or a variation of the hierarchical approach used for intra-tile rasterization, which I'll discuss next. This is a good example of how a CPU makes it easy to use two completely different approaches in order to do the 90% case well and the 10% case adequately (or well, but in a different way), rather than having to have one size fit all.
Large-triangle assignment to tiles is performed with scalar code for simplicity and because it's not a significant performance factor. Let's look at how that scalar process works because it will help us understand vectorized intra-tile rasterization later. I'll use a small triangle for the example to make the figures legible, but as previously noted, such a small triangle normally would be assigned to its tile or tiles using bounding box tests.
Once we've set up the equation for an edge (by calculating B and C, as discussed when we looked at Figure 1), the first thing we do is calculate its value at the trivial reject corner of each tile. The trivial reject corner is the corner at which an edge's equation is most negative within a tile; the selection of the trivial reject corner for a given edge is based on its slope, as we'll see shortly. We set things up so that negative means "inside" in order to allow us to generate masks directly from the sign bit. So you can think of the trivial reject corner as the point in the tile that's most inside the edge. If this point isn't inside the edge, no point in the tile can be inside the edge, and therefore, the whole triangle can be ignored for that tile.
Figure 13 shows the trivial reject test in action. Tile 0 is trivially rejected for the black edge and can be ignored because its trivial reject corner is positive, and therefore, the whole tile must be positive and must lie outside the triangle. Meanwhile, the other three tiles must be investigated further. You can see here how the trivial reject corner is the corner of each tile most inside the black edge; that is, the point with the most negative value in the tile.
Note that determining which corner is the trivial reject corner will vary from edge to edge depending on slope. For example, it would be the lower left corner of each tile for the edge shown in red in Figure 14 because that's the corner that's most inside that edge.
If you understand what we've just discussed, you're ninety percent of the way to understanding the whole Larrabee rasterizer. The trivial reject test is actually very straightforward once you understand it -- it's just a matter of evaluating the sign of a simple equation at the right point -- but it can take a little while to get it, so you may find it useful to re-read the previous section if you're at all uncertain or confused.
So that's the tile trivial reject test. The other tile test is the trivial accept test. For this, we take the value at the trivial reject corner (the corner we just discussed) and add the amount that the edge equation changes for a step all the way to the diagonally opposite tile corner, the tile trivial accept corner. This is the point in the tile where the edge equation is most positive. You can think of this as the point in the tile that's most outside the edge. If the trivial accept corner for an edge is negative, that whole tile is trivially accepted for that edge and there's no need to consider that edge when rasterizing within the tile.
Figure 15 shows the trivial accept test in action. Because the trivial accept corner is the corner at which the edge's equation is most positive, if this point is negative -- and therefore inside the edge -- all points in the tile must be inside the edge. Thus, tiles 0 and 1 are not trivially accepted for the black edge, because the equation for the black edge is positive at their trivial accept corners, but tiles 2 and 3 are trivially accepted, so rasterization of this triangle in tiles 2 and 3 can ignore the black edge entirely, saving a good bit of work.
There's an important asymmetry here. When we looked at trivial reject, we saw that it applies to the whole triangle in the tile being drawn to, in the sense that trivially rejecting a tile against any edge means the triangle doesn't touch that tile, so the triangle can be ignored in that tile. However, trivial accept applies only to the edge being checked; that is, trivially accepting a tile against an edge only means that the specific edge doesn't have to be checked when rasterizing the triangle to that tile because the whole tile is inside that edge; it has no direct implication for the whole triangle. The tile may be trivially accepted against one edge, but not against the others. In fact, it may be trivially rejected against one or both of the other edges, in which case, the triangle won't be drawn to the tile at all. This is illustrated in Figure 16, where tile 3 is trivially accepted against the black edge, so the black edge wouldn't need to be considered in rasterizing the triangle to that tile, but tile 3 is trivially rejected against the red edge, and that means that the triangle doesn't have to be rasterized to tile 3 at all.
In Figure 17, however, tile 3 is trivially accepted by all three edges, and here we come to a key point.
If all three edges are negative at their respective trivial accept corners, then the whole tile is inside the triangle, and no further rasterization tests are needed -- and this is what I meant earlier when I said the rasterizer takes advantage of CPU smarts by not-rasterizing whenever possible. The tile-assignment code can just store a draw-whole-tile command in the bin. Then the bin rendering code can simply do the equivalent of two nested loops around the shaders, resulting in a full-screen triangle rasterization speed of approximately infinity -- one of my favorite performance numbers!
By the way, this whole process should be familiar to 3D programmers because testing of bounding boxes against planes (for example, for frustum culling) is normally done in exactly the same way -- although in three dimensions instead of two -- with the same use of signs to indicate inside and outside for trivial accept and reject. Also, structures such as octrees employ a 3D version of the hierarchical recursion used by the Larrabee rasterizer.
That completes our overview of how rasterization of large triangles for tile assignment works. As I said, this is done as a scalar evaluation in the Larrabee pipeline, so the trivial accept and reject tests for each tile are performed separately. Intra-tile rasterization, to which we turn next, is much the same, but vectorized. And it is this vectorization that will give us insight into applying the Larrabee New Instructions to semi-parallel tasks.