You don't tug on Superman's cape
You don't spit into the wind
You don't pull the mask off that old Lone Ranger
And you don't mess around with Jim
The first time you heard the chorus of "You Don't Mess Around with Jim," you no doubt immediately wondered why Jim Croce left out "And you don't rasterize in software if you're trying to do GPU-class graphics, even on a high-performance part such as Larrabee." That's just common sense, right? It was obvious to me, anyway, that Larrabee was going to need a hardware rasterizer. But maybe Jim knew something I didn't, because it turned out I was wrong, and therein lies a tale.
First, though, let's back up a step and get some context.
Parallel programming for semi-parallel tasks
In A First Look at the Larrabee New Instructions (LRBni), I presented an overview of Larrabee, an upcoming multithreaded, vectorized, manycore processor from Intel. It may not be entirely apparent from that article, but the LRBni contains everything you need to do a great job running code that's designed to be vectorized, such as HLSL shaders. But what about tasks that aren't inherently scalar (in particular, where the iterations aren't serially dependent), and that at the same time can't be efficiently parallelized in any obvious way, and for which the usual algorithms aren't easily vectorized?
In the process of implementing the standard graphics pipeline for Larrabee, we (the Larrabee team at RAD -- Atman Binstock, Tom Forsyth, Mike Sartain, and me) have gotten considerable insight into that question, because while the pipeline is largely vectorizable, there are a few areas in which the conventional approaches are scalar, and where some degree of serialization is unavoidable. This article takes a close look at how we applied the Larrabee New Instructions to one of those problem areas -- rasterization -- in the process, redesigning our implementation so it was far more vectorizable so it took advantage of the strengths of Larrabee's CPU-based architecture. Needless to say, performance with the Larrabee New Instructions will vary greatly from one application to the next -- there's not much to be done with purely scalar code -- but diving into Larrabee rasterization in detail gives you a good sense of what it's like to apply the Larrabee New Instructions to at least one sort of non-obvious application.
To avoid potential confusion, let me define "rasterization": For our present purposes, it's the process of determining which pixels are inside a triangle, and nothing more. In this article, I won't discuss other aspects of the rendering pipeline, such as depth, stencil, shading, or blending, in any detail.
Rasterization Was the Problem Child -- But Less Problematic Than We Thought
When we look at applying the Larrabee New Instructions to the graphics pipeline, for the most part, it's easy to find 16-wide work for the vector unit to do. Depth, stencil, pixel shading, and blending all fall out of processing 4×4 blocks, with writemasking at the edges of triangles. Vertex shading is not quite a perfect fit, but 16 vertices can be shaded and cached in parallel, which works well because vertex usage tends to be localized. Setting up 16 triangles at a time also works pretty well; there's some efficiency loss due to culling, but it still yields more than half of maximum parallelism.
That leaves rasterization, which is definitely not easy to vectorize with enough performance to be competitive with hardware. As an aside, the Larrabee rasterizer is not, of course, the only rasterizer that could be written for Larrabee. There are other rasterizers for points and lines, and of course, there may be other future rasterizers for interesting new primitive types. In fact, I suggest you write your own -- it's an interesting problem! But as it doesn't have a more formal name, I'll just refer to the current general-purpose triangle rasterizer as "the Larrabee rasterizer" in this article.
The Larrabee rasterizer was at least the fifth major rasterizer I've written, but while the others were loosely related, the Larrabee rasterizer belongs to a whole different branch of the family -- so different, in fact, that it almost didn't get written. Why? Because, as is so often the case, I wanted to stick with the mental model I already understood and was comfortable with, rather than reexamining my assumptions. Long-time readers will recall my fondness for the phrase "Assume nothing"; unfortunately, I assumed quite a lot in this case, proving once again that it's easier to dispense wise philosophy than to put it into action!
Coming up with better solutions is all about trying out different mental models to see what you might be missing, a point perfectly illustrated by my efforts about 10 years ago to speed up a texture mapper. I had thrown a lot of tricks at it and had made it a lot faster, and figured I was done. But just to make sure, I ran it by my friend David Stafford, an excellent optimizer. David said he'd think about it and would let me know if he came up with anything.
When I got home that evening, there was a message on the answering machine from David, saying that he had gotten two cycles out of the inner loop. (It's been a long time, and my memories have faded, so it may not have been exactly two cycles, but you get the idea.) I called him back, but he wasn't home (this was before cellphones were widespread). So all that evening I tried to figure out how David could have gotten those two cycles. As I ate dinner, I wondered; as I brushed my teeth, I wondered; and as I lay there in bed not sleeping, I wondered. Finally, I had an idea -- but it only saved one cycle, not two. And no matter how much longer I racked my brain, I couldn't get that second cycle.
The next morning, I called David, admitted that I couldn't match his solution, and asked what it was. To which he replied: "Oh, sorry, it turned out my code doesn't work."
It's certainly a funny story, but more to the point, the only thing keeping my code from getting faster had been my conviction that there were no better solutions -- that my current mental model was correct and complete. After all, the only thing that changed between the original solution and the better one was that I acquired the belief that there was a better solution -- that is, that there was a better model than my current one. As soon as I had to throw away my original model, I had a breakthrough.
Software rasterization is a lot like that.
The previous experience that Tom Forsyth and I had with software rasterization was that it was much, much slower than hardware. Furthermore, there was no way we could think of to add new instructions to speed up the familiar software rasterization approaches, and no obvious way to vectorize the process, so we concluded that a hardware rasterizer would be required. However, this turned out to a fine example of the importance of reexamining assumptions.
In the early days of the project, as the vector width and the core architecture were constantly changing, one of the hardware architects, Eric Sprangle, kept asking us whether it might be possible to rasterize efficiently in software using vector processing. We kept patiently explaining that it wasn't. Then, one day, he said: "So now that we're using the P54C core, do you think software rasterization might work?" To be honest, we thought this was a pretty dumb question because the core didn't have any significant impact on vector performance; in fact, we were finally irritated enough to sit down to prove in detail -- with code -- that it wouldn't work.
And we immediately failed. As soon as we had to think hard about how the inner loop could be structured with vector instructions, and wrote down the numbers -- that is, as soon as our mental model was forced to deal with reality -- it became clear that software rasterization was in fact within the performance ballpark. After that, it was just engineering, much of which we'll see shortly.
First, though, it's worth pointing out that, in general, dedicated hardware will be able to perform any specific task more efficiently than software; this is true by definition because dedicated hardware requires no multifunction capability, so in the limit, it can be like a general-purpose core with the extraneous parts removed. However, by the same token, hardware lacks the flexibility of CPUs, and that flexibility can allow CPUs to close some or all of the performance gap. Hardware needs worst-case capacity for each component, so it often sits at least partly idle; CPUs, on the other hand, can just switch to doing something different, so ALUs are never idle. CPUs can also implement flexible, adaptive approaches, and that can make a big difference, as we'll see shortly.