Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

Ramblings in Real Time


Web Development: Ramblings In Real Time

Floating-Point for

Real-Time 3-D

Michael Abrash

Michael is the author of Zen of Graphics Programming, Second Edition, and Zen of Code Optimization. He is currently pushing the envelope of real-time 3-D on Quake at id Software. He can be contacted at [email protected].


In a crisis, sometimes it's best to go with the first solution that comes into your headbut not very often. My father's car was a Volvo sedan, only a couple of years old and easily the classiest car my family had ever owned. To the best of my recollection, as of New Year's Eve of my senior year in high school, I had never driven that car. (The car I drove was my mother's aging three-cylinder Saab: a blunt-nosed, ungainly little wagon that seated up to seven people in sardine-like comfort, as long as two passengers perched on the gas tank.) However, I was going to a New Year's party and, for reasons lost in the mists of time, I was allowed to take the Volvo. So, one clear, stunningly cold night, I picked up my passengers, who included Robin Viola, Kathy Smith, Jude Hawron, and Alan... (well, I'll omit his last name in case he wants to run for President someday).

The party was out in the middle of nowhere, and it was a good one. I heard Al Green for the first time and lots of beer was consumed (although none by me). Around 2 a.m., we piled into the Volvo, cranked up the heater, and set off for home. However, we hadn't gone but about five miles when, suddenly, Alan proceeded to deposit what had until recently been beer and chips into his lap.

Mind you, this wasn't just any car Alan was tossing his cookies init was my father's prized Volvo. Without thinking, I shouted, "Out the window! Open the window!" and Alan obligingly rolled the window down and sent some more erstwhile beer and chips on its way.

It was here that I learned that fast decisions aren't always good ones. Upon hearing a loud smacking sound as the mass splattered along the length of the car, I did what I should have done in the first place. I stopped the car so Alan could get out, while I assessed the disaster. Not only was the rear half of the car covered, but the noxious substance had frozen solid. It looked like someone had melted an enormous candle, or possibly put cake frosting, on the car.

The next morning, my father was remarkably good-natured about the whole thing, although I don't remember ever actually driving the Volvo again. For my part, I took a hair dryer out to the unheated garage to melt and clean the gunk one small piece at a time. In addition to the intricacies of car detailing, what I learned here was that it almost always pays to take a few seconds to size up a crisis and choose an effective response.

There's a surprisingly close analogue to this in programming. Often, when faced with a problem in code, a programmer's response is to come up with a solution as quickly as possible and immediately hack it in. For all but the simplest problems, though, there are side effects and design issues involved that should be thought through before any coding is done. I try to think of bugs and other problem situations as opportunities to reexamine how my code works--chances to detect and correct structural defects I hadn't previously suspected. In fact, I'm often able to simplify code as I fix a bug, thanks to the understanding I gain in the process.

Taking this a step further, it's useful to reexamine assumptions periodically, even if no bugs are involved. You might be surprised at how quickly assumptions that once were completely valid can deteriorate.

Not Your Father's Floating-Point

For example, until last year, I had never done any serious floating-point (FP) optimization, for the good reason that FP math had never been fast enough for the code I needed to write. It was an article of faith that FP, while undeniably convenient because of its automatic support for constant precision over an enormous range of magnitudes, was just not fast enough for real-time programming. Consequently, I, like pretty much everyone else doing 3-D, expended a lot of time and effort in making fixed-point do the job.

That article of faith was true up through the 486, but for three reasons all the old assumptions are out the window on the Pentium:

  • Faster FP instructions.
  • A pipelined floating-point unit (FPU).
  • The magic of a parallel FXCH.

Together, these mean that FP addition and subtraction are nearly as fast as integer operations, and FP multiplication and division have the potential to be much fasterall with the range and precision advantages of FP. Better yet, the FPU has its own set of eight registers, so the use of floating-point can help relieve pressure on the x86's integer registers, as well.

One effect of all this is that, with the Pentium, floating-point for real-time 3-D on the x86 has gone from being irrelevant to being a key element. Quake, the game I'm currently working on, uses FP all the way down into the inner loop of the span rasterizer, performing several FP operations every 16 pixels.

FP has not only become important for real-time 3-D on the PC, but will soon become crucial. Hardware accelerators will take care of texture mapping and will increase feasible scene complexity, meaning the CPU will do less bit twiddling and will have far more vertices to transform and project, and far more motion physics, line-of-sight calculations, and the like as well.

To get you started with FP for real-time 3-D, I'll examine the basics of Pentium FP optimization, then look at how some key mathematical techniques for 3-Ddot product, cross product, transformation, and projectioncan be accelerated.

Pentium Floating-Point Optimization

I'm going to assume you're already familiar with x86 FP code in general; for additional information, check out Intel's Pentium Processor User's Manual (order #241430-001; http://www.intel.com), a book that you should have if you're doing Pentium programming of any sort.

I'm going to focus on six core instructions: FLD, FST, FADD, FSUB, FMUL, and FDIV. First, let's look at cycle times for these instructions. FLD takes one cycle; the value is pushed onto the FP stack and is ready for use on the next cycle. FST takes two cycles, although when storing to memory, there's a potential extra cycle that can be lost.

FDIV is a painfully slow instruction, taking 39 cycles at full precision and 33 cycles at double precision (the default precision for Visual C++ 2.0). While FDIV executes, the FPU is occupied and can't process subsequent FP instructions until FDIV finishes. However, for all the cycles in which FDIV is executing except the one cycle in which FDIV starts, the integer unit can simultaneously execute instructions other than IMUL (which uses the FPU and can only overlap with FDIV for a few cycles). Since the integer unit can execute two instructions per cycle, this means it's possible to have three instructions--an FDIV and two integer instructions--executing at the same time. That's exactly what happens during the second cycle of Example 1(a). There's an important limitation, though; if the instruction stream following the FDIV reaches an FP instruction (or an IMUL), then that instruction and all subsequent instructions, both integer and FP, must wait to execute until FDIV has finished.

When FADD, FSUB, or FMUL is executed, it takes three cycles before the result can be used by another instruction. There's an exception: If the instruction that attempts to use the result is an FST to memory, an extra cycle is lost, so there are four cycles from the start of an arithmetic instruction until an FST of that value can begin; so Example 1(b) takes six cycles in all. Again, it's possible to execute integer-unit instructions during the two (or three, for FST) cycles after one of these FP instructions starts. There's a more exciting possibility here, though; given properly structured code, the FPU is capable of averaging one cycle per FADD, FSUB, or FMUL. The secret is pipelining.

Pipelining, Latency, and Throughput

The Pentium's FPU is the first pipelined x86 FPU. Pipelining means that the FPU is capable of starting an instruction every cycle, and can simultaneously handle several instructions in various stages of completion. Only certain x86 FP instructions allow another instruction to start on the next cycle, though: FADD, FSUB, and FMUL are pipelined, but FST and FDIV are not. (FLD executes in a single cycle, so pipelining is not an issue.) Thus, in Example 2(a), FADD1 can start on cycle N, FSUB can start on cycle N+1, FADD2 can start on cycle N+2, and FMUL can start on cycle N+3. At the start of cycle N+3, the result of FADD1 is available in the destination operand because it's been three cycles since the instruction started; FSUB is starting the final cycle of calculation; FADD2 is starting its second cycle, with one cycle yet to go after this; and FMUL is about to be issued. Each of the instructions takes three cycles to produce a result from the time it starts, but because they're simultaneously processed at different pipeline stages, one instruction is issued and one instruction completes every cycle. Thus, the latency of these instructionsthe time until the result is availableis three cycles, but the throughputthe rate at which the FPU can start new instructionsis one cycle. An exception is that the FPU is capable of starting an FMUL only every two cycles, so between the two instructions in Example 2(b), there's a one-cycle stall, and the three instructions in Example 2(c) execute just as fast as the pair in Example 2(b).

There's a caveat here, though: An FP instruction can't be issued until its operands are available. The FPU can reach a throughput of one cycle per instruction on Example 2(d) because neither the FLD nor the FSUB needs the result from the FADD. Consider, however, Example 2(e), where the ST(0) operand to FSUB is calculated by FADD. Here, FSUB can't start until FADD has completed, so there are two stall cycles between the two instructions. When dependencies like this occur, the FPU runs at latency rather than throughput speeds, and performance can drop by as much as two-thirds.

FXCH

One piece of the puzzle is still missing. Clearly, to get maximum throughput, you need to interleave FP instructions so that at any one time, ideally, three instructions are in the pipeline at once. Further, these instructions must not depend on each other for operands. But ST(0) must always be one of the operands; worse, FLD can only push into ST(0), and FST can only store from ST(0). How, then, can you keep three independent instructions going?

The easy answer would be for Intel to change the FP registers from a stack to a set of independent registers. Since Intel couldn't do that because of compatibility issues, it did the next best thingit made the FXCH instruction, which swaps ST(0) and any other FP register, virtually free. In general, if FXCH is both preceded and followed by FP instructions, then it takes no cycles to execute. (Application Note 500, "Optimizations for Intel's 32-bit Processors," February 1994, available at http://www.intel.com, describes all the conditions under which FXCH is free.) This allows you to move the target of a pending operation from ST(0) to another register, and at the same time bring another register into ST(0) where it can be used, all at no cost. So, for example, we can start three multiplications, then use FXCH to swap back to start adding the results of the first two multiplications without incurring any stalls, as shown in Listing One.

The Dot Product

Now we're ready to look at fast FP for common 3-D operations, starting with how to speed up the dot product. As discussed in my September/October 1995 column, the dot product is heavily used in 3-D to calculate cosines and to project points along vectors. The dot product is calculated as d = u1v1 + u2v2 + u3v3. With three loads, three multiplies, two adds, and a store, the theoretical minimum time for this calculation is ten cycles.

Listing Two shows a straightforward dot-product implementation. This version loses seven cycles to stalls. Listing Three cuts the loss to five cycles by doing all three FMULs first, then using FXCH to set the third FXCH aside to complete while the results of the first two FMULs, which have completed, are added. Listing Three still loses 50 percent to stalls, but unless some other code is available to be interleaved with the dot-product code, that's all you can do to speed things up. Fortunately, dot products are often used in contexts where there's plenty of interleaving potential.

The Cross Product

When last I discussed the cross product, I said that it's handy for generating a vector that's normal to two other vectors. The cross product is calculated as [u2v3-u3v2, u3v1-u1v3, u1v2-u2v1]. The theoretical minimum cycle count for the cross product is 21 cycles. Listing Four is a straightforward implementation that calculates each component of the result separately, losing 15 cycles to stalls.

You couldn't get rid of many of the stalls in the dot-product code because, with six inputs and one output, it was impossible to interleave all the operations. However, the cross product, with three outputs, is much more amenable to optimization. In fact, three is the magic number. Because we have three calculation streams and the latency of FADD, FSUB, and FMUL is three cycles, you can eliminate almost every single stall in the cross-product calculation. Listing Five loses only one cycle to a stall--the cycle before the first FST. The relevant FSUB has just finished on the preceding cycle, so you run into the extra cycle of latency associated with FST. Listing Five is more than 60 percent faster than Listing Four, a striking illustration of the power of properly managing the Pentium's FP pipeline.

Transformation

Transforming a point (for example, from worldspace to viewspace) is one of the most heavily used FP operations in real-time 3-D. Conceptually, transformation is nothing more than three dot products and three additions, as I discussed in the September/October column. (Note that I'm talking about a subset of a general 4x4 transformation matrix, where the fourth row is always implicitly [0 0 0 1]. This limited form suffices for common transformations and does 25 percent less work than a full 4x4 transformation.) Examples 3(a) and 3(b) illustrate how you calculate transformation.

When it comes to implementation, however, transformation is quite different from three separate dot products and additions, because once again, the magic number three is involved. Three separate dot products and additions would take 60 cycles if each were calculated using the unoptimized dot-product code of Listing Two, and would take 54 cycles if done one after the other using the faster dot-product code of Listing Three, in each case followed by the final addition per dot product.

When fully interleaved, however, only a single cycle is lost (again to the extra cycle of FST latency), and the cycle count drops to 34, as in Listing Six. This means that on a 100-MHz Pentium, it's theoretically possible to do nearly 3 million transforms per second, although that's a purely hypothetical number, due to cache effects and set-up costs. Still, more than 1 million transforms per second is certainly feasible; at a frame rate of 30 Hz, that's an impressive 30,000 transforms per frame.

Projection

The final optimization I'll look at is projection to screenspace. Projection itself is basically nothing more than a divide (to get 1/z), followed by two multiplies (to get x/z and y/z), so there wouldn't seem to be many FP optimization possibilities there. However, remember that although FDIV has a latency of up to 39 cycles, it can overlap with integer instructions for all but one of those cycles. That means that if you can find enough independent integer work to do before you need the 1/z result, you can effectively reduce the cost of the FDIV to one cycle. Projection by itself doesn't offer much with which to overlap, but other work, such as clamping, window- relative adjustments, 2-D clipping, or read-ahead to bring the next piece of data into the processor cache could be interleaved with the FDIV for the next point.

Another dramatic speed-up is possible by setting the precision of the FPU down to single via FLDCW, thereby cutting the time FDIV takes to a mere 19 cycles. Be aware that along with potentially greater performance, there are certain risks. The reduced precision, which affects FADD, FSUB, FMUL, FDIV, and FSQRT, can cause subtle differences: Be on the alert for precision-related problems, such as clipped values that vary more than you'd expect from the precise clip point, or the need for using larger epsilons in comparisons for point- on-plane tests.

Rounding Control

Another useful area that I can note only in passing is leaving the FPU in a particular rounding mode while performing bulk operations of some kind. For example, conversion to int via the FIST instruction requires that the FPU be in chop mode. Unfortunately, the FLDCW instruction must be used to get the FPU into and out of chop mode, and each FLDCW takes 7 cycles, meaning that compilers often take at least 14 cycles for each float->int conversion. In assembly, you can just set the rounding state (or the precision for faster FDIVs) once at the start of the loop, and save all those FLDCW cycles each time through the loop. This is even more true for ceil(), which many compilers implement as horrendously inefficient subroutines, even though there are rounding modes for both ceil() and floor(). Again though, be aware that results of FP calculations will be subtly different from compiler default behavior while chop, ceil, or floor mode is in effect.

A final note: There are some speed-ups to be had by manipulating FP variables with integer instructions. Check out Chris Hecker's column in the February/March 1996 issue of Game Developer magazine for details.

A Farewell to 3-D Fixed-Point

As with most optimizations, there are both benefits and hazards to FP acceleration, especially pedal-to-the-metal optimizations such as the last few I've mentioned. Nonetheless, I've found floating-point to be generally both more robust and easier to use than fixed-point, even with those maximum optimizations. Now that floating point is fast enough for real time, I don't expect to be doing a whole lot of fixed-point 3-D math from here on out.

And I won't miss it a bit.

Example 1: (a) The integer unit can execute two instructions per cycle, so it's possible to have three instructions executing at once; (b) if the instruction that attempts to use the result is an FST to memory, an extra cycle is lost.


     (a)  FDIV    ST(0),ST(1)
          ADD EAX,ECX
          INC EDX

     (b)  FMUL ST(0),ST(1)
          FST [temp]

Example 2: (a) FADD1 can start on cycle N, FSUB can start on cycle N+1, FADD2 can start on cycle N+2, and FMUL can start on cycle N+3; (b) a one-cycle stall between two FMULs; (c) these three instructions execute just as fast as the pair in (b); (d) the FPU can reach a throughput of one cycle per instruction; (e) the FSUB can't start until FADD has completed.

     
     (a)  FADD1
          FSUB
          FADD2
          FMUL
          
     (b)  FMUL ST(1),ST(0)
          FMUL    ST(2),ST(0)
          
     (c)  FMUL ST(1),ST(0)
          FLD     ST(4)
          FMUL    ST(0),ST(1)
          
     (d)  FADD    ST(1),ST(0)
          FLD     [temp]
          FSUB    ST(1),ST(0)
          
     (e)  FADD    ST(0),ST(2)
          FSUB    ST(0),ST(1)

Example 3: Calculating transformation.

Listing One

; use of fxch to allow addition of first two
; products to start while third multiplication finishes
        fld     [vec0+0]        ;starts & ends on cycle 0
        fmul    [vec1+0]        ;starts on cycle 1
        fld     [vec0+4]        ;starts & ends on cycle 2
        fmul    [vec1+4]        ;starts on cycle 3
        fld     [vec0+8]        ;starts & ends on cycle 4
        fmul    [vec1+8]        ;starts on cycle 5
        fxch    st(1)           ;no cost
        faddp   st(2),st(0)     ;starts on cycle 6

Listing Two

; unoptimized dot product; 17 cycles
        fld     [vec0+0]    ;starts & ends on cycle 0
        fmul    [vec1+0]    ;starts on cycle 1
        fld     [vec0+4]    ;starts & ends on cycle 2
        fmul    [vec1+4]    ;starts on cycle 3
        fld     [vec0+8]    ;starts & ends on cycle 4
        fmul    [vec1+8]    ;starts on cycle 5
                            ;stalls for cycles 6-7
        faddp   st(1),st(0) ;starts on cycle 8
                            ;stalls for cycles 9-10
        faddp   st(1),st(0) ;starts on cycle 11
                            ;stalls for cycles 12-14
        fstp    [dot]       ;starts on cycle 15,
                            ; ends on cycle 16


Listing Three

; optimized dot product; 15 cycles
        fld     [vec0+0]    ;starts & ends on cycle 0
        fmul    [vec1+0]    ;starts on cycle 1
        fld     [vec0+4]    ;starts & ends on cycle 2
        fmul    [vec1+4]    ;starts on cycle 3
        fld     [vec0+8]    ;starts & ends on cycle 4
        fmul    [vec1+8]    ;starts on cycle 5
        fxch    st(1)       ;no cost
        faddp   st(2),st(0) ;starts on cycle 6
                            ;stalls for cycles 7-8
        faddp   st(1),st(0) ;starts on cycle 9
                            ;stalls for cycles 10-12
        fstp    [dot]       ;starts on cycle 13,
                            ; ends on cycle 14

Listing Four

; unoptimized cross product; 36 cycles
        fld     [vec0+4]    ;starts & ends on cycle 0
        fmul    [vec1+8]    ;starts on cycle 1
        fld     [vec0+8]    ;starts & ends on cycle 2
        fmul    [vec1+4]    ;starts on cycle 3
                            ;stalls for cycles 4-5
        fsubrp  st(1),st(0) ;starts on cycle 6
                            ;stalls for cycles 7-9
        fstp    [vec2+0]    ;starts on cycle 10,
                            ; ends on cycle 11
        fld     [vec0+8]    ;starts & ends on cycle 12
        fmul    [vec1+0]    ;starts on cycle 13
        fld     [vec0+0]    ;starts & ends on cycle 14
        fmul    [vec1+8]    ;starts on cycle 15
                            ;stalls for cycles 16-17
        fsubrp  st(1),st(0) ;starts on cycle 18
                            ;stalls for cycles 19-21
        fstp    [vec2+4]    ;starts on cycle 22,
                            ; ends on cycle 23
        fld     [vec0+0]    ;starts & ends on cycle 24
        fmul    [vec1+4]    ;starts on cycle 25
        fld     [vec0+4]    ;starts & ends on cycle 26
        fmul    [vec1+0]    ;starts on cycle 27
                            ;stalls for cycles 28-29
        fsubrp  st(1),st(0) ;starts on cycle 30
                            ;stalls for cycles 31-33
        fstp    [vec2+8]    ;starts on cycle 34,
                            ; ends on cycle 35

Listing Five

; optimized cross product; 22 cycles
        fld     [vec0+4]    ;starts & ends on cycle 0
        fmul    [vec1+8]    ;starts on cycle 1
        fld     [vec0+8]    ;starts & ends on cycle 2
        fmul    [vec1+0]    ;starts on cycle 3
        fld     [vec0+0]    ;starts & ends on cycle 4
        fmul    [vec1+4]    ;starts on cycle 5
        fld     [vec0+8]    ;starts & ends on cycle 6
        fmul    [vec1+4]    ;starts on cycle 7
        fld     [vec0+0]    ;starts & ends on cycle 8
        fmul    [vec1+8]    ;starts on cycle 9
        fld     [vec0+4]    ;starts & ends on cycle 10
        fmul    [vec1+0]    ;starts on cycle 11
        fxch    st(2)       ;no cost
        fsubrp  st(5),st(0) ;starts on cycle 12
        fsubrp  st(3),st(0) ;starts on cycle 13
        fsubrp  st(1),st(0) ;starts on cycle 14
        fxch    st(2)       ;no cost
                            ;stalls for cycle 15
        fstp    [vec2+0]    ;starts on cycle 16,
                            ; ends on cycle 17
        fstp    [vec2+4]    ;starts on cycle 18,
                            ; ends on cycle 19
        fstp    [vec2+8]    ;starts on cycle 20,
                            ; ends on cycle 21

Listing Six

; optimized transformation: 34 cycles
        fld     [vec0+0]    ;starts & ends on cycle 0
        fmul    [matrix+0]  ;starts on cycle 1
        fld     [vec0+0]    ;starts & ends on cycle 2
        fmul    [matrix+16] ;starts on cycle 3
        fld     [vec0+0]    ;starts & ends on cycle 4
        fmul    [matrix+32] ;starts on cycle 5
        fld     [vec0+4]    ;starts & ends on cycle 6
        fmul    [matrix+4]  ;starts on cycle 7
        fld     [vec0+4]    ;starts & ends on cycle 8
        fmul    [matrix+20] ;starts on cycle 9
        fld     [vec0+4]    ;starts & ends on cycle 10
        fmul    [matrix+36] ;starts on cycle 11
        fxch    st(2)       ;no cost
        faddp   st(5),st(0) ;starts on cycle 12
        faddp   st(3),st(0) ;starts on cycle 13
        faddp   st(1),st(0) ;starts on cycle 14
        fld     [vec0+8]    ;starts & ends on cycle 15
        fmul    [matrix+8]  ;starts on cycle 16
        fld     [vec0+8]    ;starts & ends on cycle 17
        fmul    [matrix+24] ;starts on cycle 18
        fld     [vec0+8]    ;starts & ends on cycle 19
        fmul    [matrix+40] ;starts on cycle 20
        fxch    st(2)       ;no cost
        faddp   st(5),st(0) ;starts on cycle 21
        faddp   st(3),st(0) ;starts on cycle 22
        faddp   st(1),st(0) ;starts on cycle 23
        fxch    st(2)       ;no cost
        fadd    [matrix+12] ;starts on cycle 24
        fxch    st(1)       ;starts on cycle 25
        fadd    [matrix+28] ;starts on cycle 26
        fxch    st(2)       ;no cost
        fadd    [matrix+44] ;starts on cycle 27
        fxch    st(1)       ;no cost
        fstp    [vec1+0]    ;starts on cycle 28,
                            ; ends on cycle 29
        fstp    [vec1+8]    ;starts on cycle 30,
                            ; ends on cycle 31
        fstp    [vec1+4]    ;starts on cycle 32,
                            ; ends on cycle 33


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.