In an etymology-based world, "optimization" books would show how to make the "best" code for a given algorithm, and "best" would cover a range of concepts: clarity, portability, connectivity, and the like. In such a world, the title of Michael Abrash's book, Zen of Code Optimization would be Zen of Code Acceleration because it's about making programs go faster--period. That's a narrow way to define optimizing, but there are enough generalist books out already. Speed specialists will know that this book is for them as soon as they read the first sentence in the Preface: "This book is the diary of a personal passion, my quest for ways to write the fastest possible software for IBM-compatible computers in C, C++, and assembly language."
In fact, the approximate ratios are 5 percent C++, 25 percent C, and 70 percent assembly language. No surprise there: To really control the cycles, programmers have to use assembly language. But Zen of Code Optimization introduces this proposition in a sober fashion: Chapter 1 is an example of a C program that really can't be improved by just dropping in some inline assembly code. The first step--and Zen never stints on the warnings that this must be the first step--is improve the algorithm by timing it, timing the alternatives, stopping and thinking, deciding if any performance improvement would really be worth the effort, and then--only then--rewriting the critical routine in assembly language.
But how can you make your routine faster if you can't figure out how fast the routine is running, before and after their changes? Zen solves this problem in Chapter 3 by introducing a timing program called "the Zen Timer," which comes on the 3.5-inch diskette packaged with the book. Certainly, a profiler will do a better job of describing what routines are in use and where apparent bottlenecks lie. However, a profiler is not a precise instrument. Typically, profilers depend on the computer's system clock to interrupt them, which happens 18.2 times a second--a relatively infrequent occurrence on a computer that's doing several million instructions per second. Not only that, the profiler and the clock-interrupt-routine themselves consume cycles, so the act of measurement affects the thing being measured. To truly time something, you have to run a routine several million times with varying clock speeds in a loop.
The Zen Timer presents no such difficulties. It works by reprogramming the 8253 (or equivalent) timer chip that comes standard with every "IBM compatible" computer. The 8253 is incrementing an internal counter approximately 1,000,000 times per second. The Zen Timer retrieves the 8253's counter value, then it masks all interrupts and executes whatever routine has to be timed. As soon as that's over, it gets the 8523's current count (the 8523 keeps incrementing independently of what's happening on the CPU). It subtracts the count value that it got before entering the loop, and lo! The result is a timing of the routine's speed to the nearest microsecond, give or take a bit (some caveats apply).
As far as I can tell though, Zen of Code Optimization does not mention a potentially irritating detail: The Zen Timer does not work from Windows or from the Windows DOS Box. So the question arises, if you have to go outside Windows when you use the Zen Timer to test a routine, are the results valid when we put the same routine in a Windows application? In any ordinary situation, yes. There are a few instructions which run differently in Windows' Enhanced 386 mode: POP ES; MOV DS,AX; or anything else that causes a segment identifier to change. Still, such instructions are too rare to make the Zen Timer's results meaningless. This utility is the best thing about the book.
How fast does an instruction go on an Intel 486? My assembler's reference guide is supposed to answer that but--fascinatingly--it's often wrong.
Sometimes my guide is simply misprinted, like the TASM 3.x manual, which says that JCXZ takes one or three cycles (it should say five or eight cycles, a whopping difference). Sometimes my guide is simply incomplete: It was years before I found out that ADD mem,1 can be slower than INC mem, and I search in vain for that kind of information in the "manuals" that came with my assembler package. I do better if I go to the real information source (Intel's documents). Yet even there, some details are missing or hidden in a terse appendix. In short, nobody can answer the question, "How fast does an instruction go on an Intel 486?". I have to time it myself, and until recently, my code-timing method involved wasting a lot of my own time.
This is where Zen comes in. I've solved the timing problem by plugging in the Zen Timer, which shows with reasonable clarity where my cycle times are going. What about the problem of finding out how fast a CPU really goes? Since (to my knowledge) Zen is the only trade book that even tries to address this question, I'm sure that many people will rejoice in Zen's revelations.
Still, Zen's Chapters 12 and 13--the two chapters that address the Intel 486--need a close look.
Chapter 12 contains a section titled, "Calculate Memory Pointers Ahead of Time," which warns against loading a value into a register and then using the same register as a pointer. For example, if your first instruction is MOV BX,5, then your second instruction better not be MOV AX,[BX], because the 486's pipeline stalls when it can't figure out an address in advance. Intel's documentation says there is a penalty for doing this. Zen's contribution is to point out that the penalty is really two cycles. But Zen misses the exceptions. For example, if a two-cycle penalty always applied, then Example 1(a) would take two cycles longer than Example 1(b) (assuming BX starts equal to offset Mem)--but it does not. I timed them together, and the penalty in this case is one cycle.
A bit later you come to the section entitled, "Problems With Byte Registers," which begins with the statement, "There are two ways to lose cycles by using byte registers, and neither of them is documented by Intel, so far as I know." We then see that the first rule is a matter of using a 16-bit register as an instruction's source operand right after "loading" one half of the register. In Example 2, the two instructions together will run in three cycles--one more than you'd expect if you read that MOV is a one-cycle instruction.
Actually, Intel does warn that there will be a penalty for loading half of a register and then using the whole register as a source in the next instruction. When they say "half of a register," they mean the 16-bit half of a 32-bit register, so Zen is apparently right in saying that Intel doesn't document the effect. However, Zen's rule merely extrapolates Intel's warning to 8- and 16-bit registers.
In reality, Zen's rule is only one manifestation of a rule that's much more widely applicable, but also much more complex. To merely describe it would take a page, so I'll limit myself to two examples. In Example 3(a), CX is the destination, not the source, but there is a penalty anyway. In Example 3(b), however, there is no penalty because a penalty is already being applied. Neither of these byte-register effects fits either Zen rule. In short, you should be aware that there's more to the story than Zen implies.
Chapter 13 has four pages of personal anecdotes and a restatement of the 486's most important pipeline problem (changing a register just before using it in an address), then a couple of pages on the obscure BSWAP instruction ("BSWAP: More Useful than You Might Think"). To justify this attention, Zen says,
Unfortunately the x86 instruction set doesn't provide any way to work directly with only the upper half of a 32-bit register. The next best solution is to rotate the register to give you access in the lower 16 bits to the half you need at any particular time....
Zen dismisses using ROR for this, because "shifts and rotates are among the worst performing instructions of the 486, taking two to three cycles to execute." To the rescue comes BSWAP, which "executes in just 1 cycle." The discussion culminates in Example 4, which shows that you can use the top half of the ECX register as a loop counter and the bottom half as a "skip value," swapping the top half with the bottom half when you need to directly refer to the loop-counter half.
This would be very useful, except that BSWAP is not just a one-cycle instruction--it requires from one to three cycles. In most situations, BSWAP ECX takes three cycles; ROR ECX,16 and BSWAP ECX take the same time. Furthermore, you certainly can work directly with the upper half of a 32-bit register. If you do, your code will be faster than if you used BSWAP and (I think) quite a bit clearer; see Example 5.
Incidentally, my claim that BSWAP ECX is a three-cycle instruction may surprise some people. Doesn't Intel say that BSWAP takes one cycle? Yes, but Intel also says to add one cycle for prefixes (there are some exceptions, but that's the general rule, and it's certainly applicable in these examples). Aha, everyone thinks, we're operating on a 32-bit operand; therefore, there's a "32-bit operand" prefix (DB 66h), so BSWAP takes 1+1=2 cycles. That's right, but not everyone realizes that BSWAP, like many other instructions, always has another prefix (DB 0Fh), so BSWAP takes 1+1+1=3 cycles. Intel's documentation has a little note that 0Fh is a prefix, but nowhere have I seen them spell out the horrible implication: Most of the time on 386/486s, all instructions whose first machine opcode byte is 0FH take one cycle longer than the manual says they will. This includes many of the instructions that appeared with the introduction of the 386, including BSR, BT, BTC, BTR, BTS, CMPXCHG, IMUL <register>,<register>; Jxx <rel16>; LFS, LGS, MOVSX, MOVZX, POP FS, POP GS, PUSH FS, PUSH GS, SHLD, SHRD, SETxx, XADD, and various protected-mode instructions--and BSWAP. I'll suggest that BSWAP is not "more useful than you think."
By way of emphatic disclaimer, I am not giving Zen's advice a critical review. I cowrote an "optimizing tips" book (on a different topic) and fully expect people to find exceptions to the so-called rules in it--that's how we learn.
This is not a single book with a single unified plan--it is two books threaded together. Book 1 is for "speed freaks," who want to learn quickie fixes, speed up their tightest routines, and count cycles. Book 2 is for "Zen disciples," who want to exercise their minds, experience the thought processes of the masters, and be better programmers. If you're a speed freak but have no patience for all that Zen stuff, buy the Intel manuals. If you're a speed freak but you want to be a Zen disciple too, buy this book.
Zen of Code Optimization
Coriolis Group Books, 1994 449 pp., $39.95
Example 1: If a two-cycle penalty always applied, then (a) would take two cycles longer than (b), but it doesn't.
(a) MOV BX,offset Mem ADD word ptr [bx+4],55 (b) MOV CX,offset Mem ADD word ptr [BX+4],55
MOV AL,5 ;"Loading" AL, which is one half of AX MOV DX,AX ;Using AX, the whole register, as the source
Example 3: (a) CX is the destination; (b) there is no execution penalty because a penalty is already being applied.
(a) DEC CL is slower than DEC CL SUB CX,5 SUB AX,5 (b) MOV BL,BL is NOT slower than MOV BL,BL MOV [BX],BX MOV [BX],CX
mov cx,[InitialValue] bswap ecx ;Put skip value in upper half of ECX mov cx,64h ;Put loop count in CX looptop:... bswap ecx ;Make skip value word accessible in CX add bx,cx ;Skip BX ahead inc cx ;Set next skip value bswap ecx ;Put loop count in CX dec cx ;Count down loop jnz looptop ;The loop will repeat 64h times.
mov cx,[InitialValue] and ecx,000ffffh ;Clear the upper part of ECX to 0 or ecx,00630000h ;Put 63h directly in the upper part of ECX looptop:... add bx,cx ;Skip BX ahead inc cx ;Set next skip value sub ecx,00010000h ;Count down loop jnc looptop ;The loop will repeat 64h times.
Copyright © 1995, Dr. Dobb's Journal