80x86 Optimization

Optimizations that work with one member of the 80x86 family sometimes aren't efficient with one another. Michael shares his optimization rules for achieving maximum performance--no matter which 80x86 CPU you're programming.

March 01, 1991
URL:http://www.drdobbs.com/parallel/80x86-optimization/184408511

Figure 1

Figure 2

MAR91: 80X86 OPTIMIZATION

80X86 OPTIMIZATION

Aim down the middle and pray

Michael Abrash

Michael is a contributing editor to DDJ and can be contacted at 7 Adirondack Street, South Burlington, VT 05403.

Picture this: You're an archer aiming at a target 100 feet away. A strong wind comes up and pushes each arrow to the left as it flies. Naturally, you compensate by aiming farther to the right. That's what it's like optimizing for the 8088; once you learn to compensate for the strong but steady effects of the prefetch queue and the 8-bit bus, you can continue merrily on your programming way.

Now the wind starts gusting unpredictably. There's no way to compensate, so you just aim for the bull's-eye and hope for the best. That's what it's like writing code for good performance across the entire 80x86 family, or even for the 286/386SX/386 heart of today's market. You just aim down the middle and pray.

The New World of the 80x86

In the beginning, the 8088 was king, and that was good. The optimization rules weren't obvious, but once you learned them, you could count on them serving you well on every computer out there.

Not so these days. There are four major processor types -- 8088, 80286, 80386, and 80486 -- with a bewildering array of memory architectures: cached (in several forms), page mode, static-column RAM, interleaved, and, of course, the 386SX, with its half-pint memory interface. The processors offer wildly differing instruction execution times, and memory architectures warp those times further by affecting the speed of instruction fetching and access to memory operands. Because actual performance is a complex interaction of instruction characteristics, instruction execution times, and memory access speed, the myriad processor-memory combinations out there make "exact performance" a meaningless term. A specific instruction sequence may run at a certain speed on a certain processor in a certain system, but that often says little about the performance of the same instructions on a different processor, or even on the same processor with a different memory system. The result: Precise optimization for the general PC market is a thing of the past. (We're talking about optimizing for speed here; optimizing for size is the same for all processors so long as you stick to 8088-compatible code.)

So there is no way to optimize performance ideally across the 80x86 family. An optimization that suits one processor beautifully is often a dog on another. Any 8088 programmer would instinctively replace:

  
DEC    CX   
JNZ    LOOPTOP

with:

  LOOP   LOOPTOP

because LOOP is significantly faster on the 8088. LOOP is also faster on the 286. On the 386, however, LOOP is actually two cycles slower than DEC/JNZ. The pendulum swings still further on the 486, where LOOP is about twice as slow as DEC/JNZ -- and, mind you, we're talking about what was originally perhaps the most obvious optimization in the entire 80x86 instruction set.

In short, there is no such thing as code that's truly optimized for the 80x86. Instead, code is either optimized for specific processor-memory combinations, or aimed down the middle, designed to produce good performance across a range of systems. Optimizing for the 80x86 family by aiming down the middle is quite different from optimizing for the 8088, but many PC programmers are inappropriately still applying the optimization lore they've learned over the years on the PC (or AT). The world has changed, and many of those old assumptions and tricks don't hold true anymore.

You will not love the new world of 80x86 optimization, which is less precise and offers fewer clever tricks than optimizing for the 8088 alone. Still, isn't it better to understand the forces affecting your code's performance out in the real world than to optimize for a single processor and hope for the best?

Better, yes. As much fun, no. Optimizing for the 8088 was just about as good as it gets. So it goes.

Optimization Rules for a New World

So, how do you go about writing fast code nowadays? One way is to write different versions of critical code for various processors and memory access speeds, selecting the best version at runtime. That's a great solution, but it requires an awful lot of knowledge and work.

An alternative is to optimize for one particular processor and settle for whatever performance you get on the others. This might make sense when the 8088 is the target processor because it certainly needs the optimization more than any other processor. However, 8088 optimization works poorly at the upper end of the 80x86 family.

Nowadays, though, most of us want to optimize for the 286 and 386 systems that dominate the market, or across all 80x86 processors, and that's a tough nut to crack. The 286 and 386 come in many configurations, and you can be sure, for example, that a 386SX, an interleaved 386, and a cached 386 have markedly different performance characteristics. There are, alas, no hard and fast optimization rules that apply across all these environments.

My own approach to 80x86 optimization has been to develop a set of general rules that serve reasonably well throughout the 80x86 line, especially the 286 and 386, and to select a specific processor (in my case a cached 386, for which cycle times tend to be accurate) to serve as the tiebreaker when optimization details vary from one processor to another. (Naturally, it's only worth bothering with these optimizations in critical code.) The rules I've developed are:

Avoid accessing memory operands; use the registers to the max.
Don't branch.
Use string instructions, but don't go much out of your way to do so.
Keep memory accesses to a minimum by avoiding memory operands and keeping instructions short.
Align memory accesses.
Forget about many of those clever 8088 optimizations, using oddball instructions such as DAA and XLAT, that you spent years learning.

Next I'll discuss each of these rules in turn in the context of 8088-compatible real mode, which is still the focus of the 80x86 world. Later, I'll touch on protected mode.

Let's start by looking at the last -- and most surprising -- rule.

Kiss Those Tricks Goodbye

To skilled assembly language programmers, the 8088 is perhaps the most wonderful processor ever created, largely because the instruction set is packed with odd instructions that are worthless to compilers but can work miracles in the hands of clever assembly programmers. Unfortunately, each new generation of the 80x86 has rendered those odd instructions and marvelous tricks less desirable. As the execution time for the commonly used instruction ADD BX, 4 has gone down from four cycles (8088) to three cycles (286) to two cycles (386) to one cycle (486), the time for the less frequently used instruction CBW has gone from two cycles (8088 and 286) up to three cycles (386 and 486)!

Consider this ancient optimization for converting a binary digit to hex ASCII:

  
  ADD     AL,90H   
  DAA   
  ADC     AL,40H   
  DAA

Now consider the standard alternative:

  
  ADD     AL,'O'   
  CMP     AL,'9'   
  JBE     HaveAscii   
  ADD     AL,'A'-('9'+1) 
HaveAscii:

As Figure 1 indicates, the standard code should be slower on an 8088 or 286, but faster on a 386 or a 486 -- and real-world tests confirm those results, as shown in Figure 2. (All "actual performance" timings in this article were performed with the Zen timer from Zen of Assembly Language, see "References" for details. The systems used for the tests were: 8088, standard 4.77 MHz PC XT; 80286, standard one-wait-state, 8 MHz PC AT; 386SX, 16 MHz noncached; 80386, 20 MHz externally cached with all instructions and data in external cache for all tests except Listings One and Two; 80486, 25 MHz internally cached, with all instructions and data in internal cache for all tests except Listings One and Two.)

In other words, this nifty, time-tested optimization is an anti-optimization on the 386 and 486.

Why is this? On the 386, DAA -- a rarely used instruction -- takes four cycles, and on the 486 it takes two cycles, in both cases twice as long as the more common instructions CMP and ADD; in contrast, on the 8088 all three instructions are equally fast at four cycles. Also, the instruction-fetching advantage that the 1-byte DAA provides on the 8088 means nothing on a cached 386.

Nor is this an isolated example. Most oddball instructions, from AAA to XCHG, have failed to keep pace with the core instructions -- ADC, ADD, AND, CALL, CMP, DEC, INC, Jcc, JMP, LEA, MOV, OR, POP, PUSH, RET, SBB, SUB, TEST, and XOR -- during the evolution from 8088 to 486. As we saw earlier, even LOOP lags behind on the 386 and 486. Check your favorite tricks for yourself; they might or might not hold up on the 386, but will most likely be liabilities on the 486. Sorry, but I just report the news, and the news is: Kiss most of those tricks goodbye as the 386 and 486 come to dominate the market. (This means that hand-optimization in assembly language yields less of a performance boost nowadays than it did when the 8088 was king; the improvement is certainly significant, but rarely in the 200-500 percent range anymore. Sic transit gloria mundi.) Most startling of all, string instructions lose much of their allure as we move away from the 8088, hitting bottom on the 486.

The 486: All the Rules Change

The 486 represents a fundamental break with 8088-style optimization. Virtually all the old rules fail on the 486, where, incredibly, a move to or from memory often takes just one cycle, but exchanging two registers takes three cycles. The nonbranching core instructions mentioned earlier take only one cycle on the 486 when operating on registers; MOV can, under most conditions, access memory in one cycle; and CALL and JMP take only three cycles, given a cache hit. However, noncore instructions take considerably longer. XLAT takes four cycles; even STC and CLC take two cycles each. The 486's highly asymmetric execution times heavily favor core instructions and defeat most pre-486 optimizations.

Core instructions do have a weakness on the 486. While 486 MOVs involving memory are remarkably fast, accessing memory for an operand to OR, ADD, or the like costs cycles. Even with the 8K internal cache, memory is not as fast as registers, except when MOV is used (and sometimes not even then), so registers are still preferred operands. (AND [BX],1 is fast, at only three cycles, but AND BX,1 takes only one cycle -- three times as fast.)

OUT should be avoided whenever possible on the 486, and likewise for IN. OUT takes anywhere from 10 to 31 cycles, depending on processor mode and privileges, more than an order of magnitude slower than MOV. The lousy performance of OUT -- true on the 386 as well -- has important implications for graphics applications.

String instructions are so slow on the 486 that you should check cycle times before using any string instruction other than the always superior REP MOVs. For example, LODSB takes five cycles on the 486, but MOV AL,[SI]/INC SI takes only two cycles; likewise for STOSB and MOV [DI],AL/INC DI. Listing One uses LODSB/STOSB to copy a string, converting lowercase to uppercase while copying; Listing Two uses MOV/INC instead. Figure 3 summarizes the performance of the two routines on a variety of processors; note the diminishing effectiveness of string instructions on the newer processors. Think long and hard before using string instructions other than REP MOVS on the 486.

Optimization for the 486 is really a whole new ball game. When optimizing across the 80x86 family, the 486 will generally be the least of your worries because it is so much faster than the rest of the family; anything that runs adequately on any other processor will look terrific on the 486. Still, the future surely holds millions of 486s, so it wouldn't hurt to keep one eye on the 486 as you optimize.

String Instructions: Fading Stars

On the 8088, string instructions are so far superior to other instructions that it's worth going to great lengths to use them, but they lose much of that status on newer processors. One of the best things about string instructions on the 8088 is that they require little instruction fetching, because they're 1-byte instructions and because of the REP prefix; however, instruction fetching is less of a bottleneck on newer processors. String instructions also have superior cycle times on the 8088, but that advantage fades on the 286 and 386 as well.

On the 286, string instructions (when they do exactly what you need) are still clearly better than the alternatives. On the 386, however, some string instructions are, even under ideal circumstances, the best choice only by a whisker, if at all. For example, since Day One, clearing a buffer has been done with REP STOS. That's certainly faster than the looping MOV/ADD approach shown in Listing Three, but on the 386 and 486 it's no faster than the unrolled loop MOV/ADD approach of Listing Four, as shown in Figure 4. (Actually, in my tests REP STOS was a fraction of a cycle slower on the 386, and fractionally faster on the 486.) REP STOS is much easier to code and more compact, so it's still the approach of choice for buffer clearing -- but it's not necessarily fastest on a 486 or fast-memory 386. This again demonstrates just how unreliable the old optimization rules are on the newer processors.

The point is not that you shouldn't use string instructions on the 386. REP MOVs is the best way to move data, and the other string instructions are compact and usually faster, especially on uncached systems. However, on the 386 it's no longer worth going to the trouble of juggling registers and reorganizing data structures to use string instructions. Furthermore, when you truly need maximum performance on the 386, check out nonstring instructions in unrolled loops. It goes against every lesson learned in a decade of 8088 programming, but avoiding string instructions sometimes pays on the 386.

The Siren Song of Memory Accesses

Finally, here's a rule that's constant from the 8088 to the 486: Use the registers. Avoid memory.

Don't be fooled by the much faster memory access times of the 286 and 386. The effective address calculation time of the 8088 is mostly gone, so MOV AX,[BX] takes only five cycles on the 286, and ADD [SI],DX takes only seven on the 386. That's so much faster than the 17 and 29 cycles, respectively, that they take on the 8088 that you might start thinking that memory is pretty much interchangeable with registers.

Think again. MOV AX,BX is still more than twice as fast as MOV AX,[BX] on the 286, and ADD SI,DX is more than three times as fast as ADD [SI],DX on the 386. Memory operands can also reduce performance by slowing instruction fetching. Memory is fast on the 286 and 386. Registers are faster. Use them as heavily as possible.

Don't Branch

Here's another rule that stays the same across the 8Ox86 family: Don't branch. Branching suffers on the 8088 from lengthy cycle counts and emptying the prefetch queue. Emptying the prefetch queue is a lesser but nonetheless real problem in the post-8088 world, and the cycle counts of branches are still killers. As Figure 4 indicates, it pays to eliminate branches by unrolling loops or using repeated string instructions.

Modern-Day Instruction Fetching

Instruction fetching is the bugbear of 8088 performance; the 8088 simply can't fetch instruction bytes as quickly as it can execute them, thanks to its undersized bus. Minimizing all memory accesses, including instruction fetches, is paramount on the 8088.

Instruction fetching is less of a problem nowadays. Figure 5 shows the maximum rates at which various processors can fetch instruction bytes; clearly, matters have improved considerably since the 8088, although instructions also execute in fewer cycles on the newer processors. Fetching problems can occur on any 80x86 processor, even the 486, but the only processors other than the 8088 that face major instruction fetching problems are the one-wait-state 286 and the 386SX, although uncached 386s may also outrun memory. However, the problems here are different from and less serious than with the 8088.

Consider: An 8088 executes a register ADD in three cycles, but requires eight cycles to fetch that instruction, a fetch/execute ratio of 2.67. A one-wait-state 286 requires three cycles to fetch a register ADD and executes it in two cycles, a ratio of 1.5. A 386SX can fetch a register ADD in two cycles, matching the execution time nicely, and a cached 386 can fetch two register ADDs in the two cycles it takes to execute just one. For register-only code -- the sort of code critical loops should contain -- the 386 generally runs flat out, and the 286 and 386SX usually (not always, but usually) outrun memory by only a little at worst. Greater fetching problems can arise when working with large instructions or instruction sequences that access memory nonstop, but those are uncommon in critical code. This is a welcome change from the 8088, where small, register-only instructions tend to suffer most from inadequate instruction fetching.

Also, uncached 386 systems often use memory architectures that provide zero-wait-state performance when memory is accessed sequentially. In register-only code, instruction fetches are the only memory accesses, so fetching proceeds at full speed when the registers are used heavily.

So, is instruction fetching a problem in the post-8088 world? Should instructions be kept short?

Yes. Smaller instructions can help considerably on the one-wait-state 286 and on the 386SX. Not as much as on the 8088, but it's still worth the trouble. Even a cached 386 can suffer from fetching problems, although that's fairly uncommon. For example, when several MOV WOR PTR [MemVar],O instructions are executed in a row, as might happen when initializing memory variables, performance tends to fall far below rated speed, as shown in Figure 6. The particular problem with MOV WORD PTR [MemVar],O is that it executes in just two (386) or three (286) cycles, yet has both an addressing displacement field and a constant field. This eats up memory bandwidth by requiring more instruction fetching. It also accesses memory, eating up still more bandwidth. We'll see this again, and worse, when we discuss protected mode.

Generally, though, post-8088 processors with fast memory systems and full-width buses run most instructions at pretty near their official cycle times; for these systems, optimization consists mostly of counting cycles. Slower memory or constricted buses (as in the 386SX) require that memory accesses (both instruction fetches and operand accesses) be minimized as well. Fortunately, the same sort of code -- register only -- meets both requirements.

Use the registers. Avoid constants. Avoid displacements. Don't branch. That's the big picture. Don't sweat the details.

Alignment: The Easy Optimization

The 286, 386SX, and 386 take twice as long to access memory words at odd addresses as at even addresses. The 386 takes twice as long to access memory dwords at addresses that aren't multiples of four as those that are. You should use ALIGN 2 to word align all word-sized data, and ALIGN 4 to dword align all data that's accessed as a dword operand, as in:

      
    ALIGN     4   
MemVar   dd   ?          
       :     
  MOV EAX,[MemVar]

Alignment also applies to code; you may want to word or dword align the starts of procedures, labels that can only be reached by branching, and the tops of loops. (Code alignment matters only at branch targets, because only the first instruction fetch after a branch can suffer from nonalignment.) Dword alignment of code is optimal, and will help on the 386 even in real mode, but word alignment will produce nearly as much improvement as dword alignment without wasting nearly as many bytes.

Alignment improves performance on many 80x86 systems without hindering it on any. Recommended.

Protected Mode

There are two sorts of protected mode, 16-bit and 32-bit. The primary optimization characteristic of 16-bit protected mode (OS/2 1.X, Rational DOS Extender) is that it takes an ungodly long time to load a segment register (for example, MOV ES,AX takes 17 cycles on a 286) so load segment registers as infrequently as possible in 16-bit protected mode.

Optimizing for 32-bit protected mode (OS/2 2.0, SCO Unix, Phar Lap DOS Extender) is another matter entirely. Typically, no segment loads are needed because of the flat address space. However, 32-bit protected mode code can be bulky, and that can slow instruction fetching. Constants and addressing displacements can be as large as 4 bytes each, and an extra byte, the SIB byte, is required whenever two 32-bit registers are used to address an operand or scaled addressing is used. So, for example, MOV DWORD PTR [MemVar],0 is a 10-byte instruction in 32-bit protected mode. The instruction is supposed to execute in two cycles, but even a 386 needs four to six cycles to fetch it, plus another two cycles to access memory; a few such instructions in a row can empty the prefetch queue and slow performance considerably. The slowdown occurs more quickly and is more acute on a 386SX, which needs 14 cycles to perform the memory accesses for this nominally 2-cycle instruction.

Code can get even larger when 32-bit instructions are executed in 16-bit segments, adding prefix bytes. (Avoid prefix bytes if you can; they increase instruction size and can cost cycles.) Figure 7 shows actual versus nominal cycle times of multiple MOV DWORD PTR [EBX*4+MemVar],0 instructions running in a 16-bit segment. Although cache type (write-back, write-through) and main-memory write time also affect the performance of stores to memory, there is clearly a significant penalty for using several large (in this case, 13-byte) instructions in a row.

Fortunately, this is a worst case, easily avoided by keeping constants and displacements out of critical loops. For example, you should replace:

  
    ADDLOOP:
        MOV   DWORD PTR BaseTable[EDX+EBX],0
        ADD   EBX,4
        DEC   ECX
        JNZ   ADDLOOP

with:

        LEA   EBX,BaseTable[EDX+EBX]
        SUB   EAX,EAX
    ADDLOOP:
        MOV   [EBX],EAX
        ADD   EBX,4
        DEC   ECX
        JNZ   ADDLOOP

Better yet, use REP STOSD or unroll the loop!

Happily, register-only instructions are no larger in 32-bit protected mode than otherwise and run at or near their rated speed in 32-bit protected mode on all processors. All in all, in protected mode it's more important than ever to avoid large constants and displacements and to use the registers as much as possible.

Conclusion

Optimization across the 80x86 family isn't as precise as 8088 optimization, and it's a lot less fun, with fewer nifty tricks and less spectacular speed-ups. Still, familiarity with the basic 80x86 optimization rules can give you a decided advantage over programmers still laboring under the delusion that the 286, 386, and 486 are merely faster 8088s.

References

Abrash, Michael. Zen of Assembly Language. Glenview, Ill.: Scott, Foresman, 1990.

Barrenechea, Mark. "Peak Performance: On to the 486." Programmer's Journal, (November-December 1990).

Paterson, Tim. "Assembly Language Tricks of the Trade." Dr. Dobb's Journal (March 1990).

Turbo Assembler Quick Reference Guide. Borland International, 1990.

i486 Microprocessor Programmer's Reference Manual. Intel Corporation, 1989.

80386 Programmer's Reference Manual. Intel Corporation, 1986.

Microsystems Components Handbook: Microprocessors Volume I. Intel Corporation, 1985.


_80x86 OPTIMIZATION_
by Michael Abrash

[LISTING ONE]



; Copies one string to another string, converting all characters to
; uppercase in the process, using a loop containing LODSB and STOSB.
; Adapted from Zen of Assembly Language, by Michael Abrash; not a
; standalone program, but designed to be used with the Zen timer from
; that book via the Zen timer's PZTIME.BAT batch file: ZTimerOn starts
; the clock, ZTimerOff stops it, and the test-bed program linked in by
; PZTIME.BAT starts the program, reports the results, and ends.

   jmp   Skip      ;skip over data in CS and subroutine

SourceString   label   word      ;sample string to copy
   db   'This space intentionally left not blank',0
DestString   db   100 dup (?)   ;destination for copy

; Copies one zero-terminated string to another string,
; converting all characters to uppercase.
; Input: DS:SI = start of source string; DS:DI = start of destination buffer
; Output: none
; Registers altered: AX, BX, SI, DI, ES
; Direction flag cleared

CopyStringUpper:
   mov   ax,ds
   mov   es,ax   ;for STOS
   mov   bl,'a'   ;set up for fast register-register
   mov   bh,'z'   ; comparisons
   cld
StringUpperLoop:
   lodsb      ;get next character and point to following character
   cmp   al,bl   ;below 'a'?
   jb   IsUpper   ;yes, not lowercase
   cmp   al,bh   ;above 'z'?
   ja   IsUpper   ;yes, not lowercase
   and   al,not 20h ;is lowercase-make uppercase
IsUpper:
   stosb      ;put character into new string and point to
                        ; following location
   and   al,al   ;is this the zero that marks end of the string?
   jnz   StringUpperLoop ;no, do the next character
   ret

; Calls CopyStringUpper to copy & convert SourceString->DestString.
Skip:
   call   ZTimerOn      ;start timing
   mov   si,offset SourceString   ;point SI to the string to copy from
   mov   di,offset DestString   ;point DI to the string to copy to
   call   CopyStringUpper      ;copy & convert to uppercase
   call   ZTimerOff      ;stop timing

[LISTING TWO]



; Copies one string to another string, converting all characters to
; uppercase in the process, using no string instructions.
; Not a standalone program, but designed to be used with the Zen
; timer, as described in Listing 1.

   jmp   Skip      ;skip over data in CS and subroutine

SourceString   label   word      ;sample string to copy
   db   'This space intentionally left not blank',0
DestString   db   100 dup (?)   ;destination for copy

; Copies one zero-terminated string to another string,
; converting all characters to uppercase.
; Input: DS:SI = start of source string; DS:DI = start of destination string
; Output: none
; Registers altered: AL, BX, SI, DI

CopyStringUpper:
   mov   bl,'a'   ;set up for fast register-register
   mov   bh,'z'   ; comparisons
StringUpperLoop:
   mov   al,[si]   ;get the next character and
   inc   si   ; point to the following character
   cmp   al,bl   ;below 'a'?
   jb   IsUpper   ;yes, not lowercase
   cmp   al,bh   ;above 'z'?
   ja   IsUpper   ;yes, not lowercase
   and   al,not 20h ;is lowercase-make uppercase
IsUpper:
   mov   [di],al   ;put the character into the new string and
   inc   di   ; point to the following location
   and   al,al   ;is this the zero that marks the end of the string?
   jnz   StringUpperLoop ;no, do the next character
   ret

; Calls CopyStringUpper to copy & convert SourceString->DestString.
Skip:
   call   ZTimerOn
   mov   si,offset SourceString   ;point SI to the string to copy from
   mov   di,offset DestString   ;point DI to the string to copy to
   call   CopyStringUpper      ;copy & convert to uppercase
   call   ZTimerOff

[LISTING THREE]



; Clears a buffer using MOV/ADD in a loop.
; Not a standalone program, but designed to be used with the Zen
; timer, as described in Listing 1.

   mov   dx,2      ;repeat the test code twice, to make
            ; sure it's in the cache (if there is one)
   mov   bx,dx      ;distance from the start of one word
            ; to the start of the next
   sub   ax,ax      ;set buffer to zeroes
TestTwiceLoop:
   mov   cx,1024      ;clear 1024 words starting at address
   mov   di,8000h   ; DS:8000h (this is just unused memory
            ; past the end of the program)
   call   ZTimerOn   ;start timing (resets timer to 0)
StoreLoop:
   mov   [di],ax      ;clear the current word
   add   di,bx      ;point to the next word
   dec   cx      ;count off words to clear until none
   jnz   StoreLoop   ; remain
   call   ZTimerOff   ;stop timing
   dec   dx      ;count off passes through test code
   jz   StoreDone   ;that was the second pass; we're done
   jmp   TestTwiceLoop   ;that was first pass; do second pass with all
                                ; instructions and data in the cache
StoreDone:

[LISTING FOUR]



; Clears a buffer using MOV/ADD in an unrolled loop.
; Not a standalone program, but designed to be used with the Zen
; timer, as described in Listing 1.

   mov   dx,2      ;repeat the test code twice, to make
            ; sure it's in the cache (if there is one)
   mov   bx,dx      ;distance from the start of one word
            ; to the start of the next
   sub   ax,ax      ;set buffer to zeroes
TestTwiceLoop:
   mov   si,1024      ;clear 1024 words starting at address
   mov   di,8000h   ; DS:8000h (this is just unused memory
            ; past the end of the program)
   call   ZTimerOn   ;start timing (resets timer to 0)
   mov   cl,4      ;divide the count of words to clear by
   shr   si,cl      ; 16, because we'll clear 16 words
            ; each time through the loop
StoreLoop:
   REPT   16      ;clear 16 words in a row without looping
   mov   [di],ax      ;clear the current word
   add   di,bx      ;point to the next word
   ENDM
   dec   si      ;count off blocks of 16 words to clear
   jnz   StoreLoop   ; until none remain
   call   ZTimerOff   ;stop timing
   dec   dx      ;count off passes through test code
   jz   StoreDone   ;that was the second pass; we're done
   jmp   TestTwiceLoop   ;that was the first pass; do the second pass
                                ; with all instructions and data in the cache
StoreDone: