Squeezing out more and more performance
Andrew is a member of IBM's compiler development group in Toronto. He can be reached at [email protected]
Although today's complex microprocessors provide for the execution of multiple instructions per clock cycle, the amount of instruction-level parallelism (ILP) possible on most commercially available chips is limited. Two of the primary barriers to increased ILP are mispredicted branches and relatively high latency loads from the processor's memory hierarchy. Many current processors utilize an out-of-order speculative approach to deal effectively with such impediments to increased ILP. Emerging chip designs, such as Intel's IA-64 processor and similar chips from Hewlett-Packard, eschew the out-of-order approach and instead seek to remove ILP barriers by providing the software developer with two powerful new features -- predication and speculation.
On most modern processors, a branch prediction algorithm is employed in an effort to determine whether a particular program branch will be taken. Accurate branch prediction lets the processor predict a program's flow of control and, therefore, only fetch those instructions on the program's actual path of execution. In practice, however, accurate prediction of branches can be difficult, and often requires some form of feedback-directed analysis in order to be effective. Unfortunately, poorly predicted branches can have profound implications on a program's performance -- a 9- or 10-cycle delay for a single mispredicted branch is not uncommon on some processors.
Aside from the overhead associated with mispredicted branches, the presence of a branch in the instruction stream can limit the usefulness of performance-enhancing optimizations such as instruction scheduling. For example, a potentially faulting instruction that is guarded by a branch (such as a load through a pointer that might contain an invalid address) cannot be easily moved above the branch in an effort to achieve a better code schedule. Doing so could introduce a program exception that otherwise might not have occurred.
Chip designers, like those at Hewlett-Packard and Intel, have been working on ways to eradicate the performance problems posed by branches that occur frequently in the typical program, such as those used to implement if-then-else logic. Their solution? Eliminate those branches altogether! To do so, however, requires a wholesale change in the way that test-and-branch logic may be represented in a program. The representation currently finding favor with many hardware designers stems from a technique known as "predicated execution" (often simply referred to as "predication"), support for which is included in Intel's IA-64 family of processors.
Predication is best explained by example. Consider the C if-then-else statement and its pseudoassembly language implementation in Listing One. Each path through the assembly code requires a branch so that only the true or false leg is executed. If the branch is mispredicted, the program can incur a performance penalty. Now imagine that, instead of using a branch to control execution of the statement in Listing One, you compute Boolean values, b1 and b2, and use those Boolean values to switch on/off instructions that make up the if-then-else statement. The code might look something like Listing Two. The first statement compares x with 5 and produces the two Boolean values, b1 and b2. Boolean b1 is True if x is greater than 5, and False otherwise. Boolean b2 is True if x is less than or equal to 5, and False otherwise. If Boolean b1 is True, then b2 is False, and vice versa, since b1 and b2 are logical complements. The second and third statements are the calls to printf, each qualified by one of the Boolean values, as indicated by the (b1) and (b2) notation to the left of each statement. Each of these statements is only executed if its associated Boolean value is True. So, if b1 is True, the statement that prints "x>5" is executed. Because b1 and b2 can't both be True simultaneously, b2 must be False and the statement that prints x<=5 is not executed. Similarly, if b2 is True (and, therefore, b1 is False), only the statement that prints x<=5 is executed. Therefore, using Boolean values, or predicates, to control the execution of program statements, one can transform if-then-else logic into branch-free code. This transformation is known as if-conversion. Unlike Listing One, Listing Two will not incur branch prediction penalties, nor compete for scarce, on-chip branch prediction resources.
Intel's IA-64 family contains extensive support for predicated code, as in Listing Two. The processor supports a variety of comparison instructions that can be used to set predicates to values suitable for implementing fairly complex logic without the use of costly branches. With few exceptions, nearly all IA-64 instructions may be switched on or off using a predicate with the appropriate value. That's not to say that every opportunity to turn test-and-branch logic into predicated code should be exercised. On the contrary, predicating too much code can result in an increase in code path length and, therefore, poor utilization of the processor's limited set of execution units. As with all code-improving transformations, predication should be used with care.
Again, if-converting traditional test-and-branch logic to predicated code allows for more flexible scheduling of instructions. This is because instructions that change the state of memory, or may cause an exception to be raised, cannot safely be scheduled before branches that guard their execution. Indeed, doing so could affect the correctness of the original program. With the branches removed, however, the programmer (or compiler) may more freely schedule instructions in the hopes of improving program performance.
Even in the presence of branches, modern architectures (like the IA-64) allow instructions to be safely moved earlier in the schedule using a technique called "speculation."
Computer programs spend much of their time reading from and writing to memory. Unfortunately, accessing memory is a relatively slow operation -- many times slower, in fact, than the time it takes the processor to decode and issue the accessing instruction itself. On-chip memory caches have helped speed data access considerably, but even cached data can take several cycles before it is available for subsequent processing. During that time, an in-order processor, such as Intel's IA-64, must wait, or stall, before it can execute instructions dependent on the value loaded from memory. Such stalls can have a severely negative impact on program performance.
One way to reduce the amount of time processors wait for data to be loaded from memory is to execute load instructions early enough so that the data returned is available when dependent instructions need it. Listing Three presents two pseudo-IA-64 code sequences that illustrate this. For this example, I assume that the target processor takes 1 cycle to execute all instructions, but requires 3 cycles before data loaded from the on-chip cache is available to subsequent instructions (often referred to as the latency of the load instruction). Listing Three(a) requires 11 cycles to execute, because the processor must stall for 3 cycles between the load instruction that fills register r18 and the subsequent add instruction that uses it.
Listing Three(b), however, requires only 8 cycles to execute, because the load instruction that fills register r18 and the add instruction that uses r18 have been separated by three instructions that require a total of 3 cycles to execute -- just the right amount of time to guarantee that the result of the load is available for consumption by the add instruction without any processor stalls.
In Listing Three, the load instruction was free to move earlier in the instruction sequence because it was not dependent on any of the instructions that came before it (other than the instruction that loaded r17 with the address of "x," which was moved, too, to satisfy the dependency). In many cases, however, dependencies among instructions make application of this technique difficult. The IA-64 code in Listing Four illustrates one such case.
The load instruction in this example is guarded by the branch instruction above it. Since the load instruction may cause an exception (r15 may contain a null pointer, for example), it isn't possible to move it above the branch in an effort to distance it from the add instruction that immediately follows. Doing so could result in a change in program behavior, certainly an undesirable trait of any code optimization technique.
The IA-64 architecture addresses this issue by supporting control speculation, a feature that allows load instructions to be moved above guarding branches without the possibility of modifying the program's correctness in the process. This is made possible by a new kind of load instruction known as the "speculative load," which is similar to a regular load, except it doesn't immediately signal exceptions when detected. When an exception is detected, a speculative load simply records that fact by setting a special bit in the instruction's target register, known as the "Not a Thing" (NaT) bit. A special instruction called the "speculative check" is used to check that the NaT bit for a register is set and, if so, causes execution to transfer to the address of a supplied recovery routine. Listing Five shows how the speculative load and check instructions, indicated by the ld8.s and chk.s mnemonics, respectively, can modify Listing Four so that the load instruction occurs before the branch instruction.
In this example, the original load instruction has been changed to a speculative load and moved above the branch that once guarded its execution. The speculative load instruction now executes unconditionally. If r15 contains an invalid pointer, causing an exception to be detected, the speculative load instruction defers the exception and sets the NaT bit in register r17. It's important to note that if the speculative load were to raise (rather than defer) the exception, the program might not exhibit the same behavior as in Listing Four. This is because the nonspeculative load in Listing Four might not execute at all (depending on the conditions associated with the guarding branch) and, therefore, may never cause an exception to be raised.
Depending on the conditions associated with the branch in Listing Five, it may be possible for execution to reach the add instruction that uses register r17. Since r17 was loaded using a speculative load instruction, a check must be made to ensure that the load of r17 did not result in a deferred exception. This is done by inserting a speculative check instruction before the use of r17. If the speculative check instruction finds that the NaT bit for r17 is set, it branches to the address of the supplied recovery routine, where the exception can be raised. If the chk.s instruction determines that the NaT bit is not set, control simply falls through to the next instruction. So, using IA-64's speculative load and check instructions, it is possible to move load instructions above guarding branches (thus achieving a better program schedule) without altering the way exceptions are raised or suppressed.
A similar, and equally powerful, technique for moving load instructions earlier in the program schedule is known as "data speculation." Instead of moving loads above branches, however, data speculation moves load instructions above store instructions. This poses little problem when the two instructions access different (or nonoverlapping) memory locations. For example, if registers r15 and r16 in Listing Six(a) do not contain the same address, then the store through r15 cannot affect the value at the address contained in r16. Therefore, it is safe to reorder the load and store as in Listing Six(b). If, on the other hand, r15 and r16 contain the same, or overlapping, addresses, then the store in Listing Six(a) affects the value placed in r14 by the load instruction, and the instructions cannot be reordered without affecting the correctness of the original program. Unfortunately, it is not always possible to determine if the source and target registers of loads and stores contain the same, or overlapping, addresses. Such ambiguous memory references make it difficult to perform the code transformation in Listing Six.
Just as it supports control speculation, however, the IA-64 architecture addresses the need to move loads above stores through its support of data speculative load and check instructions. Listing Seven shows how these instructions can perform the transformation in Listing Six, even if the memory references on the load and store instructions are ambiguous.
In Listing Seven, the nonspeculative load instruction from Listing Six has been moved above the store instruction and turned into a speculative, or advanced, load instruction, ld8.a. Aside from performing the actual load operation, the ld8.a instruction writes its source address to a hardware data structure, known as the "Advanced Load Address Table" (ALAT). IA-64 store instructions inspect the ALAT and look for entries that contain an address that overlaps with their target address. If one is found, the ALAT entry is removed; otherwise, the ALAT entry is left intact. To complete the transformation, the original load instruction following the store is replaced with a check load.
Check loads look identical to their corresponding advanced loads, but use the mnemonic ld8.c instead. A check load first inspects the ALAT for an entry containing its source address. If one is found, then no store instruction between the advanced load and the check load wrote to an address that overlaps with the source address of the advanced load; otherwise, the ALAT entry would have been removed by the overlapping store instruction. In this case, the advanced load deposited the correct value in its target register, so nothing more needs to be done. If the check load determines that an entry containing its source address does not exist in the ALAT, then the entry must have been removed by an overlapping store instruction between the advanced load and the check load. (An entry may be missing from the ALAT even if a speculated load and its corresponding store didn't overlap. Because the ALAT is fixed in size, the processor may remove one entry to make room for another. This could cause a check load instruction to fail even if the speculated load was successful. For this example, however, it is assumed that ALAT entries are only removed due to store instructions that overlap with speculated loads.) Overlapping load and store addresses raise the possibility that the wrong value was loaded by the advanced load instruction. To ensure program correctness, a missing ALAT entry prompts the check load instruction to perform the load from memory again.
In many cases, it is useful to speculate instructions data dependent on a load instruction along with the load itself. In Listing Eight, the add instruction dependent on the result of the load instruction has been speculated above a (possibly overlapping) store.
In such cases, the IA-64 chk.a instruction, rather than the ld.c instruction, is used to check the results of the advanced load. If the chk.a instruction determines that the speculated load failed, then it branches to the address of a recovery routine, where the load and all dependent instructions (in this case, only the add instruction) are reexecuted.
As with many code optimization techniques, judicious use of speculation is advised. Only those loads and stores that have little probability of overlapping should be chosen as candidates for speculation. Poor choices can degrade program performance, as the check load instructions may be forced to reexecute many speculated instructions. Inappropriate use of control speculation can result in the execution of loads with results that are never used, causing reduced bus bandwidth, unnecessary page faults, and pollution of the data cache.
Both predication and speculation are viable techniques for improving software performance. Predication can be used to transform if-then-else logic into code not hampered by branch mispredictions or code motion restrictions. Both control and data speculation let loads be safely moved earlier in the program schedule in an effort to hide memory latencies.
The IA-64 family, beginning with the Itanium chip, contains extensive support for both predication and speculation. Several compiler vendors (including IBM) are working on high-level language compilers to exploit these new hardware features. University compiler research groups, such as the IMPACT group at the University of Illinois, are busy looking for new ways to generate efficient code using predication and speculation. Time will tell, however, whether commercial processors incorporating these features fulfill their promise of greatly improved software performance.
if (x > 5) then printf("x > 5\n"); else printf ("x <= 5\n"); cmp x, 5 // compare x with 5 ble l1: // branch if x <= 5 call printf "x > 5\n" b l2: // unconditionally branch out of if-then-else l1: call printf "x <= 5\n" l2:
b1, b2 = cmp.gt x, 5 // compute booleans b1 and b2 based // on comparison of x with 5 (b1) call printf "x > 5\n" // only executed if b1 is true (b2) call printf "x <= 5\n" // only executed if b2 is true
(a) r16 = call rand() add r16 = r16, 2 mul r16 = r16, 4 st8 [r16]=2 ld8 r17 = addr("x") // get address of "x" ld8 r18 = [r17] // load value of variable x from memory add r18 = r18, 1 // add 1 to "x" st8 [r17] = r18 // store "x" to memory <b>(b)</b> r16 = call rand() ld8 r17 = addr("x") // get address of "x" ld8 r18 = [r17] // load value of "x" from memory add r16 = r16, 2 mul r16=r16, 4 st8 [r16], 2 add r18 = r18, 1 // add 1 to "x" st8 [r17] = r18 // store "x" to memory
(p15) br.cond l1 // guard execution of load instruction ld8 r17 = [r15] // load r17 from memory pointed to by r15 add r17 = r17, 2 // use the value loaded in r17 l1:
ld8.s r17 = [r15] // speculatively load r17 from memory pointed to by r15 (p15) br.cond l1 chk.s r17, recovery_code // check NaT bit in r17 and recover add r17 = r17, 2 // use the value loaded in r17 l1:
(a) st8 [r15] = r14 // store through pointer r15 ld8 r14 = [r16] // load through pointer r16 // r15 and r16 do not contain the same address <b>(b)</b> ld8 r14 = [r16] // load and store can be st8 [r15] = r14 // reordered since store // through r15 does not affect the value loaded // through r16
ld8.a r14 = [r16] // load moved above store and st8 [r15] = r14 // turned into advanced load instruction ld8.c r14 = [r16] // non-speculative load replaced // with check load instruction
ld8.a r14 = [r16] // load moved above store and turned into // advanced load instruction add r17 = r14, r12 // the use of r14 is speculated along with the store st8 [r15] = r14 chk.a r14, recovery_code // non-speculative load replaced // with check load instruction