JUN93: PROGRAMMING THE PENTIUM PROCESSOR

PROGRAMMING THE PENTIUM PROCESSOR

New strategies for high-performance architectures

Ramesh Subramaniam and Kiran Kundargi

Ramesh Subramaniam is a Strategic Marketing Manager in the Architecture and Software Technology Group. Kiran Kundargi is Senior Software Engineer at Intel Corporation. They can be reached at Intel, MS RN6-35, 2200 Mission College Blvd., Santa Clara, CA 95052.

The advent of graphical user interfaces, digital audio-visual, and three-dimensional graphics has truly changed end-user computing on PC platforms. Likewise, multitasking integrated environments like OS/2, in combination with the widespread adoption of local area networks, opens the PC to previously uncharted territories such as cooperative computing. However, each of these capabilities demands superior performance from the underlying hardware--specifically the microprocessor, which is at the heart of the platform. But it's not enough to introduce these features in the microprocessor alone. Equally important is the availability of software tools that enable applications to easily take advantage of these features.

Intel's Pentium microprocessor extends the performance curve of the 80386/486. The Pentium is compatible with all existing x86 architectures from the 8086 through the 80486DX/DX2/SX. The Pentium, however, provides a superscalar architecture along with multiple on-chip caches, a branch-prediction mechanism, and other capabilities that result in a significant overall performance improvement over earlier 80x86 processors. Consequently, application developers can employ one of three strategies to enhance performance: Use a faster processor with existing software, compile with a processor-aware compiler, or migrate 16-bit applications to a 32-bit environment. This article focuses on the first two possibilities: architectural enhancements that improve performance of existing software and recompilation with a processor-aware compiler.

Pentium Architecture Overview

Enhancements to the Intel 486 (when compared to the 80386) included an 8-Kbyte unified code and data cache, an improved floating-point unit (FPU) that supports a one-instruction/clock throughput for most frequently used instructions, and greater support for multiprocessing.

Enhancements to the Pentium (when compared to the 80486) include higher clock speeds (the Pentium is the first Intel processor to support 66-MHz operation at both system and internal levels), a superscalar, pipelined architecture, a pipelined FPU, Branch Target buffer, and multiple on-chip caches.

The superscalar architecture (see Figure 1) consists of two pipelined integer units (the U pipe and V pipe, respectively) for integer instructions. The two units can operate in parallel, thus executing two integer instructions concurrently. This makes it possible in most cases to obtain two results per clock cycle, breaking the one-result-per-cycle barrier of existing architectures. For some complex instructions, the internal microcode also exploits its own parallelism and executes using both pipes. As a result, these complex instructions execute in fewer cycles than in earlier 80x86 architecture-based processors. Figure 2 illustrates the parallel pipelined execution.

Pipelined Floating-point Unit

Although an individual instruction may require multiple cycles, the FPU pipeline architecture effectively results in one add result per clock or one multiply result per two clocks. In many cases, if a floating-point instruction requires multiple clocks (keeping the FPU pipe busy), an appropriate number of integer instructions can still be executed in parallel in the integer pipes. The combination of these features means that a floating-point intensive application can execute on the Pentium with improved performance.

Another subtle, but important, enhancement to the FPU contributes to substantial performance improvement in floating-point instruction execution. Intel's existing numeric coprocessors (such as the 80387) contain a set of eight floating-point registers accessible to the software. This set is used as a classic stack and is therefore sometimes referred to as a "floating-point stack." Most numeric instructions on these coprocessors implicitly use the top of the stack (called TOS) as one of the operands and replace it with the result. For example, the FCOS instruction replaces the value at the TOS by COS(TOS). If an application needs to save the result during subsequent floating-point instructions, then the XFXCH instruction is used to change the designation of the TOS. The instruction takes four cycles on the 80486. On the Pentium however, the FXCH instruction is executed in parallel while some other floating-point instruction is being executed in the FPU. Such parallel execution and the fact that the instruction takes only one cycle essentially makes it a "free" (clock-less) instruction. Therefore, appropriately scheduled instruction sequences using this mechanism can significantly improve floating-point instruction performance. The total number of clock cycles needed by individual FPU instructions have also been reduced in the Pentium microprocessor. Table 1 illustrates this by comparing the FDIV instruction on the 80486 and Pentium.

Table 1: Clock-cycle comparison for FDIV.

            Clock Cycles
            486  Pentium
  ----------------------

  Single     36     18
  Double     63     32
  Extended   73     38

Branch Target Buffer

The Pentium has a feature for dynamic branch prediction called the "branch target buffer" (BTB). In any software application, it's common to see conditional loops function calls, unconditional branches, and so on. Depending upon the result of the condition's evaluation, the application may take a branch to a nonsequential location. This requires that the current prefetched instruction stream be drained and a new instruction stream fetched from the new location. This typically results in idling the processor. The existence of an on-chip instruction cache can help to alleviate this problem, in that the target location may be available in the cache; otherwise the processor remains idle until the new instruction is available for execution.

The Pentium's BTB records an association of previously executed branch instructions (from and to addresses). With this information, the processor attempts dynamic prediction of the possible target and prefetches the instruction stream at that location. If the prediction is correct, the branch executes in a single clock, without additional delay. If the prediction fails, the processor follows the normal routine of fetching the new instruction stream either from the instruction cache or from memory. In the latter case, two to three clock cycles are needed.

Caches

Memory accesses by the microprocessor generally tend to reduce overall system performance because the processor idles during the time a memory operand is being accessed. One way to alleviate memory accesses by the microprocessor is to provide on-chip caches. The 80486 provides one 8-Kbyte cache for both instruction and data; the Pentium provides two separate 8-Kbyte caches, one each for instruction and data.

Both the instruction and data caches are two-way set associative and the cache line is 32 bytes wide. Cache misses for either instructions of data do not interfere with the other cache. Additionally, each cache has its own translation lookaside buffer (TLB); therefore, there's no conflict between corresponding virtual-address mappings. As a result of the separate caches and their corresponding TLBs, the memory accesses by the processor are significantly reduced.

Page Size

The hardware page size on the 80486 is 4 Kbytes. For applications that need large memory buffers (such as frame buffers for graphics and file I/O buffers for file servers), a page size of 4K may require accessing multiple 4K hardware pages. For example, the need for a 480K memory buffer would require 120 pages. Each hardware page corresponds to one page-table entry and therefore one slot in the TLB. As the application's data access moves from one page to another, new TLB entries corresponding to the new page need to be loaded, possibly replacing some existing TLB entry. This may result in some thrashing within the TLB resulting in performance overhead.

More Details.

The Pentium addresses this by allowing a 4-Mbyte page size. Once the application accesses such a large page, the corresponding virtual mapping for that page is set up once in the TLB, and no further TLB updates need to take place for that page as the application continues its data accesses within the 4-Mbyte memory space. This reduces memory accesses due to reduced TLB updates, resulting in improved performance. Frame buffers, file servers, database servers, and other applications that can take advantage of large memory buffers (bigger than 4 Kbytes) without paying the penalty of unnecessarily reloading TLB entries will find the large-page-size feature helpful.

Optimizing Compiler Technology

Earlier compilers generated machine code by first generating a lower-level description in a generic "intermediate form" from a high-level language. This description was then translated into machine code by a separate process. This method allowed the development of compilers for different high-level languages because these languages could share the "intermediate form." Also, translating one intermediate form into a new machine language would port several high-level languages to a new processor simultaneously. Early optimizing compilers did more by reordering instructions for optimal memory use, and performed more generic optimizations that improved compiled-code performance. However, even these compilers still used a generic intermediate format that doesn't ideally map onto the instruction set available for any specific processor.

However, because hardware architectures continue to increase in complexity and capability, less optimization is possible without processor-specific knowledge built into the compiler. Therefore, there's a trend toward optimizing processor-aware compilers that implement knowledge of the pipeline architecture, branch handling, cache size, organization, and so on to provide the optimal code for the target hardware implementation.

As Figure 2 shows, the Pentium provides two integer pipelines (U and V pipes) for parallel execution of two integer instructions. Note, however, that two instructions can execute in parallel only if there are no interdependencies between them. Consider the assembler code in Example 1(a). Instruction I2 cannot be executed until I1 is complete because they are interdependent: The eax register is written by I1 and read in I2. Therefore, I1 and I2 must execute sequentially. This instruction sequence will only benefit from the dual execution units of the Pentium if the code is re-compiled using a processor-aware optimizing compiler.

Example 1: (a) Instruction I2 is dependent on I1, so it can't be executed until I1 is complete; (b) the same instruction stream can be rescheduled by a Pentium-aware compiler, resulting in parallel execution.

  (a)

  mov  eax, source    /I1: move source contents to eax
  add  eax, ebx       /I2: add contents of ebx to eax
  instr1...           /I3: some other instruction-
  instr2...           /I4: -not dependent on eax, ebx or source
  instr3...           /I5: "

  (b)

  mov  eax, source    || instr1...
  instr2...           || instr3...
  add  eax, ebx

The processor-aware compiler recognizes such interdependencies and rearranges (or reschedules) the instruction stream so that instructions can indeed execute in parallel with minimum idle cycles. The above instruction sequence can be reshuffled, as shown in Example 1(b).

The new sequence produces exactly the same results without unnecessary idle cycles. On the Pentium, the parallel execution is slightly more complex than presented in the example just described because the two pipelines (U and V) are not exactly equivalent. Of the two pipes, one pipe (U) can execute all the instructions, whereas the second (V) only executes hard-wired instructions. The need to produce correct instruction pairing makes the instruction scheduler all the more important.

For a specific code sequence in a program, it's generally possible to produce different sequences of instruction streams that produce exactly the same result. Some compilers provide user-selectable options for time (execution performance) vs. space (code size) optimizations. A processor-aware compiler analyzes the various possibilities and produces an instruction stream which causes the least number of idle cycles on the Pentium. Figure 3shows an example of a simple loop which can result in different instruction streams, each with a different cycle count.

Figure 3: Code with different instruction streams.

<PRE>  
  Original Source

  static int a[10], b[10]
  int i;
  for (i=0; i<10; i++)
  {
      a[i] = a[i] + 1;
      b[i] = b[i] + 1;
  }

Code Sequence 1              Code Sequence 2           Code Sequence 3
----------------------------------------------------------------------
xor    eax, eax              xor  eax, eax             mov  eax, -40
Loop:                        Loop:                     Loop:
mov    edx, eax              inc  dword ptr            mov  edx,
                                [a + eax * 4]             [a + eax + 40]
sh1    edx, 2                inc  dword ptr            mov  ecx,
                                [b + eax * 4]             [b + eax + 40]
inc    dword ptr [a + edx]   inc  eax                  inc  edx
mov    edx, eax              cmp  eax, 10              inc  ecx
sh1    edx, 2                j1   Loop                 mov  [a + eax + 40],
                                                          edx
inc    dword ptr [b + edx]                             mov  [a + eax + 40],
                                                          ecx
inc    eax                                             add  eax, 4
cmp    eax, 10                                         jnz  Loop
j1     Loop
</PRE>
<P>

Floating-point Code

The majority of commonly used floating-point instructions are pipelined in the Pentium, thus providing overlapped instruction execution. The floating-point unit can accept a new instruction in each clock cycle. Such overlapped execution hides the latencies of each of the individual instructions and in many cases, effectively yields one FPU result per clock cycle. The compiler takes advantage of the floating-point stack to save intermediate results. The FXCH instruction is scheduled in the instruction stream so that the required operand "happens to be" at the TOP of the floating-point stack when needed. Since the FXCH instruction is essentially "no cost," this mechanism improves the floating-point execution performance by a large margin. Note that the integer instructions (in the U and V pipes) can execute simultaneously with the floating-point instructions. Thus, in the case of floating-point instructions that take multiple cycles (for example, fdiv), a number of (multiple) integer instructions can be scheduled in the two-integer units.

Figure 4 shows how code generated using the FXCH instruction on the Pentium can execute faster than on the 80486. Typically, floating-point code executing on the Pentium can be as much as five times faster than code executing on the 80486.

Figure 4: Floating-point code using FXCH.

  Original Source

           subroutine da (x, y, z, n)
           dimension x(n), y(n)

           do 10  i = 1, n
  10       x(i) = x(i) + y(i)*z
           return
            end

  Code sequence without FXCH           Code sequence with FXCH
  -------------------------------------------------------------------------

  fld   dword ptr [esp+8]              fld   dword  ptr [esp + 8]
  fmul  dword ptr [ebx + eax *4]       fmul  dword  ptr [ebx + eax * 4]
  fadd  dword ptr [ecx + eax * 4]      fld   dword  ptr [esp + 8]
  fstp  dword ptr [ecx + eax * 4]      fmul  dword  ptr [ebx + eax * 4 + 4]
  fld   dword ptr [esp + 8]            fxch  st(1)
  fmul  dword ptr [ebx + eax * 4 + 4]  fadd  dword  ptr [ecx + eax * 4]
  fadd  dword ptr [ecx + eax * 4 + 4]  fld   dword  ptr [esp + 8]
  fstp  dword ptr [ecx + eax * 4 + 4]  fmul  dword  ptr [ebx + eax * 4 + 8]
  fld   dword ptr [esp + 8]            fxch  st(2)
  fmul  dword ptr [ebx + eax * 4 + 8]  fadd  dword  ptr [ecx + eax * 4 + 4]
  fadd  dword ptr [ecx + eax * 4 + 8]  fxch  st(1)
  fstp  dword ptr [ecx + eax * 4 + 8]  fstp  dword  ptr [ecx + eax * 4]
  add   eax, 3                         fxch  st(1)
  cmp   eax, ebp                       fadds dword  ptr [ecx + eax * 4 + 8]
  jle   Loop                           fxch  st(1)
                                       fstp  dword  ptr [ecx + eax * 4 + 4]
                                       fstp  dword  ptr [ecx + eax * 4 + 8]
                                       add   eax, 3
                                       cmp   eax, ebp
                                       jle   Loop

Blended Code

Because the 486 and Pentium processors share common architectural features, applying Pentium optimizations to code run on either processor results in performance gain on the 486 CPU. Slightly better performance can be obtained by further tailoring optimizations to one processor or the other, but another approach to the same performance gain is to choose a set of optimizations that are balanced between the two CPUs. This is called blended optimization and uses only those optimizations that do not deteriorate performance on either processor.

Preliminary studies indicate that blend optimization provides well-balanced performance on both the 486 and Pentium processors. The data also indicate that the performance variation when optimizing for different processors is 5 percent or less in all cases. Blend optimization does increase code size, but only slightly--by 4 percent on the 486 and 8 percent with the Pentium.

The Memory-bandwidth Bottleneck

Instruction operands, if not immediately available from the on-chip cache, have to be obtained from an external source, such as the system memory. Because of the comparatively slower memory speed (the low-bus bandwidth over which the data transfers between the memory and the processor, and the additional overhead of the inherent handshaking between the memory subsystem and the processor), any memory access results in a few idle cycles in the processor. Neither the superscalar architecture nor any other performance-enhancing features designed into the microprocessor can alleviate the memory-bandwidth problem. If an operand has to be accessed from memory, a certain number of cycles are lost in obtaining it. Pentium compiler technology addresses this issue. If an operand is available in the cache when needed, it's accessed without any penalty--it's obtained at the highest possible rate. The 80486- and Pentium-aware optimizing compiler therefore provides sophisticated loop-reorganization techniques and vectorizing algorithms that maximize the reuse of cached operands. This automatically reduces the number of memory accesses and feeds the Pentium pipelines at the maximum rate possible.

Other Optimizations

In the past, the limited ability of traditional compilers to optimize instruction ordering and memory utilization meant that programmers had to optimize at the source level. Indeed, programmers often resorted to manual assembly coding to maximize application speed. However, continuing advances in compiler technology are reducing the impact of source-level optimization on application performance. With the advent of processor-aware compiler technology, programs benefit from source-level optimization on far fewer occasions. However, some understanding of the possible compiler optimizations is still necessary to configure the makefile in the best manner for any particular source file. There are also some rare situations in which source-level optimization may increase the degree of optimization that the compiler can attain. For discussion purposes, we've grouped optimizations into two categories: base-level optimizations, which reduce the need for programmers to modify source code to obtain better performance; and individually selectable optimizations, which programmers can configure for improved application performance and which are product specific. Only base-level optimizations are discussed in this article.

Base-level optimizations fall into two categories: generic and processor-specific optimizations.

Generic optimizations are available in most optimizing compilers. Some of the simpler generic optimizations, such as constant and expression evaluation, are also available in simple compilers. See Table 2(a).
Processor-specific optimizations are possible to some degree even if the compiler is not processor-aware. However, detailed knowledge of the processor pipeline architecture, caching, and other hardware features enables far greater performance boosts via these optimizations. See Table 2(b).

Table 2: (a) Generic optimizations; (b) processor-specific optimizations.

      Function        Description                        Coding Practice
                                                         Implications
  -------------------------------------------------------------------------

  (a) Constant        Replaces compiler-determinate      Program can
      Evaluation      expressions with simpler           contain the
                      expressions or numeric             most appropriate
                      constants if possible.             variables and
                                                         expressions.

      Expression      Moves code that does not           Statements that
      Evaluation      produce different results on       use no
                      each iteration of a loop outside   loop-dependent
                      the loop.                          variables can
                                                         occur
                                                         inside loops.

      Loop Invariant  Removes unnecessary code           A single file
      Code Motion     created by conditional             can be compiled to
                      compilation and other              produce various
                      optimization processes.            binaries that
                                                         are related but
                                                         not identical.

      Dead Code       Replaces compiler-determinate      Program can
      Elimination     variables wih numeric              contain the most
                      constants.                         appropriate
                                                         variables
                                                         and expressions.

  (b) Instruction     Replaces simple instructions       Performance-
      Selection       with faster more complex           critical
                      instructions.                      functions
                                                         are less likely
                                                         to require manual
                                                         assembly coding.

      Instruction     Reorders instructions to           Performance-
      Scheduling      improve pipeline throughput.       critical
                                                         functions
                                                         are less likely
                                                         to require manual
                                                         assembly coding.

      Loop            Reduces branch stalls and          Loops do not need
      Unrolling       allows scheduling of one           to be unrolled
                      iteration with instructions        manually.
                      from a subsequent iteration.

      Preloading      Uploads memory addresses to        Programs do not
                      cache before multiple writes to    need to preload
                      the same addresses (writes aren't  the cache.
                      cached without prior reads).

Taking Advantage of the New Technology

Programmers can choose how much effort to expend on targeting software applications for the 80486- and Pentium. Their decision determines the degree to which the code takes advantage of the processor's architecture to obtain maximum performance improvement.

The simplest approach is to do nothing. Existing 80x86-based applications will run on 80486- and Pentium-based systems without change and have faster performance due to increased clock speeds and other architectural enhancements. However, all the processor features in the 80486 and the Pentium won't be used. For example, the floating-point code won't make the best use of the "free" FXCHs in Pentium.

The next alternative is to take existing source code and recompile it using a processor-aware optimizing compiler. As described earlier, the compiler technology will not only provide the "classic" optimizations but also do the Pentium-specific optimizations and help alleviate the memory bottleneck where possible. Thus, recompilation will undoubtedly apply the 80486- and Pentium-specific optimizations to any software application. Numerous processor-aware compilers are available from Intel's compiler partners.

To further take advantage of the architecture, you can modify existing 16-bit code to 32-bit, then recompile it using a processor-aware optimizing compiler. This exposes the program to all the fundamental benefits of 32 bits, in addition to 80486 and Pentium advantages.

Software that process huge amounts of data (database servers and file servers, for instance) may also find it advantageous to update the software to use the Pentium's large page size. For example, a database server could use a large page to load a number of records and/or indexes, thereby avoiding TLB thrashing. Thus, a slight amount of recoding and recompilation will provide significant performance benefits at relatively low costs.

Finally, for applications written partially or entirely in assembly language, a careful analysis and restructuring of the instruction stream may yield substantial performance improvement. You'll need to recognize the various instruction-pairing rules as well as the coupling between the FPU pipe and the integer units. Taking advantage of the unused clock cycles in between existing instruction streams could improve the execution performance significantly. The exact range of improvement depends, of course, on the actual instructions used and whether the program contains integer-only code or both integer and floating-point code.

Conclusion

The Pentium features a superscalar, pipelined architecture with numerous performance improving capabilities providing existing software a significant performance boost. Still, not all performance improvements can be obtained by simply running existing code on Pentium because not all the features of Pentium or the 80486 are used by the software. By taking advantage of all processor-specific features, additional performance benefits can be achieved by recompiling using processor-aware compilers.

Making Compilers Pentium Aware

John Dahms

John is a compiler writer for Watcom. He can be contacted there at 415 Phillip Street, Waterloo, ON, Canada N2L 3X2 or at [email protected].

With a Pentium compiler writer's guide in one hand and a simulator listing in the other, we set out to make Watcom C/C++32 into a Pentium-aware compiler. At first glance, it looked to be easy, but as we got into it some interesting challenges emerged. Here is some of what we learned.

Instruction Scheduling

Because of the Pentium's multiple execution pipes, instruction scheduling, or reordering, is by far the most important optimization performed by the compiler. A three-phase approach to scheduling emerged as the strategy of choice given the framework of the code generator.

RISC-ification, or reduction of integer instructions to use the RISC subset of the instruction set. For example, we turn an add from memory into a load followed by an add from register. The two instructions are a bit larger, but no slower, and they give you more freedom to choose an optimal instruction ordering.
Scheduling, or moving data-dependent instructions so that they're not adjacent. Ideally, a result shouldn't be used until the instruction has had enough time to complete all pipeline stages. This allows your code to take full advantage of the superscalar architecture, and leads to considerable 80486 execution-speed improvement.
Re-CISC-ification, or building instructions back up to take advantage of complex instructions. Often, data dependencies do not allow scheduling in certain portions of the code. You might turn the load/add sequence created in first phase back into an add from memory, if there was not scheduling benefit.

Quirks

There are some interesting architectural issues to address which can materially affect Pentium performance. For instance, instructions with prefix bytes will take an extra clock cycle. In order to reference 16-bit items, an operand size prefix has to be used. We do our best to eliminate these prefix bytes, but it's still a good idea to stay away from short integers.

Floating-point instructions must also be scheduled to take full advantage of the float pipe. The FXCH instruction was designed to make this easier. It can execute simultaneously with other floating-point instructions. Since one operand of a floating-point instruction must be ST(0), a "free" FXCH instruction allows floating-point sequences to be scheduled for better throughput. Two data-independent sequences are interleaved, and FXCH is inserted when an intermediate result needs to be brought to the top of the stack.

This sounds great, but before you rewrite your floating-point library to take full advantage of the "free" FXCH instruction, you need to read the fine print. Certain conditions that must be met before instruction pairing occurs.

First, the FXCH must be followed by another floating-point instruction. Secondly, it must be preceded by a floating-point instruction which does not pop the stack. If these conditions are not met, your code may actually run slower, since the FXCH instructions each take one clock cycle.

Finally, the processor documentation states that "performing a floating-point operation on a memory operand instead of on a stack register adds no cycles." This is true, assuming a cache hit. However, there is a clock penalty for cache misses. Don't be fooled by the clock cycles. Keep integer and floating-point values in registers wherever possible.

80386/486 Considerations

The 80486 only has one integer pipe and one float pipe, but it's still a pipelined architecture. An optimal Pentium ordering is usually optimal, or nearly so, on the 80486 as well. The scheduling avoids pipeline stalls, greatly increasing 80486 performance.

There's one major compromise which must be made if code is to be optimized for both the Pentium and the 80386/486. In this case, you don't use the FXCH instruction to achieve parallelism. It may speed up Pentium code considerably, but will not be free on a 80386/486.

16-bit Code

Remember to optimize your 16-bit code for the Pentium as well. All the same rules apply, except that 32-bit operands now require a prefix byte. Since Watcom C uses a common code generator to produce both 16- and 32- bit code, Watcom C/C++ 16 became Pentium aware for "free". We recompiled a suite of small benchmark programs, and were amazed when we saw a 40 percent average speed increase, with improvements ranging up to a factor of two in execution speed.

Embedded Systems

Programming the Pentium Processor

PROGRAMMING THE PENTIUM PROCESSOR

New strategies for high-performance architectures

Ramesh Subramaniam and Kiran Kundargi

Pentium Architecture Overview

Pipelined Floating-point Unit

Table 1: Clock-cycle comparison for FDIV.

Branch Target Buffer

Caches

Page Size

Optimizing Compiler Technology

Example 1: (a) Instruction I2 is dependent on I1, so it can't be executed until I1 is complete; (b) the same instruction stream can be rescheduled by a Pentium-aware compiler, resulting in parallel execution.

Figure 3: Code with different instruction streams.

Floating-point Code

Figure 4: Floating-point code using FXCH.

Blended Code

The Memory-bandwidth Bottleneck

Other Optimizations

Table 2: (a) Generic optimizations; (b) processor-specific optimizations.

Taking Advantage of the New Technology

Conclusion

Making Compilers Pentium Aware

Instruction Scheduling

Quirks

80386/486 Considerations

16-bit Code

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Embedded Systems Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

Embedded Systems

Programming the Pentium Processor

PROGRAMMING THE PENTIUM PROCESSOR

Ramesh Subramaniam and Kiran Kundargi

Related Reading

More Insights

White Papers

Reports

Webcasts

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Embedded Systems Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content