Dr. Dobb's | Microarchitecture Performance

Improved throughput in energy-efficient designs for multicore processors and other high-performance systems

June 20, 2006
URL:http://www.drdobbs.com/parallel/microarchitecture-performance/189500676

In his 11 years with Intel, Ram Ramanathan has held positions ranging from engineering to management. He has received four patents and has 10 patents pending in the areas of networking and security. Ram holds a master's degree in mathematics from Madurai Kamaraj University in India.

The computer industry continually needs smaller, faster, more efficient, and more capable servers. However, the fundamental challenge is in delivering that improved performance without increasing the system's power requirements. Today, the challenge is compounded as CMOS manufacturing processes scale toward physical, atomic limits. Complicated physics and breakthrough manufacturing processes are now required to approach the performance-power problem from two directions:

The industry has tried a variety of techniques to resolve those challenges. The best of these implementations are a combination of optimized microarchitecture, better transistor technologies, an increased number of execution cores, advanced memory technologies, and faster data access.

The new implementations are taking advantage of the increased compute density being delivered though breakthrough 65-nm process technology. The additional compute density--twice as many transistors in the same physical footprint--has made it possible to take parallelism down to the level of individual execution cores. The result is a new generation of energy-efficient systems, such as servers with improved instruction throughput that can respond faster to network demands.

Performance Foundation for Microarchitecture

Contrary to popular perceptions, performance is not based solely on clock frequency or on the number of instructions executed per clock cycle (IPC). Performance is the product of both clock frequency and IPC. To increase performance, one must increase either frequency, or IPC, or both. In today's designs, manufacturers are focusing not just on overall system architecture, but on microarchitecture improvements to deliver this performance in an energy-efficient form.

It is not always practical to improve both the frequency and the IPC. However, increasing one and holding the other close to constant can still achieve a significantly higher level of performance over previous-generation architectures. It is also possible to increase performance by reducing the number of instructions required to execute specific tasks.

In today's markets, power consumption is a critical challenge, and can be expressed as:

Dynamic capacitance is the ratio of the electrostatic charge on a conductor to the potential difference between the conductors required to maintain that charge. This is the dynamic capacitance required to maintain IPC efficiency. Voltage is the voltage that the transistors and I/O buffers are supplied with. Frequency is the GHz frequency that the transistors and signals are switching at.

The challenge is in balancing IPC efficiency and dynamic capacitance with the required voltage and frequency, to optimize for performance and power efficiency. The goal is to deliver microarchitectures that have an increased compute density for a given footprint, an improved performance per watt, and yet are still energy-efficient.

Wider Execution Cores

One of the most obvious approaches for improving performance would be improving IPC. Right now, the industry executes only three instructions per clock cycle. The main challenge with widening the execution core further has been the complexity of the physics involved in executing more instructions per clock cycle, without increasing the power requirements of the system.

One manufacturer (Intel, company I work for) is delivering cores that execute four instructions per clock cycle--each execution core is 33 percent wider than previous-generation cores. This means each core can fetch, dispatch, execute and retire up to four full instructions simultaneously (see Figure 1), without increasing the power consumption of the system.

[Click image to view at full size]

Figure 1: Wide dynamic execution allows each core to execute up to four full instructions simultaneously.

The new wide dynamic execution is achieved through a unique combination of advanced techniques that improve instruction throughput. These techniques include data-flow analysis, speculative execution, out-of-order execution, enhanced arithmetic logic units, and super scalar. Further efficiencies include more accurate branch prediction, deeper instruction buffers for greater execution flexibility, macrofusion, and micro-op fusion.

Most critical for the future of CMOS, manufacturers expect to be able to scale this optimization and further widen the execution cores without increasing frequency (power).

Macrofusion At the Core Level

Industry typically decodes and executes each incoming program instruction as an individual instruction. However, one of the benefits to advanced microarchitectures of having more transistors in the system is in having enough computational power to apply macro techniques at the micro level.

In macrofusion, a processor combines common instruction pairs into a single internal instruction, or micro-operation (micro-op), during decoding (see Figure 2). For example, a processor could combine a compare followed by a conditional jump into one micro-op. The "fused" instruction is then executed as a single instruction. This reduces the total number of instructions that need to be executed for a given task, so that the processor can execute more instructions in a given period of time.

[Click image to view at full size]

Figure 2: Advanced microarchitecture uses macrofusion to "fuse" common instructions and execute them as a single instruction.

An enhanced arithmetic logic unit (ALU) then optimizes the macrofusion. The ALU's single-cycle execution of combined instruction pairs further increases performance with less power consumed.

Micro-op Fusion: An Additional Energy-saving Technique

Typical mainstream processors break down x86 program instructions (macro-ops) into small pieces--the internal instructions called "micro-ops"--before sending the instructions down the processor pipeline to be processed.

Micro-op fusion takes macrofusion down another level. In advanced micro-op fusion, the execution core "fuses" common micro-ops derived from the same macro-op. This reduces the number of micro-ops that need to be executed. The result is more efficient scheduling and better performance at lower power. In fact, studies have shown that micro-op fusion can reduce the number of micro-ops handled by the out-of-order logic by more than 10 percent.

Manage Power Intelligently

Power gating is a technique that reduces power consumption, including the runtime power consumption of a processor's execution cores. The technique is used to power up logic subsystems only if and when they are needed.

Power gating has traditionally been challenging for industry because of the energy consumed in powering the subsystem down and ramping it back up for use. Gating has also been challenging because of the need to maintain overall system responsiveness when returning the subsystem to full power.

Advanced power gating in 65-nm CMOS now allows for intelligent, ultra fine-grained logic control of the individual processor logic subsystems. Only those individual subsystems that are currently required are powered on. With a finer granularity of subsystems, power gating also minimizes the number of subsystems that require power.

In addition, some manufacturers, such as Intel, split many buses and arrays, so that the data required in some modes of operation can be put into a low power state when not needed. The result is optimized energy use in a design that delivers more performance per watt without sacrificing responsiveness.

Challenges In Optimizing Cache

One of the major advantages of moving to a 65-nm process is having enough additional transistors to increase resources and/or improve performance in many critical areas. One such area is cache.

When processors have multiple cores, parallelism must be applied at the core level, not just the processor level. This means optimizing the way the execution cores exchange and share data. Without that optimization, multiple cores will introduce data and memory contention--a traditional issue in microarchitecture design.

Industry addresses this challenge by using one L2 cache for each core. It's simple. No core has to fight with other cores for access to its own cache. The problem in this approach is twofold: When two execution cores need the same data, each core must store that data in its own L2 cache, duplicating the work. Also, when one core isn't fully using its cache, other cores cannot access that underutilized cache for other tasks.

Intelligent Cache

With an increase in transistor density, manufacturers such as Intel can build significantly more cache for each core. This increases the probability that each execution core can access data from the faster, more efficient cache subsystem. Advanced parallelism in the microarchitecture then optimizes the use of that cache to reduce latency to frequently used data.

Each execution core now has a dedicated L1 cache for data specific to that core. Since more data is available locally, fewer fetches are made outside the processor, and traffic on the system bus is reduced. This reduces memory latency and accelerates data flow. All cores then share a larger L2 cache for common data, to better optimize cache resources.

The advanced parallelism takes work traditionally done in the processor architecture and performs it at the micro level--at the core level, core-to-core level, and memory level. Since this method uses fewer hardware elements in the server platform, power requirements are also reduced. The result is greater performance at an increased level of energy efficiency.

Dynamic Allocation of L2 Cache

Another advanced optimization being used by Intel is dynamic allocation of the shared L2 cache, based on each core's requirements. Each core can now dynamically use up to 100 percent of available L2 cache. If one core has minimal cache requirements, the other core can dynamically increase its proportion of L2 cache (Figure 3). This helps decrease cache misses and reduce latency.

Dynamic allocation of L2 cache also allows each core to obtain data from the cache at higher throughput rates as compared to previous-generation architectures. This increases processor efficiency, increasing absolute performance, as well as performance per watt, a critical benefit for servers.

[Click image to view at full size]

Figure 3: Dynamic allocation of L2 cache, based on each core's requirements.

Challenges and Approaches To Memory Access

No matter how much cache is put in the system, data must still be fetched from main memory to go into the cache. The industry has explored many techniques to speed up that main memory access, from designing a hardware-based memory controller into the processor, to optimizing memory access through more flexible designs and methodologies.

Each set of techniques has its benefits. However, using a single hardware-based memory technology means that a design cannot easily take advantage of newer, more advanced techniques for improving memory access. The better designs use architectures flexible enough to support multiple memory technologies, to meet any requirements in the system.

These advanced designs use intelligent memory access to optimize the use of the available data bandwidth from the memory subsystem, and to hide the latency of memory accesses. This ensures that data can be used as quickly as possible. It also helps make sure that data is located as close as possible to where it's needed. Intelligent memory access minimizes latency and significantly improves efficiency and speed of memory accesses.

Increasing the Efficiency of Out-of-order Processing

A traditional challenge in speeding up memory access is the ambiguity inherent in prefetching data from memory. Ambiguity is one of the main reasons there is latency in out-of-order processing.

New, advanced, memory disambiguation resolves this by providing execution cores with the built-in intelligence to speculatively load data for instructions that are about to execute--before all previous store instructions are completed.

In implementations without memory disambiguation, each load instruction that needs to read data from main memory must wait until all previous store instructions are completed before it can read that data in. Loads can't be rescheduled ahead of stores because the microprocessor doesn't know if it might violate data-location dependencies. Yet in many cases, loads don't depend on a previous store.

Memory disambiguation uses special, intelligent algorithms to evaluate whether or not a load can be executed ahead of a preceding store. If the system intelligently speculates that it can prefetch the data, then the load instructions are scheduled before the store instructions. The processor spends less time waiting and more time processing. To avoid putting additional requirements on the system, disambiguation is done during periods when the system bus and memory subsystems have spare bandwidth available.

In the rare event that a load is invalid, memory disambiguation has built-in intelligence to detect the conflict, reload the correct data, and reexecute the instruction.

Memory disambiguation is a sophisticated technique that helps avoid the wait states imposed by less capable microarchitectures. The result is faster execution and more efficient use of processor resources.

Doubling the Number of Prefetchers

Microarchitectures based on the new 65-nm process are also doubling the number of advanced prefetchers available per cache. Prefetchers do just that--they "prefetch" memory contents before the data is requested, so the data can be placed in cache and readily accessed when needed. By increasing the number of loads that occur from cache as opposed to main memory, these microarchitectures reduce memory latency and improve performance.

Specifically, to ensure data is where each execution core needs it, there are now two prefetchers per L1 cache and two prefetchers per L2 cache. These prefetchers detect multiple streaming and strided access patterns simultaneously. This lets them ready data in the L1 cache for "just-in-time" execution. The prefetchers for the L2 cache analyze accesses from cores to help make sure the L2 cache holds the data which the cores may need in the future.

The combination of advanced prefetchers and memory disambiguation delivers significantly improved execution throughput. The result is better performance through the highest possible instruction-level parallelism.

Doubling Throughput of Streaming SIMD Extension Instructions

Streaming SIMD extension instructions are also known as SSE, SSE2, and SSE3 instructions. They accelerate a range of applications, such as video, speech and image, photo processing, encryption, financial, and engineering and scientific applications. Today, almost all servers execute these 128-bit instructions at a sustained execution rate of one complete instruction every two clock cycles. The lower 64-bits are executed in one clock cycle, and the upper 64-bits are executed in the next clock cycle.

However, wide dynamic execution now allows four 32-bit instructions (instead of three instructions) to be executed in a single clock cycle. This opens an opportunity for greater parallelism inside the execution core.

By moving to floating-point mathematics and improving methodology, one manufacturer is already delivering microarchitecture that executes two 64-bit instructions in a single clock cycle. This means that 128-bit instructions can be executed at a throughput rate of one full instruction per clock cycle (see Figure 4). Since floating-point mathematics can be performed faster than in previous-generation processors, this approach effectively doubles the speed of execution for SIMD-extension instructions.

[Click image to view at full size]

Figure 4(a): Doubling throughput of SIMD extension instructions. Typical industry execution of streaming SIMD extention instructions breaks 128-bit instruction into two 64-bt instructions; takes two clock cycles.

[Click image to view at full size]

Figure 4(b): Doubling throughput of SIMD extension instructions.Advanced microarchitecture fully executes 128-bit streaming SIMD extention instructions at throughput rate of one per clock cycle, doubling execution speed.

New Standards for Energy-efficient Performance

In response to industry's growing concern with energy efficiency, not just performance, Intel has developed and implemented advanced and unique techniques in microarchitecture. With state-of-the-art microarchitecture, desktops can now deliver greater compute performance as well as ultra-quiet, sleek and low-power designs. Servers can deliver greater compute density, and laptops can take the increasing compute capability of multi-core to new mobile form factors. The result is a new generation of high-quality, scalable, energy-efficient platforms for the desktop, server, and mobile markets.

Performance Foundation for Microarchitecture

Wider Execution Cores

Macrofusion At the Core Level

Micro-op Fusion: An Additional Energy-saving Technique

Manage Power Intelligently

Challenges In Optimizing Cache

Intelligent Cache

Dynamic Allocation of L2 Cache

Challenges and Approaches To Memory Access

Increasing the Efficiency of Out-of-order Processing

Doubling the Number of Prefetchers

Doubling Throughput of Streaming SIMD Extension Instructions

New Standards for Energy-efficient Performance

For More Information