Parallel

Microarchitecture Performance

By Ram Ramanathan, June 20, 2006

Improved throughput in energy-efficient designs for multicore processors and other high-performance systems

Wider Execution Cores

One of the most obvious approaches for improving performance would be improving IPC. Right now, the industry executes only three instructions per clock cycle. The main challenge with widening the execution core further has been the complexity of the physics involved in executing more instructions per clock cycle, without increasing the power requirements of the system.

One manufacturer (Intel, company I work for) is delivering cores that execute four instructions per clock cycle--each execution core is 33 percent wider than previous-generation cores. This means each core can fetch, dispatch, execute and retire up to four full instructions simultaneously (see Figure 1), without increasing the power consumption of the system.

[Click image to view at full size]

Figure 1: Wide dynamic execution allows each core to execute up to four full instructions simultaneously.

The new wide dynamic execution is achieved through a unique combination of advanced techniques that improve instruction throughput. These techniques include data-flow analysis, speculative execution, out-of-order execution, enhanced arithmetic logic units, and super scalar. Further efficiencies include more accurate branch prediction, deeper instruction buffers for greater execution flexibility, macrofusion, and micro-op fusion.

Most critical for the future of CMOS, manufacturers expect to be able to scale this optimization and further widen the execution cores without increasing frequency (power).

Macrofusion At the Core Level

Industry typically decodes and executes each incoming program instruction as an individual instruction. However, one of the benefits to advanced microarchitectures of having more transistors in the system is in having enough computational power to apply macro techniques at the micro level.

In macrofusion, a processor combines common instruction pairs into a single internal instruction, or micro-operation (micro-op), during decoding (see Figure 2). For example, a processor could combine a compare followed by a conditional jump into one micro-op. The "fused" instruction is then executed as a single instruction. This reduces the total number of instructions that need to be executed for a given task, so that the processor can execute more instructions in a given period of time.

[Click image to view at full size]

Figure 2: Advanced microarchitecture uses macrofusion to "fuse" common instructions and execute them as a single instruction.

An enhanced arithmetic logic unit (ALU) then optimizes the macrofusion. The ALU's single-cycle execution of combined instruction pairs further increases performance with less power consumed.

Previous 1 2 3 4 5 6 Next

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

Parallel

Microarchitecture Performance

Wider Execution Cores

Macrofusion At the Core Level

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Parallel Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content

Parallel

Microarchitecture Performance

Wider Execution Cores

Macrofusion At the Core Level

Related Reading

News

Commentary

Slideshow

Video

Most Popular

More Insights

White Papers

Reports

Webcasts

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Parallel Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content