Wider Execution Cores
One of the most obvious approaches for improving performance would be improving IPC. Right now, the industry executes only three instructions per clock cycle. The main challenge with widening the execution core further has been the complexity of the physics involved in executing more instructions per clock cycle, without increasing the power requirements of the system.
One manufacturer (Intel, company I work for) is delivering cores that execute four instructions per clock cycle--each execution core is 33 percent wider than previous-generation cores. This means each core can fetch, dispatch, execute and retire up to four full instructions simultaneously (see Figure 1), without increasing the power consumption of the system.
The new wide dynamic execution is achieved through a unique combination of advanced techniques that improve instruction throughput. These techniques include data-flow analysis, speculative execution, out-of-order execution, enhanced arithmetic logic units, and super scalar. Further efficiencies include more accurate branch prediction, deeper instruction buffers for greater execution flexibility, macrofusion, and micro-op fusion.
Most critical for the future of CMOS, manufacturers expect to be able to scale this optimization and further widen the execution cores without increasing frequency (power).
Macrofusion At the Core Level
Industry typically decodes and executes each incoming program instruction as an individual instruction. However, one of the benefits to advanced microarchitectures of having more transistors in the system is in having enough computational power to apply macro techniques at the micro level.
In macrofusion, a processor combines common instruction pairs into a single internal instruction, or micro-operation (micro-op), during decoding (see Figure 2). For example, a processor could combine a compare followed by a conditional jump into one micro-op. The "fused" instruction is then executed as a single instruction. This reduces the total number of instructions that need to be executed for a given task, so that the processor can execute more instructions in a given period of time.
An enhanced arithmetic logic unit (ALU) then optimizes the macrofusion. The ALU's single-cycle execution of combined instruction pairs further increases performance with less power consumed.