Parallel

Performance Scaling in the Multi-Core Era

By by Robert Shiveley, October 12, 2006

Performance gains did not come about solely because of increasing transistor densities. They have also relied heavily on another physical factor that is closely related to transistor size: processor clock frequency. Learn what Intel is doing to improve performance even further.

More Resources

"In 1978, a commercial flight between New York and Paris cost around $900 and took seven hours. If the principles of Moore's Law had been applied to the airline industry the way they have to the semiconductor industry since 1978, that flight would now cost about a penny and take less than one second." (Source: Intel PDF 863KB)

In a 1965 paper, Gordon Moore predicted that the number of transistors that could be integrated into a single silicon chip would approximately double every 18 to 24 months. That prediction became widely known as Moore's Law, and engineers at Intel have been transforming that law into reality for more than 40 years (Figure 1). During that time, increases in transistor density have driven roughly proportional increases in processor performance and price/performance. Those gains have powered the growth of today's trillion-dollar electronics industry, put personal computers into businesses and homes throughout the world, and given rise to computing as a fundamental business enabler.

Figure 1. For more than 40 years, Intel has turned Moore's Law into a reality, continuously increasing transistor density to deliver faster and more powerful processors.

However, those performance gains did not come about solely because of increasing transistor densities. They have also relied heavily on another physical factor that is closely related to transistor size: processor clock frequency.

In general, as transistors get smaller, they can switch faster between one state and another. This has allowed designers to continually increase processor clock frequencies at the same time they have increased the total number of transistors. In many ways, increases in clock frequency have been more important than increases in density. It takes a great deal of engineering ingenuity to make efficient use of more transistors. Frequency gains, on the other hand, deliver instant and easily realizable performance benefits. Existing software code runs faster, without requiring software engineers to revise or optimize their code. This has been a central fact of the computing industry for many years, and a great boon to business users. It has allowed them to realize increasing value from their existing code base, with relatively little effort.

Given this advantage, it's not surprising that frequency ramping has long been the primary engine behind processor performance gains. The first Intel processor was released in 1971 and ran at 400 KHz. Today, some Intel processors support clock speeds that are nearly 10,000 times as fast. Along with high-bandwidth network communications and high-density storage technologies, these gains have been instrumental in ushering in today's era of real-time, data-intensive, Internet-connected business applications.

Moore's Law Moves Forward

"Another decade is probably straightforward...There is certainly no end to creativity."

- Gordon Moore, speaking of extending Moore's Law at the International Solid-State Circuits Conference (ISSCC), February 2003.

Moore's Law continues today and can be expected to deliver increasing transistor densities for at least several more generations. However, in recent years, frequency ramping has faced mounting obstacles. Power consumption and heat generation rise exponentially with clock frequency. Until recently, this was not a problem, since neither had risen to significant levels. Now both have become limiting factors in processor and system designs.

System integrators devote considerable engineering resources to optimizing power and cooling systems to avoid overheating, which can dramatically reduce component longevity. IT organizations are faced with similar challenges. Total server numbers worldwide have increased approximately 150 times in less than a decade. Most of today's datacenters were simply not designed with this growth in mind, nor were they designed to meet the power and cooling demands of today's high-density servers and blades. Though today's systems deliver far better performance-per-watt than their predecessors, they have higher total-power and cooling requirements. The cumulative cost of meeting these requirements is becoming excessive, especially in view of today's rising utility rates.

For all these reasons, increasing clock frequencies is no longer viable as the primary means for boosting processor performance. Clock frequencies will continue to rise, but only incrementally. New strategies are needed to maintain historic rates of performance and price/performance improvement.

Pushing the Traditional Envelope

Though frequency ramping has been the main source of performance gains over the past 40 years, processor architects have also found many ingenious ways to take advantage of increasing transistor counts to boost performance. The most important strategies are:

Larger Data Formats: Today's 64-bit processors are the result of a long evolution from the original 4-bit Intel processor. Each shift has provided faster processing of larger data elements and enabled the processor to directly address larger volumes of memory. These advances have been crucial to performance and capacity scaling, and today's data-intensive applications would not be possible without them.

However, we are already in the midst of an industry-wide transition to 64-bit processors for mainstream applications. To put this in perspective, a 32-bit processor can directly address up to about 4 GB of memory, while a 64-bit processor can theoretically address up to about 18 million terabytes (actual amounts depend on particular implementations). This represents an enormous leap, and it will likely be some time before hardware and software developers can fully utilize the large address space they now have at their disposal.
Instruction Level Parallelism (ILP): With ILP, the processor dynamically evaluates software code streams to determine which instructions are independent and can be safely processed simultaneously or out of order. If one instruction is waiting for data, the processor can then execute an independent instruction to stay productive. This strategy has been increasingly important as processor speed has outstripped memory speed. Today, a processor can wait through hundreds of clock cycles if it has to retrieve data or instructions from main memory, which decreases the value of rising frequencies.

Over the years, processor designers have invested heavily in optimizing ILP to reduce the impact of memory wait times. However, it takes a lot of complex, high-speed transistor logic to examine software code during runtime, find opportunities for ILP, and reschedule the software code accordingly. For this reason, ILP is very resource intensive, and accounts for considerable energy consumption and heat generation in today's processors. Like frequency ramping, it has reached a point of diminishing returns.
Hyper-Threading: In the past, a processor could execute just one stream of software instructions, or thread, at any given instant (though it could switch rapidly between multiple threads). A processor that supports hyper-threading can process two threads simultaneously by interleaving the code streams. Since a single thread rarely uses all of a processor's execution resources, this enables greater processing efficiency and greater total throughput for multithreaded applications.

However, software threads must be explicitly defined in the code, so performance benefits are not automatic. Many existing applications are single-threaded, and must be thread-optimized if they are to take full advantage of Hyper-Threading or other forms of thread-level parallelism (TLP). As software vendors increase their thread-optimization efforts, the benefits of hyper-threading will be realized across more applications, but it will not deliver the kind of massive performance increases needed to replace frequency ramping as a primary strategy for boosting performance.
Larger Cache: Cache is used to store data and instructions closer to the processor (often on the same chip). Cache is several orders of magnitude faster than today's fastest memory subsystems, so it can dramatically reduce processor wait times. Cache also consumes much less power than logic circuits, so increasing cache size can be a power efficient way to keep the processor productive and increase overall performance.

Of all the strategies discussed above, larger cache sizes can be expected to drive the most significant and cost-effective performance improvements in the coming years, but the gains will not be enough to meet growing needs. In addition, only memory-intensive applications are likely to see substantial performance benefits from larger cache configurations. Although this includes a great many applications today, and will include even more over time, larger cache sizes will never deliver the near-universal performance boost offered by frequency ramping.

In short, some of the traditional strategies used in today's processor designs will continue to deliver performance benefits, but none will provide the kinds of performance gains that have been delivered in the past by frequency ramping. A new strategy is needed to maintain a fast upward path toward greater performance.

Moving to Multi-Core

The industry's answer to today's performance challenges is to take advantage of ongoing increases in transistor density (i.e. Moore's Law) to integrate more execution cores into each processor. With multiple cores executing simultaneously, processor designers can turn down clock frequencies to contain power consumption and heat generation, while still delivering increases in total throughput for multi-threaded software. Individual threads might be processed slightly slower, due to the lower clock frequencies, but total throughput can be dramatically increased.

In essence, a multi-core processor is similar to a multi-processor server, except that the parallel compute resources are integrated into a single chip. This approach is more cost-effective and power efficient. It also has the potential to deliver better performance through faster core-to-core communications. Intel estimates that integrating more cores into its processors will lead to performance gains as high as 10x within the next three to four years (Figure 2).

Figure 2. The combination of more cores per processor and thread-optimized software will help to drive rapid increases in performance and power efficiency over the next few years.

Meanwhile, Intel is developing new materials, transistor structures, circuit designs and process technologies that will help to enable ongoing increases in per-core frequency without comparable increases in power consumption. Though this will no longer be a primary driver of performance gains, it will help to maintain and increase per-thread performance in future multi-core processor designs.

Getting the Most Out of Multi-Core Processors

Multi-core processors are clearly the wave of the future across virtually all computing architectures. They hold the promise of ongoing performance scaling through this decade and beyond, with dozens or even hundreds of cores being integrated into future processors. Of course, for these advances to deliver meaningful value, software and usage models must evolve accordingly.

Today's multi-core processors are ideal for well-threaded applications, which include many of today's data tier, technical, scientific and high-volume transactional applications. They are also well-suited for virtualized environments, in which each server is running multiple applications. In this case, the multiple applications constitute independent code streams that can take advantage of the multiple cores. However, the additional cores in a multi-core processor may deliver little or no benefit for a single-threaded application running on a dedicated server.

In general, organizations should look closely at workloads, usage models and per-core performance in assessing the value of multi-core processors for specific implementations. They should also consider the licensing policies of their OS and application vendors. It will take time for software vendors to optimize the full range of mainstream applications for multi-threaded throughput. As they do, the value of multi-core processors will continue to grow.

Performance Scaling in the Multi-Core Era

Part 2: The EPIC Advantage

"The [Intel] Itanium [2 processor] has time on its side and is most likely the architecture with the highest potential."

Source: Itaniumis there light at the end of the tunnel?, by Johan De Gelas, Nov. 9, 2005.

In the first part of this two-part series, we discussed the reasons behind today's industry-wide move to multi-core processors, and how that move is expected to drive rapid increases in processor performance over at least the next decade. In this second article, we take a look at multi-core performance scaling for the Intel Itanium 2 processor, to see how it is likely to perform as more and more cores are integrated in future versions.

In 1994, when Intel and HP began working together on a new processor to address escalating computing requirements, it was already apparent that the days of frequency ramping would not last indefinitely. The new architecture was therefore designed specifically to deliver new levels of parallelism that would enable sustainable performance ramping without relying on ever-higher clock frequencies.

This, in itself, was not a radically new idea. Other processor architectures had been designed to increase the number of instructions that could be processed in each clock cycle. However, since Intel Itanium microarchitecture was younger, its designers were able to take advantage of past developments. They combined many of the best elements of previous processor architectures, and added a number of additional innovations. The result was a new model, known as Explicitly Parallel Instruction Computing (EPIC). It was built to enable ongoing improvements in both instruction- and thread-level parallelism, in ways that are not feasible with competing architectures.

Breathing new life into instruction-level parallelism

"I'm bullish on IA-64 because a dream world of compilers that take their sweet time to build and optimize but produce mind-blowing code will surface there first."

Source: The CPU's next 20 years, by Tom Yager, Computerworld, Sept. 7, 2005.

As discussed in the first part of this series, instruction-level parallelism (ILP) is the ability of a processor to analyze and reorder linear software code so that instructions can be processed simultaneously or out of order. The main benefit of this approach is that it helps to keep the processor productive for a higher percentage of the time. If a stream of code is stalled while waiting for data to be retrieved, independent instructions can be processed during the intervening clock cycles. Useful work continues to be performed, and total throughput increases accordingly.

EPIC takes a very different approach to ILP. Instead of depending on complex logic circuits in the processor to analyze the software code during runtime, it relies on the software compiler to find and explicitly identify independent instructions. The compiler is not limited by the time and resource constraints of dynamic, hardware-based ILP optimization. Instead of looking just a few instructions deep, it can look thousands of instructions ahead to find additional opportunities for parallelism.

With this approach, the compiled code is highly optimized for parallel throughput before it ever reaches the processor. The processor is freed from the demanding task of analyzing code, finding opportunities for parallelism, reordering the code and allocating processing resources. Instead, the compiler arranges the code in bundles of independent instructions. The processor simply accepts the code bundles as they come, and devotes all its resources to executing them as fast as possible.

To take full advantage of the parallelism in the optimized code, the Intel Itanium 2 processor is equipped with an exceptionally large set of execution resources. It has 128 general purpose registers and can execute up to 6 simultaneous instructions per cycle (this could be increased in future implementations). This combination of compiler-based optimization and highly parallel processing results in more efficient ILP and higher throughput. It also provides a cost-effective pathway for delivering further increases in ILP, through ongoing compiler enhancements.

EPIC compilers make use of a number of techniques for increasing ILP, accelerating throughput and reducing latencies. Examples include predication (which eliminates branch prediction penalties), speculation (loading data into memory before a load command is received), data and instruction pre-fetching and cache hints. As with identifying and bundling independent instructions, these strategies are made more effective by the plentiful execution resources in the Intel Itanium 2 processor. The large number of registers reduces the need to shuffle intermediate results, so that more clock cycles can be devoted to straightforward execution.

EPIC's potential for multi-core designs

"Although the [Intel] Itanium [2 processor] is capable of sustaining a theoretical maximum of 6 instructions and executing up to 11 instructions, and despite its massive register set, it uses fewer transistors for its core than all competitors."

Source: Itaniumis there light at the end of the tunnel?, by Johan De Gelas, Nov. 9, 2005

Since EPIC relies on the compiler to optimize ILP, it doesn't need to perform that function in hardware. This reduces the need for long instruction pipelines and a lot of complex, energy-consuming logic circuits. As a result, the Intel Itanium 2 processor has a relatively small, high-performing and very power-efficient core.

This may sound surprising, since the Intel Itanium 2 processor is known for being a very large processor. However, it is the exceptionally large cache (up to 9 MB in current designs, and moving to 24 MB in the next-generation, dual-core design) that accounts for its size. The core itself is considerably smaller and more power efficient than competitive architectures. This will make it easier to integrate more cores per processor in the future, while leaving plenty of space and power for large cache configurations.

The importance of per-core performance

Multiple cores increase throughput for multithreaded applications. They also improve throughput in consolidated environments where multiple applications and background tasks can take advantage of the extra cores. However, they do not accelerate execution for individual software threads. This can be an important issue. Fast per-thread performance can be critical for many data-tier applications, especially in real-time business environments where transactions cannot complete until a particular set of data is retrieved. In today's highly integrated computing environments, this becomes even more important, since processing latencies in one application can impact many others.

To avoid slow response times, businesses need to be aware of per-core performance characteristics as they move to multi-core processors. The Intel Itanium 2 processor currently delivers very competitive per-core performance, and its ability to support increasing ILP through compiler optimization will help to enable ongoing improvements. Combined with its advantages for multi-core implementations, this provides strong potential for performance scaling.

Software compatibility going forward

"...the careful attention HP and Intel applied to the architecture definition means that you won't have to re-engineer your software two or three years down the road to run on the next whiz-bang CPU chip, because the architecture is designed to run with the code binary unchanged."

Source: Architecture Makes Itanium Processor Special, by Stephen Satchel.

One of the key advantages of frequency ramping has been the ability to continuously scale performance for existing code. EPIC offers similar advantages, since it was designed to abstract software parallelism from specific hardware implementations. Because of this, software does not have to be recompiled to take advantage of future increases in hardware resources (e.g. more registers, more instructions per clock-cycle, etc.). This will make it easier to deliver increasing performance for existing applications, with less need for software optimization.

Of course, it is important to realize that this advantage applies to individual software threads running on individual cores. Software developers will still need to thread-optimize their code to take full advantage of multi-core processor designs, as they will for all other processor architectures.

Delivering on the Promise

It's one thing to talk about the potential of EPIC architecture for rapid, multi-core performance scaling. It's quite another to realize that potential in the real world. The first real test comes with the release of the first dual-core Intel Itanium 2 processor (formerly code-named Montecito). This processor delivers up to twice the performance of its predecessor. It also reduces energy consumption from 130 watts to about 100 watts, for a targeted 2.5x boost in power efficiency.

The reduction in overall power consumption is especially impressive when you consider the new processor not only includes two cores, but also a whopping 24 MB of on-die cache, versus only 9MB in its predecessor. The new processor also supports hyper-threading (two software threads per core). The combination of larger cache and improved threading is especially beneficial for large data-tier applications, most of which are memory intensive and already optimized for multithreaded throughput.

Can Intel sustain this level of performance ramping in future processor versions? The company currently has four Intel Itanium 2 processor generations in development, including a next-generation dual-core design and a follow-on quad-core design. The current dual-core Intel Itanium 2 processor is based on a 90-nm process technology, and its successor will be based on a 65-nm process technology that Intel is already using in many of its manufacturing facilities. The company also has demonstrated a 45-nm device, using a process technology that will allow for chips with more than five times less leakage power than those made today.

Certainly, these advances are not guarantees of performance and price/performance gains in future Intel Itanium 2 processors, but they do indicate that a very promising technology foundation has been established. At the very least, Intel will have plenty of fast, low-power transistors to leverage as it works to simultaneously increase per-core performance and multiply the number of cores in its future processor designs.

Software innovation

As we've already discussed, the performance gains made possible by multi-core processors often require complementary effort on the part of software developers. Though some applications are already optimized for multithreaded throughput, many others are not. The move to thread-optimized software represents a major shift in the industry, and puts new pressure on software vendors to play a role in delivering performance improvements for systems based on multi-core processors.

Figure 3. The Intel roadmap for the Itanium 2 processor family includes a follow-on dual-core processor and two future quad-core designs, with optimized versions of each for enterprise applications, high performance computing and high density form factors (blades).

Advances in the per-core performance of Itanium-based systems will also depend partly on software innovation. Two factors come into play.

First, new applications are being written specifically for Intel Itanium microarchitecture, and these are generally optimized right from the start to take direct advantage of the inherent parallelism of the EPIC computing model.
Second, software compilers have a major impact on per-core performance for Itanium-based systems. As compiler technology improves, users can expect to see increasing parallelism in compiled code, with resulting improvements in performance. A key question that remains to be answered is how much parallelism can be extracted from a typical software application, and how much of this parallelism can be achieved automatically during compilation.

To deliver on the performance potential of EPIC architecture, Intel is investing heavily in software optimization, in addition to hardware development. More than 1,000 Intel software engineers are currently working on Intel Itanium 2 compilers and other software tools, and collaborating with software vendors and corporate developers to create, port and optimize code.

Other hardware and software vendors are involved in these efforts, particularly the members of the Itanium Solutions Alliance*. The commitment of this community is reflected by its cumulative investment, which has been pledged at $10 billion through the remainder of this decade*. The investments of many smaller hardware and software vendors are adding to these efforts.

EPIC potential

The era of frequency ramping has ended and the era of multi-core processors has begun. The Intel Itanium 2 processor is well positioned for this shift, since it is based on the Explicitly Parallel Computing Model (EPIC), which was specifically designed to enable new levels of parallelism, including both instruction-level parallelism (ILP) and thread-level parallelism (TLP). Its small, power-efficient core will make it easier to integrate more cores, while keeping power consumption low and providing ample space for large, on-die cache. In addition, its compiler-based approach to extracting inherent parallelism from software code will provide another path for ongoing gains in per-core performance.

Is Intel Itanium microarchitecture the wave of the future for multi-core performance scaling? It looks promising...so stay tuned.

For more information, visit the Itanium Solutions Alliance.

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

Parallel

Performance Scaling in the Multi-Core Era

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Parallel Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content

Parallel

Performance Scaling in the Multi-Core Era

Related Reading

News

Commentary

Slideshow

Video

Most Popular

More Insights

White Papers

Reports

Webcasts

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Parallel Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content