Dr. Dobb's | Multi-core & Multi-threaded: A New Era in Computing

Helping advance a new era in computing through new methods of thread-level parallelization

A new era of computing is upon us and the key word is "multi." This new era is characterized by multi-core processors enabling single-die multiprocessing of multithreaded applications. Spurring the need for this new era are new workloads and usage models that put enormous demands on computing platforms at a time when processor performance is constrained by energy consumption and power dissipation.

Multi-core processors and multithreading are fundamental in helping attain this computing power. They let us capitalize on the thread-level parallelism necessary for achieving higher performance and power efficiency. Unfortunately, current tools and techniques aren't designed to maximize the parallelism in most programs at a time when we need software to start making this technological leap. Consequently, researchers are investigating compiler techniques and microarchitectures that can provide highly threaded code through the use of speculative threading. In this article, I examine one new threading model called "speculative threading" which has the potential for enormous processing gains with only a small increase in power requirements.

The Need for Speculative Threading

The microprocessor is on the verge of perhaps its most significant transformation since its creation. Intel's announcement in early 2005 of a microprocessor with two full processing cores ushered in this new era of computing. These dual-core processors--two execution units packaged in a single processor--are the first step in a massive transition to multi-core computing. Moreover, research is underway on architectures that could hold dozens or even hundreds of processors on a single die.

Continuous increases in transistor count will deliver higher performance in every new processor generation. Performance projections based on past growth point toward reaching processing speeds of one trillion instructions per second by 2010. To get there, the new era of computing must do some thorough rethinking of basic foundations such as process technology, architecture, and software.

Design Challenges and the Move to Multi-Core

One of today's major challenges is maximizing performance in a power and thermal envelope. The increasing activity density, together with leaking currents (even in their "off" state) from ever-more microscopic transistors, is a challenge for the entire industry. This is particularly true with mobile, battery-operated devices. Multi-core and multithreaded platforms figure to play a big role in driving power and thermal efficiency. Their ability to exploit thread-level parallelism makes them normally more power-efficient than counterparts exploiting instruction-level parallelism.

Additional challenges come from variation--the difference in characteristics of "identical" transistors side-by-side on a die. Quantum mechanical effects that were largely ignored in previous process technology generations have increasing influence at 45nm, 32nm, and beyond. This calls for new circuits and microarchitectures that can adapt to these variations.

Thread-level parallelism--the ability to simultaneously process multiple sequences of instructions or threads (parts of programs that can execute independently)--can dramatically improve overall performance. Each of these threads runs on one of the multiple hardware contexts available through multi-core and multithreaded designs. These designs enable performance increases through:

Handling RMS Workloads

High-end applications tend to trickle down to wider use as mainstream systems develop the performance characteristics required to run them. This is the case for recognition, mining, and synthesis (RMS)--applications now performed primarily on powerful supercomputers. As multi-core platforms take over the workplace and home, RMS will play a crucial role in enabling these computers to understand and "see" data the way humans do and help us tap all the digital wealth that surrounds us.

Traditionally we have treated "R," "M," and "S" components as independent application classes. For example, computer vision (recognition), data mining (mining), and 3D photorealistic graphics (synthesis) are traditionally considered independent, stand-alone applications. However, an integration of these component applications, if achieved in real-time in an interactive RMS (iRMS) loop, could lead to new uses in everything from the sciences and medicine to business and entertainment.

The Promise of Speculative Threading

All these workloads require significant software innovation. Today we rely on software developers to express parallelism in the application, or depend on automatic tools (compilers) to extract this parallelism. These methods are only partially successful. To run RMS workloads and make effective use of many cores, we need applications that are highly parallel almost everywhere.

For example, compilers must provide code that works in all cases. Consequently, anytime a compiler has to parallelize two pieces of code, it must consider all potential dependences. It has to analyze whether one piece of code might write something in a given memory location that another piece of code may read. Unfortunately, in a majority of cases, a compiler doing this only has an approximate view of the memory locations that are being touched by every single instruction. Therefore, whenever the compiler has to detect potential dependences, it tends to be over-conservative. If the compiler cannot prove that two instructions are independent, it presumes that they are dependent. This means when the compiler generates the code, it assumes a huge number of dependences that don't exist or rarely exist. The code ends up overly serialized and misses many opportunities for parallelization.

Now consider how we might instead design multi-core platforms that take parallelize applications in a "speculative" way. The speculating we're doing here is creating speculative threads--threads that do not necessarily commit and thus, are not necessarily correct. At the time they are spawned, it's unknown whether they will work well. At some point, they have to be checked. They might speed up the application or they might be incorrect and discarded. The key here is that there will be a check of their correctness down the line. If they are not correct, no damage is done because errant threads are simply squashed and their activity discarded. There's a built-in safety net so that the processor ends up in the same state it would have been in if the threads had never executed.

Speculative threads could revolutionize how we parallelize applications. No longer would compilers have to be conservative. Instead of generating code for the worst case, compilers could generate for the common case. The result would be a much higher degree of parallelism and significant gain in performance.

Speculative Threading in Memory Dependences

One application of speculation is in memory dependences. Compilers are severely limited because they cannot do an exact analysis of which memory locations are going to be touched by every single instruction. This makes it hard for compilers to determine whether two instructions have dependence. Not being able to determine if they are independent, the compiler assumes they are dependent. Using speculative threads, it could be assumed in many cases that dependences do not exist. Each speculative thread created would then be checked at runtime. If, at some point in time, it is detected that a speculative thread reads a memory location that is later written by a thread that comes earlier in the sequence, that memory violation would be spotted and the speculative thread squashed. In most cases though, the speculative threads would be successful and the program sped up.

Speculative Threading in Values

Another speculation that has enormous possibilities is values. In Figure 1(a), for instance, the rectangle represents a sequence of instructions.

Figure 1: (a) The creation of a spawning pair--a set of two points in the program (each at the beginning of a basic block). The former is called the "spawning point" (SP), marked by the spawn instruction, and it identifies when a new speculative thread is created. The latter is called the "control quasi-independent point" (CQIP) and represents where the speculative thread starts executing (after some initialization) and is identified as an operand of the spawn instruction.

In a conventional sequential processor, such instructions are executed one after another (or sometimes out of order in a small window and then reassembled in order). Suppose for a multi-core platform we want to parallelize the code that is in red with the code that is in yellow so they execute simultaneously. Doing a dependence analysis, today's approaches would find a dependence through variable A. Because the yellow part reads from variable A, a synchronization would be inserted for the yellow part to wait for the red part to produce this value. The same happens for variable R1. In other words, between these two sections there are two true dependences through R1 and A. Consequently, a conventional compiler will put in a synchronization between each pair of potentially dependent instructions.

But what if there were a way to make a well-reasoned "guess" which values these variables are going to take in the red part? Then you wouldn't need to wait for these values to be produced and could execute the yellow part in parallel with the red part. When the red part finishes, a check could verify if the guess for these values were correct. If the guesses were correct, everything is fine. The code is parallelized. If the guesses were wrong, the thread is squashed and the yellow part is executed after the red and there's no gain (nor much of anything lost).

The way we're proposing to guess these values is through a software technique called a "precomputation slice." The compiler includes a spawning instruction in the red code that spawns a speculative thread. The speculative thread starts precomputing the values it will most likely need. This precomputation is done very fast, since it is not required to be correct. The objective is to generate a slice that computes the thread input values correctly most of the time, but with a much smaller overhead than the one required for a safe computation.

When the red thread arrives at the beginning of the yellow one, the precomputed values are validated, and in case of misspeculation the yellow thread is squashed; see Figure 1(b).

[Click image to view at full size]

Figure 1: (b) Precomputation slices help dismiss unlikely dependencies and speculatively compute input values.

This figure also illustrates a speculation on a memory dependence. In this case, there is an ambiguous dependence through pointers P and Q. If the compiler suspects that this dependence does not occur, it can simply ignore it. At runtime, the hardware checks whether this dependence actually occurs; if this is the case, the speculative thread is squashed.

Speculative threads enable the compiler to try all kinds of unsafe, aggressive, optimistic optimizations when generating the precomputation slice. In the end, we end up with a highly optimized code that can precompute these values quickly and still be correct most of the time; see Figure 2.

[Click image to view at full size]

Figure 2: Through a series of "unsafe" optimizations, the compiler generates a very small slice that still compute the thread input values correctly most of the time at a much faster speed.

The Mitosis System

I've just described how Intel's speculative threading system "Mitosis" works. This codename was chosen for the analogy it makes between cells splitting apart and working in parallel and speculative threading.

Mitosis relies on both hardware and software support to work. On the software side, the Mitosis compiler is responsible for analyzing the program, and locating the sections of it that can efficiently be executed in parallel. A key component of this analysis is the identification of sections of code whose corresponding pre-computation slice has a very low computation overhead. Other conventional aspects such as workload balance also have to be considered.

On the hardware side, Mitosis is built on top of a multi-core and/or multithreaded processor. The main extension required is support for buffering and multiversioning in the memory hierarchy. Buffering is needed to keep the speculative state until the thread is verified and can be committed. Multiversioning is required to allow each variable to have a different value for each one of the threads that are running in parallel. This is important because every thread is executing a piece of code that started out with sequential semantics, but now, parallelized in threads, is being worked on simultaneously with values that used to be supplied in different points in time in the program. That means the variables of the concurrent threads are the same, but the values that these variables contain may be different since they represent the state of the program at different points of the sequential semantics. Mitosis also relies on hardware support to check which variables are read by each one of the threads and which memory locations are written. This enables data dependence misspeculations to be spotted quickly and the corresponding threads be discarded.

The Mitosis system has been designed to optimize the trade-off between software and hardware to exploit speculative thread-level parallelism.

Results

To illustrate the performance potential of the Mitosis compiler, we use a subset of the Olden benchmark suite. Olden benchmarks are pointer-intensive programs for which automatic parallel compilers can hardly extract any thread-level parallelism.

As you can see in Figure 3, the results obtained by the Mitosis compiler/architecture for this subset of the Olden benchmarks are impressive. It outperforms single-threaded execution by 2.2x. When compared with a big out-of-order core, the speed increase is close to 2x. We can also see that the benefits of Mitosis do not come only from reducing memory latency--it outperforms an ideal system with perfect memory by about 60 percent. Overall, this work shows that significant amounts of thread-level parallelism can be exploited in irregular codes, with a rather low overhead in terms of extra (wasted) activity.

[Click image to view at full size]

Figure 3: In this graph, the green bars show the performance improvement going from an in-order to an out-of-order core with about twice the amount of resources. The red bars show the performance of a processor with perfect memory, illustrating the potential performance improvement for any technique that targets simply reducing memory latency. The yellow bars show the performance gains of using Mitosis with a four-core processor.

Conclusion

Tomorrow's computing workloads will require exponential increases in computing power, and multi-core platforms and multithreading are paving the way to meeting these performance needs. To do it though, they need a companion effort in accelerating the development of the thread-level parallelism in software to reach their full performance potential.

The Need for Speculative Threading

Design Challenges and the Move to Multi-Core

Handling RMS Workloads

The Promise of Speculative Threading

Speculative Threading in Memory Dependences

Speculative Threading in Values

The Mitosis System

Results

Conclusion

For More Information