Anatomy of a Multiprocessor System
A typical single processor system is composed of a processor, a memory controller, and an I/O controller. The processor, with its cache, is connected to the memory controller with a system interface bus. The memory controller in turn is connected to the I/O controller with another bus designed for the purpose. The system interface bus carries the requests from the processor to the memory controller and the I/O controller. Each request consists of a command or action to be performed, an address of the location in memory or I/O and optionally data if it is an operation to write out data, as illustrated in Figure 1.
The Intel Pentium Pro multiprocessor system, introduced in the mid 1990s, connected as many as four processors on the system interface bus to the memory controller, as shown in Figure 2. This interface, dubbed the Front Side Bus or FSB for short, added the capability to manage cache coherence between the processor caches using the Write-Back protocol.
The interface operates in a manner very similar to our example of the legal team described earlier. The details of a simple request for data are as follows:
A processor requests data from memory over the Front Side Bus. All the other processors observe this request go by as the FSB connects them all together. These processors check their caches to see if any one of them has a copy of the cache line being requested. This operation of checking the caches of adjacent processors is dubbed a snoop in the cache coherence protocol. The term probe is also used to describe this operation.
The snoop can produce one of several results depending upon the status of the line in that cache. If the cache does not have a copy of the line being requested then it takes not action. The memory controller provides the data to the requesting processor. If none of the caches in the system have the data, then the requesting processor places the data in its cache and marks it as Exclusive.
If the line requested is in any cache in the Shared or Exclusive state that cache signals as such on the FSB. If it is the Exclusive state, it changes its state to Shared. The memory controller returns the cache line data to the requesting processor, which places it in its cache in the Shared state as one or more caches have indicated that they have copies of that data.
If the cache line is in the Modified state in the snooped cache, then it signals as such on the FSB. The memory controller recognizes this signal and does not return data. Instead the cache that has the modified line places the data on to the bus and sends it to the processor requesting the cache line. It then marks its own copy of the line as Shared. The receiving processor puts the data in its cache and also marks it Shared. The memory controller simultaneously takes a copy of the data and writes it into system memory.
The rules of operation define similar sequences for all caches to follow where they want to modify data in a cache line or evict it from the cache to system memory if it has been updated. These rules are identical to those followed by the legal team when they needed to change the contents of a page.
Evolution of a Link-based System
The Front Side Bus based systems work well and offered a simple and elegant solution for multiple processor systems. As the computer systems speed up with improvements in processor architecture and memory technology the FSB is run faster to keep up with the data rates required for a balanced system. This approach has worked well for over a decade and five generations of processor evolution at Intel starting with the Pentium Pro through the Pentium 4 family of processors.
However the requirement of constantly increasing the data rates on the bus has its limitations. The bus cannot operate at data rates of 800 Megatransfers per second or faster with the five electrical loads. The number of loads per bus was reduced first to three and eventually to two as the data rates were pushed up to 1000 and eventually 1333 Megatransfers per second. Figure 3 shows the configuration of the systems through these successive steps. Eventually the memory controller had to be designed with four electrically independent Front Side Busses connected to it. This has made the memory controller a very expensive device with over 1500 pins split across four FSBs with 175 signals each and the entire system memory behind that one controller, as shown in Figure 4.
The latest generation of processors, with multiple cores on a single die, demands much higher data (and instruction) bandwidth. Multiples of these processors in a system demand more memory bandwidth than a single memory controller can support economically. This has lead to integrating a memory controller with the processor on the same die. Multiple processor systems built with these dice are far more scalable and balanced as the memory bandwidth and capacity increases with the addition of each new processor into the system. Figure 5 shows a typical four-processor system built around such processors with integrated memory controllers.
The Front Side Bus was designed to handle traffic between several processors and a single memory controller. This interface is no longer suitable for systems sporting multiple, integrated memory controllers distributed across the system. The interconnect must be able to handle traffic in each direction simultaneously, something the FSB is unable to do efficiently as it must stop data transfer in one direction and then restart it in the opposite direction, incurring a severe time penalty to complete the turnaround. Each processor has multiple links and the FSB, with its 170 or so signals, is a very expensive solution. Moreover the FSB uses GTL electrical signaling technology, which is limited to 1.6 Giga-transfers per second and this caps the bandwidth of the link at around 12.8 gigabytes per second.
The link must use as few signals as possible and yet be able to provide high bandwidth and transfer data in both directions simultaneously. The interface must also support multiple memory controllers and efficiently manage the coherence of the processor caches in the system. The interface must also provide a robust set of mechanisms to handle errors and recover from them without shutting down the entire system. The Intel QuickPath Interconnect is designed with these objectives in mind and an entirely new fabric for interconnecting processors has been created.