The invention of the microprocessor started a revolution in computing. The Intel 4004, the first commercial single chip microprocessor, was very basic in its design and architecture with very modest instruction and data rates. These rates matched well with the memory technology of the day, and very useable and balanced systems could be constructed by directly connecting the processor to memory. A system bus served as the interconnect mechanism and carried the data between the memory subsystem and the processor.
The architecture of the microprocessor evolved over time and the clock rates increased manyfold with improvements in the micro-architecture and advancements in silicon technology. This increase in computing power required greater data rates from memory, but corresponding improvements were made in DRAM technology and the requirements for data were readily met by the memory systems of the day. The technology of signaling also improved and the data rate of the interface kept up with the rest of the system.
Things changed in the 1980s with the advent of more sophisticated, pipelined architectures of microprocessors, such as the 80486. The DRAMs could no longer keep pace with the data rates required and cache memory was introduced. This high speed memory is built with fast SRAM technology. It has low latency and the high data rates required to keep the processor operating without stalling while waiting for data from the relatively slower DRAMs. However, cache memory has to be small in order to provide the greater speed required of it. As a result, the caches store a small portion of the data in the main system memory. They rely on the fact that typical programs are composed of small sequences of straight line code that loop back to repeatedly execute the same code many times, and that programs tend to operate on the same set of data entities for a while before moving on to next set of data.
The cache exploits these characteristics to provide a significant reduction in latency and high data throughput. It does so by fetching sequences of data from system memory beyond that which is immediately requested by the processor and storing it for fast access when it is needed. The system interconnect evolved to serve as the interface between the cache and the system memory and typically carries sequential bursts of data called cache lines. A cache line can be anywhere from 16 to 128 bytes of sequentially addressed data depending upon the design of the cache. The DRAMs of the system memory are well suited to provide such a burst of data at high speed from an open page. The signaling technology of the interconnect also evolved to run at higher data rates, keeping up with the needs of the faster processors.
The next change in computers and interconnect systems occurred in the early 1990s with the introduction of multiprocessor systems. Two or more processors were connected to a common, shared memory system over the system interconnect. These processors operated in concert to share the overall workload and provided higher performance over a single processor. The system interconnect had to evolve yet again to efficiently handle requests from all the processor systems in the system. This was achieved by fairly handling requests between all the processors and pipelining their requests to memory to use the interface efficiently. The signaling technology was updated to handle the electrical loads of multiple processors on a single bus at high transfer rates. The system interconnect was also enhanced to properly share the data in the caches in the processors.
The Intel Pentium Pro microprocessor was the first Intel architecture microprocessor to provide a system interconnect, the Front Side Bus (also abbreviated as the FSB), which supported symmetric multiprocessing. The Front Side Bus can connect up to four processors, a memory controller, and an I/O controller. The FSB can pipeline up to eight transactions for high throughput. GTL signaling is used to operate at clock rates of 400 MHz and data transfer rates of up to 1.6 Gigatransfers per second. The FSB also provides mechanisms to ensure that all the processors' caches share data from system memory properly and do not use stale data that has been modified in another processor's cache. Sharing data coherently between caches is a very key capability required in any high performance interconnect and we will delve into it in detail.
The Front Side Bus served the needs of the Pentium Pro family of processors and was then enhanced to meet the needs of Pentium 4 processors that followed. The improvements in bandwidth were achieved by increasing the data transfer rates up to 1.6 gigabytes per second. This was achieved by reducing the number of processors connected to a single bus down to two and eventually to one. Thus a four-processor system required four Front Side Busses to connect the four processors to a single central memory controller. This proved to be a very expensive solution in that the memory controller required over a thousand pins just to accommodate the four Front Side Busses and more to handle the interface to memory. Moreover the single memory controller became a severe bottleneck in the system and limited both the memory size and bandwidth that could be built into a system.
The Intel QuickPath Interconnect starts by taking a fresh look at the architecture of the entire system and provides a complete solution to address these limitations. It is a very high performance fabric that is at the heart of very scalable and high performance systems with low system and silicon cost. The Intel QuickPath Interconnect architecture has been designed for the future generations of Intel processors and provides plenty of headroom for growth in performance and features. The Intel QuickPath Interconnect achieves these goals through the use of a narrow, high speed point-to-point links that require less than half the number of signals of the Front Side Bus for lower cost and yet provide fifty percent more bandwidth for higher performance. The Intel QuickPath Interconnect is also far more flexible and allows one to build very scalable systems around multiple processors, memory controllers, and I/O controllers. Systems based on Intel QPI can choose to integrate the memory controller and I/O controllers on to the same die with the processor and create a very scalable system that can be upgraded in a modular fashion. Every processor added to a system will bring additional memory bandwidth and capacity with its integrated memory controller and providing a very cost effective upgrade that offers a well balanced system. The Intel QuickPath Interconnect also provides efficient means to resolve cache coherence between multiple processors.
Let us look at the issues created by multiple caches in a system and the ways to keep them coherent so that all processors get the most up-to-date information. We will then describe the high performance cache coherence mechanisms in the Intel QuickPath Interconnect and how they can handle from two to as many as 128 or more caches in a system.
Solving the Cache Coherency Problem
Whenever multiple processors, each with its own cache, cooperate to access and modify data in a shared memory system, they run the risk of accessing stale or outdated data that may have been modified by one of the other processors. This cache coherency problem can be best illustrated by an example from the real world.
Let us say that Mary, an attorney in a law firm, has to draw up a legal contract. She pulls together a team of experts, Robert, Janice, Patty, and Tom, who will all help to create the final document. Mary starts with boilerplate for the contract and creates a table of contents assigning page numbers to each section. She shares this with her team. She then prints out the boilerplate document and places it in a central location accessible to the entire team.
Robert decides he needs to study pages 7 through 10 and makes copies of those pages and take them to his office for his own use. Similarly Janice decides he needs pages 8 through 10 and copies and takes them. Similarly Patty and Tom make copies of pages that are of interest to them and all four team members have pages in their respective caches. Multiple copies of any page can exist and each individual can refer to his or her own copy.
If Robert decides that he has no further use for page 9 he can destroy his copy of that page. This ability to cache copies of pages works well as long as no member makes any changes to the pages in his or her office. However if Robert decides to change the contents of page 10 then he must take steps to ensure that Janice is not working with obsolete information on her copy of that page. If Janice too decides to update page 10 she makes the problem even worse. This is the basis of the problem of cache coherency.
The team must institute a set of rules on how to handle updates to ensure that they have a graceful way of collaborating to produce a coherent document. This set of rules can range from something very simple but restrictive and with much overhead to one that is more sophisticated and lets each team member to work much more autonomously. Let us look at two ways of keeping the caches coherent.
Write-Through Caching. The team can follow a simple set of rules whenever anyone decides to update a page. In our example above when Robert is ready to update page 10 he tells the other team members that he is doing so, giving them the page number. They all check to see if they have copies of the pages and if so destroy their copies. Robert then makes the change to page 10 and places it at the central location and can choose to keep a copy for himself, if he desires. If Janice now decides to make a change to the same page she must go the central location for the latest copy as she destroyed her copy of the page when Robert announced his intention to update page 10. She too must announce to all that she is about to change page 10 and they all, including Robert, must destroy their copies of that page. Once she makes her update she too must place a copy of the updated page in the central location. In case both Robert and Janice decided to make the change simultaneously they can toss a coin to decide who goes first. The other will then have to fetch the updated copy from the central location to merge in his or her updates. This mechanism, referred to as write-through caching, is the simplest form of a cache coherence mechanism for handling updates. A simple and efficient mechanism is required to announce which page is being changed. Each team member must always take the time to put the most up-to-date copy of the page in the central location after every update, for all the others to use.
Write-Back Caching. The team members decide that this is unnecessary overhead and should be able to keep the pages they have modified as they are likely to make several more updates to them again. However they would forward it to other members when requested. This is write-back caching and is the cache coherence mechanism used on modern microprocessors.
Let us see how our legal team would work under the rules of write-back caching. Robert starts by getting copies of pages 7 through 10. He announces to all the others that he has done so. Next Janice get pages 8 through 10 and announces to all members that she has done so. Robert, Patty, and Tom check their copies of pages to see if they have any one of the ones Janice has fetched. Robert sees that he does and makes a notation on his copies of pages 8 through 10 that they are shared and he also lets Janice know that he has copies of those pages. Janice now marks her copies of pages 8 through 10 as shared with someone else. Note that Robert is the only one with a copy of page 7 and it exclusively in his cache as long as no one else fetches a copy.
When Robert is ready to make his updates and starts with page 10 he sees that it is shared with someone else. So he announces his intention to modify the page and looks for a response from the rest of the team. All the others check their copies of the pages and destroy their copies as none of them has made any changes to their copy. So Janice throws away her copy of page 10. Robert now makes the change to page 10 and keeps it with him. At this point he has the only copy of the most up-to-date contents of that page so he marks it as such, as having been modified. When Janice is ready to make a change to page 10 she goes to the central location to get a copy but also announces to all her desire to change it. Robert, seeing that he has the most up-to-date copy he modified informs Janice of this. He then gives her the page he modified leaving him without a copy. Now Janice has the only copy and she can update it or store it in her cache. However she must not destroy it as this is the only copy of that page that is up-to-date. If Robert and Janice had both decide to update the page simultaneously then one of them would have gone first and then handed the modified page to the other for further updates.
Recapping each team member can hold copies of pages as long as they keep track of the state of the page in the system. The page can be:
- Shared with one or more members of the team. The team member can destroy this page at any time if he or she no longer needs it and does not need to inform anyone of that action.
- Exclusive in only one cache. This exclusive copy of the page has not been updated but can be updated by the owner of that page without informing anyone else of the change. The owner can destroy her copy if she no longer needs it as this page is up to date in the central location.
- Modified. This page can exist in only one cache. The owner can make further changes to it at will without informing the other team members. The owner must forward the page to anyone else who needs it. If the owner no longer needs the page then she or he must put it back in the central location as it is the only up to date copy.
The team members must communicate with each other to properly share the pages of their document. All of the communication for this purpose is termed coherence traffic. However, the team members may also send messages to each other for other purposes. For example, if Janice decides to take a coffee break and invites Robert to join her, messages between them about breaking for coffee would have no bearing on the shared document and would be termed as non-coherent traffic in computer parlance.
Tying It All Together
Multiple processor systems operate under very similar rules for sharing data described in our example above. The caches, represented by the team members above, can hold multiple entities of data called cache lines. Each cache line is typically composed of 64 bytes of data and is the smallest entity handled and tracked by the cache-akin to the page in our example. The cache controller tracks the state of each cache line and marks it as Shared, Exclusive, or Modified and responds accordingly. If a cache line is no longer up to date in a cache the controller marks it as Invalid. These, taken together, are referred to by their initials as the MESI states. The central memory in a computer system is the common repository of the document in our example above. A typical computer system has at least one memory controller that interfaces with the banks of DRAM memory.