The Protocol Layer
The protocol layer defines a set of rules for exchanging packets. These rules coordinate the exchange of messages across multiple links, among multiple agents, on multiple devices. In contrast, the physical, link, and routing layers exclusively address communication on a single link, between two directly connected devices. The architecture of the Intel QuickPath Interconnect prescribes thirteen different agent types; a given type of agent handles specific messages sent over an Intel QPI topology. For the purposes of the following discussion, we will concentrate on home agents and caching agents. Examples of other agent types include configuration, interrupt, and legacy IO agents. An Intel QPI message uniquely identifies its target agent by the combination of the agent type and node ID. When a link endpoint contains multiple agents, messages are directed based on their type: only a specific agent type can sink, that is remove from the link buffer and consume, a given message type.
The protocol layer is designed to coordinate access from multiple agents to a shared memory. To make sense of multiple entities writing to a shared memory, a set of rules, called coherency, are defined. Coherency is a distributed responsibility; all agents accessing the system-wide coherent memory space must participate to maintain correct operation. The primary purpose of the protocol layer is to delineate the message flows that will maintain coherence, promote performance, and accomplish all of the desired behaviors. In the following sub-sections we will first discuss coherent flows, which comprise the bulk of the traffic. We will then turn to the complex topic of integrating non-coherent flows, and explain why they are needed.
A caching agent is logic closely associated with the caching structure of a device. It initiates transactions to coherent memory and responds to them in concert with the other functions of the device and its caching structure. A caching agent can also retain copies of memory values in a local cache, and provide these copies to other agents. The protocol assumes a write-back coherency model, in which caching agents need not immediately write changes to the main memory, but may temporarily keep cached changes out-of-sync with main memory.
A home agent acts as the interface to a given set of memory addresses. It services coherent transaction requests, handshaking with the various caching agents to manage conflicts, provide data, and clarify ownership. A home agent comprises additional logic affiliated with an integrated memory controller. A requestor labels its transactions with an agent-unique requester transaction identification number (RTID). A transaction is globally uniquely identified by its global unique transaction ID (UTID), which concatenates the home node ID, the requestor node ID, and the RTID.
The protocol uses a MESI coherence scheme (see Gregory F. Pfister's "In Search of Clusters", Prentice Hall PTR, 1998), in which a caching agent maintains the status of every memory line as:
- Modified: it has the only copy in the system, which may differ from that in main memory,
- Exclusive: it has the only copy in the system, but it is the same as memory,
- Shared: there may be multiple copies in the system, which are all the same as one another and memory, or
- Invalid: it has no copy.
The protocol extends this strategy with an additional forward state (F), yielding the MESIF protocol. This forward state allows for low-latency data return of shared data via direct cache-to-cache transfer. To eliminate ambiguity, while there can be many sharers, only one caching agent is designated the forwarder. Table 3 details the properties of each of the five states. The final column refers to the states to which the caching agent may transition the line without further coordination with the line's home agent. For example, the line cannot transition out of the modified state without arranging a write-back. Similarly, the line cannot transition out of invalid without initiating snoops to verify the line's state in other caching agents.
When a request for data is sent, caching agents receive snoop requests inquiring about their state. The single forwarder provides the data immediately and informs the home agent. Since the home agent may not know that a forwarder exists, it simultaneously starts bringing the data from memory as soon as it receives the initial request. When the home agent receives the forward notification from the forwarder, it sends a completion to the requestor which lets the requestor de-allocate all resources associated with the transaction.
The protocol supports two different snooping strategies, source snooping and home snooping. In source snooping, the requestor sends snoops directly to all caching agents in the system, as well as to the home agent. The snooped agents then send their responses to the home agent; one of them may also deliver the data directly to the requestor as provisioned above. Next the home agent resolves any conflicts, returns the data if necessary, and completes the request. This two-hop protocol (not counting the completion) has the lowest latency at the cost of a large snoop bandwidth and is therefore best suited to systems with few agents.
In contrast, for home snooping the requestor only sends its request to the home agent. The home agent typically maintains a directory telling it which other agents may have a copy. It then targets its snoops to only those agents. Note that the directory may allow false positives -- while it must always include agents that have the data, it may also include agents that do not. The rest of the flow mirrors the source snooping case. The advantage of home snooping is the reduced snoop bandwidth due to the targeted snoops; the disadvantage is the three-hop latency (not counting the completion). If some nodes have longer latencies than others, however, home snoop may further have a latency advantage by eliminating snoops to that node when unnecessary. Thus home snoop is commonly used in systems with many agents, or in systems with dramatically different latencies among the agents.
Beyond coherent flows, legacy issues further required the Intel QuickPath Interconnect to support a variety of non-coherent flows. These flows may have no direct relationship to coherency (some are derivative of products before caches), or may deliberately violate the normal coherency rules. In many cases legacy behavior has to do with the non-coherent spaces of the system address map.
In previous generation systems the "actual" characteristics of an address were fundamentally unknown so every address was "snooped" on the FSB. In a link-based topology each requesting agent must know the system address map, including message class, "home" node ID, and whether or not to snoop.
Many of the legacy-driven messages added to the Intel QPI protocol occur as a consequence of platform integration of the corresponding legacy features. These messages emulate behaviors previously triggered by physical pins from the platform into the socket. Intel QPI characterizes a particular set of these physical legacy pins as virtual legacy wire (VLW) transactions. These include the INTR, A20M, and SMI legacy signals.
The first feature originally implemented with pins that we will discuss is interrupts. Processors starting with the Intel 8080 included an Intel 8259 programmable interrupt controller directly connected to the bus as a central place to receive all interrupts and distribute them to individual processors. When the 8259 received an interrupt, it drove the side-band INTR (interrupt request) signal directly to the affected processor. The interrupted processor responded with an INTA (interrupt acknowledge) on the bus. The 8259 latched the interrupt, keeping any newly arriving interrupts pending, and calculated the interrupt number of the highest priority pending interrupt. The interrupted processor then read this vector on the bus, jumped to the corresponding code, and executed the interrupt service routine. Once complete, it sent the 8259 an EOI (end of interrupt) on the bus, freeing the 8259 to process the next interrupt, if any.
The Pentium processor introduced a different method of handling interrupts, called the Advanced Programmable Interrupt Controller (IOAPIC). However, the new IOAPIC requires software enabling, and it defaults to 8259 mode (sometimes referred to as legacy interrupt mode). To this day, some boot loaders rely on 8259 semantics, which means the new Intel QuickPath Interconnect must support this legacy interrupt mode. Intel QPI messaging emulates the INTR pin with a new VLW transaction flow, and the architecture defines handling of the remaining message types.
The general interrupt legacy mode flow is as follows:
- The IO subsystem issues an interrupt. This transaction is mapped to an Interrupt VLW message and directed over Intel QPI to a processor core. The processor uncore issues a Cmp (complete) message acknowledging its reception of the VLW message.
- The processor core issues an Interrupt Acknowledge transaction (effectively a read of the register indicating the desired vector). The IO subsystem completes this transaction with a noncoherent read response containing the interrupt vector.
- The processor core executes the appropriate code and issues an End of Interrupt Transaction on Intel QPI, which the IO subsystem then completes.
The goal is not to provide the highest performing implementation of the legacy interrupt mode, but to guarantee behavior logically equivalent to previous generation systems with acceptable performance.
Interrupt compatibility looks simple next to the variety of issues associated with the layout of memory. In this arena, legacy began with the Intel 8086 when it extended its 16-bit address space to 20 bits via segmentation to support a 1 MB physical memory. With segmentation, an address was defined with a 16-bit segment and a 16-bit offset combined as (segment * 16) + offset to produce a physical address. As a side-effect of this definition, effective addresses beyond the 1 MB mark resulted in a wrap-around to the beginning of memory. For example, addressing FFFF1000 would resolve to physical address 0x00FF0.
The initial PC built on the Intel 8088 established a convention when it partitioned this inherited 1 MB address space into conventional memory below the 640 kB line and accessible to regular programs, and the upper memory area (UMA), between 640 kB and 1 MB, which was reserved for ROM, peripheral RAM, and memory-mapped IO (MMIO, in which reads and writes to particular memory addresses actually affect locations local to an IO device instead). Specifically, graphics was memory mapped to the ASEG from addresses 0xA0000 to 0xBFFFF, utility ROM occurred from 0xC0000 to 0xDFFFF, and the boot ROM lived in 0xE0000 to 0xFFFFF.
The 8086's interrupt descriptor table (IDT), which holds interrupt vectors, was statically positioned at the beginning of the address space between addresses 0x00000 and 0x003FF. With the new UMA in the 8088, programmers began to exploit the wrapping property of 8086 segmented addressing as a work-around to access the IDT directly from upper memory with fewer instructions.
In 1982, the Intel 80286 was released, which extended the logical address space (the apparent address space from the application program perspective) from 1 MB to 1 GB. A fundamental conflict now emerged: how to allow addressing of the new high memory area (HMA) while still supporting the wrap behavior from the 8088? By default, addresses would access the new memory region. But a single logic gate was also added that, when asserted, would disconnect the 21st address bit from the chipset. Masking this so-called A20 line resulted in the legacy wrap behavior (referred to as A20M). Intel QPI includes this as A20M in its VLW messaging treatment.
Yet another pin arrived with the 80386SL microprocessor, introduced in 1985. Original Equipment Manufacturers (OEM's) purchase processors from Intel and use them as components in the computers delivered directly to consumers. The OEMs wished to differentiate their platforms by offering distinct features such as multiple sleep states for extending battery life and the ability to turn on fans. The solution chosen to allow for such differentiation was a special system management interrupt (SMI) that could bypass the operating system. SMI required a different implementation from existing interrupts: an additional pin from the chipset or motherboard to the socket. Thus the Intel QPI VLW messaging also includes provisions for the SMI signal.
A seemingly similar pin, #LOCK, relates to the overarching concept of atomicity, and thus merits unique treatment outside of VLW in the architecture of the Intel QuickPath Interconnect.
Consider a system without caches, like the 8086. It introduced the XCHG instruction which can exchange values between a register and a memory address. Doing so requires first a read of memory, and then a write to it. To make this an atomic mechanism -- where the entire operation occurs indivisibly – the protocol must prevent any other memory access from taking place in between. The FSB introduced the #LOCK signal to accomplish this. When a processor drove the #LOCK signal, all other bus agents stopped generating new transactions and completed their outstanding ones. The lock requester then performed its read and write, and finally released the lock.
With the introduction of caches, and in the common case of a semaphore (atomically accessed variable) within a single cache line, a processor no longer needed to grab the bus lock: it could read the cached data and postpone any subsequent snoops to that address until after the processor finished the write. Of course uncacheable transactions still required bus locks. They were also needed for another situation: split locks. Since the first systems lacked caches, programmers had no reason to align semaphores to the 64 B cache line size. This means that the address specified in a swap might actually spill over from the end of one cache line, A, to the beginning of another, B. Suppose two processors both want to XCHNG this line, but the first one already has A cached and the second already has B cached. If they both defer snoops until they get the other line, the system will deadlock. To prevent this, they instead compete to acquire the bus lock, which serializes the two transactions. Thus bus locks are used to guarantee forward progress for split locks as well.
Intel QPI removes the physical #LOCK pin but must still support locking functionality. It designates a single Quiesce Master for the entire system. When a processor wants to request a lock, it asks the Quiesce Master. The Quiesce Master then broadcasts the message StopReq1, telling all other agents to stop sending transactions, and waits to receive acknowledgements from all agents. If there are multiple IO agents, it must additionally issue a StopReq2 message to avoid a possible deadlock case. Since Direct Memory Access (DMA) permits IO agents to directly modify memory, the locking protocol must accommodate them in addition to processors. After the Quiesce Master receives the second round of acknowledgements, it informs the original requestor that the system is locked. The requestor performs its read and write, telling the Quiesce Master when done. Finally the Quiesce Master performs another two synchronizations, StartReq1 and StartReq2, permitting agents to begin issuing messages again. Figure 4 depicts an example of this flow.
This sledgehammer approach of draining all transactions from the system has severe performance implications, which only worsen as the system grows. Thus current software manuals recommend avoiding split locks. Nevertheless, Intel QPI must continue to support locks because software vendors often only remove legacy when updating code for other reasons. For example, an older boot loader that is not re-written from generation to generation will likely contain split locks.
On the other hand, sometimes legacy makes itself useful. Suppose a customer wants to dynamically update a system by hot swapping memory or adding a node. System software must update the Intel QPI fabric routing tables to take advantage of the changes. It can use this existing locking mechanism to perform a system quiesce, which stops all transactions in the system for the duration of the update.
Coherent and Non-Coherent Interactions
So far we have described coherent and non-coherent flows independently. Nevertheless, their overlap provides a point of particular interest in the architecture of the Intel QuickPath Interconnect.
The coherent flows described initially assumed normal memory addresses that can be cached on both reads and writes. In fact, the Intel 64 and IA-32 Architectures Software Developer's Manual specifies a variety of memory types with various caching behaviors. Ranges of memory addresses are defined with common caching characteristics. Uncacheable addresses cannot be cached by either reads or writes, while write-protected addresses can be cached by reads but not writes. Earlier we mentioned that Intel QPI typically assumes a write-back cache hierarchy, and we saw how permitting caching agents to keep copies of data different from that in main memory increased the protocol's workload. A fourth memory access type, write-combining, adds further complexity. This mechanism buffers groups of writes to a cache line and delivers them to the memory bus as a single lump. While faster than performing the writes singly, it has implications for memory ordering, the global sequence in which cores perceive their memory accesses to occur relative to one another.
Much as the accretion of several generations' quirks resulted in the Intel QPI A20M VLW message, many sources added over the years have also accumulated to control the typing of memory addresses. One example is Memory Type Range Registers (MTRRs), a software-defined, fixed set of typed ranges. A second is the System Address Decoder (SAD), which also resolves the actual destination of reads and writes, but from a hardware perspective. The Page Attribute Table (PAT) extended this to page-level granularity, while the Programmable Address Map (PAM) allows dynamic changes. To maintain legacy behavior, Intel QPI must support all of them, which can lead to some surprising interactions between the flows.
Consider as an example the PAM. It was introduced to speed boot-loading (BIOS shadowing), for component debug purposes. The idea was to allow fast transfer of data from FLASH to memory, and vice-versa. To do so, the PAM defined four memory configurations (shown in Table 4) with respect to behavior on reads and writes. The first corresponds to memory as we know it: both reads and writes target memory. The second superficially seems similar to MMIO: both reads and writes target IO. The key difference from MMIO is that instead of addressing data stored in the IO device, for PAM the data is logically located in coherent memory. The third and fourth PAM configurations are even more peculiar: in the third, reads target memory while writes target IO, and in the fourth reads target IO while writes target memory. Note that these all pertain to individual addresses, so PAM allows reads from and writes to the same address to have different sources and destinations. Furthermore, the PAM is an overlay for the SAD. Each agent has an individual SAD, including the IO. Unlike the core, however, the IO does not implement a PAM and instead is configured to always target reads and writes to coherent memory.
To understand the ramifications of this on Intel QPI, we must further detail the role of IO. In an Intel platform PCI Express (PCIe)v and Direct Media Interface (DMI) provide a common interface to the IO subsystem. Although both are defined as unordered and non-coherent, they do guarantee producer-consumer behavior, which simply means that reads see the effects of previous writes. To enforce this, reads are non-posted, which means they must complete before any writes do. In contrast, writes are posted, meaning they need not complete immediately. Reads push writes, forcing write completion.
The combination of these PCIe behaviors with the PAM definition results in the following possible deadlock scenario, when the PAM targets reads to IO:
- An IO agent begins a posted write to address x, which therefore targets memory. Unlike the MMIO case, in which the IO device performs the write locally, here the transaction goes to the caches. Since memory is coherent, the IO agent must issue snoops to ensure it maintains coherency on x.
- Before its snoop from the IO write arrives, a core reads address x, which initiates a read from IO.
- Due to the producer-consumer requirement, the IO agent must complete its write to x before it can service the read request to x. The write waits pending its snoop responses.
- Since the core has already issued its request for x, it defers its response to the IO's snoop and waits for its read to complete.
- The system has deadlocked.
The Intel QPI protocol avoids this by applying the following change. If the core is waiting for an outstanding read transaction from IO space and receives a snoop from IO space to the same address, it performs the snoop as though it had not yet issued its read request. For example, consider the case where the PAM was set to act like normal memory. Suppose the core read data m from that address, and modified it. Now the PAM has changed and the core is attempting to use the same address to read from IO. Intel QPI defines that the IO write's snoop will cause the core to write back m to its original memory location and then send the response write back invalid to the IO. This indicates that it wrote back the data and invalidated its own copy of the line. The IO completes its write and then services the core's read request. This demonstrates characteristics unique to the PAM, whereby an address may appear to exist in both memory and IO at once, and to be both coherent and non-coherent at the same time, from the perspective of different agents. In the example scenario, the core considered its read non-coherent, while the IO considered its write coherent. The reason why software would perform such an action is not straightforward, but since the situation can arise, the architecture of the Intel QuickPath Interconnect has a duty to provide a resolution.
In addition to this type of potentially accidental memory typing confusion, Intel QPI also allows explicit non-coherent software requests to known coherent memory. In this situation, the protocol transfers the responsibility for correctness to the software.
A few words about protocol dependencies conclude the discussion of the protocol layer. With so many message types traversing the topology, the architecture of the Intel QuickPath Interconnect defines rules to guarantee the eventual completion of every transaction. These rules characterize a transaction flow as a sequence of messages, and delimit constraints on the behavior of each message as part of the flow. Two types of dependencies exist. Flow control dependencies simply state that an agent cannot sink a message until it acquires the credits to pass it on. The definition of how messages may proceed through the enumerated message classes precludes loops. Handshake or message dependencies govern how an agent replies to a message. These rules do not affect the forward progress of in-flight messages; they pertain only to initiating new responses. Some agents must sink messages without depending on anything else; they do so via internal buffering and the credit scheme. Other agents merely delay responding until they have collected the credits they need to handle the request.