The Link Layer
The link layer abstracts the physical layer, including detecting and correcting physical layer errors, to provision higher-level layers with reliable link-level transmission and flow control. It presents an interface of multiple message classes and virtual networks, and manages access to these resources via a credit accounting scheme. In addition to supplying a reliable delivery mechanism, the link layer also provides the ability to utilize all available bandwidth while minimizing latency. It accomplishes this with special packets and command insert packets that interleave at the flit level within larger, multi-flit payload packets.
The upper part of Figure 2 shows the interleaving of special packets, which may be IDLEs, provide general debug information, or have to do with link power management. Although both examples illustrate three-flit data packet headers, the architecture itself allows for headers from one to three flits in length. Note that special packets differ from data packets: a special packet consists of only a single flit, which also doubles as a header flit. The first bit of the packet header, called the interleave/head indicator bit (IIB), identifies the beginning of a new, interleaved packet among the flits of a data packet. A special packet may be interleaved anywhere within a data packet.
The lower part illustrates the optional feature command insert. This is the interleaving of protocol messages, such as snoops, snoop responses, completions, or requests, among the data flits of a data packet. A command packet is identified as a header flit with its IIB set. It must follow a data flit. Note that interleaving command packets among the header flits of a data packet is forbidden. A command packet may comprise one or more header flits. Thus in the lower half of Figure 2, we see three command insert packets, labeled 5, 8, and 10. Packet 5 comprises two flits. Recall that special packets may be inserted anywhere; thus we special packets 6 and 7 interleaved between the flits of command insert packet 5.
Virtualization of the Physical Link
Intel QPI agents employ a coherency protocol to correctly share data while at the same time supporting system-level features like interrupts. To do this consistently, the protocol layer must have certain guarantees about how various message types flow through the fabric. The link layer satisfies this requirement by mapping protocol-level messages to one of the six message classes, detailed in Table 1. Note that the table shows a seventh message class, which is used for link management and is not visible to the protocol layer. The "No" in the "Data Content?" column indicates that the message class does not use Data Payload packets. Message classes are required at the protocol level, to prevent protocol-level deadlock while also allowing some flexibility of ordering for performance. The section about protocol layer discusses this in further detail.
The message classes are further replicated onto multiple virtual networks. Intel QPI defines three: VNA, VN0, and VN1. VN0 and VN1 each has a channel per message class and is independent from theother. This means that the ability to inject into one network does not rely on traffic draining from the other. The number of virtual networks required to guarantee deadlock and livelock avoidance depends on the number of nodes in the system, the interconnect topology, and the amount of buffering provided. VN0 and VN1 provide these features for many such designs; depending on the system one or both may not be required, and therefore not used. VNA is added for low-cost performance in the general case; it is adaptively buffered and shared among all traffic-generating agents. In contrast, VN0 and VN1 are independently buffered. They would require far more independent buffers than VNA to achieve the same performance. If a system implements a policy to first consume all VNA credits, and then fall back on VN0 or VN1 credits as appropriate to avoid deadlock, VN0 and VN1 serve in some sense as escape channels. (See "Interconnection Networks: An Engineering Approach," by Jose Duato, et al., Morgan Kaufmann, 2002.) Note that unlike message classes, which address deadlock at the protocol level and therefore relate to the sourcing and sinking of messages, virtual networks address deadlock at the routing level, within the interconnect topology itself. They ensure that the crediting scheme described in the next section is sufficient to prevent deadlocks due to circular dependencies as messages acquire the required resources to traverse an Intel QPI link.
To achieve independence of both message classes and virtual networks, the architecture of the Intel QPI defines independent virtual channels, where a virtual channel is specified by the combination of an individual message class and virtual network. By definition, the protocol layer may treat these virtual channels as distinct physical transmission channels. However, the link layer keeps costs low while still providing this illusion by multiplexing the virtual channels onto the single physical link. Figure 3 provides a representation of the role of the link layer in performing this task and the resulting 18 virtual channels per link.
The architecture of the Intel QuickPath Interconnect does not specify how an implementation should utilize resources to create virtual channels. It only requires that any implementation guarantee that messages flow through its virtual channels completely independently. One possible implementation is to provide independent buffering per virtual channel. Since VN0 and VN1 exist only to guarantee forward progress, one choice is to assign the minimal amount of buffer space to these two virtual networks. In this case, the Rx would allocate a buffer the size of the largest possible packet per message class, per virtual network, for a total of 12 packet-sized buffers. The remaining buffer space, as constrained by the area of the module, would default to VNA. To utilize VNA most efficiently when shared among packets of highly variable size, the Rx could allocate VNA buffer space at the flit granularity. To guarantee deadlock freedom, a message that becomes blocked while travelling on VNA must be able to transition to VN0 or VN1 within the fabric. This can be accomplished by acquiring a VN0 or VN1 buffer in the Rx of the node while occupying a VNA buffer in the current node. Note that the Rx need not statically assign buffers, even though it must guarantee that messages occupying them behave as though in queues. A possible way to implement such a scheme would be to create a linked list per virtual channel within a buffer pool.
To re-iterate, the link layer works in units of flits, an acronym for "flow control units". For Intel QPI a flit is always 80 bits regardless of the physical link width; the physical layer abstracts this for the link layer.
The concept behind flow control is to only send as much data as the other side can receive. Intel QPI has link layer buffers to hold data from receipt on the link until consumption. Buffers may store flits or packets, depending on the virtual network on which the data was sent. Agents use credits as proxies for available buffer space. After reset, the receiver must send credits to the sender as part of the normal credit return process. Once the sender receives credits, it may begin transmitting. The sender decrements its credit count with each outgoing transmission, and stops sending if it runs out. At the same time, the receiver sends credits corresponding to freed buffers back to the sender via an embedded flow control stream. This mechanism guarantees that the receiver will be able to accept any data sent to it.
To understand the embedded flow control stream, we must first learn a little more about how the architecture of the Intel QuickPath Interconnect defines packets. Recall that packets are the granularity at which the protocol layer works. In addition to carrying protocol-specific information, packets also carry routing information, which is used by the routing layer, and link-layer information, which is used by the link layer.
For example, suppose that sockets A and B are connected by a link pair. Each socket has enough buffers to hold two flits from VNA. On system initialization, socket A is populated with two credits to send two flits to B on VNA, and vice-versa. Socket A sends a message requiring two flits on VNA to B. B processes the message, frees two flits of VNA buffer space, and is ready to receive another message from A. How does A know when it can send another message on VNA to B?
Meanwhile, B decides to send a message to A, say a new Read Data request on the HOM message class. Every packet has a header flit that includes the three bit field to specify the return of virtual channel credits. B sets this field to indicate the return of two VNA credits. When A's Rx receives B's request, A decodes the credit return field and credits its Tx with two VNA credits. Now A's Tx can send another message on VNA to B. The encoding of the credit return field is provisioned to return individual credits per message class to VN0 and VN1, and allows VNA credit return in groups of 2, 8, or 16 flits, for high throughput. It must also be able to specify no credit return. If the protocol layer has no packets to send, the link layer still constantly transmits a stream of IDLE packets, thus guaranteeing timely credit return even in the case that the other direction of the link pair remains unused.
In addition to flow control, the link layer also ensures reliable transmission, the guaranteed transmission of error-free data. To safeguard data integrity, Intel QPI employs CRC as a checksum to detect errors. An 80 bit flit delivers 72 bits of payload and 8 bits of CRC. The sender generates a CRC code for each flit at the Tx side, which the recipient checks at the Rx. The receiver then acknowledges groups of error-free flits. In case there is an error, the sender maintains a link level retry buffer of unacknowledged flits. If the recipient detects an error, it asks the sender to resend flits starting with the erroneous one. The sender removes flits from its buffer as they are acknowledged. This definition of link level retry handles both transient and burst errors, while always sending the flits in order. For error tracking and predictive maintenance, Intel QPI also reports errors up the software stack, but no higher-level intervention is required to ensure correct data transmission and receipt.
The receiver may process a flit as soon as it has confirmed its CRC checksum, without waiting for the whole packet to arrive. If the receiver is the final destination, this means sending the flit to the protocol layer; otherwise, it forwards it on to continue its route through the fabric.
Intel QPI additionally supports an optional rolling CRC. In this case the CRC takes into account two flits instead of one, effectively using 16 bits of CRC. This allows detection of more erroneous bits at the cost of delaying flit consumption: the receiver must wait for both flits to arrive before it can compute the checksum and use either. Thus rolling CRC incurs a latency penalty of an additional flit.
In previous, FSB-based systems, the core provided a hint regarding which eight byte chunk of a cache line it wanted to consume first. It passed this information via the FSB on address bits 5:3. This hint allowed the responder to provide the critical chunk first. Table 2 delineates the critical chunk order of a cache line. In addition to changing the placement of the critical chunk, the order of the other chunks is often scrambled as well. These sequences were originally designed for ease of implementation, and are retained for compatibility with memory and FSB products.
Even though it takes 5.6 ns to transfer a cache line across a link in a nine flit packet (full width, 6.4 GT/s), the first eight bytes of data (the critical chunk) can be sent to the core immediately upon passing the flit CRC check. To speed this check, the link layer appends the CRC at the flit granularity, allowing flit consumption and routing without waiting for the entire packet. Thus the critical chunk arrives in the second flit (the first flit after the packet header) and can be processed after only 1.25 ns.