Performance
Now that we have an appreciation for the architecture of the Intel QuickPath Interconnect, we must turn to practicalities and ask, "What performance did it achieve?" To make a fair evaluation, we must also put our query in context; for what performance did the architects strive?
Although lower latency and higher bandwidth always seem desirable, the interconnect need only provide enough performance so as not to act as the system bottleneck. In its first wave of products, Intel QPI more than achieves this goal. The theoretical maximum bandwidth calculation follows. First, a full width link is 20 lanes wide and while handling a data packet transfers two bytes of data payload at a time. It is double-pumped (executes a transfer on every edge of the forwarded clock), which means that it performs two transfers per cycle, for a total of four bytes per cycle. In the initial implementation, the underlying packetized busses comprising the link can be clocked at a frequency of 3.2 GHz, giving 12.8 GB/s bandwidth in a single direction. Thus two components connected with an Intel QPI link-pair can support a raw bandwidth of 25.6 GB/s. This corresponds to 6.4 GT/s (giga-transfers per second) on a single, uni-directional link, which drops to 4.8 GT/s if the components are far apart. To put this in perspective, Harpertown, with a 400 MHZ FSB, provides a peak bandwidth of 12.8 GB/s on certain specialized platforms.
Nonetheless, it is unlikely that an end user will observe the theoretical peak bandwidth, for any architecture. Generally only synthetic workloads that artificially stress the system achieve it; current Intel sockets do not generate sufficient traffic to strain the Intel QPI fabric. Furthermore, the above calculation excludes packetizing overhead: the transmitted data stream is divided into smaller packets that are labeled with header information to guide them through the topology. Within the interconnect, the information being communicated is divided into 20 lane phits. A header requires four phits (at full link width), and the typical data payload is a 64B cache line, requiring 32 phits, for a total of 36 phits in a data packet. This comes out to about an 11% packetization overhead (clocks used to transfer data versus total clocks elapsed), and a 5.6 ns latency to transfer a cache line, assuming 6.4 GT/s.
Summary
The architecture of the Intel QuickPath Interconnect revolutionizes Intel system platform topologies by replacing the FSB with packetized, point-to-point link pairs. The initial implementation supplies 25.6 GB/s of peak bandwidth per link-pair and can transfer a 64 B cache line in only 5.6 ns, all with fewer pins than its FSB predecessor. The five layer architecture targets multiprocessor distributed shared memory systems, and as such supports coherency by extending the traditional MESI protocol with a new Forward state, which reduces latency by permitting direct cache-to-cache transfers of data. The protocol supports both source snoop for the lowest latency in small systems, as well as home snoop to allow scalability by reducing snoop bandwidth in large systems.
For design ease, the physical layer features waveform equalization, deskew circuits, polarity inversion and lane reversal. The link layer builds on this to guarantee reliable transmission. It also implements flow control via a credit scheme and additionally defines many reliability and availability features, including link self-healing, clock failover, link level retry, and hot swap support. Self-healing and clock failover are based on dynamic width reduction. Finally, the link layer employs inline 8 bit CRC for low-latency, flit-level error detection with the option of additional error protection implemented as rolling 16 bit CRC. Virtualization of six message classes and up to three virtual networks onto 18 virtual channels guarantees deadlock and livelock avoidance and paves the way for routing optimizations in complex topologies.
Intel QPI provides all this while still maintaining compatibility with pre-existing Intel architectural legacy features. For instance, it provides support for critical chunk, has a VLW interface to mimic the effects of side-band signals such as INTR, A20M, and SMI, as well as a scheme for atomicity via locks. Intel QPI additionally defines both request interaction in the PAM regions as well as the results for alternative memory types like uncacheable regions.
In conclusion, the Intel QuickPath Interconnect revolutionizes Intel interconnect technology. It provides exceptional performance over existing bus technology and a wide array of new features, all while maintaining legacy compatibility. We refer the interested reader to the public specification for further information.
Acknowledgments
We would like to thank Intel Corporation for encouraging this article describing the legacy requirements placed on the Intel QuickPath Architecture. In addition we gratefully acknowledge the help and time of many of our colleagues who contributed to this work, in particular Malini Bhandaru, Robert Maddox, Jeffrey Gilbert, Leslie Xu, Jeff Casazza and Gurbir Singh. Michelle also thanks Mark Hill and David Wood for giving her so many opportunities to practice writing papers.


