September 26, 2022

Robotic Notes

All technology News

Double speeds and flexible fabrics

8 min read


Although technically still the new kid on the block, the Compute Express Link (CXL) host-to-device connectivity standard has quickly taken hold in the server market. Designed to offer a rich I/O feature set built on top of existing PCI-Express standards—most notably cache coherency across devices—CXL is poised for use in everything from better CPU connectivity to accelerators in servers, to the ability to attach DRAM and non-volatile memory onto what is physically still a PCIe interface. It’s an ambitious yet widely supported roadmap that, in three short years, has made CXL the de facto advanced standard for connecting devices, leading to competing standards Gen-Z, CCIX and, as of yesterday, OpenCAPI, all dropping out of the race.

And while the CXL consortium is taking a quick victory lap this week after winning the interconnection wars, there is still much work to be done by the consortium and its members. In terms of products, the first x86 processors with CXL are barely shipping—largely depending on what you want to call the state of limbo that Intel’s Sapphire Ridge chips are in—and in terms of functionality, device vendors want more bandwidth and more features than were in the original 1.x versions of CXL. Winning the interconnection wars makes CXL the king of interconnections, but in the process it means that CXL must be able to handle some of the more complex use cases that competing standards were designed for.

To that end, at this week’s Flash Memory Summit 2022, the CXL Consortium is on the show floor to announce the next full version of the CXL standard, CXL 3.0. Following the 2.0 standard, which was released in late 2020 and introduced features such as memory pooling and CXL switches, CXL 3.0 focuses on major improvements in several critical interconnect areas. The first of which is the physical side, where CXL doubles its tape throughput to 64 GT/second. Meanwhile, on the logic side of things, CXL 3.0 greatly expands the logic capabilities of the standard, enabling complex interconnect and fabric topologies, as well as more flexible modes of memory sharing and memory access within a group of CXL devices.

CXL 3.0: Built on PCI-Express 6.0

Starting with the physical aspects of CXL, the new version of the standard provides the long-awaited update to include PCIe 6.0. Both previous versions of CXL, namely 1.x and 2.0, were built on top of PCIe 5.0, so this is the first time since CXL’s introduction in 2019 that its physical layer has been updated.

In itself a major update to the inner workings of the PCI-Express standard, PCIe 6.0 again doubled the amount of bandwidth available over the bus to 64 GT/second, which for a x16 card works out to 128GB/sec. This was achieved by switching PCIe from using binary (NRZ) signaling to quad-state signaling (PAM4) and incorporating a fixed packet interface (FLIT), allowing it to double speeds without the disadvantages of operating at even higher frequencies . Since CXL is in turn built on top of PCIe, this means that the standard needs to be updated to account for PCIe operational changes.

The bottom line for CXL 3.0 is that it inherits the full bandwidth improvements of PCIe 6.0 – along with all the fun stuff like error correction (FEC) – doubling the total bandwidth of CXL compared to CXL 2.0.

It should be noted that according to the CXL consortium, they were able to achieve all this without increasing latency. This was one of the challenges PCI-SIG faced when designing PCIe 6.0, as the required error correction would add latency to the process, resulting in PCI-SIG using a low-latency form of FEC. However, CXL 3.0 takes things a step further in its attempts to reduce latency, resulting in 3.0 having the same latency as CXL 1.x/2.0.

Besides the major PCIe .60 update, the CXL consortium also changed the FLIT size. While CXL 1.x/2.0 used a relatively small 68-byte packet, CXL 3.0 increased it to 256 bytes. The much larger FLIT size is one of the key communication changes with CXL 3.0, as it gives the standard many more bits in the FLIT header, which in turn are needed to enable the complex topologies and fabrics introduced by the 3.0 standard. Although as an added feature, CXL 3.0 also offers a low-latency “variant” FLIT mode that breaks the CRC into 128-byte “sub-FLIT granular transfers”, which is designed to mitigate storage and forwarding costs in the physical layer.

It should be noted that the 256-byte FLIT size keeps CXL 3.0 compatible with PCIe 6.0, which itself uses a 256-byte FLIT. And like its underlying physical layer, CXL supports the use of large FLIT not only at the new transfer rate of 64 GT/sec, but also 32, 16 and 8 GT/sec, essentially allowing the new protocol features to be used with slower transfer speeds.

Finally, CXL 3.0 is fully backwards compatible with earlier versions of CXL. So devices and hosts can be downgraded as needed to match the rest of the hardware chain, albeit losing newer features and speeds in the process.

CXL 3.0 Features: Improved Coherence, Memory Sharing, Multi-tier Topologies and Fabrics

In addition to further improving the overall I/O bandwidth, the aforementioned changes to the CXL protocol have also been implemented in the service of enabling new features within the standard. CXL 1.x was born as a (relatively) simple host-to-device standard, but now that CXL is the dominant device connectivity protocol for servers, it should expand its capabilities to both more advanced devices and eventually for larger use cases.

Kicking things off at the feature level, the biggest news here is that the standard has updated the cache coherence protocol for memory devices (Type-2 and Type-3, in CXL parlance). Enhanced Coherence, as CXL calls it, allows devices to support invalid data that is cached by a host. This replaces the bias-based coherence approach used in earlier versions of CXL, which, to keep things short, maintained coherence not so much by sharing control of memory space, but rather by placing the host or device in charge of controlling access. In contrast, reverse invalidation is much closer to a true shared/symmetric approach, allowing CXL devices to inform a host when a device has made a change.

Enabling reverse invalidation also opens the door to new connectivity between devices. In CXL 3.0, devices can now directly access each other’s memory without having to go through a host, using improved coherence semantics to inform each other of their state. Skipping the host is not only faster from a latency perspective, but in a setup involving a switch, it means devices aren’t eating up valuable host-to-switch bandwidth with their requests. And while we’ll get into topologies a bit later, these changes go hand-in-hand with larger topologies, allowing devices to be organized into virtual hierarchies where all devices in the hierarchy share a coherence domain.

Along with the cache tuning feature, CXL 3.0 also introduces some important updates to memory sharing between hosts and devices. While CXL 2.0 offers memory unification, where multiple hosts could access the memory of a single device but each had to get its own dedicated segment of memory, CXL 3.0 introduces true memory sharing. Using the new improved coherence semantics, multiple hosts can have a coherent copy of a shared segment, with rollback used to keep all hosts in sync if something changes at the device level.

It should be noted, however, that this is not a complete substitute for pooling. There are still use cases where CXL 2.0-style merging would be preferable (maintaining consistency comes with trade-offs), and CXL 3.0 supports mixing and matching the two modes if needed.

Further extending this improved host device functionality, CXL 3.0 removes previous limitations on the number of type-1/type-2 devices that can be attached downstream of a single CXL root port. While CXL 2.0 only allowed one of these processors to be present downstream of the main port, CXL 3.0 removes these restrictions entirely. The root CXL port can now support a full setup for mixing and matching type-1/2/3 devices, depending on the system builder’s goals. Notably, this means being able to attach multiple accelerators to a single switch, improving density (more accelerators per host) and making the new peer-to-peer transfer features much more useful.

The other big feature change for CXL 3.0 is multi-level switching support. This is based on CXL 2.0, which introduced support for CXL protocol switches, but only allowed one switch to reside between a host and its devices. Multilevel switching, on the other hand, allows for multiple layers of switches—that is, switches feeding other switches—which greatly increases the types and complexity of supported network topologies.

Even with only two layers of switches, this is enough flexibility to allow treeless topologies such as rings, meshes, and other fabric setups. And individual nodes can be hosts or devices, without any type restrictions.

Meanwhile, for truly exotic setups, CXL 3.0 can even support backbone/leaf architectures where traffic is routed through top-level backbone nodes whose sole job is to route traffic back to lower-level (leaf) nodes that in turn contain actual hosts/devices.

Finally, all of these new memory and topology/fabric capabilities can be used together in what the CXL Consortium calls Global Fabric Attached Memory (GFAM). GFAM, in short, takes the idea of ​​a CXL Memory Expansion Board (Type-3) to the next level by further partitioning memory from a given host. A GFAM device, in this respect, is functionally its own shared memory pool that hosts and devices can access as needed. And a GFAM device can contain both volatile and non-volatile memory together, such as DRAM and flash memory.

GFAM, in turn, is what will allow CXL to be used to efficiently support large, multi-node setups. As the Consortium used in one of its examples, GFAM enables CXL 3.0 to offer the performance and efficiency needed to implement MapReduce on a cluster of CXL-connected machines. MapReduce is of course a very popular algorithm for use with accelerators, so extending CXL to better handle workloads common to cluster accelerators is an obvious (and perhaps necessary) next step for the standard. Although it does blur the lines a bit between where a local connection like CXL ends and a network connection like InfiniBand begins.

Ultimately, the biggest difference may be the number of supported nodes. The CXL addressing mechanism, which the Consortium calls Port Based Routing (PBR), supports up to 2^12 (4096) devices. So a CXL setup can only scale so far, especially as accelerators, attached memory, and other devices quickly eat up ports.

In conclusion, the completed CXL 3.0 standard is being publicly released today, the first day of FMS 2022. Officially, the Consortium does not offer any guidance on when we should expect CXL 3.0 to appear in devices – that’s up to the equipment manufacturers – but it’s safe to say that it won’t happen right away. With CXL 1.1 hosts just shipping – never mind CXL 2.0 hosts – actual CXL production is several years behind the standards, which is typical of these major interconnect industry standards.



Source link