Introduction to Compute Express Link (CXL)
Introduction to Compute Express Link (CXL)
Interconnect
The Compute Express Link (CXL) is an open industry-standard interconnect between processors and devices
such as accelerators, memory buffers, smart network interfaces, persistent memory, and solid-state drives.
CXL offers coherency and memory semantics with bandwidth that scales with PCIe bandwidth while achiev-
ing significantly lower latency than PCIe. All major CPU vendors, device vendors, and datacenter operators
have adopted CXL as a common standard. This enables an inter-operable ecosystem that supports key com-
puting use cases including highly efficient accelerators, server memory bandwidth and capacity expansion,
multi-server resource pooling and sharing, and efficient peer-to-peer communication. This survey provides
an introduction to CXL covering the standards CXL 1.0, CXL 2.0, and CXL 3.0. We further survey CXL im-
plementations, discuss CXL’s impact on the datacenter landscape, and future directions.
CCS Concepts: • Computer systems organization → Interconnection architectures; • Hardware →
Memory and dense storage; • General and reference → Surveys and overviews;
Additional Key Words and Phrases: Compute eXpress link, CXL, DRAM, memory, memory disaggregation,
memory tiering
ACM Reference Format:
Debendra Das Sharma, Robert Blankenship, and Daniel Berger. 2024. An Introduction to the Compute Express
Link (CXL) Interconnect. ACM Comput. Surv. 56, 11, Article 290 (July 2024), 37 pages. [Link]
3669900
1 Introduction
The Compute Express Link (CXL) is an open industry standard that defines a family of intercon-
nect protocols between CPUs and devices. While the CXL specification [3] and short summaries
by news outlets are available, this tutorial seeks to provide technical details while staying accessi-
ble to a broad systems audience. The CXL protocol spans the entire compute stack and touches on
many branches of computer science. To manage this breadth, we focus on fundamental engineer-
ing aspects, with less focus on security features, security implications, and higher-level software
functions.
As a general device interconnect, CXL takes a broad definition of devices including graph-
ics processing units (GPUs), general purpose graphics processing units (GP-GPUs), field
Authors’ Contact Information: Debendra Das Sharma, Intel Corporation, Santa Clara, California, United States; e-mail:
[Link]@[Link]; Robert Blankenship, Intel Corporation, Santa Clara, California, United States; e-mail:
[Link]@[Link]; Daniel Berger, Microsoft, Redmond, Washington, United States and University of Wash-
ington, Seattle, USA; e-mail: daberg@[Link].
This work is licensed under a Creative Commons Attribution International 4.0 License.
© 2024 Copyright held by the owner/author(s).
ACM 0360-0300/2024/07-ART290
[Link]
ACM Comput. Surv., Vol. 56, No. 11, Article 290. Publication date: July 2024.
290:2 Debendra Das Sharma et al.
Fig. 1. CXL enables coherency and memory semantics and builds on top of PCIe’ s physical subsystem.
ACM Comput. Surv., Vol. 56, No. 11, Article 290. Publication date: July 2024.
An Introduction to the Compute Express Link (CXL) Interconnect 290:3
integrity challenges. In principle, PCIe pins would be a great alternative due to their superior mem-
ory bandwidth per pin, even with the added latency of serialization/ deserialization, as discussed
later. For example, a x16 Gen5 PCIe port at 32 GT/s offers 256 GB/s with 64 signal pins. DDR5-6400
offers 50 GB/s with ∼200 signal-pins. PCIe also supports longer reach with retimers,1 which would
allow moving memory farther away from CPUs and using more than 15 W of power per DIMM,
resulting in superior performance. Unfortunately, PCIe does not support coherency, and device-
attached memory cannot be mapped to the coherent memory space. Thus, PCIe has not been able
to replace DDR.
Another scaling challenge is that DRAM memory cost per bit has recently stayed flat. While
there are multiple media types including Managed DRAM [69], ReRam [70], 3DXP/Optane [71],
the DDR standard relies on DRAM-specific commands for access and maintenance, which hinders
adoption of new media types.
Challenge 3: memory and compute inefficiency due to stranding. Today’s datacenters
are inefficient due to stranded resources. A resource, such as memory, is stranded when idle ca-
pacity remains, while another resource, such as compute, is fully used. The underlying cause is
tight resources coupling where compute, memory, and I/O devices belong to only one server. As
a result, each server needs to be overprovisioned with memory and accelerators to handle work-
loads with peak capacity demands. For example, a server that hosts an application that needs more
memory (or accelerators) than available cannot borrow memory (or accelerators) from another un-
derutilized server in the same rack and must suffer the performance consequences of page misses.
However, servers where all cores are used by workloads often have memory remaining unused.
Stranding has adverse power [34], cost [18], and sustainability [33] implications and has been
the source of low resource utilization at Alibaba [27], AWS [32], Google [28], Meta [17, 31], and
Microsoft [18, 29, 30].
Challenge 4: fine-grained data sharing in distributed systems. Distributed systems fre-
quently rely on fine-grained synchronization. The underlying updates are often small and latency
sensitive, as work blocks on updates. Examples include partition/aggregate design patterns in web-
scale applications such as web search, social network content composition, and advertisement se-
lection [36, 37, 39, 40]. In these systems, query updates are often under 2 kB (e.g., a search result).
Other examples are distributed databases that rely on kB-scale pages and distributed consensus
with even smaller updates [35, 42–44]. Sharing data at such fine granularity means that the com-
munication delay in typical datacenter networks dominates the wait time for updates and slows
down these important use cases [36, 38]. For example, transmitting 4 kB at 50 GB/s (400 Gbit/s)
takes under 2 us, but communication delays exceed 10 us on current networks [38]. A coherent
shared-memory implementation can help cut down communication delays to sub-microseconds,
as we will see later.
The role of CXL. CXL has been developed to address these four and other challenges. Since
its first release in 2019, CXL has evolved through three generations (see Section 2). Each gen-
eration specifies the interconnect and multiple protocols (see Section 3) while remaining fully
backward-compatible. Table 1 overviews the three current CXL generations versions and key use
case examples. The CXL 1.0 multiplexes coherency and memory semantics on top of the PCIe
physical layer, as shown in Figure 1. CXL introduces custom link and transaction layers to achieve
low latency comparable to remote socket memory accesses. This addresses both Challenge 1 (co-
herency) and Challenge 2 (memory scaling) by enabling CXL devices to cache system memory.
1 Retimers enable extending the cable length between the host and a device beyond electrical limits defined in the spec-
ification (usually dozens of centimeters). They essentially retransmit a fresh copy of the signal, effectively doubling the
channel length.
ACM Comput. Surv., Vol. 56, No. 11, Article 290. Publication date: July 2024.
290:4 Debendra Das Sharma et al.
This also standardizes a coherent interface to facilitate the broad adoption of PIM systems and
programming models. CPUs can also cache device memory, which also addresses Challenge 4 (fine-
grained distributed data sharing) for heterogeneous computing. Additionally, memory attached to
a CXL device can be mapped to the system cacheable memory space. This facilitates heteroge-
neous processing and helps with memory bandwidth and capacity expansion challenges (Chal-
lenge 2). CXL 1.0 also continues support for PCIe’s non-coherent producer-consumer semantics
[1, 2, 8].
CXL 2.0 additionally addresses Challenge 3 (resource stranding) by enabling resource pooling
across multiple hosts. We use host to refer to a single-socket or multi-socket system under the
control of a single operating system or hypervisor. Pooling overcomes resource stranding and
fragmentation by reassigning resources (e.g., memory) to different hosts over time without having
to reboot these hosts. The CXL protocol enables pooling by introducing CXL switches that build
a small network of hosts and memory devices.
CXL 3.0 addresses Challenge 3 on a larger scale with multiple levels of CXL switching. This
enables building dynamically composable systems at the rack or even the pod level. Furthermore,
CXL 3.0 addresses Challenge 4 (distributed data sharing) by enabling fine-grained memory sharing
across host boundaries.
CXL requires CPU and device support. The widespread adoption of CXL into commercial prod-
ucts by virtually all silicon vendors is a testament to the technology gaining widespread traction
due to its ability to solve real-world problems. This puts CXL on a viable path to solve key industry
challenges with broad deployment [8, 15, 16, 17, 18]. CXL 3.0 has also reached a level of maturity
that facilitates reviewing its fundamental design choices. This tutorial introduces background in
Section 2. We detail incremental releases of CXL 1.0, CXL 2.0, and CXL 3.0 in Sections 3, 4, and
5, respectively. Section 6 discusses CXL implementations and performance. Section 7 discusses
broader impacts and future directions.
ACM Comput. Surv., Vol. 56, No. 11, Article 290. Publication date: July 2024.
An Introduction to the Compute Express Link (CXL) Interconnect 290:5
orchestrate coherency varies widely across different architectures and design choices, which vary
over time. Additionally, not all use cases require coherency support, e.g., Challenge 2 (memory
scaling) and Challenge 3 (stranding).
Widespread adoption also calls for a plug-and-play ecosystem that is compatible with previous
generations and across different system architectures. Backwards compatibility has proven to be
one of the main reasons why PCIe is so successful as a ubiquitous I/O interconnect for more than
two decades. It facilitates technology transitions independently for device and platform manufac-
turers. For example, a platform may migrate to PCIe Gen 5, whereas the SSD vendors may decide
to stay with Gen 4 and migrate at a later point in time based on the technology evolution of SSDs. It
also helps protect customer investments, as a customer may reuse some older generation device(s)
in a new platform, e.g., to reduce carbon emissions [33, 81].
At Intel, the idea of adding simplified coherency mechanisms on top of PCIe goes back to 2005,
motivated by accelerators needing to cache system memory. This resulted in transaction process-
ing hints to use the CPU’s cache hierarchy for faster device accesses and atomics semantics being
added to PCIe 3.0 while keeping PCIe non-coherent. Intel also initially pursued adding memory
semantics on top of PCIe 3.0 to enable pooling with a shared memory controller (SMC). How-
ever, PCIe 3.0 bandwidth, 8.0 GT/s, severely limited the number of servers that could pool resources
with one SMC. Multi-SMC topologies would have resulted in higher latency due to multiple hops.
Another challenge was cabling. In 2019, PCIe 5.0 at 32.0 GT/s (and progress on PCIe 6.0 at 64.0
GT/s) and strong cable support revived efforts at Intel. This resulted in Intel Accelerator Link
(IAL): a proprietary protocol with both caching and memory support on PCIe 5.0. Leveraging the
experience from developing PCI-Express over two decades, Intel donated the IAL 1.0 specification
and launched the CXL consortium with Alibaba, Cisco, Dell, Google, Huawei, Meta, Microsoft, and
HPE in March 2019. IAL 1.0 specification was renamed as CXL 1.0 specification.
CXL adopts an asymmetric approach to coherency, backwards-compatibility, and openness to en-
able a diverse and open ecosystem that facilitates broad deployment. CXL coherence is decoupled
from host-specific coherence protocol details. The host processor is also responsible for orches-
trating cache coherency for simplicity of implementing coherency in devices. A device’s caching
agent enforces a simple MESI (modified, exclusive, shared, invalid) coherency protocol with a
small command set. CXL supports multiple use cases by offering multiple protocols that differ in
complexity, and devices can implement only a subset of protocols (Section 3).
CXL utilizes the PCIe physical layer and devices plug into PCIe slots. The backward compatible
evolution of CXL (like PCIe) as well as its interoperability with PCIe ensures that companies can
make their investment in CXL with guaranteed interoperability with prior-generation CXL devices
as well as any PCIe device. Building on PCIe infrastructure lowers the barrier to entry by enabling
reuse of IP building blocks, channels, and software infrastructure. From an SoC viewpoint, a multi-
protocol capable PCIe physical layer, as shown in Figure 2, helps reduce silicon area, pin count,
and power [1, 2, 8, 9].
When CXL was launched in 2019, the industry was fragmented with competing interconnect
standards such as OpenCAPI, GenZ, and CCIX. Since then, CXL membership has grown to about
250 companies with all CPU, GPU, FPGA, networking, IP providers actively contributing within the
consortium. After Intel donated CXL 1.0 in March 2019, the consortium published CXL 1.1 adding
compliance test mechanisms in September 2019. Following that, the consortium published CXL
2.0 and CXL 3.0 in November 2020 and August 2022, respectively, including more usage models,
while maintaining full backward compatibility. Over time, the industry has coalesced around CXL.
For example, competing standards GenZ and OpenCAPI have donated their IP and funds to CXL
to rally behind a common standard.
ACM Comput. Surv., Vol. 56, No. 11, Article 290. Publication date: July 2024.
290:6 Debendra Das Sharma et al.
Fig. 2. Dynamic multiplexing of three protocols on PCIe physical layer with CXL [8].
ACM Comput. Surv., Vol. 56, No. 11, Article 290. Publication date: July 2024.
An Introduction to the Compute Express Link (CXL) Interconnect 290:7
protocols. Type 2 devices are accelerators such as GP-GPUs and FPGAs with local memory that
can be mapped in part to the cacheable system memory. These devices also cache system memory
for processing. Thus, they implement [Link], [Link], and [Link] protocols. Type 3 devices
are used for memory bandwidth and capacity expansion and can be used to connect to different
memory types, including supporting multiple memory tiers attached to the device. Thus, Type 3
devices would implement only the [Link] and [Link] protocols. CXL Type 3 devices offer a
cost, power, and pin-efficient alternative to adding more DDR channels to server CPUs, while of-
fering flexibility in system topologies due to longer trace lengths that results in alleviating power
delivery as well as cooling constraints [8,9].
CXL adopts a layered protocol approach. The physical layer is responsible for physical infor-
mation exchange, interface initialization, and maintenance. The data link layer (or link layer) is
responsible for reliable data transport services and establishing a logical connection between de-
vices. The transaction layer(s) handles the transactions associated with each protocol along with
any architectural ordering semantics, flow control, and credits. Each of these layers has an archi-
tected set of registers that software accesses for configuring, controlling, and obtaining status of
the link. Readers are referred to the respective specifications [5, 7] for details.
ACM Comput. Surv., Vol. 56, No. 11, Article 290. Publication date: July 2024.
290:8 Debendra Das Sharma et al.
[Link] and [Link] accesses have low-latency, similar to a native CP U-to-CP U symmetric
coherency link [1, 8, 9]. Thus, memory access latency from a CXL device would be similar to
memory access from a DDR bus in a remote socket. While this is higher than memory access from
a DDR bus in a local socket [79], in a 2-socket symmetric multi-processing system, it is acceptable
due to NUMA (non-uniform memory access) optimization and the higher bandwidth resulting
in lower latency in non-idle systems [8, 9, 10]. The Link layer and Transaction layer paths for
[Link] and [Link] have low latency, since they are natively Flit based and the choice of
64-byte payload of a Flit is identical to the cache-line sized transfers of these protocols. Muxing
at the PHY level (vs. higher level of the stack) helps deliver a low latency path for [Link] and
[Link] traffic. This eliminates the higher latency in the link and transaction layers of PCIe/
[Link] path due to their support for variable packet size, ordering rules, access rights checks,
and so on, conforming to the fundamental guiding principle of CXL specifications [1, 3]. The
physical layer disambiguates between [Link], [Link]-mem, ALMP, and NULL Flits (when
nothing is sent). Each of these four types of Flits have two cases to indicate whether the current
Flit is the end of the data stream (EDS) prior to sending Ordered Sets. These Ordered Sets are
injected by the physical layer for functionality such as periodic clock compensation or any link
recovery event. These eight encodings use 8-bits with a guaranteed Hamming distance of 4 that
are repeated twice to enable correction as well as detection [2, 3]. Each Ordered Set is one Block
(130 b) long and is used for link training as well as clock compensation. Each Lane independently
sends its Ordered Set(s).
While operating at 32 GT/s or lower, CXL uses the 128 b/130 b encoding scheme of PCIe where
a 2-bit sync header is prepended for every 128 bits of data on each Lane to distinguish between
Data blocks vs. Ordered Set blocks [7, 8]. The 128-bit data payload on each Lane is used to transmit
Flits, as shown in the two data blocks of a x16 link in Figure 4(d). A Flit can straddle across multiple
data blocks, and a data block can contain multiple Flits. Since [Link] packets (TLP and DLLP) are
not natively Flit-based, a TLP or DLLP can straddle multiple Flits and a Flit may contain up to two
packets. Characterizing these overheads leads to the payload-throughputs discussed in Section 6.3.
For latency optimization, CXL enables removal of the 2-bit sync header when all components of
ACM Comput. Surv., Vol. 56, No. 11, Article 290. Publication date: July 2024.
An Introduction to the Compute Express Link (CXL) Interconnect 290:9
Fig. 5. PCI-Express/ [Link] Ordering Table for CXL 1.1, where “Yes” indicates that a transaction (2nd Trans)
can overtake a prior transaction (1st Trans) to ensure forward progress. To ensure producer-consumer or-
dering, “No” indicates that overtaking is not allowed. For performance optimization, “Y/N” applies where
ordering rules are relaxed.
the link (including Retimers, if any) advertise support for this optimization during the initial PCIe
link training process while negotiating CXL protocol [2].
ACM Comput. Surv., Vol. 56, No. 11, Article 290. Publication date: July 2024.
290:10 Debendra Das Sharma et al.
Figure 6(a) describes the producer-consumer ordering model that acts as the contract between
hardware and software. This is enforced across the entire system even for transactions that cross
different hierarchies and for address locations that are in different memory locations and may have
different attributes (e.g., cacheable vs. non-cacheable). The device may be the CPU, a CXL/PCIe
device, or any entity within a switch. The basic idea is that the producer produces (memory write)
data and subsequently writes the flag; the data and flag can be in multiple memory locations across
the system. If the consumer sees the flag, then it can read the data and be assured that it obtains the
latest data written by the producer. The flag can be in a ring buffer in system memory, a memory-
mapped I/O location in a device, or an interrupt sent to the processor. The ordering rules delineated
in Figure 5 ensure the producer-consumer ordering model. Another effect of the ordering rules is
device synchronization usage, as represented in Figure 6(b). Here, the two devices may be exe-
cuting two tasks and use A and B as indicators that they completed the tasks. If the read obtains
old data, then the process in that device can be suspended and be rescheduled by the device that
completes the execution later. If (a, b) were possible, then both processes in the two devices would
be suspended forever, an undesired outcome.
While [Link]/PCIe ordering model can be used for some types of synchronization, the synchro-
nization usage represented in Figure 6(c) cannot be enforced with pipelined accesses, since writes
can bypass prior reads. This limitation causes smart NICs with partitioned global address space
(PGAS) to serialize writes after reads when ordering matters. [Link] overcomes this limitation:
The device can prefetch all data out of order and complete the transactions with low latency in
the local cache conforming to the program order.
The [Link]+mem is mostly unordered but has some ordering constraints on a per cache line
basis, as will be discussed later. These constraints can be worked around in a topology that supports
multiple paths between a source-destination pair by having a routing mechanism that ensures the
transactions with dependencies for any given cache line always follow the same path. However,
with the traditional PCIe (and [Link]) the ordering requirements are across the entire memory
space. Hence, traditional PCIe (and hence CXL) must follow the tree topology. With CXL 3, we
relax the [Link] ordering requirements and any fabric topology can be supported, as described in
Section 5.
[Link] uses the standard PCIe DLLPs for exchanging information such as credits, reliable TLP
delivery, power management, and so on [2, 3, 7]. [Link] uses the configuration space of PCIe and
enhances it for CXL usage. This helps with using the existing device discovery mechanism. We
expect the PCIe device driver would make the necessary enhancements to take advantage of the
new capabilities such as [Link] and [Link] and the system software will program the new
set of registers associated with the new capabilities. This builds on the existing infrastructure and
makes the adoption of CXL easy for the ecosystem.
ACM Comput. Surv., Vol. 56, No. 11, Article 290. Publication date: July 2024.
An Introduction to the Compute Express Link (CXL) Interconnect 290:11
ACM Comput. Surv., Vol. 56, No. 11, Article 290. Publication date: July 2024.
290:12 Debendra Das Sharma et al.
by the system. The Write category is used for the device to evict data from the device cache. These
requests can be for dirty data (M-state) or for Clean Data (E or S state). The host will indicate
“WritePull” for cases where device needs to provide data and it will indicate GO.
The H2D Request channel is used for the host to change coherence state in the device, which is
referred to as “Snooping” (abbreviated Snp). The device must update its cache as required by the
snoop type and in the case of cache with dirty data (M-state) it must also return that data to the
host. Figure 7 provides an example where the host sends an SnpInv X that requires the device to
invalidate the cache for Address X. The device changes cache state from E to I for Address X and
then sends a RspIHitSE on the D2H Response channel. The RspI meaning its final cache state is
I-state and HitSE indicating the cache had S or E-state prior to downgrading to final I-state.
A critical component of a coherence protocol is how it handles conflicting accesses for the same
address. Two cases are important to highlight: (1) Req-to-Snoop and (2) Eviction-to-Snoop. To
resolve the Req-to-Snoop case the GO message on H2D Response channel must be ordered in
front of any future Snoop to the same cache line address to ensure the device observes the cache
state provided by the host, if any, before processing the snoop. The reason for this is illustrated
in Figure 7(a), where the GO message sent from the host must be observed before the later snoop
so the device is aware that it has Exclusive ownership of the address and can process the snoop
correctly. Figure 7(b) is the reverse order case, where the host is processing the snoop first and
will stall future requests until the snoop is completed, so the device observing snoop before GO
results in processing the snoop while the cache is in Invalid state. For Evict-to-Snoop Figure 7(c)
shows the case where snoop arrives while a DirtyEvict is outstanding, and device must reply to
the snoop with the current M-state data. The later GO_WritePull must still return the data but
includes indication that the data is bogus (meaning may be stale) so the host must drop the data,
as newer data may exist in other agents.
2 This contrasts with traditional DRAM technology interfaces like DDR4, which may rely on special commands for access
and maintenance such as Refresh or Bank Pre-charge before Read or write access and use local Device-specific address
space.
ACM Comput. Surv., Vol. 56, No. 11, Article 290. Publication date: July 2024.
An Introduction to the Compute Express Link (CXL) Interconnect 290:13
a “-D” modifier (HDM-D).3 HDM-H does not include any coherence protocol assumptions,
whereas HDM-D includes cache state and cache snooping attributes used in each message. HDM-
D also requires a method for communication of desired cache state of the memory from the device
to host referred to as the “Bias Flip” Flow.4
HDM-H optionally provides 2-bits of Meta Value with each cache line for host use. Possible
uses are coherence directory, security attributes, or data compression attributes. Figure 8(a) shows
an example flow with a host read that is updating the Meta Value in the device and the device
returning the prior state of the Meta Value. In this example, the device is required to change the
current Meta Value stored in the device as the value changed from 2 (old value) to 0 (new value).
Note that the Memory Media storage of Meta Value and ECC (Error Correcting Code) bits are
device-specific.
HDM-D uses the Meta Value field differently such that it exposes host coherence state to the
device in this field, which allows the device to know what state the host is caching for each ad-
dress in the HDM-D region. The specification uses the term Device Coherence (DCOH) agent
to describe the agent in the device managing/tracking coherence between the device and the host.
Figure 8(b) shows a case of HDM-D where the host is reading the data to cache in S-state and it
requires the device to check its cache (Dev ∃) for a current copy, but in this example, the cache
does not have the data, so the data is delivered from host memory.
The HDM-D coherence model allows the device to change the state of the host using a [Link]
request. [Link] is the natural way to do this, as HDM-D is defined only for Type 2. The flow,
where the device changes host cache state, is referred to as “Bias Flip” flow. An example of this
flow is shown in Figure 9. The device sends “RdOwnNoData X” to the host. The host detects the
address X as an HDM-D address owned by the issuing device and will behave differently than it
would have otherwise in that it will directly change cache state in the host forcing all host caches
to I-state. The response to a “Bias Flip” flow is different from traditional [Link] responses in
3 The original CXL 1.0 and 1.l specification did not use the -D and -H term, but instead the attribute was implied based on
device type where a Type 3 device was HDM-H and Type 2 device was HDM-D. This alignment to device types is removed
in CXL 3.0 because of new requirements, so this article’s descriptions will be using CXL 3.0 terms to improve consistency.
4 “Bias Flip” flow in reference to the device Bias-Table (or coherence directory), which tracks if the host could have a cached
copy of the address. The Bias tracking may indicate Host-S (Shared only in the host), Host-A (Any cache state in the host),
or Device (the host does not have a cached copy of the address, so referred to as “device bias”). The “Flip” is in reference
to the Bias-Table changing the state being tracked from “Host Bias” to “Device Bias.”
ACM Comput. Surv., Vol. 56, No. 11, Article 290. Publication date: July 2024.
290:14 Debendra Das Sharma et al.
that the host will send response MemRdFwd message in this example on the [Link] M2S Req
channel to indicate the host has completed the cache state change. By using the [Link] M2S
Req channel, it avoids race conditions with the host for future [Link] requests to the same
address requiring the M2S Request channel to be ordered for access to the same cache line. The
ordering requirement applies only to HDM-D addresses.
ACM Comput. Surv., Vol. 56, No. 11, Article 290. Publication date: July 2024.
An Introduction to the Compute Express Link (CXL) Interconnect 290:15
other internal caches without impact on [Link], which is unaware of host cache details. The
host scales to multiple CPU sockets or caches as defined in the host specific protocol (Proprietary
CP U-to-CP U), which defines an internal Home Agent to resolve coherence between host caches
(which incorporate [Link] devices). This host Home Agent may include host-specific opti-
mizations for resolving coherence such as on-die snoop filters or in-memory directory states. The
[Link] protocol is behind the Host Home Agent logic and can support simple memory expan-
sion that is host only coherent (HDM-H) or device coherent (HDM-D). With HDM-D, an additional
level of coherence resolution is included after the host Home Agent, which allows the device to be
the final arbiter of coherence for addresses owned by the device.
Figure 10(b) shows the protocol dependence graph expected in CXL. The protocol dependence
graph is defining which protocol channels may be dependent (or blocked) by other protocol chan-
nels. The dependence graph is a method to show deadlock freedom between channels if no circular
dependence is created. An example dependence from the diagram would be that L1 Req may de-
pend on L1-Snp channel to complete before a new L1 Req may be processed. This dependence
exists because the request may require a snoop to be sent and completed before the request can
be completed. The L1 protocol is [Link] showing an abstraction of the channels provided
in [Link], where L1-Req maps to D2H Req, L1-Snp maps to H2D-Req, and L1 RSP maps to
H2D/D2H RSP & Data channels that are pre-allocated to drain into host or device, enabling com-
bining these two for the dependence graph. The L2 protocol is host-specific and shown as an
example of channels that may exist for a host, but other channel choices are possible. It would also
be possible for the host to include additional levels of protocol provided the dependence graph
does not have loops. The L3 protocol is [Link], where L3 Req maps to M2S-Req, L3-RwD maps
to M2S-RwD, and L3-Rsp covers the pre-allocated S2M NDR and DRS channels. The dependence
graph is a method used to show legal relationships and dependencies in the protocol and between
protocols and quickly determine high-level deadlock freedom in CXL itself and with host or device
internal protocol choices.
ACM Comput. Surv., Vol. 56, No. 11, Article 290. Publication date: July 2024.
290:16 Debendra Das Sharma et al.
represents the CXL topology as a virtual hierarchy (VH), which includes the switch and virtual
bridges for the host’s port and every port with device resources. Each host sees a separate VCS
(Virtual CXL Switch) that includes bridges for the devices assigned to this host, as shown in
Figure 11. Flits are routed to devices based on the active virtual bridges in the VH. This limits CXL
2.0 to directed tree topologies with at most one path between each host and device. Furthermore,
the need to track each VH’s address maps in the switch limits scalability to a single switch level
CXL 3.0 overcomes this limitation (Section 5.4). Configuring and changing the VH is described in
Section 4.2.
Device Pooling builds on top of the multi-host switch support by allowing devices to be dy-
namically assigned to one host at a time. Standard devices are assigned to a single host at a time
and are referred to as Single-Logical-Device (SLD). CXL also defines a Multi-Logical-Device
(MLD), which allows a single [Link] device’s resources to be divided into logical devices (up
to a maximum of 16) that can be assigned to different hosts at the same time. Each logical device
within an MLD can be assigned to a different host. CXL 3.0 introduces an extension called Dynamic
Capacity Device that is even more flexible. Each logical device (LD) within an MLD is identified
by an identifier (LD-ID). This extension is only visible on links between a switch and device and
not visible to a host. The host leaves the LD-ID field blank. The CXL switch applies the LD-ID tag
to a host’s [Link] and [Link] transactions based on the host’s port.
As a result of switching, MLDs, and potentially different types of memory (which might not
be based on DRAM) resources in a single switch may become oversubscribed resulting in QoS
ACM Comput. Surv., Vol. 56, No. 11, Article 290. Publication date: July 2024.
An Introduction to the Compute Express Link (CXL) Interconnect 290:17
problems. For example, low performance in one device may cause congestions for all agents using
the switch. To mitigate this problem, CXL 2.0 introduces a DevLoad field into [Link] Response
messages to inform the host of the load observed in the device that it is accessing. The host is
expected to use this load information to reduce the rate at which it sends CXL requests to that
device. The CXL specification defines a reference model such that the injection rate is reduced
at high or critical load until nominal loading is reached, and at light load the host can increase
injection rate until nominal load is reached.
If there are multiple CXL request injection points (e.g., in a CXL 2.0 pooled scenario), then QoS
is provided through source throttling, which controls how much of the MLD resources can be
consumed by each source. For non-shared resources, each VH can be isolated by ensuring that
transactions for different VHs (or PBR destination) can progress without any inter-dependency.
The CXL protocol also defines a containment model that ensures that if an end-point device is
not responsive, then the impacted VH will contain the error by generating an error response for
outstanding accesses within the host to avoid host timeout and could otherwise bring down the
VH.
CXL also defines error reporting for memory devices, including poison data support. While the
exact error correcting code depends on the type of memory media as well as the usage model and
platform, reporting is defined for both corrected as well as detected but not corrected errors. The
reporting has been standardized for software to take necessary action [3, 5].
ACM Comput. Surv., Vol. 56, No. 11, Article 290. Publication date: July 2024.
290:18 Debendra Das Sharma et al.
The FM supports 19 other commands in addition to the three described here. They include QoS
controls such as bandwidth allocations among logical devices within an MLD. These commands
also include ways to identify switch ports and devices and to query and configure their state (see
Table 205 in Reference [3]).
ACM Comput. Surv., Vol. 56, No. 11, Article 290. Publication date: July 2024.
An Introduction to the Compute Express Link (CXL) Interconnect 290:19
Fig. 12. Fabric topology with multiple-paths enabled by CXL 3.0 spanning one or many racks, enabling
composable scale-out systems with shared memory and load-store message passing among peers.
Fig. 13. Three types of CXL 3.0 Flits. A 256 B latency-optimized Flit (C) consists of two Sub-Flits of 128 B
each with common FEC across 256 B but separate CRCs.
large systems. For example, in Figure 12, the NIC to memory is 8 hops away using direct
P2P vs. going through a CPU is 16 hops away round-trip.
4. Shared coherent memory and message passing among across hosts: This can be enforced
by hardware or software. Shared coherent memory enables multiple systems to share data
structures, perform synchronization, or pass messages using low-latency load-store se-
mantics. Message passing can also be done by using the load-store [Link] semantics (vs.
networking semantics with higher latency).
5. Near-memory processing to allow computation to be performed near memory for better
performance and energy efficiency.
ACM Comput. Surv., Vol. 56, No. 11, Article 290. Publication date: July 2024.
290:20 Debendra Das Sharma et al.
Fig. 14. [Link] and [Link] slots for 256 B and LO Flits [8].
[Link]+mem, ALMP, Idle) along with the reliable Flit delivery management control and is
placed in the start of the Flit to pipeline the delivery of the Flit to appropriate protocol stack
without accumulating the entire 256 B Flit to reduce latency. An optional latency-optimized
(LO) 256 B Flit, as shown in Figure 13(c). The LO Flit is subdivided into two sub-Flits, each 128 B,
with the FEC in the odd Flit-half and the 2 B Flit Hdr in the even Flit-Half. A 6 B CRC is present
in each Flit half. This 6 B CRC is derived from the 8 B CRC for reduced gate count, as described in
References [5, 13, 14]. The first 6 B CRC protects the 2 B Flit Hdr and 120 B data in the even Flit-half
and the second 6 B CRC protects the 116 B of data in the odd Flit-half. An even Flit-half can be
processed in order if its CRC passes without waiting for the odd Flit half. The odd Flit-half can be
processed if its CRC passes, assuming the even Flit-half has been processed (i.e., no CRC error).
Since the CRC is about 10-levels of logic gates vs. 50 levels of logic gates for FEC [14], applying
CRC first helps reduce the latency by 2 nanoseconds on a x16 link [13]. Also, the CRC over 128 B
makes the accumulate latency slightly better than the 68-B Flit accumulation latency at 32 GT/s.
If an error is detected, then the entire Flit is accumulated, the FEC applied, and then the CRC is
applied again. Details of the Flits and the FEC and CRC mechanisms can be found in References
[5, 7, 13, 14].
The type of Flit to be used (68 B, 256 B, LO) is negotiated upfront when the CXL protocol is
negotiated with 8 b/10 b encoding. 68 B Flit support is mandatory for any CXL device. If the
CXL device supports 64.0 GT/s data rate, then it must also advertise the 256 B Flit mode, while
the advertisement for the LO Flit mode support is optional. If all components of a link (the two
devices as well as any Retimers in between) support the LO Flit mode, then LO is selected; else, if
all components support 256 B Flit, then the 256 B Flit mode is selected; else, 68 B Flit mode will be
supported. Once a Flit mode is selected during early negotiation, that is used irrespective of the
data rate of operation. Suppose we selected the LO Flit mode because all components in the link
support it along with 64.0 GT/s, but if the link is operating at 32.0 GT/s (e.g., power savings speed
down-shift or link stability issues at 64.0 GT/s), then the LO Flit mode will still be used.
For [Link], the last 4 B of the “Data” will be for DLLP (the equivalent of the last 4 B of the 6 B
DLP in PCIe 6.0 [7,14]. That leaves 236 B for TLP in 256 B Flit mode (same as PCIe 6.0 Flit) and
232 B for TLP in the LO Flit mode. For ALMP, most of the “Data” field is Reserved. For [Link]-
mem, the slot arrangements are shown in Figure 14. The H- and G-slots have the same usage
as 68-B (i.e., H-Slot or Header Slot is used for Headers only, whereas G-Slot or Generic Slot is
used for either Header or Data) with the exception that the additional 2 B in H-slot is used for
building larger topologies. The HS slot is 10 B only and used for small headers such as 2DRS
or 2NDR.
ACM Comput. Surv., Vol. 56, No. 11, Article 290. Publication date: July 2024.
An Introduction to the Compute Express Link (CXL) Interconnect 290:21
Fig. 15. UIO Ordering Rules. A “Y/N” indicates that the 2nd transaction may or may not bypass the 1st
transaction.
5.2 Protocol Extensions with CXL 3.0: Unordered I/O (UIO) and Back-Invalidate (BI)
for Fabric Support
The PCIe Ordering rules described in Section 3 preclude any non-tree topology, which means that
only one path exists between any two nodes (host or device) in the CXL network. We need redun-
dant and multiple paths between any source-destination pair of nodes to create large distributed
systems with good performance. Since software relies on the producer-consumer ordering model,
the CXL protocol needs to preserve that while accommodating non-tree topologies. CXL 3.0 (and
hence PCIe) solves this challenge by introducing “unordered” read/write/completion transactions
on one or more distinct virtual channels (VCs; see Section 3.1) and transferring the ordering en-
forcement to the source node only. Each VC is independent of other VCs and each VC comprises
three three flow control (FC) classes: Posted (P) for transactions such as Memory Write, Non-
Posted (NP) for transactions such as Memory Read, and Completions (C) for completions to
each NP transaction, as described in Section 3.3. The basic idea is that every UIO Write gets a
completion (“UIO Write Completion” in the C FC class). Hence, UIO Writes are fundamentally
non-posted even though it is on the P FC class to enable the source to enforce ordering. Thus, the
producer (e.g., Device X in Figure 6(a)) must wait for the completion of all the “Data” before writ-
ing to the “Flag” signaling that the data is available. An UIO Read is like a regular memory read in
that it gets one or more completions (“UIO Read Completion” in C FC Class) with or without data
along with the status, including errors. Thus, in UIO VC, P and NP have only one type of trans-
action each (UIO Write and UIO Read, respectively) and C has three types of transactions (UIO
Write completion—no data, UIO Read Completion with data; and UIO Read Completion—no data).
Furthermore, transactions across the FCs as well as within an FC have no ordering requirements,
as summarized in Figure 15. This enables [Link] (and PCIe) transactions to be sent on any path in
a topology with multiple paths between a source-destination pair and still enforce the producer-
consumer ordering semantics. VC0 will always be used for the traditional non-UIO Ordering for
backward compatibility following the tree topology for routing and one or more VCs in VC1-VC7
may be used for UIO traffic that can use any of the paths for any transaction.
CXL 3.0 introduces a new flow with two channels in [Link]: Back-Invalidate (BI) in the
S2M direction and the response BI-Rsp in the M2S direction, as described in detail in Section 5.2.
This gives rise to a new type of device memory: HDM-DB (Host-managed Device Memory –
Device-coherent with Back-invalidate support), supported by Type 2 and Type 3 devices.
ACM Comput. Surv., Vol. 56, No. 11, Article 290. Publication date: July 2024.
290:22 Debendra Das Sharma et al.
Fig. 16. CXL 3.0 protocol enhancements with UIO and BI for latency and bandwidth optimization [9].
local memory to the HDM-DB region (as shown in Figure 17(a)), and hardware-enforced coherent
shared memory across multiple independent hosts (Figure 17(b)). The rationale for having BI in
[Link] is to ensure that there is no other dependency that is required for deadlock avoidance.
With UIO and BI, a device can directly access HDM-DB memory without going through the host
CPU. An accelerator device connected to a switch that needs to access memory connected directly
to the switch (Figure 16(a)) that is mapped to the HDM-D or HDM-H region needs to go through
the host CPU (Figure 16(b)) with CXL 2.0 flows, which involves the host getting the data from
memory and resolving coherency flows prior to completing the access from the device. However,
with CXL 3.0 flows, the device sends the same memory read and write transactions using [Link]
in the UIO VC. The switch does not route these UIO transactions to the host. UIO transactions
are serviced directly from memory if the state of the cache line is I or S for a UIO Read or I for
a UIO Write; else, the memory controller back-snoops the host processor using the BI flows, as
shown in Figure 16(c). As we will see later, even when BI is needed, this new mechanism is more
bandwidth-efficient than the existing approach of sending all requests to the host, which caused
more traffic and extra latency.
UIO combined with BI enables an I/O coherent model that can scale to lots of devices. Those
devices that need to cache memory can still use [Link] semantics when they need it and use
UIO for the rest. This reduces the snooping overhead as well as the latency involved in going to
the host processor all the time. The second aspect of BI is to enable Type 2 devices to have a snoop
filter instead of having a full directory and invoking the existing bias-flows. This is demonstrated
in Figure 17(a), where the Type 2 device invokes the BI flows to evict a cache line from its snoop
filter if a new request encounters a capacity miss.
The BI-mechanism also enables the implementation of shared and hardware-enforced coherent
memory across multiple hosts, as illustrated in Figure 17(b). Here, the memory device (GFD/MLD)
maintains a directory (or a snoop filter) to track ownership of cache lines in the coherent shared
memory. Thus, when Host 1 obtains ownership of cache line X as a shared copy, it updates the
directory from “I” state to “S” state with Host 1 as the sharer. When Host 3 asks for the same cache
line X in shared state, it provides the data and updates the directory for X to indicate that both Host
1 and Host 3 have it Shared. When Host 4 requests an exclusive copy, it issues a Back Invalidate to
both Host H1 and Host H3, waits for the response from both to ensure that Host 1 and Host 3 have
invalidated their copies of X, and updates its directory to mark X as “E” by Host 4, prior to sending
the data and ownership to Host 4. This memory device can be a Multi-Logical Device (MLD) or
a GFAM (Global Fabric Attached Memory) Device (GFD). Each GFD can scale to support up
ACM Comput. Surv., Vol. 56, No. 11, Article 290. Publication date: July 2024.
An Introduction to the Compute Express Link (CXL) Interconnect 290:23
Fig. 17. (a) Existing Bias – Flip mechanism needed HDM to be tracked fully, since device could not back
snoop the host. Back-Invalidate with CXL 3.0 enables snoop filter implementation resulting in large memory
that can be mapped to HDM, (b) The Back-Invalidate Flows of CXL 3.0 can be used to implement hardware-
based coherent shared memory across multiple hosts, each with its independent coherency domain.
to 4,096 independent nodes (vs. 32 for MLD) simultaneously for either pooled or shared memory
by not being discoverable by each node independently through configuration space.
To support shared coherent memory across multiple independent hosts, devices can implement
an on-die snoop filter and/or a directory structure in memory. Devices can use two bits for co-
herency states (Invalid, Shared, Exclusive) followed by a sharing list of hosts. The implementation
for this list is up to the device. As an example, for invalid, the list entry is “don’t care.” For exclusive,
the list contains the host ID that has it exclusive. For sharing, one can implement a combination
of bit-vector, with each bit representing one or a group of hosts or a list, where the list can iden-
tify each host up to a certain number, beyond which it can be a coarse grain representation of
host groups. Implementation details will depend on how many directory bits are available and
the number of hosts that can access a cache line. One may also choose to maintain this snoop-
filter/directory across multiple cache lines (e.g., at a page level). The size of shared memory and
the number of hosts that can share a cache line depends on the usage model. Examples include
large in-memory applications such as very large databases, logs, machine learning, key-value store,
in-memory analytics, and so on. Hardware coherence can scale for these applications, where it is
mostly read-only. In some cases, software coherence may be desirable. The CXL3 architecture also
enables inter-domain interrupts, semaphores using shared memory, data replication, and message
passing using special address map regions through shared memory controllers, described in detail
in References [9, 18].
Back-Invalidation creates two additional channels in the protocol and link layers: BISnp and
BIRsp. These channels are necessary to ensure the protocol deadlock freedom in the updated de-
pendence diagram shown in Figure 18 adding BISnp (S2M BISnp) compared with the CXL 1.1/CXL
2.0 Dependence graph in Figure 10(b). The BIRsp is pre-allocated, so for the dependence graph it
is combined into L3 RSP in the graph.
ACM Comput. Surv., Vol. 56, No. 11, Article 290. Publication date: July 2024.
290:24 Debendra Das Sharma et al.
to 16 [Link] devices to each root port. For the host to support [Link], it must have snoop
filter structure that will track each address that the device may be caching. The size of this tracking
in the host will limit the device’s ability to cache host memory, so sizing in the host will constrain
how much host data the device may cache. Note that the host does not limit the device’s ability to
cache its own memory for Type 2 devices. With the addition of CacheID, the host must individually
track each device behind a root port to avoid degrading to a multi-cast snooping behavior. Multi-
cast snooping would be functional but would not meet the bandwidth/performance expectations
that devices should rarely see snoops for an address that are not in the device cache. The constraints
of coherence tracking in the host may limit the number of [Link] devices in total that can be
supported. The host will advertise this limit at the host bridge granularity. Software can discover
this capability and will enforce the limit during the enumeration of CXL hierarchy.
5.5 Port Base Routing (PBR) and Extensions for CXL Fabric
To enable scaling to 4,096 endpoints (hosts or devices) and non-tree (multi-path) network topolo-
gies, CXL 3.0 adds a new optional protocol format called Port Based Routing (PBR). The idea is
that PBR is a simpler and more scalable protocol than CXL 2.0’s hierarchical routing (Section 4.1)
by focusing only on information that is needed to route messages between PBR switches. For ex-
ample, hierarchical routing would require the switch to know the virtual hierarchy and the address
mappings for every host. CXL 3.0 continues using standard CXL 2.0 messages for the link between
most endpoints and the first switch (e.g., Leaf to endpoint links in Figure 12). This first switch
port connecting to device or host is called “edge port,” because it has a special role in converting
between standard messages (HBR) and PBR messages. PBR routing itself is used for inter-switch
links, e.g., Spline to Spline and Leaf to Spline links in Figure 12). From a software point of view,
only the fabric manager needs to configure the network of switches; software running in the host
just sees a flat topology that ends at the first switch Port.
PBR addresses endpoints with a 12-bit ID referred to as PBR-ID (PID). The edge port adds this
identifier to messages as a Destination PID (DPID) and in some cases a Source PID (SPID).
The generation of the PID is done by translating/decoding the HBR message to/from PBR message
using various methods. This translation is stateless, i.e., does not require tracking any outstanding
requests. Examples of this translation include Address to PID, LD-ID to/from PID, CacheID to/from
ACM Comput. Surv., Vol. 56, No. 11, Article 290. Publication date: July 2024.
An Introduction to the Compute Express Link (CXL) Interconnect 290:25
PID, Bus Number to/from PID. For LD-ID to PID, the PID offers fewer bits, which is solved by a
16-deep lookup table that is defined with the 4-bit LD-ID indexing the table to determine the PID
of the host. In the case of Address to PID this is done by decoding the Host Physical Address
(HPA) with a lookup of the Fabric Address Segment Table (FAST). To improve scalability, the
FAST provides a single memory range that gives a power of 2 size to each PID, which enables
direct use of selectable address bits to look up the PID owning the region of HPA. As the HBR
message to/from PBR message is transparently performed by the edge port, hosts and devices can
work with PBR without being directly aware of PBR. Furthermore, since address decoding has
already been performed at the edge, interior PBR switches (e.g., Spline switches in Figure 12) can
route messages purely based on the DPID and do not have to perform address decoding. Without
this simplification, intermediate links that are shared by multiple virtual hierarchies would need
to have the full address decode capability for all Virtual Hierarchies (VHs) that share the link
and thus would limit the number of VHs. PBR thus enables better scalability and reduces latency
and cost of interior switches.
Routing of PBR messages is centrally controlled by the Fabric Manager (FM) (Section 4.2). The
FM configures each PBR switch with a lookup table indexed by the 12-bit DPID. These tables map
DPIDs to the switch’s outgoing physical port. To enable multi-path routing, the lookup table may
provide multiple destination physical ports where unordered traffic may be routed dynamically to
target ports based on load. For [Link], the UIO VC is completely unordered and can be identified
easily by the VC in which the traffic arrives. [Link]-mem is primarily unordered with a few
limited ordering exception rules for resolving conflicts (example: In [Link] Snoops push GO
to the same address); these exceptions can be easily enforced even with multiple paths by ensuring
the ordered messages follow the same path interleaving appropriately. For example, tag bits in the
[Link] Non-Data Response channel can be used to interleave as only messages with the same
tag require ordering. Ordered traffic such as traditional [Link] has less flexibility in that the same
single path must always be selected, which conforms to the tree-based topology across all the
links. When the FM observes a link or switch failure, it can also reconfigure the routing table and
redistribute new lookup tables to PBR switches.
Inter-switch links (ISLs) carry the PBR message format. These links are also symmetric in
nature in that they may carry upstream and downstream traffic from different hosts in opposing
directions on the same link. Specifically, a non-PBR port supports 6 CXL channels up and 6 CXL
channels downstream. A PBR port must be able to carry 12 CXL channels upstream and down-
stream, respectively. Support for ISLs requires the underlying Cache and Mem link layer to supply
a fully symmetric set of channels, each with its own set of flow control credits. This contrasts with
device and host links that are either upstream or downstream links (host to switch is downstream
and device to switch is upstream),
Instead of relying on edge ports, a memory device may also directly participate in PBR. Such
an endpoint is referred to as Global Fabric-Attached-Memory Device (GFD). The GFD allows
for high-scalability memory devices that can be shared/pooled across all 4,095 other agents in the
CXL fabric where CXL 2.0 Multi-Logical Devices (MLD) can only be shared by a maximum of
16 hosts.
ACM Comput. Surv., Vol. 56, No. 11, Article 290. Publication date: July 2024.
290:26 Debendra Das Sharma et al.
Fig. 19. Representative micro-architecture with typical latency of CXL micro-architecture [8].
for SmartNIC devices [50], ARM has announced CXL 2.0 support in the V2, N2, and E2 series
CPUs [55].
On the device side, IP vendors such as Synopsys, Cadence, PLDA/Rambus, Mobiveil, and others
have demonstrated interoperability with SPR CPUs [8, 21, 22]. Samsung has built a CXL 1.1 mem-
ory expansion device and publicly shared its benchmarking results [52]. Montage [53], SK Hynix
[54], Microchip [77], Micron [75], and Astera Labs [76] have announced CXL Type 3 memory
devices. Micron has prototyped a CXL 1.1 near-memory computing device [51]. Academic stud-
ies have also independently reproduced results on SPR [48] as well as implemented custom CXL
hardware and software prototypes [49].
In the next sections, we report measurements of CXL implementations. We have tested
[Link] on Intel and AMD CPUs with multiple device vendors. However, since public per-
formance numbers are primarily available for Intel CPUs, our survey primarily focuses on the
performance of the Intel implementation as a representative performance indicator of CXL
implementations.
Figure 19 is a representative micro-architecture of CXL IP, with industry standard interfaces
such as PIPE [23], LPIF [24], CPI [25], and SFI [26] for CXL endpoints as well as a host-processor
such as SPR. The memory size, caching hierarchy, and cache sizes in Figure 10(a) are meant to
provide a scale and not the exact magnitude. The last level cache (LLC) covers the lower levels
of cache in the hierarchy as well as the write caches associated with the PCIe/ [Link] stack inside
the CPU. The LLC also covers caches across all CXL devices. A snoop filter provides the home agent
with information on when to snoop a given CXL device. Memory, whether locally connected to
the CPU (e.g., DDR), or belonging to the CXL device mapped to the system address space, is under
the purview of the Home Agent (represented by blue and green colors in Figure 10(a)).
ACM Comput. Surv., Vol. 56, No. 11, Article 290. Publication date: July 2024.
An Introduction to the Compute Express Link (CXL) Interconnect 290:27
ACM Comput. Surv., Vol. 56, No. 11, Article 290. Publication date: July 2024.
290:28 Debendra Das Sharma et al.
Fig. 20. Memory access latency as a function of bandwidth load for an equal number of four DDR channels
(a) and for local, CXL, and interleaved memory in a typical channel configuration (b). Lines are smoothed
with local regression [82].
to memory access across a CP U-CP U cache-coherent link and well within the latency operating
points of CPU [20]. For [Link], as in PCIe, the pin-to-pin round-trip latency for a memory read
to local memory to completion in Xeon is about 275 ns with an LLC miss and IOTLB hit (with a
device having its TLB mandated by CXL, we expect the latency to be like an IOTLB hit).
We perform an end-to-end measurement that covers all CXL overheads including queueing.
Since queueing latency is a function of load and system capacity [19], Figure 20 shows average
memory access latency under increasing background memory bandwidth load. This measurement
was performed on an Intel EMR system with two preproduction Astera Labs Leo CXL Smart Mem-
ory Controllers [76], with x16 CXL lanes each. All channels are populated with a single 64 GB
DDR5-5600 DIMM. The workload mix shown here comprises two reads for every write (2R1W).
We first compare CXL latency, to local DDR5, and cross-socket (UPI) DDR5 accesses where we
configure four memory channels in all three cases (Figure 20(a)). At low load, [Link] (blue line)
has about 200%–220% the latency of local DDR5 accesses (orange line) and comparable latency to
accessing remote DDR5 across UPI (green line in Figure 20(a)). Additionally, it is worth noting that
the slope of the latency curve is very similar between local DDR5 and DDR5-attached on CXL. The
peak memory bandwidth of the x32 combined CXL lanes from Leo CXL Smart Memory Controllers
maxes out at around 137 GB/s, which fully utilizes the CXL channels and matches four local DDR5
channels. CXL achieves about 1.5x the bandwidth of accessing remote DDR5 across UPI.
Figure 20(b) shows CXL with four channels (blue line), local DDR5 with eight channels (orange
line), and a configuration that interleaves memory accesses across local DDR5 and CXL (dark grey
line). Interleaving local and CXL offers 1.5x the peak memory bandwidth compared to just using
local DDR5, effectively combining CXL and local DDR5 bandwidth. At bandwidth loads above 200
GB/s, interleaving significantly decreases load and queueing, compared to only using local DDR5.
This shows that interleaving local and CXL memory may be attractive to bandwidth-intensive
applications. For latency sensitive applications, another approach is hardware tiering, e.g., Intel
Flat Memory Mode [80].
6.2 Projected Latency of Large System Topologies with CXL 3.0 and Beyond
Larger topologies with CXL are possible with switches as well as devices such as Shared Memory
Controller (SMC), which is a combination of CXL switch along with several memory controllers
[9, 18] (Figure 16(a)), with multi-domain capabilities and PBR routing for scale-out. A CXL switch
latency adder to round-trip is expected to be 2x21-25+10 ns for internal arb/look-up+10 ns flight
time on wire = 62-70 ns [18, 56]. The various latencies for [Link] and [Link] for various
access/system configurations are summarized in Table 2.
ACM Comput. Surv., Vol. 56, No. 11, Article 290. Publication date: July 2024.
An Introduction to the Compute Express Link (CXL) Interconnect 290:29
Table 2. Estimated Latency for [Link] and [Link] for Various Topologies [10]
ACM Comput. Surv., Vol. 56, No. 11, Article 290. Publication date: July 2024.
290:30 Debendra Das Sharma et al.
Table 3. [Link] Realizable Bandwidth in GB/s for Different Traffic Mixes and Payload Size for the CXL
2.0 68 B Flits and CXL 3.0 256 B and 128 B Flits for a x16 link at 32 GT/s
For 256-Byte and 128-Byte Latency-Optimized (128 B LO) Flits, for [Link] and [Link],
the overall link efficiency is 15/16 (=0.938), since there are 15 slots and one slot equivalent is spent
on FEC/ CRC/ Hdr, and so on [8,9]. The slots for [Link] and [Link] are similar even after
accounting for the extra bits for scalability. As a result, as can be seen later, the bandwidth efficiency
tends to be similar across the three Flit types.
There are three common traffic mixes to benchmark performance in [Link] and PCIe: 100%
Reads, 100% Writes, and 50-50 Read-Write. Table 3 summarizes the realized [Link] bandwidth for
the various payload sizes for each of the three traffic mixes for each of the three Flit types based
on the methodology described in Reference [8]. Typical workloads include a mix of small payloads
(4 B–32 B) as well as the medium- to large-sized payloads (64 B and above). Even though commer-
cial systems deploy either a max payload size of 128 B (or 32 Double Words a.k.a. DWs) for client
CPUs and 256 B/512 B (64/128 DW) for server CPUs, we have provided the projections up to the
maximum 4 KB (1,024 DW) payload size for completeness.
[Link]: A device reading a cache line from memory using RdCurr in D2H results in an
H2D Data_Hdr plus the H2D Data. Each H2D Data_Hdr is 24 bits; 4 of these can be packed in
a slot, representing 4 cache line transfers in a slot. Each H2D Data is 64-Bytes and hence needs
4 slots. Thus, a x16 CXL device would get (16/17) *0.94*64 GB/s = 56.6 GB/s of bandwidth with
reads from the processor with 68-Byte Flit format. For 256-Byte and 128-Byte Latency-optimized
Flits, the data is the bottleneck, since each H slot can have up to 4 H2D Data_Hdr. For a 256-Byte
Flit, we only have 14 slots available for data, which can accommodate 3.5 cache lines. Hence, the
realizable bandwidth is: (14/16) *128=112 GB/s for 256-Byte Flit and (13/16) *128=104 GB/s for
128-Byte Latency-optimized Flit [8, 9].
For writes, the device issues a D2H Req (RdOwn). This results in an H2D response with Data.
The device then issues a D2H Dirty Evict, obtains an H2D Resp (Wr Pull), which causes it to do a
D2H Mem Wr. This results in 3 Hdrs (D2H Req RdOwn, D2H Dirty Evict + Data Hdr) for every
cache line worth of data in the D2H direction. For 68-Byte Flit, the 3 Hdrs occupy two slots (e.g.,
D2H Req RdOwn+ Data Hdr, D2H Dirty Evict) resulting in (4/6) *0.94*64 GB/s = 40 GB/s per
direction. For the 256-Byte as well as 128-Byte Latency-optimized Flit Formats, that is 4 G-slots
ACM Comput. Surv., Vol. 56, No. 11, Article 290. Publication date: July 2024.
An Introduction to the Compute Express Link (CXL) Interconnect 290:31
and 2.25 any slots (one each for the Req and Evict Dirty and 14 for the Data_Hdr, since we can have
4 of them in any slot) in D2H direction, resulting in a realizable data bandwidth of (4 data slots/6.5
total slots)*(15/16)*128 = 73.8 GB/s [8, 9].
[Link]: A memory read from CPU is sent as MemRd in M2S (Section 3.5). It results in a
response that comprises of a S2M DRS Hdr (MemData) and 4 slots of Data (64B) for a Type 3
device. If the device is a Type 2 device, then an additional S2M NDR for completion (Cmp) is
needed. For the writes, the CPU sends an M2S Req (MemWrite) and 4 slots of Data, generating an
S2M NDR as Cmp.
An implementation will attempt to pack multiple headers in a slot to maximize efficiency. The
CXL specification allows different packing permutations of various headers into a slot, which are
denoted using a so-called H* notation [8].
There are a few representative workloads to evaluate the memory bandwidth performance. 100%
Reads, which is represented as 1R0W; 100% Writes, which is represented as 1R1W, since every
cache line write involves reading the content of memory before updating it; and equal read and
write, which represents 2 cache line reads for every write. Table 4 summarizes the realizable band-
width for [Link] for different traffic mixes. The detailed derivation appeared in Reference [8].
When there are only reads with no writes, there is no real data transfer in the M2S direction (only
read requests go). Hence, the data bandwidth is 0 in that direction, whereas the S2M direction is
mostly data (1 slot of header for 2DRS followed by 8 slots of data for the 2 cache lines). Hence, the
data efficiency is 0.939 for the link efficiency x 8/9 for the slot efficiency x 64 GB/s raw bandwidth
= 53.5 GB/s. Other entries follow a similar approach, as detailed in Reference [8]. The scheduler
in SPR Flit packing logic follows a greedy algorithm for [Link] by prioritizing H3 for the first
16 B slot in a Flit followed by H5 and populating data in the remaining Flits if less than 64 Bytes of
Data is scheduled to be sent. If 64 Bytes or higher Data is scheduled to be sent, then it schedules
an all-Data Flit. This approach results in the measured bandwidth very close to the numbers in
Table 4 for the 68-B Flit mode.
For the 256-Byte standard or 128-Byte Latency-optimized Flit, a designer needs to consider ad-
ditional constraints. For example, one may use the H/ HS slot for header only. The scheduling will
ensure that no G-slot will go empty when there is data (or header) to be sent. Thus, headers that
precede data will be prioritized and opportunistically placed in H/HS slots while making sure that
ACM Comput. Surv., Vol. 56, No. 11, Article 290. Publication date: July 2024.
290:32 Debendra Das Sharma et al.
Table 5. Link efficiency with back-invalidate (BI) and unordered IO (UIO) Flows [9]
no more than five cache lines’ worth of data are to be scheduled. We also ensure that other headers
make forward progress.
With the UIO/BI flows introduced in CXL 3.0 [6, 8, 9], the host processor is bypassed, and ac-
cesses go directly between devices using the [Link] UIO flows both for HDM accesses as well
as inter-domain messages that do not need caching. The expectation is that the vast majority of
these will not cause invocation of the BI-Snp mechanism to enforce coherency, since I/O devices
and cores are typically not accessing the same data simultaneously. This helps with link efficiency
as well as congestion on the host-processor links. In the analysis of Table 5, the efficiency gain
with this mechanism is significant, even in the pathological case where 100% of accesses cause
BI-Snp (x =1.0 cases in Table 1b). Such simultaneous accesses from the host and device may occur
for control data structure that are used for synchronization. Typically, we would expect such data
structures to be placed in host memory. However, even if 100% of accesses cause BI-Snp, UIO/BI
are better than multiple cache line transfers across links.
7 Discussion
We discuss four implications for the impact of CXL on the compute landscape and then outline an
initial set of future directions.
ACM Comput. Surv., Vol. 56, No. 11, Article 290. Publication date: July 2024.
An Introduction to the Compute Express Link (CXL) Interconnect 290:33
twice the latency). When considering loaded latency, CXL may perform better than DDR due to its
bandwidth advantage and on-package memory would significantly outperform either. A natural
expectation is that more memory will migrate to CXL and on-package memory. Eventually, CXL
will become the only external memory attach point for CPUs and accelerators.
Implication 3: CXL will grow to become a rack or cluster-level interconnect. Compared
to today’s datacenter networks based on Ethernet and InfiniBand, CXL lowers latency by an or-
der of magnitude. Additionally, CXL’s coherent memory sharing and fine-grained synchronization
can significantly boost distributed system performance for key workloads such as large machine
learning models and databases. However, CXL requires dedicated cabling with stringent require-
ments, e.g., on cable length, adaptors, and use of retimers. This leads to significantly higher cost
and less flexibility than today’s datacenter networks. It is thus likely that CXL will be deployed
within a smaller domain, such as within racks or across a few racks (cluster or pod). For example,
current financial models point to sub-rack CXL deployments as the TCO sweet spot [56]. With
improved cost through standardization, CXL’s scope can likely grow, but it is unlikely that CXL
will replace Ethernet as the datacenter-wide networking standard.
Implication 4: CXL will enable highly composable systems. Composability means that
components and resources can be dynamically assembled at runtime and assigned to a workload
or virtual machine. We expect that both memory devices and IO devices (e.g., NICs, accelerators,
and storage) will evolve multi-host capabilities that allow dynamically assigning fractions of their
capacity to individual hosts over CXL. Converged and pooled devices lead to significantly better
resource usage due to increased multiplexing opportunities, and thus lower cost. This design also
facilitates accelerating distributed systems via shared memory, message passing, and peer-to-peer
communication via CXL. However, a composable system is still distinct from treating all resources
connected via CXL as a single big system. Workloads will still prefer locality and spanning as few
CXL links as possible to minimize coherence traffic and the blast radius of any failure.
ACM Comput. Surv., Vol. 56, No. 11, Article 290. Publication date: July 2024.
290:34 Debendra Das Sharma et al.
In computer engineering, research and development needs to continue to reduce the latency
of CXL memory, accelerators, and switches to expand the applicability of CXL. There are many
opportunities, including iterating on engineering CXL building blocks, exploiting process advance-
ments (Section 6.1), and better packing algorithms (Section 6.3). The ecosystem also needs to de-
velop a rigorous approach to error containment and management of CXL’s increased blast radius.
Working around faults and congestion hot spots with load-store semantics will have different
constraints than networking applications, which can handle lost packets as well as completely
out-of-order delivery of packets. Finally, CXL can be further enhanced and deployed to expand
across multiple racks and offer high reliability with low-latency load-store access for multiple
applications with dynamic fail-over and finer-grained QoS enhancements that will be incorpo-
rated in future revisions of the specification. With co-packaged optics enabled through Universal
Chiplet Interconnect Express (UCIe) retimers [10, 48, 72, 73, 74], we expect to realize the vision
of building composable and scale-out systems spanning the rack through the pod at the datacenter,
resulting in power-efficient performance with significant total cost-of-ownership benefits.
8 Conclusions
CXL addresses significant industry challenges while following a design philosophy based on open-
ness, simplicity, and backward-compatibility. This has enabled CXL to gain traction as a common
standard across the industry. All of these factors make CXL a trending research area within the
academic community as well. We hope that this tutorial serves as both an introduction and jump-
ing off point into the standard as well as a base for research ideas.
References
[1] D. Das Sharma. 2019. Compute Express Link. White paper. In Compute Express Link Consortium.
[2] Compute Express Link Consortium, Inc. 2019. Compute Express Link 1.1 specification.
[3] Compute Express Link Consortium, Inc. 2020. Compute Express Link 2.0 specification.
[4] D. Das Sharma and S. Tavallaei. 2020. Compute Express LinkTM 2.0. White paper. In CXL Consortium.
[5] Compute Express Link Consortium, Inc. 2022. Compute Express Link 3.0 specification. Retrieved from www.
[Link]
[6] D. Das Sharma and I. Agarwal. 2022. Compute Express Link 3.0. White paper. In CXL Consortium.
[7] PCI-SIG. 2022. PCI Express® base specification revision 6.0, Version 1.0.
[8] D. Das Sharma. 2023. Compute Express Link®: Enabling heterogeneous data-centric computing with heterogeneous
memory hierarchy. IEEE Micro. (2023 Mar.–Apr).
[9] D. Das Sharma. 2023. Novel composable and scale-out architectures using Compute Express LinkTM . IEEE Micro. 29
(Mar.–Apr. 2023).
[10] D. Das Sharma. 2022. Transforming the data-centric world. In Flash Memory Summit.
[11] D. Das Sharma. 2021. Evolution of interconnects and fabrics to support future compute infrastructure. In Open Fabrics
Alliance Workshop.
[12] D. Das Sharma. 2021. The Compute Express Link (CXL)* open standard is changing the game for cloud computing.
In IEEE Hot Interconnects Conference.
[13] D. Das Sharma. 2022. A low-latency and low-power approach for coherency and memory protocols on PCI Express
6.0 PHY at 64.0 GT/s with PAM-4 signaling. IEEE Micro. (2022 Mar.–Apr).
[14] D. Das Sharma. 2021. PCI Express 6.0 specification: A low-latency, high-bandwidth, high-reliability, and cost-effective
interconnect with 64.0 GT/s PAM-4 signaling. IEEE Micro (2021 Jan.–Feb).
[15] Stephane Hauradou. 2020. Rambus Webinar. [Link]
[16] Microchip Inc. 2020. XpressConnectTM PCIe® Gen 5 and CXLTM Retimer Family. [Link]
default/files/component_datasheet/XpressConnect- PCIe- Retimers- [Link]
[17] Hasan Al Maruf, Hao Wang, Abhishek Dhanotia, Johannes Weiner, Niket Agarwal, Pallab Bhattacharya, Chris Pe-
tersen, Mosharaf Chowdhury, Shobhit Kanaujia, and Prakash Chauhan. 2022. TPP: Transparent Page Placement for
CXL-enabled tiered memory. In (ASPLOS’23).
[18] Huaicheng Li, Daniel S. Berger, Lisa Hsu, Daniel Ernst, Pantea Zardoshti, Stanko Novakovic, Monish Shah, Samir
Rajadnya, Scott Lee, Ishwar Agarwal, Mark D. Hill, Marcus Fontoura, and Ricardo Bianchini. 2023. Pond: CXL-based
ACM Comput. Surv., Vol. 56, No. 11, Article 290. Publication date: July 2024.
An Introduction to the Compute Express Link (CXL) Interconnect 290:35
memory pooling systems for cloud platforms. In ACM International Conference on Architectural Support for Program-
ming Languages and Operating Systems (ASPLOS’23).
[19] Mor Harchol-Balter. 2013. Performance Modeling and Design of Computer Systems: Queueing Theory in Action. Cam-
bridge University Press.
[20] Kevin Lim et al. 2009. Disaggregated memory for expansion and sharing in blade servers. ACM SIGARCH Comput.
Archit. News 37, 3 (2009), 267–278.
[21] Synopsys Inc, Retrieved from [Link] 10- 13- Synopsys- Demonstrates- Industrys- First- PCI-
Express- 5- 0- IP- Interoperability- with- Intels- Future- Xeon- Scalable- Processor
[22] Mobiveil, Retrieved from [Link]
Announces- Compute- Express- Link- CXL- 2- 0- Design- IP- Successful- Completion- of- CXL- 1- 1- Validation- with-
Intel- s- CXL- Host- [Link]
[23] Intel Corporation. 2019. PHY Interface for the PCI Express, SATA, USB 3.1, Display Port, and Converged I/O Architec-
tures. Version 5.1. [Link]
pci- express- sata- usb30- architectures- [Link]
[24] Intel Corporation. 2019. Logical PHY Interface (LPIF) Specification. Version 1.0. [Link]
www/public/us/en/documents/technical-specifications/[Link]
[25] Intel Corporation. 2020. CXL-Cache/Mem Protocol Interface (CPI). Rev 0.7. [Link]
en/content- details/644330/compute- express- link- cxl- cache- mem- protocol- interface- cpi- [Link]
[26] Intel Corporation. 2020. Streaming Fabric Interface (SFI). Rev 0.7. [Link]
content- details/644200/streaming- fabric- interface- sfi- [Link]
[27] Jing Guo, Zihao Chang, Sa Wang, Haiyang Ding, Yihui Feng, Liang Mao, and Yungang Bao. 2019. Who limits the
resource efficiency of my datacenter: An analysis of Alibaba datacenter traces. In International Symposium on Quality
of Service.
[28] Muhammad Tirmazi, Adam Barker, Nan Deng, Md. E. Haque, Zhijing Gene Qin, Steven Hand, Mor Harchol-Balter,
and John Wilkes. 2020. Borg: the next generation. In 15th European Conference on Computer Systems.
[29] Eli Cortez, Anand Bonde, Alexandre Muzio, Mark Russinovich, Marcus Fontoura, and Ricardo Bianchini. 2017. Re-
source central: Understanding and predicting workloads for improved resource management in large cloud platforms.
In 26th Symposium on Operating Systems Principles.
[30] Qizhen Zhang, Philip A. Bernstein, Daniel S. Berger, and Badrish Chandramouli. 2021. Redy: Remote dynamic mem-
ory cache. VLDB J. 15, 4 (2021), 766–779.
[31] J. Gu, Y. Lee, Y. Zhang, M. Chowdhury, and K. G. Shin. 2017. Efficient memory disaggregation with INFINISWAP. In
USENIX Symposium on Networked Systems Design and Implementation (NSDI’22). 649–667.
[32] Huan Liu. 2011. A measurement study of server utilization in public clouds. In IEEE 9th International Conference on
Dependable, Autonomic and Secure Computing.
[33] Daniel Berger, Fiodar Kazhamiaka, Esha Choukse, Inigo Goiri, Celine Irvene, Pulkit A. Misra, Alok Kumbhare, Rodrigo
Fonseca, and Ricardo Bianchini. 2023. Research avenues towards net-zero cloud platforms. In 1st Workshop on NetZero
Carbon Computing (NetZero’23).
[34] Chaojie Zhang, Alok Kumbhare, Ioannis Manousakis, Deli Zhang, Pulkit Misra, Rod Assis, Kyle Woolcock, Nithish
Mahalingam, Brijesh Warrier, David Gauthier, Lalu Kunnath, Steve Solomon, Osvaldo Morales, Marcus Fontoura,
and Ricardo Bianchini. 2021. Flex: High-availability datacenters with zero reserved power. In ACM/IEEE 48th Annual
International Symposium on Computer Architecture (ISCA’21). IEEE.
[35] Alexandre Verbitski, Anurag Gupta, Debanjan Saha, Murali Brahmadesam, Kamal Gupta, Raman Mittal, Sailesh Kr-
ishnamurthy, Sandor Maurice, Tengiz Kharatishvili, and Xiaofeng Bao. 2017. Amazon aurora: Design considerations
for high throughput cloud-native relational databases. In ACM SIGMOD Conference. 1041–1052.
[36] Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye, Parveen Patel, Balaji Prabhakar, Sudipta
Sengupta, and Murari Sridharan. 2010. Data center TCP (DCTCP). In ACM SIGCOMM Conference. 63–74.
[37] Albert Greenberg, James R. Hamilton, Navendu Jain, Srikanth Kandula, Changhoon Kim, Parantap Lahiri, David A.
Maltz, Parveen Patel, and Sudipta Sengupta. 2009. VL2: A scalable and flexible data center network. In ACM SIGCOMM
Conference on Data Communication. 51–62.
[38] Irene Zhang, Amanda Raybuck, Pratyush Patel, Kirk Olynyk, Jacob Nelson, Omar S. Navarro Leija, Ashlie Mar-
tinez, Jing Liu, Anna Kornfeld Simpson, Sujay Jayakar, Pedro Henrique Penna, Max Demoulin, Piali Choudhury, and
Anirudh Badam. 2021. The demikernel datapath os architecture for microsecond-scale datacenter systems. In ACM
SIGOPS 28th Symposium on Operating Systems Principles. 195–211.
[39] Andrew Putnam, Adrian M. Caulfield, Eric S. Chung, Derek Chiou, Kypros Constantinides, John Demme, Hadi
Esmaeilzadeh, Jeremy Fowers, Gopi Prashanth Gopal, Jan Gray, Michael Haselman, Scott Hauck, Stephen Heil,
Amir Hormati, Joo-Young Kim, Sitaram Lanka, James Larus, Eric Peterson, Simon Pope, Aaron Smith, Jason Thong,
ACM Comput. Surv., Vol. 56, No. 11, Article 290. Publication date: July 2024.
290:36 Debendra Das Sharma et al.
Phillip Yi Xiao, and Doug Burger. 2014. A reconfigurable fabric for accelerating large-scale datacenter services. ACM
SIGARCH Comput. Archit. News 42, 3 (2014), 13–24.
[40] Daniel Firestone, Andrew Putnam, Sambhrama Mundkur, Derek Chiou, Alireza Dabagh, Mike Andrewartha, Hari
Angepat, Vivek Bhanu, Adrian Caulfield, Eric Chung, Harish Kumar Chandrappa, Somesh Chaturmohta, Matt
Humphrey, Jack Lavier, Norman Lam, Fengfen Liu, Kalin Ovtcharov, Jitu Padhye, Gautham Popuri, Shachar Raindel,
Tejas Sapre, Mark Shaw, Gabriel Silva, Madhan Sivakumar, Nisheeth Srivastava, Anshuman Verma, Qasim Zuhair,
Deepak Bansal, Doug Burger, Kushagra Vaid, David A. Maltz, and Albert Greenberg. 2018. Azure accelerated net-
working: SmartNICs in the public cloud. In USENIX Symposium on Networked Systems Design and Implementation
(NSDI’18), 51–66.
[41] Tal Ben-Nun and Torsten Hoefler. 2019. Demystifying parallel and distributed deep learning: An in-depth concur-
rency analysis. ACM Comput. Surv. 52, 4 (2019), 1–43.
[42] Marcos K. Aguilera, Naama Ben-David, Rachid Guerraoui, Antoine Murat, Athanasios Xygkis, and Igor Zablotchi.
2023. uBFT: Microsecond-scale BFT using disaggregated memory. In ACM International Conference on Architectural
Support for Programming Languages and Operating Systems (ASPLOS’23). 862–877.
[43] James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, J. J. Furman, Sanjay Ghemawat,
Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li,
Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Yasushi Saito,
Michal Szymaniak, Christopher Taylor, Ruth Wang, and Dale Woodford. 2013. Spanner: Google’s globally distributed
database. ACM Trans. Comput. Syst. 31, 3 (2013), 1–22.
[44] Panagiotis Antonopoulos, Alex Budovski, Cristian Diaconu, Alejandro Hernandez Saenz, Jack Hu, Hanuma Ko-
davalla, Donald Kossmann, Sandeep Lingam, Umar Farooq Minhas, Naveen Prakash, Vijendra Purohit, Hugh Qu,
Chaitanya Sreenivas Ravella, Krystyna Reisteter, Sheetal Shrotri, Dixin Tang, and Vikram Wakade. 2019. Socrates:
The new SQL server in the cloud. In ACM SIGMOD Conference. 1743–1756.
[45] Mahesh Natu and Thanu Rangarajan. 2020. Compute Express Link (CXL) Update. In UEFI 2020 Virtual Plugfest.
[46] UEFI. 2020. Coherent Device Attribute Table (CDAT) specification. Retrieved from [Link].
[47] M. S. Papamarcos and J. H. Patel. 1984. A low-overhead coherence solution for multiprocessors with private cache
memories. In 11th Annual International Symposium on Computer Architecture (ISCA’84).
[48] Yan Sun, Yifan Yuan, Zeduo Yu, Reese Kuper, Chihun Song, Jinghan Huang, Houxiang Ji, Siddharth Agarwal, Jiaqi
Lou, Ipoom Jeong, Ren Wang, Jung Ho Ahn, Tianyin Xu, and Nam Sung Kim. 2023. Demystifying CXL memory with
genuine CXL-ready systems and devices. arXiv preprint arXiv:2303.15375 (2023).
[49] D. Gouk, M. Kwon, H. Bae, S. Lee, and M. Jung. 2023. Memory pooling with CXL. IEEE Micro 43, 2 (2023).
[50] J. Dastidar, D. Riddoch, J. Moore, S. Pope, and J. Wesselkamper. 2023. AMD 400G adaptive SmartNIC SoC–Technology
preview. IEEE Micro 43, 3 (2023).
[51] D. Boles, D. Waddington, and D. A. Roberts. 2023. CXL-enabled enhanced memory functions. IEEE Micro 43, 2 (2023).
[52] Kyungsan Kim, Hyunseok Kim, Jinin So, Wonjae Lee, Junhyuk Im, Sungjoo Park, Jeonghyeon Cho, and Hoyoung
Song. 2023. SMT: Software-Defined memory tiering for heterogeneous computing systems with CXL memory ex-
pander. IEEE Micro 43, 2 (2023), 20–29.
[53] Montage Technology. 2023. CXL Memory Expander Controller (MXC). Retrieved from [Link]
com/MXC
[54] SK Hynix Inc. 2022. SK Hynix introduces industry’s first CXL-based Computational Memory Solution (CMS) at the
OCP Global Summit. Retrieved from [Link] hynix- introduces- industrysfirst- cxl- based- cms-
at- the- ocp- global- summit/
[55] Parag Beeraka. 2023. Enabling CXL within the data center with arm solutions. In OpenCompute Summit. https://
[Link]/wp- content/uploads/2022/10/CXL- Forum- Wall- Street_arm.pdf
[56] Daniel S. Berger, Daniel Ernst, Huaicheng Li, Pantea Zardoshti, Monish Shah, Samir Rajadnya, Scott Lee, Lisa Hsu,
Ishwar Agarwal, Mark D. Hill, and Ricardo Bianchini. 2023. Design tradeoffs in CXL-based memory pools for public
cloud platforms. IEEE Micro 43, 2 (2023), 30–38.
[57] Nirav Atre, Justine Sherry, Weina Wang, and Daniel S. Berger. 2020. Caching with delayed hits. In Annual Conference
of the ACM Special Interest Group on Data Communication on the Applications, Technologies, Architectures, and Protocols
for Computer Communication. 495–513.
[58] Davy Genbrugge and Lieven Eeckhout. 2007. Memory data flow modeling in statistical simulation for the efficient
exploration of microprocessor design spaces. IEEE Trans. Comput. 57, 1 (2007), 41–54.
[59] Onur Mutlu and Lavanya Subramanian. 2014. Research problems and opportunities in memory systems. Supercomput.
Front. Innov. 1, 3 (2014), 19–55.
[60] Kevin Kai-Wei Chang, Donghyuk Lee, Zeshan Chishti, Alaa R. Alameldeen, Chris Wilkerson, Yoongu Kim, and Onur
Mutlu. 2014. Improving DRAM performance by parallelizing refreshes with accesses. In International Symposium on
High-Performance Computer Architecture (HPCA’14).
ACM Comput. Surv., Vol. 56, No. 11, Article 290. Publication date: July 2024.
An Introduction to the Compute Express Link (CXL) Interconnect 290:37
[61] Jamie Liu, Ben Jaiyen, Richard Veras, and Onur Mutlu. 2012. RAIDR: Retention-aware intelligent DRAM refresh. In
International Symposium on Computer Architecture (ISCA’12).
[62] Prashant J. Nair, Dae-Hyun Kim, and Moinuddin K. Qureshi. 2013. ArchShield: Architectural framework for assisting
DRAM scaling by tolerating high error rates. In International Symposium on Computer Architecture (ISCA’13).
[63] Gennady Pekhimenko, Todd C. Mowry, and Onur Mutlu. 2013. Linearly compressed pages: A main memory com-
pression framework with low complexity and low latency. In Annual IEEE/ACM International Symposium on Microar-
chitecture (MICRO’13).
[64] Esha Choukse, Mattan Erez, and Alaa R. Alameldeen. 2018. Compresso: Pragmatic main memory compression. In
51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’18). IEEE.
[65] O. Mutlu, S. Ghose, J. Gómez-Luna, and R. Ausavarungnirun. 2022. A modern primer on processing in memory. In
Emerging Computing: From Devices to Systems: Looking Beyond Moore and Von Neumann. Springer Nature Singapore,
171–243.
[66] Gagandeep Singh, Lorenzo Chelini, Stefano Corda, Ahsan Javed Awan, Sander Stuijk, Roel Jordans, Henk Corporaal,
and Albert-Jan Boonstra. 2019. Near-memory computing: Past, present, and future. Microprocess. Microsyst. 71 (2019),
102868.
[67] H. Hassan, G. Pekhimenko, N. Vijaykumar, V. Seshadri, D. Lee, O. Ergin, and O. Mutlu. 2016. ChargeCache: Reducing
DRAM latency by exploiting row access locality. In IEEE International Symposium on High Performance Computer
Architecture (HPCA’16). 581–593.
[68] D. Lee, Y. Kim, G. Pekhimenko, S. Khan, V. Seshadri, K. Chang, and O. Mutlu. 2015. Adaptive-latency DRAM: Op-
timizing DRAM timing for the common-case. In IEEE 21st International Symposium on High Performance Computer
Architecture (HPCA’15). 489–501.
[69] S. Lee, B. Jeon, K. Kang, D. Ka, N. Kim, Y. Kim, and J. Oh. 2019. 23.4 a 512GB 1.1 V managed DRAM solution with
16GB ODP and media controller. In IEEE International Solid-State Circuits Conference (ISSCC’19). 384–386.
[70] Y. Chen. 2020. ReRAM: History, status, and future. IEEE Trans. Electron Devices 67, 4 (2020), 1420–1433.
[71] F. T. Hady, A. Foong, B. Veal, and D. Williams. 2017. Platform storage performance with 3D XPoint technology. Proc.
IEEE 105, 9 (2017), 1822–1833.
[72] D. Das Sharma. 2022. Universal Chiplet Interconnect express (UCIe)® : Building an open chiplet ecosystem. White
paper. In UCIe Consortium.
[73] Debendra Das Sharma, Gerald Pasdast, Zhiguo Qian, and Kemal Aygun. 2022. Universal Chiplet Interconnect express
(UCIe)® : An open industry standard for innovations with chiplets at package level. IEEE Trans. Compon., Packag.,
Manuf. Technol. (2022 Oct.).
[74] D. Das Sharma. 2023. Universal Chiplet Interconnect express (UCIe)®: An open industry standard for innovations
with chiplets at package level. IEEE Micro Special Issue Mar.–Apr. (2023).
[75] Micron. 2023. Micron launches memory expansion module portfolio to accelerate CXL 2.0 adoption. Retrieved from
[Link]
portfolio- accelerate- cxl
[76] Astera Labs. 2023. Leo CXL smart memory controllers. Retrieved from [Link]
uploads/2024/05/231220_Product_Brief _Leo_CXL_Smart_Memory_Controllers.pdf
[77] Microchip. 2023. SMC 2000 Smart Memory Controllers. Retrieved from [Link]
aemDocuments/documents/DCS/ProductDocuments/Brochures/SMC- 2000- Smart- Memory- Controllers- 00004551.
pdf
[78] Nam Sung Kim. 2018. Practical challenges in supporting function in memory. In IEEE Asian Solid-State Circuits Con-
ference (A-SSCC’18). IEEE.
[79] Hao Wang, Chang-Jae Park, Gyungsu Byun, Jung Ho Ahn, and Nam Sung Kim. 2015. Alloy: Parallel-serial mem-
ory channel architecture for single-chip heterogeneous processor systems. In International Symposium on High-
Performance Computer Architecture (HPCA’15).
[80] Y. Zhong, D. S. Berger, C. Waldspurger, I. Agarwal, R. Agarwal, F. Hady, K. Kumar, M. D. Hill, M. Chowdhury, and
A. Cidon. 2024. Managing memory tiers with CXL in virtualized environments. In Symposium on Operating Systems
Design and Implementation.
[81] Jaylen Wang, Daniel S. Berger, Fiodar Kazhamiaka, Celine Irvene, Chaojie Zhang, Esha Choukse, Kali Frost, Rodrigo
Fonseca, Brijesh Warrier, Chetan Bansal, Jonathan Stern, Ricardo Bianchini, and Akshitha Sriraman. 2024. Designing
cloud servers for lower carbon. In 51st Annual International Symposium on Computer Architecture.
[82] W. S. Cleveland, E. Grosse and W. M. Shyu. 1992. Local regression models. In Statistical Models in S, J. M. Chambers
and T. J. Hastie (Eds.). Wadsworth & Brooks/Cole.
ACM Comput. Surv., Vol. 56, No. 11, Article 290. Publication date: July 2024.