Home > News >

CXL 3.0: Solving New Memory Problems in Data Centres (Part 2)

TIME:2024-12-18 15:08   SOURCE:Network    WRITER:August

As a next-generation device interconnection standard protocol, Compute Express Link (CXL) has emerged as one of the most promising technologies in both industry and academia. It facilitates not only the expansion of memory capacity and bandwidth but also enhances heterogeneous interconnections and supports the decoupling of data centre resource pools. CXL's high bandwidth, low latency, substantial scalability, and maintenance of cache/memory consistency effectively address the "island" problem inherent in heterogeneous computing systems. This technology enables powerful computing capabilities that foster efficient collaboration within heterogeneous environments.


Delay Issues with CXL Technology

In recent years, the introduction of CXL technology has generated significant expectations within the industry. As a distributed memory technology, CXL 2.0 allows hosts to utilize their own directly connected DRAM while also accessing external DRAM via the CXL 2.0 interface. This capability to access external DRAM introduces increased latency when compared to local DRAM. While CXL is primarily characterized by low latency, it is important to acknowledge that there remains a latency gap when compared to the CPU's memory, cache, and registers.

At the previous Hot Chips event, the CXL Alliance presented specific latency figures associated with CXL technology. The memory latency of CXL, independent of the CPU, ranges from approximately 170 to 250 nanoseconds, which is higher than that observed in non-volatile memory (NVM), disaggregated memory over network connections, solid-state drives (SSDs), and hard disk drives (HDDs), all of which are also independent of the CPU.

A report from Microsoft Azure indicates that CXL memory exhibits a latency gap relative to other CPU memory, caches, and registers, reported to be between 180% and 320%. For detailed latency values and proportions in various scenarios, please refer to the comprehensive introduction illustrated in the accompanying figure. Different application systems demonstrate varying sensitivities to latency, particularly applications that are latency-sensitive. The figure illustrates the effects of latency discrepancies across diverse CXL networking schemes on application performance. In scenarios lacking optimization, the overarching trend suggests a decline in performance across an increasing number of applications, although the precise percentage will require specific evaluation.

Looking ahead, the industry anticipates a significant improvement in CXL memory latency issues with the widespread implementation of PCIe 7.0/CXL 3.0; however, this is unlikely to occur before 2024-2025. In the interim, it remains possible to enhance performance by utilizing improved memory system management, scheduling, and monitoring software to mitigate the latency discrepancies between local and remote CXL memory.

 

 

image.png

Analysis of Delay Values in Local and CXL Memory Resource Expansion

 

 

image.png

Analysis of the Impact of Microsoft Azure CXL Memory on Applications


Solution for delay in CXL technology

The advancement of CXL (Compute Express Link) technology has prompted an increasing number of enterprises to adopt heterogeneous layered memory systems that are compatible with CXL. Meta has proposed a hierarchical memory classification, segmenting it into three categories: "hot" memory, designated for critical tasks such as real-time data analysis; "warm" memory, which experiences infrequent access; and "cold" memory, allocated for extensive data storage. "Hot" memory resides in the native DDR memory, whereas "cold" memory is assigned to CXL memory. Nonetheless, current software applications may lack the capability to adequately differentiate between "hot" and "cold" memory. As native memory reaches capacity, it encroaches upon CXL memory, transforming what was originally intended as "cold" memory into "hot" memory. A significant challenge persists at the operating system and software level: effectively identifying "cold" memory pages and proactively transferring them to CXL memory to free up space within the native DDR memory.

To enhance performance within such systems, Meta has undertaken an exploration of an alternative approach to memory page management, departing from conventional local DRAM practices. This approach aims to address the thermal characteristics of memory pages through a Linux kernel extension known as Transparent Page Placement (TPP). Acknowledging the latency disparity between CXL main memory and local memory—as is common in distributed storage systems employing tiered management for flash and hard disk drives—Meta introduced the Multi-Tier Memory concept in its TPP paper. This framework seeks to assess the volume of data residing in memory, distinguishing between hot, warm, and cold data based on varying levels of activity. Subsequently, it identifies mechanisms for positioning hot data in the fastest memory, cold data in the slowest memory, and warm data in the intermediate memory.

The TPP design space encompasses four primary domains:

1. Lightweight downgrade to CXL memory,

2. Decoupling distribution and recovery paths,

3. Elevating hot pages to the local node, and

4. Page type-aware memory allocation.

Meta's TPP represents a kernel-mode solution leveraging transparent monitoring and placement of page temperature. This protocol operates in conjunction with Meta's Chameleon memory tracking tool, functioning within the Linux user space, thereby facilitating the performance tracking of CXL memory across applications. Meta has assessed TPP using various production workloads, optimizing the placement of "hotter" pages in local memory while relocating "colder" pages from CXL memory. The findings demonstrate that TPP can enhance performance by approximately 18% in comparison to applications running on the default Linux configuration. Furthermore, TPP shows improvements ranging from 5% to 17% relative to two leading existing techniques for tiered memory management: NUMA balancing and automatic tiering.

 

image.png

CXL 3.0: Enters Heterogeneous Interconnection

The CXL (Compute Express Link) alliance has introduced the CXL 3.0 standard, which incorporates enhanced structural functionality and management. This iteration offers improvements in memory pools, consistency, and peer-to-peer communication, leading to significant advancements overall. Notably, the data transmission rate has doubled to 64 GT/s, while latency remains unchanged compared to CXL 2.0. Moreover, CXL 3.0 maintains backward compatibility with previous standards, including CXL 2.0, CXL 1.1, and CXL 1.0, thereby facilitating the creation of heterogeneous and composable server architectures.

With CXL 3.0, the switch now supports a wider array of topological configurations. Unlike CXL 2.0, which primarily allowed for fan-out using CXL.mem devices—where the host accessed memory resources from external devices—CXL 3.0 permits the utilization of Spine/Leaf and other topological networks across one or multiple server racks. Furthermore, the theoretical limitation on the number of switch ports for CXL 3.0 equipment, host machines, and switches has been increased to 4,096. These developments significantly enhance the potential scalability of CXL networks, expanding capabilities from individual servers to comprehensive server rack infrastructures.

see more: ruijienetworks.com

 

 

Previous:长白山(长春)中医药产业经贸对接会在长春召开
Next:China Traditional Arts Exchange Exhibition Held at the Louvre in Paris, France

NEW
RANK