WO2025090989A1

WO2025090989A1 - Nvme over cxl

Info

Publication number: WO2025090989A1
Application number: PCT/US2024/053149
Authority: WO
Inventors: Yu-Ming Chang; Chuen-Shen Shung; Hsiang-Ting Cheng; William GERVASI
Original assignee: Wolley Inc
Current assignee: Wolley Inc
Priority date: 2023-10-28
Filing date: 2024-10-27
Publication date: 2025-05-01
Anticipated expiration: 2026-04-28
Also published as: WO2025090989A9

Abstract

A high-speed memory controller is disclosed. The high-speed memory controller comprises a host interface based on PCIe interface coupled to a host, a first memory interface coupled to a first memory corresponding to a non-volatile memory, a second memory interface coupled to a second memory corresponding to a volatile memory, an NVMe logic to support NVMe protocols using CXL protocols, and one or more other logics. The NVMe logic, said one or more other logics or both are configured to: support CXL protocols over the PCIe interface, wherein the CXL protocols comprise CXL.io and CXL.mem; provide NVMe functionality to connect CXL.io to the first memory; provide memory access via CXL.mem to the second memory; and provide local data path for data transfer between the first memory and the second memory without a need to include the host interface in the local data path.

Description

TITLE: NVMe Over CXL

INVENTOR(S): Yu-Ming Chang, Chuen-Shen Bernard Shung, Hsiang-Ting Cheng and William Gervasi

CROSS REFERENCE TO RELATED APPLICATIONS

[0001] The present invention is a non-Provisional Application of and claims priority to U.S. Provisional Patent Application No. 63/546,209, filed on October 28, 2023. The U.S. Provisional Patent Application is hereby incorporated by reference in its entirety.

FIELD OF THE INVENTION

[0002] The subject innovation relates generally to memory systems and storage systems for computer applications. In particular, the present invention discloses a high-speed data transfer apparatus to allow efficient data transfer between non-volatile memory and volatile memory without the need of involving data transfer to the host.

BACKGROUND AND RELATED ART

[0003] Current generation computers distinguish between long term non-volatile storage as a block oriented input/output (I/O) function and volatile temporary storage as a byte addressable direct access function. Long term storage is commonly implemented using flash memory technology as a solid state drive (SSD), with an electrical interface using Peripheral Component Interconnect Express (PCIe) and the protocol Non-volatile Memory Express (NVMe) over PCIe. Temporary storage is commonly implemented using Dynamic Random Access Memory (DRAM) which by its nature loses all content on power failure or system resets, and the common access protocol between the host processor Central Processing Unit (CPU) and the DRAM is Double Data Rate (DDR).

[0004] Data is accessed using NVMe in blocks such as 4KB or 64KB in size and transferred to or from DRAM specified in NVMe command set. Fig. 1 illustrates the process of reading a 4KB block based on NVMe protocol, where components above line 150 correspond to host side and components below line 150 correspond to device side. The submission queue 120 and completion queue 130 are the primary data structures used to store NVMe commands and their responses, respectively. These two queues are typically allocated in host memory 110 by default. Each NVMe command in the submission queue contains a pointer to a “data buffer”. For block read operations, this buffer serves as the destination for the data retrieved from the underlying storage device, while for block write operations, it stores the user data to be written to the storage.

[0005] The user or host driver initiates the process in Step 1 by preparing an NVMe command and inserting it into the submission queue. Once a set of NVMe commands is inserted, the host can notify the NVMe device 140, as shown in Step 2. Communication between the host and the NVMe device takes place using PCIe transactions, such as MRw (memory write), MRd (memory read), and Cpld (completion with data).

[0006] Upon receiving the notification, the NVMe controller actively issues MRd commands to retrieve NVMe commands from the host memory in Step 3 and allows the device to begin processing the commands internally. For example, in the case of a 4KB read command for an SSD controller, the controller performs address translation and retrieves page data from the NAND flash memory. Once the data is read from the NAND flash memory and ready within the controller, the SSD may need multiple MRw (with payload smaller than 4KB) to writes back the 4KB data to the data buffer specified in the NVMe block read command, where the data buffer resides in the host memory.

[0007] When the processing is complete, NVMe controller prepares the response and inserts it into the completion queue using MWr (memory write) in Step 5. The host is then notified to check the newly inserted entry in Step 6. After that, the host has to digest the completion entry (Step 7) and inform the device of its consumption (Step 8). For one 4KB block read command, NVMe needs a number of PCIe transactions to finish.

[0008] Once a block read NVMe command is finished and the data is in the data buffer in

DRAM, the CPU accesses the DRAM directly. DRAM access may be as small as a single byte, but more typically DRAM is accessed in 64-byte chunks which reflect the size of a line of a CPU cache. Memory architectures use this 64-byte granularity in defining optimal interface parameters such as I/O data bus width and burst length. An illustrative example of this is a DDR5 Dual In-Line Memory Module (DIMM) which has a base data width of 32 bits and a burst length of 16 bits which provides 64 bytes per exchange packet.

[0009] In this disclosure, we introduce an architecture solution referred to here as "NVMe Over CXL". This approach combines long term storage and temporary storage in an integrated subsystem that avoids the performance limits and high power of the traditional solutions.

BRIEF SUMMARY OF THE INVENTION

[0010] In this invention, we disclose a high-speed memory controller. The high-speed memory controller comprises a host interface based on PCIe interface coupled to a host, a first memory interface coupled to a first memory corresponding to a non-volatile memory, a second memory interface coupled to a second memory corresponding to a volatile memory, an NVMe (Non-Volatile Memory Express) logic to support NVMe protocols using CXL (Compute Express Link) protocols, and one or more other logics, wherein the NVMe logic, said one or more other logics or both are configured to: support CXL protocols over the PCIe interface, wherein the CXL protocols comprise CXL. io and CXL. mem; provide NVMe functionality to connect CXL. io to the first memory; provide memory access via CXL. mem to the second memory; and provide local data path for data transfer between the first memory and the second memory without a need to include the host interface in the local data path.

[0011] In one embodiment, the high-speed memory controller further comprises a first memory control logic coupled to the first memory interface and the NVMe logic, and a second memory control logic coupled to the second memory interface. In one embodiment, the local data path for the data transfer includes a direct path between the first memory control logic and the second memory control logic. In one embodiment, the local data path for the data transfer includes a direct path between the NVMe logic and the second memory control logic.

[0012] In one embodiment, the high-speed memory controller further comprises one or more other memory interfaces coupled to one or more other memories. In one embodiment, said one or more other memory interfaces comprise MRAM (Magnetoresistive Random-Access Memory), PCM (Phase-Change Memory), ReRAM (Resistive Random-Access Memory), or a combination thereof.

[0013] In one embodiment, the high-speed memory controller further comprises NVMe command queues and/or NVMe completion queues inside the second memory, wherein neither the NVMe command queues nor the NVMe completion queues are duplicated in host memory.

[0014] In one embodiment, a virtual memory space is presented to the host and virtual memory space comprises the non-volatile memory and the volatile memory, the volatile memory serves as a cache for the virtual memory space to allow a total virtual memory capacity to be as large as a capacity of the volatile memory alone or a combined capacity of the non-volatile memory and the volatile memory. In one embodiment, a caching algorithm for the cache is located in host software, and cache management is carried out using NVMe protocols from the host to memory controller.

[0015] In one embodiment, said one or more other logics or both are further configured to access fine-grain data with a block size of less than 4KB using the second memory interface.

[0016] In this invention, a sub-system based on the high-speed memory controller is also disclosed. The sub-system comprises the high-speed memory controller, a non-volatile memory and a volatile memory.

[0017] In another invention, an alternative architecture for the high-speed memory subsystem is disclosed. The sub-system comprises a CXL controller with a first CXL interface and a first memory controller interface coupled to a first memory controller, an NVMe over CXL controller with a second CXL interface and a second memory interface, wherein the second memory interface is connected to a non-volatile memory, and a CXL switch fabric. The first memory controller is coupled to a volatile memory and the second memory interface is connected to a non-volatile memory. The CXL switch fabric comprises a host interface coupled to a host, wherein the host interface supports CXL protocols, a first fabric CXL interface coupled to the CXL controller via the first CXL interface; and a second fabric CXL interface coupled to the NVMe over CXL controller via the second CXL interface. The CXL controller, NVMe over CXL controller and the CXL switch fabric are configured to: provide NVMe functionality to connect CXL.io to the non-volatile memory; provide memory access via CXL.mem to the volatile memory; and provide local data path for data transfer via the NVMe over CXL controller, the CXL controller and the CXL switch fabric.

BRIEF DESCRIPTION OF THE DRAWINGS

[0018] Fig. 1 illustrates a typical host to device command sequence for NVMe using PCIe for read block data.

[0019] Fig. 2A illustrates a typical current system implementation using PCIe for long term storage and DDR DRAM for temporary storage.

[0020] Fig. 2B illustrates a newer generation system implementation using CXL for long term storage and DDR DRAM for temporary storage.

[0021] Fig. 3 illustrates one local implementation of our proposed architecture. NVMe Over CXL memory module with Flash and DRAM subsystems, communicating with the host CPU using CXL on PCIe.

[0022] Fig. 4 illustrates one network implementation of our proposed architecture, where NVMe Over CXL memory module with Flash and DRAM subsystems, communicating with the host CPU using CXL on PCIe and CXL Memory Module is served as a staging buffer.

[0023] Fig. 5 illustrates the logical command transfer from the host CPU to the NVMe logic using CXL.io, and the direct memory access to the DRAM on the NVMe Over CXL module using CXL.mem. Local block transfers directly between Flash and DRAM are performed by the NVMe Over CXL controller without requiring the use of the PCIe/CXL bus.

[0024] Fig. 6 illustrates a detailed view of one embodiment of the host to device NVMe protocol using the NVMe Over CXL architecture.

[0025] Fig. 7A illustrates the appearance of the NVMe logic in the host CPU I/O space which includes command functions such as read and write commands, but also feature and discovery configuration parameters, plus the appearance of the data buffers in the host CPU memory space.

[0026] Fig. 7B illustrates the ability of the NVMe Over CXL controller to use the local attached DRAM to also cache NVMe control information including command linked lists, reordering information, or other NVMe optimizations.

[0027] Fig. 8. illustrates enhanced data integrity for DRAM based temporary storage where an energy source is used to maintain data upon power failure.

[0028] Fig. 9 illustrates enhanced data integrity for temporary storage by substituting DRAM with a non-volatile byte-addressable memory media type such as MRAM, PCRAM, etc.

[0029] DETAILED DESCRIPTION OF THE INVENTION

[0030] The following description is of the best-contemplated mode of implementing the invention. This description is made for the purpose of illustrating the general principles of the invention and should not be taken in a limiting sense. The scope of the invention is best determined by reference to the appended claims.

[0031] One emerging approach to consolidating computer data is Compute Express Link (CXL), an open standard designed for high-speed, high-capacity connections between CPUs and various devices including memory, storage, and communications. CXL builds upon the serial PCIe interface and provides three distinct protocols: CXL. io, CXL. cache, and CXL. mem.

CXL. io is the superset of traditional PCIe and manages configuration, device discovery, RDMA (Remote Direct Memory Access), and I/O operations efficiently. CXL. cache is designed to minimize latency as it allows peripheral devices to coherently access and cache the host-attached memory. CXL. mem enables host CPUs to access device memory in a coherent manner using load/store commands, where the memory in a CXL device can be either volatile memory (e.g., DRAM) or non-volatile memory (e.g., Flash memory), which is implementation specific.

[0032] NVMe can be accessed over CXL using the CXL.io protocol as all necessary functions in PCIe are also included in CXL.io. [0033] Temporary storage over DDR continues to be typical, however, temporary memory expansion over CXL is increasingly deployed as well. This expansion uses the CXL.mem protocol over PCIe to access DRAM stored in remote modules in a configuration known as a CXL Memory Module (CMM).

[0034] Universal Chiplet Interconnect Express (UCIe) is a variant of CXL optimized for bare chips mounted on a shared substrate. The concepts in this invention apply to UCIe as well as CXL.

[0035] Isolating SSD storage and DRAM into separate parts of the computer architecture, interconnected through PCIe or CXL, and subsequently linked via DDR, brings about notable performance constraints and high power usage as both I/O and DDR interfaces are engaged in every data transaction. To illustrate, even for an access request involving a small fraction out of a 4KB data, the complete 4KB dataset is transferred from the NAND device to the host memory, which results in an excess of redundant data movement. This, in turn, incurs costs in terms of power consumption, latency, and the utilization of buffer memory.

[0036] In this invention, we disclose the concept and architecture of NVMe Over CXL where the data can be accessed from either non-volatile (e.g., NAND flash memory) or volatile memories (e.g., DRAM) with a minimum amount of data movement in this system.

[0037] In a conventional system, the Host CPU 230 in the existing implementation adopts PCIe for long-term storage, such as SSDs 210, and DDR DRAM for temporary storage 220, as shown in Fig. 2A, where the SSD 210 comprises PCIe Controller 212, NVMe Controller + SSD Controller 214 and Flash Memory 216. In this configuration, the host employs NVMe over PCIe to access data blocks and utilizes host DRAM as a staging buffer for subsequent access by applications.

[0038] As CXL technology emerges, PCIe-based devices can seamlessly retain their full functionality even when transitioning from PCIe to CXL, as CXL. io encompasses all the features of PCIe, as illustrated in Fig. 2B where the SSD 240 comprises CXL Controller 242, NVMe Controller + SSD Controller 244 and Flash Memory 246. In this setup, the host still utilizes NVMe for accessing block data and employs a data buffer in DRAM to both store the block data and offer byte-level access to applications.

[0039] In this disclosure, our architecture invention can be realized in many options, such as, local or network implementation. In a local implementation, this embodiment can optimize data movement, where all data is sourced from the same long-term storage, as shown in Fig. 3. In this setup, Host CPU 320 connects to an NVMe Over CXL Module 310 via PCIe physical interface supporting CXL protocol, where the NVMe Over CXL Module 310 consists of an NVMe Over CXL Controller 312, long-term storage (i.e., flash memory 314), and temporary storage (i.e., DRAM 316).

[0040] The NVMe Over CXL controller 312 is utilized to provide NVMe functionality via CXL. io to long-term storage (i.e., flash memory 314) and memory access through CXL.mem to temporary storage (i.e., DRAM 316) on the same device. In contrast to traditional systems that require the PCIe/CXL bus to transfer data between Flash and DRAM on the host, data transfers can be performed within the device through the NVMe Over CXL controller. These transfers typically involve, for example, 4KB to 64KB in each transfer. In the implementation of NVMe within a CXL system, the CXL. io protocol is used to issue NVMe commands from the host CPU to the NVMe Over CXL controller, which manages the flow of block data movement and utilizes CXL. io to issue responses to the host upon completion or in case of an error.

[0041] CXL provides memory expansion for the host CPU, and the CXL.mem protocol allows this expansion to appear to the CPU as byte or cache line addressable memory much like directly attached DDR would provide. The NVMe Over CXL controller supports the CXL.mem protocol directly, allowing access to the DRAM on the module in these smaller transfer sizes, specifically with 64-byte data payload in a Flow Control Unit (FLIT), or large FLITs of 256 bytes. Masks are supported which allow byte-level transactions to occur as well.

[0042] This invention offers a unique capability: the allocation of data buffers for transfers between Flash and DRAM directly on the same module, mapping those data buffers into the CPU address space for direct access via the CXL.mem protocol. Because the NVMe Over CXL controller features access ports for both Flash and DRAM, block data transfers between these storage types can be executed without transferring data over the PCIe port. A key innovation lies in using DRAM on the device side to serve as a data staging area. The major operations involve directing block data to be stored in the device DRAM. Consequently, this key step allows the host to selectively retrieve the necessary bytes from the block data using CXL.mem, eliminating the need to transfer large block data between the host and the device. This efficient architectural approach significantly reduces latency and complexity.

[0043] In a network implementation, as illustrated in Fig. 4, the NVMe Over CXL module 410 can be paired with different CXL memory modules to also bring our architectural concept to real and minimize overall data movement. In this configuration, the host CPU 420, CXL memory module 430, and NVMe Over CXL module 410 are interconnected via the CXL fabric 440. The host CPU can access the NVMe Over CXL module as a standard SSD and utilize the CXL memory module as a memory extension. As before, the NVMe Over CXL module 410 comprises NVMe over CXL Controller 412 and Flash Memory 414 and the CXL Memory Module comprises CXL Controller 432, DDR Controller 434 and DRAM 436.

[0044] For the application, the memory (DRAM) on the CXL memory module can serve as a staging buffer, with block data from the NVMe Over CXL module being read into this buffer. Subsequently, the host CPU can retrieve the required bytes (much less than 4KB block) from the CXL memory module using CXL.mem, and thus avoid the need for extensive block data transfers between the host and the modules. Data movement between the NVMe Over CXL module and the CXL memory module can leverage PCIe/CXL transactions to facilitate peer-to- peer (P2P) functionality. This process does not require any CPU computation on the host.

[0045] NVMe Over CXL implementations are free to choose any DRAM interface as well, not locked into standard implementations such as DDR, LPDDR, HBM, or others. This can include direct control of light-featured interfaces such as CXL Native Memory where the DRAM die includes few or no control features and typical functions such as refresh control and error correction are concentrated in the NVMe Over CXL controller. Additional performance and power enhancements are enabled by isolating the DRAM interface behind the NVMe Over CXL controller.

[0046] Data integrity may also be enhanced by providing mechanisms for extending the data storage period of the temporary data. This may be accomplished by adding energy sources to power the DRAM on power failure, or by substituting DRAM with non-volatile memory media such as MRAM, PCRAM, or other media types, without significantly changing the NVMe Over CXL controller architecture.

[0047] In this disclosure, we propose an NVMe Over CXL architecture, with the embodiment applicable to both local and network implementations. The required DRAM for data buffers for NVMe flash accesses are co-located on the device side. This arrangement empowers the NVMe Over CXL controller to facilitate data transfers between different media types without the need for extensive data transfers between the host and the device over the system buses.

[0048] Fig. 5A and Fig. 5B illustrate the logical command transfer from the host CPU to the NVMe logic via CXL. io, as well as the direct memory access to the DRAM on the NVMe Over CXL module using CXL. mem. Local block transfers within the device can be implemented in various ways. One example, as shown in Fig. 5A, establishes a direct data-transfer channel between Flash and DRAM without utilizing the PCIe/CXL bus. Another example, as depicted in Fig. 5B, builds a data channel between NVMe and DRAM. In this case, NVMe can interpret the command and determine whether the buffer address is located in the device DRAM. If the buffer address of the NVMe command resides in the device DRAM, the controller can instruct NVMe to retrieve or store data directly in the DRAM.

[0049] In Fig. 5 A, Host CPU 540 is coupled to NVMe Over CXL Module 512 via CXL interface. On a read operation of a local implementation as illustrated in Fig. 5A, for example, data may be transferred directly from Flash 520 to DRAM 530 through the Flash I/O of NVMe Over CXL controller 510, through internal circuits (e.g. Flash Controller Logic 526 and DRAM Controller Logic 534), and through the DRAM I/O of NVMe Over CXL controller 510.

Similarly, on a write operation, data may be transferred by the NVMe Over CXL controller from the DRAM I/O, through internal circuits, and through the Flash interface I/O. The NVMe Over CXL Module is connected to Host CPU 540 via NVMe Over CXL Controller interface. The NVMe Over CXL Module also comprises CXL. io Logic 522, NVMe 524, and CXL. mem Logic.

[0050] In Fig. 5B, the local data transfer is achieved between DRAM Control Logic 534 and NVMe module 524. NVMe module 524 then communicates with Flash Control Logic 526 to pass the data to/from DRAM Control Logic 534 via NVMe module 524. [0051] NVMe operations often require a cache of commands and data in a scatter-gather collection of linked lists. These are typically loaded in the host-side DRAM. In the NVMe Over CXL system, the local DRAM on device side may be used for the NVMe cache as well. Fig. 6 provides a detailed view of one embodiment of the host to device NVMe command protocol within the NVMe Over CXL architecture, where components above line 630 correspond to Host side and components below line 630 correspond to Device side. Essentially, there are no changes to the existing NVMe protocol, and the submission queue and the completion queue remain the primary data structures for transferring NVMe commands and responses, respectively.

[0052] A significant difference in NVMe Over CXL is the destination of the pointer within the submission queue entry, pointing to the address on the device-side DRAM 622. This arrangement facilitates the reading of block data into the device DRAM, which acts as a staging buffer (refer to Step 4 in Fig. 6). Subsequent access to DRAM can utilize CXL.mem to retrieve the necessary bytes as if they were regular memory data. This access flow minimizes the total data movement within a system, especially when only a very small amount of data out of block data is required for the application.

[0053] Additionally, there are latency optimizations that can be supported as well. For example, the submission queue can be allocated in the memory of an Over CXL module, leveraging a concept similar to the Controller Memory Buffer (CMB), a standard feature defined in NVMe. When the NVMe command is ready on the device side, there is no need for the device to proactively issue memory-read transactions over the PCIe/CXL fabric to retrieve the command. Such an implementation further reduces the round-trip latency in NVMe command processing. Over CXL can be viewed as “CMB done right”, which optimizes the existing concept of the CMB and enhances performance through better integration with CXL technology.

[0054] From the host's perspective, the standard NVMe protocol is depicted in Fig. 7A. The NVMe logic within I/O space of the host CPU 710 supports command functions such as read and write commands and remains consistent. The I/O space of the host CPU 710 is used to access non-volatile memory module 720, which includes NVMe Control 722. The allocation for data buffers 732 on device memory also does not impact the host software functions in any way. Memory space of the host CPU 710 is used to access volatile memory module 730. [0055] In the context of the latency optimization technique described, the NVMe data structure 734 is relocated to the device side, as shown in Fig. 7B. From the viewpoint of the host, the memory allocated on the device for both data buffers and the NVMe data structure behaves like regular memory. This means that the existing NVMe software or driver can be seamlessly employed without any modification.

[0056] As the data buffers are directly addressable via the CXL.mem protocols, once the read operation has transferred data into the local DRAM, the CPU host may access the data buffer directly as a range of addresses in the CPU memory space. The data buffers may be modified directly in this CPU memory space, and NVMe write operations may transfer the results from temporary to long term storage by transferring data from DRAM to Flash.

[0057] CXL is a non-deterministic protocol which means that read or write operations may be delayed while the CXL controller completes other tasks. With NVMe Over CXL, this enables asynchronicity such as issuing an NVMe read command followed by a direct memory access of the target data buffer of the read. The NVMe Over CXL controller is capable of detecting whether the read operation from Flash has loaded the data for the address in the direct memory request. If that part of the Flash has been transferred to DRAM, the direct memory access is allowed to complete even though the rest of the NVMe read command has not completed yet. If the requested direct memory access has not been satisfied by the Flash to DRAM transfer yet, the direct memory access command is delayed then completes once the Flash to DRAM transfer has transferred the requested data. Support for this performance enhancement probably requires modification of host I/O drivers, but is possible since the NVMe Over CXL controller is aware of all transfer status information.

[0058] On top of the NVMe Over CXL architecture, a variety of applications can be deployed, which may utilize both memory and storage as the primary hardware resources in modern software systems. The proposed architecture offers flexibility and allows it to be adopted as either memory or storage, depending on specific software implementations and requirements.

[0059] To support NVMe Over CXL as a general -purpose memory solution, the software stack, including drivers and middleware, can virtualize this architecture by unifying diverse memory media, such as DRAM and flash memory, into a single virtual memory resource for the host. When NVMe Over CXL is configured in virtual HDM mode, existing user applications can seamlessly run on top of it. Heap memory can be allocated directly from this device, requiring no modifications to the applications themselves. In this configuration, the implementation must manage both the device's DRAM and flash memory collaboratively. DRAM functions as a cache for the underlying flash memory, enabling the creation of a virtual memory system that can be significantly expanded by utilizing the lower-cost flash memory.

[0060] One implementation of virtual HDM mode in NVMe Over CXL can intercept the page fault handling process using a software driver or middleware. Page faults are typically initiated by the hardware Memory Management Unit (MMU), and the software driver is responsible for managing the cache by transferring required data between flash memory and device DRAM. When a page fault occurs, the driver must free up space in the device DRAM (which might still hold valid data in flash memory), and then initiate NVMe I/O operations to retrieve the necessary data from flash. This process is naturally supported by the NVMe Over CXL architecture. Once the required portion of data from the read block is loaded into the device DRAM, the page fault handling is completed, and allows the application to access the virtual memory seamlessly via standard load/store instructions.

[0061] When used in a virtual HDM configuration, NVMe Over CXL offers notable advantages in terms of performance and latency Quality of Service (QoS). Since the device DRAM is a high-speed but limited resource, optimizing its use is crucial to achieving high performance. Effective cache management is key to improving performance when DRAM is used as a cache. The NVMe Over CXL architecture allows the host to control the movement of block data between flash memory and device DRAM using standard NVMe commands. This host-managed approach can leverage the host greater knowledge of application demands, improve cache hit rates, and enhance overall system efficiency.

[0062] Moreover, this architecture provides predictable memory access latency. Because the host controls the cache, it has full visibility into the status of block I/O operations. Specifically, in page fault handling, the load/store instructions are not restarted until the required data is available in the device DRAM, which ensures short and consistent memory access latency. In contrast, if cache management were handled solely on the device side (without host visibility), flash memory operations, often with microsecond or even millisecond latencies, can result in unpredictable delays or even timeouts, negatively impacting system performance.

[0063] In addition to its memory applications, NVMe Over CXL can also function as a high- performance, generic storage solution. The intrinsic NVMe capabilities of the proposed architecture provide the same core functionalities as a standard NVMe SSD. However, NVMe Over CXL extends beyond conventional storage by offering a high-speed device buffer, which enables further optimization opportunities for the software.

[0064] For instance, traditional file systems typically utilize host DRAM as a page cache to store recently accessed block data. File system operations, however, are not always confined to block-level access. Many file read and write operations often require smaller amounts of data, not aligned to 4KB and sometimes less than 4KB. One optimization example using NVMe Over CXL is that the file system can return only the necessary bytes to the user space directly from the device DRAM by conducting high-bandwidth block data movement between the device DRAM and flash memory, which significantly reduces the data traffic over the PCIe interface.

[0065] In this scenario, page cache management can be no longer restricted to a 4KB unit, as the data transfer between the host and device can be optimized to a smaller granularity.

Furthermore, only the exact bytes requested by users are transmitted over the PCIe interface, avoiding unnecessary data traffic for entire blocks and reducing bandwidth consumption. This also eliminates the need for host DRAM to temporarily store large blocks of data that may not be immediately needed.

[0066] NVMe Over CXL thus offers flexibility in how data is retrieved from flash memory, and allows software to optimize data access in more granular units. This finer control over data movement allows for better utilization of DRAM resources on the host side and improves the cache hit rate. In essence, smaller management units than the typical 4KB-based page cache can result in more efficient use of both storage and memory resources, and improves traditional storage system performance.

[0067] Fig. 8. illustrates enhanced data integrity for DRAM based temporary storage where an energy source is used to maintain data upon power failure. For example, the energy source810 can be added to a system based on Fig. 3.

[0068] Fig. 9 illustrates enhanced data integrity for temporary storage by substituting DRAM with a non-volatile byte-addressable memory media type such as MRAM, PCRAM, etc. Again, we can use the system in Fig. 3 as an example. The DRAM memory is replaced by nonvolatile byte-addressable memory 910.

[0069] The above description is presented to enable a person of ordinary skill in the art to practice the present invention as provided in the context of a particular application and its requirement. Various modifications to the described embodiments will be apparent to those with skill in the art, and the general principles defined herein may be applied to other embodiments. Therefore, the present invention is not intended to be limited to the particular embodiments shown and described, but is to be accorded the widest scope consistent with the principles and novel features herein disclosed. In the above detailed description, various specific details are illustrated in order to provide a thorough understanding of the present invention. Nevertheless, it will be understood by those skilled in the art that the present invention may be practiced.

[0070] The invention may be embodied in other specific forms without departing from its spirit or essential characteristics. The described examples are to be considered in all respects only as illustrative and not restrictive. The scope of the invention is therefore, indicated by the appended claims rather than by the foregoing description. All changes which come within the meaning and range of equivalency of the claims are to be embraced within their scope.

Claims

1. A high-speed memory controller comprising: a host interface based on PCIe interface coupled to a host; a first memory interface coupled to a first memory corresponding to a non-volatile memory; a second memory interface coupled to a second memory corresponding to a volatile memory; an NVMe (Non-Volatile Memory Express) logic to support NVMe protocols using CXL (Compute Express Link) protocols; and one or more other logics; wherein the NVMe logic, said one or more other logics or both are configured to: support CXL protocols over the PCIe interface, wherein the CXL protocols comprise CXL. io and CXL. mem; provide NVMe functionality to connect CXL. io to the first memory; provide memory access via CXL.mem to the second memory; and provide local data path for data transfer between the first memory and the second memory without a need to include the host interface in the local data path.

2. The high-speed memory controller of Claim 1, further comprising a first memory control logic coupled to the first memory interface and the NVMe logic, and a second memory control logic coupled to the second memory interface.

3. The high-speed memory controller of Claim [0011], wherein the local data path for the data transfer includes a direct path between the first memory control logic and the second memory control logic.

4. The high-speed memory controller of Claim [0011], wherein the local data path for the data transfer includes a direct path between the NVMe logic and the second memory control logic.

5. The high-speed memory controller of Claim 1, further comprising one or more other memory interfaces coupled to one or more other memories.

6. The high-speed memory controller of Claim 5, wherein said one or more other memory interfaces comprise MRAM (Magnetoresistive Random- Access Memory), PCM (Phase-Change Memory), ReRAM (Resistive Random-Access Memory), or a combination thereof.

7. The high-speed memory controller of Claim 1, further comprising NVMe command queues and/or NVMe completion queues inside the second memory, wherein neither the NVMe command queues nor the NVMe completion queues are duplicated in host memory.

8. The high-speed memory controller of Claim 1, wherein a virtual memory space is presented to the host and virtual memory space comprises the non-volatile memory and the volatile memory.

9. The high-speed memory controller of Claim 8, wherein the volatile memory serves as a cache for the virtual memory space to allow a total virtual memory capacity to be as large as a capacity of the volatile memory alone or a combined capacity of the non-volatile memory and the volatile memory.

10. The high-speed memory controller of Claim 9, wherein a caching algorithm for the cache is located in host software, and cache management is carried out using NVMe protocols from the host to memory controller.

11. The high-speed memory controller of Claim 1, wherein said one or more other logics or both are further configured to access fine-grain data with a block size of less than 4KB using the second memory interface.

12. A high-speed memory sub-system comprising: a non-volatile memory; a volatile memory; a host interface based on PCIe interface coupled to a host; a first memory interface coupled to a first memory corresponding to the non-volatile memory; a second memory interface coupled to a second memory corresponding to the volatile memory; an NVMe (Non-Volatile Memory Express) logic to support NVMe protocols using CXL (Compute Express Link) protocols; and one or more other logics; wherein the NVMe logic, said one or more other logics or both are configured to: support CXL protocols over the PCIe interface, wherein the CXL protocols comprise CXL. io and CXL. mem; provide NVMe functionality to connect CXL. io to the first memory; provide memory access via CXL.mem to the second memory; and provide local data path for data transfer between the first memory and the second memory without a need to include the host interface in the local data path.

13. A high-speed memory sub-system comprising: a CXL (Compute Express Link) controller with a first CXL interface and a first memory controller interface coupled to a first memory controller, wherein the first memory controller is coupled to a volatile memory; an NVMe (Non-Volatile Memory Express) over CXL controller with a second CXL interface and a second memory interface, wherein the second memory interface is connected to a non-volatile memory; a CXL switch fabric, wherein the CXL switch fabric comprises: a host interface coupled to a host, wherein the host interface supports CXL protocols; a first fabric CXL interface coupled to the CXL controller via the first CXL interface; and a second fabric CXL interface coupled to the NVMe over CXL controller via the second CXL interface; wherein the CXL controller, NVMe over CXL controller and the CXL switch fabric are configured to: provide NVMe functionality to connect CXL. io to the non-volatile memory; provide memory access via CXL.mem to the volatile memory; and provide local data path for data transfer via the NVMe over CXL controller, the

CXL controller and the CXL switch fabric.