CN118312337A - A remote memory access method for multi-tenant data centers - Google Patents
A remote memory access method for multi-tenant data centers Download PDFInfo
- Publication number
- CN118312337A CN118312337A CN202410533363.5A CN202410533363A CN118312337A CN 118312337 A CN118312337 A CN 118312337A CN 202410533363 A CN202410533363 A CN 202410533363A CN 118312337 A CN118312337 A CN 118312337A
- Authority
- CN
- China
- Prior art keywords
- command
- client computer
- client
- remote
- access
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/54—Interprogram communication
- G06F9/547—Remote procedure calls [RPC]; Web services
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5005—Allocation of resources, e.g. of the central processing unit [CPU] to service a request
- G06F9/5027—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
- G06F9/5038—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the execution order of a plurality of tasks, e.g. taking priority or time dependency constraints into consideration
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5005—Allocation of resources, e.g. of the central processing unit [CPU] to service a request
- G06F9/5027—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
- G06F9/505—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the load
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/54—Interprogram communication
- G06F9/544—Buffers; Shared memory; Pipes
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L47/00—Traffic control in data switching networks
- H04L47/10—Flow control; Congestion control
- H04L47/12—Avoiding congestion; Recovering from congestion
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/01—Protocols
- H04L67/133—Protocols for remote procedure calls [RPC]
Landscapes
- Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Data Exchanges In Wide-Area Networks (AREA)
Abstract
Description
技术领域Technical Field
本发明属于数据中心网络,具体的说是涉及一种适用于多租户数据中心的远程内存访问方法。The present invention relates to a data center network, and more particularly to a remote memory access method suitable for a multi-tenant data center.
背景技术Background technique
搜索、机器学习等现代数据中心应用程序的规模、多样性和性能要求,要求网络支持高带宽和操作速率,同时实现低尾延迟。由于单边读写提供了延迟和运算速率优势,远程直接内存访问(RDMA)成为了此类分布式系统的一个很有吸引力的选择。鉴于这些操作不涉及远程CPU,因此提供的性能仅受硬件限制。The scale, diversity, and performance requirements of modern data center applications such as search and machine learning require networks that support high bandwidth and operation rates while achieving low tail latency. Remote direct memory access (RDMA) is an attractive option for such distributed systems because of the latency and operation rate advantages offered by single-sided reads and writes. Since these operations do not involve the remote CPU, the performance provided is only limited by the hardware.
行业标准RDMA是从超级计算机环境演变而来的,在商业数据中心中的部署一直面临着挑战。RDMA假定的是低延迟、可靠、有序的网络,而超级计算结构通过交换机强制的无损链路级流控制来实现这些期望,这允许支持RDMA的网卡(NIC)实现简单的拥塞控制和丢失恢复方案,以对拥塞和丢包做出反应。这些结构通常是单租户,而用于授权、访问控制、故障恢复和隐私保护的RDMA解决方案也反映了单租户的期望。Industry standard RDMA evolved from supercomputer environments and has been challenging to deploy in commercial data centers. RDMA assumes low-latency, reliable, ordered networks, and supercomputing fabrics implement these expectations through lossless link-level flow control enforced by switches, which allows RDMA-enabled network cards (NICs) to implement simple congestion control and loss recovery schemes to react to congestion and packet loss. These fabrics are typically single-tenant, and RDMA solutions for authorization, access control, failure recovery, and privacy protection also reflect single-tenant expectations.
相比之下,现代超大规模数据中心的特点是多租户,即不协调的大规模分布式应用程序共享公共基础设施。多样化、随时间变化的应用程序组合会导致网络流量模式的快速变化,这些都需要强大的隐私和身份验证。In contrast, modern hyperscale data centers are characterized by multi-tenancy, where uncoordinated, massively distributed applications share common infrastructure. The diverse, time-varying application mix results in rapidly changing network traffic patterns, which require strong privacy and authentication.
但是这些要求都与标准RDMA的设计选择相悖。标准RDMA提供硬件连接,这是一种与早期RDMA应用程序很好结合的能力,但它对规模隔离、性能和容错性有着根本限制。随着现代服务和存储系统的运行规模超过一万台服务器,每个连接的硬件资源很容易耗尽。拥塞控制算法需要不断地根据部署和应用情况进行迭代,标准直接内存存取RNIC将拥塞响应的重要部分嵌入到硬件中,这使得部署后几乎没有调整的机会。But these requirements are at odds with the design choices of standard RDMA. Standard RDMA provides hardware connectivity, a capability that works well with early RDMA applications, but it has fundamental limitations on scale, isolation, performance, and fault tolerance. As modern service and storage systems scale beyond 10,000 servers, the hardware resources of each connection can be easily exhausted. Congestion control algorithms need to be constantly iterated based on deployment and application conditions, and standard direct memory access RNICs embed important parts of congestion response into hardware, which leaves little opportunity for tuning after deployment.
由于应用程序和基础设施相互不信任,多租户要求线速率加密以及应用程序支持来管理加密密钥的来源。尽管现代RNIC提供了加密的功能,但实际挑战仍然存在:加密与连接相关,应用程序必须信任较低级别的堆栈来管理密钥,而且不支持与安全相关的管理操作,例如加密密钥轮换。Since applications and infrastructure do not trust each other, multi-tenancy requires line-rate encryption as well as application support to manage the origin of encryption keys. Although modern RNICs provide encryption capabilities, practical challenges remain: encryption is tied to the connection, applications must trust lower levels of the stack to manage keys, and there is no support for security-related management operations such as encryption key rotation.
发明内容Summary of the invention
为了解决上述技术问题,本发明提供了一种适用于多租户数据中心的远程内存访问方法,该方法提供了标准单边RDMA的性能优势即高带宽、高操作速率和低延迟,同时还提供了可预测的尾部性能、可伸缩性、容错性、隔离性、安全特性以及部署后快速迭代的便利性,更好地满足我们整合式的多租户数据中心的限制。In order to solve the above technical problems, the present invention provides a remote memory access method suitable for a multi-tenant data center, which provides the performance advantages of standard one-sided RDMA, namely high bandwidth, high operation rate and low latency, while also providing predictable tail performance, scalability, fault tolerance, isolation, security features and the convenience of rapid iteration after deployment, so as to better meet the limitations of our integrated multi-tenant data center.
为了达到上述目的,本发明是通过以下技术方案实现的:In order to achieve the above object, the present invention is achieved through the following technical solutions:
本发明是一种适用于多租户数据中心的远程内存访问方法,包括以下步骤:The present invention is a remote memory access method applicable to a multi-tenant data center, comprising the following steps:
步骤1、客户端计算机通过执行带外远程过程调用,获取服务器端计算机访问相关远程内存区域的必要信息;Step 1: The client computer obtains the necessary information for the server computer to access the relevant remote memory area by executing an out-of-band remote procedure call;
步骤2、客户端计算机获得远程访问所需的信息后,生成客户端命令写入客户端计算机网卡,并将所述客户端命令发送至服务器端计算机;Step 2: After the client computer obtains the information required for remote access, it generates a client command and writes it into the client computer network card, and sends the client command to the server computer;
步骤3、服务器端计算机基于所述客户端命令针对目标内存进行数据访问,并在数据访问后向客户端计算机反馈命令完成情况;Step 3: The server computer accesses data in the target memory based on the client command, and feeds back the command completion status to the client computer after the data access;
步骤4、客户端计算机根据接收到的命令完成情况判断是否需要进行故障恢复以及拥塞控制。Step 4: The client computer determines whether fault recovery and congestion control are required based on the completion status of the received command.
本发明的进一步改进在于:在所述步骤1中,客户端计算机通过执行带外远程过程调用获取的相关远程内存区域的必要信息包括:加密密钥K_d、要访问的内存的架构名称RegionId。A further improvement of the present invention is that in step 1, the necessary information of the relevant remote memory area obtained by the client computer by executing an out-of-band remote procedure call includes: an encryption key K_d and a structure name RegionId of the memory to be accessed.
本发明的进一步改进在于:在所述步骤2中,所述网卡包括注册区域表、命令插槽和命令插槽表、请求窗口以及先进先出仲裁器,所述注册区域表为远程内存访问的内存转换表,要访问的内存的架构名称RegionId对所述内存转换表进行索引,以显示网卡所在的计算机内存范围和所有相应的元数据即区域键K_r、PCIe地址、边界、权限,每次访问服务器端计算机内存时,都代表所述注册区域表中的一个内存区域,命令插槽代表一次正在执行中的操作,操作完成后可重复使用,命令插槽表由网卡中固定数量的插槽组成,每个插槽由其命令插槽编号为唯一标识,网卡在操作完成时对插槽进行编码,所述编码向客户端计算机操作系统指明哪条命令已完成,因为操作可能会不按顺序完成,请求窗口中的容量由一对先进先出仲裁器动态共享即每个内部远程内存访问服务类别由一个先进先出仲裁器在注册区域表中的就绪命令中进行选择。A further improvement of the present invention is that: in the step 2, the network card includes a registration area table, a command slot and a command slot table, a request window and a first-in-first-out arbiter. The registration area table is a memory conversion table for remote memory access. The architecture name RegionId of the memory to be accessed indexes the memory conversion table to display the computer memory range where the network card is located and all corresponding metadata, namely, the region key K_r, PCIe address, boundary, and permission. Each time the server-side computer memory is accessed, it represents a memory area in the registration area table. The command slot represents an operation being executed and can be reused after the operation is completed. The command slot table is composed of a fixed number of slots in the network card, and each slot is uniquely identified by its command slot number. The network card encodes the slot when the operation is completed. The encoding indicates to the client computer operating system which command has been completed, because the operation may be completed out of sequence. The capacity in the request window is dynamically shared by a pair of first-in-first-out arbiters, that is, each internal remote memory access service category is selected by a first-in-first-out arbiter from the ready commands in the registration area table.
本发明的进一步改进在于:在所述步骤2中,客户端计算机获得远程访问所需的必要信息后生成客户端命令具体为:客户端计算机根据远程访问所需的必要信息通过PCIe向网卡写入一条客户端命令,并将MMIO存储写入网卡上的命令插槽,所述命令插槽将入队排队等待服务操作,从而启动所需的RMA操作,启动所需的RMA操作具体包括:使命令插槽进入服务,RMA操作等待网卡中请求窗口的容量满足自己的需求,向服务器端计算机发送请求,RMA操作进入服务后,网卡将按照实际大小借用请求窗口,并使用加密密钥K_d对RMA操作向服务器端计算机发送的请求进行签名,所述加密密钥K_d在RMA操作命令中提供,并发送到网络上。A further improvement of the present invention is that in step 2, after the client computer obtains the necessary information required for remote access, the client computer generates a client command specifically as follows: the client computer writes a client command to the network card through PCIe according to the necessary information required for remote access, and writes the MMIO storage into the command slot on the network card, the command slot will be queued and wait for the service operation, thereby starting the required RMA operation, and starting the required RMA operation specifically includes: putting the command slot into service, the RMA operation waits for the capacity of the request window in the network card to meet its own needs, and sends a request to the server-side computer. After the RMA operation enters service, the network card will borrow the request window according to the actual size, and use the encryption key K_d to sign the request sent by the RMA operation to the server-side computer, and the encryption key K_d is provided in the RMA operation command and sent to the network.
本发明的进一步改进在于:在所述步骤3中,服务器端计算机基于所述客户端命令针对目标内存进行数据访问具体为:请求到达服务器端计算机的网卡,网卡会查阅固定大小的片上表即集成在芯片中的可以查询内存RegionId所在位置的表,查找请求中包含的要访问的内存的架构名称RegionId的关键信息,推导出加密密钥K_d,并验证入站数据包,验证成功后,网卡通过PCIe访问内存,并执行客户端命令。A further improvement of the present invention is that in step 3, the server-side computer performs data access to the target memory based on the client command as follows: the request reaches the network card of the server-side computer, the network card consults a fixed-size on-chip table, that is, a table integrated in the chip that can query the location of the memory RegionId, searches for key information of the architecture name RegionId of the memory to be accessed contained in the request, derives the encryption key K_d, and verifies the inbound data packet. After successful verification, the network card accesses the memory through PCIe and executes the client command.
本发明的进一步改进在于:在所述步骤3中,数据访问后服务器端计算机向客户端计算机反馈命令完成情况具体包括:服务器端计算机网卡会向发起命令的客户端计算机指定完成区域返回一个完成指示符,其中,所述完成指示符包括完成命令插槽的命令插槽编号、执行操作所需的时间、请求进入启动器服务所需的时间和一个操作状态码,所述操作状态码包括成功完成、NACK远程拥塞、TIMEOUT超时。A further improvement of the present invention is that in step 3, after data access, the server-side computer feeds back the command completion status to the client computer, specifically including: the server-side computer network card returns a completion indicator to the completion area specified by the client computer that initiated the command, wherein the completion indicator includes the command slot number of the completion command slot, the time required to execute the operation, the time required to request to enter the launcher service and an operation status code, and the operation status code includes successful completion, NACK remote congestion, and TIMEOUT timeout.
本发明的进一步改进在于:在所述步骤4中,客户端计算机根据接收到的命令完成情况判断是否需要进行故障恢复具体为:客户端计算机根据接收到的完成指示符中的执行操作所需的时间、请求进入启动器服务所需的时间以及操作状态码,以明确的故障代码的形式,生成精确的快速反馈,向客户端计算机指示出现哪种故障模式以及遇到这种故障时,客户端计算机采取的适当的操作。A further improvement of the present invention is that in step 4, the client computer determines whether fault recovery is required based on the completion status of the received command. Specifically, the client computer generates accurate and fast feedback in the form of a clear fault code based on the time required to execute the operation, the time required to request to enter the launcher service, and the operation status code in the received completion indicator, indicating to the client computer which fault mode has occurred and the appropriate action to be taken by the client computer when encountering such a fault.
本发明的进一步改进在于:在所述步骤4中,客户端计算机根据接收到的命令完成情况判断是否需要进行拥塞控制,包括:客户端计算机根据接收到的完成指示符中的执行操作所需的时间、请求进入启动器服务所需的时间以及故障结果,进行策略决策,来调节操作发布速率。A further improvement of the present invention is that in step 4, the client computer determines whether congestion control is needed based on the completion status of the received command, including: the client computer makes a policy decision to adjust the operation release rate based on the time required to execute the operation in the received completion indicator, the time required to request to enter the launcher service, and the failure result.
本发明的进一步改进在于:使用通过带外加密RPC来引导客户端计算机和服务器端计算机之间的远程访问,所述RPC包括密钥的安全交换,具体为:在内存注册时,一个区域被分配一个要访问的内存的架构名称RegionId,一个客户端计算机的应用程序指定一个区域键K_r,保护相应的内存区域,K_r是一个128位的值,从中派生出加密密钥K_d,加密K_d是构成了RMA操作安全性的基础,并保护单个传输,区域键K_r和加密密钥K_d都不会通过网络发送作为远程内存访问操作或其响应的一部分。A further improvement of the present invention is that remote access between a client computer and a server computer is guided by out-of-band encrypted RPC, and the RPC includes a secure exchange of keys, specifically: when registering memory, a region is assigned an architecture name RegionId of the memory to be accessed, and an application on a client computer specifies a region key K_r to protect the corresponding memory region. K_r is a 128-bit value from which an encryption key K_d is derived. The encryption key K_d constitutes the basis of the security of the RMA operation and protects a single transmission. Neither the region key K_r nor the encryption key K_d will be sent over the network as part of the remote memory access operation or its response.
本发明的进一步改进在于:派生出加密密钥K_d用于生成消息身份验证码,对所有协议消息进行签名,并加密与传输有关的所有数据,计算公式如下:A further improvement of the present invention is that an encryption key K_d is derived to generate a message authentication code, sign all protocol messages, and encrypt all data related to the transmission. The calculation formula is as follows:
K_d=AES(Key=K_r,Contents=Address_Initiator,PID_Initiator,OperationType)。K_d=AES(Key=K_r,Contents=Address_Initiator,PID_Initiator,OperationType).
本发明与标准的远程直接内存访问(RDMA)完全不同,本发明中的网卡硬件非常的简单,只专注于快速、固定的功能原语,与提供无限资源幻觉的RDMA不同。本发明在客户端计算机中管理显式有限的硬件资源,为了便于快速迭代,客户端计算机还实现了故障恢复、拥塞控制和操作间排序的功能。The present invention is completely different from standard remote direct memory access (RDMA). The network card hardware in the present invention is very simple and only focuses on fast, fixed functional primitives, which is different from RDMA that provides the illusion of unlimited resources. The present invention manages explicitly limited hardware resources in the client computer. In order to facilitate fast iteration, the client computer also implements the functions of fault recovery, congestion control and inter-operation ordering.
本发明的远程内存访问方法(SRMA)是无连接的,客户端计算机网卡的状态不会随着端点对的数量的增加而增长。由于摆脱了连接语义,网卡(NIC)可以将每个操作视为独立于其他操作,让客户端计算机操作系统在需要时处理操作间的顺序。本发明的远程内存访问方法将每次操作重试和故障恢复的任务分配给客户端计算机操作系统,并提供更简单的故障快速恢复行为,客户端计算机网卡的确保及时完成并直接向客户端计算机的应用程序提供快速、简洁的运行故障通知。The remote memory access method (SRMA) of the present invention is connectionless, and the state of the client computer network card will not grow as the number of endpoint pairs increases. By getting rid of connection semantics, the network card (NIC) can treat each operation as independent of other operations, allowing the client computer operating system to handle the order between operations when necessary. The remote memory access method of the present invention assigns the task of retrying and fault recovery of each operation to the client computer operating system, and provides a simpler fault fast recovery behavior. The client computer network card ensures timely completion and directly provides a fast and concise operation fault notification to the application of the client computer.
本发明的远程内存访问方法都是小规模的,并把请求作为所有传输的基础,每个SRMA操作最多传输4KB的有效负载,从而实现隔离和优先级排序;此外,所有数据的传输都需要请求:SRMA不会启动传输,除非它保证将数据接入本发明的网卡NIC SRAM。客户端计算机网卡强制请求设计可以防止形成大的不一致,将请求与小操作相结合,可以实现快速响应的拥塞控制并且限制网络排队的现象。The remote memory access methods of the present invention are all small-scale, and take requests as the basis for all transmissions. Each SRMA operation transmits a maximum of 4KB of payload, thereby achieving isolation and priority sorting; in addition, all data transmissions require requests: SRMA will not start transmission unless it guarantees that the data will be accessed to the network card NIC SRAM of the present invention. The client computer network card forced request design can prevent the formation of large inconsistencies, and combining requests with small operations can achieve fast response congestion control and limit network queuing.
客户端计算机网卡可以协助客户端计算机操作系统进行拥塞控制,这与将拥塞控制算法嵌入到客户端计算机网卡中的现有RDMA方案形成对比。本发明为拥塞事件提供每个操作的延迟信息和明确的故障快速通知,使得软件能够轻松区分本地和远程拥塞,并采取精确的操作来尽量减少拥塞、超时和丢包。The client computer network card can assist the client computer operating system in congestion control, which is in contrast to the existing RDMA scheme that embeds the congestion control algorithm into the client computer network card. The present invention provides latency information for each operation and clear fault fast notification for congestion events, so that software can easily distinguish between local and remote congestion and take precise actions to minimize congestion, timeouts and packet losses.
本发明基于业务优先级来分配资源,并对客户端计算机操作系统每次可以提供的工作设置明确的限定,将有限的资源池转化为优势。这些限制与小规模操作相结合,可以防止低优先级应用程序垄断网络,并且由于不需要提供接近无限资源的假象,在显式有限资源池中规划操作可以简化客户端计算机网卡。The present invention allocates resources based on business priorities and sets clear limits on the work that the client computer operating system can provide at a time, turning limited resource pools into advantages. These limits, combined with small-scale operations, can prevent low-priority applications from monopolizing the network, and since there is no need to provide the illusion of near-infinite resources, planning operations in an explicit limited resource pool can simplify client computer network cards.
本发明的传输均采用线速率AES-GCM在客户端计算机网卡中进行加密和签名,设计与本发明的无连接架构一致。本发明允许客户端计算机的应用程序直接管理加密密钥,而无需扩展对基础设施软件的信任,并且在尽量减少可用性中断的情况下支持频繁的加密密钥轮换。The transmission of the present invention is encrypted and signed in the client computer network card using line rate AES-GCM, and the design is consistent with the connectionless architecture of the present invention. The present invention allows the client computer's application to directly manage encryption keys without extending trust in infrastructure software, and supports frequent encryption key rotation while minimizing availability interruptions.
本发明的有益效果是:本发明的远程内存访问方法将大型操作分块为4KB操作,并在客户端计算机的操作系统中实现了拥塞控制和操作管理,因此只需0.5个内核就能驱动100Gbps的线速,从而使得本发明的内存访问方法即使在高负载和故障情况下,也能提供可预测的延迟,确保高优先级应用不受低优先级应用的影响,拥塞控制能在存在竞争应用的情况下几乎立即收敛到公平的带宽共享,并且缩短加密密钥轮换期间的不可用时间。The beneficial effects of the present invention are as follows: the remote memory access method of the present invention divides large operations into 4KB operations, and implements congestion control and operation management in the operating system of the client computer, so that only 0.5 cores are needed to drive a line speed of 100Gbps, so that the memory access method of the present invention can provide predictable delays even under high load and fault conditions, ensuring that high-priority applications are not affected by low-priority applications, and congestion control can almost instantly converge to fair bandwidth sharing in the presence of competing applications, and shorten the unavailable time during encryption key rotation.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
图1是本发明远程内存访问方法的流程图。FIG1 is a flow chart of a remote memory access method according to the present invention.
图2是本发明客户端计算机操作系统的示意图。FIG. 2 is a schematic diagram of the client computer operating system of the present invention.
图3是本发明远程内存访问方法的网卡主要组件示意图。FIG3 is a schematic diagram of the main components of the network card of the remote memory access method of the present invention.
图4是本发明远程内存访问方法为执行2KB读操作的整体运行情况示意图。FIG. 4 is a schematic diagram of the overall operation of the remote memory access method of the present invention for executing a 2KB read operation.
图5是本发明远程内存访问方法安全性秘钥的生成与分发的示意图。FIG5 is a schematic diagram of the generation and distribution of security keys of the remote memory access method of the present invention.
具体实施方式Detailed ways
以下将以图式揭露本发明的实施方式,为明确说明起见,许多实务上的细节将在以下叙述中一并说明。然而,应了解到,这些实务上的细节不应用以限制本发明。也就是说,在本发明的部分实施方式中,这些实务上的细节是非必要的。The following will disclose the embodiments of the present invention with drawings. For the purpose of clear description, many practical details will be described together in the following description. However, it should be understood that these practical details should not be used to limit the present invention. That is to say, in some embodiments of the present invention, these practical details are not necessary.
本发明提出一种适用于多租户数据中心的远程内存访问方法,它针对硬件和软件进行责任划分。The present invention proposes a remote memory access method suitable for a multi-tenant data center, which divides responsibilities for hardware and software.
如图2所示,本发明的远程内存访问方法的客户端计算机操作系统提供大量的传输抽象和拥塞管理。在最底层,命令入口对象管理一组命令槽和一个配置为接受完成的内存区域,提供了熟悉的命令/完成队列结构。由于命令插槽是网卡NIC中的内存映射寄存器,因此命令入口会处理mmap()的详细信息,以便将这些寄存器插入客户端计算机的应用程序内存。在命令入口的基础上,软件栈的下一层命令执行器支持任意大小的传输,这些传输以透明方式分块为最多4KB大小的操作,并受拥塞控制软件调整的影响。命令执行器以上的客户端计算的应用软件层不负责分块、调速或拥塞控制。应用程序可根据个人需要选择管理故障。由于本发明的网卡不保留每个目的地的状态,因此每条命令都完全编码了操作所需的所有元数据。As shown in Figure 2, the client computer operating system of the remote memory access method of the present invention provides a large amount of transmission abstraction and congestion management. At the bottom layer, the command entry object manages a set of command slots and a memory area configured to accept completions, providing a familiar command/completion queue structure. Since the command slots are memory-mapped registers in the network card NIC, the command entry handles the details of mmap() so that these registers are inserted into the application memory of the client computer. Based on the command entry, the next layer of the software stack, the command executor, supports transmissions of any size, which are transparently divided into operations of up to 4KB in size and are affected by congestion control software adjustments. The application software layer of client computing above the command executor is not responsible for block division, speed regulation or congestion control. The application can choose to manage faults according to personal needs. Since the network card of the present invention does not retain the state of each destination, each command fully encodes all metadata required for the operation.
如图3所示,本发明的网卡包括注册区域表、命令插槽和命令插槽表、请求窗口以及先进先出仲裁器,所述注册区域表为远程内存访问的内存转换表,要访问的内存的架构名称RegionId对所述内存转换表进行索引,以显示网卡所在的计算机内存范围和所有相应的元数据即区域键K_r、PCIe地址、边界、权限,每次访问服务器端计算机内存时,都代表所述注册区域表中的一个内存区域,命令插槽代表一次正在执行中的操作,操作完成后可重复使用,命令插槽表由网卡中固定数量的插槽组成,每个插槽由其命令插槽编号为唯一标识,网卡在操作完成时对插槽进行编码,所述编码向客户端计算机操作系统指明哪条命令已完成,因为操作可能会不按顺序完成,请求窗口中的容量由一对先进先出仲裁器动态共享即每个内部远程内存访问服务类别由一个先进先出仲裁器在注册区域表中的就绪命令中进行选择。As shown in FIG3 , the network card of the present invention includes a registration region table, a command slot and a command slot table, a request window, and a first-in-first-out arbiter. The registration region table is a memory conversion table for remote memory access. The architecture name RegionId of the memory to be accessed indexes the memory conversion table to display the computer memory range where the network card is located and all corresponding metadata, namely, the region key K_r, PCIe address, boundary, and permission. Each time the server-side computer memory is accessed, it represents a memory region in the registration region table. The command slot represents an operation being executed and can be reused after the operation is completed. The command slot table is composed of a fixed number of slots in the network card, and each slot is uniquely identified by its command slot number. The network card encodes the slot when the operation is completed. The encoding indicates to the client computer operating system which command has been completed because the operation may be completed out of sequence. The capacity in the request window is dynamically shared by a pair of first-in-first-out arbiters, namely, each internal remote memory access service category is selected by a first-in-first-out arbiter from the ready commands in the registration region table.
当应用程序向命令插槽发出操作时,CPU会通过PCIe向相应的硬件寄存器执行MMIO写操作。该写入会导致SRMA硬件流水线根据操作的服务类别在硬件先进先出中排队,等待调度。队列头等待请求窗口中的容量,然后向远端发送请求,远端由命令本身中的地址信息指定。When the application issues an operation to a command slot, the CPU performs an MMIO write operation to the corresponding hardware register through PCIe. This write causes the SRMA hardware pipeline to queue in the hardware first-in-first-out according to the service class of the operation, waiting to be scheduled. The head of the queue waits for capacity in the request window and then sends the request to the remote end, which is specified by the address information in the command itself.
以2KB大小的读操作为例,如图1所示,本发明的适用于多租户数据中心的远程内存访问方法具体包括以下步骤:Taking a 2KB read operation as an example, as shown in FIG1 , the remote memory access method applicable to a multi-tenant data center of the present invention specifically includes the following steps:
步骤1、客户端计算机通过执行带外远程过程调用,获取服务器端计算机访问相关远程内存区域的必要信息。Step 1: The client computer obtains the necessary information for the server computer to access the relevant remote memory area by executing an out-of-band remote procedure call.
在执行RMA操作之前,启动客户端计算机执行带外远程过程调用(带外RPC),以获取所要访问的远程内存区域所需的必要信息,必要信息包括:加密密钥K_d即与启动客户端绑定的加密安全密钥,难以猜测且不可转移、以及RegionId即在服务器上内存注册时建立的待访问内存的体系结构名称。所有协议消息都使用由加密密钥K_d生成的信息验证代码进行签名。类似地,所有数据都使用加密密钥K_d进行加密。Before performing the RMA operation, the client computer is started to perform an out-of-band remote procedure call (out-of-band RPC) to obtain the necessary information required for the remote memory area to be accessed, including: the encryption key K_d, which is the encryption security key bound to the startup client, which is difficult to guess and cannot be transferred, and RegionId, which is the architecture name of the memory to be accessed established when registering the memory on the server. All protocol messages are signed with the information verification code generated by the encryption key K_d. Similarly, all data is encrypted using the encryption key K_d.
步骤2、客户端计算机获得远程访问所需的信息后,生成客户端命令写入客户端计算机网卡,并将所述客户端命令发送至服务器端计算机。Step 2: After the client computer obtains the information required for remote access, it generates a client command, writes it into the client computer network card, and sends the client command to the server computer.
在获得远程访问所需的信息之后,客户端启动其所需的RMA操作即2KN的读取操作,在本例中,通过在PCIe上写入客户端命令并将MMIO存储到网卡NIC上的命令插槽中来执行2KB读操作,该命令插槽将入队排队等待服务操作。从而启动所需的RMA操作,根据SRMA的招标规则,请求等待执行。要进入服务,操作需要在片上请求窗口中获得4KB的可用空间,这是一个SRAM缓冲区,用于存储入站的有效负载。无论操作大小如何,都需要对4KB进行仲裁,这样可以防止小操作占用大操作的空间,从而使得大操作饥饿。After obtaining the information required for remote access, the client initiates its required RMA operation, which is a 2KN read operation. In this case, a 2KB read operation is performed by writing a client command on PCIe and storing MMIO into a command slot on the network card NIC, which will be queued for service operations. The required RMA operation is thus initiated, and according to the bidding rules of SRMA, the request is waiting for execution. To enter the service, the operation needs to obtain 4KB of free space in the on-chip request window, which is an SRAM buffer used to store inbound payloads. Regardless of the size of the operation, 4KB needs to be arbitrated, which prevents small operations from occupying the space of large operations, making large operations hungry.
RMA操作进入服务后,网卡将按照实际大小即2KB借用请求窗口扣除,并使用加密密钥K_d对RMA操作向服务器端计算机发送的请求进行签名,所述加密密钥K_d在RMA操作命令中提供,并发送到网络上。After the RMA operation enters service, the network card will deduct the actual size, i.e., the 2KB borrowing request window, and use the encryption key K_d to sign the RMA operation request sent to the server computer. The encryption key K_d is provided in the RMA operation command and sent to the network.
步骤3、服务器端计算机基于所述客户端命令针对目标内存进行数据访问,并在数据访问后向客户端计算机反馈命令完成情况。服务器端计算机网卡会向发起命令的客户端计算机指定完成区域返回一个完成指示符,其中,所述完成指示符包括完成命令插槽的命令插槽编号、执行操作所需的时间、请求进入启动器服务所需的时间和一个操作状态码,所述操作状态码包括成功完成、NACK远程拥塞、TIMEOUT超时。Step 3: The server computer accesses data to the target memory based on the client command, and feeds back the command completion status to the client computer after the data access. The server computer network card returns a completion indicator to the designated completion area of the client computer that initiated the command, wherein the completion indicator includes the command slot number of the completion command slot, the time required to perform the operation, the time required to request to enter the initiator service, and an operation status code, wherein the operation status code includes successful completion, NACK remote congestion, and TIMEOUT timeout.
通常情况下,请求到达服务器端计算机的网卡,网卡会查阅固定大小的片上表即集成在芯片中的可以查询内存RegionId所在位置的表,以查找请求中包含的RegionId,推导出加密密钥K_d,并验证入站数据包。验证成功后,网卡通过PCIe访问内存,并执行客户端命令。然后,NIC通过PCIe读取所请求的数据,并将读取完成后的数据流作为单独的网络响应流。每个响应都用先前派生的密钥信息进行加密和签名。由于自适应路由的原因,这些响应在网络中穿行,但可能会出现无序到达的情况。Typically, a request arrives at the server-side computer's network card, which consults a fixed-size on-chip table, a table integrated into the chip that can query the memory location of the RegionId, to find the RegionId contained in the request, derive the encryption key K_d, and verify the inbound data packet. After successful verification, the network card accesses the memory through PCIe and executes the client command. The NIC then reads the requested data through PCIe and uses the data stream after the read as a separate network response stream. Each response is encrypted and signed with the previously derived key information. Due to adaptive routing, these responses travel through the network, but may arrive out of order.
到达发起程序后,每个响应都会被单独验证和解密,然后通过PCIe写入流传输到发起主机的内存,每个响应的偏移量在入站响应中编码。为了避免出现无序响应,NIC会跟踪字节到达情况,在无错误的情况下,一旦所有字节到达,就会向启动软件写入一个成功的操作完成信息。操作完成还包括硬件延迟度量值,显示执行操作所用的时间(total_delay)以及请求进入启动器服务所用的时间(issue_delay)。Upon arrival at the initiator, each response is individually verified and decrypted, then transferred to the initiating host's memory via the PCIe write stream, with the offset of each response encoded in the inbound response. To avoid out-of-order responses, the NIC tracks byte arrivals and, in the error-free case, writes a successful operation completion to the initiator software once all bytes have arrived. The operation completion also includes hardware latency metrics showing how long it took to perform the operation (total_delay) and how long it took the request to get serviced by the initiator (issue_delay).
虽然已经描述了无故障情况,但是上述操作可能会遇到各种故障。例如,由于严重的本地拥塞如大量的排队操作,读取操作可能不会在发起程序处进入服务。当请求到达超负载的服务器时,服务NIC可能会丢弃或“NACK”请求,如过长的入站请求队列。最后,服务器到客户端的响应可能在网络中被丢弃或延迟。Although a failure-free scenario has been described, the above operations may experience various failures. For example, a read operation may not be serviced at the initiator due to severe local congestion, such as a large number of queued operations. The service NIC may drop or "NACK" requests when requests arrive at an overloaded server, such as an excessively long inbound request queue. Finally, responses from the server to the client may be dropped or delayed in the network.
步骤4、客户端计算机根据接收到的命令完成情况判断是否需要进行故障恢复以及拥塞控制。Step 4: The client computer determines whether fault recovery and congestion control are required based on the completion status of the received command.
本发明以明确的故障代码的形式生成精确的快速反馈,向启动软件指示出现哪种故障模式。遇到这种故障时,应用程序软件可以采取适当的操作,例如立即重试该操作。The present invention generates precise rapid feedback in the form of clear fault codes, indicating to the startup software which fault mode has occurred. When such a fault is encountered, the application software can take appropriate actions, such as immediately retrying the operation.
鉴于本发明的目标部署在相互不信任的端点和互相不信任的网络结构,因此本发明的远程内存访问方法的所有基本方面都包含了安全性。Given that the present invention is targeted for deployment on mutually untrusted endpoints and mutually untrusted network structures, all fundamental aspects of the remote memory access method of the present invention include security.
如图5所示,本发明的安全性如下:As shown in FIG5 , the security of the present invention is as follows:
与标准的RDMA类似,本发明使用通过带外加密RPC来引导客户端计算机和服务器端计算机之间的远程访问,所述RPC包括密钥的安全交换,具体为:在内存注册时,一个区域被分配一个要访问的内存的架构名称RegionId,一个客户端计算机的应用程序指定一个区域键K_r,保护相应的内存区域,K_r是一个128位的值,从中派生出加密密钥K_d,加密K_d是构成了RMA操作安全性的基础,并保护单个传输,区域键K_r和加密密钥K_d都不会通过网络发送作为远程内存访问操作或其响应的一部分。Similar to standard RDMA, the present invention uses out-of-band encrypted RPC to guide remote access between a client computer and a server computer, wherein the RPC includes a secure exchange of keys, specifically: when registering memory, a region is assigned an architectural name RegionId of the memory to be accessed, an application on a client computer specifies a region key K_r to protect the corresponding memory region, K_r is a 128-bit value from which an encryption key K_d is derived, the encryption K_d constitutes the basis of the security of RMA operations and protects a single transmission, neither the region key K_r nor the encryption key K_d will be sent over the network as part of a remote memory access operation or its response.
派生出加密密钥K_d用于生成消息身份验证码,对所有协议消息进行签名,并加密与传输有关的所有数据,计算公式如下:The encryption key K_d is derived to generate a message authentication code, sign all protocol messages, and encrypt all data related to the transmission. The calculation formula is as follows:
K_d=AES(Key=K_r,Contents=Address_Initiator,PID_Initiator,OperationType)。K_d=AES(Key=K_r,Contents=Address_Initiator,PID_Initiator,OperationType).
这个函数计算的密钥是针对每个启动进程(由进程的主机地址和进程的PID标识)和操作类型的,并且该密钥植根于单个共享进程中的K_r很容易放在NIC内存的每个区域表中。当客户端计算机的应用程序分配命令槽来启动RMA时,SRMA驱动程序在槽的配置中一直存储该相关的PID_Initiator。SRMA硬件总是在请求包中包含PID_Initiator以及RegionId。因此,服务器端计算机网卡对于每个入站请求都可以为其导出分配的加密密钥K_d。此外,在服务器端计算机操作系统可以轻松完成密钥推导,服务器操作系统对于任何潜在的发起人都可以计算K_d,并且在RPC响应中将K_d发送给发起者,而不需要在任何主机或NIC表上保留K_d。本发明为加密过程添加了逐NIC递增的消息计数器,用于保护密钥完整性和防止重放。关键的是,NIC上与安全相关的状态不会随着通信端点对的数量而增长。The key calculated by this function is specific to each initiating process (identified by the process's host address and the process's PID) and operation type, and the key K_r rooted in a single shared process is easily placed in each region table in the NIC memory. When the client computer's application allocates a command slot to initiate RMA, the SRMA driver always stores the associated PID_Initiator in the slot's configuration. The SRMA hardware always includes the PID_Initiator as well as the RegionId in the request packet. Therefore, the server-side computer network card can derive the assigned encryption key K_d for each inbound request. In addition, key derivation can be easily completed on the server-side computer operating system, and the server operating system can calculate K_d for any potential initiator and send K_d to the initiator in the RPC response without retaining K_d on any host or NIC table. The present invention adds a message counter that increments by NIC to the encryption process to protect key integrity and prevent replay. Crucially, the security-related state on the NIC does not grow with the number of communication endpoint pairs.
面对试图访问其他租户拥有的远程内存区域的恶意用户,进程攻击者可以很容易地猜测RegionId,但很难获得一个K_d来匹配主机和进程,缺少根级别的攻击。面对可以完全访问网络链接和交换机的攻击者,他们可以在传输过程中观察到密文,但不容易在不破坏参与主机的情况下对其进行解密。值得注意的是,本发明还可以防止重复攻击,尽管本发明的远程内存访问方法不会阻止重复的读取请求被接纳,但此类攻击会生成新的加密密文,因此攻击者仍然需要正确的密码K_d来解密得到的响应。Faced with a malicious user trying to access a remote memory region owned by another tenant, a process attacker can easily guess the RegionId, but it is difficult to obtain a K_d to match the host and process, short of a root-level attack. Faced with an attacker with full access to network links and switches, they can observe the ciphertext during transmission, but it is not easy to decrypt it without compromising the participating hosts. It is worth noting that the present invention can also prevent repeated attacks. Although the remote memory access method of the present invention does not prevent repeated read requests from being accepted, such attacks will generate new encrypted ciphertexts, so the attacker still needs the correct password K_d to decrypt the resulting response.
本发明的安全模型并不能改变具有根级别危害的主机的行为,根级攻击者可以仿真主机上的进程,因此可以通过其选择的任何方式作为这些用户之一进行身份验证。针对这种威胁的主要防御措施是加速检测,例如,通过执行频繁的加密密钥轮换,强制执行重复的身份验证步骤,这些步骤可以被记录和检查。本发明为加密密钥轮换提供了明确的支持,以帮助实现这一目标。The security model of the present invention does not alter the behavior of a host with a root-level compromise, where a root-level attacker can emulate processes on the host and therefore authenticate as one of these users in any manner of their choosing. The primary defense against this threat is to accelerate detection, for example, by performing frequent encryption key rotation, enforcing repeated authentication steps that can be logged and inspected. The present invention provides explicit support for encryption key rotation to help achieve this goal.
倘若验证失败,本发明网卡会丢弃验证失败的响应。与之相反,未通过身份验证的入站请求会立即发送失败通知作为响应,并使用已知的保留密钥签名,其结果是产生REMOTE_AUTHENTICATION_FAILURE信号。丢弃这样的请求会更让本发明的远程内存访问更强大,从而迫使攻击者面临超时而不是及时作出否定响应。但是,这样做会对非攻击情况造成影响,因为远程认证失败错误代码很容易被识别为加密密钥轮换的副作用,因此恢复步骤对于客户端软件来说是显而易见的。相反,丢弃的请求表现为超时,而超时并不能立即表明可能发生了加密密钥轮换。If verification fails, the network card of the present invention will discard the response of the verification failure. In contrast, an inbound request that fails to pass the identity authentication will immediately send a failure notification as a response and use a known reserved key signature, resulting in the generation of a REMOTE_AUTHENTICATION_FAILURE signal. Discarding such requests will make the remote memory access of the present invention more powerful, forcing the attacker to face a timeout instead of making a timely negative response. However, doing so will have an impact on non-attack situations because the remote authentication failure error code is easily identified as a side effect of encryption key rotation, so the recovery steps are obvious to the client software. In contrast, the discarded request appears as a timeout, and the timeout does not immediately indicate that an encryption key rotation may have occurred.
本发明能确保高优先级应用不受低优先级应用的影响,即使在高负载和故障情况下,也能提供可预测的延迟,拥塞控制能在存在竞争应用的情况下几乎立即收敛到公平的带宽共享,缩短加密密钥轮换期间的不可用时间。The present invention can ensure that high-priority applications are not affected by low-priority applications, provide predictable delays even under high load and failure conditions, and congestion control can almost instantly converge to fair bandwidth sharing in the presence of competing applications, shortening the unavailability time during encryption key rotation.
以上所述仅为本发明的实施方式而已,并不用于限制本发明。对于本领域技术人员来说,本发明可以有各种更改和变化。凡在本发明的精神和原理的内所作的任何修改、等同替换、改进等,均应包括在本发明的权利要求范围之内。The above description is only an embodiment of the present invention and is not intended to limit the present invention. For those skilled in the art, the present invention may have various modifications and variations. Any modification, equivalent substitution, improvement, etc. made within the spirit and principle of the present invention should be included in the scope of the claims of the present invention.
Claims (10)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410533363.5A CN118312337A (en) | 2024-04-30 | 2024-04-30 | A remote memory access method for multi-tenant data centers |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410533363.5A CN118312337A (en) | 2024-04-30 | 2024-04-30 | A remote memory access method for multi-tenant data centers |
Publications (1)
Publication Number | Publication Date |
---|---|
CN118312337A true CN118312337A (en) | 2024-07-09 |
Family
ID=91729903
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410533363.5A Pending CN118312337A (en) | 2024-04-30 | 2024-04-30 | A remote memory access method for multi-tenant data centers |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN118312337A (en) |
-
2024
- 2024-04-30 CN CN202410533363.5A patent/CN118312337A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Singhvi et al. | 1rma: Re-envisioning remote memory access for multi-tenant datacenters | |
US10523675B2 (en) | Remote direct memory access authorization | |
US7685436B2 (en) | System and method for a secure I/O interface | |
EP3603001B1 (en) | Hardware-accelerated payload filtering in secure communication | |
US11841985B2 (en) | Method and system for implementing security operations in an input/output device | |
US7634650B1 (en) | Virtualized shared security engine and creation of a protected zone | |
US8413153B2 (en) | Methods and systems for sharing common job information | |
US20080192750A1 (en) | System and Method for Preventing IP Spoofing and Facilitating Parsing of Private Data Areas in System Area Network Connection Requests | |
CN110958215B (en) | Secure on-line received network packet processing | |
TW201635185A (en) | Systems and methods for secured key management via hardware security module for cloud-based WEB services | |
US8175271B2 (en) | Method and system for security protocol partitioning and virtualization | |
JP2007529917A (en) | Distributed network security system and hardware processor therefor | |
US10031758B2 (en) | Chained-instruction dispatcher | |
Simpson et al. | Securing {RDMA} for {High-Performance} Datacenter Storage Systems | |
US12339978B2 (en) | Network interface with data protection | |
US10691619B1 (en) | Combined integrity protection, encryption and authentication | |
US11126567B1 (en) | Combined integrity protection, encryption and authentication | |
CN114448875A (en) | Managing network services using multi-path protocols | |
US12244705B1 (en) | Offloading compute-intensive operations from real-time processor through an inter-processor queue | |
US11188658B2 (en) | Concurrent enablement of encryption on an operational path at a storage port | |
CN117546165A (en) | Secure and encrypted communication mechanism | |
US11665148B2 (en) | Systems and methods for addressing cryptoprocessor hardware scaling limitations | |
US8149709B2 (en) | Serialization queue framework for transmitting packets | |
WO2010023951A1 (en) | Secure communication device, secure communication method, and program | |
CN118312337A (en) | A remote memory access method for multi-tenant data centers |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication |