[go: up one dir, main page]

CN104038569A - Trunking communication model based on address mapping - Google Patents

Trunking communication model based on address mapping Download PDF

Info

Publication number
CN104038569A
CN104038569A CN201410284909.4A CN201410284909A CN104038569A CN 104038569 A CN104038569 A CN 104038569A CN 201410284909 A CN201410284909 A CN 201410284909A CN 104038569 A CN104038569 A CN 104038569A
Authority
CN
China
Prior art keywords
address mapping
communication
memory address
data
model based
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410284909.4A
Other languages
Chinese (zh)
Inventor
王少锋
施培任
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
IEIT Systems Co Ltd
Original Assignee
Inspur Electronic Information Industry Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur Electronic Information Industry Co Ltd filed Critical Inspur Electronic Information Industry Co Ltd
Priority to CN201410284909.4A priority Critical patent/CN104038569A/en
Publication of CN104038569A publication Critical patent/CN104038569A/en
Pending legal-status Critical Current

Links

Landscapes

  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention provides a trunking communication model based on address mapping. The trunking communication model is composed of a plurality of mainframe systems and a plurality of communication modules, wherein the plurality of mainframe systems respectively include a memory address mapping device and a communication module based on memory address mapping, and the memory address mapping device is used for performing memory address mapping among different mainframe systems; compared with the prior art, the trunking communication model based on address mapping achieves RDMA based on memory address mapping, is packaged according to a special communication protocol, directly achieves package forwarding at the communication modules, achieves zero-copy data transmission, fully utilizes physical transmission links among computer mainframes, and achieves improved transmission performance and reduced system losses; the trunking communication model based on address mapping achieves memory address mapping and data transmission based on NTB, and transmits data by using a PCI-E2.0 protocol, thus having good popularization and application values.

Description

一种基于地址映射的集群通信模型A Cluster Communication Model Based on Address Mapping

技术领域 technical field

本发明涉及计算机集群系统及存储领域,具体地说是一种基于地址映射的集群通信模型。 The invention relates to the field of computer cluster system and storage, in particular to a cluster communication model based on address mapping.

背景技术 Background technique

计算机集群是一种计算机系统,它通过一组松散集成的计算机软件和/或硬件连接起来高度紧密地协作完成计算工作。在某种意义上,他们可以被看作是一台计算机。集群系统中的单个计算机通常称为节点,通常通过局域网连接,但也有其它的可能连接方式。集群计算机通常用来改进单个计算机的计算速度和/或可靠性。一般情况下集群计算机比单个计算机,比如工作站或超级计算机性能价格比要高得多。 A computer cluster is a computer system that is connected by a group of loosely integrated computer software and/or hardware to perform computing tasks in a highly co-operative manner. In a sense, they can be seen as a computer. The individual computers in a cluster system are usually called nodes and are usually connected by a local area network, but there are other possible connections. Cluster computers are often used to improve the computing speed and/or reliability of individual computers. In general, cluster computers are much more cost-effective than individual computers, such as workstations or supercomputers.

计算机集群技术的出发点是为了提供更高的可用性、可管理性、可伸缩性的计算机系统。一个集群包含多台拥有共享数据存储空间的服务器,各服务器通过内部局域网相互通信。当一个节点发生故障时,它所运行的应用程序将由其他节点自动接管。在大多数模式下,集群中所有的节点拥有一个共同的名称,集群内的任一节点上运行的服务都可被所有的网络客户所使用。 The starting point of computer cluster technology is to provide computer systems with higher availability, manageability, and scalability. A cluster includes multiple servers with shared data storage space, and each server communicates with each other through an internal LAN. When a node fails, the applications it was running are automatically taken over by other nodes. In most modes, all nodes in the cluster have a common name, and services running on any node in the cluster can be used by all network clients.

集群中各个节点之间需要进行大量的数据交互。通常集群系统中依靠高速局域网来实现,常见的高速局域网有FDDI光纤环网、100BASE-T高速以太网、千兆位以太网、10Gbit/s以太网等。传输介质有光线、以太网等。使用的是传统iscsi、TCP/IP协议,协议封装较复杂,传输数据时需要经过大量的协议包转换、封装、编解码、校验等步骤,这些协议更适合于远距离、不可靠环境下传输数据。 A large amount of data interaction is required between the various nodes in the cluster. Usually, the cluster system relies on high-speed LAN to realize. The common high-speed LAN includes FDDI optical fiber ring network, 100BASE-T high-speed Ethernet, Gigabit Ethernet, 10Gbit/s Ethernet, etc. Transmission media include optical fiber, Ethernet, etc. The traditional iscsi and TCP/IP protocols are used, and the protocol encapsulation is more complicated. When transmitting data, a large number of protocol packet conversion, encapsulation, codec, verification, and other steps are required. These protocols are more suitable for transmission in long-distance and unreliable environments. data.

而在大规模集群系统中,更多的是主机在短距离、可靠、稳定的传输环境中传输。需求高带宽、高可靠性、大容量的数据传输链路。使用传统的高速传输协议时,计算机主机需要耗费大量的计算用于数据协议包封装、编解码、校验,并且通信链路带宽和节点计算能力将成为整个计算机集群系统整体响应、性能的瓶颈。 In a large-scale cluster system, more hosts are transmitted in a short-distance, reliable, and stable transmission environment. High-bandwidth, high-reliability, and large-capacity data transmission links are required. When using traditional high-speed transmission protocols, the computer host needs to consume a lot of calculations for data protocol packet encapsulation, encoding and decoding, and verification, and the communication link bandwidth and node computing capabilities will become the bottleneck of the overall response and performance of the entire computer cluster system.

发明内容 Contents of the invention

本发明的目的是克服现有技术中存在的不足,提供一种基于地址映射的集群通信模型,可满足大规模集群系统节点之间数据高速、高可靠、大容量的数据传输需求。 The purpose of the present invention is to overcome the deficiencies in the prior art and provide a cluster communication model based on address mapping, which can meet the requirements of high-speed, high-reliability, and large-capacity data transmission between nodes of a large-scale cluster system.

本发明的技术方案是按以下方式实现的,其结构中由多个主机系统和多个通信模块构成,多个主机系统都包含内存地址映射设备和基于内存地址映射的通信模块; The technical solution of the present invention is realized in the following manner, and its structure is composed of multiple host systems and multiple communication modules, and the multiple host systems all include memory address mapping devices and communication modules based on memory address mapping;

所述内存地址映射设备,用于不同主机系统之间内存地址映射; The memory address mapping device is used for memory address mapping between different host systems;

所述通信模型,用于不同主机系统之间数据通信,实现点对点的数据传输模型,通信模块解析数据包地址信息,不经过交换机等设备转发,直接传输到目的主机系统,实现了高带宽、低延迟的数据传输模型,适用于集群系统之间大规模数据传输。 The communication model is used for data communication between different host systems, and realizes a point-to-point data transmission model. The communication module analyzes the address information of the data packet, and directly transmits it to the destination host system without being forwarded by switches and other equipment, thereby realizing high bandwidth and low cost. The delayed data transmission model is suitable for large-scale data transmission between cluster systems.

所述多个主机系统之间通过内存地址映射设备连接,如NTB; The multiple host systems are connected through a memory address mapping device, such as NTB;

所述通信模型,用于计算机主机之间通信需求,提供统一的接口,以供计算机系统应用使用,通信模块基于计算机之间的地址映射装置,通过访问映射内存,进行数据读写,从而完成数据通信。 The communication model is used for communication requirements between computer hosts, and provides a unified interface for computer system applications. The communication module is based on the address mapping device between computers, and reads and writes data by accessing the mapped memory, thereby completing the data communication.

所述内存地址映射设备,用于将多个主机系统之间相互连接,每个主机都与其他所有主机进行内存地址映射。主机之间位置对等,所有主机之间完全对等访问。 The memory address mapping device is used to connect multiple host systems to each other, and each host performs memory address mapping with all other hosts. The locations of the hosts are equal, and all hosts have complete peer-to-peer access.

所诉通信模块,还包含特殊的通信协议封装,基于有效、可靠、简洁的主机之间通信链路,减少数据通信在协议处理上的损耗,提高有效数据占空比,提升数据通信效率。 The communication module mentioned also includes special communication protocol encapsulation, based on an effective, reliable, and concise communication link between hosts, reduces the loss of data communication in protocol processing, improves the effective data duty cycle, and improves data communication efficiency.

所述每个主机之间通过互联总线相连,所述每个主机之间中均包含心跳模块。 Each of the hosts is connected through an interconnection bus, and each of the hosts includes a heartbeat module.

所述心跳模块,用于通过所述互联总线检测到对端存储控制器处于故障状态时,对所述对端存储控制器的电源进行重置。 The heartbeat module is configured to reset the power supply of the peer storage controller when detecting that the peer storage controller is in a fault state through the interconnection bus.

本发明的优点是: The advantages of the present invention are:

本发明的一种基于地址映射的集群通信模型和现有技术相比,基于内存地址映射,实现RDMA,按照特殊的通信协议封装,直接在通信模块实现包转发,实现零拷贝数据传输,充分利用了计算机主机之间物理传输链路,提高了传输性能、减少了系统损耗;本发明基于NTB实现内存地址映射,实现数据传输;使用PCI-E 2.0协议进行数据传输,因而,具有很好的推广使用价值。 Compared with the prior art, a cluster communication model based on address mapping in the present invention realizes RDMA based on memory address mapping, encapsulates according to a special communication protocol, directly implements packet forwarding in the communication module, realizes zero-copy data transmission, and makes full use of The physical transmission link between computer hosts is improved, the transmission performance is improved, and the system loss is reduced; the present invention realizes memory address mapping based on NTB, and realizes data transmission; uses PCI-E 2.0 protocol for data transmission, thus, has good promotion use value.

附图说明 Description of drawings

图1为一种基于地址映射的集群通信模型的结构示意图。 FIG. 1 is a schematic structural diagram of a cluster communication model based on address mapping.

实施方式Implementation

下面结合附图对本发明的一种基于地址映射的集群通信模型作以下详细说明。 A cluster communication model based on address mapping of the present invention will be described in detail below in conjunction with the accompanying drawings.

如图1所示,本发明的一种基于地址映射的集群通信模型其结构中高速通信模型如图1所示,集群中不同节点通过内存地址映射装置相互连接。每个主机都在自己的内存中,开辟一段空间作为其他节点的内存访问区域。供收发数据使用。 As shown in FIG. 1 , a high-speed communication model in the structure of a cluster communication model based on address mapping in the present invention is shown in FIG. 1 , and different nodes in the cluster are connected to each other through a memory address mapping device. Each host opens up a section of space in its own memory as the memory access area of other nodes. For sending and receiving data.

内存地址映射装置(如NTB,本文以下章节使用NTB暂代内存地址映射装置),可以实现不同主机系统之间内存访问,并通过中断寄存器、暂存寄存器实现通信的目的。通过地址翻译技术,将本机中的一段内存映射至远端NTB的MMIO区域当中。远端主机向MMIO中拷贝数据,相当于直接往本端对应内存区域中拷贝数据。从而现数据的发送。并结合门铃中断寄存器和暂存寄存器,实现中断通知和配置交互的过程。从而实现不同主机系统之间的数据传输。 The memory address mapping device (such as NTB, the following chapters of this article uses NTB temporary memory address mapping device), can realize memory access between different host systems, and realize the purpose of communication through interrupt registers and temporary registers. Through address translation technology, a section of memory in the local machine is mapped to the MMIO area of the remote NTB. Copying data from the remote host to MMIO is equivalent to directly copying data to the corresponding memory area on the local end. So that the data is sent. Combined with the doorbell interrupt register and temporary storage register, the process of interrupt notification and configuration interaction is realized. In this way, data transmission between different host systems can be realized.

如图1所示,通过NTB将集群中所有节点都映射一段内存空间(映射窗口一般申请为小于1MB)。形成可访问的内存窗口,可以用于接收数据。再加上NTB的MMIO区域,形成发送和接收缓存区。 As shown in Figure 1, all nodes in the cluster are mapped to a memory space through NTB (the mapping window is generally applied for less than 1MB). Forms an accessible memory window that can be used to receive data. Coupled with the MMIO area of NTB, the sending and receiving buffer areas are formed.

在系统启动过程当中,枚举PCI设备时,根据不同主机的NTB映射区间,申请不同的内存空间,建立一一对应的关系,供查询。针对不同节点的NTB设备建立不同的中断处理函数。根据中断偏移判断是属于哪一个节点的,并添加入哈希表中。 During the system startup process, when enumerating PCI devices, apply for different memory spaces according to the NTB mapping intervals of different hosts, and establish a one-to-one correspondence for query. Establish different interrupt processing functions for NTB devices of different nodes. Determine which node it belongs to according to the interrupt offset, and add it to the hash table.

在初始化最后,向所有注册的NTB设备节点发送LINK UP 信号,然后根据LINK状态寄存器判断其他节点的状态,当对应节点也发送LINK UP 信号之后,开启建立连接流程。通过暂存寄存器进行数据交互,建立连接。 At the end of initialization, a LINK UP signal is sent to all registered NTB device nodes, and then the status of other nodes is judged according to the LINK status register. When the corresponding node also sends a LINK UP signal, the connection establishment process is started. Data interaction is performed through scratch registers and a connection is established.

初始化完毕之后,本节点就可以通过通信模块与其他节点进行数据传输。 After the initialization is completed, the node can transmit data with other nodes through the communication module.

当主机节点应用提交传输请求之后,将数据包按照MTU进行拆分、封装,并添加包头,根据节点序号,查找哈希表,拷贝至对应节点的MMIO内。然后将发送完毕标志位置1,最后触发对应的中断,通知远端节点接收数据。 After the host node application submits the transmission request, the data packet is split and encapsulated according to the MTU, and the header is added. According to the node serial number, the hash table is searched and copied to the MMIO of the corresponding node. Then set the sending completion flag to 1, and finally trigger the corresponding interrupt to notify the remote node to receive the data.

远端节点中断响应时,唤醒对应内存映射窗口的处理线程。首先判断发送完毕标志位是否置1,如果置1表明数据包传输完毕,可以进行数据接收。解析包头,将接收的数据包提取有效数据、合并,当数据传输完毕时,提交至对应的应用层中。表明此次数据发送完毕。 When the remote node interrupts the response, wake up the processing thread corresponding to the memory mapping window. First judge whether the sending completion flag is set to 1. If it is set to 1, it indicates that the data packet transmission is completed, and data reception can be performed. Parse the packet header, extract valid data from the received data packets, combine them, and submit them to the corresponding application layer when the data transmission is complete. Indicates that the data has been sent this time.

本发明的一种基于地址映射的集群通信模型其加工制作非常简单方便,按照说明书附图所示即可加工。 The processing and manufacture of a cluster communication model based on address mapping in the present invention is very simple and convenient, and can be processed as shown in the accompanying drawings.

除说明书所述的技术特征外,均为本专业技术人员的已知技术。 Except for the technical features described in the instructions, all are known technologies by those skilled in the art.

Claims (6)

1. the trunking communication model based on address mapping, is characterized in that being made up of multiple host computer systems and multiple communication module, and multiple host computer systems all comprise memory address mapped device and the communication module based on memory address mapping;
Described memory address mapped device, for memory address mapping between different host system;
Described traffic model, for data communication between different host system, realize point-to-point data transfer model, communication module resolution data packet address information, without device forwards such as switches, directly be transferred to destination host system, realized the data transfer model of high bandwidth, low delay, be applicable to large-scale data transmission between group system.
2. a kind of trunking communication model based on address mapping according to claim 1, is characterized in that connecting by memory address mapped device between described multiple host computer system, as NTB;
Described traffic model, for communication requirement between main frame, provides unified interface, use for computer system application, the address mapping device of communication module based between computer, by access map internal memory, carry out reading and writing data, thereby complete data communication.
3. a kind of trunking communication model based on address mapping according to claim 1, is characterized in that described memory address mapped device, and for interconnecting between multiple host computer systems, each main frame carries out memory address mapping with other All hosts; Position equity between main frame, complete reciprocal access between All hosts.
4. a kind of trunking communication model based on address mapping according to claim 1, it is characterized in that told communication module, also comprise special communication protocol encapsulation, communication link between main frame based on effective, reliable, succinct, reduce the loss of data communication in protocol processes, improve valid data duty ratio, promote data communication efficiency.
5. a kind of trunking communication model based on address mapping according to claim 1, is characterized in that being connected by interconnection between described each main frame, all comprises heartbeat module between described each main frame.
6. a kind of trunking communication model based on address mapping according to claim 1, it is characterized in that described heartbeat module,, the power supply of described opposite end storage control is reset when the malfunction for opposite end storage control detected by described interconnection.
CN201410284909.4A 2014-06-24 2014-06-24 Trunking communication model based on address mapping Pending CN104038569A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410284909.4A CN104038569A (en) 2014-06-24 2014-06-24 Trunking communication model based on address mapping

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410284909.4A CN104038569A (en) 2014-06-24 2014-06-24 Trunking communication model based on address mapping

Publications (1)

Publication Number Publication Date
CN104038569A true CN104038569A (en) 2014-09-10

Family

ID=51469156

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410284909.4A Pending CN104038569A (en) 2014-06-24 2014-06-24 Trunking communication model based on address mapping

Country Status (1)

Country Link
CN (1) CN104038569A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104486365A (en) * 2014-09-28 2015-04-01 浪潮(北京)电子信息产业有限公司 Communication method and system between double controls
CN107329917A (en) * 2017-06-26 2017-11-07 郑州云海信息技术有限公司 A kind of data transmission method and device
CN107480080A (en) * 2017-07-03 2017-12-15 香港红鸟科技股份有限公司 Zero-copy data stream based on RDMA
CN107852349A (en) * 2016-03-31 2018-03-27 慧与发展有限责任合伙企业 Transaction management for multi-node cluster
CN119847958A (en) * 2024-12-31 2025-04-18 浪潮电子信息产业股份有限公司 Communication method, system, device, product and storage medium for computer system

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104486365A (en) * 2014-09-28 2015-04-01 浪潮(北京)电子信息产业有限公司 Communication method and system between double controls
CN104486365B (en) * 2014-09-28 2018-02-02 浪潮(北京)电子信息产业有限公司 Communication means and system between dual control
CN107852349A (en) * 2016-03-31 2018-03-27 慧与发展有限责任合伙企业 Transaction management for multi-node cluster
US10783021B2 (en) 2016-03-31 2020-09-22 Hewlett Packard Enterprise Development Lp Transaction management for multi-node clusters
CN107852349B (en) * 2016-03-31 2020-12-01 慧与发展有限责任合伙企业 System, method and storage medium for transaction management of multi-node cluster
CN107329917A (en) * 2017-06-26 2017-11-07 郑州云海信息技术有限公司 A kind of data transmission method and device
CN107480080A (en) * 2017-07-03 2017-12-15 香港红鸟科技股份有限公司 Zero-copy data stream based on RDMA
CN107480080B (en) * 2017-07-03 2021-03-23 深圳致星科技有限公司 Zero-copy data stream based on RDMA
CN119847958A (en) * 2024-12-31 2025-04-18 浪潮电子信息产业股份有限公司 Communication method, system, device, product and storage medium for computer system

Similar Documents

Publication Publication Date Title
CN105516191B (en) System based on the FPGA 10,000,000,000 net Transmission Control Protocol unloading engine TOE realized
JP5539463B2 (en) High performance Ethernet networking utilizing existing Fiber Channel fabric HBA technology
CN105579987B (en) The port general PCI EXPRESS
US11949589B2 (en) Methods and systems for service state replication using original data packets
US11902184B2 (en) Methods and systems for providing a virtualized NVMe over fabric service
US11895027B2 (en) Methods and systems for service distribution using data path state replication and intermediate device mapping
US8751655B2 (en) Collective acceleration unit tree structure
US20110010522A1 (en) Multiprocessor communication protocol bridge between scalar and vector compute nodes
US11593294B2 (en) Methods and systems for loosely coupled PCIe service proxy over an IP network
CN104038569A (en) Trunking communication model based on address mapping
CN106953853A (en) A kind of network-on-chip gigabit Ethernet resource node and its method of work
CN112019450A (en) Streaming communication between devices
CN104486365B (en) Communication means and system between dual control
CN104270450A (en) A dual-controller multi-link heartbeat monitoring method using UDP protocol
KR20160033754A (en) Storage system, method, and apparatus for processing operation request
KR20170102717A (en) Micro server based on fabric network
WO2012126352A1 (en) Method, device and system for transmitting messages on pcie bus
CN205283599U (en) 10, 000, 000, 000 net TCP agreement offload engine TOE's system based on FPGA realizes
CN105704023B (en) Message forwarding method and device of stacking system and stacking equipment
US8225004B1 (en) Method and system for processing network and storage data
Wang et al. An optimized RDMA QP communication mechanism for hyperscale AI infrastructure
CN104980371A (en) Micro server
CN121070861B (en) Embedded system based on NVME oF RDMA
US8751603B2 (en) Exploiting cluster awareness infrastructure through internet socket based applications
US20250240185A1 (en) Cross network bridging

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20140910