CN104038569A - Trunking communication model based on address mapping - Google Patents
Trunking communication model based on address mapping Download PDFInfo
- Publication number
- CN104038569A CN104038569A CN201410284909.4A CN201410284909A CN104038569A CN 104038569 A CN104038569 A CN 104038569A CN 201410284909 A CN201410284909 A CN 201410284909A CN 104038569 A CN104038569 A CN 104038569A
- Authority
- CN
- China
- Prior art keywords
- address mapping
- communication
- memory address
- data
- model based
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Landscapes
- Data Exchanges In Wide-Area Networks (AREA)
Abstract
Description
技术领域 technical field
本发明涉及计算机集群系统及存储领域,具体地说是一种基于地址映射的集群通信模型。 The invention relates to the field of computer cluster system and storage, in particular to a cluster communication model based on address mapping.
背景技术 Background technique
计算机集群是一种计算机系统,它通过一组松散集成的计算机软件和/或硬件连接起来高度紧密地协作完成计算工作。在某种意义上,他们可以被看作是一台计算机。集群系统中的单个计算机通常称为节点,通常通过局域网连接,但也有其它的可能连接方式。集群计算机通常用来改进单个计算机的计算速度和/或可靠性。一般情况下集群计算机比单个计算机,比如工作站或超级计算机性能价格比要高得多。 A computer cluster is a computer system that is connected by a group of loosely integrated computer software and/or hardware to perform computing tasks in a highly co-operative manner. In a sense, they can be seen as a computer. The individual computers in a cluster system are usually called nodes and are usually connected by a local area network, but there are other possible connections. Cluster computers are often used to improve the computing speed and/or reliability of individual computers. In general, cluster computers are much more cost-effective than individual computers, such as workstations or supercomputers.
计算机集群技术的出发点是为了提供更高的可用性、可管理性、可伸缩性的计算机系统。一个集群包含多台拥有共享数据存储空间的服务器,各服务器通过内部局域网相互通信。当一个节点发生故障时,它所运行的应用程序将由其他节点自动接管。在大多数模式下,集群中所有的节点拥有一个共同的名称,集群内的任一节点上运行的服务都可被所有的网络客户所使用。 The starting point of computer cluster technology is to provide computer systems with higher availability, manageability, and scalability. A cluster includes multiple servers with shared data storage space, and each server communicates with each other through an internal LAN. When a node fails, the applications it was running are automatically taken over by other nodes. In most modes, all nodes in the cluster have a common name, and services running on any node in the cluster can be used by all network clients.
集群中各个节点之间需要进行大量的数据交互。通常集群系统中依靠高速局域网来实现,常见的高速局域网有FDDI光纤环网、100BASE-T高速以太网、千兆位以太网、10Gbit/s以太网等。传输介质有光线、以太网等。使用的是传统iscsi、TCP/IP协议,协议封装较复杂,传输数据时需要经过大量的协议包转换、封装、编解码、校验等步骤,这些协议更适合于远距离、不可靠环境下传输数据。 A large amount of data interaction is required between the various nodes in the cluster. Usually, the cluster system relies on high-speed LAN to realize. The common high-speed LAN includes FDDI optical fiber ring network, 100BASE-T high-speed Ethernet, Gigabit Ethernet, 10Gbit/s Ethernet, etc. Transmission media include optical fiber, Ethernet, etc. The traditional iscsi and TCP/IP protocols are used, and the protocol encapsulation is more complicated. When transmitting data, a large number of protocol packet conversion, encapsulation, codec, verification, and other steps are required. These protocols are more suitable for transmission in long-distance and unreliable environments. data.
而在大规模集群系统中,更多的是主机在短距离、可靠、稳定的传输环境中传输。需求高带宽、高可靠性、大容量的数据传输链路。使用传统的高速传输协议时,计算机主机需要耗费大量的计算用于数据协议包封装、编解码、校验,并且通信链路带宽和节点计算能力将成为整个计算机集群系统整体响应、性能的瓶颈。 In a large-scale cluster system, more hosts are transmitted in a short-distance, reliable, and stable transmission environment. High-bandwidth, high-reliability, and large-capacity data transmission links are required. When using traditional high-speed transmission protocols, the computer host needs to consume a lot of calculations for data protocol packet encapsulation, encoding and decoding, and verification, and the communication link bandwidth and node computing capabilities will become the bottleneck of the overall response and performance of the entire computer cluster system.
发明内容 Contents of the invention
本发明的目的是克服现有技术中存在的不足,提供一种基于地址映射的集群通信模型,可满足大规模集群系统节点之间数据高速、高可靠、大容量的数据传输需求。 The purpose of the present invention is to overcome the deficiencies in the prior art and provide a cluster communication model based on address mapping, which can meet the requirements of high-speed, high-reliability, and large-capacity data transmission between nodes of a large-scale cluster system.
本发明的技术方案是按以下方式实现的,其结构中由多个主机系统和多个通信模块构成,多个主机系统都包含内存地址映射设备和基于内存地址映射的通信模块; The technical solution of the present invention is realized in the following manner, and its structure is composed of multiple host systems and multiple communication modules, and the multiple host systems all include memory address mapping devices and communication modules based on memory address mapping;
所述内存地址映射设备,用于不同主机系统之间内存地址映射; The memory address mapping device is used for memory address mapping between different host systems;
所述通信模型,用于不同主机系统之间数据通信,实现点对点的数据传输模型,通信模块解析数据包地址信息,不经过交换机等设备转发,直接传输到目的主机系统,实现了高带宽、低延迟的数据传输模型,适用于集群系统之间大规模数据传输。 The communication model is used for data communication between different host systems, and realizes a point-to-point data transmission model. The communication module analyzes the address information of the data packet, and directly transmits it to the destination host system without being forwarded by switches and other equipment, thereby realizing high bandwidth and low cost. The delayed data transmission model is suitable for large-scale data transmission between cluster systems.
所述多个主机系统之间通过内存地址映射设备连接,如NTB; The multiple host systems are connected through a memory address mapping device, such as NTB;
所述通信模型,用于计算机主机之间通信需求,提供统一的接口,以供计算机系统应用使用,通信模块基于计算机之间的地址映射装置,通过访问映射内存,进行数据读写,从而完成数据通信。 The communication model is used for communication requirements between computer hosts, and provides a unified interface for computer system applications. The communication module is based on the address mapping device between computers, and reads and writes data by accessing the mapped memory, thereby completing the data communication.
所述内存地址映射设备,用于将多个主机系统之间相互连接,每个主机都与其他所有主机进行内存地址映射。主机之间位置对等,所有主机之间完全对等访问。 The memory address mapping device is used to connect multiple host systems to each other, and each host performs memory address mapping with all other hosts. The locations of the hosts are equal, and all hosts have complete peer-to-peer access.
所诉通信模块,还包含特殊的通信协议封装,基于有效、可靠、简洁的主机之间通信链路,减少数据通信在协议处理上的损耗,提高有效数据占空比,提升数据通信效率。 The communication module mentioned also includes special communication protocol encapsulation, based on an effective, reliable, and concise communication link between hosts, reduces the loss of data communication in protocol processing, improves the effective data duty cycle, and improves data communication efficiency.
所述每个主机之间通过互联总线相连,所述每个主机之间中均包含心跳模块。 Each of the hosts is connected through an interconnection bus, and each of the hosts includes a heartbeat module.
所述心跳模块,用于通过所述互联总线检测到对端存储控制器处于故障状态时,对所述对端存储控制器的电源进行重置。 The heartbeat module is configured to reset the power supply of the peer storage controller when detecting that the peer storage controller is in a fault state through the interconnection bus.
本发明的优点是: The advantages of the present invention are:
本发明的一种基于地址映射的集群通信模型和现有技术相比,基于内存地址映射,实现RDMA,按照特殊的通信协议封装,直接在通信模块实现包转发,实现零拷贝数据传输,充分利用了计算机主机之间物理传输链路,提高了传输性能、减少了系统损耗;本发明基于NTB实现内存地址映射,实现数据传输;使用PCI-E 2.0协议进行数据传输,因而,具有很好的推广使用价值。 Compared with the prior art, a cluster communication model based on address mapping in the present invention realizes RDMA based on memory address mapping, encapsulates according to a special communication protocol, directly implements packet forwarding in the communication module, realizes zero-copy data transmission, and makes full use of The physical transmission link between computer hosts is improved, the transmission performance is improved, and the system loss is reduced; the present invention realizes memory address mapping based on NTB, and realizes data transmission; uses PCI-E 2.0 protocol for data transmission, thus, has good promotion use value.
附图说明 Description of drawings
图1为一种基于地址映射的集群通信模型的结构示意图。 FIG. 1 is a schematic structural diagram of a cluster communication model based on address mapping.
实施方式Implementation
下面结合附图对本发明的一种基于地址映射的集群通信模型作以下详细说明。 A cluster communication model based on address mapping of the present invention will be described in detail below in conjunction with the accompanying drawings.
如图1所示,本发明的一种基于地址映射的集群通信模型其结构中高速通信模型如图1所示,集群中不同节点通过内存地址映射装置相互连接。每个主机都在自己的内存中,开辟一段空间作为其他节点的内存访问区域。供收发数据使用。 As shown in FIG. 1 , a high-speed communication model in the structure of a cluster communication model based on address mapping in the present invention is shown in FIG. 1 , and different nodes in the cluster are connected to each other through a memory address mapping device. Each host opens up a section of space in its own memory as the memory access area of other nodes. For sending and receiving data.
内存地址映射装置(如NTB,本文以下章节使用NTB暂代内存地址映射装置),可以实现不同主机系统之间内存访问,并通过中断寄存器、暂存寄存器实现通信的目的。通过地址翻译技术,将本机中的一段内存映射至远端NTB的MMIO区域当中。远端主机向MMIO中拷贝数据,相当于直接往本端对应内存区域中拷贝数据。从而现数据的发送。并结合门铃中断寄存器和暂存寄存器,实现中断通知和配置交互的过程。从而实现不同主机系统之间的数据传输。 The memory address mapping device (such as NTB, the following chapters of this article uses NTB temporary memory address mapping device), can realize memory access between different host systems, and realize the purpose of communication through interrupt registers and temporary registers. Through address translation technology, a section of memory in the local machine is mapped to the MMIO area of the remote NTB. Copying data from the remote host to MMIO is equivalent to directly copying data to the corresponding memory area on the local end. So that the data is sent. Combined with the doorbell interrupt register and temporary storage register, the process of interrupt notification and configuration interaction is realized. In this way, data transmission between different host systems can be realized.
如图1所示,通过NTB将集群中所有节点都映射一段内存空间(映射窗口一般申请为小于1MB)。形成可访问的内存窗口,可以用于接收数据。再加上NTB的MMIO区域,形成发送和接收缓存区。 As shown in Figure 1, all nodes in the cluster are mapped to a memory space through NTB (the mapping window is generally applied for less than 1MB). Forms an accessible memory window that can be used to receive data. Coupled with the MMIO area of NTB, the sending and receiving buffer areas are formed.
在系统启动过程当中,枚举PCI设备时,根据不同主机的NTB映射区间,申请不同的内存空间,建立一一对应的关系,供查询。针对不同节点的NTB设备建立不同的中断处理函数。根据中断偏移判断是属于哪一个节点的,并添加入哈希表中。 During the system startup process, when enumerating PCI devices, apply for different memory spaces according to the NTB mapping intervals of different hosts, and establish a one-to-one correspondence for query. Establish different interrupt processing functions for NTB devices of different nodes. Determine which node it belongs to according to the interrupt offset, and add it to the hash table.
在初始化最后,向所有注册的NTB设备节点发送LINK UP 信号,然后根据LINK状态寄存器判断其他节点的状态,当对应节点也发送LINK UP 信号之后,开启建立连接流程。通过暂存寄存器进行数据交互,建立连接。 At the end of initialization, a LINK UP signal is sent to all registered NTB device nodes, and then the status of other nodes is judged according to the LINK status register. When the corresponding node also sends a LINK UP signal, the connection establishment process is started. Data interaction is performed through scratch registers and a connection is established.
初始化完毕之后,本节点就可以通过通信模块与其他节点进行数据传输。 After the initialization is completed, the node can transmit data with other nodes through the communication module.
当主机节点应用提交传输请求之后,将数据包按照MTU进行拆分、封装,并添加包头,根据节点序号,查找哈希表,拷贝至对应节点的MMIO内。然后将发送完毕标志位置1,最后触发对应的中断,通知远端节点接收数据。 After the host node application submits the transmission request, the data packet is split and encapsulated according to the MTU, and the header is added. According to the node serial number, the hash table is searched and copied to the MMIO of the corresponding node. Then set the sending completion flag to 1, and finally trigger the corresponding interrupt to notify the remote node to receive the data.
远端节点中断响应时,唤醒对应内存映射窗口的处理线程。首先判断发送完毕标志位是否置1,如果置1表明数据包传输完毕,可以进行数据接收。解析包头,将接收的数据包提取有效数据、合并,当数据传输完毕时,提交至对应的应用层中。表明此次数据发送完毕。 When the remote node interrupts the response, wake up the processing thread corresponding to the memory mapping window. First judge whether the sending completion flag is set to 1. If it is set to 1, it indicates that the data packet transmission is completed, and data reception can be performed. Parse the packet header, extract valid data from the received data packets, combine them, and submit them to the corresponding application layer when the data transmission is complete. Indicates that the data has been sent this time.
本发明的一种基于地址映射的集群通信模型其加工制作非常简单方便,按照说明书附图所示即可加工。 The processing and manufacture of a cluster communication model based on address mapping in the present invention is very simple and convenient, and can be processed as shown in the accompanying drawings.
除说明书所述的技术特征外,均为本专业技术人员的已知技术。 Except for the technical features described in the instructions, all are known technologies by those skilled in the art.
Claims (6)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201410284909.4A CN104038569A (en) | 2014-06-24 | 2014-06-24 | Trunking communication model based on address mapping |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201410284909.4A CN104038569A (en) | 2014-06-24 | 2014-06-24 | Trunking communication model based on address mapping |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| CN104038569A true CN104038569A (en) | 2014-09-10 |
Family
ID=51469156
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN201410284909.4A Pending CN104038569A (en) | 2014-06-24 | 2014-06-24 | Trunking communication model based on address mapping |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN104038569A (en) |
Cited By (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN104486365A (en) * | 2014-09-28 | 2015-04-01 | 浪潮(北京)电子信息产业有限公司 | Communication method and system between double controls |
| CN107329917A (en) * | 2017-06-26 | 2017-11-07 | 郑州云海信息技术有限公司 | A kind of data transmission method and device |
| CN107480080A (en) * | 2017-07-03 | 2017-12-15 | 香港红鸟科技股份有限公司 | Zero-copy data stream based on RDMA |
| CN107852349A (en) * | 2016-03-31 | 2018-03-27 | 慧与发展有限责任合伙企业 | Transaction management for multi-node cluster |
| CN119847958A (en) * | 2024-12-31 | 2025-04-18 | 浪潮电子信息产业股份有限公司 | Communication method, system, device, product and storage medium for computer system |
-
2014
- 2014-06-24 CN CN201410284909.4A patent/CN104038569A/en active Pending
Cited By (9)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN104486365A (en) * | 2014-09-28 | 2015-04-01 | 浪潮(北京)电子信息产业有限公司 | Communication method and system between double controls |
| CN104486365B (en) * | 2014-09-28 | 2018-02-02 | 浪潮(北京)电子信息产业有限公司 | Communication means and system between dual control |
| CN107852349A (en) * | 2016-03-31 | 2018-03-27 | 慧与发展有限责任合伙企业 | Transaction management for multi-node cluster |
| US10783021B2 (en) | 2016-03-31 | 2020-09-22 | Hewlett Packard Enterprise Development Lp | Transaction management for multi-node clusters |
| CN107852349B (en) * | 2016-03-31 | 2020-12-01 | 慧与发展有限责任合伙企业 | System, method and storage medium for transaction management of multi-node cluster |
| CN107329917A (en) * | 2017-06-26 | 2017-11-07 | 郑州云海信息技术有限公司 | A kind of data transmission method and device |
| CN107480080A (en) * | 2017-07-03 | 2017-12-15 | 香港红鸟科技股份有限公司 | Zero-copy data stream based on RDMA |
| CN107480080B (en) * | 2017-07-03 | 2021-03-23 | 深圳致星科技有限公司 | Zero-copy data stream based on RDMA |
| CN119847958A (en) * | 2024-12-31 | 2025-04-18 | 浪潮电子信息产业股份有限公司 | Communication method, system, device, product and storage medium for computer system |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN105516191B (en) | System based on the FPGA 10,000,000,000 net Transmission Control Protocol unloading engine TOE realized | |
| JP5539463B2 (en) | High performance Ethernet networking utilizing existing Fiber Channel fabric HBA technology | |
| CN105579987B (en) | The port general PCI EXPRESS | |
| US11949589B2 (en) | Methods and systems for service state replication using original data packets | |
| US11902184B2 (en) | Methods and systems for providing a virtualized NVMe over fabric service | |
| US11895027B2 (en) | Methods and systems for service distribution using data path state replication and intermediate device mapping | |
| US8751655B2 (en) | Collective acceleration unit tree structure | |
| US20110010522A1 (en) | Multiprocessor communication protocol bridge between scalar and vector compute nodes | |
| US11593294B2 (en) | Methods and systems for loosely coupled PCIe service proxy over an IP network | |
| CN104038569A (en) | Trunking communication model based on address mapping | |
| CN106953853A (en) | A kind of network-on-chip gigabit Ethernet resource node and its method of work | |
| CN112019450A (en) | Streaming communication between devices | |
| CN104486365B (en) | Communication means and system between dual control | |
| CN104270450A (en) | A dual-controller multi-link heartbeat monitoring method using UDP protocol | |
| KR20160033754A (en) | Storage system, method, and apparatus for processing operation request | |
| KR20170102717A (en) | Micro server based on fabric network | |
| WO2012126352A1 (en) | Method, device and system for transmitting messages on pcie bus | |
| CN205283599U (en) | 10, 000, 000, 000 net TCP agreement offload engine TOE's system based on FPGA realizes | |
| CN105704023B (en) | Message forwarding method and device of stacking system and stacking equipment | |
| US8225004B1 (en) | Method and system for processing network and storage data | |
| Wang et al. | An optimized RDMA QP communication mechanism for hyperscale AI infrastructure | |
| CN104980371A (en) | Micro server | |
| CN121070861B (en) | Embedded system based on NVME oF RDMA | |
| US8751603B2 (en) | Exploiting cluster awareness infrastructure through internet socket based applications | |
| US20250240185A1 (en) | Cross network bridging |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| C06 | Publication | ||
| PB01 | Publication | ||
| WD01 | Invention patent application deemed withdrawn after publication | ||
| WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20140910 |