[go: up one dir, main page]

CN1305265C - Asynchronous remote mirror image method based on load selfadaption in SAN system - Google Patents

Asynchronous remote mirror image method based on load selfadaption in SAN system Download PDF

Info

Publication number
CN1305265C
CN1305265C CNB200310103194XA CN200310103194A CN1305265C CN 1305265 C CN1305265 C CN 1305265C CN B200310103194X A CNB200310103194X A CN B200310103194XA CN 200310103194 A CN200310103194 A CN 200310103194A CN 1305265 C CN1305265 C CN 1305265C
Authority
CN
China
Prior art keywords
node
write
mirror
command
asynchronous
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CNB200310103194XA
Other languages
Chinese (zh)
Other versions
CN1543135A (en
Inventor
舒继武
郑纬民
严瑞
姚骏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CNB200310103194XA priority Critical patent/CN1305265C/en
Publication of CN1543135A publication Critical patent/CN1543135A/en
Application granted granted Critical
Publication of CN1305265C publication Critical patent/CN1305265C/en
Anticipated expiration legal-status Critical
Expired - Fee Related legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

SAN系统中基于负载自适应的异步远程镜像方法属于网络存储技术领域,其特征在于:在由主机、包括主机适配卡、交换机和数据链路的互联设备、I/O节点及其磁盘阵列、为上述I/O节点提供网络硬盘的镜像I/O节点即Mirror I/O node及其磁盘阵列构成的存储区域网络SAN中,I/O节点的SCSI目标模拟器可根据I/O负载的大小,采取两种不同的异步镜像方式来减少命令响应时间,而且在镜像关系初建的同时,或存在磁盘发生失败更换磁盘工作后,进行磁盘间的同步来保证数据的一致性,同时对目标模拟器各子模块所有的出错情况可进行不同的分析处理。它能在保证数据安全性、一致性的前提下,减少写命令的响应时间,而且I/O负载越大,其效果越明显。

The asynchronous remote mirroring method based on load self-adaptation in the SAN system belongs to the field of network storage technology, and is characterized in that: it is composed of hosts, interconnected devices including host adapter cards, switches and data links, I/O nodes and disk arrays thereof, In the storage area network SAN composed of the mirror I/O node of the network hard disk, that is, the Mirror I/O node and its disk array, the SCSI target simulator of the I/O node can be based on the size of the I/O load , take two different asynchronous mirroring methods to reduce the command response time, and at the same time when the mirroring relationship is initially established, or after the disk fails and the disk is replaced, the synchronization between the disks is carried out to ensure the consistency of the data, and the target simulation All error conditions of each sub-module of the device can be analyzed and processed differently. It can reduce the response time of write commands on the premise of ensuring data security and consistency, and the greater the I/O load, the more obvious the effect.

Description

SAN系统中基于负载自适应的异步远程镜像方法Asynchronous Remote Mirroring Method Based on Load Adaptation in SAN System

技术领域technical field

SAN系统中基于负载自适应的异步远程镜像方法属于网络存储技术领域。An asynchronous remote mirroring method based on load self-adaptation in a SAN system belongs to the technical field of network storage.

背景技术Background technique

远程数据镜像作为SAN(Storage Area Network)关键技术,充分利用了SAN的中底层网络(FC或Ethernet)远距离连接能力和统一存储的特点,保证了重要数据在遭受区域物理灾难(如:火灾、洪水以及大规模电力故障等)后的可用性,为数据的容灾提供了有力支持。As a key technology of SAN (Storage Area Network), remote data mirroring makes full use of the characteristics of long-distance connection and unified storage of the middle and bottom network (FC or Ethernet) of SAN, ensuring that important data is protected from regional physical disasters (such as: fire, The availability after floods and large-scale power failures, etc.) provides strong support for data disaster recovery.

目前,SAN系统的远程镜像解决方案多采用同步镜像方法。同步镜像即所有的对镜像磁盘或者逻辑卷的写操作,会同时发送命令/数据到镜像磁盘(或逻辑卷)对,且只有镜像磁盘对(或逻辑卷)的写命令都完成后才会通知写命令发出程序命令完成。这种同步的数据镜像方式的最显著的特点是数据的一致性很好,但也有如下的不足之处:At present, the remote mirroring solution of the SAN system mostly adopts the method of synchronous mirroring. Synchronous mirroring means that all write operations to mirrored disks or logical volumes will send commands/data to the mirrored disk (or logical volume) pair at the same time, and will be notified only after the write commands of the mirrored disk pair (or logical volume) are completed The write command issues program command completion. The most notable feature of this synchronous data mirroring method is that the data consistency is very good, but it also has the following shortcomings:

1、由于需要等待本地写命令和远程写命令的全部完成,造成了系统的命令响应时间的增加。有数据表明,在系统I/O负载比较重的情况下,命令的响应时间呈指数增长。1. Due to the need to wait for all the local write commands and remote write commands to be completed, the command response time of the system increases. Some data show that when the system I/O load is relatively heavy, the command response time increases exponentially.

2、同步镜像需要较高的带宽,需借助于光纤通道等高速网络的支持采用FCP协议来实现,这就导致了同步镜像实现方法比较单一且整个系统的费用居高不下。一些公司的产品采用了低速网络作为远程镜像通道,但需要利用FCP到iSCSI的桥接转换,费用昂贵且性能比较差。2. Synchronous mirroring requires high bandwidth, and it needs to be implemented with the support of high-speed networks such as Fiber Channel using the FCP protocol, which leads to a relatively simple implementation method of synchronous mirroring and high costs for the entire system. The products of some companies use the low-speed network as the remote mirroring channel, but need to use the bridging conversion from FCP to iSCSI, which is expensive and has poor performance.

基于负载自适应的异步远程镜像方法很好的解决了上述的不足之处且保持了良好的数据一致性。The asynchronous remote mirroring method based on load self-adaptation solves the above shortcomings and maintains good data consistency.

发明内容Contents of the invention

本发明的目的在于提供一种SAN(Storage Area Network,存储区域网络)系统中基于负载自适应的异步远程镜像的方法来提高数据的安全性且在保持良好的数据一致性的前提下,根据系统负载情况的轻重,动态采用两种不同异步写协议进行异步镜像来减少系统负载对系统性能的影响,与同步镜像相比有一定的系统性能优势。The object of the present invention is to provide a method for asynchronous remote mirroring based on load self-adaptation in a SAN (Storage Area Network) system to improve data security and under the premise of maintaining good data consistency, according to the system Depending on the load situation, two different asynchronous write protocols are dynamically used for asynchronous mirroring to reduce the impact of system load on system performance. Compared with synchronous mirroring, it has certain system performance advantages.

本发明的特征在于在由主机、包括主机适配卡、交换机和数据链路的互联设备、I/O节点及其磁盘阵列、为上述I/O节点提供网络硬盘的镜像I/O节点即Mirror I/O node及其磁盘阵列构成的存储区域网络SAN中,I/O节点的小型计算机接口即SCSI目标模拟器按照设定的模块结构依次执行以下步骤来实现本发明所述的方法:The present invention is characterized in that the mirror I/O node that provides network hard disk for above-mentioned I/O node is Mirror In the storage area network SAN that I/O node and disk array thereof constitute, the small computer interface of I/O node is the SCSI target emulator to carry out the following steps successively according to the module structure of setting to realize the method described in the present invention:

(1)SCSI目标模拟器设有:(1) The SCSI target simulator has:

镜像关系初始化子模块,设有:用户指定镜像关系接口ProcWriteUserCommand();镜像对自动恢复接口STML_Read_Mirror_Map(),STML_Restore_Mirror_Map();异常退出处理接口Handle_Abnormal_Shutdown();The mirror relationship initialization sub-module is provided with: user-specified mirror relationship interface ProcWriteUserCommand(); mirror pair automatic restoration interface STML_Read_Mirror_Map(), STML_Restore_Mirror_Map(); abnormal exit processing interface Handle_Abnormal_Shutdown();

同步子模块,接收镜像关系初始化子模块请求,设有:磁盘同步线程Sync_disk_thread();磁盘完全同步接口,接收磁盘同步线程命令Full_Sync_disks();基于位图表的同步接口,接收同步线程命令Bitmap_Sync_disk();Synchronization sub-module, receiving mirror relationship initialization sub-module request, equipped with: disk synchronization thread Sync_disk_thread(); disk full synchronization interface, receiving disk synchronization thread command Full_Sync_disks(); bitmap-based synchronization interface, receiving synchronization thread command Bitmap_Sync_disk() ;

实时存储子模块,设有:命令处理线程STML_Handle_Cmnd_thread();读命令处理接口,接收命令处理线程命令STML_Handle_Read_Command();写命令处理接口,接收命令处理线程命令STML_Handle_Write_Command()镜像写命令处理接口,STML_Handle_Mirror_Write_Command();The real-time storage sub-module is provided with: command processing thread STML_Handle_Cmnd_thread (); read command processing interface, receive command processing thread command STML_Handle_Read_Command (); write command processing interface, receive command processing thread command STML_Handle_Write_Command () mirror write command processing interface, STML_Handle_Mirror_Write_Command ( );

双协议自适应数据复制子模块,接收实时存储子模块中镜像写命令处理接口的请求,设有:自适应异步协议切换函数和数据复制线程STML_Protocol_change(),STML_data_mover_thread();采用异步写协议A的写命令接口,接收上述自适应异步协议切换函数和数据复制线程的命令,STML_Handle_Async_A_Command();采用异步写协议B的写命令接口,接收上述自适应异步协议切换函数和数据复制线程的命令,STML_Handle_Async_B_Command();The dual-protocol self-adaptive data replication sub-module receives the request of the image writing command processing interface in the real-time storage sub-module, and is provided with: self-adaptive asynchronous protocol switching function and data copy thread STML_Protocol_change(), STML_data_mover_thread(); using asynchronous writing protocol A Write command interface, receive the command of the above-mentioned adaptive asynchronous protocol switching function and data copy thread, STML_Handle_Async_A_Command(); adopt the write command interface of asynchronous write protocol B, receive the command of the above-mentioned adaptive asynchronous protocol switching function and data copy thread, STML_Handle_Async_B_Command( );

数据变更记录子模块,接收双协议自适应数据复制子模块的请求,设有:位图表记录接口STML_Bitmap_Write(),位图表读取接口STML_Bitmap_Read();The data change recording sub-module receives the request of the dual-protocol self-adaptive data replication sub-module, and is equipped with: bitmap recording interface STML_Bitmap_Write(), bitmap reading interface STML_Bitmap_Read();

错误处理分析子模块,实时分别接收上述镜像关系初始化子模块、同步子模块、实时存储子模块、双协议自适应数据复制子模块和数据变更记录子模块的请求,设有:错误处理线程STML_ERROR_Handle_thread();The error handling analysis sub-module receives the requests of the above-mentioned mirror relationship initialization sub-module, synchronization sub-module, real-time storage sub-module, dual-protocol adaptive data replication sub-module and data change record sub-module respectively in real time, and is provided with: error handling thread STML_ERROR_Handle_thread( );

系统退出子模块,若未接收到来自错误处理分析子模块的处理命令,则系统正常退出;若接收到来自错误处理分析子模块的处理命令,则系统保存出错现场及记录,异常退出;The system exits the sub-module, if the processing command from the error processing and analysis sub-module is not received, the system exits normally; if the processing command from the error processing and analysis sub-module is received, the system saves the error scene and records, and exits abnormally;

(2)SCSI目标模拟器根据来自主机的读写命令依次执行以下步骤:(2) The SCSI target emulator performs the following steps in sequence according to the read and write commands from the host:

(2.1)I/O节点的SCSI目标模拟器接收来自主机的读写命令。(2.1) The SCSI target emulator of the I/O node receives read and write commands from the host.

(2.2)SCSI目标模拟器判断命令类型。若为读命令,则把命令发到本地磁盘执行,待命令执行完毕,便通知主机告知命令执行完毕;(2.2) The SCSI target simulator judges the command type. If it is a read command, the command is sent to the local disk for execution, and when the command is executed, the host is notified to inform the execution of the command;

否则,按以下步骤执行写命令;Otherwise, follow the steps below to execute the write command;

根据预定设在SCSI目标模拟器写命令队列长度即I/O负载的阈值判定负载的类型:Determine the type of load according to the predetermined threshold value of the write command queue length of the SCSI target emulator, that is, the I/O load:

若I/O节点的I/O负载小于负载阈值,则属于负载较轻的情况,便执行异步写协议A:I/O节点分别把主机发出的写命令交由本地磁盘和远程的镜像I/O节点分别进行处理,待I/O节点在得到本地写命令执行完毕的确认后,即向主机返回操作结束指令。If the I/O load of the I/O node is less than the load threshold, the load is light, and the asynchronous write protocol A is executed: the I/O node sends the write command issued by the host to the local disk and the remote mirror I/O respectively. The O node performs processing respectively, and after the I/O node receives the confirmation that the execution of the local write command is completed, it returns the operation end command to the host.

若I/O节点的I/O负载大于负载阈值,则属于负载较重的情况,便执行异步写协议B:I/O节点把主机发出的写命令只交由本地磁盘处理,并利用位图表记录命令所变更的数据块;I/O节点在得到本地写命令执行完毕得确认后,即向主机返回操作结束指令。更改过的数据块,则由自适应异步协议数据复制线程根据I/O节点的负载情况或者定时自动从本地磁盘读取,发往镜像I/O节点进行异步镜像写入操作。If the I/O load of the I/O node is greater than the load threshold, the load is heavy, and the asynchronous write protocol B is executed: the I/O node sends the write command issued by the host to the local disk for processing, and uses the bitmap Record the data block changed by the command; after the I/O node is confirmed that the local write command is executed, it returns the operation end command to the host. The changed data block is automatically read from the local disk by the adaptive asynchronous protocol data replication thread according to the load condition of the I/O node or at regular intervals, and sent to the mirror I/O node for asynchronous mirror write operation.

(3)所述的异步写协议A依次包含以下步骤:(3) The asynchronous writing protocol A described in turn comprises the following steps:

1、写命令由主机发向I/O节点;1. The write command is sent from the host to the I/O node;

2、I/O节点复制写命令发往镜像I/O节点;2. The I/O node copies the write command and sends it to the mirror I/O node;

3、I/O节点处理完本地写命令,将结果返回主机;3. After the I/O node processes the local write command, it returns the result to the host;

4、镜像I/O节点写命令返回。4. The mirror I/O node write command returns.

(4)所述的异步写协议B依次包含以下步骤:(4) The asynchronous writing protocol B includes the following steps in turn:

1、写命令由主机发向I/O节点;1. The write command is sent from the host to the I/O node;

2、I/O节点处理完写命令,将结果返回主机;2. After the I/O node processes the write command, it returns the result to the host;

3、自适应异步协议数据复制线程读取更改的数据块,写入镜像I/O节点;3. Adaptive asynchronous protocol data replication thread reads the changed data block and writes it to the mirror I/O node;

4、镜像I/O节点写命令返回。4. The mirror I/O node write command returns.

图7是其程序流程框图。Figure 7 is a block diagram of its program.

使用证明:基于负载自适应的异步远程镜像方法相对于通常使用的同步镜像方法在命令响应时间上有了一定的改善而且性能提升的幅度会随着系统负载的增加而增大。Proof of use: Compared with the commonly used synchronous mirroring method, the asynchronous remote mirroring method based on load adaptation has a certain improvement in the command response time, and the performance improvement range will increase with the increase of the system load.

附图说明Description of drawings

图1:海量网络存储系统TH-MSNS的硬件结构图Figure 1: Hardware structure diagram of massive network storage system TH-MSNS

图2:基于光纤通道传输协议FCP的TH-MSNS存储系统的I/O路径图Figure 2: I/O path diagram of TH-MSNS storage system based on Fiber Channel transmission protocol FCP

图3:TH-MSNS系统I/O节点数据远程镜像的硬件结构图Figure 3: Hardware structure diagram of TH-MSNS system I/O node data remote mirroring

图4:异步写协议A示意图Figure 4: Schematic diagram of asynchronous write protocol A

图5:异步写协议B示意图Figure 5: Schematic diagram of asynchronous write protocol B

图6:标模拟器模块中本发明所述各子模块的结构图Fig. 6: the structural diagram of each submodule of the present invention in the standard simulator module

图7:本发明所述的双协议自适应的程序流程框图Fig. 7: the program flow diagram of dual-protocol self-adaptation of the present invention

具体实施方式Detailed ways

本发明提出的基于负载自适应的异步远程镜像方法是基于清华大学自主开发的海量网络存储系统TH-MSNS平台之上设计和实现的。TH-MSNS是一种SAN的体系结构,但又不同于通常的SAN结构。在通常的SAN结构中,存储系统基本采用光纤通道阵列控制器的方式来实现,光纤通道阵列控制器是一个集成光纤通道芯片、廉价冗余磁盘阵列(RAID)芯片、小型计算机系统接口(SCSI)或光纤通道接口芯片以及嵌入式CPU、内存的单板计算机(SBC);它通过光纤通道接口和光纤交换机、集线器(HUB)连接,连接到SAN中;后端通过SCSI接口或光纤通道接口连接SCSI硬盘或光纤通道硬盘(采用FC-AL接口);一般来说,光纤通道阵列控制器提供自己的固件(Firmware),负责光纤通道的连接、RAID盘设置等功能。基于阵列控制器(Controller)方式实现的SAN存储设备存在着可扩展性差、兼容性差、不开放、价格昂贵等缺点。The asynchronous remote mirroring method based on load self-adaptation proposed by the present invention is designed and implemented based on the massive network storage system TH-MSNS independently developed by Tsinghua University. TH-MSNS is a SAN architecture, but it is different from the usual SAN architecture. In the usual SAN structure, the storage system is basically implemented by the Fiber Channel array controller. The Fiber Channel array controller is an integrated Fiber Channel chip, cheap Redundant Disk Array (RAID) chip, Small Computer System Interface (SCSI) Or fiber channel interface chip and single board computer (SBC) with embedded CPU and memory; it is connected to fiber switch and hub (HUB) through fiber channel interface, and connected to SAN; the back end is connected to SCSI through SCSI interface or fiber channel interface Hard disk or fiber channel hard disk (using FC-AL interface); generally speaking, the fiber channel array controller provides its own firmware (Firmware), which is responsible for the connection of fiber channel, RAID disk setting and other functions. The SAN storage device based on the array controller (Controller) has disadvantages such as poor scalability, poor compatibility, non-openness, and high price.

TH-MSNS存储系统于通常的SAN结构的最显著的区别在于,它使用软件系统来代替通常采用光纤通道阵列控制器的方式来控制存储I/O操作的方法,避免采用硬件控制的昂贵价格,同时用高效的软件存储控制方法获得最大的性能和灵活性并适用于多种底层传输协议。如FCP、iSCSI协议等。The most significant difference between the TH-MSNS storage system and the usual SAN structure is that it uses a software system to replace the method of controlling storage I/O operations that usually uses a fiber channel array controller, avoiding the expensive price of hardware control, At the same time, an efficient software storage control method is used to obtain maximum performance and flexibility and is applicable to various underlying transmission protocols. Such as FCP, iSCSI protocol, etc.

TH-MSNS存储系统的目标器是采用的是一个完整的通用服务器(I/O处理节点)。I/O节点为服务器集群提供存储的专用网,并连接有根据不同的需要连接不同主机适配卡(如FC HBA,Ethernet HBA)来接收主机通过网络(Fibre Channel,Ethernet等)发过来的封装着SCSI命令和用户数据的信息帧或数据包。HBA的作用是将接收到的信息帧或数据包解包,将其还原成SCSI命令和用户数据,然后交用SCSI目标模拟器(Target Simulator Module)处理。SCSI目标模拟器是运行于I/O节点的一个软件模块,它的作用是将HBA传送过来的SCSI命令和用户数据,进行处理、排队、封装等操作,然后将命令交给SCSI子系统处理,并将SCSI子系统返回的命令执行情况按照原路返回给主机。如图1所示。The target device of TH-MSNS storage system is a complete general server (I/O processing node). The I/O node provides a dedicated storage network for the server cluster, and is connected to different host adapter cards (such as FC HBA, Ethernet HBA) according to different needs to receive the package sent by the host through the network (Fibre Channel, Ethernet, etc.) An information frame or packet containing SCSI commands and user data. The role of the HBA is to unpack the received information frame or data packet, restore it to SCSI commands and user data, and then hand it over to the SCSI target simulator (Target Simulator Module) for processing. The SCSI target emulator is a software module running on the I/O node. Its function is to process, queue, and encapsulate the SCSI commands and user data transmitted by the HBA, and then pass the commands to the SCSI subsystem for processing. And return the execution status of the command returned by the SCSI subsystem to the host according to the original path. As shown in Figure 1.

基于负载自适应的异步远程镜像方法所提到的各子模块全部运行于I/O节点的SCSI目标模拟器中。见图2。All sub-modules mentioned in the asynchronous remote mirroring method based on load self-adaptation run in the SCSI target simulator of the I/O node. See Figure 2.

类似于I/O节点的结构,通过在光纤网络上增加一个Mirror I/O节点,与I/O结点共同组成一个存储双节点集群。利用存储集群提供冗余的存储路径,以实现数据的复制。本设计与普通的平等结构的集群系统的不同之处在于:这两个节点并不处于同一层,而是利用新增加的节点为原来的单一的I/O节点提供网络硬盘,用于对TH-MSNS系统I/O节点的数据远程镜像。对应的镜像设计结构如图3。Similar to the structure of the I/O node, by adding a Mirror I/O node to the optical fiber network, it forms a storage dual-node cluster together with the I/O node. Use storage clusters to provide redundant storage paths to achieve data replication. The difference between this design and the common cluster system with equal structure is that the two nodes are not on the same layer, but use the newly added node to provide the network hard disk for the original single I/O node, which is used for TH - Data remote mirroring of I/O nodes of MSNS system. The corresponding mirror design structure is shown in Figure 3.

在TH-MSNS系统I/O节点数据远程镜像的结构图中,可以看到,前端的主机是启动器,后端的I/O节点是目标器。普通存储的I/O请求及数据在上图中,经过细线线路,从①或②,经过交换机和③后进入I/O节点,由I/O节点提交到装载的RAID子系统上进行实现。而新加入的镜像数据链路采用粗线通路,此时I/O节点作为启动器,而镜像I/O节点工作于目标器模式。镜像的数据由I/O节点的HBA2卡出发,经过④、交换机、⑤后提交给镜像I/O节点实现。In the structure diagram of remote mirroring of I/O node data in the TH-MSNS system, it can be seen that the front-end host is the initiator, and the back-end I/O node is the target. In the figure above, the I/O requests and data of ordinary storage enter the I/O node through the thin line from ① or ②, through the switch and ③, and are submitted by the I/O node to the loaded RAID subsystem for implementation . The newly added mirroring data link adopts a thick line path, at this time, the I/O node acts as the initiator, and the mirroring I/O node works in the target mode. The mirrored data starts from the HBA2 card of the I/O node, passes through ④, the switch, and ⑤, and then submits it to the mirrored I/O node for implementation.

上述各子模块的类别及简要功能见下表:   模块名称   模块功能   镜像关系初始化子模块   系统启动时,负责镜像对恢复过程,由上次系统中止时保留的现场(镜像关系表)进行镜像对恢复;并在系统运行时提供用户接口,由用户指定镜像磁盘对。   实时存储子模块   负责将具体的读写操作交由SCSI子系统进行处理,并负责对命令的执行结果进行分析,如果发现错误,则将出错命令交由错误处理分析子模块处理。   数据变更记录子模块   进行具体的磁盘读写事件的记录工作。当有写入操作发生时,利用存储位图表(Bitmap)进行相应的记录操作,来记录磁盘媒体的更改信息。   双协议自适应数据复制子模块   根据系统的负载情况动态采用两种不同的异步写协议(异步写协议A和异步写协议B)进行不同的操作,来完成异步镜像过程。   同步子模块   在镜像关系初建的同时,或者在磁盘发生失败换新磁盘工作后,进行磁盘间的同步来保证数据的一致性。   错误处理分析子模块   为所有模块提供错误处理的接口。包含一个错误处理线程,对所有的出错情况进行不同的分析和处理。   系统退出子模块   负责保证在系统正常退出时,I/O节点和镜像I/O节点的数据完全一致。以及,在系统异常中止时,现场的保存和记录工作。 The categories and brief functions of the above sub-modules are shown in the table below: module name module function Mirroring relationship initialization submodule When the system is started, it is responsible for the recovery process of the mirrored pair, and the mirrored pair is restored from the site (mirrored relationship table) retained when the system was stopped last time; and a user interface is provided when the system is running, and the mirrored disk pair is specified by the user. Real-time storage submodule It is responsible for handing over the specific read and write operations to the SCSI subsystem for processing, and is responsible for analyzing the execution results of the commands. If an error is found, the error command is handed over to the error handling analysis sub-module for processing. Data change record sub-module Record specific disk read and write events. When a write operation occurs, a corresponding recording operation is performed using a storage bitmap (Bitmap) to record the change information of the disk medium. Dual-protocol adaptive data replication submodule According to the load situation of the system, two different asynchronous writing protocols (asynchronous writing protocol A and asynchronous writing protocol B) are dynamically used to perform different operations to complete the asynchronous mirroring process. Synchronization submodule When the mirror relationship is initially established, or after a disk failure occurs and a new disk is used, synchronization between disks is performed to ensure data consistency. Error Handling Analysis Submodule Provides an interface for error handling for all modules. Contains an error handling thread that analyzes and handles all error conditions differently. System exit submodule Responsible for ensuring that when the system exits normally, the data of the I/O node and the mirror I/O node are completely consistent. And, when the system is abnormally terminated, the on-site preservation and recording work.

以下对SCSI目标模拟器的各子模块再作详尽说明,见图7The sub-modules of the SCSI target simulator will be described in detail below, as shown in Figure 7

1.镜像关系初始化子模块。系统启动时,通过此模块进行镜像对恢复过程,由上次系统中止时保留的现场(镜像关系表)进行镜像对恢复;且用户可以在系统运行时通过模块提供的用户接口,动态指定镜像磁盘对。镜像关系建立后自动调用同步子模块进行相应的同步处理。1. Mirroring relationship initialization sub-module. When the system is started, the mirror pair recovery process is carried out through this module, and the mirror pair recovery is carried out from the site (mirror relationship table) retained when the system was stopped last time; and the user can dynamically specify the mirror disk through the user interface provided by the module when the system is running right. After the mirror relationship is established, the synchronization sub-module is automatically called to perform corresponding synchronization processing.

2.实时存储子模块和数据变更记录子模块。实时存储子模块负责将具体的读写操作交由SCSI子系统进行处理,并负责对命令的执行结果进行分析,如果发现错误,则将出错命令交由错误处理分析子模块处理。数据变更记录子模块则进行具体的磁盘读写事件的记录工作。如果有写入操作发生,则表示磁盘媒体被更改。系统在不影响该写入操作发生的同时,在存储位图表(Bitmap)进行相应记录。出于最小化的目的,只记录被更改的磁盘块地址。2. Real-time storage sub-module and data change recording sub-module. The real-time storage sub-module is responsible for handing over the specific read and write operations to the SCSI subsystem for processing, and is responsible for analyzing the execution results of the commands. If an error is found, the error command is handed over to the error handling analysis sub-module for processing. The data change recording sub-module records specific disk read and write events. If there is a write operation, it means that the disk media has been changed. While the system does not affect the occurrence of the write operation, it stores the bitmap (Bitmap) for corresponding records. For the purpose of minimization, only changed disk block addresses are recorded.

3.双协议自适应数据复制子模块。双协议自适应数据复制分两种情况即遵循异步写协议A异步写协议B进行不同的操作。系统负载较轻的情况下采用异步写协议A,即在实时存储子模块进行磁盘写操作的同时将命令和数据复制一份发送到镜像磁盘进行写操作;系统负责较重的情况下采用异步写协议B,即保证实时存储子模块不受影响,在存储系统中实时存储发生的同时,另有一个自适应异步协议数据复制线程负责进行基于异步写协议B(见附图5)异步镜像过程,该线程从系统CPU中获取时间片,定时或根据系统负载情况将Bitmap记录的在步骤2中被更改的数据块由镜像源读出,复制到镜像目的磁盘上,来完成异步镜像过程。3. Dual-protocol adaptive data replication sub-module. Dual-protocol adaptive data replication is divided into two cases, that is, following asynchronous writing protocol A and asynchronous writing protocol B to perform different operations. When the system load is light, asynchronous write protocol A is adopted, that is, when the real-time storage submodule performs disk write operations, a copy of the command and data is sent to the mirror disk for write operations; Protocol B ensures that the real-time storage sub-module is not affected. While real-time storage occurs in the storage system, another adaptive asynchronous protocol data replication thread is responsible for the asynchronous mirroring process based on asynchronous write protocol B (see Figure 5). This thread obtains the time slice from the system CPU, and reads out the data blocks changed in step 2 recorded by the Bitmap from the mirroring source at regular intervals or according to the system load, and copies them to the mirroring destination disk to complete the asynchronous mirroring process.

4.同步子模块,在系统启动后镜像关系初建的同时,或者在磁盘发生失败换新磁盘工作后,需要进行磁盘间的同步以保证数据的一致性。同步模块是一个单独的线程,在具体镜像发生之前获得操作锁,将数据由镜像源完全复制到镜像目的磁盘上;对于原有镜像关系被打断又被重建的情况,同步线程获取操作锁后,可根据在镜像关系失效这段时间内的所有Bitmap记录,进行一次增量同步。4. The synchronization sub-module needs to synchronize between disks to ensure data consistency when the mirror relationship is initially established after the system starts, or after a disk fails and a new disk is replaced. The synchronization module is a separate thread, which obtains the operation lock before the specific mirroring occurs, and completely copies the data from the mirroring source to the mirroring destination disk; for the case where the original mirroring relationship is interrupted and rebuilt, after the synchronization thread obtains the operation lock , an incremental synchronization can be performed based on all the Bitmap records during the period when the mirror relationship expires.

5.错误处理分析子模块,错误处理模块有一个错误处理线程,以上所有模块都有到错误处理线程的入口。当有错误发生时,错误处理线程获取操作锁,分析例外并进行智能处理,若为硬件故障则进行报警,同时启动异常情况下的日志记录以备恢复。5. The error handling analysis sub-module, the error handling module has an error handling thread, and all the above modules have the entrance to the error handling thread. When an error occurs, the error handling thread acquires the operation lock, analyzes the exception and performs intelligent processing. If it is a hardware failure, an alarm is issued, and at the same time, the log record in the abnormal situation is started for recovery.

6.系统退出子模块,系统退出前先退出实时存储模块,若系统正常退出,则异步镜像模块获得操作锁,将Bitmap中所有标注的更改过的磁盘块真正写到镜像目的磁盘,然后进行整个系统的中止;若系统由于部分故障产生紧急退出,则紧急退出前将内存中的Bitmap内容在系统盘的配置文件中进行紧急保存,即保证镜像源的数据准确性的同时保留存储位图表。同时置紧急退出安全位,以便系统重建时检查到此安全位后先进行增量同步;若系统发生全体故障产生失败,则无安全位,系统重建后需要对此镜像对进行完全的同步。6. The system exits the sub-module, and exits the real-time storage module before the system exits. If the system exits normally, the asynchronous mirroring module obtains the operation lock, and writes all the changed disk blocks marked in the Bitmap to the mirroring destination disk, and then performs the entire Suspension of the system; if the system exits in an emergency due to a partial failure, the Bitmap content in the memory will be saved in the configuration file of the system disk before the emergency exit, that is, the data accuracy of the image source is guaranteed while the storage bitmap is preserved. At the same time, set the emergency exit safety bit, so that incremental synchronization can be performed after checking the safety bit when the system is rebuilt; if the system fails and fails, there is no safety bit, and the mirror pair needs to be completely synchronized after the system is rebuilt.

本发明是一种基于统一存储介质进行复制且基于负载情况动态采用两种异步写协议的SAN异步远程镜像方法,它借助于存储位图表(Bitmap)由统一的镜像监控程序进行异步镜像存储、容灾处理和同步操作。并且由于其采用了异步写协议,可适用于一些低速网络,故除了可应用于FCP协议作为底层传输协议外,还可应用于iSCSI协议、InfiniBand等协议。基于SAN的双协议自适应异步远程镜像依次分别含有如下步骤:The present invention is a SAN asynchronous remote mirroring method based on a unified storage medium for duplication and dynamically adopting two asynchronous writing protocols based on load conditions. It uses a storage bitmap (Bitmap) to perform asynchronous mirroring storage by a unified mirroring monitoring program. Disaster handling and synchronization operations. And because it adopts the asynchronous writing protocol, it can be applied to some low-speed networks. Therefore, in addition to being applicable to the FCP protocol as the underlying transmission protocol, it can also be applied to iSCSI protocol, InfiniBand and other protocols. SAN-based dual-protocol adaptive asynchronous remote mirroring contains the following steps in turn:

(1)实时存储模块提供管理接口,配置镜像过程中的磁盘映射表:(1) The real-time storage module provides a management interface to configure the disk mapping table during the mirroring process:

镜像存储模块从主机获取操作锁,进行实际的数据的读和镜像写操作,读操作只需要在I/O节点上完成,写操作需要在I/O节点和镜像I/O节点上异步完成;The mirror storage module obtains the operation lock from the host, and performs the actual data read and mirror write operations. The read operation only needs to be completed on the I/O node, and the write operation needs to be completed asynchronously on the I/O node and the mirror I/O node;

(2)容灾处理模块在镜像存储模块设备即I/O节点或镜像I/O节点读写错误时由监控程序进行调用:容灾处理模块进行镜像集群系统的切换,由正常工作的存储设备继续提供完整的数据服务,同时启用日志记录;(2) The disaster recovery processing module is called by the monitoring program when the mirror storage module device, that is, the I/O node or the mirror I/O node reads and writes errors: the disaster recovery processing module switches the mirror cluster system, and the normal working storage device Continue to provide full data services, while enabling logging;

(3)损坏设备更新后重新同步子模块获取操作锁,进行数据的后台同步;在同步完成后,容灾处理模块通知镜像监控程序把系统,由容灾模块切换到正常工作时的镜像存储模式。(3) After the damaged device is updated, the re-synchronization sub-module acquires the operation lock and performs background synchronization of data; after the synchronization is completed, the disaster recovery processing module notifies the mirror monitoring program to switch the system from the disaster recovery module to the mirror storage mode during normal operation .

上述容灾处理操作属于已有技术。The above-mentioned disaster recovery processing operation belongs to the prior art.

镜像关系建立后的磁盘间的数据同步主要分为两个阶段来实现After the mirror relationship is established, the data synchronization between disks is mainly divided into two stages to achieve

(1)第一个阶段是镜像盘间的整盘复制阶段。在这个阶段,我们首先利用快照技术的机制,创建一个源磁盘的静态副本,用以进行镜像盘间的整盘复制。对于在复制期间对源盘的写操作,首先记录下写命令需要更改的数据块(只需要在相应的Bitmap中置位,表明此数据块已更改),然后则利用COW(Copy On Write)算法,缓存写操作修改过的数据块原始副本,用来保证整盘复制的一致性和连续性。(1) The first stage is the whole disk copy stage between mirror disks. At this stage, we first use the mechanism of snapshot technology to create a static copy of the source disk for full-disk replication between mirror disks. For the write operation to the source disk during copying, first record the data block that needs to be changed by the write command (only need to set a bit in the corresponding Bitmap, indicating that the data block has been changed), and then use the COW (Copy On Write) algorithm , the original copy of the data block modified by the cache write operation is used to ensure the consistency and continuity of the entire disk replication.

(2)第二个阶段则是将经过修改的数据块从源盘读出并写入镜像盘,实现数据完全同步。在这个阶段,我们利用一个单独的同步线程来完成上述的数据读出和写入。对于这个阶段发生的对源盘的写操作,我们同样利用Bitmap表记录下更改过的数据块,交由同步线程按照次序处理。异步镜像过程开始。(2) The second stage is to read the modified data block from the source disk and write it into the mirror disk to achieve complete data synchronization. At this stage, we use a separate synchronous thread to complete the above data reading and writing. For the write operations to the source disk that occurred at this stage, we also use the Bitmap table to record the changed data blocks, and hand them over to the synchronization thread to process in order. The asynchronous mirroring process begins.

异步镜像过程实现机制Implementation mechanism of asynchronous mirroring process

1、异步镜像过程中的读操作过程比较简单,与同步镜像的操作步骤相同。依次含有以下步骤:1. The read operation process in the process of asynchronous mirroring is relatively simple, and the operation steps are the same as those of synchronous mirroring. Contains the following steps in order:

(1)I/O节点的HBA接收到Host发送过来的封装SCSI命令和用户数据的协议数据包(可以是FCP包、iSCSI包或InfiniBand包等)。(1) The HBA of the I/O node receives the protocol packet (which can be FCP packet, iSCSI packet or InfiniBand packet, etc.) encapsulating SCSI commands and user data sent by the Host.

(2)HBA分析协议,解包数据,取出SCSI命令和用户数据。(2) The HBA analyzes the protocol, unpacks the data, and takes out SCSI commands and user data.

(3)HBA将SCSI命令和用户数据交由目标模拟模块(Target Simulator Module)处理。(3) The HBA hands over the SCSI command and user data to the target simulation module (Target Simulator Module) for processing.

(4)目标模拟模块将SCSI命令直接发送给SCSI中间层,并为命令分配数据缓冲区,协调模块和命令间的交互。(4) The target simulation module sends SCSI commands directly to the SCSI middle layer, and allocates data buffers for commands, and coordinates the interaction between modules and commands.

(5)SCSI中间层对廉价的冗余磁盘阵列SCSI Raid子系统发出SCSI命令,并完成数据的实际读出;(5) The SCSI middle layer issues SCSI commands to the inexpensive redundant disk array SCSI Raid subsystem, and completes the actual reading of data;

(6)无论是否成功,I/O请求的结果原路返回给Host。(6) Regardless of whether it is successful or not, the result of the I/O request is returned to the Host in the same way.

2、异步镜像过程中的写操作则根据I/O节点的I/O负载的轻重,有两种不同的处理方式。注:可根据目标模拟模块写命令队列长短来衡量I/O节点的I/O负载的轻重。2. The write operation during the asynchronous mirroring process has two different processing methods according to the I/O load of the I/O node. Note: The I/O load of the I/O node can be measured according to the length of the write command queue of the target analog module.

当I/O节点的I/O负载较轻的情况下的处理方法采用了异步写协议A(见附图4)。在这种协议下,I/O节点分别将主机发出的写命令交由本地的磁盘和远程的镜像I/O节点分别进行处理,与同步写协议不同的是,I/O节点在得到本地写命令执行完毕的确认后,即向主机返回操作结束,而不必等待镜像磁盘的写命令的执行完毕。具体含有以下步骤(见附图5):When the I/O load of the I/O node is light, the processing method adopts the asynchronous writing protocol A (see FIG. 4 ). Under this protocol, the I/O node sends the write command issued by the host to the local disk and the remote mirror I/O node for processing respectively. Unlike the synchronous write protocol, the I/O node receives the local write command After confirming that the command is completed, it returns to the host that the operation is complete, without waiting for the completion of the write command of the mirror disk. Contains the following steps specifically (see accompanying drawing 5):

(1)I/O节点的HBA接收到Host发送过来的封装SCSI命令和用户数据的协议数据包(可以是FCP包、iSCSI包或InfiniBand包等)。(1) The HBA of the I/O node receives the protocol packet (which can be FCP packet, iSCSI packet or InfiniBand packet, etc.) encapsulating SCSI commands and user data sent by the Host.

(2)HBA分析协议,解包数据,取出SCSI命令和用户数据。(2) The HBA analyzes the protocol, unpacks the data, and takes out SCSI commands and user data.

(3)HBA将SCSI命令和用户数据交由目标模拟模块(Target Simulator Module)处理。(3) The HBA hands over the SCSI command and user data to the target simulation module (Target Simulator Module) for processing.

(4)目标模拟模块(Target Simulator Module)对接收到的SCSI命令和数据进行复制,并将命令存入历史命令队列,供出错时由错误处理模块使用;(4) The target simulation module (Target Simulator Module) copies the received SCSI command and data, and stores the command into the historical command queue for use by the error handling module when an error occurs;

(5)复制后的SCSI命令和数据一份经过SCSI中间层发送到本地的SCSI-RAID(SCSI冗余磁盘阵列)子系统实现写入;(5) SCSI command after copying and data a copy are sent to local SCSI-RAID (SCSI Redundant Disk Array) subsystem to realize writing through SCSI middle layer;

(6)复制后的SCSI命令和数据的另一份用于镜像操作。经过SCSI中间层发送给镜像I/O节点。镜像I/O节点的相应处理流程于I/O节点相似。(6) Another copy of the copied SCSI command and data is used for mirroring operation. It is sent to the mirror I/O node through the SCSI intermediate layer. The corresponding processing flow of the mirror I/O node is similar to that of the I/O node.

(7)本地写操作命令结束后,不必等待镜像I/O节点的写操作结束,将执行结果原路返回给Host节点。(7) After the local write operation command ends, it does not need to wait for the end of the write operation of the mirror I/O node, and returns the execution result to the Host node through the original path.

(8)镜像I/O节点的写操作结束后,将结果返回给I/O节点。I/O节点根据返回的结果进行不同的处理。如操作成功,则将前面入队的命令出队,表明命令执行完毕。如操作不成功,则触发错误处理模块进行重试等错误处理操作。(8) After the write operation of the mirror I/O node is completed, the result is returned to the I/O node. The I/O node performs different processing according to the returned result. If the operation is successful, the previously enqueued command will be dequeued, indicating that the command has been executed. If the operation is unsuccessful, the error handling module is triggered to perform error handling operations such as retry.

当I/O节点的I/O负载较重的情况下的处理方法采用了异步写协议B(见附图3)。在这种协议下,I/O节点将主机发出的写命令仅交由本地的磁盘处理,并利用Bitmap记录命令所更改的数据块。I/O节点在得到本地写命令执行完毕的确认后,即向主机返回操作结束。更改过的数据块,则由异步数据复制进程根据系统的负载情况或定时自动从本地读取,发往镜像I/O节点执行,进行异步镜像。(见附图5)When the I/O load of the I/O node is heavy, the processing method adopts the asynchronous writing protocol B (see FIG. 3 ). Under this protocol, the I/O node sends the write command issued by the host only to the local disk for processing, and uses the Bitmap to record the data block changed by the command. After the I/O node is confirmed that the execution of the local write command is completed, it returns the end of the operation to the host. The changed data block is automatically read from the local by the asynchronous data replication process according to the system load or timing, and sent to the mirror I/O node for execution to perform asynchronous mirroring. (See Attachment 5)

具体含有以下步骤:Specifically, it contains the following steps:

(1)I/O节点的HBA接收到Host发送过来的封装SCSI命令和用户数据的协议数据包(可以是FCP包、iSCSI包或InfiniBand包等);(1) The HBA of the I/O node receives the protocol packet (which can be FCP packet, iSCSI packet or InfiniBand packet, etc.) that encapsulates the SCSI command and user data sent by the Host;

(2)HBA分析协议,解包数据,取出SCSI命令和用户数据。(2) The HBA analyzes the protocol, unpacks the data, and takes out SCSI commands and user data.

(3)HBA将SCSI命令和用户数据交由目标模拟模块(Target Simulator Module)处理。(3) The HBA hands over the SCSI command and user data to the target simulation module (Target Simulator Module) for processing.

(4)SCSI请求在Target Simulator Module进行分析,利用存储位图表(Bitmap)记录命令所更改的数据块,然后SCSI命令和数据经过SCSI中间层发送到本地的SCSI-RAID(SCSI冗余磁盘阵列)子系统实现数据写入;(4) The SCSI request is analyzed in the Target Simulator Module, and the data block changed by the command is recorded by using the storage bitmap (Bitmap), and then the SCSI command and data are sent to the local SCSI-RAID (SCSI Redundant Disk Array) through the SCSI middle layer The subsystem implements data writing;

(5)本地写操作命令结束后,直接将执行结果原路返回给Host节点。(5) After the local write operation command is completed, the execution result is directly returned to the Host node through the original route.

(6)异步数据复制进程根据系统的负载情况或定时自动按原始次序从本地读取Bitmpap中置位的数据块(写命令更改过的数据块),发往镜像I/O节点执行,进行异步镜像。(6) The asynchronous data replication process automatically reads the data blocks set in Bitmpap (the data blocks changed by the write command) from the local in the original order according to the load of the system or timing, and sends them to the mirror I/O node for execution, performing asynchronous mirror image.

(7)镜像I/O节点的写操作结束后,将结果返回给I/O节点。I/O节点根据返回的结果进行不同的处理。如操作成功,则清除Bitmap中的相应的置位,表明该数据块已经镜像完毕。如操作不成功,则触动错误处理模块进行重试等错误处理操作。(7) After the write operation of the mirror I/O node is completed, the result is returned to the I/O node. The I/O node performs different processing according to the returned result. If the operation is successful, the corresponding setting in the Bitmap is cleared, indicating that the data block has been mirrored. If the operation is unsuccessful, the error handling module is triggered to perform error handling operations such as retrying.

容灾处理操作中,当I/O节点所属的模块物理磁盘设备损坏时,当SCSI_RAID驱动侦测到热插拔事件或根据读写操作的返回值确定非热插拔事件发生失败时,I/O节点根据本地磁盘与网络磁盘映射的数据结构,把相应的磁盘的状态置为失效即Defunct,在I/O节点控制下,把对失效节点的操作请求转到相应的镜像磁盘,同时利用存储位图表(Bitmap)记录此后发生的基于该失效磁盘的请求,供原物理磁盘恢复后进行数据的重新同步使用。In the disaster recovery processing operation, when the physical disk device of the module to which the I/O node belongs is damaged, when the SCSI_RAID driver detects a hot-plug event or determines that a non-hot-plug event fails according to the return value of the read and write operation, the I/O According to the data structure of the mapping between the local disk and the network disk, the O node sets the state of the corresponding disk as Defunct, and under the control of the I/O node, transfers the operation request for the failed node to the corresponding mirror disk, and at the same time utilizes the storage The bitmap (Bitmap) records subsequent requests based on the failed disk, and is used for data resynchronization after the original physical disk is recovered.

容灾处理操作中,当I/O节点损坏时,光纤交换机便发失败警告给管理节点,并由管理节点启动容灾操作:打开镜像节点对主机的软件屏蔽,把镜像节点变为主机可见,把发送到I/O节点的请求全部交由镜像节点完成,同时建立日志记录,以便失败的I/O节点重新加入后,加速数据同步。一旦I/O节点重新加入双节点集群存储系统,由管理节点发起数据和服务重新建立的过程。In the disaster recovery processing operation, when the I/O node is damaged, the optical fiber switch will send a failure warning to the management node, and the management node will start the disaster recovery operation: open the software shielding of the mirror node to the host, make the mirror node visible to the host, All requests sent to the I/O node are completed by the mirror node, and log records are established at the same time, so that after the failed I/O node rejoins, data synchronization is accelerated. Once the I/O node rejoins the two-node cluster storage system, the management node initiates the process of re-establishing data and services.

容灾操作中,当镜像I/O节点失败时需要把原有的双节点架构切换到单节点架构,由I/O节点继续提供数据服务,同时由活动I/O节点建立存储位图表(Bitmap)的数据块变更记录,以便失败后的镜像I/O节点重新加入后,加速数据同步进程。In the disaster recovery operation, when the mirror I/O node fails, the original dual-node architecture needs to be switched to the single-node architecture, and the I/O node continues to provide data services, while the active I/O node creates a storage bitmap (Bitmap ) data block change record, so that after the failed mirror I/O node rejoins, the data synchronization process can be accelerated.

重新同步完全在后台进行,使系统自动由单节点提供存储方式自动切换为双节点工作模式。The resynchronization is completely performed in the background, so that the system automatically switches from the storage mode provided by a single node to the dual-node working mode.

根据本发明所述的方法而提出的系统的特征在于:包括以下设备:The system proposed according to the method of the present invention is characterized in that it comprises the following equipment:

主机(Host):用于构筑集群系统,为网络用户提供高可用的网络服务,或提供高性能计算能力;Host (Host): used to build a cluster system, provide network users with highly available network services, or provide high-performance computing capabilities;

互联设备:包括主机适配卡(如:FC HBA,iSCSI HBA等),交换机(如FC Switch、iSCSISwitch等)和数据链路;Interconnected devices: including host adapter cards (such as FC HBA, iSCSI HBA, etc.), switches (such as FC Switch, iSCSISwitch, etc.) and data links;

I/O节点(I/O node):为主机集群系统提供统一的网络存储服务;I/O node (I/O node): provides unified network storage services for the host cluster system;

镜像I/O节点(Mirror I/O node):为I/O节点提供数据镜像的存储空间并且在I/O节点发生存储灾难时为集群系统提供不间断的存储服务。Mirror I/O node (Mirror I/O node): Provide data mirroring storage space for I/O nodes and provide uninterrupted storage services for the cluster system when a storage disaster occurs on the I/O node.

测试结果Test Results

命令的平均响应时间是衡量系统服务质量和性能的一个非常重要指标。在这里,我们分别使用基于负载自适应的异步远程镜像方法和常用的同步镜像方法进行测试,对其命令响应时间进行相应的对比。测试的硬件环境为清华大学的海量网络存储系统TH-MSNS。I/O节点采用32位安腾2.4GHZ双CPU服务器,内存2GB,操作系统为Linux(Kernel 2.4.18-5)。存储子系统采用adapetc公司的3410 SCSI RAID卡,以及由14个Seagate公司的73GB10000转SCSI磁盘组成的磁盘柜。底层协议为FCP,采用2Gb/s光纤通道。测试工具为Inter公司的iometer,读取方式为顺序读取。由于镜像操作针对于写命令,而读命令的执行只是在本地完成,对系统性能没有影响。所以在测试中,我们只测试了在100%的写命令的情况下两种不同方法的响应时间。测试数据见下表: 数据块大小  平均响应时间(average response time)  异步镜像(ms)   同步镜像(ms)   64KB  1.718   1.795   128KB  3.482   3.573   192KB  5.357   5.335   256KB  7.083   7.202   512KB  13.573   14.469   1024KB  24.797   28.610 The average command response time is a very important index to measure the service quality and performance of the system. Here, we use the asynchronous remote mirroring method based on load adaptation and the commonly used synchronous mirroring method to test, and compare the command response time accordingly. The hardware environment of the test is the mass network storage system TH-MSNS of Tsinghua University. The I/O node uses a 32-bit Itanium 2.4GHZ dual-CPU server with 2GB of memory and an operating system of Linux (Kernel 2.4.18-5). The storage subsystem adopts 3410 SCSI RAID card of adapetc company, and a disk cabinet composed of 14 73GB 10000 rpm SCSI disks of Seagate company. The underlying protocol is FCP, using 2Gb/s fiber channel. The test tool is the iometer of Inter Company, and the reading method is sequential reading. Because the mirroring operation is aimed at the write command, and the execution of the read command is only completed locally, there is no impact on system performance. So in the test, we only tested the response time of the two different methods in the case of 100% write commands. The test data is shown in the table below: block size average response time Asynchronous mirroring (ms) Synchronous mirroring (ms) 64KB 1.718 1.795 128KB 3.482 3.573 192KB 5.357 5.335 256KB 7.083 7.202 512KB 13.573 14.469 1024KB 24.797 28.610

测试数据表明,基于负载自适应的异步远程镜像方法相对于通常使用的同步镜像方法在命令响应时间上有了一定的改善。应该指出,性能提升的幅度是伴随着系统的负载的增大而增大的。系统的负载越大,采用基于负载自适应的异步远程镜像方法相对于通常使用的同步镜像的方法的性能提升幅度越大。另外,本测试数据是基于光纤网络测试得出,光纤网络的传输延迟很小,如果基于其他底层传输网络(如:Ethernet),采用本方法的命令响应时间相对于同步镜像还可以有很大的提高。The test data shows that the asynchronous remote mirroring method based on load adaptation has a certain improvement in command response time compared with the commonly used synchronous mirroring method. It should be pointed out that the range of performance improvement increases with the increase of the system load. The greater the load on the system, the greater the performance improvement of the load-adaptive-based asynchronous remote mirroring method compared to the commonly used synchronous mirroring method. In addition, this test data is obtained based on the optical fiber network test. The transmission delay of the optical fiber network is very small. If it is based on other underlying transmission networks (such as: Ethernet), the command response time using this method can be greatly improved compared to synchronous mirroring. improve.

本发明具有以下特点:The present invention has the following characteristics:

(1)采用了基于统一存储的异步远程镜像,实现了数据的安全性和高可用性,且在保持良好的数据一致性的前提下,根据I/O节点的负载情况动态采用不同的异步写协议,有效的缩短了系统的命令响应时间,尤其是在高负载的情况下与同步镜像相比有较大的系统性能优势。(1) Asynchronous remote mirroring based on unified storage is adopted to achieve data security and high availability, and under the premise of maintaining good data consistency, different asynchronous writing protocols are dynamically adopted according to the load of I/O nodes , effectively shortening the command response time of the system, especially in the case of high load, it has a greater system performance advantage compared with synchronous mirroring.

(2)充分利用了SAN的远距离连接能力和统一存储的特点,并采用双节点集群的二级结构提供镜像,利用了I/O节点的软件控制灵活特性,能够在任一节点失败时继续提供数据服务,并且提供了系统重建后的后台数据重新同步机制,即灾难恢复功能;(2) Make full use of the long-distance connection capability of SAN and the characteristics of unified storage, and use the secondary structure of the two-node cluster to provide mirroring, and use the flexible feature of software control of I/O nodes to continue to provide data when any node fails. Data service, and provides a background data resynchronization mechanism after the system is rebuilt, that is, the disaster recovery function;

(3)异步远程镜像方法可应用于诸如FCP协议、iSCSI协议、InfiniBand协议等多种底层传输协议,灵活性很好,且在应用iSCSI协议实现时,可充分利用现有资源,减少系统花销。(3) The asynchronous remote mirroring method can be applied to various underlying transmission protocols such as FCP protocol, iSCSI protocol, InfiniBand protocol, etc. It has good flexibility, and when implementing the iSCSI protocol, it can make full use of existing resources and reduce system costs. .

(4)所有软件均在操作系统的核心态运行,通过内核模块实现,减少了用户态和核心态的内存拷贝,提高了效率;(4) All software runs in the core state of the operating system, implemented through the kernel module, which reduces the memory copy of the user state and the core state, and improves efficiency;

(5)异步远程镜像可以和同步远程镜像结合起来同时使用(针对不同的磁盘或逻辑卷),满足不同应用的不同要求。(5) Asynchronous remote mirroring can be used in combination with synchronous remote mirroring (for different disks or logical volumes) to meet different requirements of different applications.

Claims (3)

1, in the SAN system based on the asynchronous remote mirror method of loaded self-adaptive, adopt asynchronous mirror image mode, it is characterized in that: be in the storage area network SAN that consists of of Mirror I/O node and disk array thereof at the mirror image I/O node that network hard disc is provided by main frame, the InterWorking Equipment that comprises host adaptor, switch and data link, I/O node and disk array thereof, for above-mentioned I/O node, the minicomputer interface of I/O node is that the scsi target simulator is carried out successively following steps according to the modular structure of setting and realized method of the present invention:
(1) the scsi target simulator is provided with:
Mirror initializes submodule, is provided with: mirror image is to automatic restoration interface STML_Read_Mirror_Map (), STML_Restore_Mirror_Map (); User's designated mirror concerns interface ProcWriteUserCommand (); Unusually withdraw from Processing Interface Handle_Abnormal_Shutdown ();
Submodule receives mirror and initializes the submodule request synchronously, is provided with: disk synchronizing thread Sync_disk_thread (); The disk fully synchronous interface, reception of magnetic disc synchronizing thread order Full_Sync_disks (); Based on the sync cap of bitmap table, receive synchronizing thread order Bitmap_Sync_disk ();
The real-time storage submodule is provided with: command process thread STML_Handle_Cmnd_thread (); The read command Processing Interface receives command process order thread STML_Handle_Read_Command (); The write order Processing Interface receives command process order thread STML_Handle_Write_Command (); Mirror-write command process interface, STML_Handle_Mirror_Write_Command ();
Two protocol self-adapting data Replica submodules, receive the request of mirror-write command process interface in the real-time storage submodule, be provided with: self adaptation asynchronous protocol switching function and data Replica thread STML_Protocol_change (), STML_data_mover_thread (); Adopt the write order interface of asynchronous write agreement A, receive the order of above-mentioned self adaptation asynchronous protocol switching function and data Replica thread, STML_Handle_Async_A_Command (); Adopt the write order interface of asynchronous write agreement B, receive the order of above-mentioned self adaptation asynchronous protocol switching function and data Replica thread, STML_Handle_Async_B_Command ();
The data change record sub module receives the request of two protocol self-adapting data Replica submodules, is provided with: bitmap table record interface STML_Bitmap_Write (), bitmap table fetch interface STML_Bitmap_Read ();
Mistake Treatment Analysis submodule, receive respectively in real time the above-mentioned image relation and initialize submodule, the synchronously request of submodule, real-time storage submodule, two protocol self-adapting data Replica submodule and data change record sub module, be provided with: mistake processing threads STML_ERROR_Handle_thread ();
System withdraws from submodule, and when being responsible for guaranteeing normally withdrawing from system, the data of I/O node and mirror image I/O node are in full accord, and when system exception is ended, on-the-spot preservation and writing task;
(2) the scsi target simulator is according to carrying out successively following steps from the read write command of main frame:
(2.1) the scsi target simulator of I/O node receives the read write command from main frame;
(2.2) the scsi target simulator is judged command type: if read command then is dealt into local disk to order and carries out, Wait Order is finished, and just notifies main frame to inform that command execution is complete;
Otherwise, carry out according to the following steps write order;
According to being scheduled to be located at the weight that scsi target simulator write order queue length is the threshold determination I/O node load of I/O load:
If the I/O load of I/O node is less than load threshold, then belong to the lighter situation of I/O node load, just carry out that write order that asynchronous write agreement A:I/O node sends main frame is respectively transferred to local disk and long-range mirror image I/O node is processed respectively, after the I/O node is obtaining the complete affirmation of local write command execution, namely to the END instruction of main frame return;
If the I/O load of I/O node is greater than load threshold, then belong to the heavier situation of I/O node load, just carry out write order that asynchronous write agreement B:I/O node sends main frame and only transfer to local disk and process, and the data block of utilizing the bitmap table record order to change; The I/O node is after obtaining that the local write command execution is complete and must confirming, namely to the END instruction of main frame return; The data block of more correcting one's mistakes, then by self adaptation asynchronous protocol data Replica thread according to I/O node load situation or timing automaticly read from local disk, mail to mirror image I/O node and carry out the asynchronous mirroring write operation.
2, in the SAN according to claim 1 system based on the asynchronous remote mirror method of loaded self-adaptive, it is characterized in that: described asynchronous write agreement A comprises following steps successively:
(1), write order is sent to the I/O node by main frame;
(2), the I/O node copies write order and mails to mirror image I/O node;
(3), the complete local write order of I/O node processing, the result is returned main frame;
(4), mirror image I/O node write order returns.
3, in the SAN according to claim 1 system based on the asynchronous remote mirror method of loaded self-adaptive, it is characterized in that: described asynchronous write agreement B comprises following steps successively:
(1), write order is sent to the I/O node by main frame;
(2), the complete write order of I/O node processing, the result is returned main frame;
(3), self adaptation asynchronous protocol data Replica thread reads the data block of change, writes mirror image I/O node;
(4), mirror image I/O node write order returns.
CNB200310103194XA 2003-11-07 2003-11-07 Asynchronous remote mirror image method based on load selfadaption in SAN system Expired - Fee Related CN1305265C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNB200310103194XA CN1305265C (en) 2003-11-07 2003-11-07 Asynchronous remote mirror image method based on load selfadaption in SAN system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNB200310103194XA CN1305265C (en) 2003-11-07 2003-11-07 Asynchronous remote mirror image method based on load selfadaption in SAN system

Publications (2)

Publication Number Publication Date
CN1543135A CN1543135A (en) 2004-11-03
CN1305265C true CN1305265C (en) 2007-03-14

Family

ID=34333234

Family Applications (1)

Application Number Title Priority Date Filing Date
CNB200310103194XA Expired - Fee Related CN1305265C (en) 2003-11-07 2003-11-07 Asynchronous remote mirror image method based on load selfadaption in SAN system

Country Status (1)

Country Link
CN (1) CN1305265C (en)

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7461102B2 (en) * 2004-12-09 2008-12-02 International Business Machines Corporation Method for performing scheduled backups of a backup node associated with a plurality of agent nodes
WO2007036372A1 (en) * 2005-09-08 2007-04-05 International Business Machines Corporation Load distribution in storage area networks
JP4378335B2 (en) * 2005-09-09 2009-12-02 インターナショナル・ビジネス・マシーンズ・コーポレーション Device for dynamically switching transaction / data writing method to disk, switching method, and switching program
CN100353330C (en) * 2006-03-10 2007-12-05 四川大学 Disk mirroring method based on IP network
CN100405777C (en) * 2006-07-27 2008-07-23 清华大学 A caching method based on target memory device in Ethernet storage area network
US8768895B2 (en) * 2007-04-11 2014-07-01 Emc Corporation Subsegmenting for efficient storage, resemblance determination, and transmission
CN101325603A (en) * 2008-07-24 2008-12-17 上海众恒信息产业有限公司 Network memory structure for special certificate management system
CN103841169B (en) * 2012-11-27 2017-12-05 国际商业机器公司 Remote replication method and device
US9619311B2 (en) * 2013-11-26 2017-04-11 International Business Machines Corporation Error identification and handling in storage area networks
US9900391B2 (en) * 2014-08-05 2018-02-20 Microsoft Technology Licensing, Llc Automated orchestration of infrastructure service blocks in hosted services
CN105005515B (en) * 2015-07-10 2018-01-30 上海爱数信息技术股份有限公司 A kind of LAN free standby systems
CN111858090B (en) * 2020-06-30 2024-02-09 广东浪潮大数据研究有限公司 Data processing method, system, electronic equipment and storage medium
CN114168074B (en) * 2021-11-25 2025-07-25 北京金山云网络技术有限公司 Mirror image storage method, apparatus, storage medium and electronic device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6351838B1 (en) * 1999-03-12 2002-02-26 Aurora Communications, Inc Multidimensional parity protection system
WO2003017022A2 (en) * 2001-08-14 2003-02-27 Storeage Networking Technologies Asynchronous mirroring in a storage area network
WO2003061222A1 (en) * 2001-12-21 2003-07-24 Cisco Technology, Inc. Methods and apparatus for implementing a high availability fibre channel switch
CN1452737A (en) * 2000-04-24 2003-10-29 微软公司 Method and apparatus for providing volume snapshot dependencies in computer system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6351838B1 (en) * 1999-03-12 2002-02-26 Aurora Communications, Inc Multidimensional parity protection system
CN1452737A (en) * 2000-04-24 2003-10-29 微软公司 Method and apparatus for providing volume snapshot dependencies in computer system
WO2003017022A2 (en) * 2001-08-14 2003-02-27 Storeage Networking Technologies Asynchronous mirroring in a storage area network
WO2003061222A1 (en) * 2001-12-21 2003-07-24 Cisco Technology, Inc. Methods and apparatus for implementing a high availability fibre channel switch

Also Published As

Publication number Publication date
CN1543135A (en) 2004-11-03

Similar Documents

Publication Publication Date Title
CN1304961C (en) Memory virtualized management method based on metadata server
US6598174B1 (en) Method and apparatus for storage unit replacement in non-redundant array
CN1305265C (en) Asynchronous remote mirror image method based on load selfadaption in SAN system
US9009427B2 (en) Mirroring mechanisms for storage area networks and network based virtualization
US7853764B2 (en) Tape storage emulation for open systems environments
CN1320437C (en) Disk drive array subsystem and external analog controller therefor
US7389396B1 (en) Bounding I/O service time
US20070094466A1 (en) Techniques for improving mirroring operations implemented in storage area networks and network based virtualization
CN103019622B (en) The storage controlling method of a kind of data, controller, physical hard disk, and system
US20070094465A1 (en) Mirroring mechanisms for storage area networks and network based virtualization
US20090259817A1 (en) Mirror Consistency Checking Techniques For Storage Area Networks And Network Based Virtualization
US8713266B2 (en) Storage apparatus and method including page discard processing for primary and secondary volumes configured as a copy pair
JP5706808B2 (en) Improving network efficiency for continuous remote copy
US7660946B2 (en) Storage control system and storage control method
US7426588B2 (en) Storage apparatus
US20110282963A1 (en) Storage device and method of controlling storage device
WO2009108902A2 (en) Storage system front end
CN101938523A (en) A fusion method of iSCSI and FCP protocol and its application in disaster recovery
CN1694081A (en) Implementing method of virtual intelligent controller in SAN system
US20090228672A1 (en) Remote copy system and check method
US11437071B2 (en) Multi-session concurrent testing for multi-actuator drive
CN1205547C (en) Method of substituting conventional SCSI disk for optical fiber disk in storage optical-fiber network
CN109343986B (en) Method and computer system for handling memory failures
CN1529426A (en) Method and system for dual-node mirroring cluster of SAN based on FCP protocol
US7752358B2 (en) Storage apparatus and conversion board for increasing the number of hard disk drive heads in a given, limited space

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
C17 Cessation of patent right
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20070314

Termination date: 20111107