CN102868754B

CN102868754B - A kind of realize the method for cluster-based storage high availability, node apparatus and system

Info

Publication number: CN102868754B
Application number: CN201210363576.5A
Authority: CN
Inventors: 刘爱贵
Original assignee: BEIJING LIANCHUANG XINAN TECHNOLOGY CO LTD
Current assignee: BEIJING LIANCHUANG XINAN TECHNOLOGY CO LTD
Priority date: 2012-09-26
Filing date: 2012-09-26
Publication date: 2016-08-03
Anticipated expiration: 2032-09-26
Also published as: CN102868754A

Abstract

The invention discloses and a kind of realize the method for cluster-based storage high availability, node apparatus and system, the method includes: trigger adapter event according to the node status information received；Obtain the volume information of malfunctioning node, storing device information and SAN storage information and generate locally configured information；Storage device in SAN cluster described in carry also starts distributed file system service routine, and the roll recovery in described storage device uses.During the present invention enables to the resource switch between node, the not only IP of taking over fault node and service process resource, and the storage software service process of taking over fault node and physical memory resources, support NFS/CIFS/HTTP/FTP/ISCSI agreement and PanaFS agreement, the connection utilizing ICP/IP protocol reconnects technology, achieve the transparent adapter of malfunctioning node, the service disconnection during adapter will not be produced.

Description

Method, node device and system for realizing high availability of cluster storage

Technical Field

The present invention relates to the technical field of network storage systems, and in particular, to a method, a node apparatus, and a system for implementing high availability of cluster storage.

Background

In the background of cloud storage and big data, the data shows an explosive growth trend. According to the research, the digital universe reaches 35.2ZB in 2020, which is a 44-fold leap over 0.8ZB in 2009, and more than 80% of the data is unstructured data. The data blowout caused by the intensive application of a large amount of data such as high-performance calculation, medical imaging, oil and gas exploration, digital media, social WEB and the like continuously provides a new and serious challenge for a storage method. Cluster storage is a Scale-out (Scale-out) storage architecture, has the advantages of linear expansion of capacity and performance, and has been widely accepted by global markets. Besides the characteristics of high performance and high expansion, the cluster storage also has the characteristic of high availability, which is particularly critical to an enterprise core service system and ensures the continuity of key services.

The prior art solution of cluster storage mainly solves the availability problem through a redundancy technology, including a duplication technology, an erasure code technology, and a primary/standby or full availability (HA) technology. The copy technology can effectively improve the data availability by adding different numbers of copies, but the storage utilization rate is low (one times of the number of copies), and the complexity of data management is increased. The erasure code improves the storage availability through redundant coding, has lower space complexity and data redundancy, and is high in storage utilization rate, but the coding mode is complex, needs a large amount of calculation and reduces the service performance, and is suitable for the situation that the number of cluster nodes is large. The Active/Standby HA technology also adopts a redundancy technology to obtain high availability, but the waste of storage resources is serious. The Active HA technology enables the whole system to continuously and uninterruptedly provide services to the outside by monitoring and switching the fault node resources (IP, service process, service data, etc.) to the normal nodes. The HA technology can improve the usability, HAs a load balancing function and is high in resource utilization rate. The main problem with HA technology is that during resource switching, which can lead to service interruption, usually only IP and service process resources are taken over, while traffic data or physical storage resources need to be managed by external systems.

Disclosure of Invention

The invention aims to provide a method, a node device and a system for realizing high availability of cluster storage, which can take over not only IP and service process resources of a fault node but also storage software service processes and physical storage resources of the fault node during resource switching between nodes.

In order to achieve the purpose, the invention adopts the following technical scheme:

a method of implementing cluster storage high availability, the method comprising:

triggering a take-over event according to the received node state information;

acquiring volume information, storage equipment information and Storage Area Network (SAN) storage information of a fault node and generating local configuration information;

and mounting the storage device in the SAN cluster and starting a distributed file system service routine, wherein the volume on the storage device is recovered to be used.

Before triggering the takeover event according to the received node state information, the method further includes:

and responding to the backup request of the fault node, and receiving the state information of the fault node.

Before the failed node sends out the backup request, the method further includes:

receiving monitoring information through a regularly triggered monitoring event, and judging the state of local service;

when the service state is abnormal, determining a backup node in the SAN cluster according to a round robin scheduling (round robin scheduling) algorithm;

and sending the local state information to other nodes and sending a backup request to the backup node.

The method further comprises the following steps:

and sending a message to other nodes in the SAN cluster to update the volume information and the storage device information.

If the failed node is repaired, the method further comprises:

taking over the taken over resources of the backup node to the local part again;

sending a message to other nodes in the SAN cluster, and updating the volume information and the storage device information;

and sending a resource releasing message to the backup node.

When the node takes over the resource, the method further comprises the following steps:

acquiring connection information before taking over, constructing an Acknowledgement Character (ACK) request message with an automatically increased number sequence of zero (sequence = 0), and sending the ACK request message to a previously connected client;

receiving an ACK response message of the client, wherein sequence = N in the ACK response message;

sending a reset connection (RST) request to the client, and informing the client to reestablish the TCP connection;

and establishing a TCP connection with the client with the transmission port reestablished.

A node device for realizing high availability of cluster storage comprises: the system comprises a timer module, an event processing module, a monitoring module and a communication module; wherein,

the timer module is used for triggering a monitoring event to the event processing module and monitoring the resources, services and the like of the node at regular time;

the monitoring module is used for monitoring the state of the specified service and returning the state information of the service to the event processing module;

the communication module is used for information transmission and data synchronization among all node devices;

and the event processing module is used for receiving the return information of each module in the device and carrying out the next operation scheduling according to the return information.

Further comprising: a take-over module and a release module; wherein,

the takeover module is used for receiving various resources of the failed node when the node device is used as a backup node, wherein the resources comprise SAN storage equipment, a volume and corresponding distributed file system services;

and the releasing module is used for releasing each resource taken over by the taking-over module after the failed node replies.

And after the node device is used as a backup node to take over the resources of the fault node, the TCP connection is reestablished with the client connected with the fault node.

A system for realizing high availability of cluster storage comprises at least two node devices and storage equipment in SAN.

By adopting the technical scheme of the invention, the IP and service process resources of the fault node can be taken over during the resource switching period between the nodes, the storage software service process and the physical storage resource of the fault node can be taken over, the NFS/CIFS/HTTP/FTP/ISCSI protocol and the PanaFS protocol are supported, the transparent taking over of the fault node is realized by utilizing the connection reconnection technology of the TCP/IP protocol, and the service interruption during the taking over period can not be generated.

Drawings

Fig. 1 is a schematic structural diagram of a node apparatus for implementing high availability of cluster storage according to an embodiment of the present invention.

Fig. 2 is a schematic structural diagram of a node apparatus for implementing high availability of cluster storage as a standby node according to an embodiment of the present invention.

Fig. 3 is a schematic structural diagram of a system for implementing high availability of cluster storage according to an embodiment of the present invention.

Fig. 4 is a flowchart of a method for implementing high availability of cluster storage according to an embodiment of the present invention.

Fig. 5 is a flowchart of a backup request issued by a failed node in an embodiment of the present invention.

Fig. 6 is a schematic diagram of an information interaction process in the system after the failed node is repaired in the embodiment of the present invention.

Fig. 7 is a schematic process diagram of a node taking over service resources and a client reestablishing a TCP connection in an embodiment of the present invention.

Detailed Description

The technical scheme of the invention is further explained by the specific implementation mode in combination with the attached drawings.

Fig. 1 is a schematic structural diagram of a node apparatus for implementing high availability of cluster storage according to an embodiment of the present invention. The node apparatus includes: the system comprises a timer module, an event processing module, a monitoring module and a communication module; wherein,

and the timer module is used for triggering a monitoring event to the event processing module and monitoring the resources, services and the like of the node at regular time.

The HA system triggers events through a timer, monitors resources, services and the like of the node at regular time, synchronizes information with other nodes if the state changes, and triggers corresponding events to process. The timer module is used for setting time interval trigger defined events, including periodically triggering monitoring events, periodically triggering keepalive events and the like.

And the monitoring module is used for monitoring the state of the specified service and returning the state information of the service to the event processing module.

The monitoring module monitors the state of the specified service periodically and returns the state information of the service to the event processing module in a return value mode. If the service is normal, 0 is returned, otherwise, a non-0 value is returned.

The communication module is used for information transmission and data synchronization among all node devices; including periodic heartbeat and data synchronization.

When the node shown in fig. 1 is used as a backup node, the node apparatus further includes a take-over module and a release module, as shown in fig. 2, wherein,

and the takeover module is used for receiving various resources of the failed node when the node device is used as a backup node, wherein the resources comprise SAN storage equipment, a volume and corresponding distributed file system services.

The takeover module is responsible for receiving various resources of the failed node, and mainly comprises SAN storage and corresponding distributed file system service. The takeover module firstly acquires SAN storage information, storage device information and volume information managed by a fault node, then generates complete configuration information on the node, mounts the storage devices in the SAN cluster and starts distributed file system service, and the volumes established on the storage devices can be recovered for use.

The distributed file system service can adopt distributed storage systems such as Lustre, Panasas, Ceph, pNFS or PanaFS. The embodiment of the invention preferably selects the PanaFS distributed file system, uses a private PanaFS protocol for data access, is the basis of access protocols such as CIFS/NFS/FTP/HTTP and the like, and provides a uniform shared storage space. Panafs is a protocol compatible with the POSIX (portable operating System interface) standard, and has the characteristics of high expansion, high availability, high performance, high efficiency and the like

The event processing module of the node device is similar to a scheduler and is responsible for receiving the return information of different modules, and taking the return information as a basis, the next operation is carried out, and the method mainly comprises the following steps:

1. and receiving the message of the timer and calling the monitoring module.

2. And receiving the message of the monitoring module. If the return value of the monitoring module is 0, the service is normal. Comparing with the previous monitoring result, if the previous state is abnormal, it indicates that the current node has been repaired now, and a process of recovering resources needs to be executed, as shown in fig. 3. If a value other than 0 is returned, the service of the node is abnormal, and the following steps are executed:

(1) in the node, a round-robin scheduling (round robin scheduling) algorithm is adopted, and one node is selected as a backup node;

(2) calling a communication module to send the state information of the node to other nodes;

(3) and sending a message to the backup node. And the backup node executes the takeover module after receiving the message.

3. And receiving messages sent by other nodes. And if the node is selected as the backup node, operating the takeover module. And if the information of the repair node is received, operating a release module.

In a cluster storage system based on a SAN architecture, a middle-high end disk array subsystem is adopted for back end storage, RAID (redundant array of independent disks) grades of different levels such as 0, 1, 5, 6 and 10 are supported, and the system is connected to each cluster node through an optical fiber FC interface. The SAN disk array protects data through different RAID levels, provides high availability through a redundancy mechanism, and reduces storage utilization rate to a certain degree. Under the structure, if the high availability of the cluster service is provided by adopting the copy or the erasure code, the storage utilization rate is further reduced or the performance of the cluster system is greatly reduced. When a cluster node server fails, the back-end SAN storage is still normally in working order, and the data stored thereon is also complete and consistent. Therefore, one node can be completely selected from other normally working cluster nodes to take over the resources and services of the fault node, and the data service is continuously provided to the outside, so that the continuity of the service is ensured. The HA method of the patent of the invention is different from the prior HA technical scheme in that the HA method is oriented to a cluster storage system based on SAN architecture, and adopts the full-activity HA architecture technology, not only takes over the IP and service process resources of a fault node, but also takes over the storage software service process and physical storage resources of the fault node, and supports NFS/CIFS/HTTP/FTP/ISCSI protocol and PanaFS protocol. By utilizing the connection reconnection technology of the TCP/IP protocol, the transparent takeover of the fault node is realized, and the service interruption during the takeover period can not be generated. The method ensures that the storage utilization rate and the system performance of the cluster storage system are not influenced, can transparently take over complete system resources and provide higher system availability, and has the following main design principles:

1. when a certain node is down or the system is abnormal and can not provide data storage service to the upper-layer application any more, the backup node needs to take over the SAN storage device connected to the node and start the corresponding service, so as to ensure that the front-end application can still normally perform data storage operation.

2. In order to balance the load of each node in the system and avoid the overload of the backup node, when the failed node is repaired, the taken-over SAN storage needs to be restored again.

3. In the above-mentioned takeover and recovery processes, it needs to be ensured that there is no obvious influence on the data storage of the front end, the service of the distributed file system is not interrupted, and transparent takeover and recovery are achieved.

4. And selecting the backup nodes, namely selecting one of the nodes which normally work at present as the backup node by adopting a Round-Robin polling method.

When a node in the cluster fails, the HA system selects one of the nodes in the OK state by using a polling algorithm to take over the IP address of the failed node. Therefore, a takeip event can be triggered on the selected node, the node with the takeip event can be used as a backup node, and fault takeover operation is executed. After the fault node is recovered, the IP drifted before can drift back again, and meanwhile, a takeip event is triggered and fault recovery operation is executed. The existing HA technical solution often only processes the availability of various service services on the cluster node, such as CIFS, NFS, HTTP, FTP, and other services, but does not process the TCP/IP connection or Session (Session) that HAs been established with the node before the failure, which may cause service interruption during the takeover period and may not realize transparent takeover. In the method, a TCP spoofing technology is used, and the takeover node actively reestablishes the connection with the client which has established the connection, so that transparent takeover is realized. The procedure for reestablishing the connection is as follows:

1. the new node (the takeover node) acquires the previous connection information from the shared storage, constructs an ACK request message with an automatically increased number sequence of zero (sequence = 0), and sends the ACK request message to the client;

2. after receiving the request, the client sends an ACK response and corrects sequence = N;

3. after the new node obtains the correct sequence, sending an RST request and informing the client to reestablish the TCP connection;

4. and the client reestablishes the transmission port, and the TCP connection is reestablished.

Fig. 4 is a flowchart of a method for implementing high availability of cluster storage according to an embodiment of the present invention, where the method includes:

s401, triggering a takeover event according to the received node state information. The backup node receives the state information of the fault node through the communication module, sends the state information to the event processing module and triggers a takeover event.

S402, acquiring the volume information, the storage device information and the SAN storage information of the fault node and generating local configuration information. And generating local configuration according to the volume information, the storage equipment information and the SAN information used by the fault node for mounting corresponding storage resources when taking over.

S403, mounting the storage device in the SAN cluster and starting a panaFS service routine, wherein the volume on the storage device is recovered to be used.

And in the step S401, before the backup node triggers a takeover event according to the received node state information, responding to a backup request of the fault node and receiving the state information of the fault node.

Before the failed node sends out the backup request, as shown in fig. 5, the method further includes the following steps:

s501, receiving monitoring information through a regularly triggered monitoring event, and judging the state of local service;

s502, when the service state is abnormal, a backup node is determined in the SAN cluster according to a round robin scheduling (RoundRobinScheduling) algorithm;

s503, the fault node sends the local state information to other nodes and sends a backup request to the backup node.

And after the backup node takes over the storage resources successfully, the backup node sends a message to other nodes in the SAN cluster to update the volume information and the storage equipment information.

When the failed node is repaired, as shown in fig. 6, the failed node takes over the taken-over resources of the backup node to the local again; sending a message to other nodes in the SAN cluster, and updating the volume information and the storage device information; and sending a resource releasing message to the backup node. And after receiving the resource releasing message, the backup node releases the resources taken over previously.

No matter the backup node takes over the storage resource of the failed node, or the failed node recovers and then takes over the original storage resource from the backup node, the TCP connection needs to be reestablished with the existing service connection client, and the process of reestablishing the TCP connection is shown in fig. 7, and includes:

1. acquiring connection information before taking over, constructing an ACK request message with an automatically increased number sequence of zero (sequence = 0), and sending the ACK request message to a previously connected client;

2. receiving an ACK response message of the client, wherein sequence = N in the ACK response message;

3. sending an RST request to the client, and informing the client to reestablish the TCP connection;

4. and establishing a TCP connection with the client with the transmission port reestablished.

In the method, a TCP spoofing technology is used, and the takeover node actively reestablishes the connection with the client which has established the connection, so that transparent takeover is realized, and the service cannot be interrupted. And realizing transparent take-over, modifying a communication protocol of the cluster storage system, and reconstructing a communication flow of the server and the client software module. The transparent fault takeover is very critical to a key service system, the client only presents transient blocking, connection interruption or abnormal exit and other phenomena cannot be caused, data consistency and service continuity can be guaranteed, and higher cluster availability is achieved.

The above description is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method for realizing high availability of cluster storage is characterized in that the method comprises the following steps:

receiving monitoring information through a regularly triggered monitoring event, and judging a local service state;

when the service state is abnormal, determining a backup node in an SAN (storage area network) cluster according to a round robin scheduling (round robin scheduling) algorithm;

sending local service state information to other nodes and sending backup requests to the backup nodes;

responding to a backup request of a fault node, and receiving service state information of the fault node;

triggering a takeover event according to the received service state information of the fault node;

acquiring volume information, storage device information and SAN storage information of a fault node and generating local configuration information;

mounting a storage device in the SAN cluster and starting a distributed file system service routine, wherein a volume on the storage device is recovered to be used;

wherein if the failed node has been repaired, the method further comprises:

sending a resource releasing message to the backup node;

wherein, when the node takes over the resource, the method further comprises:

acquiring connection information before taking over, constructing an Acknowledgement Character (ACK) request message with an automatically increased number sequence of zero (sequence is 0), and sending the ACK request message to a previously connected client;

receiving an ACK response message of the client, wherein a sequence in the ACK response message is N;

and the client connected with the fault node reestablishes the TCP connection.

2. A node apparatus for realizing high availability of cluster storage, comprising: the system comprises a timer module, an event processing module, a monitoring module and a communication module; wherein,

the timer module is used for triggering a monitoring event to the event processing module and monitoring the resources and services of the node at regular time;

the monitoring module is used for monitoring the specified service state and returning service state information to the event processing module;

the event processing module is used for determining a backup node in the SAN cluster according to a round-robin scheduling algorithm when the service state is abnormal, sending local state information to other nodes, sending a backup request to the backup node, responding to the backup request of the fault node, and receiving the service state information of the fault node;

wherein, the node apparatus further comprises: a take-over module and a release module; wherein,

a takeover module, configured to take over various resources of a failed node when the node apparatus is used as a backup node, where the resources include an SAN storage device, a volume, and a corresponding distributed file system service, generate local configuration information, mount a storage device in the SAN cluster, and start a distributed file system service routine, where a volume on the storage device is recovered for use;

the releasing module is used for releasing each resource taken over by the taking-over module after the failed node is recovered;

after the node device is used as a backup node to take over the resources of a fault node, acquiring connection information before taking over, constructing an Acknowledgement Character (ACK) request message with an automatically increased number sequence of zero (sequence 0), and sending the ACK request message to a previously connected client; receiving an ACK response message of the client, wherein a sequence in the ACK response message is N; and sending a reset connection (RST) request to the client, and reestablishing the TCP connection with the client connected with the failed node.

3. A system for implementing high availability of cluster storage, comprising the node apparatus of claim 2 and a storage device in a SAN cluster.