CN102110154B

CN102110154B - File redundancy storage method in cluster file system

Info

Publication number: CN102110154B
Application number: CN 201110042143
Authority: CN
Inventors: 刘振晗
Original assignee: Tianjin Zhongke Bluewhale Information Technology Co ltd; Institute of Computing Technology of CAS
Current assignee: Tianjin Zhongke Bluewhale Information Technology Co ltd; Institute of Computing Technology of CAS
Priority date: 2011-02-21
Filing date: 2011-02-21
Publication date: 2012-12-26
Anticipated expiration: 2031-02-21
Also published as: CN102110154A

Abstract

The invention provides a file redundancy storage method in a cluster file system. On the premise of ensuring data reliability and availability, data can be dynamically switched in different storage states. When the system storage resources are tight, the data can be stored in the mirroring mode to the redundant verification mode, which reduces the redundant space occupation and improves the space utilization rate. At the same time, the data is written and updated in the form of mirroring. For the write request of the client, it is not necessary to consider how to perform redundant calculation and redundant update in the redundancy check mode. The newly written data block is mirrored to the redundant storage management device, and the writing performance is relatively high. Compared with the redundant check method, it has been greatly improved. In addition, when a storage node fails, a combination of mirroring and redundancy check can be used to perform recovery operations to reduce the amount of calculation and improve the efficiency of data recovery.

Description

A file redundant storage method in a cluster file system

技术领域 technical field

本发明涉及共享存储系统的数据的安全保障机制，特别涉及集群系统中文件的冗余存储方法。The invention relates to a data security guarantee mechanism of a shared storage system, in particular to a redundant storage method for files in a cluster system.

背景技术 Background technique

在以数据为中心的信息时代，如何妥善有效地保护数据是存储系统的核心问题之一。人们可以忍受计算机异常宕机、所有应用程序重新启动甚至硬件损坏，但是他们要求信息永远不会丢失。存储系统最重要的任务是不论发生什么故障，都要保证存储的信息不能丢失，并且尽力不间断地提供高质量的数据服务。数据信息的毁坏和丢失不但影响到企业的业务连续性，甚至极大地威胁到一个机构的生存。In the data-centric information age, how to properly and effectively protect data is one of the core issues of storage systems. People can tolerate abnormal computer downtime, all application restarts, and even hardware damage, but they demand that information should never be lost. The most important task of the storage system is to ensure that the stored information will not be lost no matter what failure occurs, and try to provide high-quality data services without interruption. The destruction and loss of data information not only affects the business continuity of the enterprise, but even greatly threatens the survival of an organization.

为了保证存储在磁盘中的数据的安全性，本领域的技术人员提出了独立冗余磁盘阵列(RAID)技术，该技术将多个磁盘组合成一个磁盘阵列，并在各个磁盘中存储其它磁盘的冗余信息，使得当阵列中的某个磁盘发生故障后，可以根据阵列中其它磁盘所存储的冗余信息恢复故障磁盘上的数据。RAID根据实现原理可分为不同的级别，分别用RAID0-RAID7表示。不同级别的RAID系统的工作模式存在较大的差异，比较有代表性的是采用镜像方式的RAID1和采用冗余校验方式的RAID5。In order to ensure the security of the data stored in the disks, those skilled in the art have proposed Redundant Array of Independent Disks (RAID) technology, which combines multiple disks into a disk array, and stores the information of other disks in each disk. Redundant information, so that when a certain disk in the array fails, the data on the failed disk can be restored according to the redundant information stored in other disks in the array. According to the realization principle, RAID can be divided into different levels, represented by RAID0-RAID7 respectively. There are great differences in the working modes of different levels of RAID systems. The more representative ones are RAID1 using mirroring mode and RAID5 using redundancy check mode.

在网络存储系统中同样可以采用类似上述的RAID的技术，多个存储节点组成一个网络RAID系统。但如果采用类似RAID1的镜像方式，尽管读写性能高，但是空间利用率只有50％，整个系统的性价比反而较低。如果采用类似RAID5的冗余校验方式，增加了空间利用率，但是写性能尤其是小写性能较低。A technology similar to the aforementioned RAID can also be used in a network storage system, and multiple storage nodes form a network RAID system. However, if a mirroring method similar to RAID1 is used, although the read and write performance is high, the space utilization rate is only 50%, and the cost performance of the entire system is relatively low. If a redundancy check method similar to RAID5 is used, the space utilization rate is increased, but the write performance, especially the lower write performance, is low.

如何为高性能、大容量的基于网络存储的集群文件系统，提供高效的数据高可靠和高可用保障机制，即保持优良的性能并且减少冗余空间的损耗，这个问题已经成为一个研究热点。目前集群存储厂商Panasas公司的并行文件系统PanFS实现了文件级冗余存储机制，小文件使用类似RAID1的镜像方式存放，而大文件采用类似RAID5的冗余校验方式存放，但PanFS依然无法解决上述传统镜像和冗余校验技术所不能解决的问题。How to provide an efficient data high reliability and high availability guarantee mechanism for a high-performance, large-capacity network-based storage cluster file system, that is, to maintain excellent performance and reduce the loss of redundant space, has become a research hotspot. At present, PanFS, a parallel file system of cluster storage manufacturer Panasas, implements a file-level redundant storage mechanism. Small files are stored in a mirroring mode similar to RAID1, while large files are stored in a redundancy check mode similar to RAID5, but PanFS still cannot solve the above problems. Problems that cannot be solved by traditional mirroring and redundancy check technologies.

发明内容 Contents of the invention

本发明的目的在于克服上述现有技术的缺陷，为高性能、大容量的基于网络存储的集群文件系统，提供高效的数据高可靠和高可用保障机制，在保持优良的性能的同时减少冗余空间的损耗。The purpose of the present invention is to overcome the defects of the above-mentioned prior art, provide an efficient data high reliability and high availability guarantee mechanism for a high-performance, large-capacity network-based storage cluster file system, and reduce redundancy while maintaining excellent performance loss of space.

本发明的目的是通过以下技术方案实现的：The purpose of the present invention is achieved through the following technical solutions:

本发明提供了一种集群文件系统中文件冗余存储方法，其中，在所述文件系统中，文件以数据片的方式存放在网络存储节点，包括以下存储状态：状态1，数据片以镜像方式存放；状态2，数据片以冗余校验方式存放；所述方法包括以下步骤：The present invention provides a file redundancy storage method in a cluster file system, wherein, in the file system, files are stored in network storage nodes in the form of data slices, including the following storage states: State 1, data slices are stored in the form of mirror images Deposit; state 2, the data sheet is stored in a redundancy check mode; the method includes the following steps:

读取处于状态1的文件的各数据片；Read each data piece of the file in state 1;

进行冗余计算，生成冗余校验片；Carry out redundancy calculation and generate redundancy check sheet;

将冗余校验片写入文件；Write the redundancy check piece to the file;

释放文件各镜像数据片的存储空间；Release the storage space of each image data piece of the file;

修改状态1为状态2。Change state 1 to state 2.

根据本发明优选实施例的集群文件系统中的文件冗余存储方法，存储状态2所采用的冗余校验方式是奇偶校验。According to the file redundancy storage method in the cluster file system of the preferred embodiment of the present invention, the redundancy check mode adopted by the storage state 2 is parity check.

根据本发明优选实施例的集群文件系统中的文件冗余存储方法，所述方法的步骤是由文件系统中作为客户端的应用服务器执行的。According to the file redundancy storage method in the cluster file system in the preferred embodiment of the present invention, the steps of the method are executed by an application server serving as a client in the file system.

根据本发明优选实施例的集群文件系统中的文件冗余存储方法，文件的存储状态还包括状态3，其中原版本数据片以冗余校验方式存放，写更新所生成的新版本数据片以镜像方式存放；According to the file redundancy storage method in the cluster file system of the preferred embodiment of the present invention, the storage state of the file also includes state 3, wherein the original version of the data piece is stored in a redundancy check mode, and the new version of the data piece generated by the update is written to Store in mirror image;

所述方法还包括以下步骤：The method also includes the steps of:

保持被写更新的处于状态2的原版本数据片不变，生成新版本数据片；Keep the original version of the data piece in state 2 that has been written and updated unchanged, and generate a new version of the data piece;

将新版本数据片写入相应的存储节点，同时做镜像备份到另一存储节点；Write the new version of the data piece to the corresponding storage node, and make a mirror backup to another storage node at the same time;

修改状态2为状态3。Change state 2 to state 3.

根据本发明优选实施例的集群文件系统中的文件冗余存储方法，还包括以下写更新步骤：The file redundancy storage method in the cluster file system according to the preferred embodiment of the present invention also includes the following write update steps:

对以镜像方式存放的数据片，同时修改存放在两个存储节点的数据片及其镜像；For data slices stored in the mirror mode, modify the data slices and their mirror images stored in the two storage nodes at the same time;

对以冗余校验方式存放的数据片，保持以冗余校验方式存放的原版本数据片不变，生成新版本数据片；将新版本数据片写入相应的存储节点，同时做镜像备份到另一存储节点。For the data slices stored in the redundancy check mode, keep the original version data slices stored in the redundancy check mode unchanged, and generate a new version of the data slices; write the new version of the data slices into the corresponding storage nodes, and do mirror backup at the same time to another storage node.

根据本发明优选实施例的集群文件系统中的文件冗余存储方法，还包括以下步骤：The file redundant storage method in the cluster file system according to the preferred embodiment of the present invention also includes the following steps:

依次遍历处于状态3的文件的镜像数据片，遇到空洞，将原版本数据片填充写入新版本数据片和镜像数据片；Traverse the mirrored data slices of the file in state 3 in turn, and fill the original version of the data slice into the new version of the data slice and the mirrored data slice when a hole is encountered;

将原版本数据片替换为新版本数据片；Replace the original version of the data piece with the new version of the data piece;

释放新版本数据片和冗余校验数据片的存储空间；Release the storage space of the new version data piece and the redundancy check data piece;

修改状态3为1。Modify state 3 to 1.

根据本发明优选实施例的集群文件系统中的文件冗余存储方法，还包括数据片出错时的恢复步骤：The file redundancy storage method in the cluster file system according to the preferred embodiment of the present invention also includes a recovery step when the data piece is in error:

状态1时，用对应节点上的数据片镜像恢复出错数据片；In state 1, use the data slice image on the corresponding node to restore the error data slice;

状态2时，进行冗余计算恢复出错的数据片；In state 2, redundant calculations are performed to restore erroneous data slices;

状态3时，如果出错的是原版本数据片，则进行冗余计算恢复出错数据片；如果出错的是新版本数据片或者其镜像数据片，则用对应节点上的数据片镜像恢复出错数据片。In state 3, if the error is the original version of the data piece, redundant calculations will be performed to restore the error data piece; if the error is the new version of the data piece or its mirror image data piece, the error data piece will be restored using the data piece image on the corresponding node .

与现有技术相比，根据本发明具体实施例提供的集群文件系统中的文件冗余存储方法，数据可以在不同存储状态中动态切换。对于客户端的写操作，数据以镜像方式进行写更新，对于客户端的写请求不必考虑如何进行冗余计算及冗余更新问题，新写的数据块镜像到冗余存储管理设备上。避免了冗余校验方式中小写更新导致的写性能低下问题，写性能相比现有冗余校验方式得到大幅度提高。同时，非写活跃性质的数据在系统存储资源紧张时，可以转化为冗余校验方式存储，在保证数据可靠性和可用性的前提下，减少了冗余空间占用，提高空间利用率。另外当存储节点出现故障时可以用镜像方式和冗余校验结合的方式来减少计算量，改善数据恢复的效率，保障了数据的可靠性和可用性。Compared with the prior art, according to the file redundancy storage method in the cluster file system provided by the specific embodiment of the present invention, data can be dynamically switched in different storage states. For the write operation of the client, the data is written and updated by mirroring. For the write request of the client, there is no need to consider how to perform redundant calculation and redundant update. The newly written data block is mirrored to the redundant storage management device. The problem of low write performance caused by lowercase update in the redundancy check mode is avoided, and the write performance is greatly improved compared with the existing redundancy check mode. At the same time, when the system storage resources are tight, the non-write-active data can be converted into redundant verification storage. On the premise of ensuring data reliability and availability, the redundant space occupation is reduced and the space utilization rate is improved. In addition, when a storage node fails, the combination of mirroring and redundancy checks can be used to reduce the amount of calculation, improve the efficiency of data recovery, and ensure data reliability and availability.

附图说明 Description of drawings

以下参照附图对本发明实施例作进一步说明，其中：Embodiments of the present invention will be further described below with reference to the accompanying drawings, wherein:

图1为根据本发明实施例的各存储节点没有写入数据的初始状态的示意图；FIG. 1 is a schematic diagram of an initial state in which no data is written to each storage node according to an embodiment of the present invention;

图2为根据本发明实施例的数据片以镜像方式存放的状态1的示意图；2 is a schematic diagram of a state 1 in which data slices are stored in a mirror image according to an embodiment of the present invention;

图3为根据本发明实施例的数据片以冗余校验方式存放的状态2的示意图；3 is a schematic diagram of a state 2 in which data slices are stored in a redundancy check mode according to an embodiment of the present invention;

图4为根据本发明实施例的原版本数据片以冗余校验方式存放，写更新所生成的新版本数据片以镜像方式冗余存放的状态3的示意图；4 is a schematic diagram of a state 3 in which the original version of the data piece is stored in a redundancy check mode according to an embodiment of the present invention, and the new version of the data piece generated by the write update is redundantly stored in a mirror image mode;

图5为根据本发明实施例的状态1下写更新数据片的示意图；5 is a schematic diagram of writing and updating data slices in state 1 according to an embodiment of the present invention;

图6为根据本发明实施例的状态3下新版本数据片被再次写更新后的示意图；6 is a schematic diagram of a new version of the data piece being written and updated again in state 3 according to an embodiment of the present invention;

图7为根据本发明实施例的状态3下原版本数据片被写更新的示意图；7 is a schematic diagram of the original version data piece being written and updated in state 3 according to an embodiment of the present invention;

图8为根据本发明实施例的状态2下冗余校验片出错的示意图；8 is a schematic diagram of an error in a redundancy check slice in state 2 according to an embodiment of the present invention;

图9为根据本发明实施例的状态2下数据片出错的示意图；FIG. 9 is a schematic diagram of a data slice error in state 2 according to an embodiment of the present invention;

图10为根据本发明实施例的状态3下冗余校验片出错的示意图；10 is a schematic diagram of an error in a redundancy check slice in state 3 according to an embodiment of the present invention;

图11为根据本发明实施例的状态3下原版本数据片出错的示意图；Fig. 11 is a schematic diagram of an error in the original version of the data piece in state 3 according to an embodiment of the present invention;

图12为根据本发明实施例的状态3下新版本数据片的镜像出错的示意图；Fig. 12 is a schematic diagram of an error in mirroring of a new version data piece in state 3 according to an embodiment of the present invention;

图13为根据本发明实施例的状态3下原版本数据片及其新版本的数据片都出错的示意图。Fig. 13 is a schematic diagram of errors in both the original version data slice and the new version data slice in state 3 according to an embodiment of the present invention.

具体实施方式 Detailed ways

为了使本发明的目的，技术方案及优点更加清楚明白，以下结合附图通过具体实施例对本发明进一步详细说明。应当理解，此处所描述的具体实施例仅仅用以解释本发明，并不用于限定本发明。In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be further described in detail below through specific embodiments in conjunction with the accompanying drawings. It should be understood that the specific embodiments described here are only used to explain the present invention, not to limit the present invention.

在本发明的实施例中所述文件系统包括作为应用服务器，元数据服务器以及多个存储节点，其中应用服务器作为文件系统的客户端执行文件的读写操作。图1为根据本发明的一个实施例的初始状态，各存储节点(SN)没有写入数据。在本发明的实施例中初次保存文件时，文件以数据片的方式分别存放在相应网络存储节点，同时将各存储节点数据片作镜像备份至其余节点，形成镜像方式冗余存放，提高数据可靠性。此时文件的存储状态称为状态1，如图2所示。图2为根据本发明实施例的状态1，数据片以镜像方式存放。在本实施例中仅以3个存储节点和3个镜像节点为示例，但在其他实施例中存储节点的数目不受限制，可以为任意数目。In the embodiment of the present invention, the file system includes an application server, a metadata server, and multiple storage nodes, wherein the application server, as a client of the file system, executes file read and write operations. FIG. 1 is an initial state according to an embodiment of the present invention, each storage node (SN) has no data written. In the embodiment of the present invention, when the file is saved for the first time, the file is stored in the corresponding network storage nodes in the form of data slices, and at the same time, the data slices of each storage node are mirrored and backed up to other nodes to form redundant storage in the mirror mode, which improves data reliability. sex. The storage state of the file at this time is called state 1, as shown in FIG. 2 . FIG. 2 is state 1 according to an embodiment of the present invention, where data slices are stored in a mirror image. In this embodiment, only 3 storage nodes and 3 mirror nodes are used as an example, but in other embodiments, the number of storage nodes is not limited and can be any number.

在本发明的实施例中，所述文件系统中文件的存储状态还包括如图3所示的状态2，数据片以冗余校验方式存放。在本实施例中冗余校验方式采用的是类似RAID5的奇偶校验，在其他实施例中冗余校验方式也可以采用如海明码校验等其他为本领域技术人员所知的冗余校验方式。In an embodiment of the present invention, the storage state of the file in the file system further includes state 2 as shown in FIG. 3 , and the data slices are stored in a redundancy check mode. In the present embodiment, the redundancy check mode adopts a parity check similar to RAID5, and in other embodiments, the redundancy check mode can also adopt other redundancy known to those skilled in the art such as Hamming code check. verification method.

在本发明的实施例中，上述的文件存储状态1可以通过以下步骤转换为状态2，转换步骤包括：In an embodiment of the present invention, the above-mentioned file storage status 1 can be converted to status 2 through the following steps, and the conversion steps include:

将冗余校验片写入文件；Write the redundancy check piece to the file;

修改状态1为状态2。Change state 1 to state 2.

以上步骤也可以称为降级操作。当大量文件处于状态1以镜像方式保存，文件空间冗余达到100％，此时如果系统剩余存储资源较为紧张，可以对文件进行降级操作，将状态1转化为状态2，以冗余校验保存文件，在不降低系统可靠性的基础上释放存储资源，减少空间占用。The above steps can also be referred to as a downgrade operation. When a large number of files are saved in state 1 and mirrored, the redundancy of the file space reaches 100%. At this time, if the remaining storage resources of the system are relatively tight, the file can be downgraded to convert state 1 to state 2 and saved with redundancy check Files, freeing storage resources and reducing space occupation without reducing system reliability.

在本发明实施例中降级操作由文件系统中作为客户端的应用服务器执行，这样可以避免产生集中式管理的单个控制节点性能瓶颈。此降级操作可以由后台降级守护进程(DegradeDeamon)调用，在资源紧张时进行空间压缩和资源释放。在一些实施例中，降级操作也可以由存储管理服务器执行。在另一些实施例中，降级步骤可以由文件系统中的服务器端执行。In the embodiment of the present invention, the downgrade operation is performed by the application server serving as the client in the file system, which can avoid the performance bottleneck of a single control node for centralized management. This degrading operation can be called by the background degrading daemon process (DegradeDeamon) to perform space compression and resource release when resources are tight. In some embodiments, the demotion operation may also be performed by the storage management server. In some other embodiments, the step of downgrading can be performed by the server in the file system.

在本发明实施例中所述文件系统中文件的存储状态还包括如图4所示的状态3，其中原版本数据片以冗余校验方式存放，写更新所生成的新版本数据片以镜像方式存放。当对处于状态2的文件进行写更新时文件的存储状态从状态2转换到状态3，包括以下步骤：In the embodiment of the present invention, the storage state of files in the file system also includes state 3 as shown in Figure 4, wherein the original version data slices are stored in a redundancy check mode, and the new version data slices generated by writing and updating are mirrored. way to store. When a file in state 2 is written and updated, the storage state of the file transitions from state 2 to state 3, including the following steps:

修改状态2为状态3。Change state 2 to state 3.

以上步骤也可以统称为对状态2的写操作。如图4所示，对处于状态2的文件进行写更新，以写更新数据片D₁₀为例，保持原来的旧片段D₁₀不变，并生成新片段D₁₁，同时将新片段D₁₁的镜像D₁₁′保存至另一个存储节点，实现镜像备份，以保证数据的可靠性，此时文件的存储状态称为状态3。The above steps can also be collectively referred to as a write operation to state 2 . As shown in Figure 4, the file in state 2 is written and updated. Taking the write update data piece D ₁₀ as an example, the original old piece D ₁₀ is kept unchanged, and a new piece D ₁₁ is generated. At the same time, the new piece D ₁₁ is The mirror image D ₁₁ ′ is saved to another storage node to implement mirror backup to ensure data reliability. At this time, the storage state of the file is called state 3 .

在本发明的实施例中对文件进行写更新时需要根据数据片的不同存放方式执行不同的操作，包括以下步骤：In the embodiment of the present invention, when writing and updating files, different operations need to be performed according to different storage methods of data slices, including the following steps:

以上步骤可以统称为写更新操作。图5为在本发明实施例中的状态1下写更新数据片的示意图，状态1时，数据片以镜像方式存放，对其进行写更新时，需要同时修改以镜像方式保存在两个存储节点的数据片及其镜像，修改可以采用同步模式或异步模式。以写更新数据片D₁为例，更新D₁为D₁₂，同时将另一个存储节点的镜像D₁′更新为D₁₂′。此时数据存储状态仍为状态1，以上步骤也可以称为对状态1的写更新操作。The above steps may be collectively referred to as a write update operation. Figure 5 is a schematic diagram of writing and updating data slices in state 1 in the embodiment of the present invention. In state 1, the data slices are stored in a mirror image. When writing and updating them, they need to be modified and stored in two storage nodes in a mirror image. The data slice and its mirror image can be modified in synchronous mode or asynchronous mode. Taking writing and updating data slice D ₁ as an example, update D ₁ to D ₁₂ , and at the same time update the image D ₁ ′ of another storage node to D ₁₂ ′. At this time, the data storage state is still state 1, and the above steps can also be referred to as a write update operation on state 1.

对处于状态3的文件进行写更新时，要区分数据片的存储方式，原版本数据片以冗余校验方式存放，而写更新所生成的新版本数据片以镜像方式存放。图6为根据本发明实施例的状态3下新版本数据片被写更新后的示意图，其中新版本数据片以镜像方式存放，对图4所示的处于状态3的文件进行写更新，以写更新D₁₁为例，保持原来的以冗余校验方式存放的原版本数据片D₁₀不变，更新D₁₁为D₁₂，同时将D₁₁在另一个存储节点的镜像D₁₁′更新为D₁₂′，实现镜像备份，此时存储状态也为状态3。When writing and updating a file in state 3, it is necessary to distinguish the storage method of the data piece. The original version of the data piece is stored in the form of redundancy check, while the new version of the data piece generated by the write update is stored in the form of mirror image. Fig. 6 is the schematic diagram according to the embodiment of the present invention after the new version data slice is written and updated under state 3, wherein the new version data slice is stored in a mirror image mode, and the file in state 3 shown in Fig. 4 is written and updated to write Take updating D ₁₁ as an example, keep the original data piece D ₁₀ stored in the redundancy check mode unchanged, update D ₁₁ to D ₁₂ , and at the same time update the image D ₁₁ ′ of D ₁₁ on another storage node to D ₁₂ ′, mirror backup is realized, and the storage status is also status 3 at this time.

图7为根据本发明实施例的状态3下原版本数据片被写更新的示意图；其中原版本数据片以冗余校验方式存放，对如图6所示处于状态3的文件写更新，以写更新D₃₀为例，保持原版本数据片D₃₀不变，并生成新版本数据片D₃₁，同时将新版本数据片D₃₁的镜像D₃₁′保存至另一个存储节点，实现镜像备份，以保证数据的可靠性。此时存储状态也为状态3。Fig. 7 is a schematic diagram of the original version data piece being written and updated under state 3 according to an embodiment of the present invention; wherein the original version data piece is stored in a redundancy check mode, and the file in state 3 as shown in Fig. 6 is written and updated, with Write and update D ₃₀ as an example, keep the original version of the data piece D ₃₀ unchanged, and generate a new version of the data piece D ₃₁ , and at the same time save the mirror image D ₃₁ ′ of the new version of the data piece D ₃₁ to another storage node to realize mirror backup. To ensure the reliability of the data. At this time, the storage state is also state 3.

读写操作read and write operations

在本发明实施例中，应用服务器作为客户端进行写读操作。In the embodiment of the present invention, the application server performs write and read operations as a client.

客户端进行读操作，不会影响文件系统一致性状态，不会影响快照的创建和维护，因此按正常的流程即可，如步骤(1)，(2)，(3)所示：The client's read operation will not affect the consistency state of the file system, and will not affect the creation and maintenance of snapshots, so follow the normal process, as shown in steps (1), (2), and (3):

(1)客户端向服务端请求文件布局关系信息layout(即文件逻辑地址到物理地址映射)；(1) The client requests the file layout relationship information layout from the server (that is, the file logical address to physical address mapping);

(2)服务端查询相关元数据，返回相对应的layout；(2) The server queries relevant metadata and returns the corresponding layout;

(3)客户端通过返回的layout读存储设备上的数据。(3) The client reads the data on the storage device through the returned layout.

应用服务器作为客户端进行写操作，如步骤(4)，(5)，(6)所示：The application server performs write operations as a client, as shown in steps (4), (5), and (6):

(4)客户端向服务端请求layout信息；(4) The client requests layout information from the server;

(5)服务端查询相关元数据，返回对应的读写layout，并且预留相应的资源。客户端根据返回的layout的进行设备读写，根据只读layout读出存储设备的内容，修改的内容写入可写layout对应存储设备中；(5) The server queries relevant metadata, returns the corresponding read-write layout, and reserves corresponding resources. The client reads and writes the device according to the returned layout, reads the content of the storage device according to the read-only layout, and writes the modified content into the storage device corresponding to the writable layout;

(6)客户端向layout提交相应的元数据信息，并且把预留资源信息加入到元数据组织中。(6) The client submits the corresponding metadata information to the layout, and adds the reserved resource information to the metadata organization.

对于读请求，客户端通过元数据服务器得到资源映射地址后，从存储节点直接请求数据，性能不受冗余存储影响；对于写请求客户端不必考虑冗余更新问题，只需使用远程镜像方式将新写的数据块镜像到冗余存储管理设备上，写性能相比冗余校验方式得到大幅度提高。在一些实施例中客户端还可采用延迟写和写聚合以及缓存等技术进一步优化写操作的性能。For read requests, after the client obtains the resource mapping address from the metadata server, it directly requests data from the storage node, and the performance is not affected by redundant storage; for write requests, the client does not need to consider redundant update issues, and only needs to use remote mirroring to The newly written data block is mirrored to the redundant storage management device, and the writing performance is greatly improved compared with the redundancy check method. In some embodiments, the client may further optimize the performance of the write operation by adopting techniques such as delayed write, write aggregation, and caching.

在本发明的实施例中文件的存储状态可以通过以下步骤从状态3转换到状态1：In the embodiment of the present invention, the storage state of the file can be converted from state 3 to state 1 through the following steps:

修改状态3为1。Modify state 3 to 1.

以上步骤也可以称为升级操作。在本实施例中由应用服务器作为客户端进行如下操作：依次遍历处于状态3的文件的镜像数据片block′，遇到空洞，则说明该数据片尚未写入有效数据，说明原版本数据片block₀没有被修改，自然也没有生成新版本数据片block_i；此时将原版本数据片填充写入新版本数据片block_i和镜像数据片block′；The above steps may also be referred to as an upgrade operation. In this embodiment, the application server acts as the client to perform the following operations: sequentially traverse the image data block' of the file in state 3, and if a hole is encountered, it means that no valid data has been written in the data block, indicating that the original version of the data block block ₀ has not been modified, and naturally no new version of the data piece block _i is generated; at this time, the original version of the data piece is filled and written into the new version of the data piece block _i and the mirrored data piece block';

将原版本数据片block₀替换为新版本数据片block_i，因为此时需要将文件所处存储状态3转化为状态1，只保留数据片及其镜像；Replace the original version of the data piece block ₀ with the new version of the data piece block _i , because at this time, the storage state 3 of the file needs to be converted to state 1, and only the data piece and its mirror image are retained;

将新版数据片block_i、冗余校验数据片P的相关信息进行删除、清空和回收；同时修改状态3为1；Delete, clear and recycle the relevant information of the new version of the data piece block _i and the redundancy check data piece P; at the same time, modify the status 3 to 1;

依次对处于状态3的文件执行升级操作就可以完成从状态3到状态1的转换。The transition from state 3 to state 1 can be completed by sequentially performing upgrade operations on the files in state 3.

在本实施例中，上述降级操作，对状态2的写操作，写更新操作以及升级操作都是由文件系统中作为客户端的应用服务器执行的。在其他一些的实施例中，上述操作也可以由存储管理服务器或者文件系统中的服务器执行。In this embodiment, the above-mentioned downgrade operation, write operation to state 2, write update operation, and upgrade operation are all performed by the application server serving as the client in the file system. In some other embodiments, the above operations may also be performed by a storage management server or a server in a file system.

在本发明的一些实施例中为集群文件系统提供了一种文件存储状态动态转换的方法，其中所述文件系统中的文件包括以下存储状态：In some embodiments of the present invention, a method for dynamically converting a file storage state is provided for a cluster file system, wherein the files in the file system include the following storage states:

状态1，数据片以镜像方式存放；State 1, data slices are stored in a mirror image;

状态2，数据片以冗余校验方式存放；State 2, data slices are stored in redundancy check mode;

状态3，原版本数据片以冗余校验方式存放，写更新所生成的新版本数据片以镜像方式存放；State 3, the original version of the data piece is stored in the form of redundancy check, and the new version of the data piece generated by the write update is stored in the form of mirror image;

所述的动态转换方法包括以下步骤：Described dynamic conversion method comprises the following steps:

状态1经过写更新操作进入状态1；State 1 enters state 1 after a write update operation;

状态1经过降级操作进入状态2；State 1 enters state 2 after a downgrade operation;

状态2经过对状态2的写操作进入状态3；State 2 enters state 3 through a write operation to state 2;

状态3经过写更新操作进入状态3；State 3 enters state 3 after a write update operation;

状态3经过升级操作进入状态1。State 3 enters state 1 through an upgrade operation.

当状态为1时，降级操作将原本以镜像方式保存的数据片改为以冗余校验方式保存，转换到状态2。此操作通常发生在系统存储空间资源紧张时，可压缩数据，释放存储空间。When the state is 1, the downgrade operation changes the data slices originally stored in the mirror mode to the redundancy check mode, and transitions to state 2. This operation usually occurs when system storage space resources are tight, and data can be compressed to free up storage space.

状态2经过对状态2的写操作，转换到状态3，在状态3保持原有以冗余方式存储的数据片不变，需要进行写更新操作时直接生成新版本数据片并对其做镜像保存，这样所有的写更新操作都不必进行冗余计算和冗余更新，从而提高了写性能。State 2 is converted to state 3 after a write operation to state 2. In state 3, the original redundantly stored data slice remains unchanged. When a write update operation is required, a new version of the data slice is directly generated and mirrored to save it , so that all write and update operations do not need to perform redundant calculations and redundant updates, thereby improving write performance.

当文件的存储状态为3时，如果全部或绝大多数数据片已经执行了写更新操作，则全部或绝大多数数据片D_ij都已经在另一存储节点保存镜像D_ij′，事实上已经在以镜像方式存放数据。此时可执行升级操作。操作完成后，存储状态恢复至状态1。When the storage state of the file is 3, if all or most of the data slices have performed the write update operation, then all or most of the data slices D _ij have already stored the image D _ij ' in another storage node, in fact Data is stored in a mirrored manner. The upgrade operation can be performed at this time. After the operation is complete, the storage state reverts to state 1.

当某存储节点(SN)由于节点故障(如电源故障、操作系统出错等)或网络故障(如网络连接中断等)等原因停止服务，造成数据丢失后，需要将故障设备修复或替换。完成修复或替换后，客户端需要采用恢复操作恢复丢失数据，保证数据的可靠性。在本发明的实施例中在数据片出错时采用以下步骤：When a storage node (SN) stops serving due to node failure (such as power failure, operating system error, etc.) or network failure (such as network connection interruption, etc.), resulting in data loss, the faulty device needs to be repaired or replaced. After the repair or replacement is completed, the client needs to use the recovery operation to restore the lost data to ensure the reliability of the data. In an embodiment of the present invention, the following steps are adopted when the data sheet is in error:

上述步骤也可称为恢复操作。下面参考附图8，9，10，11，12，13对恢复操作进行具体说明。The above steps may also be referred to as recovery operations. The restoration operation will be specifically described below with reference to accompanying drawings 8, 9, 10, 11, 12, and 13.

在本发明实施例中，文件处于状态1时，对小文件而言，可能只有1个数据分片；对大文件而言，可能有若干数据片分布在不同节点，每个数据片都在另一个节点存有镜像备份。当文件处于状态1时执行恢复操作只需要将故障节点的出错数据片用对应节点上的镜像恢复即可。In the embodiment of the present invention, when the file is in state 1, for a small file, there may be only one data slice; for a large file, there may be several data slices distributed on different nodes, each A node stores mirrored copies. When the file is in state 1, the recovery operation only needs to restore the error data piece of the faulty node with the mirror image on the corresponding node.

在本发明实施例中，状态2下出错数据片分类及恢复操作如下：In the embodiment of the present invention, the error data slice classification and recovery operations under state 2 are as follows:

如图8所示，状态2下，冗余校验片P出错时的恢复操作，需要读取文件的他数据片D₁、D₂和D₃，作冗余计算，生成新冗余校验片P。如图9所示，状态2下，数据片D₃出错时的恢复操作，需要读取冗余组内其他数据片D₁、D₂以及冗余校验片P，作冗余计算，生成新数据片D₃。As shown in Figure 8, in state 2, the recovery operation when the redundancy check piece P fails, it is necessary to read other data pieces D ₁ , D ₂ and D ₃ of the file for redundancy calculation and generate a new redundancy check Piece P. As shown in Figure 9, in state 2, the recovery operation when the data piece D ₃ fails requires reading other data pieces D ₁ , D ₂ and the redundancy check piece P in the redundancy group to perform redundancy calculations and generate a new Data piece D ₃ .

在本发明实施例中，状态3下出错数据分类及恢复操作如下：In the embodiment of the present invention, error data classification and recovery operations under state 3 are as follows:

图10所示，状态3下，冗余校验片P₀出错时的恢复操作，需要读取以冗余校验方式存放的其他原版本数据片D₁₀、D₂₀和D₃₀，作冗余计算，恢复冗余校验片P₀。图11所示，状态3下，原版本数据片D₃₀出错时的恢复操作，需要读取其他原版本数据片D₁₀、D₂₀以及冗余校验片P，作冗余计算，恢复数据片D₃₀。图12所示，状态3下，镜像数据片D₁₁′出错时的恢复操作，通过利用未出错节点的数据片D₁₁，恢复出错节点的镜像数据片D₁₁′。图13所示，状态3下，原版本数据片D₁₀和新版本数据片D₁₁出错时的恢复操作，分为两步：首先需要读取在未出错节点上的文件以冗余校验方式存放的其他原版本数据片，作冗余计算，恢复原版本数据片D₁₀；然后通过在镜像保存的数据片D₁₁′，恢复出错节点上的数据片D₁₁。As shown in Figure 10, in state 3, the recovery operation when the redundancy check piece P ₀ fails requires reading other original version data pieces D ₁₀ , D ₂₀ and D ₃₀ stored in the form of redundancy check for redundancy. Calculate and restore the redundancy check piece P ₀ . As shown in Figure 11, under state 3, the recovery operation when the original version data piece D ₃₀ makes an error requires reading other original version data pieces D ₁₀ , D ₂₀ and the redundancy check piece P for redundancy calculation and restoration of the data piece _D30 . As shown in FIG. 12 , in state 3, the recovery operation when the mirrored data piece D ₁₁ ′ is in error is to restore the mirrored data piece D _{11 ′} of the error node by using the data piece D ₁₁ of the non-errored node. As shown in Figure 13, in state 3, the recovery operation when the original version data piece D ₁₀ and the new version data piece D ₁₁ are in error is divided into two steps: first, it is necessary to read the file on the non-error node to check the redundancy For other stored original version data slices, perform redundancy calculation to restore the original version data slice D ₁₀ ; then restore the data slice D ₁₁ on the error node through the data slice D ₁₁ ′ saved in the mirror.

可见当存储节点出现故障时采用镜像和冗余校验结合的方式也可以减少冗余计算量和冗余更新，改善数据恢复的效率。It can be seen that when a storage node fails, the combination of mirroring and redundancy check can also reduce the amount of redundant calculation and redundant update, and improve the efficiency of data recovery.

在本发明实施例中上述恢复操作是由在文件系统中作为客户端的应用服务器执行的。在其他一些的实施例中，上述操作也可以由存储管理服务器或者文件系统中的服务器执行。In the embodiment of the present invention, the above recovery operation is performed by the application server acting as the client in the file system. In some other embodiments, the above operations may also be performed by a storage management server or a server in a file system.

在本发明的一些实施例中，所述文件系统中的文件的存储状态可以是：状态1，数据片以RAID1方式存放；状态2，数据片以RAID5模式冗余存放；状态3，原版本数据片以RAID5模式冗余方式存放，写更新所生成的新版本数据片以RAID1方式存放。In some embodiments of the present invention, the storage state of the files in the file system can be: state 1, data slices are stored in RAID1 mode; state 2, data slices are redundantly stored in RAID5 mode; state 3, original version data Slices are stored redundantly in RAID5 mode, and new version data slices generated by write updates are stored in RAID1 mode.

通过上述本发明的具体实施例，为基于网络存储的集群文件系统，提供高效的高可靠和高可用保障机制的方法，利用镜像和冗余校验结合的数据冗余机制保证文件系统的数据可靠性和可用性。根据本发明实施例提供的文件冗余存储方法使数据的存储状态进行动态转化：对于客户端的写操作，数据以镜像方式进行写更新，对于客户端的写请求不必考虑如何进行冗余校验方式中的冗余计算及冗余更新问题，新写的数据块镜像到冗余存储管理设备上，避免了如RAID5等冗余校验方式中小写更新导致的写性能低下问题，写性能得到大幅度提高。同时，非写活跃性质的数据在系统存储资源紧张时，可以转化为冗余校验方式存储，在保证数据可靠性和可用性的前提下，减少了冗余空间占用，提高空间利用率。Through the above-mentioned specific embodiments of the present invention, a method for providing an efficient, highly reliable and highly available guarantee mechanism for a cluster file system based on network storage, using a data redundancy mechanism combining mirroring and redundancy checks to ensure data reliability of the file system and usability. According to the file redundancy storage method provided by the embodiment of the present invention, the storage state of the data is dynamically converted: for the write operation of the client, the data is written and updated in a mirroring manner, and it is not necessary to consider how to perform a redundancy check in the write request of the client The problem of redundant computing and redundant update, the newly written data block is mirrored to the redundant storage management device, avoiding the problem of low write performance caused by lower-case update in redundant verification methods such as RAID5, and the write performance is greatly improved . At the same time, when the system storage resources are tight, the non-write-active data can be converted into redundant verification storage. On the premise of ensuring data reliability and availability, the redundant space occupation is reduced and the space utilization rate is improved.

可见根据本发明实施例的文件冗余存储方法具备良好的空间利用率，并且能够有效减少冗余计算开销，提高系统性能。同时当存储节点出现故障时可以用镜像和冗余校验结合的方式来减少计算量，改善数据恢复的效率，保障了数据的可靠性和可用性。It can be seen that the file redundancy storage method according to the embodiment of the present invention has a good space utilization rate, can effectively reduce redundant computing overhead, and improve system performance. At the same time, when the storage node fails, the combination of mirroring and redundancy check can be used to reduce the amount of calculation, improve the efficiency of data recovery, and ensure the reliability and availability of data.

虽然本发明已经通过优选实施例进行了描述，然而本发明并非局限于这里所描述的实施例，在不脱离本发明范围的情况下还包括所作出的各种改变以及变化。Although the present invention has been described in terms of preferred embodiments, the present invention is not limited to the embodiments described herein, and various changes and changes are included without departing from the scope of the present invention.

Claims

1. cluster file system file redundant storage method, said file system file leaves network storage node in the mode of data slice, comprises following store status:

State 1, data slice is deposited with the mirror image mode;

State 2, data slice is deposited with the redundancy check mode;

State 3, wherein the original version data slice is deposited with the redundancy check mode, writes the redaction data slice that renewal generates and deposits with the mirror image mode;

Said method comprising the steps of:

Read each data slice of the file of the state of being in 1;

Carry out redundant computation, generate the redundancy check sheet;

The redundancy check sheet is write file;

The storage space of each mirror image data sheet of releasing document;

Modification state 1 is a state 2;

It is constant that the original version data slice of the state that is in 2 of renewal is write in maintenance, generates the redaction data slice;

The redaction data slice is write corresponding memory node, do mirror back-up simultaneously to another memory node;

Modification state 2 is a state 3.

2. the file redundant storage method in the cluster file system according to claim 1 is characterized in that: said redundancy check mode is parity checking.

3. cluster file system file redundant storage method according to claim 1 is characterized in that: said method is carried out by the application server as client in the file system.

4. cluster file system file redundant storage method according to claim 1 is characterized in that: also comprise the following step of updating of writing:

To the data slice of depositing with the mirror image mode, revise data slice and the mirror image thereof that leaves two memory nodes in simultaneously;

To the data slice of depositing with the redundancy check mode, the original version data slice that maintenance is deposited with the redundancy check mode is constant, generates the redaction data slice; The redaction data slice is write corresponding memory node, do mirror back-up simultaneously to another memory node.

5. cluster file system file redundant storage method according to claim 1 is characterized in that said method is further comprising the steps of:

Traversal is in the mirror image data sheet of the file of state 3 successively, runs into the cavity, the original version data slice is filled write redaction data slice and mirror image data sheet;

The original version data slice is replaced with the redaction data slice;

The storage space of release new edition data sheet and redundancy check data sheet;

Modification state 3 is 1.

6. according to claim 1,4 or 5 described cluster file system file redundant storage methods, it is characterized in that: also comprise the recovering step when data slice is made mistakes:

During state 1, with the data slice image recovery error data sheet on the corresponding node;

During state 2, carry out the data slice that the redundant computation recovery makes mistakes;

During state 3,, then carry out redundant computation and recover to make mistakes data slice if what make mistakes is the original version data slice; If what make mistakes is redaction data slice or its mirror image data sheet, then with the data slice image recovery error data sheet on the corresponding node.