CN106874383B

CN106874383B - Decoupling distribution method of metadata of distributed file system

Info

Publication number: CN106874383B
Application number: CN201710016284.7A
Authority: CN
Inventors: 陆游游; 舒继武; 李思阳
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2017-01-10
Filing date: 2017-01-10
Publication date: 2019-12-20
Anticipated expiration: 2037-01-10
Also published as: CN106874383A

Abstract

The invention discloses a method for decoupling and distributing metadata of a distributed file system, comprising: separating metadata of a distributed file system to obtain metadata of a directory, metadata of a directory item, and metadata of a file; The directory metadata is stored centrally in the directory metadata index node, and does not contain pointers to directory entries. A directory operation is performed based on the directory inode. After the metadata of each directory item is divided, the related file metadata is stored in the same node, and a reverse index pointing to the directory metadata is established. The present invention has the following advantages: it reduces the information interaction between nodes when the distributed file system accesses the metadata, and reduces the delay of metadata access; at the same time, by separating the content of the directory, it decouples the connection between the file and the directory. Strong association can achieve high throughput, thus improving the metadata processing efficiency of the distributed file system.

Description

A Distributed File System Metadata Decoupling Distribution Method

技术领域technical field

本发明涉及计算机领域，具体涉及一种分布式文件系统元数据的解耦合分布方法。The invention relates to the field of computers, in particular to a method for decoupling and distributing metadata of a distributed file system.

背景技术Background technique

分布式文件系统是一种支持海量数据存储的新型存储系统，被广泛应用于数据中心，超算中心和公有云平台。分布式文件系统具有很多好的优点相对于传统的集中式存储。如可以对存储数据进行横向扩展，通过增加存储节点的方式可以动态的扩充存储的容量，并保证访问吞吐量的同步提升。其次，分布式文件系统相对于传统的集中式存储具有灵活的容错策略，可以使用副本机制和纠删码进行分布式的容错。分布式文件系统还可以使用更加廉价的存储和计算设备去构建一个大规模的存储集群，以保证大量数据的访问。但是受限于文件系统的访问标准(POSIX)，分布式文件系统的元数据访问往往成为了其性能的瓶颈。其元数据的访问往往无法满足高吞吐量和低延时的需求，但是在实际的系统中，超过一半以上的数据访问需要经过元数据节点。为了解决分布式文件系统元数据的可扩展性，现有的技术主要有下面三种：Distributed file system is a new type of storage system that supports massive data storage, and is widely used in data centers, supercomputing centers and public cloud platforms. Distributed file systems have many advantages over traditional centralized storage. For example, the storage data can be horizontally expanded, and the storage capacity can be dynamically expanded by adding storage nodes, and the simultaneous improvement of access throughput can be guaranteed. Secondly, compared with the traditional centralized storage, the distributed file system has a flexible fault tolerance strategy, and can use the copy mechanism and erasure code for distributed fault tolerance. The distributed file system can also use cheaper storage and computing devices to build a large-scale storage cluster to ensure access to large amounts of data. However, limited by the access standard of the file system (POSIX), the metadata access of the distributed file system often becomes the bottleneck of its performance. Metadata access often cannot meet the requirements of high throughput and low latency, but in actual systems, more than half of data access needs to go through metadata nodes. In order to solve the scalability of distributed file system metadata, the existing technologies mainly include the following three types:

一种是基于动态目录树的分布式元数据节点扩展方法，这种方法的特点是将分布式文件系统的名字空间按照子目录分为不同的子树，每个子树独立的存放在某一个节点，并且更具访问的负载动态的调节存放的节点。这种方式的优势是能够根据负载的同步动态的调整访问的位置，但是这种方式无法解决文件访问的路径回朔问题，当访问一个文件的时候，需要访问整个路径的所有目录，而这些目录往往没有存放在同一个节点，往往造成了较大的访问延迟。One is a distributed metadata node extension method based on a dynamic directory tree. This method is characterized by dividing the name space of the distributed file system into different subtrees according to subdirectories, and each subtree is independently stored in a certain node. , and more dynamically adjust the storage nodes according to the access load. The advantage of this method is that it can dynamically adjust the location of access according to the synchronization of the load, but this method cannot solve the path backtracking problem of file access. When accessing a file, it is necessary to access all directories of the entire path, and these directories It is often not stored in the same node, which often causes a large access delay.

另一种是基于哈希算法的元数据扩展方法，其特点是将一个目录内的文件通过哈希的方式将元数据分配到不同的节点。这种方式的优势是针对一个目录中有大量的文件时，能够降低文件访问的负载。但是无法解决目录的扩展性问题。The other is a metadata extension method based on a hash algorithm, which is characterized by assigning metadata to different nodes by hashing the files in a directory. The advantage of this method is that it can reduce the load of file access when there are a large number of files in a directory. But it can't solve the scalability problem of the directory.

第三种方法是通过利用键值数据库存储文件元数据的方法，这种方法利用了键值数据库访问快，延时低的特点，但是这种方法依然存在路径如第一种方法存在的路径查找问题，依然无法解决访问时延时较低的问题。The third method is to use the key-value database to store file metadata. This method takes advantage of the fast access and low latency of the key-value database, but this method still has paths such as the path search that exists in the first method. problem, still can not solve the problem of low access time delay.

为了解决路径延时的问题，这些方法往往在客户端缓存元数据，但是这又带来了很多不一致性的开销，从而无法从更本上解决问题。In order to solve the problem of path delay, these methods often cache metadata on the client side, but this brings a lot of inconsistency overhead, which makes it impossible to solve the problem fundamentally.

发明内容Contents of the invention

本发明旨在至少解决上述技术问题之一。The present invention aims to solve at least one of the above-mentioned technical problems.

为此，本发明的一个目的在于提出一种分布式文件系统元数据的解耦合分布方法，以解决分布式文件系统的元数据扩展性，吞吐率不高和延迟较低的问题。Therefore, an object of the present invention is to propose a method for decoupling and distributing metadata of a distributed file system, so as to solve the problems of metadata scalability, low throughput and low delay of the distributed file system.

为了实现上述目的，本发明的实施例公开了一种分布式文件系统元数据的解耦合分布方法，包括以下步骤：S1：对分布式文件系统的元数据进行分离，以得到目录索引节点的元数据、目录项的元数据和文件的元数据；S2：将所述目录的元数据设置在目录索引节点；S3：将每个目录项根据文件的分布情况进行分割，并在文件存放的节点存储与之相关的目录项，并建立指向目录元数据的反向索引。In order to achieve the above object, the embodiment of the present invention discloses a distributed file system metadata decoupling distribution method, including the following steps: S1: Separating the metadata of the distributed file system to obtain the metadata of the directory index node Data, metadata of directory items, and metadata of files; S2: set the metadata of the directory in the directory index node; S3: divide each directory item according to the distribution of the file, and store it in the node where the file is stored related directory entries, and build an inverted index pointing to the directory metadata.

进一步地，所述目录操作包括目录的创建、目录的删除、读取目录、获取目录的所有元数据、改变目录所在的用户组和改变目录所属的用户。Further, the directory operation includes creating the directory, deleting the directory, reading the directory, obtaining all metadata of the directory, changing the user group where the directory belongs, and changing the user to which the directory belongs.

进一步地，还包括：提供全局唯一确定文件的标识；计算所需要访问的文件的全局所述标识的哈希值；根据所述哈希值定位元数据存放的节点。Further, it also includes: providing a globally unique identification of the file; calculating a hash value of the global identification of the file to be accessed; and locating the node where the metadata is stored according to the hash value.

进一步地，所述标识为文件的完整路径。Further, the identifier is the full path of the file.

进一步地，还包括：当创建文件或者目录时，在创建文件或者目录的节点创建一个包含所述文件或者所述目录的父目录路径的所有目录项；如果所述目录项的全部或者部分已经在该节点创建，则创建余下的目录项。Further, it also includes: when creating a file or directory, creating a node containing all directory entries containing the parent directory path of the file or directory at the node where the file or directory is created; if all or part of the directory entries are already in The node is created, then the remaining directory entries are created.

进一步地，还包括：当删除一个文件的时候，删除所述文件所在节点的元数据和所述文件所在节点对应的目录项元数据指向所述文件的项目。Further, the method further includes: when deleting a file, deleting the metadata of the node where the file is located and the directory entry metadata corresponding to the node where the file is located pointing to the item of the file.

进一步地，还包括：当进行读取目录或删除目录操作时，访问所有的元数据节点，以获得读取目录或删除目录下的所有目录项。Further, it also includes: when the operation of reading or deleting a directory is performed, accessing all metadata nodes to obtain all directory entries under the read or deleted directory.

进一步地，还包括：提供客户端缓存，其中，所述客户端缓存的目录元数据用于客户端创建文件的时候确定是否具有创建文件的权限；所述客户端在访问文件的元数据时，访问目录的元数据，以获取访问权限；当所述客户端具有访问权限时，访问文件的元数据。Further, it also includes: providing a client cache, wherein the directory metadata cached by the client is used to determine whether the client has the right to create a file when creating a file; when the client accesses the metadata of a file, Access the metadata of the directory to obtain the access right; when the client has the access right, access the metadata of the file.

进一步地，还包括：在所述的目录元数据客户端的缓存失效时，进行目录元数据的目录的权限的更改和对于目录的删除。Further, it also includes: when the cache of the directory metadata client becomes invalid, changing the authority of the directory of the directory metadata and deleting the directory.

根据本发明实施例的分布式文件系统元数据的解耦合分布方法，所有的对于文件的元数据操作至多访问2次节点，在目录元数据的缓存情况下，只需要访问一次节点，其延时仅仅为一次访问往返的RTT，由于使用键值存储，所以对于元数据获取的延时非常低，在以太网的RTT延时上可以忽略不计，所以这种方法能够有效的减少了分布式文件系统访问元数据时各个节点之间的信息交互，降低了元数据访问的延迟，同时，通过分离目录内容的方法，解耦合了文件和目录之间的强关联性，能够达到很高的吞吐量，从而提高了分布式文件系统对于元数据的处理效率。According to the decoupling distribution method of distributed file system metadata in the embodiment of the present invention, all metadata operations on files visit nodes at most twice, and in the case of directory metadata cache, only one node needs to be visited, and the delay The RTT for a round trip is only for one access. Because of the use of key-value storage, the delay for metadata acquisition is very low, which is negligible in the RTT delay of Ethernet, so this method can effectively reduce the distributed file system. The information interaction between each node when accessing metadata reduces the delay of metadata access. At the same time, by separating the content of the directory, the strong correlation between the file and the directory is decoupled, and a high throughput can be achieved. Therefore, the processing efficiency of the distributed file system for metadata is improved.

本发明的附加方面和优点将在下面的描述中部分给出，部分将从下面的描述中变得明显，或通过本发明的实践了解到。Additional aspects and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

附图说明Description of drawings

本发明的上述和/或附加的方面和优点从结合下面附图对实施例的描述中将变得明显和容易理解，其中：The above and/or additional aspects and advantages of the present invention will become apparent and comprehensible from the description of the embodiments in conjunction with the following drawings, wherein:

图1是本发明实施例的分布式文件系统元数据的解耦合分布方法的流程图；1 is a flowchart of a method for decoupling and distributing distributed file system metadata according to an embodiment of the present invention;

图2是本发明一个实施例的整体系统架构图；Fig. 2 is an overall system architecture diagram of an embodiment of the present invention;

图3是本发明一个实施例的目录分割解耦合的示意图。Fig. 3 is a schematic diagram of directory division and decoupling according to an embodiment of the present invention.

具体实施方式Detailed ways

下面详细描述本发明的实施例，所述实施例的示例在附图中示出，其中自始至终相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。下面通过参考附图描述的实施例是示例性的，仅用于解释本发明，而不能理解为对本发明的限制。Embodiments of the present invention are described in detail below, examples of which are shown in the drawings, wherein the same or similar reference numerals designate the same or similar elements or elements having the same or similar functions throughout. The embodiments described below by referring to the figures are exemplary only for explaining the present invention and should not be construed as limiting the present invention.

参照下面的描述和附图，将清楚本发明的实施例的这些和其他方面。在这些描述和附图中，具体公开了本发明的实施例中的一些特定实施方式，来表示实施本发明的实施例的原理的一些方式，但是应当理解，本发明的实施例的范围不受此限制。相反，本发明的实施例包括落入所附加权利要求书的精神和内涵范围内的所有变化、修改和等同物。These and other aspects of embodiments of the invention will become apparent with reference to the following description and drawings. In these descriptions and drawings, some specific implementations of the embodiments of the present invention are specifically disclosed to represent some ways of implementing the principles of the embodiments of the present invention, but it should be understood that the scope of the embodiments of the present invention is not limited by This restriction. On the contrary, the embodiments of the present invention include all changes, modifications and equivalents coming within the spirit and scope of the appended claims.

以下结合附图描述本发明。The present invention is described below in conjunction with accompanying drawing.

图1是本发明实施例的分布式文件系统元数据的解耦合分布方法的流程图。如图1所示，根据本发明实施例的分布式文件系统元数据的解耦合分布方法，包括以下步骤：FIG. 1 is a flowchart of a method for decoupling and distributing metadata of a distributed file system according to an embodiment of the present invention. As shown in FIG. 1, the method for decoupling and distributing metadata of a distributed file system according to an embodiment of the present invention includes the following steps:

S1：对分布式文件系统的元数据进行分离，以得到目录的元数据、目录项的元数据和文件的元数据。S1: Separate the metadata of the distributed file system to obtain the metadata of the directory, the metadata of the directory item and the metadata of the file.

S2：将所述目录的元数据设置在目录索引节点。S2: Set the metadata of the directory on the index node of the directory.

S3：将每个目录项根据文件的分布情况进行分割，并在文件存放的节点存储与之相关的目录项，并建立指向目录元数据的反向索引。S3: Divide each directory entry according to the distribution of the files, store the directory entries related to it in the node where the files are stored, and establish a reverse index pointing to the directory metadata.

在本发明的一个实施例中，将文件系统的目录元数据集中存储在一个节点。在这种方式下，目录索引节点的元数据信息不包含指向目录项元数据的地址。仅仅保留基本的目录数据，包括但不限于目录的创建时间，目录的权限标识，目录所在的组标识，目录所属的用户标识。在此基础之上，对于大部分与目录索引节点的元数据相关而与目录项元数据无关的元数据操作都将仅仅在存储目录索引元数据的这个节点上进行。目录操作包括目录的创建、目录的删除、读取目录、获取目录的所有元数据、改变目录所在的用户组和改变目录所属的用户。In one embodiment of the present invention, directory metadata of the file system is centrally stored on one node. In this way, the metadata information of the directory index node does not contain the address pointing to the metadata of the directory item. Only basic directory data is retained, including but not limited to the creation time of the directory, the permission ID of the directory, the group ID of the directory, and the user ID of the directory. On this basis, most of the metadata operations related to the metadata of the directory index node but not related to the directory item metadata will only be performed on the node storing the directory index metadata. Directory operations include creating a directory, deleting a directory, reading a directory, obtaining all metadata of a directory, changing the user group where the directory belongs, and changing the user to which the directory belongs.

在本发明的一个实施例中，还包括一种基于哈希的分布式的文件元数据存储机制。这种存储机制支持将对于文件元数据的存储和访问扩展到多个节点，从而达到平衡系统负载的目的。这种算法使用可以在全局唯一确定一个文件的标识：当客户端对文件进行元数据操作时，客户端通过计算所需要访问的文件的全局唯一标识的哈希值，定位文件所存放的节点，在该节点对元数据进行操作。这种方法保证了对于所有文件元数据的操作都将至多修改一个元数据节点的信息。In one embodiment of the present invention, a hash-based distributed file metadata storage mechanism is also included. This storage mechanism supports extending the storage and access of file metadata to multiple nodes, so as to achieve the purpose of balancing system load. This algorithm can uniquely determine the identifier of a file globally: when the client performs metadata operations on the file, the client locates the node where the file is stored by calculating the hash value of the globally unique identifier of the file to be accessed. Operate on metadata at this node. This method guarantees that all file metadata operations will modify the information of at most one metadata node.

在本发明的一个实施例中，标识为文件的完整路径。In one embodiment of the present invention, the identifier is the full path of the file.

在本发明的一个实施例中，还包括一种目录项的逆向存储方法。这种方法通过将目录项的元数据进行分割后分配到多个节点，保证了在文件的创建和删除过程中不需要增加额外的节点访问开销。其分配方法是当创建文件或者目录的时候，不需要修改除创建文件节点之外的节点上的元数据信息，而是在创建文件或者目录的节点创建一个包含该文件或者目录的父目录路径的所有目录项。如果这些目录项的全部或者部分已经在该节点创建，则创建余下的目录项。这种目录项通过这种逆向存储方法保证了对于当创建和删除文件时，不会造成对多个元数据节点的修改，减少了访问多个节点带来的分布式同步开销。In one embodiment of the present invention, a reverse storage method of directory entries is also included. This method divides the metadata of directory items and distributes them to multiple nodes, which ensures that no additional node access overhead is added during the process of file creation and deletion. The allocation method is that when creating a file or directory, it is not necessary to modify the metadata information on nodes other than the node that created the file, but to create a node containing the path of the parent directory of the file or directory at the node that created the file or directory All directory entries. If all or part of these directory entries have already been created at this node, the remaining directory entries are created. This kind of directory entry ensures that when creating and deleting files, multiple metadata nodes will not be modified through this reverse storage method, which reduces the distributed synchronization overhead caused by accessing multiple nodes.

在本发明的一个实施例中，当删除一个文件的时候，删除文件所在节点的元数据和文件所在节点对应的目录项元数据指向文件的项目。In one embodiment of the present invention, when a file is deleted, the metadata of the node where the file is located and the directory entry metadata corresponding to the node where the file is located are deleted, pointing to the item of the file.

在本发明的一个实施例中，还包括：当进行读取目录或删除目录操作时，访问所有的元数据节点，以获得读取目录或删除目录下的所有目录项。In an embodiment of the present invention, it further includes: when performing the operation of reading a directory or deleting a directory, accessing all metadata nodes to obtain all directory entries under the directory being read or deleted.

在本发明的一个实施例中，还包括：提供客户端缓存，其中，客户端缓存的目录元数据用于客户端创建文件的时候确定是否具有创建文件的权限；客户端在访问文件的元数据时，访问目录的元数据，以获取访问权限；当客户端具有访问权限时，访问文件的元数据。In one embodiment of the present invention, it also includes: providing a client cache, wherein the directory metadata cached by the client is used to determine whether the client has the right to create a file when creating a file; the client accesses the metadata of the file , access the metadata of the directory to obtain access rights; when the client has access rights, access the metadata of the file.

在本发明的一个实施例中，还包括：在所述的目录元数据客户端的缓存失效时，进行目录元数据的目录的权限的更改和对于目录的删除。In an embodiment of the present invention, it further includes: when the cache of the directory metadata client becomes invalid, modifying the permission of the directory of the directory metadata and deleting the directory.

为使本领域技术人员进一步理解本发明，将通过以下实施例进行详细说明。In order for those skilled in the art to further understand the present invention, the following examples will be described in detail.

如图2所示，提供了一种存储目录的元数据节点，这种元数据节点用于处理分布式文件系统中关于目录元数据的所有请求。在实现过程中，其使用基于以太网的RPC协议，通过提供标准的POSIX访问接口接收来自客户端的请求。其分为四个模块，一个是提供执行元数据请求操作的访问接口，一个是键值存储库，用于将元数据持久化到磁盘。除此之外，还有一个存放目录内容的模块缓存各个节点的目录内容。当启动元数据节点时，其会扫描键值存储中的每一个值来构建目录的内容。As shown in FIG. 2 , a metadata node for storing directories is provided, and the metadata node is used for processing all requests about directory metadata in a distributed file system. During implementation, it uses the Ethernet-based RPC protocol to receive requests from clients by providing a standard POSIX access interface. It is divided into four modules, one is to provide an access interface for performing metadata request operations, and the other is a key-value store for persisting metadata to disk. In addition, there is a module that stores directory content to cache the directory content of each node. When a metadata node is started, it scans every value in the key-value store to build the contents of the directory.

在对于目录元数据的处理中，目录的元数据包括，目录的创建时间，目录的权限标识，目录所在的组标识，目录所属的用户标识，目录的全局唯一标识。其中，目录的路径长度是一个变长的字符串，其余的各项目录的元数据分别为一个8个字节的定长标识。这种元数据节点支持关于目录的相关操作，包括目录的创建，目录的删除，读取目录，获取目录的所有元数据，改变目录所在的用户组，改变目录所属的用户。In the processing of directory metadata, the directory metadata includes the creation time of the directory, the authority identifier of the directory, the group identifier of the directory, the user identifier of the directory, and the globally unique identifier of the directory. Wherein, the path length of the directory is a variable-length character string, and the metadata of the other directories are respectively an 8-byte fixed-length identifier. This kind of metadata node supports related operations on directories, including creating directories, deleting directories, reading directories, obtaining all metadata of directories, changing the user group where the directory is located, and changing the user to which the directory belongs.

其中目录的创建过程包括从客户端接收到创建目录的请求，这个请求包括客户端创建目录的路径，客户端创建目录的权限模式，客户端创建目录的用户组标识和用户标识。其中目录的权限模式包括目录的所有者的读写权限，用户组用户的读写权限以及其他用户的读写权限。在节点端接收到这些请求后，会首先对客户所创建目录的父目录进行权限检查和确认，当确定符合创建条件时，将目录的元数据写入目录元数据节点的存储中。其中符合的创建条件包括存在其所制定的父目录，并且具备父目录的访问权限。The directory creation process includes receiving a directory creation request from the client, and the request includes a path for the client to create the directory, a permission mode for the client to create the directory, and a user group ID and user ID for the client to create the directory. The permission mode of the directory includes the read and write permissions of the owner of the directory, the read and write permissions of the user group users, and the read and write permissions of other users. After receiving these requests on the node side, it will first check and confirm the authority of the parent directory of the directory created by the customer. When it is determined that the creation conditions are met, the metadata of the directory will be written into the storage of the directory metadata node. The creation conditions met include the existence of the specified parent directory and the access authority of the parent directory.

对于目录的删除，读取目录的所有元数据，改变目录所在的用户组和改变目录的属性这些操作，当接受到客户端的请求后，都将首先检查目录是否存在且是否具备访问权限。当确定这两条后，就对目录进行相应的操作。比较特殊的是，在对目录进行删除的时候，必须确定其子目录以及子目录的文件已经被删除。For operations such as deleting a directory, reading all the metadata of the directory, changing the user group where the directory is located, and changing the attributes of the directory, when receiving a request from the client, it will first check whether the directory exists and whether it has access rights. When these two items are determined, the corresponding operation is performed on the directory. What is more special is that when deleting a directory, it must be confirmed that its subdirectory and the files in the subdirectory have been deleted.

在整个架构中，目录元数据节点在整个系统中逻辑上只有1个。所有的目录的元数据都存放在一个元数据节点中。为了保证其可靠性，可以使用同步备份的方法而不能使用分布式扩展的方法。In the whole architecture, there is logically only one directory metadata node in the whole system. All directory metadata are stored in a metadata node. In order to ensure its reliability, the method of synchronous backup can be used instead of the method of distributed expansion.

图2还描述了一种存储文件元数据的节点，这种元数据节点用于处理分布式文件系统的所有关于文件的元数据请求，并且存储分布式文件系统的元数据。其包括了支持对于文件操作的标准POSIX访问接口，针对元数据进行分类存储的元数据管理层，以及一个键值存储将元数据持久化到磁盘。FIG. 2 also describes a node for storing file metadata. This metadata node is used to process all metadata requests of the distributed file system and store metadata of the distributed file system. It includes a standard POSIX access interface that supports file operations, a metadata management layer for classified storage of metadata, and a key-value store that persists metadata to disk.

每一个文件的元数据包括第一方面所述的其父目录全局唯一标识符，文件名称，文件的访问时间，文件的权限，文件所在的用户组标识，文件的修改时间，文件内容的访问时间，文件的大小，文件块的大小，创建该文件的元数据节点的全局标识，创建文件的所在的元数据节点。其中其父目录的全局唯一标识符和文件名的名称合在一起，构成一个变长的字符串。其他的元数据字段则为定长8字节的字段。这个元数据支持对于文件元数据的相关操作，包括文件的创建，文件的删除，获取文件的元数据，更改文件的用户组，更改文件所属的用户，更改文件的权限，文件的读写，改变文件的大小。The metadata of each file includes the global unique identifier of its parent directory mentioned in the first aspect, the file name, the access time of the file, the permission of the file, the user group identification of the file, the modification time of the file, and the access time of the file content , the size of the file, the size of the file block, the global identifier of the metadata node that created the file, and the metadata node where the file was created. The globally unique identifier of its parent directory and the name of the file name are combined to form a variable-length string. Other metadata fields are fixed-length 8-byte fields. This metadata supports operations related to file metadata, including file creation, file deletion, obtaining file metadata, changing the user group of the file, changing the user to which the file belongs, changing the permissions of the file, reading and writing of the file, changing file size.

其中，文件的创建过程分为两个阶段，第一阶段，客户端需要首先访问创建文件所属的目录，客户端需要访问第一方面所述的目录元数据节点，确定是否具有在该目录下创建文件的权限。当确定是一个合法的创建请求后，进入第二阶段，客户端将会发送请求到文件元数据节点，文件节点将会以父目录的全局唯一标识和文件名的名称合在一起的字符串去在文件元数据节点上进行索引，当确定该节点上没有该文件的时候，则创建文件成功。Among them, the file creation process is divided into two stages. In the first stage, the client needs to first access the directory to which the created file belongs, and the client needs to access the directory metadata node described in the first aspect to determine whether there is an file permissions. When it is determined that it is a legal creation request, enter the second stage, the client will send the request to the file metadata node, and the file node will use the string that combines the global unique identifier of the parent directory and the name of the file to go to the second stage. Indexing is performed on the file metadata node, and when it is determined that the file does not exist on the node, the file is successfully created.

对于其他的文件的元数据操作，则首先需要找到文件，之后确定是否具有访问权限，如果具备访问权限，则对其进行相应的修改。在这些修改中，对于文件读写的修改必须要等到文件的数据已经写入后端的存储系统或者已经成功的从后端的存储系统读出才能够完成数据的修改，这个过程需要定义为一个事务的过程。For metadata operations on other files, you first need to find the file, then determine whether you have access rights, and if you have access rights, modify it accordingly. In these modifications, the modification of file reading and writing must wait until the data of the file has been written into the back-end storage system or has been successfully read from the back-end storage system before the data modification can be completed. This process needs to be defined as a transaction. process.

元数据节点是一种基于RPC通信的网络节点，在初始化的时候需要开启相应的服务端口，同时启动存储目录元数据的数据库。对于元数据操作请求的处理，使用多线程的方式并发的对请求进行处理。The metadata node is a network node based on RPC communication. When initializing, it needs to open the corresponding service port and start the database for storing directory metadata. For the processing of metadata operation requests, the requests are processed concurrently in a multi-threaded manner.

如图2所示，文件元数据节点在整个系统中逻辑上只有多个。所有的文件的元数据通过计算其父目录的唯一标识符和文件名构成的字符串来确定文件所存放的节点的位置。这种计算方法使用一致性的哈希算法保证。这种方法可以保证其具备较好的扩展性，同时也可以支持动态的扩充文件的元数据节点。As shown in Figure 2, there are only a plurality of file metadata nodes logically in the entire system. The metadata of all files determines the location of the node where the file is stored by calculating the string composed of the unique identifier of its parent directory and the file name. This calculation method is guaranteed using a consistent hash algorithm. This method can ensure good scalability, and can also support dynamic expansion of metadata nodes of files.

如图2所示，其架构还包括一种带缓存的分布式文件系统客户端。在这个客户端上，应用程序可以通过直接访问分布式文件系统提供的库来访问分布式存储。在Linux系统中，用户可以通过用户态文件系统(FUSE)调用这个库之后就可以支持在客户端直接将文件系统挂载到磁盘上。当客户端启动后，会在客户端建立一个目录缓存，这个缓存是放在访问库上的。As shown in Figure 2, its architecture also includes a distributed file system client with cache. On this client, applications can access distributed storage by directly accessing the libraries provided by the distributed file system. In the Linux system, the user can directly mount the file system to the disk on the client after calling this library through the user state file system (FUSE). When the client starts, a directory cache will be created on the client, and this cache is placed on the access library.

客户端在访问时，通过RPC协议针对不同的访问过程构建不同的访问策略。对于目录的元数据操作，包括mkdir创建目录，getattr获取一个文件、目录、或文件夹的属性(这里只获取目录的元数据)，chmod文件/目录权限设置命令，chown是多用户多任务操作系统，所有的文件皆有其拥有者(Owner)。客户端将直接访问目录元数据节点，针对不同的操作，客户端会给出不同的参数。节点端接收到请求后直接处理后返回客户端。对于文件的元数据操作，包括open，read，write，getattr，truncate等，其首先需要访问目录缓存，查看其父目录是否具有访问权限，如果目录缓存中没有相关的信息，客户端将首先访问目录元数据节点，将父目录的元数据缓存到客户端的目录缓存中，之后检查权限，确定具备权限后，客户端再与文件元数据节点发起元数据操作请求。需要特别指出的，对于readdir(读取目录内容)这个操作，客户端将会对所有文件元数据节点和目录元数据节点发起读取目录内容的请求，这些节点将各自存储的目录内容返回到客户端，客户端将这些内容组织成有序列表返回给用户。对于rmdir(删除目录)这个操作，客户端需要向所有的文件元数据节点和目录元数据节点发起获取目录内容请求，当所有返回的请求为空时，则确认对于目录的删除是合法的，此时才能向文件元数据节点发起删除目录请求。When the client accesses, it constructs different access policies for different access processes through the RPC protocol. For the metadata operation of the directory, including mkdir to create the directory, getattr to get the attributes of a file, directory, or folder (only the metadata of the directory is obtained here), chmod file/directory permission setting command, chown is a multi-user multi-tasking operating system , all files have their owners (Owner). The client will directly access the directory metadata node, and the client will give different parameters for different operations. After the node side receives the request, it directly processes it and returns it to the client side. For file metadata operations, including open, read, write, getattr, truncate, etc., it first needs to access the directory cache to check whether its parent directory has access rights. If there is no relevant information in the directory cache, the client will first access the directory The metadata node caches the metadata of the parent directory into the directory cache of the client, and then checks the permissions. After confirming that the permissions are granted, the client initiates a metadata operation request with the file metadata node. It should be pointed out that for the readdir (read directory content) operation, the client will initiate a request to read the directory content for all file metadata nodes and directory metadata nodes, and these nodes will return the directory content stored by them to the client On the client side, the client organizes these contents into an ordered list and returns it to the user. For the operation of rmdir (deleting a directory), the client needs to initiate a request to obtain directory content from all file metadata nodes and directory metadata nodes. When all returned requests are empty, it is confirmed that the deletion of the directory is legal. Only then can a delete directory request be initiated to the file metadata node.

关于客户端的缓存，其实例还需要处理下面情况。由于在运行过程中存在一些客户端修改目录的内容，所以对于缓存的目录元数据需要设定缓存的失效时间，对于目录元数据的目录的权限的更改和对于目录的删除需要等待所有目录元数据客户端的缓存失效才能够继续操作。所以当某个客户端发送这两种请求时，请求不会被马上执行，而是要等到目录元数据节点设定的缓存失效时间超时后，才会进行目录的更新和删除工作。Regarding the client's cache, its instance also needs to handle the following situations. Since there are some clients modifying the content of the directory during the running process, the expiration time of the cache needs to be set for the cached directory metadata, and the change of the directory permission of the directory metadata and the deletion of the directory need to wait for all the directory metadata The client's cache is invalid to continue the operation. Therefore, when a client sends these two kinds of requests, the request will not be executed immediately, but will wait until the cache expiration time set by the directory metadata node expires before updating and deleting the directory.

如图2所示的客户端还需要一个配置文件来存放全局的地图。全局地图包括了目录元数据节点和文件元数据节点所在的IP地址，每个文件元数据节点的全局唯一编号，基于这个全局地图和编号，客户端才可以通过哈希算法计算确定文件节点的位置，同时能够通过网络与相应的元数据节点进行信息交换。在客户端初始化的过程中，客户端会将全局地图读入，并目录元数据节点和文件元数据节点建立心跳链接，每个一段时间将确定目录和文件元数据节点是否还在正常的工作。The client as shown in Figure 2 also needs a configuration file to store the global map. The global map includes the IP address of the directory metadata node and the file metadata node, and the globally unique number of each file metadata node. Based on this global map and number, the client can determine the location of the file node through the hash algorithm , and can exchange information with corresponding metadata nodes through the network. During the initialization process of the client, the client will read in the global map and establish a heartbeat link with the directory metadata node and the file metadata node, and will determine whether the directory and file metadata nodes are still working normally every time.

如图3所示，在元数据的组织过程中，需要对文件的元数据进行切分，在这个过程中，本实例定义了4种元数据结构，包括目录的元数据(d-inode)，目录的内容(d-entry)，文件的元数据(f-inode)，文件的内容(f-content)。在传统的文件系统中，d-inode通过d-entry可以索引到f-inode，f-inode可以索引到文件的具体内容。而在本实例中，提出了一种新的元数据组织方法。这种元数据切分方法包括每个节点自组织自己的目录内容。如图2所示，通过解耦合，将原本的d-entry分配到每个d-inode。由此解决了d-entry给文件和目录带来的强耦合关系。具体在实现过程中，当创建一个文件的时候，其首先读取其父目录确定时候可以创建目录，在前面的方法中说明过，其父目录的元数据可以通过客户端的目录缓存获取。客户端将文件分配到一个文件元数据节点上进行存储，这种分配方法是基于一致性哈希的分配方法。在存储的同时，在该文件元数据的目录内容缓存中将文件加入其所在节点的目录内容缓存，从而避免了和其他节点进行交互，降低了访问的延迟。当需要获取目录内容时，只需要访问各个节点，就可以把解耦合的目录内容聚合成一个完整的目录内容发送到应用程序。As shown in Figure 3, in the metadata organization process, the metadata of the file needs to be segmented. In this process, this example defines 4 metadata structures, including the metadata of the directory (d-inode), The content of the directory (d-entry), the metadata of the file (f-inode), the content of the file (f-content). In the traditional file system, d-inode can index to f-inode through d-entry, and f-inode can index to the specific content of the file. In this example, however, a new metadata organization method is proposed. This method of metadata sharding involves each node self-organizing its own directory content. As shown in Figure 2, the original d-entry is allocated to each d-inode through decoupling. This solves the strong coupling relationship that d-entry brings to files and directories. Specifically, in the implementation process, when creating a file, it first reads its parent directory to create a directory. As explained in the previous method, the metadata of its parent directory can be obtained through the client's directory cache. The client allocates files to a file metadata node for storage, and this allocation method is based on a consistent hash allocation method. While storing, the file is added to the directory content cache of the node where the file resides in the directory content cache of the metadata of the file, thereby avoiding interaction with other nodes and reducing access delay. When it is necessary to obtain directory content, it only needs to visit each node, and the decoupled directory content can be aggregated into a complete directory content and sent to the application.

如图2所示，在实例的执行过程中，还需要通过键值数据库对元数据进行持久化，本实例使用了一种键值数据库存储元数据的方法，其方法以文件或者目录的名称或者全局唯一标识作为键，以文件或者目录的元数据作为值，存储在键值数据库中。As shown in Figure 2, during the execution of the instance, the metadata needs to be persisted through the key-value database. This example uses a method of storing metadata in the key-value database. The method uses the name of the file or directory or The globally unique identifier is used as the key, and the metadata of the file or directory is used as the value, which is stored in a key-value database.

在目录元数据节点中，其键值数据库以文件的路径作为用于搜索的键，以之后的元数据作为值存放在目录元数据节点中。In the directory metadata node, its key-value database uses the path of the file as the key for searching, and stores the subsequent metadata as the value in the directory metadata node.

在文件元数据节点中，以父目录的全局唯一标识符和文件名组成的字符串为用于搜索的键，以其他的元数据作为值，存储在文件元数据节点中。In the file metadata node, the string composed of the global unique identifier of the parent directory and the file name is used as the key for searching, and other metadata is used as the value, which is stored in the file metadata node.

在分布式文件系统的客户端中，使用与目录元数据节点相同的键值方式缓存元数据。其具体的过程为，当存储一个目录的元数据时，以目录的路径作为键，以目录的其他元数据为值存储，同时，在键值的后端加一个目录的全局唯一标识符，这个标识符有目录的元数据节点端进行管理。在创建文件时，以其父目录的唯一标识和文件名作为键，以文件的其余元数据作为值进行存储。此时由于文件是分布在各个节点，如果用一个唯一的节点为文件创建和管理一个唯一标识符将不可避免的使这个节点成为性能瓶颈。为此，此实例使用文件创建的节点的标识和创建文件的元数据节点为此文件创建的在本节点上的唯一标识构成了一个在全局而言唯一的文件标识。这个标识在之后文件进行改动，如重命名或者移动路径都将不发生改变。在保存文件的元数据时，会将这个标识存储在键值的尾部。In the client of the distributed file system, metadata is cached using the same key-value method as the directory metadata node. The specific process is that when storing the metadata of a directory, the path of the directory is used as the key, and other metadata of the directory are stored as values. At the same time, a globally unique identifier of the directory is added to the back end of the key value. Identifiers are managed by the metadata node side of the catalog. When a file is created, the unique identifier and file name of its parent directory are used as the key, and the rest of the metadata of the file is stored as the value. At this time, since the files are distributed on each node, if a unique node is used to create and manage a unique identifier for the file, it will inevitably make this node a performance bottleneck. For this purpose, this instance uses the identifier of the node where the file is created and the unique identifier created for this file by the metadata node of the file creation on this node to form a globally unique file identifier. This flag will not change if the file is changed later, such as renaming or moving the path. When saving the metadata of the file, this identifier will be stored at the end of the key value.

在键值存储的内部数据结构中，此实例针对不同的数据服务使用不同的键值存储。其中，目录元数据节点使用基于B树的键值存储数据库，文件元数据节点使用基于哈希的键值存储数据库，分布式文件系统的客户端中使用基于内存的数据库存储目录的元数据缓存。In the internal data structure of the key-value store, this instance uses different key-value stores for different data services. Among them, the directory metadata node uses a B-tree-based key-value storage database, the file metadata node uses a hash-based key-value storage database, and the client of the distributed file system uses a memory-based database to store the metadata cache of the directory.

在对于元数据存储的优化上，本实例使用不定长的键和定长的值。这种方法要求每个元数据的字段是确定的。其中，目录元数据的元数据字段的每个值是定长的。文件元数据节点的元数据字段的每个值也是定长的。在存储的时候一定长的字段直接存储到键值存储的值中，不进行序列化和反序列化。除此之外，这种方法在存放中不需要额外的内存数据结构来缓存元数据，直接将元数据缓存在键值数据库的缓存中。In terms of metadata storage optimization, this example uses variable-length keys and fixed-length values. This approach requires each metadata field to be deterministic. Wherein, each value of the metadata field of the directory metadata is of fixed length. Each value of the metadata field of the file metadata node is also fixed-length. When storing, a field with a certain length is directly stored in the value of the key-value store without serialization and deserialization. In addition, this method does not require an additional memory data structure to cache metadata in storage, and directly caches metadata in the cache of the key-value database.

在本发明的实现结果上，所有的对于文件的元数据操作至多访问2次节点，在目录元数据的缓存情况下，只需要访问一次节点，其延时仅仅为一次访问往返的RTT，由于使用键值存储，所以对于元数据获取的延时非常低，在以太网的RTT延时上可以忽略不计，所以这种方法能够有效的减少了分布式文件系统访问元数据时各个节点之间的信息交互，降低了元数据访问的延迟，同时，通过分离目录内容的方法，解耦合了文件和目录之间的强关联性，能够达到很高的吞吐量，从而提高了分布式文件系统对于元数据的处理效率。In the implementation result of the present invention, all the metadata operations for the file visit the node twice at most, and in the case of directory metadata cache, only need to visit the node once, and its delay is only the RTT of one visit round trip, due to the use of Key-value storage, so the delay for metadata acquisition is very low, which is negligible in the RTT delay of Ethernet, so this method can effectively reduce the information between nodes when the distributed file system accesses metadata Interaction reduces the delay of metadata access. At the same time, by separating the content of the directory, the strong correlation between the file and the directory is decoupled, which can achieve high throughput, thereby improving the distributed file system for metadata. processing efficiency.

另外，本发明实施例的分布式文件系统元数据的解耦合分布方法的其它构成以及作用对于本领域的技术人员而言都是已知的，为了减少冗余，不做赘述。In addition, other components and functions of the method for decoupling and distributing distributed file system metadata in the embodiment of the present invention are known to those skilled in the art, and will not be repeated in order to reduce redundancy.

在本说明书的描述中，参考术语“一个实施例”、“一些实施例”、“示例”、“具体示例”、或“一些示例”等的描述意指结合该实施例或示例描述的具体特征、结构、材料或者特点包含于本发明的至少一个实施例或示例中。在本说明书中，对上述术语的示意性表述不一定指的是相同的实施例或示例。而且，描述的具体特征、结构、材料或者特点可以在任何的一个或多个实施例或示例中以合适的方式结合。In the description of this specification, descriptions referring to the terms "one embodiment", "some embodiments", "example", "specific examples", or "some examples" mean that specific features described in connection with the embodiment or example , structure, material or characteristic is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the specific features, structures, materials or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

尽管已经示出和描述了本发明的实施例，本领域的普通技术人员可以理解：在不脱离本发明的原理和宗旨的情况下可以对这些实施例进行多种变化、修改、替换和变型，本发明的范围由权利要求及其等同限定。Although the embodiments of the present invention have been shown and described, those skilled in the art can understand that various changes, modifications, substitutions and modifications can be made to these embodiments without departing from the principle and spirit of the present invention. The scope of the invention is defined by the claims and their equivalents.

Claims

1. A method for decoupling and distributing metadata of a distributed file system is characterized by comprising the following steps:

s1: separating metadata of the distributed file system to obtain metadata of a directory, metadata of a directory entry and metadata of a file;

s2: setting the metadata of the directory at a directory inode;

s3: dividing each directory entry according to the distribution condition of the files, storing the directory entries related to the directory entries in the nodes stored in the files, and establishing a reverse index pointing to the metadata of the directory;

further comprising: when a file or a directory is created, creating all directory entries containing a parent directory path of the file or the directory at a node where the file or the directory is created; if all or part of the directory entry has already been created at that node, the remaining directory entries are created.

2. The method of claim 1, wherein the directory operations include creating a directory, deleting a directory, reading a directory, obtaining all metadata of a directory, changing a user group in which the directory is located, and changing a user to which the directory belongs.

3. The method for decoupled distribution of distributed file system metadata according to claim 1, further comprising:

providing an identification of a globally unique certain file;

calculating the hash value of the global identification of the file to be accessed;

and positioning the nodes stored by the metadata according to the hash value.

4. The method of claim 3, wherein the identifier is a complete path of the file.

5. The method for decoupled distribution of distributed file system metadata according to claim 1, further comprising:

when a file is deleted, deleting the metadata of the node where the file is located and the directory entry metadata corresponding to the node where the file is located to point to the entry of the file.

6. The method for decoupled distribution of distributed file system metadata according to claim 1, further comprising:

when a read directory or delete directory operation is performed, all metadata nodes are accessed to obtain all directory entries under the read directory or delete directory.

7. The method for decoupled distribution of distributed file system metadata according to claim 1, further comprising:

providing a client cache, wherein the directory metadata cached by the client is used for determining whether the client has the authority of creating the file when the client creates the file;

when the client accesses the metadata of the file, the client accesses the metadata of the directory to acquire access authority;

when the client has access rights, metadata of the file is accessed.

8. The method for decoupled distribution of distributed file system metadata according to claim 7, further comprising:

and when the cache of the catalog metadata client fails, the permission of the catalog metadata is changed and the catalog is deleted.