CN111367926A

CN111367926A - Data processing method and device for distributed system

Info

Publication number: CN111367926A
Application number: CN202010125232.5A
Authority: CN
Inventors: 肖永玲; 赵岩; 刘名欣; 张旭明; 王豪迈; 胥昕
Original assignee: Xsky Beijing Data Technology Corp ltd
Current assignee: Xsky Beijing Data Technology Corp ltd
Priority date: 2020-02-27
Filing date: 2020-02-27
Publication date: 2020-07-03

Abstract

The invention discloses a data processing method and device of a distributed system. Wherein, the method includes: receiving a data storage instruction; acquiring data to be stored, and segmenting the data to be stored according to a specified size; separately calculating the verification data of the data to be stored in each segment, and The verification data of the data to be stored in each segment is stored as a verification record. The invention solves the technical problem that the distributed storage system occupies too much system performance when verifying data in the prior art.

Description

Data processing method and device for distributed system

技术领域technical field

本发明涉及数据存储领域，具体而言，涉及一种分布式系统的数据处理方法和装置。The present invention relates to the field of data storage, in particular, to a data processing method and device of a distributed system.

背景技术Background technique

在分布式存储系统中，数据的一致性非常重要，磁盘本身的静默错误，硬件异常或软件漏洞都会导致用户数据丢失或不一致，给用户带来了巨大的损失。因此分布式存储系统中，会通过MD5或CRC进行数据校验，将每一个数据块的大小记录到存储系统中，读数据的时候通过计算数据的校验和与存储的校验和来判断副本数据的一致性In a distributed storage system, data consistency is very important. The silent error of the disk itself, hardware anomaly or software vulnerability will lead to user data loss or inconsistency, which brings huge losses to users. Therefore, in the distributed storage system, data verification is performed through MD5 or CRC, and the size of each data block is recorded in the storage system. When reading data, the copy is determined by calculating the checksum of the data and the stored checksum. data consistency

现有技术的校验数据装置在从分布式存储系统中读取数据时候需要进行数据校验，这一过程对系统性能影响较大，这是目前分布式存储系统存在的缺陷。The data verification device in the prior art needs to perform data verification when reading data from a distributed storage system, and this process has a great impact on system performance, which is a defect of the current distributed storage system.

针对现有技术中分布式存储系统对数据进行校验时占用系统性能过多的问题，目前尚未提出有效的解决方案。Aiming at the problem that the distributed storage system occupies too much system performance when verifying data in the prior art, no effective solution has been proposed yet.

发明内容SUMMARY OF THE INVENTION

本发明实施例提供了一种分布式系统的数据处理方法和装置，以至少解决现有技术中分布式存储系统对数据进行校验时占用系统性能过多的技术问题。Embodiments of the present invention provide a data processing method and device for a distributed system, so as to at least solve the technical problem in the prior art that the distributed storage system occupies too much system performance when verifying data.

根据本发明实施例的一个方面，提供了一种分布式系统的数据处理方法，包括：接收数据存储指令；获取待存储数据，并将所述待存储数据按照指定大小进行分段；分别计算每个分段内的待存储数据的校验数据，并将所述每个分段内的待存储数据的校验数据作为一条校验记录进行存储。According to an aspect of the embodiments of the present invention, a data processing method for a distributed system is provided, including: receiving a data storage instruction; acquiring data to be stored, and segmenting the data to be stored according to a specified size; The verification data of the data to be stored in each segment is stored, and the verification data of the data to be stored in each segment is stored as a verification record.

进一步地，上述方法还包括：接收数据读取指令；根据所述数据读取指令读取数据，并计算读取的数据的当前校验数据；将所述当前校验数据与写入所述读取的数据时所述读取的数据的原校验数据进行比对；在比对成功的情况下确定读取的数据为所述数据读取指令所指示的待读取数据。Further, the above method also includes: receiving a data reading instruction; reading data according to the data reading instruction, and calculating the current verification data of the read data; The fetched data is compared with the original verification data of the read data; if the comparison is successful, it is determined that the read data is the data to be read indicated by the data read instruction.

进一步地，上述方法还包括：接收数据写零指令或数据清除指令；清除待写零数据或待清除数据对应的校验数据，并更新所述待写零数据或所述待清除数据所在分段的校验数据。Further, the above method also includes: receiving a data zero write instruction or a data clearing instruction; clearing the zero data to be written or the verification data corresponding to the data to be cleared, and updating the zero data to be written or the segment where the data to be cleared is located verification data.

进一步地，预设的校验数据缓存空间存储有校验数据，在将所述当前校验数据与写入所述读取的数据时所述读取的数据的原校验数据进行比对之前，所述方法还包括：从所述校验数据缓存空间提取所述原校验数据。Further, the preset verification data cache space stores verification data, before the current verification data is compared with the original verification data of the read data when the read data is written. , the method further includes: extracting the original verification data from the verification data cache space.

进一步地，从所述校验数据缓存空间提取所述原校验数据，包括：如果所述校验数据缓存空间存在所述原校验数据，则从所述校验数据缓存空间提取所述原校验数据；如果所述校验数据缓存空间不存所述原校验数据，则从磁盘中提取所述原校验数据，并将从磁盘提取的所述原校验数据写入所述校验数据缓存空间。Further, extracting the original verification data from the verification data cache space includes: if the original verification data exists in the verification data cache space, extracting the original verification data from the verification data cache space. verification data; if the verification data cache space does not store the original verification data, extract the original verification data from the disk, and write the original verification data extracted from the disk into the verification data Check the data cache space.

进一步地，将所述当前校验数据与写入所述读取的数据时所述读取的数据的原校验数据进行比对，包括：调用异步读回模块，其中，所述异步读回模块提交读请求后完成当前线程，以支持异步读取；通过所述异步读回模块将所述当前校验数据与写入所述读取的数据时所述读取的数据的原校验数据进行比对。Further, comparing the current verification data with the original verification data of the read data when the read data was written, comprising: calling an asynchronous readback module, wherein the asynchronous readback After the module submits the read request, the current thread is completed to support asynchronous reading; the current verification data is compared with the original verification data of the read data when the read data is written by the asynchronous read back module. Compare.

根据本发明实施例的一个方面，提供了一种分布式系统的数据处理装置，包括：接收模块，用于接收数据存储指令；获取模块，用于获取待存储数据，并将所述待存储数据按照指定大小进行分段；第一计算模块，用于分别计算每个分段内的待存储数据的校验数据，并将所述每个分段内的待存储数据的校验数据作为一条校验记录进行存储。According to an aspect of the embodiments of the present invention, there is provided a data processing apparatus of a distributed system, including: a receiving module for receiving a data storage instruction; an obtaining module for obtaining data to be stored, and storing the data to be stored Segmentation is performed according to the specified size; the first calculation module is used to calculate the verification data of the data to be stored in each segment respectively, and use the verification data of the data to be stored in each segment as a verification data The test records are stored.

进一步地，上述装置还包括：接收模块，用于接收数据读取指令；第二计算模块，用于根据所述数据读取指令读取数据，并计算读取的数据的当前校验数据；比对模块，用于将所述当前校验数据与写入所述读取的数据时所述读取的数据的原校验数据进行比对；确定模块，用于在比对成功的情况下确定读取的数据为所述数据读取指令所指示的待读取数据。Further, the above-mentioned device further includes: a receiving module for receiving a data reading instruction; a second calculating module for reading data according to the data reading instruction, and calculating the current verification data of the read data; A pairing module is used to compare the current verification data with the original verification data of the read data when writing the read data; a determination module is used to determine when the comparison is successful The read data is the data to be read indicated by the data read instruction.

根据本发明实施例的一个方面，提供了一种存储介质，存储介质包括存储的程序，其中，在所述程序运行时控制所述存储介质所在设备执行上述的分布式系统的数据处理方法。According to an aspect of the embodiments of the present invention, a storage medium is provided, the storage medium includes a stored program, wherein when the program runs, a device where the storage medium is located is controlled to execute the above-mentioned data processing method of a distributed system.

根据本发明实施例的一个方面，提供了一种处理器，处理器用于运行程序，其中，所述程序运行时执行上述的分布式系统的数据处理方法。According to an aspect of the embodiments of the present invention, a processor is provided, and the processor is configured to run a program, wherein the program executes the above-mentioned data processing method of a distributed system when the program is run.

在本发明实施例中，接收数据存储指令，获取待存储数据，并将所述待存储数据按照指定大小进行分段；分别计算每个分段内的待存储数据的校验数据，并将所述每个分段内的待存储数据的校验数据作为一条校验记录进行存储。基于ceph的分布式存储系统，每次操作的最小IO大小是4KiB，每4KiB的data通过crc32算法编码得到4字节的数据校验值，如果一个对象的数据校验值作为一条校验数据记录，则每4K IO的更新，都需要操作整个对象的，但采用上述分段的方式，每次只加载、更新或清除操作对象数据所在分段的校验数据记录，进而持久化或删除对应的rocksDB记录，从而降低开启数据校验对性能的影响，解决了现有技术中分布式存储系统对数据进行校验时占用系统性能过多的技术问题。In the embodiment of the present invention, a data storage instruction is received, data to be stored is acquired, and the data to be stored is segmented according to a specified size; the verification data of the data to be stored in each segment is calculated separately, and the The verification data of the data to be stored in each segment is stored as a verification record. For the distributed storage system based on ceph, the minimum IO size of each operation is 4KiB, and each 4KiB of data is encoded by the crc32 algorithm to obtain a 4-byte data check value. If the data check value of an object is used as a check data record , then every 4K IO update requires the operation of the entire object, but in the above-mentioned segmentation method, only the verification data records of the segment where the operation object data is located are loaded, updated or cleared each time, and then the corresponding data records are persisted or deleted. rocksDB records, thereby reducing the impact of enabling data verification on performance, and solving the technical problem that the distributed storage system takes up too much system performance when verifying data in the prior art.

附图说明Description of drawings

此处所说明的附图用来提供对本发明的进一步理解，构成本申请的一部分，本发明的示意性实施例及其说明用于解释本发明，并不构成对本发明的不当限定。在附图中：The accompanying drawings described herein are used to provide a further understanding of the present invention and constitute a part of the present application. The exemplary embodiments of the present invention and their descriptions are used to explain the present invention and do not constitute an improper limitation of the present invention. In the attached image:

图1是根据本发明实施例的一种分布式系统的数据处理方法的流程图；1 is a flowchart of a data processing method for a distributed system according to an embodiment of the present invention;

图2是根据本发明实施例的一种对校验数据进行分段存储的示意图；2 is a schematic diagram of segmented storage of verification data according to an embodiment of the present invention;

图3是根据本发明实施例的一种基于crc cache存储校验数据的示意图；3 is a schematic diagram of storing verification data based on a crc cache according to an embodiment of the present invention;

图4是根据本发明实施例的一种读取数据的流程图；以及FIG. 4 is a flow chart of reading data according to an embodiment of the present invention; and

图5是根据本发明实施例的一种分布式系统的数据处理装置的示意图。FIG. 5 is a schematic diagram of a data processing apparatus of a distributed system according to an embodiment of the present invention.

下面，为了便于对本实施例的理解，对本实施例出现的专业名词进行解释：Below, in order to facilitate the understanding of this embodiment, the technical terms that appear in this embodiment are explained:

分布式存储系统：简单来说，就是将数据分散存储到多个存储服务器上，并将这些分散的存储资源构成一个虚拟的存储设备，来提供数据存储服务。Distributed storage system: In simple terms, it is to store data in multiple storage servers, and form a virtual storage device with these scattered storage resources to provide data storage services.

PG：放置组(Placement Group)，一个放置组(PG)把一系列对象汇聚到一组，并且把这个组映射到一系列OSD。PG: Placement Group, a placement group (PG) gathers a series of objects into a group and maps this group to a series of OSDs.

OSD：OSD(Object Storage Device)是指负责数据落盘的一个进程。OSD: OSD (Object Storage Device) refers to a process responsible for data placement.

静默错误：Silent Data Corruption，硬盘最核心的使命是正确的存入数据、正确的读出数据，在出错时及时抛出异常告警。磁盘出现异常的情形可能包括硬件错误、固件BUG或者软件BUG、供电问题、介质损坏等，常规的这些问题都能够正常被捕获抛出异常，而最可怕的事情是，数据处理都是正常的，直直至使用的时候才发现数据是错误的、损坏的。这就是静默错误。Silent error: Silent Data Corruption, the core mission of the hard disk is to store data correctly, read data correctly, and throw an exception alarm in a timely manner when an error occurs. Disk abnormalities may include hardware errors, firmware bugs or software bugs, power supply problems, media damage, etc. These conventional problems can be captured normally and exceptions are thrown, and the most terrifying thing is that data processing is normal. The data is not found to be erroneous or corrupt until it is used. This is the silent error.

具体实施方式Detailed ways

为了使本技术领域的人员更好地理解本发明方案，下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分的实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都应当属于本发明保护的范围。In order to make those skilled in the art better understand the solutions of the present invention, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only Embodiments are part of the present invention, but not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

需要说明的是，本发明的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象，而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换，以便这里描述的本发明的实施例能够以除了在这里图示或描述的那些以外的顺序实施。此外，术语“包括”和“具有”以及他们的任何变形，意图在于覆盖不排他的包含，例如，包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元，而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。It should be noted that the terms "first", "second" and the like in the description and claims of the present invention and the above drawings are used to distinguish similar objects, and are not necessarily used to describe a specific sequence or sequence. It is to be understood that the data so used may be interchanged under appropriate circumstances such that the embodiments of the invention described herein can be practiced in sequences other than those illustrated or described herein. Furthermore, the terms "comprising" and "having" and any variations thereof, are intended to cover non-exclusive inclusion, for example, a process, method, system, product or device comprising a series of steps or units is not necessarily limited to those expressly listed Rather, those steps or units may include other steps or units not expressly listed or inherent to these processes, methods, products or devices.

实施例1Example 1

根据本发明实施例，提供了一种分布式系统的数据处理方法的实施例，需要说明的是，在附图的流程图示出的步骤可以在诸如一组计算机可执行指令的计算机系统中执行，并且，虽然在流程图中示出了逻辑顺序，但是在某些情况下，可以以不同于此处的顺序执行所示出或描述的步骤。According to an embodiment of the present invention, an embodiment of a data processing method for a distributed system is provided. It should be noted that the steps shown in the flowchart of the accompanying drawings may be executed in a computer system such as a set of computer-executable instructions and, although a logical order is shown in the flowcharts, in some cases the steps shown or described may be performed in an order different from that herein.

图1是根据本发明实施例的一种分布式系统的数据处理方法的流程图，如图1所示，该方法包括如下步骤：FIG. 1 is a flowchart of a data processing method for a distributed system according to an embodiment of the present invention. As shown in FIG. 1 , the method includes the following steps:

步骤S102，接收数据存储指令。Step S102, receiving a data storage instruction.

具体的，上述数据存储指令用于向分布式存储系统中存储数据。Specifically, the above data storage instruction is used to store data in the distributed storage system.

步骤S104，获取待存储数据，并将所述待存储数据按照指定大小进行分段。Step S104: Acquire data to be stored, and segment the data to be stored according to a specified size.

在ceph的分布式存储系统，每个对象的大小是4MiB。上述方案中，将每个对象进行分段，从而得到一个对象对应的多个分段。在一种可选的实施例中，对于一个4G的对象，可以将其分为多段，例如，如果根据256KiB进行分段，则一个对象具有16个分段。In Ceph's distributed storage system, the size of each object is 4MiB. In the above solution, each object is segmented to obtain multiple segments corresponding to one object. In an optional embodiment, for a 4G object, it can be divided into multiple segments, for example, if the segment is performed according to 256KiB, an object has 16 segments.

步骤S106，分别计算每个分段内的待存储数据的校验数据，并将所述每个分段内的待存储数据的校验数据作为一条校验记录进行存储。Step S106: Calculate the verification data of the data to be stored in each segment respectively, and store the verification data of the data to be stored in each segment as a verification record.

具体的，上述校验数据即为数据的校验值，可以通过MD5(报文摘要算法5)或CRC(Cyclic Redundancy Check，循环冗余校验)的方式确定每个分段的待存储数据的校验数据。在上述实施例中，如果根据256KiB进行分段，则一个对象具有16条检验数据记录。Specifically, the above-mentioned check data is the check value of the data, and the MD5 (Message Digest Algorithm 5) or CRC (Cyclic Redundancy Check, Cyclic Redundancy Check) can be used to determine the value of the data to be stored in each segment. Check data. In the above-described embodiment, if the segmentation is performed according to 256KiB, one object has 16 inspection data records.

图2是根据本发明实施例的一种对校验数据进行分段存储的示意图，结合图2所示，对于一个4MiB的对象，将其分为16个分段，每个分段data_chunk_size＝256KiB。每4KiBd data通过crc32算法编码得到4byte数据校验值，其中，每256KiB的数据校验值作为一条<key，value>cre_chunk记录(4KiB的IO具有4字节的校验值，256KiB数据的校验值为64个4字节，即256字节，该256个字节的校验值作为数据库中的一条存储)，其中，key用于表示分段的标识，velue用于表示数据校验值。将记录存储在后端RocksDB中，从而得到16条校验数据记录，每条校验数据记录crc_chunk_size＝256B，数据校验记录的总和为4KiB crcvalue。FIG. 2 is a schematic diagram of segmented storage of check data according to an embodiment of the present invention. With reference to FIG. 2, for a 4MiB object, it is divided into 16 segments, each segment data_chunk_size=256KiB . Each 4KiBd data is encoded by the crc32 algorithm to obtain a 4byte data check value, where each 256KiB data check value is used as a <key, value> cre_chunk record (4KiB IO has a 4-byte check value, 256KiB data checksum The value is 64 4 bytes, that is, 256 bytes. The check value of the 256 bytes is stored as a piece in the database), where the key is used to indicate the identifier of the segment, and the velue is used to indicate the data check value. The records are stored in the backend RocksDB to obtain 16 verification data records, each verification data record crc_chunk_size=256B, and the sum of the data verification records is 4KiB crcvalue.

由上可知，本申请上述实施例接收数据存储指令，获取待存储数据，并将所述待存储数据按照指定大小进行分段；分别计算每个分段内的待存储数据的校验数据，并将所述每个分段内的待存储数据的校验数据作为一条校验记录进行存储。基于ceph的分布式存储系统，每次操作的最小IO大小是4KiB，每4KiB的data通过crc32算法编码得到4字节的数据校验值，如果一个对象的数据校验值作为一条校验数据记录，则每4K IO的更新，都需要操作整个对象的，但采用上述分段的方式，每次只加载、更新或清除操作对象数据所在分段的校验数据记录，进而持久化或删除对应的rocksDB记录，从而降低开启数据校验对性能的影响，解决了现有技术中分布式存储系统对数据进行校验时占用系统性能过多的技术问题。It can be seen from the above that the above-mentioned embodiment of the present application receives a data storage instruction, obtains the data to be stored, and segments the data to be stored according to the specified size; respectively calculates the verification data of the data to be stored in each segment, and The verification data of the data to be stored in each segment is stored as a verification record. For the distributed storage system based on ceph, the minimum IO size of each operation is 4KiB, and each 4KiB of data is encoded by the crc32 algorithm to obtain a 4-byte data check value. If the data check value of an object is used as a check data record , then every 4K IO update requires the operation of the entire object, but in the above-mentioned segmentation method, only the verification data records of the segment where the operation object data is located are loaded, updated or cleared each time, and then the corresponding data records are persisted or deleted. rocksDB records, thereby reducing the impact of enabling data verification on performance, and solving the technical problem that the distributed storage system takes up too much system performance when verifying data in the prior art.

开启数据校验后，需要在write，read，zero，truncate这四种操作执行后，保证数据与存在的数据校验值保持一致。写入数据的操作(write)，写入的数据必须计算其校验数据、并对数据所属分段的校验数据记录进行更新、以及持久化(存储至RocksDB中)校验数据，从而保持数据与校验值的一致性。其他三种操作分别进行说明。After data verification is enabled, it is necessary to ensure that the data is consistent with the existing data verification value after the four operations of write, read, zero, and truncate are executed. For the operation of writing data (write), the written data must calculate its verification data, update the verification data record of the segment to which the data belongs, and persist (store it in RocksDB) the verification data, so as to keep the data. Consistency with check value. The other three operations are described separately.

作为一种可选的实施例，上述方法还包括：接收数据读取指令；根据所述数据读取指令读取数据，并计算读取的数据的当前校验数据；将所述当前校验数据与写入所述读取的数据时所述读取的数据的原校验数据进行比对；As an optional embodiment, the above method further includes: receiving a data reading instruction; reading data according to the data reading instruction, and calculating current verification data of the read data; Compare with the original verification data of the read data when writing the read data;

在比对成功的情况下确定读取的数据为所述数据读取指令所指示的待读取数据。If the comparison is successful, it is determined that the read data is the data to be read indicated by the data read instruction.

上述方案中，对于读操作(read)需要进行数据校验。进行数据校验的方法是，计算根据数据读取指令读取的数据的当前校验数据，并提取存储该数据时的原校验数据，将当前校验数据与原校验数据进行比对，如果相同，则说明读取的数据未被篡改，如果不同，则说明读取的数据可能被篡改，从而保证了数据写入和读取的一致性。In the above solution, data verification needs to be performed for a read operation (read). The method of performing data verification is to calculate the current verification data of the data read according to the data reading instruction, extract the original verification data when the data is stored, and compare the current verification data with the original verification data, If they are the same, it means that the read data has not been tampered with; if they are different, it means that the read data may have been tampered with, thus ensuring the consistency of data writing and reading.

作为一种可选的实施例，上述方法还包括：接收数据写零指令或数据清除指令；清除待写零数据或待清除数据对应的校验数据，并更新所述待写零数据或所述待清除数据所在分段的校验数据。As an optional embodiment, the above method further includes: receiving a data zero writing instruction or a data clearing instruction; clearing the zero data to be written or the check data corresponding to the data to be cleared, and updating the zero data to be written or the Verification data of the segment where the data to be cleared is located.

在上述方案中，在进行数据写零操作(zero)或数据清除操作(truncate)时，需要清除相应的数据校验值。对写零操作或清除操作的数据的数据校验值进行清除，其所在分段的校验记录也需要进行更新。In the above solution, when performing a data writing zero operation (zero) or a data clearing operation (truncate), the corresponding data check value needs to be cleared. To clear the data check value of the data in the zero write operation or the clear operation, the check record of the segment where it is located also needs to be updated.

作为一种可选的实施例，预设的校验数据缓存空间存储有校验数据，在将所述当前校验数据与写入所述读取的数据时所述读取的数据的原校验数据进行比对之前，上述方法还包括：从所述校验数据缓存空间提取所述原校验数据。As an optional embodiment, the preset verification data cache space stores verification data, and when the current verification data is compared with the original verification data of the read data when the read data is written Before comparing the verification data, the above method further includes: extracting the original verification data from the verification data cache space.

上述校验数据缓存空间(crc cache)采用缓存机制，将RocksDB中的校验数据存储在该校验数据缓存空间中，从而能够从校验数据缓存空间直接读取原校验数据，而无需从后端RocksDB提取。The above verification data cache space (crc cache) adopts a caching mechanism to store the verification data in RocksDB in the verification data cache space, so that the original verification data can be directly read from the verification data cache space without having to Backend RocksDB ingest.

在一种可选的实施例中，上述校验数据缓存空间包括多层(level)，查找方式为首先在最高层查找，如果不存在，则在下一层进行查找，直至查找到校验数据缓存空间的最后一层。图3是根据本发明实施例的一种基于crc cache存储校验数据的示意图，结合图3所示，指定存储空间XStore中包括多个缓存片段cache shard，每个cache shard都具有8层crc cache(level#1of crc cache、level#1of crc cache……level#8of crc cache)，其覆盖了pg.a至pg.z中的所有数据。在进行查找时，首先从level#1of crc cache进行查找，如果level#1of crc cache中不存在所查找的校验数据，则从level#2of crc cache中查找，直至找到所查找的校验数据，或找到level#8of crc cache，其中，该示例中，一个level是128M内存，128M内存可以缓存128G数据的数据校验值。In an optional embodiment, the verification data cache space includes multiple levels, and the search method is to first search at the highest level, and if it does not exist, search at the next level until the verification data cache is found. The last floor of the space. FIG. 3 is a schematic diagram of storing verification data based on crc cache according to an embodiment of the present invention. With reference to FIG. 3 , the designated storage space XStore includes multiple cache segment cache shards, and each cache shard has 8 layers of crc caches (level#1of crc cache, level#1of crc cache...level#8of crc cache), which covers all data in pg.a to pg.z. When searching, firstly search from level#1of crc cache. If the searched verification data does not exist in level#1of crc cache, search from level#2of crc cache until the searched verification data is found. Or find level#8of crc cache, where, in this example, a level is 128M memory, and 128M memory can cache the data check value of 128G data.

作为一种可选的实施例，从所述校验数据缓存空间提取所述原校验数据，包括：如果所述校验数据缓存空间存在所述原校验数据，则从所述校验数据缓存空间提取所述原校验数据；如果所述校验数据缓存空间不存所述原校验数据，则从磁盘中提取所述原校验数据，并将从磁盘提取的所述原校验数据写入所述校验数据缓存空间。As an optional embodiment, extracting the original verification data from the verification data cache space includes: if the original verification data exists in the verification data buffer space, extracting the original verification data from the verification data cache space Extract the original verification data from the cache space; if the verification data cache space does not store the original verification data, extract the original verification data from the disk, and use the original verification data extracted from the disk Data is written into the check data cache space.

在上述方案中，在启用了crc cache后，由crc cache用来缓存校验值，读数据的时候，计算数据的校验和，然后跟cache中的数据的校验值对比，数据校验值在cache中比对，相比从RocksDB读取，响应速度更快。如果crc cache中没有命中，则从RocksDB中读取，然后缓存到crc cache中。In the above scheme, after the crc cache is enabled, the crc cache is used to cache the check value. When reading data, the checksum of the data is calculated, and then compared with the check value of the data in the cache, the data check value Compared in cache, the response speed is faster than reading from RocksDB. If there is no hit in the crc cache, it will be read from RocksDB and then cached in the crc cache.

图4是根据本发明实施例的一种读取数据的流程图，首先进行条件检查：判断是否开启了l2_cache，其中，l2_cache是指xstore(ceph的本地存储引擎，类似文件系统，是直接同块设备交互的组件)中的第二层缓存，这层缓存包括两类重要数据结构的缓存，分别是onode和crc。onode缓存记录了哪些对象拥有onode缓存，onode包含该对象的所有元数据信息，如对象使用的数据块物理偏移、扩展属性、对象大小等信息，类似文件系统的inode；crc缓存记录了哪些对象拥有crc缓存，以及该对象的哪些分段数据包含数据校验值和数据校验值本身。开启l2_cache的意思是启用onode和crc缓存，启用后，涉及到对象读写操作时，需要更新onode和crc缓存。在判断结果为开启的情况下进入下个步骤，在判断结果为未开启的情况下，则直接退出。Fig. 4 is a flow chart of reading data according to an embodiment of the present invention. First, a condition check is performed: judging whether l2_cache is enabled, where l2_cache refers to the local storage engine of xstore (ceph, similar to a file system, and is directly in the same block) The second layer of cache in the components of device interaction), this layer of cache includes the cache of two important data structures, namely onode and crc. The onode cache records which objects have the onode cache, and the onode contains all the metadata information of the object, such as the physical offset of the data block used by the object, extended attributes, object size and other information, similar to the inode of the file system; which objects are recorded in the crc cache Has the crc cache, and which pieces of the object's data contain the data check value and the data check value itself. Turning on l2_cache means enabling onode and crc caches. After enabling, when object read and write operations are involved, onode and crc caches need to be updated. If the judgment result is ON, go to the next step, and if the judgment result is not ON, exit directly.

在判断结果为开启l2_cache的情况下，进行合法性检查，如果是meta_pg、临时对象、非块对象等情况则不需要被缓存，退出即可，否则进入下一个步骤。In the case that the judgment result is that l2_cache is enabled, the validity is checked. If it is meta_pg, temporary object, non-block object, etc., it does not need to be cached, just exit, otherwise go to the next step.

判断是否满足only_cache_crc＝true(only_cache_crc＝true表示只更新crc缓存)，如果满足，则不缓存onode(onode缓存更新在流程的其他部分会被执行，这里只更新crc缓存是为了避免重复更新，提升操作效率)，否则缓存onode，并进入下一个步骤。Determine whether only_cache_crc=true is satisfied (only_cache_crc=true means that only the crc cache is updated), if it is satisfied, the onode is not cached (the onode cache update will be performed in other parts of the process, here only the crc cache is updated to avoid repeated updates and improve operations efficiency), otherwise cache the onode and go to the next step.

再次进行条件检查。如果onode中没有记录开启数据校验，则退出，如果crc cache的开关没有开启，则退出，否则其他情况下进入下一个步骤。具体的，l2_cache包含crc缓存，但l2_cache开启并不意味着crc缓存开启，但crc缓存开启则意味着l2_cache一定开启了。因此这里需要判断crc缓存是否开启，同时如果对象本身没有开启数据校验功能(onode中记录该属性)，则也没有必要更新crc缓存。Condition check again. If there is no record in the onode to enable data verification, exit, if the switch of crc cache is not enabled, exit, otherwise go to the next step. Specifically, l2_cache contains the crc cache, but the opening of l2_cache does not mean that the crc cache is enabled, but the opening of the crc cache means that the l2_cache must be enabled. Therefore, it is necessary to judge whether the crc cache is enabled. At the same time, if the object itself does not have the data verification function enabled (this property is recorded in the onode), there is no need to update the crc cache.

查找crc cache中是否缓存了该对象，如果没有缓存则将crc_chunk_map(用于记录对象当前运行时保存的所有数据校验值)中所有记录对象都写入crc cache，否则只将其内部记录被更新、被标记为删除或新从RocksDB中读取的分段数据校验值写入cec cache。具体的，对象是ceph的一个基本操作单位，类似于文件系统中的一个文件，对象有元数据信息(onode)和数据信息(实际记录用户写入的内容)。crc缓存容量有上限，不可能包含ceph集群中所有对象的crc缓存，因此在更新crc缓存前，需要首先判断该对象是否在crc缓存中，如果在则将本次操作中变化的部分写入缓存；如果不在，则将对象当前运行时保存的所有数据校验值(crc_chunk_map中记录)写入crc缓存。被更新、被标记为删除或从rocksDB中读取的分段数据就是上述变化的部分。被更新意味着crc缓存中存的是旧值，需要用本次操作的新值覆盖旧值，如覆盖写对象的某个部分，crc值发生了变化；被标记为删除意味着，这个数据分段被删除了，因此需要清空crc缓存中数据校验值；从rocksDB中读取的分段数据意味着crc缓存没有这段数据校验值，需要在crc缓存新增这部分数据校验值。Check whether the object is cached in the crc cache. If there is no cache, write all the record objects in the crc_chunk_map (used to record all the data check values saved when the object is currently running) into the crc cache, otherwise only its internal records will be updated. , The check value of segmented data that is marked for deletion or newly read from RocksDB is written to cec cache. Specifically, the object is a basic operation unit of ceph, similar to a file in the file system, and the object has metadata information (onode) and data information (actually record the content written by the user). The crc cache capacity has an upper limit, and it is impossible to include the crc cache of all objects in the ceph cluster. Therefore, before updating the crc cache, it is necessary to first determine whether the object is in the crc cache, and if so, write the part changed in this operation into the cache ; If not, write all data check values (recorded in crc_chunk_map) saved at the current runtime of the object to the crc cache. Segmented data that is updated, marked for deletion, or read from rocksDB is part of the above changes. Being updated means that the old value is stored in the crc cache, and the old value needs to be overwritten with the new value of this operation. The segment has been deleted, so the data check value in the crc cache needs to be cleared; the segment data read from rocksDB means that the crc cache does not have this data check value, and this part of the data check value needs to be added to the crc cache.

作为一种可选的实施例，将所述当前校验数据与写入所述读取的数据时所述读取的数据的原校验数据进行比对，包括：调用异步读回模块，其中，所述异步读回模块提交读请求后完成当前线程，以支持异步读取；通过所述异步读回模块将所述当前校验数据与写入所述读取的数据时所述读取的数据的原校验数据进行比对。As an optional embodiment, comparing the current verification data with the original verification data of the read data when the read data was written includes: calling an asynchronous readback module, wherein , the asynchronous readback module completes the current thread after submitting the read request to support asynchronous reading; through the asynchronous readback module, the current verification data is compared with the read data when the read data is written. The original verification data of the data is compared.

基于ceph的分布式存储系统读分为同步读和异步读，是PG下发读请求的语义，如果是同步读，则osd的worker线程一直阻塞，直到数据从此盘或cache返回。这样相当于磁盘做IO时，线程不能调度其他IO，性能较差，因此使用异步读较多，而只有少量业务是同步读，如scrub的读数据请求。The read of the ceph-based distributed storage system is divided into synchronous read and asynchronous read, which is the semantics of the read request issued by the PG. If it is a synchronous read, the worker thread of the osd will be blocked until the data is returned from this disk or cache. This is equivalent to when the disk does IO, the thread cannot schedule other IO, and the performance is poor. Therefore, more asynchronous reads are used, and only a small number of services are synchronous reads, such as scrub data read requests.

但如果启用了数据校验，在读的时候需要校验数据crc，由于内部强制等读返回，此时这个线程一直握着pg锁，导致pg内部的其他io没法被处理，异步读在内部就退化为同步读了。However, if data verification is enabled, the data crc needs to be verified when reading. Due to the internal force waiting for the read to return, this thread has been holding the pg lock at this time, so other IOs inside the pg cannot be processed, and asynchronous reading is done internally. Degenerates to synchronous read.

为了解决上述问题，在异步读回调模块中增加数据校验，该模块用于在异步读操作从底层返回后进行数据校验，如果存在数据不一致时，记录该对象。将数据校验放到异步读回调模块中由异步读回调模块执行，异步读的语义是提交到cache后，这个线程就执行完成，同时释放pg锁，支持异步读，从而极大降低了对性能的影响。In order to solve the above problems, data verification is added to the asynchronous read callback module. This module is used to perform data verification after the asynchronous read operation returns from the bottom layer. If there is data inconsistency, the object is recorded. The data verification is put into the asynchronous read callback module and executed by the asynchronous read callback module. The semantics of asynchronous read is that after submitting to the cache, the execution of the thread is completed, and the pg lock is released at the same time to support asynchronous reading, which greatly reduces performance. Impact.

实施例2Example 2

根据本发明实施例，提供了一种分布式系统的数据处理装置的实施例，图5是根据本发明实施例的一种分布式系统的数据处理装置的示意图，如图5所示，该装置包括如下步骤：According to an embodiment of the present invention, an embodiment of a data processing apparatus of a distributed system is provided. FIG. 5 is a schematic diagram of a data processing apparatus of a distributed system according to an embodiment of the present invention. As shown in FIG. 5 , the apparatus It includes the following steps:

接收模块50，用于接收数据存储指令。The receiving module 50 is used for receiving a data storage instruction.

获取模块52，用于获取待存储数据，并将所述待存储数据按照指定大小进行分段。The obtaining module 52 is configured to obtain the data to be stored, and segment the data to be stored according to a specified size.

第一计算模块54，用于分别计算每个分段内的待存储数据的校验数据，并将所述每个分段内的待存储数据的校验数据作为一条校验记录进行存储。The first calculation module 54 is configured to calculate the verification data of the data to be stored in each segment respectively, and store the verification data of the data to be stored in each segment as a verification record.

作为一种可选的实施例，上述装置还包括：接收模块，用于接收数据读取指令；第二计算模块，用于根据所述数据读取指令读取数据，并计算读取的数据的当前校验数据；比对模块，用于将所述当前校验数据与写入所述读取的数据时所述读取的数据的原校验数据进行比对；确定模块，用于在比对成功的情况下确定读取的数据为所述数据读取指令所指示的待读取数据。As an optional embodiment, the above-mentioned device further includes: a receiving module, configured to receive a data reading instruction; a second computing module, configured to read data according to the data reading instruction, and calculate the value of the read data. current verification data; a comparison module for comparing the current verification data with the original verification data of the read data when the read data is written; a determination module for comparing the read data In the case of success, it is determined that the read data is the data to be read indicated by the data read command.

作为一种可选的实施例，上述装置还包括：第二接收模块，用于接收数据写零指令或数据清除指令；清除模块，用于清除待写零数据或待清除数据对应的校验数据，并更新所述待写零数据或所述待清除数据所在分段的校验数据。As an optional embodiment, the above-mentioned apparatus further includes: a second receiving module, configured to receive a data zero writing instruction or a data clearing instruction; a clearing module, configured to clear the zero data to be written or the check data corresponding to the data to be cleared , and update the zero data to be written or the check data of the segment where the data to be cleared is located.

作为一种可选的实施例，上述装置为还包括：提取模块，用于预设的校验数据缓存空间存储有校验数据，在将所述当前校验数据与写入所述读取的数据时所述读取的数据的原校验数据进行比对之前，从所述校验数据缓存空间提取所述原校验数据。As an optional embodiment, the above-mentioned apparatus further includes: an extraction module, which is used to store the verification data in the preset verification data cache space, and compares the current verification data with the data written into the read data. Before the data is compared with the original verification data of the read data, the original verification data is extracted from the verification data cache space.

作为一种可选的实施例，提取模块包括：第一提取子模块，用于如果所述校验数据缓存空间存在所述原校验数据，则从所述校验数据缓存空间提取所述原校验数据；第二提取子模块，用于如果所述校验数据缓存空间不存所述原校验数据，则从磁盘中提取所述原校验数据，并将从磁盘提取的所述原校验数据写入所述校验数据缓存空间。As an optional embodiment, the extraction module includes: a first extraction sub-module, configured to extract the original verification data from the verification data cache space if the original verification data exists in the verification data cache space verification data; a second extraction sub-module for extracting the original verification data from the disk if the verification data cache space does not store the original verification data, and extracting the original verification data from the disk The verification data is written into the verification data cache space.

作为一种可选的实施例，比对模块包括：调用子模块，用于调用异步读回模块，其中，所述异步读回模块提交读请求后完成当前线程，以支持异步读取；比对子模块，用于通过所述异步读回模块将所述当前校验数据与写入所述读取的数据时所述读取的数据的原校验数据进行比对。As an optional embodiment, the comparison module includes: a calling submodule for calling an asynchronous readback module, wherein the asynchronous readback module completes the current thread after submitting a read request to support asynchronous reading; the comparison A sub-module configured to compare the current verification data with the original verification data of the read data when the read data was written by the asynchronous read-back module.

实施例3Example 3

根据本发明实施例，提供了一种存储介质，存储介质包括存储的程序，其中，在所述程序运行时控制所述存储介质所在设备执行实施例1所述的分布式系统的数据处理方法。According to an embodiment of the present invention, a storage medium is provided, and the storage medium includes a stored program, wherein when the program runs, a device where the storage medium is located is controlled to execute the data processing method of the distributed system according to Embodiment 1.

实施例4Example 4

根据本发明实施例，提供了一种处理器，处理器用于运行程序，其中，所述程序运行时执行实施例1所述的分布式系统的数据处理方法。According to an embodiment of the present invention, a processor is provided, and the processor is used to run a program, wherein the data processing method of the distributed system described in Embodiment 1 is executed when the program is run.

上述本发明实施例序号仅仅为了描述，不代表实施例的优劣。The above-mentioned serial numbers of the embodiments of the present invention are only for description, and do not represent the advantages or disadvantages of the embodiments.

在本发明的上述实施例中，对各个实施例的描述都各有侧重，某个实施例中没有详述的部分，可以参见其他实施例的相关描述。In the above-mentioned embodiments of the present invention, the description of each embodiment has its own emphasis. For parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

在本申请所提供的几个实施例中，应该理解到，所揭露的技术内容，可通过其它的方式实现。其中，以上所描述的装置实施例仅仅是示意性的，例如所述单元的划分，可以为一种逻辑功能划分，实际实现时可以有另外的划分方式，例如多个单元或组件可以结合或者可以集成到另一个系统，或一些特征可以忽略，或不执行。另一点，所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口，单元或模块的间接耦合或通信连接，可以是电性或其它的形式。In the several embodiments provided in this application, it should be understood that the disclosed technical content can be implemented in other ways. The device embodiments described above are only illustrative, for example, the division of the units may be a logical function division, and there may be other division methods in actual implementation, for example, multiple units or components may be combined or Integration into another system, or some features can be ignored, or not implemented. On the other hand, the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of units or modules, and may be in electrical or other forms.

所述作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and components shown as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.

另外，在本发明各个实施例中的各功能单元可以集成在一个处理单元中，也可以是各个单元单独物理存在，也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现，也可以采用软件功能单元的形式实现。In addition, each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit. The above-mentioned integrated units may be implemented in the form of hardware, or may be implemented in the form of software functional units.

所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时，可以存储在一个计算机可读取存储介质中。基于这样的理解，本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储介质中，包括若干指令用以使得一台计算机设备(可为个人计算机、服务器或者网络设备等)执行本发明各个实施例所述方法的全部或部分步骤。而前述的存储介质包括：U盘、只读存储器(ROM，Read-Only Memory)、随机存取存储器(RAM，Random Access Memory)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。The integrated unit, if implemented in the form of a software functional unit and sold or used as an independent product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention is essentially or the part that contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present invention. The aforementioned storage medium includes: U disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disk and other media that can store program codes .

以上所述仅是本发明的优选实施方式，应当指出，对于本技术领域的普通技术人员来说，在不脱离本发明原理的前提下，还可以做出若干改进和润饰，这些改进和润饰也应视为本发明的保护范围。The above are only the preferred embodiments of the present invention. It should be pointed out that for those skilled in the art, without departing from the principles of the present invention, several improvements and modifications can be made. It should be regarded as the protection scope of the present invention.

Claims

1. a data processing method of a distributed system, is characterized in that, comprises:

receive data storage instructions;

Acquiring data to be stored, and segmenting the data to be stored according to a specified size;

The verification data of the data to be stored in each segment is calculated respectively, and the verification data of the data to be stored in each segment is stored as a verification record.

2. The method according to claim 1, wherein the method further comprises:

Receive data read instructions;

Read data according to the data read instruction, and calculate the current check data of the read data;

comparing the current verification data with the original verification data of the read data when writing the read data;

If the comparison is successful, it is determined that the read data is the data to be read indicated by the data read instruction.

3. The method according to claim 1, wherein the method further comprises:

Receive data write zero command or data clear command;

Clear the zero data to be written or the verification data corresponding to the data to be cleared, and update the zero data to be written or the verification data of the segment where the data to be cleared is located.

4. The method according to claim 2, wherein the preset check data cache space stores check data, and when the current check data and the read data are written, the read Before comparing the original verification data of the acquired data, the method further includes: extracting the original verification data from the verification data cache space.

5. The method according to claim 4, wherein extracting the original verification data from the verification data cache space comprises:

If the original verification data exists in the verification data cache space, extract the original verification data from the verification data cache space;

If the verification data cache space does not store the original verification data, extract the original verification data from the disk, and write the original verification data extracted from the disk into the verification data cache space .

6. The method according to claim 2, wherein comparing the current verification data with the original verification data of the read data when writing the read data, comprising:

Calling the asynchronous readback module, wherein the asynchronous readback module completes the current thread after submitting the read request to support asynchronous reading;

The current verification data is compared with the original verification data of the read data when the read data is written by the asynchronous readback module.

7. A data processing device of a distributed system, characterized in that, comprising:

a receiving module for receiving data storage instructions;

an acquisition module, used for acquiring data to be stored, and segmenting the data to be stored according to a specified size;

The first calculation module is used for separately calculating the verification data of the data to be stored in each segment, and storing the verification data of the data to be stored in each segment as a verification record.

8. The apparatus according to claim 7, wherein the apparatus further comprises:

The receiving module is used to receive data reading instructions;

A second calculation module, configured to read data according to the data read instruction, and calculate the current verification data of the read data;

A comparison module for comparing the current verification data with the original verification data of the read data when the read data is written;

A determination module, configured to determine that the read data is the data to be read indicated by the data read instruction when the comparison is successful.

9 . A storage medium, characterized in that the storage medium comprises a stored program, wherein when the program is run, a device where the storage medium is located is controlled to execute the distributed program according to any one of claims 1 to 6 The data processing method of the system.

10 . A processor, wherein the processor is used to run a program, wherein when the program is run, the data processing method for a distributed system according to any one of claims 1 to 6 is executed.