CN106055277A

CN106055277A - Decentralized distributed heterogeneous storage system data distribution method

Info

Publication number: CN106055277A
Application number: CN201610376033.5A
Authority: CN
Inventors: 沙行勉; 诸葛晴凤; 吴林
Original assignee: Chongqing University
Current assignee: Chongqing University
Priority date: 2016-05-31
Filing date: 2016-05-31
Publication date: 2016-10-26
Also published as: CN109196459A; WO2017206649A1; CN109196459B

Abstract

The invention discloses a decentralized distributed heterogeneous storage system data distribution method, which comprises the following steps: 1) classifying data objects; 2) classifying storage equipment; 3) putting storage data into different ''placement group clusters'', wherein the type of each kind of storage equipment corresponds to one class of ''placement group cluster''; 4) calculating a ratio for each type of data object to be stored to be placed in different types of ''placement group clusters''; 5) utilizing a Hash algorithm to determine which ''placement group'' in the ''placement group clusters'' does the data object to be stored belongs to; and 6) utilizing a data distribution algorithm of the storage system to store the data objects in each ''placement group'' into multiple pieces of corresponding storage equipment. The method has the advantages that the performance, the load balancing and the expandability of the storage system are kept, the write operation frequency of a solid state disk is reduced, and the service life of the solid state disk is prolonged.

Description

A decentralized distributed heterogeneous storage system data distribution method

技术领域technical field

本发明属于分布式计算机存储技术领域，具体涉及一种去中心化的分布式异构存储系统数据分布方法。The invention belongs to the technical field of distributed computer storage, and in particular relates to a data distribution method of a decentralized distributed heterogeneous storage system.

背景技术Background technique

在大数据应用、科学计算和云计算平台中，可靠并且可扩展的存储系统对系统性能有着至关重要的作用。随着数据量增大(PB级别)，存储系统的数据分布策略必须保证性能和可扩展性。去中心化的数据分布策略，比如Ceph，利用存储设备本身的处理能力提供可靠的对象存储系统。固态硬盘（SSD）读写性能优于传统的机械硬盘（HDD），越来越被广泛地应用到存储系统中，形成大规模分布式异构存储系统。然而，存储系统的数据分布策略必须考虑固态硬盘的“写耐受性”，同时保证系统的可扩展性和负载均衡，因为过多的写操作会加速固态硬盘存储介质的损耗。In big data applications, scientific computing and cloud computing platforms, reliable and scalable storage systems play a vital role in system performance. As the amount of data increases (PB level), the data distribution strategy of the storage system must ensure performance and scalability. Decentralized data distribution strategies, such as Ceph, use the processing power of the storage device itself to provide a reliable object storage system. Solid-state drives (SSDs) have better read and write performance than traditional mechanical hard drives (HDDs), and are more and more widely used in storage systems to form large-scale distributed heterogeneous storage systems. However, the data distribution strategy of the storage system must consider the "write endurance" of the solid-state hard disk, and at the same time ensure the scalability and load balancing of the system, because too many write operations will accelerate the loss of the solid-state hard disk storage medium.

目前，有许多研究致力于工作流系统的数据分布和任务调度。比如在科学计算中，“工作流管理系统”会更具执行计算站点的存储资源和计算能力分配计算任务。根据工作流模型中任务的依赖关系，可以确定这些任务所需数据的数据量大小，然后将不同阶段的计算任务分配到不同的计算站点，分配方案中主要考虑减少不同站点的远程访问传输开销。Ceph利用存储设备自身的通信能力，设计出了一种新的数据分布方法，该方法分为两步，第一步利用哈希算法，将数据对象映射到“放置组”，哈希函数的输入就是数据对象的全局唯一的标识符，哈希函数的输出结果相同的数据对象被放置到同一个“放置组”。第二步利用伪随机哈希算法，将每个“放置组”分布到多个存储设备。该数据分布方法没有考虑存储系统的异构特性，这样会导致对固态硬盘产生密集的写操作。还有一些工作利用固态硬盘提升中心化的存储性能，这种中心化的数据分布策略使得系统不具备扩展性，不适合超大规模的数据应用。Currently, there are many researches devoted to data distribution and task scheduling in workflow systems. For example, in scientific computing, the "workflow management system" will allocate computing tasks based on the storage resources and computing power of the computing site. According to the dependencies of the tasks in the workflow model, the amount of data required for these tasks can be determined, and then the computing tasks at different stages are allocated to different computing sites. The allocation scheme mainly considers reducing the remote access transmission overhead of different sites. Ceph uses the communication capabilities of the storage device itself to design a new data distribution method. This method is divided into two steps. The first step uses the hash algorithm to map data objects to "placement groups". The input of the hash function It is the globally unique identifier of the data object, and the data objects with the same output result of the hash function are placed in the same "placement group". The second step distributes each "placement group" across multiple storage devices using a pseudo-random hash algorithm. This data distribution method does not consider the heterogeneous characteristics of the storage system, which will lead to intensive write operations on the solid state disk. There are also some works that use solid-state drives to improve centralized storage performance. This centralized data distribution strategy makes the system not scalable and not suitable for ultra-large-scale data applications.

发明内容Contents of the invention

针对现有技术存在的不足，本发明所要解决的技术问题是提供一种去中心化的分布式异构存储系统数据分布方法，它通过分析数据对象的访问方式来保持存储系统的性能、负载均衡和可扩展性，同时减少对固态硬盘的写操作。Aiming at the deficiencies of the existing technology, the technical problem to be solved by the present invention is to provide a decentralized distributed heterogeneous storage system data distribution method, which maintains the performance and load balance of the storage system by analyzing the access mode of data objects and scalability while reducing write operations to SSDs.

本发明所要解决的技术问题是通过这样的技术方案实现的，它包括以下步骤：The technical problem to be solved by the present invention is realized by such technical scheme, and it comprises the following steps:

步骤1、在程序的执行过程中，统计每个数据对象被读/写的次数，将读写次数转换为权值，作为数据的访问模式；根据数据的访问模式，将数据对象分类；Step 1. During the execution of the program, count the number of times each data object is read/written, convert the number of reads and writes into a weight, and use it as the access mode of the data; classify the data objects according to the access mode of the data;

步骤2、根据存储设备的容量和读写性能，将存储设备分类；Step 2. Classify the storage devices according to their capacity and read/write performance;

步骤3、将存储数据分成不同的“放置组集群”，“放置组集群”包含多个“放置组”，每种存储设备的类型对应于一类“放置组集群”；Step 3. Divide the stored data into different "placement group clusters", where a "placement group cluster" includes multiple "placement groups", and each type of storage device corresponds to a type of "placement group cluster";

步骤4、根据存储系统的负载均衡目标和性能指标，计算待存入的每种数据对象应该放置到不同类型“放置组集群”的比例；Step 4. According to the load balancing target and performance index of the storage system, calculate the proportion of each data object to be stored that should be placed in different types of "placement group clusters";

步骤5、利用哈希算法确定待存入的数据对象属于“放置组集群”中的哪一个“放置组”；Step 5. Use the hash algorithm to determine which "placement group" in the "placement group cluster" the data object to be stored belongs to;

步骤6、利用存储系统的数据分布算法，将每个“放置组”中的数据对象存储到多个对应的存储设备中。固态硬盘的“放置组”会被分配到固态硬盘，机械硬盘的“放置组”会被分配到机械硬盘。Step 6: Utilize the data distribution algorithm of the storage system to store the data objects in each "placement group" in multiple corresponding storage devices. "Placement Groups" for SSDs are assigned to SSDs, and "Placement Groups" for HDDs are assigned to HDDs.

本发明的技术效果：Technical effect of the present invention:

本发明根据数据对象的访问模式，将不同类别的数据分布到不同的“放置组集群”，此时需要计算待存入的不同类型数据对象放到不同的“放置组集群”的比列，用来控制“放置组集群”之间的负载均衡，在确定了每个数据对象所属的“放置组集群”之后，再利用哈希算法计算该数据对象对应的“放置组”；再把“放置组”中的数据对象分布到存储设备中。这样将数据均匀分布到存储设备中，排除了中心化的数据存储结构，既保持了存储系统的性能、负载均衡和可扩展性，又减少了对固态硬盘的写操作次数，延长其寿命。According to the access mode of the data object, the present invention distributes different types of data to different "placement group clusters". At this time, it is necessary to calculate the ratio of different types of data objects to be stored in different "placement group clusters". To control the load balancing between "placement group clusters", after determining the "placement group cluster" to which each data object belongs, then use the hash algorithm to calculate the "placement group" corresponding to the data object; then put the "placement group "The data objects in " are distributed to the storage device. In this way, the data is evenly distributed to the storage devices, eliminating the centralized data storage structure, which not only maintains the performance, load balance and scalability of the storage system, but also reduces the number of write operations to the solid-state hard disk and prolongs its life.

附图说明Description of drawings

本发明的附图说明如下：The accompanying drawings of the present invention are as follows:

图1为计算每种待存入数据对象存储到每种“放置组集群”的比例算法流程图；Figure 1 is a flow chart of calculating the ratio of each type of data object to be stored to each "placement group cluster";

图2为本发明的数据存储过程图；Fig. 2 is a data storage process diagram of the present invention;

图3为映射读密集型数据对象映射到“放置组”的示意图；3 is a schematic diagram of mapping read-intensive data objects to "placement groups";

图4为映射写密集型数据对象映射到“放置组”的示意图。FIG. 4 is a schematic diagram of mapping write-intensive data objects to "placement groups".

具体实施方式detailed description

下面结合附图和实施例对本发明作进一步说明：Below in conjunction with accompanying drawing and embodiment the present invention will be further described:

本发明包括以下步骤：The present invention comprises the following steps:

步骤1、在程序的执行过程中，统计每个数据对象被读/写的次数，将读写次数转换为权值，作为数据的访问模式；根据数据的访问模式，将数据对象分类，比如读密集型，写密集型和混合型；分类方法可以采用常见的K-Means聚类算法，每种类型的数据对象有个属性值用来表示这类数据对象平均写次数；Step 1. During the execution of the program, count the number of times each data object is read/written, convert the number of reads and writes into a weight, and use it as the data access mode; classify the data object according to the data access mode, such as read Intensive, write-intensive and mixed; the classification method can use the common K-Means clustering algorithm, and each type of data object has an attribute value used to represent the average number of writes of this type of data object;

步骤2、根据存储设备的容量和读写性能，将存储设备分类，比如高速固态硬盘，低速固态硬盘，高速机械硬盘，低速机械硬盘，每种存储设备有自己的读写性能参数，比如平均读写延迟时间、容量。Step 2. Classify storage devices according to their capacity and read/write performance, such as high-speed solid-state drives, low-speed solid-state drives, high-speed mechanical hard drives, and low-speed mechanical hard drives. Each storage device has its own read-write performance parameters, such as average read Write latency, capacity.

步骤3、将存储数据分成不同的“放置组集群”，“放置组集群”包含多个“放置组”，每种存储设备类型对应于一类“放置组集群”。“放置组集群”用来将读写属性相似的数据对象组合到一起；“放置组集群”是一个逻辑的概念，主要用来对数据对象做聚合，同时，“放置组集群”也有容量和读写性能的属性，容量就是该“放置组集群”对应的所有硬盘的容量，读写性能是这些硬盘的平均读写延迟。Step 3. Divide the stored data into different "placement group clusters", where a "placement group cluster" includes multiple "placement groups", and each type of storage device corresponds to a type of "placement group cluster". "Placement group cluster" is used to combine data objects with similar read and write attributes; "Placement group cluster" is a logical concept, mainly used to aggregate data objects, and "placement group cluster" also has capacity and read The attribute of write performance, the capacity is the capacity of all the hard disks corresponding to the "placement group cluster", and the read and write performance is the average read and write delay of these hard disks.

例如，假设系统有3个“放置组集群”，对于读密集型数据，20%放入第一个“放置组集群”，30%放入第二个“放置组集群”，50%放入第三个“放置组集群”，这个比例指的是放入每类“放置组集群”的个数占该类数据总数的比例。For example, suppose the system has 3 "placement group clusters". For read-intensive data, 20% of the data is placed in the first "placement group cluster", 30% is placed in the second "placement group cluster", and 50% is placed in the second "placement group cluster". Three "placement group clusters", this ratio refers to the ratio of the number of each type of "placement group clusters" to the total data of that type.

存储系统的性能指标根据存储设备的读写性能来设定，比如，要求对所有的数据对象，读取操作的平均延迟为0.2毫秒，写入操作的平均延迟为0.5毫秒。设置每种数据对象在不同类型“放置组集群”比例的目的就是要保证数据在“放置组集群”之间均衡分布。在极端的情况下，所有的数据对象都是写密集型，按照存储设备分配的目标，写密集型数据对象应分配到机械硬盘中，以便减少对固态硬盘的写操作，但如果所有数据对象都是写密集型的，那么全都会被分配到机械硬盘对应的“放置组集群”，使得固态硬盘中没有数据。为了避免这种情况，需要将同一种类型的数据对象分配到不同的“放置组集群”，用这个比例来控制“放置组集群”之间的负载均衡。The performance index of the storage system is set according to the read and write performance of the storage device. For example, it is required that for all data objects, the average delay of read operations is 0.2 milliseconds, and the average delay of write operations is 0.5 milliseconds. The purpose of setting the proportion of each data object in different types of "placement group clusters" is to ensure that the data is evenly distributed among the "placement group clusters". In extreme cases, all data objects are write-intensive. According to the target of storage device allocation, write-intensive data objects should be allocated to mechanical hard drives to reduce write operations on solid-state drives. However, if all data objects are If it is write-intensive, all of them will be allocated to the "placement group cluster" corresponding to the mechanical hard disk, so that there is no data in the solid-state hard disk. In order to avoid this situation, it is necessary to allocate the same type of data objects to different "placement group clusters", and use this ratio to control the load balancing between "placement group clusters".

步骤5、利用哈希算法确定待存入的数据对象属于“放置组集群”中的哪一个“放置组”，因为一个“放置组集群”中包含多个“放置组”。Step 5. Use the hash algorithm to determine which "placement group" in the "placement group cluster" the data object to be stored belongs to, because one "placement group cluster" contains multiple "placement groups".

步骤6、利用存储系统的数据分布算法，将每个“放置组”中的数据对象存储到多个对应的存储设备中，固态硬盘对应“放置组集群”中的“放置组”会被分配到固态硬盘，机械硬盘对应“放置组集群”中的“放置组”会被分配到机械硬盘。Step 6. Use the data distribution algorithm of the storage system to store the data objects in each "placement group" in multiple corresponding storage devices, and the "placement group" in the "placement group cluster" corresponding to the solid state disk will be allocated to The "Placement Group" in the "Placement Group Cluster" corresponding to the solid state disk and mechanical hard disk will be assigned to the mechanical hard disk.

一个“放置组” 存储到多个存储设备的原因是为了对同一个数据多次备份。备份数由系统初始化设置。因为同一个“放置组”对应的存储设备有多个，所以需要有个映射算法将确定每个“放置组”应该放入哪一个存储设备。The reason for a "placement group" to store to multiple storage devices is to make multiple backups of the same data. The number of backups is set by system initialization. Because there are multiple storage devices corresponding to the same "placement group", a mapping algorithm is required to determine which storage device should be placed in each "placement group".

上述步骤4中，计算每种待存入数据对象放置到每种“放置组集群”的比例算法流程图如图1所示：In the above step 4, the algorithm flow chart for calculating the ratio of each type of data object to be stored to each type of "placement group cluster" is shown in Figure 1:

该流程开始于步骤801，然后：The flow starts at step 801, and then:

在步骤802，计算所有待存入数据对象的总数，即不同类型数据对象的总和；In step 802, calculate the total number of all data objects to be stored, that is, the sum of different types of data objects;

在步骤803，计算已有数据对象的总数，即在初始状态下，所有存储设备已经存储的数据对象的个数；In step 803, calculate the total number of existing data objects, that is, in the initial state, the number of data objects stored in all storage devices;

在步骤804，根据负载均衡条件，计算每个“放置组集群”能存储的数据对象最大值；即确定每个“放置组集群”的容量；In step 804, according to the load balancing condition, calculate the maximum value of data objects that each "placement group cluster" can store; that is, determine the capacity of each "placement group cluster";

负载均衡是系统的配置参数，在所有数据对象完全平均分布的情况下，根据每个存储设备的容量增加或者减少5%都认为是负载均衡的。比如，某个“放置组集群”在完全平均分布的状态下能存储100个数据对象，负载均衡的平衡条件允许5%的浮动，那么该“放置组集群”最多能存储100+100*0.05=105个数据对象；Load balancing is a configuration parameter of the system. When all data objects are evenly distributed, an increase or decrease of 5% according to the capacity of each storage device is considered to be load balancing. For example, if a "placement group cluster" can store 100 data objects in a fully evenly distributed state, and the balance condition of load balancing allows 5% fluctuation, then the "placement group cluster" can store up to 100+100*0.05= 105 data objects;

在步骤805，将所有待存入数据对象按照平均写次数升序排列，平均写次数是不同类的数据对象的属性；In step 805, all the data objects to be stored are arranged in ascending order according to the average write times, and the average write times are attributes of different types of data objects;

假设待存入数据对象被分成了3类，读密集性，写密集型和混合型，其中读密集型数据的平均写次数是10，写密集型的平均写次数是80，混合型的平均写次数是50。Assuming that the data objects to be stored are divided into three categories, read-intensive, write-intensive and mixed, the average number of writes for read-intensive data is 10, the average number of writes for write-intensive data is 80, and the average number of writes for mixed The number of times is 50.

在步骤806，将所有“放置组集群”按照性能降序排列，其中“放置组集群”的性能就是与其对应的存储设备的读写性能，固态硬盘的读写性能优于机械硬盘；In step 806, all "placement group clusters" are arranged in descending order of performance, wherein the performance of "placement group clusters" is the read and write performance of the corresponding storage device, and the read and write performance of the solid state disk is better than that of the mechanical hard disk;

在步骤807，初始化变量i=0，用来扫描待存入数据对象类别；In step 807, initialize the variable i=0, which is used to scan the category of data objects to be stored;

假设待存入数据对象被分成了3类，这个流程中的i就是1，2，3，这是一个循环迭代过程，即分别扫描待存入的每个类别的数据对象；Assuming that the data objects to be stored are divided into 3 categories, i in this process is 1, 2, 3. This is a cyclic iteration process, that is, to scan the data objects of each category to be stored;

在步骤808，初始化变量j=0,用来扫描“放置组集群”类别；In step 808, the variable j=0 is initialized, which is used to scan the "placement group cluster" category;

假设数据“放置组集群”被分成了4类，这个流程中的j就是1，2，3，4；Assuming that the data "placement group cluster" is divided into 4 categories, j in this process is 1, 2, 3, 4;

在步骤809，将第i类待存入的数据对象分配到第j类“放置组集群”；In step 809, assign the i-th type of data object to be stored to the j-th type "placement group cluster";

该步骤是按照步骤805和步骤806中排好的顺序，依据步骤804 所计算的“放置组集群”容量依次填充待存入的每类数据对象的个数；This step is to fill in the number of each type of data objects to be stored in sequence according to the sequence arranged in step 805 and step 806 according to the capacity of the "placement group cluster" calculated in step 804;

在步骤810，记录存储在“放置组集群”j中i类待存入的数据对象的个数，用于计算待存入的每类数据对象的存储比例；In step 810, record the number of data objects to be stored in type i stored in the "placement group cluster" j, for calculating the storage ratio of each type of data objects to be stored;

每一类待存入的数据对象的总个数是已知的，记录每一类待存入数据对象放置到每一个“放置组集群”的个数，用该数值除以每一类待存入记录数据对象的总数，就得到了比值。The total number of each type of data objects to be stored is known, record the number of each type of data objects to be stored in each "placement group cluster", and divide this value by each type of data objects to be stored Enter the total number of recorded data objects to get the ratio.

在步骤811，判断“放置组集群”j是否达到最大存储个数，如果是，执行步骤812，否则执行步骤813；In step 811, it is judged whether the "placement group cluster" j has reached the maximum storage number, if yes, execute step 812, otherwise execute step 813;

在步骤813，判断是否所有待存入数据对象处理完毕，如果是，执行步骤816，否则，执行步骤814；In step 813, it is judged whether all data objects to be stored have been processed, if yes, execute step 816, otherwise, execute step 814;

在步骤814，将用来扫描数据对象类别数组的指针i移动到下一个位置，即处理下一类待存入数据对象，执行步骤809；In step 814, the pointer i used to scan the array of data object categories is moved to the next position, that is, the next type of data object to be stored is processed, and step 809 is performed;

在步骤812，将用来扫描“放置组集群”数组的指针j移动到下一个位置，即处理下一个“放置组集群”；In step 812, the pointer j used to scan the array of "placement group clusters" is moved to the next position, that is, the next "placement group cluster" is processed;

在步骤815，判断是否所有“放置组集群”处理完毕，如果是，执行步骤816，否则，执行步骤809；In step 815, it is judged whether all the "placement group clusters" have been processed, if yes, execute step 816, otherwise, execute step 809;

在步骤816，根据步骤810中记录的每个“放置组集群”存储的每类待存入数据对象的个数，计算每类待存入数据对象分配到每种“放置组集群”的比例；In step 816, according to the number of each type of data object to be stored in each "placement group cluster" recorded in step 810, calculate the ratio of each type of data object to be stored in each type of "placement group cluster";

在步骤817，每类待存入数据分配到各类“放置组集群”的算法结束。In step 817, the algorithm for assigning each type of data to be stored to each type of "placement group cluster" ends.

上述步骤5和步骤6的数据存储过程如图2所示，将存储系统的“放置组”划分为不同的“放置组集群”，每一个“放置组集群”包含多个“放置组”。存储数据对象时，需要首先根据每个数据对象的类别和该类别的数据对象在“放置组集群”的分配比例，确定该数据属于哪一个“放置组集群”，这个过程要通过图1的流程来计算不同类型的对象放到不同的“放置组集群”的比列，以来控制“放置组集群”之间的负载均衡，再利用哈希算法确定该数据对象属于这个 “放置组集群”中的哪一个“放置组”。步骤6利用一个伪随机哈希算法（CRUSH）把这个“放置组”映射到不同的存储设备中去。The data storage process of the above step 5 and step 6 is shown in Figure 2. The "placement group" of the storage system is divided into different "placement group clusters", and each "placement group cluster" contains multiple "placement groups". When storing data objects, it is necessary to first determine which "placement group cluster" the data belongs to based on the category of each data object and the distribution ratio of the data objects of this category in the "placement group cluster". This process must go through the process in Figure 1 To calculate the ratio of different types of objects placed in different "placement group clusters" to control the load balance between "placement group clusters", and then use the hash algorithm to determine that the data object belongs to this "placement group cluster" Which "placement group". Step 6 uses a pseudo-random hash algorithm (CRUSH) to map this "placement group" to different storage devices.

（一）、图1所示流程图的实施例(1), the embodiment of the flowchart shown in Figure 1

假设存储系统有5类存储设备，每类存储设备对应一个“放置组集群”，那么系统有5个“放置组集群”。所有放置组集群已经按照性能从高到低进行排序（对应于步骤806）。如表1所示。Assuming that the storage system has five types of storage devices, and each type of storage device corresponds to a "placement group cluster", then the system has five "placement group clusters". All placement group clusters have been sorted from high to low performance (corresponding to step 806 ). As shown in Table 1.

表系统存储设备的属性surface Properties of System Storage Devices

放置组集群Placement group cluster 类型Types of 容量capacity 负载load 平均读延迟(毫秒)Average Read Latency (ms) 平均写延迟(毫秒)Average Write Latency (ms) 11 固态硬盘SSD 10001000 6060 0.120.12 0.220.22 22 固态硬盘SSD 15001500 260260 0.660.66 0.350.35 33 机械硬盘mechanical hard drive 20002000 300300 5.205.20 7.847.84 44 机械硬盘mechanical hard drive 25002500 530530 6.586.58 8.708.70 55 机械硬盘mechanical hard drive 30003000 700700 8.308.30 9.209.20

存储系统的总容量为：1000+1500+2000+2500+3000=10000The total capacity of the storage system is: 1000+1500+2000+2500+3000=10000

假设待存入的数据对象分成了3类，每类对象的平均读写次数和数量如表2所示。每个类型已经按照写次数进行排序（对应于步骤805）。Assuming that the data objects to be stored are divided into three categories, the average number of reads and writes and the number of objects of each category are shown in Table 2. Each type has been sorted according to the number of writes (corresponding to step 805).

表所有待存入数据对象的属性surface All attributes to be stored in the data object

目标类型target type 平均读次数average readings 平均写次数average write times 数量quantity AA 100100 1010 350350 BB 4040 3030 150150 CC 3030 5050 200200

根据图1所示的流程，算法的运行过程如下：According to the process shown in Figure 1, the operation process of the algorithm is as follows:

在步骤802，所有待存入的数据对象总数为350+150+200=700；In step 802, the total number of all data objects to be stored is 350+150+200=700;

在步骤803，计算已有数据对象的总数为60+260+300+530+700=1850；In step 803, calculate the total number of existing data objects as 60+260+300+530+700=1850;

数据对象的总数量为：700+1850=2500；The total number of data objects is: 700+1850=2500;

在步骤804，假设系统负载均衡的平衡因子e=0.001，则对于每一个“放置组集群”计算出其相应可容纳的最大值RMAX，计算公式如下：In step 804, assuming that the balance factor e=0.001 of the system load balancing, then for each "placement group cluster" the corresponding maximum value RMAX that can be accommodated is calculated, and the calculation formula is as follows:

“放置组集群”1：RMAX.1： (1+0.001)*(1000*(700+1850))/10000=255;"Placement group cluster" 1: RMAX.1: (1+0.001)*(1000*(700+1850))/10000=255;

“放置组集群”2：RMAX.2： (1+0.001)*(1500*(700+1850))/10000=383;"Placement group cluster" 2: RMAX.2: (1+0.001)*(1500*(700+1850))/10000=383;

“放置组集群”3：RMAX.3： (1+0.001)*(2000*(700+1850))/10000=511;"Placement group cluster" 3: RMAX.3: (1+0.001)*(2000*(700+1850))/10000=511;

“放置组集群”4：RMAX.4： (1+0.001)*(2500*(700+1850))/10000=638;"Placement group cluster" 4: RMAX.4: (1+0.001)*(2500*(700+1850))/10000=638;

“放置组集群”5：RMAX.5： (1+0.001)*(3000*(700+1850))/10000=766;"Placement group cluster" 5: RMAX.5: (1+0.001)*(3000*(700+1850))/10000=766;

因此，对五个“放置组集群”，假设数据完全平均分配时的最大容量RMAX分别为：255,383,511,638,766。Therefore, for the five "placement group clusters", assuming that the data is evenly distributed, the maximum capacity RMAX is respectively: 255,383,511,638,766.

在步骤807，i初始化为0，用来扫描A,B,C三类待存入的数据对象。In step 807, i is initialized to 0, and is used to scan the three types of data objects to be stored in A, B, and C.

在步骤808，j初始化为0，用来扫描“放置组集群”1,2,3,4,5.In step 808, j is initialized to 0 for scanning "placement group clusters" 1, 2, 3, 4, 5.

在步骤809的分配与步骤810记录过程如下：The allocation at step 809 and the recording process at step 810 are as follows:

对三种类型数据对象进行分类时，对平均写次数最少读次数较多的A类优先分配到写延迟小读延迟小的OSD.1上。When classifying the three types of data objects, class A with the least average write times and more read times is preferentially allocated to OSD.1 with small write delay and small read delay.

1.放置组集群1本身负载为60，计算出的可容纳最大值为255，可容纳量255-60=195。1. The load of the placement group cluster 1 itself is 60, the calculated maximum capacity is 255, and the capacity is 255-60=195.

故可分配类型A的195个数据对象分配到放置组集群1中，此时类型A剩余350-195=155。Therefore, 195 data objects of type A can be allocated to placement group cluster 1. At this time, 350-195=155 of type A remain.

放置组集群1已满。Placement group cluster 1 is full.

负载load AA BB CC RMAXRMAX 放置组集群1Placement group cluster 1 6060 195195 00 00 255255 放置组集群2Placement group cluster 2 260260 00 00 00 383383 放置组集群3Placement group cluster 3 300300 00 00 00 511511 放置组集群4Placement group cluster 4 530530 00 00 00 638638 放置组集群5Placement group cluster 5 700700 00 00 00 766766

2. 放置组集群2本身负载为260，计算出的可容纳最大值为383，可容纳量383-260=123。2. The load of the placement group cluster 2 itself is 260, the calculated maximum capacity is 383, and the capacity is 383-260=123.

继续将类型A的123个数据对象分配到放置组集群2中，类型A剩余155-123=32。Continue to allocate 123 data objects of type A to placement group cluster 2, leaving 155-123=32 of type A.

放置组集群2已满。Placement group cluster 2 is full.

负载load AA BB CC RMAXRMAX 放置组集群1Placement group cluster 1 6060 195195 00 00 255255 放置组集群2Placement group cluster 2 260260 123123 00 00 383383 放置组集群3Placement group cluster 3 300300 00 00 00 511511 放置组集群4Placement group cluster 4 530530 00 00 00 638638 放置组集群5Placement group cluster 5 700700 00 00 00 766766

3. 放置组集群3本身负载为300，计算出的可容纳最大值为511，可容纳量511-300=211；3. The load of cluster 3 of the placement group is 300, the calculated maximum capacity is 511, and the capacity is 511-300=211;

继续将类型A的32个数据对象分配到放置组集群3中，类型A分配完毕，剩余0；Continue to allocate 32 data objects of type A to placement group cluster 3, type A is allocated and 0 remains;

放置组集群3剩余容量为211-32=179；The remaining capacity of the placement group cluster 3 is 211-32=179;

对类型B进行分配，优先分配到读写延迟都相对较小的放置组集群3中；Allocate type B, and preferentially allocate to placement group cluster 3 with relatively small read and write delays;

将类型B的150个数据对象全部分配到放置组集群3中。此时放置组集群3剩余容量179-150=29；Allocate all 150 data objects of type B to placement group cluster 3. At this time, the remaining capacity of the placement group cluster 3 is 179-150=29;

对类型C进行分配，仍优先分配到读写延迟都相对较小的放置组集群3中；For allocation of type C, it is still preferentially allocated to placement group cluster 3 with relatively small read and write delays;

将类型C的29个数据对象分配到放置组集群3中，类型C剩余200-29=171。Allocate 29 data objects of type C into placement group cluster 3, leaving 200-29=171 of type C.

放置组集群3已满。Placement group cluster 3 is full.

负载load AA BB CC RMAXRMAX 放置组集群1Placement group cluster 1 6060 195195 00 00 255255 放置组集群2Placement group cluster 2 260260 123123 00 00 383383 放置组集群3Placement group cluster 3 300300 3232 150150 2929 511511 放置组集群4Placement group cluster 4 530530 00 00 00 638638 放置组集群5Placement group cluster 5 700700 00 00 00 766766

4.放置组集群4本身负载530，计算出的可容纳最大值为638，可容纳量为638-530=108；4. The load of the placement group cluster 4 itself is 530, the calculated maximum capacity is 638, and the capacity is 638-530=108;

继续将类型C的108个数据对象分配到放置组集群4中，类型C剩余63=171-108。Continue to allocate 108 data objects of type C to placement group cluster 4, leaving 63=171-108 of type C.

放置组集群4已满。Placement group cluster 4 is full.

负载load AA BB CC RMAXRMAX 放置组集群1Placement group cluster 1 6060 195195 00 00 255255 放置组集群2Placement group cluster 2 260260 123123 00 00 383383 放置组集群3Placement group cluster 3 300300 3232 150150 2929 511511 放置组集群4Placement group cluster 4 530530 00 00 108108 638638 放置组集群5Placement group cluster 5 700700 00 00 00 766766

5.放置组集群5本身负载700，计算出的可容纳最大值为766，可容纳量为766-700=66；5. The load of the placement group cluster 5 itself is 700, the calculated maximum capacity is 766, and the capacity is 766-700=66;

将类型C的63个数据对象分配到放置组集群5中，类型C分配完毕，剩余0。Allocate 63 data objects of type C to placement group cluster 5, allocating type C and leaving 0.

放置组集群5剩余容量仍为66-63=3。The remaining capacity of placement group cluster 5 is still 66-63=3.

负载load AA BB CC RMAXRMAX 放置组集群1Placement group cluster 1 6060 195195 00 00 255255 放置组集群2Placement group cluster 2 260260 123123 00 00 383383 放置组集群3Placement group cluster 3 300300 3232 150150 2929 511511 放置组集群4Placement group cluster 4 530530 00 00 108108 638638 放置组集群5Placement group cluster 5 700700 00 00 6363 766766

在步骤816，根据最后的结果，计算每类待存入的数据对象分配到每个“放置组集群”的比例：In step 816, according to the final result, calculate the ratio of each type of data object to be stored to each "placement group cluster":

AA BB CC 放置组集群1Placement group cluster 1 195/350=0.56195/350=0.56 0/150=00/150=0 0/200=00/200=0 放置组集群2Placement group cluster 2 123/350=0.35123/350=0.35 0/150=00/150=0 0/200=00/200=0 放置组集群3Placement group cluster 3 32/350=0.0932/350=0.09 150/150=1150/150=1 29/200=0.14529/200=0.145 放置组集群4Placement group cluster 4 0/350=00/350=0 0/0=00/0=0 108/200=0.54108/200=0.54 放置组集群5Placement group cluster 5 0/350=00/350=0 0/0=00/0=0 63/200=0.31563/200=0.315

（二）、下面说明本发明的步骤5如何将不同类型的数据对象映射到不同的“放置组”。(2). The following describes how step 5 of the present invention maps different types of data objects to different "placement groups".

本实施例中，假设系统有100个“放置组”，编号从1到100。根据系统存储设备类型，这些“放置组”被分成了3个“放置组集群”： 1-20号为第一个“放置组集群”，21-50号为第二个“放置组集群”，51-100号为第三个“放置组集群”。In this embodiment, it is assumed that the system has 100 "placement groups", numbered from 1 to 100. According to the type of system storage device, these "placement groups" are divided into three "placement group clusters": 1-20 is the first "placement group cluster", 21-50 is the second "placement group cluster", No. 51-100 is the third "placement group cluster".

如图3所示，将一个读密集型的数据对象映射到“放置组”13。假设通过如图1的流程算法得出读密集型数据对象在三个“放置组集群”中的分布比例(Distribution Ratio)为6：2：2，也就是：1-20号“放置组”为第一个“放置组集群”，有60%的读密集型的数据属于第一个“放置组集群”，21-50号“放置组”为第二个“放置组集群”，有20%的读密集型数据属于第二个“放置组集群”，51-100号“放置组”为第三个“放置组集群”，有20%的读密集型数据属于第三个“放置组集群”。由于当前这个读密集型数据对象的标识经过哈希函数得到的结果是50，在第一个“放置组集群”的范围之内，再利用哈希算法，计算出此该数据对象的目标“放置组”为13。As shown in FIG. 3 , a read-intensive data object is mapped to a “placement group” 13 . Assume that the distribution ratio (Distribution Ratio) of read-intensive data objects in the three "placement group clusters" is 6:2:2 through the process algorithm shown in Figure 1, that is, the "placement groups" of numbers 1-20 are For the first "placement group cluster", 60% of the read-intensive data belongs to the first "placement group cluster", and "placement group" No. 21-50 is the second "placement group cluster", with 20% Read-intensive data belongs to the second "placement group cluster", No. 51-100 "placement group" belongs to the third "placement group cluster", and 20% of the read-intensive data belongs to the third "placement group cluster". Since the result of the current read-intensive data object’s identification through the hash function is 50, within the scope of the first “placement group cluster”, the hash algorithm is used to calculate the target “placement” of the data object. group" is 13.

如图4所示，将一个写密集型的数据对象映射到“放置组”62。假设写密集型数据对象在三个“放置组集群”中的分布比例为1：3：6，该数据对象的标识经过哈希函数得到的结果也是50，但50属于第三个“放置组集群”（因为读密集型数据和写密集型数据对应放置到每个放置组集群的比例是不一样的，图4中间哈希值列出三个“放置组集群”的放置比例，哈希值1-10的数据对象可以认为是放置到第一类“放置组集群”的，哈希值11-40的数据对象是放置到第二类“放置组集群”的，哈希值41-100的数据对象是放置到第三类“放置组集群”的），因此这个对象最终被映射到“放置组”62中。As shown in FIG. 4 , a write-intensive data object is mapped to a "placement group" 62 . Assuming that the distribution ratio of write-intensive data objects in the three "placement group clusters" is 1:3:6, the result of the identification of the data object after the hash function is also 50, but 50 belongs to the third "placement group cluster" "(Because the ratio of read-intensive data and write-intensive data to each placement group cluster is different, the hash value in the middle of Figure 4 lists the placement ratios of the three "placement group clusters", and the hash value is 1 Data objects with -10 can be considered as being placed in the first type of "placement group cluster", data objects with a hash value of 11-40 are placed in the second type of "placement group cluster", and data with a hash value of 41-100 Objects are placed into the third category "Placement Group Cluster"), so this object is finally mapped into "Placement Group" 62.

Claims

1. A decentralized distributed heterogeneous storage system data distribution method, characterized in that it comprises the following steps:

Step 1. During the execution of the program, count the number of times each data object is read/written, convert the number of reads and writes into a weight, and use it as the access mode of the data; classify the data objects according to the access mode of the data;

Step 2. Classify the storage devices according to their capacity and read/write performance;

Step 3. Divide the stored data into different "placement group clusters", where a "placement group cluster" includes multiple "placement groups", and each type of storage device corresponds to a type of "placement group cluster";

Step 4. According to the load balancing target and performance index of the storage system, calculate the proportion of each data object to be stored that should be placed in different types of "placement group clusters";

Step 5. Use the hash algorithm to determine which "placement group" in the "placement group cluster" the data object to be stored belongs to;

Step 6: Utilize the data distribution algorithm of the storage system to store the data objects in each "placement group" in multiple corresponding storage devices.

2. A decentralized distributed heterogeneous storage system data distribution method according to claim 1, characterized in that, in said step 4, calculating the placement of each type of data object to be stored in each "placement group" The steps for "cluster" ratio include:

Step 802, calculating the total number of all data objects to be stored;

Step 803, calculating the total number of existing data objects;

Step 804, according to the load balancing condition, calculate the maximum value of data objects that each "placement group cluster" can store;

Step 805, arrange all the data objects to be stored in ascending order according to the average write times;

Step 806, arrange all "placement group clusters" in descending order of performance;

Step 807, initialize the variable i=0, which is used to scan the category of data objects to be stored;

Step 808, initialize the variable j=0, which is used to scan the "placement group cluster" category;

Step 809, assign the i-th type of data object to be stored to the j-th type of "placement group cluster";

Step 810, recording the number of data objects to be stored in type i stored in the "placement group cluster" j;

Step 811, judging whether the "placement group cluster" j has reached the maximum storage number, if yes, go to step 812, otherwise go to step 813;

Step 813, judging whether all data objects to be stored have been processed, if yes, execute step 816, otherwise, execute step 814;

Step 814, process the next type of data object to be stored, and execute step 809;

Step 812, process the next "placement group cluster";

Step 815, judging whether all the "placement group clusters" have been processed, if yes, execute step 816, otherwise, execute step 809;

Step 816, according to the number of each type of data object to be stored stored in each "placement group cluster" recorded in step 810, calculate the proportion of each type of data object to be stored in each type of "placement group cluster".

3. A decentralized distributed heterogeneous storage system data distribution method according to claim 2, characterized in that, in the step 809, the i-th type of data object to be stored is assigned to the j-th type " The method of "placement group cluster" is: according to the order arranged in step 805 and step 806, according to the capacity of "placement group cluster" calculated in step 804, the number of each type of data objects to be stored is filled in turn.

4. A decentralized distributed heterogeneous storage system data distribution method according to claim 1, characterized in that: in step 6, the pseudo-random hash algorithm is used to map this "placement group" to different storage in the device.