CN112000467A

CN112000467A - Data tilt processing method and device, terminal equipment and storage medium

Info

Publication number: CN112000467A
Application number: CN202010728649.0A
Authority: CN
Inventors: 魏文国; 黄子纯; 廖信海; 谢桂园
Original assignee: Guangdong Polytechnic Normal University; Guangdong University of Foreign Studies
Current assignee: Guangdong Polytechnic Normal University; Guangdong University of Foreign Studies
Priority date: 2020-07-24
Filing date: 2020-07-24
Publication date: 2020-11-27

Abstract

The invention discloses a data inclination processing method, device, terminal equipment and storage medium, including sampling data based on a preset sampling algorithm to obtain equal-probability sample data, and obtaining the value of each value by accumulating the data. The size of the space occupied; the sample data is divided into oblique data and non-oblique data by using a data oblique detection model; the non-oblique data is allocated to a preset Hash partition, and the oblique data is dynamically allocated based on a dynamic allocation algorithm to each bucket to balance the Spark load. In the embodiment of the present invention, a variable weight is added to predict the partition size, after sampling the data, the data is classified into oblique data and non-slanted data by using a data skew detection model, the size of the Reduce partition is predicted by using the non-slanted data, and the skewed data is Balanced distribution to each partition can make Spark load more balanced.

Description

A data tilt processing method, device, terminal device and storage medium

技术领域technical field

本发明涉及计算机应用技术领域，尤其是涉及一种数据倾斜处理方法。The invention relates to the technical field of computer applications, in particular to a data tilt processing method.

背景技术Background technique

随着大数据时代的到来，各种网络技术的兴起，使得信息数据急剧膨胀。传统的处理和存储系统已经难以应对海量数据，而对于目前流行的Hadoop和Spark等大数据分析平台，数据倾斜对其性能造成了严重的影响。目前解决数据倾斜问题大部分都是基于Hadoop平台的研究，对于Spark的数据倾斜问题研究相对较少。在Spark中，将Shuffle之前的阶段称为Map阶段，之后的阶段称为Reduce阶段。然而，默认的Spark分区算法在数据分布不均匀时，在执行Shuffle操作后就会出现数据倾斜现象。现有的分区算法思想是将重载的任务分布到额外拆分或者合并的分区上，而这些额外的操作反过来又阻碍了系统的性能。With the advent of the era of big data and the rise of various network technologies, information data has expanded rapidly. Traditional processing and storage systems have been unable to cope with massive data, and for currently popular big data analysis platforms such as Hadoop and Spark, data skew has a serious impact on their performance. At present, most of the research on solving the data skew problem is based on the Hadoop platform, and there is relatively little research on Spark's data skew problem. In Spark, the stage before Shuffle is called the Map stage, and the stage after it is called the Reduce stage. However, when the data distribution is uneven, the default Spark partitioning algorithm will skew the data after performing the Shuffle operation. The existing partition algorithm idea is to distribute the overloaded tasks to additional split or merge partitions, and these additional operations in turn hinder the performance of the system.

发明内容SUMMARY OF THE INVENTION

本发明提供一种数据倾斜处理方法，以解决现有的分区算法出现额外操作阻碍系统性能的技术问题，通过增加一个变量权重用来预测分区大小，对数据进行抽样后利用数据倾斜检测模型将数据分类为倾斜数据和非倾斜数据，利用非倾斜数据预测Reduce分区的大小，并将倾斜数据均衡分配到各个分区中去，能够使Spark负载更均衡。The present invention provides a data skew processing method to solve the technical problem that the existing partition algorithm has additional operations hindering the system performance. By adding a variable weight to predict the partition size, after sampling the data, the data skew detection model is used to analyze the data. It is classified into slanted data and non-slanted data. Using the non-slanted data to predict the size of the Reduce partition, and evenly assigning the slanted data to each partition, can make the Spark load more balanced.

为了解决上述技术问题，第一方面，本发明实施例提供了一种数据倾斜处理方法，包括以下步骤：In order to solve the above technical problems, in a first aspect, an embodiment of the present invention provides a data skew processing method, including the following steps:

基于预设的抽样算法，对数据进行抽样以获得等概率的样本数据，并通过数据的累加计算获得每一value所占空间的大小；Based on the preset sampling algorithm, the data is sampled to obtain equal probability sample data, and the size of the space occupied by each value is obtained through the cumulative calculation of the data;

利用数据倾斜检测模型将所述样本数据划分为倾斜数据、非倾斜数据；Using a data skew detection model to divide the sample data into skewed data and non-slanted data;

将所述非倾斜数据分配至预设的Hash分区，并基于动态分配算法将所述倾斜数据动态分配到各个存储分区中，以均衡Spark负载。The non-inclined data is allocated to a preset Hash partition, and the skewed data is dynamically allocated to each storage partition based on a dynamic allocation algorithm to balance the Spark load.

在本发明的其中一种实施例中，所述预设的抽样算法，包括：In one embodiment of the present invention, the preset sampling algorithm includes:

从总量为A的中间数据中抽取K个样本数据；其中，K小于A；Extract K sample data from the intermediate data with a total amount of A; among them, K is less than A;

将抽取出的第1个至第K个所述样本数据放入样本数组中；Putting the extracted 1st to Kth sample data into the sample array;

依次遍历所述中间数据中的第K+1个数据，以使所述样本数组中的第x个元素的选中概率为k/x，等概率1/k地替换所述样本数组中的元素。The K+1th data in the intermediate data is traversed in sequence, so that the selection probability of the xth element in the sample array is k/x, and the elements in the sample array are replaced with an equal probability of 1/k.

在本发明的其中一种实施例中，所述基于动态分配算法将所述倾斜数据动态分配到各个存储分区中，具体为：In one of the embodiments of the present invention, the skewed data is dynamically allocated to each storage partition based on a dynamic allocation algorithm, specifically:

在利用所述数据倾斜检测模型识别出所述倾斜数据之后，将所述倾斜数据分配至map任务；After identifying the tilt data using the data tilt detection model, assigning the tilt data to a map task;

通过所述map任务进程拆分并生成中间输出数据；Split and generate intermediate output data through the map task process;

基于所述动态分配算法将所述中间输出数据分配到至少一个reduce任务中。The intermediate output data is allocated to at least one reduce task based on the dynamic allocation algorithm.

在本发明的其中一种实施例中，所述动态分配算法，用于将所述数据倾斜检测模块检测到的所述倾斜数据分配给当前负载最小的Reducer。In one embodiment of the present invention, the dynamic allocation algorithm is configured to allocate the skew data detected by the data skew detection module to the Reducer with the smallest current load.

将所述存储分区中的key按降序排序，并将最大的key标记为1，其余的key标记为0。Sort the keys in the bucket in descending order and mark the largest key as 1 and the rest as 0.

在本发明的其中一种实施例中，所述利用数据倾斜检测模型将所述样本数据划分为倾斜数据、非倾斜数据，具体为：In one of the embodiments of the present invention, the use of a data tilt detection model to divide the sample data into tilted data and non-skewed data is specifically:

设

表示负载大小，

表示的第j个value值，则Assume

represents the load size,

represents the jth value of the value, then

负载平均值为：The load average is:

测量总负载平衡水平的标准偏差：Measure the standard deviation of the overall load balance level:

通过偏差系数测量负载平衡：Load balance is measured by the coefficient of deviation:

计算用于反映初始集群集合的数据倾斜范围的标准差：Compute the standard deviation used to reflect the skewed range of the data for the initial set of clusters:

度量所有集群集合的总体偏差：Measure the overall deviation of all cluster sets:

以预设的中间值w与FoD的比较结果判断数据的倾斜程度，从而将所述样本数据划分为倾斜数据、非倾斜数据；其中，w为预设的中间值。The degree of inclination of the data is determined based on the comparison result between the preset intermediate value w and the FoD, so that the sample data is divided into oblique data and non-oblique data; wherein, w is the preset intermediate value.

第二方面，本发明实施例还提供一种数据倾斜处理装置，包括：In a second aspect, an embodiment of the present invention further provides a data tilt processing device, including:

数据抽样模块，用于基于预设的抽样算法，对数据进行抽样以获得等概率的样本数据，并通过数据的累加计算获得每一value所占空间的大小；The data sampling module is used to sample the data to obtain equal probability sample data based on the preset sampling algorithm, and obtain the size of the space occupied by each value through the cumulative calculation of the data;

数据划分模块，用于利用数据倾斜检测模型将所述样本数据划分为倾斜数据、非倾斜数据；a data division module, configured to use a data skew detection model to divide the sample data into skewed data and non-slanted data;

数据分配模块，用于将所述非倾斜数据分配至预设的Hash分区，并基于动态分配算法将所述倾斜数据动态分配到各个存储分区中，以均衡Spark负载。A data allocation module, configured to allocate the non-inclined data to a preset Hash partition, and dynamically allocate the skewed data to each storage partition based on a dynamic allocation algorithm to balance Spark loads.

在本发明的其中一种实施例中，所述数据分配模块，还用于：In one of the embodiments of the present invention, the data distribution module is further configured to:

在本发明的其中一种实施例中，所述数据划分模块，用于：In one embodiment of the present invention, the data division module is used for:

设

表示负载大小，

表示的第j个value值，则Assume

represents the load size,

represents the jth value of the value, then

负载平均值为：The load average is:

第三方面，本发明还提供一种终端设备，包括处理器、存储器以及存储在所述存储器中且被配置为由所述处理器执行的计算机程序，所述处理器执行所述计算机程序时实现如上所述的数据倾斜处理方法。In a third aspect, the present invention further provides a terminal device, including a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor, and the processor implements the computer program when the processor executes the computer program. The data skew processing method as described above.

第四方面，本发明还提供一种计算机可读存储介质，所述计算机可读存储介质包括存储的计算机程序，其中，在所述计算机程序运行时控制所述计算机可读存储介质所在设备执行如上所述的数据倾斜处理方法。In a fourth aspect, the present invention also provides a computer-readable storage medium, where the computer-readable storage medium includes a stored computer program, wherein, when the computer program runs, the device where the computer-readable storage medium is located is controlled to execute as above The data skew processing method.

相比于现有技术，本发明实施例提供了一种数据倾斜处理方法、装置、终端设备及存储介质，其一个实施例具有如下有益效果：Compared with the prior art, an embodiment of the present invention provides a data tilt processing method, device, terminal device, and storage medium, an embodiment of which has the following beneficial effects:

通过对预设的抽样算法该算法预测存储分区大小，对数据进行抽样后利用数据倾斜检测模型将数据分类为倾斜数据和非倾斜数据，利用非倾斜数据预测Reduce分区的大小，并将倾斜数据均衡分配到各个分区中去，该机制能够使Spark负载更均衡。The algorithm predicts the size of the storage partition by using the preset sampling algorithm. After sampling the data, the data skew detection model is used to classify the data into skewed data and non-skewed data, and the non-skewed data is used to predict the size of the Reduce partition and balance the skewed data. Distributed to each partition, this mechanism can make Spark load more balanced.

附图说明Description of drawings

为了更清楚地说明本发明的技术方案，下面将对实施方式中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施方式，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to illustrate the technical solutions of the present invention more clearly, the following will briefly introduce the accompanying drawings used in the embodiments. Obviously, the drawings in the following description are only some embodiments of the present invention, which are common in the art. As far as technical personnel are concerned, other drawings can also be obtained based on these drawings without any creative effort.

图1是本发明实施例中的数据倾斜处理方法的步骤流程图；Fig. 1 is the step flow chart of the data skew processing method in the embodiment of the present invention;

图2是本发明实施例中的数据倾斜处理方法中的SP-IRS总体框架图。FIG. 2 is a general framework diagram of SP-IRS in the data skew processing method in the embodiment of the present invention.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only a part of the embodiments of the present invention, rather than all the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

如今随着信息化、数字化的快速发展，数据的规模呈指数型增长。因此，Gartner公司根据大数据的定义给出了3V定义，具体为Volume、Variety和Velocity。在此基础上，国际数据公司(International Data Corporation,IDC)将其发展成的4V定义，他认为大数据定义还应当具有价值性(Value)，所以大数据的处理有很大的研究价值。Nowadays, with the rapid development of informatization and digitization, the scale of data is growing exponentially. Therefore, Gartner gives the definition of 3V according to the definition of big data, specifically Volume, Variety and Velocity. On this basis, the International Data Corporation (IDC) developed the 4V definition. He believes that the definition of big data should also have value, so the processing of big data has great research value.

从理论上讲，很多应用场景产生的数据都是倾斜的，其规律与帕累托所提出的“二八定律”相符：在任何事物中，最重要的只占其中约20％的一小部分，而其余部分占比却高达80％，数据的大量产生和大量的热点数据，导致数据分布不均衡，从而造成数据倾斜。In theory, the data generated by many application scenarios are skewed, and the law is consistent with Pareto's "28 law": in anything, the most important only accounts for a small part of about 20% of it , while the remaining part accounts for as high as 80%. Mass production of data and a large amount of hot data lead to uneven data distribution, resulting in data skew.

现有的一种称为SCID(用于倾斜中间数据块的拆分和组合算法)的算法，用于使Spark中Reducer上的负载更为均衡。在这种方法中，必须首先运行基于蓄水池抽样的抽样作业，然后通过抽样估计出簇的大小，识别出重簇，在下一步中，该算法尝试对重簇进行拆分。蓄水池抽样算法增加了额外操作步骤，会增加Spark平台的运行时间，并且额外的拆分与聚合会浪费集群的资源。An existing algorithm called SCID (Split and Combine Algorithm for Skewed Intermediate Data Blocks) is used to make the load more balanced on the Reducers in Spark. In this method, a sampling job based on reservoir sampling must be run first, then the cluster size is estimated by sampling, multiple clusters are identified, and in the next step, the algorithm attempts to split the multiple clusters. The reservoir sampling algorithm adds extra steps, which increases the running time of the Spark platform, and the extra splitting and aggregation wastes the resources of the cluster.

鉴于此，本发明以Spark中的数据倾斜问题为研究对象，重点研究如何通过分区负载均衡来减少应用程序的总完工时间。如图2所示，图2中显示了整个模型的执行框架，它由两个独立的作业组成。首先，利用改进的蓄水池抽样算法对原始输入数据进行抽样。此阶段的输出是一个矩阵，用于记录当前输入数据的大小和key-value分布，用来确定reduce任务的分配。在运行任务之前，根据此矩阵完成reduce任务的分配。这样，reduce任务处理的大多数数据都尽可能的本地化，从而节省了流量成本，提高了reduce任务的性能。In view of this, the present invention takes the data skew problem in Spark as the research object, and focuses on how to reduce the total completion time of the application program through partition load balancing. As shown in Figure 2, the execution framework of the entire model is shown in Figure 2, which consists of two independent jobs. First, the original input data is sampled using an improved reservoir sampling algorithm. The output of this stage is a matrix that records the size and key-value distribution of the current input data to determine the allocation of reduce tasks. Before running the tasks, the assignment of reduce tasks is done according to this matrix. In this way, most of the data processed by the reduce task is localized as much as possible, which saves traffic costs and improves the performance of the reduce task.

首先输入的数据被加载到一个或多个分布式文件系统(例如HDFS)中，在该系统中，每个文件被分为更小的块，称为输入拆分。对这些数据用蓄水池抽样算法进行抽样，之后利用改进数据倾斜检测模型识别倾斜数据，将倾斜的数据分配给一个map任务，map任务进程拆分并生成中间输出，然后将这些输出在动态分区算法的基础上分配到一个或多个reduce任务中去。具体如图1所示，本发明实施例提供了一种数据倾斜处理方法，包括以下步骤：First input data is loaded into one or more distributed file systems (such as HDFS), where each file is divided into smaller chunks called input splits. The data are sampled with the reservoir sampling algorithm, and then the skewed data is identified by the improved data skew detection model, and the skewed data is assigned to a map task. The map task process splits and generates intermediate outputs, and then these outputs are dynamically partitioned. Algorithms are assigned to one or more reduce tasks. Specifically, as shown in FIG. 1, an embodiment of the present invention provides a data skew processing method, including the following steps:

S1、基于预设的抽样算法，对数据进行抽样以获得等概率的样本数据，并通过数据的累加计算获得每一value所占空间的大小，能够方便的统计出哪些key对应的value值最多；S1. Based on the preset sampling algorithm, the data is sampled to obtain equal probability sample data, and the size of the space occupied by each value is obtained through the cumulative calculation of the data, which can conveniently count which keys correspond to the most value values;

在本实施例中，所述预设的抽样算法，包括：In this embodiment, the preset sampling algorithm includes:

作为示例性的，基于改进蓄水池抽样算法的负载均衡机制：SP-IRS(Spark loadbalancing mechanism based on Improved Reservoir Sampling algorithm)。在传统的蓄水池抽样算法中增加了一个变量权重用来预测分区大小后，利用数据倾斜检测模型将数据分类为倾斜数据和非倾斜数据，最后将倾斜数据均衡动态分配到各个分区中去。As an example, the load balancing mechanism based on the improved reservoir sampling algorithm: SP-IRS (Spark loadbalancing mechanism based on Improved Reservoir Sampling algorithm). After adding a variable weight to the traditional reservoir sampling algorithm to predict the size of the partition, the data skew detection model is used to classify the data into skewed data and non-slanted data, and finally the skewed data is dynamically allocated to each partition in a balanced manner.

蓄水池抽样算法思路是假设从中间数据总量A的数据中抽取k个样本，将抽取到的前k个样本放入数组中，接着从第k+1个数据依次遍历，在遍历的过程中能够确保第x个元素的选中概率为k/x，然后1/k地替换掉样本数组中的元素。蓄水池抽样算法能够在未知数据量的情况下，等概率的抽取样本数据。改进蓄水池抽样算法在抽样的同时对中间数据量进行了累加计算。本实施例通过在上述蓄水池抽样算法中增加了一个变量权重，能够预测分区大小，改进后的抽样算法如下：The idea of the reservoir sampling algorithm is to assume that k samples are drawn from the data of the total amount of intermediate data A, and the first k samples extracted are put into the array, and then traversed from the k+1th data in turn, in the process of traversal can ensure that the selection probability of the xth element is k/x, and then replace the elements in the sample array 1/k. The reservoir sampling algorithm can extract sample data with equal probability in the case of unknown data volume. The improved reservoir sampling algorithm accumulates the amount of intermediate data while sampling. In this embodiment, by adding a variable weight to the above-mentioned reservoir sampling algorithm, the partition size can be predicted. The improved sampling algorithm is as follows:

本实施例通过改进的蓄水池抽样算法，其具体为在抽样的过程中利用算法计算出每个value所占空间的大小，增加的这一步既可以节省数据量，也可以为下一步分配空间节省必要的工作。This embodiment uses an improved reservoir sampling algorithm, which specifically uses the algorithm to calculate the size of the space occupied by each value during the sampling process. This added step can not only save the amount of data, but also allocate space for the next step. Save necessary work.

S2、利用数据倾斜检测模型将所述样本数据划分为倾斜数据、非倾斜数据；S2, using a data skew detection model to divide the sample data into skewed data and non-slanted data;

在本实施例中，使用改进数据检测模型将改进蓄水池抽样算法得到的数据分为两个部分，例如把原始RDD_1区分为RDD_11和RDD_12，其中RDD_11代表倾斜key，RDD_12代表没有倾斜的key；本实施例的数据倾斜模型为：In the present embodiment, the improved data detection model is used to divide the data obtained by the improved reservoir sampling algorithm into two parts, for example, the original RDD_1 is divided into RDD_11 and RDD_12, wherein RDD_11 represents the inclined key, and RDD_12 represents the key without inclination; The data skew model of this embodiment is:

假设

表示k_i的第j个value，其Reducer负载大小为

其中key对应的所有value值，可表示为

Assumption

Represents the jth value of k _i , and its Reducer load size is

All the value values corresponding to the key can be expressed as

因此，Reducer中包含的<key,value>对的数目如公式(1)所示：Therefore, the number of <key, value> pairs contained in the Reducer is shown in formula (1):

根据上述说明，每台Reducer的负载平均值可按公式(2)计算：According to the above description, the average load of each Reducer can be calculated according to formula (2):

在计算出每台Reducer的负载平均值后，需要对求得的平均值的差异进行统一量化。具体定义用于测量Reducer总负载平衡水平的标准偏差std，可由公式(3)表示：After the average load of each Reducer is calculated, it is necessary to uniformly quantify the difference between the obtained averages. The standard deviation std used to measure the overall load balance level of the Reducer is specifically defined, which can be represented by formula (3):

由于标准偏差std通常用于反映序列波动的范围，我们可以使用度量FoS(偏差系数)来测量所有Reducer的负载平衡，如公式(4)所示：Since the standard deviation std is usually used to reflect the range of sequence fluctuations, we can use the metric FoS (coefficient of deviation) to measure the load balance of all Reducers, as shown in Equation (4):

在FoS的基础上，出于同样的原因，使用标准差来反映初始集群集合的数据倾斜范围，如公式(5)所示：On the basis of FoS, for the same reason, the standard deviation is used to reflect the data skew range of the initial cluster set, as shown in Equation (5):

使用FoD度量来度量所有集群集合的总体偏差，如公式(6)所示：The FoD metric is used to measure the overall deviation of all cluster sets, as shown in Equation (6):

后期根据相关实验确定中间值w。当满足条件时，FoD≤w数据倾斜程度略有倾斜。当满足此条件FoD≥w时，数据的倾斜程度会很大。In the later stage, the intermediate value w is determined according to relevant experiments. When the condition is met, the data skew degree of FoD≤w is slightly skewed. When this condition FoD≥w is satisfied, the skewness of the data will be great.

传统的蓄水池抽样算法思路是假设从中间数据总量A的数据中抽取k个样本，将抽取到的前k个样本放入数组中，接着从第k+1个数据依次遍历，在遍历的过程中能够确保第x个元素的选中概率为k/x，然后1/k地替换掉样本数组中的元素。蓄水池抽样算法能够在未知数据量的情况下，等概率的抽取样本数据。改进蓄水池抽样算法在抽样的同时对中间数据量进行了累加计算。The idea of the traditional reservoir sampling algorithm is to assume that k samples are drawn from the data of the total amount of intermediate data A, put the first k samples extracted into the array, and then traverse from the k+1th data in turn. In the process of , it can ensure that the selection probability of the xth element is k/x, and then replace the elements in the sample array by 1/k. The reservoir sampling algorithm can extract sample data with equal probability in the case of unknown data volume. The improved reservoir sampling algorithm accumulates the amount of intermediate data while sampling.

接下来为了使每个Reducer内的数据大致相同，提出了一种动态分配算法，其主要思想是根据倾斜检测算法检测到倾斜数据分配给当前负载最小的Reducer。分区中的key按降序排序，将最大的key标记为1，其余key标记为0。每个集群的分配完成后，根据改进蓄水池抽样算法预测得到的分区大小，再次按降序排序，重复上述集群分配过程。具体如步骤S3所示。Next, in order to make the data in each Reducer roughly the same, a dynamic allocation algorithm is proposed. The keys in the partition are sorted in descending order, marking the largest key as 1 and the rest as 0. After the allocation of each cluster is completed, according to the partition size predicted by the improved reservoir sampling algorithm, sort again in descending order, and repeat the above cluster allocation process. The details are shown in step S3.

S3、将所述非倾斜数据分配至预设的Hash分区，并基于动态分配算法将所述倾斜数据动态分配到各个存储分区中，以均衡Spark负载。S3. Allocate the non-inclined data to a preset Hash partition, and dynamically allocate the skewed data to each storage partition based on a dynamic allocation algorithm to balance Spark loads.

在本实施例中，所述基于动态分配算法将所述倾斜数据动态分配到各个存储分区中，具体为：In this embodiment, the skewed data is dynamically allocated to each storage partition based on a dynamic allocation algorithm, specifically:

通过改进蓄水池抽样算法得到数据统计，动态分配算法的主要思想是根据倾斜检测算法检测到倾斜数据分配给当前负载最小的Reducer。分区中的key按降序排序，将最大的key标记为1，其余key标记为0。每个集群的分配完成后，根据所有Reducer当前剩余容量，再次按降序排序，重复上述集群分配过程。作为示例性的，动态分配算法为：The data statistics are obtained by improving the sampling algorithm of the reservoir. The main idea of the dynamic allocation algorithm is to allocate the skew data detected by the skew detection algorithm to the Reducer with the smallest current load. The keys in the partition are sorted in descending order, marking the largest key as 1 and the rest as 0. After the allocation of each cluster is completed, according to the current remaining capacity of all Reducers, sort them in descending order again, and repeat the above cluster allocation process. As an example, the dynamic allocation algorithm is:

综上所述，本发明实施例提供了一种数据倾斜处理方法，首先在其他学者提出的一些数据倾斜模型中，为了更好地衡量数据的倾斜程度，定义了一个改进数据倾斜模型来度量中间数据的偏斜程度，从而能够更为精确的定义数据倾斜程度从而减少应用程序的job运行时间；然后提出了一个蓄水池抽样算法，与现存的机制相比，该算法在传统的蓄水池抽样算法中增加了一个变量权重用来预测分区大小，对数据进行抽样后利用数据倾斜检测模型将数据分类为倾斜数据和非倾斜数据，利用非倾斜数据预测Reduce分区的大小，并将倾斜数据均衡分配到各个分区中去，该机制能够使Spark负载更均衡。To sum up, the embodiments of the present invention provide a data skew processing method. First, in some data skew models proposed by other scholars, in order to better measure the skew degree of the data, an improved data skew model is defined to measure the intermediate The skewness of the data can be defined more accurately and the job running time of the application can be reduced; then a reservoir sampling algorithm is proposed. A variable weight is added to the sampling algorithm to predict the partition size. After sampling the data, the data skew detection model is used to classify the data into skewed data and non-skewed data, and the non-skewed data is used to predict the size of the Reduce partition and balance the skewed data. Distributed to each partition, this mechanism can make Spark load more balanced.

本发明还提供一种数据倾斜处理装置，包括：The present invention also provides a data tilt processing device, comprising:

数据抽样模块，用于基于预设的抽样算法，对数据进行抽样以获得等概率的样本数据；The data sampling module is used to sample data to obtain equal probability sample data based on a preset sampling algorithm;

数据分配模块，用于根据所述非倾斜数据预测存储分区的大小，并基于动态分配算法将所述倾斜数据动态分配到各个存储分区中，以均衡Spark负载。The data allocation module is configured to predict the size of the storage partition according to the non-inclined data, and dynamically allocate the skewed data to each storage partition based on the dynamic allocation algorithm to balance the Spark load.

设

表示负载大小，

表示的第j个value值，则Assume

represents the load size,

represents the jth value of the value, then

负载平均值为：The load average is:

本发明还提供一种终端设备，包括处理器、存储器以及存储在所述存储器中且被配置为由所述处理器执行的计算机程序，所述处理器执行所述计算机程序时实现如上所述的数据倾斜处理方法。The present invention also provides a terminal device, including a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor, when the processor executes the computer program, the above-mentioned computer program is implemented Data skew processing method.

示例性的，所述计算机程序可以被分割成一个或多个模块/单元，所述一个或者多个模块/单元被存储在所述存储器中，并由所述处理器执行，以完成本发明。所述一个或多个模块/单元可以是能够完成特定功能的一系列计算机程序指令段，该指令段用于描述所述计算机程序在所述终端设备中的执行过程。Exemplarily, the computer program may be divided into one or more modules/units, and the one or more modules/units are stored in the memory and executed by the processor to accomplish the present invention. The one or more modules/units may be a series of computer program instruction segments capable of performing specific functions, and the instruction segments are used to describe the execution process of the computer program in the terminal device.

所述终端设备可以是桌上型计算机、笔记本、掌上电脑及智能平板等计算设备。所述终端设备可包括，但不仅限于，处理器、存储器。本领域技术人员可以理解，上述部件仅仅是终端设备的示例，并不构成对终端设备的限定，可以包括比上述更多或更少的部件，或者组合某些部件，或者不同的部件，例如所述终端设备还可以包括输入输出设备、网络接入设备、总线等。The terminal device may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a smart tablet. The terminal device may include, but is not limited to, a processor and a memory. Those skilled in the art can understand that the above components are only examples of terminal equipment, and do not constitute a limitation to the terminal equipment, and may include more or less components than the above, or combine some components, or different components, such as all The terminal device may also include input and output devices, network access devices, buses, and the like.

所称处理器可以是中央处理单元(Central Processing Unit，CPU)，还可以是其他通用处理器、数字信号处理器(Digital Signal Processor，DSP)、专用集成电路(Application Specific Integrated Circuit，ASIC)、现成可编程门阵列(Field-Programmable Gate Array，FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等，所述处理器是所述终端设备的控制中心，利用各种接口和线路连接整个终端设备的各个部分。The processor may be a central processing unit (Central Processing Unit, CPU), other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), off-the-shelf processors Programmable Gate Array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. The general-purpose processor can be a microprocessor or the processor can also be any conventional processor, etc. The processor is the control center of the terminal device, and uses various interfaces and lines to connect various parts of the entire terminal device.

所述存储器可用于存储所述计算机程序和/或模块，所述处理器通过运行或执行存储在所述存储器内的计算机程序和/或模块，以及调用存储在存储器内的数据，实现所述终端设备的各种功能。所述存储器可主要包括存储程序区和存储数据区，其中，存储程序区可存储操作系统、至少一个功能所需的应用程序(比如声音播放功能、图像播放功能等)等；存储数据区可存储根据手机的使用所创建的数据(比如音频数据、电话本等)等。此外，存储器可以包括高速随机存取存储器，还可以包括非易失性存储器，例如硬盘、内存、插接式硬盘，智能存储卡(Smart Media Card,SMC)，安全数字(Secure Digital,SD)卡，闪存卡(Flash Card)、至少一个磁盘存储器件、闪存器件、或其他易失性固态存储器件。The memory can be used to store the computer program and/or module, and the processor implements the terminal by running or executing the computer program and/or module stored in the memory and calling the data stored in the memory various functions of the device. The memory may mainly include a stored program area and a stored data area, wherein the stored program area may store an operating system, an application program required for at least one function (such as a sound playback function, an image playback function, etc.), etc.; the storage data area may store Data (such as audio data, phonebook, etc.) created according to the usage of the mobile phone, etc. In addition, the memory may include high-speed random access memory, and may also include non-volatile memory such as hard disk, internal memory, plug-in hard disk, Smart Media Card (SMC), Secure Digital (SD) card , a flash memory card (Flash Card), at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.

其中，所述终端设备集成的模块/单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时，可以存储在一个计算机可读取存储介质中。基于这样的理解，本发明实现上述实施例方法中的全部或部分流程，也可以通过计算机程序来指令相关的硬件来完成，所述的计算机程序可存储于一计算机可读存储介质中，该计算机程序在被处理器执行时，可实现上述各个方法实施例的步骤。其中，所述计算机程序包括计算机程序代码，所述计算机程序代码可以为源代码形式、对象代码形式、可执行文件或某些中间形式等。所述计算机可读介质可以包括：能够携带所述计算机程序代码的任何实体或装置、记录介质、U盘、移动硬盘、磁碟、光盘、计算机存储器、只读存储器(ROM，Read-Only Memory)、随机存取存储器(RAM，Random Access Memory)、电载波信号、电信信号以及软件分发介质等。需要说明的是，所述计算机可读介质包含的内容可以根据司法管辖区内立法和专利实践的要求进行适当的增减，例如在某些司法管辖区，根据立法和专利实践，计算机可读介质不包括电载波信号和电信信号。Wherein, if the modules/units integrated in the terminal device are implemented in the form of software functional units and sold or used as independent products, they may be stored in a computer-readable storage medium. Based on this understanding, the present invention can implement all or part of the processes in the methods of the above embodiments, and can also be completed by instructing relevant hardware through a computer program, and the computer program can be stored in a computer-readable storage medium. When the program is executed by the processor, the steps of the foregoing method embodiments can be implemented. Wherein, the computer program includes computer program code, and the computer program code may be in the form of source code, object code, executable file or some intermediate form, and the like. The computer-readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer memory, a read-only memory (ROM, Read-Only Memory) , Random Access Memory (RAM, Random Access Memory), electric carrier signal, telecommunication signal and software distribution medium, etc. It should be noted that the content contained in the computer-readable media may be appropriately increased or decreased according to the requirements of legislation and patent practice in the jurisdiction, for example, in some jurisdictions, according to legislation and patent practice, the computer-readable media Electric carrier signals and telecommunication signals are not included.

需说明的是，以上所描述的装置实施例仅仅是示意性的，其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。另外，本发明提供的装置实施例附图中，模块之间的连接关系表示它们之间具有通信连接，具体可以实现为一条或多条通信总线或信号线。本领域普通技术人员在不付出创造性劳动的情况下，即可以理解并实施。It should be noted that the device embodiments described above are only schematic, wherein the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical unit, that is, it can be located in one place, or it can be distributed over multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution in this embodiment. In addition, in the drawings of the apparatus embodiments provided by the present invention, the connection relationship between the modules indicates that there is a communication connection between them, which may be specifically implemented as one or more communication buses or signal lines. Those of ordinary skill in the art can understand and implement it without creative effort.

以上所述是本发明的优选实施方式，应当指出，对于本技术领域的普通技术人员来说，在不脱离本发明原理的前提下，还可以做出若干改进和润饰，这些改进和润饰也视为本发明的保护范围。The above are the preferred embodiments of the present invention. It should be pointed out that for those skilled in the art, without departing from the principles of the present invention, several improvements and modifications can also be made, and these improvements and modifications may also be regarded as It is the protection scope of the present invention.

Claims

1. A data skew processing method, comprising the steps of:

based on a preset sampling algorithm, sampling data to obtain sample data with equal probability, and obtaining the size of the space occupied by each value through the accumulative calculation of the data;

dividing the sample data into oblique data and non-oblique data by using a data oblique detection model;

and distributing the non-tilting data to a preset Hash partition, and dynamically distributing the tilting data to each storage partition based on a dynamic distribution algorithm so as to balance Spark load.

2. The data skew processing method of claim 1, wherein the predetermined sampling algorithm comprises:

extracting K sample data from the intermediate data with the total amount of A; wherein K is less than A;

putting the extracted sample data from the 1 st to the Kth sample data into a sample array;

and sequentially traversing the K +1 th data in the intermediate data to enable the selected probability of the x-th element in the sample array to be K/x, and replacing the elements in the sample array with equal probability of 1/K.

3. The data skew processing method of claim 1, wherein the dynamically allocating the skew data to each memory partition based on a dynamic allocation algorithm is specifically:

after identifying the tilt data using the data tilt detection model, assigning the tilt data to a map task;

splitting and generating intermediate output data through the map task process;

allocating the intermediate output data into at least one reduce task based on the dynamic allocation algorithm.

4. The data skew processing method of claim 1 or 3, wherein the dynamic allocation algorithm is configured to allocate the skew data detected by the data skew detection module to a Reducer with a minimum current load.

5. The data skew processing method of claim 1, wherein the dividing the sample data into skewed data and non-skewed data by using the data skew detection model specifically comprises:

is provided with

The magnitude of the load is represented by,

the jth value of the representation, then

The average load value is:

standard deviation of the total load balancing level was measured:

load balancing is measured by the bias coefficients:

calculating a standard deviation reflecting a data tilt range of the initial cluster set:

measure the overall deviation of all cluster sets:

judging the inclination degree of the data according to the comparison result of preset intermediate values w and FoD, thereby dividing the sample data into inclined data and non-inclined data; wherein w is a preset intermediate value.

6. A data skew processing apparatus, comprising:

the data sampling module is used for sampling the data based on a preset sampling algorithm to obtain sample data with equal probability, and the size of the space occupied by each value is obtained through the accumulative calculation of the data;

the data dividing module is used for dividing the sample data into oblique data and non-oblique data by using a data oblique detection model;

and the data distribution module is used for distributing the non-tilting data to a preset Hash partition, and dynamically distributing the tilting data to each storage partition based on a dynamic distribution algorithm so as to balance Spark load.

7. The data skew processing apparatus of claim 6 wherein the predetermined sampling algorithm comprises:

8. The data tilt processing apparatus of claim 6, wherein the data allocation module is further configured to:

splitting and generating intermediate output data through the map task process;

9. A terminal device comprising a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor, the processor implementing the data tilt processing method according to any one of claims 1 to 5 when executing the computer program.

10. A computer-readable storage medium, comprising a stored computer program, wherein the computer program, when executed, controls an apparatus in which the computer-readable storage medium is located to perform the data tilt processing method according to any one of claims 1 to 5.