CN117473325A

CN117473325A - Sample equalization method, device, electronic equipment and computer-readable storage medium

Info

Publication number: CN117473325A
Application number: CN202311548163.9A
Authority: CN
Inventors: 石建伟; 陈明; 肖勃飞; 戈汉权; 何兴凤; 杜培良
Original assignee: Zhongdian Jinxin Digital Technology Group Co ltd
Current assignee: Zhongdian Jinxin Digital Technology Group Co ltd
Priority date: 2023-11-17
Filing date: 2023-11-17
Publication date: 2024-01-30

Abstract

The application provides a sample equalization method, a sample equalization device, electronic equipment and a computer readable storage medium, wherein the method comprises the following steps: predicting, for each training sample in the training sample set, a default probability for the training sample; according to the default probability of each training sample, training samples with the default probability larger than a first threshold value or smaller than a second threshold value are screened out from the training sample set, a first training sample set is constructed by using the screened training samples, and a second training sample set is constructed by using the rest training samples; the first threshold is greater than the second threshold; sample equalization processing is carried out on the first training sample set, and a third training sample set is obtained; the third training sample set comprises each training sample in the first training sample set and new training samples generated after sample expansion; and carrying out sample set merging on the third training sample set and the second training sample set to obtain a target training sample set. By the method, accuracy of the prediction model is improved.

Description

Sample equalization method, device, electronic device and computer readable storage medium

技术领域Technical Field

本申请涉及计算机技术领域，尤其是涉及一种样本均衡方法、装置、电子设备及计算机可读存储介质。The present application relates to the field of computer technology, and in particular to a sample equalization method, device, electronic device, and computer-readable storage medium.

背景技术Background Art

在风险评估中，预测借款人的违约概率是为了准确评估借款人的风险，从而制定适当的风控策略。目前，违约概率的预测更多依赖于传统的预测模型进行预测。在训练预测模型过程中，由于违约样本数量较少，未违约样本数量较多，训练样本不均衡，这样就会导致使用该训练样本训练出来的预测模型的准确性较低。In risk assessment, predicting the borrower's default probability is to accurately assess the borrower's risk and formulate appropriate risk control strategies. At present, the prediction of default probability relies more on traditional prediction models. In the process of training the prediction model, due to the small number of default samples and the large number of non-default samples, the training samples are unbalanced, which will lead to the low accuracy of the prediction model trained using the training samples.

现有技术中，通常使用SMOTE技术直接对训练样本进行样本均衡处理，以扩充违约样本的数量。但是，若直接对训练样本进行样本均衡处理，容易导致扩充出来的违约样本不够准确。例如，训练样本中存在违约概率为0.5的样本(即训练样本的违约情况不够明确)，若对该训练样本进行扩充，会导致得到的扩充后的新样本的违约情况不够准确。若扩充后得到的新样本的准确性较低，则会影响预测模型的训练，导致训练后的预测模型的误判概率增大，影响预测模型的准确性。In the prior art, SMOTE technology is usually used to directly perform sample balancing on training samples to expand the number of default samples. However, if sample balancing is directly performed on training samples, it is easy to cause the expanded default samples to be inaccurate. For example, there are samples with a default probability of 0.5 in the training samples (that is, the default situation of the training samples is not clear enough). If the training samples are expanded, the default situation of the expanded new samples will be inaccurate. If the accuracy of the new samples obtained after expansion is low, it will affect the training of the prediction model, resulting in an increase in the probability of misjudgment of the trained prediction model, affecting the accuracy of the prediction model.

发明内容Summary of the invention

有鉴于此，本申请的目的在于提供一种样本均衡方法、装置、电子设备及计算机可读存储介质，以提高预测模型的准确性。In view of this, the purpose of the present application is to provide a sample balancing method, device, electronic device and computer-readable storage medium to improve the accuracy of the prediction model.

第一方面，本申请实施例提供了一种样本均衡方法，包括：In a first aspect, an embodiment of the present application provides a sample equalization method, including:

针对训练样本集中的每个训练样本，预测该训练样本的违约概率；其中，所述训练样本集中包含多个训练样本；所述训练样本的样本类型包括违约样本和未违约样本；所述违约样本的样本数量少于所述未违约样本的样本数量；For each training sample in a training sample set, predict the default probability of the training sample; wherein the training sample set includes a plurality of training samples; the sample types of the training samples include default samples and non-default samples; the sample number of the default samples is less than the sample number of the non-default samples;

根据各所述训练样本的违约概率，从所述训练样本集中筛选出违约概率大于第一阈值或者小于第二阈值的训练样本，以使用筛选出来的训练样本构建第一训练样本集，使用剩余的训练样本构建第二训练样本集；所述第一阈值大于所述第二阈值；According to the default probability of each of the training samples, training samples whose default probability is greater than a first threshold or less than a second threshold are screened out from the training sample set, so as to construct a first training sample set using the screened training samples, and to construct a second training sample set using the remaining training samples; the first threshold is greater than the second threshold;

对所述第一训练样本集进行样本均衡处理，以对所述第一训练样本集中样本类型为所述违约样本的训练样本进行样本扩充，得到第三训练样本集；所述第三训练样本集中包含所述第一训练样本集中的各个训练样本以及进行样本扩充后生成的新的训练样本；Performing sample balancing processing on the first training sample set to expand the training samples whose sample types are the default samples in the first training sample set to obtain a third training sample set; the third training sample set includes each training sample in the first training sample set and new training samples generated after sample expansion;

对所述第三训练样本集和所述第二训练样本集进行样本集合并，得到目标训练样本集。The third training sample set and the second training sample set are merged to obtain a target training sample set.

结合第一方面，本申请实施例提供了第一方面的第一种可能的实施方式，其中，所述针对训练样本集中的每个训练样本，预测该训练样本的违约概率，包括：In combination with the first aspect, the embodiment of the present application provides a first possible implementation of the first aspect, wherein, for each training sample in the training sample set, predicting the probability of default of the training sample includes:

针对训练样本集中的每个训练样本，使用训练好的第一违约概率预测模型预测该训练样本的第一违约概率，以及使用训练好的第二违约概率预测模型预测该训练样本的第二违约概率；所述训练样本的违约概率包含训练样本的第一违约概率和第二违约概率。For each training sample in the training sample set, a first default probability prediction model that has been trained is used to predict a first default probability of the training sample, and a second default probability prediction model that has been trained is used to predict a second default probability of the training sample; the default probability of the training sample includes the first default probability and the second default probability of the training sample.

结合第一方面的第一种可能的实施方式，本申请实施例提供了第一方面的第二种可能的实施方式，其中，所述第一违约概率预测模型是通过以下方式训练得到的：In combination with the first possible implementation of the first aspect, the embodiment of the present application provides a second possible implementation of the first aspect, wherein the first default probability prediction model is trained in the following manner:

针对所述训练样本集中的每个训练样本，将该训练样本输入到第一初始违约概率预测模型中，通过第一初始违约概率预测模型对该训练样本的违约情况进行预测，输出该训练样本的第一违约概率；For each training sample in the training sample set, the training sample is input into a first initial default probability prediction model, a default situation of the training sample is predicted by the first initial default probability prediction model, and a first default probability of the training sample is output;

将该训练样本的违约标签和所述第一违约概率输入到损失函数中，计算所述第一初始违约概率预测模型的损失值；Inputting the default label of the training sample and the first default probability into a loss function to calculate the loss value of the first initial default probability prediction model;

当所述损失值大于预设损失值时，根据所述损失值更新所述第一初始违约概率预测模型中的可训练参数；当所述损失值不大于所述预设损失值时，将当前的第一初始违约概率预测模型作为所述第一违约概率预测模型。When the loss value is greater than the preset loss value, the trainable parameters in the first initial default probability prediction model are updated according to the loss value; when the loss value is not greater than the preset loss value, the current first initial default probability prediction model is used as the first default probability prediction model.

结合第一方面的第一种可能的实施方式，本申请实施例提供了第一方面的第三种可能的实施方式，其中，所述第二违约概率预测模型包括多个决策树，每个决策树中包含多个分类节点；所述第二违约概率预测模型是通过以下方式训练得到的：In combination with the first possible implementation of the first aspect, the embodiment of the present application provides a third possible implementation of the first aspect, wherein the second default probability prediction model includes multiple decision trees, each decision tree includes multiple classification nodes; and the second default probability prediction model is trained in the following manner:

针对第二初始违约概率预测模型中的每个初始决策树，将所述训练样本集中的各个训练样本输入到该初始决策树中，得到该初始决策树中每个分类节点对应的节点参数；所述节点参数为所述分类节点上样本类别为违约样本的样本数量与该分类节点上总的训练样本数量的比值；For each initial decision tree in the second initial default probability prediction model, each training sample in the training sample set is input into the initial decision tree to obtain a node parameter corresponding to each classification node in the initial decision tree; the node parameter is the ratio of the number of samples whose sample category is the default sample on the classification node to the total number of training samples on the classification node;

根据该初始决策树中每个分类节点对应的所述节点参数，计算该初始决策树的不纯度；Calculating the impurity of the initial decision tree according to the node parameters corresponding to each classification node in the initial decision tree;

当所述不纯度大于预设不纯度时，优化该初始决策树的结构；当所述不纯度不大于预设不纯度时，将当前的初始决策树确定为训练完成后的决策树；When the impurity is greater than the preset impurity, optimizing the structure of the initial decision tree; when the impurity is not greater than the preset impurity, determining the current initial decision tree as the decision tree after training;

当所述第二初始违约概率预测模型中的各个初始决策树均训练完成后，得到所述第二违约概率预测模型。When all initial decision trees in the second initial default probability prediction model are trained, the second default probability prediction model is obtained.

结合第一方面的第一种可能的实施方式，本申请实施例提供了第一方面的第四种可能的实施方式，其中，所述根据各所述训练样本的违约概率，从所述训练样本集中筛选出违约概率大于第一阈值或者小于第二阈值的训练样本，以使用筛选出来的训练样本构建第一训练样本集，使用剩余的训练样本构建第二训练样本集，包括：In combination with the first possible implementation manner of the first aspect, the embodiment of the present application provides a fourth possible implementation manner of the first aspect, wherein, according to the default probability of each of the training samples, the training samples having a default probability greater than a first threshold or less than a second threshold are screened out from the training sample set, and the first training sample set is constructed using the screened training samples, and the second training sample set is constructed using the remaining training samples, including:

根据各所述训练样本的第一违约概率，确定出所述第一违约概率预测模型对应的第一阈值和第二阈值；Determining a first threshold and a second threshold corresponding to the first default probability prediction model according to the first default probability of each training sample;

从所述训练样本集中筛选出第一违约概率大于该第一阈值或者小于该第二阈值的训练样本，以使用筛选出来的训练样本构建第四训练样本集，使用剩余的训练样本构建第五训练样本集；Screening out training samples whose first default probability is greater than the first threshold or less than the second threshold from the training sample set, using the screened training samples to construct a fourth training sample set, and using the remaining training samples to construct a fifth training sample set;

根据各所述训练样本的第二违约概率，确定出所述第二违约概率预测模型对应的第一阈值和第二阈值；Determining a first threshold and a second threshold corresponding to the second default probability prediction model according to the second default probability of each training sample;

从所述训练样本集中筛选出第二违约概率大于该第一阈值或者小于该第二阈值的训练样本，以使用筛选出来的训练样本构建第六训练样本集，使用剩余的训练样本构建第七训练样本集；其中，所述第二训练样本集包括所述第五训练样本集和所述第七训练样本集；Screening out training samples whose second default probability is greater than the first threshold or less than the second threshold from the training sample set, using the screened training samples to construct a sixth training sample set, and using the remaining training samples to construct a seventh training sample set; wherein the second training sample set includes the fifth training sample set and the seventh training sample set;

取所述第四训练样本集和所述第六训练样本集的交集，作为所述第一训练样本集。An intersection of the fourth training sample set and the sixth training sample set is taken as the first training sample set.

结合第一方面的第四种可能的实施方式，本申请实施例提供了第一方面的第五种可能的实施方式，其中，所述对所述第一训练样本集进行样本均衡处理，以对所述第一训练样本集中样本类型为所述违约样本的训练样本进行样本扩充，得到第三训练样本集，包括：In combination with the fourth possible implementation of the first aspect, the embodiment of the present application provides a fifth possible implementation of the first aspect, wherein the performing sample balancing processing on the first training sample set to perform sample expansion on the training samples whose sample types are the default samples in the first training sample set to obtain a third training sample set includes:

针对所述第一训练样本集中样本类型为所述违约样本的各个训练样本，计算该训练样本与其他训练样本之间的欧式距离；所述其他训练样本为除该训练样本之外的样本类型为所述违约样本的训练样本；For each training sample whose sample type is the default sample in the first training sample set, calculating the Euclidean distance between the training sample and other training samples; the other training samples are training samples whose sample type is the default sample other than the training sample;

根据该训练样本与其他训练样本之间的欧式距离，按照所述欧式距离从小到大的顺序，从其他训练样本中选取出前预设数量个其他训练样本作为该训练样本的相似训练样本；According to the Euclidean distance between the training sample and other training samples, a preset number of other training samples are selected from other training samples in the order of the Euclidean distance from small to large as similar training samples of the training sample;

从所述相似训练样本中选取出一个目标相似训练样本，利用所述目标相似训练样本和该训练样本，生成该训练样本的合成训练样本；Selecting a target similar training sample from the similar training samples, and generating a synthetic training sample of the training sample by using the target similar training sample and the training sample;

根据所述第一训练样本集中样本类型为所述违约样本的各个训练样本各自对应的合成训练样本，以及所述第一训练样本集中的各个训练样本，构建所述第三训练样本集。The third training sample set is constructed according to the synthetic training samples corresponding to each training sample whose sample type is the default sample in the first training sample set and each training sample in the first training sample set.

结合第一方面的第一种可能的实施方式至第五种可能的实施方式中的任一种可能的实施方式，本申请实施例提供了第一方面的第六种可能的实施方式，其中，所述方法还包括：In combination with any possible implementation manner of the first aspect to the fifth possible implementation manner, the embodiment of the present application provides a sixth possible implementation manner of the first aspect, wherein the method further includes:

使用所述目标训练样本集训练初始违约概率预测模型，得到训练完成后的目标违约概率预测模型。The target training sample set is used to train an initial default probability prediction model to obtain a trained target default probability prediction model.

第二方面，本申请实施例还提供一种样本均衡装置，包括：In a second aspect, an embodiment of the present application further provides a sample equalization device, including:

预测模块，用于针对训练样本集中的每个训练样本，预测该训练样本的违约概率；其中，所述训练样本集中包含多个训练样本；所述训练样本的样本类型包括违约样本和未违约样本；所述违约样本的样本数量少于所述未违约样本的样本数量；A prediction module, for predicting the default probability of each training sample in a training sample set; wherein the training sample set includes a plurality of training samples; the sample types of the training samples include default samples and non-default samples; the sample number of the default samples is less than the sample number of the non-default samples;

筛选模块，用于根据各所述训练样本的违约概率，从所述训练样本集中筛选出违约概率大于第一阈值或者小于第二阈值的训练样本，以使用筛选出来的训练样本构建第一训练样本集，使用剩余的训练样本构建第二训练样本集；所述第一阈值大于所述第二阈值；a screening module, configured to screen out training samples whose default probability is greater than a first threshold or less than a second threshold from the training sample set according to the default probability of each training sample, so as to construct a first training sample set using the screened training samples and to construct a second training sample set using the remaining training samples; the first threshold is greater than the second threshold;

样本均衡模块，用于对所述第一训练样本集进行样本均衡处理，以对所述第一训练样本集中样本类型为所述违约样本的训练样本进行样本扩充，得到第三训练样本集；所述第三训练样本集中包含所述第一训练样本集中的各个训练样本以及进行样本扩充后生成的新的训练样本；a sample balancing module, configured to perform sample balancing processing on the first training sample set, so as to expand the training samples whose sample types are the default samples in the first training sample set, so as to obtain a third training sample set; the third training sample set includes each training sample in the first training sample set and new training samples generated after sample expansion;

合并模块，用于对所述第三训练样本集和所述第二训练样本集进行样本集合并，得到目标训练样本集。The merging module is used to merge the third training sample set and the second training sample set to obtain a target training sample set.

结合第二方面，本申请实施例提供了第二方面的第一种可能的实施方式，其中，所述预测模块在用于针对训练样本集中的每个训练样本，预测该训练样本的违约概率时，具体用于：In combination with the second aspect, the embodiment of the present application provides a first possible implementation of the second aspect, wherein the prediction module, when used to predict the default probability of each training sample in the training sample set, is specifically used to:

结合第二方面的第一种可能的实施方式，本申请实施例提供了第二方面的第二种可能的实施方式，其中，所述装置还包括第一训练模块；所述第一训练模块用于通过以下方式训练得到所述第一违约概率预测模型：In combination with the first possible implementation of the second aspect, the embodiment of the present application provides a second possible implementation of the second aspect, wherein the device further includes a first training module; the first training module is used to train the first default probability prediction model in the following manner:

结合第二方面的第一种可能的实施方式，本申请实施例提供了第二方面的第三种可能的实施方式，其中，所述第二违约概率预测模型包括多个决策树，每个决策树中包含多个分类节点；所述装置还包括第二训练模块；所述第二训练模块用于通过以下方式训练得到所述第二违约概率预测模型：In combination with the first possible implementation of the second aspect, the embodiment of the present application provides a third possible implementation of the second aspect, wherein the second default probability prediction model includes multiple decision trees, each decision tree includes multiple classification nodes; the device also includes a second training module; the second training module is used to train the second default probability prediction model in the following manner:

结合第二方面的第一种可能的实施方式，本申请实施例提供了第二方面的第四种可能的实施方式，其中，所述筛选模块在用于根据各所述训练样本的违约概率，从所述训练样本集中筛选出违约概率大于第一阈值或者小于第二阈值的训练样本，以使用筛选出来的训练样本构建第一训练样本集，使用剩余的训练样本构建第二训练样本集时，具体用于：In combination with the first possible implementation manner of the second aspect, the embodiment of the present application provides a fourth possible implementation manner of the second aspect, wherein the screening module is used to screen out training samples having a default probability greater than a first threshold or less than a second threshold from the training sample set according to the default probability of each training sample, so as to construct a first training sample set using the screened training samples and construct a second training sample set using the remaining training samples, and is specifically used to:

结合第二方面的第四种可能的实施方式，本申请实施例提供了第二方面的第五种可能的实施方式，其中，所述样本均衡模块在用于对所述第一训练样本集进行样本均衡处理，以对所述第一训练样本集中样本类型为所述违约样本的训练样本进行样本扩充，得到第三训练样本集时，具体用于：In combination with the fourth possible implementation of the second aspect, the embodiment of the present application provides a fifth possible implementation of the second aspect, wherein the sample balancing module, when used to perform sample balancing processing on the first training sample set to perform sample expansion on the training samples whose sample type is the default sample in the first training sample set to obtain the third training sample set, is specifically used to:

结合第二方面的第一种可能的实施方式至第四种可能的实施方式中任一种可能的实施方式，本申请实施例提供了第二方面的第六种可能的实施方式，其中，所述装置还包括：In combination with any possible implementation of the first to fourth possible implementations of the second aspect, the embodiment of the present application provides a sixth possible implementation of the second aspect, wherein the device further includes:

第三训练模块，用于使用所述目标训练样本集训练初始违约概率预测模型，得到训练完成后的目标违约概率预测模型。The third training module is used to train the initial default probability prediction model using the target training sample set to obtain a target default probability prediction model after training.

第三方面，本申请实施例还提供一种电子设备，包括：处理器、存储器和总线，所述存储器存储有所述处理器可执行的机器可读指令，当电子设备运行时，所述处理器与所述存储器之间通过总线通信，所述机器可读指令被所述处理器执行时执行上述第一方面中任一种可能的实施方式中的步骤。In a third aspect, an embodiment of the present application further provides an electronic device, comprising: a processor, a memory and a bus, wherein the memory stores machine-readable instructions executable by the processor, and when the electronic device is running, the processor and the memory communicate via the bus, and when the machine-readable instructions are executed by the processor, the steps in any possible implementation of the first aspect above are performed.

第四方面，本申请实施例还提供一种计算机可读存储介质，该计算机可读存储介质上存储有计算机程序，该计算机程序被处理器运行时执行上述第一方面中任一种可能的实施方式中的步骤。In a fourth aspect, an embodiment of the present application further provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the steps of any possible implementation of the first aspect described above are executed.

本申请实施例提供的样本均衡方法、装置、电子设备及计算机可读存储介质，其中，根据训练样本集中每个训练样本的违约概率，从训练样本集中筛选出违约概率大于第一阈值或者小于第二阈值的训练样本，并使用筛选出来的训练样本构建第一训练样本集，使用剩余的训练样本构建第二训练样本集。其中，由于第一阈值大于第二阈值，因此，第一训练样本集中包含的是高概率违约训练样本和低概率违约训练样本。该实施例中，通过对第一训练样本集进行样本均衡处理，也就是对高概率违约训练样本和低概率违约训练样本进行样本均衡处理，可以使得生成的新的训练样本的违约情况较为明确，也就是使得生成的新的训练样本较为准确。因此，通过本实施例的样本均衡方法，有利于使得生成的新的训练样本更准确，从而在使用该训练样本对预测模型进行模型训练时，有利于减少预测模型的误判概率，提升预测模型的准确性。The sample balancing method, device, electronic device and computer-readable storage medium provided by the embodiment of the present application, wherein, according to the default probability of each training sample in the training sample set, the training samples with the default probability greater than the first threshold or less than the second threshold are screened out from the training sample set, and the screened training samples are used to construct the first training sample set, and the remaining training samples are used to construct the second training sample set. Wherein, since the first threshold is greater than the second threshold, the first training sample set contains high probability default training samples and low probability default training samples. In this embodiment, by performing sample balancing processing on the first training sample set, that is, performing sample balancing processing on the high probability default training samples and the low probability default training samples, the default situation of the generated new training samples can be made clearer, that is, the generated new training samples are made more accurate. Therefore, the sample balancing method of this embodiment is conducive to making the generated new training samples more accurate, so that when the training samples are used to train the prediction model, it is conducive to reducing the probability of misjudgment of the prediction model and improving the accuracy of the prediction model.

为使本申请的上述目的、特征和优点能更明显易懂，下文特举较佳实施例，并配合所附附图，作详细说明如下。In order to make the above-mentioned objects, features and advantages of the present application more obvious and easy to understand, preferred embodiments are specifically cited below and described in detail with reference to the attached drawings.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

为了更清楚地说明本申请实施例的技术方案，下面将对实施例中所需要使用的附图作简单地介绍，应当理解，以下附图仅示出了本申请的某些实施例，因此不应被看作是对范围的限定，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他相关的附图。In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for use in the embodiments will be briefly introduced below. It should be understood that the following drawings only show certain embodiments of the present application and therefore should not be regarded as limiting the scope. For ordinary technicians in this field, other related drawings can be obtained based on these drawings without paying creative work.

图1示出了本申请实施例所提供的一种样本均衡方法的流程图；FIG1 shows a flow chart of a sample balancing method provided in an embodiment of the present application;

图2示出了本申请实施例所提供的初始决策树中部分分类节点的示意图；FIG2 is a schematic diagram showing some classification nodes in the initial decision tree provided in an embodiment of the present application;

图3示出了本申请实施例所提供的一种样本均衡装置的结构示意图；FIG3 shows a schematic structural diagram of a sample equalization device provided in an embodiment of the present application;

图4示出了本申请实施例所提供的一种电子设备的结构示意图。FIG4 shows a schematic structural diagram of an electronic device provided in an embodiment of the present application.

具体实施方式DETAILED DESCRIPTION

为使本申请实施例的目的、技术方案和优点更加清楚，下面将结合本申请实施例中附图，对本申请实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本申请一部分实施例，而不是全部的实施例。通常在此处附图中描述和示出的本申请实施例的组件可以以各种不同的配置来布置和设计。因此，以下对在附图中提供的本申请的实施例的详细描述并非旨在限制要求保护的本申请的范围，而是仅仅表示本申请的选定实施例。基于本申请的实施例，本领域技术人员在没有做出创造性劳动的前提下所获得的所有其他实施例，都属于本申请保护的范围。To make the purpose, technical scheme and advantages of the embodiments of the present application clearer, the technical scheme in the embodiments of the present application will be clearly and completely described below in conjunction with the drawings in the embodiments of the present application. Obviously, the described embodiments are only a part of the embodiments of the present application, rather than all of the embodiments. The components of the embodiments of the present application generally described and shown in the drawings here can be arranged and designed in various different configurations. Therefore, the following detailed description of the embodiments of the present application provided in the drawings is not intended to limit the scope of the application claimed for protection, but merely represents the selected embodiments of the present application. Based on the embodiments of the present application, all other embodiments obtained by those skilled in the art without making creative work belong to the scope of protection of the present application.

考虑到若直接对原始训练样本集中的训练样本进行样本均衡处理，容易导致扩充出来的违约样本不够准确。基于此，本申请实施例提供了一种样本均衡方法、装置、电子设备及计算机可读存储介质，以提高生成的新的训练样本的准确性，进而减少预测模型的误判概率，提升预测模型的准确性，下面通过实施例进行描述。Considering that if the training samples in the original training sample set are directly subjected to sample balancing processing, it is easy to cause the expanded default samples to be inaccurate. Based on this, the embodiments of the present application provide a sample balancing method, device, electronic device and computer-readable storage medium to improve the accuracy of the generated new training samples, thereby reducing the probability of misjudgment of the prediction model and improving the accuracy of the prediction model, which is described below through embodiments.

实施例一：Embodiment 1:

为便于对本实施例进行理解，首先对本申请实施例所公开的一种样本均衡方法进行详细介绍。图1示出了本申请实施例所提供的一种样本均衡方法的流程图，如图1所示，包括以下步骤S101-S104：To facilitate understanding of this embodiment, a sample balancing method disclosed in the embodiment of the present application is first introduced in detail. FIG1 shows a flow chart of a sample balancing method provided in the embodiment of the present application, as shown in FIG1 , including the following steps S101-S104:

S101：针对训练样本集中的每个训练样本，预测该训练样本的违约概率；其中，训练样本集中包含多个训练样本；训练样本的样本类型包括违约样本和未违约样本；违约样本的样本数量少于未违约样本的样本数量。S101: For each training sample in a training sample set, predict the default probability of the training sample; wherein the training sample set includes multiple training samples; the sample types of the training samples include default samples and non-default samples; and the sample number of the default samples is less than the sample number of the non-default samples.

该实施例中，训练样本中包含客户基础信息、产品信息、客户行为信息、历史违约信息。其中，客户基础信息例如包含客户的身份证号码、年龄、性别等信息。产品可以是客户借用的贷款产品，例如公积金贷款。产品信息例如包含贷款产品的余额、缴存基数、提现金额等。客户行为信息例如客户的取现记录、消费记录、转账记录等与贷款相关的操作性行为。In this embodiment, the training samples include basic customer information, product information, customer behavior information, and historical default information. Among them, the basic customer information includes, for example, the customer's ID number, age, gender, and other information. The product can be a loan product borrowed by the customer, such as a provident fund loan. Product information includes, for example, the balance of the loan product, the deposit base, the withdrawal amount, etc. Customer behavior information includes, for example, the customer's withdrawal record, consumption record, transfer record, and other operational behaviors related to the loan.

历史违约信息指的是客户在历史时刻是否发生过违约事件，例如，当客户借用了多种贷款产品，或者，一种贷款产品借用了多次，那么只要该客户发生过至少一次违约事件(例如未及时还款)，则该客户为违约客户，该客户对应的训练样本为违约样本；若该客户从未发生过违约事件，则该客户为未违约客户，该客户对应的训练样本为未违约样本。其中，每个客户对应一个训练样本。Historical default information refers to whether a customer has ever defaulted at a historical moment. For example, if a customer has borrowed multiple loan products, or has borrowed a loan product multiple times, then as long as the customer has at least one default event (such as failure to repay on time), the customer is a default customer, and the training sample corresponding to the customer is a default sample; if the customer has never defaulted, the customer is a non-default customer, and the training sample corresponding to the customer is a non-default sample. Each customer corresponds to one training sample.

违约概率指的是该训练样本对应的客户发生过违约事件的概率，违约概率越大，则说明该训练样本中的客户发生过违约事件的概率越大；违约概率越小，则说明该训练样本中的客户发生过违约事件的概率越小。The default probability refers to the probability that the customer corresponding to the training sample has defaulted. The greater the default probability, the greater the probability that the customer in the training sample has defaulted; the smaller the default probability, the smaller the probability that the customer in the training sample has defaulted.

在训练样本集中包含多个违约样本(样本类型为违约样本的训练样本)和多个未违约样本(样本类型为未违约样本的训练样本)，其中，违约样本的样本数量少于未违约样本的样本数量，因此，训练样本集中的训练样本存在样本不均衡的问题。The training sample set includes multiple default samples (the sample type is training samples of default samples) and multiple non-default samples (the sample type is training samples of non-default samples). Among them, the sample number of default samples is less than the sample number of non-default samples. Therefore, there is a sample imbalance problem in the training samples in the training sample set.

S102：根据各训练样本的违约概率，从训练样本集中筛选出违约概率大于第一阈值或者小于第二阈值的训练样本，以使用筛选出来的训练样本构建第一训练样本集，使用剩余的训练样本构建第二训练样本集；第一阈值大于第二阈值。S102: According to the default probability of each training sample, select training samples whose default probability is greater than a first threshold or less than a second threshold from the training sample set, use the selected training samples to construct a first training sample set, and use the remaining training samples to construct a second training sample set; the first threshold is greater than the second threshold.

该实施例中，每个训练样本对应一个违约概率，根据每个训练样本各自对应的违约概率，从训练样本集中筛选出违约概率大于第一阈值的训练样本，以及从训练样本集中筛选出违约概率小于第二阈值的训练样本，其中，由于第一阈值大于第二阈值，因此筛选出来的训练样本为高概率违约训练样本和低概率违约训练样本，将使用筛选出来的训练样本构建第一训练样本集，也就是说，使用高概率违约训练样本和低概率违约训练样本构建第一训练样本集，那么第一训练样本集中只包含高概率违约训练样本和低概率违约训练样本。使用剩余的训练样本构建第二训练样本集，第二训练样本集中包含中概率违约训练样本，中概率违约训练样本例如违约概率为0.5的训练样本。In this embodiment, each training sample corresponds to a default probability. According to the default probability corresponding to each training sample, training samples with a default probability greater than a first threshold are screened out from the training sample set, and training samples with a default probability less than a second threshold are screened out from the training sample set. Since the first threshold is greater than the second threshold, the screened out training samples are high-probability default training samples and low-probability default training samples. The screened out training samples are used to construct the first training sample set. That is, the high-probability default training samples and the low-probability default training samples are used to construct the first training sample set. Then, the first training sample set only includes high-probability default training samples and low-probability default training samples. The remaining training samples are used to construct the second training sample set. The second training sample set includes medium-probability default training samples. The medium-probability default training samples are, for example, training samples with a default probability of 0.5.

S103：对第一训练样本集进行样本均衡处理，以对第一训练样本集中样本类型为违约样本的训练样本进行样本扩充，得到第三训练样本集；第三训练样本集中包含第一训练样本集中的各个训练样本以及进行样本扩充后生成的新的训练样本。S103: Perform sample balancing processing on the first training sample set to expand the training samples whose sample types are default samples in the first training sample set to obtain a third training sample set; the third training sample set includes each training sample in the first training sample set and new training samples generated after sample expansion.

该实施例中，仅对第一训练样本集进行样本均衡处理，具体是对第一训练样本集中样本类型为违约样本的训练样本进行样本扩充，得到新的样本类型为违约样本的训练样本。使用新生成的训练样本和第一训练样本集中的各个训练样本构建第三训练样本集。第三训练样本集即为对第一训练样本集进行样本均衡处理后的训练样本集。In this embodiment, sample balancing is performed only on the first training sample set, specifically, sample expansion is performed on the training samples whose sample type is default samples in the first training sample set to obtain new training samples whose sample type is default samples. The third training sample set is constructed using the newly generated training samples and each training sample in the first training sample set. The third training sample set is the training sample set after sample balancing is performed on the first training sample set.

S104：对第三训练样本集和第二训练样本集进行样本集合并，得到目标训练样本集。S104: Merge the third training sample set and the second training sample set to obtain a target training sample set.

对第三训练样本集中的各个训练样本与第二训练样本集中的各个训练样本进行样本集合并，得到目标训练样本集。由于第三训练样本集中包含了新生成的样本类型为违约样本的训练样本，因此，目标训练样本集中的违约样本的样本数量多于训练样本集中违约样本的样本数量，换句话说，目标训练样本集中违约样本的样本数量与未违约样本的样本数量之间的差值小于训练样本集中违约样本的样本数量与未违约样本的样本数量之间的差值。Each training sample in the third training sample set is combined with each training sample in the second training sample set to obtain a target training sample set. Since the third training sample set includes newly generated training samples whose sample type is default samples, the number of default samples in the target training sample set is greater than the number of default samples in the training sample set. In other words, the difference between the number of default samples and the number of non-default samples in the target training sample set is less than the difference between the number of default samples and the number of non-default samples in the training sample set.

在一种可能的实施方式中，在执行步骤S101之前，还可以按照以下步骤执行：In a possible implementation manner, before executing step S101, the following steps may also be performed:

S1001：从mysql数据库中获取原始数据集；原始数据集中包含多个样本类型为违约样本的原始训练样本和多个样本类型为未违约样本的原始训练样本；违约样本的样本数量少于未违约样本的样本数量；每个原始训练样本中包含客户基础信息、产品信息、客户行为信息、历史违约信息。S1001: Obtain the original data set from the MySQL database; the original data set includes multiple original training samples whose sample types are default samples and multiple original training samples whose sample types are non-default samples; the number of default samples is less than the number of non-default samples; each original training sample includes basic customer information, product information, customer behavior information, and historical default information.

S1002：针对原始数据集中的每个原始训练样本，对该原始训练样本进行预处理，以将该原始训练样本中的文字字段转换为机器学习认识的数字语言，得到预处理后的原始训练样本。S1002: For each original training sample in the original data set, preprocess the original training sample to convert the text fields in the original training sample into a digital language recognized by machine learning, thereby obtaining a preprocessed original training sample.

S1003：对预处理后的原始训练样本进行特征工程处理，得到特征工程处理后的训练样本；特征工程处理包含以下中的任意一种或多种：缺失值填充，异常值处理，指标衍生，变量筛选，变量相关性分析。S1003: Perform feature engineering processing on the preprocessed original training samples to obtain training samples after feature engineering processing; the feature engineering processing includes any one or more of the following: missing value filling, outlier processing, indicator derivation, variable screening, and variable correlation analysis.

其中，针对缺失值填充，当训练样本中存在缺失值时，可以先确定该缺失值对应的变量类型，然后使用该变量类型下其他训练样本中的变量值的中位数或者众数填充该缺失值。例如，当训练样本A中缺失年龄时，确定出该缺失值对应的变量类型为年龄，然后计算其他训练样本的年龄的中位数或众数，使用中位数或者众数填充训练样本A的缺失值中。Among them, for missing value filling, when there are missing values in the training sample, the variable type corresponding to the missing value can be determined first, and then the median or mode of the variable values in other training samples under the variable type can be used to fill the missing value. For example, when age is missing in training sample A, it is determined that the variable type corresponding to the missing value is age, and then the median or mode of age of other training samples is calculated, and the median or mode is used to fill the missing value of training sample A.

针对异常值处理，针对训练样本中的每个变量类型，判断该变量类型对应的变量值是否大于第一变量阈值或者小于第二变量阈值，若该训练样本中存在至少一个变量类型对应的变量值大于第一变量阈值或者小于第二变量阈值，则将该变量值作为该训练样本的异常值；当训练样本中存在异常值时，将该训练样本从原始数据集中删除。其中，第一变量阈值大于第二变量阈值。For outlier processing, for each variable type in the training sample, determine whether the variable value corresponding to the variable type is greater than the first variable threshold or less than the second variable threshold. If there is at least one variable type in the training sample whose variable value is greater than the first variable threshold or less than the second variable threshold, then the variable value is regarded as an outlier of the training sample; when there is an outlier in the training sample, the training sample is deleted from the original data set. Among them, the first variable threshold is greater than the second variable threshold.

针对指标衍生，根据至少两个变量类型，衍生出至少一个新的变量类型，以使原始数据集中每个训练样本中多了新的变量类型。例如，根据变量类型“本月话费”和变量类型“上月话费”，可以生成新的变量类型“话费差值”。For indicator derivation, at least one new variable type is derived from at least two variable types, so that each training sample in the original data set has a new variable type. For example, a new variable type "difference in telephone charges" can be generated based on the variable type "this month's telephone charges" and the variable type "last month's telephone charges".

针对变量筛选，原始数据集中的每个训练样本中，均包含了多个变量类型，根据各个变量类型的重要性，从多个变量类型中筛选出重要性大于预设重要性阈值的变量类型，将重要性不大于预设重要性阈值的变量类型从原始数据集中的各个训练样本中删除。For variable screening, each training sample in the original data set contains multiple variable types. According to the importance of each variable type, variable types whose importance is greater than a preset importance threshold are screened out from the multiple variable types, and variable types whose importance is not greater than the preset importance threshold are deleted from each training sample in the original data set.

针对变量相关性分析，针对任意两个变量类型，根据各个训练样本对应的该两个变量类型的变量值，计算该两个变量类型之间的相关性数值，当该两个变量类型之间的相关性数值大于预设相关性阈值时，将该两个变量类型中的任意一个变量类型删除。通过剔除相关性高的两个变量类型中的其中一个，有利于避免影响模型训练时模型的准确性。For variable correlation analysis, for any two variable types, the correlation value between the two variable types is calculated according to the variable values of the two variable types corresponding to each training sample. When the correlation value between the two variable types is greater than the preset correlation threshold, any one of the two variable types is deleted. By eliminating one of the two variable types with high correlation, it is helpful to avoid affecting the accuracy of the model during model training.

S1004：当对原始数据集中的各个原始训练样本均进行了预处理以及特征工程处理后，得到包含多个进行预处理以及特征工程处理后的训练样本的原始样本集。S1004: After each original training sample in the original data set has been preprocessed and feature-engineered, an original sample set including a plurality of training samples that have been preprocessed and feature-engineered is obtained.

S1005：按照预设比例对原始样本集中的各个训练样本进行随机切分，得到训练样本集和测试样本集。例如，预设比例为7：3。S1005: Randomly divide each training sample in the original sample set according to a preset ratio to obtain a training sample set and a test sample set. For example, the preset ratio is 7:3.

在一种可能的实施方式中，在执行步骤S101针对训练样本集中的每个训练样本，预测该训练样本的违约概率时，具体可以按照以下步骤执行：In a possible implementation manner, when performing step S101 to predict the default probability of each training sample in the training sample set, the following steps may be specifically performed:

S1011：针对训练样本集中的每个训练样本，使用训练好的第一违约概率预测模型预测该训练样本的第一违约概率，以及使用训练好的第二违约概率预测模型预测该训练样本的第二违约概率；训练样本的违约概率包含训练样本的第一违约概率和第二违约概率。S1011: For each training sample in the training sample set, use the trained first default probability prediction model to predict the first default probability of the training sample, and use the trained second default probability prediction model to predict the second default probability of the training sample; the default probability of the training sample includes the first default probability and the second default probability of the training sample.

该实施例中，第一违约概率预测模型具体可以为使用训练样本集中的各个训练样本训练好的XGBOOST模型；第二违约概率预测模型具体可以为使用训练样本集中的各个训练样本训练好的RF模型(随机森林模型)。In this embodiment, the first default probability prediction model may specifically be an XGBOOST model trained using each training sample in the training sample set; the second default probability prediction model may specifically be an RF model (random forest model) trained using each training sample in the training sample set.

在一种可能的实施方式中，第一违约概率预测模型是通过以下步骤S1012-S1014训练得到的：In a possible implementation manner, the first default probability prediction model is trained by following the steps S1012-S1014:

S1012：针对训练样本集中的每个训练样本，将该训练样本输入到第一初始违约概率预测模型中，通过第一初始违约概率预测模型对该训练样本的违约情况进行预测，输出该训练样本的第一违约概率；S1012: for each training sample in the training sample set, input the training sample into a first initial default probability prediction model, predict the default situation of the training sample by using the first initial default probability prediction model, and output a first default probability of the training sample;

该实施例中，步骤S1012-S1014位于步骤S1011之前，位于S1005之后。也就是说，先划分出训练样本集后，使用训练样本集中的各个训练样本训练第一初始违约概率预测模型，得到训练完成后的第一违约概率预测模型，然后使用训练好的第一违约概率预测模型进行第一违约概率预测。In this embodiment, steps S1012-S1014 are located before step S1011 and after step S1005. That is, after dividing the training sample set, the first initial default probability prediction model is trained using each training sample in the training sample set to obtain the first default probability prediction model after training, and then the trained first default probability prediction model is used to perform the first default probability prediction.

其中，第一初始违约概率预测模型可以为初始的XGBOOST模型。该实施例中，可以通过以下函数表征初始的XGBOOST模型：The first initial default probability prediction model may be an initial XGBOOST model. In this embodiment, the initial XGBOOST model may be characterized by the following function:

其中，Obj^(t)表示预测出的第一违约概率；n表示训练样本的数量；x_i表示输入到第一初始违约概率预测模型中的第i个训练样本；g_i和h_i为可训练参数；XGBOOST模型是由多棵决策树构成的，即第一初始违约概率预测模型是由多棵决策树构成的，f_t表示第t棵决策树；Ω为抑制模型复杂度的正则项。Among them, Obj ^(t) represents the predicted first default probability; n represents the number of training samples; _xi represents the i-th training sample input into the first initial default probability prediction model; _gi and _hi are trainable parameters; the XGBOOST model is composed of multiple decision trees, that is, the first initial default probability prediction model is composed of multiple decision trees, _ft represents the t-th decision tree; Ω is a regular term that suppresses the complexity of the model.

第一违约概率表示该训练样本为违约样本的概率。The first default probability represents the probability that the training sample is a default sample.

S1013：将该训练样本的违约标签和第一违约概率输入到损失函数中，计算第一初始违约概率预测模型的损失值。S1013: Input the default label and the first default probability of the training sample into the loss function to calculate the loss value of the first initial default probability prediction model.

该实施例中，损失函数如下：In this embodiment, the loss function is as follows:

其中，L表示损失值，y_i表示训练样本i的违约标签，违约标签分为0和1，其中0表示未违约，1表示违约；表示训练样本i的第一违约概率；Where L represents the loss value, _yi represents the default label of training sample i, and the default label is divided into 0 and 1, where 0 represents no default and 1 represents default; represents the first default probability of training sample i;

该实施例中，g_i为l的一阶导，h_i为l的二阶导。In this embodiment, _gi is the first-order derivative of l, and _hi is the second-order derivative of l.

S1014：当损失值大于预设损失值时，根据损失值更新第一初始违约概率预测模型中的可训练参数；当损失值不大于预设损失值时，将当前的第一初始违约概率预测模型作为第一违约概率预测模型。S1014: When the loss value is greater than the preset loss value, the trainable parameters in the first initial default probability prediction model are updated according to the loss value; when the loss value is not greater than the preset loss value, the current first initial default probability prediction model is used as the first default probability prediction model.

该实施例中，判断损失值是否大于预设损失值。当损失值大于预设损失值时，根据损失值更新第一初始违约概率预测模型中的可训练参数g_i和h_i，得到新的第一初始违约概率预测模型，然后重新执行步骤S1012-S1014，直至损失值不大于预设损失值时，停止训练，将当前时刻的第一初始违约概率预测模型作为第一违约概率预测模型。In this embodiment, it is determined whether the loss value is greater than the preset loss value. When the loss value is greater than the preset loss value, the trainable parameters g _i and h _i in the first initial default probability prediction model are updated according to the loss value to obtain a new first initial default probability prediction model, and then steps S1012-S1014 are re-executed until the loss value is not greater than the preset loss value, then the training is stopped, and the first initial default probability prediction model at the current moment is used as the first default probability prediction model.

在一种可能的实施方式中，第二违约概率预测模型包括多个决策树，每个决策树中包含多个分类节点；第二违约概率预测模型是通过以下方式训练得到的：In a possible implementation, the second default probability prediction model includes multiple decision trees, each decision tree includes multiple classification nodes; the second default probability prediction model is trained in the following manner:

S1015：针对第二初始违约概率预测模型中的每个初始决策树，将训练样本集中的各个训练样本输入到该初始决策树中，得到该初始决策树中每个分类节点对应的节点参数；节点参数为分类节点上样本类别为违约样本的样本数量与该分类节点上总的训练样本数量的比值。S1015: For each initial decision tree in the second initial default probability prediction model, each training sample in the training sample set is input into the initial decision tree to obtain the node parameter corresponding to each classification node in the initial decision tree; the node parameter is the ratio of the number of samples whose sample category is default samples on the classification node to the total number of training samples on the classification node.

该实施例中，步骤S1015-S1018位于步骤S1011之前，位于S1005之后。也就是说，先划分出训练样本集后，使用训练样本集中的各个训练样本训练第二初始违约概率预测模型，得到训练完成后的第二违约概率预测模型，然后使用训练好的第二违约概率预测模型进行第二违约概率预测。In this embodiment, steps S1015-S1018 are located before step S1011 and after step S1005. That is, after dividing the training sample set, the second initial default probability prediction model is trained using each training sample in the training sample set to obtain the trained second default probability prediction model, and then the trained second default probability prediction model is used to perform the second default probability prediction.

其中，第二初始违约概率预测模型可以为初始的RF模型(随机森林模型)。第二初始违约概率预测模型中包含多个待训练的初始决策树，每个初始决策树中包含多个分类节点，分类节点上的分类分量是根据训练样本中的变量类型决定的，例如，变量类型为性别时，对应的分类分量为男和女，即对应的分类节点为分类节点“男”和分类节点“女”。图2示出了本申请实施例所提供的初始决策树中部分分类节点的示意图。Among them, the second initial default probability prediction model can be an initial RF model (random forest model). The second initial default probability prediction model contains multiple initial decision trees to be trained, each of which contains multiple classification nodes. The classification components on the classification nodes are determined according to the variable type in the training sample. For example, when the variable type is gender, the corresponding classification components are male and female, that is, the corresponding classification nodes are classification nodes "male" and classification nodes "female". Figure 2 shows a schematic diagram of some classification nodes in the initial decision tree provided in an embodiment of the present application.

如图2所示，示例性的，以分类节点“男”为例进行说明，将训练样本集中的各个训练样本输入到初始决策树中以后，初始决策树会根据各个分类节点，对训练样本进行分类，例如，将训练样本中男性训练样本分到分类节点“男”中，将训练样本中女性训练样本分到分类节点“女”中。假设，分类节点“男”中的训练样本数量为20，接下来，若分类节点“男”的下一节分类节点为分类节点“年龄段20-40”和分类节点“年龄段40-60”时，根据分类节点“男”上的20个训练样本的年龄，对这20个训练样本进一步分类，例如，在分类节点“年龄段20-40”上得到了15个训练样本，在分类节点“年龄段40-60”上得到了5个训练样本。As shown in FIG2 , for example, taking the classification node “male” as an example, after inputting each training sample in the training sample set into the initial decision tree, the initial decision tree will classify the training samples according to each classification node, for example, classifying the male training samples in the training samples into the classification node “male”, and classifying the female training samples in the training samples into the classification node “female”. Assume that the number of training samples in the classification node “male” is 20. Next, if the next classification node of the classification node “male” is the classification node “age range 20-40” and the classification node “age range 40-60”, the 20 training samples are further classified according to the age of the 20 training samples on the classification node “male”. For example, 15 training samples are obtained on the classification node “age range 20-40”, and 5 training samples are obtained on the classification node “age range 40-60”.

针对分类节点“男”，分类节点“男”中的训练样本的总数量为20，若分类节点“男”中有5个训练样本为违约样本，那么，该分类节点“男”的节点参数为5÷20＝0.25。For the classification node "Male", the total number of training samples in the classification node "Male" is 20. If 5 training samples in the classification node "Male" are default samples, then the node parameter of the classification node "Male" is 5÷20=0.25.

S1016：根据该初始决策树中每个分类节点对应的节点参数，计算该初始决策树的不纯度。S1016: Calculate the impurity of the initial decision tree according to the node parameters corresponding to each classification node in the initial decision tree.

该实施例中，RF模型由多个决策树组成，通过投票或平均的方式来进行分类(回归)任务。随机森林模型并不直接使用单一的目标函数或损失函数，而是通过决策树的构建和集成过程来实现任务的优化。在随机森林模型中，每个决策树的训练过程都会尽量减少不纯度(impurity)，以达到更好的分类(回归)效果。这里不纯度通常使用基尼不纯度(Giniimpurity)或熵(entropy)等指标来衡量；In this embodiment, the RF model is composed of multiple decision trees, which perform classification (regression) tasks by voting or averaging. The random forest model does not directly use a single objective function or loss function, but optimizes the task through the construction and integration process of decision trees. In the random forest model, the training process of each decision tree will minimize impurity to achieve better classification (regression) results. Here, impurity is usually measured using indicators such as Gini impurity or entropy;

该实施例中，当使用基尼不纯度作为不纯度指标时，可以通过以下公式计算该初始决策树的不纯度：In this embodiment, when Gini impurity is used as the impurity indicator, the impurity of the initial decision tree can be calculated by the following formula:

其中，Gini(P)表示初始决策树的基尼不纯度，j表示初始决策树中的第j个分类节点，k表示初始决策树中分类节点的数量，P_j表示第j个分类节点的节点参数。Among them, Gini(P) represents the Gini impurity of the initial decision tree, j represents the j-th classification node in the initial decision tree, k represents the number of classification nodes in the initial decision tree, and _Pj represents the node parameter of the j-th classification node.

当使用熵作为不纯度指标时，可以通过以下公式计算该初始决策树的不纯度：When entropy is used as the impurity indicator, the impurity of the initial decision tree can be calculated by the following formula:

其中，Entropy(P)表示初始决策树的熵，j表示初始决策树中的第j个分类节点，k表示初始决策树中分类节点的数量，P_j表示第j个分类节点的节点参数。Where Entropy(P) represents the entropy of the initial decision tree, j represents the j-th classification node in the initial decision tree, k represents the number of classification nodes in the initial decision tree, and _Pj represents the node parameter of the j-th classification node.

S1017：当不纯度大于预设不纯度时，优化该初始决策树的结构；当不纯度不大于预设不纯度时，将当前的初始决策树确定为训练完成后的决策树。S1017: When the impurity is greater than the preset impurity, optimizing the structure of the initial decision tree; when the impurity is not greater than the preset impurity, determining the current initial decision tree as the decision tree after training.

该实施例中，判断该初始决策树的不纯度是否大于预设不纯度，若该初始决策树的不纯度大于预设不纯度，则说明该初始决策树需要继续优化，此时优化该初始决策树的结构，将优化后的初始决策树作为新的初始决策树，继续执行步骤S1015-S1017，直至该初始决策树的不纯度不大于预设不纯度时，将当前的初始决策树确定为训练完成后的决策树。In this embodiment, it is determined whether the impurity of the initial decision tree is greater than the preset impurity. If the impurity of the initial decision tree is greater than the preset impurity, it means that the initial decision tree needs to be further optimized. At this time, the structure of the initial decision tree is optimized, and the optimized initial decision tree is used as the new initial decision tree. Steps S1015-S1017 are continued until the impurity of the initial decision tree is no greater than the preset impurity, and the current initial decision tree is determined as the decision tree after training.

S1018：当第二初始违约概率预测模型中的各个初始决策树均训练完成后，得到第二违约概率预测模型。S1018: When all initial decision trees in the second initial default probability prediction model are trained, a second default probability prediction model is obtained.

该实施例中，第二违约概率预测模型中包含多个训练完成后的决策树。In this embodiment, the second default probability prediction model includes a plurality of trained decision trees.

在一种可能的实施方式中，在执行步骤S102根据各训练样本的违约概率，从训练样本集中筛选出违约概率大于第一阈值或者小于第二阈值的训练样本，以使用筛选出来的训练样本构建第一训练样本集，使用剩余的训练样本构建第二训练样本集时，具体可以按照以下步骤S1021-S1025执行：In a possible implementation, when executing step S102, according to the default probability of each training sample, training samples whose default probability is greater than a first threshold or less than a second threshold are screened out from the training sample set, so as to construct a first training sample set using the screened training samples, and to construct a second training sample set using the remaining training samples, the following steps S1021-S1025 may be specifically performed:

S1021：根据各训练样本的第一违约概率，确定出第一违约概率预测模型对应的第一阈值和第二阈值。S1021: Determine a first threshold and a second threshold corresponding to a first default probability prediction model according to the first default probability of each training sample.

该实施例中，通过以下方式计算第一违约概率预测模型对应的第二阈值，针对训练样本集中的各个训练样本的第一违约概率，按照从小到大的顺序对各第一违约概率进行排序,计算四分之一位数Q1：In this embodiment, the second threshold corresponding to the first default probability prediction model is calculated in the following manner: for the first default probability of each training sample in the training sample set, the first default probability is sorted in ascending order, and the quarter digit Q1 is calculated:

其中，m表示第一违约概率的数量。Where m represents the number of first default probabilities.

当Q1为整数时，将第Q1个第一违约概率作为第一违约概率预测模型对应的第二阈值。示例性的，当m为11时，Q1为3，此时，将第三个第一违约概率作为第一违约概率预测模型对应的第二阈值。When Q1 is an integer, the Q1th first default probability is used as the second threshold corresponding to the first default probability prediction model. Exemplarily, when m is 11, Q1 is 3, at this time, the third first default probability is used as the second threshold corresponding to the first default probability prediction model.

当Q1不为整数时，可以根据插值法来计算第一违约概率预测模型对应的第二阈值，具体的，根据Q1的整数部分z，定位Q1处于z和z+1之间，因此，第二阈值定位到第z位对应的第一违约概率和第z+1位对应的第一违约概率之间。其中，插值法计算公式如下：When Q1 is not an integer, the second threshold corresponding to the first default probability prediction model can be calculated according to the interpolation method. Specifically, according to the integer part z of Q1, Q1 is located between z and z+1. Therefore, the second threshold is located between the first default probability corresponding to the zth position and the first default probability corresponding to the z+1th position. The interpolation calculation formula is as follows:

C_Q1＝(1-frac(Q1))×data(z)+frac(Q1)×data(z+1)C _Q1 = (1-frac(Q1))×data(z)+frac(Q1)×data(z+1)

其中，C_Q1表示第一违约概率预测模型对应的第二阈值，frac(Q1)为Q1的小数部分，data(z)为第z个位置的第一违约概率，data(z+1)为第z+1个位置的第一违约概率。Among them, C _Q1 represents the second threshold corresponding to the first default probability prediction model, frac(Q1) is the fractional part of Q1, data(z) is the first default probability at the z-th position, and data(z+1) is the first default probability at the z+1-th position.

示例性的，当m＝10时，Q1＝2.75，此时，z＝2，z+1＝3。frac(Q1)＝0.75，假设data(z)＝0.2，data(z+1)＝0.3，通过上述公式，可以计算出C_Q1＝0.275，此时第二阈值为0.275。For example, when m=10, Q1=2.75, at this time, z=2, z+1=3. frac(Q1)=0.75, assuming data(z)=0.2, data(z+1)=0.3, through the above formula, C _Q1 =0.275 can be calculated, at this time the second threshold is 0.275.

该实施例中，通过以下方式计算第一违约概率预测模型对应的第一阈值，针对训练样本集中的各个训练样本的第一违约概率，按照从小到大的顺序对各第一违约概率进行排序,计算四分之三位数Q2：In this embodiment, the first threshold corresponding to the first default probability prediction model is calculated in the following manner: for the first default probability of each training sample in the training sample set, the first default probability is sorted in ascending order, and the three-quarter digit Q2 is calculated:

当Q2为整数时，将第Q2个第一违约概率作为第一违约概率预测模型对应的第一阈值。示例性的，当m为11时，Q2为9，此时，将第9个第一违约概率作为第一违约概率预测模型对应的第一阈值。When Q2 is an integer, the Q2th first default probability is used as the first threshold corresponding to the first default probability prediction model. Exemplarily, when m is 11, Q2 is 9, at this time, the 9th first default probability is used as the first threshold corresponding to the first default probability prediction model.

当Q2不为整数时，可以根据插值法来计算第一违约概率预测模型对应的第一阈值，具体的，根据Q2的整数部分v，定位Q2处于v和v+1之间，因此，第二阈值定位到第v位对应的第一违约概率和第v+1位对应的第一违约概率之间。其中，插值法计算公式如下：When Q2 is not an integer, the first threshold corresponding to the first default probability prediction model can be calculated according to the interpolation method. Specifically, according to the integer part v of Q2, Q2 is located between v and v+1. Therefore, the second threshold is located between the first default probability corresponding to the vth position and the first default probability corresponding to the v+1th position. The interpolation calculation formula is as follows:

C_Q2＝(1-frac(Q2))×data(v)+frac(Q2)×data(v+1)C _Q2 = (1-frac(Q2))×data(v)+frac(Q2)×data(v+1)

其中，C_Q2表示第一违约概率预测模型对应的第一阈值，frac(Q2)为Q2的小数部分，data(v)为第v个位置的第一违约概率，data(v+1)为第v+1个位置的第一违约概率。Among them, C _Q2 represents the first threshold corresponding to the first default probability prediction model, frac(Q2) is the fractional part of Q2, data(v) is the first default probability at the vth position, and data(v+1) is the first default probability at the v+1th position.

示例性的，当m＝10时，Q2＝8.25，此时，v＝8，v+1＝9。frac(Q2)＝0.25，假设data(v)＝0.7，data(v+1)＝0.8，通过上述公式，可以计算出C_Q2＝0.275，此时第一阈值为0.725。For example, when m=10, Q2=8.25, at this time, v=8, v+1=9. frac(Q2)=0.25, assuming data(v)=0.7, data(v+1)=0.8, through the above formula, C _Q2 =0.275 can be calculated, at this time the first threshold is 0.725.

S1022：从训练样本集中筛选出第一违约概率大于该第一阈值或者小于该第二阈值的训练样本，以使用筛选出来的训练样本构建第四训练样本集，使用剩余的训练样本构建第五训练样本集。S1022: Filter out training samples whose first default probability is greater than the first threshold or less than the second threshold from the training sample set, use the filtered training samples to construct a fourth training sample set, and use the remaining training samples to construct a fifth training sample set.

该实施例中，根据训练样本集中每个训练样本的第一违约概率，从训练样本集中筛选出第一违约概率大于第一违约概率预测模型对应的第一阈值的训练样本，以及从训练样本集中筛选出第一违约概率小于第一违约概率预测模型对应的第二阈值的训练样本，使用此次筛选出来的训练样本构建第四训练样本集，使用此次剩余的训练样本构建第五训练样本集。In this embodiment, according to the first default probability of each training sample in the training sample set, training samples whose first default probability is greater than the first threshold corresponding to the first default probability prediction model are screened out from the training sample set, and training samples whose first default probability is less than the second threshold corresponding to the first default probability prediction model are screened out from the training sample set, and the fourth training sample set is constructed using the screened training samples, and the fifth training sample set is constructed using the remaining training samples.

S1023：根据各训练样本的第二违约概率，确定出第二违约概率预测模型对应的第一阈值和第二阈值。S1023: Determine a first threshold and a second threshold corresponding to a second default probability prediction model according to the second default probability of each training sample.

该实施例中，通过以下方式计算第二违约概率预测模型对应的第二阈值，针对训练样本集中的各个训练样本的第二违约概率，按照从小到大的顺序对各第二违约概率进行排序，计算四分之一位数Q3：In this embodiment, the second threshold corresponding to the second default probability prediction model is calculated in the following manner: for the second default probability of each training sample in the training sample set, the second default probability is sorted in ascending order, and the quartile Q3 is calculated:

其中，r表示第二违约概率的数量。该实施例中，r＝m。Wherein, r represents the number of the second default probability. In this embodiment, r=m.

当Q3为整数时，将第Q3个第二违约概率作为第二违约概率预测模型对应的第二阈值。When Q3 is an integer, the Q3th second default probability is used as the second threshold corresponding to the second default probability prediction model.

当Q3不为整数时，可以根据插值法来计算第二违约概率预测模型对应的第二阈值，具体的，根据Q3的整数部分u，定位Q3处于u和u+1之间，因此，第二阈值定位到第u位对应的第二违约概率和第u+1位对应的第二违约概率之间。其中，插值法计算公式如下：When Q3 is not an integer, the second threshold corresponding to the second default probability prediction model can be calculated according to the interpolation method. Specifically, according to the integer part u of Q3, Q3 is located between u and u+1. Therefore, the second threshold is located between the second default probability corresponding to the uth position and the second default probability corresponding to the u+1th position. The interpolation calculation formula is as follows:

C_Q3＝(1-frac(Q3))×data(u)+frac(Q3)×data(u+1)C _Q3 = (1-frac(Q3))×data(u)+frac(Q3)×data(u+1)

其中，C_Q3表示第二违约概率预测模型对应的第二阈值，frac(Q3)为Q3的小数部分，data(u)为第u个位置的第二违约概率，data(u+1)为第u+1个位置的第二违约概率。Among them, C _Q3 represents the second threshold corresponding to the second default probability prediction model, frac(Q3) is the decimal part of Q3, data(u) is the second default probability at the u-th position, and data(u+1) is the second default probability at the u+1-th position.

该实施例中，通过以下方式计算第二违约概率预测模型对应的第一阈值，针对训练样本集中的各个训练样本的第二违约概率，按照从小到大的顺序对各第二违约概率进行排序,计算四分之三位数Q4：In this embodiment, the first threshold corresponding to the second default probability prediction model is calculated in the following manner: for the second default probabilities of each training sample in the training sample set, the second default probabilities are sorted in ascending order, and the three-quarter digit Q4 is calculated:

其中，r表示第二违约概率的数量。Where r represents the number of second default probabilities.

当Q4为整数时，将第Q4个第二违约概率作为第二违约概率预测模型对应的第一阈值。When Q4 is an integer, the Q4th second default probability is used as the first threshold corresponding to the second default probability prediction model.

当Q4不为整数时，可以根据插值法来计算第二违约概率预测模型对应的第一阈值，具体的，根据Q4的整数部分d，定位Q4处于d和d+1之间，因此，第一阈值定位到第d位对应的第二违约概率和第d+1位对应的第二违约概率之间。其中，插值法计算公式如下：When Q4 is not an integer, the first threshold corresponding to the second default probability prediction model can be calculated according to the interpolation method. Specifically, according to the integer part d of Q4, Q4 is located between d and d+1. Therefore, the first threshold is located between the second default probability corresponding to the dth position and the second default probability corresponding to the d+1th position. The interpolation calculation formula is as follows:

C_Q4＝(1-frac(Q4))×data(d)+frac(Q4)×data(d+1)C _Q4 = (1-frac(Q4))×data(d)+frac(Q4)×data(d+1)

其中，C_Q4表示第二违约概率预测模型对应的第一阈值，frac(Q4)为Q4的小数部分，data(d)为第d个位置的第二违约概率，data(d+1)为第d+1个位置的第二违约概率。Among them, C _Q4 represents the first threshold corresponding to the second default probability prediction model, frac(Q4) is the fractional part of Q4, data(d) is the second default probability at the d-th position, and data(d+1) is the second default probability at the d+1-th position.

S1024：从训练样本集中筛选出第二违约概率大于该第一阈值或者小于该第二阈值的训练样本，以使用筛选出来的训练样本构建第六训练样本集，使用剩余的训练样本构建第七训练样本集；其中，第二训练样本集包括第五训练样本集和第七训练样本集；S1024: Filter out training samples whose second default probability is greater than the first threshold or less than the second threshold from the training sample set, use the filtered training samples to construct a sixth training sample set, and use the remaining training samples to construct a seventh training sample set; wherein the second training sample set includes the fifth training sample set and the seventh training sample set;

该实施例中，根据训练样本集中每个训练样本对应的第二违约概率，从训练样本集中筛选出第二违约概率大于第二违约概率预测模型对应的第一阈值的训练样本，以及从训练样本集中筛选出第二违约概率小于第二违约概率预测模型对应的第二阈值的训练样本，使用此次筛选出来的训练样本构建第六训练样本集，使用此次剩余的训练样本构建第七训练样本集。In this embodiment, according to the second default probability corresponding to each training sample in the training sample set, training samples whose second default probability is greater than the first threshold corresponding to the second default probability prediction model are screened out from the training sample set, and training samples whose second default probability is less than the second threshold corresponding to the second default probability prediction model are screened out from the training sample set, and the sixth training sample set is constructed using the screened training samples, and the seventh training sample set is constructed using the remaining training samples.

使用第五训练样本集和第七训练样本集构建第二训练样本集。The second training sample set is constructed using the fifth training sample set and the seventh training sample set.

S1025：取第四训练样本集和第六训练样本集的交集，作为第一训练样本集。S1025: Take the intersection of the fourth training sample set and the sixth training sample set as the first training sample set.

在一种可能的实施方式中，在执行步骤S103对第一训练样本集进行样本均衡处理，以对第一训练样本集中样本类型为违约样本的训练样本进行样本扩充，得到第三训练样本集时，具体可以按照以下步骤执行：In a possible implementation, when performing sample balancing processing on the first training sample set in step S103 to expand the training samples whose sample types are default samples in the first training sample set to obtain the third training sample set, the following steps may be specifically performed:

S1031：针对第一训练样本集中样本类型为违约样本的各个训练样本，计算该训练样本与其他训练样本之间的欧式距离；其他训练样本为除该训练样本之外的样本类型为违约样本的训练样本。S1031: For each training sample whose sample type is a default sample in the first training sample set, calculate the Euclidean distance between the training sample and other training samples; other training samples are training samples whose sample type is a default sample except the training sample.

该实施例中，按照样本类型，将第一训练样本集中的样本类型为违约样本的训练样本划分到正样本集中，将第一训练样本集中的样本类型为未违约样本的训练样本划分到负样本集中。In this embodiment, according to the sample type, the training samples whose sample type is the default sample in the first training sample set are divided into the positive sample set, and the training samples whose sample type is the non-default sample in the first training sample set are divided into the negative sample set.

针对正样本集中的每个训练样本，计算该训练样本与正样本集中其他训练样本之间的欧式距离。For each training sample in the positive sample set, the Euclidean distance between the training sample and other training samples in the positive sample set is calculated.

该实施例中，可以通过以下公式计算训练样本B(b1,b2,b3…bw)和训练样本E(e1,e2,e3…ew)之间的欧式距离：In this embodiment, the Euclidean distance between the training sample B (b1, b2, b3 ... bw) and the training sample E (e1, e2, e3 ... ew) can be calculated by the following formula:

其中，b1,b2,b3…bw为训练样本B中各个变量的变量值，e1,e2,e3…ew为训练样本E中各个变量的变量值；distance(B,E)表示训练样本B和训练样本E之间的欧式距离。Among them, b1, b2, b3…bw are the variable values of each variable in training sample B, e1, e2, e3…ew are the variable values of each variable in training sample E; distance(B,E) represents the Euclidean distance between training sample B and training sample E.

S1032：根据该训练样本与其他训练样本之间的欧式距离，按照欧式距离从小到大的顺序，从其他训练样本中选取出前预设数量个其他训练样本作为该训练样本的相似训练样本。S1032: According to the Euclidean distance between the training sample and other training samples, in the order of the Euclidean distance from small to large, select a preset number of other training samples from other training samples as similar training samples of the training sample.

该实施例中，在计算出该训练样本B与正样本集中各个其他训练样本之间的欧式距离后，按照欧式距离从小到大的顺序，从其他训练样本中选取出前预设数量个其他训练样本作为该训练样本的相似训练样本。预设数量例如，3个或5个。In this embodiment, after calculating the Euclidean distance between the training sample B and each other training sample in the positive sample set, a preset number of other training samples are selected from the other training samples in the order of the Euclidean distance from small to large as similar training samples of the training sample. The preset number is, for example, 3 or 5.

S1033：从相似训练样本中选取出一个目标相似训练样本，利用目标相似训练样本和该训练样本，生成该训练样本的合成训练样本。S1033: Select a target similar training sample from the similar training samples, and use the target similar training sample and the training sample to generate a synthetic training sample of the training sample.

该实施例中，从该训练样本B的各个相似训练样本中随机选取出一个相似训练样本作为该训练样本B的目标相似训练样本，记为B_cc。In this embodiment, a similar training sample is randomly selected from each similar training sample of the training sample B as a target similar training sample of the training sample B, which is recorded as B _cc .

该实施例中，可以通过以下公式，生成该训练样本B的合成训练样本：In this embodiment, the synthetic training sample of the training sample B can be generated by the following formula:

B_newsample＝B+random(0，1)×(B_cc-B)B _newsample =B+random(0,1)×(B _cc -B)

其中，B_newsample为该训练样本B的合成训练样本，B_cc为该训练样本B的目标相似训练样本；random(0,1)表示从0到1随机数值。Among them, B _newsample is the synthetic training sample of the training sample B, B _cc is the target similar training sample of the training sample B; random(0,1) represents a random value from 0 to 1.

该实施例中，一个训练样本可以生成一个合成训练样本，也可以生成多个合成训练样本。In this embodiment, one training sample may generate one synthetic training sample, or may generate multiple synthetic training samples.

S1034：根据第一训练样本集中样本类型为违约样本的各个训练样本各自对应的合成训练样本，以及第一训练样本集中的各个训练样本，构建第三训练样本集。S1034: Construct a third training sample set according to the synthetic training samples corresponding to each training sample whose sample type is the default sample in the first training sample set and each training sample in the first training sample set.

该实施例中，根据正样本集中每个训练样本各自对应的合成训练样本，以及第一训练样本集中的各个训练样本，构建第三训练样本集。In this embodiment, the third training sample set is constructed according to the synthetic training samples corresponding to each training sample in the positive sample set and each training sample in the first training sample set.

在一种可能的实施方式中，在执行步骤S104得到目标训练样本集之后，还可以按照以下步骤执行：In a possible implementation manner, after executing step S104 to obtain the target training sample set, the following steps may also be performed:

使用目标训练样本集训练初始违约概率预测模型，得到训练完成后的目标违约概率预测模型。The target training sample set is used to train the initial default probability prediction model to obtain the target default probability prediction model after training.

该实施例中，初始违约概率预测模型具体可以是初始的XGBOOST模型，使用样本均衡处理后的目标训练样本集训练该初始违约概率预测模型后，得到训练完成后的目标违约概率预测模型。目标违约概率预测模型为使用目标训练样本集训练完成的XGBOOST模型。In this embodiment, the initial default probability prediction model can specifically be an initial XGBOOST model, and after the initial default probability prediction model is trained using the target training sample set after sample equalization processing, a target default probability prediction model after training is obtained. The target default probability prediction model is an XGBOOST model trained using the target training sample set.

目标违约概率预测模型被训练完成后，可以使用步骤S1005中得到的测试样本集，验证目标违约概率预测模型进行违约概率预测的准确性。当准确性大于预设准确性时，表征目标违约概率预测模型训练完成，此时，当目标客户想要进行违约概率预测时，可以获取该目标客户的客户基础信息、产品信息、客户行为信息、历史违约信息，将这些信息输入到目标违约概率预测模型中，通过该目标违约概率预测模型预测目标客户的违约概率。After the target default probability prediction model is trained, the test sample set obtained in step S1005 can be used to verify the accuracy of the target default probability prediction model in predicting the default probability. When the accuracy is greater than the preset accuracy, it indicates that the training of the target default probability prediction model is completed. At this time, when the target customer wants to predict the default probability, the customer basic information, product information, customer behavior information, and historical default information of the target customer can be obtained, and this information can be input into the target default probability prediction model, and the default probability of the target customer can be predicted by the target default probability prediction model.

当准确性不大于预设准确性时，表征目标违约概率预测模型需要继续进行训练，直至目标违约概率预测模型进行违约概率预测的准确性大于预设准确性时训练停止。When the accuracy is not greater than the preset accuracy, the target default probability prediction model needs to continue training until the accuracy of the target default probability prediction model in predicting the default probability is greater than the preset accuracy, and the training stops.

实施例二：Embodiment 2:

基于相同的技术构思，本申请还提供了一种样本均衡装置，图3示出了本申请实施例所提供的一种样本均衡装置的结构示意图，如图3所示，包括：Based on the same technical concept, the present application also provides a sample balancing device. FIG3 shows a schematic diagram of the structure of a sample balancing device provided in an embodiment of the present application. As shown in FIG3 , the device includes:

预测模块301，用于针对训练样本集中的每个训练样本，预测该训练样本的违约概率；其中，所述训练样本集中包含多个训练样本；所述训练样本的样本类型包括违约样本和未违约样本；所述违约样本的样本数量少于所述未违约样本的样本数量；Prediction module 301, for predicting the default probability of each training sample in a training sample set; wherein the training sample set includes a plurality of training samples; the sample types of the training samples include default samples and non-default samples; the sample number of the default samples is less than the sample number of the non-default samples;

筛选模块302，用于根据各所述训练样本的违约概率，从所述训练样本集中筛选出违约概率大于第一阈值或者小于第二阈值的训练样本，以使用筛选出来的训练样本构建第一训练样本集，使用剩余的训练样本构建第二训练样本集；所述第一阈值大于所述第二阈值；A screening module 302 is used to screen out training samples whose default probability is greater than a first threshold or less than a second threshold from the training sample set according to the default probability of each training sample, so as to construct a first training sample set using the screened training samples and construct a second training sample set using the remaining training samples; the first threshold is greater than the second threshold;

样本均衡模块303，用于对所述第一训练样本集进行样本均衡处理，以对所述第一训练样本集中样本类型为所述违约样本的训练样本进行样本扩充，得到第三训练样本集；所述第三训练样本集中包含所述第一训练样本集中的各个训练样本以及进行样本扩充后生成的新的训练样本；The sample balancing module 303 is used to perform sample balancing processing on the first training sample set, so as to expand the training samples whose sample types are the default samples in the first training sample set, so as to obtain a third training sample set; the third training sample set includes each training sample in the first training sample set and a new training sample generated after sample expansion;

合并模块304，用于对所述第三训练样本集和所述第二训练样本集进行样本集合并，得到目标训练样本集。The merging module 304 is configured to merge the third training sample set and the second training sample set to obtain a target training sample set.

可选的，所述预测模块301在用于针对训练样本集中的每个训练样本，预测该训练样本的违约概率时，具体用于：Optionally, when the prediction module 301 is used to predict the default probability of each training sample in the training sample set, it is specifically used to:

可选的，所述装置还包括第一训练模块；所述第一训练模块用于通过以下方式训练得到所述第一违约概率预测模型：Optionally, the device further includes a first training module; the first training module is used to train and obtain the first default probability prediction model in the following manner:

可选的，所述第二违约概率预测模型包括多个决策树，每个决策树中包含多个分类节点；所述装置还包括第二训练模块；所述第二训练模块用于通过以下方式训练得到所述第二违约概率预测模型：Optionally, the second default probability prediction model includes a plurality of decision trees, each decision tree includes a plurality of classification nodes; the device further includes a second training module; the second training module is used to train the second default probability prediction model in the following manner:

可选的，所述筛选模块302在用于根据各所述训练样本的违约概率，从所述训练样本集中筛选出违约概率大于第一阈值或者小于第二阈值的训练样本，以使用筛选出来的训练样本构建第一训练样本集，使用剩余的训练样本构建第二训练样本集时，具体用于：Optionally, when the screening module 302 is used to screen out training samples whose default probability is greater than a first threshold or less than a second threshold from the training sample set according to the default probability of each training sample, so as to construct a first training sample set using the screened training samples and construct a second training sample set using the remaining training samples, it is specifically used to:

可选的，所述样本均衡模块303在用于对所述第一训练样本集进行样本均衡处理，以对所述第一训练样本集中样本类型为所述违约样本的训练样本进行样本扩充，得到第三训练样本集时，具体用于：Optionally, when the sample balancing module 303 is used to perform sample balancing processing on the first training sample set to expand the training samples whose sample types are the default samples in the first training sample set to obtain the third training sample set, it is specifically used to:

根据该训练样本与其他训练样本之间的欧式距离，按照所述欧式距离从小到大的顺序，从其他训练样本中选取出前预设数量个其他训练样本作为该训练样本的相似训练样本；According to the Euclidean distance between the training sample and other training samples, in the order of the Euclidean distance from small to large, selecting a preset number of other training samples from other training samples as similar training samples of the training sample;

可选的，所述装置还包括：Optionally, the device further comprises:

实施例三：Embodiment three:

图4为本申请实施例提供的一种电子设备的结构示意图，包括：处理器401、存储器402和总线403，所述存储器402存储有所述处理器401可执行的机器可读指令，当电子设备运行上述的信息处理方法时，所述处理器401与所述存储器402之间通过总线403通信，所述处理器401执行所述机器可读指令，以执行实施例一中所述的方法步骤。Figure 4 is a schematic diagram of the structure of an electronic device provided in an embodiment of the present application, including: a processor 401, a memory 402 and a bus 403, wherein the memory 402 stores machine-readable instructions executable by the processor 401. When the electronic device runs the above-mentioned information processing method, the processor 401 communicates with the memory 402 through the bus 403, and the processor 401 executes the machine-readable instructions to execute the method steps described in Example 1.

实施例四：Embodiment 4:

本申请实施例还提供了一种计算机可读存储介质，该计算机可读存储介质上存储有计算机程序，该计算机程序被处理器运行时执行实施例一中所述的方法步骤。The embodiment of the present application further provides a computer-readable storage medium, on which a computer program is stored. When the computer program is executed by a processor, the method steps described in the first embodiment are executed.

所属领域的技术人员可以清楚地了解到，为描述的方便和简洁，上述描述的装置、电子设备和计算机可读存储介质的具体工作过程，可以参考前述方法实施例中的对应过程，在此不再赘述。Those skilled in the art can clearly understand that, for the convenience and brevity of description, the specific working processes of the above-described devices, electronic devices, and computer-readable storage media can refer to the corresponding processes in the aforementioned method embodiments and will not be repeated here.

在本申请所提供的几个实施例中，应该理解到，所揭露的方法、装置电子设备和计算机可读存储介质，可以通过其它的方式实现。以上所描述的装置实施例仅仅是示意性的，例如，所述模块的划分，仅仅为一种逻辑功能划分，实际实现时可以有另外的划分方式，又例如，多个模块或组件可以结合或者可以集成到另一个系统，或一些特征可以忽略，或不执行。另一点，所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些通信接口，装置或单元的间接耦合或通信连接，可以是电性，机械或其它的形式。In the several embodiments provided in the present application, it should be understood that the disclosed methods, devices, electronic devices, and computer-readable storage media can be implemented in other ways. The device embodiments described above are merely schematic. For example, the division of the modules is only a logical function division. There may be other division methods in actual implementation. For example, multiple modules or components can be combined or integrated into another system, or some features can be ignored or not executed. Another point is that the mutual coupling or direct coupling or communication connection shown or discussed can be through some communication interfaces, indirect coupling or communication connection of devices or units, which can be electrical, mechanical or other forms.

所述作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place or distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

另外，在本申请各个实施例中的各功能单元可以集成在一个处理单元中，也可以是各个单元单独物理存在，也可以两个或两个以上单元集成在一个单元中。In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.

所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时，可以存储在一个处理器可执行的非易失的计算机可读取存储介质中。基于这样的理解，本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储介质中，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括：U盘、移动硬盘、只读存储器(Read-OnlyMemory，ROM)、随机存取存储器(Random Access Memory，RAM)、磁碟或者光盘等各种可以存储程序代码的介质。If the functions are implemented in the form of software functional units and sold or used as independent products, they can be stored in a non-volatile computer-readable storage medium that is executable by a processor. Based on this understanding, the technical solution of the present application, or the part that contributes to the prior art or the part of the technical solution, can be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for a computer device (which can be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method described in each embodiment of the present application. The aforementioned storage medium includes: various media that can store program codes, such as a USB flash drive, a mobile hard disk, a read-only memory (ROM), a random access memory (RAM), a magnetic disk or an optical disk.

最后应说明的是：以上所述实施例，仅为本申请的具体实施方式，用以说明本申请的技术方案，而非对其限制，本申请的保护范围并不局限于此，尽管参照前述实施例对本申请进行了详细的说明，本领域的普通技术人员应当理解：任何熟悉本技术领域的技术人员在本申请揭露的技术范围内，其依然可以对前述实施例所记载的技术方案进行修改或可轻易想到变化，或者对其中部分技术特征进行等同替换；而这些修改、变化或者替换，并不使相应技术方案的本质脱离本申请实施例技术方案的精神和范围，都应涵盖在本申请的保护范围之内。因此，本申请的保护范围应以所述权利要求的保护范围为准。Finally, it should be noted that the above-described embodiments are only specific implementation methods of the present application, which are used to illustrate the technical solutions of the present application, rather than to limit them. The protection scope of the present application is not limited thereto. Although the present application is described in detail with reference to the above-described embodiments, ordinary technicians in the field should understand that any technician familiar with the technical field can still modify the technical solutions recorded in the above-described embodiments within the technical scope disclosed in the present application, or can easily think of changes, or make equivalent replacements for some of the technical features therein; and these modifications, changes or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions of the embodiments of the present application, and should be included in the protection scope of the present application. Therefore, the protection scope of the present application shall be based on the protection scope of the claims.

Claims

1. A sample balancing method, characterized by including:

For each training sample in the training sample set, predict the default probability of the training sample; wherein the training sample set contains multiple training samples; the sample type of the training sample includes default samples and non-default samples; the default sample The sample size is less than the sample size of the non-default sample;

According to the default probability of each training sample, the training samples whose default probability is greater than the first threshold or less than the second threshold are selected from the training sample set, so as to use the filtered training samples to construct the first training sample set, and use the remaining The training samples construct a second training sample set; the first threshold is greater than the second threshold;

Perform sample equalization processing on the first training sample set to expand the training samples whose sample type is the default sample in the first training sample set to obtain a third training sample set; the third training sample set Containing each training sample in the first training sample set and a new training sample generated after sample expansion;

The third training sample set and the second training sample set are combined to obtain a target training sample set.

2. The method according to claim 1, characterized in that, for each training sample in the training sample set, predicting the default probability of the training sample includes:

For each training sample in the training sample set, use the trained first default probability prediction model to predict the first default probability of the training sample, and use the trained second default probability prediction model to predict the second default probability of the training sample ; The default probability of the training sample includes the first default probability and the second default probability of the training sample.

3. The method according to claim 2, characterized in that the first default probability prediction model is trained in the following manner:

For each training sample in the training sample set, input the training sample into the first initial default probability prediction model, predict the default situation of the training sample through the first initial default probability prediction model, and output the first probability of default;

Input the default label of the training sample and the first default probability into the loss function, and calculate the loss value of the first initial default probability prediction model;

When the loss value is greater than the preset loss value, the trainable parameters in the first initial default probability prediction model are updated according to the loss value; when the loss value is not greater than the preset loss value, the current The first initial default probability prediction model is used as the first default probability prediction model.

4. The method according to claim 2, characterized in that the second default probability prediction model includes multiple decision trees, each decision tree contains multiple classification nodes; the second default probability prediction model is based on the following Obtained by training in this way:

For each initial decision tree in the second initial default probability prediction model, input each training sample in the training sample set into the initial decision tree to obtain the node parameters corresponding to each classification node in the initial decision tree; so The node parameter is the ratio of the number of samples whose sample category is default sample on the classification node to the total number of training samples on the classification node;

Calculate the impurity of the initial decision tree according to the node parameters corresponding to each classification node in the initial decision tree;

When the impurity is greater than the preset impurity, optimize the structure of the initial decision tree; when the impurity is not greater than the preset impurity, determine the current initial decision tree as the decision tree after training is completed;

When each initial decision tree in the second initial default probability prediction model is trained, the second default probability prediction model is obtained.

5. The method according to claim 2, characterized in that, according to the default probability of each training sample, training samples with a default probability greater than a first threshold or less than a second threshold are selected from the training sample set to Use the selected training samples to construct the first training sample set, and use the remaining training samples to construct the second training sample set, including:

Determine the first threshold and the second threshold corresponding to the first default probability prediction model according to the first default probability of each training sample;

Screen out training samples whose first default probability is greater than the first threshold or less than the second threshold from the training sample set, use the filtered training samples to construct a fourth training sample set, and use the remaining training samples to construct a fifth training set sample set;

Determine the first threshold and the second threshold corresponding to the second default probability prediction model according to the second default probability of each training sample;

Screen out training samples whose second default probability is greater than the first threshold or less than the second threshold from the training sample set, use the filtered training samples to construct a sixth training sample set, and use the remaining training samples to construct a seventh training set Sample set; wherein, the second training sample set includes the fifth training sample set and the seventh training sample set;

The intersection of the fourth training sample set and the sixth training sample set is taken as the first training sample set.

6. The method according to claim 5, characterized in that, the first training sample set is subjected to sample equalization processing to perform sample balancing on the training samples whose sample type is the default sample in the first training sample set. Expand and obtain the third training sample set, including:

For each training sample in the first training sample set whose sample type is the default sample, calculate the Euclidean distance between the training sample and other training samples; the other training samples are sample types other than the training sample. Training samples of the default samples;

According to the Euclidean distance between the training sample and other training samples, and in the order of the Euclidean distance from small to large, select a preset number of other training samples from other training samples as similar training samples to the training sample;

Select a target similar training sample from the similar training samples, and use the target similar training sample and the training sample to generate a synthetic training sample of the training sample;

The third training sample set is constructed based on the synthetic training samples corresponding to each training sample whose sample type is the default sample in the first training sample set, and each training sample in the first training sample set.

7. The method according to any one of claims 1-6, characterized in that the method further includes:

The target training sample set is used to train an initial default probability prediction model, and a target default probability prediction model after training is obtained.

8. A sample equalization device, characterized in that it includes:

A prediction module, used to predict the default probability of each training sample in the training sample set; wherein the training sample set contains multiple training samples; the sample types of the training samples include default samples and non-default samples ;The sample number of the default sample is less than the sample number of the non-default sample;

A screening module, configured to screen out training samples whose default probability is greater than the first threshold or less than the second threshold from the training sample set according to the default probability of each of the training samples, so as to construct the first training sample using the filtered training samples. set, use the remaining training samples to construct a second training sample set; the first threshold is greater than the second threshold;

A sample equalization module, configured to perform sample equalization processing on the first training sample set to expand the training samples whose sample type is the default sample in the first training sample set to obtain a third training sample set; The third training sample set includes each training sample in the first training sample set and new training samples generated after sample expansion;

A merging module, configured to merge the third training sample set and the second training sample set to obtain a target training sample set.

9. An electronic device, characterized in that it includes: a processor, a memory and a bus, the memory stores machine-readable instructions executable by the processor, and when the electronic device is running, the processor and the Memories communicate with each other through a bus, and when the machine-readable instructions are executed by the processor, the steps of the method according to any one of claims 1 to 7 are performed.

10. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, and when the computer program is run by a processor, the steps of the method according to any one of claims 1 to 7 are executed.