CN111275203A

CN111275203A - Decision tree construction method, device, equipment and storage medium based on column storage

Info

Publication number: CN111275203A
Application number: CN202010087103.1A
Authority: CN
Inventors: 李诗琦; 黄启军; 黄铭毅; 刘玉德
Original assignee: WeBank Co Ltd
Current assignee: WeBank Co Ltd
Priority date: 2020-02-11
Filing date: 2020-02-11
Publication date: 2020-06-12
Anticipated expiration: 2040-02-11
Also published as: CN111275203B

Abstract

The invention discloses a decision tree construction method, a device, equipment and a storage medium based on column storage, relating to the technical field of machine learning, wherein the method comprises the following steps: reading corresponding column data according to the first splitting characteristic of each node in the current layer, and dividing each sample to each node in the current layer according to the column data; obtaining grouping samples in each node, and counting the characteristic distribution condition of each characteristic in the grouping samples; and generating a next-layer node according to the characteristic distribution condition, splitting the next-layer node until convergence, and obtaining a decision tree. Therefore, the decision tree is constructed based on the column storage, the data reading time is shortened, and the working efficiency is improved.

Description

Column-based decision tree construction method, device, device and storage medium

技术领域technical field

本发明涉及机器学习技术领域，尤其涉及一种基于列存储的决策树构造方法、装置、设备及存储介质。The present invention relates to the technical field of machine learning, and in particular, to a method, device, device and storage medium for constructing a decision tree based on column storage.

背景技术Background technique

随着计算机技术的发展，越来越多的技术(大数据、分布式、区块链Blockchain、人工智能等)应用在金融领域，传统金融业正在逐步向金融科技(Fintech)转变，但由于金融行业的安全性、实时性要求，也对技术提出了更高的要求。With the development of computer technology, more and more technologies (big data, distributed, blockchain, artificial intelligence, etc.) are applied in the financial field, and the traditional financial industry is gradually transforming into financial technology (Fintech). The security and real-time requirements of the industry also put forward higher requirements for technology.

决策树广泛应用于金融领域，但是当前在的决策树构造主要是基于行存储的方式进行数据读取。在数据量较大的情况下构造决策树，在数据读取过程中会需要耗费大量的时间，进而工作效率不高。Decision trees are widely used in the financial field, but the current decision tree construction is mainly based on row storage for data reading. Constructing a decision tree in the case of a large amount of data will take a lot of time in the process of data reading, and thus the work efficiency is not high.

发明内容SUMMARY OF THE INVENTION

本发明提供一种基于列存储的决策树构造方法、装置、设备及存储介质，旨在节约决策树的训练时间，提升工作效率。The invention provides a decision tree construction method, device, equipment and storage medium based on column storage, aiming at saving the training time of the decision tree and improving the work efficiency.

为实现上述目的，本发明提供一种基于列存储的决策树构造方法，该方法包括：To achieve the above object, the present invention provides a method for constructing a decision tree based on column storage, the method comprising:

根据当前层中各个节点的第一拆分特征读取对应的列数据，并根据所述列数据将各个样本划分至所述当前层中的各个节点；Read the corresponding column data according to the first splitting feature of each node in the current layer, and divide each sample into each node in the current layer according to the column data;

获取所述各个节点中的分组样本，并统计所述分组样本中各个特征的特征分布情况；Obtain the grouped samples in the respective nodes, and count the feature distribution of each feature in the grouped samples;

根据所述特征分布情况，生成下一层节点；According to the feature distribution, the next layer of nodes is generated;

对所述下一层节点进行分裂，直至收敛，获得决策树。The next layer nodes are split until convergence, and a decision tree is obtained.

优选地，所述各个节点对应不同的第一拆分特征，所述根据所述列数据将各个样本划分至所述当前层中的各个节点的步骤包括：Preferably, each node corresponds to a different first splitting feature, and the step of dividing each sample into each node in the current layer according to the column data includes:

从所述列数据中获取所述各个样本的第一特征，所述第一特征与所述第一拆分特征对应；Obtain the first features of the respective samples from the column data, where the first features correspond to the first split features;

将各个节点的所述第一拆分特征发送至一个或多个执行机，由所述一个或多个执行机根据所述第一特征将所述各个样本划分至所述当前层中的各个节点。Sending the first splitting feature of each node to one or more execution machines, and the one or more execution machines divides each sample into each node in the current layer according to the first feature .

优选地，所述根据所述特征分布情况，生成下一层节点的步骤包括：Preferably, the step of generating the next layer of nodes according to the feature distribution includes:

将所述特征分布情况发送至一个或多个执行机，由所述一个或多个执行机进行并行求解获得最优值；Sending the feature distribution to one or more execution machines, and the one or more execution machines perform parallel solutions to obtain an optimal value;

将所述最优值作为第二拆分特征生成下一层节点。The optimal value is used as the second split feature to generate the next layer of nodes.

优选地，所述根据当前层中各个节点的第一拆分特征读取对应的列数据的步骤包括：Preferably, the step of reading the corresponding column data according to the first splitting feature of each node in the current layer includes:

获取当前层中各个节点的所述第一拆分特征，从数据表中读取与所述第一拆分特征对应的列数据。The first splitting feature of each node in the current layer is acquired, and column data corresponding to the first splitting feature is read from the data table.

优选地，所述根据当前层中各个节点的第一拆分特征读取对应的列数据的步骤之前还包括：Preferably, before the step of reading the corresponding column data according to the first splitting feature of each node in the current layer, the step further includes:

扫描样本数据，对所述样本数据进行特征一致性处理，并根据特征将所述样本数据按列存储。Scan the sample data, perform feature consistency processing on the sample data, and store the sample data in columns according to the features.

优选地，所述获取所述各个节点中的分组样本，并统计所述分组样本中各个特征的特征分布情况的步骤包括：Preferably, the step of acquiring the grouped samples in the respective nodes and counting the feature distribution of each feature in the grouped samples includes:

将划分至同一个节点的所述样本记为所述分组样本，获取所述各个节点对应的分组样本；Denote the samples divided into the same node as the grouped samples, and obtain the grouped samples corresponding to the respective nodes;

根据预设特征统计规则对所述分组样本进行统计，获得各个特征的特征分布情况。Statistics are performed on the grouped samples according to preset feature statistics rules to obtain the feature distribution of each feature.

优选地，所述当前层中的各个节点的第一拆分特征分别是学历、存款和年龄，所述根据当前层中各个节点的第一拆分特征读取对应的列数据，并根据所述列数据将各个样本划分至所述当前层中的各个节点的步骤包括：Preferably, the first splitting feature of each node in the current layer is education, deposit and age, respectively, and the corresponding column data is read according to the first splitting feature of each node in the current layer, and according to the The step of dividing each sample into each node in the current layer by column data includes:

从数据表中读取与学历、存款和年龄对应的学历列数据、存款列数据和年龄列数据；Read the education column data, deposit column data and age column data corresponding to education, deposit and age from the data table;

将所述学历列数据、存款列数据和年龄列数据分别发送至三个执行机，由所述三个执行机分别根据对应的列数据将所述各个样本划分至所述当前层中的各个节点。Sending the educational background column data, deposit column data, and age column data to three execution machines respectively, and the three execution machines divide the samples into each node in the current layer according to the corresponding column data. .

此外，为实现上述目的，本发明还提供一种基于列存储的决策树构造装置，所述基于列存储的决策树构造装置包括：In addition, in order to achieve the above object, the present invention also provides a decision tree construction device based on column storage, and the decision tree construction device based on column storage includes:

划分模块，用于根据当前层中各个节点的第一拆分特征读取对应的列数据，并根据所述列数据将各个样本划分至所述当前层中的各个节点；a dividing module, configured to read the corresponding column data according to the first splitting feature of each node in the current layer, and divide each sample into each node in the current layer according to the column data;

统计模块，用于获取所述各个节点中的分组样本，并统计所述分组样本中各个特征的特征分布情况；A statistics module, used to obtain the grouped samples in the respective nodes, and count the feature distribution of each feature in the grouped samples;

生成模块，用于根据所述特征分布情况，生成下一层节点；a generation module, configured to generate the next layer of nodes according to the feature distribution;

获得模块，用于对所述下一层节点进行分裂，直至收敛，获得决策树。The obtaining module is used for splitting the nodes of the next layer until convergence, and obtaining a decision tree.

此外，为实现上述目的，本发明还提供一种基于列存储的决策树构造设备，所述基于列存储的决策树构造设备包括处理器，存储器以及存储在所述存储器中的基于列存储的决策树构造程序，所述基于列存储的决策树构造程序被所述处理器运行时，实现如上所述的基于列存储的决策树构造方法的步骤。In addition, in order to achieve the above object, the present invention also provides a decision tree construction device based on column storage, and the decision tree construction device based on column storage includes a processor, a memory, and a decision tree based on column storage stored in the memory. A tree construction program, when the column storage-based decision tree construction program is run by the processor, implements the steps of the column storage-based decision tree construction method as described above.

此外，为实现上述目的，本发明还提供一种计算机存储介质，所述计算机存储介质上存储有基于列存储的决策树构造程序，所述基于列存储的决策树构造程序被处理器运行时实现如上所述基于列存储的决策树构造方法的步骤。In addition, in order to achieve the above object, the present invention also provides a computer storage medium on which a decision tree construction program based on column storage is stored, and the decision tree construction program based on column storage is implemented when a processor runs The steps of the column store-based decision tree construction method are described above.

相比现有技术，本发明提供了一种基于列存储的决策树构造方法、装置、设备及存储介质，根据当前层中各个节点的第一拆分特征读取对应的列数据，并根据所述列数据将各个样本划分至所述当前层中的各个节点；获取所述各个节点中的分组样本，并统计所述分组样本中各个特征的特征分布情况；根据所述特征分布情况，生成下一层节点；对所述下一层节点进行分裂，直至收敛，获得决策树。由此，基于列存储构造决策树，缩减了数据读取的时间，提升了工作效率。Compared with the prior art, the present invention provides a decision tree construction method, device, device and storage medium based on column storage. The corresponding column data is read according to the first splitting feature of each node in the current layer, and the corresponding column data is read according to the The column data divides each sample into each node in the current layer; obtains the grouped samples in the each node, and counts the feature distribution of each feature in the grouped sample; generates the following according to the feature distribution. One layer of nodes; split the next layer of nodes until convergence to obtain a decision tree. As a result, the decision tree is constructed based on column storage, which reduces the time for data reading and improves work efficiency.

附图说明Description of drawings

图1是本发明各实施例涉及的基于列存储的决策树构造设备的硬件结构示意图；1 is a schematic diagram of the hardware structure of a column storage-based decision tree construction device involved in various embodiments of the present invention;

图2是本发明基于列存储的决策树构造方法第一实施例的流程示意图；2 is a schematic flowchart of the first embodiment of the method for constructing a decision tree based on column storage of the present invention;

图3是本发明基于列存储的决策树构造装置第一实施例的功能模块示意图。FIG. 3 is a schematic diagram of the functional modules of the first embodiment of the device for constructing a decision tree based on column storage according to the present invention.

本发明目的的实现、功能特点及优点将结合实施例，参照附图做进一步说明。The realization, functional characteristics and advantages of the present invention will be further described with reference to the accompanying drawings in conjunction with the embodiments.

具体实施方式Detailed ways

应当理解，此处所描述的具体实施例仅仅用以解释本发明，并不用于限定本发明。It should be understood that the specific embodiments described herein are only used to explain the present invention, but not to limit the present invention.

本发明实施例主要涉及的基于列存储的决策树构造设备是指能够实现网络连接的网络连接设备，所述基于列存储的决策树构造设备可以是服务器、云平台等。The column storage-based decision tree construction device mainly involved in the embodiments of the present invention refers to a network connection device capable of realizing network connection, and the column storage-based decision tree construction device may be a server, a cloud platform, or the like.

参照图1，图1是本发明各实施例涉及的基于列存储的决策树构造设备的硬件结构示意图。本发明实施例中，基于列存储的决策树构造设备可以包括处理器1001(例如中央处理器Central Processing Unit、CPU)，通信总线1002，输入端口1003，输出端口1004，存储器1005。其中，通信总线1002用于实现这些组件之间的连接通信；输入端口1003用于数据输入；输出端口1004用于数据输出，存储器1005可以是高速RAM存储器，也可以是稳定的存储器(non-volatile memory)，例如磁盘存储器，存储器1005可选的还可以是独立于前述处理器1001的存储装置。本领域技术人员可以理解，图1中示出的硬件结构并不构成对本发明的限定，可以包括比图示更多或更少的部件，或者组合某些部件，或者不同的部件布置。Referring to FIG. 1 , FIG. 1 is a schematic diagram of a hardware structure of a decision tree construction device based on column storage according to various embodiments of the present invention. In this embodiment of the present invention, the column storage-based decision tree construction device may include a processor 1001 (eg, Central Processing Unit, CPU), a communication bus 1002 , an input port 1003 , an output port 1004 , and a memory 1005 . Among them, the communication bus 1002 is used to realize the connection communication between these components; the input port 1003 is used for data input; the output port 1004 is used for data output, and the memory 1005 can be a high-speed RAM memory or a non-volatile memory (non-volatile memory). memory), such as a disk memory, the memory 1005 may optionally also be a storage device independent of the aforementioned processor 1001 . Those skilled in the art can understand that the hardware structure shown in FIG. 1 does not constitute a limitation of the present invention, and may include more or less components than those shown in the drawings, or combine some components, or arrange different components.

继续参照图1，图1中作为一种可读存储介质的存储器1005可以包括操作系统、网络通信模块、应用程序模块以及基于列存储的决策树构造程序。在图1中，网络通信模块主要用于连接服务器，与服务器进行数据通信；而处理器1001用于调用存储器1005中存储的基于列存储的决策树构造程序，并执行如下操作：Continuing to refer to FIG. 1 , the memory 1005 as a readable storage medium in FIG. 1 may include an operating system, a network communication module, an application program module, and a decision tree construction program based on column storage. In FIG. 1, the network communication module is mainly used to connect to the server and perform data communication with the server; and the processor 1001 is used to call the column storage-based decision tree construction program stored in the memory 1005, and perform the following operations:

所述处理器1001还用于调用存储器1005中存储的基于列存储的决策树构造程序，并执行如下操作：The processor 1001 is further configured to call the column storage-based decision tree construction program stored in the memory 1005, and perform the following operations:

将各个节点的所述第一拆分特征发送至一个或多个执行机，由所述一个或多个执行机根据所述第一特征将所述各个样本划分至所述当前层中的各个节点Sending the first splitting feature of each node to one or more execution machines, and the one or more execution machines divides each sample into each node in the current layer according to the first feature

决策树(Decision Tree)是在已知各种情况发生概率的基础上，通过构成决策树来求取净现值的期望值大于等于零的概率，评价项目风险，判断其可行性的决策分析方法，是直观运用概率分析的一种图解法。在机器学习中，决策树是一个预测模型，代表的是对象属性与对象值之间的一种映射关系。Decision Tree (Decision Tree) is a decision analysis method for evaluating the project risk and judging its feasibility by forming a decision tree to obtain the probability that the expected value of the net present value is greater than or equal to zero on the basis of the known probability of occurrence of various situations. A graphical method that uses probability analysis intuitively. In machine learning, a decision tree is a predictive model that represents a mapping relationship between object attributes and object values.

本发明实施例提供了一种基于列存储的决策树构造方法。在SQL(结构化查询语言，Structured Query Language)Server(数据库)里，Page(页)是数据存储的基本单位，而数据行是实际数据的存储单位，数据从Page Header(页头)之后就开始依次存储在Page上。这种按行在Page上存储记录的方式就是行存储。当数据是按单列而不是多行进行连续存储时，就是所谓的列存储。列存储在写入效率、保证数据完整性上都不如行存储。列存储的优势是在读取过程中不会产生冗余数据，这对数据完整性要求不高的大数据处理领域，犹为重要。The embodiment of the present invention provides a decision tree construction method based on column storage. In SQL (Structured Query Language) Server (database), Page (page) is the basic unit of data storage, and data row is the storage unit of actual data, and data starts after Page Header (page header). In turn, they are stored on the Page. This way of storing records on a Page by row is called row storage. When data is stored contiguously in a single column rather than multiple rows, it is called columnar storage. Column storage is inferior to row storage in terms of write efficiency and data integrity. The advantage of column storage is that redundant data will not be generated during the reading process, which is still very important in the field of big data processing where data integrity is not high.

参照图2，图2是本发明基于列存储的决策树构造方法第一实施例的流程示意图。Referring to FIG. 2 , FIG. 2 is a schematic flowchart of a first embodiment of a method for constructing a decision tree based on column storage according to the present invention.

本实施例中，所述基于列存储的决策树构造方法应用于基于列存储的决策树构造设备，所述方法包括：In this embodiment, the column storage-based decision tree construction method is applied to a column storage-based decision tree construction device, and the method includes:

步骤S101，根据当前层中各个节点的第一拆分特征读取对应的列数据，并根据所述列数据将各个样本划分至所述当前层中的各个节点；Step S101, reading the corresponding column data according to the first splitting feature of each node in the current layer, and dividing each sample into each node in the current layer according to the column data;

决策树的每一层的各个节点都有各自的拆分特征，节点根据拆分特征进行分裂，获得下一节点。本实施例中，将当前层的拆分特征标记为第一拆分特征。获得所述第一拆分特征后，根据所述第一拆分特征读取对应的列数据。获取所述第一拆分特征，从数据表中读取与所述第一拆分特征对应的列数据。Each node of each layer of the decision tree has its own splitting feature, and the node is split according to the splitting feature to obtain the next node. In this embodiment, the splitting feature of the current layer is marked as the first splitting feature. After the first splitting feature is obtained, the corresponding column data is read according to the first splitting feature. The first splitting feature is acquired, and column data corresponding to the first splitting feature is read from the data table.

本实施例中，所述各个节点对应不同的第一拆分特征，所述根据所述列数据将各个样本划分至所述当前层中的各个节点的步骤包括：In this embodiment, each node corresponds to a different first splitting feature, and the step of dividing each sample into each node in the current layer according to the column data includes:

从所述列数据中获取所述各个样本的第一特征，所述第一特征与所述第一拆分特征对应；根据所述列数据将所述第一拆分特征的值标记为各个样本的第一特征。例如，若所述第一拆分特征为年龄，读取的列数据中某个样本的年龄是20岁，则该样本的第一特征为20岁。Obtain the first feature of each sample from the column data, the first feature corresponds to the first split feature; mark the value of the first split feature as each sample according to the column data the first feature. For example, if the first split feature is age, and the age of a certain sample in the read column data is 20 years old, the first feature of the sample is 20 years old.

将各个节点的所述第一拆分特征发送至一个或多个执行机，由所述一个或多个执行机根据所述第一特征将所述各个样本划分至所述当前层中的各个节点。由于各个节点对应的所述第一拆分特征各不相同，故可以将各个节点的所述第一拆分特征发送至多个执行机，由所述多个执行机进行并行处理。例如，如果当前层有4个节点，对应的第一拆分特征分别为特征A、特征B、特征C、特征D，则可以将这4个节点的第一拆分特征分别发送至1号执行机、2号执行机、3号执行机和4号执行机进行分别处理。由此，实现了多线程处理，提高了处理效率。Sending the first splitting feature of each node to one or more execution machines, and the one or more execution machines divides each sample into each node in the current layer according to the first feature . Since the first splitting feature corresponding to each node is different, the first splitting feature of each node may be sent to multiple execution machines, and the multiple execution machines perform parallel processing. For example, if there are 4 nodes in the current layer, and the corresponding first split features are feature A, feature B, feature C, and feature D, the first split features of these four nodes can be sent to No. 1 for execution. machine, executor No. 2, executor No. 3, and executor No. 4 are processed separately. Thereby, multi-thread processing is realized, and processing efficiency is improved.

进一步地，所述根据当前层中各个节点的第一拆分特征读取对应的列数据的步骤之前还包括：Further, before the step of reading the corresponding column data according to the first splitting feature of each node in the current layer, the step further includes:

列存储需要把一条完整的数据拆分成多个单列进行保存，在读取的过程中也是读取对应的列。列存储的数据的在存储前会进行拆分，因而可以对需要进行列存储的样本数据进行特征一致性处理。具体地，扫描每一条样本数据，获取目标特征对应的目标特征值，将所述目标特征值按预设逻辑存储为目标样本数据，并忽略所述样本数据中的其他值。若某个样本数据中缺失了一个或多个目标特征值，则将该样本数据舍弃或者按预设方式补齐缺失的目标特征值。经过一致性处理后，即可以获得统一的样本数据。Column storage needs to split a complete piece of data into multiple single columns for storage, and also read the corresponding columns during the reading process. The data stored in the column will be split before being stored, so the sample data that needs to be stored in the column can be processed for feature consistency. Specifically, each piece of sample data is scanned, a target feature value corresponding to the target feature is obtained, the target feature value is stored as target sample data according to a preset logic, and other values in the sample data are ignored. If one or more target eigenvalues are missing in a certain sample data, the sample data is discarded or the missing target eigenvalues are filled in a preset manner. After consistency processing, unified sample data can be obtained.

进一步地，根据所述目标特征将所述样本数据按列存储，每一列包含一个目标特征。Further, the sample data is stored in columns according to the target features, and each column contains one target feature.

步骤S102，获取所述各个节点中的分组样本，并统计所述分组样本中各个特征的特征分布情况；Step S102, acquiring the grouping samples in the respective nodes, and making statistics on the feature distribution of each feature in the grouping samples;

具体地，所述获取所述各个节点中的分组样本，并统计所述分组样本中各个特征的特征分布情况的步骤包括：Specifically, the step of acquiring the grouped samples in the respective nodes and counting the feature distribution of each feature in the grouped samples includes:

步骤S102a，将划分至同一个节点的所述样本记为所述分组样本，获取所述各个节点对应的分组样本；Step S102a, denoting the samples divided into the same node as the grouping samples, and obtaining the grouping samples corresponding to the respective nodes;

将各个样本划分至所述当前层的各个节点之后，每个节点都包括所述样本中的部分样本，将划分至同一个节点的所述样本记为所述分组样本，由此，每个节点都有对应的分组样本，并分别获取各个节点对应的多个分组样本。After each sample is divided into each node of the current layer, each node includes part of the samples in the sample, and the sample divided into the same node is recorded as the grouped sample, so that each node There are corresponding grouping samples, and multiple grouping samples corresponding to each node are obtained respectively.

步骤S102b，根据预设特征统计规则对所述分组样本进行统计，获得各个特征的特征分布情况。In step S102b, statistics are performed on the grouped samples according to a preset feature statistics rule to obtain the feature distribution of each feature.

所述多个分组样本在当前层是按所述第一拆分特征进行分组的，可以理解地，所述多个分组样本的其它特征若暂未在当前层以上的层进行分组，则需要在当前层以下的某一层中进行分组。The multiple grouped samples are grouped according to the first splitting feature in the current layer. It is understandable that if other features of the multiple grouped samples are not grouped at the layers above the current layer, they need to be Group in a layer below the current layer.

本实施例中，设置预设特征统计规则，所述预设特征统计规则可以是将各个特征用正态分布进行统计，也可以通过平均值、最大值、最小值、中位值等方式进行统计。然后，根据所述预设特征统计规则对所述分组样本进行统计，获得各个特征的特征分别情况。In this embodiment, a preset feature statistical rule is set, and the preset feature statistical rule may be to use a normal distribution to perform statistics on each feature, or to perform statistics by means of an average value, a maximum value, a minimum value, a median value, etc. . Then, statistics are performed on the grouped samples according to the preset feature statistics rule to obtain the respective characteristics of each feature.

步骤S103，根据所述特征分布情况，生成下一层节点。Step S103 , generating a next layer of nodes according to the feature distribution.

获得所述特征分布情况后，则可以获得最优值，并根据所述最优值进行分裂生成下一节点，由此决策树不断分裂，直到收敛。After the feature distribution is obtained, the optimal value can be obtained, and the next node is generated by splitting according to the optimal value, so that the decision tree is continuously split until convergence.

具体地，所述根据所述特征分布情况，生成下一层节点的步骤包括：Specifically, the step of generating the next layer of nodes according to the feature distribution includes:

步骤S103a，将所述特征分布情况发送至一个或多个执行机，由所述一个或多个执行机进行并行求解获得最优值；Step S103a, sending the feature distribution to one or more execution machines, and the one or more execution machines perform parallel solutions to obtain the optimal value;

具体地，将各个节点的所述特征分布情况分别发送至一个或多个执行机，由所述一个或多个执行机根据预设算法对所述特征分布情况进行求解。例如可以通过算法计算熵、GINI系数(不纯度)来获得所述特征分布情况的最优值。例如，可以选择具有最高信息增益的特征作为测试特征，利用该特征对节点样本进行划分子集，会使得各子集中不同类别样本的混合程度最低，在各子集中对样本划分所需的信息(熵)最少，由此即可获得所述最优值。Specifically, the feature distribution of each node is respectively sent to one or more execution machines, and the one or more execution machines solve the feature distribution according to a preset algorithm. For example, the optimal value of the feature distribution can be obtained by calculating the entropy and the GINI coefficient (impurity) through an algorithm. For example, the feature with the highest information gain can be selected as the test feature, and the node samples can be divided into subsets by using this feature, so that the mixing degree of samples of different categories in each subset will be the lowest, and the information required to divide the samples in each subset ( entropy) is minimum, and thus the optimal value can be obtained.

本实施例中，可以由多个所述执行机同时进行求解，实现了并行计算，大大提高了工作效率。In this embodiment, the solution can be solved by a plurality of the execution machines at the same time, which realizes parallel computing and greatly improves the work efficiency.

步骤S103b，将所述最优值作为第二拆分特征生成下一层节点。Step S103b, using the optimal value as the second split feature to generate the next layer of nodes.

根据所述最优值最为该节点的第二拆分特征，并基于所述第二拆分特征生成下一节点。The optimal value is the second splitting feature of the node, and the next node is generated based on the second splitting feature.

步骤S104：对所述下一层节点进行分裂，直至收敛，获得决策树。Step S104: Split the next layer node until convergence, and obtain a decision tree.

具体地，对所述下一层节点内的节点样本进行统计，获得第二特征分布情况，并基于所述第二特征分布情况由执行机计最优值，并继续生成节点。不断重复上述步骤，直到达到决策树收敛条件，并在减枝后获得最终的决策树。在决策树充分生长后，修剪掉多余的分支。Specifically, the node samples in the nodes of the next layer are counted to obtain a second feature distribution, and the execution machine calculates an optimal value based on the second feature distribution, and continues to generate nodes. The above steps are repeated continuously until the convergence condition of the decision tree is reached, and the final decision tree is obtained after pruning. After the decision tree has grown sufficiently, prune off excess branches.

本实施例可以通过后减枝的方式进行减枝。具体地，根据每个分支的分类错误率及每个分支的权重，计算该节点不修剪时预期分类错误率；对于每个非叶节点，计算该节点被修剪后的分类错误率，如果修剪后分类错误率变大，即放弃修剪；否则将该节点强制为叶节点，并标记类别。产生一系列修剪过的决策树候选之后，利用测试数据(未参与建模的数据)对各候选决策树的分类准确性进行评价，保留分类错误率最小的决策树。In this embodiment, branch pruning can be performed by post-pruning. Specifically, according to the classification error rate of each branch and the weight of each branch, calculate the expected classification error rate when the node is not pruned; for each non-leaf node, calculate the classification error rate after the node is pruned. When the classification error rate becomes larger, pruning is abandoned; otherwise, the node is forced to be a leaf node and the category is marked. After a series of pruned decision tree candidates are generated, test data (data not involved in modeling) are used to evaluate the classification accuracy of each candidate decision tree, and the decision tree with the smallest classification error rate is retained.

决策树模型经常应用于金融领域，例如在银行的信贷业务中，可以构建信贷风险决策树。在构建信贷风险决策树时，获取大量用户数据作为样本，所述样本包括多个信贷特征，所述信贷特征包括学历、存款、职业、年龄、历史贷款记录、家庭情况、固定资产等。The decision tree model is often used in the financial field, for example, in the bank's credit business, a credit risk decision tree can be constructed. When constructing the credit risk decision tree, a large amount of user data is obtained as a sample, and the sample includes multiple credit features, and the credit features include education, deposit, occupation, age, historical loan records, family situation, fixed assets, and the like.

在构建信贷风险决策树的过程中，设所述当前层中的各个节点的第一拆分特征分别是学历、存款和年龄，所述根据当前层中各个节点的第一拆分特征读取对应的列数据，并根据所述列数据将各个样本划分至所述当前层中的各个节点的步骤包括：从数据表中读取与学历、存款和年龄对应的学历列数据、存款列数据和年龄列数据；将所述学历列数据、存款列数据和年龄列数据分别发送至三个执行机，由所述三个执行机分别根据对应的列数据将所述各个样本划分至所述当前层中的各个节点，所述各个节点分别是：学历节点、存款节点和年龄节点。In the process of constructing the credit risk decision tree, it is assumed that the first splitting features of each node in the current layer are education, deposit and age, respectively, and the corresponding data is read according to the first splitting feature of each node in the current layer. The step of dividing each sample into each node in the current layer according to the column data includes: reading the education column data, deposit column data and age corresponding to education, deposit and age from the data table column data; send the educational background column data, deposit column data and age column data to three execution machines respectively, and the three execution machines divide the samples into the current layer according to the corresponding column data respectively Each node of , the each node is: education node, deposit node and age node.

进一步地，将所述学历节点、所述存款节点和所述年龄节点中样本标记为分组样本：样本A，样本B以及样本C，分别统计各个分组样本中各个特征的分布情况。对于样本A，则获取除学历特征之外的其它特征的分别情况；对于样本B，则获取除存款特征之外的其它特征的分别情况；对于样本C，则获取除年龄特征之外的其它特征的分别情况，可以理解地，所述其它特征可以根据需要具体选择，例如还可以剔除所述当前层节点的上层节点的一个或多个特征。当获取到所述学历节点、所述存款节点和所述年龄节点中样本的特征分布情况后，则根据所述特征分布情况，生成下一节点。例如，对于学历节点，则可以根据职业特征进行分裂，获得下一层节点。不断重复上述步骤，直到满足收敛条件，获得信贷风险决策树。Further, the samples in the education node, the deposit node and the age node are marked as grouped samples: sample A, sample B and sample C, and the distribution of each feature in each grouped sample is counted respectively. For sample A, obtain the separate conditions of other features except the educational level feature; for sample B, obtain the separate conditions of other features except the deposit feature; for sample C, obtain other features except the age feature It can be understood that the other features can be specifically selected as required, for example, one or more features of the upper-level node of the current-level node can also be eliminated. After obtaining the characteristic distribution of the samples in the education node, the deposit node and the age node, the next node is generated according to the characteristic distribution. For example, for education nodes, it can be split according to occupational characteristics to obtain the next layer of nodes. The above steps are repeated continuously until the convergence conditions are met, and the credit risk decision tree is obtained.

本实施例通过上述方案，根据当前层中各个节点的第一拆分特征读取对应的列数据，并根据所述列数据将各个样本划分至所述当前层中的各个节点；获取所述各个节点中的分组样本，并统计所述分组样本中各个特征的特征分布情况；根据所述特征分布情况，生成下一层节点，对所述下一层节点进行分裂，直至收敛，获得决策树。由此，基于列存储构造决策树，缩减了数据读取的时间，提升了工作效率。In this embodiment, through the above solution, the corresponding column data is read according to the first splitting feature of each node in the current layer, and each sample is divided into each node in the current layer according to the column data; The grouped samples in the nodes are collected, and the feature distribution of each feature in the grouped samples is counted; according to the feature distribution, the next layer of nodes is generated, and the next layer of nodes is split until convergence to obtain a decision tree. As a result, the decision tree is constructed based on column storage, which reduces the time for data reading and improves work efficiency.

此外，本实施例还提供一种基于列存储的决策树构造装置。参照图3，图3为本发明基于列存储的决策树构造装置第一实施例的功能模块示意图。In addition, this embodiment also provides an apparatus for constructing a decision tree based on column storage. Referring to FIG. 3 , FIG. 3 is a schematic diagram of the functional modules of the first embodiment of the apparatus for constructing a decision tree based on column storage according to the present invention.

本实施例中，所述列存储的决策树构造装置为虚拟装置，存储于图1所示的列存储的决策树构造设备的存储器1005中，以实现列存储的决策树构造程序的所有功能：用于根据当前层中各个节点的第一拆分特征读取对应的列数据，并根据所述列数据将各个样本划分至所述当前层中的各个节点；用于获取所述各个节点中的分组样本，并统计所述分组样本中各个特征的特征分布情况；用于根据所述特征分布情况，生成下一层节点；用于对所述下一层节点进行分裂，直至收敛，获得决策树。In this embodiment, the column-stored decision tree construction device is a virtual device, which is stored in the memory 1005 of the column-stored decision tree construction device shown in FIG. 1 , so as to realize all the functions of the column-stored decision tree construction program: It is used to read the corresponding column data according to the first splitting feature of each node in the current layer, and divide each sample into each node in the current layer according to the column data; used to obtain the data in each node. Group the samples, and count the feature distribution of each feature in the grouped samples; for generating the next layer of nodes according to the feature distribution; for splitting the next layer of nodes until convergence, to obtain a decision tree .

具体地，所述基于列存储的决策树构造装置包括：Specifically, the column storage-based decision tree construction device includes:

划分模块10，用于根据当前层中各个节点的第一拆分特征读取对应的列数据，并根据所述列数据将各个样本划分至所述当前层中的各个节点；A division module 10, configured to read the corresponding column data according to the first splitting feature of each node in the current layer, and divide each sample into each node in the current layer according to the column data;

统计模块20，用于获取所述各个节点中的分组样本，并统计所述分组样本中各个特征的特征分布情况；A statistical module 20, configured to obtain the grouped samples in the respective nodes, and to count the feature distribution of each feature in the grouped samples;

生成模块30，用于根据所述特征分布情况，生成下一层节点；A generating module 30, configured to generate the next layer of nodes according to the feature distribution;

获得模块40，用于对所述下一层节点进行分裂，直至收敛，获得决策树。The obtaining module 40 is used for splitting the nodes of the next layer until convergence, and obtaining a decision tree.

进一步地，所述生成模块包括：Further, the generation module includes:

发送单元，用于将所述特征分布情况发送至一个或多个执行机，由所述一个或多个执行机进行并行求解获得最优值；a sending unit, configured to send the feature distribution to one or more executive machines, and the one or more executive machines perform parallel solutions to obtain an optimal value;

生成单元，用于将所述最优值作为第二拆分特征生成下一层节点。a generating unit, configured to use the optimal value as the second splitting feature to generate a node of the next layer.

进一步地，所述划分模块包括：Further, the dividing module includes:

读取单元，用于获取所述第一拆分特征，从数据表中读取与所述第一拆分特征对应的列数据。The reading unit is configured to acquire the first splitting feature, and read column data corresponding to the first splitting feature from the data table.

进一步地，所述生成模块还包括：Further, the generation module also includes:

扫描单元，用于扫描样本数据，对所述样本数据进行特征一致性处理，并根据特征将所述样本数据按列存储。The scanning unit is configured to scan the sample data, perform feature consistency processing on the sample data, and store the sample data in columns according to the features.

进一步地，所述统计模块包括：Further, the statistical module includes:

获取单元，用于将划分至同一个节点的所述样本记为所述分组样本，获取所述各个节点对应的分组样本；an obtaining unit, configured to record the samples divided into the same node as the grouped samples, and obtain the grouped samples corresponding to the respective nodes;

统计单元，用于根据预设特征统计规则对所述分组样本进行统计，获得各个特征的特征分布情况。A statistical unit, configured to perform statistics on the grouped samples according to a preset feature statistical rule to obtain the feature distribution of each feature.

此外，本发明实施例还提供一种计算机存储介质，所述计算机存储介质上存储有基于列存储的决策树构造程序，所述基于列存储的决策树构造程序被处理器运行时实现如上所述基于列存储的决策树构造方法的步骤，此处不再赘述。In addition, an embodiment of the present invention also provides a computer storage medium, where a column storage-based decision tree construction program is stored, and the column storage-based decision tree construction program is implemented by the processor as described above when running The steps of the column store-based decision tree construction method will not be repeated here.

相比现有技术，本发明提出的一种基于列存储的决策树构造方法、装置、设备及存储介质，涉及机器学习技术领域，该方法包括：根据当前层中各个节点的第一拆分特征读取对应的列数据，并根据所述列数据将各个样本划分至所述当前层中的各个节点；获取所述各个节点中的分组样本，并统计所述分组样本中各个特征的特征分布情况；根据所述特征分布情况，生成下一层节点；对所述下一层节点进行分裂，直至收敛，获得决策树。由此，基于列存储构造决策树，缩减了数据读取的时间，提升了工作效率。Compared with the prior art, a method, device, device and storage medium for constructing a decision tree based on column storage proposed by the present invention relate to the technical field of machine learning. The method includes: according to the first splitting feature of each node in the current layer Read the corresponding column data, and divide each sample into each node in the current layer according to the column data; obtain the grouped samples in each node, and count the feature distribution of each feature in the grouped samples ; According to the feature distribution, generate the next layer of nodes; split the next layer of nodes until convergence, and obtain a decision tree. As a result, the decision tree is constructed based on column storage, which reduces the time for data reading and improves work efficiency.

需要说明的是，在本文中，术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、物品或者系统不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、方法、物品或者系统所固有的要素。在没有更多限制的情况下，由语句“包括一个……”限定的要素，并不排除在包括该要素的过程、方法、物品或者系统中还存在另外的相同要素。It should be noted that, herein, the terms "comprising", "comprising" or any other variation thereof are intended to encompass non-exclusive inclusion, such that a process, method, article or system comprising a series of elements includes not only those elements, It also includes other elements not expressly listed or inherent to such a process, method, article or system. Without further limitation, an element qualified by the phrase "comprising a..." does not preclude the presence of additional identical elements in the process, method, article or system that includes the element.

上述本发明实施例序号仅仅为了描述，不代表实施例的优劣。The above-mentioned serial numbers of the embodiments of the present invention are only for description, and do not represent the advantages or disadvantages of the embodiments.

通过以上的实施方式的描述，本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现，当然也可以通过硬件，但很多情况下前者是更佳的实施方式。基于这样的理解，本发明的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来，该计算机软件产品存储在如上所述的一个存储介质(如ROM/RAM、磁碟、光盘)中，包括若干指令用以使得一台终端设备执行本发明各个实施例所述的方法。From the description of the above embodiments, those skilled in the art can clearly understand that the methods of the above embodiments can be implemented by means of software plus a necessary general hardware platform, and of course hardware can also be used, but in many cases the former is better implementation. Based on this understanding, the technical solutions of the present invention can be embodied in the form of software products in essence or the parts that make contributions to the prior art. The computer software products are stored in a storage medium (such as ROM/RAM) as described above. , magnetic disk, optical disk), including several instructions to make a terminal device execute the method described in each embodiment of the present invention.

以上所述仅为本发明的优选实施例，并非因此限制本发明的专利范围，凡是利用本发明说明书及附图内容所作的等效结构或流程变换，或直接或间接运用在其它相关的技术领域，均同理包括在本发明的专利保护范围内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the scope of the present invention. Any equivalent structure or process transformation made by using the contents of the description and drawings of the present invention, or directly or indirectly applied in other related technical fields , are similarly included in the scope of patent protection of the present invention.

Claims

1. a decision tree construction method based on column storage, is characterized in that, the method comprises:

Read the corresponding column data according to the first splitting feature of each node in the current layer, and divide each sample into each node in the current layer according to the column data;

Obtain the grouped samples in the respective nodes, and count the feature distribution of each feature in the grouped samples;

According to the feature distribution, the next layer of nodes is generated;

The next layer nodes are split until convergence, and a decision tree is obtained.

2. The method according to claim 1, wherein each node corresponds to a different first splitting feature,

The step of dividing each sample into each node in the current layer according to the column data includes:

Obtain the first features of the respective samples from the column data, where the first features correspond to the first split features;

Sending the first splitting feature of each node to one or more execution machines, and the one or more execution machines divides each sample into each node in the current layer according to the first feature .

3. The method according to claim 1, wherein the step of generating a next layer of nodes according to the feature distribution comprises:

Sending the feature distribution to one or more execution machines, and the one or more execution machines perform parallel solutions to obtain an optimal value;

The optimal value is used as the second split feature to generate the next layer of nodes.

4. The method according to claim 1, wherein the step of reading the corresponding column data according to the first splitting feature of each node in the current layer comprises:

The first splitting feature of each node in the current layer is acquired, and column data corresponding to the first splitting feature is read from the data table.

5. The method according to claim 1, wherein before the step of reading the corresponding column data according to the first splitting feature of each node in the current layer, the step further comprises:

Scan the sample data, perform feature consistency processing on the sample data, and store the sample data in columns according to the features.

6. The method according to claim 1, wherein the step of acquiring the grouped samples in the respective nodes and counting the feature distribution of each feature in the grouped samples comprises:

Denote the samples divided into the same node as the grouped samples, and obtain the grouped samples corresponding to the respective nodes;

Statistics are performed on the grouped samples according to preset feature statistics rules to obtain the feature distribution of each feature.

7. The method according to claim 1, wherein the first splitting feature of each node in the current layer is education, deposit and age respectively, and the first splitting feature of each node in the current layer is based on the first splitting feature of each node in the current layer. The feature reads the corresponding column data, and divides each sample into each node in the current layer according to the column data, including:

Read the education column data, deposit column data and age column data corresponding to education, deposit and age from the data table;

Sending the educational background column data, deposit column data, and age column data to three execution machines respectively, and the three execution machines divide the samples into each node in the current layer according to the corresponding column data. .

8. A column storage-based decision tree construction device, wherein the column storage-based decision tree construction device comprises:

a dividing module, configured to read the corresponding column data according to the first splitting feature of each node in the current layer, and divide each sample into each node in the current layer according to the column data;

A statistics module, used to obtain the grouped samples in the respective nodes, and count the feature distribution of each feature in the grouped samples;

a generation module, configured to generate the next layer of nodes according to the feature distribution;

The obtaining module is used for splitting the nodes of the next layer until convergence, and obtaining a decision tree.

9. A column storage-based decision tree construction device, wherein the column storage-based decision tree construction device comprises a processor, a memory, and a column storage-based decision tree construction program stored in the memory, wherein When the column storage-based decision tree construction program is run by the processor, the steps of implementing the column storage-based decision tree construction method according to any one of claims 1-7 are implemented.

10. A computer storage medium, characterized in that, a column storage-based decision tree construction program is stored on the computer storage medium, and the column storage-based decision tree construction program is implemented by a processor when it is run as claimed in claim 1- Steps of the column storage-based decision tree construction method described in any one of 7.