CN112182982B

CN112182982B - Multiparty joint modeling method, device, equipment and storage medium

Info

Publication number: CN112182982B
Application number: CN202011165475.8A
Authority: CN
Inventors: 宋传园; 冯智; 吕亮亮
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2020-10-27
Filing date: 2020-10-27
Publication date: 2024-03-01
Anticipated expiration: 2040-10-27
Also published as: CN112182982A

Abstract

The disclosure provides a multiparty joint modeling method based on a distributed system, and relates to the fields of machine learning, safe computing and the like. The multiparty joint modeling method comprises the following steps: intersection is carried out on sample identifications included in each of the plurality of clusters to obtain intersection sample identifications and cluster sample data corresponding to the intersection sample identifications and included in each of the plurality of clusters, wherein the sample identifications and the cluster sample data included in each of the plurality of clusters are distributed and stored in a plurality of clients of the corresponding cluster; respectively carrying out barrel classification on cluster sample data of each cluster in the plurality of clusters to obtain cluster barrel classification data of each cluster in the plurality of clusters; constructing a global information gain histogram based on the sample tag and cluster barrel data of each of the plurality of clusters; and constructing a decision tree model based on the global information gain histogram.

Description

Multi-party joint modeling method, device, equipment and storage medium

技术领域Technical field

本公开涉及机器学习、安全计算等领域，更具体地，涉及一种多方联合建模方法、装置、设备及存储介质。The present disclosure relates to fields such as machine learning and secure computing, and more specifically, to a multi-party joint modeling method, device, equipment and storage medium.

背景技术Background technique

随着算法和大数据的发展，算法和算力已经不再是阻碍AI发展的瓶颈了。各个领域内真实有效的数据源才是最宝贵的资源。同时数据源之间存在着难以打破的壁垒，在大多数行业中，数据是以孤岛的形式存在的，由于行业竞争、隐私安全、行政手续复杂等问题，即使是在同一个公司的不同部门之间实现数据整合也面临着重重阻力，在现实中想要将分散在各地、各个机构的数据进行整合几乎是不可能的，或者说所需的成本是巨大的。With the development of algorithms and big data, algorithms and computing power are no longer bottlenecks hindering the development of AI. Real and effective data sources in various fields are the most valuable resources. At the same time, there are barriers between data sources that are difficult to break. In most industries, data exists in the form of islands. Due to industry competition, privacy security, complex administrative procedures and other issues, even among different departments of the same company, There are also many obstacles to realizing data integration among different places. In reality, it is almost impossible to integrate data scattered across various places and institutions, or the cost required is huge.

在此部分中描述的方法不一定是之前已经设想到或采用的方法。除非另有指明，否则不应假定此部分中描述的任何方法仅因其包括在此部分中就被认为是现有技术。类似地，除非另有指明，否则此部分中提及的问题不应认为在任何现有技术中已被公认。The approaches described in this section are not necessarily those that have been previously envisioned or employed. Unless otherwise indicated, it should not be assumed that any method described in this section is prior art merely by virtue of its inclusion in this section. Similarly, unless otherwise indicated, the issues mentioned in this section should not be considered to be recognized in any prior art.

发明内容Contents of the invention

根据本公开的一方面，提供了一种多方联合建模方法，包括：对多个集群中的每一个集群包括的样本标识求交，得到交集样本标识以及多个集群中的每一个集群包括的与交集样本标识对应的集群样本数据，其中，多个集群中的每一个集群包括的样本标识和集群样本数据均分布保存在该相应集群的多个客户端中；对多个集群中的每一个集群的集群样本数据分别进行分桶，以得到多个集群中的每一个集群的集群分桶数据；基于样本标签和多个集群中的每一个集群的集群分桶数据，构建全局信息增益直方图，其中，样本标签为每个样本的真实值，并且样本标签保存在多个集群中的一个特定集群；以及基于全局信息增益直方图，构建决策树模型。According to one aspect of the present disclosure, a multi-party joint modeling method is provided, including: intersecting sample identifiers included in each cluster in multiple clusters, and obtaining the intersection sample identifier and the identifier included in each cluster in the multiple clusters. Cluster sample data corresponding to the intersection sample ID, where the sample ID and cluster sample data included in each cluster in the multiple clusters are distributed and stored in multiple clients of the corresponding cluster; for each of the multiple clusters The cluster sample data of the cluster are bucketed separately to obtain the cluster bucket data of each cluster in multiple clusters; based on the sample label and the cluster bucket data of each cluster in the multiple clusters, a global information gain histogram is constructed , where the sample label is the true value of each sample, and the sample label is saved in a specific cluster among multiple clusters; and based on the global information gain histogram, a decision tree model is constructed.

根据本公开的另一方面，还提供了一种基于分布式系统的多方联合预测方法，包括：将预测样本输入决策树模型；针对决策树模型的每一棵子决策树，获取根节点的所属集群；与集群通信，获取根节点的特征；将预测样本的所述根节点的特征的特征数据发送至节点所属集群，获取子节点的所属集群；迭代上述过程，获取所述每一棵子决策树对预测样本的预测值；以及将每一棵子决策树对预测样本的预测值求和，得到预测样本的预测值。According to another aspect of the present disclosure, a multi-party joint prediction method based on a distributed system is also provided, including: inputting prediction samples into a decision tree model; for each sub-decision tree of the decision tree model, obtaining the cluster to which the root node belongs ; Communicate with the cluster to obtain the characteristics of the root node; send the characteristic data of the characteristics of the root node of the prediction sample to the cluster to which the node belongs, and obtain the cluster to which the child node belongs; iterate the above process to obtain each of the sub-decision tree pairs The predicted value of the predicted sample; and the predicted value of the predicted sample is summed by each sub-decision tree to obtain the predicted value of the predicted sample.

根据本公开的一方面，提供了一种基于分布式系统的多方联合建模装置，包括：求交模块，被配置用于将所述多个集群包括的数据进行求交，使所述多个集群中的每一个得到相应的集群样本数据；分桶模块，被配置用于对所述多个集群中每一个的集群样本数据分别进行分桶，得到所述多个集群中每一个的集群分桶数据；第一构建模块，被配置用于基于样本标签和所述多个集群分桶数据，构建全局信息增益直方图；以及第二构建模块，被配置用于基于所述全局信息增益直方图，构建决策树模型。According to an aspect of the present disclosure, a multi-party joint modeling device based on a distributed system is provided, including: an intersection module configured to intersect data included in the plurality of clusters, so that the plurality of Each of the clusters obtains corresponding cluster sample data; the bucketing module is configured to bucket the cluster sample data of each of the plurality of clusters to obtain a cluster score of each of the plurality of clusters. bucket data; a first building module configured to construct a global information gain histogram based on sample labels and the plurality of cluster bucket data; and a second building module configured to construct a global information gain histogram based on the global information gain histogram , build a decision tree model.

根据本公开的另一方面，还提供了一种电子设备，包括：处理器；以及存储程序的存储器，上述程序包括指令，并且指令在由处理器执行时使处理器执行根据上述的多方联合建模方法和/或根据上述的多方联合预测方法。According to another aspect of the present disclosure, an electronic device is also provided, including: a processor; and a memory storing a program. The program includes instructions, and when executed by the processor, the instructions cause the processor to perform the multi-party joint construction according to the above. modular method and/or a multi-party joint prediction method according to the above.

根据本公开的另一方面，还提供了一种存储程序的计算机可读存储介质，上述程序包括指令，并且指令在由电子设备的处理器执行时，致使电子设备执行根据上述的多方联合建模方法和/或根据上述的多方联合预测方法。According to another aspect of the present disclosure, a computer-readable storage medium storing a program is also provided. The program includes instructions, and when the instructions are executed by a processor of an electronic device, the instructions cause the electronic device to perform the multi-party joint modeling according to the above. method and/or based on the multi-party joint prediction method described above.

根据本公开的另一方面，提供了一种计算机程序产品，包括计算机程序，其中，所述计算机程序在被处理器执行时实现如上所述的多方联合建模方法和/或如上所述的多方联合预测方法。According to another aspect of the present disclosure, a computer program product is provided, including a computer program, wherein the computer program, when executed by a processor, implements the multi-party joint modeling method as described above and/or the multi-party joint modeling method as described above. Joint forecasting method.

本公开的技术方案通过对分布式数据进行分桶以及对分布式数据构建信息增益直方图，实现了基于分布式系统的多方联合建模方法，从而提升多方联合建模的速度，并且能够在数据量大的场景下完成建模。The technical solution of the present disclosure implements a multi-party joint modeling method based on distributed systems by bucketing distributed data and constructing an information gain histogram for distributed data, thereby improving the speed of multi-party joint modeling and enabling data processing Complete modeling in large-volume scenarios.

附图说明Description of drawings

附图示例性地示出了实施例并且构成说明书的一部分，与说明书的文字描述一起用于讲解实施例的示例性实施方式。所示出的实施例仅出于例示的目的，并不限制权利要求的范围。在所有附图中，相同的附图标记指代类似但不一定相同的要素。The drawings illustrate exemplary embodiments and constitute a part of the specification, and together with the written description, serve to explain exemplary implementations of the embodiments. The embodiments shown are for illustrative purposes only and do not limit the scope of the claims. Throughout the drawings, the same reference numbers refer to similar, but not necessarily identical, elements.

图1-图2是示出根据示例性实施例的多方联合建模方法的流程图；1-2 are flowcharts illustrating a multi-party joint modeling method according to an exemplary embodiment;

图3是示出根据示例性实施例的分布式系统的组成框图；Figure 3 is a block diagram illustrating a distributed system according to an exemplary embodiment;

图4是示出根据示例性实施例的对多个集群中的每一个集群的集群样本数据分别进行分桶的流程图；4 is a flowchart illustrating separately bucketing cluster sample data for each cluster in a plurality of clusters according to an exemplary embodiment;

图5是示出根据示例性实施例的分桶过程的示意图；Figure 5 is a schematic diagram illustrating a bucketing process according to an exemplary embodiment;

图6是示出根据示例性实施例的基于当前特征的特征数据生成至少一个数据桶的流程图；6 is a flowchart illustrating generation of at least one data bucket based on feature data of a current feature according to an exemplary embodiment;

图7是示出根据示例性实施例的构建全局信息增益直方图的流程图；7 is a flowchart illustrating construction of a global information gain histogram according to an exemplary embodiment;

图8是示出根据示例性实施例的构建第一信息增益直方图的流程图；8 is a flowchart illustrating construction of a first information gain histogram according to an exemplary embodiment;

图9是示出根据示例性实施例的得到待分裂节点的特征的第一信息增益子直方图或第一候选分裂增益的流程图；Figure 9 is a flowchart illustrating obtaining a first information gain sub-histogram or a first candidate split gain of features of a node to be split according to an exemplary embodiment;

图10是示出根据示例性实施例的构建第一信息增益子直方图的流程图；10 is a flowchart illustrating construction of a first information gain sub-histogram according to an exemplary embodiment;

图11是示出根据示例性实施例的多方联合预测方法的流程图；Figure 11 is a flowchart illustrating a multi-party joint prediction method according to an exemplary embodiment;

图12是示出根据示例性实施例的多方联合建模装置的组成框图；Figure 12 is a block diagram illustrating a multi-party joint modeling device according to an exemplary embodiment;

图13是示出能够应用于示例性实施例的示例性计算设备的结构框图。13 is a structural block diagram illustrating an exemplary computing device applicable to exemplary embodiments.

具体实施方式Detailed ways

在本公开中，除非另有说明，否则使用术语“第一”、“第二”等来描述各种要素不意图限定这些要素的位置关系、时序关系或重要性关系，这种术语只是用于将一个元件与另一元件区分开。在一些示例中，第一要素和第二要素可以指向该要素的同一实例，而在某些情况下，基于上下文的描述，它们也可以指代不同实例。In this disclosure, unless otherwise stated, the use of the terms “first”, “second”, etc. to describe various elements is not intended to limit the positional relationship, timing relationship, or importance relationship of these elements. Such terms are only used for Distinguish one element from another. In some examples, the first element and the second element may refer to the same instance of the element, and in some cases, based on contextual description, they may refer to different instances.

在本公开中对各种所述示例的描述中所使用的术语只是为了描述特定示例的目的，而并非旨在进行限制。除非上下文另外明确地表明，如果不特意限定要素的数量，则该要素可以是一个也可以是多个。此外，本公开中所使用的术语“和/或”涵盖所列出的项目中的任何一个以及全部可能的组合方式。The terminology used in the description of the various described examples in this disclosure is for the purpose of describing the particular examples only and is not intended to be limiting. Unless the context clearly indicates otherwise, if the number of elements is not specifically limited, the element may be one or more. Furthermore, the term "and/or" as used in this disclosure encompasses any and all possible combinations of the listed items.

相关技术中，现有的多方联合建模方法速度较慢，尤其在数据量大的场景下，受限于设备性能、存储容量等因素，无法进行多方联合建模，从而在实际应用中具有很大的局限性。In related technologies, the existing multi-party joint modeling method is slow. Especially in scenarios with large amounts of data, multi-party joint modeling cannot be performed due to factors such as device performance and storage capacity, which makes it very difficult to implement in practical applications. big limitations.

为了解决上述技术问题，本公开提供一种基于分布式系统的多方联合建模方法：对多个集群间的样本标识求交，得到每个集群的集群样本数据；对每个集群的集群样本数据进行分桶，得到集群分桶数据；基于样本标签和每一个集群的集群分桶数据，构建全局信息增益直方图；基于全局信息增益直方图，构建决策树模型。由此，通过对分布式数据进行分桶以及对分布式数据构建信息增益直方图，实现了基于分布式系统的多方联合建模方法，从而提升多方联合建模的速度，并且能够在数据量大的场景下完成建模。In order to solve the above technical problems, the present disclosure provides a multi-party joint modeling method based on a distributed system: intersect the sample identifiers between multiple clusters to obtain the cluster sample data of each cluster; Perform bucketing to obtain cluster bucketed data; build a global information gain histogram based on sample labels and cluster bucketed data of each cluster; build a decision tree model based on the global information gain histogram. As a result, by bucketing distributed data and constructing information gain histograms for distributed data, a multi-party joint modeling method based on distributed systems is realized, thereby improving the speed of multi-party joint modeling and being able to handle large amounts of data. Complete modeling under the scenario.

以下将结合附图对本公开的多方联合建模方法进行进一步描述。The multi-party joint modeling method of the present disclosure will be further described below with reference to the accompanying drawings.

图1是示出根据本公开示例性实施例的基于分布式系统的多方联合建模方法的流程图。如图1所示，多方联合建模方法可以包括：步骤S101、对多个集群中的每一个集群包括的样本标识求交，得到交集样本标识以及多个集群中的每一个集群包括的与交集样本标识对应的集群样本数据；步骤S102、对多个集群中的每一个集群的集群样本数据分别进行分桶，以得到多个集群中的每一个集群的集群分桶数据；步骤S103、基于样本标签和多个集群中的每一个集群的集群分桶数据，构建全局信息增益直方图；以及步骤S104、基于全局信息增益直方图，构建决策树模型。由此，通过建立分布式系统，将集群的数据分布存放在集群的多个客户端，并使用客户端对分布式数据进行初步分桶以及构建信息增益子直方图，能够大幅提升多方联合建模的速度，使得模型能够支持在更丰富的场景下进行快速多方联合建模。FIG. 1 is a flowchart illustrating a multi-party joint modeling method based on a distributed system according to an exemplary embodiment of the present disclosure. As shown in Figure 1, the multi-party joint modeling method may include: step S101, intersect the sample identifiers included in each cluster in the multiple clusters, and obtain the intersection sample identifier and the intersection of the sample identifiers included in each cluster in the multiple clusters. Cluster sample data corresponding to the sample identifier; Step S102, bucket the cluster sample data of each cluster in the multiple clusters to obtain cluster bucket data of each cluster in the multiple clusters; Step S103, based on the sample Tags and cluster bucket data of each cluster in multiple clusters are used to construct a global information gain histogram; and step S104 is to construct a decision tree model based on the global information gain histogram. Therefore, by establishing a distributed system, distributing and storing the cluster's data in multiple clients of the cluster, and using the clients to preliminary bucket the distributed data and construct information gain sub-histograms, multi-party joint modeling can be greatly improved. The speed enables the model to support rapid multi-party joint modeling in richer scenarios.

根据一些实施例，分布式系统包括多个集群，每一个集群包括一个服务器和多个客户端。在一个示例性实施例中，如图3所示，分布式系统3000包括集群3100、集群3200和集群3300，每个集群内包括一个服务器和三个客户端，例如集群3100包括服务器3110和客户端3101，集群3200包括服务器3210和客户端3201，集群3300包括服务器3310和客户端3301。服务器的主要功能可以包括协调本集群内的多个客户端、指示客户端完成分桶和构建直方图等任务、整合客户端上传的信息、下发信息至客户端、完成一些计算功能以及与其他集群的服务器进行通信等。在一个示例性实施例中，集群间的通信可以使用Paillier密文进行加密。客户端的主要功能可以包括存储数据、接收服务器的指令完成分桶和构建直方图等任务、上传信息至服务器等。集群内的通信可以没有隐私需求。在一个示例性实施例中，客户端只与本集群的服务器通信。According to some embodiments, a distributed system includes multiple clusters, each cluster including a server and multiple clients. In an exemplary embodiment, as shown in Figure 3, distributed system 3000 includes cluster 3100, cluster 3200 and cluster 3300. Each cluster includes one server and three clients. For example, cluster 3100 includes server 3110 and clients. 3101. Cluster 3200 includes server 3210 and client 3201. Cluster 3300 includes server 3310 and client 3301. The main functions of the server can include coordinating multiple clients in the cluster, instructing clients to complete tasks such as bucketing and building histograms, integrating information uploaded by clients, sending information to clients, completing some computing functions, and cooperating with other clients. The servers in the cluster communicate, etc. In an exemplary embodiment, inter-cluster communications may be encrypted using Paillier ciphertext. The main functions of the client can include storing data, receiving instructions from the server to complete tasks such as bucketing and building histograms, uploading information to the server, etc. Communication within the cluster can occur without privacy requirements. In an exemplary embodiment, the client communicates only with servers of its own cluster.

每一个集群可以包括若干分布存放在集群的多个客户端的原始样本数据，样本数据包括样本标识。对每一个集群包括的全部样本标识求交，得到共有样本标识。每一个集群选取集群内的原始样本数据中与共有样本标识重合的样本作为该集群的集群样本数据。Each cluster may include a number of original sample data distributed across multiple clients in the cluster, and the sample data includes sample identifiers. Intersect all sample identifiers included in each cluster to obtain the common sample identifier. Each cluster selects the samples that coincide with the common sample identifier among the original sample data in the cluster as the cluster sample data of the cluster.

在一个示例性实施例中，步骤S101可以包括：每一个集群的服务器汇总本集群内的全部样本标识；基于OT安全协议实现集群间的样本标识求交，每一个集群得到相同的共有样本标识；每一个集群的服务器将共有样本标识发送至本集群的多个客户端，指示客户端将共有样本标识和客户端包括的原始样本数据的样本标识求交，得到每一个客户端的客户端样本数据。集群样本数据例如可以包括保存在相应的多个客户端的多个客户端的样本数据。In an exemplary embodiment, step S101 may include: the server of each cluster summarizes all sample identifiers in the cluster; implements the intersection of sample identifiers between clusters based on the OT security protocol, and each cluster obtains the same common sample identifier; The server of each cluster sends the common sample identifier to multiple clients in the cluster, instructing the client to intersect the common sample identifier with the sample identifier of the original sample data included in the client to obtain the client sample data of each client. The cluster sample data may include, for example, sample data stored in multiple clients corresponding to multiple clients.

根据一些实施例，集群样本数据和客户端样本数据均可以包括样本标识和至少一个特征，如图4所示，步骤S102、对多个集群中的每一个集群的集群样本数据分别进行分桶，以得到多个集群中的每一个集群的集群分桶数据可以包括：步骤S10201、针对多个集群中的每一个集群，遍历该集群的集群样本数据的至少一个特征；步骤S10202、基于当前特征的特征数据，生成至少一个数据桶；以及步骤S10203、整合所有特征对应的数据桶，得到该集群的集群分桶数据。由此，通过对集群样本数据进行分桶，能够减少需要计算的分裂点及其相应的信息增益的数量，从而大大提升建模的速度；同时，由于分桶会抹去桶内所有样本的特征数据，因此同样可以作为在隐私要求下的多方联合建模的基础。According to some embodiments, both the cluster sample data and the client sample data may include a sample identifier and at least one feature. As shown in Figure 4, step S102 is to bucket the cluster sample data of each cluster in multiple clusters. Obtaining the cluster bucket data of each cluster in multiple clusters may include: step S10201, for each cluster in multiple clusters, traverse at least one feature of the cluster sample data of the cluster; step S10202, based on the current features Feature data, generate at least one data bucket; and step S10203, integrate the data buckets corresponding to all features to obtain cluster bucket data of the cluster. Therefore, by bucketing the cluster sample data, the number of split points and their corresponding information gains that need to be calculated can be reduced, thereby greatly improving the speed of modeling; at the same time, because bucketing will erase the characteristics of all samples in the bucket The data can therefore also serve as the basis for multi-party joint modeling under privacy requirements.

分桶是基于特征信息对特征数据进行特征离散化处理的过程。根据一些实施例，数据桶可以包括样本标识、桶的值、所属客户端、所属特征中的至少一者。在一个示例性实施例中，如图5所示，数据501包括样本标识1-15和一个选定特征的特征数据。分桶过程例如可以包括：将选定特征的特征数据排序，生成排序后的特征数据502；基于预设分桶规则，将特征数据分为若干个数据桶5001，得到分桶数据503。每个数据桶的所属客户端可以包括被分到同一个数据桶的每一个特征数据所在客户端、每个数据桶的所述特征可以为分桶过程基于的特征、每个数据桶的样本标识可以包括分到被同一个数据桶的每一个特征数据对应的样本标识、每个数据桶的值例如可以是分到同一个数据桶的全部特征数据的平均值、中位数、最小值、最大值或者通过其他计算方法得到的值，在此不做限定。在一个示例性实施例中，如图5所示，每个数据桶5001的值可以是分到同一个数据桶的全部特征数据的中位数。Bucketing is the process of discretizing feature data based on feature information. According to some embodiments, the data bucket may include at least one of a sample identifier, a bucket value, an associated client, and an associated characteristic. In an exemplary embodiment, as shown in Figure 5, data 501 includes sample identifications 1-15 and characteristic data of a selected characteristic. The bucketing process may include, for example: sorting the feature data of the selected features to generate sorted feature data 502; and dividing the feature data into several data buckets 5001 based on preset bucketing rules to obtain bucketed data 503. The client to which each data bucket belongs may include the client where each characteristic data assigned to the same data bucket is located. The characteristics of each data bucket may be the characteristics based on the bucketing process, and the sample identifier of each data bucket. It can include the sample identifier corresponding to each feature data assigned to the same data bucket. The value of each data bucket can be, for example, the average, median, minimum value, and maximum of all the feature data assigned to the same data bucket. Values or values obtained through other calculation methods are not limited here. In an exemplary embodiment, as shown in Figure 5, the value of each data bucket 5001 may be the median of all feature data classified into the same data bucket.

根据一些实施例，待合并数据桶可以包括样本标识、桶的值、所属客户端、所属特征中的至少一者，如图6所示，步骤S10202、基于当前特征的特征数据，生成至少一个数据桶可以包括：步骤S602、判断当前特征的特征数据是否分布在同一个客户端；步骤S603、响应于当前特征的特征数据分布在多个客户端，指示多个客户端中每一个客户端对各自客户端样本数据包括的当前特征的特征数据进行分桶，生成当前特征的至少一个待合并数据桶，并将至少一个待合并数据桶上传至多个客户端对应的服务器；以及步骤S604、将接收到的多个客户端上传的至少一个待合并数据桶进行合并，生成至少一个数据桶。由此，在当前特征的特征数据分布在多个客户端的情况下，通过指示每个客户端对该客户端包括的样本数据进行预分桶，再由服务器将部分桶的值相同或相近的桶合并为同一个桶，实现了在当前特征的特征数据分布在多个客户端的情况下对分布式数据的分桶。相比于将多个客户端的当前特征的全部特征数据传送至服务器，由服务器对全部特征数据进行排序、分桶，上述方法能够显著加快分桶效率，从而大幅提升建模速度。According to some embodiments, the data bucket to be merged may include at least one of a sample identifier, a bucket value, an associated client, and an associated feature. As shown in Figure 6, step S10202: Generate at least one data based on the feature data of the current feature. The bucket may include: step S602, determining whether the feature data of the current feature is distributed on the same client; step S603, responding to the feature data of the current feature being distributed on multiple clients, indicating that each of the multiple clients is responsible for their respective The feature data of the current feature included in the client sample data is divided into buckets, generating at least one data bucket to be merged of the current feature, and uploading at least one data bucket to be merged to the servers corresponding to multiple clients; and step S604, receiving the Merge at least one data bucket to be merged uploaded by multiple clients to generate at least one data bucket. Therefore, when the feature data of the current feature is distributed among multiple clients, each client is instructed to pre-bucket the sample data included in the client, and then the server divides some of the buckets into buckets with the same or similar values. Merging into the same bucket realizes bucketing of distributed data when the feature data of the current feature is distributed on multiple clients. Compared with transmitting all the feature data of the current features of multiple clients to the server, and having the server sort and bucket all the feature data, the above method can significantly speed up the bucketing efficiency, thus greatly improving the modeling speed.

根据一些实施例，步骤S604、将接收到的多个客户端上传的至少一个待合并数据桶进行合并，生成至少一个数据桶例如可以包括：将当前特征的所有待合并数据桶按桶的值进行排序；将连续的多个桶的值相同或相近的一个或多个待合并数据桶合并为一个数据桶，合并后的数据桶包括的样本标签可以为被合并的一个或多个待合并数据桶包括的全部标签，合并后的数据桶所属客户端可以包括被合并的一个或多个待合并数据桶的所属客户端，合并后的数据桶所述特征可以为当前特征，合并后数据桶的值例如可以为被合并的一个或多个待合并数据桶的值的平均值、中位数、最小值、最大值或者通过其他计算方法得到的值，在此不做限定。According to some embodiments, step S604: Merge at least one data bucket to be merged received from multiple clients. Generating at least one data bucket may include, for example: merging all data buckets of the current feature to be merged according to the bucket value. Sorting; merge one or more data buckets to be merged with the same or similar values in consecutive buckets into one data bucket. The sample labels included in the merged data bucket can be the merged one or more data buckets to be merged. All tags included. The client to which the merged data bucket belongs may include the clients to which one or more merged data buckets belong. The characteristics of the merged data bucket may be the current characteristics. The value of the merged data bucket For example, it can be the average, median, minimum value, maximum value of the merged values of one or more data buckets to be merged, or a value obtained through other calculation methods, which is not limited here.

根据一些实施例，如图6所示，步骤S10202、基于当前特征的特征数据，生成至少一个数据桶可以包括：步骤S605、将合并而成的至少一个数据桶中的每一个数据桶发送至该数据桶的所属客户端。该所属客户端的客户端分桶数据包括被合并的至少一个数据桶。According to some embodiments, as shown in Figure 6, step S10202, generating at least one data bucket based on the feature data of the current feature may include: step S605, sending each data bucket in the merged at least one data bucket to the The client to which the data bucket belongs. The client bucket data of the client includes at least one merged data bucket.

根据一些实施例，如图6所示，步骤S10202、基于当前特征的特征数据，生成至少一个数据桶可以包括：步骤S606、响应于当前特征的特征数据分布在同一个客户端，指示同一个客户端对当前特征的特征数据进行分桶，生成至少一个数据桶，并将至少一个数据桶上传至与同一个客户端相应的服务器。图6中步骤S601、步骤S607分别与图4中步骤S10201、S10203类似。执行完步骤S605、步骤S606后，可以执行步骤S607。由此，在当前特征的全部特征数据分布在同一个客户端的情况下，直接将该客户端的分桶结果作为最终结果，从而可以减少由服务器进行分桶而带来的工作量，进而提升建模速度。According to some embodiments, as shown in Figure 6, step S10202, generating at least one data bucket based on the characteristic data of the current characteristic may include: step S606, responding to the characteristic data of the current characteristic being distributed on the same client, indicating that the same client The client buckets the feature data of the current feature, generates at least one data bucket, and uploads at least one data bucket to the server corresponding to the same client. Steps S601 and S607 in Figure 6 are respectively similar to steps S10201 and S10203 in Figure 4 . After executing steps S605 and S606, step S607 can be executed. Therefore, when all the feature data of the current feature is distributed on the same client, the bucketing result of the client is directly used as the final result, which can reduce the workload caused by bucketing by the server and thereby improve modeling. speed.

上述技术方案通过指示客户端对其客户端样本数据进行分桶，生成待合并数据桶或数据桶，再由服务器将待合并数据桶进行合并，并结合其他数据桶得到集群分桶数据和各个客户端的客户端分桶数据，实现了对分布式数据的快速分桶，从而能够大幅提升建模速度，同时使得模型能够支持更丰富的场景。The above technical solution instructs the client to bucket its client sample data to generate data buckets or data buckets to be merged, and then the server merges the data buckets to be merged and combines them with other data buckets to obtain cluster bucket data and each client The client-side bucketing data realizes fast bucketing of distributed data, which can greatly improve the modeling speed and enable the model to support richer scenarios.

根据一些实施例，多个集群包括第一集群和至少一个第二集群，如图2所示，多方联合建模方法还可以包括：步骤S202、针对第一集群的服务器，生成公钥、私钥对；以及步骤S203、将公钥与建模参数发送至至少一个第二集群中每一个第二集群的服务器。图2中的步骤S201、步骤S204与图1中的步骤S101、步骤S102类似。由此，通过使用同态加密，能够使得第二集群使用加密的数据进行运算得到的加密信息经第一集群并解密后得到的结果与未经加密的运算结果相同，从而能够作为在隐私要求下的多方联合建模的基础。According to some embodiments, the multiple clusters include a first cluster and at least one second cluster. As shown in Figure 2, the multi-party joint modeling method may also include: step S202, generating a public key and a private key for the server of the first cluster. Right; and step S203, sending the public key and modeling parameters to the server of each second cluster in at least one second cluster. Steps S201 and S204 in Figure 2 are similar to steps S101 and S102 in Figure 1 . Therefore, by using homomorphic encryption, the encrypted information obtained by the second cluster using the encrypted data can be decrypted by the first cluster and the result obtained will be the same as the unencrypted calculation result, which can be used as a data set under privacy requirements. The basis of multi-party joint modeling.

建模参数例如可以包括最大迭代次数、学习率、停止分裂条件、模型收敛条件等。建模参数对于参与建模的每一个集群都是公开的，不需要加密。Modeling parameters may include, for example, the maximum number of iterations, learning rate, stop splitting conditions, model convergence conditions, etc. Modeling parameters are public to every cluster participating in the modeling and do not need to be encrypted.

根据一些实施例，第一集群的集群样本数据还包括样本标签，如图7所示，步骤S103、基于样本标签和多个集群中的每一个集群的集群分桶数据，构建全局信息增益直方图包括：步骤S10301、获取当前模型对与第一集群的集群分桶数据的每一个样本标识对应的每一个样本的预测值；步骤S10302、基于预测值和样本标签，计算一阶梯度数据和二阶梯度数据；以及步骤S10303、基于一阶梯度数据、二阶梯度数据和多个集群中的每一个集群的集群分桶数据，构建全局信息增益直方图。由此，通过计算一阶梯度数据和二阶梯度数据，结合每个集群的集群特征数据，能够构建全局信息增益直方图，从而后续能够基于全局信息增益直方图确定最优分裂点，以构建决策树。同时，使用信息增益直方图能够减少需要计算的分裂点及其分裂阈值的数量，并且可以进行直方图差加速，从而提升建模速度。According to some embodiments, the cluster sample data of the first cluster also includes sample labels. As shown in Figure 7, step S103 is to construct a global information gain histogram based on the sample labels and the cluster bucket data of each cluster in the plurality of clusters. It includes: step S10301, obtaining the predicted value of each sample corresponding to each sample identifier of the cluster bucket data of the first cluster by the current model; step S10302, calculating the first-order gradient data and second-order gradient data based on the predicted value and sample label. degree data; and step S10303, construct a global information gain histogram based on the first-order gradient data, the second-order gradient data and the cluster bucket data of each cluster in the multiple clusters. Therefore, by calculating the first-order gradient data and second-order gradient data, combined with the cluster characteristic data of each cluster, a global information gain histogram can be constructed, so that the optimal split point can be subsequently determined based on the global information gain histogram to construct decisions. Tree. At the same time, using the information gain histogram can reduce the number of split points and their split thresholds that need to be calculated, and can accelerate the histogram difference, thereby improving the modeling speed.

根据一些实施例，当前模型包括一棵或多棵子决策树。本公开构建的决策树模型例如可以是梯度提升决策树模型，XGBoost模型，LightGBM模型，也可以是其他模型，在此不作限定。当前模型每个叶子结点包括至少一个样本标识，表明被分到这个节点的样本标识。当前模型包括的一棵或多棵子决策树的结构、所有节点的所属集群、所有叶子节点包括的样本标识可以对全部集群公开。样本预测值的计算方法可以是将每一个子决策树对样本的预测值求和，得到模型对样本的预测值。According to some embodiments, the current model includes one or more sub-decision trees. The decision tree model constructed by this disclosure can be, for example, a gradient boosting decision tree model, an XGBoost model, a LightGBM model, or other models, which are not limited here. Each leaf node of the current model includes at least one sample identifier, indicating the sample identifier assigned to this node. The structure of one or more sub-decision trees included in the current model, the clusters to which all nodes belong, and the sample identifiers included in all leaf nodes can be disclosed to all clusters. The calculation method of the sample prediction value can be to sum the prediction value of each sub-decision tree for the sample to obtain the model's prediction value for the sample.

一阶梯度数据和二阶梯度数据可以是求出模型设定的目标函数的一阶梯度和二阶梯度，将样本的预测值和样本标签带入一阶梯度和二阶梯度，从而得到样本的一阶梯度数据和二阶梯度数据。The first-order gradient data and the second-order gradient data can be used to find the first-order gradient and the second-order gradient of the objective function set by the model, and bring the predicted value and sample label of the sample into the first-order gradient and the second-order gradient, thereby obtaining the sample's First-order gradient data and second-order gradient data.

信息增益直方图可以包括多个与数据桶一一对应的直方图桶，每一个直方图桶可以用于表示对应的数据桶的信息增益。每一个直方图桶中包括对应的数据桶的全部样本的一阶梯度和、二阶梯度和以及数据桶包括的样本数量。The information gain histogram may include multiple histogram buckets that correspond one-to-one to the data buckets, and each histogram bucket may be used to represent the information gain of the corresponding data bucket. Each histogram bucket includes the first-order gradient sum, the second-order gradient sum of all samples in the corresponding data bucket, and the number of samples included in the data bucket.

根据一些实施例，如图2所示，步骤S10303、基于一阶梯度数据、二阶梯度数据和多个集群中的每一个集群的集群分桶数据，构建全局信息增益直方图包括：步骤S207、将一阶梯度数据和二阶梯度数据加密后发送至至少一个第二集群中的每一个第二集群的服务器；步骤S208、获取当前模型的至少一个待分裂节点，待分裂节点包括至少一个样本标识；步骤S209、基于一阶梯度数据、二阶梯度数据、第一集群的集群分桶数据和至少一个待分裂节点，构建第一信息增益直方图；步骤S211、接收来自至少一个第二集群中每一个的服务器的至少一个密文信息增益直方图；步骤S212、将至少一个密文信息增益直方图解密，得到与至少一个密文信息增益直方图一一对应的至少一个第二信息增益直方图；以及步骤S213、将第一信息增益直方图和至少一个第二信息增益直方图合并，得到全局信息增益直方图。图2中的步骤S205-步骤S206分别与图7中的步骤S10301-步骤S10302类似。由此，通过将加密后的梯度发送至至少一个第二集群，接收并解密来自至少一个第二集群发送的密文信息增益直方图，使得双方可以在不获取对方的梯度数据和样本数据的情况下得到相应的至少一个第二信息增益直方图。结合第一集群的第一信息增益直方图和至少一个第二信息增益直方图，实现在隐私要求下构建全局信息增益直方图。According to some embodiments, as shown in Figure 2, step S10303, building a global information gain histogram based on the first-order gradient data, the second-order gradient data and the cluster bucket data of each cluster in the plurality of clusters includes: step S207. The first-order gradient data and the second-order gradient data are encrypted and sent to the server of each second cluster in at least one second cluster; step S208, obtain at least one node to be split of the current model, and the node to be split includes at least one sample identifier. ; Step S209, construct a first information gain histogram based on the first-order gradient data, the second-order gradient data, the cluster bucket data of the first cluster and at least one node to be split; Step S211, receive each node from at least one second cluster At least one ciphertext information gain histogram of a server; step S212, decrypt at least one ciphertext information gain histogram to obtain at least one second information gain histogram corresponding to at least one ciphertext information gain histogram; And step S213, combine the first information gain histogram and at least one second information gain histogram to obtain a global information gain histogram. Steps S205 to S206 in Figure 2 are respectively similar to steps S10301 to S10302 in Figure 7 . Thus, by sending the encrypted gradient to at least one second cluster, receiving and decrypting the ciphertext information gain histogram sent from at least one second cluster, the two parties can obtain the other party's gradient data and sample data without Corresponding at least one second information gain histogram is obtained. Combining the first information gain histogram of the first cluster and at least one second information gain histogram, a global information gain histogram is constructed under privacy requirements.

待分裂节点例如可以是最后一棵子决策树的满足允许分裂条件的部分叶子结点。允许分裂条件例如可以是叶子结点包括的样本标识的数量小于预设值、叶子结点的深度小于预设深度等，在此不作限定。作为一种叶子结点，待分裂节点可以包括多个样本标识，表示被模型分到这个节点的样本。The node to be split may be, for example, some of the leaf nodes of the last sub-decision tree that meet the conditions for allowing splitting. The conditions for allowing splitting may be, for example, that the number of sample identifiers included in a leaf node is less than a preset value, the depth of a leaf node is less than a preset depth, etc., which are not limited here. As a leaf node, the node to be split can include multiple sample identifiers, indicating the samples assigned to this node by the model.

根据一些实施例，如图8所示，步骤S209、基于一阶梯度数据、二阶梯度数据、第一集群的集群分桶数据和至少一个待分裂节点，构建第一信息增益直方图包括：步骤S20901、针对至少一个待分裂节点中的每一个待分裂节点，遍历第一集群的集群分桶数据的至少一个特征；步骤S20902、基于该待分裂节点和当前特征的特征数据，得到该待分裂节点的当前特征的第一信息增益子直方图或第一候选分裂增益；以及步骤S20903、将每一个待分裂节点的第一集群的集群分桶数据的至少一个特征中每一个特征的第一信息增益子直方图或第一候选分裂增益合并，得到第一信息增益直方图。由此，对每一个待分裂节点与每一个第一集群的特征构建第一信息增益子直方图或第一候选分裂增益，能够得到第一信息增益直方图，从而后续能够得到全局信息增益直方图以构建决策树。According to some embodiments, as shown in Figure 8, step S209, building the first information gain histogram based on the first-order gradient data, the second-order gradient data, the cluster bucket data of the first cluster and at least one node to be split includes: steps S20901. For each node to be split in at least one node to be split, traverse at least one feature of the cluster bucket data of the first cluster; Step S20902. Based on the feature data of the node to be split and the current feature, obtain the node to be split. The first information gain sub-histogram or the first candidate split gain of the current feature; and step S20903, the first information gain of each feature in at least one feature of the cluster bucket data of the first cluster of each node to be split The sub-histograms or first candidate split gains are combined to obtain the first information gain histogram. Therefore, by constructing the first information gain sub-histogram or the first candidate split gain for each node to be split and the characteristics of each first cluster, the first information gain histogram can be obtained, and the global information gain histogram can subsequently be obtained. to build a decision tree.

根据一些实施例，如图9所示，步骤S20902、基于该待分裂节点和当前特征的特征数据，得到待分裂节点的当前特征的第一信息增益子直方图或第一候选分裂增益：步骤S902、判断当前特征的特征数据是否分布在同一个客户端；步骤S903、响应于当前特征的特征数据分布在多个客户端，指示多个客户端中的每一个客户端基于一阶梯度数据、二阶梯度数据、该客户端的客户端分桶数据和该待分裂节点构建待合并第一信息增益子直方图，并将待合并第一信息增益子直方图上传至多个客户端对应的服务器；以及步骤S904、将接收到的多个客户端上传的多个待合并第一信息增益子直方图合并，构建第一信息增益子直方图。由此，当前特征的特征数据分布在多个客户端的情况下，通过指示每个客户端生成待合并第一信息增益子直方图，再由服务器将待合并子直方图合并，实现了在基于当前待分裂节点的第一集群的当前特征的特征数据分布在多个客户端的情况下对当前待分裂节点与当前特征构建第一信息增益子直方图。相比于由第一集群的服务器直接构建第一信息增益子直方图，上述方法能够使得多个客户端并行构建子直方图，从而提升了建模速度。According to some embodiments, as shown in Figure 9, step S20902: Based on the feature data of the node to be split and the current feature, obtain the first information gain sub-histogram or the first candidate split gain of the current feature of the node to be split: step S902 , determine whether the feature data of the current feature is distributed on the same client; step S903. In response to the feature data of the current feature being distributed on multiple clients, instruct each of the multiple clients to base on the first-level gradient data, the second-level gradient data, and the second-level gradient data. The step gradient data, the client's client bucket data and the node to be split construct a first information gain sub-histogram to be merged, and upload the first information gain sub-histogram to be merged to the servers corresponding to multiple clients; and steps S904. Combine multiple first information gain sub-histograms to be merged received from multiple clients to construct a first information gain sub-histogram. Therefore, when the feature data of the current feature is distributed among multiple clients, by instructing each client to generate the first information gain sub-histogram to be merged, and then the server merges the sub-histograms to be merged, it is realized based on the current When the feature data of the current feature of the first cluster of the node to be split is distributed among multiple clients, a first information gain sub-histogram is constructed for the current node to be split and the current feature. Compared with directly constructing the first information gain sub-histogram by the server of the first cluster, the above method enables multiple clients to construct the sub-histogram in parallel, thus improving the modeling speed.

根据一些实施例，第一信息增益子直方图包括至少一个直方图桶，至少一个直方图桶与所有所属特征为特征的数据桶一一对应，待合并第一信息增益子直方图包括至少一个待合并直方图桶，至少一个待合并直方图桶与所有所属特征为特征的数据桶一一对应，直方图桶和待合并直方图桶均包括一阶梯度和、二阶梯度和中的至少一者，如图10所示，步骤S904、将接收到的多个客户端上传的多个待合并第一信息增益子直方图合并，构建第一信息增益子直方图包括：步骤S90401、将接收到的多个客户端上传的多个待合并第一信息增益子直方图中的至少一个待合并直方图桶进行合并，生成至少一个直方图桶；以及步骤S90402、基于至少一个直方图桶，构建第一信息增益子直方图。由此，通过将与同一个数据桶对应的待合并直方图桶合并为一个直方图桶，能够构建第一信息增益子直方图，实现了分布式构建第一信息增益子直方图，从而能够后续与其他特征的第一信息增益子直方图合并，得到第一信息增益直方图。According to some embodiments, the first information gain sub-histogram includes at least one histogram bucket, and the at least one histogram bucket corresponds one-to-one to all data buckets whose features are characteristics, and the first information gain sub-histogram to be merged includes at least one to be merged. Merging histogram buckets, at least one histogram bucket to be merged corresponds one-to-one with all data buckets whose features are features, and both the histogram buckets and the histogram buckets to be merged include at least one of the first-order gradient sum and the second-order gradient sum. , as shown in Figure 10, step S904, merge the multiple first information gain sub-histograms to be merged received from multiple clients, and constructing the first information gain sub-histogram includes: step S90401, combine the received Merge at least one histogram bucket to be merged in multiple first information gain sub-histograms uploaded by multiple clients to generate at least one histogram bucket; and step S90402, build the first histogram bucket based on at least one histogram bucket. Information gain sub-histogram. Therefore, by merging the histogram buckets to be merged corresponding to the same data bucket into one histogram bucket, the first information gain sub-histogram can be constructed, realizing the distributed construction of the first information gain sub-histogram, thereby enabling subsequent Combined with the first information gain sub-histograms of other features, a first information gain histogram is obtained.

直方图桶与待合并直方图桶均可以包括一阶梯度和、二阶梯度和以及直方图桶或待合并直方图桶对应的数据桶包括的样本标识与当前待分裂节点包括的样本标识的交集部分的样本标识数量。Both the histogram bucket and the histogram bucket to be merged may include the first-order gradient sum, the second-order gradient sum, and the intersection of the sample identifier included in the histogram bucket or the data bucket corresponding to the histogram bucket to be merged and the sample identifier included in the current node to be split. The number of sample identifiers for the part.

根据一些实施例，步骤S90401、将接收到的多个客户端上传的多个待合并第一信息增益子直方图中的至少一个待合并直方图桶进行合并，生成至少一个直方图桶可以包括：将与每一个当前特征的数据桶对应的一个或多个待合并直方图桶合并，得到与上述数据桶对应的直方图桶。该直方图桶的一阶梯度和可以为被合并的一个或多个待合并直方图桶的一阶梯度和的总和，该直方图桶的二阶梯度和可以为被合并的一个或多个待合并直方图桶的二阶梯度和的总和，该直方图桶的样本标识数量可以为被合并的一个或多个待合并直方图桶的样本标识数量的总和。According to some embodiments, step S90401 is to merge at least one histogram bucket to be merged among the first information gain sub-histograms to be merged received from multiple clients. Generating at least one histogram bucket may include: Merge one or more histogram buckets to be merged corresponding to the data bucket of each current feature to obtain a histogram bucket corresponding to the above data bucket. The first-order gradient sum of the histogram bucket may be the sum of the first-order gradient sums of one or more histogram buckets to be merged, and the second-order gradient sum of the histogram bucket may be the sum of the first-order gradients of the merged one or more histogram buckets to be merged. The sum of the second-order gradient sums of the merged histogram buckets, and the number of sample identifiers of the histogram bucket may be the sum of the number of sample identifiers of one or more merged histogram buckets to be merged.

根据一些实施例，如图9所示，步骤S20902、基于该待分裂节点和当前特征的特征数据，得到待分裂节点的当前特征的第一信息增益子直方图或第一候选分裂增益可包括：步骤S905、响应于当前特征的全部特征数据分布在同一客户端，指示同一客户端基于一阶梯度数据、二阶梯度数据、同一客户端的客户端分桶数据和待分裂节点构建第一信息增益子直方图，基于第一信息增益子直方图计算第一候选分裂增益，并将第一候选分裂增益上传至第一集群的服务器。图9中的步骤S901、步骤S906分别与图8中的步骤S20901、步骤S20903类似。在执行完步骤S904、步骤S905后，可以执行步骤S906。由此，在当前特征的全部特征数据分布在同一客户端的情况下，直接构建当前待分裂节点的当前特征的第一信息增益子直方图，并基于第一信息增益子直方图计算第一候选分裂增益，从而可以减少由服务器构建直方图而带来的工作量，进而提升建模速度。According to some embodiments, as shown in Figure 9, step S20902, based on the feature data of the node to be split and the current feature, obtaining the first information gain sub-histogram or the first candidate split gain of the current feature of the node to be split may include: Step S905: In response to all feature data of the current feature being distributed on the same client, instruct the same client to construct the first information gain sub-system based on the first-order gradient data, the second-order gradient data, the client bucket data of the same client and the node to be split. histogram, calculate the first candidate split gain based on the first information gain sub-histogram, and upload the first candidate split gain to the server of the first cluster. Steps S901 and S906 in FIG. 9 are respectively similar to steps S20901 and S20903 in FIG. 8 . After executing steps S904 and S905, step S906 can be executed. Therefore, when all feature data of the current feature is distributed on the same client, the first information gain sub-histogram of the current feature of the current node to be split is directly constructed, and the first candidate split is calculated based on the first information gain sub-histogram. Gain, thereby reducing the workload caused by building histograms on the server, thereby increasing modeling speed.

第一候选分裂增益可以是当前待分裂节点的当前特征的最大增益，可以通过计算第一信息增益子直方图的每一个直方图桶对应的分裂增益，并选取其中的最大值而得到。直方图桶对应的分裂增益可以通过如下方法计算得到：在当前待分裂节点与当前特征下，获取全部直方图桶的一阶梯度数据总和、二阶梯度数据总和以及样本标识数量总和，这三个总和在之前的分裂增益计算过程中已经得到，称为父节点信息增益原始数据；计算比直方图桶对应的数据桶的值小的所有数据桶对应的直方图桶的上述三个总和，称为左子节点信息增益原始数据；基于父节点信息增益原始数据与左子节点信息增益原始数据，通过作差可以得到右子节点信息增益原始数据；针对三个节点，分别基于信息增益原始数据计算信息增益；以及，使用左子节点信息增益加右子节点信息增益减父节点信息增益，得到分裂增益。信息增益的计算方法例如可以是一阶梯度总和的平方除以二阶梯度总和和调和参数的和，可以是一阶梯度总和的平方除以样本标识数量总和，也可以是其他计算方法，在此不作限定。The first candidate splitting gain may be the maximum gain of the current feature of the current node to be split, which can be obtained by calculating the splitting gain corresponding to each histogram bucket of the first information gain sub-histogram and selecting the maximum value among them. The splitting gain corresponding to the histogram bucket can be calculated as follows: under the current node to be split and the current feature, obtain the sum of first-order gradient data, the sum of second-order gradient data and the sum of the number of sample identifiers for all histogram buckets. The sum has been obtained during the previous split gain calculation process and is called the parent node information gain original data; calculate the above three sums of the histogram buckets corresponding to all data buckets that are smaller than the value of the data bucket corresponding to the histogram bucket, which is called The original information gain data of the left child node; based on the original information gain data of the parent node and the original information gain data of the left child node, the original information gain data of the right child node can be obtained by making a difference; for the three nodes, the information is calculated based on the original information gain data. Gain; and, use the information gain of the left child node plus the information gain of the right child node minus the information gain of the parent node to obtain the split gain. The calculation method of the information gain can be, for example, the square of the sum of the first-order gradients divided by the sum of the second-order gradients and the harmonic parameter, the square of the sum of the first-order gradients divided by the sum of the number of sample identifiers, or other calculation methods. Here Not limited.

上述技术方案通过指示客户端构建待合并第一信息增益子直方图或直接计算第一候选分裂增益，再由服务器将待合并第一信息增益子直方图合并，并结合第一候选分裂增益得到第一信息增益直方图，实现了快速构建第一信息增益直方图，从而能够提升建模速度，同时使得模型能够支持更丰富的场景。The above technical solution instructs the client to construct the first information gain sub-histogram to be merged or directly calculate the first candidate split gain, and then the server merges the first information gain sub-histogram to be merged, and combines it with the first candidate split gain to obtain the first split gain. An information gain histogram enables the rapid construction of the first information gain histogram, thereby improving the modeling speed and enabling the model to support richer scenarios.

根据一些实施例，如图2所示，步骤S10303、基于一阶梯度数据、二阶梯度数据和多个集群中的每一个集群的集群分桶数据，构建全局信息增益直方图可包括：步骤S210、基于密文一阶梯度数据、密文二阶梯度数据、第二集群的集群分桶数据和至少一个待分裂节点构建密文信息增益直方图。在一个示例性实施例中，可以采用类似上述步骤S901-步骤S906的方法，例如可以分别使用密文一阶梯度数据和密文二阶梯度数据代替一阶梯度数据和二阶梯度数据，得到相应的密文信息增益直方图。According to some embodiments, as shown in Figure 2, step S10303, building a global information gain histogram based on the first-order gradient data, the second-order gradient data and the cluster bucket data of each cluster in the plurality of clusters may include: step S210 , construct a ciphertext information gain histogram based on the ciphertext first-order gradient data, the ciphertext second-order gradient data, the cluster bucket data of the second cluster and at least one node to be split. In an exemplary embodiment, a method similar to the above steps S901 to S906 may be adopted. For example, the first-order ciphertext gradient data and the second-order ciphertext gradient data may be used respectively to replace the first-order gradient data and the second-order gradient data to obtain the corresponding The ciphertext information gain histogram.

根据一些实施例，如图2所示，步骤S104、基于全局信息增益直方图，构建决策树模型包括：步骤S214、基于全局新信息增益直方图，确定最优分裂点；步骤S215、指示最优分裂点所在的客户端分裂最优分裂点；步骤S216、对分裂过程进行迭代，直至达到终止分裂条件，生成子决策树；以及步骤S217、对子决策树生成过程进行迭代，直到达到终止迭代条件，得到决策树模型。由此，通过确定并分裂最优分裂点，得到新的叶子结点，并通过重复迭代上述步骤，得到决策树模型。According to some embodiments, as shown in Figure 2, step S104, building a decision tree model based on the global information gain histogram includes: step S214, determining the optimal split point based on the global new information gain histogram; step S215, indicating the optimal The client where the split point is located splits the optimal split point; step S216, iterates the splitting process until the terminating splitting condition is reached, and generates a sub-decision tree; and step S217, iterates the sub-decision tree generation process until the terminating iteration condition is reached. , get the decision tree model. From this, by determining and splitting the optimal split point, new leaf nodes are obtained, and by repeating the above steps, the decision tree model is obtained.

根据一些实施例，步骤S215、指示最优分裂点所在的客户端分裂最优分裂点包括：指示客户端基于最优分裂点和客户端的客户端分桶数据计算分裂阈值，得到分裂出的叶子节点包括的样本标识，并将叶子节点上传至服务器；以及将最优分裂点所在节点的所属集群和叶子节点同步到除本集群外的多个集群。由此，通过分裂最优分裂点，得到新的叶子结点，并将最优分裂点所在节点的所属集群可新的叶子节点同步到每一个集群，实现集群间模型的共享。According to some embodiments, step S215, instructing the client to split the optimal split point where the optimal split point is located includes: instructing the client to calculate the split threshold based on the optimal split point and the client's client bucket data to obtain the split leaf node. Include the sample identification and upload the leaf nodes to the server; and synchronize the cluster and leaf nodes of the node where the optimal split point is located to multiple clusters except this cluster. Therefore, by splitting the optimal split point, a new leaf node is obtained, and the new leaf node of the cluster to which the node where the optimal split point belongs is synchronized to each cluster to realize the sharing of models between clusters.

最优分裂点可以是全部待分裂节点与全部特征中的分裂增益最大的直方图桶。计算分裂阈值可以基于如下方法计算得到：对最优分裂点对应的数据桶的样本标识与最优分裂点对应的待分裂节点的样本标识的交集部分对应的特征数据进行排序；以排序后的每两个相邻样本的特征数据的平均作为分裂阈值，计算分裂增益；选取分裂增益最大的分裂阈值，作为最优分裂点的分裂阈值。每个节点的所属集群是公开的，而分裂阈值保存在其所属集群，从而在预测阶段，可以通过多方共享的模型中包括的节点的所属集群找到相应的集群，再将数据发送至该集群，获取下一个节点，直至得到最终的预测值。The optimal split point can be the histogram bucket with the largest split gain among all nodes to be split and all features. The split threshold can be calculated based on the following method: sort the feature data corresponding to the intersection of the sample identifier of the data bucket corresponding to the optimal split point and the sample identifier of the node to be split corresponding to the optimal split point; The average of the characteristic data of two adjacent samples is used as the split threshold to calculate the split gain; the split threshold with the largest split gain is selected as the split threshold of the optimal split point. The cluster to which each node belongs is public, and the split threshold is saved in the cluster to which it belongs. Therefore, in the prediction phase, the corresponding cluster can be found through the cluster to which the node included in the multi-party shared model belongs, and then the data is sent to the cluster. Get the next node until the final predicted value is obtained.

根据本公开的另一方面，还提供一种基于分布式系统的多方联合预测方法，如图11所示，多方联合预测方法可以包括：步骤S1101、将预测样本输入决策树模型；步骤S1102、针对决策树模型的每一棵子决策树，获取根节点的所属集群；步骤S1103、与该所属集群通信，获取根节点的特征；步骤S1104、将预测样本的根节点的特征的特征数据发送至该所属集群，获取子节点的所属集群；步骤S1105、迭代上述过程，获取每一棵子决策树对预测样本的预测值；以及步骤S1106、将每一棵子决策树对预测样本的预测值求和，得到预测样本的预测值。由此，通过不断与当前子决策树的当前节点的所属集群通信并返回下一个节点，能够计算出每一个子决策树对样本的预测值，从而得到模型对样本的预测值。由于决策树模型由多个集群共享而分裂阈值仅保存在节点所属集群，因此模型的完整内容对任一集群都是不公开的，从而需要结合每一个集群保存在集群内的信息实现对样本的预测，实现了支持隐私场景的多方联合建模。According to another aspect of the present disclosure, a multi-party joint prediction method based on a distributed system is also provided. As shown in Figure 11, the multi-party joint prediction method may include: step S1101, input the prediction sample into the decision tree model; step S1102, for For each sub-decision tree of the decision tree model, obtain the cluster to which the root node belongs; Step S1103, communicate with the cluster to which it belongs, and obtain the characteristics of the root node; Step S1104, send the characteristic data of the characteristics of the root node of the predicted sample to the cluster to which it belongs. Cluster, obtain the cluster to which the child node belongs; step S1105, iterate the above process to obtain the predicted value of each sub-decision tree for the prediction sample; and step S1106, sum the predicted value of each sub-decision tree for the prediction sample to obtain the prediction The predicted value of the sample. Therefore, by continuously communicating with the cluster to which the current node of the current sub-decision tree belongs and returning to the next node, the predicted value of each sub-decision tree for the sample can be calculated, thereby obtaining the predicted value of the sample by the model. Since the decision tree model is shared by multiple clusters and the split threshold is only saved in the cluster to which the node belongs, the complete content of the model is not public to any cluster. Therefore, it is necessary to combine the information saved in the cluster for each cluster to implement sample analysis. Prediction, realizing multi-party joint modeling that supports privacy scenarios.

根据本公开的另一方面，还提供一种多方联合建模装置。如图12所示，多方联合建模装置1200可以包括：求交模块1201，被配置用于对多个集群中的每一个集群包括的样本标识求交，得到交集样本标识以及多个集群中的每一个集群包括的与交集样本标识对应的集群样本数据；分桶模块1202，被配置用于对多个集群中每一个的集群样本数据分别进行分桶，得到多个集群中每一个的集群分桶数据；第一构建模块1203，被配置用于基于样本标签和多个集群分桶数据，构建全局信息增益直方图；以及第二构建模块1204，被配置用于基于全局信息增益直方图，构建决策树模型。According to another aspect of the present disclosure, a multi-party joint modeling device is also provided. As shown in Figure 12, the multi-party joint modeling device 1200 may include: an intersection module 1201, configured to intersect the sample identifiers included in each cluster in multiple clusters, and obtain the intersection sample identifier and the sample identifiers in the multiple clusters. Each cluster includes cluster sample data corresponding to the intersection sample identifier; the bucketing module 1202 is configured to bucket the cluster sample data of each of the multiple clusters to obtain the cluster score of each of the multiple clusters. bucket data; the first building module 1203 is configured to construct a global information gain histogram based on sample labels and multiple cluster bucket data; and the second building module 1204 is configured to construct a global information gain histogram based on the global information gain histogram. Decision tree model.

根据本公开的另一方面，还提供一种电子设备，可以包括：处理器；以及存储程序的存储器，所述程序包括指令，所述指令在由所述处理器执行时使所述处理器执行根据上述的多方联合建模方法。According to another aspect of the present disclosure, there is also provided an electronic device, which may include: a processor; and a memory storing a program, the program including instructions that, when executed by the processor, cause the processor to execute According to the above-mentioned multi-party joint modeling method.

根据本公开的另一方面，还提供一种存储程序的计算机可读存储介质，所述程序包括指令，所述指令在由电子设备的处理器执行时，致使所述电子设备执行根据上述的多方联合建模方法。According to another aspect of the present disclosure, there is also provided a computer-readable storage medium storing a program, the program including instructions that, when executed by a processor of an electronic device, cause the electronic device to execute the method according to the above-mentioned methods. Joint modeling approach.

参见图13所示，现将描述计算设备13000，其是可以应用于本公开的各方面的硬件设备(电子设备)的示例。计算设备13000可以是被配置为执行处理和/或计算的任何机器，可以是但不限于工作站、服务器、台式计算机、膝上型计算机、平板计算机、个人数字助理、机器人、智能电话、车载计算机或其任何组合。上述多方联合建模方法可以全部或至少部分地由计算设备13000或类似设备或系统实现。Referring to FIG. 13 , a computing device 13000 will now be described, which is an example of a hardware device (electronic device) to which aspects of the present disclosure may be applied. Computing device 13000 may be any machine configured to perform processing and/or calculations, and may be, but is not limited to, a workstation, a server, a desktop computer, a laptop, a tablet, a personal digital assistant, a robot, a smartphone, a vehicle-mounted computer, or any combination thereof. The above-mentioned multi-party joint modeling method may be implemented in whole or at least in part by the computing device 13000 or a similar device or system.

计算设备13000可以包括(可能经由一个或多个接口)与总线13002连接或与总线13002通信的元件。例如，计算设备13000可以包括总线13002、一个或多个处理器13004、一个或多个输入设备13006以及一个或多个输出设备13008。一个或多个处理器13004可以是任何类型的处理器，并且可以包括但不限于一个或多个通用处理器和/或一个或多个专用处理器(例如特殊处理芯片)。输入设备13006可以是能向计算设备13000输入信息的任何类型的设备，并且可以包括但不限于鼠标、键盘、触摸屏、麦克风和/或遥控器。输出设备13008可以是能呈现信息的任何类型的设备，并且可以包括但不限于显示器、扬声器、视频/音频输出终端、振动器和/或打印机。计算设备13000还可以包括非暂时性存储设备13010或者与非暂时性存储设备13010连接，非暂时性存储设备可以是非暂时性的并且可以实现数据存储的任何存储设备，并且可以包括但不限于磁盘驱动器、光学存储设备、固态存储器、软盘、柔性盘、硬盘、磁带或任何其他磁介质，光盘或任何其他光学介质、ROM(只读存储器)、RAM(随机存取存储器)、高速缓冲存储器和/或任何其他存储器芯片或盒、和/或计算机可从其读取数据、指令和/或代码的任何其他介质。非暂时性存储设备13010可以从接口拆卸。非暂时性存储设备13010可以具有用于实现上述方法和步骤的数据/程序(包括指令)/代码。计算设备13000还可以包括通信设备13012。通信设备13012可以是使得能够与外部设备和/或与网络通信的任何类型的设备或系统，并且可以包括但不限于调制解调器、网卡、红外通信设备、无线通信设备和/或芯片组，例如蓝牙TM设备、1302.11设备、WiFi设备、WiMax设备、蜂窝通信设备和/或类似物。Computing device 13000 may include elements coupled to or in communication with bus 13002 (perhaps via one or more interfaces). For example, computing device 13000 may include a bus 13002, one or more processors 13004, one or more input devices 13006, and one or more output devices 13008. The one or more processors 13004 may be any type of processor and may include, but are not limited to, one or more general purpose processors and/or one or more special purpose processors (eg, special processing chips). Input device 13006 may be any type of device capable of inputting information to computing device 13000 and may include, but is not limited to, a mouse, a keyboard, a touch screen, a microphone, and/or a remote control. Output device 13008 may be any type of device capable of presenting information and may include, but is not limited to, a display, speakers, video/audio output terminal, vibrator, and/or printer. Computing device 13000 may also include or be connected to non-transitory storage device 13010 , which may be any storage device that is non-transitory and that enables data storage, and may include, but is not limited to, a disk drive , optical storage device, solid state memory, floppy disk, flexible disk, hard disk, magnetic tape or any other magnetic media, optical disk or any other optical media, ROM (read only memory), RAM (random access memory), cache memory and/or Any other memory chip or cartridge, and/or any other medium from which a computer can read data, instructions and/or code. Non-transitory storage device 13010 may be detached from the interface. The non-transitory storage device 13010 may have data/programs (including instructions)/code for implementing the above methods and steps. Computing device 13000 may also include a communications device 13012. Communication device 13012 may be any type of device or system that enables communication with external devices and/or with a network, and may include, but is not limited to, a modem, a network card, an infrared communication device, a wireless communication device, and/or a chipset, such as Bluetooth™ equipment, 1302.11 equipment, WiFi equipment, WiMax equipment, cellular communications equipment and/or the like.

计算设备13000还可以包括工作存储器13014，其可以是可以存储对处理器13004的工作有用的程序(包括指令)和/或数据的任何类型的工作存储器，并且可以包括但不限于随机存取存储器和/或只读存储器设备。Computing device 13000 may also include working memory 13014, which may be any type of working memory that may store programs (including instructions) and/or data useful for the operation of processor 13004, and may include, but is not limited to, random access memory and /or read-only memory device.

软件要素(程序)可以位于工作存储器13014中，包括但不限于操作系统13016、一个或多个应用程序13018、驱动程序和/或其他数据和代码。用于执行上述方法和步骤的指令可以被包括在一个或多个应用程序13018中，并且上述多方联合建模方法可以通过由处理器13004读取和执行一个或多个应用程序13018的指令来实现。更具体地，上述多方联合建模方法中，步骤S101-步骤S104可以例如通过处理器13004执行具有步骤S101-步骤S104的指令的应用程序13018而实现。此外，上述多方联合建模方法中的其它步骤可以例如通过处理器13004执行具有执行相应步骤中的指令的应用程序13018而实现。软件要素(程序)的指令的可执行代码或源代码可以存储在非暂时性计算机可读存储介质(例如上述存储设备13010)中，并且在执行时可以被存入工作存储器13014中(可能被编译和/或安装)。软件要素(程序)的指令的可执行代码或源代码也可以从远程位置下载。Software elements (programs) may be located in working memory 13014, including, but not limited to, operating system 13016, one or more applications 13018, drivers, and/or other data and code. Instructions for performing the above methods and steps may be included in one or more application programs 13018, and the above multi-party joint modeling method may be implemented by reading and executing instructions of the one or more application programs 13018 by the processor 13004 . More specifically, in the above-mentioned multi-party joint modeling method, steps S101 to S104 can be implemented, for example, by the processor 13004 executing the application program 13018 having the instructions of steps S101 to S104. In addition, other steps in the above-mentioned multi-party joint modeling method may be implemented, for example, by the processor 13004 executing an application program 13018 having instructions for executing corresponding steps. The executable code or source code of the instructions of the software element (program) may be stored in a non-transitory computer-readable storage medium (such as the storage device 13010 described above), and when executed, may be stored in the working memory 13014 (possibly compiled and/or installation). Executable code or source code for the instructions of the software element (program) may also be downloaded from a remote location.

还应该理解，可以根据具体要求而进行各种变型。例如，也可以使用定制硬件，和/或可以用硬件、软件、固件、中间件、微代码，硬件描述语言或其任何组合来实现特定元件。例如，所公开的方法和设备中的一些或全部可以通过使用根据本公开的逻辑和算法，用汇编语言或硬件编程语言(诸如VERILOG，VHDL，C++)对硬件(例如，包括现场可编程门阵列(FPGA)和/或可编程逻辑阵列(PLA)的可编程逻辑电路)进行编程来实现。It should also be understood that variations may be made depending on specific requirements. For example, custom hardware may also be used, and/or particular elements may be implemented in hardware, software, firmware, middleware, microcode, hardware description languages, or any combination thereof. For example, some or all of the disclosed methods and apparatus may be implemented on hardware (e.g., including field programmable gate arrays) in assembly language or a hardware programming language (such as VERILOG, VHDL, C++) by using logic and algorithms in accordance with the present disclosure. Programmable logic circuits (FPGA) and/or programmable logic array (PLA)) are implemented by programming.

还应该理解，前述方法可以通过服务器-客户端模式来实现。例如，客户端可以接收用户输入的数据并将所述数据发送到服务器。客户端也可以接收用户输入的数据，进行前述方法中的一部分处理，并将处理所得到的数据发送到服务器。服务器可以接收来自客户端的数据，并且执行前述方法或前述方法中的另一部分，并将执行结果返回给客户端。客户端可以从服务器接收到方法的执行结果，并例如可以通过输出设备呈现给用户。It should also be understood that the aforementioned methods can be implemented through a server-client model. For example, a client can receive user-entered data and send that data to a server. The client can also receive data input by the user, perform part of the processing in the aforementioned methods, and send the processed data to the server. The server can receive data from the client, execute the foregoing method or another part of the foregoing method, and return the execution result to the client. The client can receive the execution result of the method from the server and can present it to the user through an output device, for example.

还应该理解，计算设备13000的组件可以分布在网络上。例如，可以使用一个处理器执行一些处理，而同时可以由远离该一个处理器的另一个处理器执行其他处理。计算系统13000的其他组件也可以类似地分布。这样，计算设备13000可以被解释为在多个位置执行处理的分布式计算系统。It should also be understood that components of computing device 13000 may be distributed over a network. For example, one processor may be used to perform some processing while other processing may be performed by another processor remote from the one processor. Other components of computing system 13000 may be similarly distributed. As such, computing device 13000 may be interpreted as a distributed computing system that performs processing at multiple locations.

虽然已经参照附图描述了本公开的实施例或示例，但应理解，上述的方法、系统和设备仅仅是示例性的实施例或示例，本发明的范围并不由这些实施例或示例限制，而是仅由授权后的权利要求书及其等同范围来限定。实施例或示例中的各种要素可以被省略或者可由其等同要素替代。此外，可以通过不同于本公开中描述的次序来执行各步骤。进一步地，可以以各种方式组合实施例或示例中的各种要素。重要的是随着技术的演进，在此描述的很多要素可以由本公开之后出现的等同要素进行替换。Although the embodiments or examples of the present disclosure have been described with reference to the accompanying drawings, it should be understood that the above-mentioned methods, systems and devices are only exemplary embodiments or examples, and the scope of the present invention is not limited by these embodiments or examples. It is limited only by the granted claims and their equivalent scope. Various elements in the embodiments or examples may be omitted or replaced by equivalent elements thereof. Furthermore, the steps may be performed in a different order than described in this disclosure. Further, various elements in the embodiments or examples may be combined in various ways. Importantly, as technology evolves, many elements described herein may be replaced by equivalent elements appearing after this disclosure.

Claims

1. A multi-party joint modeling method based on a distributed system, wherein the distributed system comprises a plurality of clusters, each cluster of the plurality of clusters comprising a server and a plurality of clients, the method comprising:

intersection is carried out on sample identifications included in each of the plurality of clusters to obtain an intersection sample identification and cluster sample data corresponding to the intersection sample identification, wherein the sample identifications and the cluster sample data included in each of the plurality of clusters are distributed and stored in a plurality of clients of the corresponding cluster, the cluster sample data comprises client sample data stored in the corresponding plurality of clients, and the cluster sample data and the client sample data comprise sample identifications and at least one feature;

respectively barrelling cluster sample data of each cluster in the plurality of clusters to obtain cluster barrelled data of each cluster in the plurality of clusters, wherein the method comprises the following steps:

traversing, for each of the plurality of clusters, the at least one feature of the cluster sample data of that cluster;

Generating at least one data bucket based on the feature data of the current feature; and

integrating the data buckets corresponding to all the features to obtain cluster sub-bucket data of the cluster, wherein the cluster sub-bucket data of the cluster comprises at least one feature of cluster sample data of the cluster and one or more data buckets corresponding to each feature of the at least one feature of the cluster sample data of the cluster;

constructing a global information gain histogram based on a sample tag and cluster barrel data of each of the plurality of clusters, wherein the sample tag is a true value of each sample and the sample tag is stored in a particular cluster of the plurality of clusters; and

and constructing a decision tree model based on the global information gain histogram.

2. The multi-party joint modeling method of claim 1, wherein the generating at least one data bucket based on the feature data of the current feature comprises:

judging whether the characteristic data of the current characteristic are distributed on the same client;

responding to the characteristic data of the current characteristic to be distributed on a plurality of clients, indicating each client in the plurality of clients to divide the characteristic data of the current characteristic included in the sample data of the respective client into barrels, generating at least one data barrel to be combined of the current characteristic, and uploading the at least one data barrel to be combined to a server corresponding to the plurality of clients; and

And merging the at least one to-be-merged data bucket uploaded by the plurality of clients to generate the at least one data bucket, wherein each data bucket in the at least one data bucket is formed by merging one or more to-be-merged data buckets with the same or similar values, the sample identification of each data bucket in the at least one data bucket comprises all sample identifications of the one or more to-be-merged data buckets, and the client of each data bucket in the at least one data bucket comprises the client of the one or more to-be-merged data buckets.

3. The multi-party joint modeling method of claim 2, wherein generating at least one data bucket based on the feature data of the current feature further comprises:

and sending each data bucket in the at least one data bucket to the client of the data bucket, wherein the client sub-bucket data of the client comprises the at least one data bucket.

4. The multi-party joint modeling method of claim 3, wherein generating at least one data bucket based on the feature data of the current feature further comprises:

And responding to the characteristic data of the current characteristic to be distributed on the same client, indicating the same client to sub-tank the characteristic data of the current characteristic, generating at least one data tank, and uploading the at least one data tank to a server corresponding to the same client, wherein the client sub-tank data of the same client comprises the at least one data tank.

5. The multi-party joint modeling method of claim 4, wherein the plurality of clusters includes a first cluster and at least one second cluster, the cluster sample data of the first cluster further including the sample tag,

wherein constructing a global information gain histogram based on the sample tags and the cluster barrel data for each of the plurality of clusters comprises:

obtaining a predicted value of a current model on each sample corresponding to each sample identifier of cluster sub-bucket data of the first cluster;

calculating first-order gradient data and second-order gradient data based on the predicted value and the sample tag; and

the global information gain histogram is constructed based on the first order gradient data, the second order gradient data, and cluster bucket data for each of the plurality of clusters.

6. The multi-party joint modeling method of claim 5, wherein the constructing the global information gain histogram based on the first order gradient data, the second order gradient data, and cluster bucket data for each of the plurality of clusters comprises:

encrypting the first-order gradient data and the second-order gradient data and then sending the encrypted first-order gradient data and the encrypted second-order gradient data to a server of each second cluster in the at least one second cluster;

acquiring at least one node to be split of the current model, wherein the node to be split comprises at least one sample identifier;

constructing a first information gain histogram based on the first-order gradient data, the second-order gradient data, cluster bucket data of the first cluster and the at least one node to be split;

receiving at least one ciphertext information gain histogram from a server of each of the at least one second cluster;

decrypting the at least one ciphertext information gain histogram to obtain at least one second information gain histogram corresponding to the at least one ciphertext information gain histogram one by one; and

and combining the first information gain histogram and the at least one second information gain histogram to obtain the global information gain histogram.

7. The multi-party joint modeling method of claim 6, wherein the constructing a first information gain histogram based on the first order gradient data, the second order gradient data, cluster bucket data of the first cluster, and the at least one node to be split comprises:

traversing at least one feature of cluster sub-bucket data of the first cluster for each of the at least one node to be split;

based on the feature data of the node to be split and the current feature, obtaining a first information gain sub-histogram or a first candidate splitting gain of the current feature of the node to be split; and

and merging the first information gain sub-histogram or the first candidate splitting gain of each feature in at least one feature of the cluster barrel data of the first cluster of each node to be split to obtain the first information gain histogram.

8. The multiparty joint modeling method according to claim 7, wherein said deriving a first information gain sub-histogram or a first candidate split gain for the current feature of the node to be split based on the feature data for the node to be split and the current feature comprises:

Judging whether feature data of the current feature are distributed on the same client;

responding to the characteristic data of the current characteristic to be distributed on a plurality of clients, indicating each client in the plurality of clients to construct a first information gain sub-histogram to be combined based on the first-order gradient data, the second-order gradient data, the client barrel data of the client and the node to be split, and uploading the first information gain sub-histogram to be combined to a server corresponding to the plurality of clients; and

and merging the received first information gain sub-histograms to be merged, which are uploaded by the clients, and constructing the first information gain sub-histograms.

9. The multi-party joint modeling method of claim 8, wherein the first information gain sub-histogram includes at least one histogram bin that corresponds one-to-one to all data bins belonging to the feature, the first information gain sub-histogram to be combined includes at least one histogram bin to be combined that corresponds one-to-one to all data bins belonging to the feature, the histogram bin and the histogram bin to be combined each include at least one of a first order gradient and a second order gradient and,

The merging the received multiple to-be-merged first information gain sub-histograms uploaded by the multiple clients, and constructing the first information gain sub-histogram includes:

combining at least one to-be-combined histogram bucket in the plurality of to-be-combined first information gain sub-histograms uploaded by the plurality of clients to generate at least one histogram bucket, wherein each histogram bucket in the at least one histogram bucket is formed by combining one or a plurality of to-be-combined histogram buckets corresponding to the same data bucket from different to-be-combined first information gain sub-histograms, the first-order gradient sum of each histogram bucket in the at least one histogram bucket is the sum of the first-order gradient sums of the one or more to-be-combined histogram buckets, and the second-order gradient sum of each histogram bucket in the at least one histogram bucket is the sum of the second-order gradient sums of the one or more to-be-combined histogram buckets; and

the first information gain sub-histogram is constructed based on the at least one histogram bin.

10. The multi-party joint modeling method of claim 9, wherein the obtaining a first information gain sub-histogram or a first candidate split gain for the current feature of the node to be split based on the feature data for the node to be split and the current feature further comprises:

And responding to the fact that all feature data of the current feature are distributed on the same client, indicating the same client to construct the first information gain sub-histogram based on the first-order gradient data, the second-order gradient data, the client barrel data of the same client and the node to be split, calculating the first candidate split gain based on the first information gain sub-histogram, and uploading the first candidate split gain to a server of the first cluster.

11. The multiparty joint modeling method according to claim 6, wherein said ciphertext information gain histogram is constructed based on ciphertext first order gradient data, ciphertext second order gradient data, cluster bucket data of said second cluster, and said at least one node to be split.

12. The multi-party joint modeling method of claim 5, wherein the current model includes one or more sub-decision trees, each of the at least one node to be split being a part of a leaf node of a last sub-decision tree of the current model.

13. The multi-party joint modeling method of claim 12, wherein the constructing a decision tree model based on the global information gain histogram includes:

Determining an optimal splitting point based on the global new information gain histogram;

indicating a client where the optimal splitting point is located to split the optimal splitting point;

iterating the splitting process until the splitting termination condition is reached, and generating the sub-decision tree; and

and iterating the sub-decision tree generation process until reaching the iteration termination condition to obtain the decision tree model.

14. The multi-party joint modeling method of claim 13, wherein the current model includes a structure of one or more sub-decision trees, the clusters to which all nodes belong are disclosed for the plurality of clusters,

wherein the indicating the client where the optimal splitting point is located to split the optimal splitting point includes:

the client is instructed to calculate a splitting threshold based on the optimal splitting point and the client barrel splitting data of the client, a sample identifier included in the split leaf node is obtained, and the leaf node is uploaded to a server; and

synchronizing the cluster to which the node of the optimal splitting point belongs and the leaf nodes to the clusters except the cluster.

15. A multi-party joint prediction method based on a decision tree model built according to the method of any one of claims 1-14, comprising:

Inputting a prediction sample into the decision tree model;

aiming at each sub-decision tree of the decision tree model, acquiring a cluster to which a root node belongs;

communicating with the affiliated cluster to acquire the characteristics of the root node;

the feature data of the features of the root node of the prediction sample are sent to the affiliated cluster, and the affiliated cluster of the child node is obtained;

iterating the process to obtain a predicted value of each sub-decision tree on the predicted sample; and

and summing the predicted values of the predicted samples by each sub-decision tree to obtain the predicted values of the predicted samples.

16. A multi-party joint modeling apparatus based on a distributed system, wherein the distributed system comprises a plurality of clusters, each cluster of the plurality of clusters comprising a server and a plurality of clients, the apparatus comprising:

an intersection module configured to intersect a sample identifier included in each of the plurality of clusters to obtain an intersection sample identifier and cluster sample data corresponding to the intersection sample identifier included in each of the plurality of clusters, where the sample identifier and the cluster sample data included in each of the plurality of clusters are distributed and stored in a plurality of clients of the corresponding cluster, and the cluster sample data includes client sample data stored in the corresponding plurality of clients, and the cluster sample data and the client sample data each include a sample identifier and at least one feature;

The barreling module is configured to respectively barrel the cluster sample data of each of the plurality of clusters to obtain cluster barreled data of each of the plurality of clusters, and comprises the following steps:

a first construction module configured to construct a global information gain histogram based on a sample tag and the plurality of cluster barrel data, wherein the sample tag is a true value for each sample and the sample tag is stored in a particular cluster of the plurality of clusters; and

and a second construction module configured to construct a decision tree model based on the global information gain histogram.

17. An electronic device, comprising:

a processor; and

a memory storing a program comprising instructions that when executed by the processor cause the processor to perform the method of any one of claims 1-15.

18. A computer readable storage medium storing a program, the program comprising instructions which, when executed by a processor of an electronic device, cause the electronic device to perform the method of any one of claims 1-15.

19. A computer program product comprising a computer program, wherein the computer program, when executed by a processor, implements the method of any of claims 1-15.