CN115271272A

CN115271272A - Click-through rate prediction method and system for multi-order feature optimization and hybrid knowledge distillation

Info

Publication number: CN115271272A
Application number: CN202211200198.9A
Authority: CN
Inventors: 李广丽; 许广鑫; 吴光庭; 李传秀; 叶艺源; 张红斌
Original assignee: East China Jiaotong University
Current assignee: East China Jiaotong University
Priority date: 2022-09-29
Filing date: 2022-09-29
Publication date: 2022-11-01
Anticipated expiration: 2042-09-29
Also published as: CN115271272B

Abstract

The invention provides a click rate prediction method and a click rate prediction system for multi-order feature optimization and mixed knowledge distillation.A user behavior data and advertisement data clicked by a user are analyzed to construct an embedded feature vector of the user behavior data and the advertisement data, and a SENET network, domain feature interaction, a CIN model and a DNN model are combined around the embedded feature vector to realize multi-order feature optimization and generate features capable of accurately describing user interest; and then designing a mixed knowledge distillation framework, and outputting a lightweight click rate prediction model with stronger real-time reasoning capability and excellent recommendation precision based on the mixed knowledge distillation framework so as to realize efficient and high-quality advertisement click prediction, improve user recommendation experience and create good economic and social benefits for Internet companies.

Description

Click-through rate prediction method and system based on multi-stage feature optimization and hybrid knowledge distillation

技术领域technical field

本发明涉及广告推荐技术领域，特别涉及一种多阶特征优化与混合型知识蒸馏的点击率预测方法与系统。The present invention relates to the technical field of advertisement recommendation, in particular to a click-through rate prediction method and system for multi-stage feature optimization and hybrid knowledge distillation.

背景技术Background technique

由于互联网信息量过大，信息过载问题越来越严重。推荐系统能有效缓解信息过载问题，它根据用户与项目之间交互的历史数据，分析用户习惯、兴趣以及偏好等特征，同时根据项目自身的特性分析项目特征，最终在用户和待推荐项目之间建立重要联系，向用户推荐其可能感兴趣的项目。Due to the large amount of information on the Internet, the problem of information overload is becoming more and more serious. The recommendation system can effectively alleviate the problem of information overload. It analyzes the characteristics of user habits, interests, and preferences based on the historical data of interaction between users and items, and at the same time analyzes the characteristics of items according to the characteristics of the items themselves. Make important connections and recommend items that users might be interested in.

点击率通常用来预测用户对互联网广告或在线商品的点击概率，点击率预测是推荐系统的重要组成部分，在互联网商业平台发挥了非常重要的作用。众所周知，互联网广告蕴含巨大的经济利益，广告点击意味着潜在购买，故点击率预测对于推动社会、经济的发展都有着至关重要的作用。因此，对广告的精准推荐，既可以提高用户体验感，也能为互联网公司带来丰厚的经济收益。The click-through rate is usually used to predict the probability of users clicking on Internet advertisements or online products. The click-through rate prediction is an important part of the recommendation system and plays a very important role in the Internet business platform. As we all know, Internet advertisements contain huge economic benefits, and clicks on advertisements mean potential purchases, so click-through rate prediction plays a vital role in promoting social and economic development. Therefore, the precise recommendation of advertisements can not only improve user experience, but also bring substantial economic benefits to Internet companies.

然而，现有广告点击率的标准预测技术存在如下问题：（1）、首先，特征表示单一，仅使用显式特征或隐式特征，而未综合两者之间的互补性；（2）、其次，特征优化方法简单，未考虑多阶特征优化。基于上述两点，导致最终特征的判别性不强，严重制约了点击率预测精度；同时，现有的点击率预测技术多采用非常复杂、庞大的预测模型，如DIFM、AutoInt等，实时推理效率偏低，严重影响用户的推荐体验，也制约了模型的落地应用。However, the existing standard prediction technology of advertising click-through rate has the following problems: (1), first, the feature representation is single, and only explicit features or implicit features are used, without synthesizing the complementarity between the two; (2), Second, the feature optimization method is simple and does not consider multi-order feature optimization. Based on the above two points, the discriminativeness of the final features is not strong, which seriously restricts the accuracy of click-through rate prediction; at the same time, the existing click-through rate prediction technologies mostly use very complex and huge prediction models, such as DIFM, AutoInt, etc. It is low, which seriously affects the user's recommendation experience, and also restricts the application of the model.

发明内容Contents of the invention

鉴于上述状况，本发明的主要目的是为了提出一种多阶特征优化与混合型知识蒸馏的点击率预测方法与系统，以解决现有技术中存在的特征优化方法简单、点击率预测精度不高以及实时推理效率偏低的问题。In view of the above situation, the main purpose of the present invention is to propose a click-through rate prediction method and system of multi-stage feature optimization and hybrid knowledge distillation, so as to solve the problems in the prior art that the feature optimization method is simple and the click-through rate prediction accuracy is not high. And the problem of low efficiency of real-time reasoning.

本发明提出一种多阶特征优化与混合型知识蒸馏的点击率预测方法，其中，所述方法包括如下步骤：The present invention proposes a multi-stage feature optimization and hybrid knowledge distillation click rate prediction method, wherein the method includes the following steps:

步骤一，数据预处理：Step 1, data preprocessing:

对获取的原始用户行为数据与已点击广告数据进行特征提取，并进行独热编码转化，以分别得到用户行为特征嵌入向量以及广告特征嵌入向量；Feature extraction is performed on the obtained original user behavior data and clicked advertisement data, and one-hot encoding conversion is performed to obtain user behavior feature embedding vectors and advertisement feature embedding vectors respectively;

步骤二，模型训练：Step 2, model training:

将用户行为特征嵌入向量与广告特征嵌入向量输入SENET网络，然后执行基于通道注意力的特征优化，以生成第一阶特征；Input the user behavior feature embedding vector and advertising feature embedding vector into the SENET network, and then perform feature optimization based on channel attention to generate first-order features;

构建域特征交互网络，对已获取的所述第一阶特征执行基于域对对称矩阵嵌入的特征优化，以生成第二阶特征；Constructing a domain feature interaction network, performing domain-based feature optimization on the symmetric matrix embedding of the acquired first-order features to generate second-order features;

将所述第一阶特征输入至压缩交互网络中以输出得到显式高阶特征，将所述第二阶特征输入至深度神经网络中以输出得到隐式高阶特征，加权拼接所述显式高阶特征与隐式高阶特征，以融合生成第三阶特征，并基于所述第三阶特征生成点击率预测模型；The first-order features are input into the compression interaction network to output explicit high-order features, the second-order features are input into the deep neural network to output implicit high-order features, and the explicit high-order features are weighted and spliced. High-order features and implicit high-order features are fused to generate third-order features, and a click-through rate prediction model is generated based on the third-order features;

步骤三，点击率预测；Step 3, click-through rate prediction;

预训练点击率预测模型、AutoInt模型以及DIFM模型，然后分别进行自蒸馏后进行联合以构建得到教师网络；Pre-train the click-through rate prediction model, AutoInt model and DIFM model, and then perform joint self-distillation to construct the teacher network;

预训练DNN模型以及FM模型，然后进行相互蒸馏后进行组合以构建得到学生网络；Pre-train the DNN model and the FM model, and then combine them after mutual distillation to construct the student network;

设计门控网络，在教师网络中通过门控网络计算教师模型知识权重，基于教师模型知识权重，教师网络对学生网络中的各学生模型进行点击率预测指导，以实现混合型知识蒸馏；其中，所述教师模型知识权重表示教师模型指导学生网络中各学生模型的知识权重；Design a gating network, and calculate the knowledge weight of the teacher model through the gating network in the teacher network. Based on the knowledge weight of the teacher model, the teacher network predicts and guides the click rate of each student model in the student network to achieve hybrid knowledge distillation; among them, The teacher model knowledge weight represents the knowledge weight of each student model in the teacher model instructing the student network;

步骤四，广告推荐；Step 4, advertisement recommendation;

将混合型知识蒸馏输出的学生网络进行线上部署，以获得多个预测值并进行降序排列，选取预测值最高的预设数量的广告推荐给用户，以完成点击率预测。The student network output by the hybrid knowledge distillation is deployed online to obtain multiple prediction values and arrange them in descending order, and select the preset number of advertisements with the highest prediction value to recommend to users to complete the click-through rate prediction.

本发明还提出一种多阶特征优化与混合型知识蒸馏的点击率预测系统，其中，所述系统包括：The present invention also proposes a multi-stage feature optimization and hybrid knowledge distillation click rate prediction system, wherein the system includes:

数据预处理模块，用于：Data preprocessing module for:

模型训练模块，用于：Model training module for:

点击率预测模块，用于；The click-through rate prediction module is used for;

广告推荐模块，用于；Ad recommendation module, used for;

与现有技术相比，本发明所达到的有益效果是：Compared with the prior art, the beneficial effects achieved by the present invention are:

本发明提出一种多阶特征优化与混合型知识蒸馏的点击率预测方法，一方面，通过分析用户行为数据和用户点击的广告数据，构建用户行为数据和广告数据的嵌入特征向量，围绕嵌入特征向量，联合SENET网络、域特征交互、CIN模型以及DNN模型，实现多阶特征优化，生成能精准描述用户兴趣的特征；The present invention proposes a click rate prediction method based on multi-level feature optimization and hybrid knowledge distillation. On the one hand, by analyzing the user behavior data and the advertisement data clicked by the user, the embedded feature vector of the user behavior data and the advertisement data is constructed. Vector, combined with SENET network, domain feature interaction, CIN model and DNN model, realizes multi-level feature optimization and generates features that can accurately describe user interests;

另一方面，设计混合型知识蒸馏框架，基于该混合型知识蒸馏框架输出实时推理能力更强且推荐精度优良的轻量级点击率预测模型，实现高效、优质的广告点击预测，以提升用户推荐体验，为互联网公司创造良好的经济和社会效益。On the other hand, a hybrid knowledge distillation framework is designed. Based on the hybrid knowledge distillation framework, a lightweight click-through rate prediction model with stronger real-time reasoning ability and excellent recommendation accuracy can be output to achieve efficient and high-quality advertisement click prediction to improve user recommendation. experience and create good economic and social benefits for Internet companies.

本发明的附加方面和优点将在下面的描述中部分给出，部分将从下面的描述中变得明显，或通过本发明的实施例了解到。Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be apparent from the description which follows, or may be learned by practice of the invention.

附图说明Description of drawings

图1为本发明提出的一种多阶特征优化与混合型知识蒸馏的点击率预测方法的流程图；Fig. 1 is a flow chart of a click-through rate prediction method for multi-stage feature optimization and hybrid knowledge distillation proposed by the present invention;

图2为本发明中点击率预测模型（Se-xDeepFEFM）的流程图；Fig. 2 is the flow chart of click rate prediction model (Se-xDeepFEFM) among the present invention;

图3为本发明中混合型知识蒸馏框架的流程图；Fig. 3 is a flowchart of the hybrid knowledge distillation framework in the present invention;

图4为本发明提出的一种多阶特征优化与混合型知识蒸馏的点击率预测系统的结构图。FIG. 4 is a structural diagram of a click-through rate prediction system proposed by the present invention with multi-stage feature optimization and hybrid knowledge distillation.

具体实施方式Detailed ways

下面详细描述本发明的实施例，所述实施例的示例在附图中示出，其中自始至终相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。下面通过参考附图描述的实施例是示例性的，仅用于解释本发明，而不能理解为对本发明的限制。Embodiments of the present invention are described in detail below, examples of which are shown in the drawings, wherein the same or similar reference numerals designate the same or similar elements or elements having the same or similar functions throughout. The embodiments described below by referring to the figures are exemplary only for explaining the present invention and should not be construed as limiting the present invention.

参照下面的描述和附图，将清楚本发明的实施例的这些和其他方面。在这些描述和附图中，具体公开了本发明的实施例中的一些特定实施方式，来表示实施本发明的实施例的原理的一些方式，但是应当理解，本发明的实施例的范围不受此限制。相反，本发明的实施例包括落入所附加权利要求书的精神和内涵范围内的所有变化、修改和等同物。These and other aspects of embodiments of the invention will become apparent with reference to the following description and drawings. In these descriptions and drawings, some specific implementations of the embodiments of the present invention are specifically disclosed to represent some ways of implementing the principles of the embodiments of the present invention, but it should be understood that the scope of the embodiments of the present invention is not limited by this limit. On the contrary, the embodiments of the present invention include all changes, modifications and equivalents coming within the spirit and scope of the appended claims.

请参阅图1至图3，本发明提出一种多阶特征优化与混合型知识蒸馏的点击率预测方法，其中，所述方法包括如下步骤：Please refer to FIG. 1 to FIG. 3 , the present invention proposes a method for predicting the click-through rate of multi-stage feature optimization and hybrid knowledge distillation, wherein the method includes the following steps:

S101，数据预处理：S101, data preprocessing:

对获取的原始用户行为数据与已点击广告数据进行特征提取，并进行独热编码转化，以分别得到用户行为特征嵌入向量以及广告特征嵌入向量。Feature extraction is performed on the obtained original user behavior data and clicked advertisement data, and one-hot encoding transformation is performed to obtain user behavior feature embedding vectors and advertisement feature embedding vectors respectively.

在步骤S101中，对获取的原始用户行为数据与已点击广告数据进行特征提取，并进行独热编码转化，以分别得到用户行为特征嵌入向量以及广告特征嵌入向量的方法包括如下步骤：In step S101, performing feature extraction on the obtained original user behavior data and clicked advertisement data, and performing one-hot encoding conversion to obtain user behavior feature embedding vectors and advertisement feature embedding vectors respectively, including the following steps:

S1011，对所述用户行为数据与所述已点击广告数据均进行预处理，所述预处理包括：S1011. Preprocessing both the user behavior data and the clicked advertisement data, the preprocessing includes:

从年龄、性别、以及用户类型的相关字段提取得到对应的离散特征，通过嵌入方法对所述离散特征进行处理，使语义上相似的特征聚集到特征空间中相近位置；Extract corresponding discrete features from related fields of age, gender, and user type, and process the discrete features through an embedding method, so that semantically similar features are gathered to similar positions in the feature space;

从价格与时间的相关字段提取得到对应的连续特征，对所述连续特征进行归一化处理，将特征值压缩至[0,1]。The corresponding continuous features are extracted from the relevant fields of price and time, and the continuous features are normalized, and the feature values are compressed to [0,1].

S1012，根据经过预处理之后的用户行为数据生成用户行为特征嵌入向量，根据经过预处理之后的已点击广告数据生成广告特征嵌入向量。S1012. Generate a user behavior feature embedding vector according to the preprocessed user behavior data, and generate an advertisement feature embedding vector according to the preprocessed clicked advertisement data.

其中，所述用户行为特征嵌入向量与所述广告特征嵌入向量记为特征嵌入向量

。Wherein, the user behavior feature embedding vector and the advertisement feature embedding vector are recorded as feature embedding vectors

.

S102，模型训练：S102, model training:

S1021，将用户行为特征嵌入向量与广告特征嵌入向量输入SENET网络，然后执行基于通道注意力的特征优化，以生成第一阶特征；S1021, inputting the user behavior feature embedding vector and the advertisement feature embedding vector into the SENET network, and then performing feature optimization based on channel attention to generate first-order features;

S1022，构建域特征交互网络，对已获取的所述第一阶特征执行基于域对对称矩阵嵌入的特征优化，以生成第二阶特征；S1022. Construct a domain feature interaction network, and perform domain-based feature optimization of symmetric matrix embedding on the acquired first-order features to generate second-order features;

S1023，将所述第一阶特征输入至压缩交互网络（CIN）中以输出得到显式高阶特征，将所述第二阶特征输入至深度神经网络中以输出得到隐式高阶特征，加权拼接所述显式高阶特征与隐式高阶特征，以融合生成第三阶特征，并基于所述第三阶特征生成点击率预测模型。S1023, input the first-order features into the compression interaction network (CIN) to output explicit high-order features, input the second-order features into the deep neural network to output implicit high-order features, weighted The explicit high-order features and the implicit high-order features are spliced together to generate third-order features, and a click-through rate prediction model is generated based on the third-order features.

具体的，在步骤S102中，将用户行为特征嵌入向量与广告特征嵌入向量输入SENET网络，然后执行基于通道注意力的特征优化，以生成第一阶特征的方法包括如下步骤：Specifically, in step S102, the user behavior feature embedding vector and the advertisement feature embedding vector are input into the SENET network, and then the feature optimization based on channel attention is performed to generate the first-order feature. The method includes the following steps:

S1021a，利用SENET网络通过平均池化操作，对所述特征嵌入向量

进行压缩，以计算得到统计向量；S1021a, using the SENET network to perform an average pooling operation on the feature embedding vector

Perform compression to calculate the statistical vector;

S1021b，基于所述统计向量，设计两个全连接层以计算得到注意力权重；S1021b. Based on the statistical vector, design two fully connected layers to calculate attention weights;

S1021c，根据所述注意力权重对所述特征嵌入向量

进行加权，以生成所述第一阶特征。S1021c, embedding the feature vector according to the attention weight

Weighting is performed to generate the first-order features.

第一阶特征表示为：The first-order features are expressed as:

其中，

表示第一阶特征，

表示对所述特征嵌入向量

进行注意力加权，

表示注意力权重，

表示特征嵌入向量，

表示

中第

个特征嵌入向量，

表示

中第

个特征嵌入向量，

表示

的注意力权重，

表示

的注意力权重，

表示第一阶特征的第

个特征值，

表示第一阶特征的第

个特征值，

表示计算注意力权重的函数，

表示全连接层的第一激活函数，

表示全连接层的第二激活函数，

表示全连接层的第一参数，

表示全连接层的第二参数，

表示统计向量，

，

表示计算出的第

个特征嵌入向量对应的统计信息值，

表示计算统计信息值的函数，

表示特征嵌入向量

的维度，

表示从维度1计算到

。in,

represent the first-order features,

Denotes the feature embedding vector

perform attention weighting,

represents the attention weight,

Represents the feature embedding vector,

express

B

feature embedding vectors,

express

B

feature embedding vectors,

express

attention weight,

express

attention weight,

represents the first-order feature

eigenvalues,

represents the first-order feature

eigenvalues,

Represents a function for computing attention weights,

Represents the first activation function of the fully connected layer,

Represents the second activation function of the fully connected layer,

Indicates the first parameter of the fully connected layer,

Indicates the second parameter of the fully connected layer,

represents a statistical vector,

,

represents the calculated

Statistical information values corresponding to feature embedding vectors,

Represents a function that computes a statistic value,

Represents the feature embedding vector

dimension,

Indicates calculated from dimension 1 to

.

作为补充的，由于第一阶特征经过了注意力加权，重要特征得以凸显，次要特征得以抑制，因此为后续第二阶特征、第三阶特征的提取及点击率预测奠定坚实基础（原理参见图2）。As a supplement, because the first-order features have been weighted by attention, important features can be highlighted and secondary features can be suppressed, thus laying a solid foundation for the subsequent extraction of second-order features, third-order features and click-through rate predictions (for the principle see figure 2).

进一步的，构建域特征交互网络，对已获取的所述第一阶特征执行基于域对对称矩阵嵌入的特征优化，对应有如下公式：Further, build a domain feature interaction network, and perform feature optimization based on domain-to-symmetric matrix embedding for the acquired first-order features, corresponding to the following formula:

其中，

表示域特征交互网络的输出，

表示一个

对称矩阵，

表示域特征交互网络可学习得到的基础加权参数，

表示域特征交互网络可学习得到的第

个特征嵌入向量的加权参数，

表示特征数，

表示第

个特征嵌入向量的值，

表示第

个特征嵌入向量的值，

表示第

个字段的域特征，

表示第

个字段的域特征，

表示第一阶特征的第

个特征值。in,

Denotes the output of the domain feature interaction network,

means a

Symmetric matrix,

Represents the basic weighting parameters that can be learned by the domain feature interaction network,

Representation domain feature interaction network can learn the first

The weighting parameters of feature embedding vectors,

represent the number of features,

Indicates the first

The value of the feature embedding vector,

Indicates the first

The value of the feature embedding vector,

Indicates the first

domain characteristics of fields,

Indicates the first

domain characteristics of fields,

represents the first-order feature

feature value.

进一步的，第二阶特征的公式表示为：Further, the formula of the second-order feature is expressed as:

其中，

表示第二阶特征，

表示进行拼接操作，

表示始特征嵌入向量输入到域特征交互网络中得到的输出结果，

表示第一阶特征输入到域特征交互网络中得到的输出结果，

表示拼接后的第

个交互特征向量，

表示域特征交互网络生成的交互特征向量个数。in,

represent the second-order features,

Indicates the splicing operation,

Represents the output result obtained by inputting the original feature embedding vector into the domain feature interaction network,

Represents the output result obtained by inputting the first-order feature into the domain feature interaction network,

Indicates the spliced first

interaction eigenvectors,

Indicates the number of interaction feature vectors generated by the domain feature interaction network.

在此需要说明的是，由于融合了特征嵌入向量与第一阶特征的高阶表示，故第二阶特征中包含更为丰富的语义信息，有助于改善点击预测精度。What needs to be explained here is that due to the fusion of the feature embedding vector and the high-level representation of the first-order features, the second-order features contain richer semantic information, which helps to improve the accuracy of click prediction.

进一步的，将输出的第一阶特征输入至压缩交互网络（CIN）中以输出得到显式高阶特征。其中，显式高阶特征的生成公式为：Further, the output first-order features are input into the compressed interactive network (CIN) to output explicit high-order features. Among them, the generation formula of explicit high-order features is:

其中，

表示第

层高阶矩阵中的第

个高阶特征向量，

表示第

层高阶矩阵中的第

个高阶特征向量，

表示第一阶特征中的第

个特征值，

，

表示第一阶特征生成第

层高阶特征向量的第

个高阶特征的参数矩阵，

表示第0层特征嵌入向量的个数，

表示第

层特征嵌入向量的个数，

表示第

层高阶特征向量中的第

个特征，

表示第

层高阶特征向量中第

个特征的第

维度的特征向量，

表示最终生成的显式高阶特征，

表示显式高阶特征的总层数，

表示哈达玛积。in,

Indicates the first

The first layer in the high-order matrix

high-order eigenvectors,

Indicates the first

The first layer in the high-order matrix

high-order eigenvectors,

Represents the first-order feature in the first-order

eigenvalues,

,

Indicates that the first-order feature generation

The first layer of high-order eigenvectors

A parameter matrix of high-order features,

Indicates the number of feature embedding vectors of layer 0,

Indicates the first

The number of layer feature embedding vectors,

Indicates the first

The first layer in the high-order feature vector

features,

Indicates the first

The first layer in the high-order feature vector

feature's

eigenvectors of dimension,

Represents the final generated explicit high-order features,

Indicates the total number of layers of explicit high-order features,

Indicates Hadamard product.

进一步的，将第二阶特征输入至深度神经网络（DNN）中以输出得到隐式高阶特征。其中，隐式高阶特征的生成公式为：Further, the second-order features are input into a deep neural network (DNN) to output implicit high-order features. Among them, the generation formula of implicit high-order features is:

其中，

表示深度神经网络中第

层的神经网络输出，

表示激活函数，

表示深度神经网络中第

层的权重，

表示深度神经网络中第

层的偏移量，

表示深度神经网络的层数。in,

Represents the first in the deep neural network

layer neural network output,

represents the activation function,

Represents the first in the deep neural network

layer weights,

Represents the first in the deep neural network

layer offset,

Indicates the number of layers of the deep neural network.

将CIN输出的显式高阶特征和DNN输出的隐式高阶特征组合起来，完成特征融合并生成第三阶特征，第三阶特征充分利用了隐式高阶特征与显式高阶特征之间的互补性，有助于提升特征判别性及最终的点击预测精度。Combine the explicit high-order features output by CIN and the implicit high-order features output by DNN to complete feature fusion and generate third-order features. The third-order features make full use of the relationship between implicit high-order features and explicit high-order features. The complementarity between them helps to improve feature discrimination and final click prediction accuracy.

基于第三阶特征生成点击率预测模型的公式表示为：The formula for generating a click-through rate prediction model based on third-order features is expressed as:

其中，

表示点击率预测值，

表示sigmoid函数操作，

均表示点击率预测模型参数，

。in,

represents the predicted click-through rate,

Indicates the sigmoid function operation,

Both represent the parameters of the click-through rate prediction model,

.

S103，点击率预测：S103, click rate prediction:

S1031，预训练点击率预测模型、AutoInt模型以及DIFM模型，然后分别进行自蒸馏后进行联合以构建得到教师网络。S1031, pre-training the click-through rate prediction model, the AutoInt model and the DIFM model, and then performing self-distillation and combining them to construct a teacher network.

S1032，预训练DNN模型以及FM模型，然后进行相互蒸馏后进行组合以构建得到学生网络。S1032, pre-training the DNN model and the FM model, and then performing mutual distillation and combining to construct a student network.

其中，预训练轻量级的DNN模型（相当于图3中学生模型1）与FM模型（相当于图3中学生模型2），并将它们作为学生模型，以构建学生网络。在DNN模型与FM模型之间进行相互蒸馏，有助于融合各个学生模型中的多样性信息，通过相互蒸馏提升各学生模型的点击预测精度。Among them, pre-train the lightweight DNN model (equivalent to student model 1 in Figure 3) and FM model (equivalent to student model 2 in Figure 3), and use them as student models to build student networks. The mutual distillation between the DNN model and the FM model helps to fuse the diversity information in each student model, and improves the click prediction accuracy of each student model through mutual distillation.

S1033，设计门控网络，在教师网络中通过门控网络计算教师模型知识权重，基于教师模型知识权重，教师网络对学生网络中的各学生模型进行点击率预测指导，以实现混合型知识蒸馏；其中，所述教师模型知识权重表示教师模型指导学生网络中各学生模型的知识权重。S1033, designing a gating network, calculating the knowledge weight of the teacher model through the gating network in the teacher network, based on the knowledge weight of the teacher model, the teacher network performs click rate prediction guidance for each student model in the student network, so as to realize hybrid knowledge distillation; Wherein, the knowledge weight of the teacher model indicates the knowledge weight of each student model in the teacher model instructing the student network.

DNN模型与FM模型之间相互蒸馏的具体过程为：The specific process of mutual distillation between DNN model and FM model is as follows:

其中，

表示学生网络中FM模型的损失函数，

表示真实标签，

表示学生网络中FM模型的输出，

表示学生网络中FM模型对真实标签进行拟合，

表示学生网络中FM模型相对DNN模型的KL损失，

表示

的权重；in,

Denotes the loss function of the FM model in the student network,

represents the true label,

Denotes the output of the FM model in the student network,

Indicates that the FM model in the student network fits the real label,

Indicates the KL loss of the FM model relative to the DNN model in the student network,

express

the weight of;

表示学生网络中DNN模型的损失函数，

表示学生网络中DNN模型的输出，

表示学生网络中DNN模型对真实标签进行拟合，

表示学生网络中DNN模型相对FM模型的KL损失，

表示

的权重。

Represents the loss function of the DNN model in the student network,

Denotes the output of the DNN model in the student network,

Indicates that the DNN model in the student network fits the real label,

Indicates the KL loss of the DNN model relative to the FM model in the student network,

express

the weight of.

进一步的，预训练DIFM模型（相当于图3中教师模型1）、AutoInt模型（相当于图3中教师模型2）以及Se-xDeepFEFM模型（相当于图3中教师模型3），将预训练好的三个模型进行自蒸馏，然后组合为教师网络。由于教师网络中各教师模型彼此异构，因此可向学生模型提供更多样性的知识，以促进学生模型点击预测精度的提高；再设计一种GATE机制，自适应调整教师网络中各教师模型对学生网络中各学生模型的知识权重，知识权重越大，则表示对应教师模型在知识蒸馏中向学生模型提供更多有价值的知识，以促使该学生模型点击率预测精度的提升。Further, the pre-trained DIFM model (equivalent to teacher model 1 in Figure 3), AutoInt model (equivalent to teacher model 2 in Figure 3) and Se-xDeepFEFM model (equivalent to teacher model 3 in Figure 3) will be pre-trained The three models of are self-distilled and then combined into a teacher network. Since the teacher models in the teacher network are heterogeneous with each other, more diverse knowledge can be provided to the student model to improve the click prediction accuracy of the student model; and a GATE mechanism is designed to adaptively adjust the teacher models in the teacher network For the knowledge weight of each student model in the student network, the greater the knowledge weight, the corresponding teacher model provides more valuable knowledge to the student model in the knowledge distillation, so as to promote the improvement of the click rate prediction accuracy of the student model.

具体的，点击率预测模型（Se-xDeepFEFM模型）、AutoInt模型以及DIFM模型分别进行自蒸馏的公式表示为：Specifically, the formulas for self-distillation of the click-through rate prediction model (Se-xDeepFEFM model), AutoInt model, and DIFM model are expressed as:

其中，

表示DIFM模型的损失函数，

表示针对未增强样本的教师网络中DIFM模型的输出，

表示针对增强样本的教师网络中DIFM模型的输出，

表示

的权重，

表示未增强样本的教师网络中DIFM模型对真实标签进行拟合，

表示增强样本的教师网络中DIFM模型对真实标签进行拟合；in,

Represents the loss function of the DIFM model,

Denotes the output of the DIFM model in the teacher network for unaugmented samples,

Denotes the output of the DIFM model in the teacher network for augmented samples,

express

the weight of,

represents the fitting of the ground-truth labels by the DIFM model in the teacher network representing the unaugmented samples,

The DIFM model in the teacher network representing the augmented samples fits the ground truth labels;

表示AutoInt模型的损失函数，

表示针对未增强样本的教师网络中AutoInt模型的输出，

表示针对增强样本的教师网络中AutoInt模型的输出，

表示

的权重，

表示未增强样本的教师网络中AutoInt模型对真实标签进行拟合，

表示增强样本的教师网络中AutoInt模型对真实标签进行拟合；

Represents the loss function of the AutoInt model,

Denotes the output of the AutoInt model in the teacher network for unaugmented samples,

Denotes the output of the AutoInt model in the teacher network for augmented samples,

express

the weight of,

Indicates that the AutoInt model fits the true label in the teacher network of the unaugmented sample,

The AutoInt model in the teacher network representing the augmented sample fits the true label;

表示点击率预测模型的损失函数，

表示针对未增强样本的教师网络中点击率预测模型的输出，

表示针对增强样本的教师网络中点击率预测模型的输出，

表示

的权重，

表示未增强样本的教师网络中点击率预测模型对真实标签进行拟合，

表示增强样本的教师网络中点击率预测模型对真实标签进行拟合。

Represents the loss function of the click-through rate prediction model,

Denotes the output of the hit rate prediction model in the teacher network for unaugmented samples,

Denotes the output of the hit-rate prediction model in the teacher network for augmented samples,

express

the weight of,

Represents the fitting of the true label by the click rate prediction model in the teacher network representing the unaugmented sample,

The CTR prediction model in the teacher network representing augmented samples is fitted to the ground truth labels.

在本发明中，Se-xDeepFEFM模型通过样本多样性完成自蒸馏，自蒸馏能压缩教师模型规模，有助于缩小教师模型与学生模型之间的“代沟”，以更好地训练混合型知识蒸馏框架。In the present invention, the Se-xDeepFEFM model completes self-distillation through sample diversity. Self-distillation can compress the size of the teacher model, which helps to narrow the "generation gap" between the teacher model and the student model, so as to better train the mixed knowledge distillation frame.

混合型知识蒸馏对应的总损失函数表示为：The total loss function corresponding to the hybrid knowledge distillation is expressed as:

其中，

表示混合型知识蒸馏对应的总损失函数，

表示教师网络中第

个教师模型，

表示教师网络，

表示学生网络，

表示教师网络中教师模型的数量，

表示教师网络中第

个教师模型的知识权重。in,

Represents the total loss function corresponding to the hybrid knowledge distillation,

Indicates the first in the teacher network

a teacher model,

represents the teacher network,

represents a student network,

Indicates the number of teacher models in the teacher network,

Indicates the first in the teacher network

The knowledge weight of a teacher model.

S104，广告推荐；S104, advertisement recommendation;

教师网络和学生网络进行联合训练，即通过GATE，教师网络中的教师模型向学生网络中的学生模型传递知识，实现混合型知识蒸馏。混合型知识蒸馏框架输出轻量级学生模型，运用轻量级学生模型计算点击预测值，在确保预测精度的同时提高实时预测效率，增强点击率预测模型的实时推理能力。The teacher network and the student network are jointly trained, that is, through GATE, the teacher model in the teacher network transfers knowledge to the student model in the student network to achieve hybrid knowledge distillation. The hybrid knowledge distillation framework outputs a lightweight student model, and uses the lightweight student model to calculate the click prediction value, which improves the real-time prediction efficiency while ensuring the prediction accuracy, and enhances the real-time reasoning ability of the click rate prediction model.

请参阅图4，本发明提出一种多阶特征优化与混合型知识蒸馏的点击率预测系统，其中，所述系统包括：Please refer to Fig. 4, the present invention proposes a click rate prediction system for multi-stage feature optimization and hybrid knowledge distillation, wherein the system includes:

数据预处理模块，用于：Data preprocessing module for:

模型训练模块，用于：Model training module for:

广告推荐模块，用于；Ad recommendation module, used for;

应当理解的，本发明的各部分可以用硬件、软件、固件或它们的组合来实现。在上述实施方式中，多个步骤或方法可以用存储在存储器中且由合适的指令执行系统执行的软件或固件来实现。例如，如果用硬件来实现，和在另一实施方式中一样，可用本领域公知的下列技术中的任一项或他们的组合来实现：具有用于对数据信号实现逻辑功能的逻辑门电路的离散逻辑电路，具有合适的组合逻辑门电路的专用集成电路，可编程门阵列（PGA），现场可编程门阵列（FPGA）等。It should be understood that each part of the present invention can be realized by hardware, software, firmware or their combination. In the embodiments described above, various steps or methods may be implemented by software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, it can be implemented by any one or combination of the following techniques known in the art: Discrete logic circuits, ASICs with suitable combinational logic gates, Programmable Gate Arrays (PGAs), Field Programmable Gate Arrays (FPGAs), etc.

在本说明书的描述中，参考术语“一个实施例”、“一些实施例”、 “示例”、“具体示例”、或“一些示例”等的描述意指结合该实施例或示例描述的具体特征、结构、材料或者特点包含于本发明的至少一个实施例或示例中。在本说明书中，对上述术语的示意性表述不一定指的是相同的实施例或示例。而且，描述的具体特征、结构、材料或者特点可以在任何的一个或多个实施例或示例中以合适的方式结合。In the description of this specification, descriptions referring to the terms "one embodiment", "some embodiments", "example", "specific examples", or "some examples" mean that specific features described in connection with the embodiment or example , structure, material or characteristic is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the specific features, structures, materials or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

以上所述实施例仅表达了本发明的几种实施方式，其描述较为具体和详细，但并不能因此而理解为对本发明专利范围的限制。应当指出的是，对于本领域的普通技术人员来说，在不脱离本发明构思的前提下，还可以做出若干变形和改进，这些都属于本发明的保护范围。因此，本发明专利的保护范围应以所附权利要求为准。The above-mentioned embodiments only express several implementation modes of the present invention, and the description thereof is relatively specific and detailed, but should not be construed as limiting the patent scope of the present invention. It should be noted that those skilled in the art can make several modifications and improvements without departing from the concept of the present invention, and these all belong to the protection scope of the present invention. Therefore, the protection scope of the patent for the present invention should be based on the appended claims.

Claims

1. A click rate prediction method for multi-order feature optimization and mixed knowledge distillation is characterized by comprising the following steps:

step one, data preprocessing:

extracting the characteristics of the obtained original user behavior data and the clicked advertisement data, and performing unique hot code conversion to respectively obtain a user behavior characteristic embedded vector and an advertisement characteristic embedded vector;

step two, model training:

inputting the user behavior feature embedding vector and the advertisement feature embedding vector into a SENET, and then performing feature optimization based on channel attention to generate first-order features;

constructing a domain feature interaction network, and performing feature optimization based on domain symmetric matrix embedding on the acquired first-order features to generate second-order features;

inputting the first-order features into a compression interactive network to output and obtain explicit high-order features, inputting the second-order features into a deep neural network to output and obtain implicit high-order features, performing weighted splicing on the explicit high-order features and the implicit high-order features to generate third-order features in a fusion mode, and generating a click rate prediction model based on the third-order features;

step three, predicting the click rate;

pre-training a click rate prediction model, an AutoInt model and a DIFM model, and then respectively carrying out self-distillation and then combining to construct a teacher network;

pre-training a DNN model and an FM model, mutually distilling, and combining to construct a student network;

designing a gate control network, calculating the knowledge weight of a teacher model in the teacher network through the gate control network, and performing click rate prediction guidance on each student model in the student network by the teacher network based on the knowledge weight of the teacher model so as to realize mixed knowledge distillation; wherein the teacher model knowledge weight represents a knowledge weight of each student model in a teacher model teaching student network;

step four, recommending advertisements;

and (3) carrying out online deployment on the student network output by mixed knowledge distillation to obtain a plurality of predicted values, carrying out descending order arrangement, selecting a preset number of advertisements with the highest predicted values, and recommending the advertisements to users to complete click rate prediction.

2. The method for predicting click rate of multi-order feature optimization and hybrid knowledge distillation as claimed in claim 1, wherein in the step one, the steps of performing feature extraction on the obtained original user behavior data and clicked advertisement data, and performing one-hot code transformation to obtain the user behavior feature embedded vector and the advertisement feature embedded vector respectively comprise the following steps:

preprocessing the user behavior data and the clicked advertisement data, wherein the preprocessing comprises the following steps:

extracting corresponding discrete features from relevant fields of age, gender and user type, and processing the discrete features by an embedding method to gather semantically similar features to a close position in a feature space;

the pre-processing further comprises:

extracting corresponding continuous features from relevant fields of price and time, carrying out normalization processing on the continuous features, and compressing feature values to [0,1];

generating a user behavior feature embedded vector according to the preprocessed user behavior data, and counting the number of clicked advertisements according to the preprocessed user behavior dataGenerating advertisement characteristic embedding vectors; wherein the user behavior feature embedded vector and the advertisement feature embedded vector are marked as feature embedded vectors

。

3. The click-through rate prediction method using multi-order feature optimization and mixed knowledge distillation as claimed in claim 2, wherein in the second step, the user behavior feature embedding vector and the advertisement feature embedding vector are inputted into a SENET network, and then the feature optimization based on channel attention is performed to generate the first-order features, the method comprising the following steps:

embedding the feature into a vector by an averaging pooling operation using a SENET network

Compressing to calculate a statistical vector;

designing two full-connection layers based on the statistical vector to calculate attention weight;

embedding vectors for the features according to the attention weights

Weighting to generate the first order features;

the first order features are represented as:

wherein,

a first-order feature is represented by,

representing an embedded vector to the feature

The attention-weighting is performed such that,

the weight of attention is represented as a weight of attention,

a feature-embedded vector is represented that is,

to represent

To middle

The features are embedded into a vector of the image,

represent

To middle

The features are embedded into a vector of the image,

represent

The attention weight of (a) is given,

represent

The attention weight of (a) is given,

first order features

The value of the characteristic is used as the characteristic value,

first order features

A characteristic value;

a function representing the calculation of the attention weight,

a first activation function representing a fully connected layer,

a second activation function representing a fully connected layer,

a first parameter representing a fully connected layer,

a second parameter representing a fully connected layer,

a statistical vector is represented that represents the statistical vector,

，

represents the calculated second

The features are embedded into the corresponding statistical information values of the vector,

a function representing the value of the calculated statistical information,

representing feature embedding vectors

The dimension (c) of (a) is,

representation is calculated from dimension 1 to

。

4. The method for predicting click rate of multi-order feature optimization and mixed-type knowledge distillation as claimed in claim 3, wherein in the second step, in the step of constructing a domain feature interaction network and performing feature optimization based on domain-symmetric matrix embedding on the obtained first-order features, the following formula is applied:

wherein,

represents the output of the domain feature interaction network,

represents one

The symmetric matrix is a matrix of a plurality of,

representing the basis weighting parameters learnable by the domain feature interaction network,

the interactive network of the representation domain features can learn

The individual features are embedded into the weighting parameters of the vector,

the number of features is represented by a number of features,

denotes the first

The value of the individual feature embedding vector is,

denotes the first

The value of the individual feature embedding vector is,

denotes the first

The domain characteristics of the individual fields are,

is shown as

The domain characteristics of the individual fields are,

first order features

A characteristic value.

5. The method as claimed in claim 4, wherein the second order features are expressed by the following formula:

wherein,

which represents the characteristics of the second order,

it is shown that the splicing operation is performed,

embedding vector input to field representing initial featuresThe output results obtained in the feature interaction network,

represents the output result of the first-order feature input into the domain feature interaction network,

to show the spliced second

The number of the feature vectors of each interaction,

and representing the number of the interactive feature vectors generated by the domain feature interactive network.

6. The method as claimed in claim 5, wherein in the step two, the first-order features are inputted into a compressed interactive network to output explicit high-order features, and the explicit high-order features are generated according to the following formula:

wherein,

denotes the first

First in a layer high order matrix

A plurality of high-order feature vectors,

is shown as

First in a layer high order matrix

A plurality of high-order feature vectors,

representing the first in the first order features

The value of the characteristic is used as the characteristic value,

，

representing first-order features generating

First of layer high order eigenvectors

A parameter matrix of a high-order feature,

represents the number of layer 0 feature embedding vectors,

denotes the first

The number of layer feature embedding vectors,

is shown as

The first in the layer high order feature vector

The characteristics of the device are as follows,

is shown as

The first in the layer high order feature vector

A first feature of

The feature vector of the dimension(s),

representing the explicit high-order features that are ultimately generated,

the total number of layers representing the explicit high-order features,

representing a Hadamard product;

in the method for inputting the second-order feature into the deep neural network to output and obtain the implicit high-order feature, a generation formula of the implicit high-order feature is as follows:

wherein,

representing the second in a deep neural network

The neural network output of the layer(s),

it is shown that the activation function is,

representing the second in a deep neural network

The weight of a layer is determined by the weight of the layer,

representing the second in a deep neural network

The amount of offset of the layer(s),

the number of layers of the deep neural network is represented.

7. The method as claimed in claim 6, wherein in the step two, the formula for generating the click rate prediction model based on the third order feature is expressed as:

wherein,

the predicted value of the click-through rate is shown,

to representsigmoidThe function is operated on by the operation of the function,

all represent the parameters of the click-through rate prediction model,

。

8. the method as claimed in claim 7, wherein the formula corresponding to the mixed knowledge distillation in the third step is represented as follows:

wherein,

representing the loss function of the FM model in the student network,

the presence of a real label is indicated,

representing the output of the FM model in the student network,

indicating that the FM model in the student network fits the real tags,

representing KL loss of the FM model versus the DNN model in the student network,

to represent

The weight of (c);

representing the loss function of the DNN model in the student network,

represents the output of the DNN model in the student network,

representing that the DNN model in the student network fits the real tags,

representing the KL loss of the DNN model relative to the FM model in the student network,

to represent

The weight of (c).

9. The method as claimed in claim 8, wherein the formula for self-distillation of the click rate prediction model, the AutoInt model and the DIFM model is as follows:

wherein,

a loss function representing the diff model,

represents the output of the diff model in the teacher network for the unenhanced sample,

represents the output of the diff model in the teacher network for the enhanced sample,

to represent

The weight of (a) is calculated,

DIFM mode in teacher network representing non-enhanced samplesThe model is fitted to the real tag,

fitting the real label by using a DIFM model in the teacher network representing the enhanced sample;

representing the penalty function of the AutoInt model,

represents the output of the AutoInt model in the teacher network for the unenhanced sample,

represents the output of the AutoInt model in the teacher network for the enhanced sample,

to represent

The weight of (a) is determined,

the AutoInt model in the teacher network representing the unenhanced sample fits the real tags,

the AutoInt model in the teacher network representing the enhanced sample is used for fitting the real label;

a loss function representing a click-through rate prediction model,

represents the output of the click-through rate prediction model in the teacher's network for the unenhanced sample,

representing the output of the click-through rate prediction model in the teacher's network for the enhanced sample,

to represent

The weight of (a) is determined,

the click-through rate prediction model in the teacher's network representing the unenhanced sample fits the true label,

fitting the real label by a click rate prediction model in a teacher network representing an enhanced sample;

the total loss function for the mixed knowledge distillation is expressed as:

wherein,

represents the total loss function corresponding to the mixed knowledge distillation,

representing teacher in network

The number of the teacher models is set according to the teacher model,

representing a network of teachers that are,

a network of students is represented and,

representing the number of teacher models in the teacher network,

representing teacher in network

Knowledge weights for individual teacher models.

10. A click-through prediction system for multi-order feature optimization and mixed knowledge distillation, the system comprising:

a data pre-processing module to:

performing feature extraction on the obtained original user behavior data and the clicked advertisement data, and performing one-hot coding conversion to obtain a user behavior feature embedded vector and an advertisement feature embedded vector respectively;

a model training module to:

constructing a domain feature interactive network, and executing feature optimization based on domain symmetric matrix embedding on the acquired first-order features to generate second-order features;

inputting the first-order features into a compression interactive network to output to obtain explicit high-order features, inputting the second-order features into a deep neural network to output to obtain implicit high-order features, performing weighted splicing on the explicit high-order features and the implicit high-order features to generate third-order features in a fusion mode, and generating a click rate prediction model based on the third-order features;

the click rate prediction module is used for predicting click rate;

an advertisement recommendation module for;

and (3) carrying out online deployment on the student network with mixed knowledge distillation output to obtain a plurality of predicted values, carrying out descending order arrangement, selecting a preset number of advertisements with the highest predicted values, and recommending the advertisements to users to finish click rate prediction.