CN103530321B

CN103530321B - A kind of ordering system based on machine learning

Info

Publication number: CN103530321B
Application number: CN201310429873.XA
Authority: CN
Inventors: 蔡文彬; 张娅
Original assignee: Shanghai Jiao Tong University
Current assignee: Shanghai Media Intelligence Co ltd
Priority date: 2013-09-18
Filing date: 2013-09-18
Publication date: 2016-09-07
Anticipated expiration: 2033-09-18
Also published as: CN103530321A

Abstract

The invention provides a sorting system based on machine learning, wherein: the data collection and data preprocessing module collects the data required for sorting learning; the training set construction module includes: in the initial stage, from the stored massive unlabeled query-text data, Randomly select some samples and manually mark them according to their correlations to build an initial training set; in the iterative stage, call the ranking learning module and execute the sampling algorithm based on the built ranking model to select the most informative samples from the unlabeled data for correlation label. Store the marked data in the storage module; the ranking learning module builds the ranking model; the training set building module and the ranking learning module interact iteratively; the predictive ranking module invokes the sorting model to retrieve the corresponding text prediction correlation for the query input by the user , and sorted according to the magnitude of the correlation. The invention can improve the sorting accuracy of the retrieval results by the sorting system, better display the retrieval results, and meet the needs of users.

Description

A sorting system based on machine learning

技术领域technical field

本发明涉及机器学习领域和信息检索领域，具体地是一种基于机器学习的排序系统。The present invention relates to the fields of machine learning and information retrieval, in particular to a sorting system based on machine learning.

背景技术Background technique

当今社会是信息化的社会，随着现代科学技术的发展，信息出现爆炸性的增长。如何从海量的信息中快速准确的找到自己所需要的信息，是信息检索技术所需要解决的核心问题。信息检索能否更好的符合用户的需求，直接关系到海量信息是否被充分利用，对经济社会发展具有十分重要的意义。Today's society is an information society. With the development of modern science and technology, information has grown explosively. How to quickly and accurately find the information you need from the massive amount of information is the core problem that information retrieval technology needs to solve. Whether information retrieval can better meet the needs of users is directly related to whether the massive information is fully utilized, which is of great significance to economic and social development.

排序作为信息检索领域中的核心技术问题，已广泛应用于网页搜索，推荐，在线广告等信息检索问题。以网页搜索为例，如Google,百度，Bing,Yahoo等，排序系统的任务是建立一个排序模型，并对搜索出的网页按照预测相关性进行排序。在基于机器学习的排序系统中，训练排序模型是排序系统的核心部分。而排序模型的性能与训练集的质量高度相关，由于数据标注的代价巨大，因此，无法对收集到的海量数据进行人工标注。目前排序系统广泛采用的方法是从海量数据中随机挑选一部分数据进行标注，以此保持数据分布的特性。但是，不足之处在于，由于忽略了排序模型和训练集之间的相互关系，所选取的训练集很难保证排序模型的性能，从而导致排序系统的排序准确度不高，难以满足用户的需求。As a core technical problem in the field of information retrieval, ranking has been widely used in information retrieval problems such as web search, recommendation, and online advertisement. Taking web search as an example, such as Google, Baidu, Bing, Yahoo, etc., the task of the ranking system is to build a ranking model and sort the searched web pages according to the predicted relevance. In the ranking system based on machine learning, training the ranking model is the core part of the ranking system. The performance of the ranking model is highly related to the quality of the training set. Due to the huge cost of data labeling, it is impossible to manually label the massive data collected. At present, the method widely used in sorting systems is to randomly select a part of data from massive data for labeling, so as to maintain the characteristics of data distribution. However, the disadvantage is that due to ignoring the relationship between the ranking model and the training set, the selected training set is difficult to guarantee the performance of the ranking model, which leads to the low ranking accuracy of the ranking system and is difficult to meet the needs of users. .

发明内容Contents of the invention

针对现有技术的不足，本发明的目的在于提供一种基于机器学习的排序系统，旨在充分利用排序模型和训练集之间的关系，提高排序系统对检索结果的排序准确性，更好的展现检索结果，满足用户的需求。Aiming at the deficiencies of the prior art, the object of the present invention is to provide a sorting system based on machine learning, which aims to make full use of the relationship between the sorting model and the training set, improve the sorting accuracy of the sorting system for retrieval results, and better Display search results to meet user needs.

为实现上述目的，本发明采用了以下技术方案：To achieve the above object, the present invention adopts the following technical solutions:

本发明提供一种基于机器学习的排序系统，该系统包括：数据收集和数据预处理模块、训练集构建模块，排序学习模块，预测排序模块及数据存储模块，其中：The invention provides a sorting system based on machine learning, the system includes: data collection and data preprocessing module, training set construction module, sorting learning module, predictive sorting module and data storage module, wherein:

所述数据收集和数据预处理模块，收集所需排序学习的数据，获取查询及其所属的文本，该文本为未标注文本，并将收集到的查询-文本数据，进行特征处理和数据归一化操作，存于数据存储模块；The data collection and data preprocessing module collects the data required for sorting learning, obtains the query and its text, which is unlabeled text, and performs feature processing and data normalization on the collected query-text data Operation, stored in the data storage module;

所述训练集构建模块，训练集构建包括两个阶段：初始阶段，从存储的海量未标注查询-文本数据，随机挑选部分数据按照相关性人工标注，构建初始训练集；迭代阶段，调用排序学习模块并依据已经构建的排序模型，执行选样(Sampling)算法，从未标注数据中选择最有信息的样本，进行查询-文本相关性标注，对已有的训练集进行扩展；将标注后的数据存于数据存储模块；The training set construction module, the training set construction includes two stages: the initial stage, from the stored massive unlabeled query-text data, randomly select some data according to the correlation manual labeling, construct the initial training set; the iterative stage, call the sorting learning The module executes the Sampling algorithm based on the sorting model that has been built, selects the most informative samples from the unlabeled data, performs query-text correlation labeling, and expands the existing training set; the labeled The data is stored in the data storage module;

所述排序学习模块，调用数据存储模块中的有标注的数据，进行排序模型的训练，构建排序模型；The sorting learning module calls the marked data in the data storage module to train the sorting model and build the sorting model;

所述的训练集构建模块和排序学习模块，交互迭代进行，从而充分利用训练集和排序模型之间的关系，提高排序系统对检索结果的排序准确性。迭代终止条件包括多种：人工设置迭代次数，排序系统性能满足用户需求等；所述选样算法先对存储的未标注数据添加高斯噪声，再调用排序模型进行相关性预测获取预测分数分布，然后将预测分数分布转换为预测排序分布，最后选择排序预测最不确定的样本数据。The training set building module and the sorting learning module are carried out interactively and iteratively, so as to make full use of the relationship between the training set and the sorting model, and improve the sorting accuracy of the sorting system for the retrieval results. Iteration termination conditions include a variety of: manually set the number of iterations, the performance of the sorting system meets user needs, etc.; the sampling algorithm first adds Gaussian noise to the stored unlabeled data, and then calls the sorting model to perform correlation prediction to obtain the predicted score distribution, and then Transform the predicted score distribution into a predicted rank distribution, and finally select the sample data with the least certain rank predictions.

所述预测排序模块，迭代结束后，调用排序学习模块建立的排序模型，对用户搜索的查询，检索其对应的文本进行相关性预测，并根据相关性大小进行排序，展示给用户；The predictive sorting module, after the iteration, invokes the sorting model established by the sorting learning module, searches for the user's query, retrieves its corresponding text, performs correlation prediction, sorts according to the correlation size, and presents it to the user;

所述数据存储模块，存储两部分数据：一部分存储未标注的数据，另一部分存储有标注的数据。其中，未标注数据被训练集构建模块调用，用于选样，当被选中标注后，转存为有标注数据。有标注数据被排序学习模块调用，用于训练排序模型。The data storage module stores two parts of data: one part stores unmarked data, and the other part stores marked data. Among them, the unlabeled data is called by the training set building module for sample selection, and when it is selected and labeled, it is transferred to labeled data. The labeled data is called by the ranking learning module to train the ranking model.

与现有技术相比，本发明的基于机器学习的排序系统，充分利用了排序模型和训练集之间的相互关系，能够选择出最有信息的样本，从而训练出高性能的排序模型，实现排序系统精确排序的目的，能更好的满足用户的需求，具有重要的应用价值。Compared with the prior art, the sorting system based on machine learning of the present invention makes full use of the relationship between the sorting model and the training set, and can select the most informative samples, thereby training a high-performance sorting model and realizing The purpose of accurate sorting in the sorting system can better meet the needs of users and has important application value.

附图说明Description of drawings

通过阅读参照以下附图对非限制性实施例所作的详细描述，本发明的其它特征、目的和优点将会变得更明显：Other characteristics, objects and advantages of the present invention will become more apparent by reading the detailed description of non-limiting embodiments made with reference to the following drawings:

图1是本发明中基于机器学习的排序系统框架图。Fig. 1 is a frame diagram of a sorting system based on machine learning in the present invention.

图2是本发明中选样算法的流程图。Fig. 2 is a flow chart of the sampling algorithm in the present invention.

图3是本发明中基于机器学习的排序系统与现有技术的性能比较图。Fig. 3 is a performance comparison diagram between the sorting system based on machine learning in the present invention and the prior art.

具体实施方式detailed description

下面结合具体实施例对本发明进行详细说明。以下实施例将有助于本领域的技术人员进一步理解本发明，但不以任何形式限制本发明。应当指出的是，对本领域的普通技术人员来说，在不脱离本发明构思的前提下，还可以做出若干变形和改进。这些都属于本发明的保护范围。The present invention will be described in detail below in conjunction with specific embodiments. The following examples will help those skilled in the art to further understand the present invention, but do not limit the present invention in any form. It should be noted that those skilled in the art can make several modifications and improvements without departing from the concept of the present invention. These all belong to the protection scope of the present invention.

本实施例构建了一个基于机器学习的排序系统，并使用百度所提供的商业搜索排序数据进行测试。本实施例选取当前信息检索领域中度量排序准确度的最重要的二个评价标准，DCG@10和MAP（Mean Average Precision平均准确度）进行效果评价，并且与已有的具有代表性的技术进行了比较。本发明具有更准确的排序准确度，能够更好的将搜索结果展示给用户。In this embodiment, a ranking system based on machine learning is constructed, and the commercial search ranking data provided by Baidu is used for testing. In this embodiment, the two most important evaluation criteria for measuring sorting accuracy in the current information retrieval field, DCG@10 and MAP (Mean Average Precision), are selected for effect evaluation, and compared with existing representative technologies compared. The invention has more accurate sorting accuracy and can better display search results to users.

本实施例所述系统包括：The system described in this embodiment includes:

数据收集和数据预处理模块：负责收集排序学习的数据。在排序系统中，收集查询及其所对应的文本，一个查询一般可对应多个文本。例如：查询词为新浪，对应的文本有：新浪首页，新浪邮箱，新浪百度百科等等。之后，为每一个查询-文本对数据提取相应的特征，排序数据有3种特征：查询特征，文本特征，及查询-文本特征。例如：查询词在文本出现的频率，文本PageRank值等。提取特征之后，每一个查询-文本对数据可用一个特征向量表示。对每一个查询所包含的所有文本特征都进行归一化操作，并存储于存储模块。Data collection and data preprocessing module: responsible for collecting data for ranking learning. In a ranking system, queries and their corresponding texts are collected, and a query generally corresponds to multiple texts. For example: the query word is Sina, and the corresponding texts include: Sina Homepage, Sina Mailbox, Sina Baidu Encyclopedia, etc. Afterwards, corresponding features are extracted for each query-text pair data, and there are three types of features for ranking data: query features, text features, and query-text features. For example: the frequency of query words appearing in the text, the PageRank value of the text, etc. After feature extraction, each query-text pair data can be represented by a feature vector. All text features contained in each query are normalized and stored in the storage module.

训练集构建模块：负责构建排序学习所需要的训练数据。由于对数据进行人工标注代价巨大，因此无法对存储的海量数据全部进行标注。本发明构建训练集包括以下2个阶段。在初始阶段，即没有任何训练数据的情况下，从所存储的海量数据中随机挑选部分数据进行人工标注，作为初始训练集，并调用排序学习模块开始训练初始的排序模型。之后，训练集迭代构建。在迭代阶段，执行选样算法，本实施例中所述选样算法包括以下步骤：Training set building block: responsible for building the training data required for ranking learning. Due to the huge cost of manual labeling of data, it is impossible to label all the massive data stored. The construction of the training set in the present invention includes the following two stages. In the initial stage, that is, without any training data, some data is randomly selected from the stored massive data for manual labeling as the initial training set, and the ranking learning module is called to start training the initial ranking model. After that, the training set is constructed iteratively. In the iterative phase, the sample selection algorithm is executed, and the sample selection algorithm described in this embodiment includes the following steps:

1)对存储的未标注数据添加高斯噪声：为存储的每一个查询-文本对数据样本，对其特征独立的添加高斯噪声。添加之后，每一个样本会拥有m个噪声样本，这些噪声样本在特征空间中环绕在原始样本的周围，并且具有高斯分布。1) Add Gaussian noise to stored unlabeled data: For each stored query-text pair data sample, add Gaussian noise independently to its features. After adding, each sample will have m noise samples, which surround the original sample in the feature space and have a Gaussian distribution.

2）调用排序模型进行相关性预测获取预测分数分布：得噪声样本之后，调用排序学习模块里的排序模型，对噪声样本进行相关性预测。并以原始样本周围的m个噪声样本的预测分数作为原始样本在当前排序模型下的预测分数分布。2) Call the ranking model for correlation prediction and obtain the distribution of prediction scores: After obtaining the noise samples, call the ranking model in the ranking learning module to predict the correlation of the noise samples. And the predicted scores of m noise samples around the original sample are used as the predicted score distribution of the original sample under the current ranking model.

3)将预测分数分布转换为预测排序分布：根据查询-文本结构，将所得的预测分数分布转换为预测排序分布，假设一个查询包含有n个文本。3) Convert the predicted score distribution to the predicted ranking distribution: According to the query-text structure, convert the obtained predicted score distribution into the predicted ranking distribution, assuming that a query contains n texts.

■查询的预测排序分布转换过程为：■The conversion process of the predicted ranking distribution of the query is:

(a)根据查询所包含的n个文本的预测分数分布，为每一个文本随机抽样一个预测分数，可以获得n个预测分数；(a) Randomly sample a prediction score for each text according to the prediction score distribution of n texts contained in the query, and n prediction scores can be obtained;

(b)对这n个预测分数按照大小进行排序,可以获得一个预测排序；(b) Sort the n prediction scores according to their size, and a prediction ranking can be obtained;

(c)重复以上(a)(b)步骤多次，即可以得到查询在当前排序模型下的预测排序分布。(c) Repeat the steps (a) and (b) above several times to obtain the predicted ranking distribution of the query under the current ranking model.

■文本的预测排序分布转换过程为：■The conversion process of the predicted ranking distribution of the text is:

(a)固定该文本所属查询的其余n-1个文本的预测分数为原始样本的预测值；(a) Fix the prediction scores of the remaining n-1 texts of the query to which the text belongs to the prediction value of the original sample;

(b)从当前文本的预测分数分布下随机抽取一个分数，加上其余n-1个预测分数，可以获得n个预测分数；(b) Randomly select a score from the predicted score distribution of the current text, and add the remaining n-1 predicted scores to obtain n predicted scores;

(c)对这n个预测分数按照大小进行排序，可以获得一个预测排序；(c) Sorting the n prediction scores according to size can obtain a prediction ranking;

(d)重复以上(b)(c)步骤多次，即可以得到文本的预测排序分布。(d) Repeat the above steps (b) and (c) multiple times to obtain the predicted ranking distribution of the text.

4）选择排序预测最不确定的样本数据：得到查询和文本的预测排序分布后，从排序分布中进行抽样，用DCG-like gain函数计算每一个抽样排序的gain值：4) Select the most uncertain sample data for sorting prediction: After obtaining the predicted sorting distribution of the query and text, sample from the sorting distribution, and use the DCG-like gain function to calculate the gain value of each sampling sort:

$g g ((r r)) = = {Σ Σ}_{i i = = 11}^{n no} ((22^{s the s (({r r}_{i i}))} - - 11)) / / {log log}_{22} ((11 + + {r r}_{i i}))$

其中，s(r_i)表示在位置r_i的文本与查询的预测相关性。r表示一个排序。从排序分布中抽样N个排序，相应的，可以计算出N个gain值，并计算出方差。根据以上算法步骤可知，当前排序模型下，排序预测不确定的样本，相应的N个gain值变化较大，导致方差变大。根据计算所得方差值，主动选择构建排序模型的训练样本，根据查询-文本结构，共有三种选样方式。where s( _{ri) denotes the predicted relevance of the text at position r i} _to the query. r indicates a sort. N sorts are sampled from the sort distribution, correspondingly, N gain values can be calculated, and the variance can be calculated. According to the above algorithm steps, under the current ranking model, for samples with uncertain ranking predictions, the corresponding N gain values change greatly, resulting in larger variance. According to the calculated variance value, the training samples for constructing the ranking model are actively selected. According to the query-text structure, there are three sample selection methods.

■基于查询：从存储的未标注数据中选择方差最大的K个查询。■ Query-based: Select K queries with the largest variance from the stored unlabeled data.

■基于文本：从存储的未标注数据中选择方差最大的K个文本。■ Text-based: Select K texts with the largest variance from the stored unlabeled data.

■基于查询-文本两阶段：从存储的未标注数据中首先选择方差最大的K₁个查询，随后在选定的查询中选择方差最大的K₂个文本。■Based on query-text two-stage: first select K ₁ queries with the largest variance from the stored unlabeled data, and then select K ₂ texts with the largest variance among the selected queries.

执行完以上选样算法后，对选择出的样本数据进行标注。在排序学习的数据中，需要按照查询-文本相关性进行标注，查询-文本的相关性分为5个等级：{非常相关，比较相关，相关，不相关，极不相关}，分别标注为{4,3,2,1,0}。表1给出了一个标注示例：After executing the above sampling algorithm, mark the selected sample data. In the data of ranking learning, it needs to be marked according to the query-text correlation. The query-text correlation is divided into 5 levels: {very relevant, relatively relevant, relevant, irrelevant, and extremely irrelevant}, which are respectively marked as { 4,3,2,1,0}. Table 1 gives an example of labeling:

表1查询-文本相关性标注示例Table 1 Query-Text Relevance Annotation Example

对数据进行相关性标注之后，存储于数据存储模块。After the data is marked with relevance, it is stored in the data storage module.

排序学习模块：调用数据存储模块中的有标注的训练数据，训练排序模型，本发明中使用了目前具有代表性的排序模型：梯度提升决策树，作为排序方程：Sorting learning module: call the marked training data in the data storage module, and train the sorting model. In the present invention, a representative sorting model is used: gradient lifting decision tree, as the sorting equation:

$f f ((x x)) = = {Σ Σ}_{t t = = 11}^{n no} {λ λ}_{t t} {T T}_{t t} ((x x))$

其中，每一个树模型T_t(x)通过训练数据对特征空间划分进行构建。参数λ_t通过训练集中样本数据的预测误差的函数梯度方向进行搜索获取最优值。在本实施例中，共构建了100个树模型，T=100。Wherein, each tree model T _t (x) is constructed by dividing the feature space by training data. The parameter λ _t is searched through the gradient direction of the function gradient of the prediction error of the sample data in the training set to obtain the optimal value. In this embodiment, a total of 100 tree models are constructed, T=100.

以上所述的训练集构建模块和排序学习模块，交互迭代进行，从而可以充分利用训练集和排序模型之间的关系，实现排序系统准确排序的目的。迭代终止条件可以有以下几种方式：人工设置迭代次数，排序系统性能满足用户需求等。本实施例中，迭代终止条件为：人工设置迭代次数：10次。The above-mentioned training set construction module and ranking learning module are carried out interactively and iteratively, so that the relationship between the training set and the ranking model can be fully utilized to achieve the purpose of accurate ranking by the ranking system. The iteration termination condition can be in the following ways: manually setting the number of iterations, sorting system performance to meet user needs, and so on. In this embodiment, the iteration termination condition is: the number of iterations is manually set: 10 times.

预测排序模块：以上迭代过程结束之后，对于用户新输入的查询，系统检索其相对应的文本，调用排序学习模块中训练好的排序模型对查询-文本的相关性进行预测，并按照预测相关分数值的大小排序，最后将排序好的结果返回给用户。Predictive ranking module: After the above iterative process is over, for the user's newly input query, the system retrieves the corresponding text, calls the sorting model trained in the ranking learning module to predict the query-text correlation, and predicts the relevance score according to the prediction The values are sorted by size, and finally the sorted results are returned to the user.

数据存储模块：存储两部分数据，根据查询-文本结构，一部分顺序存储未标注的查询-文本对数据，另一部分顺序存储查询-文本对数据及其相关性标注。其中，未标注数据被训练集构建模块调用，用于执行初始阶段的选样，以及迭代阶段的选样算法。当被选中标注相关性后，转存为有标注数据。有标注数据被排序学习模块调用，用于训练排序模型。Data storage module: store two parts of data, according to the query-text structure, one part sequentially stores unmarked query-text pair data, and the other part sequentially stores query-text pair data and their correlation annotations. Among them, the unlabeled data is called by the training set building block to perform the sampling in the initial stage and the sampling algorithm in the iterative stage. When it is selected to mark the correlation, it will be saved as marked data. The labeled data is called by the ranking learning module to train the ranking model.

实施效果Implementation Effect

依据上述技术方案，使用百度提供的商业搜索排序数据。在排序系统中，选取信息检索领域中度量排序准确度的最重要的二个评价标准，DCG@10和MAP，并与现有的具有代表性的技术进行了比较。According to the above technical solution, the commercial search ranking data provided by Baidu is used. In the ranking system, DCG@10 and MAP, two most important evaluation criteria for measuring ranking accuracy in the field of information retrieval, are selected, and compared with the existing representative technologies.

为了充分测试本发明带来的技术效果，减少系统的随机误差。排序系统的每一项性能比较测试，都独立实验10次并计算平均结果，并以此平均值作为最终的性能指标。其中，RBSS是本发明的排序系统，RAND是目前商业排序系统广泛使用的技术。为叙述方便，后缀标记-Q表示在排序系统训练集构建模块中执行基于查询的选样算法，-D表示在排序系统训练集构建模块中执行基于文本的选样算法，-QD表示在排序系统训练集构建模块中执行基于查询-文本两阶段的选样算法。In order to fully test the technical effect brought by the present invention and reduce the random error of the system. Each performance comparison test of the sorting system is independently tested 10 times and the average result is calculated, and the average value is used as the final performance index. Among them, RBSS is the sorting system of the present invention, and RAND is a technique widely used in commercial sorting systems at present. For the convenience of description, the suffix mark -Q indicates that the query-based sampling algorithm is implemented in the training set building block of the ranking system, -D indicates that the text-based sampling algorithm is implemented in the training set building block of the ranking system, and -QD indicates that the sampling algorithm is implemented in the ranking system training set The query-text two-stage sampling algorithm is implemented in the training set building block.

表2给出了性能比较结果。可以看出，在DCG@10和MAP这2个性能评价指标上，本发明排序系统的性能都明显好于已有的技术RAND。Table 2 shows the performance comparison results. It can be seen that the performance of the sorting system of the present invention is significantly better than that of the existing technology RAND in terms of the two performance evaluation indexes of DCG@10 and MAP.

表2排序系统性能比较Table 2 Sorting system performance comparison

为了进一步验证本发明带来的性能提升，在训练集构建模块的选样算法的执行迭代过程中，每一步迭代后均进行了排序系统性能对比测试，图3给出了迭代过程中，系统性能的对比结果。横坐标表示排序系统中选样算法的迭代次数，纵坐标是排序系统的性能评价指标，分值越高，表明排序系统的排序越准确。可以看出，在迭代构建训练集过程中，本发明的排序系统性均能持续好于目前广泛使用的技术RAND。In order to further verify the performance improvement brought by the present invention, in the iterative process of executing the sample selection algorithm of the training set building block, a sorting system performance comparison test was carried out after each step of iteration. Figure 3 shows the system performance during the iterative process. comparison results. The abscissa indicates the number of iterations of the sampling algorithm in the sorting system, and the ordinate is the performance evaluation index of the sorting system. The higher the score, the more accurate the sorting system is. It can be seen that, in the process of iteratively constructing the training set, the sorting system of the present invention is consistently better than the currently widely used technology RAND.

从以上测试可以看出，本发明的基于机器学习的排序系统，能够有效的提高排序准确度，从而更好的展现检索结果，满足用户的需求。本发明在百度商业搜索排序中带来了明显的技术效果，具有重要的应用价值。It can be seen from the above tests that the machine learning-based sorting system of the present invention can effectively improve the sorting accuracy, thereby better displaying retrieval results and meeting user needs. The invention brings obvious technical effects in Baidu commercial search sorting, and has important application value.

以上对本发明的具体实施例进行了描述。需要理解的是，本发明并不局限于上述特定实施方式，本领域技术人员可以在权利要求的范围内做出各种变形或修改，这并不影响本发明的实质内容。Specific embodiments of the present invention have been described above. It should be understood that the present invention is not limited to the specific embodiments described above, and those skilled in the art may make various changes or modifications within the scope of the claims, which do not affect the essence of the present invention.

Claims

1. A sorting system based on machine learning, characterized in that the system includes: data collection and data preprocessing module, training set building module, sorting learning module, predictive sorting module and data storage module, wherein:

The data collection and data preprocessing module collects the data required for sorting learning, obtains the query and its text, which is unlabeled text, and performs feature processing and data normalization on the collected query-text data Operation, stored in the data storage module;

The training set construction module, the training set construction includes two stages: the initial stage, from the stored massive unlabeled query-text data, randomly select some data according to the correlation manual labeling, construct the initial training set; the iterative stage, call the sorting learning The module executes the sample selection algorithm based on the sorting model that has been built, selects the most informative samples from the unlabeled data, performs query-text correlation labeling, and expands the existing training set; stores the labeled data in data storage module;

The sorting learning module calls the marked data in the data storage module to train the sorting model and build the sorting model;

The above-mentioned training set construction module and the sorting learning module are carried out interactively and iteratively, and the iteration termination conditions include manually setting the number of iterations or the performance of the sorting system to meet user needs; the sampling algorithm first adds Gaussian noise to the stored unlabeled data, Then call the sorting model to perform correlation prediction to obtain the predicted score distribution, then convert the predicted score distribution into a predicted ranking distribution, and finally select the sample data with the most uncertain sorting prediction; the selection of the most uncertain sample data for sorting prediction refers to: get After querying and predicting the sorting distribution of text, sample from the sorting distribution, use the DCG-likegain function to calculate the gain value of each sampling sort, calculate N gain values, and calculate the variance, and use the calculated variance to measure the sorting prediction. Determine, select the sample data, where N is the number of samples for predicting sorting;

The predictive sorting module, after the iteration, invokes the sorting model established by the sorting learning module, searches for the user's query, retrieves its corresponding text, performs correlation prediction, sorts according to the correlation size, and presents it to the user;

The data storage module stores two parts of data: one part stores unlabeled data, and the other part stores labeled data, wherein the unlabeled data is called by the training set construction module for sample selection, and when selected and labeled, transfer to It is stored as labeled data, and the labeled data is called by the ranking learning module to train the ranking model.

2. The sorting system based on machine learning according to claim 1, characterized in that, said data collection and data preprocessing module: responsible for collecting data for sorting learning, collecting queries and their corresponding texts, and one query can correspond to Multiple texts, and then extract corresponding features for each query-text pair data. There are three types of features for sorting data: query features, text features, and query-text features. After extracting features, each query-text pair data is used A feature vector representation, all text features contained in each query are normalized and stored in the storage module.

3. The sorting system based on machine learning according to claim 1 or 2, wherein the training set construction module executes a sample selection algorithm in an iterative phase, and the sample selection algorithm comprises the following steps:

1) Add Gaussian noise to the stored unlabeled data: For each stored query-text pair data sample, add Gaussian noise independently to its features. After adding, each sample will have m noise samples. These noise samples are in The feature space surrounds the original sample and has a Gaussian distribution;

2) Call the ranking model for correlation prediction and obtain the prediction score distribution: After obtaining the noise samples, call the ranking model in the ranking learning module to predict the correlation of the noise samples, and use the prediction scores of m noise samples around the original sample as The predicted score distribution of the original sample under the current ranking model;

3) Transform the predicted score distribution into a predicted ranking distribution: according to the query-text structure, convert the resulting predicted score distribution into a predicted ranking distribution;

4) Select the most uncertain sample data for sorting prediction: After obtaining the predicted sorting distribution of the query and text, sample from the sorting distribution, and use the DCG-like gain function to calculate the gain value of each sampling sort:

g g ((r r)) = = {Σ Σ}_{i i = = 11}^{n no} ((22^{s the s (({r r}_{i i}))} - - 11)) / / {log log}_{22} ((11 + + {r r}_{i i}))

Among them, n is the number of texts; s(r _i ) represents the predicted correlation between the text at position r _i and the query, and r represents a ranking; N rankings are sampled from the ranking distribution, and N gain values are calculated accordingly , and calculate the variance; actively select the training samples for constructing the ranking model according to the calculated variance; after executing the above sampling algorithm, mark the selected sample data, and in the data for sorting learning, it is necessary to follow the query-text correlation For labeling, the query-text correlation is divided into five levels: {very relevant, relatively relevant, relevant, irrelevant, and extremely irrelevant}, which are marked as {4, 3, 2, 1, 0} respectively.

4. The sorting system based on machine learning according to claim 3, wherein the sorting learning module calls the marked training data in the data storage module, trains the sorting model, and uses the gradient boosting decision tree model as the sorting equation :

f f ((x x)) = = {Σ Σ}_{t t = = 11}^{T T} {λ λ}_{t t} {T T}_{t t} ((x x))

Among them, each tree model T _t (x) is constructed by dividing the feature space from the training data, and the parameter λ _t is searched to obtain the optimal value through the function gradient direction of the prediction error of the sample data in the training set, and T represents the number of tree models.