CN115563284B

CN115563284B - A deep multi-instance weakly supervised text classification method based on semantics

Info

Publication number: CN115563284B
Application number: CN202211301646.4A
Authority: CN
Inventors: 刘小洋; 尹娟
Original assignee: Chongqing University of Technology
Current assignee: Shenzhen Wanzhida Technology Transfer Center Co ltd
Priority date: 2022-10-24
Filing date: 2022-10-24
Publication date: 2023-06-23
Anticipated expiration: 2042-10-24
Also published as: CN115563284A

Abstract

The invention provides a semantic-based deep multi-instance weak supervision text classification method, which comprises the following steps: s1, organizing a plurality of comment texts under the same social content into a text package, and distributing labels to the text package, so that topic related packages can be obtained; s2, extracting keywords representing topics from the topic related package, and constructing topic related vectors through the keywords; and S3, inputting topic related vectors and word vectors into the double-branch neural network as vector pairs, and predicting the text instance through the double-branch neural network to obtain the category of the text instance and the category of the package. According to the method and the device, the text information can be effectively classified on the premise that the social media text data is fast in change, difficult to annotate and seriously deficient in annotation data.

Description

A semantic-based deep multi-instance weakly supervised text classification method

技术领域Technical Field

本发明涉及自然语言处理技术领域，特别是涉及一种基于语义的深度多实例弱监督文本分类方法。The present invention relates to the technical field of natural language processing, and in particular to a semantic-based deep multi-instance weakly supervised text classification method.

背景技术Background Art

随着互联网和社交媒体的发展，每天有海量的文本数据产生，而数据分析者通常只关注与自身领域或者某个特定话题相关的数据，这就需要从海量数据中过滤出领域或话题相关数据。在数据抓取的过程中，为了获得尽量丰富的数据，抓取规则设定比较宽松，这保证了数据的丰富性，但也引入了很多与话题无关的数据。在进行分析之前，将真正与话题相关的数据过滤出来才能保证分析的准确性。此任务可以看成是文本与话题相关与否的二分类问题。如果有对应的二分类标注数据集，它本身只是一个很简单的有监督文本分类问题。然而，互联网时代，每年都会产生很多新的网络用语，自然语言变化速度大大提升，这也意味着标注数据的时效性大打折扣。同时，社交媒体中的话题随事件而变化，内容更新频繁，除非持续有大量的标注数据更新，否则标注数据可能和新事件包含的数据大相径庭。而现目前的现有技术中暂未出现社交媒体文本数据变化快，标注难，标注数据严重匮乏场景下的一种文本分类方法。With the development of the Internet and social media, a huge amount of text data is generated every day, and data analysts usually only focus on data related to their own fields or a specific topic. This requires filtering out field or topic-related data from massive data. In the process of data crawling, in order to obtain as much data as possible, the crawling rules are set relatively loosely, which ensures the richness of the data, but also introduces a lot of data that is irrelevant to the topic. Before analysis, the data that is truly related to the topic must be filtered out to ensure the accuracy of the analysis. This task can be regarded as a binary classification problem of whether the text is related to the topic. If there is a corresponding binary classification annotated data set, it is just a very simple supervised text classification problem. However, in the Internet era, many new network terms are generated every year, and the speed of natural language changes has greatly increased, which also means that the timeliness of the annotated data is greatly reduced. At the same time, topics in social media change with events, and content is updated frequently. Unless there is a large amount of annotated data updated continuously, the annotated data may be very different from the data contained in the new event. However, in the current existing technology, there is no text classification method for scenarios where social media text data changes rapidly, annotation is difficult, and annotated data is severely scarce.

发明内容Summary of the invention

本发明旨在至少解决现有技术中存在的技术问题，特别创新地提出了一种基于语义的深度多实例弱监督文本分类方法。The present invention aims to at least solve the technical problems existing in the prior art, and in particular innovatively proposes a semantic-based deep multi-instance weakly supervised text classification method.

为了实现本发明的上述目的，本发明提供了一种基于语义的深度多实例弱监督文本分类方法，包括以下步骤：In order to achieve the above-mentioned object of the present invention, the present invention provides a semantic-based deep multi-instance weakly supervised text classification method, comprising the following steps:

S1，将同一社交内容下的多条评论文本组织成文本包，利用社交媒体数据本身具有的层次特性和话题关联性，自动给文本包分发标签，由此得到话题相关包；S1, organize multiple comment texts under the same social content into text packages, and use the hierarchical characteristics and topic relevance of social media data to automatically assign labels to text packages, thereby obtaining topic-related packages;

S2，从话题相关包中抽取代表话题的关键词，通过关键词构建话题相关向量；通过构建话题相关向量，能避免数据不平衡，以及减少收集成本和计算成本。S2, extracts keywords representing topics from topic-related packages and constructs topic-related vectors through keywords; by constructing topic-related vectors, data imbalance can be avoided and collection and computational costs can be reduced.

S3，将话题相关向量和词向量作为向量对输入双分支神经网络中，通过双分支神经网络对文本实例进行预测，得到文本实例的类别和包的类别。S3, the topic-related vector and the word vector are input into the two-branch neural network as a vector pair, and the text instance is predicted by the two-branch neural network to obtain the category of the text instance and the category of the package.

进一步地，所述S2括以下步骤：Further, the S2 comprises the following steps:

S2-1，将话题相关包通过LDA算法聚类出若干话题，并提取话题关键词；S2-1, cluster topic-related packages into several topics using the LDA algorithm and extract topic keywords;

S2-2，采用fasttext模型对话题中的每个关键词进行嵌入表示，并采用话题强相关关键词的向量平均值作为话题相关向量；S2-2, the fasttext model is used to embed each keyword in the topic, and the average vector of the strongly related keywords is used as the topic-related vector;

将话题关键词

的向量表示为

因此话题相关向量表示为：Keywords

The vector representation is

Therefore, the topic-related vector is expressed as:

其中V_T表示话题相关向量；Where V _T represents the topic-related vector;

K表示话题强相关关键词的总数。K represents the total number of keywords strongly related to the topic.

进一步地，还包括：将所述向量对转化为稠密向量输入双分支神经网络中；Furthermore, the method further comprises: converting the vector pair into a dense vector and inputting the dense vector into a dual-branch neural network;

所述稠密向量通过词向量

和话题相关向量V_T做内积后再叠加到词向量从而得到，公式如下：The dense vector is obtained by word vector

After doing the inner product with the topic-related vector V _T and then superimposing it on the word vector, the formula is as follows:

其中

是叠加后的词向量，是双分支神经网络的输入；in

is the superimposed word vector, which is the input of the two-branch neural network;

[·,·]表示两个向量连接；[·,·] indicates the concatenation of two vectors;

表示词向量；

Represents word vectors;

×表示矩阵按位乘；× represents matrix bitwise multiplication;

V_T表示话题相关向量；V _T represents the topic-related vector;

因此，双分支神经网络的输入可表示为：Therefore, the input of the two-branch neural network can be expressed as:

其中x_ij表示第i个包中的第j条文本表，为双分支神经网络的输入；Where _xij represents the jth text table in the ith package, which is the input of the two-branch neural network;

表示第一个叠加后的词向量，

表示第二个叠加后的词向量，

表示第L个叠加后的词向量；

represents the first superimposed word vector,

represents the second superimposed word vector,

Represents the Lth superimposed word vector;

L表示文本包含的词的个数；L represents the number of words contained in the text;

[·,·,...,·,]表示向量的集合。[·,·,...,·,] represents a set of vectors.

经过实践发现文本中的词向量

和V_T做内积后，叠加到词向量上对神经网络抽取特征和分类更有帮助。Through practice, we find that the word vector in the text

After doing the inner product with _VT , superimposing it on the word vector is more helpful for neural network feature extraction and classification.

进一步地，在所述双分支神经网络中进行如下操作：Furthermore, the following operations are performed in the dual-branch neural network:

引入隐变量Z＝{z_ij}来刻画文本实例与包之间的关系，z_ij表示第i个包的第j个实例对包i是正向包贡献的贡献度，0≤z_ij≤1；若Z服从分布p(z)，那么第i个包为正向包的概率可以表示为：The hidden variable Z = { _zij } is introduced to characterize the relationship between text instances and packages. _Zij represents the contribution of the jth instance of the ith package to the positive package contribution of package i, _0≤zij≤1 ; if Z follows the distribution p(z), then the probability that the ith package is a positive package can be expressed as:

p(Y_i＝1|X_i)＝f_{j∈{1,…,N}}{p_θ(y_ij＝1|x_ij,z_ij)·[z_ij-γ]} (7)p(Y _i ＝1|X _i )＝f _{j∈{1,...,N}} {p _θ (y _ij ＝1|x _ij ,z _ij )·[z _ij -γ]} (7)

其中，X_i表示第i个包；Among them, _Xi represents the i-th package;

Y_i表示第i个包的标注； _Yi represents the label of the i-th package;

f是由文本实例向包之间的映射算子；f is the mapping operator from text instances to packages;

N表示包的数量；N represents the number of packets;

p_θ(y_ij＝1|x_ij,z_ij)表示实例x_ij被预测为1的概率；p _θ (y _ij ＝1|x _ij ,z _ij ) represents the probability that instance x _ij is predicted to be 1;

y_ij表示第i个包中的第j条文本表的标注；y _ij represents the annotation of the jth text table in the i-th package;

x_ij表示第i个包中的第j条文本表；x _ij represents the j-th text table in the i-th package;

z_ij表示第i个包的第j个实例对包i是正向包贡献的贡献度，z _ij represents the contribution of the jth instance of the ith package to the positive package contribution of package i.

γ是包中正实例的平均比例。γ is the average proportion of positive instances in a bag.

将包的类别与文本实例的类别联系起来，从而达到通过包的类别来学习文本实例本身的类别的目标。The categories of the bag are linked to the categories of the text instances, so as to achieve the goal of learning the categories of the text instances themselves through the categories of the bag.

进一步地，所述f为均值算子。本发明专利所解决的问题场景中，正包包含的正实例并不稀疏，用最大值和注意力机制作为映射算子很容易预测出假阳包，降低精确度，因此采用均值算子。Furthermore, f is a mean operator. In the problem scenario solved by the present invention, the positive instances contained in the positive bag are not sparse, and it is easy to predict false positive bags using the maximum value and attention mechanism as mapping operators, which reduces the accuracy, so the mean operator is used.

进一步地，在所述双分支神经网络中还进行如下操作：Furthermore, the following operations are also performed in the dual-branch neural network:

多实例文本分类中，学习的目标是包的交叉熵最小化：In multi-instance text classification, the learning goal is to minimize the cross entropy of the bag:

L_i＝-[Y_i'logp(Y_i|X_i)-(1-Y_i')log(1-p(Y_i|X_i))] (8)L _i =-[Y _i 'logp(Y _i |X _i )-(1-Y _i ')log(1-p(Y _i |X _i ))] (8)

其中L_i表示第i个包的交叉熵；Where _Li represents the cross entropy of the i-th package;

p(Y_i|X_i)表示实例X_i被预测为Y_i的概率，为分支一的输出；p(Y _i |X _i ) represents the probability that instance _Xi is predicted to be _Yi , which is the output of branch one;

X_i表示第i个文本包的输入特征，为分支一的输入； _Xi represents the input features of the i-th text package, which is the input of branch one;

Y_i表示第i个文本包的预测值， _Yi represents the predicted value of the i-th text package,

Y_i'表示第i个文本包的标注； _Yi ' represents the label of the i-th text package;

对于正包，Y_i'＝1，1-Y_i'＝0，因此L_i表示为：For a positive packet, _Yi '=1, 1- _Yi '=0, so _Li is expressed as:

对于负包Y_i'＝0，因此L_i表示为：For the negative bag _Yi ' = 0, so _Li is expressed as:

负包中所有文本实例均为负，且当所有p_θ(y_ij|x_ij,z_ij)和z_ij均为负时，

为0，达到最小值；All text instances in the negative bag are negative, and when all p _θ (y _ij |x _ij ,z _ij ) and z _ij are negative,

is 0, reaching the minimum value;

正包，最小化

等同于p(Y_i|X_i)的似然值极大化，将公式(7)代入，则Positive package, minimize

This is equivalent to maximizing the likelihood value of p(Y _i |X _i ). Substituting formula (7) into the equation, we get

然后公式(11)引入变分推断Then formula (11) introduces variational inference

其中x_ij表示第i个包中的第j条文本表；Where x _ij represents the jth text table in the i-th package;

z_ij表示第i个包的第j个实例对包i是正向包贡献的贡献度；z _ij represents the contribution of the jth instance of the ith package to the positive package contribution of package i;

γ是包中正实例的平均比例；γ is the average proportion of positive instances in the bag;

p_θ(y_ij|x_ij)表示实例x_ij被预测为y_ij的概率；p _θ (y _ij |x _ij ) represents the probability that instance x _ij is predicted to be y _ij ;

p(z)表示贡献度z的p分布；p(z) represents the p distribution of contribution z;

p_θ(y_ij|x_ij,z)表示x_ij的贡献度为z，实例x_ij被预测为y_ij的概率；p _θ (y _ij |x _ij ,z) represents the probability that instance x _ij is predicted to be y _ij when the contribution of x _ij is z;

p_θ(y_ij|x_ij,z>γ)表示x_ij的贡献度z>γ，且实例x_ij被预测为y_ij的概率；p _θ (y _ij |x _ij ,z>γ) represents the contribution of x _ij z>γ and the probability that instance x _ij is predicted to be y _ij ;

q(z)表示贡献度z的q分布；q(z) represents the q distribution of contribution z;

E_Z～q[·]表示Z服从q分布的条件下的均值。E _Z～q [·] represents the mean under the condition that Z follows q distribution.

根据变分推断的思想，本发明专利用

逼近q(z)。According to the idea of variational inference, the present invention uses

Approximate q(z).

进一步地，所述神经网络为TextCNN，LSTM、Transformer中的任意一个。Furthermore, the neural network is any one of TextCNN, LSTM, and Transformer.

进一步地，还包括：S4，对双分支神经网络的网络参数进行优化：Furthermore, the method further includes: S4, optimizing the network parameters of the dual-branch neural network:

S4-1，E步以KL最小化为目标，优化参数

目标函数为：S4-1, step E takes KL minimization as the goal and optimizes the parameters

The objective function is:

其中

表示对

p_θ(z|x,Y)进行KL最小化；in

Express

p _θ (z|x,Y) performs KL minimization;

表示双分支神经网络中分支一的输出，为文本实例的类别；

Represents the output of branch one in a two-branch neural network, which is the category of the text instance;

p_θ(z|x,Y)表示双分支神经网络中分支二的输出，为文本实例与包之间的联系；p _θ (z|x,Y) represents the output of branch 2 in the two-branch neural network, which is the connection between the text instance and the bag;

z表示贡献度；z represents contribution;

Y泛指包的类别；Y refers to the category of the package;

x泛指该包中的一条文本实例；x refers to a text instance in the package;

θ和

是两条分支的参数；θ and

are the parameters of the two branches;

公式(15)用于计算分布

和分布p_θ(z|x,Y)的差异，最小化L_E就是让分布q和分布p逐步逼近，达到收窄下界和证据的差值的目标。Formula (15) is used to calculate the distribution

Minimizing _LE is to make the distribution q and the distribution p gradually approach each other, so as to achieve the goal of narrowing the difference between the lower bound and _the evidence.

在参数θ决定的神经网络中，z服从的真实分布可类比于后验分布p_θ(y|x)。In a neural network determined by parameter θ, the true distribution of z is analogous to the posterior distribution p _θ (y|x).

其中

表示对

p_θ(y|x)进行KL最小化；in

Express

p _θ (y|x) is KL minimized;

表示Y＝1条件下双分支神经网络中分支一的输出，为文本实例的类别；

represents the output of branch one in the two-branch neural network under the condition of Y=1, which is the category of the text instance;

p_θ(y|x)表示参数θ决定的神经网络在θ固定的情况下计算出来的值，对于负向包，每个实例的p_θ(y|x)均为0；p _θ (y|x) represents the value calculated by the neural network determined by parameter θ when θ is fixed. For negative packets, p _θ (y|x) of each instance is 0;

公式16是公式15的进一步演化。由于负包中只包含不相关文本，即负包中，所有文本实例的类别均为0，且实例对包为正的贡献度为0，可看做有监督学习。那么剩下的只有正包，即Y＝1。在词条件下，将z所服从的真实分布p_θ(z|x,Y)近似于p_θ(y|x)，就得到了公式16。Formula 16 is a further evolution of Formula 15. Since the negative bag contains only irrelevant text, that is, in the negative bag, the category of all text instances is 0, and the contribution of the instance to the bag being positive is 0, it can be regarded as supervised learning. Then the only thing left is the positive bag, that is, Y = 1. Under the word condition, the true distribution p _θ (z|x,Y) obeyed by z is approximated to p _θ (y|x), and Formula 16 is obtained.

因此therefore

其中

表示对

p′进行KL最小化；in

Express

p′ is KL minimized;

Y_i＝1表示第i个包为正； _Yi = 1 means the i-th packet is positive;

x_ij表示第i个包中的第j条文本；x _ij represents the jth text in the i-th package;

y_ij表示第i个包中的第j条实例对包为正的贡献度；y _ij represents the positive contribution of the jth instance in the ith package to the package;

p′＝p_θ(y|x)，表示参数θ决定的神经网络在θ固定的情况下计算出来的值，对于负向包，每个实例的p_θ(y|x)均为0；p′＝p _θ (y|x), which represents the value calculated by the neural network determined by parameter θ when θ is fixed. For negative packets, p _θ (y|x) of each instance is 0;

公式(17)中用p′来替换p_θ(y|x)，由于logp_θ(y_ij|x_ij)的增减性和p_θ(y|x)是一致的，因此用logp_θ(y_ij|x_ij)替换p_θ(y|x)，加快收敛速度。而包为负，即Y＝0时，分布的所有概率值均置为0，所以得到公式(17)。In formula (17), p′ is used to replace p _θ (y|x). Since the increase and decrease of logp _θ (y _ij |x _ij ) is consistent with that of p _θ (y|x), logp _θ (y _ij |x _ij ) is used to replace p _θ (y|x) to speed up the convergence. When the package is negative, that is, when Y = 0, all probability values of the distribution are set to 0, so formula (17) is obtained.

S4-2，M步固定参数

使同样文本下

和p_θ(z|x,Y)的KL散度不变，然后通过优化参数θ，让期望最大化，对数似然值的期望表示如下S4-2, M-step fixed parameters

Make the same text

The KL divergence of and p _θ (z|x,Y) remains unchanged, and then the expectation is maximized by optimizing the parameter θ. The expectation of the log-likelihood value is expressed as follows

L_M＝E_Z～q[logp_θ(y_ij|x_ij,z>γ)] (18)L _M ＝E _Z～q [logp _θ (y _ij |x _ij ,z>γ)] (18)

其中L_M表示对数似然值的期望；Where L _M represents the expectation of the log-likelihood value;

E_Z～q[·]表示Z服从q分布的条件下的均值；E _Z～q [·] represents the mean under the condition that Z follows q distribution;

p_θ(y_ij|x_ij,z>γ)表示在z>γ，且文本包中的实例i经过θ分支后，被预测为正文本的概率；p _θ (y _ij |x _ij ,z>γ) represents the probability that instance i in the text bag is predicted as the positive text after passing through the θ branch when z>γ;

z表示贡献度；z represents contribution;

γ是一个超参数，表示所有正包中，正文本实例所占的平均比例；γ is a hyperparameter that represents the average proportion of positive text instances in all positive bags;

按照公式(7)的定义，可将L_M以z＝γ为界拆成两部分，对于z>γ只对y_ij＝1有意义，而对z<γ只对y_ij＝0有意义，因此，M步的代价函数L_M可进一步拆解为According to the definition of formula (7), L _M can be split into two parts with z = γ as the boundary. For z>γ, it is only meaningful for _yij = 1, while for z<γ, it is only meaningful for _yij = 0. Therefore, the cost function L _M of the M-step can be further decomposed into

其中r是一个超参数，表示所有正包中，正文本实例所占的平均比例；Where r is a hyperparameter, which represents the average proportion of positive text instances in all positive bags;

p_θ(y_ij＝1|x_ij)表示包i中文本实例j为正文本的概率；p _θ (y _ij =1|x _ij ) represents the probability that text instance j in bag i is a positive text;

p_θ(y_ij＝0|x_ij)表示包i中文本实例j为负文本的概率；p _θ (y _ij = 0|x _ij ) represents the probability that text instance j in bag i is a negative text;

y_ij＝1表示包i中文本实例j为正；y _ij = 1 means that text instance j in bag i is positive;

y_ij＝0表示包i中文本实例j为负；y _ij = 0 means that the text instance j in bag i is negative;

公式(19)是将公式(18)根据z＝γ为界拆分而成。Formula (19) is obtained by splitting formula (18) based on z=γ.

公式(19)可以转化为交叉熵Formula (19) can be transformed into cross entropy

L_M＝y'_ijlogp_θ(y_ij|x_ij)-(1-y'_ij)log(1-p_θ(y_ij|x_ij)) (20)L _M =y' _ij logp _θ (y _ij |x _ij )-(1-y' _ij )log(1-p _θ (y _ij |x _ij )) (20)

公式(20)中p_θ(y_ij|x_ij)和公式(19)中p_θ(y_ij＝1|x_ij)表示的含义一致，均为文本为正的概率。1-p_θ(y_ij|x_ij)表示文本为负的概率，和p_θ(y_ij＝0|x_ij)的含义一致。公式(20)是公式(19)的离散化形式，离散化成交叉熵。The meaning of p _θ (y _ij |x _ij ) in formula (20) is consistent with that of p _θ (y _ij =1|x _ij ) in formula (19), which is the probability that the text is positive. 1-p _θ (y _ij |x _ij ) represents the probability that the text is negative, which is consistent with the meaning of p _θ (y _ij =0|x _ij ). Formula (20) is the discretized form of formula (19), which is discretized into cross entropy.

其中y'_ij是y_ij的伪标签，在正包中，它由z决定，在负包中，全部为0；Where _y'ij is the pseudo label of _yij , in the positive bag, it is determined by z, in the negative bag, all are 0;

其中mean(·)表示求平均；Where mean(·) means finding the average;

本发明的参数优化不同于传统的EM算法，由于引入了变分推断，因此，E步通过收窄证据和变分下界的差值来计算期望，收窄下界时优化参数

M步通过优化参数θ最大化期望。The parameter optimization of the present invention is different from the traditional EM algorithm. Due to the introduction of variational inference, the E step calculates the expectation by narrowing the difference between the evidence and the variational lower bound, and optimizes the parameters when narrowing the lower bound.

The M step maximizes the expectation by optimizing the parameters θ.

综上所述，由于采用了上述技术方案，本发明具有以下优点：In summary, due to the adoption of the above technical solution, the present invention has the following advantages:

1)通过引入隐变量和变分推断将双分支深度网络应用到传统多实例学习中的思想，有效提升多实例文本分类任务下侧重于实例级别分类问题的效果。1) By introducing latent variables and variational inference, the idea of applying dual-branch deep networks to traditional multi-instance learning is effectively improved in multi-instance text classification tasks that focus on instance-level classification problems.

2)通过提出的弱监督文本分类方法SDMIC，利用社交媒体文本数据本身的特性，获取社交媒体用户建立的标签或平台划分的类别作为弱监督信息，并通过弱监督信息有效优化模型，从而解决了社交媒体文本数据变化快，标注难，标注数据严重匮乏的痛点。2) Through the proposed weakly supervised text classification method SDMIC, the characteristics of social media text data are utilized to obtain the labels established by social media users or the categories divided by the platform as weak supervision information, and the model is effectively optimized through weak supervision information, thus solving the pain points of social media text data that changes rapidly, is difficult to label, and has a serious lack of labeled data.

本发明的附加方面和优点将在下面的描述中部分给出，部分将从下面的描述中变得明显，或通过本发明的实践了解到。Additional aspects and advantages of the present invention will be given in part in the following description and in part will be obvious from the following description, or will be learned through practice of the present invention.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

本发明的上述和/或附加的方面和优点从结合下面附图对实施例的描述中将变得明显和容易理解，其中：The above and/or additional aspects and advantages of the present invention will become apparent and easily understood from the description of the embodiments in conjunction with the following drawings, in which:

图1是本发明通过LDA模型提取话题关键词并计算话题相关向量的过程示意图。FIG1 is a schematic diagram of a process of extracting topic keywords and calculating topic-related vectors through an LDA model in the present invention.

图2是本发明实例话题相关性学习模型和贡献度学习模型示意图。FIG. 2 is a schematic diagram of a topic relevance learning model and a contribution learning model according to an example of the present invention.

图3是本发明SDMIC和有监督学习的学习速度趋势示意图，图3(a)是测试集的预测准确率Acc的变化趋势，图3(b)是召回率F1 score的变化趋势。FIG3 is a schematic diagram of the learning speed trend of SDMIC and supervised learning of the present invention, FIG3(a) is a variation trend of the prediction accuracy Acc of the test set, and FIG3(b) is a variation trend of the recall rate F1 score.

具体实施方式DETAILED DESCRIPTION

下面详细描述本发明的实施例，所述实施例的示例在附图中示出，其中自始至终相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。下面通过参考附图描述的实施例是示例性的，仅用于解释本发明，而不能理解为对本发明的限制。Embodiments of the present invention are described in detail below, examples of which are shown in the accompanying drawings, wherein the same or similar reference numerals throughout represent the same or similar elements or elements having the same or similar functions. The embodiments described below with reference to the accompanying drawings are exemplary and are only used to explain the present invention, and cannot be understood as limiting the present invention.

基于背景技术所描述的客观事实，纯有监督的文本分类应用在社交媒体话题数据提取的时间成本和人力成本都很高，无需标注数据的无监督或弱监督分类更实用。Based on the objective facts described in the background technology, the time and labor costs of applying purely supervised text classification to extract social media topic data are very high. Unsupervised or weakly supervised classification that does not require labeled data is more practical.

并且，社交媒体中的文本数据虽然量大且变化快，但它们因用户的行为而生，数据之间有一定关联，且有特定的层次结构，这给数据过滤提供了一定线索。例如，在贴吧中，用户通常到这个话题相关的吧里去开贴讨论问题，吧中产生的大部分文本内容都跟吧的主题相关，但也有很多非话题相关的内容。整个吧的主题容易获得，但很难确定内部讨论中哪些单条数据是主题相关的，哪些是不相关的。如果将一个吧的数据看成是一个包，包的话题已知，包内某些文本与话题相关的内容，是正文本，其他的不相关，是负文本，包内存在正文本，也可能存在负文本。这个问题很符合弱监督学习中的不确切监督学习，根据数据的结构性，可以进一步的将问题定义到不确切监督中的多实例学习即MIL。Moreover, although the text data in social media is large in volume and changes rapidly, they are generated by user behavior, and there is a certain correlation between the data and a specific hierarchical structure, which provides certain clues for data filtering. For example, in a forum, users usually go to the forum related to this topic to discuss issues. Most of the text content generated in the forum is related to the topic of the forum, but there are also many non-topic related contents. The topic of the entire forum is easy to obtain, but it is difficult to determine which individual data in the internal discussion is topic-related and which is irrelevant. If the data of a forum is regarded as a package, the topic of the package is known, and some texts in the package are related to the topic, which are positive texts, and other irrelevant texts are negative texts. There are positive texts in the package, and there may also be negative texts. This problem is very consistent with the inexact supervised learning in weakly supervised learning. According to the structure of the data, the problem can be further defined as multiple instance learning in inexact supervision, namely MIL.

多实例学习根据分类侧重点的不同分为两个分支，侧重于包的分类和侧重于包内实体的分类。在话题相关文本数据过滤中，过滤数据的目标是根据能获取到的所有线索，尽量从包内把正文本提取出来，把负文本过滤掉，因此，问题的侧重点是如何提升文本实体的分类效果。Multi-instance learning is divided into two branches according to the different classification focuses: focusing on the classification of packages and focusing on the classification of entities in packages. In topic-related text data filtering, the goal of filtering data is to extract positive text from the package and filter out negative text based on all the clues that can be obtained. Therefore, the focus of the problem is how to improve the classification effect of text entities.

1 相关技术1 Related technologies

1.1 多实例学习1.1 Multiple Instance Learning

多实例学习是典型的不确切学习，即标签的粒度比实际任务的粒度粗，它在20世纪90年代中期由Dietterich等人引入。Dietterich提出多实例思想，将这个思想应用在药物活性预测中，并提出APRs学习规则。APRs学习规则有三种可选算法：噪声容忍标准算法，“outside-in”算法以及“inside-out”算法，他们本质都是为正实例寻找超平面边界。早期，MIL和其他机器学习算法一样，主要是通过传统统计学习方法实现。同时，MIL有两个不同的预测侧重方向，一是相对容易的包本身的预测，二是更难的包内实例的预测。例如，Zhang,Q.等人提出的EE-DM算法,结合EM算法和多样密度，将MIL转向预测包中的实例。在此之后，支持向量机(SVM)方法被引入到MIL中，典型的有S.Andrews等人提出的MISVM算法将MIL问题看作最大边缘问题，将SVM学习方法的扩展可以得到可以启发式求解的混合整数二次规划。传统MIL还有Bunescu等人提出的SbSVM,StMIL，Mangasarian等人提出的MICA，其中有些方法侧重于包内实例预测，但只对正包中的少数具有明显特征的实例预测准确。Multiple instance learning is a typical inexact learning, that is, the granularity of the label is coarser than the granularity of the actual task. It was introduced by Dietterich et al. in the mid-1990s. Dietterich proposed the idea of multiple instances, applied this idea to drug activity prediction, and proposed the APRs learning rule. There are three optional algorithms for the APRs learning rule: the noise-tolerant standard algorithm, the "outside-in" algorithm, and the "inside-out" algorithm. They are essentially to find the hyperplane boundary for positive instances. In the early days, MIL, like other machine learning algorithms, was mainly implemented through traditional statistical learning methods. At the same time, MIL has two different prediction focuses, one is the relatively easy prediction of the package itself, and the other is the more difficult prediction of instances in the package. For example, the EE-DM algorithm proposed by Zhang, Q. et al., combined with the EM algorithm and diverse density, turned MIL to predicting instances in the package. After that, the support vector machine (SVM) method was introduced into MIL. The typical one is the MISVM algorithm proposed by S. Andrews et al., which regards the MIL problem as a maximum margin problem. The extension of the SVM learning method can obtain a mixed integer quadratic programming that can be solved heuristically. Traditional MIL also includes SbSVM and StMIL proposed by Bunescu et al., and MICA proposed by Mangasarian et al. Some of these methods focus on predicting instances within a package, but can only accurately predict a few instances with obvious features in the positive package.

随着深度学习的发展，MIL也开始引入深度神经网络，比较典型的做法是将一个包按照一个批次输入，包内文本通过神经网络抽取特征后，计算实例级别的预测概率值，然后用一个算子将包内所有文本的概率值结合起来，整合成包的概率，同于包的标签作为监督信息，优化网络。Ilse,M.等提出了将门控注意力机制(attention)作为从实力级别预测概率到包级别预测概率的算子,并将该方法应用于图像识别中。Wang,Y.,Li等人对比了最大池化、平均池化、线性softmax、指数softmax以及attention五种算子在物体定位中的效果。Shi,X.等人在Ilse,M.等人的基础上将attention融入到loss中。Attention机制通常将注意力聚焦在比较明显的特征上，因此上面这些方法对于包含稀疏的正实例包效果很好，但对于包内正实例较多的情况，表现却很差。而本发明的优化过程立足于实例级别，因此无论包内的正实例稀疏还是稠密，预测效果都很好。With the development of deep learning, MIL has also begun to introduce deep neural networks. A typical approach is to input a package as a batch, extract features from the text in the package through the neural network, calculate the predicted probability value at the instance level, and then use an operator to combine the probability values of all the texts in the package to integrate the probability of the package, which is the same as the label of the package as supervision information to optimize the network. Ilse, M. et al. proposed to use the gated attention mechanism (attention) as an operator to predict the probability from the strength level to the package level, and applied this method to image recognition. Wang, Y., Li et al. compared the effects of five operators, namely maximum pooling, average pooling, linear softmax, exponential softmax and attention, in object positioning. Shi, X. et al. integrated attention into the loss based on Ilse, M. et al. The attention mechanism usually focuses on more obvious features, so the above methods work well for packages containing sparse positive instances, but perform poorly for packages with more positive instances. The optimization process of the present invention is based on the instance level, so the prediction effect is very good regardless of whether the positive instances in the package are sparse or dense.

Luo,Z.等人将双分支神经网络引入MIL中，结合EM算法对两条神经网络进行优化，但他们的方法应用于动作定位，两条分支都以交叉熵做损失函数进行优化，这种方式在文本分类中效果不理想。Li,B.等人将双分支神经网络MIL应用在医疗图像病灶定位中，一条神经网络分支大图像(包)的特征，一条分支提取小图像(实例)的特征，然后进行融合后对大图像进行分类。本发明专利同样将双分支神经网络引入多实例学习中，但应用在文本分类上，并且本发明专利引入变分推断，将包的分类问题转化为实例分类预测和隐变量分布预测，采用交叉熵和KL散度优化双分支网络。Luo, Z. et al. introduced a dual-branch neural network into MIL and optimized the two neural networks in combination with the EM algorithm. However, their method was applied to action localization, and both branches were optimized using cross entropy as the loss function. This method was not ideal in text classification. Li, B. et al. applied the dual-branch neural network MIL to lesion localization in medical images. One neural network branch extracted the features of the large image (package), and the other branch extracted the features of the small image (instance), and then fused them to classify the large image. The patent of this invention also introduces a dual-branch neural network into multi-instance learning, but applies it to text classification. In addition, the patent of this invention introduces variational inference, converts the classification problem of the package into instance classification prediction and latent variable distribution prediction, and uses cross entropy and KL divergence to optimize the dual-branch network.

1.2弱监督文本分类1.2 Weakly supervised text classification

文本分类目前主流的做法还是有监督学习，但随着互联网产生的数据越来越多，弱监督方法也被不断尝试。Hingmire S等人提出通过语料统计的先验知识给LDA所要提取的话题预先分配一个或多个标签，然后根据后面的主题比例对文档进行分类。这种方法，分类过程完全受限于提取规则和训练数据的范围，分类效果较差。Meng Y.等人提出根据种子信息生产带标签的伪文档，用伪文档训练一个神经网络模型，同时用真实数据自监督训练一个神经网络模型，两个模型共享网络参数的方式进行文本分类。这种方法和Hingmire S等人提出的方法类似，通过规则产生伪标注数据时，会引入噪音，根据机器学习模型训练“垃圾进则垃圾出”的原理，分类效果也有限。Li,C.等人通过提取话题相关关键词构建话题语义空间，然后以现有的话题标注数据训练模型判断文本与话题空间的相关性，当无标注数据的话题出现，只需要提取话题相关关键词，模型预测文本的话题相关性。这种方法相较于前两种更为先进，但只是解决了对未来可能产生的话题有效预测的问题，并且它仍然需要大量的训练数据支撑，这些训练数据包含了大量的已知话题，这和本发明想要解决的问题完全不同。在本发明专利所要解决的问题是在完全没有精确的标注数据下，对文本的话题相关性进行分类，因此本发明专利提出利用社交平台上首发用户提供的标签或者平台对社区本身的分类标签分生作为弱监督信号，通过引入多实例思想解决在完全无精确分类训练数据情况下仍能对每条文本精确分类的问题。The mainstream approach to text classification is still supervised learning, but as the Internet generates more and more data, weak supervision methods are also being tried. Hingmire S et al. proposed to pre-assign one or more labels to the topics to be extracted by LDA through prior knowledge of corpus statistics, and then classify documents according to the subsequent topic ratio. In this method, the classification process is completely limited by the scope of extraction rules and training data, and the classification effect is poor. Meng Y. et al. proposed to produce pseudo-documents with labels based on seed information, train a neural network model with pseudo-documents, and train a neural network model with real data self-supervision, and the two models share network parameters for text classification. This method is similar to the method proposed by Hingmire S et al. When pseudo-annotated data is generated through rules, noise will be introduced. According to the principle of "garbage in, garbage out" for machine learning model training, the classification effect is also limited. Li, C. et al. constructed a topic semantic space by extracting topic-related keywords, and then trained the model with existing topic annotation data to determine the relevance of text and topic space. When a topic without labeled data appears, only topic-related keywords need to be extracted, and the model predicts the topic relevance of the text. This method is more advanced than the previous two, but it only solves the problem of effectively predicting topics that may arise in the future, and it still requires a large amount of training data support. These training data contain a large number of known topics, which is completely different from the problem that the present invention wants to solve. The problem that the present invention patent wants to solve is to classify the topic relevance of the text in the absence of accurate annotation data. Therefore, the present invention patent proposes to use the labels provided by the first-posting users on the social platform or the classification labels of the community itself by the platform as weak supervision signals, and introduces the idea of multiple instances to solve the problem of accurately classifying each text in the absence of accurate classification training data.

2提出的方法2 Proposed method

本发明专利采用深度学习实现端到端式的多实例学习文本分类，对社交媒体文本数据进行话题相关性二分类。在多实例学习的定义中，实例是最小的个体，在本发明中实例是单条文本，一个包包含多个实例，一共存在两种包，正包和负包，正包中至少包含一条正文本，负包不能包含任何正文本。在本发明专利中，将社交媒体中一些具有一定关系的多条文本整合成包(例如：B站中一条视频下的所有评论可作为一个文本包，贴吧中一个帖子的所有回复也可作为一个包)，做某个话题分类时，从社交媒体中搜索该话题相关的内容，提取所有内容组成多个正包。收集数据时，以社交媒体平台给出标签作为包的话题标签(例如，贴吧的类别、用户上传的视频类别标签)，按照多实例学习的定义，为了完成弱监督实例级分类的任务，除了收集话题下的数据作为正包，同时还需收集其他多个话题的文本数据组成负包，以此达到对比学习的目标。由于任务的目标是对正包内的文本实例做相关性二分类，因此包的标签是弱监督信号，以弱监督的形式完成实例二分类。The present invention adopts deep learning to realize end-to-end multi-instance learning text classification, and performs topic relevance binary classification on social media text data. In the definition of multi-instance learning, an instance is the smallest individual. In the present invention, an instance is a single text, and a package contains multiple instances. There are two types of packages, positive packages and negative packages. The positive package contains at least one positive text, and the negative package cannot contain any positive text. In the present invention, some texts with certain relationships in social media are integrated into packages (for example: all comments under a video in Station B can be used as a text package, and all replies to a post in Tieba can also be used as a package). When classifying a certain topic, search for content related to the topic from social media, and extract all content to form multiple positive packages. When collecting data, the label given by the social media platform is used as the topic label of the package (for example, the category of Tieba, the category label of the video uploaded by the user). According to the definition of multi-instance learning, in order to complete the task of weakly supervised instance-level classification, in addition to collecting data under the topic as a positive package, it is also necessary to collect text data of multiple other topics to form a negative package, so as to achieve the goal of comparative learning. Since the goal of the task is to perform relevance binary classification on the text instances in the positive bag, the bag label is a weak supervision signal, and the instance binary classification is completed in the form of weak supervision.

为了更好的引入话题的特征，除了端到端的深度学习多实例文本分类外，本发明专利还通过统计学习方法，在学习之前，提取话题的关键词，并对话题关键词进行嵌入，得到话题的关键特征即关键词，并将这些关键特征引入深度学习(Deep Learning,简称DL)中。在自然语言中，每个话题都会有一些独具代表性的词语，由于这些代表性关键词语境和语义相似，他们在稠密向量空间中的位置相近，向量空间中这个包含所有关键词位置的局部小空间可以用来表示话题的向量空间。本发明专利利用这一特点，通过对同一话题下所有包中的文本进行分词，并通过带种子的隐狄利克雷分布(以下简称Guilded-LDA)算法结合人工选择话题相关类别的方法提取话题强相关关键词，作为话题代表词，构建话题T的强相关向量，话题相关向量的构建过程如图1。In order to better introduce the characteristics of the topic, in addition to the end-to-end deep learning multi-instance text classification, the patent of the present invention also uses a statistical learning method to extract the keywords of the topic before learning, and embed the topic keywords to obtain the key features of the topic, namely the keywords, and introduce these key features into deep learning (DL for short). In natural language, each topic will have some unique representative words. Since these representative keywords have similar contexts and semantics, their positions in the dense vector space are close. This local small space in the vector space containing all keyword positions can be used to represent the vector space of the topic. The patent of the present invention uses this feature to segment the text in all packages under the same topic, and extract the topic-strong related keywords through the seeded hidden Dirichlet allocation (hereinafter referred to as Guilded-LDA) algorithm combined with the method of manually selecting topic-related categories. As the representative words of the topic, the strongly related vector of the topic T is constructed. The construction process of the topic-related vector is shown in Figure 1.

话题T的参考向量和文本实例在同一空间中向量化后的稠密向量作为输入对，输入到一个有两条分支的神经网络模型进行预测，一条分支预测文本实例对包话题相关是否有贡献度，另一条分支预测文本实例是否与话题相关。The reference vector of topic T and the dense vector of the text instance vectorized in the same space are used as input pairs and input into a neural network model with two branches for prediction. One branch predicts whether the text instance contributes to the topic relevance, and the other branch predicts whether the text instance is related to the topic.

整个SDMIC的前向传播架构如图2所示：一个包的所有文本作为一个批次，文本经过向量化，并与话题相关向量做内积后，整合成输入数据，分别输入到两个神经网络分支中，神经网络分支可选择卷积神经网络，也可选择其他神经网络层。本发明专利在实验过程和实际应用时选择了卷积神经网络。将不同层的输出融合并展平成一维向量，通过全连接层转化为类别预测值，然后通过softmax将预测值转化成预测概率。以上是整个网络的前向计算过程。在实际分类时，仅需计算前向过程；在训练时，先计算完前向过程，然后开始计算网络的预测损失(代价)，整个训练过程分为E步和M步，E步p分支参数固定，以KL散度为代价函数，优化q分支；M步q分支参数固定，以交叉熵为代价函数，优化p分支参数。其中KLD是Kullback–Leibler divergence的缩写，即KL散度，用L_M表示；CE是cross entropy的缩写，即交叉熵，用L_E表示。The forward propagation architecture of the entire SDMIC is shown in Figure 2: All the texts of a package are taken as a batch. After vectorization and inner product with the topic-related vector, the texts are integrated into input data and input into two neural network branches respectively. The neural network branches can select convolutional neural networks or other neural network layers. The patent of this invention selected convolutional neural networks in the experimental process and practical application. The outputs of different layers are fused and flattened into one-dimensional vectors, converted into category prediction values through the fully connected layer, and then the prediction values are converted into prediction probabilities through softmax. The above is the forward calculation process of the entire network. In actual classification, only the forward process needs to be calculated; in training, the forward process is calculated first, and then the prediction loss (cost) of the network is calculated. The entire training process is divided into E steps and M steps. In the E step, the p branch parameters are fixed, and the KL divergence is used as the cost function to optimize the q branch; in the M step, the q branch parameters are fixed, and the cross entropy is used as the cost function to optimize the p branch parameters. Among them, KLD is the abbreviation of Kullback–Leibler divergence, that is, KL divergence, represented by _LM ; CE is the abbreviation of cross entropy, represented by _LE .

为了优化词嵌入的向量空间参数，以及两条神经网络分支的参数，本发明专利将Kullback-Leibler散度(以下简称KL散度)和下界(Lower bound,以下简称LB)分别作为两条分支的损失函数，通过最小化KL散度和最大化LB来调节网络参数。In order to optimize the vector space parameters of word embedding and the parameters of the two neural network branches, the patent of this invention uses the Kullback-Leibler divergence (hereinafter referred to as KL divergence) and the lower bound (hereinafter referred to as LB) as the loss functions of the two branches respectively, and adjusts the network parameters by minimizing the KL divergence and maximizing the LB.

为了更好的实现文本实例话题相关二分类，首先从话题相关包中抽取代表话题的关键词，通过关键词构建话题相关向量，然后将话题相关向量和文本实例向量作为向量对输入到神经网络中，双分支神经网络对实例进行相关预测，然后根据实例预测得到实例的类别和包的类别。In order to better realize the topic-related binary classification of text instances, we first extract keywords representing the topic from the topic-related package, build a topic-related vector through the keywords, and then input the topic-related vector and the text instance vector as a vector pair into the neural network. The two-branch neural network makes relevant predictions on the instances, and then obtains the instance category and the package category based on the instance predictions.

2.1话题相关向量构建2.1 Topic-related vector construction

MIL任务的特点在于标注的监督性很弱，从实践中发现，训练数据集构建了话题相关的文本包和话题不相关的文本包，由于标注的监督弱，加上神经网络学习过程的可控性弱，如果不对话题加以约束，最后实例分类器就会倾向于将负向包内没有的，而正向包内有的文本预测为话题相关。如果训练数据集的负向包足够大，能将非话题相关的文本都包含在内，这就不算个问题，但负向包数据过多，首先容易造成数据不平衡，同时收集成本，计算成本会大大提升。因此，本发明专利为了规避这一问题，加入了话题相关向量的构建。The characteristic of the MIL task is that the supervision of annotation is very weak. It is found in practice that the training data set constructs topic-related text packages and topic-irrelevant text packages. Due to the weak supervision of annotation and the weak controllability of the neural network learning process, if the topic is not constrained, the final instance classifier will tend to predict the text that is not in the negative package but in the positive package as topic-related. If the negative package of the training data set is large enough to include all non-topic-related texts, this is not a problem, but too much negative package data will easily cause data imbalance, and the collection cost and calculation cost will be greatly increased. Therefore, in order to circumvent this problem, the patent of this invention adds the construction of topic-related vectors.

本发明专利提出通过LDA算法提取话题的关键词，本发明专利也采用这种方法，将训练数据中正向包内的所有文本实例通过LDA算法聚出若干话题，然后筛选与话题T相关的话题集

和无关话题集

为了使人工介入的工作最小化l和m应尽量小。取话题集

的关键词，根据关键词在相关话题和非相关话题中的权重对关键词进行排序，得到K个与话题T最相关的关键词。The present invention proposes to extract keywords of topics through the LDA algorithm. The present invention also adopts this method to aggregate all text instances in the positive bag in the training data into several topics through the LDA algorithm, and then screen the topic set related to topic T.

and irrelevant topics

In order to minimize the amount of manual intervention, l and m should be as small as possible.

Keywords are sorted according to their weights in relevant topics and irrelevant topics, and K keywords most relevant to topic T are obtained.

其中C表示话题T的强相关关键词集合；Where C represents a set of strongly related keywords of topic T;

表示排名第一的话题关键词。

Indicates the top-ranked topic keyword.

表1所示为Amazon Review数据集上抽取的“Jewelry”和“Beauty”话题的相关关键词，表2所示为Toutiao News数据集“Car”和“Sports”话题的相关关键词。Table 1 shows the relevant keywords of the topics “Jewelry” and “Beauty” extracted from the Amazon Review dataset, and Table 2 shows the relevant keywords of the topics “Car” and “Sports” from the Toutiao News dataset.

表1.从Amazon数据集上提取的“Jewelry”和“Beauty”话题关键词Table 1. Keywords of the topics “Jewelry” and “Beauty” extracted from the Amazon dataset

表2.从Toutiao news数据集上提取的“Car”和“Sports”话题的关键词Table 2. Keywords of the topics “Car” and “Sports” extracted from the Toutiao news dataset

本发明专利采用Joulin,A.等人提出的fasttext模型对话题中的每个关键词进行嵌入表示，并采用话题强相关关键词的词向量平均值作为话题相关向量，将话题关键词

通过词嵌入向量化为

话题相关向量可表示为The patent of this invention adopts the fasttext model proposed by Joulin, A. et al. to embed each keyword in the topic, and uses the average word vector of the topic-strongly related keywords as the topic-related vector.

Through word embedding vectorization

The topic-related vector can be expressed as

K表示话题强相关关键词的总数；K represents the total number of keywords strongly related to the topic;

V_T表示话题相关向量；V _T represents the topic-related vector;

众所周知，词的稠密嵌入，除了能将单词映射到向量空间，它还包含了句法和语义信息，通过对两个词向量相减，内积，欧氏距离等简单操作可得到两个之间隐藏的相关性。Mikolov,T.等人提出，两个词向量相减可获得两个词之间的关系，Li,C.L.,Zhou等人提出同时使用词向量相减和向量内积来表示两个词之间的交互作用。在本任务当中，经过实践发现文本中第i个词的词向量即文本实例向量

和话题相关向量V_T做内积后，叠加到词向量上对神经网络抽取特征和分类更有帮助。因此，文本实例的嵌入采用以下方式As we all know, dense embedding of words not only maps words to vector space, but also contains syntactic and semantic information. The hidden correlation between two words can be obtained by simple operations such as subtracting two word vectors, inner product, and Euclidean distance. Mikolov, T. et al. proposed that subtracting two word vectors can obtain the relationship between the two words. Li, CL, Zhou et al. proposed to use both word vector subtraction and vector inner product to represent the interaction between two words. In this task, it is found through practice that the word vector of the i-th word in the text is the text instance vector

After doing the inner product with the topic-related vector V _T , superimposing it on the word vector is more helpful for neural network feature extraction and classification. Therefore, the embedding of text instances is done in the following way

其中

是叠加后的词向量，是双分支神经网络的输入；[·,·]表示两个向量连接，如果原来

是D维的，

则是(2D)维。

是

和V_T的按元素相乘：in

is the superimposed word vector, which is the input of the two-branch neural network; [·,·] represents the connection of two vectors.

It is D-dimensional.

It is (2D) dimensional.

yes

And element-wise multiplication of V _T :

×表示矩阵按位乘；× represents matrix bitwise multiplication;

按照上面的计算，双分支神经网络的输入可表示为：According to the above calculation, the input of the two-branch neural network can be expressed as:

其中L表示文本包含的词的个数，[·,·,...,·,]表示向量的集合或者矩阵。Where L represents the number of words contained in the text, and [·,·,...,·,] represents a set of vectors or a matrix.

3.2双分支预测神经网络3.2 Dual-branch prediction neural network

MIL是弱监督学习，整个任务分两级，包的分类和实例的分类，因为训练数据中每个包带有标注，包的分类可以算作有监督学习，而实例没有标注，只能依靠包的标注做弱监督。在不同的任务中，对两级任务的侧重点不同，Maron,O.等人指出，如果任务的目标是做实例分类，且正向包内正向实例不是太稀疏，应该尽量减弱训练过程对包标注的依赖。在本发明专利涉及到的任务中，更侧重实例的分类，且绝大部分包正向实例高于50％，因此本发明专利侧重于刻画每个实例的特征。MIL is weakly supervised learning. The entire task is divided into two levels: package classification and instance classification. Because each package in the training data is labeled, package classification can be regarded as supervised learning, while instances are not labeled and can only rely on package labels for weak supervision. In different tasks, the emphasis on the two-level tasks is different. Maron, O. and others pointed out that if the goal of the task is to classify instances and the positive instances in the positive package are not too sparse, the training process should try to reduce its dependence on package labels. In the tasks involved in the patent of this invention, more emphasis is placed on instance classification, and the vast majority of packages have positive instances higher than 50%. Therefore, the patent of this invention focuses on characterizing the characteristics of each instance.

假定有2N个文本包，前N个与话题T相关的包，为正向包，表示为[X₁,X₂,…,X_N]，后N个包是从其他话题收集而来，与话题T不相关，表示为[X_N+1,X_N+2,…,X_2N]，每个包内包含n个文本实例。第i个包的标注为Y_i，第i个包中的第j条文本表示为x_ij作为文本经过处理后作为神经网络的输入特征值。本发明专利要把包的类别与文本实例的类别联系起来，从而达到通过包的类别来学习文本实例本身的类别的目标。本发明专利引入隐变量Z＝{z_ij}来刻画文本实例与包之间的关系，z_ij表示第i个包的第j个实例对包i是正向包贡献的贡献度，0≤z_ij≤1，假定Z服从分布p(z)，Assume that there are 2N text packages, the first N packages related to topic T are positive packages, represented as [X ₁ ,X ₂ ,…,X _N ], and the next N packages are collected from other topics and are not related to topic T, represented as [X _N+1 ,X _N+2 ,…,X _2N ], and each package contains n text instances. The i-th package is labeled as _Yi , and the j-th text in the i-th package is represented as x _ij as the input feature value of the neural network after the text is processed. The patent of the present invention aims to link the category of the package with the category of the text instance, so as to achieve the goal of learning the category of the text instance itself through the category of the package. The patent of the present invention introduces a hidden variable Z = {z _ij } to characterize the relationship between the text instance and the package, z _ij represents the contribution of the j-th instance of the i-th package to the positive package contribution of package i, 0≤z _ij ≤1, assuming that Z follows the distribution p(z),

z_ij～p(z) (6)z _ij ～p(z) (6)

那么第i个包Y_i为正的概率可以表示为：Then the probability that the i-th packet _Yi is positive can be expressed as:

p(Y_i＝1|X_i)＝f_{j∈{1,…,N}}{p_θ(y_ij＝1|x_ij,z_ij)·[z_ij-γ]} (7)p(Y _i ＝1|X _i )＝f _{j∈{1,…,N}} {p _θ (y _ij ＝1|x _ij ,z _ij )·[z _ij -γ]} (7)

其中，p_θ(y_ij＝1|x_ij,z_ij)表示实例x_ij被预测为1(即正)的概率，γ是包中正实例的平均比例，f是由文本实例向包之间的映射算子。f常用的算子有求最大值，求均值，注意力机制等，由于本发明专利所解决的问题场景中，正包包含的正实例并不稀疏，用最大值和注意力机制作为映射算子很容易预测出假阳包，降低精确度，因此采用均值算子。Among them, p _θ (y _ij ＝1|x _ij ,z _ij ) represents the probability that instance x _ij is predicted to be 1 (i.e. positive), γ is the average proportion of positive instances in the bag, and f is the mapping operator from text instances to bags. Common operators for f include maximum value, mean value, attention mechanism, etc. Since the positive instances contained in the positive bag are not sparse in the problem scenario solved by the present invention, it is easy to predict false positive bags using the maximum value and attention mechanism as mapping operators, which reduces the accuracy, so the mean operator is used.

3.2.1变分推断3.2.1 Variational Inference

本发明专利所要解决的多实例文本分类中，学习的目标是包的交叉熵最小化In the multi-instance text classification to be solved by this invention, the learning goal is to minimize the cross entropy of the package.

对于正包，Y_i'＝1，1-Y_i'＝0，因此L_i可表示为：For a positive packet, _Yi '=1, 1- _Yi '=0, so _Li can be expressed as:

对于负包Y_i'＝0，因此L_i可表示为：For the negative bag _Yi ' = 0, _Li can be expressed as:

按照定义，负包中所有文本实例均为负，且当所有p_θ(y_ij|x_ij,z_ij)和z_ij均为负时，

为0，达到最小值。因此，负包按照有监督的逻辑进行学习即可。By definition, all text instances in the negative bag are negative, and when all p _θ (y _ij |x _ij ,z _ij ) and z _ij are negative,

is 0, reaching the minimum value. Therefore, the negative bag can be learned according to the supervised logic.

正包，最小化

根据变分推断的思想，本发明专利用

逼近q(z)，

是通过一条神经网络来表示。而p_θ(y_ij|x_ij)用另外一条神经网络来表示。因此，整个方法引入了二分支神经网络，一条网络预测文本实例的类别，另一条网络文本实例与包之间的联系。According to the idea of variational inference, the present invention uses

Approximating q(z),

is represented by a neural network. And p _θ (y _ij |x _ij ) is represented by another neural network. Therefore, the whole method introduces a two-branch neural network, one network predicts the category of the text instance, and the other network predicts the connection between the text instance and the bag.

任务有两个重点是p_θ(y_ij＝1|x_ij)和z_ij预测，本发明专利通过双分支神经网络来完成这两个目标的预测，神经网络可以选择TextCNN，LSTM甚至是当下流行的Transformer。很多工作已经对这些网络结构做了很好的讲解，可参考具体论文，本发明专利不再赘述，仅仅用p_θ(y_ij＝1|x_ij)和

来表示两条分支的输出，其中θ和

是两条分支的参数。There are two key points in the task: p _θ (y _ij ＝1|x _ij ) and z _ij prediction. The present invention uses a dual-branch neural network to complete the prediction of these two goals. The neural network can choose TextCNN, LSTM or even the currently popular Transformer. Many works have made good explanations of these network structures. Please refer to specific papers. The present invention will not go into details. It only uses p _θ (y _ij ＝1|x _ij ) and

To represent the output of the two branches, where θ and

are the parameters of the two branches.

3.3网络参数优化3.3 Network parameter optimization

通过上一章对整个网络结构的描述可知，对所有正包而言，输入是X_i，输出p_θ(Y_i|X_i)有标注可监督，整个网络的学习目标是最大化对数似然：From the description of the entire network structure in the previous chapter, we can know that for all positive packets, the input is _Xi , the output p _θ ( _Yi | _Xi ) is labeled and supervised, and the learning goal of the entire network is to maximize the log-likelihood:

L＝logp(Y_i|X_i) (13)L＝logp(Y _i |X _i ) (13)

结合Jensen不等式，式(12)可以进一步得出：Combined with Jensen inequality, equation (12) can be further concluded:

根据统计学中证据下界(The evidence lower bound,后续简称ELBO)的定义，可以logp(Y|X)看做是证据，E_Z～q[logp_θ(y_ij＝1|x_ij)]是变分下界，证据和变分下界之间的差值是KL(q(z|x)||p(z|x,Y))。According to the definition of the evidence lower bound (ELBO) in statistics, logp(Y|X) can be regarded as the evidence, E _Z～q [logp _θ (y _ij ＝1|x _ij )] is the variational lower bound, and the difference between the evidence and the variational lower bound is KL(q(z|x)||p(z|x,Y)).

最大化p(Y_i|X_i)的对数似然可采用EM算法对参数进行优化。传统的EM算法，E步计算期望，即变化下界，M步，寻找让期望最大化的参数，从而进行参数估计。而本发明专利引入变分推断，因此，E步通过收窄证据和变分下界的差值来计算期望，收窄下界时优化参数

M步通过优化参数θ最大化期望。The EM algorithm can be used to optimize the parameters by maximizing the log-likelihood of p(Y _i |X _i ). In the traditional EM algorithm, the E step calculates the expectation, that is, the variation lower bound, and the M step finds the parameters that maximize the expectation, thereby performing parameter estimation. The present invention introduces variational inference. Therefore, the E step calculates the expectation by narrowing the difference between the evidence and the variational lower bound, and optimizes the parameters when narrowing the lower bound.

The M step maximizes the expectation by optimizing the parameters θ.

3.3.1 E步收窄差值3.3.1 E-step narrowing difference

E步骤的目标是收窄下界和证据的差值，使下界逼近期望，根据对公式(14)的理解，KL散度越小，下界越接近证据logp(Y|X)，因此，E步以KL最小化为目标，优化参数

目标函数为The goal of the E step is to narrow the difference between the lower bound and the evidence so that the lower bound is close to the expectation. According to the understanding of formula (14), the smaller the KL divergence, the closer the lower bound is to the evidence logp(Y|X). Therefore, the E step aims to minimize KL and optimize the parameters

The objective function is

包和实例的分类都是二分类任务，在本任务中，z和y_ij存在一定的关系，比如，负文本实例与包被判定为正包的贡献度z可以看做为0。因此，本发明专利假设在参数θ决定的神经网络中，z服从的真实分布可类比于后验分布p_θ(y|x)。The classification of packages and instances is a binary classification task. In this task, there is a certain relationship between z and _yij . For example, the contribution z of a negative text instance to a package being judged as a positive package can be regarded as 0. Therefore, the present invention assumes that in a neural network determined by parameter θ, the true distribution of z is analogous to the posterior distribution p _θ (y|x).

对于正包中的文本实例来说，p_θ(y＝1|x)就是参数θ决定的神经网络在θ固定的情况下计算出来的值，而对于负包来说，每个实例的p_θ(y|x)都应该为0，因此For the text instance in the positive bag, p _θ (y=1|x) is the value calculated by the neural network with parameter θ fixed, while for the negative bag, p _θ (y|x) of each instance should be 0, so

从公式(17)含义上可以将L_E理解为，一个文本实例对包预测为正的贡献度和这个实例与话题的相关度尽量服从同一个分布。From the meaning of formula (17), _LE can be understood as that the contribution of a text instance to the positive prediction of the bag and the relevance of this instance to the topic try to follow the same distribution.

3.3.2 M步最大化下界优化θ3.3.2 M-step maximization of the lower bound to optimize θ

按照公式(14)的定义，整个问题的代价函数包含两部分，预测概率的对数似然的期望以及隐变量与真实分布的KL散度。E步通过最小化KL散度，让期望和极大似然被动逼近，并固定参数θ调节优化参数

M步固定参数

使同样文本下

和p_θ(z|x,Y)的KL散度不变，然后通过优化参数θ，让对数似然的期望最大化。对数似然值的期望L_M表示如下：According to the definition of formula (14), the cost function of the whole problem consists of two parts: the expectation of the log-likelihood of the predicted probability and the KL divergence between the latent variable and the true distribution. The E step minimizes the KL divergence, allowing the expectation and maximum likelihood to be passively approximated, and the parameter θ is fixed to adjust the optimization parameter

M-step fixed parameters

Make the same text

The KL divergence of and p _θ (z|x,Y) remains unchanged, and then the expectation of the log-likelihood is maximized by optimizing the parameter θ. The expectation of the log-likelihood value L _M is expressed as follows:

L_M＝E_Z～q[logp_θ(y_ij|x_ij,z>γ)] (18)L _M =E _Z～q [logp _θ (y _ij |x _ij ,z>γ)] (18)

按照公式(7)的定义，可将L_M以z＝γ为界拆成两部分，对于z>γ只对y_ij＝1有意义，而对z<γ只对y_ij＝0有意义。因此，M步的代价函数L_M可进一步拆解为According to the definition of formula (7), L _M can be split into two parts with z = γ as the boundary. For z>γ, it is only meaningful for y _ij = 1, and for z<γ, it is only meaningful for y _ij = 0. Therefore, the cost function L _M of the M step can be further decomposed into

r是一个超参数，用来衡量多大的贡献度算是有效贡献。在实际使用中设置为所有正包中，正文本实例所占的平均比例。r is a hyperparameter used to measure how much contribution is considered effective. In actual use, it is set to the average proportion of positive text instances in all positive packages.

其中，第一部分表示z>γ时的对数似然，第二部分表示z<γ时的对数似然。The first part represents the log-likelihood when z>γ, and the second part represents the log-likelihood when z<γ.

γ是包中正实例的平均比例，范围在(0,1)之间，它是一个经验值，因数据集中正实例的密度而决定，它与判定包是否为正无关，但对能实例的召回息息相关。mean(·)表示求平均，根据式(19)，γ是切分y_ij的断点，这里引入隐变量的平均值，对γ进行标准化，使它和z_ij在同一数值范围内。γ is the average proportion of positive instances in a bag, ranging from (0,1). It is an empirical value determined by the density of positive instances in the data set. It has nothing to do with determining whether a bag is positive, but is closely related to the recall of the instances. mean(·) means averaging. According to formula (19), γ is the breakpoint for splitting y _ij . Here, the average value of the latent variable is introduced to standardize γ so that it is in the same numerical range as z _ij .

和优化

一样，当负包中的所有实例，对包话题相关没有任何贡献，伪标签设置为0，在正包中，话题相关概率低的实例，贡献度也不会高，同样将贡献度伪标签设置为0。and optimization

Similarly, when all instances in the negative bag do not contribute to the bag topic relevance, the pseudo label is set to 0. In the positive bag, instances with low probability of topic relevance will not contribute much, and the contribution pseudo label is also set to 0.

3实验结果分析3 Experimental results analysis

3.1数据集3.1 Dataset

本发明专利分别用AG news,Amazon Reviews,Toutiao news这3个数据集来对比SDMIC和其他方法的效果差异，分别是。其中AG news和Toutiao news对应的话题时领域类别，新闻文本属于官方发布，文本格式和内容都更规范，Amazon Reviews对应于某一细分领域的产品，是用户对产品的主观评价，它所包含的话题粒度更细，且文本由用户生成，格式和语法更随意，可类比社交媒体的用户生成内容。AG news和Amazon Reviews是英语文本，Toutiao news是中文数据，从而保证SDMIC对不同语言的适应性。The patent of this invention uses AG news, Amazon Reviews, and Toutiao news to compare the effects of SDMIC and other methods. The topics corresponding to AG news and Toutiao news are field categories. The news texts are officially released, and the text format and content are more standardized. Amazon Reviews corresponds to products in a certain niche field and is a user's subjective evaluation of the product. The topics it contains are more granular, and the text is generated by users. The format and grammar are more arbitrary, which can be compared to user-generated content on social media. AG news and Amazon Reviews are English texts, and Toutiao news is Chinese data, thereby ensuring the adaptability of SDMIC to different languages.

表3.实验数据集描述Table 3. Description of experimental datasets

AG News：AG news包含Business,Sci_Tech,Sports以及World 4个类别，训练集重每个类别包含30000条文本，测试集重每个类别包含1900条文本。将训练集和测试集每个类别单独处理，该类别包含文本作为话题相关正文本实例，其他3个类别作为话题不相关文本实例，按照每个包包含50条文本，话题相关实例和不相关实例1:2至4:1之间随机选择比例混合形成正文本包，负包全部为负文本实例，负包内的文本和正包内负文本不重叠，以此来模拟社交媒体中文本的结构和话题相关性状态。AG News: AG news contains four categories: Business, Sci_Tech, Sports, and World. The training set contains 30,000 texts for each category, and the test set contains 1,900 texts for each category. Each category in the training set and the test set is processed separately. The text contained in this category is used as a topic-related positive text instance, and the other three categories are used as topic-irrelevant text instances. Each package contains 50 texts, and topic-related instances and irrelevant instances are randomly selected in a ratio of 1:2 to 4:1 to form a positive text package. The negative package is all negative text instances, and the text in the negative package does not overlap with the negative text in the positive package, so as to simulate the structure of text and the topic relevance status in social media.

Amazon Reviews：Amazon Reviews有多个版本，本发明专利采用最经典的2013版本，此版本包含24组产品，从中选取评价量适中的4种产品作为潜在的正向话题；每种产品的评论作为该话题的正相关文本，其他产品的评论为负文本实例。和AG news一样的方式组合成文本包。将每个话题的文本包按照4:1切分训练集和测试集。Amazon Reviews: Amazon Reviews has multiple versions. The patent of this invention adopts the most classic 2013 version, which contains 24 groups of products. Four products with moderate evaluation volume are selected as potential positive topics. The reviews of each product are used as positive text related to the topic, and the reviews of other products are negative text instances. They are combined into text packages in the same way as AG news. The text packages of each topic are divided into training set and test set according to 4:1.

Toutiao news：Toutiao news包含12个话题，本实验从中选择文本量大的4个话题作为潜在正向话题，每个话题包含的文本为该话题相关正文本，从其他11个话题随机抽取数据作为不相关文本，和AG news一样的方式组合成文本包。每个话题的文本包按照4:1切分训练集和测试集。Toutiao news: Toutiao news contains 12 topics. This experiment selects 4 topics with large text volume as potential positive topics. The text contained in each topic is the positive text related to the topic. Data from the other 11 topics are randomly selected as irrelevant text and combined into text packages in the same way as AG news. The text package of each topic is divided into training set and test set according to 4:1.

经过数据整理后，如表3所示，3个数据集一共组成了12个话题文本包。After data sorting, as shown in Table 3, the three data sets constitute a total of 12 topic text packages.

3.2实验过程与参数设置3.2 Experimental process and parameter setting

在实验过程中，为了证明提出的SDMIC在同等条件下确实具有优势，除了基于以上数据集所提出的方法自身进行实验外，还引入其他的无监督、弱监督方法进行对比；同时，为了评估SDMIC与使用人工标注数据的有监督方法的效果差距，也对同样神经网络结果和参数下的有监督文本分类进行实验。During the experiment, in order to prove that the proposed SDMIC does have advantages under the same conditions, in addition to conducting experiments based on the method proposed on the above dataset, other unsupervised and weakly supervised methods were introduced for comparison. At the same time, in order to evaluate the effect gap between SDMIC and supervised methods using manually labeled data, supervised text classification experiments were also conducted under the same neural network results and parameters.

3.2.1对比算法介绍3.2.1 Introduction to comparison algorithms

Guilded-LDA：Guilded-LDA在LDA的基础上给每个话题加入种子关键词，以种子关键词部分约束聚类的方向。实验中将每个话题提取出的话题关键词作为部分簇的种子，这部分簇作为话题相关簇，而其他种子词为空的簇作为话题不相关簇，以此达到分类的效果。Guilded-LDA: Guilded-LDA adds seed keywords to each topic based on LDA, and uses the seed keywords to partially constrain the clustering direction. In the experiment, the topic keywords extracted from each topic are used as seeds for some clusters, which are topic-related clusters, while other clusters with empty seed words are used as topic-irrelevant clusters, so as to achieve the classification effect.

MISVM和SbMIL：MISVM算法将MIL问题看作最大边缘问题，将SVM学习方法扩展到求解混合整数二次规划。算法求解正包和负包的最大边缘，将包的边缘看作实例的边缘，预测实例的极性。此方法原本应用在MUSK dataset上实验，修改了输入特征提取，将其应用在文本分类中。SbMIL和MISVM一样，结合SVM算法。MISVM and SbMIL: The MISVM algorithm treats the MIL problem as a maximum margin problem and extends the SVM learning method to solve mixed integer quadratic programming. The algorithm solves the maximum margin of positive and negative bags, treats the edges of bags as the edges of instances, and predicts the polarity of instances. This method was originally applied to the MUSK dataset experiment, and the input feature extraction was modified to apply it to text classification. SbMIL, like MISVM, combines the SVM algorithm.

Weighted-MIL：一种多实例回归方法，将包内每个实例向量化后，估计每个实例向量的对应每个类别的权值，然后计算包内所有实例的加权平均值，以实例的加权平均值经过算子计算包的类别。Weighted-MIL: A multi-instance regression method that vectorizes each instance in the package, estimates the weight of each instance vector corresponding to each category, and then calculates the weighted average of all instances in the package. The weighted average of the instances is used to calculate the category of the package through an operator.

Attention base和Gated attention base：将一个包按照一个批次输入，包内文本通过神经网络抽取特征后，计算实例级别的预测概率值，然后用一个算子将包内所有文本的概率值结合起来，整合成包的概率，同于包的标签优化网络。Attention base以attention机制作为算子，将文本概率值整合成包概率值；Gated attention base在attention机制的基础上加入门控机制。Attention base and Gated attention base: A package is input as a batch, and the text in the package is extracted through a neural network. The predicted probability value at the instance level is calculated, and then the probability values of all the texts in the package are combined with an operator to integrate the probability of the package, which is the same as the label optimization network of the package. Attention base uses the attention mechanism as an operator to integrate the text probability value into the package probability value; Gated attention base adds a gating mechanism based on the attention mechanism.

CNN-supervised：通用的以词为嵌入单位的文本卷积分类算法。CNN-supervised: A general text convolution classification algorithm with word embedding units.

3.2.2实验参数设置3.2.2 Experimental parameter settings

在实验过程中，首先用LDA对正向包中的文本进行话题聚类，由于任务中设定组合所有正包的数据正负实例的比例为3:2，负实例来自于10个以上的话题类别，聚类的话题数量设定为20。通过人工确认的方式确定相关话题，然后将所有话题中的top 50的关键词取出，计算每个关键词对话题的权重和非话题权重比，根据比例排名，选择top 20的话题关键词。During the experiment, we first used LDA to cluster the text in the positive bag into topics. Since the ratio of positive and negative instances in the combined data of all positive bags was set to 3:2 in the task, and the negative instances came from more than 10 topic categories, the number of clustered topics was set to 20. We determined the relevant topics through manual confirmation, and then took out the top 50 keywords from all topics, calculated the weight of each keyword to the topic and the non-topic weight ratio, and selected the top 20 topic keywords based on the ratio ranking.

针对每个数据集生成词典，确定网络词嵌入层的变量大小，然后采用fasttext英文和中文预训练模型对神经网络的嵌入层进行初始化。A dictionary is generated for each dataset, the variable size of the network word embedding layer is determined, and then the embedding layer of the neural network is initialized using the fasttext English and Chinese pre-trained models.

神经网络可采用TextCNN,BiLSTM等小型神经网络结构作为特征提取过程，网络参数

和θ按照第2节中的方式训练，实际应用中也可引用BERT,RoBERTa等预训练大语言模型作初始化参数，然后再在训练过程中以很小的学习率微调模型。本次实验的目标是对比SDMIC和无监督方法，其他弱监督方法以及纯在表现上的优越性，与有监督分类效果的差距，而不是探索哪一种不同经典神经网络在这个任务中的表现，由于有两个神经网络分支要训练，采用大预言模型实验效率很低，因此，整个实验，在不同的方法中均使用完全一致的TextCNN网络结构来构建网络。The neural network can use TextCNN, BiLSTM and other small neural network structures as the feature extraction process, and the network parameters

and θ are trained as described in Section 2. In practical applications, BERT, RoBERTa and other pre-trained large language models can also be used as initialization parameters, and then the model can be fine-tuned with a very small learning rate during the training process. The goal of this experiment is to compare the superiority of SDMIC and unsupervised methods, other weakly supervised methods, and pure performance, and the gap with supervised classification effects, rather than exploring which different classical neural networks perform well in this task. Since there are two neural network branches to train, the efficiency of using a large oracle model experiment is very low. Therefore, throughout the experiment, the same TextCNN network structure is used to build the network in different methods.

超参数γ用来衡量一个正包中，正实例的占比，在本实验用到的新闻类数据集中，由于包是构建出来的，正实例的平均占比大于负实例的，根据占比γ设定为0.4。现实社会媒体数据应用中，根据平均话题有关文本的比例来设置γ值。The hyperparameter γ is used to measure the proportion of positive instances in a positive bag. In the news dataset used in this experiment, since the bag is constructed, the average proportion of positive instances is greater than that of negative instances. Based on the proportion, γ is set to 0.4. In real social media data applications, the γ value is set based on the average proportion of topic-related text.

实验设备为1片Tesla T4 GPU，训练过程中，以SDMIC和有监督二分类均采用1e-5为学习率，设置最大迭代次数为200，支持过拟合侦测，提前停止训练，另外设定测试集loss连续20次不下降，便停止训练，有监督二分类测试集loss连续5次迭代loss不下降，则停止训练。The experimental equipment is a Tesla T4 GPU. During the training process, the learning rate of 1e-5 is used for both SDMIC and supervised binary classification. The maximum number of iterations is set to 200, overfitting detection is supported, and training is stopped early. In addition, the training is stopped if the test set loss does not decrease for 20 consecutive times, and the supervised binary classification test set loss does not decrease for 5 consecutive iterations.

对于其他方法实验设置，传统的多实例方法，输入为CWB类型的特征，本实验中采用sklearn库的CountVectorizer和TfidfTransformer模块提取每个数据集的tf-idf特征，然后输入算法中进行训练，预测测试集文本包中每条文本实例。其他深度学习方法均采用与SDMIC同样的神经网络结构和超参数设置，弱监督方法以同样的文本包为输入，以包的标签为监督信息训练模型，CNN-supervised以单条文本的标签为监督信息进行训练。For the experimental settings of other methods, the traditional multi-instance method uses CWB type features as input. In this experiment, the CountVectorizer and TfidfTransformer modules of the sklearn library are used to extract the tf-idf features of each data set, and then input into the algorithm for training to predict each text instance in the test set text package. Other deep learning methods use the same neural network structure and hyperparameter settings as SDMIC. The weak supervision method uses the same text package as input and the label of the package as the supervision information to train the model. CNN-supervised uses the label of a single text as the supervision information for training.

3.3实验结果3.3 Experimental Results

4.3.1性能分析4.3.1 Performance Analysis

(1)评价指标(1) Evaluation indicators

常见的模型评价指标有Accuracy(以下简称Acc)，Precision,Recall，F1值等，不失一般性，本发明专利采用Acc和F1两种指标评价算法模型的效果。Common model evaluation indicators include Accuracy (hereinafter referred to as Acc), Precision, Recall, F1 value, etc. Without loss of generality, the patent of this invention uses Acc and F1 indicators to evaluate the effect of the algorithm model.

Acc衡量模型在所有测试文本实例中的平均预测准确率Acc measures the average prediction accuracy of the model across all test text instances

Precision表示预测为正的文本实例中真阳性实例的占比，相当于正文本实例查准率Precision represents the proportion of true positive instances in the text instances predicted as positive, which is equivalent to the precision rate of positive text instances.

Recall表示所有真阳性实例中，预测准确的比例，相当于正文本实例查全率。Recall represents the proportion of accurate predictions among all true positive instances, which is equivalent to the recall rate of the positive text instance.

F1值综合考虑正文本实例的查准率和查全率，只有当Precision和Recall同是高，F1才会高。The F1 value comprehensively considers the precision and recall of the positive text instance. F1 will be high only when both Precision and Recall are high.

所以本发明专利采用Acc和F1两个指标同时评价模型效果。Therefore, the patent of this invention uses the two indicators of Acc and F1 to simultaneously evaluate the model effect.

(2)性能对比分析(2) Performance comparison analysis

通过在AG News,Toutiao News,Amazon Reviews这3个不同数据集上进行实验，对比效果可以看出SDMIC比无监督话题聚类和其他弱监督文本分类方法的结果在Acc和F1上均有了很大的提升，而和完全依赖标注数据的有监督分类的效果差距很小。By conducting experiments on three different datasets, AG News, Toutiao News, and Amazon Reviews, it can be seen from the comparative results that SDMIC has greatly improved the results of unsupervised topic clustering and other weakly supervised text classification methods in terms of Acc and F1, while the difference with the results of supervised classification that completely relies on labeled data is very small.

表4.AG News上面的准确率和F1值对比Table 4. Comparison of accuracy and F1 value on AG News

表5.Toutiao News上面的准确率和F1值对比Table 5. Comparison of accuracy and F1 value on Toutiao News

表6.Amazon Reviews上面的准确率和F1值对比Table 6. Comparison of accuracy and F1 value on Amazon Reviews

表4～表6表明，SDMIC在3个数据集的表现均比其他无监督和半监督方法有显著提高，从原理上来讲，它通过双分子神经网络和EM算法，将传统的多实例学习转化为针对单条文本的预测，本身就立足于单条文本计算损失函数，优化网络，实验结果正好证明了这点。其他方法虽然想要做实例级别的预测，但最终都是立足于包对模型进行优化的，在实例级别的分类效果自然不好控制。Tables 4 to 6 show that SDMIC's performance on the three data sets is significantly better than other unsupervised and semi-supervised methods. In principle, it transforms traditional multi-instance learning into prediction for a single text through bimolecular neural networks and EM algorithms. It is based on calculating the loss function and optimizing the network based on a single text. The experimental results prove this point. Although other methods want to make instance-level predictions, they are ultimately based on optimizing the model with packages, and the classification effect at the instance level is naturally difficult to control.

同时，各话题在测试集准确率仅比有监督分类平均低3.219％,F1值低2.602％。这说明SDMIC能很大程度从包的标注和文本语义特征中学习。At the same time, the accuracy of each topic in the test set is only 3.219% lower than the average of supervised classification, and the F1 value is 2.602% lower. This shows that SDMIC can learn to a large extent from the annotation and text semantic features of the package.

另外，从不同数据集的实验结果来看，SDMIC可适用于不同语言、不同粒度话题的相关数据筛选。In addition, judging from the experimental results of different data sets, SDMIC can be applied to the screening of relevant data in different languages and topics of different granularity.

3.3.2训练速度分析3.3.2 Training speed analysis

除了方法的预测准确性，本发明专利还对不同方法在训练过程中的学习速度、正文本召回值的变化趋势进行了对比分析。图3显示了以头条短新闻的Car话题为训练和测试数据，SDMIC和有监督分类方法训练过程中，测试集的预测准确率Acc和召回率F1 score的变化趋势。In addition to the prediction accuracy of the method, the patent of this invention also conducts a comparative analysis of the learning speed of different methods during the training process and the changing trend of the recall value of the positive text. Figure 3 shows the changing trend of the prediction accuracy Acc and recall F1 score of the test set during the training process of SDMIC and supervised classification methods, using the Car topic of the headline short news as training and test data.

图3(a)图是测试集预测准确率随着训练迭代次数的变化趋势，可以看出，SDMIC作为弱监督学习方法，且整个方法架构中包含了两个深度学习网络，通过E-M的方式轮流优化两个网络中的参数，在同等条件下，确实比有监督学习的学习速度慢很多。图3(b)图是召回率的变化，从趋势来看，有监督方法的召回率变化更灵敏，但SDMIC召回率收敛过程更稳定。Figure 3(a) shows the trend of the prediction accuracy of the test set as the number of training iterations changes. It can be seen that SDMIC is a weakly supervised learning method, and the entire method architecture contains two deep learning networks. The parameters in the two networks are optimized in turn by the E-M method. Under the same conditions, the learning speed is indeed much slower than that of supervised learning. Figure 3(b) shows the change of recall rate. From the trend, the recall rate of the supervised method changes more sensitively, but the recall rate convergence process of SDMIC is more stable.

虽然从学习速度来看，有监督方法确实有优势，但是，在实际工程应用中，有监督方法完全依靠人工标注数据的方式是社交媒体文本数据挖掘的很大弊病，毕竟，大量且持续的标注数据在时间和人力上都需要花费巨大的代价，和这个代价比起来，训练过程学习速度的问题几乎可以忽略。Although supervised methods do have advantages in terms of learning speed, in actual engineering applications, supervised methods rely entirely on manually labeled data, which is a major drawback of social media text data mining. After all, large amounts of continuous labeled data require huge costs in time and manpower. Compared with this cost, the problem of learning speed in the training process can be almost ignored.

综上所述，本发明方法在不同语言(中英文)、不同类型、不同话题上对提出的SDMIC进行了对比分析，新方法分别在AG News，Toutiao News，Amazon Reviews数据集的多个话题上进行了实验和评估，均实现了弱监督文本分类理想效果。In summary, the method of the present invention conducted a comparative analysis on the proposed SDMIC in different languages (Chinese and English), different types, and different topics. The new method was experimented and evaluated on multiple topics of AG News, Toutiao News, and Amazon Reviews datasets, and achieved the ideal effect of weakly supervised text classification.

尽管已经示出和描述了本发明的实施例，本领域的普通技术人员可以理解：在不脱离本发明的原理和宗旨的情况下可以对这些实施例进行多种变化、修改、替换和变型，本发明的范围由权利要求及其等同物限定。Although the embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that various changes, modifications, substitutions and variations may be made to the embodiments without departing from the principles and spirit of the present invention, and that the scope of the present invention is defined by the claims and their equivalents.

Claims

1. A semantic-based deep multi-instance weakly supervised text classification method, characterized by comprising the following steps:

S1, organize multiple comment texts under the same social content into text packages, assign labels to the text packages, and thus obtain topic-related packages;

S2, extract keywords representing topics from topic-related bags and construct topic-related vectors through keywords;

S3, input the topic-related vector and the word vector as a vector pair into the two-branch neural network, and predict the text instance through the two-branch neural network to obtain the category of the text instance and the category of the package;

The following operations are performed in the dual-branch neural network:

The hidden variable Z = { _zij } is introduced to characterize the relationship between text instances and packages. _Zij represents the contribution of the jth instance of the ith package to the positive package contribution of package i, _0≤zij≤1 ; if Z follows the distribution p(z), then the probability that the ith package is a positive package can be expressed as:

p(Y _i ＝1|X _i )＝f _{j∈{1,...,N}} {p _θ (y _ij ＝1|x _ij ,z _ij )·[z _ij -γ]} (7)

Wherein, f is a mapping operator from text instances to packages, and f is a mean operator;

N represents the number of packets;

p _θ (y _ij ＝1|x _ij ,z _ij ) represents the probability that instance x _ij is predicted to be 1;

The following operations are also performed in the dual-branch neural network:

In multi-instance text classification, the learning goal is to minimize the cross entropy of the bag:

L _i =-[Y _i 'logp(Y _i |X _i )-(1-Y _i ')log(1-p(Y _i |X _i ))] (8)

Where _Li represents the cross entropy of the i-th package;

p(Y _i |X _i ) represents the probability that instance _Xi is predicted to be _Yi , which is the output of branch one;

_Xi represents the input features of the i-th text package, which is the input of branch one;

_Yi represents the predicted value of the i-th text package,

_Yi ' represents the label of the i-th text package;

For a positive packet, _Yi '=1, 1- _Yi '=0, so _Li is expressed as:

For the negative bag _Yi ' = 0, so _Li is expressed as:

All text instances in the negative bag are negative, and when all p _θ (y _ij |x _ij ,z _ij ) and z _ij are negative,

is 0, reaching the minimum value;

Positive package, minimize

Then formula (11) introduces variational inference

Where x _ij represents the jth text table in the i-th package;

y _ij represents the annotation of the jth text table in the i-th package;

z _ij represents the contribution of the jth instance of the ith package to the positive package contribution of package i;

γ is the average proportion of positive instances in the bag;

p _θ (y _ij |x _ij ) represents the probability that instance x _ij is predicted to be y _ij ;

p(z) represents the p distribution of contribution z;

p _θ (y _ij |x _ij ,z) represents the probability that instance x _ij is predicted to be y _ij when the contribution of x _ij is z;

p _θ (y _ij |x _ij ,z＞γ) represents the contribution of x _ij z＞γ and the probability that instance x _ij is predicted to be y _ij ;

q(z) represents the q distribution of contribution z;

E _Z～q [·] represents the mean under the condition that Z follows q distribution.

2. According to a semantic-based deep multi-instance weakly supervised text classification method according to claim 1, it is characterized in that S2 comprises the following steps:

S2-1, cluster topic-related packages into several topics using the LDA algorithm and extract topic keywords;

S2-2, the fasttext model is used to embed each keyword in the topic, and the average vector of the strongly related keywords is used as the topic-related vector;

Keywords

The vector representation is

Therefore, the topic-related vector is expressed as:

Where V _T represents the topic-related vector;

K represents the total number of keywords strongly related to the topic.

3. The semantic-based deep multi-instance weakly supervised text classification method according to claim 1, further comprising: converting the vector pair into a dense vector and inputting it into a dual-branch neural network;

The dense vector is obtained by word vector

in

[·,·] indicates the concatenation of two vectors;

Represents word vectors;

× represents matrix bitwise multiplication;

V _T represents the topic-related vector;

Therefore, the input of the two-branch neural network can be expressed as:

Where _xij represents the jth text table in the ith package, which is the input of the two-branch neural network;

represents the first superimposed word vector,

represents the second superimposed word vector,

Represents the Lth superimposed word vector;

L represents the number of words contained in the text;

[·,·,...,·,] represents a set of vectors.

4. A semantic-based deep multi-instance weakly supervised text classification method according to claim 1, characterized in that the neural network is any one of TextCNN, LSTM, and Transformer.

5. According to the semantic-based deep multi-instance weakly supervised text classification method of claim 1, it is characterized by further comprising: S4, optimizing the network parameters of the dual-branch neural network:

S4-1, step E takes KL minimization as the goal and optimizes the parameters

The objective function is:

in

Express

p′ is KL minimized;

_Yi = 1 means the i-th packet is positive;

p _θ (y|x) represents the value calculated by the neural network determined by parameter θ when θ is fixed. For negative packets, p _θ (y|x) of each instance is 0;

S4-2, M-step fixed parameters

Make the same text

The KL divergence of and p _θ (z|x,Y) remains unchanged,

represents the output of branch one in the two-branch neural network, which is the category of the text instance; p _θ (z|x,Y) represents the output of branch two in the two-branch neural network, which is the connection between the text instance and the bag;

Then, by optimizing the parameter θ, the expectation is maximized and the expectation of the log-likelihood value is expressed as follows

L _M =E _Z～q [logp _θ (y _ij |x _ij ,z>γ)] (18)

Where L _M represents the expectation of the log-likelihood value;

E _Z～q [·] represents the mean under the condition that Z follows q distribution;

p _θ (y _ij |x _ij ,z＞γ) represents the probability that instance i in the text bag is predicted as the positive text after passing through the θ branch when z＞γ;

z represents contribution;

γ is a hyperparameter;

According to the definition of formula (7), L _M can be divided into two parts with z = γ as the boundary. For z>γ, it is only meaningful for y _ij = 1, while for z<γ, it is only meaningful for y _ij = 0. Therefore, the cost function L _M of the M step can be further decomposed into

Where r is a hyperparameter;

p _θ (y _ij =1|x _ij ) represents the probability that text instance j in bag i is a positive text;

p _θ (y _ij = 0|x _ij ) represents the probability that text instance j in bag i is a negative text;

y _ij = 1 means that text instance j in bag i is positive;

y _ij = 0 means that the text instance j in bag i is negative;

Formula (19) can be transformed into cross entropy

L _M =y′ _ij logp _θ (y _ij |x _ij )-(1-y′ _ij )log(1-p _θ (y _ij |x _ij )) (20)

Where y′ _ij is the pseudo label of y _ij , which is determined by z in the positive bag and all zeros in the negative bag;

Where mean(·) means finding the average;

γ is the average proportion of positive instances in a bag.