CN115935953A - False news detection method, device, electronic device and storage medium - Google Patents
False news detection method, device, electronic device and storage medium Download PDFInfo
- Publication number
- CN115935953A CN115935953A CN202310036408.3A CN202310036408A CN115935953A CN 115935953 A CN115935953 A CN 115935953A CN 202310036408 A CN202310036408 A CN 202310036408A CN 115935953 A CN115935953 A CN 115935953A
- Authority
- CN
- China
- Prior art keywords
- news
- event
- false
- data
- news data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 41
- 239000013598 vector Substances 0.000 claims description 40
- 230000015654 memory Effects 0.000 claims description 18
- 238000004422 calculation algorithm Methods 0.000 claims description 13
- 238000001914 filtration Methods 0.000 claims description 12
- 238000000605 extraction Methods 0.000 claims description 11
- 238000012545 processing Methods 0.000 claims description 10
- 238000004590 computer program Methods 0.000 claims description 7
- 238000011176 pooling Methods 0.000 claims description 6
- 230000011218 segmentation Effects 0.000 claims description 4
- 238000013526 transfer learning Methods 0.000 claims description 4
- 230000009467 reduction Effects 0.000 claims description 3
- 238000000034 method Methods 0.000 abstract description 36
- 238000012549 training Methods 0.000 description 16
- 238000013527 convolutional neural network Methods 0.000 description 11
- 230000008569 process Effects 0.000 description 10
- 230000006870 function Effects 0.000 description 9
- 238000012360 testing method Methods 0.000 description 6
- 238000002474 experimental method Methods 0.000 description 5
- 238000013528 artificial neural network Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 238000011156 evaluation Methods 0.000 description 4
- 238000007637 random forest analysis Methods 0.000 description 4
- 238000012417 linear regression Methods 0.000 description 3
- 230000005012 migration Effects 0.000 description 3
- 238000013508 migration Methods 0.000 description 3
- 238000012706 support-vector machine Methods 0.000 description 3
- 230000008859 change Effects 0.000 description 2
- 239000000284 extract Substances 0.000 description 2
- 235000019580 granularity Nutrition 0.000 description 2
- 239000004973 liquid crystal related substance Substances 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 230000006855 networking Effects 0.000 description 2
- 238000000513 principal component analysis Methods 0.000 description 2
- 230000006403 short-term memory Effects 0.000 description 2
- 230000004913 activation Effects 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000008878 coupling Effects 0.000 description 1
- 238000010168 coupling process Methods 0.000 description 1
- 238000005859 coupling reaction Methods 0.000 description 1
- 238000013136 deep learning model Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000003203 everyday effect Effects 0.000 description 1
- 238000013467 fragmentation Methods 0.000 description 1
- 238000006062 fragmentation reaction Methods 0.000 description 1
- 238000007477 logistic regression Methods 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
技术领域technical field
本发明涉及网络信息传播技术领域,具体涉及一种虚假新闻检测方法、装置、电子设备及存储介质。The invention relates to the technical field of network information dissemination, in particular to a false news detection method, device, electronic equipment and storage medium.
背景技术Background technique
社交网络服务的普及导致其用户群体迅速扩增,与此同时,带来了信息量的极速增长。社交网络在线平台允许其用户在其平台上自由地发布信息。这些庞大的用户群体每日发布海量的信息,但是其中也充斥着许多不实的或者说是虚假的信息。虚假的信息可以利用社交网络这个平台快速传播,因此,如何快速高效地识别虚假新闻,是一项具有非常重大意义的课题。The popularization of social networking services has led to the rapid expansion of its user groups, and at the same time, it has brought about a rapid increase in the amount of information. Social networking online platforms allow their users to freely post information on their platforms. These huge user groups publish massive amounts of information every day, but they are also full of false or false information. False information can spread quickly through the platform of social networks. Therefore, how to quickly and efficiently identify false news is a topic of great significance.
发明内容Contents of the invention
有鉴于此,有必要提供一种虚假新闻检测方法、装置、电子设备及存储介质,用以解决现有技术中无法快速高效地识别虚假新闻的技术问题。In view of this, it is necessary to provide a false news detection method, device, electronic device and storage medium to solve the technical problem that false news cannot be quickly and efficiently identified in the prior art.
为了实现上述目的,本发明提供了一种虚假新闻检测方法,包括:In order to achieve the above object, the present invention provides a false news detection method, comprising:
获取待测的新闻数据,对所述新闻数据进行特征提取,得到对应的文本特征;Obtain news data to be tested, perform feature extraction on the news data, and obtain corresponding text features;
基于所述文本特征从所述新闻数据中获取事件,并基于所述事件判断所述新闻数据为第一类型新闻或者第二类型新闻;acquiring an event from the news data based on the text features, and judging that the news data is a first type of news or a second type of news based on the event;
在确定所述新闻数据为第一类型新闻的情况下,将所述新闻数据输入至第一事件判别器中,以基于所述第一事件判别器检索与所述新闻数据对应的历史事件,判断所述新闻数据是否为虚假新闻;In the case of determining that the news data is the first type of news, input the news data into a first event discriminator to retrieve historical events corresponding to the news data based on the first event discriminator, and determine Whether said news data is fake news;
在确定所述新闻数据为第二类型新闻的情况下,将所述新闻数据输入至第二事件判别器中,以基于所述第二事件判别器判断所述新闻数据是否为虚假新闻;In the case of determining that the news data is a second type of news, inputting the news data into a second event discriminator to determine whether the news data is false news based on the second event discriminator;
其中,所述第二事件判别器包括虚假事件检测器和事件特征提取器;所述虚假事件检测器用于对所述文本特征进行识别,得到对应事件是虚假时间的概率;所述事件特征提取器用于基于所述概率对所述新闻数据进行分类,确定所述新闻数据是否为虚假新闻。Wherein, the second event discriminator includes a false event detector and an event feature extractor; the false event detector is used to identify the text features to obtain the probability that the corresponding event is a false time; the event feature extractor uses After classifying the news data based on the probability, it is determined whether the news data is fake news.
进一步地,所述虚假事件检测器是基于对抗生成网络进行虚假事件的迁移学习得到。Further, the false event detector is obtained by performing migration learning of false events based on an adversarial generation network.
进一步地,所述对所述新闻数据进行特征提取,得到对应的文本特征,包括:Further, the feature extraction of the news data to obtain corresponding text features includes:
对所述新闻数据进行分词处理和词性标记,得到带标记的词语;performing word segmentation and part-of-speech marking on the news data to obtain marked words;
基于预训练的词嵌入模型学习得到所述带标记的词语对应的词向量;The word embedding model learning based on pre-training obtains the word vector corresponding to the word with label;
基于所述词向量对所述新闻数据对应的句子向量进行降维处理,得到词嵌入向量;Carry out dimensionality reduction processing to the sentence vector corresponding to the news data based on the word vector to obtain a word embedding vector;
基于所述词嵌入向量,得到所述文本特征。Based on the word embedding vector, the text feature is obtained.
进一步地,所述基于所述词嵌入向量,得到所述文本特征,包括:Further, the text feature is obtained based on the word embedding vector, including:
将所述词嵌入向量输入至卷积滤波器,得到所述新闻数据中每个句子对应的特征向量;The word embedding vector is input to a convolution filter to obtain a feature vector corresponding to each sentence in the news data;
对所述特征向量进行最大池化处理,得到所述文本特征。Performing maximum pooling processing on the feature vectors to obtain the text features.
进一步地,所述基于所述文本特征从所述新闻数据中获取事件,并基于所述事件判断所述新闻数据为第一类型新闻或者第二类型新闻,包括:Further, the acquiring events from the news data based on the text features, and judging that the news data is the first-type news or the second-type news based on the events, includes:
基于所述文本特征从所述新闻数据中搜索关键词;searching for keywords from the news data based on the text features;
基于所述关键词从关键词新闻倒排索引表中检索相似新闻集;Retrieving similar news sets from the keyword news inverted index table based on the keyword;
确定所述相似新闻集中不同新闻的余弦相似度,基于所述相似新闻集中不同新闻的余弦相似度,对所述相似新闻集中的新闻进行聚类,得到所述关键词对应的事件集;determining the cosine similarity of different news in the similar news set, and clustering the news in the similar news set based on the cosine similarity of different news in the similar news set to obtain the event set corresponding to the keyword;
基于所述事件集将判断新闻数据为第一类型新闻或者第二类型新闻;Based on the set of events, it will be judged that the news data is the first type of news or the second type of news;
其中,所述关键词新闻倒排索引表包括多个预设关键词及每个预设关键词对应的新闻倒排表。Wherein, the keyword news posting index table includes a plurality of preset keywords and a news posting list corresponding to each preset keyword.
进一步地,所述基于所述事件集将判断新闻数据为第一类型新闻或者第二类型新闻,包括:Further, the judging that the news data is the first type of news or the second type of news based on the event set includes:
对所述事件集进行熵过滤,得到过滤后的事件;performing entropy filtering on the event set to obtain filtered events;
对所述过滤后的事件进行LCS算法过滤,基于LCS算法过滤结果判断所述过滤后的事件为第一类型新闻或者第二类型新闻。The filtered event is filtered by the LCS algorithm, and based on the LCS algorithm filtering result, it is judged that the filtered event is the first type of news or the second type of news.
进一步地,所述基于所述关键词从关键词新闻倒排索引表中检索相似新闻集,包括:Further, the retrieval of similar news sets from the keyword news inverted index table based on the keyword includes:
确定所述关键词在所述关键词新闻倒排索引表中对应的每一条新闻对应的余弦相似度;Determining the cosine similarity corresponding to each piece of news corresponding to the keyword in the keyword news inverted index table;
基于余弦相似度大于预设阈值的新闻构建所述相似新闻集。The similar news set is constructed based on news whose cosine similarity is greater than a preset threshold.
本发明还提供一种虚假新闻检测装置,包括:The present invention also provides a false news detection device, comprising:
提取模块,用于获取待测的新闻数据,对所述新闻数据进行特征提取,得到对应的文本特征;An extraction module, configured to obtain news data to be tested, perform feature extraction on the news data, and obtain corresponding text features;
第一判断模块,用于基于所述文本特征从所述新闻数据中获取事件,并基于所述事件判断所述新闻数据为第一类型新闻或者第二类型新闻;A first judging module, configured to acquire an event from the news data based on the text features, and judge that the news data is a first type of news or a second type of news based on the event;
第二判断模块,用于在确定所述新闻数据为第一类型新闻的情况下,将所述新闻数据输入至第一事件判别器中,以基于所述第一事件判别器检索与所述新闻数据对应的历史事件,判断所述新闻数据是否为虚假新闻;The second judging module is configured to input the news data into the first event discriminator when it is determined that the news data is the first type of news, so as to retrieve the news related to the news based on the first event discriminator. historical events corresponding to the data, and determine whether the news data is false news;
第三判断模块,用于在确定所述新闻数据为第二类型新闻的情况下,将所述新闻数据输入至第二事件判别器中,以基于所述第二事件判别器判断所述新闻数据是否为虚假新闻;A third judging module, configured to input the news data into a second event discriminator when it is determined that the news data is news of the second type, so as to judge the news data based on the second event discriminator Whether it is fake news;
其中,所述第二事件判别器包括虚假事件检测器和事件特征提取器;所述虚假事件检测器用于对所述文本特征进行识别,得到对应事件是虚假时间的概率;所述事件特征提取器用于基于所述概率对所述新闻数据进行分类,确定所述新闻数据是否为虚假新闻。Wherein, the second event discriminator includes a false event detector and an event feature extractor; the false event detector is used to identify the text features to obtain the probability that the corresponding event is a false time; the event feature extractor uses After classifying the news data based on the probability, it is determined whether the news data is fake news.
本发明还提供一种电子设备,包括存储器和处理器,其中,The present invention also provides an electronic device, including a memory and a processor, wherein,
所述存储器,用于存储程序;The memory is used to store programs;
所述处理器,与所述存储器耦合,用于执行所述存储器中存储的所述程序,以实现如上述任意一项所述的虚假新闻检测方法中的步骤。The processor, coupled with the memory, is configured to execute the program stored in the memory, so as to realize the steps in the false news detection method described in any one of the above.
本发明还提供一种非暂态计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现如上述任一项所述的虚假新闻检测方法。The present invention also provides a non-transitory computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the method for detecting fake news as described in any one of the above is implemented.
采用上述实现方式的有益效果是:本发明提供的虚假新闻检测方法、装置、电子设备及存储介质,通过待测的新闻数据对应文本特征从新闻数据中获取事件,并基于事件判断新闻数据为第一类型新闻或者第二类型新闻;在确定新闻数据为第一类型新闻的情况下,基于第一事件判别器检索与新闻数据对应的历史事件,判断新闻数据是否为虚假新闻;在确定新闻数据为第二类型新闻的情况下,基于第二事件判别器判断新闻数据是否为虚假新闻;其中,第二事件判别器包括虚假事件检测器和事件特征提取器;虚假事件检测器用于对文本特征进行识别,得到对应事件是虚假时间的概率;事件特征提取器用于基于概率对新闻数据进行分类,确定新闻数据是否为虚假新闻。本发明将新闻数据分为第一类型新闻和第二类型新闻,分别输入至不同的事件判别器进行虚假新闻判别,基于不同新闻之间的事件关联性实现对虚假新闻快速高效地检测虚假新闻。The beneficial effect of adopting the above-mentioned implementation is: the false news detection method, device, electronic equipment and storage medium provided by the present invention obtain events from the news data through the corresponding text features of the news data to be tested, and judge the news data based on the event as the first A type of news or a second type of news; in the case of determining that the news data is the first type of news, based on the first event discriminator to retrieve the historical events corresponding to the news data, judge whether the news data is false news; determine whether the news data is In the case of the second type of news, judge whether the news data is false news based on the second event discriminator; wherein, the second event discriminator includes a false event detector and an event feature extractor; the false event detector is used to identify text features , to obtain the probability that the corresponding event is a false time; the event feature extractor is used to classify the news data based on the probability, and determine whether the news data is false news. The invention divides the news data into the first type of news and the second type of news, which are respectively input to different event discriminators for discriminating false news, and realizes fast and efficient detection of false news based on the event correlation between different news.
附图说明Description of drawings
为了更清楚地说明本发明实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings that need to be used in the description of the embodiments will be briefly introduced below. Obviously, the drawings in the following description are only some embodiments of the present invention. For those skilled in the art, other drawings can also be obtained based on these drawings without any creative effort.
图1为本发明提供的虚假新闻检测方法的一实施例的流程示意图;Fig. 1 is a schematic flow chart of an embodiment of the false news detection method provided by the present invention;
图2为本发明提供的关键词新闻倒排索引示意图;Fig. 2 is the keyword news inverted index schematic diagram that the present invention provides;
图3为本发明提供的关键词事件ID倒排索引示意图;Fig. 3 is the schematic diagram of keyword event ID inverted index provided by the present invention;
图4为本发明提供的虚假新闻检测装置的结构示意图;Fig. 4 is the structural schematic diagram of false news detection device provided by the present invention;
图5为本发明提供的电子设备的一个实施例结构示意图。Fig. 5 is a schematic structural diagram of an embodiment of an electronic device provided by the present invention.
具体实施方式Detailed ways
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述。显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。The following will clearly and completely describe the technical solutions in the embodiments of the present invention with reference to the drawings in the embodiments of the present invention. Apparently, the described embodiments are only some of the embodiments of the present invention, but not all of them. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without making creative efforts belong to the protection scope of the present invention.
在本申请实施例的描述中,除非另有说明,“多个”的含义是两个或两个以上。In the description of the embodiments of the present application, unless otherwise specified, "plurality" means two or more.
本发明实施例中术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或模块的过程、方法、装置、产品或设备不必限于清楚地列出的那些步骤或模块,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或模块。The terms "comprising" and "having" and any variations thereof in the embodiments of the present invention are intended to cover a non-exclusive inclusion, for example, a process, method, device, product or equipment that includes a series of steps or modules is not necessarily limited to expressly Instead, other steps or modules not explicitly listed or inherent to the process, method, product or apparatus may be included.
在本发明实施例中出现的对步骤进行的命名或者编号,并不意味着必须按照命名或者编号所指示的时间/逻辑先后顺序执行方法流程中的步骤,已经命名或者编号的流程步骤可以根据要实现的技术目的变更执行次序,只要能达到相同或者相类似的技术效果即可。The naming or numbering of the steps in the embodiments of the present invention does not mean that the steps in the method flow must be executed in the time/logic order indicated by the naming or numbering, and the steps that have been named or numbered can be performed according to the requirements To achieve the technical purpose, change the execution sequence, as long as the same or similar technical effect can be achieved.
在本文中提及“实施例”意味着,结合实施例描述的特定特征、结构或特性可以包含在本发明的至少一个实施例中。在说明书中的各个位置出现该短语并不一定均是指相同的实施例,也不是与其它实施例互斥的独立的或备选的实施例。本领域技术人员显式地和隐式地理解的是,本文所描述的实施例可以与其它实施例相结合。Reference herein to an "embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the present invention. The occurrences of this phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is understood explicitly and implicitly by those skilled in the art that the embodiments described herein can be combined with other embodiments.
本发明提供了一种虚假新闻检测方法、装置、电子设备及存储介质,以下分别进行说明。The present invention provides a false news detection method, device, electronic equipment and storage medium, which will be described respectively below.
如图1所示,本发明提供一种虚假新闻检测方法,包括:As shown in Figure 1, the present invention provides a kind of false news detection method, comprises:
步骤110、获取待测的新闻数据,对所述新闻数据进行特征提取,得到对应的文本特征。Step 110: Obtain the news data to be tested, perform feature extraction on the news data, and obtain corresponding text features.
可以理解的是,本发明提供的方法可以基于ECNNet架构实现,ECNNet架构融合了词向量模型、卷积神经网络、生成对抗网络和增量聚类算法,通过特征提取器、事件映射器、熟悉事件判断器和未知事件判断器四个模块实现针对社交网络突发事件的虚假新闻检测。It can be understood that the method provided by the present invention can be implemented based on the ECNNet architecture, which integrates the word vector model, convolutional neural network, generative confrontation network and incremental clustering algorithm, through the feature extractor, event mapper, familiar event The four modules of judger and unknown event judger realize false news detection for social network emergencies.
确定一个时间范围T,通过网络爬虫从新浪微博、今日头条和腾讯新闻等主流社交网络中获取足够多的原始新闻数据,筛选并确定事件的研究范围S。将新闻数据输入至特征提取器,特征提取器对新闻数据的文本内容中提取信息特征,这里可以采用卷积神经网络(CNN)作为特征提取器的核心模块。Determine a time range T, obtain enough original news data from mainstream social networks such as Sina Weibo, Toutiao, and Tencent News through web crawlers, and filter and determine the research scope S of the event. The news data is input to the feature extractor, and the feature extractor extracts information features from the text content of the news data. Here, a convolutional neural network (CNN) can be used as the core module of the feature extractor.
步骤120、基于所述文本特征从所述新闻数据中获取事件,并基于所述事件判断所述新闻数据为第一类型新闻或者第二类型新闻。Step 120: Acquire events from the news data based on the text features, and determine whether the news data is news of the first type or news of the second type based on the events.
可以理解的是,将文本特征输入至事件映射器,事件映射器结合特征提取器提取得到的文本特征,从新闻数据中收集事件,并将其划分为熟悉新闻(即:第一类型新闻)与未知新闻(即:第二类型新闻)。It can be understood that the text features are input to the event mapper, and the event mapper combines the text features extracted by the feature extractor to collect events from news data and divide them into familiar news (ie: first-type news) and Unknown news (ie: the second type of news).
步骤130、在确定所述新闻数据为第一类型新闻的情况下,将所述新闻数据输入至第一事件判别器中,以基于所述第一事件判别器检索与所述新闻数据对应的历史事件,判断所述新闻数据是否为虚假新闻。
可以理解的是,第一事件判别器也即是熟悉事件判别器,熟悉事件判断器接收来自于事件映射器输送过来的新闻,并根据检索相应的历史事件来预测新闻是真实的还是虚假的。It can be understood that the first event discriminator is also the familiar event discriminator, which receives news from the event mapper and predicts whether the news is true or false by retrieving corresponding historical events.
对于每一个判定为“熟悉新闻”的新闻事件,搜索与之对应的历史事件集,然后计算熟悉新闻与检索到的事件集的事件关联度。本发明中对事件关联度的定义如下:For each news event judged as "familiar news", the corresponding historical event set is searched, and then the event correlation degree between the familiar news and the retrieved event set is calculated. The definition of event relevance among the present invention is as follows:
假设存在一个历史事件1和一个事件2,事件2由n条新闻所组成,分别是news1,news2,……,newsn。这些新闻与历史事件1的余弦相似度分别是sim1,sim2,……,simn。将历史事件1和事件2的事件关联度S定义如下:Suppose there is a historical event 1 and an event 2, and the event 2 is composed of n pieces of news, namely news 1 , news 2 , ..., news n . The cosine similarities between these news and historical event 1 are sim 1 , sim 2 , ..., sim n . The event correlation degree S of historical event 1 and event 2 is defined as follows:
对于检索集中的每个历史事件,可以用标签为1表示该事件为虚假事件,标签为0表示该事件是真实事件,那么新闻真实度R定义如下:For each historical event in the retrieval set, a label of 1 can be used to indicate that the event is a false event, and a label of 0 can indicate that the event is a real event. Then the news authenticity R is defined as follows:
其中li表示第i条新闻的标签值。如果最终计算得到的新闻真实度R大于0,则表明预测的熟悉新闻是虚假新闻,否则,则认为该熟悉新闻是真实的。where l i represents the label value of the i-th news. If the final calculated news truth degree R is greater than 0, it indicates that the predicted familiar news is false news, otherwise, the familiar news is considered real.
步骤140、在确定所述新闻数据为第二类型新闻的情况下,将所述新闻数据输入至第二事件判别器中,以基于所述第二事件判别器判断所述新闻数据是否为虚假新闻;
其中,所述第二事件判别器包括虚假事件检测器和事件特征提取器;所述虚假事件检测器用于对所述文本特征进行识别,得到对应事件是虚假时间的概率;所述事件特征提取器用于基于所述概率对所述新闻数据进行分类,确定所述新闻数据是否为虚假新闻。Wherein, the second event discriminator includes a false event detector and an event feature extractor; the false event detector is used to identify the text features to obtain the probability that the corresponding event is a false time; the event feature extractor uses After classifying the news data based on the probability, it is determined whether the news data is fake news.
可以理解的是,第二事件判别器也即使未知事件判别器,未知事件判断器采用生成对抗模型来进行关于虚假新闻中的迁移学习,并以此来判断未知新闻是否是虚假新闻。此外,还集成了一个卷积神经网络层用于事件特征学习。It can be understood that the second event discriminator is also an unknown event discriminator, and the unknown event discriminator uses a generative confrontation model to perform transfer learning on fake news, and thereby judge whether the unknown news is fake news. In addition, a convolutional neural network layer is integrated for event feature learning.
事件通常是一系列新闻报道的集合,也就是说,针对同一事件进行报道的新闻和新闻存在事件关联性,而新闻与事件之间同样存在事件关联性。因此,可以考虑使用生成对抗网络来构建ECNNet的未知事件判断器。未知事件判断器有两个模块组成:虚假事件检测器和事件特征提取器。An event is usually a collection of a series of news reports, that is to say, there is an event correlation between news and news reported on the same event, and there is also an event correlation between news and events. Therefore, it can be considered to use the generative confrontation network to construct the unknown event judger of ECNNet. The unknown event judge consists of two modules: a false event detector and an event feature extractor.
在一些实施例中,所述虚假事件检测器是基于对抗生成网络进行虚假事件的迁移学习得到。In some embodiments, the false event detector is obtained by performing migration learning of false events based on an adversarial generative network.
可以理解的是,虚假事件检测器使用生成对抗网络来进行假事件中的迁移学习,事件特征提取器使用两层的全连接神经网络提取事件特征。It is understandable that the fake event detector uses the generative adversarial network to perform transfer learning in fake events, and the event feature extractor uses a two-layer fully connected neural network to extract event features.
虚假事件检测器采用生成对抗网络来进行虚假事件中的迁移学习,它将特征提取器提取得到的文本特征表示RT作为输入,并输出第i个事件是虚假事件的概率,将第i个事件表示为mi。虚假事件检测器应用了一个带有softmax的全连接层来预测新闻是否是虚假新闻。本发明将虚假事件检测器记为Gd(·;θd),其中θd表示包含的所有参数。将第i条新闻的文本特征RT作为输入,输出记为Pθ(mi),计算公式如下所示:The false event detector uses the generative confrontation network to perform migration learning in false events. It takes the text feature representation RT extracted by the feature extractor as input, and outputs the probability that the i-th event is a false event. The i-th event denoted as m i . The fake event detector applies a fully connected layer with softmax to predict whether news is fake or not. In the present invention, the spurious event detector is denoted as G d (·; θ d ), where θ d represents all parameters involved. Taking the text feature R T of the i-th news as input, the output is recorded as P θ (m i ), and the calculation formula is as follows:
Pθ(mi)=Gd(Gf(mi;θf);θd)P θ (m i ) = G d (G f (m i ; θ f ); θ d )
事件检测器的目标是识别未知新闻是否是假新闻,使用Yd来表示人工打上了标签的训练集,并使用交叉熵来计算损失函数,如下列公式所示:The goal of the event detector is to identify whether the unknown news is fake news, using Y d to represent the manually labeled training set, and using cross-entropy to calculate the loss function, as shown in the following formula:
在接下来的过程中,通过寻找最优参数θf和θd来使得损失函数Ld(θf,θd)达到最小值,这个过程可以表示为:In the next process, by finding the optimal parameters θ f and θ d to make the loss function L d (θ f , θ d ) reach the minimum value, this process can be expressed as:
然而,虚假事件检测器只能学习得到事件特有的特征,而无法进行泛化表示,这不利于检测未包含在训练数据集中的事件。因此,这要求能够学习到新出现的事件的可转移特征表示。考虑模型能够学习到更一般的特征表示,而这些特征表示可以表示具有同一事件关联性的新闻。因此,为了学习得到事件的共有特征,考虑加入如下的事件特征提取器来完善模型。However, false event detectors can only learn event-specific features, but cannot perform generalized representations, which is not conducive to detecting events not included in the training dataset. Therefore, this requires the ability to learn transferable feature representations of emerging events. Consider the model being able to learn more general feature representations that represent news related to the same event. Therefore, in order to learn the common features of events, consider adding the following event feature extractor to improve the model.
事件特征提取器本质上是一个双向对抗的迁移学习过程,它充分利用了源领域和目标领域的数据特征。The event feature extractor is essentially a two-way adversarial transfer learning process, which makes full use of the data features of the source domain and the target domain.
事件特征提取器表示为Ge(RF,θe),其中表示包含在其中的所有参数Ge,它部署了一个卷积神经网络层来正确分类包括新闻和事件在内的输入域。定义Ye为一组域标签,因此事件鉴别器的损失可以表示为:The event feature extractor, denoted Ge ( RF , θe ), where denote all parameters Ge included in it, deploys a convolutional neural network layer to correctly classify the input domain including news and events. Define Ye as a set of domain labels, so the loss of the event discriminator can be expressed as:
由于判别器寻求识别输入域,因此从判别器的角度来看,应最小化损失函数以找到最佳参数:Since the discriminator seeks to recognize the input domain, from the discriminator's point of view, the loss function should be minimized to find the optimal parameters:
然而,对于事件特征提取器,它旨在欺骗虚假事件检测器以学习事件的共同特征。损失函数的增大意味着它学习得到更常见的特征,因为更大的损失函数表明将事件正确分类到集群中的难度更大。因此,需要最大化上述损失来找到最优参数。However, for the event feature extractor, it aims to fool the fake event detector to learn common features of events. A larger loss function means that it learns more common features, since a larger loss function indicates that it is more difficult to correctly classify events into clusters. Therefore, the above loss needs to be maximized to find the optimal parameters.
在未知事件判断器的训练阶段之前,提取所有历史事件作为训练集。在训练阶段,一方面使用生成对抗网络Gf(·;θf)训练数据集,为了提高虚假事件检测的性能,事件特征提取器与虚假事件检测器合作以最小化损失Ld(θf,θd)。另一方面,在事件特征提取器和虚假事件检测器之间存在一个极小极大博弈。因此,最终损失可以表示为:Before the training phase of the unknown event discriminator, all historical events are extracted as the training set. In the training phase, on the one hand, the GAN G f ( ; θ f ) training data set is used. In order to improve the performance of false event detection, the event feature extractor cooperates with the false event detector to minimize the loss L d (θ f , θ d ). On the other hand, there is a minimax game between the event feature extractor and the spurious event detector. Therefore, the final loss can be expressed as:
Lfinal(θf,θd,θe)=Ld(θf,θd)=Le(θf,θe)L final (θ f ,θ d ,θ e )=L d (θ f ,θ d )=L e (θ f ,θ e )
为了找到最优参数需要最小化Lfinal,以及考虑极小极大博弈。因此,最佳参数与过程达到平衡的时间有关,它们可以表示为:In order to find the optimal parameters It is necessary to minimize L final , and to consider the minimax game. Therefore, the optimal parameters are related to the time for the process to reach equilibrium, and they can be expressed as:
在一些实施例中,所述对所述新闻数据进行特征提取,得到对应的文本特征,包括:In some embodiments, the feature extraction of the news data to obtain corresponding text features includes:
对所述新闻数据进行分词处理和词性标记,得到带标记的词语;performing word segmentation and part-of-speech marking on the news data to obtain marked words;
基于预训练的词嵌入模型学习得到所述带标记的词语对应的词向量;The word embedding model learning based on pre-training obtains the word vector corresponding to the word with label;
基于所述词向量对所述新闻数据对应的句子向量进行降维处理,得到词嵌入向量;Carry out dimensionality reduction processing to the sentence vector corresponding to the news data based on the word vector to obtain a word embedding vector;
基于所述词嵌入向量,得到所述文本特征。Based on the word embedding vector, the text feature is obtained.
可以理解的是,为了提取新闻中丰富的文本特征,针对输入的新闻文本(即新闻数据)依次进行分词处理和词性标记。然后通过预训练的词嵌入模型学习每个词语的词向量,此外,使用PCA(主成分分析)模型对每条新闻的句子向量进行降维处理。处理后得到的词语的顺序列表(即词嵌入向量)是文本特征提取器的输入。It can be understood that in order to extract rich text features in news, word segmentation and part-of-speech tagging are performed sequentially on the input news text (ie, news data). Then, the word vector of each word is learned through the pre-trained word embedding model. In addition, the PCA (Principal Component Analysis) model is used to reduce the dimensionality of the sentence vector of each news item. The resulting sequential list of words (i.e., word embedding vectors) is the input to the text feature extractor.
为了更好地从新闻文本中提取相应的信息特征,采用了卷积神经网络作为文本特征提取器的核心模块。在文本特征提取器中加入修改后的卷积神经网络模型,即Text-CNN。Text-CNN的架构使用多个具有不同大小窗口的过滤器来筛选不同粒度的文本特征。In order to better extract the corresponding information features from the news text, a convolutional neural network is used as the core module of the text feature extractor. A modified convolutional neural network model, Text-CNN, is added to the text feature extractor. The architecture of Text-CNN uses multiple filters with windows of different sizes to filter text features of different granularities.
在一些实施例中,所述基于所述词嵌入向量,得到所述文本特征,包括:In some embodiments, the obtaining the text features based on the word embedding vector includes:
将所述词嵌入向量输入至卷积滤波器,得到所述新闻数据中每个句子对应的特征向量;The word embedding vector is input to a convolution filter to obtain a feature vector corresponding to each sentence in the news data;
对所述特征向量进行最大池化处理,得到所述文本特征。Performing maximum pooling processing on the feature vectors to obtain the text features.
可以理解的是,新闻文本中的每个词语都经过文本向量化,表示为一个词嵌入向量。每一个词语或者词组的词嵌入向量使用给定数据集上的预训练词嵌入模型进行初始化。对于句子中的第i个词语,对应的k维单词嵌入向量记为Ti∈Rk,因此,一个带有n个词语的句子可以表示为:Understandably, each word in the news text is text-vectorized and represented as a word embedding vector. The word embedding vector for each word or phrase is initialized using a pre-trained word embedding model on a given dataset. For the i-th word in a sentence, the corresponding k-dimensional word embedding vector is denoted as T i ∈ R k , therefore, a sentence with n words can be expressed as:
其中,表示连接操作符。窗口大小为h的卷积滤波器以文本中的h个词语连续序列作为输入,输出一个特征。为了清楚地说明过程,以第i个单词开始的h个词语的连续序列为例,过滤操作可以表示为:in, Represents a concatenation operator. A convolutional filter with a window size h takes as input a continuous sequence of h words in the text and outputs a feature. In order to clearly illustrate the process, taking the continuous sequence of h words starting from the i-th word as an example, the filtering operation can be expressed as:
ti=σ(Wc·Ti:i+h-1)t i =σ(W c ·T i:i+h-1 )
这里σ(·)是一个ReLU的激活函数,Wc表示卷积滤波器的权重。这个滤波器同样可以应用在剩下的词语当中,然后就可以得到这个句子的一个特征向量如下所示:Here σ( ) is an activation function of ReLU, and W c represents the weight of the convolution filter. This filter can also be applied to the remaining words, and then a feature vector of the sentence can be obtained as follows:
t=[t1,t2,...,tn-h+1]t=[t 1 ,t 2 ,...,t n-h+1 ]
对于每个特征向量,使用最大池化运算来求取最大值,以便于提取文本中最重要的信息。现在,得到了一个特定的滤波器的相应特性。这个过程会被重复,直到得到所有滤波器的特性。为了提取具有不同粒度的文本特征,应用了不同的窗口大小。对于一个特定的窗口大小,有nh个不同的滤波器。For each feature vector, the maximum pooling operation is used to find the maximum value, so as to extract the most important information in the text. Now, the corresponding characteristics of a particular filter are obtained. This process is repeated until all filter characteristics are obtained. To extract text features with different granularities, different window sizes are applied. For a particular window size, there are n h different filters.
因此,假设有c个可能的窗口大小,总共有c·nh个滤波器。最大池化操作后的文本特性被记为在最大池化操作之后,使用一个全连接层来确保最终的文本特征表示(记为)具有p维特征:Thus, assuming there are c possible window sizes, there are a total of c n h filters. The text features after the max pooling operation are denoted as After the max pooling operation, a fully connected layer is used to ensure the final text feature representation (denoted as ) has p-dimensional features:
其中,Wtf为全连接层的权重矩阵。Among them, W tf is the weight matrix of the fully connected layer.
在一些实施例中,所述基于所述文本特征从所述新闻数据中获取事件,并基于所述事件判断所述新闻数据为第一类型新闻或者第二类型新闻,包括:In some embodiments, the acquiring events from the news data based on the text features, and judging that the news data is the first type of news or the second type of news based on the events, includes:
基于所述文本特征从所述新闻数据中搜索关键词;searching for keywords from the news data based on the text features;
基于所述关键词从关键词新闻倒排索引表中检索相似新闻集;Retrieving similar news sets from the keyword news inverted index table based on the keyword;
确定所述相似新闻集中不同新闻的余弦相似度,基于所述相似新闻集中不同新闻的余弦相似度,对所述相似新闻集中的新闻进行聚类,得到所述关键词对应的事件集;determining the cosine similarity of different news in the similar news set, and clustering the news in the similar news set based on the cosine similarity of different news in the similar news set to obtain the event set corresponding to the keyword;
基于所述事件集将判断新闻数据为第一类型新闻或者第二类型新闻;Based on the set of events, it will be judged that the news data is the first type of news or the second type of news;
其中,所述关键词新闻倒排索引表包括多个预设关键词及每个预设关键词对应的新闻倒排表。Wherein, the keyword news posting index table includes a plurality of preset keywords and a news posting list corresponding to each preset keyword.
可以理解的是,本实施例中的步骤基于事件映射器实现,事件映射器包含3个部分:关键词-新闻映射器,关键词-事件映射器和过滤器。It can be understood that the steps in this embodiment are implemented based on the event mapper, and the event mapper includes three parts: keyword-news mapper, keyword-event mapper and filter.
关键词-新闻映射器用于快速检索输入的新闻在现有的新闻中是否有类似的新闻,其本质上是一张动态更新的倒排索引表。为了减少搜索之前遇到的类似输入新闻d的新闻所需的时间,同时保持恒定的时间和空间需求,可以使用了在时间窗t内维护的关键词-新闻倒排索引表,如图2所示。集合M通过用最新的输入的新闻替换最旧的新闻来不断更新,以保持关键词-新闻倒排索引表的内存需求不变,由于新闻流中词汇的无限制使用,关键词的数量可能会变得非常大。关键词-新闻倒排索引表的每个条目都包含一个关键词和一个有限集Q。该集合Q是该关键词出现在其中的最新的新闻。但新闻的数量超过Q的限制时,最旧的新闻将被包含该关键词的最新的新闻替换。The keyword-news mapper is used to quickly retrieve whether the input news has similar news in the existing news, which is essentially a dynamically updated inverted index table. In order to reduce the time required to search for previously encountered news similar to the input news d while maintaining constant time and space requirements, a keyword-news inverted index table maintained within the time window t can be used, as shown in Figure 2 Show. The set M is constantly updated by replacing the oldest news with the newest incoming news, to keep the memory requirements of the keyword-news inverted index constant, the number of keywords may increase due to the unbounded use of vocabulary in the news stream. become very large. Each entry in the keyword-news inverted index table contains a keyword and a finite set Q. The set Q is the latest news in which the keyword appears. But when the number of news exceeds the limit of Q, the oldest news will be replaced by the newest news containing this keyword.
基于TF-IDF(term frequency–inverse document frequency)方法从新闻中选择前k个关键词,然后通过计算这k个关键词中的每一个关键词和关键词-新闻倒排索引表中对应的每一条新闻的余弦相似度来检索潜在的相似新闻集。Based on the TF-IDF (term frequency–inverse document frequency) method, select the first k keywords from the news, and then calculate each keyword in the k keywords and each corresponding keyword in the keyword-news inverted index table Cosine similarity of a piece of news to retrieve potentially similar news sets.
例如输入新闻“A市新闻发布会通报:4月22日16时至22时,A市新增C动态人员4例(E区2例、F区2例)。另有1例Y动态人员。”其中前三个TF-IDF加权关键词分别是“A市”,“C动态”和“Y动态”。在关键词-新闻倒排索引表中搜索每一个关键词,并检索ID为3、5、7、15、18、21和25的新闻。最后,使用余弦距离计算两个新闻向量之间的相似度。余弦距离的计算公式如下列公式所示:For example, input the news "City A press conference notification: From 16:00 to 22:00 on April 22, 4 new cases of C dynamic personnel (2 cases in E area and 2 cases in F area) were added in city A. Another case of Y dynamic personnel was added. "The first three TF-IDF weighted keywords are "A City", "C Dynamics" and "Y Dynamics". Search each keyword in the keyword-news inverted index table, and retrieve news with IDs 3, 5, 7, 15, 18, 21 and 25. Finally, the similarity between two news vectors is calculated using cosine distance. The formula for calculating the cosine distance is shown in the following formula:
如果没有一个新闻的余弦相似度高于tsi值(阈值),则表示之前没有发生过类似的新闻。因此,将创建一个新事件并将新闻d分配到事件中,然后将其发送到关键词-事件映射器。如果上面tsi存在相似性,则新闻d将直接发送到关键词-事件映射器。最后,通过将新闻添加到k个术语的相应新闻集中来更新关键词-新闻映射器。If the cosine similarity of none of the news is higher than the tsi value (threshold), it means that no similar news has occurred before. So a new event will be created and newsd assigned to the event, which will then be sent to the keyword-event mapper. If there is a similarity in the above tsi, the news d will be directly sent to the keyword-event mapper. Finally, the keyword-news mapper is updated by adding the news to the corresponding news set of k terms.
关键词-事件映射器用于检测事件并将其划分为熟悉事件与未知事件。最开始的状态中,关键词-事件映射器只包含一个动态更新的关键词-历史事件倒排索引表,如图3所示。与关键词-新闻映射器类似,每一行都有一个关键词和一个有限集,其中历史事件的数量不超过Q。当数量超过限制时,最旧的事件将被最新的事件替换。A keyword-event mapper is used to detect and classify events into familiar and unknown events. In the initial state, the keyword-event mapper only contains a dynamically updated keyword-historical event inverted index table, as shown in FIG. 3 . Similar to the keywords-news mapper, each row has a keyword and a finite set where the number of historical events does not exceed Q. When the number exceeds the limit, the oldest event will be replaced by the newest event.
在一些实施例中,所述基于所述事件集将判断新闻数据为第一类型新闻或者第二类型新闻,包括:In some embodiments, the judging that the news data is the first type of news or the second type of news based on the event set includes:
对所述事件集进行熵过滤,得到过滤后的事件;performing entropy filtering on the event set to obtain filtered events;
对所述过滤后的事件进行LCS(最长公共子序列)算法过滤,基于LCS算法过滤结果判断所述过滤后的事件为第一类型新闻或者第二类型新闻。The filtered event is filtered by an LCS (Longest Common Subsequence) algorithm, and based on the filtering result of the LCS algorithm, it is judged that the filtered event is the first type of news or the second type of news.
可以理解的是,在关键词-事件映射器中,使用了聚类的方法来将新闻聚类成事件,在聚类的过程中会产生较小的事件,称之为碎片事件。碎片事件的存在会影响模型的精度与速度。It can be understood that in the keyword-event mapper, a clustering method is used to cluster news into events, and smaller events are generated during the clustering process, which are called fragmented events. The existence of fragmentation events will affect the accuracy and speed of the model.
为了提高模型的精度和速度,使用过滤器来过滤掉碎片事件和那些无意义的候选事件。过滤器主要由两部分组成,分别是熵过滤与LCS算法过滤。To improve the accuracy and speed of the model, filters are used to filter out fragmented events and those nonsensical candidates. The filter is mainly composed of two parts, which are entropy filtering and LCS algorithm filtering.
熵过滤使用候选事件集群中的熵信息。通过计算每个候选事件集群的熵,并将其与预先设定好的熵阈值(tent)进行比较,若候选事件集群的熵值小于熵阈值tent,则认为该候选事件集群的信息量未达到所设定的最少信息量,判定其为碎片集群,进行抛弃处理。Entropy filtering uses entropy information in candidate event clusters. By calculating the entropy of each candidate event cluster and comparing it with the preset entropy threshold (tent), if the entropy value of the candidate event cluster is less than the entropy threshold tent, it is considered that the information content of the candidate event cluster has not reached The set minimum amount of information is determined to be a fragmented cluster and discarded.
LCS算法过滤考虑到事件中的新闻通常具有相似的句子结构,对每个候选事件使用LCS算法,并记录最大LCS的长度。然后丢弃最大LCS低于tlcs(阈值)的事件,对于高于tlcs的事件,表示具有最大LCS的新闻的事件。最后,剩余的属于熟悉事件的新闻被发送到熟悉事件判断器,而另一类未知新闻被发送到未知事件判断器。LCS Algorithm Filtering Considering that the news in the event usually has a similar sentence structure, use the LCS algorithm for each candidate event, and record the length of the largest LCS. Events with maximum LCS below tlcs(threshold) are then discarded, and for events above tlcs, events representing news with maximum LCS. Finally, the remaining news belonging to familiar events is sent to the familiar event judger, while another category of unknown news is sent to the unknown event judger.
在一些实施例中,所述基于所述关键词从关键词新闻倒排索引表中检索相似新闻集,包括:In some embodiments, the retrieval of similar news sets from the keyword news inverted index table based on the keyword includes:
确定所述关键词在所述关键词新闻倒排索引表中对应的每一条新闻对应的余弦相似度;Determining the cosine similarity corresponding to each piece of news corresponding to the keyword in the keyword news inverted index table;
基于余弦相似度大于预设阈值的新闻构建所述相似新闻集。The similar news set is constructed based on news whose cosine similarity is greater than a preset threshold.
可以理解的是,基于上述例子,可以找到前三个关键词“A市”,“C动态”和“Y动态”,采用增量聚类的方法,以新闻向量之间的余弦相似度作为度量标准,进行聚类。因此关键词“A市”相关的新闻聚类到事件ID为2,5,10,11的事件中,如图3所示。此后,当输入的新闻中包含“A市”这个关键词,关键词-事件映射器会快速检索事件ID为2,5,10,11的事件集,再计算输入新闻和事件集中每个事件的余弦相似度。如果余弦相似度高于tes(阈值),则将新闻添加到对应的事件中去。否则创建一个新的事件并将新闻添加到新创建的事件中去,然后将其插入关键词-事件映射器中。It can be understood that, based on the above example, the first three keywords "A city", "C dynamic" and "Y dynamic" can be found, using the method of incremental clustering, taking the cosine similarity between news vectors as a measure Criteria for clustering. Therefore, news related to the keyword "city A" is clustered into events with event IDs 2, 5, 10, and 11, as shown in FIG. 3 . Thereafter, when the input news contains the keyword "City A", the keyword-event mapper will quickly retrieve event sets with event IDs of 2, 5, 10, and 11, and then calculate the value of each event in the input news and event sets. Cosine similarity. If the cosine similarity is higher than tes (threshold), add the news to the corresponding event. Otherwise create a new event and add the news to the newly created event and insert it into the keyword-event mapper.
综上所述,本发明提供的ECNNet架构包括四个关键组件:特征提取器、事件映射器、熟悉事件判断器和未知事件判断器。在模型的训练阶段,首先提供训练集X1进入特征提取器中训练Text-CNN模型以得到新闻的文本特征表示其后,文本特征表示输入到事件映射器中,事件映射器会将输入的新闻划分为熟悉新闻Nf与未知新闻Nu两大类新闻。随后,熟悉新闻在熟悉事件判断器中形成历史事件集,用于后续熟悉新闻的快速预测真假与否;未知新闻输入到未知事件判断器中,用来进行生成对抗训练,以提取得到事件特征表示。In summary, the ECNNet architecture provided by the present invention includes four key components: feature extractor, event mapper, familiar event judger and unknown event judger. In the training phase of the model, first provide the training set X 1 to enter the feature extractor to train the Text-CNN model to obtain the text feature representation of the news Afterwards, the text feature representation Input into the event mapper, the event mapper will divide the input news into two types of news: familiar news N f and unknown news N u . Subsequently, the familiar news forms a historical event set in the familiar event judger, which is used to quickly predict whether the familiar news is true or not; the unknown news is input into the unknown event judger, which is used for generative confrontation training to extract event features express.
在模型的测试阶段,提供测试集Xu进入到Text-CNN模型中训练得到未标记新闻集的文本特征表示紧接着,将测试集的文本特征表示输入到事件映射器中,用于划分为熟悉新闻与未知新闻。在熟悉事件判断器中,检索历史事件集用以计算新闻的真实度Rθ(ui)以便于判断是否为虚假新闻。在未知事件判断器中,虚假新闻检测器来预测新闻是否为虚假新闻,并输出计算得到的标签集Yu。In the testing phase of the model, the test set Xu is provided to enter the Text-CNN model for training to obtain the text feature representation of the unlabeled news set Next, the text features of the test set are represented by Input into the Event Mapper for classification into familiar versus unknown news. In the familiar event judger, retrieve the historical event set to calculate the authenticity R θ (u i ) of the news so as to judge whether it is fake news. In the unknown event judger, the fake news detector is used to predict whether the news is fake news, and output the calculated label set Y u .
在社交网络虚假新闻检测领域,尚无国际公认的标准测试语料或与其近似相关的语料。为了公平地评估ECNNet模型的性能,本实验的数据来源是简体中文网络中的各大社交平台及门户网站,如微博、知乎、小红书、腾讯新闻、今日头条等,通过网络爬虫的方式,获取2021年12月1日至2021年12月31日共666条新闻。In the field of fake news detection in social networks, there is no internationally recognized standard test corpus or a corpus closely related to it. In order to fairly evaluate the performance of the ECNNet model, the data sources of this experiment are major social platforms and portals in the Simplified Chinese network, such as Weibo, Zhihu, Xiaohongshu, Tencent News, Toutiao, etc. method to obtain a total of 666 news from December 1, 2021 to December 31, 2021.
其后,人工地对这666条新闻打上标签,新闻为真实新闻则记为0,新闻为虚假新闻则记为1。其中,真实新闻446条,虚假新闻220条。Afterwards, the 666 pieces of news were labeled manually, and the news was recorded as 0 if it was real news, and 1 if it was false news. Among them, 446 were real news and 220 were fake news.
数据集上的统计信息如表1所示:The statistical information on the dataset is shown in Table 1:
表1Table 1
在传统的虚假新闻检测评估中,正确率(Precision)、召回率(Recall)和F值(F-score)是非常重要的三个评价指标。正确率是针对预测结果而言的,它表示的是预测为正的样本中有多少是真正的正样本,把正类预测为正类记为TP,把负类预测为正类记为FP。召回率是针对原来的样本而言的,它表示的是样本中的正例有多少被预测正确了。把原来的正类预测成正类记为TP,把原来的正类预测为负类记为FN。F1值是正确率和召回率的调和平均数。In the traditional evaluation of false news detection, precision, recall and F-score are three very important evaluation indicators. The correct rate refers to the prediction result, which indicates how many of the samples predicted to be positive are true positive samples. The positive class is predicted as positive class as TP, and the negative class is predicted as positive class as FP. The recall rate is for the original sample, which indicates how many positive examples in the sample are predicted correctly. Predict the original positive class as positive class as TP, and predict the original positive class as negative class as FN. The F1 score is the harmonic mean of the precision and recall.
本文采用的评价指标如式下列公式所示:The evaluation index used in this paper is shown in the following formula:
为了验证提出的模型的有效性,主要从以下两个方面来考虑基准方法的选择:传统的机器学习模型和神经网络深度学习模型。In order to verify the effectiveness of the proposed model, the choice of benchmark methods is mainly considered from the following two aspects: traditional machine learning models and neural network deep learning models.
本发明主要选择了以下四种基准方法:The present invention mainly selects following four benchmark methods:
1.支持向量机(Suppoort Vector Machine,SVM)。支持向量机使用标准化的文本特征表示和真实标签集来训练支持向量机模型。参赛C设置为50,核函数设置为RF。1. Support Vector Machine (Suppoort Vector Machine, SVM). Support Vector Machines use a standardized representation of textual features and a set of ground truth labels to train a support vector machine model. The entry C is set to 50, and the kernel function is set to RF.
2.随机森林(Random Forest,RF)。随机森林使用归一化的文本特征表示和真实标签集训练随机森林模型。参数n_estimators设置为50。2. Random Forest (Random Forest, RF). Random Forest trains a random forest model using normalized text feature representations and ground truth label sets. The parameter n_estimators is set to 50.
3.线性回归(Linear Regression,LR)。线性回归使用归一化的文本特征表示和真实标签集来训练Logistic回归模型。参数求解器设置为lbfgs。3. Linear Regression (LR). Linear Regression trains a Logistic Regression model using normalized text feature representations and ground truth label sets. The parameter solver is set to lbfgs.
4.长短期记忆神经网络(Long Short Term Memory,LSTM)。长短期记忆神经网络使用全连接层作为文本特征提取器,文本向量表示来自文本特征评估器。该模型具有256维隐藏大小,全连接层的输入为文本特征,输出为真实新闻的概率。4. Long Short Term Memory neural network (Long Short Term Memory, LSTM). The LSTM neural network uses a fully connected layer as a text feature extractor, and the text vector representation comes from a text feature estimator. The model has a 256-dimensional hidden size, the input of the fully connected layer is the text feature, and the output is the probability of the real news.
实验步骤:Experimental steps:
通过调用机器学习库Scikit-Learn和Pytorch完成本次实验。Python版本为3.7.2,Scikit-Leanrn版本为0.21.2,Pytorch版本为1.10.0。以8:2的比例划分训练集和测试集。训练集用于优化ECNNet的参数设置,测试集用于评估模型的性能。ECNNet模型推荐参数设置如表2所示:This experiment is completed by calling the machine learning libraries Scikit-Learn and Pytorch. The Python version is 3.7.2, the Scikit-Leanrn version is 0.21.2, and the Pytorch version is 1.10.0. The training set and test set are divided in a ratio of 8:2. The training set is used to optimize the parameter settings of ECNNet, and the test set is used to evaluate the performance of the model. The recommended parameter settings of the ECNet model are shown in Table 2:
表2Table 2
在特征提取器中,设置词嵌入的维度为512维,时间窗口大小从1到4不等,特征提取器的全连接层大小为32。对于事件检测器,设置它的全连接层大小为64。其余的详细参数设置如表2所示。对于所有的基本对照方法以及所提出的ECNNet模型,在训练阶段均使用相同的批量大小100,训练轮次均为100。In the feature extractor, the dimension of the word embedding is set to 512 dimensions, the time window size ranges from 1 to 4, and the size of the fully connected layer of the feature extractor is 32. For the event detector, set its fully connected layer size to 64. The rest of the detailed parameter settings are shown in Table 2. For all the basic control methods and the proposed ECNNet model, the same batch size of 100 is used in the training phase, and the training rounds are 100.
对于所有的基本对照组,使用了推荐的参数设置或者最佳的参数设置。为了获得公平的评价,对所有进行实验的方法进行了相同的预处理操作。For all base control groups, the recommended or optimal parameter settings were used. In order to obtain a fair evaluation, the same preprocessing operation is carried out for all the experimental methods.
结果与分析:results and analysis:
在本发明提供的数据集上验证上述方法,并与目前最流行的几种虚假新闻检测方法进行比较,结果如表3所示:The above method is verified on the data set provided by the present invention, and compared with the most popular false news detection methods at present, the results are shown in Table 3:
表3table 3
相比于其他的实验方法,本发明所提出的算法ECNNet在准确率、召回率以及F1值上均取得不错的效果。其中,ECNNet的准确率远高于其他的基线方法,与之前的先进的基线方法RF相比,提高了0.02。在召回率方面,ECNNet的召回率同样高于先前已有绝大多数的基线方法,与表先最优秀的基线方法LSTM相比,仅相差了0.06。综合考虑ECNNet的召回率与准确率,ECNNet的F1-Score取得了最佳的表现,说明模型在虚假新闻检测方面具有独特的优越性。Compared with other experimental methods, the algorithm ECNNet proposed by the present invention has achieved good results in accuracy, recall and F1 value. Among them, the accuracy of ECNNet is much higher than other baseline methods, which is 0.02 higher than the previous advanced baseline method RF. In terms of recall rate, the recall rate of ECNNet is also higher than most of the previous baseline methods, and compared with the best baseline method LSTM, the difference is only 0.06. Considering the recall rate and accuracy rate of ECNNet, the F1-Score of ECNNet has achieved the best performance, indicating that the model has unique advantages in false news detection.
综合实验结果表明,本文提出的基于事件关联的虚假新闻检测算法能够有效提升算法的召回率,实现对社交网络中虚假新闻的精准检测。The comprehensive experimental results show that the fake news detection algorithm based on event correlation proposed in this paper can effectively improve the recall rate of the algorithm and realize the accurate detection of fake news in social networks.
本发明提出了针对社交网络突发事件的基于事件关联的虚假新闻检测器ECNNet,它可以基于不同新闻之间的事件关联性实现对虚假新闻的快速高效地检测。The present invention proposes an event correlation-based false news detector ECNNet aimed at social network emergencies, which can quickly and efficiently detect false news based on the event correlation between different news.
本发明提出的ECNNet模型使用事件聚类器来衡量不同事件之间的差异,并进一步实现事件聚类。事件聚类器所维护的关键词-事件ID倒排索引表能够快速实现对熟悉新闻的检测。The ECNNet model proposed by the present invention uses an event clusterer to measure the difference between different events, and further implements event clustering. The keyword-event ID inverted index table maintained by the event clusterer can quickly realize the detection of familiar news.
本发明提出的ECNNet模型是虚假新闻检测的通用框架,其内部各个模块之间实现了高内聚低耦合。使用者可根据实际需求实现改变或拓展。The ECNNet model proposed by the present invention is a general framework for false news detection, and realizes high cohesion and low coupling among various internal modules. Users can change or expand according to actual needs.
实验表明,本发明所提出的ECNNet模型可以有效地实现虚假新闻检测,并在正确率与召回率上体现出了较好的效果。Experiments show that the ECNNet model proposed by the present invention can effectively detect false news, and shows good results in accuracy and recall.
在本发明提供的虚假新闻检测方法,通过待测的新闻数据对应文本特征从新闻数据中获取事件,并基于事件判断新闻数据为第一类型新闻或者第二类型新闻;在确定新闻数据为第一类型新闻的情况下,基于第一事件判别器检索与新闻数据对应的历史事件,判断新闻数据是否为虚假新闻;在确定新闻数据为第二类型新闻的情况下,基于第二事件判别器判断新闻数据是否为虚假新闻;其中,第二事件判别器包括虚假事件检测器和事件特征提取器;虚假事件检测器用于对文本特征进行识别,得到对应事件是虚假时间的概率;事件特征提取器用于基于概率对新闻数据进行分类,确定新闻数据是否为虚假新闻。本发明将新闻数据分为第一类型新闻和第二类型新闻,分别输入至不同的事件判别器进行虚假新闻判别,基于不同新闻之间的事件关联性实现对虚假新闻快速高效地检测虚假新闻。In the method for detecting false news provided by the present invention, the event is obtained from the news data by the corresponding text features of the news data to be tested, and the news data is judged to be the first type news or the second type news based on the event; after determining that the news data is the first In the case of type news, retrieve historical events corresponding to the news data based on the first event discriminator, and judge whether the news data is false news; Whether the data is false news; wherein, the second event discriminator includes a false event detector and an event feature extractor; the false event detector is used to identify the text features to obtain the probability that the corresponding event is a false time; the event feature extractor is used based on Probabilistically classifies news data and determines whether news data is fake news or not. The invention divides the news data into the first type of news and the second type of news, which are respectively input to different event discriminators for discriminating false news, and realizes fast and efficient detection of false news based on the event correlation between different news.
如图4所示,本发明还提供一种虚假新闻检测装置400,包括:As shown in Figure 4, the present invention also provides a fake
提取模块410,用于获取待测的新闻数据,对所述新闻数据进行特征提取,得到对应的文本特征;The
第一判断模块420,用于基于所述文本特征从所述新闻数据中获取事件,并基于所述事件判断所述新闻数据为第一类型新闻或者第二类型新闻;The
第二判断模块430,用于在确定所述新闻数据为第一类型新闻的情况下,将所述新闻数据输入至第一事件判别器中,以基于所述第一事件判别器检索与所述新闻数据对应的历史事件,判断所述新闻数据是否为虚假新闻;The
第三判断模块440,用于在确定所述新闻数据为第二类型新闻的情况下,将所述新闻数据输入至第二事件判别器中,以基于所述第二事件判别器判断所述新闻数据是否为虚假新闻;The
其中,所述第二事件判别器包括虚假事件检测器和事件特征提取器;所述虚假事件检测器用于对所述文本特征进行识别,得到对应事件是虚假时间的概率;所述事件特征提取器用于基于所述概率对所述新闻数据进行分类,确定所述新闻数据是否为虚假新闻。Wherein, the second event discriminator includes a false event detector and an event feature extractor; the false event detector is used to identify the text features to obtain the probability that the corresponding event is a false time; the event feature extractor uses After classifying the news data based on the probability, it is determined whether the news data is fake news.
上述实施例提供的虚假新闻检测装置可实现上述虚假新闻检测方法实施例中描述的技术方案,上述各模块或单元具体实现的原理可参见上述虚假新闻检测方法实施例中的相应内容,此处不再赘述。The false news detection device provided in the above embodiments can realize the technical solutions described in the above embodiments of the false news detection method. Let me repeat.
如图5所示,本发明还相应提供了一种电子设备500。该电子设备500包括处理器501、存储器502及显示器503。图5仅示出了电子设备500的部分组件,但是应理解的是,并不要求实施所有示出的组件,可以替代的实施更多或者更少的组件。As shown in FIG. 5 , the present invention also provides an
存储器502在一些实施例中可以是电子设备500的内部存储单元,例如电子设备500的硬盘或内存。存储器502在另一些实施例中也可以是电子设备500的外部存储设备,例如电子设备500上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。The
进一步地,存储器502还可既包括电子设备500的内部储存单元也包括外部存储设备。存储器502用于存储安装电子设备500的应用软件及各类数据。Further, the
处理器501在一些实施例中可以是一中央处理器(Central Processing Unit,CPU),微处理器或其他数据处理芯片,用于运行存储器502中存储的程序代码或处理数据,例如本发明中的虚假新闻检测方法。
显示器503在一些实施例中可以是LED显示器、液晶显示器、触控式液晶显示器以及OLED(Organic Light-Emitting Diode,有机发光二极管)触摸器等。显示器503用于显示在电子设备500的信息以及用于显示可视化的用户界面。电子设备500的部件501-503通过系统总线相互通信。In some embodiments, the
在本发明的一些实施例中,当处理器501执行存储器502中的虚假新闻检测程序时,可实现以下步骤:In some embodiments of the present invention, when the
获取待测的新闻数据,对所述新闻数据进行特征提取,得到对应的文本特征;Obtain news data to be tested, perform feature extraction on the news data, and obtain corresponding text features;
基于所述文本特征从所述新闻数据中获取事件,并基于所述事件判断所述新闻数据为第一类型新闻或者第二类型新闻;acquiring an event from the news data based on the text features, and judging that the news data is a first type of news or a second type of news based on the event;
在确定所述新闻数据为第一类型新闻的情况下,将所述新闻数据输入至第一事件判别器中,以基于所述第一事件判别器检索与所述新闻数据对应的历史事件,判断所述新闻数据是否为虚假新闻;In the case of determining that the news data is the first type of news, input the news data into a first event discriminator to retrieve historical events corresponding to the news data based on the first event discriminator, and determine Whether said news data is fake news;
在确定所述新闻数据为第二类型新闻的情况下,将所述新闻数据输入至第二事件判别器中,以基于所述第二事件判别器判断所述新闻数据是否为虚假新闻;In the case of determining that the news data is a second type of news, inputting the news data into a second event discriminator to determine whether the news data is false news based on the second event discriminator;
其中,所述第二事件判别器包括虚假事件检测器和事件特征提取器;所述虚假事件检测器用于对所述文本特征进行识别,得到对应事件是虚假时间的概率;所述事件特征提取器用于基于所述概率对所述新闻数据进行分类,确定所述新闻数据是否为虚假新闻。Wherein, the second event discriminator includes a false event detector and an event feature extractor; the false event detector is used to identify the text features to obtain the probability that the corresponding event is a false time; the event feature extractor uses After classifying the news data based on the probability, it is determined whether the news data is fake news.
应当理解的是:处理器501在执行存储器502中的虚假新闻检测程序时,除了上面的功能之外,还可实现其它功能,具体可参见前面相应方法实施例的描述。It should be understood that, when the
进一步地,本发明实施例对提及的电子设备500的类型不做具体限定,电子设备500可以为手机、平板电脑、个人数字助理(personal digitalassistant,PDA)、可穿戴设备、膝上型计算机(laptop)等便携式电子设备。便携式电子设备的示例性实施例包括但不限于搭载IOS、android、microsoft或者其他操作系统的便携式电子设备。上述便携式电子设备也可以是其他便携式电子设备,诸如具有触敏表面(例如触控面板)的膝上型计算机(laptop)等。还应当理解的是,在本发明其他一些实施例中,电子设备500也可以不是便携式电子设备,而是具有触敏表面(例如触控面板)的台式计算机。Further, the embodiment of the present invention does not specifically limit the type of the mentioned
又一方面,本发明还提供一种非暂态计算机可读存储介质,其上存储有计算机程序,该计算机程序被处理器执行时实现以执行上述各方法提供的虚假新闻检测方法,该方法包括:In yet another aspect, the present invention also provides a non-transitory computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, it is implemented to perform the fake news detection method provided by the above-mentioned methods, the method includes :
获取待测的新闻数据,对所述新闻数据进行特征提取,得到对应的文本特征;Obtain news data to be tested, perform feature extraction on the news data, and obtain corresponding text features;
基于所述文本特征从所述新闻数据中获取事件,并基于所述事件判断所述新闻数据为第一类型新闻或者第二类型新闻;acquiring an event from the news data based on the text features, and judging that the news data is a first type of news or a second type of news based on the event;
在确定所述新闻数据为第一类型新闻的情况下,将所述新闻数据输入至第一事件判别器中,以基于所述第一事件判别器检索与所述新闻数据对应的历史事件,判断所述新闻数据是否为虚假新闻;In the case of determining that the news data is the first type of news, input the news data into a first event discriminator to retrieve historical events corresponding to the news data based on the first event discriminator, and determine Whether said news data is fake news;
在确定所述新闻数据为第二类型新闻的情况下,将所述新闻数据输入至第二事件判别器中,以基于所述第二事件判别器判断所述新闻数据是否为虚假新闻;In the case of determining that the news data is a second type of news, inputting the news data into a second event discriminator to determine whether the news data is false news based on the second event discriminator;
其中,所述第二事件判别器包括虚假事件检测器和事件特征提取器;所述虚假事件检测器用于对所述文本特征进行识别,得到对应事件是虚假时间的概率;所述事件特征提取器用于基于所述概率对所述新闻数据进行分类,确定所述新闻数据是否为虚假新闻。Wherein, the second event discriminator includes a false event detector and an event feature extractor; the false event detector is used to identify the text features to obtain the probability that the corresponding event is a false time; the event feature extractor uses After classifying the news data based on the probability, it is determined whether the news data is fake news.
本领域技术人员可以理解,实现上述实施例方法的全部或部分流程,可以通过计算机程序来指令相关的硬件来完成,程序可存储于计算机可读存储介质中。其中,计算机可读存储介质为磁盘、光盘、只读存储记忆体或随机存储记忆体等。Those skilled in the art can understand that all or part of the processes of the methods in the above embodiments can be implemented by instructing related hardware through a computer program, and the program can be stored in a computer-readable storage medium. Wherein, the computer-readable storage medium is a magnetic disk, an optical disk, a read-only memory or a random access memory, and the like.
以上对本发明所提供的虚假新闻检测方法、装置、电子设备及存储介质进行了详细介绍,本文中应用了具体个例对本发明的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本发明的方法及其核心思想;同时,对于本领域的技术人员,依据本发明的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本发明的限制。Above, the false news detection method, device, electronic equipment and storage medium provided by the present invention have been introduced in detail. In this paper, specific examples have been used to illustrate the principle and implementation of the present invention. The description of the above embodiments is only used to help Understand the method of the present invention and its core idea; at the same time, for those skilled in the art, according to the idea of the present invention, there will be changes in the specific implementation and scope of application. In summary, the content of this specification should not be construed as a limitation of the invention.
Claims (10)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202310036408.3A CN115935953A (en) | 2023-01-09 | 2023-01-09 | False news detection method, device, electronic device and storage medium |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202310036408.3A CN115935953A (en) | 2023-01-09 | 2023-01-09 | False news detection method, device, electronic device and storage medium |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| CN115935953A true CN115935953A (en) | 2023-04-07 |
Family
ID=86650873
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202310036408.3A Pending CN115935953A (en) | 2023-01-09 | 2023-01-09 | False news detection method, device, electronic device and storage medium |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN115935953A (en) |
Cited By (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN117034905A (en) * | 2023-08-07 | 2023-11-10 | 重庆邮电大学 | Internet false news identification method based on big data |
| CN117668158A (en) * | 2023-12-06 | 2024-03-08 | 济南大学 | Real-time false news event detection method and system based on multi-source social data |
-
2023
- 2023-01-09 CN CN202310036408.3A patent/CN115935953A/en active Pending
Cited By (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN117034905A (en) * | 2023-08-07 | 2023-11-10 | 重庆邮电大学 | Internet false news identification method based on big data |
| CN117034905B (en) * | 2023-08-07 | 2024-05-14 | 重庆邮电大学 | Internet false news identification method based on big data |
| CN117668158A (en) * | 2023-12-06 | 2024-03-08 | 济南大学 | Real-time false news event detection method and system based on multi-source social data |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN108629043B (en) | Webpage target information extraction method, device and storage medium | |
| CN107992596B (en) | Text clustering method, text clustering device, server and storage medium | |
| Hua et al. | Clickage: Towards bridging semantic and intent gaps via mining click logs of search engines | |
| CN110929125B (en) | Search recall method, device, equipment and storage medium thereof | |
| CN105824959B (en) | Public opinion monitoring method and system | |
| CN107657048B (en) | User identification method and device | |
| CN111797214A (en) | Question screening method, device, computer equipment and medium based on FAQ database | |
| US20130198192A1 (en) | Author disambiguation | |
| CN107844533A (en) | A kind of intelligent Answer System and analysis method | |
| CN111538903B (en) | Method and device for determining search recommended word, electronic equipment and computer readable medium | |
| US20160378847A1 (en) | Distributional alignment of sets | |
| CN113761125B (en) | Dynamic summary determination method and device, computing device and computer storage medium | |
| CN106708929A (en) | Video program searching method and device | |
| De Boom et al. | Semantics-driven event clustering in Twitter feeds | |
| Singh et al. | Burst: real-time events burst detection in social text stream | |
| CN111708942A (en) | Multimedia resource pushing method, device, server and storage medium | |
| CN113722484A (en) | Rumor detection method, device, equipment and storage medium based on deep learning | |
| CN106570196A (en) | Video program searching method and device | |
| CN114416998B (en) | Text label identification method and device, electronic equipment and storage medium | |
| CN115935953A (en) | False news detection method, device, electronic device and storage medium | |
| CN111338692A (en) | Vulnerability classification method, device and electronic device based on vulnerability code | |
| CN113704462B (en) | Text processing method, device, computer equipment and storage medium | |
| WO2015084757A1 (en) | Systems and methods for processing data stored in a database | |
| CN114461783A (en) | Keyword generating method, apparatus, computer equipment, storage medium and product | |
| CN112818206A (en) | Data classification method, device, terminal and storage medium |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination |