CN116244446A - Method and system for detecting social media cognitive threats - Google Patents
Method and system for detecting social media cognitive threats Download PDFInfo
- Publication number
- CN116244446A CN116244446A CN202211732859.2A CN202211732859A CN116244446A CN 116244446 A CN116244446 A CN 116244446A CN 202211732859 A CN202211732859 A CN 202211732859A CN 116244446 A CN116244446 A CN 116244446A
- Authority
- CN
- China
- Prior art keywords
- cognitive
- threat
- text
- topic
- cognitive threat
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/367—Ontology
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/353—Clustering; Classification into predefined classes
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/242—Dictionaries
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G06Q10/40—
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Molecular Biology (AREA)
- Evolutionary Computation (AREA)
- Probability & Statistics with Applications (AREA)
- Biophysics (AREA)
- Animal Behavior & Ethology (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
技术领域technical field
本发明属于网络安全技术领域,特别涉及一种社交媒体认知威胁检测方法及系统。The invention belongs to the technical field of network security, and in particular relates to a social media cognitive threat detection method and system.
背景技术Background technique
认知,是指人们获得知识或应用知识的过程,或信息加工的过程,它包括感觉、知觉、记忆、思维、想象和语言等。认知威胁基于对个体输入具有目的性、煽动性、隐蔽性、方向性、非真实性的信息,通过对个体认知过程不断的影响、固化而达到使个体形成扭曲、非常规、反向负面的认知,或是改变个体已有正常认知体系,使之背离社会核心价值体系。自媒体平台、社交平台的新兴为认知威胁的滋生、传播提供了温床。网络空间俨然已成为认知威胁对抗的主战场。社交网络具有身份匿名、言论“自由”、实时性高、传播快等特点,用户多是年轻人,对社会问题不敏感,极易被认知渗透。而认知威胁隐蔽性高、溯源难度大、跨平台监管困难等问题亟待解决,从技术上对认知威胁信息进行识别、溯源和对抗已经迫在眉睫。针对认知威胁隐蔽性高、溯源难度大、跨平台监管困难等问题,如何从技术上遏制认知威胁成为净化网络空间的迫切需求。Cognition refers to the process of people acquiring knowledge or applying knowledge, or the process of information processing, which includes feeling, perception, memory, thinking, imagination and language. Cognitive threats are based on information that is purposeful, inflammatory, concealed, directional, and untrue to the individual input, and through continuous influence and solidification of the individual cognitive process, the individual forms a distorted, unconventional, and reverse negative cognition, or change the individual's existing normal cognitive system, making it deviate from the core value system of society. The emergence of self-media platforms and social platforms has provided a hotbed for the breeding and spread of cognitive threats. Cyberspace has become the main battlefield of cognitive threat confrontation. Social networks have the characteristics of anonymous identity, "free speech", high real-time performance, and fast communication. Most of the users are young people, who are not sensitive to social issues and are easily permeated by cognition. However, problems such as the high concealment of cognitive threats, the difficulty of traceability, and the difficulty of cross-platform supervision need to be solved urgently. It is imminent to identify, trace and counter cognitive threat information technically. In view of the high concealment of cognitive threats, the difficulty of traceability, and the difficulty of cross-platform supervision, how to technically contain cognitive threats has become an urgent need to purify cyberspace.
发明内容Contents of the invention
为此,本发明提供一种社交媒体认知威胁检测方法及系统,针对特定主题和敏感事件相关的话题文本,利用其背后的情感倾向来识别认知威胁,相较于传统人工举证,大幅缩短鉴定周期,提高检测信息量及效率。To this end, the present invention provides a social media cognitive threat detection method and system, which uses the emotional tendency behind it to identify cognitive threats for specific topics and topic texts related to sensitive events. The identification period is shortened, and the amount of information and efficiency of inspection are improved.
按照本发明所提供的设计方案,提供一种社交媒体认知威胁检测方法,包含如下内容:According to the design scheme provided by the present invention, a social media cognitive threat detection method is provided, including the following content:
采集网络平台敏感话题文本数据并对数据进行预处理操作;Collect text data of sensitive topics on the network platform and perform preprocessing operations on the data;
针对预处理后的敏感话题文本数据,通过多层级认知威胁检测来获取认知威胁话题文本,其中,多层级认知威胁检测包含:将敏感话题文本数据划分为认知威胁话题文本和初始疑似认知威胁话题文本的初级检测,将初始疑似认知威胁话题文本分类为认知威胁话题文本、疑似认知威胁话题文本和非认知威胁话题文本的中级检测,和通过人工标注从疑似认知威胁话题文本来获取认知威胁话题文本的终极检测;For the preprocessed sensitive topic text data, the cognitive threat topic text is obtained through multi-level cognitive threat detection, wherein the multi-level cognitive threat detection includes: dividing the sensitive topic text data into cognitive threat topic text and initial suspected Primary detection of cognitive threat topic texts, classifying initial suspected cognitive threat topic texts into cognitive threat topic texts, suspected cognitive threat topic texts, and intermediate detection of non-cognitive threat topic texts; Threat topic text to obtain the ultimate detection of cognitive threat topic text;
通过对认知威胁话题文本的命名实体识别和实体关系抽取来构建认知威胁传播知识图谱;Construct a knowledge map of cognitive threat communication through named entity recognition and entity relationship extraction of cognitive threat topic texts;
基于认知威胁传播知识图谱对认知威胁话题文本传播进行用户溯源、事件溯源以及组织溯源。Based on the knowledge map of cognitive threat communication, user tracing, event tracing and organizational tracing are carried out on cognitive threat topic text dissemination.
作为本发明中社交媒体认知威胁检测方法,进一步,采集网络平台敏感话题文本数据并对数据进行预处理操作,包含:As the social media cognitive threat detection method in the present invention, further, collect sensitive topic text data on the network platform and perform preprocessing operations on the data, including:
首先,根据用户授权信息库分布式采集网络平台敏感话题文本信息及相关用户数据;First, according to the distributed collection of user authorization information database, sensitive topic text information and related user data on the network platform;
然后,针对采集的文本信息,将标题与正文进行合并,利用冗余检测算法去除冗余信息,将相关评论进行去重处理,对文本噪声数据进行清洗转换,并利用分词系统对文本进行分词处理。Then, for the collected text information, merge the title with the text, use the redundancy detection algorithm to remove redundant information, de-duplicate the relevant comments, clean and transform the text noise data, and use the word segmentation system to segment the text .
作为本发明中社交媒体认知威胁检测方法,进一步地,初级检测中,利用情感分析方法将敏感话题文本数据划分为认知威胁话题文本和初始疑似认知威胁话题文本,其中,情感分析方法划分的过程包含:As the social media cognitive threat detection method in the present invention, further, in the primary detection, the sentiment analysis method is used to divide the sensitive topic text data into cognitive threat topic text and initial suspected cognitive threat topic text, wherein the sentiment analysis method divides The process includes:
首先,依据已知情感词典并运用词频统计方法来构建基础情感词典,通过将文本数据中词语与基础情感词典中词汇进行相关性统计来扩充情感词典;First, based on the known emotional dictionary and using the word frequency statistics method to construct the basic emotional dictionary, and expand the emotional dictionary by making correlation statistics between the words in the text data and the basic emotional dictionary;
接着,以敏感话题文本数据中文本为单位、以情感词为分隔符,对每个分隔符之间的断句进行情感权值统计,依据负向情感权值在所有情感词权值中的比重来判断文本的情感极性;Then, take the text in the sensitive topic text data as the unit and the emotional word as the delimiter, carry out the sentiment weight statistics on the sentences between each delimiter, and calculate the negative sentiment weight according to the proportion of the negative sentiment weight in all the sentiment word weights. Judge the emotional polarity of the text;
然后,依据文本的情感极性将敏感话题文本数据划分为认知威胁话题文本和初始疑似认知威胁话题文本。Then, according to the emotional polarity of the text, the sensitive topic text data is divided into cognitive threat topic text and initial suspected cognitive threat topic text.
作为本发明中社交媒体认知威胁检测方法,进一步地,依据已知情感词典并运用词频统计方法来构建基础情感词典,包含:As a social media cognitive threat detection method in the present invention, further, construct a basic emotional dictionary according to a known emotional dictionary and use a word frequency statistical method, including:
首先,在已知情感词典中选取系列情感词,依据系列情感词中搜索引擎点击量来对情感词进行排序,依据点击量热度来选取若干情感词;First, select a series of emotional words in the known emotional dictionary, sort the emotional words according to the search engine clicks in the series of emotional words, and select some emotional words according to the popularity of the clicks;
接着,基于词频统计选取与主题相关度最高的情感词汇,利用选取的若干情感词和情感词汇共同构成基础情感字典;Then, based on the word frequency statistics, the emotional vocabulary with the highest relevance to the topic is selected, and a number of selected emotional words and emotional vocabulary are used to form a basic emotional dictionary;
然后,利用同义词及带情感倾向候选词对基础情感字典进行扩充。Then, the basic sentiment dictionary is expanded with synonyms and candidate words with sentimental tendencies.
作为本发明中社交媒体认知威胁检测方法,进一步地,对每个分隔符之间的断句进行情感权值统计,依据负向情感权值在所有情感词权值中的比重来判断文本的情感极性,包含:As a social media cognitive threat detection method in the present invention, further, carry out sentiment weight statistics on the sentences between each delimiter, and judge the emotion of the text according to the proportion of negative sentiment weights in all emotional word weights Polarity, including:
首先,针对分隔符之间的断句,分别通过情感词分析、否定词分析、副词分析、固定搭配词分析、转折词分析及感叹句分析来统计情感倾向;First of all, for the sentences between the delimiters, the emotional tendency is counted through the analysis of emotional words, negative words, adverbs, fixed collocations, transition words and exclamation sentences;
然后,统计文本包含所有子句的负向情感倾向值总和与总体情感权值绝对值总和,并利用负向情感词权值在文本所有情感词权重纵占比来判断文本的情感极性。Then, the statistical text contains the sum of the negative sentiment tendency values of all clauses and the sum of the absolute value of the overall sentiment weight, and uses the weight of negative sentiment words in the vertical proportion of all sentiment words in the text to judge the sentiment polarity of the text.
作为本发明社交媒体认知威胁检测方法,进一步地,中级检测中,利用深度学习方法将初始疑似认知威胁话题文本分类为认知威胁话题文本、疑似认知威胁话题文本和非认知威胁话题文本,分类过程包含:As the social media cognitive threat detection method of the present invention, further, in the intermediate detection, the initial suspected cognitive threat topic text is classified into cognitive threat topic text, suspected cognitive threat topic text and non-cognitive threat topic text by using deep learning method For text, the classification process includes:
构建深度学习模型,并利用带有标注标签的训练数据集进行预训练,其中,深度学习模型包含用于对输入进行词向量表示的BERT模型,和用于对输入的词向量进行认知威胁检测的BiLSTM模型;Construct a deep learning model and use the training data set with labels for pre-training, where the deep learning model includes a BERT model for word vector representation of the input, and cognitive threat detection for the input word vector The BiLSTM model;
将初始疑似认知威胁话题文本输入至预训练的深度学习模型中,利用深度学习模型来获取认知威胁概率值,通过认知威胁概率值来确定初始疑似认知威胁话题文本中的认知威胁话题文本、疑似认知威胁话题文本和非认知威胁话题文本。Input the initial suspected cognitive threat topic text into the pre-trained deep learning model, use the deep learning model to obtain the cognitive threat probability value, and determine the cognitive threat in the initial suspected cognitive threat topic text through the cognitive threat probability value Topic text, suspected cognitive threat topic text, and non-cognitive threat topic text.
作为本发明社交媒体认知威胁检测方法,进一步地,针对预处理后的敏感话题文本数据,通过多层级认知威胁检测来获取认知威胁话题文本,还包含:利用情感分析方法在认知威胁话题文本中评论区依据整体情感倾向来评估认知威胁影响度。As the social media cognitive threat detection method of the present invention, further, for the preprocessed sensitive topic text data, the cognitive threat topic text is obtained through multi-level cognitive threat detection, which also includes: using the sentiment analysis method in the cognitive threat The comment area in the topic text evaluates the impact of cognitive threats based on the overall emotional tendency.
作为本发明社交媒体认知威胁检测方法,进一步地,通过对认知威胁话题文本的命名实体识别和实体关系抽取来构建认知威胁传播知识图谱,包含:As the social media cognitive threat detection method of the present invention, further, the knowledge map of cognitive threat propagation is constructed through named entity recognition and entity relationship extraction of the cognitive threat topic text, including:
构建命名实体抽取模型,并利用对抗训练方法对模型进行优化,其中,命名实体识别模型包含用于将输入字符映射到实数空间并挖掘潜在语义的编码器、用于通过捕捉编码器转化向量中向前和向后双向特征来提取上下文语义信息的BiLSTM神经网络层、和用于将BiLSTM神经网络层提取的双向特征作为输入并结合Bioes标注范式生成字符对应标签的CRF条件随机场层;Construct a named entity extraction model, and optimize the model by using an adversarial training method. The named entity recognition model includes an encoder for mapping input characters to real number space and mining potential semantics, and is used to convert vectors to The BiLSTM neural network layer for extracting contextual semantic information by forward and backward bidirectional features, and the CRF conditional random field layer for taking the bidirectional features extracted by the BiLSTM neural network layer as input and combining the Bioes labeling paradigm to generate the corresponding label of the character;
将认知威胁话题文本作为优化后的命名实体抽取模型输入,利用命名实体抽取模型来识别认知威胁话题文本中的实体类别和关系。The cognitive threat topic text is used as the input of the optimized named entity extraction model, and the named entity extraction model is used to identify the entity categories and relationships in the cognitive threat topic text.
作为本发明社交媒体认知威胁检测方法,进一步地,通过对认知威胁话题文本的命名实体识别和实体关系抽取来构建认知威胁传播知识图谱中,搭建管道式连接的两个命名实体抽取模型,其中,第一个命名实体抽取模型采用单标签多分类任务方式来识别认知威胁话题文本中实体,第二个命名实体抽取模型中采用多标签多分类任务方式将第一命名实体抽取模型输入作为输入来识别实体之间关系。As the social media cognitive threat detection method of the present invention, further, constructing the cognitive threat propagation knowledge map through the named entity recognition and entity relationship extraction of the cognitive threat topic text, two named entity extraction models connected by pipeline are built , where the first named entity extraction model uses a single-label multi-classification task method to identify entities in the cognitive threat topic text, and the second named entity extraction model uses a multi-label multi-classification task method to input the first named entity extraction model as input to identify relationships between entities.
进一步地,本发明还提供一种社交媒体认知威胁检测系统,包含:数据采集服务器、多台认知威胁鉴别服务器、知识图谱服务器和web服务器,其中,Further, the present invention also provides a social media cognitive threat detection system, including: a data collection server, multiple cognitive threat identification servers, a knowledge map server and a web server, wherein,
数据采集服务器,用于采集网络平台敏感话题文本数据并对数据进行预处理操作;The data collection server is used to collect text data of sensitive topics on the network platform and perform preprocessing operations on the data;
多台认知威胁鉴别服务器,用于针对预处理后的敏感话题文本数据,通过多层级认知威胁检测来获取认知威胁话题文本,其中,多台认知威胁鉴别服务器具体包含:将敏感话题文本数据划分为认知威胁话题文本和初始疑似认知威胁话题文本的初级鉴别服务器,将初始疑似认知威胁话题文本分类为认知威胁话题文本、疑似认知威胁话题文本和非认知威胁话题文本的中级鉴别服务器,和通过人工标注从疑似认知威胁话题文本来获取认知威胁话题文本的终极鉴别服务器;Multiple cognitive threat identification servers are used to obtain the cognitive threat topic text through multi-level cognitive threat detection for the preprocessed sensitive topic text data, wherein the multiple cognitive threat identification servers specifically include: sensitive topic The text data is divided into cognitive threat topic text and initial suspected cognitive threat topic text, and the primary identification server classifies the initial suspected cognitive threat topic text into cognitive threat topic text, suspected cognitive threat topic text and non-cognitive threat topic text An intermediate authentication server for the text, and an ultimate authentication server for obtaining the cognitive threat topic text from the suspected cognitive threat topic text through manual annotation;
知识图谱服务器,用于通过对认知威胁话题文本的命名实体识别和实体关系抽取来构建认知威胁传播知识图谱;The knowledge map server is used to build a cognitive threat communication knowledge map through named entity recognition and entity relationship extraction of cognitive threat topic texts;
web服务器,用于基于认知威胁传播知识图谱并利用web交互界面对认知威胁话题文本传播进行用户溯源、事件溯源以及组织溯源。The web server is used to disseminate the knowledge graph based on the cognitive threat and use the web interactive interface to trace the user source, event source and organization source of the cognitive threat topic text dissemination.
本发明的有益效果:Beneficial effects of the present invention:
本发明可依托于微博、知乎、微信公众号等网络社交平台,对爬取的敏感话题文本及其评论进行多维情感分析来实现认知威胁话题文本的检测,通过识别认知威胁相关命名实体并抽取实体间关系来构建认知威胁传播知识图谱,利用知识图谱实现认知威胁传播用户溯源,认知威胁传播事件溯源以及认知威胁传播组织溯源的可视化展示,通过隐含关系发掘,实现认知威胁传播预测,对重点账号、群组、组织、用户进行实时监测,提供认知对抗策略分析,阻断认知威胁传播,深化网络认知威胁监管力度,威慑认知威胁相关网络违法行为,有效净化网络空间。The present invention can rely on network social platforms such as Weibo, Zhihu, and WeChat official accounts to perform multi-dimensional sentiment analysis on crawled sensitive topic texts and their comments to realize the detection of cognitive threat topic texts, and identify cognitive threat-related names Entities and the relationship between entities are extracted to construct a knowledge map of cognitive threat communication, and the knowledge map is used to realize the visual display of cognitive threat communication user traceability, cognitive threat communication event source tracing, and cognitive threat communication organization source tracing. Cognitive threat propagation prediction, real-time monitoring of key accounts, groups, organizations, and users, analysis of cognitive confrontation strategies, blocking cognitive threat propagation, deepening network cognitive threat supervision, and deterring cognitive threat-related network violations , Effectively purify cyberspace.
附图说明:Description of drawings:
图1为实施例中社交媒体认知威胁检测流程示意;Fig. 1 is a schematic diagram of social media cognitive threat detection process in an embodiment;
图2为实施例中认知威胁检测和度量流程示意;Fig. 2 is a schematic diagram of cognitive threat detection and measurement process in an embodiment;
图3为实施例中对抗训练流程示意;Figure 3 is a schematic diagram of the confrontation training process in the embodiment;
图4为实施例中知识图谱可视化构建层次示意。Fig. 4 is a schematic diagram of the construction hierarchy of the knowledge map visualization in the embodiment.
具体实施方式:Detailed ways:
为使本发明的目的、技术方案和优点更加清楚、明白,下面结合附图和技术方案对本发明作进一步详细的说明。In order to make the purpose, technical solution and advantages of the present invention more clear and understandable, the present invention will be further described in detail below in conjunction with the accompanying drawings and technical solutions.
网络环境的发达使认知域威胁的实施更加简便易行,可以在多维度、多层面单独或共同实施,从而影响整个社会价值形态。本案实施例,参见图1所示,提供一种社交媒体认知威胁检测方法,包含:The development of the network environment makes the implementation of cognitive domain threats easier and easier, and can be implemented individually or jointly in multiple dimensions and levels, thereby affecting the entire social value form. The embodiment of this case, as shown in Figure 1, provides a social media cognitive threat detection method, including:
S101、采集网络平台敏感话题文本数据并对数据进行预处理操作;S101. Collect text data of sensitive topics on the network platform and perform preprocessing operations on the data;
S102、针对预处理后的敏感话题文本数据,通过多层级认知威胁检测来获取认知威胁话题文本,其中,多层级认知威胁检测包含:将敏感话题文本数据划分为认知威胁话题文本和初始疑似认知威胁话题文本的初级检测,将初始疑似认知威胁话题文本分类为认知威胁话题文本、疑似认知威胁话题文本和非认知威胁话题文本的中级检测,和通过人工标注从疑似认知威胁话题文本来获取认知威胁话题文本的终极检测;S102. For the preprocessed sensitive topic text data, obtain the cognitive threat topic text through multi-level cognitive threat detection, wherein the multi-level cognitive threat detection includes: dividing the sensitive topic text data into cognitive threat topic text and The primary detection of the initial suspected cognitive threat topic text, the intermediate detection of classifying the initial suspected cognitive threat topic text into cognitive threat topic text, suspected cognitive threat topic text and non-cognitive threat topic text, and manual labeling from suspected Cognitive threat topic text to obtain the ultimate detection of cognitive threat topic text;
S103、通过对认知威胁话题文本的命名实体识别和实体关系抽取来构建认知威胁传播知识图谱;S103. Construct a cognitive threat communication knowledge graph through named entity recognition and entity relationship extraction of the cognitive threat topic text;
S104、基于认知威胁传播知识图谱对认知威胁话题文本传播进行用户溯源、事件溯源以及组织溯源。S104. Perform user tracing, event tracing, and organization tracing on cognitive threat topic text propagation based on the cognitive threat propagation knowledge graph.
依托于微博、知乎、微信公众号等网络社交平台,对爬取的敏感话题文本及其评论进行多维情感分析来实现认知威胁话题文本的检测,通过识别认知威胁相关命名实体并抽取实体间关系来构建认知威胁传播知识图谱,利用知识图谱实现认知威胁传播用户溯源。通过构建认知威胁知识图谱,预测认知威胁传播路径,提供认知对抗策略分析。不仅能在社交媒体平台的短文本背景下实现准确鉴别,也能在鉴别媒体长文本时保持较高的准确率,使得新闻媒体必须对自己的言行负责,能够有效震慑某些无良媒体。Relying on online social platforms such as Weibo, Zhihu, and WeChat official accounts, multi-dimensional sentiment analysis is performed on crawled sensitive topic texts and their comments to realize the detection of cognitive threat topic texts. By identifying cognitive threat-related named entities and extracting The relationship between entities is used to build a cognitive threat communication knowledge map, and the knowledge map is used to trace the source of cognitive threat communication users. By building a knowledge map of cognitive threats, predicting the propagation path of cognitive threats, and providing analysis of cognitive confrontation strategies. Not only can it achieve accurate identification in the context of short texts on social media platforms, but it can also maintain a high accuracy rate when identifying long media texts, making news media responsible for their words and deeds, and can effectively deter some unscrupulous media.
作为优选实施例,进一步,采集网络平台敏感话题文本数据并对数据进行预处理操作,包含:As a preferred embodiment, further, collect sensitive topic text data on the network platform and perform preprocessing operations on the data, including:
首先,根据用户授权信息库分布式采集网络平台敏感话题文本信息及相关用户数据;First, according to the distributed collection of user authorization information database, sensitive topic text information and related user data on the network platform;
然后,针对采集的文本信息,将标题与正文进行合并,利用冗余检测算法去除冗余信息,将相关评论进行去重处理,对文本噪声数据进行清洗转换,并利用分词系统对文本进行分词处理。Then, for the collected text information, merge the title with the text, use the redundancy detection algorithm to remove redundant information, de-duplicate the relevant comments, clean and transform the text noise data, and use the word segmentation system to segment the text .
可通过API接口获取微博、公众号等社交平台的海量敏感话题文本数据,其包括文章标题、正文、评论等十个字段。为方便进一步处理,对数据进行标题与正文合并,通过冗余检测算法去除冗余信息,评论去重,噪声数据清洗与转换,通过ICTCLAS进行分词的数据预处理操作。去除冗余信息的冗余检测算法可设计包含如下步骤:A large amount of sensitive topic text data of social platforms such as Weibo and official accounts can be obtained through the API interface, which includes ten fields such as article title, text, and comments. In order to facilitate further processing, the title and text of the data are merged, redundant information is removed through redundancy detection algorithms, comments are deduplicated, noise data is cleaned and converted, and word segmentation data preprocessing operations are performed through ICTCLAS. The redundancy detection algorithm for removing redundant information can be designed to include the following steps:
Step1:按照标点符号对文本分句;Step1: Segment the text according to the punctuation mark;
Step2:获取分句后文章的前5句,若包含“关注我们”、“点击**字体”等字样,则将该句话删除,其余保留;Step2: Obtain the first 5 sentences of the article after the sentence clause. If it contains words such as "Follow us" and "Click ** font", delete the sentence and keep the rest;
Step3:获取分句后文章的前10句,若包含“编辑”、“初审”、“点击在看”等文本,则将该句话删除,其余保留;Step3: Obtain the first 10 sentences of the article after the sentence division. If it contains texts such as "edit", "first review", "click to read", delete the sentence, and keep the rest;
Step4:将保留的句子重新合并为文本。Step4: Merge the retained sentences back into text.
需要说明的是,本案采集的数据可以是各网络社交平台的敏感话题文本数据。可通过微博平台进行采集,也会随机通过知乎、微信公众号等社交平台进行采集。微博数据采集主要流程可包括:用户授权、新发布微博获取、微博信息更新、用户信息获取。用户授权通过Oauth2完成,新发布微博的获取以及微博信息更新、用户信息的获取通过自动化调用微博官方公开的API接口完成。It should be noted that the data collected in this case can be text data of sensitive topics on various social networking platforms. It can be collected through the Weibo platform, and will also be randomly collected through social platforms such as Zhihu and WeChat public accounts. The main process of microblog data collection may include: user authorization, acquisition of newly released microblogs, update of microblog information, and acquisition of user information. User authorization is completed through Oauth2, and the acquisition of newly released Weibo, Weibo information update, and user information acquisition are completed by automatically calling the official API interface of Weibo.
初级检测中,可利用情感分析方法将敏感话题文本数据划分为认知威胁话题文本和初始疑似认知威胁话题文本,其中,情感分析方法划分的过程包含:In primary detection, the sentiment analysis method can be used to divide sensitive topic text data into cognitive threat topic text and initial suspected cognitive threat topic text. The process of sentiment analysis method division includes:
首先,依据已知情感词典并运用词频统计方法来构建基础情感词典,通过将文本数据中词语与基础情感词典中词汇进行相关性统计来扩充情感词典;First, based on the known emotional dictionary and using the word frequency statistics method to construct the basic emotional dictionary, and expand the emotional dictionary by making correlation statistics between the words in the text data and the basic emotional dictionary;
接着,以敏感话题文本数据中文本为单位、以情感词为分隔符,对每个分隔符之间的断句进行情感权值统计,依据负向情感权值在所有情感词权值中的比重来判断文本的情感极性;Then, take the text in the sensitive topic text data as the unit and the emotional word as the delimiter, carry out the sentiment weight statistics on the sentences between each delimiter, and calculate the negative sentiment weight according to the proportion of the negative sentiment weight in all the sentiment word weights. Judge the emotional polarity of the text;
然后,依据文本的情感极性将敏感话题文本数据划分为认知威胁话题文本和初始疑似认知威胁话题文本。Then, according to the emotional polarity of the text, the sensitive topic text data is divided into cognitive threat topic text and initial suspected cognitive threat topic text.
多维情感分析是利用自然语言处理和文本挖掘技术,对带有情感色彩的主观性文本的情感极性、情感程度、情感类别多维度进行分析、处理的过程。NLP领域的一个重要研究方向是情感分析,正确有效的情感分析可以快速从文本中得到人们所表达出的积极或者消极的情绪,有助于发掘文本背后的情感倾向,进而分离出海量信息中潜藏的政治威胁和带有文化渗透性质的认知威胁。情感分析任务按其分析的粒度可以分为篇章级、句子级、词或短语级;按其处理文本的类别可分为基于文本的情感分析和基于评论的情感分析,按其研究的任务类型,可分为情感分类,情感检索和情感抽取等子问题。本案实施例中,如图2所示,通过基于情感词典动态拓展和深度学习的认知威胁识别和度量的基本流程Multidimensional sentiment analysis is the process of analyzing and processing the emotional polarity, emotional degree, and emotional category of subjective texts with emotional color by using natural language processing and text mining technology. An important research direction in the field of NLP is sentiment analysis. Correct and effective sentiment analysis can quickly get the positive or negative emotions expressed by people from the text, which helps to discover the emotional tendency behind the text, and then separate the hidden information hidden in the massive information. political threats and culturally penetrating cognitive threats. Sentiment analysis tasks can be divided into chapter level, sentence level, word or phrase level according to the granularity of analysis; according to the category of text processing, it can be divided into text-based sentiment analysis and comment-based sentiment analysis. According to the type of task studied, It can be divided into sub-problems such as sentiment classification, sentiment retrieval and sentiment extraction. In the embodiment of this case, as shown in Figure 2, the basic flow of cognitive threat identification and measurement based on dynamic expansion of sentiment dictionary and deep learning
对于情感分类的方法大体可以分为基于情感词典的分类方法和基于深度学习的分类方法,两类方法各有特点也各有不足。基于情感词典的方法是指运用一个标有情感极性的情感词典对文本进行情感极性量化计算,该方法是利用一系列规则和情感词典来进行分类的,首先将情感词典中的词语和待分析文本中的词语进行匹配,而后通过计算获得句子的情感值,最后把得到的情感值作为句子情感倾向分类的判断依据,虽然这种方法的正确率比较高,但是构建情感词典的成本较大,而且基于情感词典的方法没有考虑文本中词语之间的联系,缺少词义信息;基于深度学习的方法是将情感分类视为一种特殊的文本分类,运用人工标注和机器学习的方法对文本进行情感分类。基于深度学习的方法利用标记好的数据与标签,这些数据和标签都是人工标记的,然后再利用深度学习的方法对文本进行情感分析,常用的机器学习方法有朴素贝叶斯NB(NaiveBayes)、决策树、支持向量机SVM(SupportVectorMachine)等。该方法效果的好坏主要依赖于人工标注的数据的数量和质量,所以受人的主观意识影响较大,且要耗费大量人工。The methods of emotion classification can be roughly divided into classification methods based on sentiment lexicon and classification methods based on deep learning. Both types of methods have their own characteristics and shortcomings. The method based on the sentiment dictionary refers to the use of a sentiment dictionary marked with sentiment polarity to quantify the emotional polarity of the text. This method uses a series of rules and the sentiment dictionary to classify. Analyze the words in the text for matching, and then obtain the emotional value of the sentence through calculation, and finally use the obtained emotional value as the basis for judging the emotional orientation of the sentence. Although this method has a relatively high accuracy rate, the cost of building an emotional dictionary is relatively high. , and the method based on the sentiment dictionary does not consider the connection between words in the text and lacks word meaning information; the method based on deep learning regards sentiment classification as a special text classification, and uses manual annotation and machine learning methods to classify the text Sentiment classification. The method based on deep learning uses marked data and labels. These data and labels are manually marked, and then use the method of deep learning to analyze the sentiment of the text. The commonly used machine learning methods are Naive Bayesian NB (NaiveBayes) , decision tree, support vector machine SVM (SupportVectorMachine), etc. The effect of this method mainly depends on the quantity and quality of manually labeled data, so it is greatly affected by people's subjective consciousness and consumes a lot of labor.
针对两种方法各自的特点,将基于情感词典和深度学习两类方法相结合并进行优化,提出动态拓展情感词典和深度学习相结合的多维情感分析方法,进而克服两种方法各自缺点,取得较高准确率。According to the respective characteristics of the two methods, the two methods based on the sentiment dictionary and deep learning are combined and optimized, and a multi-dimensional sentiment analysis method combining the dynamic expansion of the sentiment dictionary and deep learning is proposed to overcome the respective shortcomings of the two methods and achieve comparative results. High accuracy.
其中,依据已知情感词典并运用词频统计方法来构建基础情感词典,包含为:Among them, based on the known emotional dictionary and using the word frequency statistical method to construct the basic emotional dictionary, including:
首先,在已知情感词典中选取系列情感词,依据系列情感词中搜索引擎点击量来对情感词进行排序,依据点击量热度来选取若干情感词;First, select a series of emotional words in the known emotional dictionary, sort the emotional words according to the search engine clicks in the series of emotional words, and select some emotional words according to the popularity of the clicks;
接着,基于词频统计选取与主题相关度最高的情感词汇,利用选取的若干情感词和情感词汇共同构成基础情感字典;Then, based on the word frequency statistics, the emotional vocabulary with the highest relevance to the topic is selected, and a number of selected emotional words and emotional vocabulary are used to form a basic emotional dictionary;
然后,利用同义词及带情感倾向候选词对基础情感字典进行扩充。Then, the basic sentiment dictionary is expanded with synonyms and candidate words with sentimental tendencies.
进一步,对每个分隔符之间的断句进行情感权值统计,依据负向情感权值在所有情感词权值中的比重来判断文本的情感极性,包含:Further, carry out emotion weight statistics on the sentences between each delimiter, and judge the emotional polarity of the text according to the proportion of negative emotional weight in all emotional word weights, including:
首先,针对分隔符之间的断句,分别通过情感词分析、否定词分析、副词分析、固定搭配词分析、转折词分析及感叹句分析来统计情感倾向;First of all, for the sentences between the delimiters, the emotional tendency is counted through the analysis of emotional words, negative words, adverbs, fixed collocations, transition words and exclamation sentences;
然后,统计文本包含所有子句的负向情感倾向值总和与总体情感权值绝对值总和,并利用负向情感词权值在文本所有情感词权重纵占比来判断文本的情感极性。Then, the statistical text contains the sum of the negative sentiment tendency values of all clauses and the sum of the absolute value of the overall sentiment weight, and uses the weight of negative sentiment words in the vertical proportion of all sentiment words in the text to judge the sentiment polarity of the text.
可从知网Hownet中选取一系列情感词,将它们逐个输入至搜索引擎,根据搜索引擎返回的点击量(hits值)的大小对情感词进行排序,选取点击量最高的若干个情感词作为基础情感词,此外采取基于词频统计的方法,半自动地选取与主题相关度更高的基础情感词汇共同构成基础情感词典。因为文本中含有情感成分的词语大多为形容词、动词、和部分名词,所以在预处理后,只需基于条目足够多的自动文本进行词频统计,然后针对词频较高的若干词汇,选取词频最高的20个正面情感词和词频最高的20个负向情感词,与通用基础情感词汇共同构成基础情感词典。You can select a series of emotional words from Hownet, input them into the search engine one by one, sort the emotional words according to the number of hits (hits value) returned by the search engine, and select several emotional words with the highest hits as the basis In addition, based on the method of word frequency statistics, the basic emotional words with higher relevance to the topic are selected semi-automatically to form the basic emotional dictionary. Because most of the words containing emotional components in the text are adjectives, verbs, and some nouns, after preprocessing, it is only necessary to perform word frequency statistics based on the automatic text with enough entries, and then select the words with the highest word frequency for some words with high word frequency The 20 positive emotional words and the 20 negative emotional words with the highest word frequency, together with the common basic emotional vocabulary, constitute the basic emotional dictionary.
由于基础情感词典表达了较强烈的感情倾向,可将基础情感词典中负向情感词赋予情感倾向值-1。基础情感词典中词汇量较小,不可能包含文本集中出现的所有带有情感倾向性的词汇,因此需要对基础情感词典进行扩充,构建相对完整的情感词典。可添加同义词和添加带有情感倾向候选词进行扩充。Since the basic sentiment dictionary expresses a strong emotional tendency, the negative sentiment words in the basic sentiment dictionary can be given a sentiment tendency value of -1. The vocabulary in the basic emotional dictionary is small, and it is impossible to contain all the words with emotional tendencies that appear in the text set. Therefore, it is necessary to expand the basic emotional dictionary to build a relatively complete emotional dictionary. You can add synonyms and add candidate words with emotional tendencies for expansion.
添加同义词能够帮助更宽泛地识别情感词汇,利用现有同义词词库对基础情感词典进行同义词扩充。但是为了提高情感倾向计算的算法性能,仍需人工筛选出常用的同义词词汇,扩充后情感词典词语增至256个,可将负面情感词的同义词的情感倾向值设置为-1。Adding synonyms can help identify emotional vocabulary more broadly, and use the existing thesaurus to expand the basic emotional dictionary with synonyms. However, in order to improve the algorithm performance of emotional tendency calculation, it is still necessary to manually screen out commonly used synonyms. After the expansion, the number of words in the sentiment dictionary increases to 256, and the emotional tendency value of synonyms of negative emotional words can be set to -1.
构建完全无遗漏情感词典非常困难,但通过分析文本集中每个词语与情感词典中词汇的相关性,将相关性很高的词语纳入词典,可以有效构建覆盖面更广的情感词典。It is very difficult to construct a completely exhaustive sentiment dictionary, but by analyzing the correlation between each word in the text set and the vocabulary in the sentiment dictionary, and incorporating highly correlated words into the dictionary, an emotional dictionary with wider coverage can be effectively constructed.
可点互信息法(Pointwise Mutual Information)来计算候选词语字典中情感词汇相关性,从而判断是否添加其至情感词典。点互信息法基于互信息理论计算词语和词语之间的相关性。其基本思想是统计两个词wordi和wordj在文本中共现的概率,共现概率越大,则两个词相关性越高,计算公式如下:Pointwise Mutual Information can be used to calculate the correlation of emotional words in the candidate word dictionary, so as to determine whether to add it to the emotional dictionary. The point mutual information method calculates the correlation between words and words based on mutual information theory. The basic idea is to count the co-occurrence probability of two words wordi and wordj in the text. The higher the co-occurrence probability, the higher the correlation between the two words. The calculation formula is as follows:
其中p(wordi^wordj)是wordi和wordj在文本中共现的概率,计算方法如下:Where p(wordi^wordj) is the probability of co-occurrence of wordi and wordj in the text, the calculation method is as follows:
其中n代表文本中子句总条数,numSentence(wordi,wordj)表示同时包含wordi和wordj的子句条数。P(wordi)和P(wordj)分别表示文本中包含wordi和wordj的子句条数在总的子句数中所占比例。计算公式如下:Among them, n represents the total number of clauses in the text, and numSentence(wordi, wordj) represents the number of clauses containing both wordi and wordj. P(wordi) and P(wordj) respectively represent the proportion of the number of clauses containing wordi and wordj in the total number of clauses in the text. Calculated as follows:
其中numSentence(wordi)表示文本中包含wordi的子句条数。上式中PMI(wordi,wordj)表示当wordi和wordj其中一个变量出现时,可以获取到的另一个变量的信息量,充分表现了wordi和wordj之间的统计相关性:PMI大于0时,表示两个词语是具有相关性的,且PMI值越大,相关性越强;PMI为0时,表示两词之间是统计独立的;PMI小于0时,表示两个词之间是互斥的。Among them, numSentence(wordi) indicates the number of clauses containing wordi in the text. In the above formula, PMI(wordi, wordj) means that when one of the variables wordi and wordj appears, the amount of information of the other variable that can be obtained fully shows the statistical correlation between wordi and wordj: when PMI is greater than 0, it means Two words are related, and the larger the PMI value, the stronger the correlation; when the PMI is 0, it means that the two words are statistically independent; when the PMI is less than 0, it means that the two words are mutually exclusive .
可采用ICTCLAS系统分词之后获取词语的词性property,之后计算由word.propertyal∈{a,d,an,ag,al}和word.propertyal∈{vn,vd,vi,vg,vl},所限定的两种候选词的SO-PMI值,其余词性的词直接被视为中性词语,该方法旨在解决添加相关词时出现的部分本身无情感倾向性词语与正向或负向情感词汇共现概率很高,导致错误引入情感词典之中,造成情感分类的性能无谓开销问题,降低准确性问题,提高扩充词典算法的效率。计算两种候选词word的SO-PMI值具体为:计算候选词和正向基础词典的PMI值,计算候选词与负向词之间的PMI值,最后将两者相见得到候选词的SO-PMI值,计算公式如下:The part-of-speech property of the word can be obtained after word segmentation using the ICTCLAS system, and then calculated by word.propertyal∈{a,d,an,ag,al} and word.propertyal∈{vn,vd,vi,vg,vl}, defined by The SO-PMI values of the two candidate words, and the remaining words of the part of speech are directly regarded as neutral words. This method aims to solve the co-occurrence of some words with no emotional tendency and positive or negative emotional words when adding related words The probability is very high, leading to wrong introduction into the sentiment dictionary, causing unnecessary overhead in the performance of sentiment classification, reducing the accuracy problem, and improving the efficiency of the dictionary expansion algorithm. Calculating the SO-PMI value of the two candidate words is as follows: calculating the PMI value of the candidate word and the positive basic dictionary, calculating the PMI value between the candidate word and the negative word, and finally combining the two to obtain the SO-PMI value of the candidate word PMI value, the calculation formula is as follows:
SO-PMI(word)=SO-PMI(word)=
∑posWord∈posWordsPMI(word,posWord)-∑negWord∈negWordsPMI(word,negWord)∑ posWord∈posWords PMI(word,posWord)-∑ negWord∈negWords PMI(word,negWord)
可将SO-PMI的值和情感倾向性关系调整为The relationship between the value of SO-PMI and emotional tendency can be adjusted as
综上,对于情感词典扩充方法进行如下总结:In summary, the sentiment dictionary expansion method is summarized as follows:
posWords:posWords:
如果word是基础情感词典中的正面词语,难么word纳入posWords;If word is a positive word in the basic sentiment dictionary, why is word included in posWords;
如果word是基础情感词典中的某个正面词语的同义词,那么word纳入posWords;If word is a synonym for a positive word in the basic sentiment dictionary, then word is included in posWords;
如果word符合式word.propertyal∈{a,d,an,ag,al}或word.propertyal∈{vn,vd,vi,vg,vl},并且1.36<SO-PMI(word)<23,word纳入posWords。If word conforms to the formula word.propertyal∈{a,d,an,ag,al} or word.propertyal∈{vn,vd,vi,vg,vl}, and 1.36<SO-PMI(word)<23, word is included posWords.
同理,negWords:Similarly, negWords:
如果word为基础情感词典中的负面词语,那么word纳入negWords;If word is a negative word in the basic sentiment dictionary, then word is included in negWords;
如果word是基础情感词典中的某个负面词语的同义词,那么word纳入到negWords;If word is a synonym for a negative word in the basic sentiment dictionary, then word is included in negWords;
如果word符合word.propertyal∈{a,d,an,ag,al}或word.propertyal∈{vn,vd,vi,vg,vl},并且-16<SO-PMI(word)<-1,那么word纳入negWords。If word conforms to word.propertyal∈{a,d,an,ag,al} or word.propertyal∈{vn,vd,vi,vg,vl}, and -16<SO-PMI(word)<-1, then Word is included in negWords.
在情感词典基础上,以每条文本语句S为单位,以该语句中的每个情感词WS为分隔符,对两个分隔符之间的断句phrase(WSi-1,WSi)进行情感权值计算,断句phrase(WSi-1,WSi)包含词语WSI但不包含词语WSi-1;该模型由5个模块组成,分别为:情感词的分析、否定词的分析、副词的分析、固定搭配词句的分析、转折词的分析、感叹句的分析。Based on the sentiment dictionary, each text sentence S is taken as a unit, and each sentiment word WS in the sentence is used as a delimiter, and the sentiment weight of the sentence phrase (WSi-1, WSi) between the two delimiters is calculated. Calculate, sentence sentence phrase(WSi-1,WSi) contains the word WSI but does not contain the word WSi-1; the model consists of 5 modules, namely: analysis of emotional words, analysis of negative words, analysis of adverbs, and fixed collocations Analysis of transition words, analysis of exclamatory sentences.
情感词的分析:针对待分析文本中的每个词语word,扫描情感词典,判断word是否存在于情感词典之中,若存在,则将word视为存在情感词并从负向情感词典中读取该词的情感倾向值,将其返回;若不存在,则将word视为中性词汇,返回0,这样循环,直至对整个文本集的词语判断完成。通过对每个词语的情感倾向值计算,我们获取到准确的情感词(即权值不等于0的词),并且过滤了在特定语句中不发挥情感作用的情感词(即权值等于0的情感词)。Analysis of emotional words: For each word word in the text to be analyzed, scan the emotional dictionary to determine whether the word exists in the emotional dictionary, if it exists, treat the word as an emotional word and read it from the negative emotional dictionary The sentiment value of the word is returned; if it does not exist, the word is regarded as a neutral word and 0 is returned, and this cycle is repeated until the word judgment of the entire text set is completed. By calculating the emotional tendency value of each word, we obtain accurate emotional words (that is, words whose weight is not equal to 0), and filter out emotional words that do not play an emotional role in a specific sentence (that is, words whose weight is equal to 0 emotional words).
否定词分析:在语句出现情感词Wsi情况下,计算Wsi与前一个分隔符Wsi-1之间(即一个断句中)否定词的个数negNum(Wsi-1,Wsi)。如果negNum为奇数,则该子句的情感值为情感词的情感倾向值取反;反之,则保持原情感倾向值。Negative word analysis: in the case where the emotional word Wsi appears in the sentence, calculate the number negNum(Wsi-1, Wsi) of negative words between Wsi and the previous separator Wsi-1 (that is, in a sentence sentence). If negNum is an odd number, the sentiment value of the clause is reversed from that of the emotional word; otherwise, the original sentiment value is maintained.
副词的分析:判断词汇是否位于副词词典,若在,从副词词典中获取副词情感强度,将对应权值乘上子句当前情感倾向值作为子句情感权值。Adverb analysis: determine whether the vocabulary is in the adverb dictionary, if so, obtain the adverb emotional strength from the adverb dictionary, and multiply the corresponding weight by the current emotional tendency value of the clause as the clause emotional weight.
转折词分析:从当前情感词Wsi处开始向后扫描寻找下一个情感词Wsi+1.在这个过程中,如果扫描到转折词,则将weight(phrase(Wsi-1,Wsi))取反,使得phrase(Wsi-1,Wsi)的情感倾向偏向转折词后面的断句phrase(Wsi,Wsi+1)的情感倾向。Turning word analysis: scan backward from the current emotional word Wsi to find the next emotional word Wsi+1. In this process, if a turning word is scanned, the weight(phrase(Wsi-1, Wsi)) is reversed, The emotional tendency of phrase(Wsi-1, Wsi) is biased towards the emotional tendency of phrase(Wsi, Wsi+1) behind the transition word.
感叹句分析:对于感叹句的分析,我们以惊叹号“!”作为感叹句的标识,将它记为exc。其情感权值计算方法为:扫描到惊叹号是,我们从后往前寻找距离惊叹号最近的情感词语Wsi-1,并将Wsi-1的情感倾向值作为exc的权值。Exclamatory sentence analysis: For the analysis of exclamatory sentences, we use the exclamation mark "!" as the identifier of the exclamatory sentence, and record it as exc. The calculation method of its emotional weight is: when the exclamation point is scanned, we search for the emotional word Wsi-1 closest to the exclamation point from the back to the front, and use the emotional tendency value of Wsi-1 as the weight of exc.
计算得到一条文本S包含的所有子句的负向情感倾向值之和weight(S)与总体情感权值绝对值之和total(S),计算负向情感词权值在文本所有情感词权值中比重scale(S),scale(S)=weight(S)/total(S),依据scale(S)判断文本S的情感极性,依据情感极性做出对该文本的认知威胁性质做出初步判断,scale(S)在[0.68-1]区间内的文本,视为极大可能性为认知威胁;scale(S)在[0-0.68)区间内为疑似认知威胁话题文本,至此实现对文本认知威胁性的第一阶段分类。Calculate the sum weight(S) of the negative emotional tendency values of all clauses contained in a text S and the sum total(S) of the absolute value of the overall emotional weight, and calculate the weight of negative emotional words in the weight of all emotional words in the text Medium proportion scale(S), scale(S)=weight(S)/total(S), judge the emotional polarity of the text S according to the scale(S), and make a decision on the cognitive threat nature of the text according to the emotional polarity A preliminary judgment is made, the text with scale(S) in the [0.68-1] interval is regarded as a cognitive threat with a high probability; the scale(S) in the [0-0.68) interval is a suspected cognitive threat topic text, So far, the first stage classification of text cognitive threat has been realized.
作为优选实施例,进一步地,中级检测中,利用深度学习方法将初始疑似认知威胁话题文本分类为认知威胁话题文本、疑似认知威胁话题文本和非认知威胁话题文本,分类过程包含:As a preferred embodiment, further, in the middle-level detection, the initial suspected cognitive threat topic text is classified into cognitive threat topic text, suspected cognitive threat topic text and non-cognitive threat topic text by using a deep learning method. The classification process includes:
构建深度学习模型,并利用带有标注标签的训练数据集进行预训练,其中,深度学习模型包含用于对输入进行词向量表示的BERT模型,和用于对输入的词向量进行认知威胁检测的BiLSTM模型;Construct a deep learning model and use the training data set with labels for pre-training, where the deep learning model includes a BERT model for word vector representation of the input, and cognitive threat detection for the input word vector The BiLSTM model;
将初始疑似认知威胁话题文本输入至预训练的深度学习模型中,利用深度学习模型来获取认知威胁概率值,通过认知威胁概率值来确定初始疑似认知威胁话题文本中的认知威胁话题文本、疑似认知威胁话题文本和非认知威胁话题文本。Input the initial suspected cognitive threat topic text into the pre-trained deep learning model, use the deep learning model to obtain the cognitive threat probability value, and determine the cognitive threat in the initial suspected cognitive threat topic text through the cognitive threat probability value Topic text, suspected cognitive threat topic text, and non-cognitive threat topic text.
本案实施例中,针对两种方法各自的特点,将基于情感词典和深度学习两类方法相结合并进行优化,提出动态拓展情感词典和深度学习相结合的多维情感分析方法,进而克服两种方法各自缺点,取得较高准确率。In the embodiment of this case, according to the respective characteristics of the two methods, the two methods based on emotional dictionary and deep learning are combined and optimized, and a multi-dimensional sentiment analysis method combining dynamic expansion of emotional dictionary and deep learning is proposed to overcome the two methods. Each method has its own disadvantages and achieves a higher accuracy rate.
基于情感分析的认知威胁识别对文本分两阶段进行情感分析。第一阶段,可参考已有的知网(HowNet)情感词典、BosonNLP情感词典,并运用基于词频统计方法构建基础情感词典,对候选词语与基础情感词典中词汇的统计相关性计算来判断其情感倾向,实现情感词典动态扩充。在情感词典、否定词典、程度副词词典基础之上,以每条文本S为单位,以该语句的每个情感词WS为分隔符,对两个分隔符之间的断句phrase(WSi-1,WSi)计算负向情感权值之和weight(S)与情感词权值绝对值之和total(S),定义scale(S)为负向情感权值在所有情感词权值中所占比重,依据scale(S)大小判断文本S的情感极性,依据情感极性做出对该文本的认知威胁性质做出初步判断,完成认知威胁初识别。经过对采集的大量实验文本分析的统计结果显示,负向情感权重scale(S)在[0.68-1]区间内的文本,极大可能性为认知威胁;负向情感权重scale(S)在[0-0.68)区间内为疑似认知威胁话题文本;对情感倾向值在[0-0.68)分值区间内的文本初步归类为疑似认知威胁,对其进行第二阶段鉴别处理。第二阶段可采用以BERT+BiLSTM深度学习模型为核心进行情感分析,对文本情感倾向性进一步分析,完成认知威胁再识别,采用BERT(BidirectionalEncode,Reprsesnationfrom Transformers,BERT)预训练的词向量替代传统方式训练的词向量,将分词处理的文本转化为多维词向量,采用能解决短时依赖问题与长时依赖问题的双向长短时记忆网络(BiLSTM)模型构成该板块情感倾向性分析的核心,以人工标注的认知威胁话题文本集与同主题下已知认知威胁话题文本作为训练集,对BERT+BiLSTM模型进行训练,以训练后的模型对第一阶段得到的疑似认知威胁话题文本进一步情感分析,利用Softmax分类器将文本划分为确定认知威胁话题文本集,疑似认知威胁话题文本集和非认知威胁话题文本集。Cognitive threat recognition based on sentiment analysis performs sentiment analysis on text in two stages. In the first stage, you can refer to the existing HowNet emotional dictionary and BosonNLP emotional dictionary, and use the method based on word frequency statistics to build a basic emotional dictionary, and calculate the statistical correlation between candidate words and vocabulary in the basic emotional dictionary to judge their emotions Tendency to realize the dynamic expansion of the sentiment dictionary. On the basis of sentiment dictionary, negative dictionary, and degree adverb dictionary, each text S is taken as a unit, and each sentiment word WS of the sentence is used as a separator, and the sentence phrase(WSi-1, WSi) calculates the sum of negative emotional weights weight (S) and the sum total (S) of the absolute value of emotional word weights, and defines scale (S) as the proportion of negative emotional weights in all emotional word weights, Judging the emotional polarity of the text S according to the size of scale(S), making a preliminary judgment on the nature of the cognitive threat of the text based on the emotional polarity, and completing the initial recognition of the cognitive threat. The statistical results of the analysis of a large number of collected experimental texts show that texts with a negative emotional weight scale (S) in the [0.68-1] interval are most likely to be cognitive threats; negative emotional weight scale (S) is in [0-0.68) interval is the topic text of suspected cognitive threat; texts whose emotional tendency value is in the [0-0.68) score interval are initially classified as suspected cognitive threat, and the second-stage identification process is performed on it. In the second stage, the BERT+BiLSTM deep learning model can be used as the core to conduct sentiment analysis, further analyze the text's emotional tendency, and complete the re-identification of cognitive threats, using BERT (BidirectionalEncode, Reprsesnation from Transformers, BERT) pre-trained word vectors to replace traditional The word vector trained by the method converts the text processed by word segmentation into a multi-dimensional word vector, and adopts the bidirectional long-short-term memory network (BiLSTM) model that can solve the problem of short-term dependence and long-term dependence to form the core of the emotional tendency analysis of this section. The artificially labeled cognitive threat topic text set and the known cognitive threat topic texts under the same topic are used as the training set to train the BERT+BiLSTM model, and the trained model is used to further analyze the suspected cognitive threat topic texts obtained in the first stage. Sentiment analysis, using the Softmax classifier to divide the text into a set of confirmed cognitive threat topic text sets, a suspected cognitive threat topic text set and a non-cognitive threat topic text set.
对于第一阶段基于动态扩充情感词典的情感分析结果为疑似认知威胁的文本进行第二阶段鉴别处理,以BERT+BiLSTM深度学习模型作为认知威胁进一步识别的核心。首先进行模型训练,训练流程如下:可首先对训练数据集进行人工标注,标注是否具有认知威胁性质,分词处理后使用BERT模型对其进行词向量表示,最后将转化成的向量传入BiLSTM神经网络。根据认知威胁样本训练出覆盖认知威胁样本的BiLSTM模型。将第一阶段处理结果为疑似认知威胁的文本通过BERT词向量化,将转化的向量分别传入认知威胁模型,通过该模型会得到一个认知威胁概率值,通过对大量文本试验表明,训练结果概率在(0.68-1]的文本可以确定具有认知威胁性质,概率在(0.32-0.68]为疑似认知威胁,需人工判别,概率为[0-0.32]为非认知威胁。The sentiment analysis result based on the dynamic expansion of the emotional dictionary in the first stage is a second-stage identification process for texts that are suspected of cognitive threats, and the BERT+BiLSTM deep learning model is used as the core for further identification of cognitive threats. Firstly, model training is carried out, and the training process is as follows: the training data set can be manually labeled first, whether the label has the nature of cognitive threat, and the BERT model is used to represent the word vector after word segmentation, and finally the converted vector is passed to the BiLSTM neural network network. According to the cognitive threat samples, the BiLSTM model covering the cognitive threat samples is trained. The first-stage processing results are suspected cognitive threat texts through BERT word vectorization, and the converted vectors are respectively passed into the cognitive threat model, and a cognitive threat probability value will be obtained through the model. Through a large number of text experiments, it is shown that, The text with the probability of training results in (0.68-1] can be determined to have the nature of cognitive threat, the probability of (0.32-0.68] is a suspected cognitive threat, which needs manual identification, and the probability of [0-0.32] is non-cognitive threat.
实验结果表明,在数据集包含近5000条微博文本数据情况下,单纯基于深度学习和单纯基于情感词典的情感倾向性分析方法进行认知威胁识别准确率分别为67.9%和83.27%,本案基于多维情感分析的综合型认知威胁识别的准确率为89.9%,相对较优。The experimental results show that when the data set contains nearly 5,000 microblog text data, the cognitive threat recognition accuracy rates of purely based on deep learning and purely based on sentiment lexicon are 67.9% and 83.27%, respectively. The accuracy rate of comprehensive cognitive threat recognition based on multidimensional sentiment analysis is 89.9%, which is relatively good.
进一步地,本案实施例中,针对预处理后的敏感话题文本数据,通过多层级认知威胁检测来获取认知威胁话题文本,还包含:利用情感分析方法在认知威胁话题文本中评论区依据整体情感倾向来评估认知威胁影响度。Furthermore, in the embodiment of this case, for the preprocessed sensitive topic text data, the cognitive threat topic text is obtained through multi-level cognitive threat detection, which also includes: using the sentiment analysis method in the comment area of the cognitive threat topic text The overall emotional tendency to assess the impact of cognitive threats.
通过对已识别出为认知威胁的文本下评论区整体情感倾向来定义认知威胁影响度,将一条已确定为认知威胁的文本下的所有评论文本进行合并,数据预处理和文本分词后,可采用上文提到基于情感词典的认知威胁识别方法对评论区文本整体情感倾向进行判断,以认知威胁性质文本引发的评论导向作为威胁度的评判依据,并依据评论情感分析结果对文本的威胁度做出评价,可将威胁度由高到低分为一、二、三级,将评论文本整体负向情感权值在文本总体情感词权值比重在[0.68-1]区间的文本认知威胁度定义为一级;评论文本整体负向情感权值在文本总体情感词权值比重在[0.32-0.68)的文本威胁度定义为二级;评论文本整体负向情感权值在文本总体情感词权值比重在[0-0.32)的文本认知威胁度定义为三级,分析结果可为应对处理认知威胁提供重要参考。Define the influence degree of cognitive threat through the overall emotional tendency of the comment area under the text that has been identified as cognitive threat, merge all the comment texts under a text that has been identified as cognitive threat, after data preprocessing and text word segmentation , the cognitive threat recognition method based on the emotional dictionary mentioned above can be used to judge the overall emotional tendency of the text in the comment area, and the comment orientation triggered by the text of the cognitive threat nature can be used as the basis for judging the threat degree. The threat degree of the text can be evaluated, and the threat degree can be divided into one, two, and three levels from high to low, and the overall negative emotional weight of the comment text is in the range of [0.68-1] The text cognitive threat degree is defined as the first level; the overall negative emotional weight of the comment text is defined as the second level when the overall negative emotional weight of the text is between [0.32-0.68]; the overall negative emotional weight of the comment text is between The text cognitive threat degree is defined as three levels when the weight ratio of the overall emotional words in the text is [0-0.32), and the analysis results can provide an important reference for dealing with cognitive threats.
作为优选实施例,进一步地,通过对认知威胁话题文本的命名实体识别和实体关系抽取来构建认知威胁传播知识图谱,包含:As a preferred embodiment, further, construct a cognitive threat communication knowledge map through named entity recognition and entity relationship extraction of the cognitive threat topic text, including:
构建命名实体抽取模型,并利用对抗训练方法对模型进行优化,其中,命名实体识别模型包含用于将输入字符映射到实数空间并挖掘潜在语义的编码器、用于通过捕捉编码器转化向量中向前和向后双向特征来提取上下文语义信息的BiLSTM神经网络层、和用于将BiLSTM神经网络层提取的双向特征作为输入并结合Bioes标注范式生成字符对应标签的CRF条件随机场层;Construct a named entity extraction model, and optimize the model by using an adversarial training method. The named entity recognition model includes an encoder for mapping input characters to real number space and mining potential semantics, and is used to convert vectors to The BiLSTM neural network layer for extracting contextual semantic information by forward and backward bidirectional features, and the CRF conditional random field layer for taking the bidirectional features extracted by the BiLSTM neural network layer as input and combining the Bioes labeling paradigm to generate the corresponding label of the character;
将认知威胁话题文本作为优化后的命名实体抽取模型输入,利用命名实体抽取模型来识别认知威胁话题文本中的实体类别和关系。The cognitive threat topic text is used as the input of the optimized named entity extraction model, and the named entity extraction model is used to identify the entity categories and relationships in the cognitive threat topic text.
目前利用基于统计机器学习的方法实现命名实体识别任务较为常见,本案实施例中的命名实体抽取模型采用BERT-BiLSTM-CRF模型,其是基于BiLSTM-CRF模型发展的一种无需人工归纳特征、端到端的深度学习模型,能够满足目前中文地址解析和地址要素标注任务需求。该模型自底向上由编码器(Transformer)、BiLSTM神经网络层和条件随机场(CRF)层组成。Transformer编码器是基于字符级的中文BERT模型,将输入的中文地址字符映射到低维稠密的实数空间中,挖掘中文地址中各类地址要素蕴含的潜在语义;BiLSTM神经网络层将编码器转化而来的字符向量作为输入,捕捉中文地址序列前向(自左向右)和后向(自右向左)的双向特征,能够充分获取上下文的语义信息;CRF条件随机场层属于概率图模型,以上游BiLSTM提取的双向特征作为输入,结合Bioes标注范式生成地址中各字符对应的标签,从而进一步将中文地址按照标签解析为各类地址要素,并且在计算过程中考虑到了序列的问题,可以很大程度上提高命名实体的识别效果。At present, it is more common to use methods based on statistical machine learning to achieve named entity recognition tasks. The named entity extraction model in this case uses the BERT-BiLSTM-CRF model, which is developed based on the BiLSTM-CRF model. The end-to-end deep learning model can meet the current requirements of Chinese address parsing and address element labeling tasks. The model consists of encoder (Transformer), BiLSTM neural network layer and conditional random field (CRF) layer from bottom to top. The Transformer encoder is based on the character-level Chinese BERT model, which maps the input Chinese address characters into a low-dimensional dense real number space, and mines the potential semantics contained in various address elements in Chinese addresses; the BiLSTM neural network layer converts the encoder into The character vector from the source is used as input to capture the bidirectional features of the Chinese address sequence forward (from left to right) and backward (from right to left), which can fully obtain the semantic information of the context; the CRF conditional random field layer belongs to the probability graph model, The bidirectional feature extracted by the upstream BiLSTM is used as input, and the label corresponding to each character in the address is generated in combination with the Bioes labeling paradigm, so that the Chinese address is further parsed into various address elements according to the label, and the sequence problem is considered in the calculation process, which can be very fast. Improve the recognition effect of named entities to a great extent.
认知威胁领域的实体还没有既定的标准,现有的网络命名识别任务大多仅针对网络舆情识别。本案实施例中,对爬取的数据进行分析,依据认知威胁识别需求,可设置认知威胁领域实体共6种类型,分别为用户、时间、地址、平台、组织、热点事件,如表1所示:There is no established standard for entities in the field of cognitive threats, and most of the existing network naming and recognition tasks are only for network public opinion recognition. In the example of this case, the crawled data is analyzed, and according to the cognitive threat identification requirements, a total of 6 types of entities in the cognitive threat field can be set, which are users, time, address, platform, organization, and hot events, as shown in Table 1 Shown:
表1认知威胁实体类型Table 1 Cognitive Threat Entity Types
实体标注是命名实体识别任务最重要的问题,也是模型训练的基础。常用的标注方法有BIO和BIOES两类。虽然BIOES标注方式提供更多的信息,但需要预测的标签更多,由于本案构建的数据集数量有限,采用BIOES的标注方式效果可能会受到影响。在BIO标注体系中,可采用“B”标记实体的开始,“I”标记实体内部,“O”标记非实体。每一类实体的标签都包含“开始”和“内部”,因此,本案构建的命名实体识别数据集可设置为13个标签。Entity labeling is the most important issue in named entity recognition tasks, and it is also the basis of model training. Commonly used annotation methods are BIO and BIOES. Although the BIOES labeling method provides more information, more labels need to be predicted. Due to the limited number of data sets constructed in this case, the effect of using the BIOES labeling method may be affected. In the BIO notation system, "B" can be used to mark the beginning of the entity, "I" can be used to mark the inside of the entity, and "O" can be used to mark the non-entity. The labels of each type of entity include "start" and "internal". Therefore, the named entity recognition dataset constructed in this case can be set to 13 labels.
认知威胁用户知识图谱相关实体之间具有关系复杂性,在对认知威胁话题文本进行知识抽取的时候,也应注意相邻标签之间的依赖关系。但由于BiLSTM善于处长距离的文本信息,无法处理相邻标签之间的依赖关系,因此,可在认知威胁用户知识图谱的知识抽取中结合CRF(Conditional Random FieId,条件随机场)在BiLSTM输出每个单词初步对应的预测标签的基础上,通过邻近标签的关系,对输出分数进行校正,获得一个最优的预测序列。Cognitive threat user knowledge map related entities have complex relationships. When extracting knowledge from cognitive threat topic texts, we should also pay attention to the dependencies between adjacent tags. However, since BiLSTM is good at dealing with long-distance text information, it cannot handle the dependencies between adjacent labels. Therefore, it can be combined with CRF (Conditional Random Field) in the knowledge extraction of cognitive threat user knowledge graphs in BiLSTM output On the basis of the prediction label initially corresponding to each word, the output score is corrected through the relationship of the adjacent labels to obtain an optimal prediction sequence.
CRF层将上一层BILSTM的输出得分作为输入,输出符合标注转移约束条件的、最大可能的预测标注序列。对于任一个序列X=(x1,x2,…,xn);在此假定P是BiLSTM的输出得分矩阵,P的大小为n\times k,其中n为词的个数,k为标签个数,Pij表示第i个词第j个标签的分数,对预测序列Y=(y1,y2,…,yn)而言,得到它的分数函数为:The CRF layer takes the output score of the previous layer BILSTM as input, and outputs the most likely sequence of predicted annotations that meet the annotation transfer constraints. For any sequence X=(x1,x2,...,xn); here it is assumed that P is the output score matrix of BiLSTM, and the size of P is n\times k, where n is the number of words, k is the number of labels, Pij represents the score of the j-th label of the i-th word. For the prediction sequence Y=(y1,y2,...,yn), its score function is:
A表示转移分数矩阵,Aij代表标签i转移为标签j的分数,A的大小为k+2。预测序列Y产生的概率为:A represents the transfer score matrix, Aij represents the score of label i transferred to label j, and the size of A is k+2. The probability of predicting sequence Y produced is:
两头取对数得到预测序列的似然函数:Take the logarithm at both ends to get the likelihood function of the predicted sequence:
式中,\widetilde{Y}表示真实的标注序列,YX表示所有可能的标注序列。解码后得到最大分数的输出序列:In the formula, \widetilde{Y} represents the real label sequence, and YX represents all possible label sequences. The output sequence to get the maximum score after decoding:
CRF层输出的是认知威胁话题文本的最优标签序列,关注的是转发用户、转发时间、转发地点、转发平台、文本摘要信息、文本主题等标签所对应的词语,这些是建立认知威胁用户知识图谱并进行转发用户溯源、转发过程溯源、转发时间溯源、转发平台溯源等溯源过程以及进行认知威胁转发用户关系推理的基础。The output of the CRF layer is the optimal label sequence of the cognitive threat topic text, focusing on the words corresponding to the tags such as reposting user, reposting time, reposting location, reposting platform, text summary information, and text topic. The user knowledge map is also the basis for traceability of forwarding users, forwarding process, forwarding time, forwarding platform and other traceability processes, as well as cognitive threat forwarding user relationship reasoning.
在使用BERT及其变体时,由于其已经进行了预训练,参数已经达到较好水平,为保持训练效果,应采用较低的学习率;相反,由于其下游任务为经过预训练,如果设置较低的学习率,不仅使得训练过程慢,且难以同BERT训练同步。因此,本案实施例中,可采用分层设置学习率策略:对上游BERT预训练层,设置较小学习率,而下阶层设置较大学习率。When using BERT and its variants, because it has been pre-trained, the parameters have reached a good level, in order to maintain the training effect, a lower learning rate should be used; on the contrary, because its downstream tasks have been pre-trained, if you set A low learning rate not only makes the training process slow, but also difficult to synchronize with BERT training. Therefore, in the embodiment of this case, a hierarchical learning rate strategy can be adopted: set a smaller learning rate for the upstream BERT pre-training layer, and set a larger learning rate for the lower layer.
在模型训练过程中,当损失值下降逐渐平缓时,如果仍采用较大学习率,会导致模型在收敛到全局最优点时在最优点附近来回摆荡,为保证损失函数最终始终保持在离最优值很近的范围内,并逐渐接近最优值,需要采用学习率衰减策略,即减小参数更新的步长。本案可设置一种学习衰减策略:在训练过程中当模型效果没有提升时,减小学习率,可有效提升模型精度。In the process of model training, when the loss value declines gradually, if a large learning rate is still used, it will cause the model to oscillate around the optimal point when it converges to the global optimal point. In order to ensure that the loss function is always kept away from the optimal In the range of very close to the value, and gradually approaching the optimal value, it is necessary to adopt the learning rate decay strategy, that is, to reduce the step size of the parameter update. In this case, a learning attenuation strategy can be set: during the training process, when the model effect does not improve, reducing the learning rate can effectively improve the model accuracy.
BERT-BiLSTM-CRF作为命名实体识别模型,但由于神经网络具有局部不稳定性,即使微小的扰动也可能对模型产生较大误差。因此,本案实施例采用对抗训练方法优化模型。对抗训练通过向模型中输入微小扰动来提高模型鲁棒性,可以达到缓解神经网络局部不稳定性的缺陷和提高模型鲁棒性的效果。参见图3所示。训练过程中,首先BERT会对输入的文本生成初始向量,然后在初始向量上添加一些扰动来生成对抗样本,这些对抗样本作为原始样本的变体,很容易对模型产生误导。初始向量和对抗样本将一同输入BiLSTM进行训练,神经网络将在训练过程中将学习到更加健壮的参数以抵抗对抗样本攻击。BERT-BiLSTM-CRF is used as a named entity recognition model, but due to the local instability of the neural network, even a small disturbance may cause a large error to the model. Therefore, the embodiment of this case adopts the confrontation training method to optimize the model. Adversarial training improves the robustness of the model by inputting small perturbations into the model, which can alleviate the defects of local instability of the neural network and improve the robustness of the model. See Figure 3. During the training process, BERT will first generate an initial vector for the input text, and then add some perturbations to the initial vector to generate adversarial samples. These adversarial samples are variants of the original samples, which can easily mislead the model. The initial vector and adversarial samples will be input into BiLSTM for training, and the neural network will learn more robust parameters during the training process to resist adversarial sample attacks.
作为优选实施例,进一步地,通过对认知威胁话题文本的命名实体识别和实体关系抽取来构建认知威胁传播知识图谱中,可搭建管道式连接的两个命名实体抽取模型,其中,第一个命名实体抽取模型采用单标签多分类任务方式来识别认知威胁话题文本中实体,第二个命名实体抽取模型中采用多标签多分类任务方式将第一命名实体抽取模型输入作为输入来识别实体之间关系。As a preferred embodiment, further, in constructing the cognitive threat propagation knowledge map through named entity recognition and entity relationship extraction of the cognitive threat topic text, two named entity extraction models connected by pipeline can be built, wherein the first The first named entity extraction model adopts the single-label multi-classification task method to identify entities in the cognitive threat topic text, and the second named entity extraction model uses the multi-label multi-classification task method to use the input of the first named entity extraction model as input to identify entities relationship between.
知识融合是构建领域知识图谱的一项重要任务,其通过多个相关实体的对齐、关联和合并,使之成为一个整体,主要工作分为实体统一和实体消歧两个部分。由于认知威胁话题文本具有政治性、攻击性的特点,命名实体识别出的实体之间存在未统一的问题,因此需要进行实体统一个实体消歧。其中,实体统一是指含义相同的不同实体示例,需要进行实体统一。Knowledge fusion is an important task in building a domain knowledge graph. It makes it a whole through the alignment, association and merging of multiple related entities. The main work is divided into two parts: entity unification and entity disambiguation. Due to the political and offensive characteristics of cognitive threat topic texts, there is a problem of inconsistency between entities identified by named entities, so entity unification and entity disambiguation are required. Among them, entity unification refers to different entity examples with the same meaning, and entity unification is required.
命名实体的歧义指的是一个实体指称项可对应到多个真实世界实体,由于中文语义的丰富性与复杂性,同一个词在不同语境下代表的含义可能有所不同,因此需要进行实体消歧。本案可采用基于链接的实体消歧方法,将实体指称项链接到知识库中的相应实体上。经过实体统一后共可得到最终的有效实体。The ambiguity of named entities means that an entity referent can correspond to multiple real-world entities. Due to the richness and complexity of Chinese semantics, the same word may have different meanings in different contexts, so it is necessary to carry out entity Disambiguation. In this case, a link-based entity disambiguation method can be used to link entity references to corresponding entities in the knowledge base. After entity unification, the final effective entity can be obtained.
图数据库善于处理大量复杂、互连接、低结构化的数据,这些数据变化迅速,需要频繁的查询——在关系数据库中,这些查询会导致大量的表连接,因此会产生性能上的问题。而传统的如RDBMS等在查询时出现的性能衰退问题,故而本案实施例中可采取支持完整事务的持久化引擎Neo4j,其提供大规模可扩展性,在一台机器上可以处理数十亿节点关系-属性图,可以扩展到多台机器并行运行。同时重点解决性能衰退问题。通过围绕图进行数据建模,Neo4j会以相同的速度遍历节点与边,其遍历速度与构成图的数据量没有任何关系。Graph databases are good at handling large volumes of complex, interconnected, low-structured data that changes rapidly and requires frequent queries—in relational databases, these queries would result in a large number of table joins and thus create performance issues. However, the traditional RDBMS, etc. have performance degradation problems when querying. Therefore, in the embodiment of this case, the persistence engine Neo4j that supports complete transactions can be used, which provides large-scale scalability and can handle billions of nodes on one machine. A relation-property graph that can be extended to run on multiple machines in parallel. At the same time, focus on solving the problem of performance degradation. By modeling data around graphs, Neo4j traverses nodes and edges at the same speed, independent of the amount of data that makes up the graph.
如图4所示,在基于Neo4j实现知识图谱可视化展示时,可在Neo4j程序中定义可视化图谱元素集合。认知威胁传播的schema可主要由类型(type)和属性(property)来表达。将用户、事件、地址、平台、热点事件、情感标签、威胁意图定义为实体。在关系的定义上,可以定义出如下关系:文本-热点事件,用户-文本,转发,用户-组织,文本-情感标签,文本-威胁意图等。用三元组可表示为:<文本,文本-热点事件,热点事件>,<用户,用户-认知威胁话题文本,文本>,<用户,转发,用户>,<用户,用户-组织,用户>,<文本,文本-威胁意图,威胁意图>等。As shown in Figure 4, when realizing the visual display of the knowledge graph based on Neo4j, a set of visual graph elements can be defined in the Neo4j program. The schema of cognitive threat propagation can be mainly expressed by type and property. Define users, events, addresses, platforms, hot events, sentiment labels, and threat intents as entities. In terms of relationship definition, the following relationships can be defined: text-hot event, user-text, forwarding, user-organization, text-emotional label, text-threat intent, etc. A triple can be expressed as: <text, text-hot event, hot event>, <user, user-cognitive threat topic text, text>, <user, forward, user>, <user, user-organization, user >, <text, text-threat-intent, threat-intent>, etc.
通过构建认知威胁传播知识图谱来实现认知威胁传播用户溯源、事件溯源、组织溯源。通过隐含关系挖掘,对重点账号、群组、组织、用户进行实时监测,能够对认知威胁精准定位和定向阻断提供依据。By constructing a cognitive threat communication knowledge map, the cognitive threat communication user traceability, event traceability, and organization traceability are realized. Through implicit relationship mining, real-time monitoring of key accounts, groups, organizations, and users can provide a basis for precise positioning and targeted blocking of cognitive threats.
进一步地,基于上述的方法,本发明实施例还提供一种社交媒体认知威胁检测系统,包含:数据采集服务器、多台认知威胁鉴别服务器、知识图谱服务器和web服务器,其中,Further, based on the above method, an embodiment of the present invention also provides a social media cognitive threat detection system, including: a data collection server, multiple cognitive threat identification servers, a knowledge graph server, and a web server, wherein,
数据采集服务器,用于采集网络平台敏感话题文本数据并对数据进行预处理操作;The data collection server is used to collect text data of sensitive topics on the network platform and perform preprocessing operations on the data;
多台认知威胁鉴别服务器,用于针对预处理后的敏感话题文本数据,通过多层级认知威胁检测来获取认知威胁话题文本,其中,多台认知威胁鉴别服务器具体包含:将敏感话题文本数据划分为认知威胁话题文本和初始疑似认知威胁话题文本的初级鉴别服务器,将初始疑似认知威胁话题文本分类为认知威胁话题文本、疑似认知威胁话题文本和非认知威胁话题文本的中级鉴别服务器,和通过人工标注从疑似认知威胁话题文本来获取认知威胁话题文本的终极鉴别服务器;Multiple cognitive threat identification servers are used to obtain the cognitive threat topic text through multi-level cognitive threat detection for the preprocessed sensitive topic text data, wherein the multiple cognitive threat identification servers specifically include: sensitive topic The text data is divided into cognitive threat topic text and initial suspected cognitive threat topic text, and the primary identification server classifies the initial suspected cognitive threat topic text into cognitive threat topic text, suspected cognitive threat topic text and non-cognitive threat topic text An intermediate authentication server for the text, and an ultimate authentication server for obtaining the cognitive threat topic text from the suspected cognitive threat topic text through manual annotation;
知识图谱服务器,用于通过对认知威胁话题文本的命名实体识别和实体关系抽取来构建认知威胁传播知识图谱;The knowledge map server is used to build a cognitive threat communication knowledge map through named entity recognition and entity relationship extraction of cognitive threat topic texts;
web服务器,用于基于认知威胁传播知识图谱并利用web交互界面对认知威胁话题文本传播进行用户溯源、事件溯源以及组织溯源。The web server is used to disseminate the knowledge graph based on the cognitive threat and use the web interactive interface to trace the user source, event source and organization source of the cognitive threat topic text dissemination.
前端可基于JavaScript设计、采用Echart实现数据可视化。多台认知威胁鉴别服务器可设置为去中心的分布式特性,共同负责认知威胁信息的检测度量,只有多台服务器公认为认知威胁的文本才会被判定为认知威胁话题文本。在文本被判定为认知威胁话题文本后,认知威胁鉴别服务器将文本信息及文本认知威胁属性度量信息上传至分布式网络。分布式的结构不仅提升了认知威胁检测的准确率,也提升了系统的抗风险能力,一台服务器的损坏不会影响整个系统的运作。The front end can be designed based on JavaScript, and Echart can be used to realize data visualization. Multiple cognitive threat identification servers can be set to be decentralized and distributed, and are jointly responsible for the detection and measurement of cognitive threat information. Only texts that are recognized by multiple servers as cognitive threats will be judged as cognitive threat topic texts. After the text is judged as a cognitive threat topic text, the cognitive threat identification server uploads the text information and the text cognitive threat attribute measurement information to the distributed network. The distributed structure not only improves the accuracy of cognitive threat detection, but also improves the system's ability to resist risks. The damage of one server will not affect the operation of the entire system.
知识图谱服务器对应认知威胁知识抽取模块和认知威胁传播知识图谱构建模块。知识图谱服务器通过智能合约自动化访问分布式网络,提取已被检测认知威胁话题文本的文本信息,对用户、时间、地址、组织、转发平台,认知威胁话题文本相关热点事件等认知威胁实体进行命名实体识别与关系抽取,通过Neo4j构建认知威胁传播知识图谱。The knowledge graph server corresponds to the cognitive threat knowledge extraction module and the cognitive threat propagation knowledge graph building module. The knowledge map server automatically accesses the distributed network through smart contracts, extracts the text information of the detected cognitive threat topic text, and detects cognitive threat entities such as users, time, address, organization, forwarding platform, and hot events related to the cognitive threat topic text. Carry out named entity recognition and relationship extraction, and build a cognitive threat propagation knowledge map through Neo4j.
人机交互界面可利用Web实现用户与数据交互的桥梁,利用Echart可视化工具将抽象的数据、关系转化为直观的图表。The human-computer interaction interface can use the Web to realize the bridge between users and data, and use the Echart visualization tool to convert abstract data and relationships into intuitive charts.
本案实施例中,系统具有很好的安全性和可行性,模块化操作复杂度较低,便于维护。并经过试验数据验证,对百字以内话题文本的威胁性判定准确率达到93%,其准确率较高,检测效果较好。此外,本案方案具有很广阔的应用场景,可以用于新闻媒体监管、网络舆情监管以及打击违法行为等方面。In the embodiment of this case, the system has good security and feasibility, and the complexity of modular operation is low, which is convenient for maintenance. And verified by experimental data, the accuracy rate of threat determination for topic texts within 100 words reaches 93%, which has a high accuracy rate and a good detection effect. In addition, the proposal in this case has a wide range of application scenarios, and can be used in news media supervision, network public opinion supervision, and crackdown on illegal activities.
除非另外具体说明,否则在这些实施例中阐述的部件和步骤的相对步骤、数字表达式和数值并不限制本发明的范围。Relative steps, numerical expressions and numerical values of components and steps set forth in these embodiments do not limit the scope of the present invention unless specifically stated otherwise.
本说明书中各个实施例采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似部分互相参见即可。对于实施例公开的系统而言,由于其与实施例公开的方法相对应,所以描述的比较简单,相关之处参见方法部分说明即可。Each embodiment in this specification is described in a progressive manner, each embodiment focuses on the difference from other embodiments, and the same and similar parts of each embodiment can be referred to each other. As for the system disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and for the related information, please refer to the description of the method part.
结合本文中所公开的实施例描述的各实例的单元及方法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已按照功能一般性地描述了各示例的组成及步骤。这些功能是以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。本领域普通技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不认为超出本发明的范围。The units and method steps of the examples described in conjunction with the embodiments disclosed herein can be implemented by electronic hardware, computer software, or a combination of the two. In order to clearly illustrate the interchangeability of hardware and software, in the above description The composition and steps of each example have been generally described in terms of functions. Whether these functions are performed by hardware or software depends on the specific application and design constraints of the technical solution. Those of ordinary skill in the art may use different methods to implement the described functions for each particular application, but such implementation is not considered to exceed the scope of the present invention.
本领域普通技术人员可以理解上述方法中的全部或部分步骤可通过程序来指令相关硬件完成,所述程序可以存储于计算机可读存储介质中,如:只读存储器、磁盘或光盘等。可选地,上述实施例的全部或部分步骤也可以使用一个或多个集成电路来实现,相应地,上述实施例中的各模块/单元可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。本发明不限制于任何特定形式的硬件和软件的结合。Those of ordinary skill in the art can understand that all or part of the steps in the above method can be completed by instructing relevant hardware through a program, and the program can be stored in a computer-readable storage medium, such as: a read-only memory, a magnetic disk or an optical disk, and the like. Optionally, all or part of the steps in the above embodiments can also be implemented using one or more integrated circuits. Correspondingly, each module/unit in the above embodiments can be implemented in the form of hardware, or can be implemented in the form of software function modules. The form is realized. The present invention is not limited to any specific combination of hardware and software.
最后应说明的是:以上所述实施例,仅为本发明的具体实施方式,用以说明本发明的技术方案,而非对其限制,本发明的保护范围并不局限于此,尽管参照前述实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:任何熟悉本技术领域的技术人员在本发明揭露的技术范围内,其依然可以对前述实施例所记载的技术方案进行修改或可轻易想到变化,或者对其中部分技术特征进行等同替换;而这些修改、变化或者替换,并不使相应技术方案的本质脱离本发明实施例技术方案的精神和范围,都应涵盖在本发明的保护范围之内。因此,本发明的保护范围应所述以权利要求的保护范围为准。Finally, it should be noted that: the above-described embodiments are only specific implementations of the present invention, used to illustrate the technical solutions of the present invention, rather than limiting them, and the scope of protection of the present invention is not limited thereto, although referring to the foregoing The embodiment has described the present invention in detail, and those of ordinary skill in the art should understand that any person familiar with the technical field can still modify the technical solutions described in the foregoing embodiments within the technical scope disclosed in the present invention Changes can be easily thought of, or equivalent replacements are made to some of the technical features; and these modifications, changes or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions of the embodiments of the present invention, and should be included in the scope of the present invention within the scope of protection. Therefore, the protection scope of the present invention should be based on the protection scope of the claims.
Claims (10)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202211732859.2A CN116244446B (en) | 2022-12-30 | 2022-12-30 | Social media cognitive threat detection method and system |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202211732859.2A CN116244446B (en) | 2022-12-30 | 2022-12-30 | Social media cognitive threat detection method and system |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN116244446A true CN116244446A (en) | 2023-06-09 |
| CN116244446B CN116244446B (en) | 2025-06-20 |
Family
ID=86628873
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202211732859.2A Active CN116244446B (en) | 2022-12-30 | 2022-12-30 | Social media cognitive threat detection method and system |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN116244446B (en) |
Cited By (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN117874755A (en) * | 2024-03-13 | 2024-04-12 | 中国电子科技集团公司第三十研究所 | System and method for identifying hidden network threat users |
| CN117910567A (en) * | 2024-03-20 | 2024-04-19 | 道普信息技术有限公司 | Vulnerability knowledge graph construction method based on safety dictionary and deep learning network |
| CN121302289A (en) * | 2025-12-11 | 2026-01-09 | 成都天奥集团有限公司 | Public opinion tracing method based on multi-source heterogeneous data |
Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN107633044A (en) * | 2017-09-14 | 2018-01-26 | 国家计算机网络与信息安全管理中心 | A kind of public sentiment knowledge mapping construction method based on focus incident |
| US20180159876A1 (en) * | 2016-12-05 | 2018-06-07 | International Business Machines Corporation | Consolidating structured and unstructured security and threat intelligence with knowledge graphs |
| CN110717049A (en) * | 2019-08-29 | 2020-01-21 | 四川大学 | Text data-oriented threat information knowledge graph construction method |
| CN113919351A (en) * | 2021-09-29 | 2022-01-11 | 中国科学院软件研究所 | Network security named entity and relationship joint extraction method and device based on transfer learning |
-
2022
- 2022-12-30 CN CN202211732859.2A patent/CN116244446B/en active Active
Patent Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20180159876A1 (en) * | 2016-12-05 | 2018-06-07 | International Business Machines Corporation | Consolidating structured and unstructured security and threat intelligence with knowledge graphs |
| CN107633044A (en) * | 2017-09-14 | 2018-01-26 | 国家计算机网络与信息安全管理中心 | A kind of public sentiment knowledge mapping construction method based on focus incident |
| CN110717049A (en) * | 2019-08-29 | 2020-01-21 | 四川大学 | Text data-oriented threat information knowledge graph construction method |
| CN113919351A (en) * | 2021-09-29 | 2022-01-11 | 中国科学院软件研究所 | Network security named entity and relationship joint extraction method and device based on transfer learning |
Non-Patent Citations (1)
| Title |
|---|
| 谢博;申国伟;郭春;周燕;于淼;: "基于残差空洞卷积神经网络的网络安全实体识别方法", 网络与信息安全学报, no. 05, 13 October 2020 (2020-10-13) * |
Cited By (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN117874755A (en) * | 2024-03-13 | 2024-04-12 | 中国电子科技集团公司第三十研究所 | System and method for identifying hidden network threat users |
| CN117874755B (en) * | 2024-03-13 | 2024-05-10 | 中国电子科技集团公司第三十研究所 | System and method for identifying hidden network threat users |
| CN117910567A (en) * | 2024-03-20 | 2024-04-19 | 道普信息技术有限公司 | Vulnerability knowledge graph construction method based on safety dictionary and deep learning network |
| CN121302289A (en) * | 2025-12-11 | 2026-01-09 | 成都天奥集团有限公司 | Public opinion tracing method based on multi-source heterogeneous data |
Also Published As
| Publication number | Publication date |
|---|---|
| CN116244446B (en) | 2025-06-20 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN113378565B (en) | Event analysis method, device, device and storage medium for multi-source data fusion | |
| CN113806563B (en) | Architect knowledge graph construction method for multi-source heterogeneous building humanistic historical material | |
| CN114064918A (en) | Multi-modal event knowledge graph construction method | |
| Srinivasa et al. | Crime base: Towards building a knowledge base for crime entities and their relationships from online news papers | |
| CN111737496A (en) | A method for constructing fault knowledge graph of power equipment | |
| CN108763333A (en) | A kind of event collection of illustrative plates construction method based on Social Media | |
| CN108549647B (en) | Method for realizing active prediction of emergency in mobile customer service field without marking corpus based on SinglePass algorithm | |
| CN116244446B (en) | Social media cognitive threat detection method and system | |
| CN117312577A (en) | Traffic event knowledge graph construction method based on multi-layer semantic graph convolutional neural network | |
| CN116029305A (en) | Chinese attribute-level emotion analysis method, system, equipment and medium based on multitask learning | |
| CN111460158B (en) | Microblog topic public emotion prediction method based on emotion analysis | |
| CN113177164A (en) | Multi-platform collaborative new media content monitoring and management system based on big data | |
| Xu et al. | Research on topic recognition of network sensitive information based on SW-LDA model | |
| CN110888991A (en) | Sectional semantic annotation method in weak annotation environment | |
| Nejad et al. | A combination of frequent pattern mining and graph traversal approaches for aspect elicitation in customer reviews | |
| Gao et al. | Chinese causal event extraction using causality‐associated graph neural network | |
| CN118733860A (en) | A method for constructing a media influence evaluation model based on multi-dimensional feature optimization | |
| Sun et al. | Fine-grained emotion analysis based on mixed model for product review | |
| Huang et al. | Token relation aware Chinese named entity recognition | |
| CN115048510B (en) | Crime name prediction method based on hierarchical legal knowledge and double-graph combined representation learning | |
| Li | [Retracted] Forecast and Simulation of the Public Opinion on the Public Policy Based on the Markov Model | |
| Cui et al. | Short text analysis based on dual semantic extension and deep hashing in microblog | |
| Chen | Monitoring of public opinion on typhoon disaster using improved clustering model based on single-pass approach | |
| Waheed et al. | NEURAL NETWORKS FOR DETECTING FAKE NEWS AND MISINFORMATION: AN AI-POWERED FRAMEWORK FOR SECURING DIGITAL MEDIA AND SOCIAL PLATFORMS | |
| Ishlach et al. | A Novel Method for News Article Event-Based Embedding |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| CB02 | Change of applicant information | ||
| CB02 | Change of applicant information |
Country or region after: China Address after: 450000 Science Avenue 62, Zhengzhou High-tech Zone, Henan Province Applicant after: Information Engineering University of the Chinese People's Liberation Army Cyberspace Force Address before: No. 62 Science Avenue, High tech Zone, Zhengzhou City, Henan Province Applicant before: Information Engineering University of Strategic Support Force,PLA Country or region before: China |
|
| GR01 | Patent grant | ||
| GR01 | Patent grant |