[go: up one dir, main page]

CN116049343A - Method and system for document-level threat intelligence relationship extraction based on feature enhancement - Google Patents

Method and system for document-level threat intelligence relationship extraction based on feature enhancement Download PDF

Info

Publication number
CN116049343A
CN116049343A CN202211416432.1A CN202211416432A CN116049343A CN 116049343 A CN116049343 A CN 116049343A CN 202211416432 A CN202211416432 A CN 202211416432A CN 116049343 A CN116049343 A CN 116049343A
Authority
CN
China
Prior art keywords
entity
model
mention
embedding vector
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202211416432.1A
Other languages
Chinese (zh)
Other versions
CN116049343B (en
Inventor
李勇飞
郭渊博
方晨
常雅静
刘盈泽
邱俊博
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
PLA Information Engineering University
Original Assignee
PLA Information Engineering University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by PLA Information Engineering University filed Critical PLA Information Engineering University
Priority to CN202211416432.1A priority Critical patent/CN116049343B/en
Publication of CN116049343A publication Critical patent/CN116049343A/en
Application granted granted Critical
Publication of CN116049343B publication Critical patent/CN116049343B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/313Selection or weighting of terms for indexing
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/16Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks using machine learning or artificial intelligence
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/20Network architectures or network communication protocols for network security for managing network security; network security policies in general
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Medical Informatics (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Computer Security & Cryptography (AREA)
  • Computing Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Animal Behavior & Ethology (AREA)
  • Computer Hardware Design (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Machine Translation (AREA)

Abstract

The invention belongs to the technical field of network space security, and particularly relates to a document-level threat intelligence relation extraction method and system based on feature enhancement, which are implemented by constructing an entity information extraction model and training and optimizing, wherein the entity information extraction model comprises the following components: the system comprises a BERT model, a fusion processing unit and an entity relation extraction model; inputting text data of a threat information document to be processed into a training optimized entity information extraction model, acquiring word entity mention context embedding vectors in the text data through a BERT model, fusing the word part embedding vectors and entity width information with the word entity mention context embedding vectors by utilizing a fusion processing unit, and acquiring entity relations of target entity pairs in the text data through an entity relation extraction model. The invention combines part-of-speech sequences, mentions additional document information such as width and the like to enhance entity representation characteristics, can improve entity relation extraction accuracy, and provides technical support for network security space threat modeling, risk analysis reasoning and the like.

Description

基于特征增强的文档级威胁情报关系抽取方法及系统Document-level threat intelligence relationship extraction method and system based on feature enhancement

技术领域Technical Field

本发明属于网络空间安全技术领域,特别涉及一种基于特征增强的文档级威胁情报关系抽取方法及系统。The present invention belongs to the field of cyberspace security technology, and in particular relates to a document-level threat intelligence relationship extraction method and system based on feature enhancement.

背景技术Background Art

随着网络和信息技术的快速发展,新型威胁攻击呈现持续增长的态势,日益复杂的攻击策略、不断变化的攻击场景使得防火墙、签名注册表等传统网络防御以难以抵御这些新型攻击。为更好地了解威胁状况,协调对未知威胁的响应,安全专家提出利用网络威胁情报(Cyber Threat Intelligence,CTI)进行网络防御。2013年,Gartner首次提出:威胁情报是关于现有或即将出现的针对资产有威胁的知识,包括场景、机制、指标、启示和可操作建议等,这些知识可为主体提供威胁的应对策略。With the rapid development of network and information technology, new threat attacks are showing a trend of continuous growth. The increasingly complex attack strategies and ever-changing attack scenarios make it difficult for traditional network defenses such as firewalls and signature registries to resist these new attacks. In order to better understand the threat situation and coordinate responses to unknown threats, security experts propose to use Cyber Threat Intelligence (CTI) for network defense. In 2013, Gartner first proposed that threat intelligence is knowledge about existing or upcoming threats to assets, including scenarios, mechanisms, indicators, revelations and actionable suggestions, etc. This knowledge can provide the subject with a strategy to respond to threats.

威胁情报的知识源于安全分析报告、博客、社交网络、漏洞库、威胁情报库等,能够为态势感知、主动防御提供有力的数据支撑。然而,多数威胁情报以自然语言文本的形式存在,包含大量非结构化数据,难以可视化地展示攻击要素的内部联系。为了帮助研究者快速理解攻击要素的语义关联,需要设计相应算法,从大规模威胁情报文档中挖掘实体及其关系,构建威胁情报知识图谱。The knowledge of threat intelligence comes from security analysis reports, blogs, social networks, vulnerability libraries, threat intelligence libraries, etc., which can provide strong data support for situational awareness and active defense. However, most threat intelligence exists in the form of natural language text, contains a large amount of unstructured data, and it is difficult to visualize the internal connections of attack elements. In order to help researchers quickly understand the semantic associations of attack elements, it is necessary to design corresponding algorithms to mine entities and their relationships from large-scale threat intelligence documents and construct a threat intelligence knowledge graph.

关系抽取旨在识别给定文本中实体之间的关系,尽管通用领域的关系抽取已经取得了较好的效果,但在网络安全领域仍存在下列问题:1)缺乏公开可用的威胁情报数据集;2)威胁情报包含大量漏洞名称、恶意软件、APT组织等专业词汇,存在严重的OOV(Out ofVocabulary)问题;3)威胁情报文档结构复杂,句子相对较长,实体和关系频率较低,数据标签分布严重不平衡。除此之外,现有工作主要集中于句子级文本挖掘,然而,在实际场景中,同一实体可能存在多个提及,实体之间的关系通常需要依赖多个句子进行推断。为此,亟需一种关系抽取方案,能够满足文档结构复杂的威胁情报文本数据的分析处理。Relation extraction aims to identify the relationship between entities in a given text. Although relationship extraction in general fields has achieved good results, the following problems still exist in the field of network security: 1) There is a lack of publicly available threat intelligence datasets; 2) Threat intelligence contains a large number of professional terms such as vulnerability names, malware, APT organizations, etc., and there is a serious OOV (Out of Vocabulary) problem; 3) Threat intelligence documents have complex structures, relatively long sentences, low entity and relationship frequencies, and serious imbalance in data label distribution. In addition, existing work mainly focuses on sentence-level text mining. However, in actual scenarios, the same entity may have multiple mentions, and the relationship between entities usually needs to rely on multiple sentences for inference. For this reason, there is an urgent need for a relationship extraction scheme that can meet the analysis and processing of threat intelligence text data with complex document structures.

发明内容Summary of the invention

为此,本发明提供一种基于特征增强的文档级威胁情报关系抽取方法及系统,结合词性序列、提及宽度等额外的文档信息来增强实体表示特征,能够提高实体关系抽取的精确性,可为网络安全空间的威胁建模、风险分析、攻击推理等提供技术支撑。To this end, the present invention provides a document-level threat intelligence relationship extraction method and system based on feature enhancement, which combines additional document information such as part-of-speech sequence and mention width to enhance entity representation features, thereby improving the accuracy of entity relationship extraction and providing technical support for threat modeling, risk analysis, attack reasoning, etc. in the network security space.

按照本发明所提供的设计方案,提供一种基于特征增强的文档级威胁情报关系抽取方法,包含如下内容:According to the design scheme provided by the present invention, a document-level threat intelligence relationship extraction method based on feature enhancement is provided, which includes the following contents:

构建实体信息抽取模型并进行训练优化,其中,实体信息抽取模型包含:用于对输入文本数据进行编码来获取文本数据单词实体提及上下文嵌入向量的BERT模型,通过融合词性嵌入向量和实体宽度信息来增强单词实体提及上下文嵌入向量并获取实体全局表示的融合处理单元,及将给定实体对局部上下文嵌入与实体全局表示、实体类型信息和实体间间距信息进行融合并通过非线性激活函数来得到给定实体对关系概率的实体关系抽取模型;Constructing an entity information extraction model and performing training optimization, wherein the entity information extraction model includes: a BERT model for encoding input text data to obtain a word entity mention context embedding vector of the text data, a fusion processing unit for enhancing the word entity mention context embedding vector by fusing part-of-speech embedding vectors and entity width information and obtaining a global representation of the entity, and an entity relationship extraction model for fusing a given entity pair's local context embedding with the entity's global representation, entity type information, and inter-entity spacing information to obtain a given entity pair's relationship probability through a nonlinear activation function;

将待处理威胁情报文档的文本数据输入训练优化后的实体信息抽取模型中,通过BERT模型来获取文本数据中单词实体提及上下文嵌入向量,利用融合处理单元将词性嵌入向量和实体宽度信息与单词实体提及上下文嵌入向量进行融合,并通过实体关系抽取模型来获取文本数据中目标实体对的实体关系。The text data of the threat intelligence document to be processed is input into the trained and optimized entity information extraction model, and the word entity mention context embedding vector in the text data is obtained through the BERT model. The part-of-speech embedding vector and entity width information are fused with the word entity mention context embedding vector using the fusion processing unit, and the entity relationship of the target entity pair in the text data is obtained through the entity relationship extraction model.

作为本发明中基于特征增强的文档级威胁情报关系抽取方法,进一步地,实体信息抽取模型训练优化中,基于知识蒸馏来进行模型训练优化,训练优化过程包含:收集样本标签数据,利用样本标签数据对教师模型进行训练,通过更新教师模型参数来获取教师模型;然后,利用样本标签数据对教师模型进行知识蒸馏,获得学生模型,将所述学生模型作为训练优化后的实体信息抽取模型。As a document-level threat intelligence relationship extraction method based on feature enhancement in the present invention, further, in the entity information extraction model training optimization, model training optimization is performed based on knowledge distillation, and the training optimization process includes: collecting sample label data, training a teacher model using the sample label data, and obtaining the teacher model by updating the teacher model parameters; then, performing knowledge distillation on the teacher model using the sample label data to obtain a student model, and using the student model as the entity information extraction model after training optimization.

作为本发明中基于特征增强的文档级威胁情报关系抽取方法,进一步,训练优化中的目标损失函数表示为:LRE=α1LAFL2LKD,其中,LRE表示总损失,LAFL表示将关系抽取作为多标签分离问题的自适应损失,LKD表示知识蒸馏损失,α1、α2分别为自适应损失和知识蒸馏损失权重。As the document-level threat intelligence relationship extraction method based on feature enhancement in the present invention, further, the objective loss function in the training optimization is expressed as: L RE1 L AFL2 L KD , wherein L RE represents the total loss, L AFL represents the adaptive loss of relation extraction as a multi-label separation problem, L KD represents the knowledge distillation loss, and α 1 and α 2 are the weights of the adaptive loss and the knowledge distillation loss, respectively.

作为本发明中基于特征增强的文档级威胁情报关系抽取方法,进一步,通过BERT模型来获取文本数据中单词实体提及上下文嵌入向量,包含:首先,通过分词器对输入文档文本数据进行分词处理来获取单词实体集合,并利用预设提及符号对实体提及进行标记;然后,利用预训练BERT模型作为编码器对文本数据中单词实体进行编码,生成实体提及上下文嵌入向量。As a document-level threat intelligence relationship extraction method based on feature enhancement in the present invention, further, a BERT model is used to obtain a word entity mention context embedding vector in text data, comprising: first, a word segmentation process is performed on the input document text data through a word segmenter to obtain a word entity set, and the entity mention is marked with a preset mention symbol; then, a pre-trained BERT model is used as an encoder to encode the word entities in the text data to generate an entity mention context embedding vector.

作为本发明基于特征增强的文档级威胁情报关系抽取方法,进一步地,利用融合处理单元将词性嵌入向量和实体宽度信息与单词实体提及上下文嵌入向量进行融合中,首先,利用自然语言处理工具获取词性标签,利用词性标签生成文本数据中单词实体的词性嵌入向量,并通过将其与上下文嵌入向量融合来生成词性嵌入增强的向量表示;接着,通过融合实体提及宽度信息来对向量表示进行增强处理;然后,针对每个实体,通过池化操作来获取该实体的全局表示。As a document-level threat intelligence relationship extraction method based on feature enhancement of the present invention, further, a fusion processing unit is used to fuse the part-of-speech embedding vector and the entity width information with the word entity mention context embedding vector. First, a natural language processing tool is used to obtain the part-of-speech tag, and the part-of-speech embedding vector of the word entity in the text data is generated using the part-of-speech tag, and the part-of-speech embedding enhanced vector representation is generated by fusing it with the context embedding vector; then, the vector representation is enhanced by fusing the entity mention width information; then, for each entity, a pooling operation is performed to obtain the global representation of the entity.

作为本发明基于特征增强的文档级威胁情报关系抽取方法,进一步地,针对融合词性嵌入向量的单词实体提及上下文嵌入向量,基于多头注意力机制来获取单词实体提及上下文嵌入向量中每个提及元素的注意力分数,将每个提及元素的注意力分数作为对应实体提及的注意力,并通过对实体提及注意力进行平均来获取实体级注意力矩阵,利用该实体级注意力矩阵来表示对应实体到所有实体提及的注意力分数。As a document-level threat intelligence relationship extraction method based on feature enhancement of the present invention, further, for the word entity mention context embedding vector fused with the part-of-speech embedding vector, the attention score of each mention element in the word entity mention context embedding vector is obtained based on the multi-head attention mechanism, the attention score of each mention element is used as the attention of the corresponding entity mention, and the entity-level attention matrix is obtained by averaging the entity mention attention, and the entity-level attention matrix is used to represent the attention score of the corresponding entity to all entity mentions.

作为本发明基于特征增强的文档级威胁情报关系抽取方法,进一步地,通过实体关系抽取模型来获取文本数据中目标实体对的实体关系,包含:首先,通过实体级注意力矩阵来定位给定实体对关键上下文,并依据上下文来获取该给定实体对的局部上下文嵌入向量;接着,将局部上下文嵌入向量与实体全局表示、实体类型信息及实体间距离信息进行融合来获取目标实体对的上下文嵌入表示;然后,利用非线性激活函数来得到给定目标实体对的关系概率。As a document-level threat intelligence relationship extraction method based on feature enhancement of the present invention, further, the entity relationship of the target entity pair in the text data is obtained through the entity relationship extraction model, including: first, the key context of the given entity pair is located through the entity-level attention matrix, and the local context embedding vector of the given entity pair is obtained according to the context; then, the local context embedding vector is fused with the global representation of the entity, the entity type information and the distance information between entities to obtain the context embedding representation of the target entity pair; then, a nonlinear activation function is used to obtain the relationship probability of the given target entity pair.

作为本发明基于特征增强的文档级威胁情报关系抽取方法,进一步地,利用非线性激活函数来得到给定实体对的关系概率中,首先将目标实体对的上下文嵌入表示进行分组和特征融合,得到实体对表示,然后再利用非线性激活函数sigmoid获得给定目标实体对的关系概率。As a document-level threat intelligence relationship extraction method based on feature enhancement of the present invention, further, in obtaining the relationship probability of a given entity pair using a nonlinear activation function, the context embedding representation of the target entity pair is first grouped and feature fused to obtain an entity pair representation, and then the nonlinear activation function sigmoid is used to obtain the relationship probability of a given target entity pair.

进一步地,本发明还提供一种基于特征增强的文档级威胁情报关系抽取系统,包含:模型构建模块和关系抽取模块,其中,Furthermore, the present invention also provides a document-level threat intelligence relationship extraction system based on feature enhancement, comprising: a model building module and a relationship extraction module, wherein:

模型构建模块,用于构建实体信息抽取模型并进行训练优化,其中,实体信息抽取模型包含:用于对输入文本数据进行编码来获取文本数据单词实体提及上下文嵌入向量的BERT模型,通过融合词性嵌入向量和实体宽度信息来增强单词实体提及上下文嵌入向量并获取实体全局表示的融合处理单元,及将给定实体对局部上下文嵌入与实体全局表示、实体类型信息和实体间间距信息进行融合并通过非线性激活函数来得到给定实体对关系概率的实体关系抽取模型;A model building module, used to build an entity information extraction model and perform training optimization, wherein the entity information extraction model includes: a BERT model for encoding input text data to obtain a word entity mention context embedding vector of the text data, a fusion processing unit for enhancing the word entity mention context embedding vector by fusing the part-of-speech embedding vector and the entity width information and obtaining the entity global representation, and an entity relationship extraction model for fusing the local context embedding of a given entity pair with the entity global representation, entity type information and inter-entity spacing information to obtain the relationship probability of a given entity pair through a non-linear activation function;

关系抽取模块,用于将待处理威胁情报文档的文本数据输入训练优化后的实体信息抽取模型中,通过BERT模型来获取文本数据中单词实体提及上下文嵌入向量,利用融合处理单元将词性嵌入向量和实体宽度信息与单词实体提及上下文嵌入向量进行融合,并通过实体关系抽取模型来获取文本数据中目标实体对的实体关系。The relationship extraction module is used to input the text data of the threat intelligence document to be processed into the trained and optimized entity information extraction model, obtain the word entity mention context embedding vector in the text data through the BERT model, use the fusion processing unit to fuse the part-of-speech embedding vector and entity width information with the word entity mention context embedding vector, and obtain the entity relationship of the target entity pair in the text data through the entity relationship extraction model.

本发明的有益效果:Beneficial effects of the present invention:

本发明针对威胁情报存在的OOV问题,利用Bert预训练模型作为编码器,生成词嵌入,并融合实体宽度、实体距离、实体类型等额外特征,能够充分利用文档中的文本信息,有效提升关系抽取精准度;并进一步针对威胁情报实体关系频率较低、数据标签严重不平衡等问题,通过引入教师-学生模型,利用软标签统计数据集的有效信息,并保留类间关联信息,剔除部分无效的冗余信息,实现知识蒸馏,提升关系抽取模型的性能,便于复杂结构的威胁情报文档处理,为网络安全空间的威胁建模、风险分析、攻击推理等提供数据分析和知识推理方面的支持,具有较好的应用前景。In order to solve the OOV problem in threat intelligence, the present invention uses the Bert pre-trained model as an encoder to generate word embeddings and integrates additional features such as entity width, entity distance, entity type, etc., which can make full use of the text information in the document and effectively improve the accuracy of relationship extraction. In order to solve the problems of low frequency of entity relationships in threat intelligence and serious imbalance of data labels, the present invention introduces a teacher-student model, uses soft-label statistical data to calculate the effective information of the data set, retains the inter-class correlation information, removes some invalid redundant information, realizes knowledge distillation, improves the performance of the relationship extraction model, facilitates the processing of threat intelligence documents with complex structures, and provides support for data analysis and knowledge reasoning for threat modeling, risk analysis, attack reasoning, etc. in the network security space, which has good application prospects.

附图说明:Description of the drawings:

图1为实施例中基于特征增强的文档级威胁情报关系抽取流程示意;FIG1 is a schematic diagram of a document-level threat intelligence relationship extraction process based on feature enhancement in an embodiment;

图2为实施例中实体信息抽取模型架构示意;FIG2 is a schematic diagram of an entity information extraction model architecture in an embodiment;

图3为实施例中教师-学生模型示意;FIG3 is a diagram showing a teacher-student model in an embodiment;

图4为实施例中威胁情报本体示意;FIG4 is a schematic diagram of a threat intelligence entity in an embodiment;

图5为实施例中部分威胁情报知识图谱示意。FIG5 is a schematic diagram of a portion of the threat intelligence knowledge graph in the embodiment.

具体实施方式:Specific implementation method:

为使本发明的目的、技术方案和优点更加清楚、明白,下面结合附图和技术方案对本发明作进一步详细的说明。In order to make the purpose, technical solutions and advantages of the present invention clearer and more understandable, the present invention is further described in detail below in conjunction with the accompanying drawings and technical solutions.

本案实施例,参见图1所示,提供一种基于特征增强的文档级威胁情报关系抽取方法,包含:In this embodiment, as shown in FIG1 , a document-level threat intelligence relationship extraction method based on feature enhancement is provided, comprising:

S101、构建实体信息抽取模型并进行训练优化,其中,实体信息抽取模型包含:用于对输入文本数据进行编码来获取文本数据单词实体提及上下文嵌入向量的BERT模型,通过融合词性嵌入向量和实体宽度信息来增强单词实体提及上下文嵌入向量并获取实体全局表示的融合处理单元,及将给定实体对局部上下文嵌入与实体全局表示、实体类型信息和实体间间距信息进行融合并通过非线性激活函数来得到给定实体对关系概率的实体关系抽取模型;S101, constructing an entity information extraction model and performing training optimization, wherein the entity information extraction model includes: a BERT model for encoding input text data to obtain a word entity mention context embedding vector of the text data, a fusion processing unit for enhancing the word entity mention context embedding vector by fusing the part-of-speech embedding vector and the entity width information and obtaining the entity global representation, and an entity relationship extraction model for fusing the local context embedding of a given entity pair with the entity global representation, entity type information and inter-entity spacing information to obtain the relationship probability of a given entity pair through a non-linear activation function;

S102、将待处理威胁情报文档的文本数据输入训练优化后的实体信息抽取模型中,通过BERT模型来获取文本数据中单词实体提及上下文嵌入向量,利用融合处理单元将词性嵌入向量和实体宽度信息与单词实体提及上下文嵌入向量进行融合,并通过实体关系抽取模型来获取文本数据中目标实体对的实体关系。S102. Input the text data of the threat intelligence document to be processed into the trained and optimized entity information extraction model, obtain the word entity mention context embedding vector in the text data through the BERT model, fuse the part-of-speech embedding vector and entity width information with the word entity mention context embedding vector using a fusion processing unit, and obtain the entity relationship of the target entity pair in the text data through the entity relationship extraction model.

本案实施例,参见图2和3所示,通过构建威胁情报本体,融合实体宽度、实体距离、实体类型等特征,充分利用文档中的信息,提升关系抽取精准度;并利用Bert预训练模型生成词嵌入,有效改善OOV问题。In the embodiment of this case, as shown in Figures 2 and 3, by constructing a threat intelligence ontology, integrating features such as entity width, entity distance, and entity type, the information in the document is fully utilized to improve the accuracy of relationship extraction; and the Bert pre-training model is used to generate word embeddings to effectively improve the OOV problem.

作为优选实施例,进一步地,实体信息抽取模型训练优化中,基于知识蒸馏来进行模型训练优化,训练优化过程包含:收集样本标签数据,利用样本标签数据对教师模型进行训练,通过更新教师模型参数来获取教师模型;然后,利用样本标签数据对教师模型进行知识蒸馏,获得学生模型,将所述学生模型作为训练优化后的实体信息抽取模型。As a preferred embodiment, further, in the entity information extraction model training optimization, model training optimization is performed based on knowledge distillation, and the training optimization process includes: collecting sample label data, training the teacher model using the sample label data, and obtaining the teacher model by updating the teacher model parameters; then, performing knowledge distillation on the teacher model using the sample label data to obtain a student model, and using the student model as the entity information extraction model after training optimization.

将同一批标签样本数据可同时放入两个模型中,将教师模型的预测输出作为软标签,将真实标签作为硬标签,分别计算学生模型的两种损失,最后将两个损失加权求和,作为最终损失更新网络参数;实际预测的时候,仅使用学生模型。The same batch of labeled sample data can be put into two models at the same time. The predicted output of the teacher model is used as the soft label, and the true label is used as the hard label. The two losses of the student model are calculated separately. Finally, the weighted sum of the two losses is used as the final loss to update the network parameters. During actual prediction, only the student model is used.

进一步,训练优化中的目标损失函数表示为:LRE=α1LAFL2LKD,其中,LRE表示总损失,LAFL表示将关系抽取作为多标签分离问题的自适应损失,LKD表示知识蒸馏损失,α1、α2分别为自适应损失和知识蒸馏损失权重。可利用均方差损失函数计算学生模型生成的logits与教师模型生成的软标签之间的差异,与自适应Focal Loss损失加权合并作为模型的整体损失函数,实现模型性能的进一步提升。Furthermore, the objective loss function in training optimization is expressed as: L RE = α 1 L AFL + α 2 L KD , where L RE represents the total loss, L AFL represents the adaptive loss of relation extraction as a multi-label separation problem, L KD represents the knowledge distillation loss, and α 1 and α 2 are the weights of the adaptive loss and knowledge distillation loss, respectively. The mean square error loss function can be used to calculate the difference between the logits generated by the student model and the soft labels generated by the teacher model, and combined with the adaptive Focal Loss loss weighted as the overall loss function of the model to further improve the model performance.

作为优选实施例,进一步,通过BERT模型来获取文本数据中单词实体提及上下文嵌入向量,包含:首先,通过分词器对输入文档文本数据进行分词处理来获取单词实体集合,并利用预设提及符号对实体提及进行标记;然后,利用预训练BERT模型作为编码器对文本数据中单词实体进行编码,生成实体提及上下文嵌入向量。As a preferred embodiment, further, a BERT model is used to obtain a context embedding vector of a word entity mention in text data, comprising: first, a word segmentation process is performed on the input document text data through a word segmenter to obtain a word entity set, and the entity mention is marked with a preset mention symbol; then, a pre-trained BERT model is used as an encoder to encode the word entities in the text data to generate an entity mention context embedding vector.

采用预训练模型Bert作为文档编码器,对于长度为l的文档

Figure BDA0003940200440000051
xt表示文档中位置为t的单词。可利用特殊符号对实体提及进行标记:即可在实体提及其前后添加“*”。接着利用Bert对文档进行编码,生成上下文嵌入H,编码过程可表示为:The pre-trained model Bert is used as the document encoder. For a document of length l
Figure BDA0003940200440000051
x t represents the word at position t in the document. Special symbols can be used to mark entity mentions: you can add "*" before and after the entity mention. Then use Bert to encode the document and generate the context embedding H. The encoding process can be expressed as:

H=Bert([x1,...,xl])=[h1,...,hl]H=Bert([x 1 ,..., x l ])=[h 1 ,..., h l ]

其中

Figure BDA0003940200440000052
d1是预训练模型的隐藏层维度。in
Figure BDA0003940200440000052
d1 is the hidden layer dimension of the pre-trained model.

作为优选实施例,进一步地,利用融合处理单元将词性嵌入向量和实体宽度信息与单词实体提及上下文嵌入向量进行融合中,首先,利用自然语言处理工具获取词性标签,利用词性标签生成文本数据中单词实体的词性嵌入向量,并通过将其与上下文嵌入向量融合来生成词性嵌入增强的向量表示;接着,通过融合实体提及宽度信息来对向量表示进行增强处理;然后,针对每个实体,通过池化操作来获取该实体的全局表示。As a preferred embodiment, further, a fusion processing unit is used to fuse the part-of-speech embedding vector and the entity width information with the word entity mention context embedding vector. First, a natural language processing tool is used to obtain the part-of-speech tag, and the part-of-speech tag is used to generate a part-of-speech embedding vector of the word entity in the text data, and then the part-of-speech embedding-enhanced vector representation is generated by fusing it with the context embedding vector; then, the vector representation is enhanced by fusing the entity mention width information; then, for each entity, a pooling operation is performed to obtain a global representation of the entity.

使用Python中的Nltk库产生词性标签,生成词性嵌入矩阵P,具体过程可表示为:Use the Nltk library in Python to generate part-of-speech tags and generate the part-of-speech embedding matrix P. The specific process can be expressed as:

P=Pos([x1,...,xl])=[p1,...,pl]P=Pos([x 1 ,...,x l ])=[p 1 ,..., p l ]

其中

Figure BDA0003940200440000053
d2是词性嵌入的维度。in
Figure BDA0003940200440000053
d2 is the dimension of part-of-speech embedding.

将其与上下文嵌入H融合,生成词性嵌入增强的token表示。It is fused with the context embedding H to generate a token representation enhanced by part-of-speech embedding.

C=[h1|p1,...,hl|pl]=[c1,...,cl]C=[h 1 |p 1 ,..., h l |p l ]=[c 1 ,..., c l ]

其中

Figure BDA0003940200440000061
[|]表示连接操作。in
Figure BDA0003940200440000061
[|] indicates a concatenation operation.

利用预先生成的宽度嵌入矩阵W和距离嵌入矩阵D来融合实体提及的宽度信息及实体之间的距离信息。The pre-generated width embedding matrix W and distance embedding matrix D are used to fuse the width information of entity mentions and the distance information between entities.

Figure BDA0003940200440000062
Figure BDA0003940200440000062

Figure BDA0003940200440000063
Figure BDA0003940200440000063

其中

Figure BDA0003940200440000064
d3和d4分别为宽度嵌入和距离嵌入的维度。in
Figure BDA0003940200440000064
d3 and d4 are the dimensions of width embedding and distance embedding respectively.

将实体提及起始处“*”的嵌入向量作为该提及的嵌入,标记为

Figure BDA0003940200440000065
将其与宽度嵌入融合,融合过程可表示为:The embedding vector of the “*” at the beginning of the entity mention is used as the embedding of the mention, denoted as
Figure BDA0003940200440000065
Fusing it with width embedding, the fusion process can be expressed as:

Figure BDA0003940200440000066
Figure BDA0003940200440000066

对于包含

Figure BDA0003940200440000067
个提及
Figure BDA0003940200440000068
的实体ei,利用logsumexp池化操作获得该实体的全局表示,池化操作过程可表示为:For including
Figure BDA0003940200440000067
Mentions
Figure BDA0003940200440000068
The entity e i is represented by the logsumexp pooling operation, and the pooling operation process can be expressed as:

Figure BDA0003940200440000069
Figure BDA0003940200440000069

作为优选实施例,进一步地,针对融合词性嵌入向量的单词实体提及上下文嵌入向量,基于多头注意力机制来获取单词实体提及上下文嵌入向量中每个提及元素的注意力分数,将每个提及元素的注意力分数作为对应实体提及的注意力,并通过对实体提及注意力进行平均来获取实体级注意力矩阵,利用该实体级注意力矩阵来表示对应实体到所有实体提及的注意力分数。As a preferred embodiment, further, for the word entity mention context embedding vector fused with the part-of-speech embedding vector, the attention score of each mention element in the word entity mention context embedding vector is obtained based on the multi-head attention mechanism, the attention score of each mention element is used as the attention of the corresponding entity mention, and the entity-level attention matrix is obtained by averaging the entity mention attentions, and the entity-level attention matrix is used to represent the attention scores of the corresponding entity to all entity mentions.

利用预训练的多头注意力矩阵A∈RHD×l×l,Aijk表示第i个注意力头中,从token j到token k的注意力分数,即提及级注意力。对提及级注意力进行平均,获得实体级注意力矩阵

Figure BDA00039402004400000610
表示第i个实体到所有token的注意力分数。Using the pre-trained multi-head attention matrix A∈R HD×l×l , Aijk represents the attention score from token j to token k in the ith attention head, i.e., the mention-level attention. The mention-level attention is averaged to obtain the entity-level attention matrix
Figure BDA00039402004400000610
Represents the attention score of the i-th entity to all tokens.

作为优选实施例,进一步地,通过实体关系抽取模型来获取文本数据中目标实体对的实体关系,包含:首先,通过实体级注意力矩阵来定位给定实体对关键上下文,并依据上下文来获取该给定实体对的局部上下文嵌入向量;接着,将局部上下文嵌入向量与实体全局表示、实体类型信息及实体间距离信息进行融合来获取目标实体对的上下文嵌入表示;然后,利用非线性激活函数来得到给定目标实体对的关系概率。As a preferred embodiment, further, the entity relationship of the target entity pair in the text data is obtained through the entity relationship extraction model, including: first, locating the key context of the given entity pair through the entity-level attention matrix, and obtaining the local context embedding vector of the given entity pair based on the context; then, the local context embedding vector is fused with the global representation of the entity, the entity type information and the distance information between entities to obtain the context embedding representation of the target entity pair; then, a nonlinear activation function is used to obtain the relationship probability of the given target entity pair.

生成实体类型嵌入矩阵T,融合实体的类型信息,融合过程可表示为:Generate the entity type embedding matrix T and fuse the entity type information. The fusion process can be expressed as:

Figure BDA0003940200440000071
Figure BDA0003940200440000071

其中

Figure BDA0003940200440000072
d5为类型嵌入的维度。in
Figure BDA0003940200440000072
d 5 is the dimension of type embedding.

对于给定实体对(es,eo),通过注意力矩阵可定位其重要上下文,并计算特定实体对的局部上下文嵌入,具体过程可表示如下:For a given entity pair ( es , eo ), the attention matrix can be used to locate its important context and calculate the local context embedding of a specific entity pair. The specific process can be expressed as follows:

Figure BDA0003940200440000073
Figure BDA0003940200440000073

Figure BDA0003940200440000074
Figure BDA0003940200440000074

a(s,o)=q(s,o)/1Tq(s,o) a (s, o) = q (s, o) /1 T q (s, o)

c(s,o)=Ha(s,o) c (s,o) =Ha (s,o)

将局部上下文嵌入与全局实体表示、类型嵌入、距离嵌入融合,获得特定实体对的嵌入表示,融合过程可表示如下:The local context embedding is fused with the global entity representation, type embedding, and distance embedding to obtain the embedding representation of a specific entity pair. The fusion process can be expressed as follows:

Figure BDA0003940200440000075
Figure BDA0003940200440000075

Figure BDA0003940200440000076
Figure BDA0003940200440000076

其中dso表示实体s和实体o第一个提及的相对距离。where d so represents the relative distance between the first mention of entity s and entity o.

为减少参数数量,降低计算复杂度,将实体表示平均分为k组,并进行特征融合,获得实体对表示,融合过程可表示如下:In order to reduce the number of parameters and computational complexity, the entity representations are evenly divided into k groups, and feature fusion is performed to obtain entity pair representations. The fusion process can be expressed as follows:

Figure BDA0003940200440000077
Figure BDA0003940200440000077

Figure BDA0003940200440000078
Figure BDA0003940200440000078

Figure BDA0003940200440000079
Figure BDA0003940200440000079

利用非线性激活函数sigmoid获得特定实体对的关系概率,具体过程可表示为:The nonlinear activation function sigmoid is used to obtain the relationship probability of a specific entity pair. The specific process can be expressed as:

Figure BDA00039402004400000710
Figure BDA00039402004400000710

关系抽取可以看作一个多标签分类问题,传统基线通常采用二元交叉熵损失作为损失函数,规定一个全局阈值作为关系标签是否存在的标准。然而,针对不同实体对,模型对关系阈值可能具有不同的置信度,仅使用全局阈值难以满足需求。针对这一问题,引入可学习的自适应阈值,有效减少推断过程的决策错误。对于每个实体对(es,eo),分数大于阈值类的被预测为正类,剩余的预测为负类。在此基础上,针对长尾类,引入先前工作中的自适应Focal Loss,损失函数可表示如下:Relation extraction can be viewed as a multi-label classification problem. Traditional baselines usually use binary cross entropy loss as the loss function and specify a global threshold as the criterion for whether a relationship label exists. However, for different entity pairs, the model may have different confidence levels for the relationship threshold, and it is difficult to meet the requirements using only the global threshold. To address this problem, a learnable adaptive threshold is introduced to effectively reduce decision errors in the inference process. For each entity pair ( es , eo ), the class with a score greater than the threshold is predicted as a positive class, and the rest is predicted as a negative class. On this basis, for the long-tail class, the adaptive Focal Loss in previous work is introduced, and the loss function can be expressed as follows:

Figure BDA0003940200440000081
Figure BDA0003940200440000081

进一步地,基于上述的方法,本发明实施例还提供一种基于特征增强的文档级威胁情报关系抽取系统,包含:模型构建模块和关系抽取模块,其中,Furthermore, based on the above method, an embodiment of the present invention also provides a document-level threat intelligence relationship extraction system based on feature enhancement, comprising: a model building module and a relationship extraction module, wherein:

模型构建模块,用于构建实体信息抽取模型并进行训练优化,其中,实体信息抽取模型包含:用于对输入文本数据进行编码来获取文本数据单词实体提及上下文嵌入向量的BERT模型,通过融合词性嵌入向量和实体宽度信息来增强单词实体提及上下文嵌入向量并获取实体全局表示的融合处理单元,及将给定实体对局部上下文嵌入与实体全局表示、实体类型信息和实体间间距信息进行融合并通过非线性激活函数来得到给定实体对关系概率的实体关系抽取模型;A model building module, used to build an entity information extraction model and perform training optimization, wherein the entity information extraction model includes: a BERT model for encoding input text data to obtain a word entity mention context embedding vector of the text data, a fusion processing unit for enhancing the word entity mention context embedding vector by fusing the part-of-speech embedding vector and the entity width information and obtaining the entity global representation, and an entity relationship extraction model for fusing the local context embedding of a given entity pair with the entity global representation, entity type information and inter-entity spacing information to obtain the relationship probability of a given entity pair through a non-linear activation function;

关系抽取模块,用于将待处理威胁情报文档的文本数据输入训练优化后的实体信息抽取模型中,通过BERT模型来获取文本数据中单词实体提及上下文嵌入向量,利用融合处理单元将词性嵌入向量和实体宽度信息与单词实体提及上下文嵌入向量进行融合,并通过实体关系抽取模型来获取文本数据中目标实体对的实体关系。The relationship extraction module is used to input the text data of the threat intelligence document to be processed into the trained and optimized entity information extraction model, obtain the word entity mention context embedding vector in the text data through the BERT model, use the fusion processing unit to fuse the part-of-speech embedding vector and entity width information with the word entity mention context embedding vector, and obtain the entity relationship of the target entity pair in the text data through the entity relationship extraction model.

为验证本案方案有效性,下面结合具体数据做解释说明:In order to verify the effectiveness of this solution, the following is an explanation based on specific data:

针对网络安全领域公开数据集缺乏的问题,收集227篇威胁情报,并基于自定义本体进行人工标注,从中选取151篇作为训练集,剩余76篇作为测试集。In order to address the lack of public data sets in the field of cybersecurity, we collected 227 pieces of threat intelligence and manually annotated them based on a custom ontology. We selected 151 of them as training sets and the remaining 76 as test sets.

利用本案方案中的内容,采用预训练模型Bert作为文档编码器,接着使用Python中的Nltk库产生词性标签,生成词性嵌入,与编码器生成的上下文嵌入融合。分别生成宽度嵌入和距离嵌入,融合实体提及的宽度信息及实体之间的距离信息。利用logsumexp池化操作获得该实体的全局表示。生成实体类型嵌入,融合实体的类型信息。通过注意力矩阵定位重要上下文,计算特定实体对的局部上下文嵌入。接着将局部上下文嵌入与全局实体表示、类型嵌入、距离嵌入融合,获得特定实体对的嵌入表示。最后利用分类器获得特定实体对的关系概率。威胁情报本体如图4所示。将威胁情报输入至实体信息抽取模型中,通过预测文本中所有实体对之间的关系来填充入知识图谱中,利用Neo4j图数据库进行呈现,其结果可如图5所示。通过图5的展示,能够进一步验证本案方案可适用于复杂结构的威胁情报分析处理,能够为态势感知、主动防御提供有力的数据支撑。Using the content in this case solution, the pre-trained model Bert is used as the document encoder, and then the Nltk library in Python is used to generate part-of-speech tags, generate part-of-speech embeddings, and fuse them with the context embeddings generated by the encoder. Generate width embeddings and distance embeddings respectively, and fuse the width information of entity mentions and the distance information between entities. Use the logsumexp pooling operation to obtain the global representation of the entity. Generate entity type embeddings and fuse the type information of the entity. Use the attention matrix to locate important contexts and calculate the local context embeddings of specific entity pairs. Then fuse the local context embeddings with the global entity representation, type embeddings, and distance embeddings to obtain the embedding representation of the specific entity pair. Finally, use the classifier to obtain the relationship probability of the specific entity pair. The threat intelligence ontology is shown in Figure 4. The threat intelligence is input into the entity information extraction model, and the relationship between all entity pairs in the text is predicted to fill in the knowledge graph. The Neo4j graph database is used for presentation, and the results can be shown in Figure 5. Through the display of Figure 5, it can be further verified that the solution of this case can be applied to the analysis and processing of threat intelligence with complex structures, and can provide strong data support for situation awareness and active defense.

除非另外具体说明,否则在这些实施例中阐述的部件和步骤的相对步骤、数字表达式和数值并不限制本发明的范围。Unless otherwise specifically stated, the relative steps, numerical expressions and values of the components and steps set forth in these embodiments do not limit the scope of the present invention.

本说明书中各个实施例采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似部分互相参见即可。对于实施例公开的系统而言,由于其与实施例公开的方法相对应,所以描述的比较简单,相关之处参见方法部分说明即可。In this specification, each embodiment is described in a progressive manner, and each embodiment focuses on the differences from other embodiments. The same or similar parts between the embodiments can be referred to each other. For the system disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant parts can be referred to the method part.

结合本文中所公开的实施例描述的各实例的单元及方法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已按照功能一般性地描述了各示例的组成及步骤。这些功能是以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。本领域普通技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不认为超出本发明的范围。The units and method steps of each example described in conjunction with the embodiments disclosed herein can be implemented by electronic hardware, computer software, or a combination of the two. In order to clearly illustrate the interchangeability of hardware and software, the composition and steps of each example have been generally described in the above description according to function. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. A person of ordinary skill in the art may use different methods to implement the described functions for each specific application, but such implementation is not considered to be beyond the scope of the present invention.

本领域普通技术人员可以理解上述方法中的全部或部分步骤可通过程序来指令相关硬件完成,所述程序可以存储于计算机可读存储介质中,如:只读存储器、磁盘或光盘等。可选地,上述实施例的全部或部分步骤也可以使用一个或多个集成电路来实现,相应地,上述实施例中的各模块/单元可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。本发明不限制于任何特定形式的硬件和软件的结合。Those skilled in the art will appreciate that all or part of the steps in the above method can be completed by instructing related hardware through a program, and the program can be stored in a computer-readable storage medium, such as a read-only memory, a disk or an optical disk. Optionally, all or part of the steps in the above embodiment can also be implemented using one or more integrated circuits, and accordingly, each module/unit in the above embodiment can be implemented in the form of hardware or in the form of software function modules. The present invention is not limited to any specific form of combination of hardware and software.

最后应说明的是:以上所述实施例,仅为本发明的具体实施方式,用以说明本发明的技术方案,而非对其限制,本发明的保护范围并不局限于此,尽管参照前述实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:任何熟悉本技术领域的技术人员在本发明揭露的技术范围内,其依然可以对前述实施例所记载的技术方案进行修改或可轻易想到变化,或者对其中部分技术特征进行等同替换;而这些修改、变化或者替换,并不使相应技术方案的本质脱离本发明实施例技术方案的精神和范围,都应涵盖在本发明的保护范围之内。因此,本发明的保护范围应所述以权利要求的保护范围为准。Finally, it should be noted that the above-described embodiments are only specific implementations of the present invention, which are used to illustrate the technical solutions of the present invention, rather than to limit them. The protection scope of the present invention is not limited thereto. Although the present invention is described in detail with reference to the above-described embodiments, ordinary technicians in the field should understand that any technician familiar with the technical field can still modify the technical solutions recorded in the above-described embodiments within the technical scope disclosed by the present invention, or can easily think of changes, or make equivalent replacements for some of the technical features therein; and these modifications, changes or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions of the embodiments of the present invention, and should be included in the protection scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (10)

1.一种基于特征增强的文档级威胁情报关系抽取方法,其特征在于,包含如下内容:1. A document-level threat intelligence relationship extraction method based on feature enhancement, characterized by comprising the following contents: 构建实体信息抽取模型并进行训练优化,其中,实体信息抽取模型包含:用于对输入文本数据进行编码来获取文本数据单词实体提及上下文嵌入向量的BERT模型,通过融合词性嵌入向量和实体宽度信息来增强单词实体提及上下文嵌入向量并获取实体全局表示的融合处理单元,及将给定实体对局部上下文嵌入与实体全局表示、实体类型信息和实体间间距信息进行融合并通过非线性激活函数来得到给定实体对关系概率的实体关系抽取模型;Constructing an entity information extraction model and performing training optimization, wherein the entity information extraction model includes: a BERT model for encoding input text data to obtain a word entity mention context embedding vector of the text data, a fusion processing unit for enhancing the word entity mention context embedding vector by fusing part-of-speech embedding vectors and entity width information and obtaining a global representation of the entity, and an entity relationship extraction model for fusing a given entity pair's local context embedding with the entity's global representation, entity type information, and inter-entity spacing information to obtain a given entity pair's relationship probability through a nonlinear activation function; 将待处理威胁情报文档的文本数据输入训练优化后的实体信息抽取模型中,通过BERT模型来获取文本数据中单词实体提及上下文嵌入向量,利用融合处理单元将词性嵌入向量和实体宽度信息与单词实体提及上下文嵌入向量进行融合,并通过实体关系抽取模型来获取文本数据中目标实体对的实体关系。The text data of the threat intelligence document to be processed is input into the trained and optimized entity information extraction model, and the word entity mention context embedding vector in the text data is obtained through the BERT model. The part-of-speech embedding vector and entity width information are fused with the word entity mention context embedding vector using the fusion processing unit, and the entity relationship of the target entity pair in the text data is obtained through the entity relationship extraction model. 2.根据权利要求1所述的基于特征增强的文档级威胁情报关系抽取方法,其特征在于,实体信息抽取模型训练优化中,基于知识蒸馏来进行模型训练优化,训练优化过程包含:收集样本标签数据,利用样本标签数据对教师模型进行训练,通过更新教师模型参数来获取教师模型;然后,利用样本标签数据对教师模型进行知识蒸馏,获得学生模型,将所述学生模型作为训练优化后的实体信息抽取模型。2. According to the feature-enhanced document-level threat intelligence relationship extraction method according to claim 1, it is characterized in that in the entity information extraction model training optimization, model training optimization is performed based on knowledge distillation, and the training optimization process includes: collecting sample label data, training the teacher model using the sample label data, and obtaining the teacher model by updating the teacher model parameters; then, performing knowledge distillation on the teacher model using the sample label data to obtain a student model, and using the student model as the entity information extraction model after training optimization. 3.根据权利要求2所述的基于特征增强的文档级威胁情报关系抽取方法,其特征在于,训练优化中的目标损失函数表示为:LRE=α1LAFL2LKD,其中,LRE表示总损失,LAFL表示将关系抽取作为多标签分离问题的自适应损失,LKD表示知识蒸馏损失,α1、α2分别为自适应损失和知识蒸馏损失权重。3. According to the feature enhancement-based document-level threat intelligence relationship extraction method of claim 2, it is characterized in that the objective loss function in the training optimization is expressed as: L RE1 L AFL2 L KD , wherein L RE represents the total loss, L AFL represents the adaptive loss of relation extraction as a multi-label separation problem, L KD represents the knowledge distillation loss, and α 1 and α 2 are the weights of the adaptive loss and the knowledge distillation loss, respectively. 4.根据权利要求1所述的基于特征增强的文档级威胁情报关系抽取方法,其特征在于,通过BERT模型来获取文本数据中单词实体提及上下文嵌入向量,包含:首先,通过分词器对输入文档文本数据进行分词处理来获取单词实体集合,并利用预设提及符号对实体提及进行标记;然后,利用预训练BERT模型作为编码器对文本数据中单词实体进行编码,生成实体提及上下文嵌入向量。4. According to the feature-enhanced document-level threat intelligence relationship extraction method according to claim 1, it is characterized in that the word entity mention context embedding vector in the text data is obtained through the BERT model, which includes: first, the input document text data is segmented by a word segmenter to obtain a word entity set, and the entity mention is marked with a preset mention symbol; then, the pre-trained BERT model is used as an encoder to encode the word entities in the text data to generate an entity mention context embedding vector. 5.根据权利要求1或4所述的基于特征增强的文档级威胁情报关系抽取方法,其特征在于,利用融合处理单元将词性嵌入向量和实体宽度信息与单词实体提及上下文嵌入向量进行融合中,首先,利用自然语言处理工具获取词性标签,利用词性标签生成文本数据中单词实体的词性嵌入向量,并通过将其与上下文嵌入向量融合来生成词性嵌入增强的向量表示;接着,通过融合实体提及宽度信息来对向量表示进行增强处理;然后,针对每个实体,通过池化操作来获取该实体的全局表示。5. According to the feature-enhanced document-level threat intelligence relationship extraction method according to claim 1 or 4, it is characterized in that the part-of-speech embedding vector and entity width information are fused with the word entity mention context embedding vector using a fusion processing unit. First, the part-of-speech tag is obtained using a natural language processing tool, and the part-of-speech embedding vector of the word entity in the text data is generated using the part-of-speech tag, and the part-of-speech embedding vector is generated by fusing it with the context embedding vector to generate a vector representation enhanced by the part-of-speech embedding; then, the vector representation is enhanced by fusing the entity mention width information; then, for each entity, a global representation of the entity is obtained through a pooling operation. 6.根据权利要求1所述的基于特征增强的文档级威胁情报关系抽取方法,其特征在于,针对融合词性嵌入向量的单词实体提及上下文嵌入向量,基于多头注意力机制来获取单词实体提及上下文嵌入向量中每个提及元素的注意力分数,将每个提及元素的注意力分数作为对应实体提及的注意力,并通过对实体提及注意力进行平均来获取实体级注意力矩阵,利用该实体级注意力矩阵来表示对应实体到所有实体提及的注意力分数。6. According to the feature-enhanced document-level threat intelligence relationship extraction method described in claim 1, it is characterized in that, for the word entity mention context embedding vector fused with the part-of-speech embedding vector, the attention score of each mention element in the word entity mention context embedding vector is obtained based on a multi-head attention mechanism, the attention score of each mention element is used as the attention of the corresponding entity mention, and the entity-level attention matrix is obtained by averaging the entity mention attention, and the entity-level attention matrix is used to represent the attention score of the corresponding entity to all entity mentions. 7.根据权利要求6所述的基于特征增强的文档级威胁情报关系抽取方法,其特征在于,通过实体关系抽取模型来获取文本数据中目标实体对的实体关系,包含:首先,通过实体级注意力矩阵来定位给定实体对关键上下文,并依据上下文来获取该给定实体对的局部上下文嵌入向量;接着,将局部上下文嵌入向量与实体全局表示、实体类型信息及实体间距离信息进行融合来获取目标实体对的上下文嵌入表示;然后,利用非线性激活函数来得到给定目标实体对的关系概率。7. According to the feature-enhanced document-level threat intelligence relationship extraction method described in claim 6, it is characterized in that the entity relationship of the target entity pair in the text data is obtained through the entity relationship extraction model, which includes: first, locating the key context of the given entity pair through the entity-level attention matrix, and obtaining the local context embedding vector of the given entity pair based on the context; then, the local context embedding vector is fused with the global representation of the entity, the entity type information and the distance information between entities to obtain the context embedding representation of the target entity pair; then, a nonlinear activation function is used to obtain the relationship probability of the given target entity pair. 8.根据权利要求7所述的基于特征增强的文档级威胁情报关系抽取方法,其特征在于,利用非线性激活函数来得到给定实体对的关系概率中,首先将目标实体对的上下文嵌入表示进行分组和特征融合,得到实体对表示,然后再利用非线性激活函数sigmoid获得给定目标实体对的关系概率。8. According to the feature-enhanced document-level threat intelligence relationship extraction method of claim 7, it is characterized in that, in obtaining the relationship probability of a given entity pair using a nonlinear activation function, the context embedding representation of the target entity pair is first grouped and feature-fused to obtain an entity pair representation, and then the nonlinear activation function sigmoid is used to obtain the relationship probability of the given target entity pair. 9.一种基于特征增强的文档级威胁情报关系抽取系统,其特征在于,包含:模型构建模块和关系抽取模块,其中,9. A document-level threat intelligence relationship extraction system based on feature enhancement, characterized in that it comprises: a model building module and a relationship extraction module, wherein: 模型构建模块,用于构建实体信息抽取模型并进行训练优化,其中,实体信息抽取模型包含:用于对输入文本数据进行编码来获取文本数据单词实体提及上下文嵌入向量的BERT模型,通过融合词性嵌入向量和实体宽度信息来增强单词实体提及上下文嵌入向量并获取实体全局表示的融合处理单元,及将给定实体对局部上下文嵌入与实体全局表示、实体类型信息和实体间间距信息进行融合并通过非线性激活函数来得到给定实体对关系概率的实体关系抽取模型;A model building module, used to build an entity information extraction model and perform training optimization, wherein the entity information extraction model includes: a BERT model for encoding input text data to obtain a word entity mention context embedding vector of the text data, a fusion processing unit for enhancing the word entity mention context embedding vector by fusing the part-of-speech embedding vector and the entity width information and obtaining the entity global representation, and an entity relationship extraction model for fusing the local context embedding of a given entity pair with the entity global representation, entity type information and inter-entity spacing information to obtain the relationship probability of a given entity pair through a non-linear activation function; 关系抽取模块,用于将待处理威胁情报文档的文本数据输入训练优化后的实体信息抽取模型中,通过BERT模型来获取文本数据中单词实体提及上下文嵌入向量,利用融合处理单元将词性嵌入向量和实体宽度信息与单词实体提及上下文嵌入向量进行融合,并通过实体关系抽取模型来获取文本数据中目标实体对的实体关系。The relationship extraction module is used to input the text data of the threat intelligence document to be processed into the trained and optimized entity information extraction model, obtain the word entity mention context embedding vector in the text data through the BERT model, use the fusion processing unit to fuse the part-of-speech embedding vector and entity width information with the word entity mention context embedding vector, and obtain the entity relationship of the target entity pair in the text data through the entity relationship extraction model. 10.一种计算机可读存储介质,其特征在于,所述计算机可读存储介质内存储有计算机程序,所述计算机程序被处理器执行时实现权利要求1~8任一项所述的方法步骤。10. A computer-readable storage medium, characterized in that a computer program is stored in the computer-readable storage medium, and when the computer program is executed by a processor, the method steps described in any one of claims 1 to 8 are implemented.
CN202211416432.1A 2022-11-12 2022-11-12 Document-level threat intelligence relationship extraction method and system based on feature enhancement Active CN116049343B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211416432.1A CN116049343B (en) 2022-11-12 2022-11-12 Document-level threat intelligence relationship extraction method and system based on feature enhancement

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211416432.1A CN116049343B (en) 2022-11-12 2022-11-12 Document-level threat intelligence relationship extraction method and system based on feature enhancement

Publications (2)

Publication Number Publication Date
CN116049343A true CN116049343A (en) 2023-05-02
CN116049343B CN116049343B (en) 2025-06-06

Family

ID=86120889

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211416432.1A Active CN116049343B (en) 2022-11-12 2022-11-12 Document-level threat intelligence relationship extraction method and system based on feature enhancement

Country Status (1)

Country Link
CN (1) CN116049343B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117195075A (en) * 2023-09-14 2023-12-08 北京工商大学 Document-level relation extraction based on span negative samples and enhanced context representation
CN118428470A (en) * 2024-05-10 2024-08-02 北京邮电大学 Relation extraction method and related equipment
CN118733697A (en) * 2024-06-27 2024-10-01 南方电网人工智能科技有限公司 Relationship extraction model construction method, device, computer equipment, storage medium and computer program product
CN119003785A (en) * 2024-07-26 2024-11-22 广州大学 Attention context mapping and relationship matching-based network threat intelligence relationship extraction method
CN119441790A (en) * 2025-01-10 2025-02-14 国网电商科技有限公司 Threat intelligence entity relationship identification method and system

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021184311A1 (en) * 2020-03-19 2021-09-23 中山大学 Method and apparatus for automatically generating inference questions and answers
CN114861645A (en) * 2022-04-28 2022-08-05 浙江大学 Document level relation extraction method based on long-tail data distribution
CN115080735A (en) * 2022-05-20 2022-09-20 郑州大学产业技术研究院有限公司 Relation extraction model optimization method and device and electronic equipment
CN116049419A (en) * 2022-11-12 2023-05-02 中国人民解放军战略支援部队信息工程大学 Threat information extraction method and system integrating multiple models

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021184311A1 (en) * 2020-03-19 2021-09-23 中山大学 Method and apparatus for automatically generating inference questions and answers
CN114861645A (en) * 2022-04-28 2022-08-05 浙江大学 Document level relation extraction method based on long-tail data distribution
CN115080735A (en) * 2022-05-20 2022-09-20 郑州大学产业技术研究院有限公司 Relation extraction model optimization method and device and electronic equipment
CN116049419A (en) * 2022-11-12 2023-05-02 中国人民解放军战略支援部队信息工程大学 Threat information extraction method and system integrating multiple models

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
张晗等: "结合GAN与BiLSTM-Attention-CRF的领域命名实体识别", 计算机研究与发展, vol. 56, no. 9, 30 September 2019 (2019-09-30) *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117195075A (en) * 2023-09-14 2023-12-08 北京工商大学 Document-level relation extraction based on span negative samples and enhanced context representation
CN118428470A (en) * 2024-05-10 2024-08-02 北京邮电大学 Relation extraction method and related equipment
CN118733697A (en) * 2024-06-27 2024-10-01 南方电网人工智能科技有限公司 Relationship extraction model construction method, device, computer equipment, storage medium and computer program product
CN119003785A (en) * 2024-07-26 2024-11-22 广州大学 Attention context mapping and relationship matching-based network threat intelligence relationship extraction method
CN119003785B (en) * 2024-07-26 2025-09-23 广州大学 A network threat intelligence relation extraction method based on attention context mapping and relation matching
CN119441790A (en) * 2025-01-10 2025-02-14 国网电商科技有限公司 Threat intelligence entity relationship identification method and system

Also Published As

Publication number Publication date
CN116049343B (en) 2025-06-06

Similar Documents

Publication Publication Date Title
CN113505244B (en) Knowledge graph construction method, system, equipment and medium based on deep learning
Yin et al. Apply transfer learning to cybersecurity: Predicting exploitability of vulnerabilities by description
Lu et al. Machine learning for synthetic data generation: a review
Yang et al. Aspect-based sentiment analysis with alternating coattention networks
CN116049343A (en) Method and system for document-level threat intelligence relationship extraction based on feature enhancement
JP7059368B2 (en) Protecting the cognitive system from gradient-based attacks through the use of deceptive gradients
CN112905868B (en) Event extraction method, device, equipment and storage medium
Ahmed et al. CyberEntRel: Joint extraction of cyber entities and relations using deep learning
US12488068B2 (en) Performance-adaptive sampling strategy towards fast and accurate graph neural networks
WO2023023379A1 (en) Semantic map generation from natural-language text documents
WO2022222037A1 (en) Interpretable recommendation method based on graph neural network inference
CN113779225B (en) Training method of entity link model, entity link method and device
WO2024193382A1 (en) Knowledge injection and training methods and systems for knowledge-enhanced pre-trained language model
JP7800230B2 (en) Method and device for presenting hint information, and computer program
CN116049419A (en) Threat information extraction method and system integrating multiple models
CN115712732B (en) A method, system, equipment, and medium for constructing a knowledge graph ontology for power equipment.
US12165014B2 (en) Dynamic ontology classification system
CN113792144A (en) A Text Classification Method Based on Semi-Supervised Graph Convolutional Neural Networks
JP2023517518A (en) Vector embedding model for relational tables with null or equivalent values
Dong et al. Relational distance and document-level contrastive pre-training based relation extraction model
CN114996479B (en) A method and system for tracking conversation status based on enhanced technology
Chen et al. Quality assessment of cyber threat intelligence knowledge graph based on adaptive joining of embedding model
Zhang et al. Aspect-dependent heterogeneous graph convolutional network for aspect-level sentiment analysis
CN118316662B (en) A method for constructing a knowledge representation learning model based on the knowledge graph of smart device vulnerabilities
CN119293266A (en) Enterprise knowledge graph construction method, system, device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Country or region after: China

Address after: 450000 Science Avenue 62, Zhengzhou High-tech Zone, Henan Province

Applicant after: Information Engineering University of the Chinese People's Liberation Army Cyberspace Force

Address before: No. 62 Science Avenue, High tech Zone, Zhengzhou City, Henan Province

Applicant before: Information Engineering University of Strategic Support Force,PLA

Country or region before: China

GR01 Patent grant
GR01 Patent grant