[go: up one dir, main page]

CN116028812A - A Construction Method of Pipeline Multi-event Extraction Model - Google Patents

A Construction Method of Pipeline Multi-event Extraction Model Download PDF

Info

Publication number
CN116028812A
CN116028812A CN202211733205.1A CN202211733205A CN116028812A CN 116028812 A CN116028812 A CN 116028812A CN 202211733205 A CN202211733205 A CN 202211733205A CN 116028812 A CN116028812 A CN 116028812A
Authority
CN
China
Prior art keywords
event
type
model
trigger
arg
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202211733205.1A
Other languages
Chinese (zh)
Other versions
CN116028812B (en
Inventor
迟雨桐
冯少辉
张建业
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Iplus Teck Co ltd
Original Assignee
Beijing Iplus Teck Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Iplus Teck Co ltd filed Critical Beijing Iplus Teck Co ltd
Priority to CN202211733205.1A priority Critical patent/CN116028812B/en
Publication of CN116028812A publication Critical patent/CN116028812A/en
Application granted granted Critical
Publication of CN116028812B publication Critical patent/CN116028812B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

The invention relates to a construction method of a pipeline type multi-event extraction model, belongs to the technical field of natural language processing, and solves the problems that the existing event extraction model is easy to identify missing and the event elements cannot be matched under the condition that more events or multi-event overlapping exists in corpus, so that the accuracy is low. The method comprises the steps of constructing an event characteristic data set based on an original data set, further constructing a training set containing positive and negative samples of event types and event elements, and training a T5 model by using the training set, so that the model effectively learns internal relations among the event types, the event roles, the event elements and trigger words, and particularly improves understanding and predicting capabilities of the model on multiple events.

Description

一种管道式多事件抽取模型的构建方法A method for constructing a pipeline multi-event extraction model

技术领域Technical Field

本发明涉及自然语言处理技术领域,尤其涉及一种管道式多事件抽取模型的构建方法。The present invention relates to the technical field of natural language processing, and in particular to a method for constructing a pipeline multi-event extraction model.

背景技术Background Art

事件抽取(EE,EventExtraction)是自然语言处理(NLP)领域的重要任务之一,事件抽取的目的是识别给定语料中所包含的事件类型(eventtype)、事件触发词(trigger)、事件要素(argument)及要素角色(argumentrole)。目前,事件抽取技术的应用场景非常广泛,可以高效提取海量文本中的有用信息,并为知识图谱的构建提供了有力的数据支撑。Event extraction (EE) is one of the important tasks in the field of natural language processing (NLP). The purpose of event extraction is to identify the event type (eventtype), event trigger (trigger), event element (argument) and element role (argumentrole) contained in a given corpus. At present, the application scenarios of event extraction technology are very wide. It can efficiently extract useful information from massive texts and provide strong data support for the construction of knowledge graphs.

现有的主流的事件抽取模型的抽取方法包括序列标注法、指针判别法和生成式方法。序列标注法本质上是多标签多分类方法,针对每一个token预测其可能的标签;指针判别法通过预测每种标签对应文本的开始和结束位置来抽取事件;生成式方法是一种端到端(end2end)的方法,通过更深层的网络提取上下文信息并直接输出文本格式的事件信息。上述三种方法在单事件或包含事件数较少的非重叠多事件的语料中有很好的表现,但当语料中出现较多事件尤其是某一个或几个元素重叠时,就非常容易出现识别缺漏、识别错误、事件要素无法匹配等问题,导致准确率非常低。由于重叠的多事件在实际语料中是普遍存在的,因此亟需一种更为优化的多事件的抽取模型的构建方法,以解决现有技术中的事件抽取模型在多事件、重叠多事件抽取任务中因识别缺漏、事件要素无法匹配导致的抽取准确率不高的问题。The existing mainstream event extraction model extraction methods include sequence labeling, pointer discrimination and generative methods. Sequence labeling is essentially a multi-label multi-classification method, which predicts the possible label for each token; pointer discrimination extracts events by predicting the start and end positions of the text corresponding to each label; generative method is an end-to-end method, which extracts context information through a deeper network and directly outputs event information in text format. The above three methods have good performance in single events or non-overlapping multi-event corpora with a small number of events, but when there are more events in the corpus, especially when one or several elements overlap, it is very easy to have problems such as missing recognition, recognition errors, and event elements cannot be matched, resulting in very low accuracy. Since overlapping multi-events are ubiquitous in actual corpora, there is an urgent need for a more optimized method for constructing a multi-event extraction model to solve the problem that the event extraction model in the prior art has low extraction accuracy due to missing recognition and event elements cannot be matched in multi-event and overlapping multi-event extraction tasks.

发明内容Summary of the invention

鉴于上述的分析,本发明实施例旨在提供一种管道式多事件抽取模型的构建方法,用以解决现有技术中的事件抽取模型在多事件、重叠多事件抽取任务中因识别缺漏、事件要素无法匹配导致的抽取准确率不高的问题。In view of the above analysis, an embodiment of the present invention aims to provide a method for constructing a pipeline multi-event extraction model, so as to solve the problem of low extraction accuracy of event extraction models in the prior art in multi-event and overlapping multi-event extraction tasks due to missing recognition and inability to match event elements.

一方面,本发明实施例提供了一种管道式多事件抽取模型的构建方法,包括如下步骤:On the one hand, an embodiment of the present invention provides a method for constructing a pipeline multi-event extraction model, comprising the following steps:

获取已标注的文本数据作为原始数据集;Obtain the labeled text data as the original data set;

基于原始数据集获得事件特征数据集合,并进一步构建事件类型正样本数据集D+1、事件要素正样本数据集D+2、事件类型全负样本数据集D-1和事件要素随机负样本数据集D-2,最终得到模型训练数据集DallBased on the original data set, the event feature data set is obtained, and the event type positive sample data set D +1 , event element positive sample data set D +2 , event type all negative sample data set D -1 and event element random negative sample data set D -2 are further constructed, and finally the model training data set D all is obtained;

使用训练数据集Dall对T5模型进行训练,得到训练好的管道式多事件抽取模型MtrainedUse the training data set D all to train the T5 model and obtain the trained pipeline multi-event extraction model M trained ;

在多事件抽取时,逐步构建每一步的预测样本集合,所述训练好的模型Mtrained用于基于每一步的预测样本集合得到每一步的预测结果,整合得到最终的抽取结果。When extracting multiple events, a prediction sample set for each step is gradually constructed, and the trained model M trained is used to obtain the prediction result of each step based on the prediction sample set of each step, and the final extraction result is obtained by integration.

进一步的,所述获取已标注的文本数据包括:Furthermore, the step of obtaining the annotated text data includes:

获取原始文本数据;Get the original text data;

对原始文本数据进行标注;其中,标注包括:确定文本数据中的句子所包含的事件类型;根据事件类型抽取触发词、事件要素及其位置;为事件要素打上合适的事件角色标签。The original text data is annotated; wherein the annotation includes: determining the event type contained in the sentences in the text data; extracting trigger words, event elements and their positions according to the event type; and labeling the event elements with appropriate event role labels.

进一步的,所述事件特征数据集合包括:Furthermore, the event feature data set includes:

事件类型与所有事件角色的对应关系schema、事件类型与单个事件角色的对应集合Stype_role、所有事件类型集合Stype、所有触发词集合Strigger和所有事件要素集合Sargument;其中schema记录了原始数据集中所有事件类型和其分别对应的所有事件角色;Stype_role根据schema得出,包括schema中每个事件的事件类型和所有事件角色的两两组合,以及该事件角色在schema中属于第几事件角色;Stype记录了所有事件类型;Strigger记录了原始数据集中出现的所有触发词;Sargument记录了原始数据集中包含的所有事件要素。The schema is the correspondence between event types and all event roles, the corresponding set S type_role between event types and individual event roles, the set S type of all event types, the set S trigger of all trigger words, and the set S argument of all event elements. The schema records all event types in the original dataset and all the event roles corresponding to them. S type_role is derived from the schema, including the event type of each event in the schema and the pairwise combination of all event roles, as well as the event role to which the event role belongs in the schema. S type records all event types. S trigger records all trigger words that appear in the original dataset. S argument records all event elements contained in the original dataset.

进一步的,模型训练数据集Dall,通过以下步骤构建得到:Furthermore, the model training dataset D all is constructed by the following steps:

对原始数据集的标注信息进行汇总整理,获得事件类型与所有事件角色的对应关系schema、事件类型与单个事件角色的对应集合Stype_role以及所有事件类型集合Stype三种事件特征数据集合;Summarize and organize the annotation information of the original data set to obtain three event feature data sets: the corresponding relationship schema between event types and all event roles, the corresponding set S type_role between event types and single event roles, and the set S type of all event types;

使用原始数据集和数据集schema构建事件类型正样本数据集D+1和事件要素正样本数据集D+2,以及原始数据集中出现的所有触发词集合Strigger和所有事件要素集合Sargument两种事件特征数据集合;Use the original dataset and dataset schema to construct event type positive sample dataset D +1 and event element positive sample dataset D +2 , as well as two event feature datasets: the set of all trigger words S trigger and the set of all event elements S argument that appear in the original dataset;

使用事件类型正样本数据集D+1和事件类型数据集Stype构造事件类型全负样本数据集D-1Use the event type positive sample dataset D +1 and the event type dataset S type to construct the event type full negative sample dataset D -1 ;

使用事件要素正样本数据集D+2、触发词集合Strigger、事件要素集合Sargument和事件类型与单个事件角色的对应集合Stype_role构建事件要素随机负样本数据集D-2Use the event element positive sample dataset D +2 , the trigger word set S trigger , the event element set S argument , and the corresponding set S type_role of event type and single event role to construct the event element random negative sample dataset D -2 ;

将D+1、D+2、D-1、D-2混合打乱,最终得到模型训练数据集DallD +1 , D +2 , D -1 , and D -2 are mixed and shuffled to finally obtain the model training data set D all .

更进一步的,所述事件类型正样本数据集D+1和事件要素正样本数据集D+2,通过以下步骤构建得到:Furthermore, the event type positive sample dataset D +1 and the event element positive sample dataset D +2 are constructed by the following steps:

A1.提取原始数据集文本数据text_p所包含的某一事件对应的事件类型etype,触发词wtrigger,事件角色erole_1~erole_n,对应的事件要素warg_1~warg_n(n为该事件包含的事件角色数,也等于事件要素数);构建该事件的事件类型正样本的输入为text_p+etype+“触发词”,输出为wtrigger;构建该事件的事件要素正样本的输入为text_p+promptarg,输出为warg_1~warg_n;其中,事件要素提示promptarg可用下式获得:A1. Extract the event type e type , trigger word w trigger , event roles e role_1 ~e role_n , and corresponding event elements w arg_1 ~w arg_n (n is the number of event roles contained in the event, which is also equal to the number of event elements) corresponding to a certain event contained in the original data set text data text_p; the input for constructing the event type positive sample of the event is text_p+e type + "trigger word", and the output is w trigger ; the input for constructing the event element positive sample of the event is text_p+prompt arg , and the output is w arg_1 ~w arg_n ; wherein, the event element prompt prompt arg can be obtained by the following formula:

Figure BDA0004032242400000041
Figure BDA0004032242400000041

A2.对文本数据text_p中的每个事件使用(1)中方法构建事件类型正样本和事件要素正样本,得到事件类型正样本数据集D+1和事件要素正样本数据集D+2A2. For each event in the text data text_p, use the method in (1) to construct event type positive samples and event element positive samples, and obtain event type positive sample dataset D +1 and event element positive sample dataset D +2 ;

更进一步的,所述事件类型全负样本数据集D-1通过以下步骤构建得到:Furthermore, the event type all-negative sample dataset D -1 is constructed by the following steps:

B1.将某一事件类型正样本的etype依次换成事件类型数据集Stype中的该事件的其他事件类型,目标输出都为空,得到该事件的事件类型全负样本;B1. Replace the e type of a positive sample of a certain event type with other event types of the event in the event type dataset S type in sequence, and the target output is empty to obtain all negative samples of the event type of the event;

B2.对事件类型正样本数据集D+1中所有事件都使用(1)的方法,构建得到事件类型全负样本数据集D-1B2. Use method (1) for all events in the event type positive sample dataset D +1 to construct the event type full negative sample dataset D -1 .

更进一步的,所述事件要素随机负样本数据集D-2通过以下步骤构建得到:Furthermore, the event element random negative sample dataset D -2 is constructed by the following steps:

(1)在D+2中找出某一事件所有事件要素正样本,从事件要素正样本中找出所有事件要素提示promptarg,组成集合Sprompt(1) Find all positive event element samples of a certain event in D + 2 , and find all event element prompts prompt arg from the positive event element samples to form a set S prompt ;

(2)从Strigger中随机选取一个触发词,得到wtrigger_random;从Stype_role中随机选取一个元素,得到一个事件类型etype_random,一个事件角色erole_random以及该事件角色所处位置p;(2) Randomly select a trigger word from S trigger to obtain w trigger_random ; randomly select an element from S type_role to obtain an event type e type_random , an event role e role_random and the position p of the event role;

(3)从事件要素集合Sargument中随机选取p个事件要素,得到warg_r_1~warg_r_p,按如下格式组合得到事件要素随机提示promptarg_random(3) Randomly select p event elements from the event element set S argument to obtain w arg_r_1 to w arg_r_p , and combine them in the following format to obtain the event element random prompt prompt arg_random ;

promptarg_random=etype_random+wtrigger_random+warg_r_1+…+warg_r_p+erole_random prompt arg_random =e type_random +w trigger_random +w arg_r_1 +…+w arg_r_p +e role_random

(4)判断promptarg_random是否存在于Sprompt中,若存在则重复步骤2、3、4,若不存在则使用promptarg_random构建负样本,并将promptarg_random加入Sprompt(4) Determine whether prompt arg_random exists in S prompt . If so, repeat steps 2, 3, and 4. If not, use prompt arg_random to construct a negative sample and add prompt arg_random to S prompt .

(5)重复步骤(1)~(4)直至得到5n个事件要素随机负样本。(5) Repeat steps (1) to (4) until 5n random negative samples of event elements are obtained.

(6)对D+2中所有事件样本都使用(1)~(5)中方法构建得到事件要素随机负样本数据集D-2(6) All event samples in D +2 are constructed using the methods in (1) to (5) to obtain the event element random negative sample dataset D -2 .

进一步的,所述对T5模型进行训练,包括:Furthermore, the training of the T5 model includes:

将模型训练数据集Dall按一定比例划分得到训练集Dtrain、验证集Deval和测试集DtestDivide the model training data set D all into a training set D train , a validation set Deval , and a test set D test according to a certain ratio;

使用训练集Dtrain对T5模型进行微调训练n轮,每轮训练结束使用验证集Deval进行验证,取验证集结果最好的一轮模型作为最终模型,并用测试集Dtest进行测试,最终得到训练好的模型MtrainedFine-tune the T5 model using the training set D train for n rounds. After each round of training, use the validation set Deval for validation. Take the model with the best validation set result as the final model and test it with the test set D test to finally get the trained model M trained ;

训练过程中使用下式计算模型损失并更新参数:During training, the following formula is used to calculate the model loss and update the parameters:

Loss=CrossEntropy(xpred,xgold)Loss=CrossEntropy(x pred ,x gold )

其中,xpred为预测结果,xgold为目标输出。Among them, x pred is the prediction result and x gold is the target output.

进一步的,所述逐步构建每一步的预测样本集合,包括:Furthermore, the stepwise construction of the prediction sample set for each step includes:

基于待抽取文本text、事件特征数据集合构建第一步预测样本集合Dstep_1Construct the first step prediction sample set D step_1 based on the text to be extracted and the event feature data set;

基于待抽取文本text,事件特征数据集合和前一步模型Mtrained的预测结果构建提示信息prompt,以text+prompt结构构建下一步模型的预测样本集合,实现按步依次构建第2~n+1步预测样本集合Dstep_2~Dstep_(n+1)Based on the text to be extracted, the event feature data set and the prediction result of the previous model M trained, the prompt information prompt is constructed, and the prediction sample set of the next model is constructed with the text+prompt structure, so as to realize the step-by-step construction of the prediction sample sets D step_2 to D step_(n+1) of the 2nd to n+1st steps.

更进一步的,所述训练好的模型Mtrained用于基于每一步的预测样本集合得到每一步的预测结果,整合得到最终的抽取结果,包括:Furthermore, the trained model M trained is used to obtain the prediction result of each step based on the prediction sample set of each step, and integrate to obtain the final extraction result, including:

将Dstep_1输入模型Mtrained,得到第一步预测结果文本text中包含的所有触发词ptriggerInput D step_1 into the model M trained to obtain all trigger words p trigger contained in the first step prediction result text;

以格式text+prompt_x构建第2~n+1步预测样本集合Dstep_2~Dstep_(n+1),将Dstep_x输入模型Mtrained,得到每一个触发词

Figure BDA0004032242400000051
对应事件类型
Figure BDA0004032242400000055
的第x-1事件角色
Figure BDA0004032242400000053
对应的第x-1事件要素
Figure BDA0004032242400000054
其中x∈[2,n+1];其中prompt_x表示为:Construct the prediction sample set D step_2 ~D step_(n+1) for the 2nd to n+1st steps in the format text+prompt_x, input D step_x into the model M trained , and obtain each trigger word
Figure BDA0004032242400000051
Corresponding event type
Figure BDA0004032242400000055
The x-1 event character
Figure BDA0004032242400000053
The corresponding x-1 event element
Figure BDA0004032242400000054
where x∈[2,n+1]; prompt_x is represented as:

Figure BDA0004032242400000061
Figure BDA0004032242400000061

将最后一步的提示信息与抽取结果组合得到完整的事件。Combine the prompt information of the last step with the extraction result to get the complete event.

与现有技术相比,本发明至少可实现如下有益效果之一:Compared with the prior art, the present invention can achieve at least one of the following beneficial effects:

1、通过基于原始数据集构建得到事件特征数据集合,并进一步构建得到包含事件类型、事件要素的正、负样本的训练集,使用训练集对T5模型进行训练,使模型有效地学到了各事件类型、事件角色、事件要素以及触发词之间的内在联系,尤其提高了模型对于多事件的理解和预测能力,整体训练过程使用提示信息(prompt)的方法,一定程度上保证了抽取准确率和忠诚度,得到了对事件文本具有较高识别率的事件抽取模型。1. By constructing an event feature data set based on the original data set, and further constructing a training set containing positive and negative samples of event types and event elements, the T5 model is trained using the training set, so that the model can effectively learn the intrinsic connections between various event types, event roles, event elements and trigger words, especially improving the model's understanding and prediction capabilities for multiple events. The overall training process uses the prompt method to ensure the extraction accuracy and fidelity to a certain extent, and obtain an event extraction model with a high recognition rate for event text.

2、基于训练好的模型对事件文本进行抽取,可通过使用提示信息(prompt)以层层递进的方式抽取事件,将所有事件类型作为提示信息抽取对应的触发词,然后将触发词和待抽取的要素角色按步依次加入提示抽取事件要素,待该事件类型包含的所有事件要素抽取完毕,将最后一步的提示信息与抽取结果组合得到完整的事件;这种管道式的抽取方法为每个可能的事件都提供了一条单独的抽取路径,重点解决了多事件、重叠多事件抽取时识别缺漏、事件要素无法匹配的问题,大大提高了抽取准确率。2. Extract event text based on the trained model. Events can be extracted in a layered and progressive manner by using prompt information. All event types are used as prompt information to extract corresponding trigger words. Then the trigger words and the element roles to be extracted are added step by step to the prompt to extract event elements. After all event elements contained in the event type are extracted, the prompt information of the last step is combined with the extraction result to obtain a complete event. This pipeline extraction method provides a separate extraction path for each possible event, focusing on solving the problems of missing identification and event element matching when extracting multiple events and overlapping events, greatly improving the extraction accuracy.

本发明中,上述各技术方案之间还可以相互组合,以实现更多的优选组合方案。本发明的其他特征和优点将在随后的说明书中阐述,并且,部分优点可从说明书中变得显而易见,或者通过实施本发明而了解。本发明的目的和其他优点可通过说明书以及附图中所特别指出的内容中来实现和获得。In the present invention, the above-mentioned technical solutions can also be combined with each other to achieve more preferred combination solutions. Other features and advantages of the present invention will be described in the subsequent description, and some advantages can become obvious from the description, or can be understood by practicing the present invention. The purpose and other advantages of the present invention can be realized and obtained through the contents particularly pointed out in the description and the drawings.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

附图仅用于示出具体实施例的目的,而并不认为是对本发明的限制,在整个附图中,相同的参考符号表示相同的部件。The drawings are only for the purpose of illustrating particular embodiments and are not to be considered limiting of the present invention. Like reference symbols denote like components throughout the drawings.

图1为本发明实施例的管道式多事件抽取模型的构建方法流程示意图;FIG1 is a schematic flow chart of a method for constructing a pipeline multi-event extraction model according to an embodiment of the present invention;

图2为本发明实施例的管道式多事件抽取模型的构建方法包含实际预测的整体实施流程示意图;FIG2 is a schematic diagram of the overall implementation process of the method for constructing a pipeline multi-event extraction model including actual prediction according to an embodiment of the present invention;

图3为本发明实施例提供的构建训练数据流程示意图;FIG3 is a schematic diagram of a flow chart of constructing training data according to an embodiment of the present invention;

图4为本发明实施例提供的获取预测结果流程示意图。FIG. 4 is a schematic diagram of a process for obtaining prediction results provided by an embodiment of the present invention.

具体实施方式DETAILED DESCRIPTION

下面结合附图来具体描述本发明的优选实施例,其中,附图构成本申请一部分,并与本发明的实施例一起用于阐释本发明的原理,并非用于限定本发明的范围。The preferred embodiments of the present invention are described in detail below in conjunction with the accompanying drawings, wherein the accompanying drawings constitute a part of this application and are used together with the embodiments of the present invention to illustrate the principles of the present invention, but are not used to limit the scope of the present invention.

本发明的一个具体实施例,公开了一种管道式多事件抽取模型的构建方法,如图1所示,包括以下步骤:A specific embodiment of the present invention discloses a method for constructing a pipeline multi-event extraction model, as shown in FIG1 , comprising the following steps:

步骤S110、获取已标注的文本数据作为原始数据集;Step S110, obtaining the annotated text data as the original data set;

步骤S120、基于原始数据集获得事件特征数据集合,并进一步构建事件类型正样本数据集D+1、事件要素正样本数据集D+2、事件类型全负样本数据集D-1和事件要素随机负样本数据集D-2,最终得到模型训练数据集DallStep S120, obtaining an event feature data set based on the original data set, and further constructing an event type positive sample data set D +1 , an event element positive sample data set D +2 , an event type full negative sample data set D -1 and an event element random negative sample data set D -2 , and finally obtaining a model training data set D all ;

步骤S130、使用训练数据集Dall对T5模型进行训练,得到训练好的管道式多事件抽取模型MtrainedStep S130: Use the training data set D all to train the T5 model to obtain a trained pipeline multi-event extraction model M trained ;

在多事件抽取时,逐步构建每一步的预测样本集合,所述训练好的模型Mtrained用于基于每一步的预测样本集合得到每一步的预测结果,整合得到最终的抽取结果。When extracting multiple events, a prediction sample set for each step is gradually constructed, and the trained model M trained is used to obtain the prediction result of each step based on the prediction sample set of each step, and the final extraction result is obtained by integration.

本发明实施例使用包含事件类型、事件要素的正、负样本的训练集对T5模型进行训练获得多事件抽取模型。通过基于原始数据集构建得到事件特征数据集合,并进一步构建得到包含事件类型、事件要素的正、负样本的训练集,使用训练集对T5模型进行训练,使模型有效地学到了各事件类型、事件角色、事件要素以及触发词之间的内在联系,尤其提高了模型对于多事件的理解和预测能力,整体训练过程使用提示信息(prompt)的方法,一定程度上保证了抽取准确率和忠诚度,得到了对事件文本具有较高识别率的事件抽取模型。The embodiment of the present invention uses a training set containing positive and negative samples of event types and event elements to train the T5 model to obtain a multi-event extraction model. By constructing an event feature data set based on the original data set, and further constructing a training set containing positive and negative samples of event types and event elements, the T5 model is trained using the training set, so that the model effectively learns the internal connections between various event types, event roles, event elements and trigger words, and especially improves the model's understanding and prediction capabilities for multiple events. The overall training process uses a prompt method to ensure extraction accuracy and fidelity to a certain extent, and obtains an event extraction model with a high recognition rate for event texts.

在上述实施例的基础上,具体的,上述步骤S110中所述已标注的文本数据通过以下方法获得:Based on the above embodiment, specifically, the annotated text data in the above step S110 is obtained by the following method:

直接使用百度的事件抽取数据集;Directly use Baidu's event extraction dataset;

自行对原始文本数据进行标注;其中,标注方法为:确定文本数据中的句子所包含的事件类型;根据事件类型抽取触发词、事件要素及其位置;为事件要素打上合适的事件角色标签;Label the original text data by yourself; the labeling method is: determine the event type contained in the sentence in the text data; extract the trigger words, event elements and their positions according to the event type; label the event elements with appropriate event role labels;

具体的,上述步骤S120还可以优化为以下步骤:Specifically, the above step S120 can also be optimized into the following steps:

步骤S210、对原始数据集的标注信息进行汇总整理,获得事件类型与所有事件角色的对应关系schema、事件类型与单个事件角色的对应集合Stype_role以及所有事件类型集合Stype三种事件特征数据集合;Step S210, summarize and organize the annotation information of the original data set to obtain three event feature data sets: the corresponding relationship schema between event type and all event roles, the corresponding set S type_role between event type and a single event role, and the set S type of all event types;

具体的,将所述原始数据集中所有事件类型和事件角色进行汇总整理,构建数据集schema、Stype_role和Stype;其中schema记录了原始数据集中所有事件类型和其分别对应的所有事件角色;Stype_role根据schema得出,包括schema中每个事件的事件类型和所有事件角色的两两组合,以及该事件角色在schema中属于第几事件角色;Stype记录了所有事件类型;优选的,将schema存储在文件json中,Stype_role与Stype都使用集合(set)进行储存。Specifically, all event types and event roles in the original data set are summarized and organized to construct data set schema, S type_role and S type ; wherein the schema records all event types in the original data set and all event roles corresponding thereto; S type_role is obtained according to the schema, including the event type of each event in the schema and the pairwise combination of all event roles, as well as the event role to which the event role belongs in the schema; S type records all event types; preferably, the schema is stored in a file json, and both S type_role and S type are stored using a set.

示例性的,对于一个事件类型为“收购”,事件角色包括“收购时间、收购方、被收购方”的事件,其在schema、Stype_role和Stype中的记录如表1所示。For example, for an event whose event type is "acquisition" and whose event roles include "acquisition time, acquirer, acquired party", its records in schema, S type_role and S type are shown in Table 1.

表1类型为“收购”的事件在schema、Stype_role与Stype中的记录示例Table 1 Example of records of events of type "acquisition" in schema, S type_role and S type

Figure BDA0004032242400000091
Figure BDA0004032242400000091

步骤S220、使用原始数据集和数据集schema构建事件类型正样本数据集D+1和事件要素正样本数据集D+2,以及原始数据集中出现的所有触发词集合Strigger和所有事件要素集合Sargument两种事件特征数据集合;Step S220, using the original data set and the data set schema to construct an event type positive sample data set D +1 and an event element positive sample data set D +2 , as well as two event feature data sets, namely, a set of all trigger words S trigger and a set of all event elements S argument that appear in the original data set;

具体的,所述构建事件类型正样本数据集D+1和事件要素正样本数据集D+2,以及原始数据集中出现的所有触发词集合Strigger和所有事件要素集合Sargument包括:Specifically, the constructing of the event type positive sample data set D +1 and the event element positive sample data set D +2 , as well as all trigger word sets S trigger and all event element sets S argument appearing in the original data set includes:

(1)提取原始数据集文本数据text_p所包含的某一事件对应的事件类型etype,触发词wtrigger,事件角色erole_1~erole_n,对应的事件要素warg_1~warg_n(n为该事件包含的事件角色数,也等于事件要素数);构建该事件的事件类型正样本的输入为text_p+etype+“触发词”,输出为wtrigger;构建该事件的事件要素正样本的输入为text_p+promptarg,输出为warg_1~warg_n;其中,事件要素提示promptarg可用下式获得:(1) Extract the event type e type , trigger word w trigger , event roles e role_1 to e role_n , and corresponding event elements w arg_1 to w arg_n (n is the number of event roles contained in the event, which is also equal to the number of event elements) corresponding to a certain event contained in the original data set text data text_p; the input for constructing the event type positive sample of the event is text_p + e type + "trigger word", and the output is w trigger ; the input for constructing the event element positive sample of the event is text_p + prompt arg , and the output is w arg_1 to w arg_n ; where the event element prompt prompt arg can be obtained by the following formula:

Figure BDA0004032242400000101
Figure BDA0004032242400000101

(2)对文本数据text_p中的每个事件使用(1)中方法构建事件类型正样本和事件要素正样本,得到事件类型正样本数据集D+1和事件要素正样本数据集D+2(2) For each event in the text data text_p, use the method in (1) to construct event type positive samples and event element positive samples, and obtain event type positive sample dataset D +1 and event element positive sample dataset D +2 ;

(3)将文本数据text_p中的所有事件的触发词wtrigger保存在触发词集合Strigger中,将所有事件要素warg_1~warg_n保存在事件要素集合Sargument中,得到触发词数据集Strigger和事件要素数据集Sargument(3) Save the trigger words w trigger of all events in the text data text_p in the trigger word set S trigger , and save all event elements w arg_1 to w arg_n in the event element set S argument , to obtain the trigger word dataset S trigger and the event element dataset S argument .

示例性地,对于类型为“收购”的事件,构建的事件类型正样本和事件要素正样本以及在Strigger和Sargument中的保存示例如表2所示。Exemplarily, for an event of type “acquisition”, the constructed event type positive samples and event element positive samples and the stored examples in S trigger and S argument are shown in Table 2.

表2类型为“收购”的某事件的事件类型正样本和事件要素正样本示例以及在Strigger和Sargument中的保存示例Table 2. Examples of positive samples of event types and event elements for an event of type "acquisition" and examples of their storage in S trigger and S argument

Figure BDA0004032242400000102
Figure BDA0004032242400000102

Figure BDA0004032242400000111
Figure BDA0004032242400000111

需要说明的是,本例中提示信息各元素间用“-”分割,实际也可使用其他符号或空格分割。构建事件要素正样本时事件要素在promptarg中的出现顺序须与schema记录的保持一致。It should be noted that in this example, the prompt information elements are separated by "-", but other symbols or spaces can also be used. When constructing event element positive samples, the order of event elements in prompt arg must be consistent with that recorded in the schema.

对于复杂情况的事件,输出可能为多个事件要素,在构建输入时,需分别构建事件要素提示,示例性的,表3展示了共用触发词的多事件的事件要素正样本示例:For events in complex situations, the output may be multiple event elements. When constructing the input, event element prompts need to be constructed separately. For example, Table 3 shows an example of positive event element samples of multiple events with a common trigger word:

表3共用触发词的多事件的事件要素正样本示例Table 3. Examples of positive samples of event elements of multiple events with common trigger words

Figure BDA0004032242400000112
Figure BDA0004032242400000112

步骤S230、使用事件类型正样本数据集D+1和事件类型数据集Stype构造事件类型全负样本数据集D-1Step S230, constructing an event type all-negative sample dataset D -1 using the event type positive sample dataset D +1 and the event type dataset S type ;

具体的,所述构造事件类型全负样本数据集D-1包括:Specifically, the construction of the event type all-negative sample dataset D -1 includes:

(1)将某一事件类型正样本的etype依次换成事件类型数据集Stype中的该事件的其他事件类型,目标输出都为空,得到该事件的事件类型全负样本;(1) Replace the e type of a certain event type positive sample with other event types of the event in the event type dataset S type in sequence, and the target output is empty to obtain all negative samples of the event type of the event;

(2)对事件类型正样本数据集D+1中所有事件都使用(1)的方法,构建得到事件类型全负样本数据集D-1(2) The method (1) is applied to all events in the event type positive sample dataset D +1 to construct the event type full negative sample dataset D -1 .

示例性的,事件类型数据集Stype中有m种事件类型,则每一个事件的事件类型全负样本有m-1条;For example, there are m event types in the event type dataset S type , and there are m-1 full negative samples of each event type;

对于模型的训练来说,正样本是有目标输出结果的样本,负样本是没有输出结果的样本,在训练时加入负样本能有效提高模型识别准确率。For model training, positive samples are samples with target output results, and negative samples are samples without output results. Adding negative samples during training can effectively improve the model recognition accuracy.

步骤S240、使用事件要素正样本数据集D+2、触发词集合Strigger、事件要素集合Sargument和事件类型与单个事件角色的对应集合Stype_role构建事件要素随机负样本数据集D-2Step S240, constructing an event element random negative sample dataset D -2 using the event element positive sample dataset D +2 , the trigger word set S trigger , the event element set S argument , and the corresponding set S type_role of event types and single event roles;

事件要素随机负样本的输入格式与事件要素正样本一致,区别在于事件要素随机负样本的提示信息的不同,且输出结果都为空。对于某一事件,事件要素随机负样本数一般推荐为事件要素正样本的5倍;The input format of event element random negative samples is the same as that of event element positive samples. The difference is that the prompt information of event element random negative samples is different, and the output results are empty. For a certain event, the number of event element random negative samples is generally recommended to be 5 times the number of event element positive samples;

具体的,所述构建事件要素随机负样本数据集D-2的步骤如下:Specifically, the steps of constructing the event element random negative sample data set D -2 are as follows:

(1)在D+2中找出某一事件所有事件要素正样本,从事件要素正样本中找出所有事件要素提示promptarg,组成集合Sprompt(1) Find all positive event element samples of a certain event in D + 2 , and find all event element prompts prompt arg from the positive event element samples to form a set S prompt ;

(2)从Strigger中随机选取一个触发词,得到wtrigger_random;从Stype_role中随机选取一个元素,得到一个事件类型etype_random,一个事件角色erole_random以及该事件角色所处位置p;(2) Randomly select a trigger word from S trigger to obtain w trigger_random ; randomly select an element from S type_role to obtain an event type e type_random , an event role e role_random and the position p of the event role;

(3)从事件要素集合Sargument中随机选取p个事件要素,得到warg_r_1~warg_r_p,按如下格式组合得到事件要素随机提示promptarg_random(3) Randomly select p event elements from the event element set S argument to obtain w arg_r_1 to w arg_r_p , and combine them in the following format to obtain the event element random prompt prompt arg_random ;

promptarg_random=etype_random+wtrigger_random+warg_r_1+…+warg_r_p+erole_random prompt arg_random =e type_random +w trigger_random +w arg_r_1 +…+w arg_r_p +e role_random

(4)判断promptarg_random是否存在于Sprompt中,若存在则重复步骤2、3、4,若不存在则使用promptarg_random构建负样本,并将promptarg_random加入Sprompt(4) Determine whether prompt arg_random exists in S prompt . If so, repeat steps 2, 3, and 4. If not, use prompt arg_random to construct a negative sample and add prompt arg_random to S prompt .

(5)重复步骤(1)~(4)直至得到5n个事件要素随机负样本。(5) Repeat steps (1) to (4) until 5n random negative samples of event elements are obtained.

(6)对D+2中所有事件样本都使用(1)~(5)中方法构建得到事件要素随机负样本数据集D-2(6) Using the methods (1) to (5) for all event samples in D +2 , we construct a random negative sample dataset D -2 of event elements;

步骤S250、将D+1、D+2、D-1、D-2混合打乱,最终得到模型训练数据集DallStep S250, D +1 , D +2 , D -1 , and D -2 are mixed and shuffled to finally obtain a model training data set D all ;

具体的,上述步骤S130中对T5模型进行训练包括:Specifically, the training of the T5 model in the above step S130 includes:

将模型训练数据集Dall按一定比例划分得到训练集Dtrain、验证集Deval和测试集Dtest;优选的,所述比例为8:1:1;使用训练集Dtrain对T5模型进行微调训练n轮,每轮训练结束使用验证集Deval进行验证,取验证集结果最好的一轮模型作为最终模型,并用测试集Dtest进行测试,最终得到训练好的模型Mtrained;优选的,所述训练论述n为20;The model training data set D all is divided into a training set D train , a validation set Deval and a test set D test according to a certain ratio; preferably, the ratio is 8:1:1; the T5 model is fine-tuned and trained for n rounds using the training set D train , and the validation set Deval is used for validation after each round of training, and the model with the best validation set result is taken as the final model, and the model is tested with the test set D test to finally obtain a trained model M trained ; preferably, the training set n is 20;

进一步的,训练过程中使用下式计算模型损失并更新参数:Furthermore, the following formula is used to calculate the model loss and update the parameters during training:

Loss=CrossEntropy(xpred,xgold)Loss=CrossEntropy(x pred ,x gold )

其中,xpred为预测结果,xgold为标注。Among them, x pred is the prediction result and x gold is the label.

更进一步的,使用训练好的模型Mtrained;对实际事件文本进行抽取包括以下步骤:Furthermore, using the trained model M trained ; extracting the actual event text includes the following steps:

步骤S310、获取待抽取文本text;其中,所述待抽取文本text可以为从网站爬取的新闻文本数据;Step S310, obtaining the text to be extracted; wherein the text to be extracted may be news text data crawled from a website;

步骤S320、基于待抽取文本text、由原始数据集获得的事件特征数据集合以及前一步模型Mtrain的预测结果,以text+prompt结构按步构建第1~n+1步预测样本集合Dstep_1~Dstep_(n+1),将Dstep_1~Dstep_(n+1)按步输入模型Mtrain获得第1~n+1步模型Mtrain的预测结果;n为第一步预测结果所对应的事件类型的事件角色数;Step S320: Based on the text to be extracted, the event feature data set obtained from the original data set and the prediction result of the previous step model M train , the prediction sample sets D step_1 to D step_(n+1) of the first to n+1 steps are constructed step by step with the text+prompt structure, and D step_1 to D step_(n+1) are input into the model M train step by step to obtain the prediction results of the first to n+1 steps model M train ; n is the number of event roles of the event type corresponding to the prediction result of the first step;

具体的,所述构建第1~n+1步预测样本集合Dstep_1~Dstep_(n+1)以及获得第1~n+1步模型Mtrain的预测结果包括以下步骤:Specifically, the construction of the prediction sample sets D step_1 to D step_(n+1) of the first to n+1 steps and the acquisition of the prediction results of the models M train of the first to n+1 steps include the following steps:

(A)依次遍历Stype中的所有事件类型etype,对于任一事件类型

Figure BDA0004032242400000141
向第一步预测样本集Dstep1中加入样本:
Figure BDA0004032242400000142
遍历结束后,Dstep1中的样本数为m(m为事件类型数,k∈[1,m]);(A) Traverse all event types e type in S type in turn. For any event type
Figure BDA0004032242400000141
Add samples to the first step prediction sample set D step1 :
Figure BDA0004032242400000142
After the traversal is completed, the number of samples in D step1 is m (m is the number of event types, k∈[1,m]);

(B)将第一步预测样本集Dstep1输入Mtrained,当某条样本有输出结果时,其输出结果为待抽取文本text中事件类型

Figure BDA0004032242400000143
的触发词
Figure BDA0004032242400000144
记为
Figure BDA0004032242400000145
从schema中查找
Figure BDA0004032242400000146
对应的第一事件角色
Figure BDA0004032242400000147
并以格式text+prompt_2将输出结果加入下一步预测样本集Dstep2;其中
Figure BDA0004032242400000148
Figure BDA0004032242400000149
(B) Input the first step prediction sample set D step1 into M trained . When a sample has an output result, its output result is the event type in the text to be extracted.
Figure BDA0004032242400000143
Trigger word
Figure BDA0004032242400000144
Recorded as
Figure BDA0004032242400000145
Search from schema
Figure BDA0004032242400000146
Corresponding first event role
Figure BDA0004032242400000147
And add the output result to the next prediction sample set D step2 in the format of text+prompt _2 ;
Figure BDA0004032242400000148
Figure BDA0004032242400000149

对于无输出结果的样本,说明文本text中没有输入的事件类型

Figure BDA00040322424000001410
的触发词,即文本text中不包含事件类型为
Figure BDA00040322424000001411
的事件。For samples with no output results, it means that there is no input event type in the text text
Figure BDA00040322424000001410
The trigger word of the event type is not contained in the text text.
Figure BDA00040322424000001411
of events.

(C)将Dstep2输入Mtrained,预测各触发词

Figure BDA00040322424000001412
对应的第一事件角色
Figure BDA00040322424000001413
的事件要素,记为
Figure BDA00040322424000001414
通过查看schema判断该事件类型
Figure BDA00040322424000001415
是否有其他事件角色,若没有进行步骤S330;(C) Input D step2 into M trained to predict each trigger word
Figure BDA00040322424000001412
Corresponding first event role
Figure BDA00040322424000001413
The event element is recorded as
Figure BDA00040322424000001414
Determine the event type by viewing the schema
Figure BDA00040322424000001415
Whether there are other event roles, if not, proceed to step S330;

若该事件类型

Figure BDA00040322424000001416
在schema中存在其他事件角色
Figure BDA00040322424000001417
则对该事件类型
Figure BDA00040322424000001418
的其他事件角色
Figure BDA00040322424000001419
按步骤依次构建下一步预测样本集Dstep_3~Dstep_(n+1),并将Dstep_3~Dstep_(n+1)按步依次输入模型Mtrained进行事件要素
Figure BDA00040322424000001420
的抽取,直到
Figure BDA00040322424000001421
包含的全部事件角色对应的事件要素都被模型Mtrained抽取,进行步骤S330;If the event type
Figure BDA00040322424000001416
Other event roles exist in the schema
Figure BDA00040322424000001417
For this event type
Figure BDA00040322424000001418
Other event roles
Figure BDA00040322424000001419
Construct the next prediction sample set D step_3 ~D step_(n+1) step by step, and input D step_3 ~D step_(n+1) into the model M trained step by step to perform event factor
Figure BDA00040322424000001420
of extraction until
Figure BDA00040322424000001421
The event elements corresponding to all the event roles are extracted by the model M trained , and step S330 is performed;

更具体的,构建下一步预测样本集Dstep_3~Dstep_(n+1)的方法为:More specifically, the method for constructing the next prediction sample set D step_3 ~D step_(n+1) is:

以格式text+prompt_X构建样本加入下一步预测样本集Dstep_(x);其中

Figure BDA00040322424000001422
为prompt_(x-1)基础上将
Figure BDA0004032242400000151
替换为
Figure BDA0004032242400000152
并在最后加入
Figure BDA0004032242400000153
其中x∈[3,n+1],n为schema中该事件类型所包含的事件角色数;Construct samples in the format of text+prompt _X and add them to the next prediction sample set D step_(x) ;
Figure BDA00040322424000001422
Based on prompt _(x-1)
Figure BDA0004032242400000151
Replace with
Figure BDA0004032242400000152
And add at the end
Figure BDA0004032242400000153
Where x∈[3,n+1], n is the number of event roles contained in the event type in the schema;

提示信息prompt中所用到的事件要素包括

Figure BDA0004032242400000154
其确定方法如下:The event elements used in the prompt include:
Figure BDA0004032242400000154
The determination method is as follows:

Figure BDA0004032242400000155
Figure BDA0004032242400000155

其中j∈[1,n-1],n为schema中该事件类型所包含的事件角色数;若

Figure BDA0004032242400000156
中包含多个预测结果,需按照本步中的格式将多个结果分开构建预测样本。Where j∈[1,n-1], n is the number of event roles contained in the event type in the schema; if
Figure BDA0004032242400000156
contains multiple prediction results, and the multiple results need to be separated into prediction samples according to the format in this step.

步骤S330、基于第n+1步预测样本集合Dstep_(n+1)和第n+1步模型Mtrain的预测结果,整合得到最终的识别结果;Step S330, based on the prediction results of the n+1-step prediction sample set D step_(n+1) and the n+1-step model M train , integrate to obtain the final recognition result;

具体的,所述整合得到最终的识别结果包括:Specifically, the integration to obtain the final recognition result includes:

根据Dstep_n+1及预测结果第n事件要素,整理得到事件抽取结果为:According to D step_n+1 and the nth event element of the prediction result, the event extraction results are sorted as follows:

事件类型:

Figure BDA0004032242400000157
Event Type:
Figure BDA0004032242400000157

触发词:

Figure BDA0004032242400000158
Trigger words:
Figure BDA0004032242400000158

事件角色/事件要素(role/argument):Event role/event element (role/argument):

Figure BDA0004032242400000159
Figure BDA0004032242400000159

示例性的,可以使用如表4的格式整合事件抽取结果。Exemplarily, the event extraction results may be integrated using the format shown in Table 4.

表4事件抽取结果整合示例Table 4 Example of event extraction result integration

Figure BDA00040322424000001510
Figure BDA00040322424000001510

综上所述,本实施例的有益效果如下:In summary, the beneficial effects of this embodiment are as follows:

与现有技术相比,本发明至少可实现如下有益效果之一:Compared with the prior art, the present invention can achieve at least one of the following beneficial effects:

1、通过基于原始数据集构建得到事件特征数据集合,并进一步构建得到包含事件类型、事件要素的正、负样本的训练集,使用训练集对T5模型进行训练,使模型有效地学到了各事件类型、事件角色、事件要素以及触发词之间的内在联系,尤其提高了模型对于多事件的理解和预测能力,整体训练过程使用提示信息(prompt)的方法,一定程度上保证了抽取准确率和忠诚度,得到了对事件文本具有较高识别率的事件抽取模型。1. By constructing an event feature data set based on the original data set, and further constructing a training set containing positive and negative samples of event types and event elements, the T5 model is trained using the training set, so that the model can effectively learn the intrinsic connections between various event types, event roles, event elements and trigger words, especially improving the model's understanding and prediction capabilities for multiple events. The overall training process uses the prompt method to ensure the extraction accuracy and fidelity to a certain extent, and obtain an event extraction model with a high recognition rate for event text.

2、基于训练好的模型对事件文本进行抽取,可通过使用提示信息(prompt)以层层递进的方式抽取事件,将所有事件类型作为提示信息抽取对应的触发词,然后将触发词和待抽取的要素角色按步依次加入提示抽取事件要素,待该事件类型包含的所有事件要素抽取完毕,将最后一步的提示信息与抽取结果组合得到完整的事件;这种管道式的抽取方法为每个可能的事件都提供了一条单独的抽取路径,重点解决了多事件、重叠多事件抽取时识别缺漏、事件要素无法匹配的问题,大大提高了抽取准确率。2. Extract event text based on the trained model. Events can be extracted in a layered and progressive manner by using prompt information. All event types are used as prompt information to extract corresponding trigger words. Then the trigger words and the element roles to be extracted are added step by step to the prompt to extract event elements. After all event elements contained in the event type are extracted, the prompt information of the last step is combined with the extraction result to obtain a complete event. This pipeline extraction method provides a separate extraction path for each possible event, focusing on solving the problems of missing identification and event element matching when extracting multiple events and overlapping events, greatly improving the extraction accuracy.

本领域技术人员可以理解,实现上述实施例方法的全部或部分流程,可以通过计算机程序来指令相关的硬件来完成,所述的程序可存储于计算机可读存储介质中。其中,所述计算机可读存储介质为磁盘、光盘、只读存储记忆体或随机存储记忆体等。Those skilled in the art will appreciate that all or part of the processes of the above-mentioned embodiments can be implemented by instructing related hardware through a computer program, and the program can be stored in a computer-readable storage medium, wherein the computer-readable storage medium is a disk, an optical disk, a read-only storage memory, or a random access memory, etc.

以上所述,仅为本发明较佳的具体实施方式,但本发明的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本发明揭露的技术范围内,可轻易想到的变化或替换,都应涵盖在本发明的保护范围之内。The above description is only a preferred specific implementation manner of the present invention, but the protection scope of the present invention is not limited thereto. Any changes or substitutions that can be easily conceived by any technician familiar with the technical field within the technical scope disclosed by the present invention should be covered within the protection scope of the present invention.

Claims (10)

1. The method for constructing the pipeline type multi-event extraction model is characterized by comprising the following steps of:
acquiring marked text data as an original data set;
obtaining an event feature data set based on the original data set, and further constructing an event type positive sample data set D +1 Event element positive sample dataset D +2 Event type all negative sample dataset D -1 And event element random negative sample dataset D -2 Finally, a model training data set D is obtained all
Using training dataset D all Training the T5 model to obtain a trained pipeline type multi-event extraction model M trained
Gradually constructing a prediction sample set of each step during multi-event extraction, wherein the trained model M trained The method is used for obtaining the prediction result of each step based on the prediction sample set of each step, and integrating to obtain the final extraction result.
2. The method of claim 1, wherein the obtaining the annotated text data comprises:
acquiring original text data;
labeling the original text data; wherein, the labeling includes: determining the event type contained in sentences in the text data; extracting trigger words, event elements and positions thereof according to the event types; the event elements are labeled with the appropriate event roles.
3. The method of claim 1, wherein the set of event feature data comprises:
event type and all event role corresponding relation schema, event type and single event role corresponding set S type_role All event type set S type All trigger word sets S trigger And all event element set S argument The method comprises the steps of carrying out a first treatment on the surface of the Wherein the schema records all event types in the original data set and all event roles corresponding to the event types respectively; s is S type_role Obtaining according to the schema, including the event type of each event in the schema and the combination of all event roles, and the event roles belonging to the same event role in the schema; s is S type All event types are recorded; s is S trigger Recording all trigger words appearing in the original data set; s is S argument All event elements contained in the original dataset are recorded.
4. The method according to claim 1, characterized in that the model training dataset D all The method is constructed by the following steps:
summarizing and sorting the labeling information of the original data set to obtain a corresponding relation schema of the event types and all event roles and a corresponding set S of the event types and single event roles type_role All event type set S type Three event feature data sets;
constructing event type positive samples using raw data sets and data set schemasData set D +1 And event element positive sample dataset D +2 And all trigger word sets S that occur in the original dataset trigger And all event element set S argument Two event feature data sets;
using event type positive sample dataset D +1 And event type dataset S type Constructing event type all negative sample dataset D -1
Using event element positive sample dataset D +2 Trigger word set S trigger Event element set S argument And a corresponding set S of event types and individual event roles type_role Construction of event element random negative sample dataset D -2
Will D +1 、D +2 、D -1 、D -2 Mixing and scrambling to finally obtain a model training data set D all
5. The method of claim 4, wherein the event type positive sample dataset D +1 And event element positive sample dataset D +2 The method is constructed by the following steps:
A1. extracting an event type e corresponding to a certain event contained in the text data text_p of the original data set type Trigger word w trigger Event role e role_1 ~e role_n Corresponding event element w arg_1 ~w arg_n (n is the number of event roles that the event contains, also equal to the number of event elements); the input of the event type positive sample for constructing the event is text_p+e type The "+" trigger word "is output as w trigger The method comprises the steps of carrying out a first treatment on the surface of the The input of the positive sample of the event element for constructing the event is text_p+sample arg Output is w arg_1 ~w arg_n The method comprises the steps of carrying out a first treatment on the surface of the Wherein the event element prompts prompt arg Obtainable by the following formula:
Figure FDA0004032242390000021
A2. constructing positive samples of event types and positive samples of event elements by using the method in (1) for each event in text_p of the text data to obtain a positive sample data set D of the event types +1 And event element positive sample dataset D +2
6. The method of claim 4, wherein the event type full negative sample dataset D -1 The method is constructed by the following steps:
B1. e of positive samples of a certain event type type Sequentially changing into event type data sets S type The target output is null for other event types of the event, so as to obtain an event type full negative sample of the event;
B2. positive sample dataset D for event type +1 The method of (1) is used for all events in the database, and an event type full negative sample data set D is constructed -1
7. The method of claim 4, wherein the event element randomizes a negative sample dataset D -2 The method is constructed by the following steps:
(1) At D +2 Finding out positive samples of all event elements of a certain event, and finding out prompt of all event elements in the positive samples of the event elements arg Form a set S prompt
(2) From S trigger Randomly selecting a trigger word to obtain w trigger_random The method comprises the steps of carrying out a first treatment on the surface of the From S type_role Randomly selecting an element to obtain an event type e type_random An event role e role_random The position p of the event role;
(3) From the event element set S argument Randomly selecting p event elements to obtain w arg_r_1 ~w arg_r_p The event element random prompt is obtained by combining the following formats arg_random
prompt arg_random =e type_random +w trigger_random +w arg_r_1 +…+w arg_r_p +e role_random
(4) Judging prompt arg_random Whether or not to exist in S prompt If so, repeating the steps 2, 3 and 4, and if not, using promt arg_random Negative samples were constructed and promtt was performed arg_random Adding S prompt
(5) Repeating the steps (1) - (4) until 5n event element random negative samples are obtained;
(6) Pair D +2 All event samples in the sequence are constructed by using the methods in (1) to (5) to obtain an event element random negative sample data set D -2
8. The method of claim 1, wherein training the T5 model comprises:
training the model into a data set D all Dividing according to a certain proportion to obtain a training set D train Verification set D eval And test set D test
Using training set D train Fine-tuning the T5 model for training n rounds, and finishing each round of training by using the verification set D eval Performing verification, taking a round of model with the best verification result as a final model, and using a test set D test Testing to finally obtain a trained model M trained
Model loss is calculated and parameters are updated during training using the following formula:
Loss=CrossEntropy(x pred ,x gold )
wherein x is pred To predict the result, x gold Is output as a target.
9. The method of claim 1, wherein gradually constructing the prediction sample set for each step comprises:
constructing a first-step prediction sample set D based on text to be extracted and event feature data set step_1
Based on text to be extracted, event feature data set and previous step model M trained The prediction result of (1) constructs prompt information promtt by text+promptStructure construction of prediction sample set of next model, realizing construction of 2 nd-n+1 th step prediction sample set D in sequence step_2 ~D step_(n+1)
10. The method according to claim 9, characterized in that the trained model M trained The method is used for obtaining the prediction result of each step based on the prediction sample set of each step, and integrating to obtain a final extraction result, and comprises the following steps:
will D step_1 Input model M trained Obtaining all trigger words p contained in the text of the first-step prediction result trigger
In the format text+sample _X Constructing a2 nd to n+1 th step prediction sample set D step_2 ~D step_(n+1) D is to step_x Input model M trained Obtaining each trigger word
Figure FDA0004032242390000041
Corresponding event type->
Figure FDA0004032242390000042
Is the x-1 th event role->
Figure FDA0004032242390000043
Corresponding x-1 event element->
Figure FDA0004032242390000044
Wherein x is ∈ [2, n+1 ]]The method comprises the steps of carrying out a first treatment on the surface of the Wherein prompt is _X Expressed as:
Figure FDA0004032242390000051
and combining the prompting information of the last step with the extraction result to obtain a complete event.
CN202211733205.1A 2022-12-30 2022-12-30 A method for constructing a pipeline multi-event extraction model Active CN116028812B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211733205.1A CN116028812B (en) 2022-12-30 2022-12-30 A method for constructing a pipeline multi-event extraction model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211733205.1A CN116028812B (en) 2022-12-30 2022-12-30 A method for constructing a pipeline multi-event extraction model

Publications (2)

Publication Number Publication Date
CN116028812A true CN116028812A (en) 2023-04-28
CN116028812B CN116028812B (en) 2025-07-01

Family

ID=86070463

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211733205.1A Active CN116028812B (en) 2022-12-30 2022-12-30 A method for constructing a pipeline multi-event extraction model

Country Status (1)

Country Link
CN (1) CN116028812B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130013535A1 (en) * 2011-07-07 2013-01-10 Kunal Punera Method for Summarizing Event-Related Texts To Answer Search Queries
CN111597302A (en) * 2020-04-28 2020-08-28 北京中科智加科技有限公司 Text event acquisition method and device, electronic equipment and storage medium
CN112559747A (en) * 2020-12-15 2021-03-26 北京百度网讯科技有限公司 Event classification processing method and device, electronic equipment and storage medium
CN115099235A (en) * 2022-05-13 2022-09-23 清华大学 Text generation method based on entity description
CN115238045A (en) * 2022-09-21 2022-10-25 北京澜舟科技有限公司 Method, system and storage medium for extracting generation type event argument

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130013535A1 (en) * 2011-07-07 2013-01-10 Kunal Punera Method for Summarizing Event-Related Texts To Answer Search Queries
CN111597302A (en) * 2020-04-28 2020-08-28 北京中科智加科技有限公司 Text event acquisition method and device, electronic equipment and storage medium
CN112559747A (en) * 2020-12-15 2021-03-26 北京百度网讯科技有限公司 Event classification processing method and device, electronic equipment and storage medium
CN115099235A (en) * 2022-05-13 2022-09-23 清华大学 Text generation method based on entity description
CN115238045A (en) * 2022-09-21 2022-10-25 北京澜舟科技有限公司 Method, system and storage medium for extracting generation type event argument

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
马春明: "《事件抽取综述》", 《计算机应用》, 16 March 2022 (2022-03-16) *

Also Published As

Publication number Publication date
CN116028812B (en) 2025-07-01

Similar Documents

Publication Publication Date Title
CN117033608B (en) A knowledge graph generative question answering method and system based on large language model
CN108073568B (en) Keyword extraction method and device
CN109902285B (en) Corpus classification method, corpus classification device, computer equipment and storage medium
Mazurets et al. Practical Implementation of Neural Network Method for Stress Features Detection by Social Internet Networks Posts
CN113220951B (en) Medical clinic support method and system based on intelligent content
CN118552708A (en) A general visual relationship recognition and detection method based on multimodal large model
CN112446206B (en) A method and apparatus for generating recipe titles
CN118333168A (en) Model training method, device, computer equipment and storage medium
CN119917671A (en) Interview simulation method, system, device and storage medium based on knowledge graph
CN117573797A (en) Test question retrieval method based on large language model
CN114936563B (en) Event extraction method, device and storage medium
CN114443484B (en) Program testing method, device, equipment and storage medium
CN116306974A (en) Model training method, device, electronic device and storage medium for question answering system
CN112015870B (en) Data uploading method and device
CN116028812A (en) A Construction Method of Pipeline Multi-event Extraction Model
CN116304017B (en) Pipeline type multi-event extraction method
CN118657154A (en) An intelligent audit report information extraction method based on LLM technology
CN114783446B (en) Voice recognition method and system based on contrast predictive coding
CN114118060B (en) Method and system for automatically identifying key events from sales session
CN113722567B (en) Entity relation extraction method based on multi-target fusion
Kulev et al. Extraction of medication names from Twitter using augmentation and an ensemble of language models
CN114528399A (en) Work order text classification method and device, storage medium and computer equipment
CN117453857B (en) Recall model training method, data search method and related devices and program products
Chung et al. A question detection algorithm for text analysis
CN116842128B (en) Text relation extraction method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant