CN116028812A

CN116028812A - A Construction Method of Pipeline Multi-event Extraction Model

Info

Publication number: CN116028812A
Application number: CN202211733205.1A
Authority: CN
Inventors: 迟雨桐; 冯少辉; 张建业
Original assignee: Beijing Iplus Teck Co ltd
Current assignee: Beijing Iplus Teck Co ltd
Priority date: 2022-12-30
Filing date: 2022-12-30
Publication date: 2023-04-28
Anticipated expiration: 2042-12-30
Also published as: CN116028812B

Abstract

The invention relates to a construction method of a pipeline type multi-event extraction model, belongs to the technical field of natural language processing, and solves the problems that the existing event extraction model is easy to identify missing and the event elements cannot be matched under the condition that more events or multi-event overlapping exists in corpus, so that the accuracy is low. The method comprises the steps of constructing an event characteristic data set based on an original data set, further constructing a training set containing positive and negative samples of event types and event elements, and training a T5 model by using the training set, so that the model effectively learns internal relations among the event types, the event roles, the event elements and trigger words, and particularly improves understanding and predicting capabilities of the model on multiple events.

Description

A method for constructing a pipeline multi-event extraction model

技术领域Technical Field

本发明涉及自然语言处理技术领域，尤其涉及一种管道式多事件抽取模型的构建方法。The present invention relates to the technical field of natural language processing, and in particular to a method for constructing a pipeline multi-event extraction model.

背景技术Background Art

事件抽取(EE，EventExtraction)是自然语言处理(NLP)领域的重要任务之一，事件抽取的目的是识别给定语料中所包含的事件类型(eventtype)、事件触发词(trigger)、事件要素(argument)及要素角色(argumentrole)。目前，事件抽取技术的应用场景非常广泛，可以高效提取海量文本中的有用信息，并为知识图谱的构建提供了有力的数据支撑。Event extraction (EE) is one of the important tasks in the field of natural language processing (NLP). The purpose of event extraction is to identify the event type (eventtype), event trigger (trigger), event element (argument) and element role (argumentrole) contained in a given corpus. At present, the application scenarios of event extraction technology are very wide. It can efficiently extract useful information from massive texts and provide strong data support for the construction of knowledge graphs.

现有的主流的事件抽取模型的抽取方法包括序列标注法、指针判别法和生成式方法。序列标注法本质上是多标签多分类方法，针对每一个token预测其可能的标签；指针判别法通过预测每种标签对应文本的开始和结束位置来抽取事件；生成式方法是一种端到端(end2end)的方法，通过更深层的网络提取上下文信息并直接输出文本格式的事件信息。上述三种方法在单事件或包含事件数较少的非重叠多事件的语料中有很好的表现，但当语料中出现较多事件尤其是某一个或几个元素重叠时，就非常容易出现识别缺漏、识别错误、事件要素无法匹配等问题，导致准确率非常低。由于重叠的多事件在实际语料中是普遍存在的，因此亟需一种更为优化的多事件的抽取模型的构建方法，以解决现有技术中的事件抽取模型在多事件、重叠多事件抽取任务中因识别缺漏、事件要素无法匹配导致的抽取准确率不高的问题。The existing mainstream event extraction model extraction methods include sequence labeling, pointer discrimination and generative methods. Sequence labeling is essentially a multi-label multi-classification method, which predicts the possible label for each token; pointer discrimination extracts events by predicting the start and end positions of the text corresponding to each label; generative method is an end-to-end method, which extracts context information through a deeper network and directly outputs event information in text format. The above three methods have good performance in single events or non-overlapping multi-event corpora with a small number of events, but when there are more events in the corpus, especially when one or several elements overlap, it is very easy to have problems such as missing recognition, recognition errors, and event elements cannot be matched, resulting in very low accuracy. Since overlapping multi-events are ubiquitous in actual corpora, there is an urgent need for a more optimized method for constructing a multi-event extraction model to solve the problem that the event extraction model in the prior art has low extraction accuracy due to missing recognition and event elements cannot be matched in multi-event and overlapping multi-event extraction tasks.

发明内容Summary of the invention

鉴于上述的分析，本发明实施例旨在提供一种管道式多事件抽取模型的构建方法，用以解决现有技术中的事件抽取模型在多事件、重叠多事件抽取任务中因识别缺漏、事件要素无法匹配导致的抽取准确率不高的问题。In view of the above analysis, an embodiment of the present invention aims to provide a method for constructing a pipeline multi-event extraction model, so as to solve the problem of low extraction accuracy of event extraction models in the prior art in multi-event and overlapping multi-event extraction tasks due to missing recognition and inability to match event elements.

一方面，本发明实施例提供了一种管道式多事件抽取模型的构建方法，包括如下步骤：On the one hand, an embodiment of the present invention provides a method for constructing a pipeline multi-event extraction model, comprising the following steps:

获取已标注的文本数据作为原始数据集；Obtain the labeled text data as the original data set;

基于原始数据集获得事件特征数据集合，并进一步构建事件类型正样本数据集D₊₁、事件要素正样本数据集D₊₂、事件类型全负样本数据集D_-1和事件要素随机负样本数据集D_-2，最终得到模型训练数据集D_all；Based on the original data set, the event feature data set is obtained, and the event type positive sample data set D ₊₁ , event element positive sample data set D ₊₂ , event type all negative sample data set D _-1 and event element random negative sample data set D _-2 are further constructed, and finally the model training data set D _all is obtained;

使用训练数据集D_all对T5模型进行训练，得到训练好的管道式多事件抽取模型M_trained；Use the training data set D _all to train the T5 model and obtain the trained pipeline multi-event extraction model M _trained ;

在多事件抽取时，逐步构建每一步的预测样本集合，所述训练好的模型M_trained用于基于每一步的预测样本集合得到每一步的预测结果，整合得到最终的抽取结果。When extracting multiple events, a prediction sample set for each step is gradually constructed, and the trained model M _{trained is} used to obtain the prediction result of each step based on the prediction sample set of each step, and the final extraction result is obtained by integration.

进一步的，所述获取已标注的文本数据包括：Furthermore, the step of obtaining the annotated text data includes:

获取原始文本数据；Get the original text data;

对原始文本数据进行标注；其中，标注包括：确定文本数据中的句子所包含的事件类型；根据事件类型抽取触发词、事件要素及其位置；为事件要素打上合适的事件角色标签。The original text data is annotated; wherein the annotation includes: determining the event type contained in the sentences in the text data; extracting trigger words, event elements and their positions according to the event type; and labeling the event elements with appropriate event role labels.

进一步的，所述事件特征数据集合包括：Furthermore, the event feature data set includes:

事件类型与所有事件角色的对应关系schema、事件类型与单个事件角色的对应集合S_{type_role}、所有事件类型集合S_type、所有触发词集合S_trigger和所有事件要素集合S_argument；其中schema记录了原始数据集中所有事件类型和其分别对应的所有事件角色；S_{type_role}根据schema得出，包括schema中每个事件的事件类型和所有事件角色的两两组合，以及该事件角色在schema中属于第几事件角色；S_type记录了所有事件类型；S_trigger记录了原始数据集中出现的所有触发词；S_argument记录了原始数据集中包含的所有事件要素。The schema is the correspondence between event types and all event roles, the corresponding set S _{type_role} between event types and individual event roles, the set S _type of all event types, the set S _trigger of all trigger words, and the set S _argument of all event elements. The schema records all event types in the original dataset and all the event roles corresponding to them. S _{type_role} is derived from the schema, including the event type of each event in the schema and the pairwise combination of all event roles, as well as the event role to which the event role belongs in the schema. S _type records all event types. S _trigger records all trigger words that appear in the original dataset. S _argument records all event elements contained in the original dataset.

进一步的，模型训练数据集D_all，通过以下步骤构建得到：Furthermore, the model training dataset D _all is constructed by the following steps:

对原始数据集的标注信息进行汇总整理，获得事件类型与所有事件角色的对应关系schema、事件类型与单个事件角色的对应集合S_{type_role}以及所有事件类型集合S_type三种事件特征数据集合；Summarize and organize the annotation information of the original data set to obtain three event feature data sets: the corresponding relationship schema between event types and all event roles, the corresponding set S _{type_role} between event types and single event roles, and the set S _{type of} all event types;

使用原始数据集和数据集schema构建事件类型正样本数据集D₊₁和事件要素正样本数据集D₊₂，以及原始数据集中出现的所有触发词集合S_trigger和所有事件要素集合S_argument两种事件特征数据集合；Use the original dataset and dataset schema to construct event type positive sample dataset D ₊₁ and event element positive sample dataset D ₊₂ , as well as two event feature datasets: the set of all trigger words S _trigger and the set of all event elements S _{argument that} appear in the original dataset;

使用事件类型正样本数据集D₊₁和事件类型数据集S_type构造事件类型全负样本数据集D_-1；Use the event type positive sample dataset D ₊₁ and the event type dataset S _type to construct the event type full negative sample dataset D _-1 ;

使用事件要素正样本数据集D₊₂、触发词集合S_trigger、事件要素集合S_argument和事件类型与单个事件角色的对应集合S_{type_role}构建事件要素随机负样本数据集D_-2；Use the event element positive sample dataset D ₊₂ , the trigger word set S _trigger , the event element set S _argument , and the corresponding set S _{type_role} of event type and single event role to construct the event element random negative sample dataset D _-2 ;

将D₊₁、D₊₂、D_-1、D_-2混合打乱，最终得到模型训练数据集D_all。D ₊₁ , D ₊₂ , D _-1 , and D _-2 are mixed and shuffled to finally obtain the model training data set D _all .

更进一步的，所述事件类型正样本数据集D₊₁和事件要素正样本数据集D₊₂，通过以下步骤构建得到：Furthermore, the event type positive sample dataset D ₊₁ and the event element positive sample dataset D ₊₂ are constructed by the following steps:

A1.提取原始数据集文本数据text_p所包含的某一事件对应的事件类型e_type，触发词w_trigger，事件角色e_{role_1}～e_{role_n}，对应的事件要素w_{arg_1}～w_{arg_n}(n为该事件包含的事件角色数，也等于事件要素数)；构建该事件的事件类型正样本的输入为text_p+e_type+“触发词”，输出为w_trigger；构建该事件的事件要素正样本的输入为text_p+prompt_arg，输出为w_{arg_1}～w_{arg_n}；其中，事件要素提示prompt_arg可用下式获得：A1. Extract the event type e _type , trigger word w _trigger , event roles e _{role_1} ～e _{role_n} , and corresponding event elements w _{arg_1} ～w _{arg_n} (n is the number of event roles contained in the event, which is also equal to the number of event elements) corresponding to a certain event contained in the original data set text data text_p; the input for constructing the event type positive sample of the event is text_p+e _type + "trigger word", and the output is w _trigger ; the input for constructing the event element positive sample of the event is text_p+prompt _arg , and the output is w _{arg_1} ～w _{arg_n} ; wherein, the event element prompt prompt _arg can be obtained by the following formula:

A2.对文本数据text_p中的每个事件使用(1)中方法构建事件类型正样本和事件要素正样本，得到事件类型正样本数据集D₊₁和事件要素正样本数据集D₊₂；A2. For each event in the text data text_p, use the method in (1) to construct event type positive samples and event element positive samples, and obtain event type positive sample dataset D ₊₁ and event element positive sample dataset D ₊₂ ;

更进一步的，所述事件类型全负样本数据集D_-1通过以下步骤构建得到：Furthermore, the event type all-negative sample dataset D _-1 is constructed by the following steps:

B1.将某一事件类型正样本的e_type依次换成事件类型数据集S_type中的该事件的其他事件类型，目标输出都为空，得到该事件的事件类型全负样本；B1. Replace the e _type of a positive sample of a certain event type with other event types of the event in the event type dataset S _type in sequence, and the target output is empty to obtain all negative samples of the event type of the event;

B2.对事件类型正样本数据集D₊₁中所有事件都使用(1)的方法，构建得到事件类型全负样本数据集D_-1。B2. Use method (1) for all events in the event type positive sample dataset D ₊₁ to construct the event type full negative sample dataset D _-1 .

更进一步的，所述事件要素随机负样本数据集D_-2通过以下步骤构建得到：Furthermore, the event element random negative sample dataset D _-2 is constructed by the following steps:

(1)在D₊₂中找出某一事件所有事件要素正样本，从事件要素正样本中找出所有事件要素提示prompt_arg，组成集合S_prompt；(1) Find all positive event element samples of a certain event in D _{+ 2} , and find all event element prompts prompt _arg from the positive event element samples to form a set S _prompt ;

(2)从S_trigger中随机选取一个触发词，得到w_{trigger_random}；从S_{type_role}中随机选取一个元素，得到一个事件类型e_{type_random}，一个事件角色e_{role_random}以及该事件角色所处位置p；(2) Randomly select a trigger word from S _trigger to obtain w _{trigger_random} ; randomly select an element from S _{type_role} to obtain an event type e _{type_random} , an event role e _{role_random} and the position p of the event role;

(3)从事件要素集合S_argument中随机选取p个事件要素，得到w_{arg_r_1}～w_{arg_r_p}，按如下格式组合得到事件要素随机提示prompt_{arg_random}；(3) Randomly select p event elements from the event element set S _argument to obtain w _{arg_r_1} to w _{arg_r_p} , and combine them in the following format to obtain the event element random prompt prompt _{arg_random} ;

prompt_{arg_random}＝e_{type_random}+w_{trigger_random}+w_{arg_r_1}+…+w_{arg_r_p}+e_{role_random} prompt _{arg_random} =e _{type_random} +w _{trigger_random} +w _{arg_r_1} +…+w _{arg_r_p} +e _{role_random}

(4)判断prompt_{arg_random}是否存在于S_prompt中，若存在则重复步骤2、3、4，若不存在则使用prompt_{arg_random}构建负样本，并将prompt_{arg_random}加入S_prompt；(4) Determine whether prompt _{arg_random} exists in S _prompt . If so, repeat steps 2, 3, and 4. If not, use prompt _{arg_random} to construct a negative sample and add prompt _{arg_random} to S _prompt .

(5)重复步骤(1)～(4)直至得到5n个事件要素随机负样本。(5) Repeat steps (1) to (4) until 5n random negative samples of event elements are obtained.

(6)对D₊₂中所有事件样本都使用(1)～(5)中方法构建得到事件要素随机负样本数据集D_-2。(6) All event samples in D ₊₂ are constructed using the methods in (1) to (5) to obtain the event element random negative sample dataset D _-2 .

进一步的，所述对T5模型进行训练，包括：Furthermore, the training of the T5 model includes:

将模型训练数据集D_all按一定比例划分得到训练集D_train、验证集D_eval和测试集D_test；Divide the model training data set D _all into a training set D _train , a validation set _Deval , and a test set D _test according to a certain ratio;

使用训练集D_train对T5模型进行微调训练n轮，每轮训练结束使用验证集D_eval进行验证，取验证集结果最好的一轮模型作为最终模型，并用测试集D_test进行测试，最终得到训练好的模型M_trained；Fine-tune the T5 model using the training set D _train for n rounds. After each round of training, use the validation set _Deval for validation. Take the model with the best validation set result as the final model and test it with the test set D _test to finally get the trained model M _trained ;

训练过程中使用下式计算模型损失并更新参数：During training, the following formula is used to calculate the model loss and update the parameters:

Loss＝CrossEntropy(x_pred，x_gold)Loss=CrossEntropy(x _pred ,x _gold )

其中，x_pred为预测结果，x_gold为目标输出。Among them, x _pred is the prediction result and x _gold is the target output.

进一步的，所述逐步构建每一步的预测样本集合，包括：Furthermore, the stepwise construction of the prediction sample set for each step includes:

基于待抽取文本text、事件特征数据集合构建第一步预测样本集合D_{step_1}；Construct the first step prediction sample set D _{step_1} based on the text to be extracted and the event feature data set;

基于待抽取文本text，事件特征数据集合和前一步模型M_trained的预测结果构建提示信息prompt，以text+prompt结构构建下一步模型的预测样本集合，实现按步依次构建第2～n+1步预测样本集合D_{step_2}～D_{step_(n+1)}。Based on the text to be extracted, the event feature data set and the prediction result of the previous model M _trained, the prompt information prompt is constructed, and the prediction sample set of the next model is constructed with the text+prompt structure, so as to realize the step-by-step construction of the prediction sample sets D _{step_2} to D _{step_(n+1)} of the 2nd to n+1st steps.

更进一步的，所述训练好的模型M_trained用于基于每一步的预测样本集合得到每一步的预测结果，整合得到最终的抽取结果，包括：Furthermore, the trained model M _{trained is} used to obtain the prediction result of each step based on the prediction sample set of each step, and integrate to obtain the final extraction result, including:

将D_{step_1}输入模型M_trained，得到第一步预测结果文本text中包含的所有触发词p_trigger；Input D _{step_1} into the model M _trained to obtain all trigger words p _trigger contained in the first step prediction result text;

以格式text+prompt_x构建第2～n+1步预测样本集合D_{step_2}～D_{step_(n+1)}，将D_{step_x}输入模型M_trained，得到每一个触发词

对应事件类型

的第x-1事件角色

对应的第x-1事件要素

其中x∈[2,n+1]；其中prompt_x表示为：Construct the prediction sample set D _{step_2} ~D _{step_(n+1)} for the 2nd to n+1st steps in the format text+prompt_x, input D _{step_x} into the model M _trained , and obtain each trigger word

Corresponding event type

The x-1 event character

The corresponding x-1 event element

where x∈[2,n+1]; prompt_x is represented as:

将最后一步的提示信息与抽取结果组合得到完整的事件。Combine the prompt information of the last step with the extraction result to get the complete event.

与现有技术相比，本发明至少可实现如下有益效果之一：Compared with the prior art, the present invention can achieve at least one of the following beneficial effects:

1、通过基于原始数据集构建得到事件特征数据集合，并进一步构建得到包含事件类型、事件要素的正、负样本的训练集，使用训练集对T5模型进行训练，使模型有效地学到了各事件类型、事件角色、事件要素以及触发词之间的内在联系，尤其提高了模型对于多事件的理解和预测能力，整体训练过程使用提示信息(prompt)的方法，一定程度上保证了抽取准确率和忠诚度，得到了对事件文本具有较高识别率的事件抽取模型。1. By constructing an event feature data set based on the original data set, and further constructing a training set containing positive and negative samples of event types and event elements, the T5 model is trained using the training set, so that the model can effectively learn the intrinsic connections between various event types, event roles, event elements and trigger words, especially improving the model's understanding and prediction capabilities for multiple events. The overall training process uses the prompt method to ensure the extraction accuracy and fidelity to a certain extent, and obtain an event extraction model with a high recognition rate for event text.

2、基于训练好的模型对事件文本进行抽取，可通过使用提示信息(prompt)以层层递进的方式抽取事件，将所有事件类型作为提示信息抽取对应的触发词，然后将触发词和待抽取的要素角色按步依次加入提示抽取事件要素，待该事件类型包含的所有事件要素抽取完毕，将最后一步的提示信息与抽取结果组合得到完整的事件；这种管道式的抽取方法为每个可能的事件都提供了一条单独的抽取路径，重点解决了多事件、重叠多事件抽取时识别缺漏、事件要素无法匹配的问题，大大提高了抽取准确率。2. Extract event text based on the trained model. Events can be extracted in a layered and progressive manner by using prompt information. All event types are used as prompt information to extract corresponding trigger words. Then the trigger words and the element roles to be extracted are added step by step to the prompt to extract event elements. After all event elements contained in the event type are extracted, the prompt information of the last step is combined with the extraction result to obtain a complete event. This pipeline extraction method provides a separate extraction path for each possible event, focusing on solving the problems of missing identification and event element matching when extracting multiple events and overlapping events, greatly improving the extraction accuracy.

本发明中，上述各技术方案之间还可以相互组合，以实现更多的优选组合方案。本发明的其他特征和优点将在随后的说明书中阐述，并且，部分优点可从说明书中变得显而易见，或者通过实施本发明而了解。本发明的目的和其他优点可通过说明书以及附图中所特别指出的内容中来实现和获得。In the present invention, the above-mentioned technical solutions can also be combined with each other to achieve more preferred combination solutions. Other features and advantages of the present invention will be described in the subsequent description, and some advantages can become obvious from the description, or can be understood by practicing the present invention. The purpose and other advantages of the present invention can be realized and obtained through the contents particularly pointed out in the description and the drawings.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

附图仅用于示出具体实施例的目的，而并不认为是对本发明的限制，在整个附图中，相同的参考符号表示相同的部件。The drawings are only for the purpose of illustrating particular embodiments and are not to be considered limiting of the present invention. Like reference symbols denote like components throughout the drawings.

图1为本发明实施例的管道式多事件抽取模型的构建方法流程示意图；FIG1 is a schematic flow chart of a method for constructing a pipeline multi-event extraction model according to an embodiment of the present invention;

图2为本发明实施例的管道式多事件抽取模型的构建方法包含实际预测的整体实施流程示意图；FIG2 is a schematic diagram of the overall implementation process of the method for constructing a pipeline multi-event extraction model including actual prediction according to an embodiment of the present invention;

图3为本发明实施例提供的构建训练数据流程示意图；FIG3 is a schematic diagram of a flow chart of constructing training data according to an embodiment of the present invention;

图4为本发明实施例提供的获取预测结果流程示意图。FIG. 4 is a schematic diagram of a process for obtaining prediction results provided by an embodiment of the present invention.

具体实施方式DETAILED DESCRIPTION

下面结合附图来具体描述本发明的优选实施例，其中，附图构成本申请一部分，并与本发明的实施例一起用于阐释本发明的原理，并非用于限定本发明的范围。The preferred embodiments of the present invention are described in detail below in conjunction with the accompanying drawings, wherein the accompanying drawings constitute a part of this application and are used together with the embodiments of the present invention to illustrate the principles of the present invention, but are not used to limit the scope of the present invention.

本发明的一个具体实施例，公开了一种管道式多事件抽取模型的构建方法，如图1所示，包括以下步骤：A specific embodiment of the present invention discloses a method for constructing a pipeline multi-event extraction model, as shown in FIG1 , comprising the following steps:

步骤S110、获取已标注的文本数据作为原始数据集；Step S110, obtaining the annotated text data as the original data set;

步骤S120、基于原始数据集获得事件特征数据集合，并进一步构建事件类型正样本数据集D₊₁、事件要素正样本数据集D₊₂、事件类型全负样本数据集D_-1和事件要素随机负样本数据集D_-2，最终得到模型训练数据集D_all；Step S120, obtaining an event feature data set based on the original data set, and further constructing an event type positive sample data set D ₊₁ , an event element positive sample data set D ₊₂ , an event type full negative sample data set D _-1 and an event element random negative sample data set D _-2 , and finally obtaining a model training data set D _all ;

步骤S130、使用训练数据集D_all对T5模型进行训练，得到训练好的管道式多事件抽取模型M_trained；Step S130: Use the training data set D _all to train the T5 model to obtain a trained pipeline multi-event extraction model M _trained ;

本发明实施例使用包含事件类型、事件要素的正、负样本的训练集对T5模型进行训练获得多事件抽取模型。通过基于原始数据集构建得到事件特征数据集合，并进一步构建得到包含事件类型、事件要素的正、负样本的训练集，使用训练集对T5模型进行训练，使模型有效地学到了各事件类型、事件角色、事件要素以及触发词之间的内在联系，尤其提高了模型对于多事件的理解和预测能力，整体训练过程使用提示信息(prompt)的方法，一定程度上保证了抽取准确率和忠诚度，得到了对事件文本具有较高识别率的事件抽取模型。The embodiment of the present invention uses a training set containing positive and negative samples of event types and event elements to train the T5 model to obtain a multi-event extraction model. By constructing an event feature data set based on the original data set, and further constructing a training set containing positive and negative samples of event types and event elements, the T5 model is trained using the training set, so that the model effectively learns the internal connections between various event types, event roles, event elements and trigger words, and especially improves the model's understanding and prediction capabilities for multiple events. The overall training process uses a prompt method to ensure extraction accuracy and fidelity to a certain extent, and obtains an event extraction model with a high recognition rate for event texts.

在上述实施例的基础上，具体的，上述步骤S110中所述已标注的文本数据通过以下方法获得：Based on the above embodiment, specifically, the annotated text data in the above step S110 is obtained by the following method:

直接使用百度的事件抽取数据集；Directly use Baidu's event extraction dataset;

自行对原始文本数据进行标注；其中，标注方法为：确定文本数据中的句子所包含的事件类型；根据事件类型抽取触发词、事件要素及其位置；为事件要素打上合适的事件角色标签；Label the original text data by yourself; the labeling method is: determine the event type contained in the sentence in the text data; extract the trigger words, event elements and their positions according to the event type; label the event elements with appropriate event role labels;

具体的，上述步骤S120还可以优化为以下步骤：Specifically, the above step S120 can also be optimized into the following steps:

步骤S210、对原始数据集的标注信息进行汇总整理，获得事件类型与所有事件角色的对应关系schema、事件类型与单个事件角色的对应集合S_{type_role}以及所有事件类型集合S_type三种事件特征数据集合；Step S210, summarize and organize the annotation information of the original data set to obtain three event feature data sets: the corresponding relationship schema between event type and all event roles, the corresponding set S _{type_role} between event type and a single event role, and the set S _type of all event types;

具体的，将所述原始数据集中所有事件类型和事件角色进行汇总整理，构建数据集schema、S_{type_role}和S_type；其中schema记录了原始数据集中所有事件类型和其分别对应的所有事件角色；S_{type_role}根据schema得出，包括schema中每个事件的事件类型和所有事件角色的两两组合，以及该事件角色在schema中属于第几事件角色；S_type记录了所有事件类型；优选的，将schema存储在文件json中，S_{type_role}与S_type都使用集合(set)进行储存。Specifically, all event types and event roles in the original data set are summarized and organized to construct data set schema, S _{type_role} and S _type ; wherein the schema records all event types in the original data set and all event roles corresponding thereto; S _{type_role} is obtained according to the schema, including the event type of each event in the schema and the pairwise combination of all event roles, as well as the event role to which the event role belongs in the schema; S _type records all event types; preferably, the schema is stored in a file json, and both S _{type_role} and S _type are stored using a set.

示例性的，对于一个事件类型为“收购”，事件角色包括“收购时间、收购方、被收购方”的事件，其在schema、S_{type_role}和S_type中的记录如表1所示。For example, for an event whose event type is "acquisition" and whose event roles include "acquisition time, acquirer, acquired party", its records in schema, S _{type_role} and S _type are shown in Table 1.

表1类型为“收购”的事件在schema、S_{type_role}与S_type中的记录示例Table 1 Example of records of events of type "acquisition" in schema, S _{type_role} and S _type

步骤S220、使用原始数据集和数据集schema构建事件类型正样本数据集D₊₁和事件要素正样本数据集D₊₂，以及原始数据集中出现的所有触发词集合S_trigger和所有事件要素集合S_argument两种事件特征数据集合；Step S220, using the original data set and the data set schema to construct an event type positive sample data set D ₊₁ and an event element positive sample data set D ₊₂ , as well as two event feature data sets, namely, a set of all trigger words S _trigger and a set of all event elements S _argument that appear in the original data set;

具体的，所述构建事件类型正样本数据集D₊₁和事件要素正样本数据集D₊₂，以及原始数据集中出现的所有触发词集合S_trigger和所有事件要素集合S_argument包括：Specifically, the constructing of the event type positive sample data set D ₊₁ and the event element positive sample data set D ₊₂ , as well as all trigger word sets S _trigger and all event element sets S _argument appearing in the original data set includes:

(1)提取原始数据集文本数据text_p所包含的某一事件对应的事件类型e_type，触发词w_trigger，事件角色e_{role_1}～e_{role_n}，对应的事件要素w_{arg_1}～w_{arg_n}(n为该事件包含的事件角色数，也等于事件要素数)；构建该事件的事件类型正样本的输入为text_p+e_type+“触发词”，输出为w_trigger；构建该事件的事件要素正样本的输入为text_p+prompt_arg，输出为w_{arg_1}～w_{arg_n}；其中，事件要素提示prompt_arg可用下式获得：(1) Extract the event type e _type , trigger word w _trigger , event roles e _{role_1} to e _{role_n} , and corresponding event elements w _{arg_1} to w _{arg_n} (n is the number of event roles contained in the event, which is also equal to the number of event elements) corresponding to a certain event contained in the original data set text data text_p; the input for constructing the event type positive sample of the event is text_p + e _type + "trigger word", and the output is w _trigger ; the input for constructing the event element positive sample of the event is text_p + prompt _arg , and the output is w _{arg_1} to w _{arg_n} ; where the event element prompt prompt _arg can be obtained by the following formula:

(2)对文本数据text_p中的每个事件使用(1)中方法构建事件类型正样本和事件要素正样本，得到事件类型正样本数据集D₊₁和事件要素正样本数据集D₊₂；(2) For each event in the text data text_p, use the method in (1) to construct event type positive samples and event element positive samples, and obtain event type positive sample dataset D ₊₁ and event element positive sample dataset D ₊₂ ;

(3)将文本数据text_p中的所有事件的触发词w_trigger保存在触发词集合S_trigger中，将所有事件要素w_{arg_1}～w_{arg_n}保存在事件要素集合S_argument中，得到触发词数据集S_trigger和事件要素数据集S_argument。(3) Save the trigger words w _trigger of all events in the text data text_p in the trigger word set S _trigger , and save all event elements w _{arg_1} to w _{arg_n} in the event element set S _argument , to obtain the trigger word dataset S _trigger and the event element dataset S _argument .

示例性地，对于类型为“收购”的事件，构建的事件类型正样本和事件要素正样本以及在S_trigger和S_argument中的保存示例如表2所示。Exemplarily, for an event of type “acquisition”, the constructed event type positive samples and event element positive samples and the stored examples in S _trigger and S _argument are shown in Table 2.

表2类型为“收购”的某事件的事件类型正样本和事件要素正样本示例以及在S_trigger和S_argument中的保存示例Table 2. Examples of positive samples of event types and event elements for an event of type "acquisition" and examples of their storage in S _trigger and S _argument

需要说明的是，本例中提示信息各元素间用“-”分割，实际也可使用其他符号或空格分割。构建事件要素正样本时事件要素在prompt_arg中的出现顺序须与schema记录的保持一致。It should be noted that in this example, the prompt information elements are separated by "-", but other symbols or spaces can also be used. When constructing event element positive samples, the order of event elements in prompt _arg must be consistent with that recorded in the schema.

对于复杂情况的事件，输出可能为多个事件要素，在构建输入时，需分别构建事件要素提示，示例性的，表3展示了共用触发词的多事件的事件要素正样本示例：For events in complex situations, the output may be multiple event elements. When constructing the input, event element prompts need to be constructed separately. For example, Table 3 shows an example of positive event element samples of multiple events with a common trigger word:

表3共用触发词的多事件的事件要素正样本示例Table 3. Examples of positive samples of event elements of multiple events with common trigger words

步骤S230、使用事件类型正样本数据集D₊₁和事件类型数据集S_type构造事件类型全负样本数据集D_-1；Step S230, constructing an event type all-negative sample dataset D _-1 using the event type positive sample dataset D ₊₁ and the event type dataset S _type ;

具体的，所述构造事件类型全负样本数据集D_-1包括：Specifically, the construction of the event type all-negative sample dataset D _-1 includes:

(1)将某一事件类型正样本的e_type依次换成事件类型数据集S_type中的该事件的其他事件类型，目标输出都为空，得到该事件的事件类型全负样本；(1) Replace the e _type of a certain event type positive sample with other event types of the event in the event type dataset S _type in sequence, and the target output is empty to obtain all negative samples of the event type of the event;

(2)对事件类型正样本数据集D₊₁中所有事件都使用(1)的方法，构建得到事件类型全负样本数据集D_-1。(2) The method (1) is applied to all events in the event type positive sample dataset D ₊₁ to construct the event type full negative sample dataset D _-1 .

示例性的，事件类型数据集S_type中有m种事件类型，则每一个事件的事件类型全负样本有m-1条；For example, there are m event types in the event type dataset S _type , and there are m-1 full negative samples of each event type;

对于模型的训练来说，正样本是有目标输出结果的样本，负样本是没有输出结果的样本，在训练时加入负样本能有效提高模型识别准确率。For model training, positive samples are samples with target output results, and negative samples are samples without output results. Adding negative samples during training can effectively improve the model recognition accuracy.

步骤S240、使用事件要素正样本数据集D₊₂、触发词集合S_trigger、事件要素集合S_argument和事件类型与单个事件角色的对应集合S_{type_role}构建事件要素随机负样本数据集D_-2；Step S240, constructing an event element random negative sample dataset D _-2 using the event element positive sample dataset D ₊₂ , the trigger word set S _trigger , the event element set S _argument , and the corresponding set S _{type_role} of event types and single event roles;

事件要素随机负样本的输入格式与事件要素正样本一致，区别在于事件要素随机负样本的提示信息的不同，且输出结果都为空。对于某一事件，事件要素随机负样本数一般推荐为事件要素正样本的5倍；The input format of event element random negative samples is the same as that of event element positive samples. The difference is that the prompt information of event element random negative samples is different, and the output results are empty. For a certain event, the number of event element random negative samples is generally recommended to be 5 times the number of event element positive samples;

具体的，所述构建事件要素随机负样本数据集D_-2的步骤如下：Specifically, the steps of constructing the event element random negative sample data set D _-2 are as follows:

(6)对D₊₂中所有事件样本都使用(1)～(5)中方法构建得到事件要素随机负样本数据集D_-2；(6) Using the methods (1) to (5) for all event samples in D ₊₂ , we construct a random negative sample dataset D _-2 of event elements;

步骤S250、将D₊₁、D₊₂、D_-1、D_-2混合打乱，最终得到模型训练数据集D_all；Step S250, D ₊₁ , D ₊₂ , D _-1 , and D _-2 are mixed and shuffled to finally obtain a model training data set D _all ;

具体的，上述步骤S130中对T5模型进行训练包括：Specifically, the training of the T5 model in the above step S130 includes:

将模型训练数据集D_all按一定比例划分得到训练集D_train、验证集D_eval和测试集D_test；优选的，所述比例为8：1：1；使用训练集D_train对T5模型进行微调训练n轮，每轮训练结束使用验证集D_eval进行验证，取验证集结果最好的一轮模型作为最终模型，并用测试集D_test进行测试，最终得到训练好的模型M_trained；优选的，所述训练论述n为20；The model training data set D _all is divided into a training set D _train , a validation set _Deval and a test set D _test according to a certain ratio; preferably, the ratio is 8:1:1; the T5 model is fine-tuned and trained for n rounds using the training set D _train , and the validation set _Deval is used for validation after each round of training, and the model with the best validation set result is taken as the final model, and the model is tested with the test set D _test to finally obtain a trained model M _trained ; preferably, the training set n is 20;

进一步的，训练过程中使用下式计算模型损失并更新参数：Furthermore, the following formula is used to calculate the model loss and update the parameters during training:

Loss＝CrossEntropy(x_pred，x_gold)Loss=CrossEntropy(x _pred ,x _gold )

其中，x_pred为预测结果，x_gold为标注。Among them, x _pred is the prediction result and x _gold is the label.

更进一步的，使用训练好的模型M_trained；对实际事件文本进行抽取包括以下步骤：Furthermore, using the trained model M _trained ; extracting the actual event text includes the following steps:

步骤S310、获取待抽取文本text；其中，所述待抽取文本text可以为从网站爬取的新闻文本数据；Step S310, obtaining the text to be extracted; wherein the text to be extracted may be news text data crawled from a website;

步骤S320、基于待抽取文本text、由原始数据集获得的事件特征数据集合以及前一步模型M_train的预测结果，以text+prompt结构按步构建第1～n+1步预测样本集合D_{step_1}～D_{step_(n+1)}，将D_{step_1}～D_{step_(n+1)}按步输入模型M_train获得第1～n+1步模型M_train的预测结果；n为第一步预测结果所对应的事件类型的事件角色数；Step S320: Based on the text to be extracted, the event feature data set obtained from the original data set and the prediction result of the previous step model M _train , the prediction sample sets D _{step_1} to D _{step_(n+1)} of the first to n+1 steps are constructed step by step with the text+prompt structure, and D _{step_1} to D _{step_(n+1)} are input into the model M _train step by step to obtain the prediction results of the first to n+1 steps model M _train ; n is the number of event roles of the event type corresponding to the prediction result of the first step;

具体的，所述构建第1～n+1步预测样本集合D_{step_1}～D_{step_(n+1)}以及获得第1～n+1步模型M_train的预测结果包括以下步骤：Specifically, the construction of the prediction sample sets D _{step_1} to D _{step_(n+1)} of the first to n+1 steps and the acquisition of the prediction results of the models M _train of the first to n+1 steps include the following steps:

(A)依次遍历S_type中的所有事件类型e_type，对于任一事件类型

向第一步预测样本集D_step1中加入样本：

遍历结束后，D_step1中的样本数为m(m为事件类型数，k∈[1,m])；(A) Traverse all event types e _type in S _type in turn. For any event type

Add samples to the first step prediction sample set D _step1 :

After the traversal is completed, the number of samples in D _step1 is m (m is the number of event types, k∈[1,m]);

(B)将第一步预测样本集D_step1输入M_trained，当某条样本有输出结果时，其输出结果为待抽取文本text中事件类型

的触发词

记为

从schema中查找

对应的第一事件角色

并以格式text+prompt_{_2}将输出结果加入下一步预测样本集D_step2；其中

(B) Input the first step prediction sample set D _step1 into M _trained . When a sample has an output result, its output result is the event type in the text to be extracted.

Trigger word

Recorded as

Search from schema

Corresponding first event role

And add the output result to the next prediction sample set D _step2 in the format of text+prompt _{_2} ;

对于无输出结果的样本，说明文本text中没有输入的事件类型

的触发词，即文本text中不包含事件类型为

的事件。For samples with no output results, it means that there is no input event type in the text text

The trigger word of the event type is not contained in the text text.

of events.

(C)将D_step2输入M_trained，预测各触发词

对应的第一事件角色

的事件要素，记为

通过查看schema判断该事件类型

是否有其他事件角色，若没有进行步骤S330；(C) Input D _step2 into M _trained to predict each trigger word

Corresponding first event role

The event element is recorded as

Determine the event type by viewing the schema

Whether there are other event roles, if not, proceed to step S330;

若该事件类型

在schema中存在其他事件角色

则对该事件类型

的其他事件角色

按步骤依次构建下一步预测样本集D_{step_3}～D_{step_(n+1)}，并将D_{step_3}～D_{step_(n+1)}按步依次输入模型M_trained进行事件要素

的抽取，直到

包含的全部事件角色对应的事件要素都被模型M_trained抽取，进行步骤S330；If the event type

Other event roles exist in the schema

For this event type

Other event roles

Construct the next prediction sample set D _{step_3} ~D _{step_(n+1) step} by step, and input D _{step_3} ~D _{step_(n+1)} into the model M _trained step by step to perform event factor

of extraction until

The event elements corresponding to all the event roles are extracted by the model M _trained , and step S330 is performed;

更具体的，构建下一步预测样本集D_{step_3}～D_{step_(n+1)}的方法为：More specifically, the method for constructing the next prediction sample set D _{step_3} ~D _{step_(n+1)} is:

以格式text+prompt_{_X}构建样本加入下一步预测样本集D_{step_(x)}；其中

为prompt_{_(x-1)}基础上将

替换为

并在最后加入

其中x∈[3,n+1]，n为schema中该事件类型所包含的事件角色数；Construct samples in the format of text+prompt _{_X} and add them to the next prediction sample set D _{step_(x)} ;

Based on prompt _{_(x-1)}

Replace with

And add at the end

Where x∈[3,n+1], n is the number of event roles contained in the event type in the schema;

提示信息prompt中所用到的事件要素包括

其确定方法如下：The event elements used in the prompt include:

The determination method is as follows:

其中j∈[1,n-1]，n为schema中该事件类型所包含的事件角色数；若

中包含多个预测结果，需按照本步中的格式将多个结果分开构建预测样本。Where j∈[1,n-1], n is the number of event roles contained in the event type in the schema; if

contains multiple prediction results, and the multiple results need to be separated into prediction samples according to the format in this step.

步骤S330、基于第n+1步预测样本集合D_{step_(n+1)}和第n+1步模型M_train的预测结果，整合得到最终的识别结果；Step S330, based on the prediction results of the n+1-step prediction sample set D _{step_(n+1)} and the n+1-step model M _train , integrate to obtain the final recognition result;

具体的，所述整合得到最终的识别结果包括：Specifically, the integration to obtain the final recognition result includes:

根据D_{step_n+1}及预测结果第n事件要素，整理得到事件抽取结果为：According to D _{step_n+1} and the nth event element of the prediction result, the event extraction results are sorted as follows:

事件类型：

Event Type:

触发词：

Trigger words:

事件角色/事件要素(role/argument)：Event role/event element (role/argument):

示例性的，可以使用如表4的格式整合事件抽取结果。Exemplarily, the event extraction results may be integrated using the format shown in Table 4.

表4事件抽取结果整合示例Table 4 Example of event extraction result integration

综上所述，本实施例的有益效果如下：In summary, the beneficial effects of this embodiment are as follows:

本领域技术人员可以理解，实现上述实施例方法的全部或部分流程，可以通过计算机程序来指令相关的硬件来完成，所述的程序可存储于计算机可读存储介质中。其中，所述计算机可读存储介质为磁盘、光盘、只读存储记忆体或随机存储记忆体等。Those skilled in the art will appreciate that all or part of the processes of the above-mentioned embodiments can be implemented by instructing related hardware through a computer program, and the program can be stored in a computer-readable storage medium, wherein the computer-readable storage medium is a disk, an optical disk, a read-only storage memory, or a random access memory, etc.

以上所述，仅为本发明较佳的具体实施方式，但本发明的保护范围并不局限于此，任何熟悉本技术领域的技术人员在本发明揭露的技术范围内，可轻易想到的变化或替换，都应涵盖在本发明的保护范围之内。The above description is only a preferred specific implementation manner of the present invention, but the protection scope of the present invention is not limited thereto. Any changes or substitutions that can be easily conceived by any technician familiar with the technical field within the technical scope disclosed by the present invention should be covered within the protection scope of the present invention.

Claims

1. The method for constructing the pipeline type multi-event extraction model is characterized by comprising the following steps of:

acquiring marked text data as an original data set;

obtaining an event feature data set based on the original data set, and further constructing an event type positive sample data set D ₊₁ Event element positive sample dataset D ₊₂ Event type all negative sample dataset D _-1 And event element random negative sample dataset D _-2 Finally, a model training data set D is obtained _all ；

Using training dataset D _all Training the T5 model to obtain a trained pipeline type multi-event extraction model M _trained ；

Gradually constructing a prediction sample set of each step during multi-event extraction, wherein the trained model M _trained The method is used for obtaining the prediction result of each step based on the prediction sample set of each step, and integrating to obtain the final extraction result.

2. The method of claim 1, wherein the obtaining the annotated text data comprises:

acquiring original text data;

labeling the original text data; wherein, the labeling includes: determining the event type contained in sentences in the text data; extracting trigger words, event elements and positions thereof according to the event types; the event elements are labeled with the appropriate event roles.

3. The method of claim 1, wherein the set of event feature data comprises:

event type and all event role corresponding relation schema, event type and single event role corresponding set S _{type_role} All event type set S _type All trigger word sets S _trigger And all event element set S _argument The method comprises the steps of carrying out a first treatment on the surface of the Wherein the schema records all event types in the original data set and all event roles corresponding to the event types respectively; s is S _{type_role} Obtaining according to the schema, including the event type of each event in the schema and the combination of all event roles, and the event roles belonging to the same event role in the schema; s is S _type All event types are recorded; s is S _trigger Recording all trigger words appearing in the original data set; s is S _argument All event elements contained in the original dataset are recorded.

4. The method according to claim 1, characterized in that the model training dataset D _all The method is constructed by the following steps:

summarizing and sorting the labeling information of the original data set to obtain a corresponding relation schema of the event types and all event roles and a corresponding set S of the event types and single event roles _{type_role} All event type set S _type Three event feature data sets;

constructing event type positive samples using raw data sets and data set schemasData set D ₊₁ And event element positive sample dataset D ₊₂ And all trigger word sets S that occur in the original dataset _trigger And all event element set S _argument Two event feature data sets;

using event type positive sample dataset D ₊₁ And event type dataset S _type Constructing event type all negative sample dataset D _-1 ；

Using event element positive sample dataset D ₊₂ Trigger word set S _trigger Event element set S _argument And a corresponding set S of event types and individual event roles _{type_role} Construction of event element random negative sample dataset D _-2 ；

Will D ₊₁ 、D ₊₂ 、D _-1 、D _-2 Mixing and scrambling to finally obtain a model training data set D _all 。

5. The method of claim 4, wherein the event type positive sample dataset D ₊₁ And event element positive sample dataset D ₊₂ The method is constructed by the following steps:

A1. extracting an event type e corresponding to a certain event contained in the text data text_p of the original data set _type Trigger word w _trigger Event role e _{role_1} ～e _{role_n} Corresponding event element w _{arg_1} ～w _{arg_n} (n is the number of event roles that the event contains, also equal to the number of event elements); the input of the event type positive sample for constructing the event is text_p+e _type The "+" trigger word "is output as w _trigger The method comprises the steps of carrying out a first treatment on the surface of the The input of the positive sample of the event element for constructing the event is text_p+sample _arg Output is w _{arg_1} ～w _{arg_n} The method comprises the steps of carrying out a first treatment on the surface of the Wherein the event element prompts prompt _arg Obtainable by the following formula:

A2. constructing positive samples of event types and positive samples of event elements by using the method in (1) for each event in text_p of the text data to obtain a positive sample data set D of the event types ₊₁ And event element positive sample dataset D ₊₂ 。

6. The method of claim 4, wherein the event type full negative sample dataset D _-1 The method is constructed by the following steps:

B1. e of positive samples of a certain event type _type Sequentially changing into event type data sets S _type The target output is null for other event types of the event, so as to obtain an event type full negative sample of the event;

B2. positive sample dataset D for event type ₊₁ The method of (1) is used for all events in the database, and an event type full negative sample data set D is constructed _-1 。

7. The method of claim 4, wherein the event element randomizes a negative sample dataset D _-2 The method is constructed by the following steps:

(1) At D ₊₂ Finding out positive samples of all event elements of a certain event, and finding out prompt of all event elements in the positive samples of the event elements _arg Form a set S _prompt ；

(2) From S _trigger Randomly selecting a trigger word to obtain w _{trigger_random} The method comprises the steps of carrying out a first treatment on the surface of the From S _{type_role} Randomly selecting an element to obtain an event type e _{type_random} An event role e _{role_random} The position p of the event role;

(3) From the event element set S _argument Randomly selecting p event elements to obtain w _{arg_r_1} ～w _{arg_r_p} The event element random prompt is obtained by combining the following formats _{arg_random} ；

prompt _{arg_random} ＝e _{type_random} +w _{trigger_random} +w _{arg_r_1} +…+w _{arg_r_p} +e _{role_random}

(4) Judging prompt _{arg_random} Whether or not to exist in S _prompt If so, repeating the steps 2, 3 and 4, and if not, using promt _{arg_random} Negative samples were constructed and promtt was performed _{arg_random} Adding S _prompt ；

(5) Repeating the steps (1) - (4) until 5n event element random negative samples are obtained;

(6) Pair D ₊₂ All event samples in the sequence are constructed by using the methods in (1) to (5) to obtain an event element random negative sample data set D _-2 。

8. The method of claim 1, wherein training the T5 model comprises:

training the model into a data set D _all Dividing according to a certain proportion to obtain a training set D _train Verification set D _eval And test set D _test ；

Using training set D _train Fine-tuning the T5 model for training n rounds, and finishing each round of training by using the verification set D _eval Performing verification, taking a round of model with the best verification result as a final model, and using a test set D _test Testing to finally obtain a trained model M _trained ；

Model loss is calculated and parameters are updated during training using the following formula:

Loss＝CrossEntropy(x _pred ,x _gold )

wherein x is _pred To predict the result, x _gold Is output as a target.

9. The method of claim 1, wherein gradually constructing the prediction sample set for each step comprises:

constructing a first-step prediction sample set D based on text to be extracted and event feature data set _{step_1} ；

Based on text to be extracted, event feature data set and previous step model M _trained The prediction result of (1) constructs prompt information promtt by text+promptStructure construction of prediction sample set of next model, realizing construction of 2 nd-n+1 th step prediction sample set D in sequence _{step_2} ～D _{step_(n+1)} 。

10. The method according to claim 9, characterized in that the trained model M _trained The method is used for obtaining the prediction result of each step based on the prediction sample set of each step, and integrating to obtain a final extraction result, and comprises the following steps:

will D _{step_1} Input model M _trained Obtaining all trigger words p contained in the text of the first-step prediction result _trigger ；

In the format text+sample _{_X} Constructing a2 nd to n+1 th step prediction sample set D _{step_2} ～D _{step_(n+1)} D is to _{step_x} Input model M _trained Obtaining each trigger word

Corresponding event type->

Is the x-1 th event role->

Corresponding x-1 event element->

Wherein x is ∈ [2, n+1 ]]The method comprises the steps of carrying out a first treatment on the surface of the Wherein prompt is _{_X} Expressed as:

and combining the prompting information of the last step with the extraction result to obtain a complete event.