CN116028812B

CN116028812B - A method for constructing a pipeline multi-event extraction model

Info

Publication number: CN116028812B
Application number: CN202211733205.1A
Authority: CN
Inventors: 迟雨桐; 冯少辉; 张建业
Original assignee: Beijing Iplus Teck Co ltd
Current assignee: Beijing Iplus Teck Co ltd
Priority date: 2022-12-30
Filing date: 2022-12-30
Publication date: 2025-07-01
Anticipated expiration: 2042-12-30
Also published as: CN116028812A

Abstract

The present invention relates to a method for constructing a pipeline multi-event extraction model, which belongs to the technical field of natural language processing. The invention solves the problem that the existing event extraction model is prone to recognition omissions and event elements cannot be matched when there are many events or multiple events overlap in the corpus, resulting in low accuracy. An event feature data set is obtained by constructing based on an original data set, and a training set of positive and negative samples containing event types and event elements is further constructed, and then the T5 model is trained using the training set, so that the model effectively learns the internal connection between each event type, event role, event element and trigger word, and especially improves the model's understanding and prediction ability for multiple events. The overall training process uses a prompt information method to ensure the extraction accuracy and fidelity to a certain extent, and obtains an event extraction model with a high recognition rate for event text.

Description

Construction method of pipeline type multi-event extraction model

Technical Field

The invention relates to the technical field of natural language processing, in particular to a method for constructing a pipeline type multi-event extraction model.

Background

Event extraction (EE, eventExtraction) is one of the important tasks in the field of Natural Language Processing (NLP), and the purpose of event extraction is to identify event types (eventtype), event trigger words (trigger), event elements (parameters), and element roles (argumentrole) contained in a given corpus. At present, the application scene of the event extraction technology is very wide, useful information in a large amount of texts can be extracted efficiently, and powerful data support is provided for knowledge graph construction.

The existing extraction method of the mainstream event extraction model comprises a sequence labeling method, a pointer discrimination method and a generation method. The sequence labeling method is essentially a multi-label multi-classification method, predicts possible labels of each token, extracts events by predicting the starting and ending positions of texts corresponding to each label by a pointer discrimination method, and is an end-to-end (end 2 end) method, extracts context information by a deeper network and directly outputs event information in a text format. The three methods have good performance in the corpus of single event or non-overlapping multiple event with less event number, but when more events occur in the corpus, especially when one or more elements overlap, the problems of lack of recognition, recognition error, incapability of matching event elements and the like are very easy to occur, so that the accuracy is very low. Because overlapping multiple events are ubiquitous in the actual corpus, a more optimized method for constructing the multiple event extraction model is needed to solve the problem of low extraction accuracy caused by the fact that the event extraction model in the prior art is lack of recognition and is not matched with event elements in the multiple event and overlapping multiple event extraction tasks.

Disclosure of Invention

In view of the above analysis, the embodiment of the invention aims to provide a method for constructing a pipeline type multi-event extraction model, which is used for solving the problem of low extraction accuracy rate caused by the fact that the event extraction model is lack of identification and event elements cannot be matched in multi-event and overlapped multi-event extraction tasks in the prior art.

In one aspect, an embodiment of the present invention provides a method for constructing a pipeline multi-event extraction model, including the following steps:

Acquiring marked text data as an original data set;

Obtaining an event characteristic data set based on the original data set, and further constructing an event type positive sample data set D ₊₁, an event element positive sample data set D ₊₂, an event type full negative sample data set D _-1 and an event element random negative sample data set D _-2 to finally obtain a model training data set D _all;

Training the T5 model by using a training data set D _all to obtain a trained pipeline type multi-event extraction model M _trained;

And when multiple events are extracted, gradually constructing a prediction sample set of each step, wherein the trained model M _trained is used for obtaining a prediction result of each step based on the prediction sample set of each step, and integrating to obtain a final extraction result.

Further, the obtaining the noted text data includes:

acquiring original text data;

The method comprises the steps of marking original text data, wherein the marking comprises the steps of determining event types contained in sentences in the text data, extracting trigger words, event elements and positions of the trigger words and the event elements according to the event types, and marking the event elements with proper event role labels.

Further, the event feature data set includes:

The method comprises the steps of setting a corresponding relation schema of event types and all event roles, a corresponding set S _{type_role} of event types and single event roles, a set S _type of all event types, a set S _trigger of all trigger words and a set S _argument of all event elements, wherein the schema records all event types in an original data set and all event roles corresponding to the event types and the event types respectively, obtaining the event types and the event roles in the schema according to the schema by S _{type_role}, including two-by-two combinations of the event types and the event roles of each event in the schema, and the event roles belonging to the event roles in the schema, recording all event types by S _type, recording all trigger words appearing in the original data set by S _trigger, and recording all event elements contained in the original data set by S _argument.

Further, the model training dataset D _all is constructed by the following steps:

Summarizing and sorting the labeling information of the original dataset to obtain three event characteristic data sets, namely a corresponding relation schema of event types and all event roles, a corresponding set S _{type_role} of event types and single event roles and an all event type set S _type;

constructing an event type positive sample data set D ₊₁ and an event element positive sample data set D ₊₂ by using the original data set and the data set schema, and two event feature data sets of all trigger word sets S _trigger and all event element sets S _argument which occur in the original data set;

Constructing an event type all negative sample data set D _-1 using the event type positive sample data set D ₊₁ and the event type data set S _type;

constructing an event element random negative sample data set D _-2 by using the event element positive sample data set D ₊₂, the trigger word set S _trigger, the event element set S _argument and the corresponding set S _{type_role} of event types and single event roles;

Mixing and disturbing the D ₊₁、D₊₂、D_-1、D_-2 to finally obtain a model training data set D _all.

Further, the event type positive sample data set D ₊₁ and the event element positive sample data set D ₊₂ are constructed by the following steps:

A1. Extracting an event type e _type, a trigger word w _trigger and an event role e _{role_1}～e_{role_n} corresponding to a certain event contained in text data text_p of an original data set, wherein corresponding event elements w _{arg_1}～w_{arg_n} (n is the number of event roles contained in the event and is equal to the number of event elements), the input of an event type positive sample for constructing the event is text_p+e _type + "trigger word" and the output of the event type positive sample is w _trigger, the input of an event element positive sample for constructing the event is text_p+prompt _arg and the output of the event element positive sample is w _{arg_1}～w_{arg_n}, and the event element prompt _arg can be obtained by the following formula:

A2. Constructing an event type positive sample and an event element positive sample for each event in the text data text_p by using the method in (1) to obtain an event type positive sample data set D ₊₁ and an event element positive sample data set D ₊₂;

Further, the event type full negative sample dataset D _-1 is constructed by the following steps:

B1. E _type of a positive sample of an event type is sequentially changed into other event types of the event in the event type dataset S _type, and the target output is null to obtain an event type full negative sample of the event;

B2. The method of (1) is used for all events in the event type positive sample dataset D ₊₁ to construct an event type full negative sample dataset D _-1.

Further, the event element random negative sample dataset D _-2 is constructed by:

(1) Finding out all event element positive samples of a certain event in D ₊₂, finding out all event element prompt promt _arg from the event element positive samples, and forming a set S _prompt;

(2) Randomly selecting a trigger word from S _trigger to obtain w _{trigger_random}, randomly selecting an element from S _{type_role} to obtain an event type e _{type_random}, an event role e _{role_random} and a position p where the event role is located;

(3) Randomly selecting p event elements from the event element set S _argument to obtain w _{arg_r_1}～w_{arg_r_p}, and combining the event elements according to the following format to obtain event element random prompt _{arg_random};

prompt_{arg_random}＝e_{type_random}+w_{trigger_random}+w_{arg_r_1}+…+w_{arg_r_p}+e_{role_random}

(4) Judging whether the promtt _{arg_random} exists in the S _prompt, if so, repeating the steps 2, 3 and 4, if not, constructing a negative sample by using the promtt _{arg_random}, and adding the promtt _{arg_random} into the S _prompt;

(5) Repeating the steps (1) - (4) until 5n random negative samples of the event elements are obtained.

(6) And (3) constructing an event element random negative sample data set D _-2 by using the methods (1) - (5) for all event samples in the D ₊₂.

Further, the training of the T5 model includes:

Dividing the model training dataset D _all according to a certain proportion to obtain a training set D _train, a verification set D _eval and a testing set D _test;

Performing fine tuning training on the T5 model by using a training set D _train for n rounds, performing verification by using a verification set D _eval after each round of training, taking a round of model with the best verification result as a final model, and performing testing by using a testing set D _test to finally obtain a trained model M _trained;

Model loss is calculated and parameters are updated during training using the following formula:

Loss=CrossEntropy(x_pred,x_gold)

where x _pred is the prediction result and x _gold is the target output.

Further, the step-by-step construction of the prediction sample set of each step includes:

Constructing a first-step prediction sample set D _{step_1} based on text to be extracted and an event feature data set;

Based on text to be extracted, the event characteristic data set and the prediction result of the previous step model M _trained construct prompt information prompt, and the text+prompt structure constructs the prediction sample set of the next step model, so that the 2-n+1 step prediction sample set D _{step_2}～D_{step_(n+1)} is constructed in sequence.

Further, the trained model M _trained is configured to obtain a prediction result of each step based on the prediction sample set of each step, and integrate to obtain a final extraction result, which includes:

inputting D _{step_1} into a model M _trained to obtain all trigger words p _trigger contained in the text of the first-step prediction result;

constructing a 2-n+1 step prediction sample set D _{step_2}～D_{step_(n+1)} by using a format text+sample_x, and inputting D _{step_x} into a model M _trained to obtain each trigger word Corresponding event typeIs the x-1 th event role of (2)Corresponding x-1 event elementWherein x ε [2, n+1], wherein probtx_x is represented as:

and combining the prompting information of the last step with the extraction result to obtain a complete event.

Compared with the prior art, the invention has at least one of the following beneficial effects:

1. The training set containing positive and negative samples of event types and event elements is further constructed and obtained based on the original data set, and the T5 model is trained by using the training set, so that the model effectively learns internal relations among the event types, the event roles, the event elements and trigger words, and particularly, the understanding and predicting capacity of the model for multiple events is improved, and the whole training process uses a prompt information (prompt) method, so that extraction accuracy and loyalty are guaranteed to a certain extent, and an event extraction model with higher recognition rate on event texts is obtained.

2. The method comprises the steps of extracting event texts based on a trained model, extracting the event by using prompt information (prompt) in a layer-by-layer progressive mode, extracting corresponding trigger words by taking all event types as the prompt information, sequentially adding the trigger words and element roles to be extracted into prompt extraction event elements in steps, and combining the prompt information of the last step with an extraction result to obtain a complete event after all event elements contained in the event types are extracted.

In the invention, the technical schemes can be mutually combined to realize more preferable combination schemes. Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention may be realized and attained by the structure particularly pointed out in the written description and drawings.

Drawings

The drawings are only for purposes of illustrating particular embodiments and are not to be construed as limiting the invention, like reference numerals being used to refer to like parts throughout the several views.

FIG. 1 is a schematic flow chart of a method for constructing a pipeline multi-event extraction model according to an embodiment of the invention;

FIG. 2 is a schematic diagram of an overall implementation flow of actual prediction in a method for constructing a pipeline multi-event extraction model according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a training data constructing process according to an embodiment of the present invention;

Fig. 4 is a schematic flow chart of obtaining a prediction result according to an embodiment of the present invention.

Detailed Description

The following detailed description of preferred embodiments of the application is made in connection with the accompanying drawings, which form a part hereof, and together with the description of the embodiments of the application, are used to explain the principles of the application and are not intended to limit the scope of the application.

The invention discloses a method for constructing a pipeline type multi-event extraction model, which is shown in fig. 1 and comprises the following steps:

Step S110, obtaining marked text data as an original data set;

step S120, obtaining an event characteristic data set based on the original data set, and further constructing an event type positive sample data set D ₊₁, an event element positive sample data set D ₊₂, an event type full negative sample data set D _-1 and an event element random negative sample data set D _-2, so as to finally obtain a model training data set D _all;

Step S130, training a T5 model by using a training data set D _all to obtain a trained pipeline type multi-event extraction model M _trained;

The embodiment of the invention trains the T5 model by using a training set containing positive and negative samples of event types and event elements to obtain a multi-event extraction model. The training set containing positive and negative samples of event types and event elements is further constructed and obtained based on the original data set, and the T5 model is trained by using the training set, so that the model effectively learns internal relations among the event types, the event roles, the event elements and trigger words, and particularly, the understanding and predicting capacity of the model for multiple events is improved, and the whole training process uses a prompt information (prompt) method, so that extraction accuracy and loyalty are guaranteed to a certain extent, and an event extraction model with higher recognition rate on event texts is obtained.

On the basis of the above embodiment, specifically, the noted text data in the above step S110 is obtained by the following method:

extracting a data set directly by using hundred-degree events;

The method comprises the steps of determining event types contained in sentences in the text data, extracting trigger words, event elements and positions of the trigger words, the event elements and the positions of the event elements according to the event types, marking the event elements with proper event role labels;

Specifically, the step S120 may be further optimized as the following steps:

Step S210, summarizing and sorting the labeling information of the original dataset to obtain three event feature data sets, namely a corresponding relation schema of event types and all event roles, a corresponding set S _{type_role} of event types and single event roles and an all event type set S _type;

Specifically, all event types and event roles in the original data set are summarized and arranged to construct a data set schema, S _{type_role} and S _type, wherein the schema records all event types and all event roles corresponding to the event types and the event roles respectively in the original data set, S _{type_role} is obtained according to the schema, the event types and the event roles of each event in the schema are combined in pairs, the event roles belong to the event roles in the schema, S _type records all event types, preferably, the schema is stored in a file json, and both S _{type_role} and S _type are stored by using a set (set).

For example, for an event type of "acquisition," the event roles include "acquisition time, acquirer" events, whose records in schema, S _{type_role}, and S _type are shown in table 1.

Table 1 example of recording of events of type "acquisition" in schema, S _{type_role} and S _type

Step S220, constructing an event type positive sample data set D ₊₁ and an event element positive sample data set D ₊₂ by using the original data set and the data set schema, and two event characteristic data sets of all trigger word sets S _trigger and all event element sets S _argument which appear in the original data set;

specifically, the constructing the event type positive sample data set D ₊₁ and the event element positive sample data set D ₊₂, and all trigger word sets S _trigger and all event element sets S _argument that occur in the original data set include:

(1) Extracting an event type e _type, a trigger word w _trigger and an event role e _{role_1}～e_{role_n} corresponding to a certain event contained in text data text_p of an original data set, wherein corresponding event elements w _{arg_1}～w_{arg_n} (n is the number of event roles contained in the event and is equal to the number of event elements), the input of an event type positive sample for constructing the event is text_p+e _type + "trigger word" and the output of the event type positive sample is w _trigger, the input of an event element positive sample for constructing the event is text_p+prompt _arg and the output of the event element positive sample is w _{arg_1}～w_{arg_n}, and the event element prompt _arg can be obtained by the following formula:

(2) Constructing an event type positive sample and an event element positive sample for each event in the text data text_p by using the method in (1) to obtain an event type positive sample data set D ₊₁ and an event element positive sample data set D ₊₂;

(3) The trigger words w _trigger of all events in the text data text_p are stored in the trigger word set S _trigger, and all event elements w _{arg_1}～w_{arg_n} are stored in the event element set S _argument, so as to obtain a trigger word data set S _trigger and an event element data set S _argument.

Illustratively, for events of the type "acquisition", the constructed event type positive samples and event element positive samples and the save examples in S _trigger and S _argument are shown in table 2.

Table 2 event type positive sample and event element positive sample examples of an event of type "acquisition" and save examples in S _trigger and S _argument

In this example, the elements of the hint information are divided by "-" and may be divided by other symbols or spaces. The order of occurrence of the event elements in the template _arg when the positive sample of event elements is constructed must be consistent with the maintenance of the schema record.

For complex events, the output may be multiple event elements, and when an input is constructed, event element prompts need to be constructed respectively, and table 3 shows an example of positive samples of event elements of multiple events sharing trigger words:

Table 3 Multi-event element positive sample example of common trigger words

Step S230, constructing an event type all negative sample data set D _-1 using the event type positive sample data set D ₊₁ and the event type data set S _type;

Specifically, the construction event type all negative sample dataset D _-1 includes:

(1) E _type of a positive sample of an event type is sequentially changed into other event types of the event in the event type dataset S _type, and the target output is null to obtain an event type full negative sample of the event;

(2) The method of (1) is used for all events in the event type positive sample dataset D ₊₁ to construct an event type full negative sample dataset D _-1.

For example, if there are m event types in the event type dataset S _type, then there are m-1 event type full negative samples for each event;

For training of a model, a positive sample is a sample with a target output result, a negative sample is a sample without an output result, and the negative sample is added during training, so that the model identification accuracy can be effectively improved.

Step S240, constructing an event element random negative sample data set D _-2 by using the event element positive sample data set D ₊₂, the trigger word set S _trigger, the event element set S _argument, and the corresponding set of event types and single event roles S _{type_role};

The input format of the event element random negative sample is identical with the event element positive sample, and the difference is that the prompt information of the event element random negative sample is different, and the output results are all null. For an event, the random negative number of samples of the event element is generally recommended to be 5 times of the positive number of samples of the event element;

Specifically, the step of constructing the event element random negative sample dataset D _-2 is as follows:

(6) Constructing all event samples in the D ₊₂ by using the methods in the steps (1) - (5) to obtain an event element random negative sample data set D _-2;

Step S250, mixing and disturbing the D ₊₁、D₊₂、D_-1、D_-2 to finally obtain a model training data set D _all;

specifically, training the T5 model in step S130 includes:

Dividing a model training data set D _all according to a certain proportion to obtain a training set D _train, a verification set D _eval and a test set D _test, wherein the preferable proportion is 8:1:1, performing fine tuning training on a T5 model by using the training set D _train for n rounds, performing verification by using the verification set D _eval after each round of training, taking a round of model with the best verification result as a final model, and performing test by using the test set D _test to finally obtain a trained model M _trained, and the preferable training discussion n is 20;

further, model loss is calculated and parameters are updated during training using the following formula:

Loss=CrossEntropy(x_pred,x_gold)

Wherein x _pred is the predicted result and x _gold is the label.

Furthermore, the extraction of the actual event text by using the trained model M _trained comprises the following steps:

step S310, obtaining text to be extracted, wherein the text to be extracted can be news text data crawled from a website;

Step S320, based on text to be extracted, an event characteristic data set obtained from an original data set and a prediction result of a previous step model M _train, constructing a1 st to n+1 st step prediction sample set D _{step_1}～D_{step_(n+1)} in a text+prompt structure in steps, inputting D _{step_1}～D_{step_(n+1)} in steps into a model M _train to obtain a prediction result of a1 st to n+1 st step model M _train, wherein n is an event angle number of an event type corresponding to the first step prediction result;

Specifically, the step of constructing the 1 st to n+1 th step prediction sample set D _{step_1}～D_{step_(n+1)} and obtaining the prediction result of the 1 st to n+1 th step model M _train includes the following steps:

(A) All event types e _type in S _type are traversed in turn, for any event type Adding samples to the first step prediction sample set D _step1: After the traversal is finished, the number of samples in D _step1 is m (m is the number of event types, k is [1, m ]);

(B) Inputting the first-step prediction sample set D _step1 into M _trained, and when a certain sample has an output result, the output result is the event type in the text to be extracted Trigger word of (a)Is marked asFind from schemaCorresponding first event roleAnd adding the output result into a next prediction sample set D _step2 in the format text+prompt _{_2}, wherein

For a sample without output result, the event type without input in the text is describedThe trigger words of (a) are not included in text, i.e. the event type isIs a part of the event.

(C) D _step2 is input into M _trained, and each trigger word is predictedCorresponding first event roleEvent elements of (2) are recorded asJudging the event type by looking up schemaWhether other event roles exist, if not, step S330 is performed;

If the event type The presence of other event roles in schemaThen for the event typeOther event roles of (a)Sequentially constructing a next prediction sample set D _{step_3}～D_{step_(n+1)} according to the steps, sequentially inputting D _{step_3}～D_{step_(n+1)} into a model M _trained according to the steps to perform event elementsIs extracted up toExtracting event elements corresponding to all the included event roles by the model M _trained, and performing step S330;

More specifically, the method for constructing the next prediction sample set D _{step_3}～D_{step_(n+1)} is as follows:

Building a sample by using a format text+sample _{_X} and adding the sample into a predicted sample set D _{step_(x)} in the next step, wherein Will be based on the prompt _{_(x-1)} Replaced byAnd finally addWherein x is E [3, n+1], n is the event role number contained in the event type in schema;

The event elements used in prompt message prompt comprise The determination method is as follows:

where j E [1, n-1], n is the event role number contained in the event type in schema, if The method comprises the steps of including a plurality of prediction results, and constructing prediction samples by dividing the plurality of results according to the format in the step.

Step S330, integrating the prediction results based on the n+1th prediction sample set D _{step_(n+1)} and the n+1th model M _train to obtain a final recognition result;

specifically, the integrating to obtain the final recognition result includes:

According to D _{step_n+1} and the n-th event element of the predicted result, the event extraction result is obtained by arrangement:

Event type:

trigger words:

Event role/event element (role/event):

for example, event extraction results may be integrated using a format as in Table 4.

Table 4 event extraction result integration example

In summary, the beneficial effects of the embodiment are as follows:

Those skilled in the art will appreciate that all or part of the flow of the methods of the embodiments described above may be accomplished by way of a computer program to instruct associated hardware, where the program may be stored on a computer readable storage medium. Wherein the computer readable storage medium is a magnetic disk, an optical disk, a read-only memory or a random access memory, etc.

The present invention is not limited to the above-mentioned embodiments, and any changes or substitutions that can be easily understood by those skilled in the art within the technical scope of the present invention are intended to be included in the scope of the present invention.

Claims

1. The method for constructing the pipeline type multi-event extraction model is characterized by comprising the following steps of:

Acquiring marked text data as an original data set;

When multiple events are extracted, a prediction sample set of each step is gradually constructed, the trained model M _trained is used for obtaining a prediction result of each step based on the prediction sample set of each step, and the prediction results are integrated to obtain a final extraction result;

the model training data set D _all is constructed by the following steps:

mixing and disturbing the D ₊₁、D₊₂、D_-1、D_-2 to finally obtain a model training data set D _all;

The event type positive sample data set D ₊₁ and the event element positive sample data set D ₊₂ are constructed by the following steps:

A1. extracting an event type e _type, a trigger word w _trigger and an event role e _{role_1}～e_{role_n} corresponding to a certain event contained in text data text_p of an original data set, wherein corresponding event elements w _{arg_1}～w_{arg_n} and n are event role numbers contained in the event, the input of an event type positive sample for constructing the event is text_p+e _type + "trigger word" and the output of the event type positive sample is w _trigger, the input of an event element positive sample for constructing the event is text_p+promtt _arg and the output of the event element positive sample is w _{arg_1}～w_{arg_n}, and the event element prompt promtt _arg can be obtained by the following formula:

The event type all negative sample data set D _-1 is constructed by the following steps:

B2. Using the method of (1) for all events in the event type positive sample dataset D ₊₁, constructing an event type full negative sample dataset D _-1;

The event element random negative sample data set D _-2 is constructed by the following steps:

(5) Repeating the steps (1) - (4) until 5n random negative samples of the event elements are obtained;

The step-by-step construction of the prediction sample set of each step comprises the following steps:

Constructing prompt information prompt based on text to be extracted, event characteristic data sets and a prediction result of a previous step model M _trained, constructing a prediction sample set of a next step model by a text+prompt structure, and constructing a 2-n+1 step prediction sample set D _{step_2}～D_{step_(n+1)} in sequence according to steps;

The trained model M _trained is configured to obtain a prediction result of each step based on the prediction sample set of each step, and integrate to obtain a final extraction result, where the method includes:

Constructing a 2-n+1 step prediction sample set D _{step_2}～D_{step_(n+1)} by using a format text+prompt _{_X}, and inputting D _{step_x} into a model M _trained to obtain each trigger word Corresponding event typeIs the x-1 th event role of (2)Corresponding x-1 event elementWherein x is [2, n+1], wherein promt _{_X} is represented as:

2. The method of claim 1, wherein the obtaining the annotated text data comprises:

acquiring original text data;

3. The method of claim 1, wherein the set of event feature data comprises:

4. The method of claim 1, wherein training the T5 model comprises:

Loss=CrossEntropy(x_pred,x_gold)

where x _pred is the prediction result and x _gold is the target output.