Disclosure of Invention
In order to solve the problems in the prior art, the invention provides a relation extraction method for fusing entity type characterization and relation characterization, which starts from semantic characterization, entity type characterization and relation characterization, and utilizes a text-host-guest weak correlation semantic characterization mechanism and a relation feature fusion mechanism to provide a novel relation extraction model, so that the context information of sentences can be effectively captured, the extraction of entity pairs and relations in unstructured texts is realized, and the problems mentioned in the background art are solved.
In order to achieve the purpose, the invention provides a relation extraction method for fusing entity type characterization and relation characterization, which comprises the following specific steps:
S10, for natural language texts of an input system, based on semantic information, entity type information and relation information of text encoded by a Word-Piece Word segmentation method, outputting Word-Piece semantic representation, entity type representation and relation representation;
Step S20, extracting a subject and an object in a text by further utilizing BERT and a binary annotation method based on the output word-piece semantic representation;
step S30, replacing the extracted word sense representations of the subject and the object through the output entity type representation to weaken subject-object semantic association information, constructing a weak correlation semantic representation mechanism of the subject and the object in the text, and generating a new weak semantic association text between the subject and the object;
S40, constructing a relation encoder based on a BERT representation model, encoding a new text with weak semantic association, extracting high-level abstract semantic information in the text, and outputting a text-host-guest weak-related context Wen Yuyi vector representation by combining the two-way context information;
and S50, constructing a fusion mechanism of context semantic information and relationship information related to the weakness of the text-host and the guest, wherein the fused characterization vector is used for capturing the subject-relationship-guest triples.
Preferably, the specific steps of the step S10 are as follows:
Step S101, the natural language text of the input system is a Word sequence, s= { w 1,...,wl }, wherein w i, i epsilon {1,2,.. The i }, represents the i-th Word in the sentence, i is the number of words contained in the sentence to be extracted, a Word-Piece representation model based on a BPE double-byte coding mode is constructed to represent the words in a vector space, each Word in the input sentence is divided into sub-words with fine granularity, and the sub-Word representation sequence is output Wherein t i, i e {1,2,.. The term, L }, represents the i-th subword in the sentence, which is the subword length of the sentence to be extracted after Word-Piece division;
Step S102, vector characterization is carried out on the entity type and relation type pre-input system, epsilon is a set of entity types, R is a set of relation types, and for any entity type e epsilon and any relation type R epsilon R of the input system, an entity type and relation characterization model based on a multi-layer perceptron is respectively constructed, and discrete entity type symbols and relation type symbols are converted into continuous high-dimensional characterization vectors To output fine-grained semantic information of entity types and relationship types.
Preferably, the specific steps of the step S20 are as follows:
step S201, constructing a named entity encoder based on BERT neural network representation model, and sequencing sub-words As input of the system encoder, the bi-directional context information of each word element is deeply encoded by the N transducer encoder blocks in sequence through the fine tuning parameters, and a bi-directional language representation vector sequence of depth is output
Wherein, trans represents a transducer encoder block, h α-1 represents the encoding result of the last transducer encoder block;
Step S202, establishing a named entity subject decoder and a subject decoder based on a fully-connected neural network to extract candidate subjects and candidate subjects in the sequence of subwords, and outputting the final block of the encoder For the input of the decoder, for each word element i in the sub-word sequence, the probability that the word element is a subject span starting point, a subject span end point, an object span starting point and an object span end point is calculated, and the formulas are respectively as follows:
wherein, the Using weight parameters and deviation parameters which can be learned in the representing fully-connected neural network, wherein sigma is a sigmoid activation function;
comparing the calculated probability values Whether the type E start_s, end_s, start_o and end_o exceed a preset threshold value of 0.5 (the threshold value is a superparameter which is artificially set by combining prior knowledge and superparameter experiments), the control system judges whether the word element is a label corresponding to the type according to whether the output probability value exceeds the threshold value, if so, the label is correspondingly judgedType e start_s, end_s, start_o, end_o is assigned 1, otherwise the tag is assigned 0;
According to the above determination tag Outputs a corresponding sequence representation of a subject span start point, a subject span end point,
Step S203, searching the nearest 1 tag to the right in the subject end point judging sequence d end_s for one 1 tag in the subject start point judging sequence d start_s to form a potential subject span sub i;
Carrying out the above operation on 1 tag in all subject and object starting point judgment sequences, and respectively outputting a potential subject span sequence H sub=(sub1,...,subm) and a potential object span sequence H obj=(obj1,...,objn), wherein the potential subject-object span pair sequences H= (sub 1,obj1),...,(subm×n,objm×n) are formed by combining two pairs;
wherein m and n are the number of potential subjects and the number of potential objects extracted from the sub word sequence respectively.
Preferably, the specific steps of the step S30 are as follows:
step S301, constructing a text-host-guest weak correlation semantic characterization mechanism, inputting entity type information to weaken the host-guest semantic correlation information, and for a given host-guest span pair (sub i,objj), using a corresponding entity type characterization vector (i M) with i not equal to j Sequence of subwordsThe representation vector of the corresponding span is replaced to weaken the subject-object semantic association information, a new text representation sequence is output,L 2 is the sequence length of the sub word after replacement, and meanwhile, the position of the type characterization vector e (sub i),e(objj) in the new sequence T is output, (s 1,...,sm) represents the main body replacement position sequence, m is the main body replacement length, (o 1,...,on) represents the object replacement position sequence, and n is the object replacement length.
Preferably, the specific steps of the step S40 are as follows:
Step S401, for subject-object pair (sub i,objj), i+.j, constructing a BERT neural network representation model-based relational encoder, and characterizing the new text sequence As input of the system encoder, the bi-directional context information of each word element is deeply encoded by the N transducer encoder blocks in sequence through the fine tuning parameters, and the bi-directional language of the output depth represents a vector sequenceAlpha epsilon [1, N ], wherein Trans represents a transducer encoder block, h α-1 represents the encoding result of the last transducer encoder block, and the output of the relational encoder is the encoding result of the last transducer encoder block, namely, the upper Wen Yuyi representation of the text-host-guest weak correlationWherein, h i,i∈{1,2,...,L2 is the context coding result of the lemma t i of the subword sequence.
Preferably, the specific steps of the step S50 are as follows:
Step S501, constructing a relation decoder based on a fully connected linear neural network, calculating a subject-object pair (sub i,objj), and outputting a relation when a natural language text sentence input by a system is s= { w 1,...,wl }, wherein i is not equal to j The formula is as follows:
H=Hsub+Hobj
pi,j,k=σ(W(H;e(rk))+b)
wherein, the Is a relationship ofIs used to determine the characterization vector of (c),AndSemantic representation of encoder output, respectivelyThe value at position (s 1,...,si),(o1,...,oj), maxPooling represents the maximum pooling layer operation, the output subject representation H sub and the object representation H obj are added to form the overall entity representation H, W, b is respectively a weight parameter and a deviation parameter which can be learned in the fully-connected linear neural network, and sigma is a sigmoid activation function;
If the calculated probability value p i,j,k exceeds a preset threshold value 0.6 (the threshold value is a superparameter which is artificially set by combining priori knowledge and superparameter experiments), the control system judges whether the triplet has a certain relation according to whether the output probability value exceeds the threshold value, and the host-object pair (sub i,objj) is considered that the relation exists when the i is not equal to j and the natural language text sentence is s is not equal to w 1,...,wl Pair h= (sub 1,obj1),...,(subm×n,objm×n) any entity pair and anyAnd calculating the occurrence probability of the three-tuple, wherein the final output result is that all the three-tuple with the probability exceeding a preset threshold value form an extraction result, namely, a relation extraction result Reuslt = ((sub 1,r1,obj1),...,(subn,rn,objn)) of a natural language text sentence s= { w 1,...,wl } and n is the number of the extracted three-tuple.
The beneficial effects of the invention are as follows:
1) The invention takes a natural language text as a research object, provides a relation extraction method for fusing entity type representation and relation representation, and starts from semantic representation, entity type representation and relation representation, and a novel relation extraction model is provided by utilizing a text-host-guest weak correlation semantic representation mechanism and a relation feature fusion mechanism, so that the context information of sentences can be effectively captured, and the extraction of entity pairs and relations in unstructured text is realized.
2) The method outputs entity type characterization and relation characterization besides semantic characterization, extracts a subject and an object at the same time by using a BERT-based neural network model, pairs the extracted subject and object, replaces coding information of the subject and the object by using type information of the subject and object according to a pairing result to obtain new semantic information, recodes the obtained new semantic information by using the BERT-based neural network model, and predicts whether a relation exists by using a maximum pooling and multi-layer perceptron. According to the invention, a text-host-guest weak correlation semantic characterization mechanism is designed, and entity semantic information is replaced by introducing entity type information, so that the dependence of an extraction model on the host-guest semantic association is reduced.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Referring to FIG. 1, the invention provides a technical scheme of a relation extraction method for fusing weak correlation semantic representation and relation representation, which comprises the following specific steps:
Step 1, for a natural language text of an input system, based on semantic information, entity type information and relation information of a Word-Piece Word segmentation method coding text, outputting Word-Piece semantic representation, entity type representation and relation representation, wherein in step 1-1, the natural language text of the input system is a Word sequence, s= { w 1,...,wl }, wherein w i, i epsilon {1, 2., l }, the i-th Word in a sentence is represented, and l is the number of words contained in the sentence to be extracted. The system constructs Word-Piece representation model based on BPE double-byte coding mode to represent words in vector space, divides each Word in input sentence into sub-words with fine granularity, and outputs sub-Word representation sequence Wherein t i, i e {1,2,.. The term, L }, represents the i-th subword in the sentence, L is the subword length of the sentence to be extracted after Word-Piece division;
Step 1-2, carrying out vector characterization on an entity type and relation type pre-input system, wherein epsilon is a set of entity types, R is a set of relation types, and for any entity type e epsilon and any relation type R epsilon R of the input system, respectively constructing an entity type and relation characterization model based on a multi-layer perceptron, and converting discrete entity type symbols and relation type symbols into continuous high-dimensional characterization vectors To output fine-grained semantic information of entity types and relationship types.
Step 2, extracting a subject and an object in the text by further utilizing BERT and a binary annotation method based on the output word-piece semantic representation;
and 2-1, constructing a named entity encoder based on the BERT neural network representation model. Sub word sequence As input of the system encoder, the bi-directional context information of each word element is deeply encoded by the N transducer encoder blocks in sequence through the fine tuning parameters, and a bi-directional language representation vector sequence of depth is outputAlpha epsilon [1, N ], wherein Trans represents a transducer encoder block, and h α-1 represents the encoding result of the last transducer encoder block;
And 2-2, establishing a named entity subject decoder and a named entity object decoder based on the fully-connected neural network to extract candidate subjects and candidate objects in the subword sequence. With the output of the last block of the encoder For the input of the decoder, for each word element i in the sub-word sequence, the probability that the word element is a subject span starting point, a subject span end point, an object span starting point and an object span end point is calculated, and the formulas are respectively as follows:
wherein, the With weight parameters and bias parameters that represent learnable in fully connected neural networks, σ is the sigmoid activation function.
Comparing the calculated probability valuesWhether the type E start_s, end_s, start_o and end_o exceed a preset threshold value of 0.5 (the threshold value is a superparameter which is artificially set by combining prior knowledge and superparameter experiments), the control system judges whether the word element is a label corresponding to the type according to whether the output probability value exceeds the threshold value, if so, the label is correspondingly judgedType e start_s, end_s, start_o, end_o is assigned 1, otherwise the tag is assigned 0;
the system determines the tag according to the above Outputs a corresponding sequence representation of a subject span start point, a subject span end point,
In step 2-3, the system searches for the nearest 1 tag to the right in the subject end judgment sequence d end_s for one 1 tag in the subject start judgment sequence d start_s to form a potential subject span sub i, and performs the same operation on the object judgment sequence to output a potential object span obj i. The above operation is performed on the 1 tag in all subject/object origin determining sequences, and the potential subject span sequence H sub=(sub1,...,subm) and the potential object span sequence H obj=(obj1,...,objn) are outputted, respectively. Two pairs of the potential subject-object span pair sequences are formed, H= (sub 1,obj1),...,(subm×n,objm×n), and m and n are respectively the number of the potential subjects and the number of the potential objects extracted from the sub word sequences;
step 3, replacing the extracted word sense representations of the subject and the object through the output entity type representation to weaken subject-object semantic association information, constructing a weak correlation semantic representation mechanism of the subject and the object in the text, and generating a new weak semantic association text between the subject and the object;
and 3-1, constructing a text-host-guest weak correlation semantic characterization mechanism, and inputting additional entity type information to weaken the host-guest semantic correlation information. For a given subject-object span pair (sub i,objj), i+.j, the system characterizes the vector with the corresponding entity type Sequence of subwordsReplacing the representation vector of the corresponding span in order to weaken the subject-object semantic association information, outputting a new text representation sequence,L 2 is the sequence length of the sub word after replacement. Meanwhile, outputting the position of the type characterization vector e (sub i),e(objj) in the new sequence T, (s 1,...,sm) represents a main body replacement position sequence, m is a main body replacement length, (o 1,...,on) represents an object replacement position sequence, and n is an object replacement length;
constructing a relation encoder based on a BERT representation model, encoding a new text with weak semantic association, extracting high-level abstract semantic information in the text, and outputting a text-host-guest weak-related context Wen Yuyi vector representation by combining the two-way context information;
step 4-1, for the subject-object pair (sub i,objj), i+.j, constructing a relational encoder based on the BERT neural network representation model. Sequence of text characterization As input of the system encoder, the bi-directional context information of each word element is deeply encoded by the N transducer encoder blocks in sequence through the fine tuning parameters, and the bi-directional language of the output depth represents a vector sequenceAlpha.epsilon.1, N where Trans represents a transducer encoder block and h α-1 represents the encoding result of the last transducer encoder block. The output of the relational encoder is the encoded result of the last transducer encoder block, i.e., the context-based weak correlation representation Wen YuyiWherein, h i,i∈{1,2,...,L2 is the context coding result of the word element t i of the sub word sequence;
Constructing a fusion mechanism of context semantic information and relationship information related to text-host-guest weakness, wherein the fused characterization vector is used for capturing a host-relationship-guest triplet;
step 5-1, constructing a relation decoder based on a fully connected linear neural network, calculating a subject-object pair (sub i,objj), and outputting a relation when a natural language text sentence input by a system is s= { w 1,...,wl }, wherein i is not equal to j The formula is as follows:
H=Hsub+Hobj
pi,j,k=σ(W(H;e(rk))+b)
wherein, the Is a relationship ofIs used to determine the characterization vector of (c),AndSemantic representation of encoder output, respectivelyThe value at position (s 1,...,si),(o1,...,oj), maxPooling represents the maximum pooling layer operation, the output subject representation H sub and the object representation H obj are added to form the overall entity representation H, W, b is respectively a weight parameter and a deviation parameter which can be learned in the fully-connected linear neural network, and sigma is a sigmoid activation function;
If the calculated probability value p i,j,k exceeds a preset threshold value 0.6 (the threshold value is a superparameter which is artificially set by combining priori knowledge and superparameter experiments), the control system judges whether the triplet has a certain relation according to whether the output probability value exceeds the threshold value, and the host-object pair (sub i,objj) is considered that the relation exists when the i is not equal to j and the natural language text sentence is s is not equal to w 1,...,wl Pair h= (sub 1,obj1),...,(subm×n,objm×n) any entity pair and anyAnd calculating the occurrence probability of the three-dimensional data, wherein the final output result of the system is an extraction result formed by all the triples with the probability exceeding a preset threshold value of 0.6, namely, the relation extraction result of a natural language text sentence s= { w 1,...,wl }, reuslt = ((sub 1,r1,obj1),...,(subn,rn,objn)), and n is the number of the extracted triples.
The method takes a natural language text as a research object, designs a relation extraction method for fusing entity type representation and relation representation, and starts from semantic representation, entity type representation and relation representation, and utilizes a text-host-guest weak correlation semantic representation mechanism and a relation feature fusion mechanism to provide a novel relation extraction model which can effectively capture context information of sentences and realize extraction of entity pairs and relations in unstructured text.
Although the present invention has been described with reference to the foregoing embodiments, it will be apparent to those skilled in the art that modifications may be made to the embodiments described, or equivalents may be substituted for elements thereof, and any modifications, equivalents, improvements and changes may be made without departing from the spirit and principles of the present invention.