CN111160031A

CN111160031A - Social media named entity identification method based on affix perception

Info

Publication number: CN111160031A
Application number: CN201911289215.9A
Authority: CN
Inventors: 蔡毅; 吴志威
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2019-12-13
Filing date: 2019-12-13
Publication date: 2020-05-15
Also published as: WO2021114745A1

Abstract

The invention discloses a social media named entity recognition method based on affix perception, comprising the steps of: collecting social media data sets that have marked named entities; capturing word embedding representation, character level representation and affix feature representation, and embedding the word The representation, character level representation and affix feature representation are fused as the final representation of the word; the final representation of the word is input into the bidirectional convolutional neural network and the conditional random field, the label sequence is predicted and the loss value is calculated; according to the obtained loss value, the stochastic gradient descent algorithm is used to train the model; the text is input into the trained model, and the named entities in the text are identified. The present invention enriches the semantic representation of words, alleviates the problem of unregistered words in social media data, and improves the effect of named entity recognition.

Description

Social media named entity identification method based on affix perception

Technical Field

The invention relates to the technical field of natural language processing, in particular to a social media named entity recognition method based on affix perception.

Background

In the world today, with the explosion of mobile internet, people are not publishing information on social media all the time, and the information constitutes a huge amount of social media data. Compared with the traditional news private line manuscript data, the data on the social media are more time-efficient, contain rich information and gradually become a plurality of application potential information sources, such as news hotspot tracking, user public opinion analysis, potential violence incident early warning and the like. Therefore, how to mine potential information from social media data becomes an important task. The entity extraction is a basic task of information extraction, and a powerful entity extraction system is indispensable to the construction of the applications, and has extremely high social and economic values.

In recent years, with the rise of deep neural network models, end-to-end neural network-based models have become the mainstream method for named entity recognition. These methods can be broadly classified into the following categories: word-based representations, character-based representations, phrase-based representations, or any combination of the foregoing. Although these methods have achieved good performance on news text, the performance of such methods is dramatically reduced when confronted with social media data due to the inherent features of social media, such as informal expressions, irregular noun abbreviations, non-grammatical expressions, more unknown words, etc.

Applicants have discovered that affixes, as morphemes with certain semantics, can assist in identifying to some extent whether a word is part of an entity. For simplicity, only the most common prefixes and suffixes of affixes are considered. Two benefits can be brought about by introducing affix-feature representations, one is that words with the same affix tend to have similar meanings, and introducing affix representations can enrich semantic representations of words, e.g., "autopen", "automat", etc., words with the same prefix "auto-" all have "automatic" meaning; the second is that some affixes themselves have the semantics of named entities, for example, the suffix "-ie" is derived from ancient english and is commonly found in related names, names of people, children's words, or colloquial languages, so that the word whose "-ie" ends is likely to be a name of a person.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides a social media named entity identification method based on affix perception. The method can combine word embedding, word character level representation, word prefix characteristic representation and word suffix characteristic representation, capture word affix characteristic representation by utilizing the bidirectional cyclic neural network, well enrich semantic representation of words by introducing the affix characteristic representation, relieve the problem of unknown words in social media data and improve the effect of named entity recognition. The method has certain generalization and is also suitable for named entity identification in the fields of news and the like.

The purpose of the invention can be realized by the following technical scheme:

a social media named entity recognition method based on affix perception comprises the following steps:

collecting a social media data set marked with named entities, wherein each piece of data comprises an original text and is marked with the named entities;

preprocessing texts in the data set, and constructing index vector representation of the texts at a word level and index vector representation of the texts at a character level;

capturing embedding representation, character level representation and affix characteristic representation of a word by adopting a recurrent neural network and a word embedding technology, and fusing the word embedding representation, the character level representation and the affix characteristic representation to be used as final representation of the word;

inputting the final representation of the obtained word into a bidirectional convolution neural network and a conditional random field, predicting a tag sequence and calculating a loss value;

training the model by adopting a random gradient descent algorithm according to the obtained loss value;

and inputting the text into the trained model, and identifying the named entity in the text.

Compared with the prior art, the invention has the following beneficial effects:

the method for recognizing the named entity of the social media based on affix perception introduces affix characteristic representation of words on the basis of word embedding and character level representation of the words, enriches semantic representation of the words, relieves the problem of unknown words in social media data, improves the effect of recognizing the named entity, has certain generalization, and is also suitable for recognizing the named entity in the fields of news and the like.

Drawings

Fig. 1 is a flowchart of a social media named entity recognition method based on affix sensing according to the present invention.

Fig. 2 is a schematic diagram of a model used for extracting affix features in this embodiment.

Detailed Description

The present invention will be described in further detail with reference to examples and drawings, but the present invention is not limited thereto.

Examples

The embodiment provides a method for identifying a social media named entity based on affix perception, and a flowchart of the method is shown in fig. 1, and the method comprises the following steps:

(1) and collecting a social media data set marked with the named entity, wherein each piece of data comprises original text and is marked with the named entity.

The collected social media data set is used as a training set.

(2) And preprocessing the texts in the data set, and constructing the index vector representation of the texts at the word level and the index vector representation of the texts at the character level.

Specifically, the pretreatment comprises:

replacing all lower case letters in the text with corresponding upper case letters;

all the numbers in the text are replaced by 0.

Specifically, the constructing the index representation of the text at a word level and a character level comprises:

(2-1) traversing all texts of the social media data set, and constructing a word dictionary and a character dictionary;

specifically, the word dictionary is constructed by traversing each word of each text in the data set, adding the word to a word list when different words are encountered, and assigning an index to each word according to the adding sequence, wherein the index value is 0, 1, 2 and so on. And the vocabulary obtained after traversing is a word dictionary.

The method of character dictionary construction is as above except that each character of each word of each text is traversed.

And (2-2) using the word dictionary and the character dictionary obtained in the step (2-1) to sequence the text at a word level and a character level.

The text is serialized at a word level and a character level, namely, each word in each sentence is subjected to one-hot coding, and corresponding vectors are formed according to the word level and the character level.

(3) Capturing embedded representation, character level representation and affix characteristic representation of a word by adopting a recurrent neural network and a word embedding technology, and fusing the embedded representation, the character level representation and the affix characteristic representation of the word to be used as final representation of the word, wherein the steps comprise:

(3-1) the serialization of texts at the word level is expressed as s ═ w₁,w₂,…,w_nWhere n denotes the number of words in the text,

representing the one-hot encoding of the ith word of the sentence, v being the number of words in the word dictionary. Inputting s into the word embedding layer to obtain a corresponding word embedding representation:

wherein,

representing trainable parameters of the word embedding layer and d representing the dimensions of the word embedding vector.

(3-2) let the character serialization of the ith word in the text be represented as column w_i＝{c_i,1,c_i,2,…,c_i,mWhere m denotes the number of characters contained in the ith word,

the ith word in the sentenceOne-hot encoding of j characters, v_cIs the number of characters in the character dictionary. Extracting character level representation of word by bidirectional cyclic neural network, firstly inputting each character of word into character embedding layer to obtain corresponding character embedding

Wherein, W^cParameter matrix representing character embedding layer with size d_c×v_c. Then the character is embedded and input into a bidirectional cyclic neural network to respectively obtain the forward hidden state vector of each character

And reverse implicit state vector

Finally, the last implicit state vector of the forward cyclic neural network and the last implicit state vector of the reverse cyclic neural network are spliced to represent the character-level representation of the word

(3-3) for simplicity, the first t characters of each word are considered as the prefix of the word, and similarly, the last t words of each word are considered as the suffix of the word, t being the hyperparameter. Let the character serialization of the ith word in the text be represented as column w_i＝{c_i,1,c_i,2,…,c_i,mTherein of

One-hot encoding of the jth character representing the ith word in a sentence, v_cIs the character dictionary size. Extracting character level representation of the word by adopting a bidirectional cyclic neural network, firstly inputting each character of the word into a character embedding layer to obtain corresponding character embedding

Wherein, W^cMoments of parameters representing embedded layers of charactersMatrix size d_c×v_c. Then the character is embedded and input into a bidirectional cyclic neural network to respectively obtain the forward hidden state vector of each character

And reverse implicit state vector

Finally, the implicit state vectors of the first k characters are spliced to obtain a matrix, and the dimensionality of the matrix is d_vX t, this matrix contains prefix information. Specially, if the length of a word is less than t, the hidden state vectors of all time steps are spliced together, and the dimension of the obtained matrix is d_vX m. In order to ensure consistent dimensionality of prefix features of all words, an averaging operation is performed on the implicit state matrices, namely, averaging is performed on the second dimensionality of the matrices, and finally prefix feature representations are obtained

Similarly, the implicit state vector corresponding to the last t characters of the word is operated in the same way to obtain the suffix feature representation of the word

(3-4) splicing the word embedded representation, the character level representation, the prefix feature representation and the suffix feature representation of the word obtained in the steps (3-1) - (3-3) to obtain a final representation of the word

(4) Inputting the final representation of the obtained word into a bidirectional convolutional neural network and a conditional random field, predicting a label sequence and calculating a loss value, wherein the steps comprise:

(4-1) inputting the final representation of the word obtained in step (3) intoIn the bidirectional cyclic neural network, the obtained forward hidden state and the reverse hidden state are spliced to obtain word sequence representation

(4-2) inputting the word sequence representation obtained in the step (4-1) into a full-connection layer, and obtaining the score of each word on all labels: p_i＝Wh_i+ b, where W and b are trainable parameters;

(4-3) y ═ y₁,y₂,…,y_nDenotes the predicted tag sequence corresponding to the input text s, y(s) denotes the set of all possible tag sequences for the input sentence s. The P obtained in the step (4-2)_iWhen the random field is input into the conditional random field, the score of each possible sequence is calculated according to the following formula:

wherein A represents a state transition score matrix, the size of A is k × k, A_i,jRepresenting a score for a transition from label i to label j,

indicating the tag y in the predicted sequence_iFollowed by label y_i+1Score (likelihood), y_iRepresenting the ith tag in the predicted tag sequence y.

The label of the i-th word representing the input text s is y_iScore (likelihood).

Represents the predicted tag sequence, with the highest scoring tag sequence as the final prediction:

finally, the loss value is calculated as follows:

(5) training the model by adopting a random gradient descent algorithm according to the loss value obtained in the step (4) to obtain a trained model;

when the loss value of the model is not reduced any more, the training is completed.

(6) And (5) inputting the text into the trained model obtained in the step (5), and identifying the named entity in the text.

Fig. 2 is a schematic diagram of a model used for extracting the affix feature representation.

The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims

1. a social media named entity recognition method based on affix perception, is characterized in that, comprises the steps:

Collect social media datasets with named entities marked, each piece of data contains the original text and named entities marked;

Preprocess the text in the dataset, and construct the index vector representation of the text at the word level and the index vector representation of the text at the character level;

Using recurrent neural network and word embedding technology to capture word embedding representation, character level representation and affix feature representation, and fuse word embedding representation, character level representation and affix feature representation as the final representation of the word;

Input the final representation of the obtained word into a bidirectional convolutional neural network and a conditional random field, predict the label sequence and calculate the loss value;

According to the obtained loss value, the stochastic gradient descent algorithm is used to train the model;

Feed the text into the trained model and identify the named entities in the text.

2. The method according to claim 1, wherein the preprocessing comprises:

Replace all lowercase letters in the text with the corresponding uppercase letters;

Replace all numbers in the text with 0;

Traverse all texts of social media datasets, build word dictionaries and character dictionaries;

Using the resulting word dictionary and character dictionary, serialize the text at the word level and at the character level.

3. The method according to claim 1, wherein, in the step of constructing the index vector representation of the text at the word level and the index vector representation of the text at the character level, comprising:

4. method according to claim 3, is characterized in that, the construction method of word dictionary is:

Traverse each word of each text in the data set, add it to the vocabulary when encountering different words, and assign an index to each word in the order of addition, the index value is 0, 1, 2 and so on, The word list obtained after the traversal is completed is the word dictionary;

The construction method of character dictionary is the same as that of word dictionary construction, the difference is that each character of each word of each text is traversed;

The serialization method is:

The text is serialized at the word level and character level, that is, each word in each sentence is one-hot encoded, and the corresponding vectors are formed at the word level and character level, respectively.

5. The method according to claim 1, characterized in that, the embedding representation, character level representation and affix feature representation of the words are captured, and the word embedding representation, character level representation and affix feature representation are fused, as the word embedding representation, character level representation and affix feature representation. The steps in the final representation include:

Input the serialized representation of the text at the word level into the word embedding layer to obtain the corresponding word embedding representation;

A bidirectional recurrent neural network is used to extract the character-level representation of a word: first, each character of the word is input into the character embedding layer to obtain the corresponding character embedding; then the character embedding is input into the bidirectional recurrent neural network to obtain the The forward hidden state vector and the reverse hidden state vector, and finally the last hidden state vector of the forward recurrent neural network and the last hidden state vector of the reverse recurrent neural network are spliced together to represent the character level of the word express;

The character-level representation of a word is extracted by a bidirectional recurrent neural network: first, each character of the word is input into the character embedding layer to obtain the corresponding character embedding, and then the character embedding is input into the bidirectional recurrent neural network to obtain the corresponding character embedding of each character. Forward implicit state vector and reverse implicit state vector, and finally concatenate the implicit state vectors of the first t characters to obtain a matrix; perform the same operation on the implicit state vector corresponding to the last t characters of the word, Get the suffix feature representation of the word;

The word embedding representation, character-level representation, prefix feature representation, and suffix feature representation of the obtained word are spliced together to obtain the final representation of the word.

6. The method according to claim 5, wherein, in the character level representation of the extracted word, if the length of the word is less than t, the hidden state vectors of all time steps are spliced together; The dimensions of the prefix features are the same, and the mean value operation is performed on these hidden state matrices, that is, the mean value is taken on the second dimension of the matrix, and finally the prefix feature representation is obtained.

7. The method according to claim 1, wherein the final representation of the obtained word is input into a bidirectional convolutional neural network and a conditional random field, and in the step of predicting the label sequence and calculating the loss value, comprising:

The final representation of the obtained word is input into the bidirectional recurrent neural network, and the obtained forward hidden state and reverse hidden state are spliced together to obtain the word sequence representation;

The word sequence representation is input into the fully connected layer, and the score of each word on all labels is calculated;

Input the resulting word sequence representation into a conditional random field and calculate the score for each possible sequence;

The label sequence with the highest score is obtained as the final prediction result, and its loss value is calculated.

8. method according to claim 7 is characterized in that, the calculation formula of the score of each word on all labels is:

P _i =Wh _i +b

where W and b are trainable parameters.

9. The method according to claim 7, wherein the calculation formula of the score of each possible sequence is:

Among them, A represents the state transition score matrix, the size of A is k × k, A _i,j represents the score of the transition from label i to label j;

Represents the score of label y _i followed by label y _i+1 in the predicted sequence, y _i represents the ith label in the predicted label sequence y,

Represents the score of the _i -th word of the input text s with the label yi;

The label sequence with the highest score is expressed as the final prediction result as:

10. The method according to claim 9, wherein the loss value calculation formula is: