Named entity identification method based on network classification
Technical Field
The invention relates to the fields of natural language processing technology and named entity recognition, in particular to a named entity recognition method based on network classification.
Background
Named entity recognition (Named Entity Recognition, NER for short), also known as "private name recognition," refers to the recognition of entities in text that have a specific meaning, mainly including person names, place names, organization names, proper nouns, and the like. Generally comprising two parts: (1) entity boundary identification; (2) The entity class (person name, place name, organization name, or others) is determined. NER is a fundamental key task in NLP. From the flow of natural language processing, NER can be regarded as one of recognition of unregistered words in the lexical analysis, and is the problem that the number of unregistered words is the largest, the recognition difficulty is the largest, and the word segmentation effect is the most influenced. Meanwhile, NER is also the basis of a plurality of NLP tasks such as relation extraction, event extraction, knowledge graph, machine translation, question-answering system and the like.
The focus of the task of extracting the named entity identification information is urgent in actual production, but the characteristics of infinite number of named entities, flexible word formation, fuzzy category and the like make the named entity identification difficult. Conventional classification algorithms only consider physical characteristics (e.g., similarity, distance, distribution, etc.) between data, and do not consider semantic characteristics (e.g., context semantic information may exist in the text) between data.
Traditional classification learning methods, such as SVM and some other network-based classification algorithms, require the use of all training data in practical implementations, and noise present in a vast amount of data can reduce the recognition efficiency of named entities.
Disclosure of Invention
The invention provides a named entity identification method based on network classification, aiming at constructing a classification network by selecting part of named entity identification samples and identifying the to-be-detected named entity samples, thereby improving the identification efficiency of the named entities and providing technical support for information extraction, question-answering systems, syntactic analysis, machine translation and the like.
In order to achieve the above purpose, the invention adopts the following technical scheme:
the invention relates to a named entity identification method based on network classification, which is characterized by comprising the following steps:
step one: named entity classification model training:
step 1.1: text data of T named entity samples are obtained and converted into vector data ψ= ((x) using Word2Vec natural language processing tool 1 ,y 1 ),(x 2 ,y 2 ),…,(x t ,y t ),…,(x T ,y T )),(x t ,y t ) Represents the t thVector data of named entity samples, where x t Representing attribute features of the t-th named entity sample, an Representing the attribute characteristics of the d-th named entity in the t-th named entity sample; y is t A label representing a T-th named entity sample, t=1, 2, …, T;
step 1.2: for the attribute characteristics x of the t named entity sample t Performing standardization processing to obtain feature vectors of the t named entity samples Representing the d-th feature about the named entity in the t-th named entity sample;
step 1.3: two objective functions f are constructed using the equations (1) and (2), respectively 1 And f 2 :
min f 1 =Rr(V s ) (1)
In the formula (1), V s For vector data selected from T vector data ψ, rr (V s ) Representing selected vector data V s Is the proportion of T vector data ψ;
in the formula (2), the amino acid sequence of the compound,to utilize the selected vector data V s A constructed classification network;To classify networkClassification accuracy of (2);
step 1.4: taking a set of vector data of S named entity samples to be selected as an initial population P= { P 1 ,...,p S },p S The vector data set representing the S-th named entity sample to be selected is combined as one individual;
coding the initial population P by adopting binary codes with the length of T; if individual p S The ith bit in the binary code of (1) represents the attribute feature x of the ith named entity sample t Selected and used to construct a classification network
Step 1.5: defining the current iteration number as N and the maximum iteration number as N; and initializing n=1; the initial population P is taken as the parent population P of the nth iteration n ;
Step 1.6: parent population P from the nth iteration through binary tournament n Randomly selecting two individuals p x And p y And respectively constructing classification networksAnd->If the classification network->Is higher than the classification network +.>From the parent population P of the nth iteration n Acquiring higher than classification network->All individuals of precision and randomly selecting one individual p therefrom z The method comprises the steps of carrying out a first treatment on the surface of the For individual p y And p z Cross-mutating to obtain mutated individual p' y And p' z The method comprises the steps of carrying out a first treatment on the surface of the From individual p y 、p′ y And' z Selecting the individual with highest classification network precision to replace the individual p y The method comprises the steps of carrying out a first treatment on the surface of the Finally by the replaced individual p y With individual p x Performing cross-variation to generate offspring P of the nth iteration ′n ;
Step 1.7: the parent population P of the nth iteration n And offspring P of the nth iteration ′n Merging to obtain a merged population of the nth iteration, and obtaining any individual p in the merged population of the nth iteration by using a formula (3) n Importance of (1) IMP (p n ):
IMP(p n )=α×Acc(p n )+(1-α)×(-Red(p n )) (3)
In the formula (3), alpha is a compromise factor, acc (p) n ) For individual p n Is of the precision of Red (p n ) For individual p n Is provided with the following redundancy:
Red(p n )=(a 1 ×b 1 +a 2 ×b 2 +...+a i ×b i +...+a m ×b m )/m (4)
in the formula (4), m is the individual p divided from the combined population of the nth iteration n Number of individuals other than; a, a i For individual p n Dividing individuals p in the combined population with the nth iteration n Redundancy of the ith individual outside in the source space and through individual p n Dividing the number of the same named entity samples selected by the ith individual by T, i e { 1..m }; b i For individual p n Redundancy in the accuracy target space with the i-th individual is obtained by the equation (5):
in the formula (5), acc (i) represents the accuracy of the classification network constructed by the ith individual, acc (p) n ) Representing individual p n The precision of the constructed classification network;
step 1.8: obtaining all individuals p in the pooled population for the nth iteration according to equation (3) n And selecting the former S individuals as the parent population P of the (n+1) th iteration n ;
Step 1.9: assigning n+1 to N, judging whether N is more than N, if so, selecting vector data of a named entity sample corresponding to an individual with highest classification network precision in the parent population of the nth iteration and using the vector data to construct an optimal network classifier, executing the second step, otherwise, returning to the step 1.6 for execution;
step two: named entity identification:
step 2.1: inputting text data of a named entity sample to be identified, processing according to the steps 1.1 and 1.2, and obtaining a feature vector of the sample to be identified;
step 2.3: classifying the feature vectors of the sample to be detected by using the optimal network classifier, wherein the obtained label represents a named entity corresponding to the sample to be detected.
The invention relates to a named entity recognition method based on network classification, which is characterized in that the classification network in the formula (6)The method adopts the construction mode of the k-associated optimal diagram of Euclidean distance, and comprises the following steps:
for feature vectorsObtaining the Euclidean distance d between the feature vector of the d named entity in the t named entity sample and the feature vector of the d named entity in the i named entity sample by using the formula (6) ti And selecting k named entities of the same category closest to the network connection, thereby forming a classified network:
in the formula (6), the amino acid sequence of the compound,representing the feature vector of the (d) th named entity in the (t) th named entity sample.
Compared with the prior art, the invention has the beneficial effects that:
1. the invention is different from the traditional classification method, provides a named entity identification method based on network classification, comprehensively considers the physical and semantic characteristics of named entity sample data, and constructs a classification network by screening and training the named entity sample data, so that noise points are removed, and the named entity can be identified more efficiently.
2. The present invention defines two objectives: the number of samples in the selected named entity recognition sample set constructs the optimization problem of the classification precision of the network, and the high-quality named entity sample data is selected through optimizing the number of samples in the selected named entity recognition sample set, so that the classification network with better classification effect is constructed, and the performance and the accuracy of named entity recognition are improved.
3. In the iterative process, the method adopts a precision preference-based solution strategy, and performs precision guidance on the low-precision named entity recognition sample set to obtain more excellent offspring, so that the quality of the classification network to be constructed is effectively improved, and the classifier finally used for named entity recognition has better classification effect and higher recognition accuracy.
4. In the process of selecting the next generation named entity recognition sample set, the importance-based selection strategy is adopted, so that the classifier for the named entity recognition finally has better classifying effect and better performance by carrying out importance sorting selection on all named entity recognition sample sets to enter the next generation more excellently, and continuous optimization in the iterative process is ensured.
Drawings
Fig. 1 is a flow chart of the method of the present invention.
Detailed Description
In this embodiment, a named entity recognition method based on network classification includes a named entity classification model training step and a named entity recognition step, specifically, as shown in fig. 1, the method is performed according to the following steps:
step one: named entity classification model training:
step 1.1: taking name named entity recognition as an example, text data of T named entity samples are obtained, and the text data are converted into vector data ψ= ((x) by using Word2Vec natural language processing tool 1 ,y 1 ),(x 2 ,y 2 ),…,(x t ,y t ),…,(x T ,y T )),(x t ,y t ) Vector data representing the t-th named entity sample, where x t Representing attribute features of the t-th named entity sample, an Representing the attribute characteristics of the (d) th named entity in the (t) th named entity sample, namely describing the attribute characteristics of the (t) th personal name, wherein the common attributes include birth time, native, height, weight, nickname, main contribution and the like; y is t The label representing the sample of the t-th named entity is the sign that the named entity belongs to a certain category, and is the name of a person. Thus converting the named entity recognition problem into a multi-classification problem, y t Representing the name of the person represented by the label described by the T-th named entity sample, t=1, 2, …, T;
step 1.2: attribute feature x for the t-th named entity sample t Performing standardization processing to obtain feature vectors of the t named entity samples Representing the d-th feature about the named entity in the t-th named entity sample;
step 1.3: by using (1) andequation (2) constructs two objective functions f, respectively 1 And f 2 All the targets are minimization:
min f 1 =Rr(V s ) (1)
in the formula (1), V s For vector data selected from T vector data ψ, rr (V s ) Representing selected vector data V s Is the proportion of T vector data ψ;
in the formula (2), the amino acid sequence of the compound,to utilize the selected vector data V s A constructed classification network;To classify networkClassification accuracy of (2);
step 1.4: taking a set of vector data of S named entity samples to be selected as an initial population P= { P 1 ,...,p S },p S The vector data set representing the S-th named entity sample to be selected is combined as one individual;
coding the initial population P by adopting binary codes with the length of T; if individual p S The ith bit in the binary code of (1) represents the attribute feature x of the ith named entity sample t Selected and used to construct a classification networkFor example, assume a total of 10 named entity samples and p S In 3,5,8,9 is 1, p S Selectively naming a set of entity identification samples as (x 3 ,x 5 ,x 8 ,x 9 );
Step 1.5: definition of the currentThe iteration number is N, and the maximum iteration number is N; and initializing n=1; the initial population P is taken as the parent population P of the nth iteration n ;
Step 1.6: parent population P from the nth iteration through binary tournament n Randomly selecting two individuals p x And p y And respectively constructing classification networksAnd->Classifying by using the constructed network; if the classification network->Is higher than the classification network +.>From the parent population P of the nth iteration n Acquiring higher than classification network->All individuals of precision and randomly selecting one individual p therefrom z The method comprises the steps of carrying out a first treatment on the surface of the For individual p y And p z Cross-mutating to obtain mutated individual p' y And p' z The method comprises the steps of carrying out a first treatment on the surface of the From individual p y 、p′ y And p' z Selecting the individual with highest classification network precision to replace the individual p y Thus, the poor guiding of the two is performed and the excellent guiding individual is obtained; finally by the replaced individual p y With individual p x Performing cross-variation to generate offspring P of the nth iteration ′n ;
Step 1.7: the parent population P of the nth iteration n And offspring P of the nth iteration ′n Merging to obtain a merged population of the nth iteration, and obtaining any individual p in the merged population of the nth iteration by using a formula (3) n Importance of (1) IMP (p n ):
IMP(p n )=α×Acc(p n )+(1-α)×(-Red(p n )) (3)
In the formula (3), alpha is a compromise factor, and is usually 0.8, acc (p n ) For individual p n Is of the precision of Red (p n ) For individual p n The importance obtained by integrating the integrated accuracy and redundancy has a more balanced evaluation for the individual, and has:
Red(p n )=(a 1 ×b 1 +a 2 ×b 2 +...+a i ×b i +...+a m ×b m )/m (4)
in the formula (4), m is the individual p divided from the combined population of the nth iteration n Number of individuals other than; a, a i For individual p n Dividing individuals p in the combined population with the nth iteration n Redundancy of the ith individual outside in the source space and through individual p n The number of identical named entity samples selected by the ith individual divided by T, i e { 1..m }, a i The larger the description of the individual p n The higher the redundancy in source space with individual i; b i For individual p n Redundancy in the precision target space with the ith individual is clear and reasonable in redundancy analysis of each individual by combining the redundancy in the source space and the precision target space, and the judgment effect on the subsequent importance is larger, and the redundancy is obtained by the formula (5):
in the formula (5), acc (i) represents the accuracy of the classification network constructed by the ith individual, acc (p) n ) Representing individual p n Accuracy of the constructed classification network, b i The larger the description of the individual p n The higher the spatial redundancy with individual i in the precision target;
step 1.8: obtaining all individuals p in the pooled population for the nth iteration according to equation (3) n And selecting the former S individuals as the parent population P of the (n+1) th iteration n ;
Step 1.9: assigning n+1 to N, judging whether N is more than N, if so, selecting vector data of a named entity sample corresponding to an individual with highest classification network precision in the parent population of the nth iteration and using the vector data to construct an optimal network classifier, executing the second step, otherwise, returning to the step 1.6 for execution;
step two: named entity identification, classifying the sample to be detected by using the most network classifier obtained in the step one:
step 2.1: inputting text data of a named entity sample to be identified, processing according to the steps 1.1 and 1.2, and obtaining feature vectors of the sample to be identified, wherein common features include birth time, native, height, weight, nickname, main contribution and the like;
step 2.3: classifying the feature vectors of the sample to be detected by using an optimal network classifier, wherein the obtained label represents a named entity corresponding to the sample to be detected.
2. The method for identifying a named entity based on network classification as claimed in claim 1, wherein the classification network in formula (6)The method adopts the construction mode of the k-associated optimal diagram of Euclidean distance, and comprises the following steps:
for feature vectorsObtaining the Euclidean distance d between the feature vector of the d named entity in the t named entity sample and the feature vector of the d named entity in the i named entity sample by using the formula (6) ti And selecting k named entities of the same category closest to the network connection, thereby forming a classified network:
in the formula (6), the amino acid sequence of the compound,represents the t thThe d-th feature vector of the named entity sample about the named entity.
The method is tested and verified by adopting objectively collected data.
1) Acquiring text data of named entity samples related to the names of people, namely acquiring sentences or paragraphs related to the names of people in a literature, converting the real-world text data into vector data which can be processed by a computer by using a Word2Vec tool, dividing a processed data set into training samples and test samples, selecting the best training samples through ten-fold cross-validation to construct a classification network, and identifying named entities of the test samples.
2) Evaluation indexes;
the classification accuracy is used as an evaluation index for the present example to evaluate the performance of the named entity recognition. The higher the accuracy is, the better the classification effect is, and the higher the identification accuracy is.
3) Performing an experiment on the dataset;
the validity of the invention is verified by experimental results on the dataset. In the present day of high information diversification, it is important to accurately and efficiently identify named entities from texts and analyze the named entities. Experiments show that the method can rapidly and effectively extract key attributes of the named entities from massive texts and identify the categories of the named entities, improves the recognition efficiency of the named entities and provides a basis for information extraction, question-answering systems, syntactic analysis, machine translation and the like.