CN104899304B - Name entity recognition method and device - Google Patents
Name entity recognition method and device Download PDFInfo
- Publication number
- CN104899304B CN104899304B CN201510321448.8A CN201510321448A CN104899304B CN 104899304 B CN104899304 B CN 104899304B CN 201510321448 A CN201510321448 A CN 201510321448A CN 104899304 B CN104899304 B CN 104899304B
- Authority
- CN
- China
- Prior art keywords
- word
- measured
- sample
- vector
- entity
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/353—Clustering; Classification into predefined classes
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- Databases & Information Systems (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Evolutionary Computation (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Machine Translation (AREA)
Abstract
The present invention provides a kind of recognition methods for naming entity and device, can identify the name entity in name entity, particularly electric business field exactly.Wherein, this method includes:Obtain vectorial storehouse;Training corpus text string is segmented to obtain multiple sample words;According to priority for each sample word query vector storehouse to build first eigenvector, first eigenvector includes entity indicia vector corresponding to term vector corresponding to sample word and part of speech vector and the previous word of sample word;Using all first eigenvectors integrally as input quantity, neutral net Named Entity Extraction Model is trained;Text string to be predicted is segmented to obtain multiple words to be measured;According to priority for each word query vector storehouse to be measured to build second feature vector, second feature vector comprising term vector corresponding to word to be measured and part of speech be vectorial and the previous word of word to be measured corresponding to entity indicia vector;Second feature vector corresponding to each word to be measured is distinguished into input model, exports the entity indicia of word to be measured.
Description
Technical field
The present invention relates to natural language processing technique field, more particularly to a kind of name entity recognition method and device.
Background technology
With the fast development of Internet technology, information service becomes more popular.Wherein, naming the identification of entity is
The information service application fields such as information extraction, question answering system, syntactic analysis, machine translation, the metadata of Internet mark
Important foundation work.Entity (abbreviation entity) is named, name, mechanism name, place name is referred to and other is all with entitled
The entity of mark, widely entity is named also to include numeral, date, currency, address etc..
There is the technology using nerual network technique training name Entity recognition in the prior art.Existing method is extremely
It is few that there are following several shortcomings:(1) rely primarily on word and be used as input feature vector in itself, the aspect of model is single, is not introduced directly into reality
Front and rear dependence between body mark, cause the accuracy rate of identification not high, the name entity particularly in identification electric business field
Shi Jingchang identifications are inaccurate;(2) it is randomly generated due to the initial value of network, final parameter optimization result may not be enough to
Good, the training time is longer to cause development efficiency low;(3) not taking into full account the distribution situation of training data causes model to entity
Fitting degree it is uneven.
The name entity in electric business field, such as trade name (14 inches of Nokia 1020, ThinkPad E431 notebook electricity
Brain), price, item property etc., these name entities are generally made up of one or more continuous words in sentence, and part of speech is usually
Forms such as " nouns+number ".In a word, the name entity in electric business field has salient feature, needs badly at present for electric business field
Name entity develop recognition methods or identification device.
The content of the invention
In view of this, the present invention provides a kind of name entity recognition method and device, can identify that name is real exactly
The name entity of body, particularly electric business field.
To achieve the above object, according to an aspect of the invention, there is provided one kind names entity recognition method, including:
Vectorial storehouse is obtained, the vectorial storehouse includes term vector corresponding to multiple words difference, part of speech vector corresponding to multiclass part of speech difference, with
And entity indicia vector corresponding to multiclass entity indicia difference;Training corpus text string is segmented to obtain ordered multiple samples
Word;The vectorial storehouse is inquired about for each sample word to build first eigenvector, the first eigenvector bag according to priority
Term vector corresponding to word containing sample, part of speech vector corresponding to sample word and entity indicia vector corresponding to the previous word of sample word;
Using the first eigenvector corresponding to all sample words integrally as the training input quantity of neutral net, BP NEURAL NETWORK is utilized
Algorithm carries out network parameter solution, obtains neutral net Named Entity Extraction Model;Text string to be predicted is segmented to obtain order
The multiple words to be measured changed;The vectorial storehouse is inquired about for each word to be measured to build second feature vector according to priority, described the
Two characteristic vectors include that term vector corresponding to word to be measured, part of speech corresponding to word to be measured be vectorial and the previous word of word to be measured corresponding to it is real
Body label vector;The second feature vector corresponding to each word to be measured is inputted into the neutral net name entity respectively
Identification model, export the entity indicia of the word to be measured.
Alternatively, also included in the first eigenvector:The sample word is adjacent to term vector corresponding to word and described
Sample word is vectorial adjacent to part of speech corresponding to word, and, also included in the second feature vector:The word to be measured is corresponding adjacent to word
Term vector and the word to be measured adjacent to corresponding to word part of speech vector.
Alternatively, when building the first eigenvector for the first sample word in ordered multiple sample words, institute
The previous word for stating first sample word is book character string, and, for the first word structure to be measured in ordered multiple words to be measured
When building the second feature vector, the previous word of the first word to be measured is book character string.
Alternatively, negative example sample is also included in the training input quantity of the neutral net.
To achieve the above object, according to another aspect of the present invention, there is provided one kind name entity recognition device, including:
Vectorial storehouse acquisition module, for obtaining vectorial storehouse, the vectorial storehouse includes term vector corresponding to multiple words difference, multiclass part of speech point
Not corresponding part of speech vector, and entity indicia vector corresponding to multiclass entity indicia difference;First participle module, for that will instruct
Practice language material text string to segment to obtain ordered multiple sample words;First structure module, for according to priority for each sample
Word inquires about the vectorial storehouse to build first eigenvector, and the first eigenvector includes term vector, sample corresponding to sample word
Entity indicia vector corresponding to part of speech vector corresponding to this word and the previous word of sample word;Training module, for by all samples
The overall training input quantity as neutral net of the first eigenvector corresponding to word, net is carried out using BP algorithm of neural network
Network parametric solution, obtain neutral net Named Entity Extraction Model;Second word-dividing mode, for text string to be predicted to be segmented
To ordered multiple words to be measured;Second structure module, for according to priority for each word to be measured inquire about the vectorial storehouse with
Second feature vector is built, the second feature vector includes term vector corresponding to word to be measured, part of speech vector corresponding to word to be measured
And entity indicia vector corresponding to the previous word of word to be measured;Prediction module, for by corresponding to each word to be measured described
Two characteristic vectors input the neutral net Named Entity Extraction Model respectively, export the entity indicia of the word to be measured.
Alternatively, also included in the first eigenvector:The sample word is adjacent to term vector corresponding to word and described
Sample word is vectorial adjacent to part of speech corresponding to word, and, also included in the second feature vector:The word to be measured is corresponding adjacent to word
Term vector and the word to be measured adjacent to corresponding to word part of speech vector.
Alternatively, the first structure module is additionally operable to:For the first sample word structure in ordered multiple sample words
When building the first eigenvector, the previous word using book character string as the first sample word, and, second structure
Modeling block is additionally operable to:When building the second feature vector for the first word to be measured in ordered multiple words to be measured, use
Previous word of the book character string as the first word to be measured.
Alternatively, in the training module, negative example sample is also included in the training input quantity of the neutral net.
Technique according to the invention scheme, employ more rational characteristic vector and carry out training pattern and entered using model
Row prediction, this feature vector is not only comprising feature current word word in itself, also comprising current word part of speech feature, the previous word of current word
Entity indicia feature, with it is existing only consider word identification technology in itself compared with, the information of consideration is more comprehensive, causes most
The recognition result obtained eventually is more accurate, and accuracy rate is higher when particularly being identified to electric business domain entities.
Brief description of the drawings
Accompanying drawing is used to more fully understand the present invention, does not form inappropriate limitation of the present invention.Wherein:
Fig. 1 is the flow chart of the key step of name entity recognition method according to embodiments of the present invention;
Fig. 2 is the schematic diagram of the critical piece of name entity recognition device according to embodiments of the present invention.
Embodiment
The one exemplary embodiment of the present invention is explained below in conjunction with accompanying drawing, including the various of the embodiment of the present invention
Details should think them only exemplary to help understanding.Therefore, those of ordinary skill in the art should recognize
Arrive, various changes and modifications can be made to the embodiments described herein, without departing from scope and spirit of the present invention.Together
Sample, for clarity and conciseness, the description to known function and structure is eliminated in following description.
To more fully understand those skilled in the art, brief introduction is first done to relational language.
Word:The word of word is in itself.
Term vector:The vectorization of word represents that each word is represented with the vector of a multidimensional.
Part of speech:The property of word.Word is generally divided into two classes, 12 kinds of parts of speech.One kind is notional word:Noun, verb, adjective, number
Word, adverbial word, onomatopoeia, measure word and pronoun.One kind is function word:Preposition, conjunction, auxiliary word and interjection.
Part of speech vector:The vectorization of part of speech represents that every kind of part of speech is represented with a multi-C vector, it is preferred to use discrete shape
The multi-C vector of formula represents.
Entity indicia:Each entity indicia represents a kind of entity type, for example WID represents that commodity ID, WB represent trade name
First word, WI represents the medium term of trade name, and WE represents the closing of trade name, and O represents other words etc..Such as:Millet
(WB) red (WI) mobile phones (WE) of 2s (WI) how (O).
Entity indicia vector:The vectorization of entity indicia represents that every kind of entity indicia is represented with a multi-C vector, excellent
Choosing is represented using the multi-C vector of discrete form.
It should be noted that term vector, part of speech vector and these three vectorial vectorial dimensions of entity indicia and need not
It is consistent, can flexibly sets as needed.
Fig. 1 is the flow chart of the key step of name entity recognition method according to embodiments of the present invention.As shown in figure 1,
The name entity recognition method can include step A to step G.
Step A:Obtain vectorial storehouse.The vectorial storehouse includes term vector corresponding to multiple words difference, and multiclass part of speech corresponds to respectively
Part of speech vector, and multiclass entity indicia respectively corresponding to entity indicia vector.
In an embodiment of the invention, for given language material, can be determined using word2dec each in language material
Term vector corresponding to individual word.Word2vec is a work that word is characterized as to real number value vector that Google increased income in 2013
Tool, can be mapped to word K gts, or even the vector operations between word and word can also be corresponding with semanteme.Therefore profit
Term vector is precalculated with word2vec, the time can be saved, improve efficiency, and accuracy rate can be improved.Part of speech vector sum
The method that entity indicia vector can use random initializtion, obtains random vector.Term vector that said process is obtained, part of speech
The vector storage of vector sum entity indicia is standby into vectorial storehouse.
Step B:Training corpus text string is segmented to obtain ordered multiple sample words.
In embodiments of the present invention, training corpus text string can be extracted from the data of electric business website and then is carried out
Participle, obtains multiple ordered sample words, as shown in table 1:
The training corpus text string of table 1 and sample word
Training corpus text string | Ordered sample word |
" iphone prices " | " iphone " " price " |
" Huawei's honor 6 " | " Huawei " " honor " " 6 " |
" millet 1s red mobile phone " | " millet " " 1s " " red " " mobile phone " |
…… | …… |
Step C:According to priority for each sample word query vector storehouse to build first eigenvector.First eigenvector
Comprising entity indicia corresponding to part of speech vector corresponding to term vector corresponding to sample word, sample word and the previous word of sample word to
Amount.First eigenvector is contained outside the word of sample word information in itself and part-of-speech information, in addition to the previous word of sample word
Entity indicia information.The method of the present invention carrys out training pattern based on first eigenvector, with relying solely on the letter of word in itself
The prior art that breath carrys out training pattern is compared, and the information of consideration is more comprehensive, causes the recognition result that finally gives more accurate.
It should be noted that " first eigenvector includes term vector corresponding to sample word, part of speech vector corresponding to sample word
And corresponding to the previous word of sample word entity indicia vector " implication refer to first eigenvector by three below vector splicing and
Into, such as:First eigenvector=[term vector corresponding to sample word, part of speech vector, previous word pair of sample word corresponding to sample word
The entity indicia vector answered].Splicing order is defined when the present invention is not spliced to vector, and different splicings order has no effect on
The principle of the present invention.But splicing order in whole method once it is determined that, no longer change, with ensure all fisrt feature to
It is consistent to measure form.
Step C detailed process is exemplified below:Assuming that ordered multiple sample words " sample word 1+ samples have been obtained before
Word 2+ sample word 3+ samples word 4 ... ", then need according to priority to sample word 1, sample word 2, sample word 3, sample word 4 etc.
First eigenvector is built respectively.Set and take word window width as 0.Wherein, to sample word 1 (i.e. first sample word) structure first
During characteristic vector, because above word is not present originally in sample word 1, so needing artificially to increase book character string " $ BEGIN " works
For the previous word of sample word 1.The book character string " in the pre-existing vectorial storehouse of $ BEGIN " entity indicia vector, leads to
Often it is random initialization vector.At this moment, for sample word 1, it is assumed that the term vector note of sample word 1 is inquired from vectorial storehouse
For X1, the part of speech vector of sample word 1 is designated as Z1, and " $ BEGIN " entity indicia vector is designated as T0, then the first synthesis of sample word 1
Vector=[X1, Z1, T0].Then, for sample word 2, it is assumed that the term vector that sample word 2 is inquired from vectorial storehouse is designated as
X2, the part of speech vector of sample word 2 is Z2, and the entity indicia vector of the previous word (i.e. sample word 1) of sample word 2 is designated as T1, then
First resultant vector of sample word 2=[X2, Z2, T1].By that analogy, can obtain fisrt feature corresponding to all sample words to
Amount.
In embodiments of the present invention, can also be included in first eigenvector:Sample word adjacent to word corresponding to word to
Amount and sample word are adjacent to part of speech vector corresponding to word.What is " also included " herein is meant that " also by vector splicing below
Into "." sample word is adjacent to word " refers to that before current sample word or after current sample word, distance is not more than and taken
The sample word of word window width.It is exemplified below:Assuming that it is 1 to take word window width, then sample word refers to current sample word adjacent to word
1 word after preceding 1 word and current sample word.The first eigenvector of current sample word can be designated as [the current previous word of sample word
Corresponding term vector, term vector corresponding to current sample word, term vector corresponding to the current latter word of sample word, before current sample word
Part of speech vector corresponding to one word, part of speech vector corresponding to current sample word, part of speech vector corresponding to the current latter word of sample word, when
Entity indicia vector corresponding to the preceding previous word of sample word].Other numerical value take the situation of word window width to analogize, herein no longer
Repeat.It should be noted that the present invention is not defined to the numerical value for taking word window width, can flexibly set as needed,
But once it is determined that, no longer change, it is consistent with the first eigenvector form for ensureing all.Word is taken it should also be noted that, working as
During window width increase, the preset characters string increased before first sample word can be served as before first sample word
Neighbouring word, preset characters string can also be increased afterwards to serve as the neighbouring word after the sample word of end to end sample word,
Those skilled in the art can derive specific practice by content above, repeat no more herein.In the embodiment, first is special
Sign vector has further contemplated word information and part-of-speech information of the sample word adjacent to word, and the information of consideration is more comprehensive, causes most
The recognition result obtained eventually is more accurate.
Step D:Using first eigenvector corresponding to all sample words integrally as the training input quantity of neutral net, utilize
BP algorithm of neural network carries out network parameter solution, obtains neutral net Named Entity Extraction Model.Specifically, can use flat
The overall object function of square error structure model, the parameter of neutral net is solved using stochastic gradient method, obtains final god
Through network naming entity recognition model.
In embodiments of the present invention, negative example sample can also be included in the training input quantity of neutral net.Due to reality
Entity indicia in the training corpus text string on border is typically skewness, and this can cause model to intend part name entity
Close poor.This is directed to, can be during training pattern, according to the distribution situation of these entity indicias, in proportion at random
Data minus example sampling is carried out, ensures that its distribution is uniform as much as possible, so as to ensure fitting of the model to all name entity indicias
It is accurate to compare.
Step E:Text string to be predicted is segmented to obtain ordered multiple words to be measured.
In embodiments of the present invention, text string to be predicted can be obtained from user's read statement and then is divided
Word, obtain multiple ordered words to be measured.
Step F:It is vectorial to build second feature for each word query vector storehouse to be measured according to priority, second feature vector
Comprising part of speech corresponding to term vector corresponding to word to be measured, word to be measured is vectorial and the previous word of word to be measured corresponding to entity indicia to
Amount.
It should be noted that when building second feature vector for the first word to be measured in ordered multiple words to be measured,
Book character string " previous words of the $ BEGIN " as first word to be measured can be increased before first word to be measured.Operation herein with
The operation for above increasing book character string before first sample word is similar.
It should also be noted that, second feature vector corresponding to word to be measured should first eigenvector corresponding with sample word
Form it is consistent.This means special comprising point vectorial species and point vectorial splicing order needs and first in second feature vector
Sign vector is consistent.Such as:When neighbouring adjacent to term vector corresponding to word and sample word also comprising sample word in first eigenvector
Corresponding to word during part of speech vector, correspondingly, also adjacent to term vector corresponding to word and treated comprising word to be measured in second feature vector
Word is surveyed adjacent to part of speech vector corresponding to word.
Step G:Respectively by second feature vector input neutral net Named Entity Extraction Model, output corresponding to word to be measured
The entity indicia of word to be measured.
To more fully understand those skilled in the art, the specific implementation of a name entity recognition method is enumerated for example
Under.
(1) vectorial storehouse is obtained using word2vec instruments.
(2) it is " iphone prices " to assume some training corpus text string, and two sample words can be obtained by participle
" iphone " and " price ".The part of speech of " iphone " is noun n, and entity indicia is commodity entity indicia W.The part of speech of " price " is
Noun n, entity indicia are other entity indicias O.
(3) first eigenvector corresponding to " iphone " is built first.Because " iphone " is first sample word, therefore need
In above addition " $ BEGIN " (its term vector, part of speech vector, entity indicia vector are all random initializtions).Assuming that this implementation
The word window width that takes in example is 1.Term vector storehouse is inquired about, takes out the current previous word of sample word " BEGIN ", current sample word
Term vector corresponding to " iphone ", current these three words of the latter word of sample word " price " is expressed as Xi-1, Xi, Xi+1, and this
Part of speech vector representation corresponding to three words is Zi-1, Zi, Zi+1, along with " $ BEGIN " entity tag is expressed as Ti-1.Will
This seven vectors are stitched together in order, and first eigenvector corresponding to formation " iphone "=[Xi-1, Xi, Xi+1, Zi-1,
Zi,Zi+1,Ti-1]。
(4) input layer of neutral net is inputted using first eigenvector as input quantity, obtains exporting h (X).The present embodiment
It is middle entity indicia W/O is converted into 1/0 discrete representation.Because the entity indicia of known " iphone " is the expectation of " W " here
Export as 1.Parameter optimization is carried out using gradient descent algorithm so that error is minimum.All training corpus text strings are passed through
Above training process, you can obtain final neutral net Named Entity Extraction Model.(5) some text string to be predicted is assumed
" Nokia white ", word segmentation result are two words " Nokia " to be measured and " white ", and the part of speech of known " Nokia " and " white "
It is noun n.
(6) process for building second feature vector corresponding to " Nokia " is as follows:" $ BEGIN " are added before " Nokia ".
Term vector storehouse is inquired about, obtains term vector corresponding to " BEGIN " " Nokia " " white ", it is " white then to obtain " BEGIN " " Nokia "
Part of speech vector corresponding to color ", and obtain " $ BEGIN " entity indicia vector.This seven vectors are stitched together in order,
Obtain second feature vector corresponding to " Nokia ".
(7) neutral net for obtaining second feature vector input step (4) corresponding to " Nokia " names Entity recognition mould
Type, to predict the entity indicia of " Nokia ".If model output h (X)=0.8, numerical value is more than intermediate value 0.5, then by " Nokia "
Labeled as W (commodity entity).Model output h (X)=0.2 is such as crossed, numerical value is less than intermediate value 0.5, then " Nokia " is labeled as into O (its
His entity).
Fig. 2 is the schematic diagram of the critical piece of name entity recognition method according to embodiments of the present invention.As shown in Fig. 2
The name entity recognition device 20 can include:Vectorial storehouse acquisition module 21, first participle module 22, first build module 23,
Training module 24, the second word-dividing mode 25, second structure module 26 and prediction module 27.
Vectorial storehouse acquisition module 21 is used to obtain vectorial storehouse, and vectorial storehouse includes term vector corresponding to multiple words difference, multiclass
Part of speech vector corresponding to part of speech difference, and entity indicia vector corresponding to multiclass entity indicia difference.Alternatively, utilize
Word2dec determines term vector corresponding to multiple words.Precalculated using word2dec, save the training time.
First participle module 22 is used to segment training corpus text string to obtain ordered multiple sample words.
First structure module 23 is used for according to priority for each sample word query vector storehouse to build first eigenvector,
First eigenvector includes term vector corresponding to sample word, corresponding to part of speech vector corresponding to sample word and the previous word of sample word
Entity indicia vector.
Training module 24 is used for training of the first eigenvector integrally as neutral net corresponding to all sample words is defeated
Enter amount, carry out network parameter solution using BP algorithm of neural network, obtain neutral net Named Entity Extraction Model.
Second word-dividing mode 25 is used to segment text string to be predicted to obtain ordered multiple words to be measured.
Second structure module 26 is used for according to priority for each word query vector storehouse to be measured to build second feature vector,
Second feature vector include that term vector corresponding to word to be measured, part of speech corresponding to word to be measured be vectorial and the previous word of word to be measured corresponding to
Entity indicia vector.
Prediction module 27 is used to second feature vector corresponding to each word to be measured inputting neutral net name entity respectively
Identification model, export the entity indicia of word to be measured.
In embodiments of the present invention, can also be included in first eigenvector:Sample word adjacent to word corresponding to word to
Amount and sample word are vectorial adjacent to part of speech corresponding to word, and, it can also be included in second feature vector:Word to be measured is adjacent to word pair
The term vector and word to be measured answered are adjacent to part of speech vector corresponding to word.In the embodiment, first eigenvector and second feature
Vector has further contemplated the word information and part-of-speech information of neighbouring word, and the information of consideration is more comprehensive, causes what is finally given
Recognition result is more accurate.
In embodiments of the present invention, the first structure module 23 can be also used for:For ordered multiple sample words
In first sample word structure first eigenvector when, the previous word of first sample word be book character string, and, second build
Module 26 can be also used for:It is first when building second feature vector for the first word to be measured in ordered multiple words to be measured
The previous word of word to be measured is book character string.This addresses the problem lack word originally before first sample word or first word to be measured
Problem.
In embodiments of the present invention, in training module 27, negative example sample is also included in the training input quantity of neutral net
This.Introducing negative example sample can ensure that sample distribution is uniform as much as possible, so as to ensure model to all name entity indicias
Fitting is more accurate.
In summary, name entity recognition method of the invention and device employ more rational characteristic vector to train
Model and it is predicted using model, this feature vector is not only comprising feature current word word in itself, also comprising current word word
Property feature, the entity indicia feature of the previous word of current word, with it is existing only consider word identification technology in itself compared with, consideration
Information is more comprehensive, causes the recognition result that finally gives more accurate, accuracy rate when particularly being identified to electric business domain entities
It is higher.
Above-mentioned embodiment, does not form limiting the scope of the invention.Those skilled in the art should be bright
It is white, depending on design requirement and other factors, various modifications, combination, sub-portfolio and replacement can occur.It is any
Modifications, equivalent substitutions and improvements made within the spirit and principles in the present invention etc., should be included in the scope of the present invention
Within.
Claims (8)
1. one kind name entity recognition method, it is characterised in that including:
Obtain vectorial storehouse, the vectorial storehouse include multiple words respectively corresponding to term vector, multiclass part of speech respectively corresponding to part of speech to
Amount, and entity indicia vector corresponding to multiclass entity indicia difference;
Training corpus text string is segmented to obtain ordered multiple sample words;
The vectorial storehouse is inquired about for each sample word to build first eigenvector, the first eigenvector bag according to priority
Term vector corresponding to word containing sample, part of speech vector corresponding to sample word and entity indicia vector corresponding to the previous word of sample word;
Using the first eigenvector corresponding to all sample words integrally as the training input quantity of neutral net, nerve net is utilized
Network BP algorithm carries out network parameter solution, obtains neutral net Named Entity Extraction Model;
Text string to be predicted is segmented to obtain ordered multiple words to be measured;
The vectorial storehouse is inquired about for each word to be measured to build second feature vector, the second feature vector bag according to priority
Containing part of speech corresponding to term vector corresponding to word to be measured, word to be measured is vectorial and the previous word of word to be measured corresponding to entity indicia vector;
The second feature vector corresponding to each word to be measured is inputted into the neutral net name Entity recognition mould respectively
Type, export the entity indicia of the word to be measured.
2. according to the method for claim 1, it is characterised in that
Also included in the first eigenvector:The sample word is adjacent to term vector corresponding to word and the sample word adjacent to word
Corresponding part of speech vector, and,
Also included in the second feature vector:The word to be measured is adjacent to term vector corresponding to word and the word to be measured adjacent to word
Corresponding part of speech vector.
3. according to the method for claim 1, it is characterised in that
When building the first eigenvector for the first sample word in ordered multiple sample words, the first sample word
Previous word be book character string, and,
When building the second feature vector for the first word to be measured in ordered multiple words to be measured, the first word to be measured
Previous word be book character string.
4. according to the method for claim 1, it is characterised in that also include negative example in the training input quantity of the neutral net
Sample.
5. one kind name entity recognition device, it is characterised in that including:
Vectorial storehouse acquisition module, for obtaining vectorial storehouse, the vectorial storehouse includes term vector corresponding to multiple words difference, multiclass word
Property respectively corresponding to part of speech vector, and multiclass entity indicia respectively corresponding to entity indicia vector;
First participle module, for segmenting training corpus text string to obtain ordered multiple sample words;
First structure module, for inquiring about the vectorial storehouse for each sample word to build first eigenvector according to priority,
The first eigenvector includes term vector corresponding to sample word, part of speech vector and the previous word pair of sample word corresponding to sample word
The entity indicia vector answered;
Training module, for training of the first eigenvector corresponding to all sample words integrally as neutral net to be inputted
Amount, network parameter solution is carried out using BP algorithm of neural network, obtains neutral net Named Entity Extraction Model;
Second word-dividing mode, for segmenting text string to be predicted to obtain ordered multiple words to be measured;
Second structure module, for inquiring about the vectorial storehouse for each word to be measured according to priority to build second feature vector,
Second feature vector includes that term vector corresponding to word to be measured, part of speech corresponding to word to be measured be vectorial and the previous word pair of word to be measured
The entity indicia vector answered;
Prediction module, ordered for the second feature vector corresponding to each word to be measured to be inputted into the neutral net respectively
Name entity recognition model, export the entity indicia of the word to be measured.
6. device according to claim 5, it is characterised in that
Also included in the first eigenvector:The sample word is adjacent to term vector corresponding to word and the sample word adjacent to word
Corresponding part of speech vector, and,
Also included in the second feature vector:The word to be measured is adjacent to term vector corresponding to word and the word to be measured adjacent to word
Corresponding part of speech vector.
7. device according to claim 5, it is characterised in that
The first structure module is additionally operable to:It is special for the first sample word structure described first in ordered multiple sample words
When levying vectorial, the previous word using book character string as the first sample word, and,
The second structure module is additionally operable to:It is special for the first word structure described second to be measured in ordered multiple words to be measured
When levying vectorial, the previous word using book character string as the first word to be measured.
8. device according to claim 5, it is characterised in that in the training module, the training of the neutral net is defeated
Entering also includes negative example sample in amount.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510321448.8A CN104899304B (en) | 2015-06-12 | 2015-06-12 | Name entity recognition method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510321448.8A CN104899304B (en) | 2015-06-12 | 2015-06-12 | Name entity recognition method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN104899304A CN104899304A (en) | 2015-09-09 |
CN104899304B true CN104899304B (en) | 2018-02-16 |
Family
ID=54031966
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510321448.8A Active CN104899304B (en) | 2015-06-12 | 2015-06-12 | Name entity recognition method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104899304B (en) |
Families Citing this family (46)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106294313A (en) * | 2015-06-26 | 2017-01-04 | 微软技术许可有限责任公司 | Study embeds for entity and the word of entity disambiguation |
CN106815194A (en) * | 2015-11-27 | 2017-06-09 | 北京国双科技有限公司 | Model training method and device and keyword recognition method and device |
CN106815193A (en) * | 2015-11-27 | 2017-06-09 | 北京国双科技有限公司 | Model training method and device and wrong word recognition methods and device |
CN105550227B (en) * | 2015-12-07 | 2020-05-22 | 中国建设银行股份有限公司 | Named entity identification method and device |
CN105468780B (en) * | 2015-12-18 | 2019-01-29 | 北京理工大学 | The normalization method and device of ProductName entity in a kind of microblogging text |
CN105701075B (en) * | 2016-01-13 | 2018-04-13 | 夏峰 | A kind of document associated detecting method and system |
CN105550172B (en) * | 2016-01-13 | 2018-06-01 | 夏峰 | A kind of distributed text detection method and system |
CN105701213B (en) * | 2016-01-13 | 2018-12-28 | 夏峰 | A kind of document control methods and system |
CN105701087B (en) * | 2016-01-13 | 2018-03-16 | 夏峰 | A kind of formula plagiarizes detection method and system |
CN105701077B (en) * | 2016-01-13 | 2018-04-13 | 夏峰 | A kind of multilingual literature detection method and system |
CN105701086B (en) * | 2016-01-13 | 2018-06-01 | 夏峰 | A kind of sliding window document detection method and system |
CN107195296B (en) * | 2016-03-15 | 2021-05-04 | 阿里巴巴集团控股有限公司 | Voice recognition method, device, terminal and system |
CN107506345A (en) * | 2016-06-14 | 2017-12-22 | 科大讯飞股份有限公司 | Method and device for constructing language model |
CN106095988A (en) * | 2016-06-21 | 2016-11-09 | 上海智臻智能网络科技股份有限公司 | Automatic question-answering method and device |
CN106202255A (en) * | 2016-06-30 | 2016-12-07 | 昆明理工大学 | Merge the Vietnamese name entity recognition method of physical characteristics |
CN106202054B (en) * | 2016-07-25 | 2018-12-14 | 哈尔滨工业大学 | A kind of name entity recognition method towards medical field based on deep learning |
CN106557462A (en) * | 2016-11-02 | 2017-04-05 | 数库(上海)科技有限公司 | Name entity recognition method and system |
CN106570170A (en) * | 2016-11-09 | 2017-04-19 | 武汉泰迪智慧科技有限公司 | Text classification and naming entity recognition integrated method and system based on depth cyclic neural network |
CN108074565A (en) * | 2016-11-11 | 2018-05-25 | 上海诺悦智能科技有限公司 | Phonetic order redirects the method and system performed with detailed instructions |
TWI645303B (en) * | 2016-12-21 | 2018-12-21 | 財團法人工業技術研究院 | String verification method, string expansion method and verification model training method |
CN106682220A (en) * | 2017-01-04 | 2017-05-17 | 华南理工大学 | Online traditional Chinese medicine text named entity identifying method based on deep learning |
CN108428137A (en) * | 2017-02-14 | 2018-08-21 | 阿里巴巴集团控股有限公司 | Generate the method and device of abbreviation, verification electronic banking rightness of business |
CN106844351B (en) * | 2017-02-24 | 2020-02-21 | 易保互联医疗信息科技(北京)有限公司 | A multi-data source-oriented medical institution organization entity identification method and device |
CN106933803B (en) * | 2017-02-24 | 2020-02-21 | 黑龙江特士信息技术有限公司 | A multi-data source-oriented medical device entity recognition method and device |
CN106933802B (en) * | 2017-02-24 | 2020-02-21 | 黑龙江特士信息技术有限公司 | A multi-data source-oriented social security entity identification method and device |
CN107122582B (en) * | 2017-02-24 | 2019-12-06 | 黑龙江特士信息技术有限公司 | Multi-data source-oriented diagnosis and treatment entity recognition method and device |
CN107291693B (en) * | 2017-06-15 | 2021-01-12 | 广州赫炎大数据科技有限公司 | Semantic calculation method for improved word vector model |
CN107818080A (en) * | 2017-09-22 | 2018-03-20 | 新译信息科技(北京)有限公司 | Term recognition methods and device |
CN107908614A (en) * | 2017-10-12 | 2018-04-13 | 北京知道未来信息技术有限公司 | A kind of name entity recognition method based on Bi LSTM |
CN107885721A (en) * | 2017-10-12 | 2018-04-06 | 北京知道未来信息技术有限公司 | A kind of name entity recognition method based on LSTM |
CN107967251A (en) * | 2017-10-12 | 2018-04-27 | 北京知道未来信息技术有限公司 | A kind of name entity recognition method based on Bi-LSTM-CNN |
CN107832289A (en) * | 2017-10-12 | 2018-03-23 | 北京知道未来信息技术有限公司 | A kind of name entity recognition method based on LSTM CNN |
CN107766559B (en) * | 2017-11-06 | 2019-12-13 | 第四范式(北京)技术有限公司 | training method, training device, dialogue method and dialogue system for dialogue model |
CN107886943A (en) * | 2017-11-21 | 2018-04-06 | 广州势必可赢网络科技有限公司 | Voiceprint recognition method and device |
CN110083820B (en) * | 2018-01-26 | 2023-06-27 | 普天信息技术有限公司 | An improved method and device for a benchmark word segmentation model |
CN110276066B (en) * | 2018-03-16 | 2021-07-27 | 北京国双科技有限公司 | Entity association relation analysis method and related device |
CN108920457B (en) * | 2018-06-15 | 2022-01-04 | 腾讯大地通途(北京)科技有限公司 | Address recognition method and device and storage medium |
RU2699687C1 (en) * | 2018-06-18 | 2019-09-09 | Общество с ограниченной ответственностью "Аби Продакшн" | Detecting text fields using neural networks |
CN109101481B (en) * | 2018-06-25 | 2022-07-22 | 北京奇艺世纪科技有限公司 | Named entity identification method and device and electronic equipment |
CN109657230B (en) * | 2018-11-06 | 2023-07-28 | 众安信息技术服务有限公司 | Named entity recognition method and device integrating word vector and part-of-speech vector |
CN110162772B (en) * | 2018-12-13 | 2020-06-26 | 北京三快在线科技有限公司 | Named entity identification method and device |
CN110309515B (en) * | 2019-07-10 | 2023-08-11 | 北京奇艺世纪科技有限公司 | Entity identification method and device |
CN111079418B (en) * | 2019-11-06 | 2023-12-05 | 科大讯飞股份有限公司 | Named entity recognition method, device, electronic equipment and storage medium |
CN111444720A (en) * | 2020-03-30 | 2020-07-24 | 华南理工大学 | Named entity recognition method for English text |
US11675978B2 (en) | 2021-01-06 | 2023-06-13 | International Business Machines Corporation | Entity recognition based on multi-task learning and self-consistent verification |
CN113408273B (en) * | 2021-06-30 | 2022-08-23 | 北京百度网讯科技有限公司 | Training method and device of text entity recognition model and text entity recognition method and device |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7171350B2 (en) * | 2002-05-03 | 2007-01-30 | Industrial Technology Research Institute | Method for named-entity recognition and verification |
CN101075228A (en) * | 2006-05-15 | 2007-11-21 | 松下电器产业株式会社 | Method and apparatus for named entity recognition in natural language |
CN101576910A (en) * | 2009-05-31 | 2009-11-11 | 北京学之途网络科技有限公司 | Method and device for identifying product naming entity automatically |
CN104615589A (en) * | 2015-02-15 | 2015-05-13 | 百度在线网络技术(北京)有限公司 | Named-entity recognition model training method and named-entity recognition method and device |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7478033B2 (en) * | 2004-03-16 | 2009-01-13 | Google Inc. | Systems and methods for translating Chinese pinyin to Chinese characters |
-
2015
- 2015-06-12 CN CN201510321448.8A patent/CN104899304B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7171350B2 (en) * | 2002-05-03 | 2007-01-30 | Industrial Technology Research Institute | Method for named-entity recognition and verification |
CN101075228A (en) * | 2006-05-15 | 2007-11-21 | 松下电器产业株式会社 | Method and apparatus for named entity recognition in natural language |
CN101576910A (en) * | 2009-05-31 | 2009-11-11 | 北京学之途网络科技有限公司 | Method and device for identifying product naming entity automatically |
CN104615589A (en) * | 2015-02-15 | 2015-05-13 | 百度在线网络技术(北京)有限公司 | Named-entity recognition model training method and named-entity recognition method and device |
Non-Patent Citations (2)
Title |
---|
基于语义与SVM的中文实体关系抽取;毕海滨等;《第18届全国信息存储技术学术会议论文集》;20131029;全文 * |
词边界字向量的中文命名实体识别;姚霖等;《智能系统学报》;20160228;第11卷(第1期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN104899304A (en) | 2015-09-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104899304B (en) | Name entity recognition method and device | |
CN109408526B (en) | SQL sentence generation method, device, computer equipment and storage medium | |
CN107330011B (en) | The recognition methods of the name entity of more strategy fusions and device | |
CN110427623A (en) | Semi-structured document Knowledge Extraction Method, device, electronic equipment and storage medium | |
CN110489523B (en) | Fine-grained emotion analysis method based on online shopping evaluation | |
CN111666427A (en) | Entity relationship joint extraction method, device, equipment and medium | |
CN112328761B (en) | Method and device for setting intention label, computer equipment and storage medium | |
CN109508458A (en) | The recognition methods of legal entity and device | |
CN113627797B (en) | Method, device, computer equipment and storage medium for generating staff member portrait | |
CN105677857B (en) | method and device for accurately matching keywords with marketing landing pages | |
CN104715063B (en) | search ordering method and device | |
CN107102993A (en) | A kind of user's demand analysis method and device | |
CN106980620A (en) | A kind of method and device matched to Chinese character string | |
CN114240672B (en) | Method for identifying duty ratio of green asset and related product | |
CN113723077B (en) | Sentence vector generation method and device based on bidirectional characterization model and computer equipment | |
CN109783640A (en) | One type case recommended method, system and device | |
CN111160041A (en) | Semantic understanding method and device, electronic equipment and storage medium | |
CN114637831A (en) | Data query method and related equipment based on semantic analysis | |
CN113902569A (en) | Method for identifying the proportion of green assets in digital assets and related products | |
CN116644148A (en) | Keyword recognition method and device, electronic equipment and storage medium | |
CN112801425A (en) | Method and device for determining information click rate, computer equipment and storage medium | |
CN116402630A (en) | Financial risk prediction method and system based on characterization learning | |
CN107193806A (en) | A kind of vocabulary justice former automatic prediction method and device | |
CN110826315B (en) | Method for identifying timeliness of short text by using neural network system | |
CN107861950A (en) | The detection method and device of abnormal text |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |