Specific embodiments
Many detailed descriptions with reference to the accompanying drawings, and the model of introducing emotion analytical algorithm and use in detail is for making the purpose, technical solutions and advantages of the present invention clearer, below in conjunction with specific embodiment, and with reference to accompanying drawing, the present invention is described in more detail.
Fig. 1 shows the Organization Chart of the fine granularity emotion analytic system that the present invention is directed to product information.
With reference to Fig. 1, emotion analytic system of the present invention comprises that user interface, product review information climb delivery piece, product review information training sample database, dictionary database, text pretreatment module, dictionary load-on module, feature extraction module, emotion analytical model training module, emotion analysis module, administrator interface, system management module and database interface.
User interface be used for to realize that the emotion analytic system communicates by letter with the various of user, comprises obtaining product review relevant textual information that the user imports and information being passed to the text pretreatment module; The emotion analysis result that the emotion analysis module is finally obtained returns to the user; Analysis result is wrong if the user thinks emotion, and interface will pass to system management module to the correction result of user feedback allows the keeper examine.
Product review information is climbed the delivery piece, be used for climbing and getting having similar webpages with emotion tendency markup information such as star mark on the large-scale shopping website such as Jingdone district, Amazon at interval according to certain hour, extract product review information wherein and align negative information and put in order, connect with the foundation of product review information training data sample database by database interface, the formatted data of handling well is deposited in the training sample database.
The text pretreatment module, connect with the foundation of product review information training sample database by database interface, obtain the training sample data, and obtain the text data of user input from user interface, these text datas are carried out pre-service such as participle, POS mark (part-of-speech tagging), stop words processing and syntactic analysis, and the data of handling well are passed to the word load-on module.
The dictionary load-on module connects by database interface and dictionary database, obtains dictionary data such as emotion dictionary, matched combined dictionary, negative word dictionary, is used for the feature extraction of feature extraction module.
The feature extraction module, data after the dictionary data that loads of load-on module pair and the processing are carried out the extraction of pre-defined feature with the help of a dictionary, with text vectorization, be converted into the form that emotion analytical model training module can be handled, and pass to emotion analytical model training module.
Emotion analytical model training module is used at interval the emotion analytical model of native system core being trained by certain hour.Obtain from the feature extraction module and to be converted into the training data that requires form, use the Two-Level DCRF model training of L-BFGS algorithm to making up according to training data.The Two-Level DCRF model that the present invention uses is to develop in Linear CRF (linear conditions random field) model based, is a kind of in CRF (condition random field) model, is to use at the emotion analysis field first time.Method is in the past generally artificially independently got up this two parts work, ignored contact between the two, this model is by the structure double-layer structure, evaluation object and the identification of emotion word are carried out in a model simultaneously with the emotion tendency judgement is unified, realized the intercommunication of two workplace information, introduce contact details between the two, helped the raising of final precision.This module is given the emotion analysis module with the Model Transfer that trains.
The emotion analysis module loads the emotion analysis module that trains, and the user input text information after the format conversion is carried out fine-grained emotion analysis, namely draws the emotion tendency of specified evaluation object is judged.For example: " touch-screen is very cruel, and sound is also very clear, and just battery is not durable " the words is (touch-screen, front), (sound, front) and (battery, negative) with the emotion analysis result that obtains, and analysis result is offered the user by user interface.Simultaneously non-existent emotion word and matched combined in the dictionary of identification being passed to system management module examines for the keeper, the emotion word that identifies in the last example is (cruel, the front), (clear, front) and (is not durable, negative), the matched combined that identifies is (touch-screen, cruel, front), (sound, clear, positive) and (battery is not durable, and is negative).
Administrator interface is used for the system manager the new emotion word of emotion analysis module identification and the error analysis result of matched combined and user feedback is carried out the manual examination and verification affirmation.
System management module is used for the system manager and connects by database interface and database, more new database.If new emotion word or matched combined are correctly then deposit it in corresponding dictionary database, otherwise then give up; If the same revised feedback result of user correctly then will be deposited in the training sample database, otherwise then give up.
Database interface, unified interface and the access rights control of database manipulations such as the access of realization training sample data, dictionary data, renewal.
Product review information training sample database is used for the storage products review information and climbs the format training sample data that the delivery piece transmits.
Dictionary database is used for dictionary data such as storage emotion word dictionary, matched combined dictionary.
To sum up, anti-rubbish mail gateway of the present invention is climbed delivery piece, product review information training sample database, dictionary database, text pretreatment module, dictionary load-on module, feature extraction module, emotion analytical model training module, emotion analysis module, administrator interface, system management module and database interface etc. by user interface, product review information and is partly formed.Above-mentioned module is finished fine-grained emotion analysis, the collection of user profile feedback information, training text automated data acquiistion and dictionary database data in real time together and is upgraded this four functions.In fine-grained emotion analytic function, the Two-Level CRF model of emotion analytic system of the present invention by using training module to train, evaluation object in the text message of in the emotion analysis module user being imported, evaluation word and user identify the emotion tendency of each evaluation object and judge, judged result is offered the user by user interface; In the field feedback collecting function, the user passes to system management module with feedback information, and the keeper carries out depositing training sample data Kucheng in by database interface after the manual examination and verification to feedback information and is new learning sample in system management module; In training text automated data acquiistion function, emotion analytic system of the present invention is climbed the delivery piece by information the product review information of the similar marks such as band star mark on the network is collected, format is handled, and deposits in the training sample database by database interface; In dictionary database data in real time update functions, after the emotion word that the keeper does not include and matched combined are carried out manual examination and verification, deposit dictionary database in by database interface in the dictionary database of information management module to the output of emotion analysis module.
The present invention adopts has the algorithm of supervision that text emotion is carried out fine-grained analysis.And the algorithm that supervision is arranged needs a large amount of labeled data as training sample, and artificial mark need expend a large amount of manpowers and time and bring the subjective factor in the mark process to influence, and this also is to hinder the main cause that has supervise algorithm to use in reality.The review information that native system has star mark by automatic collection and extraction has reduced artificial intervention and cost, and can regularly effectively upgrade training data as corpus.
The present invention introduces feedback mechanism error analysis information is learnt.Existing method does not generally deal with for mistake branch result, but these feedback informations have comprised a large amount of useful informations, how can take full advantage of these information and become system to realize the key of self-teaching.The introducing of feedback mechanism makes model to learn again the result of error analysis, and the system that makes uses more accurate and more accurate.
Fig. 2 is the process flow diagram of the fine granularity emotion analytical approach of the present invention's proposition.
With reference to Fig. 2, this method comprises step: 1. product review information is climbed delivery piece response message and is climbed the request of getting, regularly obtain product review information and carry out information extraction from network, and obtain field feedback, store these information into the training sample sample database by connecting with the training sample data; 2. the response model train request connects with the training sample data, obtains training data, training data is carried out pre-service such as subordinate sentence, participle and part-of-speech tagging; 3. to pretreated data, carry out feature extraction by dictionary load-on module and feature extraction module, be converted to the vectorization data; 4. utilize after the feature extraction characteristic to the present invention propose the emotion analytical model---Two-Level CRF Model trains; 5. obtain the user and import product review information to be analyzed, and carry out pre-service and the feature extraction work identical with the 2-3 step; 6. load the emotion analytical model that trains characteristic is carried out the emotion analysis; 7. connect with user interface, the emotion analysis result is exported to the user.
In sum, this method has mainly comprised the collection of training sample data step, emotion analytical model training step and product review information emotion analytical procedure.
Fig. 3 collects the realization schematic diagram of step for the training sample data.With reference to Fig. 3, this step realizes obtaining the product review information training sample from network and these two sources of user feedback, and these sample datas are climbed by product review information respectively enters the product review information sample database after delivery piece and system management module are handled.In the sorting algorithm of supervision was arranged, training data had tremendous influence to the final effect of model, and the method that tradition manually marks training data needs a large amount of manpowers and time, and the feasibility in real world applications is not high.Therefore in this step, the system product review information is climbed the delivery piece to the mark of the band star on the network or is had review information that the good job assessment of bids annotates and climbs and get and extract on the one hand; The keeper carries out the correctness audit to user's feedback information in system management module on the other hand, and rational feedback information is stored as the training sample data.By the work of this two aspect, realize training data is collected comprehensively and effectively automatically.
Fig. 4 is the realization schematic diagram of emotion analytical model training step.With reference to Fig. 4, in this step, system at first extracts the training sample data in nearest a period of time from the training sample database, the pretreatment module of system, dictionary load-on module and feature extraction module are carried out the characteristic that a series of processing obtain vectorization to these training sample data then, and output to the training that the model training module is carried out the emotion analytical model.
The present invention carries out the analysis of fine-grained entity level emotion by the DCRF model of constructing a kind of Two-Level of having structure to review information first according to characteristics such as its composition structures after the large-tonnage product review information is analyzed.This model is the key point of fine granularity emotion analytical approach of the present invention, therefore, will introduce structure, principle and the advantage of Two-Level CRF Model below in detail.
Fine-grained entity level emotion analytical work purpose is to analyze in the text message emotion tendency at concrete object.Therefore just must relate to identification and the work of emotional orientation analysis two parts of entity.Entity level emotion analytical work is in the past regarded above-mentioned two parts work independently as usually, after namely the entity in the first distich is identified, the emotion of concrete entity is analyzed again, and has ignored contact between the two.Two-Level CRF not only can sentence between word structure carry out modeling, and two parts working relation is got up, carry out simultaneously, by the final associating precision of mutual raising of information between the two.
Two-Level CRF is a kind of special CRF model.CRF is a kind of non-directed graph model, and it carries out modeling to the conditional probability distribution of sequence mark on given characteristic set basis.Be example with the most basic Linear-CRF, under the condition of given observation sequence, the conditional probability of flag sequence can formalized description be following form:
Wherein, ψ
iBe the potential function in the non-directed graph model concept,
Be length be I the regularization factor under might flag sequence.Potential function ψ
iCan be decomposed into following form, wherein f
kFundamental function for definition.
Its corresponding graph model structure is example with the named entity recognition task as shown in Figure 5 here, imports pretreated text message, sets up its corresponding Linear-CRF model.Different with traditional sorting technique such as naive Bayesian, Logic Regression Models etc., Linear-CRF regards classification problem as feature that the sequence mark problem namely not only can utilize the traditional classification model to adopt, also by doing suitable Markov hypothesis, introduce different classes of position feature information, for example in this example, named entity occurs near the emotion word usually.And being traditional classification models, these contact details of different classes of are difficult to show.Linear-CRF directly carries out modeling to the conditional probability of flag sequence simultaneously, is different from digraph model such as HMM (latent horse model), and it does not need just can introduce abundant feature to doing independence assumption between feature; On the other hand, it also can regard the MEMM (maximum entropy Markov model) of overall regularization as, and has avoided the marking bias problem among the MEMM.Therefore, Linear-CRF no matter compare traditional classification model or digraph model, can both obtain better effect when solving the identification of sequence mark problem such as named entity.
Two-Level CRF can be regarded as the combination of two Linear-CRF.Shown in the graph model of Two-Level CRF among Fig. 6, its structure has comprised linear chain and the observation sequence of two marks, and the flag node of while in the different levels of identical time point interconnects.In Fig. 6 example, given one section pretreated product review example, we are node with the word, are launched into corresponding Two-Level CRF.Product entity and emotion word in the ground floor flag sequence distich are identified, namely to three kinds of mark T, S and O being arranged, difference representative products physical name, emotion word and other words.In second layer flag sequence, the emotion tendency of entity and the emotion tendency of emotion word are analyzed, namely to three kinds of mark P, N and O should be arranged, represented positive emotion, negative emotion respectively and do not have emotion.As can be seen, Two-Level CRF not only has the characteristics of Linear-CRF, also different markers work is merged, and introduces the contact details of different markers work, and this also has at present and is difficult in the method accomplish.
The formalized description of Two-Level CRF is as follows:
Wherein, Ψ
1Be illustrated in the potential function on the same flag sequence, φ
1Represent the potential function between two flags sequence.T represents the node number on the same flag sequence, the number of L expressive notation sequence, L=2 in model of the present invention.Same potential function can be expressed as following two kinds of forms, wherein f respectively
k(y
L, t, y
L, t+1, x, t) and f
k(y
L, t, y
L+1, t, x t) is respectively the fundamental function that is defined between same flag sequence and different flags sequence:
Ψ
l(y
l,t,y
l,t+1,x,t)=exp{Σ
kλ
k*f
k(y
l,t,y
l,t+1,x,t)}
φ
l(y
l,t,y
l+1,t,x,t)=exp{Σ
kλ
k*f
k(y
l,t,y
l+1,t,x,t)}
Contrast the formalized description of Linear-CRF and Two-Level CRF as can be seen, the difference that both ask is potential function φ among the Two-Level CRF
l(y
L, t, y
L+1, t, x, introducing t).Potential function φ
l(y
L, t, y
L+1, t, x t) is the formalized description that contacts between Entity recognition and two tasks of emotional orientation analysis.With Linear-CRF in that do Markov hypothesis with the mark on one deck diverse location similar, Two-Level CRF is on the basis of Linear-CRF, further the mark of identical timing node position in the different layers is done the Markov hypothesis, the not contact details between isolabeling on the different flags sequence have been introduced, according to maximum entropy criterion, be potential function φ with this contact details formalized description again
l(y
L, t, y
L+1, t, x, t), it is the weighting of fundamental function on the identical timing node for being defined in different layers.This is the key point of Two-Level CRF just, by introducing potential function φ
l(y
L, t, y
L+1, t, x t), has set up the information interaction between different flags sequence, will think separate in the former studies, not have two flags sequence of information interaction effectively to link together.Therefore, Two-Level CRF not only has the advantage of above-mentioned Linear-CRF, also further introduced the feature between the different flags sequence that enrich more, ignored by former studies or be difficult to utilize on its basis, especially the problem that needs to carry out twice sequence mark in the sequence mark problem, for example in the emotion problem analysis of the fine granularityization of the present invention's solution, finally can obtain the better effect than Linear-CRF.
In the model training module, we use the Two-Level CRF model training of L-BFGS algorithm to launching, the parameter lambda in the learning model
k
Fig. 7 is the realization schematic diagram of product review information emotion analytical procedure.With reference to Fig. 7, in this step, user interface transmits the product review information of user's input, through pretreatment module, dictionary load-on module and feature extraction resume module, the characteristic of output vectorization is to the emotion analysis module, call the emotion analytical model that trains in the emotion analysis module entity in the data and emotion are analyzed, and the result is offered the user by user interface.In this module, we use the TRP algorithm that the mark in the flag sequence is derived.The new emotion word of identifying in emotion word identifying will be examined the back by database interface the system management module keeper, deposit in the dictionary database, guarantee the real time automatic update of dictionary resources.
Above-described specific embodiment; purpose of the present invention, technical scheme and beneficial effect are further described; institute is understood that; the above only is specific embodiments of the invention; be not limited to the present invention; within the spirit and principles in the present invention all, any modification of making, be equal to replacement, improvement etc., all should be included within protection scope of the present invention.