CN110110080A

CN110110080A - Textual classification model training method, device, computer equipment and storage medium

Info

Publication number: CN110110080A
Application number: CN201910247846.8A
Authority: CN
Inventors: 金戈; 徐亮
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2019-03-29
Filing date: 2019-03-29
Publication date: 2019-08-09
Also published as: WO2020199591A1

Abstract

The invention discloses a kind of textual classification model training method, device, computer equipment and storage mediums, which comprises the first sample data with category label and the second sample data without category label are obtained from default sample database；Preliminary classification model is established according to first sample data；Meanwhile calculating the information entropy and relevance degree of the second sample data；According to preset classification notation methods, the second sample data for meeting preset condition to information entropy and relevance degree carries out classification mark, obtains third sample data；Preliminary classification model is trained using third sample data, obtains intermediate disaggregated model；It is trained using first sample data and third sample data centering grade disaggregated model, obtains textual classification model.Technical solution of the present invention solves in textual classification model training process, and training sample is in large scale, the problem of training time length.

Description

Textual classification model training method, device, computer equipment and storage medium

Technical field

The present invention relates to field of information processing more particularly to textual classification model training method, device, computer equipment and Storage medium.

Background technique

Text classification is an important application direction in natural language processing research field.Text classification refers to utilize and divide Class device classifies to the data file comprising text, so that it is determined that classification belonging to each document, allows users to conveniently Acquisition need document.

Wherein, classifier is also known as disaggregated model, be by using a large amount of sample data for having category label, to point Obtained from class criterion or model parameter are trained.The classifier obtained using training carries out the text data of unknown classification Identification, to realize the automatic classification to large scale text data.Therefore, the superiority and inferiority of disaggregated model directly influences classification most Whole effect.

However, having the sample data of category label very limited, most of sample in the large-scale text classification problem of reality It originally is not no category label.This makes in the building process of disaggregated model, it has to using by the expert in field come into The mode of pedestrian's work mark.This mode needs to expend a large amount of manpower, financial resources and time, and the scale of training sample is huge Greatly, training process will also devote a tremendous amount of time.

Summary of the invention

The embodiment of the present invention provides a kind of textual classification model training method, device, computer equipment and storage medium, with It solves in textual classification model training process, training sample is in large scale, the problem of training time length.

A kind of textual classification model training method, comprising:

The first sample data with category label are obtained from default sample database, and are built according to the first sample data Vertical preliminary classification model；

The second sample data without the category label is obtained from the default sample database；

The comentropy for calculating each second sample data obtains the information entropy of each second sample data；

According to the quantity in second sample data including identical phrase, the phase of each second sample data is calculated Close angle value；

The information entropy is chosen more than presupposed information entropy threshold, and the relevance degree is lower than the default degree of correlation Second sample data of threshold value is as data to be marked；

According to preset classification notation methods, classification mark is carried out to the data to be marked, obtains third sample data；

According to preset model training mode, the preliminary classification model is instructed using the third sample data Practice, obtains intermediate disaggregated model；

According to the preset model training mode, using the first sample data and the third sample data to institute It states intermediate disaggregated model to be trained, obtains textual classification model.

A kind of textual classification model training device, comprising:

Primary mold establishes module, for obtaining the first sample data with category label from default sample database, and Preliminary classification model is established according to the first sample data；

Sample data obtains module, for obtaining the second sample without the category label from the default sample database Notebook data；

Comentropy computing module obtains each described second for calculating the comentropy of each second sample data The information entropy of sample data；

Relatedness computation module, for calculating each according to the quantity in second sample data including identical phrase The relevance degree of second sample data；

Data decimation module to be marked, for choosing the information entropy more than presupposed information entropy threshold, and the phase Second sample data of the angle value lower than the default relevance threshold is closed as data to be marked；

Labeling module, for carrying out classification mark to the data to be marked, obtaining according to preset classification notation methods Third sample data；

First model training module is used for according to preset model training mode, using the third sample data to institute It states preliminary classification model to be trained, obtains intermediate disaggregated model；

Second model training module, for using the first sample data according to the preset model training mode The intermediate disaggregated model is trained with the third sample data, obtains textual classification model.

A kind of computer equipment, including memory, processor and storage are in the memory and can be in the processing The computer program run on device, the processor realize above-mentioned textual classification model training side when executing the computer program Method.

A kind of computer readable storage medium, the computer-readable recording medium storage have computer program, the meter Calculation machine program realizes above-mentioned textual classification model training method when being executed by processor.

Above-mentioned textual classification model training method, device, computer equipment and storage medium, obtain from default sample database First sample data with category label, and establish preliminary classification model according to first sample data, that is, utilize sub-fraction There is the sample data of category label to be trained, obtain preliminary classification model, it is possible to reduce to the sample data for having category label Demand, save training cost；The second sample data without category label is obtained from default sample database；Calculate second The information entropy and relevance degree of sample data, and meet information entropy and relevance degree the second sample data of preset condition Carry out classification mark；According to preset model training mode, using the third sample data after mark to preliminary classification model into Row training, obtains intermediate disaggregated model, that is, the comentropy that third sample data is utilized is big, and correlation each other is small, and There is the characteristics of category label, optimizes the nicety of grading of preliminary classification model；Finally, according to first sample data and third sample Data are trained the intermediate disaggregated model, obtain textual classification model, i.e., by iteration step by step, optimization obtains final Textual classification model.It proposes a kind of train using the sample data for having category label on a small quantity and obtains the side of textual classification model Method allows to obtain the preferable disaggregated model of performance by being trained less sample data, saved manpower at This, improves training speed.

Detailed description of the invention

In order to illustrate the technical solution of the embodiments of the present invention more clearly, below by institute in the description to the embodiment of the present invention Attached drawing to be used is needed to be briefly described, it should be apparent that, the accompanying drawings in the following description is only some implementations of the invention Example, for those of ordinary skill in the art, without any creative labor, can also be according to these attached drawings Obtain other attached drawings.

Fig. 1 is an application environment schematic diagram of textual classification model training method in one embodiment of the invention；

Fig. 2 is the flow chart of textual classification model training method in one embodiment of the invention；

Fig. 3 is the flow chart of step S1 in textual classification model training method in one embodiment of the invention；

Fig. 4 is the flow chart of step S4 in textual classification model training method in one embodiment of the invention；

Fig. 5 is the flow chart of step S5 in textual classification model training method in one embodiment of the invention；

Fig. 6 is the schematic diagram of textual classification model training device in one embodiment of the invention；

Fig. 7 is the schematic diagram of computer equipment in one embodiment of the invention.

Specific embodiment

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are some of the embodiments of the present invention, instead of all the embodiments.Based on this hair Embodiment in bright, every other implementation obtained by those of ordinary skill in the art without making creative efforts Example, shall fall within the protection scope of the present invention.

Textual classification model training method provided by the invention, can be applicable in the application environment such as Fig. 1, wherein service End is the computer equipment for carrying out textual classification model training, and server-side can be server or server cluster；Default sample Library is to provide the database of training sample data, specifically can be various relationship types or non-relational database, as MS-SQL, Oracle, MySQL, Sybase, DB2, Redis, MongodDB, Hbase etc.；Pass through network between server-side and default sample database Connection, network can be cable network or wireless network.Textual classification model training method application provided in an embodiment of the present invention In server-side.

In one embodiment, as shown in Fig. 2, providing a kind of textual classification model training method, specific implementation flow Include the following steps:

S1: the first sample data with category label are obtained from default sample database, and are built according to first sample data Vertical preliminary classification model.

Default sample database, that is, provide the database of training sample data.Default sample database can be deployed in server-side local, Or it is connected by network with server-side.

First sample data are the text datas with category label.Wherein, text data includes text information Text, news and Email Body on text document, internet etc.；Category label is to made by text data points Class label is that the classification to text data limits.

For example, the category label of an article is " emotion ", then the content of this article is represented with related to " emotion ".It can To understand that ground, category label further include but is not limited to " science popularization ", " movement ", " pursuing a goal with determination ", " poem prose " etc. for indicating text The label of data generic.

Specifically, in default sample database, category label and text data are associated storages, and each text data has Indicate whether it has the field of category label.Server-side can obtain the textual data for having category label by SQL query statement According to as first sample data.

Preliminary classification model is the classification tool constructed according to first sample data.The preliminary classification model energy set up It is enough that broad classification is carried out to the sample data for having category label.

Specifically, server-side can obtain the by carrying out signature analysis to the first sample data with category label Then category label and text feature information are associated storage, as first fraction by the text feature information of one sample data Class model.For example, server-side can carry out word segmentation processing to the text in first sample data, using the participle of high word frequency as text Eigen information.Wherein, word segmentation processing is that the word in text is entered cutting in Text extraction, is obtained single one by one Only word.Word segmentation processing is widely used in the fields such as full-text search, content of text excavation as a kind of word processing means.

Alternatively, server-side can obtain just fraction using training method neural network based according to first sample data Class model.

S2: the second sample data without category label is obtained from default sample database.

Second sample data is the text data without category label.That is, compared with first sample data, the second sample Notebook data does not have category label, if server-side does not know text belonging to the second sample data not by way of handmarking Classification or the meaning of expression.

Specifically, server-side can obtain the second sample data by SQL query statement from default sample database.

S3: the comentropy of each second sample data is calculated, the information entropy of each second sample data is obtained.

Comentropy is by the concept of the scaling information amount of aromatic proposition, is the quantisation metric how many to information.Comentropy is got over Greatly, i.e., information content included in sample data is also abundanter, while the uncertainty of representative information is bigger.

Information entropy is the specific quantized value to comentropy.

Server-side can according to include in the second sample data text data number determine information entropy.For example, with The quantity of text number of words is as information entropy in second sample data.It is to be appreciated that included in the article of 5000 words Information content be greater than the Email Body information content that are included of only 20 words.

Specifically, server-side calculates the text number of words in each second sample data, using text number of words as each second The information entropy of sample data.

Alternatively, server-side removes the participle quantity after auxiliary word that indicates mood using in the second sample data as the second sample data Information entropy.Wherein, auxiliary word that indicates mood includes but is not limited to " ", " uh ", " ", " " etc..

Specifically, server-side makees word segmentation processing to the second sample data, obtains participle set, and will segment the language in set Auxiliary word removes, using remaining participle quantity as the information entropy of the second sample data.

S4: according to the quantity in the second sample data including identical phrase, the degree of correlation of each second sample data is calculated Value.

The relevance degree of second sample data, that is, whether the information for having reacted the offer of the second sample data repeats and redundancy. Relevance degree is higher, then it is higher to represent information multiplicity and redundancy that the second sample data provides each other；Relevance degree Lower, then the otherness for representing the information that the second sample data provides each other is bigger.

Server-side determines relevance degree according to including the quantity of identical phrase in the second sample data.

For example, if in the second sample data A including phrase " culture ", " civilization ", " history ", the second sample data B In include phrase " culture ", " country ", " history ", including phrase " travelling ", " mountains and rivers ", " country " in the second sample data C；Then It include phrase " culture " and " history " in second sample data A and the second sample data B, then the relevance degree of A and B is 2；It can It is 1 with the relevance degree that with understanding, the relevance degree of A and C are 0, B and C.Meanwhile the relevance degree of each second sample data It by the cumulative of the relevance degree of second sample data and other each second sample datas and can determine.That is the degree of correlation of A The relevance degree that the relevance degree that value is 2, B is 3, C is 1.

S5: choosing information entropy is more than presupposed information entropy threshold, and relevance degree is lower than the of default relevance threshold Two sample datas are as data to be marked.

Presupposed information entropy threshold and default relevance threshold are sieved to the second sample data for not having category label The condition of choosing.

Data to be marked are to be sieved according to presupposed information entropy threshold and default relevance threshold to the second sample data The data obtained after choosing.

Information entropy is more than presupposed information entropy threshold, and relevance degree is lower than the second sample number of default relevance threshold According to the interior container for representing its information content is uncertain, and the otherness between information content is bigger, is the head for training pattern Select data.

Specifically, if presupposed information entropy threshold is 1000, presetting relevance threshold is 100, then server-side is according to each the The information entropy and relevance degree of two sample datas are chosen, and information entropy are greater than 1000, and relevance degree is lower than 100 The second sample data as data to be marked.

S6: it according to preset classification notation methods, treats labeled data and carries out classification mark, obtain third sample data.

Classification mark is that the second sample data for not having category label is marked, has the second sample data The process of corresponding category label.For example, classification mark is carried out to certain article, it is anti-plus such as " novel ", " suspense " to it Answer the label of its subject content.The data obtained after classification marks are third sample data.

Preset classification notation methods, refer to server-side specifically can using a variety of notation methods to the second sample data into Row classification mark.

For example, server-side can extract the keyword in the second sample data, i.e., made with highest five words of word frequency For keyword；Then, the target keyword in keyword and preset category label dictionary is subjected to comparison of coherence, if crucial Word is consistent with target keyword, then is labeled target keyword to the second sample data, to obtain third sample data.

It is marked alternatively, server-side can call directly third-party expert system.For example, being using third party expert API (Application Programming Interface, application programming interface) interface that system provides, by the second sample Notebook data is inputted, and category label corresponding with the second sample data is obtained, to obtain third sample data.

S7: according to preset model training mode, preliminary classification model is trained using third sample data, is obtained Intermediate disaggregated model.

Intermediate disaggregated model is obtained after being trained using third sample data on the basis of preliminary classification model Disaggregated model.The difference of intermediate disaggregated model and preliminary classification model is that the training set of intermediate disaggregated model is with class It does not mark, and information entropy and relevance degree meet the third sample data of specified conditions.

Preset model training mode, i.e. server-side using third sample data as training data, using a variety of frames or Algorithm is trained preliminary classification model.For example, server-side can use existing machine learning frame or tool, such as Scikit-Learn, TensorFlow etc..

Wherein, Scikit-Learn, abbreviation sklearn are open source, the Machine learning tools based on Python Library, the sorting algorithms such as built-in NB Algorithm, decision Tree algorithms, random forests algorithm, use in sklearn The common machine learning algorithms such as data prediction, classification, recurrence, dimensionality reduction, model selection may be implemented in sklearn. TensorFlow is the initially researcher by Google brain group (being under the jurisdiction of Google machine intelligence research institution) and engineering The open source software library calculated for numerical value that teachers developed, in terms of can be used for machine learning and deep neural network Research, but the versatility of this system makes it can be also widely applied to other calculating fields.

Specifically, by taking sklearn as an example, server-side is called in sklearn using third sample data as input data Intermediate disaggregated model can be obtained until model tends to restrain in built-in training method.

S8: according to preset model training mode, first sample data and third sample data centering grade classification mould are used Type is trained, and obtains textual classification model.

Textual classification model is that centering grade disaggregated model carries out the final classification model obtained after retraining.

Wherein, the preset model training mode that server-side uses is no longer superfluous herein as the training process of step S7 It states.Unlike the training process of step S7, while using first sample data and third sample data centering grade classification mould Type is trained, i.e., using there is the sample data centering grade disaggregated model of category label to be iterated training, with fraction in raising The nicety of grading of class model.

Specifically, by taking sklearn as an example, server-side using first sample data and third sample data as input data, Call the built-in training method in sklearn, until model tends to restrain, textual classification model can be obtained.

In the present embodiment, the first sample data with category label are obtained from default sample database, and according to first Sample data establishes preliminary classification model, i.e., has the sample data of category label to be trained using sub-fraction, obtain primary Disaggregated model, it is possible to reduce to the demand for the sample data for having category label, save training cost；It is obtained from default sample database Take the second sample data without category label；The information entropy and relevance degree of the second sample data are calculated, and to information The second sample data that entropy and relevance degree meet preset condition carries out classification mark；According to preset model training mode, Preliminary classification model is trained using the third sample data after mark, intermediate disaggregated model is obtained, that is, third is utilized The comentropy of sample data is big, and correlation each other is small, and has the characteristics of category label, optimizes preliminary classification model Nicety of grading；Finally, being trained according to first sample data and third sample data to the intermediate disaggregated model, text is obtained This disaggregated model, i.e., by iteration step by step, optimization obtains final textual classification model.Propose a kind of utilize has class on a small quantity The method that the sample data training not marked obtains textual classification model, allows to by instructing to less sample data Practice, obtains the preferable disaggregated model of performance, saved human cost, improved training speed.

Further, in one embodiment, as shown in figure 3, being directed to step S1, i.e., obtaining from default sample database has class The first sample data not marked, and preliminary classification model is established according to first sample data, specifically comprise the following steps:

S11: mode is chosen according to default sample and chooses the first sample data with category label from default sample database.

Default sample chooses mode, i.e., chooses a certain number of from default sample database and representational have classification The first sample data of label.Wherein, quantity lacking as far as possible, to reduce the demand to sample data；Meanwhile the first of selection The classification of sample overlay text data as far as possible.For example, the selection to news category text data, as far as possible covering " politics ", " business ", The classifications such as " sport ", " style entertainment ".

Specifically, if having 100,000 articles in default sample database, wherein the article with category label has 3000, then Server-side can choose 30% in 3000 articles, that is, choose 900 articles, and choose from 900 articles and represent text Each 5 articles of the article of notebook data classification are as first sample data.

S12: in conjunction with category label first sample data and default training algorithm establish preliminary classification model.

Default training algorithm, including the various algorithms being trained in machine learning to model.Server-side, which uses, has class The process that the first sample data not marked establish preliminary classification model belongs to supervised learning mode.Wherein, supervised learning is exactly By existing training sample, i.e. given data and its corresponding output, training is gone to obtain an optimal models.This model Belong to the set of some function, it is optimal, indicate to be optimal under some interpretational criteria.

Specifically, by taking Naive Bayes Classification Algorithm as an example, server-side can import naive Bayesian from the library sklearn Then function calls MultinomialNB () .fit () to be trained.

When training is completed, server-side can be used the library Joblib and realize the function of saving training data.Wherein, Joblib is A part of SciPy ecology is the tool that the work of pipeline python provides.Alternatively, server-side can call the library pickle Function preliminary classification model is saved.

In the present embodiment, server-side chooses mode according to default sample, selects that quantity is few as far as possible, and sample data Type covers first sample data wide as far as possible；Preliminary classification model is established then in conjunction with default training algorithm, so as to sample The demand of data is few as far as possible, further mitigates training cost, simultaneously as the broad covered area of first sample data, so that primary The recognizable set of disaggregated model is wider.

Further, in one embodiment, for step S3, that is, the comentropy of each second sample data is calculated, is obtained The information entropy of each second sample data, specifically comprises the following steps:

The comentropy of each second sample data is calculated according to the following formula:

Wherein, H represents the information entropy of the second sample data, and x represents the phrase in the second sample data, p_(x)Represent word The frequency that group occurs.

Phrase in second sample data is that server-side makees the word obtained after word segmentation processing to the second sample data.Phrase The number that the frequency of appearance, i.e. phrase occur in the second sample data.

Specifically, server-side first makees word segmentation processing to each second sample data, the set segmented；It then will participle The frequency of all participles substitutes into formula in set, and the information entropy of second sample data can be obtained.

In the present embodiment, server-side calculates second according to the word frequency of the phrase in aromatic formula and the second sample data The comentropy of sample data, so that the quantization to sample data comprising information content is more accurate.

Further, in one embodiment, as shown in figure 4, be directed to step S4, i.e., according in the second sample data include phase With the quantity of phrase, the relevance degree of each second sample data is calculated, is specifically comprised the following steps:

S41: making word segmentation processing to each second sample data, obtains N number of participle set, wherein N is the second sample data Quantity.

Specifically, server-side can carry out word segmentation processing using various ways.For example, using regular expression to the second sample Notebook data carries out cutting, obtains segmenting the set constituted by several, i.e. participle set.It is to be appreciated that the second sample data The quantity of quantity and participle set is one-to-one.

Wherein, regular expression, i.e. Regular Expression, also known as regular expression, are within a context The processing method of retrieval or replacement target text.

Specifically, server-side can be using regular expression engine built-in in Perl or Python, to the second sample Notebook data carries out cutting；Alternatively, server-side cuts the second sample data using the grep tool carried in Unix system Point, obtain the set comprising several participles.Wherein, grep, i.e. Globally search a Regular Expression And Print is a kind of powerful text search tools.

S42: being directed to each second sample data, calculates the participle set and other N-1 a second of second sample data Intersection between the participle set of sample data, and according to the phrase quantity for including in each intersection, determine second sample number According to the local correlation angle value between other N-1 the second sample datas, the corresponding N-1 part of second sample data is obtained Relevance degree.

The intersection between participle set is calculated, different participles can specifically be gathered and compare, intersection, that is, identical word Group.

Local correlation angle value represents the degree of correlation between second sample data and other second sample datas.

For example, participle set a is expressed as { " people ", " interest ", " bank ", " debt-credit " }, and participle set b is expressed as { " bank ", " debt-credit ", " income " }, then the intersection for segmenting set a and b is { " bank ", " debt-credit " }, the phrase for including in intersection Quantity is 2, and the local correlation angle value of participle set a and b are 2.Similarly it is found that if participle set c is expressed as { " meeting ", " report Announcement ", " income " }, then the local correlation angle value for segmenting set a and c is 0, and the local correlation angle value of participle set b and c are 1.

S43: calculating the average value of the corresponding N-1 local correlation angle value of each second sample data, using average value as The relevance degree of each second sample data.

It is related to corresponding second sample data of participle set a still by taking participle set a, b and c in step S42 as an example Angle value is the average value of the sum of local correlation angle value for segmenting set a and b, participle set a and c, as 1.Similarly it is found that with dividing The relevance degree of corresponding second sample data of set of words b and c is respectively 1.5 and 0.5.

In the present embodiment, server-side is by carrying out word segmentation processing to the second sample data, to segment the friendship between set Collect the local correlation angle value determined between the second sample data, and the mode averaged to local relevance degree obtains often The relevance degree of a second sample data, allows relevance degree more accurately to react the association between the second sample data Degree.

Further, in one embodiment, as shown in figure 5, being directed to step S5, i.e., selection information entropy is more than presupposed information Entropy threshold, and relevance degree is lower than the second sample data of default relevance threshold as data to be marked, specifically include as Lower step:

S51: choosing information entropy is more than presupposed information entropy threshold, and relevance degree is lower than the of default relevance threshold Two sample datas are as candidate samples data.

Server-side screens the second sample data for meeting specified conditions again, has both reduced the quantity of training sample, The sample data that general category device is difficult to is found out again.Wherein, specified conditions refer to that information entropy is more than presupposed information entropy threshold Value, and relevance degree is lower than default relevance threshold.

S52: classified using at least two default sample classification devices to candidate samples data, obtain classification results.

Default sample classification device, i.e. textual classification model.For example, common FastText, Text-CNN model etc..

Wherein, FastText is a term vector and text classification tool for facebook open source, typical case scene It is " the text classification problem with supervision ".It provides the method for simple and efficient text classification and representative learning, and performance is shoulder to shoulder Deep learning and speed is faster.TextCNN is the algorithm classified using convolutional neural networks to text, due to its structure Simply, effect is good, is widely used in text classification field.

The result that different default sample classification devices classifies to same sample data may be different.I.e. same sample number After being classified by the different classifications model such as FastText, Text-CNN, different classifications may be identified as.

Classification results include classification belonging to each candidate samples data.

The candidate samples data that S53: choosing from classification results while belonging to a different category are as data to be marked.

The candidate samples data to belong to a different category simultaneously, i.e., different default classifiers is to same candidate samples data Recognition result is different.For example, an article is identified as " history class " by FastText, meanwhile, and " text is identified as by Text-CNN Skill class ".Therefore, it represents this article to be difficult to be identified, or is difficult to simply be divided into a certain classification.

Specifically, server-side classification according to belonging to the candidate samples data in classification results determines if to belong to simultaneously In different classes of.

In the present embodiment, server-side according to different default classifiers to meet the second sample data of specified conditions into Row screening, chooses and is difficult to identified second sample data as data to be marked, both get rid of the sample being simply easily identified Notebook data is further reduced quantity and the training time of training sample, improves training effectiveness；Meanwhile it picking out and being not easy to be known Other sample data is as data to be marked, so that being conducive to model training after carrying out classification mark to these data to be marked The raising of precision.

It should be understood that the size of the serial number of each step is not meant that the order of the execution order in above-described embodiment, each process Execution sequence should be determined by its function and internal logic, the implementation process without coping with the embodiment of the present invention constitutes any limit It is fixed.

In one embodiment, a kind of textual classification model training device is provided, text disaggregated model training device with it is upper Textual classification model training method in embodiment is stated to correspond.As shown in fig. 6, text disaggregated model training device includes just Grade model building module 61, sample data obtain module 62, comentropy computing module 63, relatedness computation module 64, to be marked Data decimation module 65, labeling module 66, the first model training module 67 and the second model training module 68.Each functional module is detailed Carefully it is described as follows:

Primary mold establishes module 61, for obtaining the first sample data with category label from default sample database, And preliminary classification model is established according to first sample data；

Sample data obtains module 62, for obtaining the second sample number without category label from default sample database According to；

Comentropy computing module 63 obtains each second sample number for calculating the comentropy of each second sample data According to information entropy；

Relatedness computation module 64, for calculating each the according to the quantity in the second sample data including identical phrase The relevance degree of two sample datas；

Data decimation module 65 to be marked, for choosing information entropy more than presupposed information entropy threshold, and relevance degree Lower than default relevance threshold the second sample data as data to be marked；

Labeling module 66 carries out classification mark for according to preset classification notation methods, treating labeled data, obtains the Three sample datas；

First model training module 67 is used for according to preset model training mode, using third sample data to primary Disaggregated model is trained, and obtains intermediate disaggregated model；

Second model training module 68, for using first sample data and third according to preset model training mode Sample data centering grade disaggregated model is trained, and obtains textual classification model.

Further, primary mold establishes module 61, comprising:

Submodule 611 is chosen, there is category label for choosing mode according to default sample and choosing from default sample database First sample data；

Training submodule 612, for combining the first sample data with category label and default training algorithm to establish just Grade disaggregated model.

Further, comentropy computing module 63, including

Comentropy computational submodule 631, for calculating the comentropy of each second sample data according to the following formula:

Further, relatedness computation module 64, comprising:

Submodule 641 is segmented, for making word segmentation processing to each second sample data, obtains N number of participle set, wherein N For the quantity of the second sample data；

Local correlation degree computational submodule 642 calculates second sample data for being directed to each second sample data Intersection between participle set and the participle set of other N-1 the second sample datas, and according to the word for including in each intersection Group quantity, determines the local correlation angle value between second sample data and other N-1 the second sample datas, obtain this second The corresponding N-1 local correlation angle value of sample data；

Mean value calculation submodule 643, for calculating the corresponding N-1 local correlation angle value of each second sample data Average value, using average value as the relevance degree of each second sample data.

Further, data decimation module 65 to be marked, comprising:

Candidate samples choose submodule 651, for choosing information entropy more than presupposed information entropy threshold, and relevance degree Lower than default relevance threshold the second sample data as candidate samples data；

Classification submodule 652 is obtained for being classified using at least two default sample classification devices to candidate samples data To classification results；

Submodule 653 is marked, the candidate samples data conduct for choosing while belonging to a different category from classification results Data to be marked.

Specific restriction about textual classification model training device may refer to above for textual classification model training The restriction of method, details are not described herein.Modules in above-mentioned textual classification model training device can be fully or partially through Software, hardware and combinations thereof are realized.Above-mentioned each module can be embedded in the form of hardware or independently of the place in computer equipment It manages in device, can also be stored in a software form in the memory in computer equipment, in order to which processor calls execution or more The corresponding operation of modules.

In one embodiment, a kind of computer equipment is provided, which can be server, internal junction Composition can be as shown in Figure 7.The computer equipment include by system bus connect processor, memory, network interface and Database.Wherein, the processor of the computer equipment is for providing calculating and control ability.The memory packet of the computer equipment Include non-volatile memory medium, built-in storage.The non-volatile memory medium is stored with operating system, computer program and data Library.The built-in storage provides environment for the operation of operating system and computer program in non-volatile memory medium.The calculating The network interface of machine equipment is used to communicate with external terminal by network connection.When the computer program is executed by processor with Realize a kind of textual classification model training method.

In one embodiment, a kind of computer equipment is provided, including memory, processor and storage are on a memory And the computer program that can be run on a processor, processor realize text classification in above-described embodiment when executing computer program The step of model training method, such as step S1 shown in Fig. 2 to step S8.Alternatively, reality when processor executes computer program The function of each module/unit of textual classification model training device in existing above-described embodiment, such as module 61 shown in Fig. 6 is to module 68 function.To avoid repeating, which is not described herein again.

In one embodiment, a computer readable storage medium is provided, computer program, computer program are stored thereon with Textual classification model training method in above method embodiment is realized when being executed by processor, alternatively, the computer program is located Manage the function that each module/unit in textual classification model training device in above-mentioned apparatus embodiment is realized when device executes.To avoid It repeats, which is not described herein again.

Those of ordinary skill in the art will appreciate that realizing all or part of the process in above-described embodiment method, being can be with Relevant hardware is instructed to complete by computer program, the computer program can be stored in a non-volatile computer In read/write memory medium, the computer program is when being executed, it may include such as the process of the embodiment of above-mentioned each method.Wherein, To any reference of memory, storage, database or other media used in each embodiment provided by the present invention, Including non-volatile and/or volatile memory.Nonvolatile memory may include read-only memory (ROM), programming ROM (PROM), electrically programmable ROM (EPROM), electrically erasable ROM (EEPROM) or flash memory.Volatile memory may include Random access memory (RAM) or external cache.By way of illustration and not limitation, RAM is available in many forms, Such as static state RAM (SRAM), dynamic ram (DRAM), synchronous dram (SDRAM), double data rate sdram (DDRSDRAM), enhancing Type SDRAM (ESDRAM), synchronization link (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic ram (DRDRAM) and memory bus dynamic ram (RDRAM) etc..

It is apparent to those skilled in the art that for convenience of description and succinctly, only with above-mentioned each function Can unit, module division progress for example, in practical application, can according to need and by above-mentioned function distribution by different Functional unit, module are completed, i.e., the internal structure of described device is divided into different functional unit or module, more than completing The all or part of function of description.

Embodiment described above is merely illustrative of the technical solution of the present invention, rather than its limitations；Although referring to aforementioned reality Applying example, invention is explained in detail, those skilled in the art should understand that: it still can be to aforementioned each Technical solution documented by embodiment is modified or equivalent replacement of some of the technical features；And these are modified Or replacement, the spirit and scope for technical solution of various embodiments of the present invention that it does not separate the essence of the corresponding technical solution should all It is included within protection scope of the present invention.

Claims

1. a kind of textual classification model training method, which is characterized in that the textual classification model training method includes:

The first sample data with category label are obtained from default sample database, and are established just according to the first sample data Grade disaggregated model；

According to the quantity in second sample data including identical phrase, the degree of correlation of each second sample data is calculated Value；

The information entropy is chosen more than presupposed information entropy threshold, and the relevance degree is lower than the default relevance threshold Second sample data as data to be marked；

According to preset model training mode, the preliminary classification model is trained using the third sample data, is obtained To intermediate disaggregated model；

According to the preset model training mode, using the first sample data and the third sample data in described Grade disaggregated model is trained, and obtains textual classification model.

2. textual classification model training method as described in claim 1, which is characterized in that described to be obtained from default sample database First sample data with category label, and preliminary classification model is established according to the first sample data, comprising:

Mode is chosen according to default sample, and the first sample data with category label are chosen from the default sample database；

The preliminary classification model is established in conjunction with the first sample data with category label and default training algorithm.

3. textual classification model training method as described in claim 1, which is characterized in that described to calculate each second sample The comentropy of notebook data obtains the information entropy of each second sample data, comprising:

Wherein, H represents the information entropy of second sample data, and x represents the phrase in second sample data, p_(x)Generation The frequency that phrase described in table occurs.

4. textual classification model training method as described in claim 1, which is characterized in that described according to second sample number Comprising the quantity of identical phrase in, the relevance degree of each second sample data is calculated, comprising:

Word segmentation processing is made to each second sample data, obtains N number of participle set, wherein N is second sample data Quantity；

For each second sample data, the participle set and other N-1 the second samples of second sample data are calculated Intersection between the participle set of data, and according to the phrase quantity for including in each intersection, determine second sample number According to the local correlation angle value between other N-1 the second sample datas, it is a described to obtain the corresponding N-1 of second sample data Local correlation angle value；

The average value for calculating the corresponding N-1 local correlation angle value of each second sample data, by the average value Relevance degree as each second sample data.

5. textual classification model training method as described in claim 1, which is characterized in that the selection information entropy is super Presupposed information entropy threshold is crossed, and the relevance degree is lower than the second sample data conduct of the default relevance threshold Data to be marked, comprising:

The information entropy is chosen more than the presupposed information entropy threshold, and the relevance degree is lower than the default degree of correlation Second sample data of threshold value is as candidate samples data；

Classified using at least two default sample classification devices to the candidate samples data, obtains classification results；

It is chosen from the classification results while the candidate samples data that belong to a different category is as the data to be marked.

6. a kind of textual classification model training device, which is characterized in that the textual classification model training device, comprising:

Primary mold establishes module, for first sample data of the acquisition with category label from default sample database, and according to The first sample data establish preliminary classification model；

Sample data obtains module, for obtaining the second sample number without the category label from the default sample database According to；

Comentropy computing module obtains each second sample for calculating the comentropy of each second sample data The information entropy of data；

Relatedness computation module, for calculating each described according to the quantity in second sample data including identical phrase The relevance degree of second sample data；

Data decimation module to be marked, for choosing the information entropy more than presupposed information entropy threshold, and the degree of correlation Value is lower than second sample data of the default relevance threshold as data to be marked；

Labeling module, for carrying out classification mark to the data to be marked, obtaining third according to preset classification notation methods Sample data；

First model training module is used for according to preset model training mode, using the third sample data to described first Grade disaggregated model is trained, and obtains intermediate disaggregated model；

Second model training module, for using the first sample data and institute according to the preset model training mode It states third sample data to be trained the intermediate disaggregated model, obtains textual classification model.

7. textual classification model training device as claimed in claim 6, which is characterized in that the primary mold establishes module, Include:

Submodule is chosen, it is described with category label for being chosen from the default sample database according to default sample selection mode First sample data；

Training submodule, it is described first for being established in conjunction with the first sample data with category label and default training algorithm Grade disaggregated model.

8. textual classification model training device as claimed in claim 6, which is characterized in that the comentropy computing module, packet It includes:

Comentropy computational submodule, for calculating the comentropy of each second sample data according to the following formula:

9. a kind of computer equipment, including memory, processor and storage are in the memory and can be in the processor The computer program of upper operation, which is characterized in that the processor realized when executing the computer program as claim 1 to Any one of 5 textual classification model training methods.

10. a kind of computer readable storage medium, the computer-readable recording medium storage has computer program, and feature exists In realization textual classification model training side as described in any one of claim 1 to 5 when the computer program is executed by processor Method.