CN110110080A - Textual classification model training method, device, computer equipment and storage medium - Google Patents
Textual classification model training method, device, computer equipment and storage medium Download PDFInfo
- Publication number
- CN110110080A CN110110080A CN201910247846.8A CN201910247846A CN110110080A CN 110110080 A CN110110080 A CN 110110080A CN 201910247846 A CN201910247846 A CN 201910247846A CN 110110080 A CN110110080 A CN 110110080A
- Authority
- CN
- China
- Prior art keywords
- sample data
- sample
- data
- classification model
- default
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/355—Creation or modification of classes or clusters
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/19—Recognition using electronic means
- G06V30/192—Recognition using electronic means using simultaneous comparisons or correlations of the image signals with a plurality of references
- G06V30/194—References adjustable by an adaptive method, e.g. learning
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- General Engineering & Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Multimedia (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of textual classification model training method, device, computer equipment and storage mediums, which comprises the first sample data with category label and the second sample data without category label are obtained from default sample database;Preliminary classification model is established according to first sample data;Meanwhile calculating the information entropy and relevance degree of the second sample data;According to preset classification notation methods, the second sample data for meeting preset condition to information entropy and relevance degree carries out classification mark, obtains third sample data;Preliminary classification model is trained using third sample data, obtains intermediate disaggregated model;It is trained using first sample data and third sample data centering grade disaggregated model, obtains textual classification model.Technical solution of the present invention solves in textual classification model training process, and training sample is in large scale, the problem of training time length.
Description
Technical field
The present invention relates to field of information processing more particularly to textual classification model training method, device, computer equipment and
Storage medium.
Background technique
Text classification is an important application direction in natural language processing research field.Text classification refers to utilize and divide
Class device classifies to the data file comprising text, so that it is determined that classification belonging to each document, allows users to conveniently
Acquisition need document.
Wherein, classifier is also known as disaggregated model, be by using a large amount of sample data for having category label, to point
Obtained from class criterion or model parameter are trained.The classifier obtained using training carries out the text data of unknown classification
Identification, to realize the automatic classification to large scale text data.Therefore, the superiority and inferiority of disaggregated model directly influences classification most
Whole effect.
However, having the sample data of category label very limited, most of sample in the large-scale text classification problem of reality
It originally is not no category label.This makes in the building process of disaggregated model, it has to using by the expert in field come into
The mode of pedestrian's work mark.This mode needs to expend a large amount of manpower, financial resources and time, and the scale of training sample is huge
Greatly, training process will also devote a tremendous amount of time.
Summary of the invention
The embodiment of the present invention provides a kind of textual classification model training method, device, computer equipment and storage medium, with
It solves in textual classification model training process, training sample is in large scale, the problem of training time length.
A kind of textual classification model training method, comprising:
The first sample data with category label are obtained from default sample database, and are built according to the first sample data
Vertical preliminary classification model;
The second sample data without the category label is obtained from the default sample database;
The comentropy for calculating each second sample data obtains the information entropy of each second sample data;
According to the quantity in second sample data including identical phrase, the phase of each second sample data is calculated
Close angle value;
The information entropy is chosen more than presupposed information entropy threshold, and the relevance degree is lower than the default degree of correlation
Second sample data of threshold value is as data to be marked;
According to preset classification notation methods, classification mark is carried out to the data to be marked, obtains third sample data;
According to preset model training mode, the preliminary classification model is instructed using the third sample data
Practice, obtains intermediate disaggregated model;
According to the preset model training mode, using the first sample data and the third sample data to institute
It states intermediate disaggregated model to be trained, obtains textual classification model.
A kind of textual classification model training device, comprising:
Primary mold establishes module, for obtaining the first sample data with category label from default sample database, and
Preliminary classification model is established according to the first sample data;
Sample data obtains module, for obtaining the second sample without the category label from the default sample database
Notebook data;
Comentropy computing module obtains each described second for calculating the comentropy of each second sample data
The information entropy of sample data;
Relatedness computation module, for calculating each according to the quantity in second sample data including identical phrase
The relevance degree of second sample data;
Data decimation module to be marked, for choosing the information entropy more than presupposed information entropy threshold, and the phase
Second sample data of the angle value lower than the default relevance threshold is closed as data to be marked;
Labeling module, for carrying out classification mark to the data to be marked, obtaining according to preset classification notation methods
Third sample data;
First model training module is used for according to preset model training mode, using the third sample data to institute
It states preliminary classification model to be trained, obtains intermediate disaggregated model;
Second model training module, for using the first sample data according to the preset model training mode
The intermediate disaggregated model is trained with the third sample data, obtains textual classification model.
A kind of computer equipment, including memory, processor and storage are in the memory and can be in the processing
The computer program run on device, the processor realize above-mentioned textual classification model training side when executing the computer program
Method.
A kind of computer readable storage medium, the computer-readable recording medium storage have computer program, the meter
Calculation machine program realizes above-mentioned textual classification model training method when being executed by processor.
Above-mentioned textual classification model training method, device, computer equipment and storage medium, obtain from default sample database
First sample data with category label, and establish preliminary classification model according to first sample data, that is, utilize sub-fraction
There is the sample data of category label to be trained, obtain preliminary classification model, it is possible to reduce to the sample data for having category label
Demand, save training cost;The second sample data without category label is obtained from default sample database;Calculate second
The information entropy and relevance degree of sample data, and meet information entropy and relevance degree the second sample data of preset condition
Carry out classification mark;According to preset model training mode, using the third sample data after mark to preliminary classification model into
Row training, obtains intermediate disaggregated model, that is, the comentropy that third sample data is utilized is big, and correlation each other is small, and
There is the characteristics of category label, optimizes the nicety of grading of preliminary classification model;Finally, according to first sample data and third sample
Data are trained the intermediate disaggregated model, obtain textual classification model, i.e., by iteration step by step, optimization obtains final
Textual classification model.It proposes a kind of train using the sample data for having category label on a small quantity and obtains the side of textual classification model
Method allows to obtain the preferable disaggregated model of performance by being trained less sample data, saved manpower at
This, improves training speed.
Detailed description of the invention
In order to illustrate the technical solution of the embodiments of the present invention more clearly, below by institute in the description to the embodiment of the present invention
Attached drawing to be used is needed to be briefly described, it should be apparent that, the accompanying drawings in the following description is only some implementations of the invention
Example, for those of ordinary skill in the art, without any creative labor, can also be according to these attached drawings
Obtain other attached drawings.
Fig. 1 is an application environment schematic diagram of textual classification model training method in one embodiment of the invention;
Fig. 2 is the flow chart of textual classification model training method in one embodiment of the invention;
Fig. 3 is the flow chart of step S1 in textual classification model training method in one embodiment of the invention;
Fig. 4 is the flow chart of step S4 in textual classification model training method in one embodiment of the invention;
Fig. 5 is the flow chart of step S5 in textual classification model training method in one embodiment of the invention;
Fig. 6 is the schematic diagram of textual classification model training device in one embodiment of the invention;
Fig. 7 is the schematic diagram of computer equipment in one embodiment of the invention.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete
Site preparation description, it is clear that described embodiments are some of the embodiments of the present invention, instead of all the embodiments.Based on this hair
Embodiment in bright, every other implementation obtained by those of ordinary skill in the art without making creative efforts
Example, shall fall within the protection scope of the present invention.
Textual classification model training method provided by the invention, can be applicable in the application environment such as Fig. 1, wherein service
End is the computer equipment for carrying out textual classification model training, and server-side can be server or server cluster;Default sample
Library is to provide the database of training sample data, specifically can be various relationship types or non-relational database, as MS-SQL,
Oracle, MySQL, Sybase, DB2, Redis, MongodDB, Hbase etc.;Pass through network between server-side and default sample database
Connection, network can be cable network or wireless network.Textual classification model training method application provided in an embodiment of the present invention
In server-side.
In one embodiment, as shown in Fig. 2, providing a kind of textual classification model training method, specific implementation flow
Include the following steps:
S1: the first sample data with category label are obtained from default sample database, and are built according to first sample data
Vertical preliminary classification model.
Default sample database, that is, provide the database of training sample data.Default sample database can be deployed in server-side local,
Or it is connected by network with server-side.
First sample data are the text datas with category label.Wherein, text data includes text information
Text, news and Email Body on text document, internet etc.;Category label is to made by text data points
Class label is that the classification to text data limits.
For example, the category label of an article is " emotion ", then the content of this article is represented with related to " emotion ".It can
To understand that ground, category label further include but is not limited to " science popularization ", " movement ", " pursuing a goal with determination ", " poem prose " etc. for indicating text
The label of data generic.
Specifically, in default sample database, category label and text data are associated storages, and each text data has
Indicate whether it has the field of category label.Server-side can obtain the textual data for having category label by SQL query statement
According to as first sample data.
Preliminary classification model is the classification tool constructed according to first sample data.The preliminary classification model energy set up
It is enough that broad classification is carried out to the sample data for having category label.
Specifically, server-side can obtain the by carrying out signature analysis to the first sample data with category label
Then category label and text feature information are associated storage, as first fraction by the text feature information of one sample data
Class model.For example, server-side can carry out word segmentation processing to the text in first sample data, using the participle of high word frequency as text
Eigen information.Wherein, word segmentation processing is that the word in text is entered cutting in Text extraction, is obtained single one by one
Only word.Word segmentation processing is widely used in the fields such as full-text search, content of text excavation as a kind of word processing means.
Alternatively, server-side can obtain just fraction using training method neural network based according to first sample data
Class model.
S2: the second sample data without category label is obtained from default sample database.
Second sample data is the text data without category label.That is, compared with first sample data, the second sample
Notebook data does not have category label, if server-side does not know text belonging to the second sample data not by way of handmarking
Classification or the meaning of expression.
Specifically, server-side can obtain the second sample data by SQL query statement from default sample database.
S3: the comentropy of each second sample data is calculated, the information entropy of each second sample data is obtained.
Comentropy is by the concept of the scaling information amount of aromatic proposition, is the quantisation metric how many to information.Comentropy is got over
Greatly, i.e., information content included in sample data is also abundanter, while the uncertainty of representative information is bigger.
Information entropy is the specific quantized value to comentropy.
Server-side can according to include in the second sample data text data number determine information entropy.For example, with
The quantity of text number of words is as information entropy in second sample data.It is to be appreciated that included in the article of 5000 words
Information content be greater than the Email Body information content that are included of only 20 words.
Specifically, server-side calculates the text number of words in each second sample data, using text number of words as each second
The information entropy of sample data.
Alternatively, server-side removes the participle quantity after auxiliary word that indicates mood using in the second sample data as the second sample data
Information entropy.Wherein, auxiliary word that indicates mood includes but is not limited to " ", " uh ", " ", " " etc..
Specifically, server-side makees word segmentation processing to the second sample data, obtains participle set, and will segment the language in set
Auxiliary word removes, using remaining participle quantity as the information entropy of the second sample data.
S4: according to the quantity in the second sample data including identical phrase, the degree of correlation of each second sample data is calculated
Value.
The relevance degree of second sample data, that is, whether the information for having reacted the offer of the second sample data repeats and redundancy.
Relevance degree is higher, then it is higher to represent information multiplicity and redundancy that the second sample data provides each other;Relevance degree
Lower, then the otherness for representing the information that the second sample data provides each other is bigger.
Server-side determines relevance degree according to including the quantity of identical phrase in the second sample data.
For example, if in the second sample data A including phrase " culture ", " civilization ", " history ", the second sample data B
In include phrase " culture ", " country ", " history ", including phrase " travelling ", " mountains and rivers ", " country " in the second sample data C;Then
It include phrase " culture " and " history " in second sample data A and the second sample data B, then the relevance degree of A and B is 2;It can
It is 1 with the relevance degree that with understanding, the relevance degree of A and C are 0, B and C.Meanwhile the relevance degree of each second sample data
It by the cumulative of the relevance degree of second sample data and other each second sample datas and can determine.That is the degree of correlation of A
The relevance degree that the relevance degree that value is 2, B is 3, C is 1.
S5: choosing information entropy is more than presupposed information entropy threshold, and relevance degree is lower than the of default relevance threshold
Two sample datas are as data to be marked.
Presupposed information entropy threshold and default relevance threshold are sieved to the second sample data for not having category label
The condition of choosing.
Data to be marked are to be sieved according to presupposed information entropy threshold and default relevance threshold to the second sample data
The data obtained after choosing.
Information entropy is more than presupposed information entropy threshold, and relevance degree is lower than the second sample number of default relevance threshold
According to the interior container for representing its information content is uncertain, and the otherness between information content is bigger, is the head for training pattern
Select data.
Specifically, if presupposed information entropy threshold is 1000, presetting relevance threshold is 100, then server-side is according to each the
The information entropy and relevance degree of two sample datas are chosen, and information entropy are greater than 1000, and relevance degree is lower than 100
The second sample data as data to be marked.
S6: it according to preset classification notation methods, treats labeled data and carries out classification mark, obtain third sample data.
Classification mark is that the second sample data for not having category label is marked, has the second sample data
The process of corresponding category label.For example, classification mark is carried out to certain article, it is anti-plus such as " novel ", " suspense " to it
Answer the label of its subject content.The data obtained after classification marks are third sample data.
Preset classification notation methods, refer to server-side specifically can using a variety of notation methods to the second sample data into
Row classification mark.
For example, server-side can extract the keyword in the second sample data, i.e., made with highest five words of word frequency
For keyword;Then, the target keyword in keyword and preset category label dictionary is subjected to comparison of coherence, if crucial
Word is consistent with target keyword, then is labeled target keyword to the second sample data, to obtain third sample data.
It is marked alternatively, server-side can call directly third-party expert system.For example, being using third party expert
API (Application Programming Interface, application programming interface) interface that system provides, by the second sample
Notebook data is inputted, and category label corresponding with the second sample data is obtained, to obtain third sample data.
S7: according to preset model training mode, preliminary classification model is trained using third sample data, is obtained
Intermediate disaggregated model.
Intermediate disaggregated model is obtained after being trained using third sample data on the basis of preliminary classification model
Disaggregated model.The difference of intermediate disaggregated model and preliminary classification model is that the training set of intermediate disaggregated model is with class
It does not mark, and information entropy and relevance degree meet the third sample data of specified conditions.
Preset model training mode, i.e. server-side using third sample data as training data, using a variety of frames or
Algorithm is trained preliminary classification model.For example, server-side can use existing machine learning frame or tool, such as
Scikit-Learn, TensorFlow etc..
Wherein, Scikit-Learn, abbreviation sklearn are open source, the Machine learning tools based on Python
Library, the sorting algorithms such as built-in NB Algorithm, decision Tree algorithms, random forests algorithm, use in sklearn
The common machine learning algorithms such as data prediction, classification, recurrence, dimensionality reduction, model selection may be implemented in sklearn.
TensorFlow is the initially researcher by Google brain group (being under the jurisdiction of Google machine intelligence research institution) and engineering
The open source software library calculated for numerical value that teachers developed, in terms of can be used for machine learning and deep neural network
Research, but the versatility of this system makes it can be also widely applied to other calculating fields.
Specifically, by taking sklearn as an example, server-side is called in sklearn using third sample data as input data
Intermediate disaggregated model can be obtained until model tends to restrain in built-in training method.
S8: according to preset model training mode, first sample data and third sample data centering grade classification mould are used
Type is trained, and obtains textual classification model.
Textual classification model is that centering grade disaggregated model carries out the final classification model obtained after retraining.
Wherein, the preset model training mode that server-side uses is no longer superfluous herein as the training process of step S7
It states.Unlike the training process of step S7, while using first sample data and third sample data centering grade classification mould
Type is trained, i.e., using there is the sample data centering grade disaggregated model of category label to be iterated training, with fraction in raising
The nicety of grading of class model.
Specifically, by taking sklearn as an example, server-side using first sample data and third sample data as input data,
Call the built-in training method in sklearn, until model tends to restrain, textual classification model can be obtained.
In the present embodiment, the first sample data with category label are obtained from default sample database, and according to first
Sample data establishes preliminary classification model, i.e., has the sample data of category label to be trained using sub-fraction, obtain primary
Disaggregated model, it is possible to reduce to the demand for the sample data for having category label, save training cost;It is obtained from default sample database
Take the second sample data without category label;The information entropy and relevance degree of the second sample data are calculated, and to information
The second sample data that entropy and relevance degree meet preset condition carries out classification mark;According to preset model training mode,
Preliminary classification model is trained using the third sample data after mark, intermediate disaggregated model is obtained, that is, third is utilized
The comentropy of sample data is big, and correlation each other is small, and has the characteristics of category label, optimizes preliminary classification model
Nicety of grading;Finally, being trained according to first sample data and third sample data to the intermediate disaggregated model, text is obtained
This disaggregated model, i.e., by iteration step by step, optimization obtains final textual classification model.Propose a kind of utilize has class on a small quantity
The method that the sample data training not marked obtains textual classification model, allows to by instructing to less sample data
Practice, obtains the preferable disaggregated model of performance, saved human cost, improved training speed.
Further, in one embodiment, as shown in figure 3, being directed to step S1, i.e., obtaining from default sample database has class
The first sample data not marked, and preliminary classification model is established according to first sample data, specifically comprise the following steps:
S11: mode is chosen according to default sample and chooses the first sample data with category label from default sample database.
Default sample chooses mode, i.e., chooses a certain number of from default sample database and representational have classification
The first sample data of label.Wherein, quantity lacking as far as possible, to reduce the demand to sample data;Meanwhile the first of selection
The classification of sample overlay text data as far as possible.For example, the selection to news category text data, as far as possible covering " politics ", " business ",
The classifications such as " sport ", " style entertainment ".
Specifically, if having 100,000 articles in default sample database, wherein the article with category label has 3000, then
Server-side can choose 30% in 3000 articles, that is, choose 900 articles, and choose from 900 articles and represent text
Each 5 articles of the article of notebook data classification are as first sample data.
S12: in conjunction with category label first sample data and default training algorithm establish preliminary classification model.
Default training algorithm, including the various algorithms being trained in machine learning to model.Server-side, which uses, has class
The process that the first sample data not marked establish preliminary classification model belongs to supervised learning mode.Wherein, supervised learning is exactly
By existing training sample, i.e. given data and its corresponding output, training is gone to obtain an optimal models.This model
Belong to the set of some function, it is optimal, indicate to be optimal under some interpretational criteria.
Specifically, by taking Naive Bayes Classification Algorithm as an example, server-side can import naive Bayesian from the library sklearn
Then function calls MultinomialNB () .fit () to be trained.
When training is completed, server-side can be used the library Joblib and realize the function of saving training data.Wherein, Joblib is
A part of SciPy ecology is the tool that the work of pipeline python provides.Alternatively, server-side can call the library pickle
Function preliminary classification model is saved.
In the present embodiment, server-side chooses mode according to default sample, selects that quantity is few as far as possible, and sample data
Type covers first sample data wide as far as possible;Preliminary classification model is established then in conjunction with default training algorithm, so as to sample
The demand of data is few as far as possible, further mitigates training cost, simultaneously as the broad covered area of first sample data, so that primary
The recognizable set of disaggregated model is wider.
Further, in one embodiment, for step S3, that is, the comentropy of each second sample data is calculated, is obtained
The information entropy of each second sample data, specifically comprises the following steps:
The comentropy of each second sample data is calculated according to the following formula:
Wherein, H represents the information entropy of the second sample data, and x represents the phrase in the second sample data, p(x)Represent word
The frequency that group occurs.
Phrase in second sample data is that server-side makees the word obtained after word segmentation processing to the second sample data.Phrase
The number that the frequency of appearance, i.e. phrase occur in the second sample data.
Specifically, server-side first makees word segmentation processing to each second sample data, the set segmented;It then will participle
The frequency of all participles substitutes into formula in set, and the information entropy of second sample data can be obtained.
In the present embodiment, server-side calculates second according to the word frequency of the phrase in aromatic formula and the second sample data
The comentropy of sample data, so that the quantization to sample data comprising information content is more accurate.
Further, in one embodiment, as shown in figure 4, be directed to step S4, i.e., according in the second sample data include phase
With the quantity of phrase, the relevance degree of each second sample data is calculated, is specifically comprised the following steps:
S41: making word segmentation processing to each second sample data, obtains N number of participle set, wherein N is the second sample data
Quantity.
Specifically, server-side can carry out word segmentation processing using various ways.For example, using regular expression to the second sample
Notebook data carries out cutting, obtains segmenting the set constituted by several, i.e. participle set.It is to be appreciated that the second sample data
The quantity of quantity and participle set is one-to-one.
Wherein, regular expression, i.e. Regular Expression, also known as regular expression, are within a context
The processing method of retrieval or replacement target text.
Specifically, server-side can be using regular expression engine built-in in Perl or Python, to the second sample
Notebook data carries out cutting;Alternatively, server-side cuts the second sample data using the grep tool carried in Unix system
Point, obtain the set comprising several participles.Wherein, grep, i.e. Globally search a Regular Expression
And Print is a kind of powerful text search tools.
S42: being directed to each second sample data, calculates the participle set and other N-1 a second of second sample data
Intersection between the participle set of sample data, and according to the phrase quantity for including in each intersection, determine second sample number
According to the local correlation angle value between other N-1 the second sample datas, the corresponding N-1 part of second sample data is obtained
Relevance degree.
The intersection between participle set is calculated, different participles can specifically be gathered and compare, intersection, that is, identical word
Group.
Local correlation angle value represents the degree of correlation between second sample data and other second sample datas.
For example, participle set a is expressed as { " people ", " interest ", " bank ", " debt-credit " }, and participle set b is expressed as
{ " bank ", " debt-credit ", " income " }, then the intersection for segmenting set a and b is { " bank ", " debt-credit " }, the phrase for including in intersection
Quantity is 2, and the local correlation angle value of participle set a and b are 2.Similarly it is found that if participle set c is expressed as { " meeting ", " report
Announcement ", " income " }, then the local correlation angle value for segmenting set a and c is 0, and the local correlation angle value of participle set b and c are 1.
S43: calculating the average value of the corresponding N-1 local correlation angle value of each second sample data, using average value as
The relevance degree of each second sample data.
It is related to corresponding second sample data of participle set a still by taking participle set a, b and c in step S42 as an example
Angle value is the average value of the sum of local correlation angle value for segmenting set a and b, participle set a and c, as 1.Similarly it is found that with dividing
The relevance degree of corresponding second sample data of set of words b and c is respectively 1.5 and 0.5.
In the present embodiment, server-side is by carrying out word segmentation processing to the second sample data, to segment the friendship between set
Collect the local correlation angle value determined between the second sample data, and the mode averaged to local relevance degree obtains often
The relevance degree of a second sample data, allows relevance degree more accurately to react the association between the second sample data
Degree.
Further, in one embodiment, as shown in figure 5, being directed to step S5, i.e., selection information entropy is more than presupposed information
Entropy threshold, and relevance degree is lower than the second sample data of default relevance threshold as data to be marked, specifically include as
Lower step:
S51: choosing information entropy is more than presupposed information entropy threshold, and relevance degree is lower than the of default relevance threshold
Two sample datas are as candidate samples data.
Server-side screens the second sample data for meeting specified conditions again, has both reduced the quantity of training sample,
The sample data that general category device is difficult to is found out again.Wherein, specified conditions refer to that information entropy is more than presupposed information entropy threshold
Value, and relevance degree is lower than default relevance threshold.
S52: classified using at least two default sample classification devices to candidate samples data, obtain classification results.
Default sample classification device, i.e. textual classification model.For example, common FastText, Text-CNN model etc..
Wherein, FastText is a term vector and text classification tool for facebook open source, typical case scene
It is " the text classification problem with supervision ".It provides the method for simple and efficient text classification and representative learning, and performance is shoulder to shoulder
Deep learning and speed is faster.TextCNN is the algorithm classified using convolutional neural networks to text, due to its structure
Simply, effect is good, is widely used in text classification field.
The result that different default sample classification devices classifies to same sample data may be different.I.e. same sample number
After being classified by the different classifications model such as FastText, Text-CNN, different classifications may be identified as.
Classification results include classification belonging to each candidate samples data.
The candidate samples data that S53: choosing from classification results while belonging to a different category are as data to be marked.
The candidate samples data to belong to a different category simultaneously, i.e., different default classifiers is to same candidate samples data
Recognition result is different.For example, an article is identified as " history class " by FastText, meanwhile, and " text is identified as by Text-CNN
Skill class ".Therefore, it represents this article to be difficult to be identified, or is difficult to simply be divided into a certain classification.
Specifically, server-side classification according to belonging to the candidate samples data in classification results determines if to belong to simultaneously
In different classes of.
In the present embodiment, server-side according to different default classifiers to meet the second sample data of specified conditions into
Row screening, chooses and is difficult to identified second sample data as data to be marked, both get rid of the sample being simply easily identified
Notebook data is further reduced quantity and the training time of training sample, improves training effectiveness;Meanwhile it picking out and being not easy to be known
Other sample data is as data to be marked, so that being conducive to model training after carrying out classification mark to these data to be marked
The raising of precision.
It should be understood that the size of the serial number of each step is not meant that the order of the execution order in above-described embodiment, each process
Execution sequence should be determined by its function and internal logic, the implementation process without coping with the embodiment of the present invention constitutes any limit
It is fixed.
In one embodiment, a kind of textual classification model training device is provided, text disaggregated model training device with it is upper
Textual classification model training method in embodiment is stated to correspond.As shown in fig. 6, text disaggregated model training device includes just
Grade model building module 61, sample data obtain module 62, comentropy computing module 63, relatedness computation module 64, to be marked
Data decimation module 65, labeling module 66, the first model training module 67 and the second model training module 68.Each functional module is detailed
Carefully it is described as follows:
Primary mold establishes module 61, for obtaining the first sample data with category label from default sample database,
And preliminary classification model is established according to first sample data;
Sample data obtains module 62, for obtaining the second sample number without category label from default sample database
According to;
Comentropy computing module 63 obtains each second sample number for calculating the comentropy of each second sample data
According to information entropy;
Relatedness computation module 64, for calculating each the according to the quantity in the second sample data including identical phrase
The relevance degree of two sample datas;
Data decimation module 65 to be marked, for choosing information entropy more than presupposed information entropy threshold, and relevance degree
Lower than default relevance threshold the second sample data as data to be marked;
Labeling module 66 carries out classification mark for according to preset classification notation methods, treating labeled data, obtains the
Three sample datas;
First model training module 67 is used for according to preset model training mode, using third sample data to primary
Disaggregated model is trained, and obtains intermediate disaggregated model;
Second model training module 68, for using first sample data and third according to preset model training mode
Sample data centering grade disaggregated model is trained, and obtains textual classification model.
Further, primary mold establishes module 61, comprising:
Submodule 611 is chosen, there is category label for choosing mode according to default sample and choosing from default sample database
First sample data;
Training submodule 612, for combining the first sample data with category label and default training algorithm to establish just
Grade disaggregated model.
Further, comentropy computing module 63, including
Comentropy computational submodule 631, for calculating the comentropy of each second sample data according to the following formula:
Wherein, H represents the information entropy of the second sample data, and x represents the phrase in the second sample data, p(x)Represent word
The frequency that group occurs.
Further, relatedness computation module 64, comprising:
Submodule 641 is segmented, for making word segmentation processing to each second sample data, obtains N number of participle set, wherein N
For the quantity of the second sample data;
Local correlation degree computational submodule 642 calculates second sample data for being directed to each second sample data
Intersection between participle set and the participle set of other N-1 the second sample datas, and according to the word for including in each intersection
Group quantity, determines the local correlation angle value between second sample data and other N-1 the second sample datas, obtain this second
The corresponding N-1 local correlation angle value of sample data;
Mean value calculation submodule 643, for calculating the corresponding N-1 local correlation angle value of each second sample data
Average value, using average value as the relevance degree of each second sample data.
Further, data decimation module 65 to be marked, comprising:
Candidate samples choose submodule 651, for choosing information entropy more than presupposed information entropy threshold, and relevance degree
Lower than default relevance threshold the second sample data as candidate samples data;
Classification submodule 652 is obtained for being classified using at least two default sample classification devices to candidate samples data
To classification results;
Submodule 653 is marked, the candidate samples data conduct for choosing while belonging to a different category from classification results
Data to be marked.
Specific restriction about textual classification model training device may refer to above for textual classification model training
The restriction of method, details are not described herein.Modules in above-mentioned textual classification model training device can be fully or partially through
Software, hardware and combinations thereof are realized.Above-mentioned each module can be embedded in the form of hardware or independently of the place in computer equipment
It manages in device, can also be stored in a software form in the memory in computer equipment, in order to which processor calls execution or more
The corresponding operation of modules.
In one embodiment, a kind of computer equipment is provided, which can be server, internal junction
Composition can be as shown in Figure 7.The computer equipment include by system bus connect processor, memory, network interface and
Database.Wherein, the processor of the computer equipment is for providing calculating and control ability.The memory packet of the computer equipment
Include non-volatile memory medium, built-in storage.The non-volatile memory medium is stored with operating system, computer program and data
Library.The built-in storage provides environment for the operation of operating system and computer program in non-volatile memory medium.The calculating
The network interface of machine equipment is used to communicate with external terminal by network connection.When the computer program is executed by processor with
Realize a kind of textual classification model training method.
In one embodiment, a kind of computer equipment is provided, including memory, processor and storage are on a memory
And the computer program that can be run on a processor, processor realize text classification in above-described embodiment when executing computer program
The step of model training method, such as step S1 shown in Fig. 2 to step S8.Alternatively, reality when processor executes computer program
The function of each module/unit of textual classification model training device in existing above-described embodiment, such as module 61 shown in Fig. 6 is to module
68 function.To avoid repeating, which is not described herein again.
In one embodiment, a computer readable storage medium is provided, computer program, computer program are stored thereon with
Textual classification model training method in above method embodiment is realized when being executed by processor, alternatively, the computer program is located
Manage the function that each module/unit in textual classification model training device in above-mentioned apparatus embodiment is realized when device executes.To avoid
It repeats, which is not described herein again.
Those of ordinary skill in the art will appreciate that realizing all or part of the process in above-described embodiment method, being can be with
Relevant hardware is instructed to complete by computer program, the computer program can be stored in a non-volatile computer
In read/write memory medium, the computer program is when being executed, it may include such as the process of the embodiment of above-mentioned each method.Wherein,
To any reference of memory, storage, database or other media used in each embodiment provided by the present invention,
Including non-volatile and/or volatile memory.Nonvolatile memory may include read-only memory (ROM), programming ROM
(PROM), electrically programmable ROM (EPROM), electrically erasable ROM (EEPROM) or flash memory.Volatile memory may include
Random access memory (RAM) or external cache.By way of illustration and not limitation, RAM is available in many forms,
Such as static state RAM (SRAM), dynamic ram (DRAM), synchronous dram (SDRAM), double data rate sdram (DDRSDRAM), enhancing
Type SDRAM (ESDRAM), synchronization link (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM
(RDRAM), direct memory bus dynamic ram (DRDRAM) and memory bus dynamic ram (RDRAM) etc..
It is apparent to those skilled in the art that for convenience of description and succinctly, only with above-mentioned each function
Can unit, module division progress for example, in practical application, can according to need and by above-mentioned function distribution by different
Functional unit, module are completed, i.e., the internal structure of described device is divided into different functional unit or module, more than completing
The all or part of function of description.
Embodiment described above is merely illustrative of the technical solution of the present invention, rather than its limitations;Although referring to aforementioned reality
Applying example, invention is explained in detail, those skilled in the art should understand that: it still can be to aforementioned each
Technical solution documented by embodiment is modified or equivalent replacement of some of the technical features;And these are modified
Or replacement, the spirit and scope for technical solution of various embodiments of the present invention that it does not separate the essence of the corresponding technical solution should all
It is included within protection scope of the present invention.
Claims (10)
1. a kind of textual classification model training method, which is characterized in that the textual classification model training method includes:
The first sample data with category label are obtained from default sample database, and are established just according to the first sample data
Grade disaggregated model;
The second sample data without the category label is obtained from the default sample database;
The comentropy for calculating each second sample data obtains the information entropy of each second sample data;
According to the quantity in second sample data including identical phrase, the degree of correlation of each second sample data is calculated
Value;
The information entropy is chosen more than presupposed information entropy threshold, and the relevance degree is lower than the default relevance threshold
Second sample data as data to be marked;
According to preset classification notation methods, classification mark is carried out to the data to be marked, obtains third sample data;
According to preset model training mode, the preliminary classification model is trained using the third sample data, is obtained
To intermediate disaggregated model;
According to the preset model training mode, using the first sample data and the third sample data in described
Grade disaggregated model is trained, and obtains textual classification model.
2. textual classification model training method as described in claim 1, which is characterized in that described to be obtained from default sample database
First sample data with category label, and preliminary classification model is established according to the first sample data, comprising:
Mode is chosen according to default sample, and the first sample data with category label are chosen from the default sample database;
The preliminary classification model is established in conjunction with the first sample data with category label and default training algorithm.
3. textual classification model training method as described in claim 1, which is characterized in that described to calculate each second sample
The comentropy of notebook data obtains the information entropy of each second sample data, comprising:
The comentropy of each second sample data is calculated according to the following formula:
Wherein, H represents the information entropy of second sample data, and x represents the phrase in second sample data, p(x)Generation
The frequency that phrase described in table occurs.
4. textual classification model training method as described in claim 1, which is characterized in that described according to second sample number
Comprising the quantity of identical phrase in, the relevance degree of each second sample data is calculated, comprising:
Word segmentation processing is made to each second sample data, obtains N number of participle set, wherein N is second sample data
Quantity;
For each second sample data, the participle set and other N-1 the second samples of second sample data are calculated
Intersection between the participle set of data, and according to the phrase quantity for including in each intersection, determine second sample number
According to the local correlation angle value between other N-1 the second sample datas, it is a described to obtain the corresponding N-1 of second sample data
Local correlation angle value;
The average value for calculating the corresponding N-1 local correlation angle value of each second sample data, by the average value
Relevance degree as each second sample data.
5. textual classification model training method as described in claim 1, which is characterized in that the selection information entropy is super
Presupposed information entropy threshold is crossed, and the relevance degree is lower than the second sample data conduct of the default relevance threshold
Data to be marked, comprising:
The information entropy is chosen more than the presupposed information entropy threshold, and the relevance degree is lower than the default degree of correlation
Second sample data of threshold value is as candidate samples data;
Classified using at least two default sample classification devices to the candidate samples data, obtains classification results;
It is chosen from the classification results while the candidate samples data that belong to a different category is as the data to be marked.
6. a kind of textual classification model training device, which is characterized in that the textual classification model training device, comprising:
Primary mold establishes module, for first sample data of the acquisition with category label from default sample database, and according to
The first sample data establish preliminary classification model;
Sample data obtains module, for obtaining the second sample number without the category label from the default sample database
According to;
Comentropy computing module obtains each second sample for calculating the comentropy of each second sample data
The information entropy of data;
Relatedness computation module, for calculating each described according to the quantity in second sample data including identical phrase
The relevance degree of second sample data;
Data decimation module to be marked, for choosing the information entropy more than presupposed information entropy threshold, and the degree of correlation
Value is lower than second sample data of the default relevance threshold as data to be marked;
Labeling module, for carrying out classification mark to the data to be marked, obtaining third according to preset classification notation methods
Sample data;
First model training module is used for according to preset model training mode, using the third sample data to described first
Grade disaggregated model is trained, and obtains intermediate disaggregated model;
Second model training module, for using the first sample data and institute according to the preset model training mode
It states third sample data to be trained the intermediate disaggregated model, obtains textual classification model.
7. textual classification model training device as claimed in claim 6, which is characterized in that the primary mold establishes module,
Include:
Submodule is chosen, it is described with category label for being chosen from the default sample database according to default sample selection mode
First sample data;
Training submodule, it is described first for being established in conjunction with the first sample data with category label and default training algorithm
Grade disaggregated model.
8. textual classification model training device as claimed in claim 6, which is characterized in that the comentropy computing module, packet
It includes:
Comentropy computational submodule, for calculating the comentropy of each second sample data according to the following formula:
Wherein, H represents the information entropy of second sample data, and x represents the phrase in second sample data, p(x)Generation
The frequency that phrase described in table occurs.
9. a kind of computer equipment, including memory, processor and storage are in the memory and can be in the processor
The computer program of upper operation, which is characterized in that the processor realized when executing the computer program as claim 1 to
Any one of 5 textual classification model training methods.
10. a kind of computer readable storage medium, the computer-readable recording medium storage has computer program, and feature exists
In realization textual classification model training side as described in any one of claim 1 to 5 when the computer program is executed by processor
Method.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910247846.8A CN110110080A (en) | 2019-03-29 | 2019-03-29 | Textual classification model training method, device, computer equipment and storage medium |
PCT/CN2019/117095 WO2020199591A1 (en) | 2019-03-29 | 2019-11-11 | Text categorization model training method, apparatus, computer device, and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910247846.8A CN110110080A (en) | 2019-03-29 | 2019-03-29 | Textual classification model training method, device, computer equipment and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110110080A true CN110110080A (en) | 2019-08-09 |
Family
ID=67484695
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910247846.8A Pending CN110110080A (en) | 2019-03-29 | 2019-03-29 | Textual classification model training method, device, computer equipment and storage medium |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN110110080A (en) |
WO (1) | WO2020199591A1 (en) |
Cited By (26)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111026851A (en) * | 2019-10-18 | 2020-04-17 | 平安科技(深圳)有限公司 | Model prediction capability optimization method, device, equipment and readable storage medium |
CN111081221A (en) * | 2019-12-23 | 2020-04-28 | 合肥讯飞数码科技有限公司 | Training data selection method, device, electronic device and computer storage medium |
CN111143568A (en) * | 2019-12-31 | 2020-05-12 | 郑州工程技术学院 | Buffering method, device, equipment and storage medium for paper classification |
CN111159396A (en) * | 2019-12-04 | 2020-05-15 | 中国电子科技集团公司第三十研究所 | A method for establishing a text data classification and grading model for data sharing and exchange |
CN111368515A (en) * | 2020-03-02 | 2020-07-03 | 中国农业科学院农业信息研究所 | Method and system for generating industry dynamic interactive report based on PDF document fragmentation |
CN111382268A (en) * | 2020-02-25 | 2020-07-07 | 北京小米松果电子有限公司 | Text training data processing method and device and storage medium |
WO2020199591A1 (en) * | 2019-03-29 | 2020-10-08 | 平安科技(深圳)有限公司 | Text categorization model training method, apparatus, computer device, and storage medium |
CN111767400A (en) * | 2020-06-30 | 2020-10-13 | 平安国际智慧城市科技股份有限公司 | Training method and device of text classification model, computer equipment and storage medium |
CN111881295A (en) * | 2020-07-31 | 2020-11-03 | 中国光大银行股份有限公司 | Text classification model training method and device and text labeling method and device |
CN111914061A (en) * | 2020-07-13 | 2020-11-10 | 上海乐言信息科技有限公司 | Radius-based uncertainty sampling method and system for text classification active learning |
CN112036166A (en) * | 2020-07-22 | 2020-12-04 | 大箴(杭州)科技有限公司 | Data labeling method and device, storage medium and computer equipment |
CN112069293A (en) * | 2020-09-14 | 2020-12-11 | 上海明略人工智能(集团)有限公司 | Data annotation method and device, electronic equipment and computer readable medium |
CN112434736A (en) * | 2020-11-24 | 2021-03-02 | 成都潜在人工智能科技有限公司 | Deep active learning text classification method based on pre-training model |
CN112633344A (en) * | 2020-12-16 | 2021-04-09 | 中国平安财产保险股份有限公司 | Quality inspection model training method, quality inspection model training device, quality inspection model training equipment and readable storage medium |
CN112651211A (en) * | 2020-12-11 | 2021-04-13 | 北京大米科技有限公司 | Label information determination method, device, server and storage medium |
CN112711940A (en) * | 2019-10-08 | 2021-04-27 | 台达电子工业股份有限公司 | Information processing system, information processing method, and non-transitory computer-readable recording medium |
WO2021139279A1 (en) * | 2020-07-30 | 2021-07-15 | 平安科技(深圳)有限公司 | Data processing method and apparatus based on classification model, and electronic device and medium |
CN113239128A (en) * | 2021-06-01 | 2021-08-10 | 平安科技(深圳)有限公司 | Data pair classification method, device, equipment and storage medium based on implicit characteristics |
CN113590822A (en) * | 2021-07-28 | 2021-11-02 | 北京百度网讯科技有限公司 | Document title processing method, device, equipment, storage medium and program product |
CN113761034A (en) * | 2021-09-15 | 2021-12-07 | 深圳信息职业技术学院 | A data processing method and device thereof |
CN114117043A (en) * | 2021-11-24 | 2022-03-01 | 阿里巴巴(中国)有限公司 | Model training method and device and computer storage medium |
CN114417882A (en) * | 2022-01-04 | 2022-04-29 | 马上消费金融股份有限公司 | Data labeling method and device, electronic equipment and readable storage medium |
CN114548074A (en) * | 2022-02-15 | 2022-05-27 | 中电云脑(天津)科技有限公司 | Method and device for determining medical data to be annotated |
CN114637843A (en) * | 2020-12-15 | 2022-06-17 | 阿里巴巴集团控股有限公司 | Data processing method and device, electronic equipment and storage medium |
WO2024021526A1 (en) * | 2022-07-29 | 2024-02-01 | 上海智臻智能网络科技股份有限公司 | Method and apparatus for generating training samples, device, and storage medium |
CN119513321A (en) * | 2025-01-16 | 2025-02-25 | 煤炭科学研究总院有限公司 | Classification method of coal industry vocabulary |
Families Citing this family (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112348203A (en) * | 2020-11-05 | 2021-02-09 | 中国平安人寿保险股份有限公司 | Model training method, device, terminal device and storage medium |
CN112528022A (en) * | 2020-12-09 | 2021-03-19 | 广州摩翼信息科技有限公司 | Method for extracting characteristic words corresponding to theme categories and identifying text theme categories |
CN112632219B (en) * | 2020-12-17 | 2022-10-04 | 中国联合网络通信集团有限公司 | Method and device for intercepting spam short messages |
CN112651447B (en) * | 2020-12-29 | 2023-09-26 | 广东电网有限责任公司电力调度控制中心 | Ontology-based resource classification labeling method and system |
CN112541595B (en) * | 2020-12-30 | 2024-12-06 | 中国建设银行股份有限公司 | Model building method and device, storage medium and electronic device |
CN112446441B (en) * | 2021-02-01 | 2021-08-20 | 北京世纪好未来教育科技有限公司 | Model training data screening method, device, equipment and storage medium |
CN113793191B (en) * | 2021-02-09 | 2024-05-24 | 京东科技控股股份有限公司 | Commodity matching method and device and electronic equipment |
CN113704393B (en) * | 2021-04-13 | 2025-07-15 | 腾讯科技(深圳)有限公司 | Keyword extraction method, device, equipment and medium |
CN113190154B (en) * | 2021-04-29 | 2023-10-13 | 北京百度网讯科技有限公司 | Model training and entry classification methods, apparatuses, devices, storage medium and program |
CN113343695B (en) * | 2021-05-27 | 2022-02-01 | 镁佳(北京)科技有限公司 | Text labeling noise detection method and device, storage medium and electronic equipment |
CN114169539A (en) * | 2022-02-11 | 2022-03-11 | 阿里巴巴(中国)有限公司 | Model training method, training device, electronic device, and computer-readable medium |
CN114648980B (en) * | 2022-03-03 | 2025-02-28 | 科大讯飞股份有限公司 | Data classification and speech recognition method, device, electronic device and storage medium |
CN115129872A (en) * | 2022-06-21 | 2022-09-30 | 浙江大学 | Active learning-based small sample text labeling method and device |
CN115994225B (en) * | 2023-03-20 | 2023-06-27 | 北京百分点科技集团股份有限公司 | Text classification method, device, storage medium and electronic equipment |
CN116304058B (en) * | 2023-04-27 | 2023-08-08 | 云账户技术(天津)有限公司 | Method and device for identifying negative information of enterprise, electronic equipment and storage medium |
CN117783377B (en) * | 2024-02-27 | 2024-08-30 | 南昌怀特科技有限公司 | Component analysis method and system for tooth paste production |
CN117973522B (en) * | 2024-04-02 | 2024-06-04 | 成都派沃特科技股份有限公司 | Knowledge data training technology-based application model construction method and system |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060095521A1 (en) * | 2004-11-04 | 2006-05-04 | Seth Patinkin | Method, apparatus, and system for clustering and classification |
US20140172754A1 (en) * | 2012-12-14 | 2014-06-19 | International Business Machines Corporation | Semi-supervised data integration model for named entity classification |
CN106131613A (en) * | 2016-07-26 | 2016-11-16 | 深圳Tcl新技术有限公司 | Intelligent television video sharing method and video sharing system |
CN107025218A (en) * | 2017-04-07 | 2017-08-08 | 腾讯科技(深圳)有限公司 | A kind of text De-weight method and device |
CN107506793A (en) * | 2017-08-21 | 2017-12-22 | 中国科学院重庆绿色智能技术研究院 | Clothes recognition methods and system based on weak mark image |
CN108304427A (en) * | 2017-04-28 | 2018-07-20 | 腾讯科技(深圳)有限公司 | A kind of user visitor's heap sort method and apparatus |
CN108665158A (en) * | 2018-05-08 | 2018-10-16 | 阿里巴巴集团控股有限公司 | A kind of method, apparatus and equipment of trained air control model |
CN109101997A (en) * | 2018-07-11 | 2018-12-28 | 浙江理工大学 | A kind of source tracing method sampling limited Active Learning |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102063642A (en) * | 2010-12-30 | 2011-05-18 | 上海电机学院 | Selection method for fuzzy neural network sample on basis of active learning |
US11100420B2 (en) * | 2014-06-30 | 2021-08-24 | Amazon Technologies, Inc. | Input processing for machine learning |
CN104166706B (en) * | 2014-08-08 | 2017-11-03 | 苏州大学 | Multi-tag grader construction method based on cost-sensitive Active Learning |
CN108090231A (en) * | 2018-01-12 | 2018-05-29 | 北京理工大学 | A kind of topic model optimization method based on comentropy |
CN110110080A (en) * | 2019-03-29 | 2019-08-09 | 平安科技(深圳)有限公司 | Textual classification model training method, device, computer equipment and storage medium |
-
2019
- 2019-03-29 CN CN201910247846.8A patent/CN110110080A/en active Pending
- 2019-11-11 WO PCT/CN2019/117095 patent/WO2020199591A1/en active Application Filing
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060095521A1 (en) * | 2004-11-04 | 2006-05-04 | Seth Patinkin | Method, apparatus, and system for clustering and classification |
US20140172754A1 (en) * | 2012-12-14 | 2014-06-19 | International Business Machines Corporation | Semi-supervised data integration model for named entity classification |
CN106131613A (en) * | 2016-07-26 | 2016-11-16 | 深圳Tcl新技术有限公司 | Intelligent television video sharing method and video sharing system |
CN107025218A (en) * | 2017-04-07 | 2017-08-08 | 腾讯科技(深圳)有限公司 | A kind of text De-weight method and device |
CN108304427A (en) * | 2017-04-28 | 2018-07-20 | 腾讯科技(深圳)有限公司 | A kind of user visitor's heap sort method and apparatus |
CN107506793A (en) * | 2017-08-21 | 2017-12-22 | 中国科学院重庆绿色智能技术研究院 | Clothes recognition methods and system based on weak mark image |
CN108665158A (en) * | 2018-05-08 | 2018-10-16 | 阿里巴巴集团控股有限公司 | A kind of method, apparatus and equipment of trained air control model |
CN109101997A (en) * | 2018-07-11 | 2018-12-28 | 浙江理工大学 | A kind of source tracing method sampling limited Active Learning |
Non-Patent Citations (4)
Title |
---|
关雅夫: "基于主动学习的微博情感分析方法研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
关雅夫: "基于主动学习的微博情感分析方法研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》, 15 October 2017 (2017-10-15), pages 138 - 297 * |
胡正平;高文涛;万春艳;: "基于样本不确定性和代表性相结合的可控主动学习算法研究", 燕山大学学报, no. 04, pages 341 - 346 * |
龙军;殷建平;祝恩;赵文涛;: "针对入侵检测的代价敏感主动学习算法", 南京大学学报(自然科学版), no. 05, pages 527 - 535 * |
Cited By (39)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2020199591A1 (en) * | 2019-03-29 | 2020-10-08 | 平安科技(深圳)有限公司 | Text categorization model training method, apparatus, computer device, and storage medium |
CN112711940B (en) * | 2019-10-08 | 2024-06-11 | 台达电子工业股份有限公司 | Information processing system, information processing method, and non-transitory computer-readable recording medium |
CN112711940A (en) * | 2019-10-08 | 2021-04-27 | 台达电子工业股份有限公司 | Information processing system, information processing method, and non-transitory computer-readable recording medium |
CN111026851A (en) * | 2019-10-18 | 2020-04-17 | 平安科技(深圳)有限公司 | Model prediction capability optimization method, device, equipment and readable storage medium |
WO2021073408A1 (en) * | 2019-10-18 | 2021-04-22 | 平安科技(深圳)有限公司 | Model prediction capability optimization method, apparatus and device, and readable storage medium |
CN111026851B (en) * | 2019-10-18 | 2023-09-15 | 平安科技(深圳)有限公司 | Model prediction capability optimization method, device, equipment and readable storage medium |
CN111159396A (en) * | 2019-12-04 | 2020-05-15 | 中国电子科技集团公司第三十研究所 | A method for establishing a text data classification and grading model for data sharing and exchange |
CN111159396B (en) * | 2019-12-04 | 2022-04-22 | 中国电子科技集团公司第三十研究所 | A method for establishing a text data classification and grading model for data sharing and exchange |
CN111081221A (en) * | 2019-12-23 | 2020-04-28 | 合肥讯飞数码科技有限公司 | Training data selection method, device, electronic device and computer storage medium |
CN111081221B (en) * | 2019-12-23 | 2022-10-14 | 合肥讯飞数码科技有限公司 | Training data selection method and device, electronic equipment and computer storage medium |
CN111143568A (en) * | 2019-12-31 | 2020-05-12 | 郑州工程技术学院 | Buffering method, device, equipment and storage medium for paper classification |
CN111382268A (en) * | 2020-02-25 | 2020-07-07 | 北京小米松果电子有限公司 | Text training data processing method and device and storage medium |
CN111382268B (en) * | 2020-02-25 | 2023-12-01 | 北京小米松果电子有限公司 | Text training data processing method, device and storage medium |
CN111368515A (en) * | 2020-03-02 | 2020-07-03 | 中国农业科学院农业信息研究所 | Method and system for generating industry dynamic interactive report based on PDF document fragmentation |
CN111368515B (en) * | 2020-03-02 | 2021-01-26 | 中国农业科学院农业信息研究所 | Method and system for generating industry dynamic interactive report based on PDF document fragmentation |
CN111767400B (en) * | 2020-06-30 | 2024-04-26 | 平安国际智慧城市科技股份有限公司 | Training method and device for text classification model, computer equipment and storage medium |
CN111767400A (en) * | 2020-06-30 | 2020-10-13 | 平安国际智慧城市科技股份有限公司 | Training method and device of text classification model, computer equipment and storage medium |
CN111914061A (en) * | 2020-07-13 | 2020-11-10 | 上海乐言信息科技有限公司 | Radius-based uncertainty sampling method and system for text classification active learning |
CN111914061B (en) * | 2020-07-13 | 2021-04-16 | 上海乐言科技股份有限公司 | Radius-based uncertainty sampling method and system for text classification active learning |
CN112036166A (en) * | 2020-07-22 | 2020-12-04 | 大箴(杭州)科技有限公司 | Data labeling method and device, storage medium and computer equipment |
WO2021139279A1 (en) * | 2020-07-30 | 2021-07-15 | 平安科技(深圳)有限公司 | Data processing method and apparatus based on classification model, and electronic device and medium |
CN111881295A (en) * | 2020-07-31 | 2020-11-03 | 中国光大银行股份有限公司 | Text classification model training method and device and text labeling method and device |
CN112069293B (en) * | 2020-09-14 | 2024-04-19 | 上海明略人工智能(集团)有限公司 | Data labeling method, device, electronic equipment and computer readable medium |
CN112069293A (en) * | 2020-09-14 | 2020-12-11 | 上海明略人工智能(集团)有限公司 | Data annotation method and device, electronic equipment and computer readable medium |
CN112434736A (en) * | 2020-11-24 | 2021-03-02 | 成都潜在人工智能科技有限公司 | Deep active learning text classification method based on pre-training model |
CN112651211A (en) * | 2020-12-11 | 2021-04-13 | 北京大米科技有限公司 | Label information determination method, device, server and storage medium |
CN114637843A (en) * | 2020-12-15 | 2022-06-17 | 阿里巴巴集团控股有限公司 | Data processing method and device, electronic equipment and storage medium |
CN112633344A (en) * | 2020-12-16 | 2021-04-09 | 中国平安财产保险股份有限公司 | Quality inspection model training method, quality inspection model training device, quality inspection model training equipment and readable storage medium |
CN113239128B (en) * | 2021-06-01 | 2022-03-18 | 平安科技(深圳)有限公司 | Data pair classification method, device, equipment and storage medium based on implicit characteristics |
CN113239128A (en) * | 2021-06-01 | 2021-08-10 | 平安科技(深圳)有限公司 | Data pair classification method, device, equipment and storage medium based on implicit characteristics |
CN113590822A (en) * | 2021-07-28 | 2021-11-02 | 北京百度网讯科技有限公司 | Document title processing method, device, equipment, storage medium and program product |
CN113590822B (en) * | 2021-07-28 | 2023-08-08 | 北京百度网讯科技有限公司 | Processing method, device, device, storage medium and program product of document title |
CN113761034B (en) * | 2021-09-15 | 2022-06-17 | 深圳信息职业技术学院 | A data processing method and device thereof |
CN113761034A (en) * | 2021-09-15 | 2021-12-07 | 深圳信息职业技术学院 | A data processing method and device thereof |
CN114117043A (en) * | 2021-11-24 | 2022-03-01 | 阿里巴巴(中国)有限公司 | Model training method and device and computer storage medium |
CN114417882A (en) * | 2022-01-04 | 2022-04-29 | 马上消费金融股份有限公司 | Data labeling method and device, electronic equipment and readable storage medium |
CN114548074A (en) * | 2022-02-15 | 2022-05-27 | 中电云脑(天津)科技有限公司 | Method and device for determining medical data to be annotated |
WO2024021526A1 (en) * | 2022-07-29 | 2024-02-01 | 上海智臻智能网络科技股份有限公司 | Method and apparatus for generating training samples, device, and storage medium |
CN119513321A (en) * | 2025-01-16 | 2025-02-25 | 煤炭科学研究总院有限公司 | Classification method of coal industry vocabulary |
Also Published As
Publication number | Publication date |
---|---|
WO2020199591A1 (en) | 2020-10-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110110080A (en) | Textual classification model training method, device, computer equipment and storage medium | |
Kukačka et al. | Regularization for deep learning: A taxonomy | |
CN115311687B (en) | Natural language pedestrian retrieval method and system with joint token and feature alignment | |
CN109840322B (en) | Complete shape filling type reading understanding analysis model and method based on reinforcement learning | |
CN110969020A (en) | CNN and attention mechanism-based Chinese named entity identification method, system and medium | |
CN104966105A (en) | Robust machine error retrieving method and system | |
CN114220086B (en) | A cost-effective scene text detection method and system | |
Wu et al. | Optimized deep learning framework for water distribution data-driven modeling | |
CN111309918A (en) | Multi-label text classification method based on label relevance | |
CN110188195A (en) | A kind of text intension recognizing method, device and equipment based on deep learning | |
CN117409206B (en) | Small sample image segmentation method based on self-adaptive prototype aggregation network | |
CN113836896A (en) | Patent text abstract generation method and device based on deep learning | |
CN111145914A (en) | Method and device for determining lung cancer clinical disease library text entity | |
Liu et al. | Hybrid neural network text classification combining tcn and gru | |
Chatterjee et al. | ImageNet classification using wordnet hierarchy | |
Feng et al. | Enhancing fitness evaluation in genetic algorithm-based architecture search for AI-Aided financial regulation | |
Safdari et al. | A hierarchical feature learning for isolated Farsi handwritten digit recognition using sparse autoencoder | |
CN115223189A (en) | Method and system for recognizing secondary drawings of substation, and retrieval method and system | |
CN119003769A (en) | Netizen view analysis method based on double large models | |
Wang et al. | Efficient deep convolutional model compression with an active stepwise pruning approach | |
Passalis et al. | Deep temporal logistic bag-of-features for forecasting high frequency limit order book time series | |
CN118606469A (en) | Multi-classification prediction method for intangible cultural heritage text based on multi-head attention and semantic features | |
Asfaw | Deep learning hyperparameter’s impact on potato disease detection | |
Shanmugasundaram et al. | Detection accuracy improvement on one-stage object detection using ap-loss-based ranking module and resnet-152 backbone | |
CN115455162B (en) | Answer sentence selection method and device based on hierarchical capsule and multi-view information fusion |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190809 |
|
RJ01 | Rejection of invention patent application after publication |