[go: up one dir, main page]

CN110110080A - Textual classification model training method, device, computer equipment and storage medium - Google Patents

Textual classification model training method, device, computer equipment and storage medium Download PDF

Info

Publication number
CN110110080A
CN110110080A CN201910247846.8A CN201910247846A CN110110080A CN 110110080 A CN110110080 A CN 110110080A CN 201910247846 A CN201910247846 A CN 201910247846A CN 110110080 A CN110110080 A CN 110110080A
Authority
CN
China
Prior art keywords
sample data
sample
data
classification model
default
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910247846.8A
Other languages
Chinese (zh)
Inventor
金戈
徐亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN201910247846.8A priority Critical patent/CN110110080A/en
Publication of CN110110080A publication Critical patent/CN110110080A/en
Priority to PCT/CN2019/117095 priority patent/WO2020199591A1/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Creation or modification of classes or clusters
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/192Recognition using electronic means using simultaneous comparisons or correlations of the image signals with a plurality of references
    • G06V30/194References adjustable by an adaptive method, e.g. learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Multimedia (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of textual classification model training method, device, computer equipment and storage mediums, which comprises the first sample data with category label and the second sample data without category label are obtained from default sample database;Preliminary classification model is established according to first sample data;Meanwhile calculating the information entropy and relevance degree of the second sample data;According to preset classification notation methods, the second sample data for meeting preset condition to information entropy and relevance degree carries out classification mark, obtains third sample data;Preliminary classification model is trained using third sample data, obtains intermediate disaggregated model;It is trained using first sample data and third sample data centering grade disaggregated model, obtains textual classification model.Technical solution of the present invention solves in textual classification model training process, and training sample is in large scale, the problem of training time length.

Description

Textual classification model training method, device, computer equipment and storage medium
Technical field
The present invention relates to field of information processing more particularly to textual classification model training method, device, computer equipment and Storage medium.
Background technique
Text classification is an important application direction in natural language processing research field.Text classification refers to utilize and divide Class device classifies to the data file comprising text, so that it is determined that classification belonging to each document, allows users to conveniently Acquisition need document.
Wherein, classifier is also known as disaggregated model, be by using a large amount of sample data for having category label, to point Obtained from class criterion or model parameter are trained.The classifier obtained using training carries out the text data of unknown classification Identification, to realize the automatic classification to large scale text data.Therefore, the superiority and inferiority of disaggregated model directly influences classification most Whole effect.
However, having the sample data of category label very limited, most of sample in the large-scale text classification problem of reality It originally is not no category label.This makes in the building process of disaggregated model, it has to using by the expert in field come into The mode of pedestrian's work mark.This mode needs to expend a large amount of manpower, financial resources and time, and the scale of training sample is huge Greatly, training process will also devote a tremendous amount of time.
Summary of the invention
The embodiment of the present invention provides a kind of textual classification model training method, device, computer equipment and storage medium, with It solves in textual classification model training process, training sample is in large scale, the problem of training time length.
A kind of textual classification model training method, comprising:
The first sample data with category label are obtained from default sample database, and are built according to the first sample data Vertical preliminary classification model;
The second sample data without the category label is obtained from the default sample database;
The comentropy for calculating each second sample data obtains the information entropy of each second sample data;
According to the quantity in second sample data including identical phrase, the phase of each second sample data is calculated Close angle value;
The information entropy is chosen more than presupposed information entropy threshold, and the relevance degree is lower than the default degree of correlation Second sample data of threshold value is as data to be marked;
According to preset classification notation methods, classification mark is carried out to the data to be marked, obtains third sample data;
According to preset model training mode, the preliminary classification model is instructed using the third sample data Practice, obtains intermediate disaggregated model;
According to the preset model training mode, using the first sample data and the third sample data to institute It states intermediate disaggregated model to be trained, obtains textual classification model.
A kind of textual classification model training device, comprising:
Primary mold establishes module, for obtaining the first sample data with category label from default sample database, and Preliminary classification model is established according to the first sample data;
Sample data obtains module, for obtaining the second sample without the category label from the default sample database Notebook data;
Comentropy computing module obtains each described second for calculating the comentropy of each second sample data The information entropy of sample data;
Relatedness computation module, for calculating each according to the quantity in second sample data including identical phrase The relevance degree of second sample data;
Data decimation module to be marked, for choosing the information entropy more than presupposed information entropy threshold, and the phase Second sample data of the angle value lower than the default relevance threshold is closed as data to be marked;
Labeling module, for carrying out classification mark to the data to be marked, obtaining according to preset classification notation methods Third sample data;
First model training module is used for according to preset model training mode, using the third sample data to institute It states preliminary classification model to be trained, obtains intermediate disaggregated model;
Second model training module, for using the first sample data according to the preset model training mode The intermediate disaggregated model is trained with the third sample data, obtains textual classification model.
A kind of computer equipment, including memory, processor and storage are in the memory and can be in the processing The computer program run on device, the processor realize above-mentioned textual classification model training side when executing the computer program Method.
A kind of computer readable storage medium, the computer-readable recording medium storage have computer program, the meter Calculation machine program realizes above-mentioned textual classification model training method when being executed by processor.
Above-mentioned textual classification model training method, device, computer equipment and storage medium, obtain from default sample database First sample data with category label, and establish preliminary classification model according to first sample data, that is, utilize sub-fraction There is the sample data of category label to be trained, obtain preliminary classification model, it is possible to reduce to the sample data for having category label Demand, save training cost;The second sample data without category label is obtained from default sample database;Calculate second The information entropy and relevance degree of sample data, and meet information entropy and relevance degree the second sample data of preset condition Carry out classification mark;According to preset model training mode, using the third sample data after mark to preliminary classification model into Row training, obtains intermediate disaggregated model, that is, the comentropy that third sample data is utilized is big, and correlation each other is small, and There is the characteristics of category label, optimizes the nicety of grading of preliminary classification model;Finally, according to first sample data and third sample Data are trained the intermediate disaggregated model, obtain textual classification model, i.e., by iteration step by step, optimization obtains final Textual classification model.It proposes a kind of train using the sample data for having category label on a small quantity and obtains the side of textual classification model Method allows to obtain the preferable disaggregated model of performance by being trained less sample data, saved manpower at This, improves training speed.
Detailed description of the invention
In order to illustrate the technical solution of the embodiments of the present invention more clearly, below by institute in the description to the embodiment of the present invention Attached drawing to be used is needed to be briefly described, it should be apparent that, the accompanying drawings in the following description is only some implementations of the invention Example, for those of ordinary skill in the art, without any creative labor, can also be according to these attached drawings Obtain other attached drawings.
Fig. 1 is an application environment schematic diagram of textual classification model training method in one embodiment of the invention;
Fig. 2 is the flow chart of textual classification model training method in one embodiment of the invention;
Fig. 3 is the flow chart of step S1 in textual classification model training method in one embodiment of the invention;
Fig. 4 is the flow chart of step S4 in textual classification model training method in one embodiment of the invention;
Fig. 5 is the flow chart of step S5 in textual classification model training method in one embodiment of the invention;
Fig. 6 is the schematic diagram of textual classification model training device in one embodiment of the invention;
Fig. 7 is the schematic diagram of computer equipment in one embodiment of the invention.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are some of the embodiments of the present invention, instead of all the embodiments.Based on this hair Embodiment in bright, every other implementation obtained by those of ordinary skill in the art without making creative efforts Example, shall fall within the protection scope of the present invention.
Textual classification model training method provided by the invention, can be applicable in the application environment such as Fig. 1, wherein service End is the computer equipment for carrying out textual classification model training, and server-side can be server or server cluster;Default sample Library is to provide the database of training sample data, specifically can be various relationship types or non-relational database, as MS-SQL, Oracle, MySQL, Sybase, DB2, Redis, MongodDB, Hbase etc.;Pass through network between server-side and default sample database Connection, network can be cable network or wireless network.Textual classification model training method application provided in an embodiment of the present invention In server-side.
In one embodiment, as shown in Fig. 2, providing a kind of textual classification model training method, specific implementation flow Include the following steps:
S1: the first sample data with category label are obtained from default sample database, and are built according to first sample data Vertical preliminary classification model.
Default sample database, that is, provide the database of training sample data.Default sample database can be deployed in server-side local, Or it is connected by network with server-side.
First sample data are the text datas with category label.Wherein, text data includes text information Text, news and Email Body on text document, internet etc.;Category label is to made by text data points Class label is that the classification to text data limits.
For example, the category label of an article is " emotion ", then the content of this article is represented with related to " emotion ".It can To understand that ground, category label further include but is not limited to " science popularization ", " movement ", " pursuing a goal with determination ", " poem prose " etc. for indicating text The label of data generic.
Specifically, in default sample database, category label and text data are associated storages, and each text data has Indicate whether it has the field of category label.Server-side can obtain the textual data for having category label by SQL query statement According to as first sample data.
Preliminary classification model is the classification tool constructed according to first sample data.The preliminary classification model energy set up It is enough that broad classification is carried out to the sample data for having category label.
Specifically, server-side can obtain the by carrying out signature analysis to the first sample data with category label Then category label and text feature information are associated storage, as first fraction by the text feature information of one sample data Class model.For example, server-side can carry out word segmentation processing to the text in first sample data, using the participle of high word frequency as text Eigen information.Wherein, word segmentation processing is that the word in text is entered cutting in Text extraction, is obtained single one by one Only word.Word segmentation processing is widely used in the fields such as full-text search, content of text excavation as a kind of word processing means.
Alternatively, server-side can obtain just fraction using training method neural network based according to first sample data Class model.
S2: the second sample data without category label is obtained from default sample database.
Second sample data is the text data without category label.That is, compared with first sample data, the second sample Notebook data does not have category label, if server-side does not know text belonging to the second sample data not by way of handmarking Classification or the meaning of expression.
Specifically, server-side can obtain the second sample data by SQL query statement from default sample database.
S3: the comentropy of each second sample data is calculated, the information entropy of each second sample data is obtained.
Comentropy is by the concept of the scaling information amount of aromatic proposition, is the quantisation metric how many to information.Comentropy is got over Greatly, i.e., information content included in sample data is also abundanter, while the uncertainty of representative information is bigger.
Information entropy is the specific quantized value to comentropy.
Server-side can according to include in the second sample data text data number determine information entropy.For example, with The quantity of text number of words is as information entropy in second sample data.It is to be appreciated that included in the article of 5000 words Information content be greater than the Email Body information content that are included of only 20 words.
Specifically, server-side calculates the text number of words in each second sample data, using text number of words as each second The information entropy of sample data.
Alternatively, server-side removes the participle quantity after auxiliary word that indicates mood using in the second sample data as the second sample data Information entropy.Wherein, auxiliary word that indicates mood includes but is not limited to " ", " uh ", " ", " " etc..
Specifically, server-side makees word segmentation processing to the second sample data, obtains participle set, and will segment the language in set Auxiliary word removes, using remaining participle quantity as the information entropy of the second sample data.
S4: according to the quantity in the second sample data including identical phrase, the degree of correlation of each second sample data is calculated Value.
The relevance degree of second sample data, that is, whether the information for having reacted the offer of the second sample data repeats and redundancy. Relevance degree is higher, then it is higher to represent information multiplicity and redundancy that the second sample data provides each other;Relevance degree Lower, then the otherness for representing the information that the second sample data provides each other is bigger.
Server-side determines relevance degree according to including the quantity of identical phrase in the second sample data.
For example, if in the second sample data A including phrase " culture ", " civilization ", " history ", the second sample data B In include phrase " culture ", " country ", " history ", including phrase " travelling ", " mountains and rivers ", " country " in the second sample data C;Then It include phrase " culture " and " history " in second sample data A and the second sample data B, then the relevance degree of A and B is 2;It can It is 1 with the relevance degree that with understanding, the relevance degree of A and C are 0, B and C.Meanwhile the relevance degree of each second sample data It by the cumulative of the relevance degree of second sample data and other each second sample datas and can determine.That is the degree of correlation of A The relevance degree that the relevance degree that value is 2, B is 3, C is 1.
S5: choosing information entropy is more than presupposed information entropy threshold, and relevance degree is lower than the of default relevance threshold Two sample datas are as data to be marked.
Presupposed information entropy threshold and default relevance threshold are sieved to the second sample data for not having category label The condition of choosing.
Data to be marked are to be sieved according to presupposed information entropy threshold and default relevance threshold to the second sample data The data obtained after choosing.
Information entropy is more than presupposed information entropy threshold, and relevance degree is lower than the second sample number of default relevance threshold According to the interior container for representing its information content is uncertain, and the otherness between information content is bigger, is the head for training pattern Select data.
Specifically, if presupposed information entropy threshold is 1000, presetting relevance threshold is 100, then server-side is according to each the The information entropy and relevance degree of two sample datas are chosen, and information entropy are greater than 1000, and relevance degree is lower than 100 The second sample data as data to be marked.
S6: it according to preset classification notation methods, treats labeled data and carries out classification mark, obtain third sample data.
Classification mark is that the second sample data for not having category label is marked, has the second sample data The process of corresponding category label.For example, classification mark is carried out to certain article, it is anti-plus such as " novel ", " suspense " to it Answer the label of its subject content.The data obtained after classification marks are third sample data.
Preset classification notation methods, refer to server-side specifically can using a variety of notation methods to the second sample data into Row classification mark.
For example, server-side can extract the keyword in the second sample data, i.e., made with highest five words of word frequency For keyword;Then, the target keyword in keyword and preset category label dictionary is subjected to comparison of coherence, if crucial Word is consistent with target keyword, then is labeled target keyword to the second sample data, to obtain third sample data.
It is marked alternatively, server-side can call directly third-party expert system.For example, being using third party expert API (Application Programming Interface, application programming interface) interface that system provides, by the second sample Notebook data is inputted, and category label corresponding with the second sample data is obtained, to obtain third sample data.
S7: according to preset model training mode, preliminary classification model is trained using third sample data, is obtained Intermediate disaggregated model.
Intermediate disaggregated model is obtained after being trained using third sample data on the basis of preliminary classification model Disaggregated model.The difference of intermediate disaggregated model and preliminary classification model is that the training set of intermediate disaggregated model is with class It does not mark, and information entropy and relevance degree meet the third sample data of specified conditions.
Preset model training mode, i.e. server-side using third sample data as training data, using a variety of frames or Algorithm is trained preliminary classification model.For example, server-side can use existing machine learning frame or tool, such as Scikit-Learn, TensorFlow etc..
Wherein, Scikit-Learn, abbreviation sklearn are open source, the Machine learning tools based on Python Library, the sorting algorithms such as built-in NB Algorithm, decision Tree algorithms, random forests algorithm, use in sklearn The common machine learning algorithms such as data prediction, classification, recurrence, dimensionality reduction, model selection may be implemented in sklearn. TensorFlow is the initially researcher by Google brain group (being under the jurisdiction of Google machine intelligence research institution) and engineering The open source software library calculated for numerical value that teachers developed, in terms of can be used for machine learning and deep neural network Research, but the versatility of this system makes it can be also widely applied to other calculating fields.
Specifically, by taking sklearn as an example, server-side is called in sklearn using third sample data as input data Intermediate disaggregated model can be obtained until model tends to restrain in built-in training method.
S8: according to preset model training mode, first sample data and third sample data centering grade classification mould are used Type is trained, and obtains textual classification model.
Textual classification model is that centering grade disaggregated model carries out the final classification model obtained after retraining.
Wherein, the preset model training mode that server-side uses is no longer superfluous herein as the training process of step S7 It states.Unlike the training process of step S7, while using first sample data and third sample data centering grade classification mould Type is trained, i.e., using there is the sample data centering grade disaggregated model of category label to be iterated training, with fraction in raising The nicety of grading of class model.
Specifically, by taking sklearn as an example, server-side using first sample data and third sample data as input data, Call the built-in training method in sklearn, until model tends to restrain, textual classification model can be obtained.
In the present embodiment, the first sample data with category label are obtained from default sample database, and according to first Sample data establishes preliminary classification model, i.e., has the sample data of category label to be trained using sub-fraction, obtain primary Disaggregated model, it is possible to reduce to the demand for the sample data for having category label, save training cost;It is obtained from default sample database Take the second sample data without category label;The information entropy and relevance degree of the second sample data are calculated, and to information The second sample data that entropy and relevance degree meet preset condition carries out classification mark;According to preset model training mode, Preliminary classification model is trained using the third sample data after mark, intermediate disaggregated model is obtained, that is, third is utilized The comentropy of sample data is big, and correlation each other is small, and has the characteristics of category label, optimizes preliminary classification model Nicety of grading;Finally, being trained according to first sample data and third sample data to the intermediate disaggregated model, text is obtained This disaggregated model, i.e., by iteration step by step, optimization obtains final textual classification model.Propose a kind of utilize has class on a small quantity The method that the sample data training not marked obtains textual classification model, allows to by instructing to less sample data Practice, obtains the preferable disaggregated model of performance, saved human cost, improved training speed.
Further, in one embodiment, as shown in figure 3, being directed to step S1, i.e., obtaining from default sample database has class The first sample data not marked, and preliminary classification model is established according to first sample data, specifically comprise the following steps:
S11: mode is chosen according to default sample and chooses the first sample data with category label from default sample database.
Default sample chooses mode, i.e., chooses a certain number of from default sample database and representational have classification The first sample data of label.Wherein, quantity lacking as far as possible, to reduce the demand to sample data;Meanwhile the first of selection The classification of sample overlay text data as far as possible.For example, the selection to news category text data, as far as possible covering " politics ", " business ", The classifications such as " sport ", " style entertainment ".
Specifically, if having 100,000 articles in default sample database, wherein the article with category label has 3000, then Server-side can choose 30% in 3000 articles, that is, choose 900 articles, and choose from 900 articles and represent text Each 5 articles of the article of notebook data classification are as first sample data.
S12: in conjunction with category label first sample data and default training algorithm establish preliminary classification model.
Default training algorithm, including the various algorithms being trained in machine learning to model.Server-side, which uses, has class The process that the first sample data not marked establish preliminary classification model belongs to supervised learning mode.Wherein, supervised learning is exactly By existing training sample, i.e. given data and its corresponding output, training is gone to obtain an optimal models.This model Belong to the set of some function, it is optimal, indicate to be optimal under some interpretational criteria.
Specifically, by taking Naive Bayes Classification Algorithm as an example, server-side can import naive Bayesian from the library sklearn Then function calls MultinomialNB () .fit () to be trained.
When training is completed, server-side can be used the library Joblib and realize the function of saving training data.Wherein, Joblib is A part of SciPy ecology is the tool that the work of pipeline python provides.Alternatively, server-side can call the library pickle Function preliminary classification model is saved.
In the present embodiment, server-side chooses mode according to default sample, selects that quantity is few as far as possible, and sample data Type covers first sample data wide as far as possible;Preliminary classification model is established then in conjunction with default training algorithm, so as to sample The demand of data is few as far as possible, further mitigates training cost, simultaneously as the broad covered area of first sample data, so that primary The recognizable set of disaggregated model is wider.
Further, in one embodiment, for step S3, that is, the comentropy of each second sample data is calculated, is obtained The information entropy of each second sample data, specifically comprises the following steps:
The comentropy of each second sample data is calculated according to the following formula:
Wherein, H represents the information entropy of the second sample data, and x represents the phrase in the second sample data, p(x)Represent word The frequency that group occurs.
Phrase in second sample data is that server-side makees the word obtained after word segmentation processing to the second sample data.Phrase The number that the frequency of appearance, i.e. phrase occur in the second sample data.
Specifically, server-side first makees word segmentation processing to each second sample data, the set segmented;It then will participle The frequency of all participles substitutes into formula in set, and the information entropy of second sample data can be obtained.
In the present embodiment, server-side calculates second according to the word frequency of the phrase in aromatic formula and the second sample data The comentropy of sample data, so that the quantization to sample data comprising information content is more accurate.
Further, in one embodiment, as shown in figure 4, be directed to step S4, i.e., according in the second sample data include phase With the quantity of phrase, the relevance degree of each second sample data is calculated, is specifically comprised the following steps:
S41: making word segmentation processing to each second sample data, obtains N number of participle set, wherein N is the second sample data Quantity.
Specifically, server-side can carry out word segmentation processing using various ways.For example, using regular expression to the second sample Notebook data carries out cutting, obtains segmenting the set constituted by several, i.e. participle set.It is to be appreciated that the second sample data The quantity of quantity and participle set is one-to-one.
Wherein, regular expression, i.e. Regular Expression, also known as regular expression, are within a context The processing method of retrieval or replacement target text.
Specifically, server-side can be using regular expression engine built-in in Perl or Python, to the second sample Notebook data carries out cutting;Alternatively, server-side cuts the second sample data using the grep tool carried in Unix system Point, obtain the set comprising several participles.Wherein, grep, i.e. Globally search a Regular Expression And Print is a kind of powerful text search tools.
S42: being directed to each second sample data, calculates the participle set and other N-1 a second of second sample data Intersection between the participle set of sample data, and according to the phrase quantity for including in each intersection, determine second sample number According to the local correlation angle value between other N-1 the second sample datas, the corresponding N-1 part of second sample data is obtained Relevance degree.
The intersection between participle set is calculated, different participles can specifically be gathered and compare, intersection, that is, identical word Group.
Local correlation angle value represents the degree of correlation between second sample data and other second sample datas.
For example, participle set a is expressed as { " people ", " interest ", " bank ", " debt-credit " }, and participle set b is expressed as { " bank ", " debt-credit ", " income " }, then the intersection for segmenting set a and b is { " bank ", " debt-credit " }, the phrase for including in intersection Quantity is 2, and the local correlation angle value of participle set a and b are 2.Similarly it is found that if participle set c is expressed as { " meeting ", " report Announcement ", " income " }, then the local correlation angle value for segmenting set a and c is 0, and the local correlation angle value of participle set b and c are 1.
S43: calculating the average value of the corresponding N-1 local correlation angle value of each second sample data, using average value as The relevance degree of each second sample data.
It is related to corresponding second sample data of participle set a still by taking participle set a, b and c in step S42 as an example Angle value is the average value of the sum of local correlation angle value for segmenting set a and b, participle set a and c, as 1.Similarly it is found that with dividing The relevance degree of corresponding second sample data of set of words b and c is respectively 1.5 and 0.5.
In the present embodiment, server-side is by carrying out word segmentation processing to the second sample data, to segment the friendship between set Collect the local correlation angle value determined between the second sample data, and the mode averaged to local relevance degree obtains often The relevance degree of a second sample data, allows relevance degree more accurately to react the association between the second sample data Degree.
Further, in one embodiment, as shown in figure 5, being directed to step S5, i.e., selection information entropy is more than presupposed information Entropy threshold, and relevance degree is lower than the second sample data of default relevance threshold as data to be marked, specifically include as Lower step:
S51: choosing information entropy is more than presupposed information entropy threshold, and relevance degree is lower than the of default relevance threshold Two sample datas are as candidate samples data.
Server-side screens the second sample data for meeting specified conditions again, has both reduced the quantity of training sample, The sample data that general category device is difficult to is found out again.Wherein, specified conditions refer to that information entropy is more than presupposed information entropy threshold Value, and relevance degree is lower than default relevance threshold.
S52: classified using at least two default sample classification devices to candidate samples data, obtain classification results.
Default sample classification device, i.e. textual classification model.For example, common FastText, Text-CNN model etc..
Wherein, FastText is a term vector and text classification tool for facebook open source, typical case scene It is " the text classification problem with supervision ".It provides the method for simple and efficient text classification and representative learning, and performance is shoulder to shoulder Deep learning and speed is faster.TextCNN is the algorithm classified using convolutional neural networks to text, due to its structure Simply, effect is good, is widely used in text classification field.
The result that different default sample classification devices classifies to same sample data may be different.I.e. same sample number After being classified by the different classifications model such as FastText, Text-CNN, different classifications may be identified as.
Classification results include classification belonging to each candidate samples data.
The candidate samples data that S53: choosing from classification results while belonging to a different category are as data to be marked.
The candidate samples data to belong to a different category simultaneously, i.e., different default classifiers is to same candidate samples data Recognition result is different.For example, an article is identified as " history class " by FastText, meanwhile, and " text is identified as by Text-CNN Skill class ".Therefore, it represents this article to be difficult to be identified, or is difficult to simply be divided into a certain classification.
Specifically, server-side classification according to belonging to the candidate samples data in classification results determines if to belong to simultaneously In different classes of.
In the present embodiment, server-side according to different default classifiers to meet the second sample data of specified conditions into Row screening, chooses and is difficult to identified second sample data as data to be marked, both get rid of the sample being simply easily identified Notebook data is further reduced quantity and the training time of training sample, improves training effectiveness;Meanwhile it picking out and being not easy to be known Other sample data is as data to be marked, so that being conducive to model training after carrying out classification mark to these data to be marked The raising of precision.
It should be understood that the size of the serial number of each step is not meant that the order of the execution order in above-described embodiment, each process Execution sequence should be determined by its function and internal logic, the implementation process without coping with the embodiment of the present invention constitutes any limit It is fixed.
In one embodiment, a kind of textual classification model training device is provided, text disaggregated model training device with it is upper Textual classification model training method in embodiment is stated to correspond.As shown in fig. 6, text disaggregated model training device includes just Grade model building module 61, sample data obtain module 62, comentropy computing module 63, relatedness computation module 64, to be marked Data decimation module 65, labeling module 66, the first model training module 67 and the second model training module 68.Each functional module is detailed Carefully it is described as follows:
Primary mold establishes module 61, for obtaining the first sample data with category label from default sample database, And preliminary classification model is established according to first sample data;
Sample data obtains module 62, for obtaining the second sample number without category label from default sample database According to;
Comentropy computing module 63 obtains each second sample number for calculating the comentropy of each second sample data According to information entropy;
Relatedness computation module 64, for calculating each the according to the quantity in the second sample data including identical phrase The relevance degree of two sample datas;
Data decimation module 65 to be marked, for choosing information entropy more than presupposed information entropy threshold, and relevance degree Lower than default relevance threshold the second sample data as data to be marked;
Labeling module 66 carries out classification mark for according to preset classification notation methods, treating labeled data, obtains the Three sample datas;
First model training module 67 is used for according to preset model training mode, using third sample data to primary Disaggregated model is trained, and obtains intermediate disaggregated model;
Second model training module 68, for using first sample data and third according to preset model training mode Sample data centering grade disaggregated model is trained, and obtains textual classification model.
Further, primary mold establishes module 61, comprising:
Submodule 611 is chosen, there is category label for choosing mode according to default sample and choosing from default sample database First sample data;
Training submodule 612, for combining the first sample data with category label and default training algorithm to establish just Grade disaggregated model.
Further, comentropy computing module 63, including
Comentropy computational submodule 631, for calculating the comentropy of each second sample data according to the following formula:
Wherein, H represents the information entropy of the second sample data, and x represents the phrase in the second sample data, p(x)Represent word The frequency that group occurs.
Further, relatedness computation module 64, comprising:
Submodule 641 is segmented, for making word segmentation processing to each second sample data, obtains N number of participle set, wherein N For the quantity of the second sample data;
Local correlation degree computational submodule 642 calculates second sample data for being directed to each second sample data Intersection between participle set and the participle set of other N-1 the second sample datas, and according to the word for including in each intersection Group quantity, determines the local correlation angle value between second sample data and other N-1 the second sample datas, obtain this second The corresponding N-1 local correlation angle value of sample data;
Mean value calculation submodule 643, for calculating the corresponding N-1 local correlation angle value of each second sample data Average value, using average value as the relevance degree of each second sample data.
Further, data decimation module 65 to be marked, comprising:
Candidate samples choose submodule 651, for choosing information entropy more than presupposed information entropy threshold, and relevance degree Lower than default relevance threshold the second sample data as candidate samples data;
Classification submodule 652 is obtained for being classified using at least two default sample classification devices to candidate samples data To classification results;
Submodule 653 is marked, the candidate samples data conduct for choosing while belonging to a different category from classification results Data to be marked.
Specific restriction about textual classification model training device may refer to above for textual classification model training The restriction of method, details are not described herein.Modules in above-mentioned textual classification model training device can be fully or partially through Software, hardware and combinations thereof are realized.Above-mentioned each module can be embedded in the form of hardware or independently of the place in computer equipment It manages in device, can also be stored in a software form in the memory in computer equipment, in order to which processor calls execution or more The corresponding operation of modules.
In one embodiment, a kind of computer equipment is provided, which can be server, internal junction Composition can be as shown in Figure 7.The computer equipment include by system bus connect processor, memory, network interface and Database.Wherein, the processor of the computer equipment is for providing calculating and control ability.The memory packet of the computer equipment Include non-volatile memory medium, built-in storage.The non-volatile memory medium is stored with operating system, computer program and data Library.The built-in storage provides environment for the operation of operating system and computer program in non-volatile memory medium.The calculating The network interface of machine equipment is used to communicate with external terminal by network connection.When the computer program is executed by processor with Realize a kind of textual classification model training method.
In one embodiment, a kind of computer equipment is provided, including memory, processor and storage are on a memory And the computer program that can be run on a processor, processor realize text classification in above-described embodiment when executing computer program The step of model training method, such as step S1 shown in Fig. 2 to step S8.Alternatively, reality when processor executes computer program The function of each module/unit of textual classification model training device in existing above-described embodiment, such as module 61 shown in Fig. 6 is to module 68 function.To avoid repeating, which is not described herein again.
In one embodiment, a computer readable storage medium is provided, computer program, computer program are stored thereon with Textual classification model training method in above method embodiment is realized when being executed by processor, alternatively, the computer program is located Manage the function that each module/unit in textual classification model training device in above-mentioned apparatus embodiment is realized when device executes.To avoid It repeats, which is not described herein again.
Those of ordinary skill in the art will appreciate that realizing all or part of the process in above-described embodiment method, being can be with Relevant hardware is instructed to complete by computer program, the computer program can be stored in a non-volatile computer In read/write memory medium, the computer program is when being executed, it may include such as the process of the embodiment of above-mentioned each method.Wherein, To any reference of memory, storage, database or other media used in each embodiment provided by the present invention, Including non-volatile and/or volatile memory.Nonvolatile memory may include read-only memory (ROM), programming ROM (PROM), electrically programmable ROM (EPROM), electrically erasable ROM (EEPROM) or flash memory.Volatile memory may include Random access memory (RAM) or external cache.By way of illustration and not limitation, RAM is available in many forms, Such as static state RAM (SRAM), dynamic ram (DRAM), synchronous dram (SDRAM), double data rate sdram (DDRSDRAM), enhancing Type SDRAM (ESDRAM), synchronization link (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic ram (DRDRAM) and memory bus dynamic ram (RDRAM) etc..
It is apparent to those skilled in the art that for convenience of description and succinctly, only with above-mentioned each function Can unit, module division progress for example, in practical application, can according to need and by above-mentioned function distribution by different Functional unit, module are completed, i.e., the internal structure of described device is divided into different functional unit or module, more than completing The all or part of function of description.
Embodiment described above is merely illustrative of the technical solution of the present invention, rather than its limitations;Although referring to aforementioned reality Applying example, invention is explained in detail, those skilled in the art should understand that: it still can be to aforementioned each Technical solution documented by embodiment is modified or equivalent replacement of some of the technical features;And these are modified Or replacement, the spirit and scope for technical solution of various embodiments of the present invention that it does not separate the essence of the corresponding technical solution should all It is included within protection scope of the present invention.

Claims (10)

1. a kind of textual classification model training method, which is characterized in that the textual classification model training method includes:
The first sample data with category label are obtained from default sample database, and are established just according to the first sample data Grade disaggregated model;
The second sample data without the category label is obtained from the default sample database;
The comentropy for calculating each second sample data obtains the information entropy of each second sample data;
According to the quantity in second sample data including identical phrase, the degree of correlation of each second sample data is calculated Value;
The information entropy is chosen more than presupposed information entropy threshold, and the relevance degree is lower than the default relevance threshold Second sample data as data to be marked;
According to preset classification notation methods, classification mark is carried out to the data to be marked, obtains third sample data;
According to preset model training mode, the preliminary classification model is trained using the third sample data, is obtained To intermediate disaggregated model;
According to the preset model training mode, using the first sample data and the third sample data in described Grade disaggregated model is trained, and obtains textual classification model.
2. textual classification model training method as described in claim 1, which is characterized in that described to be obtained from default sample database First sample data with category label, and preliminary classification model is established according to the first sample data, comprising:
Mode is chosen according to default sample, and the first sample data with category label are chosen from the default sample database;
The preliminary classification model is established in conjunction with the first sample data with category label and default training algorithm.
3. textual classification model training method as described in claim 1, which is characterized in that described to calculate each second sample The comentropy of notebook data obtains the information entropy of each second sample data, comprising:
The comentropy of each second sample data is calculated according to the following formula:
Wherein, H represents the information entropy of second sample data, and x represents the phrase in second sample data, p(x)Generation The frequency that phrase described in table occurs.
4. textual classification model training method as described in claim 1, which is characterized in that described according to second sample number Comprising the quantity of identical phrase in, the relevance degree of each second sample data is calculated, comprising:
Word segmentation processing is made to each second sample data, obtains N number of participle set, wherein N is second sample data Quantity;
For each second sample data, the participle set and other N-1 the second samples of second sample data are calculated Intersection between the participle set of data, and according to the phrase quantity for including in each intersection, determine second sample number According to the local correlation angle value between other N-1 the second sample datas, it is a described to obtain the corresponding N-1 of second sample data Local correlation angle value;
The average value for calculating the corresponding N-1 local correlation angle value of each second sample data, by the average value Relevance degree as each second sample data.
5. textual classification model training method as described in claim 1, which is characterized in that the selection information entropy is super Presupposed information entropy threshold is crossed, and the relevance degree is lower than the second sample data conduct of the default relevance threshold Data to be marked, comprising:
The information entropy is chosen more than the presupposed information entropy threshold, and the relevance degree is lower than the default degree of correlation Second sample data of threshold value is as candidate samples data;
Classified using at least two default sample classification devices to the candidate samples data, obtains classification results;
It is chosen from the classification results while the candidate samples data that belong to a different category is as the data to be marked.
6. a kind of textual classification model training device, which is characterized in that the textual classification model training device, comprising:
Primary mold establishes module, for first sample data of the acquisition with category label from default sample database, and according to The first sample data establish preliminary classification model;
Sample data obtains module, for obtaining the second sample number without the category label from the default sample database According to;
Comentropy computing module obtains each second sample for calculating the comentropy of each second sample data The information entropy of data;
Relatedness computation module, for calculating each described according to the quantity in second sample data including identical phrase The relevance degree of second sample data;
Data decimation module to be marked, for choosing the information entropy more than presupposed information entropy threshold, and the degree of correlation Value is lower than second sample data of the default relevance threshold as data to be marked;
Labeling module, for carrying out classification mark to the data to be marked, obtaining third according to preset classification notation methods Sample data;
First model training module is used for according to preset model training mode, using the third sample data to described first Grade disaggregated model is trained, and obtains intermediate disaggregated model;
Second model training module, for using the first sample data and institute according to the preset model training mode It states third sample data to be trained the intermediate disaggregated model, obtains textual classification model.
7. textual classification model training device as claimed in claim 6, which is characterized in that the primary mold establishes module, Include:
Submodule is chosen, it is described with category label for being chosen from the default sample database according to default sample selection mode First sample data;
Training submodule, it is described first for being established in conjunction with the first sample data with category label and default training algorithm Grade disaggregated model.
8. textual classification model training device as claimed in claim 6, which is characterized in that the comentropy computing module, packet It includes:
Comentropy computational submodule, for calculating the comentropy of each second sample data according to the following formula:
Wherein, H represents the information entropy of second sample data, and x represents the phrase in second sample data, p(x)Generation The frequency that phrase described in table occurs.
9. a kind of computer equipment, including memory, processor and storage are in the memory and can be in the processor The computer program of upper operation, which is characterized in that the processor realized when executing the computer program as claim 1 to Any one of 5 textual classification model training methods.
10. a kind of computer readable storage medium, the computer-readable recording medium storage has computer program, and feature exists In realization textual classification model training side as described in any one of claim 1 to 5 when the computer program is executed by processor Method.
CN201910247846.8A 2019-03-29 2019-03-29 Textual classification model training method, device, computer equipment and storage medium Pending CN110110080A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201910247846.8A CN110110080A (en) 2019-03-29 2019-03-29 Textual classification model training method, device, computer equipment and storage medium
PCT/CN2019/117095 WO2020199591A1 (en) 2019-03-29 2019-11-11 Text categorization model training method, apparatus, computer device, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910247846.8A CN110110080A (en) 2019-03-29 2019-03-29 Textual classification model training method, device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN110110080A true CN110110080A (en) 2019-08-09

Family

ID=67484695

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910247846.8A Pending CN110110080A (en) 2019-03-29 2019-03-29 Textual classification model training method, device, computer equipment and storage medium

Country Status (2)

Country Link
CN (1) CN110110080A (en)
WO (1) WO2020199591A1 (en)

Cited By (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111026851A (en) * 2019-10-18 2020-04-17 平安科技(深圳)有限公司 Model prediction capability optimization method, device, equipment and readable storage medium
CN111081221A (en) * 2019-12-23 2020-04-28 合肥讯飞数码科技有限公司 Training data selection method, device, electronic device and computer storage medium
CN111143568A (en) * 2019-12-31 2020-05-12 郑州工程技术学院 Buffering method, device, equipment and storage medium for paper classification
CN111159396A (en) * 2019-12-04 2020-05-15 中国电子科技集团公司第三十研究所 A method for establishing a text data classification and grading model for data sharing and exchange
CN111368515A (en) * 2020-03-02 2020-07-03 中国农业科学院农业信息研究所 Method and system for generating industry dynamic interactive report based on PDF document fragmentation
CN111382268A (en) * 2020-02-25 2020-07-07 北京小米松果电子有限公司 Text training data processing method and device and storage medium
WO2020199591A1 (en) * 2019-03-29 2020-10-08 平安科技(深圳)有限公司 Text categorization model training method, apparatus, computer device, and storage medium
CN111767400A (en) * 2020-06-30 2020-10-13 平安国际智慧城市科技股份有限公司 Training method and device of text classification model, computer equipment and storage medium
CN111881295A (en) * 2020-07-31 2020-11-03 中国光大银行股份有限公司 Text classification model training method and device and text labeling method and device
CN111914061A (en) * 2020-07-13 2020-11-10 上海乐言信息科技有限公司 Radius-based uncertainty sampling method and system for text classification active learning
CN112036166A (en) * 2020-07-22 2020-12-04 大箴(杭州)科技有限公司 Data labeling method and device, storage medium and computer equipment
CN112069293A (en) * 2020-09-14 2020-12-11 上海明略人工智能(集团)有限公司 Data annotation method and device, electronic equipment and computer readable medium
CN112434736A (en) * 2020-11-24 2021-03-02 成都潜在人工智能科技有限公司 Deep active learning text classification method based on pre-training model
CN112633344A (en) * 2020-12-16 2021-04-09 中国平安财产保险股份有限公司 Quality inspection model training method, quality inspection model training device, quality inspection model training equipment and readable storage medium
CN112651211A (en) * 2020-12-11 2021-04-13 北京大米科技有限公司 Label information determination method, device, server and storage medium
CN112711940A (en) * 2019-10-08 2021-04-27 台达电子工业股份有限公司 Information processing system, information processing method, and non-transitory computer-readable recording medium
WO2021139279A1 (en) * 2020-07-30 2021-07-15 平安科技(深圳)有限公司 Data processing method and apparatus based on classification model, and electronic device and medium
CN113239128A (en) * 2021-06-01 2021-08-10 平安科技(深圳)有限公司 Data pair classification method, device, equipment and storage medium based on implicit characteristics
CN113590822A (en) * 2021-07-28 2021-11-02 北京百度网讯科技有限公司 Document title processing method, device, equipment, storage medium and program product
CN113761034A (en) * 2021-09-15 2021-12-07 深圳信息职业技术学院 A data processing method and device thereof
CN114117043A (en) * 2021-11-24 2022-03-01 阿里巴巴(中国)有限公司 Model training method and device and computer storage medium
CN114417882A (en) * 2022-01-04 2022-04-29 马上消费金融股份有限公司 Data labeling method and device, electronic equipment and readable storage medium
CN114548074A (en) * 2022-02-15 2022-05-27 中电云脑(天津)科技有限公司 Method and device for determining medical data to be annotated
CN114637843A (en) * 2020-12-15 2022-06-17 阿里巴巴集团控股有限公司 Data processing method and device, electronic equipment and storage medium
WO2024021526A1 (en) * 2022-07-29 2024-02-01 上海智臻智能网络科技股份有限公司 Method and apparatus for generating training samples, device, and storage medium
CN119513321A (en) * 2025-01-16 2025-02-25 煤炭科学研究总院有限公司 Classification method of coal industry vocabulary

Families Citing this family (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112348203A (en) * 2020-11-05 2021-02-09 中国平安人寿保险股份有限公司 Model training method, device, terminal device and storage medium
CN112528022A (en) * 2020-12-09 2021-03-19 广州摩翼信息科技有限公司 Method for extracting characteristic words corresponding to theme categories and identifying text theme categories
CN112632219B (en) * 2020-12-17 2022-10-04 中国联合网络通信集团有限公司 Method and device for intercepting spam short messages
CN112651447B (en) * 2020-12-29 2023-09-26 广东电网有限责任公司电力调度控制中心 Ontology-based resource classification labeling method and system
CN112541595B (en) * 2020-12-30 2024-12-06 中国建设银行股份有限公司 Model building method and device, storage medium and electronic device
CN112446441B (en) * 2021-02-01 2021-08-20 北京世纪好未来教育科技有限公司 Model training data screening method, device, equipment and storage medium
CN113793191B (en) * 2021-02-09 2024-05-24 京东科技控股股份有限公司 Commodity matching method and device and electronic equipment
CN113704393B (en) * 2021-04-13 2025-07-15 腾讯科技(深圳)有限公司 Keyword extraction method, device, equipment and medium
CN113190154B (en) * 2021-04-29 2023-10-13 北京百度网讯科技有限公司 Model training and entry classification methods, apparatuses, devices, storage medium and program
CN113343695B (en) * 2021-05-27 2022-02-01 镁佳(北京)科技有限公司 Text labeling noise detection method and device, storage medium and electronic equipment
CN114169539A (en) * 2022-02-11 2022-03-11 阿里巴巴(中国)有限公司 Model training method, training device, electronic device, and computer-readable medium
CN114648980B (en) * 2022-03-03 2025-02-28 科大讯飞股份有限公司 Data classification and speech recognition method, device, electronic device and storage medium
CN115129872A (en) * 2022-06-21 2022-09-30 浙江大学 Active learning-based small sample text labeling method and device
CN115994225B (en) * 2023-03-20 2023-06-27 北京百分点科技集团股份有限公司 Text classification method, device, storage medium and electronic equipment
CN116304058B (en) * 2023-04-27 2023-08-08 云账户技术(天津)有限公司 Method and device for identifying negative information of enterprise, electronic equipment and storage medium
CN117783377B (en) * 2024-02-27 2024-08-30 南昌怀特科技有限公司 Component analysis method and system for tooth paste production
CN117973522B (en) * 2024-04-02 2024-06-04 成都派沃特科技股份有限公司 Knowledge data training technology-based application model construction method and system

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060095521A1 (en) * 2004-11-04 2006-05-04 Seth Patinkin Method, apparatus, and system for clustering and classification
US20140172754A1 (en) * 2012-12-14 2014-06-19 International Business Machines Corporation Semi-supervised data integration model for named entity classification
CN106131613A (en) * 2016-07-26 2016-11-16 深圳Tcl新技术有限公司 Intelligent television video sharing method and video sharing system
CN107025218A (en) * 2017-04-07 2017-08-08 腾讯科技(深圳)有限公司 A kind of text De-weight method and device
CN107506793A (en) * 2017-08-21 2017-12-22 中国科学院重庆绿色智能技术研究院 Clothes recognition methods and system based on weak mark image
CN108304427A (en) * 2017-04-28 2018-07-20 腾讯科技(深圳)有限公司 A kind of user visitor's heap sort method and apparatus
CN108665158A (en) * 2018-05-08 2018-10-16 阿里巴巴集团控股有限公司 A kind of method, apparatus and equipment of trained air control model
CN109101997A (en) * 2018-07-11 2018-12-28 浙江理工大学 A kind of source tracing method sampling limited Active Learning

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102063642A (en) * 2010-12-30 2011-05-18 上海电机学院 Selection method for fuzzy neural network sample on basis of active learning
US11100420B2 (en) * 2014-06-30 2021-08-24 Amazon Technologies, Inc. Input processing for machine learning
CN104166706B (en) * 2014-08-08 2017-11-03 苏州大学 Multi-tag grader construction method based on cost-sensitive Active Learning
CN108090231A (en) * 2018-01-12 2018-05-29 北京理工大学 A kind of topic model optimization method based on comentropy
CN110110080A (en) * 2019-03-29 2019-08-09 平安科技(深圳)有限公司 Textual classification model training method, device, computer equipment and storage medium

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060095521A1 (en) * 2004-11-04 2006-05-04 Seth Patinkin Method, apparatus, and system for clustering and classification
US20140172754A1 (en) * 2012-12-14 2014-06-19 International Business Machines Corporation Semi-supervised data integration model for named entity classification
CN106131613A (en) * 2016-07-26 2016-11-16 深圳Tcl新技术有限公司 Intelligent television video sharing method and video sharing system
CN107025218A (en) * 2017-04-07 2017-08-08 腾讯科技(深圳)有限公司 A kind of text De-weight method and device
CN108304427A (en) * 2017-04-28 2018-07-20 腾讯科技(深圳)有限公司 A kind of user visitor's heap sort method and apparatus
CN107506793A (en) * 2017-08-21 2017-12-22 中国科学院重庆绿色智能技术研究院 Clothes recognition methods and system based on weak mark image
CN108665158A (en) * 2018-05-08 2018-10-16 阿里巴巴集团控股有限公司 A kind of method, apparatus and equipment of trained air control model
CN109101997A (en) * 2018-07-11 2018-12-28 浙江理工大学 A kind of source tracing method sampling limited Active Learning

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
关雅夫: "基于主动学习的微博情感分析方法研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
关雅夫: "基于主动学习的微博情感分析方法研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》, 15 October 2017 (2017-10-15), pages 138 - 297 *
胡正平;高文涛;万春艳;: "基于样本不确定性和代表性相结合的可控主动学习算法研究", 燕山大学学报, no. 04, pages 341 - 346 *
龙军;殷建平;祝恩;赵文涛;: "针对入侵检测的代价敏感主动学习算法", 南京大学学报(自然科学版), no. 05, pages 527 - 535 *

Cited By (39)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020199591A1 (en) * 2019-03-29 2020-10-08 平安科技(深圳)有限公司 Text categorization model training method, apparatus, computer device, and storage medium
CN112711940B (en) * 2019-10-08 2024-06-11 台达电子工业股份有限公司 Information processing system, information processing method, and non-transitory computer-readable recording medium
CN112711940A (en) * 2019-10-08 2021-04-27 台达电子工业股份有限公司 Information processing system, information processing method, and non-transitory computer-readable recording medium
CN111026851A (en) * 2019-10-18 2020-04-17 平安科技(深圳)有限公司 Model prediction capability optimization method, device, equipment and readable storage medium
WO2021073408A1 (en) * 2019-10-18 2021-04-22 平安科技(深圳)有限公司 Model prediction capability optimization method, apparatus and device, and readable storage medium
CN111026851B (en) * 2019-10-18 2023-09-15 平安科技(深圳)有限公司 Model prediction capability optimization method, device, equipment and readable storage medium
CN111159396A (en) * 2019-12-04 2020-05-15 中国电子科技集团公司第三十研究所 A method for establishing a text data classification and grading model for data sharing and exchange
CN111159396B (en) * 2019-12-04 2022-04-22 中国电子科技集团公司第三十研究所 A method for establishing a text data classification and grading model for data sharing and exchange
CN111081221A (en) * 2019-12-23 2020-04-28 合肥讯飞数码科技有限公司 Training data selection method, device, electronic device and computer storage medium
CN111081221B (en) * 2019-12-23 2022-10-14 合肥讯飞数码科技有限公司 Training data selection method and device, electronic equipment and computer storage medium
CN111143568A (en) * 2019-12-31 2020-05-12 郑州工程技术学院 Buffering method, device, equipment and storage medium for paper classification
CN111382268A (en) * 2020-02-25 2020-07-07 北京小米松果电子有限公司 Text training data processing method and device and storage medium
CN111382268B (en) * 2020-02-25 2023-12-01 北京小米松果电子有限公司 Text training data processing method, device and storage medium
CN111368515A (en) * 2020-03-02 2020-07-03 中国农业科学院农业信息研究所 Method and system for generating industry dynamic interactive report based on PDF document fragmentation
CN111368515B (en) * 2020-03-02 2021-01-26 中国农业科学院农业信息研究所 Method and system for generating industry dynamic interactive report based on PDF document fragmentation
CN111767400B (en) * 2020-06-30 2024-04-26 平安国际智慧城市科技股份有限公司 Training method and device for text classification model, computer equipment and storage medium
CN111767400A (en) * 2020-06-30 2020-10-13 平安国际智慧城市科技股份有限公司 Training method and device of text classification model, computer equipment and storage medium
CN111914061A (en) * 2020-07-13 2020-11-10 上海乐言信息科技有限公司 Radius-based uncertainty sampling method and system for text classification active learning
CN111914061B (en) * 2020-07-13 2021-04-16 上海乐言科技股份有限公司 Radius-based uncertainty sampling method and system for text classification active learning
CN112036166A (en) * 2020-07-22 2020-12-04 大箴(杭州)科技有限公司 Data labeling method and device, storage medium and computer equipment
WO2021139279A1 (en) * 2020-07-30 2021-07-15 平安科技(深圳)有限公司 Data processing method and apparatus based on classification model, and electronic device and medium
CN111881295A (en) * 2020-07-31 2020-11-03 中国光大银行股份有限公司 Text classification model training method and device and text labeling method and device
CN112069293B (en) * 2020-09-14 2024-04-19 上海明略人工智能(集团)有限公司 Data labeling method, device, electronic equipment and computer readable medium
CN112069293A (en) * 2020-09-14 2020-12-11 上海明略人工智能(集团)有限公司 Data annotation method and device, electronic equipment and computer readable medium
CN112434736A (en) * 2020-11-24 2021-03-02 成都潜在人工智能科技有限公司 Deep active learning text classification method based on pre-training model
CN112651211A (en) * 2020-12-11 2021-04-13 北京大米科技有限公司 Label information determination method, device, server and storage medium
CN114637843A (en) * 2020-12-15 2022-06-17 阿里巴巴集团控股有限公司 Data processing method and device, electronic equipment and storage medium
CN112633344A (en) * 2020-12-16 2021-04-09 中国平安财产保险股份有限公司 Quality inspection model training method, quality inspection model training device, quality inspection model training equipment and readable storage medium
CN113239128B (en) * 2021-06-01 2022-03-18 平安科技(深圳)有限公司 Data pair classification method, device, equipment and storage medium based on implicit characteristics
CN113239128A (en) * 2021-06-01 2021-08-10 平安科技(深圳)有限公司 Data pair classification method, device, equipment and storage medium based on implicit characteristics
CN113590822A (en) * 2021-07-28 2021-11-02 北京百度网讯科技有限公司 Document title processing method, device, equipment, storage medium and program product
CN113590822B (en) * 2021-07-28 2023-08-08 北京百度网讯科技有限公司 Processing method, device, device, storage medium and program product of document title
CN113761034B (en) * 2021-09-15 2022-06-17 深圳信息职业技术学院 A data processing method and device thereof
CN113761034A (en) * 2021-09-15 2021-12-07 深圳信息职业技术学院 A data processing method and device thereof
CN114117043A (en) * 2021-11-24 2022-03-01 阿里巴巴(中国)有限公司 Model training method and device and computer storage medium
CN114417882A (en) * 2022-01-04 2022-04-29 马上消费金融股份有限公司 Data labeling method and device, electronic equipment and readable storage medium
CN114548074A (en) * 2022-02-15 2022-05-27 中电云脑(天津)科技有限公司 Method and device for determining medical data to be annotated
WO2024021526A1 (en) * 2022-07-29 2024-02-01 上海智臻智能网络科技股份有限公司 Method and apparatus for generating training samples, device, and storage medium
CN119513321A (en) * 2025-01-16 2025-02-25 煤炭科学研究总院有限公司 Classification method of coal industry vocabulary

Also Published As

Publication number Publication date
WO2020199591A1 (en) 2020-10-08

Similar Documents

Publication Publication Date Title
CN110110080A (en) Textual classification model training method, device, computer equipment and storage medium
Kukačka et al. Regularization for deep learning: A taxonomy
CN115311687B (en) Natural language pedestrian retrieval method and system with joint token and feature alignment
CN109840322B (en) Complete shape filling type reading understanding analysis model and method based on reinforcement learning
CN110969020A (en) CNN and attention mechanism-based Chinese named entity identification method, system and medium
CN104966105A (en) Robust machine error retrieving method and system
CN114220086B (en) A cost-effective scene text detection method and system
Wu et al. Optimized deep learning framework for water distribution data-driven modeling
CN111309918A (en) Multi-label text classification method based on label relevance
CN110188195A (en) A kind of text intension recognizing method, device and equipment based on deep learning
CN117409206B (en) Small sample image segmentation method based on self-adaptive prototype aggregation network
CN113836896A (en) Patent text abstract generation method and device based on deep learning
CN111145914A (en) Method and device for determining lung cancer clinical disease library text entity
Liu et al. Hybrid neural network text classification combining tcn and gru
Chatterjee et al. ImageNet classification using wordnet hierarchy
Feng et al. Enhancing fitness evaluation in genetic algorithm-based architecture search for AI-Aided financial regulation
Safdari et al. A hierarchical feature learning for isolated Farsi handwritten digit recognition using sparse autoencoder
CN115223189A (en) Method and system for recognizing secondary drawings of substation, and retrieval method and system
CN119003769A (en) Netizen view analysis method based on double large models
Wang et al. Efficient deep convolutional model compression with an active stepwise pruning approach
Passalis et al. Deep temporal logistic bag-of-features for forecasting high frequency limit order book time series
CN118606469A (en) Multi-classification prediction method for intangible cultural heritage text based on multi-head attention and semantic features
Asfaw Deep learning hyperparameter’s impact on potato disease detection
Shanmugasundaram et al. Detection accuracy improvement on one-stage object detection using ap-loss-based ranking module and resnet-152 backbone
CN115455162B (en) Answer sentence selection method and device based on hierarchical capsule and multi-view information fusion

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20190809

RJ01 Rejection of invention patent application after publication