CN107273500A - Text classifier generation method, file classification method, device and computer equipment - Google Patents
Text classifier generation method, file classification method, device and computer equipment Download PDFInfo
- Publication number
- CN107273500A CN107273500A CN201710457280.2A CN201710457280A CN107273500A CN 107273500 A CN107273500 A CN 107273500A CN 201710457280 A CN201710457280 A CN 201710457280A CN 107273500 A CN107273500 A CN 107273500A
- Authority
- CN
- China
- Prior art keywords
- classification
- sample
- training
- grader
- classifier
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3346—Query execution using probabilistic model
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/353—Clustering; Classification into predefined classes
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/355—Creation or modification of classes or clusters
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Probability & Statistics with Applications (AREA)
- Computational Linguistics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
本发明公开一种文本分类器生成方法、文本分类方法、装置及计算机设备,用以解决训练样本集中的样本交叉所带来的分类效果差的问题。所述文本分类器生成方法包括:将训练样本集中存在样本交叉的至少两个原始样本类别进行合并操作得到新样本类别;所述训练样本集包括多个原始样本类别,每个训练样本属于所述多个原始样本类别中的一个;根据合并操作后的训练样本集训练得到第一分类器;根据属于所述新样本类别的训练样本以及所述属于所述新样本类别的训练样本所属的原始样本类别进行训练得到第二分类器,所述第二分类器用于对所述第一分类器的分类结果中分类类别为所述新样本类别的待分类文本进行再次分类,以分入相应的原始样本类别中。
The invention discloses a text classifier generation method, text classification method, device and computer equipment, which are used to solve the problem of poor classification effect caused by crossover of samples in a training sample set. The method for generating a text classifier includes: performing a merging operation on at least two original sample categories with sample intersections in the training sample set to obtain a new sample category; the training sample set includes a plurality of original sample categories, and each training sample belongs to the One of a plurality of original sample categories; training according to the training sample set after the merging operation to obtain a first classifier; according to the training samples belonging to the new sample category and the original samples to which the training samples belonging to the new sample category belong The category is trained to obtain a second classifier, and the second classifier is used to reclassify the text to be classified whose classification category is the new sample category in the classification result of the first classifier, so as to be divided into corresponding original samples category.
Description
技术领域technical field
本发明涉及通信技术领域,特别是涉及一种文本分类器生成方法、文本分类方法、装置及计算机设备。The invention relates to the field of communication technology, in particular to a method for generating a text classifier, a text classification method, a device and computer equipment.
背景技术Background technique
在文本分类中,训练样本的质量在很大程度上决定分类器的效果。In text classification, the quality of training samples largely determines the performance of classifiers.
例如,当训练样本出现类间样本交叉时,则存在样本交叉的两个类或多个类,一定会影响整体分类器的效果,并且该两类或多类别的分类准确率较低,分类效果差。For example, when inter-class sample crossover occurs in the training samples, there are two or more classes with sample crossover, which will definitely affect the effect of the overall classifier, and the classification accuracy of the two or more classes is low, and the classification effect Difference.
发明内容Contents of the invention
本发明要解决的技术问题是提供一种文本分类器生成方法、文本分类方法、装置及计算机设备,用以解决现有技术中训练样本集中的样本交叉所带来的分类效果差的问题。The technical problem to be solved by the present invention is to provide a text classifier generation method, text classification method, device and computer equipment to solve the problem of poor classification effect caused by the crossing of samples in the training sample set in the prior art.
一方面,本发明提供一种文本分类器生成方法,包括:将训练样本集中存在样本交叉的至少两个原始样本类别进行合并操作得到新样本类别;所述训练样本集包括多个原始样本类别,每个训练样本属于所述多个原始样本类别中的一个;根据合并操作后的训练样本集训练得到第一分类器;根据属于所述新样本类别的训练样本以及所述属于所述新样本类别的训练样本所属的原始样本类别进行训练得到第二分类器,所述第二分类器用于对所述第一分类器的分类结果中分类类别为所述新样本类别的待分类文本进行再次分类,以分入相应的原始样本类别中。On the one hand, the present invention provides a method for generating a text classifier, comprising: performing a merging operation on at least two original sample categories with sample intersections in the training sample set to obtain a new sample category; the training sample set includes a plurality of original sample categories, Each training sample belongs to one of the plurality of original sample categories; the first classifier is obtained according to the training sample set after the merging operation; according to the training samples belonging to the new sample category and the belonging to the new sample category The original sample category to which the training sample belongs is trained to obtain a second classifier, and the second classifier is used to reclassify the text to be classified whose classification category is the new sample category in the classification result of the first classifier, to be classified into the corresponding original sample category.
可选的,所述根据合并操作后的训练样本集训练得到第一分类器包括:对所述合并操作后的训练样本集采用以下至少一种分类算法训练,得到所述第一分类器:朴素贝叶斯NB分类算法、支持向量机SVM分类算法、K最邻近KNN分类算法和随机森林分类算法。Optionally, the training to obtain the first classifier according to the training sample set after the merging operation includes: using at least one of the following classification algorithms to train the training sample set after the merging operation to obtain the first classifier: naive Bayesian NB classification algorithm, support vector machine SVM classification algorithm, K nearest neighbor KNN classification algorithm and random forest classification algorithm.
可选的,根据属于所述新样本类别的训练样本以及所述属于所述新样本类别的训练样本所属的原始样本类别进行训练得到第二分类器包括:根据属于所述新样本类别的训练样本以及所述属于所述新样本类别的训练样本所属的原始样本类别进行训练,分别得到每个所述新样本类别的第二分类器。Optionally, performing training according to the training samples belonging to the new sample category and the original sample category to which the training samples belonging to the new sample category belong, to obtain the second classifier includes: according to the training samples belonging to the new sample category and the original sample category to which the training samples belonging to the new sample category belong, to obtain a second classifier for each of the new sample categories respectively.
可选的,对所述属于所述新样本类别的训练样本以及所述属于所述新样本类别的训练样本所属的原始样本类别采用以下至少一种分类算法训练,得到所述第二分类器:朴素贝叶斯NB分类算法、支持向量机SVM分类算法、K最邻近KNN分类算法和随机森林分类算法;所述第二分类器包括的分类类别与所述新样本类别中的训练样本所属的原始样本类别相对应。Optionally, at least one of the following classification algorithms is used to train the training samples belonging to the new sample category and the original sample category to which the training samples belonging to the new sample category belong, to obtain the second classifier: Naive Bayesian NB classification algorithm, support vector machine SVM classification algorithm, K nearest neighbor KNN classification algorithm and random forest classification algorithm; the classification category included in the second classifier is the same as the original class to which the training samples in the new sample category belong corresponding to the sample category.
可选的,在所述将训练样本集中存在样本交叉的至少两个原始样本类别进行合并得到新样本类别之前,所述方法还包括:将训练语料进行预处理以对所述训练语料进行过滤和/或格式统一;将预处理后的训练语料根据分词词典进行分词处理得到所述训练样本集。Optionally, before the merging of at least two original sample categories with sample intersections in the training sample set to obtain a new sample category, the method further includes: preprocessing the training corpus to filter the training corpus and /or the format is unified; the training sample set is obtained by performing word segmentation on the preprocessed training corpus according to the word segmentation dictionary.
可选的,所述训练语料包括句子和/或者文本片段。Optionally, the training corpus includes sentences and/or text fragments.
可选的,所述方法还包括:对所述训练语料执行新词发现操作并将发现的新词加入所述分词词典。Optionally, the method further includes: performing a new word discovery operation on the training corpus and adding the discovered new words to the word segmentation dictionary.
可选的,所述新词发现操作通过以下至少一种方式实现:互信息、共现概率和信息熵。Optionally, the new word discovery operation is realized by at least one of the following methods: mutual information, co-occurrence probability and information entropy.
可选的,所述方法还包括:测试所述第一分类器中每个分类类别的分类准确率;测试所述第二分类器中每个分类类别的分类准确率;其中,所述第一分类器的分类准确率分别为P1j,其中j为大于等于1且小于等于m的整数,m为进行合并操作后的训练样本集中的样本类别数;所述第二分类器的分类准确率分别为P1h*P2k,其中k为大于等于1且小于等于n的整数,n为所述新样本类别中的训练样本所属的原始样本类别数;P1h为所述第一分类器中所述新样本类别的分类准确率,h为大于等于1且小于等于g的整数,g为所述新样本类别数;检测所述第一分类器中每个分类类别的分类准确率是否都大于第一概率阈值,且所述第二分类器中每个分类类别的分类准确率是否都大于第二概率阈值;如果是,确定所述第一分类器和所述第二分类器训练成功。Optionally, the method further includes: testing the classification accuracy rate of each classification category in the first classifier; testing the classification accuracy rate of each classification category in the second classifier; wherein, the first The classification accuracy rates of the classifiers are respectively P1j, wherein j is an integer greater than or equal to 1 and less than or equal to m, and m is the number of sample categories in the training sample set after the merging operation; the classification accuracy rates of the second classifier are respectively P1h*P2k, wherein k is an integer greater than or equal to 1 and less than or equal to n, n is the number of original sample categories to which the training samples in the new sample category belong; P1h is the number of the new sample category in the first classifier Classification accuracy, h is an integer greater than or equal to 1 and less than or equal to g, g is the number of new sample categories; detecting whether the classification accuracy of each classification category in the first classifier is greater than the first probability threshold, and Whether the classification accuracy rate of each classification category in the second classifier is greater than a second probability threshold; if yes, it is determined that the training of the first classifier and the second classifier is successful.
可选的,检测所述第一分类器中每个分类类别的分类准确率是否都大于第一概率阈值,且所述第二分类器中每个分类类别的分类准确率是否都大于第二概率阈值,之后还包括:如果否,将所述训练样本集中存在样本交叉的至少两个原始样本类别重新进行合并操作和分类器训练,直至所述第一分类器中每个分类类别的分类准确率都大于第一概率阈值,且所述第二分类器中每个分类类别的分类准确率都大于第二概率阈值为止。Optionally, it is detected whether the classification accuracy rate of each classification category in the first classifier is greater than the first probability threshold, and whether the classification accuracy rate of each classification category in the second classifier is greater than the second probability Threshold, then also includes: if no, at least two original sample categories that have sample intersections in the training sample set are re-merged and classifier training until the classification accuracy of each classification category in the first classifier are greater than the first probability threshold, and the classification accuracy of each classification category in the second classifier is greater than the second probability threshold.
可选的,所述方法还包括:调整所述第一概率阈值和所述第二概率阈值以筛选出不同分类准确率的第一分类器和第二分类器。Optionally, the method further includes: adjusting the first probability threshold and the second probability threshold to filter out the first classifier and the second classifier with different classification accuracies.
可选的,基于样本总体采用交叉验证方式测试所述第一分类器的分类准确率;基于样本总体采用交叉验证方式测试所述第二分类器的分类准确率。Optionally, the classification accuracy of the first classifier is tested by cross-validation based on the overall sample; the classification accuracy of the second classifier is tested by cross-validation based on the overall sample.
可选的,所述基于样本总体的交叉验证方式包括:采用所述样本总体中60%至90%的样本作为训练样本集,采用剩余的样本作为待分类文本进行测试,所述样本总体包括所述多个原始样本类别,每个所述样本属于所述多个原始样本类别中的一个。Optionally, the cross-validation method based on the sample population includes: using 60% to 90% of the samples in the sample population as the training sample set, and using the remaining samples as the text to be classified for testing, and the sample population includes all said plurality of original sample classes, each of said samples belonging to one of said plurality of original sample classes.
可选的,所述将存在样本交叉的至少两个原始样本类别合并成新样本类别包括:将存在样本交叉的所有原始样本类别合并成一个新样本类别。Optionally, the merging at least two original sample categories with sample intersections into a new sample category includes: merging all original sample categories with sample intersections into a new sample category.
可选的,所述在所述将训练样本集中存在样本交叉的至少两个原始样本类别进行合并操作得到新样本类别之前,所述方法还包括:根据所述训练样本集训练得到分类类别为所述多个原始样本类别的第三分类器,测试所述第三分类器的各分类类别的分类准确率,初步筛选出分类准确率低于第三阈值的原始样本类别;在初步筛选出的所述原始样本类别中识别出存在样本交叉的原始样本类别。Optionally, before the merging operation of at least two original sample categories with crossed samples in the training sample set to obtain a new sample category, the method further includes: training according to the training sample set to obtain a classification category of all The third classifier of the plurality of original sample categories, testing the classification accuracy rate of each classification category of the third classifier, and initially screening out the original sample categories whose classification accuracy rate is lower than the third threshold; Identify the original sample category for which there is a sample intersection among the above original sample categories.
另一方面,本发明还提供一种文本分类方法,利用本发明提供的文本分类器生成方法生成的分类器进行分类,所述分类方法包括:将待分类文本集输入所述第一分类器,得到第一分类结果;将所述第一分类结果中分类类别为所述新样本类别的待分类文本输入与所述新样本类别对应的第二分类器,得到第二分类结果。On the other hand, the present invention also provides a text classification method, using the classifier generated by the text classifier generation method provided by the present invention to classify, the classification method includes: inputting the text set to be classified into the first classifier, Obtaining a first classification result; inputting the text to be classified in the first classification result as the new sample class to a second classifier corresponding to the new sample class to obtain a second classification result.
可选的,所述方法还包括:将所述第一分类结果中分类类别为未经合并的所述原始样本类别的分类结果与所述第二分类结果共同作为所述待分类文本的最终分类结果。Optionally, the method further includes: taking the classification result of the unmerged original sample category in the first classification result together with the second classification result as the final classification of the text to be classified result.
另一方面,本发明还提供一种文本分类器生成装置,包括:合并单元,用于将训练样本集中存在样本交叉的至少两个原始样本类别进行合并操作得到新样本类别;所述训练样本集包括多个原始样本类别,每个训练样本属于所述多个原始样本类别中的一个;第一训练单元,用于根据合并操作后的训练样本集训练得到第一分类器;第二训练单元,用于根据属于所述新样本类别的训练样本以及所述属于所述新样本类别的训练样本所属的原始样本类别进行训练得到第二分类器,所述第二分类器用于对所述第一分类器的分类结果中分类类别为所述新样本类别的待分类文本进行再次分类,以分入相应的原始样本类别中。On the other hand, the present invention also provides a device for generating a text classifier, including: a merging unit for merging at least two original sample categories with sample intersections in the training sample set to obtain a new sample category; the training sample set Including a plurality of original sample categories, each training sample belongs to one of the plurality of original sample categories; the first training unit is used to train the first classifier according to the training sample set after the merging operation; the second training unit, It is used to obtain a second classifier by training according to the training samples belonging to the new sample category and the original sample category to which the training samples belonging to the new sample category belong, and the second classifier is used to classify the first classifier The text to be classified whose classification category is the new sample category in the classification result of the detector is classified again, so as to be classified into the corresponding original sample category.
可选的,所述第一训练单元,具体用于:对所述合并操作后的训练样本集采用以下至少一种分类算法训练,得到所述第一分类器:朴素贝叶斯NB分类算法、支持向量机SVM分类算法、K最邻近KNN分类算法和随机森林分类算法。Optionally, the first training unit is specifically configured to: use at least one of the following classification algorithms to train the training sample set after the merging operation to obtain the first classifier: Naive Bayesian NB classification algorithm, Support vector machine SVM classification algorithm, K nearest neighbor KNN classification algorithm and random forest classification algorithm.
可选的,所述第二训练单元,具体用于:根据属于所述新样本类别的训练样本以及所述属于所述新样本类别的训练样本所属的原始样本类别进行训练,分别得到每个所述新样本类别的第二分类器。Optionally, the second training unit is specifically configured to: perform training according to the training samples belonging to the new sample category and the original sample category to which the training samples belonging to the new sample category belong, and obtain each Describe the second classifier for the new sample category.
可选的,所述第二分类单元具体用于采用以下至少一种分类算法训练,得到所述第二分类器:朴素贝叶斯NB分类算法、支持向量机SVM分类算法、K最邻近KNN分类算法和随机森林分类算法;所述第二分类器包括的分类类别与所述新样本类别中的训练样本所属的原始样本类别相对应。Optionally, the second classification unit is specifically configured to adopt at least one of the following classification algorithms for training to obtain the second classifier: Naive Bayesian NB classification algorithm, support vector machine SVM classification algorithm, K nearest neighbor KNN classification algorithm and a random forest classification algorithm; the classification category included in the second classifier corresponds to the original sample category to which the training samples in the new sample category belong.
可选的,所述装置还包括:预处理单元,用于在所述将训练样本集中存在样本交叉的至少两个原始样本类别进行合并得到新样本类别之前,将训练语料进行预处理以对所述训练语料进行过滤和/或格式统一;分词单元,用于将预处理后的训练语料根据分词词典进行分词处理得到所述训练样本集。Optionally, the device further includes: a preprocessing unit, configured to perform preprocessing on the training corpus before combining at least two original sample categories with sample intersections in the training sample set to obtain a new sample category. The training corpus is filtered and/or the format is unified; the word segmentation unit is used to perform word segmentation on the preprocessed training corpus according to the word segmentation dictionary to obtain the training sample set.
可选的,所述训练语料包括句子或者文本片段。Optionally, the training corpus includes sentences or text fragments.
可选的,所述装置还包括新词发现单元,用于对所述训练语料执行新词发现操作并将发现的新词加入所述分词词典。Optionally, the device further includes a new word discovery unit, configured to perform a new word discovery operation on the training corpus and add the discovered new words to the word segmentation dictionary.
可选的,所述新词发现操作通过以下至少一种方式实现:互信息、共现概率和信息熵。Optionally, the new word discovery operation is realized by at least one of the following methods: mutual information, co-occurrence probability and information entropy.
可选的,所述装置还包括:第一测试单元,用于测试所述第一分类器中每个分类类别的分类准确率;第二测试单元,用于测试所述第二分类器中每个分类类别的分类准确率;其中,所述第一分类器的分类准确率分别为P1j,其中j为大于等于1且小于等于m的整数,m为进行合并操作后的训练样本集中的样本类别数;所述第二分类器的分类准确率分别为P1h*P2k,其中k为大于等于1且小于等于n的整数,n为所述新样本类别中的训练样本所属的原始样本类别数;P1h为所述第一分类器中所述新样本类别的分类准确率,h为大于等于1且小于等于g的整数,g为所述新样本类别数;检测单元,用于检测所述第一分类器中每个分类类别的分类准确率是否都大于第一概率阈值,且所述第二分类器中每个分类类别的分类准确率是否都大于第二概率阈值;确定单元,用于如果所述检测单元的检测结果为是,确定所述第一分类器和所述第二分类器训练成功。Optionally, the device further includes: a first testing unit, used to test the classification accuracy of each classification category in the first classifier; a second testing unit, used to test the classification accuracy of each classification category in the second classifier. The classification accuracy rate of each classification category; wherein, the classification accuracy rate of the first classifier is P1j respectively, wherein j is an integer greater than or equal to 1 and less than or equal to m, and m is the sample category in the training sample set after the merging operation number; the classification accuracy rates of the second classifier are respectively P1h*P2k, wherein k is an integer greater than or equal to 1 and less than or equal to n, and n is the number of original sample categories to which the training samples in the new sample category belong; P1h is the classification accuracy rate of the new sample category in the first classifier, h is an integer greater than or equal to 1 and less than or equal to g, and g is the number of the new sample category; a detection unit is used to detect the first classification Whether the classification accuracy rate of each classification category in the classifier is greater than the first probability threshold, and whether the classification accuracy rate of each classification category in the second classifier is greater than the second probability threshold; the determination unit is used to determine if the The detection result of the detection unit is yes, and it is determined that the training of the first classifier and the second classifier is successful.
可选的,所述装置还包括:返回单元,用于如果所述检测单元的检测结果为否,将所述训练样本集中存在样本交叉的至少两个原始样本类别重新进行合并操作和分类器训练,直至所述第一分类器中每个分类类别的分类准确率都大于第一概率阈值,且所述第二分类器中每个分类类别的分类准确率都大于第二概率阈值为止。Optionally, the device further includes: a return unit, configured to re-merge at least two original sample categories with sample crossover in the training sample set and classifier training if the detection result of the detection unit is negative , until the classification accuracy rate of each classification category in the first classifier is greater than the first probability threshold, and the classification accuracy rate of each classification category in the second classifier is greater than the second probability threshold.
可选的,所述装置还包括:调整单元,用于调整所述第一概率阈值和所述第二概率阈值以筛选出不同分类准确率的第一分类器和第二分类器。Optionally, the device further includes: an adjustment unit, configured to adjust the first probability threshold and the second probability threshold to select the first classifier and the second classifier with different classification accuracies.
可选的,所述第一测试单元,具体用于基于样本总体采用交叉验证方式测试所述第一分类器的分类准确率;所述第二测试单元,具体用于基于样本总体采用交叉验证方式测试所述第二分类器的分类准确率。Optionally, the first test unit is specifically used to test the classification accuracy of the first classifier by using a cross-validation method based on the overall sample; the second test unit is specifically used to use a cross-validation method based on the overall sample Testing the classification accuracy of the second classifier.
可选的,所述基于样本总体的交叉验证方式包括:采用所述样本总体中60%至90%的样本作为训练样本集,采用剩余的样本作为待分类文本进行测试,所述样本总体包括所述多个原始样本类别,每个所述样本属于所述多个原始样本类别中的一个。Optionally, the cross-validation method based on the sample population includes: using 60% to 90% of the samples in the sample population as the training sample set, and using the remaining samples as the text to be classified for testing, and the sample population includes all said plurality of original sample classes, each of said samples belonging to one of said plurality of original sample classes.
可选的,所述合并单元,具体用于将存在样本交叉的所有原始样本类别合并成一个新样本类别。Optionally, the merging unit is specifically configured to merge all original sample categories with sample intersections into a new sample category.
可选的,所述装置还包括筛选单元,用于在所述将训练样本集中存在样本交叉的至少两个原始样本类别进行合并操作得到新样本类别之前,根据所述训练样本集训练得到分类类别为所述多个原始样本类别的第三分类器,测试所述第三分类器的各分类类别的分类准确率,初步筛选出分类准确率低于第三阈值的原始样本类别;在初步筛选出的所述原始样本类别中识别出存在样本交叉的原始样本类别。Optionally, the device further includes a screening unit, configured to obtain a classification category according to training of the training sample set before performing a merging operation on at least two original sample categories with crossed samples in the training sample set to obtain a new sample category For the third classifier of the plurality of original sample categories, test the classification accuracy of each classification category of the third classifier, and initially screen out the original sample categories whose classification accuracy is lower than the third threshold; Among the original sample classes of , an original sample class with sample intersection is identified.
另一方面,本发明还提供一种文本分类装置,利用本发明提供的任一种文本分类器生成装置生成的分类器进行分类,所述分类装置包括:第一输入单元,用于将待分类文本集输入所述第一分类器,得到第一分类结果;第二输入单元,用于将所述第一分类结果中分类类别为所述新样本类别的待分类文本输入与所述新样本类别对应的第二分类器,得到第二分类结果。On the other hand, the present invention also provides a text classification device, which uses any classifier generated by the text classifier generation device provided by the present invention to classify, and the classification device includes: a first input unit for classifying The text set is input into the first classifier to obtain the first classification result; the second input unit is used to input the text to be classified into the new sample category in the first classification result and the new sample category The corresponding second classifier obtains the second classification result.
可选的,所述装置还包括:结果输出单元,用于将所述第一分类结果中分类类别为未经合并的所述原始样本类别的分类结果与所述第二分类结果共同作为所述待分类文本的最终分类结果。Optionally, the device further includes: a result output unit, configured to use the classification result of the original sample category in which the classification category is not combined in the first classification result and the second classification result as the The final classification result of the text to be classified.
另一方面,本发明还提供一种计算机设备,包括处理器和存储器;存储器用于存储计算机指令,处理器用于运行所述存储器存储的计算机指令,以执行本发明提供的任一种文本分类器生成方法。On the other hand, the present invention also provides a computer device, including a processor and a memory; the memory is used to store computer instructions, and the processor is used to run the computer instructions stored in the memory to execute any text classifier provided by the present invention generate method.
另一方面,本发明还提供一种计算机设备,包括处理器和存储器;存储器用于存储计算机指令,处理器用于运行所述存储器存储的计算机指令,以执行本发明提供的任一种文本分类方法。On the other hand, the present invention also provides a computer device, including a processor and a memory; the memory is used to store computer instructions, and the processor is used to run the computer instructions stored in the memory to perform any text classification method provided by the present invention .
另一方面,本发明还提供一种计算机可读存储介质,所述存储介质中存储有指令,所述指令运行时执行本发明提供的任一种文本分类器生成方法。On the other hand, the present invention also provides a computer-readable storage medium, and instructions are stored in the storage medium, and when the instructions are executed, any method for generating a text classifier provided by the present invention is executed.
另一方面,本发明还提供一种计算机可读存储介质,所述存储介质中存储有指令,所述指令运行时执行本发明提供的任一种文本分类方法。On the other hand, the present invention also provides a computer-readable storage medium, where instructions are stored in the storage medium, and any text classification method provided by the present invention is executed when the instructions are run.
本发明的实施例提供的文本分类器生成方法、文本分类方法、装置及计算机设备,通过第一分类器的训练,可以准确地将存在样本交叉的原始样本类别与不存在样本交叉的原始样本类别区分开,通过第二分类器的训练,可以将存在样本交叉的原始样本类别单独分离出来,在更具体的范围内进行更细致的分类训练,从而大大提高了文本分类器的分类准确率。The text classifier generation method, text classification method, device, and computer equipment provided by the embodiments of the present invention can accurately classify the original sample categories with sample intersections and the original sample categories without sample intersections through the training of the first classifier Differentiate, through the training of the second classifier, the original sample category with sample crossover can be separated separately, and more detailed classification training can be carried out in a more specific range, thus greatly improving the classification accuracy of the text classifier.
附图说明Description of drawings
图1是本发明实施例提供的文本分类器生成方法的一种流程图;Fig. 1 is a kind of flowchart of the text classifier generation method that the embodiment of the present invention provides;
图2是本发明实施例提供的文本分类器生成方法的一种详细流程图;Fig. 2 is a kind of detailed flowchart of the text classifier generating method provided by the embodiment of the present invention;
图3是本发明实施例提供的文本分类方法的一种流程图;Fig. 3 is a kind of flowchart of the text classification method provided by the embodiment of the present invention;
图4是本发明实施例提供的文本分类器生成装置的一种结构示意图;FIG. 4 is a schematic structural diagram of a text classifier generation device provided by an embodiment of the present invention;
图5是本发明实施例提供的文本分类装置的一种结构示意图。Fig. 5 is a schematic structural diagram of a text classification device provided by an embodiment of the present invention.
具体实施方式detailed description
以下结合附图对本发明进行详细说明。应当理解,此处所描述的具体实施例仅仅用以解释本发明,并不限定本发明。The present invention will be described in detail below in conjunction with the accompanying drawings. It should be understood that the specific embodiments described here are only used to explain the present invention, not to limit the present invention.
如图1所示,本发明实施例提供一种文本分类器生成方法,包括:As shown in Figure 1, an embodiment of the present invention provides a method for generating a text classifier, including:
S11,将训练样本集中存在样本交叉的至少两个原始样本类别进行合并操作得到新样本类别;所述训练样本集包括多个原始样本类别,每个训练样本属于所述多个原始样本类别中的一个;S11, performing a merging operation on at least two original sample categories with sample intersections in the training sample set to obtain a new sample category; the training sample set includes multiple original sample categories, and each training sample belongs to one of the multiple original sample categories one;
S12,根据合并操作后的训练样本集训练得到第一分类器;S12, training according to the training sample set after the merging operation to obtain the first classifier;
S13,根据属于所述新样本类别的训练样本以及所述属于所述新样本类别的训练样本所属的原始样本类别进行训练得到第二分类器,所述第二分类器用于对所述第一分类器的分类结果中分类类别为所述新样本类别的待分类文本进行再次分类,以分入相应的原始样本类别中。S13. Perform training according to the training samples belonging to the new sample category and the original sample category to which the training samples belonging to the new sample category belong to obtain a second classifier, and the second classifier is used to classify the first classifier The text to be classified whose classification category is the new sample category in the classification result of the detector is classified again, so as to be classified into the corresponding original sample category.
本发明的实施例提供的文本分类器生成方法,在根据训练样本训练文本分类器时采用分层次的分类器生成方式,先将训练样本集中存在样本交叉的两个或者更多的原始样本类别合并成新样本类别,根据合并操作后的训练样本集训练得到第一分类器,再对新样本类别进行更细致的分类训练得到第二分类器。这样,通过第一分类器的训练,可以准确地将存在样本交叉的原始样本类别与不存在样本交叉的原始样本类别区分开,通过第二分类器的训练,可以将存在样本交叉的原始样本类别单独分离出来,在更具体的范围内进行更细致的分类训练,从而大大提高了文本分类器的分类准确率。The text classifier generation method provided by the embodiment of the present invention adopts a hierarchical classifier generation method when training the text classifier according to the training samples, and first merges two or more original sample categories with sample intersections in the training sample set into a new sample category, and obtain the first classifier according to the training sample set after the merging operation, and then conduct more detailed classification training on the new sample category to obtain the second classifier. In this way, through the training of the first classifier, the original sample category with sample crossover can be accurately distinguished from the original sample category without sample crossover, and through the training of the second classifier, the original sample category with sample crossover can be distinguished It is separated separately, and more detailed classification training is carried out in a more specific range, thus greatly improving the classification accuracy of the text classifier.
具体而言,本发明的实施例所说的样本交叉是指在提供的训练样本集中,样本数据的所属分类并不十分清晰准确,例如存在原本应该属于A类的样本数据却被分在了B类中的情况,则认为A类和B类之间存在样本交叉。样本交叉又称为类重叠或者数据集重叠。由于文本分类器是通过使用这些训练样本集训练而来的,训练样本集的这种样本交叉情况必然会影响其训练出来的文本分类器的分类准确性。本发明实施例提供的文本分类器生成方法能够针对这种样本类间交叉的情况进行有效改进。下面进行具体说明。Specifically, the sample crossover mentioned in the embodiments of the present invention refers to that in the provided training sample set, the classification of the sample data is not very clear and accurate, for example, there are sample data that should originally belong to the A category but are classified in the B category. In the case of a class, it is considered that there is a sample intersection between class A and class B. Sample intersection is also known as class overlap or dataset overlap. Since the text classifier is trained by using these training sample sets, the crossover of the training sample sets will inevitably affect the classification accuracy of the text classifier trained. The method for generating a text classifier provided by the embodiment of the present invention can effectively improve the crossover between such sample classes. A detailed description will be given below.
在步骤S11中,训练样本集中包括多个原始样本类别,这些原始样本类别即对应着用户期望得到的目标分类,每个训练样本属于多个原始样本类别中的一个。在本发明的一个实施例中,训练样本集中包括A、B、C、D四个原始样本类别,其中原始样本类别A与原始样本类别C之间存在样本交叉,则可以将A与C进行合并操作生成新样本类别G,合并后的训练样本集中包括原始样本类别B、D以及新样本类别G。In step S11 , the training sample set includes a plurality of original sample categories, and these original sample categories correspond to target categories expected by the user, and each training sample belongs to one of the multiple original sample categories. In one embodiment of the present invention, the training sample set includes four original sample categories A, B, C, and D, where there is a sample intersection between the original sample category A and the original sample category C, then A and C can be merged The operation generates a new sample category G, and the combined training sample set includes the original sample categories B, D and the new sample category G.
相应的,在步骤S12中,根据合并操作后的训练样本集训练得到第一分类器也就是将训练样本集中的所有元素分别划分到样本类别B、D、G中。通过这样的分类可以训练成第一分类器。Correspondingly, in step S12, the first classifier is obtained through training according to the training sample set after the merging operation, that is, all elements in the training sample set are divided into sample categories B, D, and G respectively. Through such classification, a first classifier can be trained.
当然,存在样本交叉的原始样本类别可以不止两个,合并后出现的新样本类别也可以不止一个,只要有利于纠正样本交叉,提高分类准确率即可,本发明的实施例对此不作限定。Of course, there may be more than two original sample categories with sample intersection, and more than one new sample category after merging, as long as it is beneficial to correct sample intersection and improve classification accuracy, which is not limited by the embodiment of the present invention.
例如,上述实施例中,也可以是A、B、C、D四个原始样本类别之间存在类间样本交叉,或者A、B之间存在样本交叉的同时C、D之间存在样本交叉,相应的,进行合并操作既可以是将A、B、C、D四个原始样本类别合并成一个新样本类别G1,即,将存在样本交叉的所有原始样本类别合并成一个新样本类别,也可以将A、B合并成一个新样本类别G2,将C、D合并成一个新样本类别G3,即,将存在样本交叉的原始样本类别合并成多个新样本类别。For example, in the above-mentioned embodiment, it may also be that there is an inter-class sample intersection between the four original sample categories A, B, C, and D, or there is a sample intersection between A and B while there is a sample intersection between C and D, Correspondingly, the merging operation can be to merge the four original sample categories A, B, C, and D into a new sample category G1, that is, to merge all the original sample categories with sample intersections into a new sample category, or Merge A and B into a new sample category G2, and merge C and D into a new sample category G3, that is, merge the original sample categories with sample intersections into multiple new sample categories.
可选的,对合并操作后的训练样本集可以采用以下一种或多种分类算法训练得到所述第一分类器:朴素贝叶斯NB分类算法、支持向量机SVM分类算法、K最邻近KNN分类算法和随机森林分类算法等。Optionally, one or more of the following classification algorithms can be used to train the training sample set after the merging operation to obtain the first classifier: Naive Bayesian NB classification algorithm, support vector machine SVM classification algorithm, K nearest neighbor KNN Classification algorithms and random forest classification algorithms, etc.
需要说明的是,上述存在样本交叉的原始样本类别可以是已知的,也可以是有待识别的。当训练样本集的样本数量以及原始样本类别的数据量庞大时,可以通过如下方式识别出存在样本交叉的原始样本类别:It should be noted that the above-mentioned original sample category with sample intersection may be known or may be yet to be identified. When the number of samples in the training sample set and the data volume of the original sample category are huge, the original sample category with sample crossover can be identified in the following way:
在所述将训练样本集中存在样本交叉的至少两个原始样本类别进行合并操作得到新样本类别之前,根据所述训练样本集训练得到分类类别为所述多个原始样本类别的第三分类器,测试所述第三分类器的各分类类别的分类准确率,初步筛选出分类准确率低于第三阈值的原始样本类别;Before performing the merging operation on at least two original sample categories with intersecting samples in the training sample set to obtain a new sample category, training according to the training sample set to obtain a third classifier whose classification category is the plurality of original sample categories, Test the classification accuracy of each classification category of the third classifier, and initially screen out the original sample categories whose classification accuracy is lower than the third threshold;
在初步筛选出的所述原始样本类别中识别出存在样本交叉的原始样本类别。Among the initially screened original sample categories, an original sample category with sample crossover is identified.
上述识别方法的原理是,如果训练样本集存在类间样本交叉,则势必会影响训练出的分类器的分类准确性,因此可以通过分类准确性对训练样本集的样本交叉情况进行初步筛选,然后再从筛选出的所述原始样本类别中识别出存在样本交叉的原始样本类别。The principle of the above identification method is that if there is inter-class sample intersection in the training sample set, it will inevitably affect the classification accuracy of the trained classifier. Therefore, the sample intersection of the training sample set can be preliminarily screened through the classification accuracy, and then Then, from the screened original sample categories, the original sample categories with sample intersections are identified.
其中,可以采用对低于第三阈值的原始样本类别进行人工核对或机器数据匹配的方式来识别出存在样本交叉的原始样本类别。Among them, the original sample category with sample intersection can be identified by manually checking or machine data matching on the original sample category lower than the third threshold.
例如,第三阈值的设定越高,则对样本交叉的检测越灵敏。训练样本集的原始样本类别中存在原本应该属于A类的样本数据却被分在了B类中的情况,而A类中并没有原应该属于其他类的样本,则此时第三分类器的B类的分类准确率会低于第三阈值,虽然A类的分类准确率没有受到影响,但根据人工核对或机器数据匹配,可得出训练样本集中A类与B类存在样本交叉。训练样本集的原始样本类别中存在原本应该属于A类的样本数据却被分在了B类中,且原本属于B类中的样本数据却被分在了A类,则此时第三分类器的B类的分类准确率和A类的分类准确率都会低于第三阈值,可通过人工核对或机器数据匹配确定A类和B类中是否还与其他类样本交叉。For example, the higher the third threshold is set, the more sensitive the detection of sample crossing is. In the original sample category of the training sample set, there are sample data that should belong to category A but are classified into category B, and there are no samples that should belong to other categories in category A. At this time, the third classifier’s The classification accuracy rate of class B will be lower than the third threshold. Although the classification accuracy rate of class A is not affected, according to manual check or machine data matching, it can be concluded that there is a sample intersection between class A and class B in the training sample set. In the original sample category of the training sample set, there are sample data that should belong to category A but are classified into category B, and the sample data that originally belonged to category B are classified into category A, then the third classifier at this time Both the classification accuracy rate of class B and the classification accuracy rate of class A are lower than the third threshold, and it can be determined whether the class A and class B are intersected with other class samples through manual checking or machine data matching.
测试所述第三分类器的各分类类别的分类准确率,具体可以为:基于样本总体采用交叉验证方式测试所述第三分类器的分类准确率。训练出第一分类器后,在步骤S13中即可训练形成第二分类器。具体而言,可以根据属于所述新样本类别的训练样本以及所述属于所述新样本类别的训练样本所属的原始样本类别进行训练,分别得到每个所述新样本类别的第二分类器。Testing the classification accuracy of each classification category of the third classifier may specifically be: testing the classification accuracy of the third classifier in a cross-validation manner based on the overall sample. After the first classifier is trained, the second classifier can be trained in step S13. Specifically, training may be performed according to the training samples belonging to the new sample category and the original sample category to which the training samples belonging to the new sample category belong, to obtain the second classifiers for each of the new sample categories respectively.
仍以上述实施例为例,如果原始样本类别A与原始样本类别C之间存在样本交叉,A与C进行合并操作后生成新样本类别G,用合并后的训练样本集进行分类训练,将训练样本集中的各元素分入样本类别B、D或G中,得到第一分类器。得到第一分类器后,用新样本类别G中进行分类训练,将G中元素细分为A类和C类,其中A和C即为属于新样本类别G的训练样本所属的原始样本类别。Still taking the above-mentioned embodiment as an example, if there is a sample intersection between the original sample category A and the original sample category C, A and C are merged to generate a new sample category G, and the combined training sample set is used for classification training, and the training Each element in the sample set is classified into the sample category B, D or G to obtain the first classifier. After obtaining the first classifier, use the new sample category G for classification training, and subdivide the elements in G into A and C, where A and C are the original sample categories to which the training samples belonging to the new sample category G belong.
这样,通过第二分类器的训练,可以将存在样本交叉的原始样本类别单独分离出来,在更具体的范围内进行更细致的分类训练,从而大大提高了文本分类器的分类准确率。In this way, through the training of the second classifier, the original sample category with sample crossover can be separated separately, and more detailed classification training can be carried out in a more specific range, thereby greatly improving the classification accuracy of the text classifier.
可选的,得到第二分类器的分类算法可以包括以下一种或多种:朴素贝叶斯NB分类算法、支持向量机SVM分类算法、K最邻近KNN分类算法和随机森林分类算法;其中,第二分类器包括的分类类别与所述新样本类别中的训练样本所属的原始样本类别相对应。Optionally, the classification algorithm for obtaining the second classifier may include one or more of the following: Naive Bayesian NB classification algorithm, support vector machine SVM classification algorithm, K nearest neighbor KNN classification algorithm and random forest classification algorithm; wherein, The classification category included in the second classifier corresponds to the original sample category to which the training samples in the new sample category belong.
进一步的,上述实施例中的训练样本集可以由训练语料经过一定的处理得到。为了获得上述训练样本集,在本发明的一个实施例中,在所述将训练样本集中存在样本交叉的至少两个原始样本类别进行合并得到新样本类别之前,本发明实施例提供的文本分类器生成方法还可包括:Further, the training sample set in the above embodiment can be obtained from the training corpus after certain processing. In order to obtain the above-mentioned training sample set, in one embodiment of the present invention, before the at least two original sample categories with crossed samples in the training sample set are combined to obtain a new sample category, the text classifier provided by the embodiment of the present invention Generation methods can also include:
将训练语料进行预处理以对所述训练语料进行过滤和/或格式统一;Preprocessing the training corpus to filter and/or unify the format of the training corpus;
将预处理后的训练语料根据分词词典进行分词处理得到所述训练样本集。The preprocessed training corpus is subjected to word segmentation processing according to the word segmentation dictionary to obtain the training sample set.
具体而言,采集的训练语料包括句子和/或者文本片段,具体形式可以为语音、文字、图像等多种,首先要通过预处理将获取训练语料格式统一为文本格式,过滤掉无效的格式,保存待用。然后,将预处理后的训练语料根据分词词典进行分词处理从而得到训练样本集。Specifically, the collected training corpus includes sentences and/or text fragments, and the specific forms can be voice, text, images, etc. First, the format of the acquired training corpus must be unified into a text format through preprocessing, and invalid formats are filtered out. Save for later use. Then, the preprocessed training corpus is subjected to word segmentation processing according to the word segmentation dictionary to obtain a training sample set.
进一步的,分词词典可以被扩充,例如可以对所述训练语料执行新词发现操作并将发现的新词加入所述分词词典,这样,利用新词发现方法可以获取新的词语,根据获取的新的词语可以更新分词词典,那么在进行分词处理时,可以根据更新后的分词词典进行分词,从而能够使分词词典来不断完善,有效提高分词处理的准确率。Further, the word segmentation dictionary can be expanded. For example, the new word discovery operation can be performed on the training corpus and the new words found can be added to the word segmentation dictionary. In this way, new words can be obtained by using the new word discovery method. According to the obtained new words Words can be updated in the word segmentation dictionary, then when the word segmentation process is performed, the word segmentation can be performed according to the updated word segmentation dictionary, so that the word segmentation dictionary can be continuously improved, and the accuracy of word segmentation processing can be effectively improved.
可选的,所述新词发现操作可以通过以下一种或几种方式实现:互信息、共现概率和信息熵。Optionally, the new word discovery operation may be realized by one or more of the following methods: mutual information, co-occurrence probability and information entropy.
为了确定生成的第一分类器和第二分类器的分类效果,进一步的,本发明实施例提供的文本分类器生成方法还可以包括:In order to determine the classification effect of the generated first classifier and the second classifier, further, the text classifier generation method provided by the embodiment of the present invention may further include:
测试所述第一分类器中每个分类类别的分类准确率;Test the classification accuracy rate of each classification category in the first classifier;
测试所述第二分类器中每个分类类别的分类准确率;Test the classification accuracy rate of each classification category in the second classifier;
其中,所述第一分类器的分类准确率分别为P1j,其中j为大于等于1且小于等于m的整数,m为进行合并操作后的训练样本集中的样本类别数;Wherein, the classification accuracy rates of the first classifier are respectively P1j, wherein j is an integer greater than or equal to 1 and less than or equal to m, and m is the number of sample categories in the training sample set after the merging operation;
所述第二分类器的分类准确率分别为P1h*P2k,其中k为大于等于1且小于等于n的整数,n为所述新样本类别中的训练样本所属的原始样本类别数;P1h为所述第一分类器中所述新样本类别的分类准确率,h为大于等于1且小于等于g的整数,g为所述新样本类别数;The classification accuracy rates of the second classifier are respectively P1h*P2k, wherein k is an integer greater than or equal to 1 and less than or equal to n, and n is the number of original sample categories to which the training samples in the new sample category belong; The classification accuracy rate of the new sample category in the first classifier, h is an integer greater than or equal to 1 and less than or equal to g, and g is the number of new sample categories;
检测所述第一分类器中每个分类类别的分类准确率是否都大于第一概率阈值,且所述第二分类器中每个分类类别的分类准确率是否都大于第二概率阈值;Detecting whether the classification accuracy rate of each classification category in the first classifier is greater than a first probability threshold, and whether the classification accuracy rate of each classification category in the second classifier is greater than a second probability threshold;
如果是,确定所述第一分类器和所述第二分类器训练成功。If yes, it is determined that the training of the first classifier and the second classifier is successful.
举例说明,假如第一概率阈值为0.98,第二概率阈值为0.95,测试时,如果第一分类器的分类结果中,每个分类类别的分类准确率都大于0.98,且第二分类器的分类结果中,每个分类类别的分类准确率都大于0.95,则说明通过本发明实施例提供的文本分类器生成方法所生成的文本分类器的分类准确率达到了用户的要求。For example, if the first probability threshold is 0.98 and the second probability threshold is 0.95, during the test, if the classification accuracy of each classification category in the classification results of the first classifier is greater than 0.98, and the classification of the second classifier In the results, the classification accuracy rate of each classification category is greater than 0.95, indicating that the classification accuracy rate of the text classifier generated by the text classifier generation method provided by the embodiment of the present invention meets the user's requirement.
可选的,在进行分类准确率检测时,可以采用与训练样本集的各分类类别属性相同或相似的数据来进行测试,这些数据均标注有相关的分类类别。其中,与训练样本集的分类类别属性相同或相似的数据可以由算法构造出,也可以根据交叉验证方式获得。Optionally, when testing the classification accuracy, data with the same or similar attributes as the classification categories of the training sample set can be used for testing, and these data are marked with relevant classification categories. Among them, the data with the same or similar classification category attributes as the training sample set can be constructed by the algorithm, or can be obtained according to the cross-validation method.
具体的,可以采用基于样本总体采用交叉验证方式测试所述第一分类器的分类准确率;基于样本总体采用交叉验证方式测试所述第二分类器的分类准确率。Specifically, the classification accuracy of the first classifier may be tested by cross-validation based on the overall sample; the classification accuracy of the second classifier may be tested by cross-validation based on the overall sample.
其中,样本总体是指与此次分类任务相关的全部样本数据。基于样本总体的交叉验证方式可以为将样本总体的一部分作为训练样本集,另一部分作为测试样本集,可以采用所述样本总体中60%至90%(例如80%)的样本作为训练样本集,采用剩余的样本作为待分类文本进行测试,所述样本总体包括所述多个原始样本类别,每个所述样本属于所述多个原始样本类别中的一个。Among them, the sample population refers to all sample data related to this classification task. The cross-validation method based on the sample population can be a part of the sample population as a training sample set, and another part as a test sample set, and 60% to 90% (for example, 80%) of the samples in the sample population can be used as a training sample set, Using the remaining samples as the text to be classified for testing, the sample population includes the multiple original sample categories, and each of the samples belongs to one of the multiple original sample categories.
可选的,在检测所述第一分类器中每个分类类别的分类准确率是否都大于第一概率阈值,且所述第二分类器中每个分类类别的分类准确率是否都大于第二概率阈值之后,本发明的实施例提供的文本分类器生成方法还可包括:如果否,将所述训练样本集中存在样本交叉的至少两个原始样本类别重新进行合并操作和分类器训练,直至所述第一分类器中每个分类类别的分类准确率都大于第一概率阈值,且所述第二分类器中每个分类类别的分类准确率都大于第二概率阈值为止。Optionally, when detecting whether the classification accuracy rate of each classification category in the first classifier is greater than the first probability threshold, and whether the classification accuracy rate of each classification category in the second classifier is greater than the second After the probability threshold, the method for generating a text classifier provided by an embodiment of the present invention may further include: if not, re-merging and classifying the at least two original sample categories for which there is a sample cross in the training sample set until the The classification accuracy rate of each classification category in the first classifier is greater than the first probability threshold, and the classification accuracy rate of each classification category in the second classifier is greater than the second probability threshold.
也就是说,根据第一分类器和第二分类器作用后,如果发现任何一个分类类别的分类准确率小于相应的概率阈值,都说明第一分类器和第二分类器尚未达到用户的要求,因此需要重新对存在样本交叉的原始样本类别进行合并操作和分类器训练,直至准确率达到上述阈值要求为止。That is to say, after the first classifier and the second classifier are used, if the classification accuracy of any classification category is found to be less than the corresponding probability threshold, it means that the first classifier and the second classifier have not yet met the user's requirements. Therefore, it is necessary to re-merge the original sample categories with sample crossover and classifier training until the accuracy rate reaches the above threshold requirements.
可选的,为了满足对分类器的分类准确率有不同的要求,在本发明的一个实施例中,还可以调整所述第一概率阈值和所述第二概率阈值以筛选出不同分类准确率的第一分类器和第二分类器。Optionally, in order to meet different requirements on the classification accuracy of the classifier, in an embodiment of the present invention, the first probability threshold and the second probability threshold may be adjusted to filter out different classification accuracy The first classifier and the second classifier.
下面通过具体实施例对本发明实施例提供的文本分类器生成方法进行详细说明。The method for generating a text classifier provided by the embodiment of the present invention will be described in detail below through specific embodiments.
如图2所示,本实施例提供的文本分类器生成方法具体可包括如下步骤:As shown in Figure 2, the text classifier generation method provided in this embodiment may specifically include the following steps:
S201,预处理:将获取训练语料格式统一为文本格式,过滤无效的格式,保存待用;S201, preprocessing: unify the acquired training corpus format into a text format, filter invalid formats, and save for later use;
S202,新词发现:利用已有的新词发现工具找出训练语料的新词候选词,经人工过滤后加入分词词典;S202, new word discovery: use the existing new word discovery tool to find out the new word candidate words of the training corpus, add the word segmentation dictionary after manual filtering;
S203,对预处理后的训练语料根据分词词典进行分词;S203, performing word segmentation on the preprocessed training corpus according to the word segmentation dictionary;
S204,样本筛选:对原始样本类别构造分类器实现其分类,并基于样本总体使用交叉验证的方式测试分类器的准确率P0(各个类的准确率为P01、P02、……、P0i、……)。根据分类的结果(存在样本交叉的类别准确率都较低)选定(人为选定或简单的匹配方式)存在样本交叉的类别集(相互交叉的类别为一个类别集,因此,可能有一个或者多个类别集);S204, sample screening: construct a classifier for the original sample category to achieve its classification, and test the accuracy rate P0 of the classifier based on the sample population using cross-validation (the accuracy rates of each class are P01, P02, ..., P0i, ... ). According to the results of the classification (categories with sample intersections have low accuracy rates), select (artificially selected or simple matching methods) category sets with sample intersections (categories that intersect with each other are a category set, so there may be one or multiple category sets);
S205,样本重组:将存在样本交叉的两个类或多个类进行合并,其他类保持不变;S205, sample reorganization: merge two or more classes with sample crossover, and keep other classes unchanged;
S206,训练生成第一分类器:对进行合并操作后的训练样本集进行分类操作,同样基于样本总体使用交叉验证的方式测试分类的准确率P1(各个类的准确率为P11、P12、……、P1j、……)。S206, train and generate the first classifier: classify the training sample set after the merging operation, and test the accuracy rate P1 of the classification based on the overall sample by cross-validation (the accuracy rate of each class is P11, P12, ... , P1j, ...).
S207,训练生成第二分类器:对合并生成的新样本类别构造分类器(1个或多个),并基于样本总体使用交叉验证的方式测试其分类的准确率P2(各个类的准确率为P21、P22、……、P2k、……)。S207, training and generating a second classifier: constructing classifiers (one or more) for the new sample categories generated by merging, and testing the classification accuracy P2 based on the sample population using cross-validation (the accuracy of each class is P21, P22, ..., P2k, ...).
S208,准确率测试:检测所述第一分类器中每个分类类别的分类准确率是否都大于第一概率阈值,且所述第二分类器中每个分类类别的分类准确率是否都大于第二概率阈值;S208, accuracy rate test: detecting whether the classification accuracy rate of each classification category in the first classifier is greater than the first probability threshold, and whether the classification accuracy rate of each classification category in the second classifier is greater than the first probability threshold Two probability thresholds;
如果是,确定所述第一分类器和所述第二分类器训练成功;If yes, determining that the first classifier and the second classifier have been successfully trained;
如果否,将所述训练样本集中存在样本交叉的至少两个原始样本类别重新进行合并操作和分类器训练,直至所述第一分类器中每个分类类别的分类准确率都大于第一概率阈值,且所述第二分类器中每个分类类别的分类准确率都大于第二概率阈值为止。If not, at least two original sample categories with sample intersections in the training sample set are re-merged and classifier trained until the classification accuracy of each classification category in the first classifier is greater than the first probability threshold , and the classification accuracy rate of each classification category in the second classifier is greater than the second probability threshold.
相应的,如图3所示,本发明的实施例还提供一种文本分类方法,该文本分类方法利用前述实施例提供的任一种文本分类器生成方法生成的分类器进行分类,所述分类方法包括:Correspondingly, as shown in FIG. 3 , an embodiment of the present invention also provides a text classification method. The text classification method utilizes a classifier generated by any of the text classifier generation methods provided in the foregoing embodiments for classification, and the classification Methods include:
S31,将待分类文本集输入所述第一分类器,得到第一分类结果;S31. Input the text set to be classified into the first classifier to obtain a first classification result;
S32,将所述第一分类结果中分类类别为所述新样本类别的待分类文本输入与所述新样本类别对应的第二分类器,得到第二分类结果。S32. Input the to-be-classified text whose classification category is the new sample category in the first classification result to a second classifier corresponding to the new sample category, to obtain a second classification result.
本发明的实施例提供的文本分类方法,应用了前述实施例提供的任一种文本分类器生成方法生成的文本分类器。这样,通过第一分类器的训练,可以准确地将存在样本交叉的原始样本类别与不存在样本交叉的原始样本类别区分开,通过第二分类器的训练,可以将存在样本交叉的原始样本类别单独分离出来,在更具体的范围内进行更细致的分类训练,从而大大提高了文本分类的准确率。The text classification method provided by the embodiment of the present invention applies the text classifier generated by any one of the text classifier generation methods provided in the foregoing embodiments. In this way, through the training of the first classifier, the original sample category with sample crossover can be accurately distinguished from the original sample category without sample crossover, and through the training of the second classifier, the original sample category with sample crossover can be distinguished It is separated separately, and more detailed classification training is carried out in a more specific range, thus greatly improving the accuracy of text classification.
可选的,本发明实施例提供的文本分类方法还可包括:将所述第一分类结果中分类类别为未经合并的所述原始样本类别的分类结果与所述第二分类结果共同作为所述待分类文本的最终分类结果。Optionally, the text classification method provided in the embodiment of the present invention may further include: using the classification result of the original sample category in which the classification category is not merged in the first classification result and the second classification result as the Describe the final classification result of the text to be classified.
本发明的实施例提供的文本分类方法,应用了前述实施例提供的任一种文本分类器生成方法生成的文本分类器,具体的分类过程和原理前文已经进行了详细的说明,此处不再赘述。The text classification method provided by the embodiment of the present invention applies the text classifier generated by any of the text classifier generation methods provided in the foregoing embodiments. The specific classification process and principle have been described in detail above, and will not be repeated here. repeat.
相应的,如图4所示,本发明的实施例还提供一种文本分类器生成装置,包括:Correspondingly, as shown in FIG. 4, an embodiment of the present invention also provides a text classifier generating device, including:
合并单元41,用于将训练样本集中存在样本交叉的至少两个原始样本类别进行合并操作得到新样本类别;所述训练样本集包括多个原始样本类别,每个训练样本属于所述多个原始样本类别中的一个;The merging unit 41 is configured to perform a merging operation on at least two original sample categories with sample intersections in the training sample set to obtain a new sample category; the training sample set includes a plurality of original sample categories, and each training sample belongs to the plurality of original sample categories one of the sample categories;
第一训练单元42,用于根据合并操作后的训练样本集训练得到第一分类器;The first training unit 42 is used to train the first classifier according to the training sample set after the merging operation;
第二训练单元43,用于根据属于所述新样本类别的训练样本以及所述属于所述新样本类别的训练样本所属的原始样本类别进行训练得到第二分类器,所述第二分类器用于对所述第一分类器的分类结果中分类类别为所述新样本类别的待分类文本进行再次分类,以分入相应的原始样本类别中。The second training unit 43 is configured to perform training to obtain a second classifier according to the training samples belonging to the new sample category and the original sample category to which the training samples belonging to the new sample category belong, and the second classifier is used for Reclassify the text to be classified whose classification category is the new sample category in the classification result of the first classifier, so as to be classified into the corresponding original sample category.
本发明的实施例提供的文本分类器生成装置,在根据训练样本训练文本分类器时采用分层次的分类器生成方式,先将训练样本集中存在样本交叉的两个或者更多的原始样本类别合并成新样本类别,根据合并操作后的训练样本集训练得到第一分类器,再对新样本类别进行更细致的分类训练得到第二分类器。这样,通过第一分类器的训练,可以准确地将存在样本交叉的原始样本类别与不存在样本交叉的原始样本类别区分开,通过第二分类器的训练,可以将存在样本交叉的原始样本类别单独分离出来,在更具体的范围内进行更细致的分类训练,从而大大提高了文本分类器的分类准确率。The text classifier generation device provided by the embodiment of the present invention adopts a hierarchical classifier generation method when training the text classifier according to the training samples, and first merges two or more original sample categories with sample intersections in the training sample set into a new sample category, train the first classifier according to the training sample set after the merge operation, and then conduct more detailed classification training on the new sample category to obtain the second classifier. In this way, through the training of the first classifier, the original sample category with sample crossover can be accurately distinguished from the original sample category without sample crossover, and through the training of the second classifier, the original sample category with sample crossover can be distinguished It is separated separately, and more detailed classification training is carried out in a more specific range, thus greatly improving the classification accuracy of the text classifier.
具体而言,本发明的实施例所说的样本交叉是指在提供的训练样本集中,样本数据的所属分类并不十分清晰准确,例如存在原本应该属于A类的样本数据却被分在了B类中的情况,则认为A类和B类之间存在样本交叉。由于文本分类器是通过使用这些训练样本集训练而来的,训练样本集的这种样本交叉情况必然会影响其训练出来的文本分类器的分类准确性。本发明实施例提供的文本分类器生成装置能够针对这种样本类间交叉的情况进行有效改进。下面进行具体说明。Specifically, the sample crossover mentioned in the embodiments of the present invention refers to that in the provided training sample set, the classification of the sample data is not very clear and accurate, for example, there are sample data that should originally belong to the A category but are classified in the B category. In the case of a class, it is considered that there is a sample intersection between class A and class B. Since the text classifier is trained by using these training sample sets, the crossover of the training sample sets will inevitably affect the classification accuracy of the text classifier trained. The device for generating a text classifier provided by the embodiment of the present invention can effectively improve the crossover between such sample classes. A detailed description will be given below.
可选的,所述合并单元,可具体用于将存在样本交叉的所有原始样本类别合并成一个新样本类别。Optionally, the merging unit may be specifically configured to merge all original sample categories with sample intersections into a new sample category.
可选的,在合并单元41进行合并操作时,训练样本集中包括多个原始样本类别,这些原始样本类别即对应着用户期望得到的目标分类,每个训练样本属于多个原始样本类别中的一个。在本发明的一个实施例中,训练样本集中包括A、B、C、D四个原始样本类别,其中原始样本类别A与原始样本类别C之间存在样本交叉,则可以将A与C进行合并操作生成新样本类别G,合并后的训练样本集中包括原始样本类别B、D以及新样本类别G。Optionally, when the merging unit 41 performs the merging operation, the training sample set includes a plurality of original sample categories, and these original sample categories correspond to the target classification expected by the user, and each training sample belongs to one of the multiple original sample categories . In one embodiment of the present invention, the training sample set includes four original sample categories A, B, C, and D, where there is a sample intersection between the original sample category A and the original sample category C, then A and C can be merged The operation generates a new sample category G, and the combined training sample set includes the original sample categories B, D and the new sample category G.
相应的,第一训练单元42根据合并操作后的训练样本集训练得到第一分类器也就是将训练样本集中的所有元素分别划分到样本类别B、D、G中。通过这样的分类可以训练成第一分类器。Correspondingly, the first training unit 42 trains the first classifier according to the training sample set after the merging operation, that is, divides all elements in the training sample set into sample categories B, D, and G respectively. Through such classification, a first classifier can be trained.
当然,存在样本交叉的原始样本类别可以不止两个,合并后出现的新样本类别也可以不止一个,只要有利于纠正样本交叉,提高分类准确率即可,本发明的实施例对此不作限定。Of course, there may be more than two original sample categories with sample intersection, and more than one new sample category after merging, as long as it is beneficial to correct sample intersection and improve classification accuracy, which is not limited by the embodiment of the present invention.
例如,上述实施例中,也可以是A、B、C、D四个原始样本类别之间存在类间样本交叉,或者A、B之间存在样本交叉的同时C、D之间存在样本交叉,相应的,进行合并操作既可以是将A、B、C、D四个原始样本类别合并成一个新样本类别G1,即,将存在样本交叉的所有原始样本类别合并成一个新样本类别,也可以将A、B合并成一个新样本类别G2,将C、D合并成一个新样本类别G3,即,将存在样本交叉的原始样本类别合并成多个新样本类别。For example, in the above-mentioned embodiment, it may also be that there is an inter-class sample intersection between the four original sample categories A, B, C, and D, or there is a sample intersection between A and B while there is a sample intersection between C and D, Correspondingly, the merging operation can be to merge the four original sample categories A, B, C, and D into a new sample category G1, that is, to merge all the original sample categories with sample intersections into a new sample category, or Merge A and B into a new sample category G2, and merge C and D into a new sample category G3, that is, merge the original sample categories with sample intersections into multiple new sample categories.
可选的,第一训练单元42,具体可用于:Optionally, the first training unit 42 can specifically be used for:
对所述合并操作后的训练样本集采用以下至少一种分类算法训练,得到所述第一分类器:朴素贝叶斯NB分类算法、支持向量机SVM分类算法、K最邻近KNN分类算法和随机森林分类算法。The training sample set after the merging operation is trained by at least one of the following classification algorithms to obtain the first classifier: Naive Bayesian NB classification algorithm, support vector machine SVM classification algorithm, K nearest neighbor KNN classification algorithm and random Forest classification algorithm.
需要说明的是,上述存在样本交叉的原始样本类别可以是已知的,也可以是有待识别的。当训练样本集的样本数量以及原始样本类别的数据量庞大时,所述文本分类器生成装置还可包括筛选单元,用于:It should be noted that the above-mentioned original sample category with sample intersection may be known or may be yet to be identified. When the sample size of the training sample set and the data volume of the original sample category are huge, the text classifier generation device may also include a screening unit for:
在所述将训练样本集中存在样本交叉的至少两个原始样本类别进行合并操作得到新样本类别之前,根据所述训练样本集训练得到分类类别为所述多个原始样本类别的第三分类器,测试所述第三分类器的各分类类别的分类准确率,初步筛选出分类准确率低于第三阈值的原始样本类别;Before performing the merging operation on at least two original sample categories with intersecting samples in the training sample set to obtain a new sample category, training according to the training sample set to obtain a third classifier whose classification category is the plurality of original sample categories, Test the classification accuracy of each classification category of the third classifier, and initially screen out the original sample categories whose classification accuracy is lower than the third threshold;
在初步筛选出的所述原始样本类别中识别出存在样本交叉的原始样本类别。Among the initially screened original sample categories, an original sample category with sample crossover is identified.
上述识别方法的原理是,如果训练样本集存在类间样本交叉,则势必会影响训练出的分类器的分类准确性,因此可以通过分类准确性对训练样本集的样本交叉情况进行初步筛选,然后再从筛选出的所述原始样本类别中识别出存在样本交叉的原始样本类别。The principle of the above identification method is that if there is inter-class sample intersection in the training sample set, it will inevitably affect the classification accuracy of the trained classifier. Therefore, the sample intersection of the training sample set can be preliminarily screened through the classification accuracy, and then Then, from the screened original sample categories, the original sample categories with sample intersections are identified.
其中,可以采用对低于第三阈值的原始样本类别进行人工核对或机器数据匹配的方式来识别出存在样本交叉的原始样本类别。Among them, the original sample category with sample intersection can be identified by manually checking or machine data matching on the original sample category lower than the third threshold.
例如,第三阈值的设定越高,则对样本交叉的检测越灵敏。训练样本集的原始样本类别中存在原本应该属于A类的样本数据却被分在了B类中的情况,而A类中并没有原应该属于其他类的样本,则此时第三分类器的B类的分类准确率会低于第三阈值,虽然A类的分类准确率没有受到影响,但根据人工核对或机器数据匹配,可得出训练样本集中A类与B类存在样本交叉。训练样本集的原始样本类别中存在原本应该属于A类的样本数据却被分在了B类中,且原本属于B类中的样本数据却被分在了A类,则此时第三分类器的B类的分类准确率和A类的分类准确率都会低于第三阈值,可通过人工核对或机器数据匹配确定A类和B类中是否还与其他类样本交叉。For example, the higher the third threshold is set, the more sensitive the detection of sample crossing is. In the original sample category of the training sample set, there are sample data that should belong to category A but are classified into category B, and there are no samples that should belong to other categories in category A. At this time, the third classifier’s The classification accuracy rate of class B will be lower than the third threshold. Although the classification accuracy rate of class A is not affected, according to manual check or machine data matching, it can be concluded that there is a sample intersection between class A and class B in the training sample set. In the original sample category of the training sample set, there are sample data that should belong to category A but are classified into category B, and the sample data that originally belonged to category B are classified into category A, then the third classifier at this time Both the classification accuracy rate of class B and the classification accuracy rate of class A are lower than the third threshold, and it can be determined whether the class A and class B are intersected with other class samples through manual checking or machine data matching.
测试所述第三分类器的各分类类别的分类准确率,具体可以为:基于样本总体采用交叉验证方式测试所述第三分类器的分类准确率。Testing the classification accuracy of each classification category of the third classifier may specifically be: testing the classification accuracy of the third classifier in a cross-validation manner based on the overall sample.
训练出第一分类器后,第二训练单元43即可训练形成第二分类器。具体而言,可以根据属于所述新样本类别的训练样本以及所述属于所述新样本类别的训练样本所属的原始样本类别进行训练,分别得到每个所述新样本类别的第二分类器。After the first classifier is trained, the second training unit 43 can train to form the second classifier. Specifically, training may be performed according to the training samples belonging to the new sample category and the original sample category to which the training samples belonging to the new sample category belong, to obtain the second classifiers for each of the new sample categories respectively.
仍以上述实施例为例,如果原始样本类别A与原始样本类别C之间存在样本交叉,A与C进行合并操作后生成新样本类别G,用合并后的训练样本集进行分类训练,将训练样本集中的各元素分入样本类别B、D或G中,得到第一分类器。得到第一分类器后,用新样本类别G中进行分类训练,将G中元素细分为A类和C类,其中A和C即为属于新样本类别G的训练样本所属的原始样本类别。Still taking the above-mentioned embodiment as an example, if there is a sample intersection between the original sample category A and the original sample category C, A and C are merged to generate a new sample category G, and the combined training sample set is used for classification training, and the training Each element in the sample set is classified into the sample category B, D or G to obtain the first classifier. After obtaining the first classifier, use the new sample category G for classification training, and subdivide the elements in G into A and C, where A and C are the original sample categories to which the training samples belonging to the new sample category G belong.
这样,通过第二分类器的训练,可以将存在样本交叉的原始样本类别单独分离出来,在更具体的范围内进行更细致的分类训练,从而大大提高了文本分类器的分类准确率。In this way, through the training of the second classifier, the original sample category with sample crossover can be separated separately, and more detailed classification training can be carried out in a more specific range, thereby greatly improving the classification accuracy of the text classifier.
可选的,第二分类单元43具体可用于采用以下至少一种分类算法训练,得到所述第二分类器:朴素贝叶斯NB分类算法、支持向量机SVM分类算法、K最邻近KNN分类算法和随机森林分类算法;所述第二分类器包括的分类类别与所述新样本类别中的训练样本所属的原始样本类别相对应。Optionally, the second classification unit 43 can be specifically configured to adopt at least one of the following classification algorithms for training to obtain the second classifier: Naive Bayesian NB classification algorithm, support vector machine SVM classification algorithm, K nearest neighbor KNN classification algorithm and a random forest classification algorithm; the classification category included in the second classifier corresponds to the original sample category to which the training samples in the new sample category belong.
进一步的,上述实施例中的训练样本集可以由训练语料经过一定的处理得到。为了获得上述训练样本集,本发明实施例提供的文本分类器生成装置还可包括:Further, the training sample set in the above embodiment can be obtained from the training corpus after certain processing. In order to obtain the above-mentioned training sample set, the text classifier generation device provided by the embodiment of the present invention may further include:
预处理单元,用于在所述将训练样本集中存在样本交叉的至少两个原始样本类别进行合并得到新样本类别之前,将训练语料进行预处理以对所述训练语料进行过滤和/或格式统一;A preprocessing unit, configured to perform preprocessing on the training corpus to filter and/or unify the format of the training corpus before merging at least two original sample categories with crossed samples in the training sample set to obtain a new sample category ;
分词单元,用于将预处理后的训练语料根据分词词典进行分词处理得到所述训练样本集。The word segmentation unit is configured to perform word segmentation processing on the preprocessed training corpus according to the word segmentation dictionary to obtain the training sample set.
具体而言,采集的训练语料包括句子和/或者文本片段,具体形式可以为语音、文字、图像等多种,首先要通过预处理将获取训练语料格式统一为文本格式,过滤掉无效的格式,保存待用。然后,将预处理后的训练语料根据分词词典进行分词处理从而得到训练样本集。Specifically, the collected training corpus includes sentences and/or text fragments, and the specific forms can be voice, text, images, etc. First, the format of the acquired training corpus must be unified into a text format through preprocessing, and invalid formats are filtered out. Save for later use. Then, the preprocessed training corpus is subjected to word segmentation processing according to the word segmentation dictionary to obtain a training sample set.
可选的,所述装置还可包括新词发现单元,用于对所述训练语料执行新词发现操作并将发现的新词加入所述分词词典。这样,利用新词发现方法可以获取新的词语,根据获取的新的词语可以更新分词词典,那么在进行分词处理时,可以根据更新后的分词词典进行分词,从而能够使分词词典来不断完善,有效提高分词处理的准确率。Optionally, the device may further include a new word discovery unit, configured to perform a new word discovery operation on the training corpus and add the discovered new words to the word segmentation dictionary. In this way, new words can be obtained by using the new word discovery method, and the word segmentation dictionary can be updated according to the obtained new words. Then, when performing word segmentation processing, word segmentation can be performed according to the updated word segmentation dictionary, so that the word segmentation dictionary can be continuously improved. Effectively improve the accuracy of word segmentation processing.
可选的,所述新词发现操作可以通过以下一种或几种方式实现:互信息、共现概率和信息熵。Optionally, the new word discovery operation may be realized by one or more of the following methods: mutual information, co-occurrence probability and information entropy.
为了确定生成的第一分类器和第二分类器的分类效果,进一步的,所述装置还可包括:In order to determine the classification effect of the generated first classifier and the second classifier, further, the device may further include:
第一测试单元,用于测试所述第一分类器中每个分类类别的分类准确率;The first test unit is used to test the classification accuracy of each classification category in the first classifier;
第二测试单元,用于测试所述第二分类器中每个分类类别的分类准确率;The second test unit is used to test the classification accuracy of each classification category in the second classifier;
其中,所述第一分类器的分类准确率分别为P1j,其中j为大于等于1且小于等于m的整数,m为进行合并操作后的训练样本集中的样本类别数;Wherein, the classification accuracy rates of the first classifier are respectively P1j, wherein j is an integer greater than or equal to 1 and less than or equal to m, and m is the number of sample categories in the training sample set after the merging operation;
所述第二分类器的分类准确率分别为P1h*P2k,其中k为大于等于1且小于等于n的整数,n为所述新样本类别中的训练样本所属的原始样本类别数;P1h为所述第一分类器中所述新样本类别的分类准确率,h为大于等于1且小于等于g的整数,g为所述新样本类别数;The classification accuracy rates of the second classifier are respectively P1h*P2k, wherein k is an integer greater than or equal to 1 and less than or equal to n, and n is the number of original sample categories to which the training samples in the new sample category belong; The classification accuracy rate of the new sample category in the first classifier, h is an integer greater than or equal to 1 and less than or equal to g, and g is the number of new sample categories;
检测单元,用于检测所述第一分类器中每个分类类别的分类准确率是否都大于第一概率阈值,且所述第二分类器中每个分类类别的分类准确率是否都大于第二概率阈值;A detection unit, configured to detect whether the classification accuracy rate of each classification category in the first classifier is greater than the first probability threshold, and whether the classification accuracy rate of each classification category in the second classifier is greater than the second probability threshold;
确定单元,用于如果所述检测单元的检测结果为是,确定所述第一分类器和所述第二分类器训练成功。A determination unit, configured to determine that the training of the first classifier and the second classifier is successful if the detection result of the detection unit is yes.
举例说明,假如第一概率阈值为0.98,第二概率阈值为0.95,测试时,如果第一分类器的分类结果中,每个分类类别的分类准确率都大于0.98,且第二分类器的分类结果中,每个分类类别的分类准确率都大于0.95,则说明通过本发明实施例提供的文本分类器生成方法所生成的文本分类器的分类准确率达到了用户的要求。For example, if the first probability threshold is 0.98 and the second probability threshold is 0.95, during the test, if the classification accuracy of each classification category in the classification results of the first classifier is greater than 0.98, and the classification of the second classifier In the results, the classification accuracy rate of each classification category is greater than 0.95, indicating that the classification accuracy rate of the text classifier generated by the text classifier generation method provided by the embodiment of the present invention meets the user's requirements.
可选的,在进行分类准确率检测时,可以采用与训练样本集的各分类类别属性相同或相似的数据来进行测试,这些数据均标注有相关的分类类别。其中,与训练样本集的分类类别属性相同或相似的数据可以由算法构造出,也可以根据交叉验证方式获得。Optionally, when testing the classification accuracy, data with the same or similar attributes as the classification categories of the training sample set can be used for testing, and these data are marked with relevant classification categories. Among them, the data with the same or similar classification category attributes as the training sample set can be constructed by the algorithm, or can be obtained according to the cross-validation method.
可选的,所述第一测试单元,具体可用于基于样本总体采用交叉验证方式测试所述第一分类器的分类准确率;所述第二测试单元,具体用于基于样本总体采用交叉验证方式测试所述第二分类器的分类准确率。Optionally, the first test unit is specifically configured to test the classification accuracy of the first classifier by using a cross-validation method based on a sample population; the second test unit is specifically used to use a cross-validation method based on a sample population Testing the classification accuracy of the second classifier.
其中,样本总体是指与此次分类任务相关的全部样本数据。基于样本总体的交叉验证方式可以为将样本总体的一部分作为训练样本集,另一部分作为测试样本集,可以采用所述样本总体中60%至90%(例如80%)的样本作为训练样本集,采用剩余的样本作为待分类文本进行测试,所述样本总体包括所述多个原始样本类别,每个所述样本属于所述多个原始样本类别中的一个。Among them, the sample population refers to all sample data related to this classification task. The cross-validation method based on the sample population can be a part of the sample population as a training sample set, and another part as a test sample set, and 60% to 90% (for example, 80%) of the samples in the sample population can be used as a training sample set, Using the remaining samples as the text to be classified for testing, the sample population includes the multiple original sample categories, and each of the samples belongs to one of the multiple original sample categories.
可选的,所述装置还可包括:返回单元,用于如果所述检测单元的检测结果为否,将所述训练样本集中存在样本交叉的至少两个原始样本类别重新进行合并操作和分类器训练,直至所述第一分类器中每个分类类别的分类准确率都大于第一概率阈值,且所述第二分类器中每个分类类别的分类准确率都大于第二概率阈值为止。Optionally, the device may further include: a return unit, configured to re-merge at least two original sample categories with sample intersections in the training sample set and a classifier if the detection result of the detection unit is negative Training until the classification accuracy rate of each classification category in the first classifier is greater than the first probability threshold, and the classification accuracy rate of each classification category in the second classifier is greater than the second probability threshold.
也就是说,根据第一分类器和第二分类器作用后,如果发现任何一个分类类别的分类准确率小于相应的概率阈值,都说明第一分类器和第二分类器尚未达到用户的要求,因此需要重新对存在样本交叉的原始样本类别进行合并操作和分类器训练,直至准确率达到上述阈值要求为止。That is to say, after the first classifier and the second classifier are used, if the classification accuracy of any classification category is found to be less than the corresponding probability threshold, it means that the first classifier and the second classifier have not yet met the user's requirements. Therefore, it is necessary to re-merge the original sample categories with sample crossover and classifier training until the accuracy rate reaches the above threshold requirements.
为了满足对分类器的分类准确率有不同的要求,在本发明的一个实施例中,可选的,所述装置还可包括:调整单元,用于调整所述第一概率阈值和所述第二概率阈值以筛选出不同分类准确率的第一分类器和第二分类器。In order to meet different requirements on the classification accuracy of classifiers, in an embodiment of the present invention, optionally, the device may further include: an adjustment unit, configured to adjust the first probability threshold and the second probability threshold Two probability thresholds to filter out the first classifier and the second classifier with different classification accuracy.
相应的,如图5所示,本发明的实施例还提供一种文本分类装置,利用前述实施例提供的任一种文本分类器生成装置生成的分类器进行分类,所述分类装置包括:Correspondingly, as shown in FIG. 5 , an embodiment of the present invention also provides a text classification device, which classifies using a classifier generated by any text classifier generation device provided in the foregoing embodiments, and the classification device includes:
第一输入单元51,用于将待分类文本集输入所述第一分类器,得到第一分类结果;The first input unit 51 is configured to input the text set to be classified into the first classifier to obtain a first classification result;
第二输入单元52,用于将所述第一分类结果中分类类别为所述新样本类别的待分类文本输入与所述新样本类别对应的第二分类器,得到第二分类结果。The second input unit 52 is configured to input the to-be-classified text classified as the new sample category in the first classification result into a second classifier corresponding to the new sample category to obtain a second classification result.
本发明的实施例提供的文本分类装置,应用了前述实施例提供的任一种文本分类器生成装置生成的文本分类器。这样,通过第一分类器的训练,可以准确地将存在样本交叉的原始样本类别与不存在样本交叉的原始样本类别区分开,通过第二分类器的训练,可以将存在样本交叉的原始样本类别单独分离出来,在更具体的范围内进行更细致的分类训练,从而大大提高了文本分类的准确率。The text classification apparatus provided by the embodiments of the present invention applies the text classifier generated by any text classifier generation apparatus provided in the foregoing embodiments. In this way, through the training of the first classifier, the original sample category with sample crossover can be accurately distinguished from the original sample category without sample crossover, and through the training of the second classifier, the original sample category with sample crossover can be distinguished It is separated separately, and more detailed classification training is carried out in a more specific range, thus greatly improving the accuracy of text classification.
进一步的,所述文本分类装置,还包括:Further, the text classification device also includes:
结果输出单元,用于将所述第一分类结果中分类类别为未经合并的所述原始样本类别的分类结果与所述第二分类结果共同作为所述待分类文本的最终分类结果。The result output unit is configured to use the classification result of the original sample category whose classification category is not combined in the first classification result and the second classification result as the final classification result of the text to be classified.
相应的,本发明的实施例还提供一种计算机设备,包括处理器和存储器;存储器用于存储计算机指令,处理器用于运行所述存储器存储的计算机指令,以执行前述实施例提供的任一种文本分类器生成方法,因此也能实现相应的技术效果,前文已经进行了详细说明,此处不再赘述。Correspondingly, an embodiment of the present invention also provides a computer device, including a processor and a memory; the memory is used to store computer instructions, and the processor is used to run the computer instructions stored in the memory to perform any one of the above-mentioned embodiments. The text classifier generation method, therefore, can also achieve corresponding technical effects, which have been described in detail above, and will not be repeated here.
相应的,本发明的实施例还提供一种计算机设备,包括处理器和存储器;存储器用于存储计算机指令,处理器用于运行所述存储器存储的计算机指令,以执行执行前述实施例提供的任一种文本分类方法,因此也能实现相应的技术效果,前文已经进行了详细说明,此处不再赘述。Correspondingly, an embodiment of the present invention also provides a computer device, including a processor and a memory; the memory is used to store computer instructions, and the processor is used to run the computer instructions stored in the memory to perform any of the functions provided in the preceding embodiments. A text classification method, so it can also achieve the corresponding technical effect, which has been described in detail above, and will not be repeated here.
相应的,本发明的实施例还提供一种计算机可读存储介质,所述存储介质中存储有指令,所述指令运行时执行前述实施例提供的任一种文本分类器生成方法,因此也能实现相应的技术效果,前文已经进行了详细说明,此处不再赘述。Correspondingly, an embodiment of the present invention also provides a computer-readable storage medium, and instructions are stored in the storage medium, and when the instructions are executed, any method for generating a text classifier provided in the foregoing embodiments is executed, so it can also The realization of the corresponding technical effects has been described in detail above, and will not be repeated here.
相应的,本发明的实施例还提供一种计算机可读存储介质,所述存储介质中存储有指令,所述指令运行时执行前述实施例提供的任一种文本分类方法,因此也能实现相应的技术效果,前文已经进行了详细说明,此处不再赘述。Correspondingly, an embodiment of the present invention also provides a computer-readable storage medium, and instructions are stored in the storage medium, and when the instructions are run, any text classification method provided in the foregoing embodiments is executed, so that the corresponding The technical effect of , which has been described in detail above, will not be repeated here.
需要说明的是,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者装置不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者装置所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、方法、物品或者装置中还存在另外的相同要素。It should be noted that, in this document, the term "comprising", "comprising" or any other variation thereof is intended to cover a non-exclusive inclusion such that a process, method, article or apparatus comprising a set of elements includes not only those elements, It also includes other elements not expressly listed, or elements inherent in the process, method, article, or device. Without further limitations, an element defined by the phrase "comprising a ..." does not preclude the presence of additional identical elements in the process, method, article, or apparatus comprising that element.
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,空调器,或者网络设备等)执行本发明各个实施例所述的方法。Through the description of the above embodiments, those skilled in the art can clearly understand that the methods of the above embodiments can be implemented by means of software plus a necessary general-purpose hardware platform, and of course also by hardware, but in many cases the former is better implementation. Based on such an understanding, the essence of the technical solution of the present invention or the part that contributes to the prior art can be embodied in the form of software products, and the computer software products are stored in a storage medium (such as ROM/RAM, disk, CD) contains several instructions to make a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) execute the methods described in various embodiments of the present invention.
以上仅为本发明的优选实施例,并非因此限制本发明的专利范围,凡是利用本发明说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本发明的专利保护范围内。The above are only preferred embodiments of the present invention, and are not intended to limit the patent scope of the present invention. Any equivalent structure or equivalent process transformation made by using the description of the present invention and the contents of the accompanying drawings, or directly or indirectly used in other related technical fields , are all included in the scope of patent protection of the present invention in the same way.
Claims (38)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710457280.2A CN107273500A (en) | 2017-06-16 | 2017-06-16 | Text classifier generation method, file classification method, device and computer equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710457280.2A CN107273500A (en) | 2017-06-16 | 2017-06-16 | Text classifier generation method, file classification method, device and computer equipment |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107273500A true CN107273500A (en) | 2017-10-20 |
Family
ID=60066353
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710457280.2A Pending CN107273500A (en) | 2017-06-16 | 2017-06-16 | Text classifier generation method, file classification method, device and computer equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107273500A (en) |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107844553A (en) * | 2017-10-31 | 2018-03-27 | 山东浪潮通软信息科技有限公司 | A kind of file classification method and device |
CN108038208A (en) * | 2017-12-18 | 2018-05-15 | 深圳前海微众银行股份有限公司 | Training method, device and the storage medium of contextual information identification model |
CN108229564A (en) * | 2018-01-05 | 2018-06-29 | 阿里巴巴集团控股有限公司 | A kind of processing method of data, device and equipment |
CN108710651A (en) * | 2018-05-08 | 2018-10-26 | 华南理工大学 | A kind of large scale customer complaint data automatic classification method |
CN108920694A (en) * | 2018-07-13 | 2018-11-30 | 北京神州泰岳软件股份有限公司 | A kind of short text multi-tag classification method and device |
CN109359186A (en) * | 2018-10-25 | 2019-02-19 | 杭州时趣信息技术有限公司 | A kind of method, apparatus and computer readable storage medium of determining address information |
CN109961063A (en) * | 2017-12-26 | 2019-07-02 | 杭州海康机器人技术有限公司 | Method for text detection and device, computer equipment and storage medium |
CN110489545A (en) * | 2019-07-09 | 2019-11-22 | 平安科技(深圳)有限公司 | File classification method and device, storage medium, computer equipment |
WO2020034126A1 (en) * | 2018-08-15 | 2020-02-20 | 深圳先进技术研究院 | Sample training method, classification method, identification method, device, medium, and system |
CN111339250A (en) * | 2020-02-20 | 2020-06-26 | 北京百度网讯科技有限公司 | Mining method, electronic device and computer readable medium of new category label |
CN112396084A (en) * | 2019-08-19 | 2021-02-23 | 中国移动通信有限公司研究院 | Data processing method, device, equipment and storage medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101876987A (en) * | 2009-12-04 | 2010-11-03 | 中国人民解放军信息工程大学 | A Two-Class Text Classification Method Oriented to Class Overlap |
US20130103695A1 (en) * | 2011-10-21 | 2013-04-25 | Microsoft Corporation | Machine translation detection in web-scraped parallel corpora |
CN106503254A (en) * | 2016-11-11 | 2017-03-15 | 上海智臻智能网络科技股份有限公司 | Language material sorting technique, device and terminal |
-
2017
- 2017-06-16 CN CN201710457280.2A patent/CN107273500A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101876987A (en) * | 2009-12-04 | 2010-11-03 | 中国人民解放军信息工程大学 | A Two-Class Text Classification Method Oriented to Class Overlap |
US20130103695A1 (en) * | 2011-10-21 | 2013-04-25 | Microsoft Corporation | Machine translation detection in web-scraped parallel corpora |
CN106503254A (en) * | 2016-11-11 | 2017-03-15 | 上海智臻智能网络科技股份有限公司 | Language material sorting technique, device and terminal |
Cited By (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107844553A (en) * | 2017-10-31 | 2018-03-27 | 山东浪潮通软信息科技有限公司 | A kind of file classification method and device |
CN108038208A (en) * | 2017-12-18 | 2018-05-15 | 深圳前海微众银行股份有限公司 | Training method, device and the storage medium of contextual information identification model |
CN108038208B (en) * | 2017-12-18 | 2022-01-11 | 深圳前海微众银行股份有限公司 | Training method and device of context information recognition model and storage medium |
CN109961063A (en) * | 2017-12-26 | 2019-07-02 | 杭州海康机器人技术有限公司 | Method for text detection and device, computer equipment and storage medium |
CN108229564A (en) * | 2018-01-05 | 2018-06-29 | 阿里巴巴集团控股有限公司 | A kind of processing method of data, device and equipment |
CN108710651A (en) * | 2018-05-08 | 2018-10-26 | 华南理工大学 | A kind of large scale customer complaint data automatic classification method |
CN108710651B (en) * | 2018-05-08 | 2022-03-25 | 华南理工大学 | Automatic classification method for large-scale customer complaint data |
CN108920694B (en) * | 2018-07-13 | 2020-08-28 | 鼎富智能科技有限公司 | Short text multi-label classification method and device |
CN108920694A (en) * | 2018-07-13 | 2018-11-30 | 北京神州泰岳软件股份有限公司 | A kind of short text multi-tag classification method and device |
WO2020034126A1 (en) * | 2018-08-15 | 2020-02-20 | 深圳先进技术研究院 | Sample training method, classification method, identification method, device, medium, and system |
CN109359186B (en) * | 2018-10-25 | 2020-12-08 | 杭州时趣信息技术有限公司 | Method and device for determining address information and computer readable storage medium |
CN109359186A (en) * | 2018-10-25 | 2019-02-19 | 杭州时趣信息技术有限公司 | A kind of method, apparatus and computer readable storage medium of determining address information |
CN110489545A (en) * | 2019-07-09 | 2019-11-22 | 平安科技(深圳)有限公司 | File classification method and device, storage medium, computer equipment |
CN112396084A (en) * | 2019-08-19 | 2021-02-23 | 中国移动通信有限公司研究院 | Data processing method, device, equipment and storage medium |
CN111339250A (en) * | 2020-02-20 | 2020-06-26 | 北京百度网讯科技有限公司 | Mining method, electronic device and computer readable medium of new category label |
CN111339250B (en) * | 2020-02-20 | 2023-08-18 | 北京百度网讯科技有限公司 | Method for mining new category labels, electronic device, and computer-readable medium |
US11755654B2 (en) | 2020-02-20 | 2023-09-12 | Beijing Baidu Netcom Science Technology Co., Ltd. | Category tag mining method, electronic device and non-transitory computer-readable storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107273500A (en) | Text classifier generation method, file classification method, device and computer equipment | |
CN106909654B (en) | Multi-level classification system and method based on news text information | |
CN113590764B (en) | Training sample construction method and device, electronic equipment and storage medium | |
CN107729520B (en) | File classification method and device, computer equipment and computer readable medium | |
TW201737118A (en) | Method and device for classifying webpage text, method and device for recognizing webpage text | |
CN110413780A (en) | Text sentiment analysis method, device, storage medium and electronic equipment | |
CN109145108A (en) | Classifier training method, classification method, device and computer equipment is laminated in text | |
CN110472203B (en) | Article duplicate checking and detecting method, device, equipment and storage medium | |
CN113486664B (en) | Text data visualization analysis method, device, equipment and storage medium | |
CN104361037B (en) | Microblogging sorting technique and device | |
CN106296195A (en) | A kind of Risk Identification Method and device | |
WO2016177069A1 (en) | Management method, device, spam short message monitoring system and computer storage medium | |
CN109547423A (en) | A kind of WEB malicious requests depth detection system and method based on machine learning | |
CN104778186A (en) | Method and system for hanging commodity object to standard product unit (SPU) | |
CN115510500B (en) | Sensitive analysis method and system for text content | |
CN109165529B (en) | Dark chain tampering detection method and device and computer readable storage medium | |
CN103593431A (en) | Internet public opinion analyzing method and device | |
CN107145778B (en) | Intrusion detection method and device | |
CN110287311A (en) | File classification method and device, storage medium, computer equipment | |
CN108595884A (en) | Power system transient stability appraisal procedure and device | |
CN106778878A (en) | A kind of character relation sorting technique and device | |
CN102411592B (en) | Text classification method and device | |
CN107229614A (en) | Method and apparatus for grouped data | |
CN109101487A (en) | Conversational character differentiating method, device, terminal device and storage medium | |
CN105468731B (en) | A kind of preposition processing method of text emotion analysis signature verification |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20171020 |
|
RJ01 | Rejection of invention patent application after publication |