CN105512311B - An Adaptive Feature Selection Method Based on Chi-square Statistics - Google Patents
An Adaptive Feature Selection Method Based on Chi-square Statistics Download PDFInfo
- Publication number
- CN105512311B CN105512311B CN201510927759.9A CN201510927759A CN105512311B CN 105512311 B CN105512311 B CN 105512311B CN 201510927759 A CN201510927759 A CN 201510927759A CN 105512311 B CN105512311 B CN 105512311B
- Authority
- CN
- China
- Prior art keywords
- feature
- class
- text
- chi
- word
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2216/00—Indexing scheme relating to additional aspects of information retrieval not explicitly covered by G06F16/00 and subgroups
- G06F2216/03—Data mining
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
一种基于卡方统计的自适应特征选择方法,本方法涉及计算机文本数据处理领域,首先进行训练文本集和测试文本集的预处理,包括分词,停用词处理,然后进行基于卡方统计的自适应文本特征选择,定义词频因子和类间方差,将其引入CHI算法,为CHI算法添加合适的比例因子,最后结合经典的KNN算法的评价指标,自动调节比例因子,使改进的CHI适用于不同的语料库,以保证较高的分类准确度。实验结果表明,与传统的CHI方法相比,本发明分别用于平衡语料库和非平衡语料库分类精度均得到提高。
An adaptive feature selection method based on chi-square statistics. The method relates to the field of computer text data processing. First, the training text set and the test text set are preprocessed, including word segmentation and stop word processing. Adaptive text feature selection, define word frequency factor and inter-class variance, introduce them into the CHI algorithm, add appropriate scale factors to the CHI algorithm, and finally combine the evaluation indicators of the classic KNN algorithm to automatically adjust the scale factor, so that the improved CHI is suitable for Different corpora to ensure high classification accuracy. The experimental results show that compared with the traditional CHI method, the classification accuracy of the present invention is improved for both balanced corpus and unbalanced corpus respectively.
Description
技术领域technical field
本发明涉及计算机文本数据处理领域,特别涉及一种基于卡方统计(χ2,CHI)的自适应文本特征选择方法。The invention relates to the field of computer text data processing, in particular to an adaptive text feature selection method based on chi-square statistics (χ 2 , CHI).
背景技术Background technique
当今大数据时代,挖掘数据潜在的价值至关重要。数据挖掘作为发现数据潜在价值的技术,引起极大关注。大数据中文本数据占相当大的比例,而文本分类作为有效组织和管理文本数据的数据挖掘方法,逐渐成为关注热点。它在信息过滤、信息组织和管理、信息检索、数字图书馆以及垃圾邮件过滤等方面得到广泛应用。文本分类(TextClassification,TC)是指在预先给定的类别体系下对未知类别文本根据其内容将其自动划分到一类或多类的过程。常用的文本分类方法,如K最近邻(K-Nearest-Neighbor,KNN),贝叶斯(Naive Bayes,NB)以及支持向量机(SupportVector Machine,SVM)等。In today's era of big data, it is very important to mine the potential value of data. As a technology for discovering the potential value of data, data mining has attracted great attention. Text data accounts for a considerable proportion of big data, and text classification, as a data mining method to effectively organize and manage text data, has gradually become a hot spot of attention. It has been widely used in information filtering, information organization and management, information retrieval, digital library and spam filtering. Text Classification (TC) refers to the process of automatically classifying unknown category texts into one or more categories according to their content under a predetermined category system. Commonly used text classification methods, such as K-Nearest-Neighbor (KNN), Bayesian (Naive Bayes, NB) and Support Vector Machine (SVM) and so on.
现阶段文本分类过程包括预处理、特征降维、文本表示、分类器学习和评测,文本表示最常用的是向量空间模型,向量空间的高维性和稀疏性增加时间复杂度和空间复杂度,大大影响文本分类精度,所以特征降维过程至关重要,它直接影响分类的效率和准确率。特征降维主要包括两种方法——特征提取(Feature Extraction)和特征选择(FeatureSelection),基于语言学的特征提取需要自然语言处理方面的技术且计算复杂度高,而基于统计理论的特征选择方法复杂度较低且不需要过多的背景知识,因此特征选择方法应用更广泛。特征选择的基本思想是构造一个评价函数,分别对特征集的每个特征项进行评估打分,按照评估分值的高低对所有的特征项进行排序,选择特定数目的特征作为最终的文本特征集。常用的特征选择方法有卡方统计,文档频率(Document Frequency,DF),信息增益(Information Gain,IG),互信息(Mutual Information,MI),期望交叉熵(ExpectedCross Entropy,ECE)以及文本证据权重(Weight of Evidence,WE)等。The current text classification process includes preprocessing, feature dimensionality reduction, text representation, classifier learning and evaluation. The most commonly used text representation is the vector space model. The high dimensionality and sparsity of the vector space increase the time and space complexity. It greatly affects the accuracy of text classification, so the feature dimensionality reduction process is very important, which directly affects the efficiency and accuracy of classification. Feature dimensionality reduction mainly includes two methods—Feature Extraction and Feature Selection. Linguistic-based feature extraction requires natural language processing technology and high computational complexity, while feature selection method based on statistical theory The complexity is low and does not require too much background knowledge, so the feature selection method is more widely used. The basic idea of feature selection is to construct an evaluation function, evaluate and score each feature item of the feature set, sort all feature items according to the evaluation score, and select a specific number of features as the final text feature set. Commonly used feature selection methods are chi-square statistics, document frequency (Document Frequency, DF), information gain (Information Gain, IG), mutual information (Mutual Information, MI), expected cross entropy (Expected Cross Entropy, ECE) and text evidence weight (Weight of Evidence, WE) etc.
CHI方法作为常用的文本特征选择方法之一,有着实现简单、时间复杂度低等特点;但也存在很多缺点,以至于分类效果不理想。CHI算法的不足主要包括两个方面:第一,CHI只考虑了特征项的文档频率,忽略了特征项的词频,导致低频词的权重被放大;第二,放大了在一个类别中出现次数不多而在其他类中经常出现的特征项的权重。针对CHI算法存在的不足,很多研究者对其进行改进,将改进方法总结为以下两个方面:第一,引入若干调节参数以降低对低频词的倚重,但该方法没有考虑特征项与类别之间的正、负相关性问题。第二,引入比例因子α,按照其正、负相关性进行分类并赋以不同权重以改善CHI模型的特征选择能力,但是比例因子需要通过经验来选择。考虑到目前各种CHI改进型算法存在的不足,设计分类精度高的CHI文本特征选择方法具有重要的学术意义和实用价值。As one of the commonly used text feature selection methods, the CHI method has the characteristics of simple implementation and low time complexity; however, it also has many shortcomings, so that the classification effect is not ideal. The shortcomings of the CHI algorithm mainly include two aspects: first, CHI only considers the document frequency of feature items, ignoring the word frequency of feature items, resulting in the weight of low-frequency words being amplified; The weights of feature items that are many and often appear in other classes. In view of the shortcomings of the CHI algorithm, many researchers have improved it and summarized the improvement methods into the following two aspects: First, a number of adjustment parameters are introduced to reduce the reliance on low-frequency words, but the method does not consider the relationship between feature items and categories. positive and negative correlations between. Second, a scale factor α is introduced, which is classified according to its positive and negative correlations and assigned different weights to improve the feature selection ability of the CHI model, but the scale factor needs to be selected through experience. Considering the shortcomings of various CHI improved algorithms, it is of great academic significance and practical value to design a CHI text feature selection method with high classification accuracy.
发明内容SUMMARY OF THE INVENTION
本发明的目的在于,提供一种改进的CHI文本特征选择方法,从而提高文本分类的准确率。一方面本文引入词频因子和类间方差以降低CHI对低频词的倚重,选择出在特定类中出现频数大且均匀分布在该类中的特征项;另一方面引入自适应比例因子μ,以按照其正、负相关性进行分类并赋以不同权重,来降低人为选取比例因子带来的误差。The purpose of the present invention is to provide an improved CHI text feature selection method, thereby improving the accuracy of text classification. On the one hand, this paper introduces the word frequency factor and inter-class variance to reduce the CHI's reliance on low-frequency words, and selects the feature items that appear frequently in a specific class and are evenly distributed in this class; on the other hand, an adaptive scale factor μ is introduced to Classify according to their positive and negative correlations and assign different weights to reduce the error caused by artificial selection of scale factors.
本发明的特征如下:The features of the present invention are as follows:
步骤1,从互联网下载复旦大学发布的中文语料库——训练文本集和测试文本集;Step 1, download the Chinese corpus released by Fudan University from the Internet - the training text set and the test text set;
步骤2,采用中科院开源分词软件ICTCLAS对训练文本集和测试文本集进行分词、停用词去除等预处理,得到分词后的训练文本集和测试文本集;Step 2, using ICTCLAS, an open source word segmentation software of the Chinese Academy of Sciences, to perform preprocessing such as word segmentation and stop word removal on the training text set and the test text set, and obtain the training text set and test text set after word segmentation;
步骤3,采用基于CHI的自适应文本特征选择方法对分词后的训练文本集进行特征选择,得到该训练文本集对应的特征词库;Step 3, adopt the adaptive text feature selection method based on CHI to perform feature selection on the training text set after word segmentation, and obtain the feature vocabulary corresponding to the training text set;
传统的CHI文本特征选择方法的计算公式如下:The calculation formula of the traditional CHI text feature selection method is as follows:
其中,A表示包含特征tk且属于类别Ci的文档数,B表示包含特征tk且不属于类别Ci的文档数,C表示不包含特征tk且属于类别Ci的文档数,D表示不包含特征tk且不属于类别Ci的文档数。Among them, A represents the number of documents that contain feature t k and belong to category C i , B represents the number of documents that contain feature t k but does not belong to category C i , C represents the number of documents that do not contain feature t k and belong to category C i , D represents the number of documents that do not contain feature t k and do not belong to category C i .
提出基于CHI的自适应文本特征选择方法,公式如下:An adaptive text feature selection method based on CHI is proposed, and the formula is as follows:
χnew 2(tk,Ci)=[μ*χ2(tk,Ci)++(1-μ)*χ2(tk,Ci)-]*α*βχ new 2 (t k ,C i )=[μ*χ 2 (t k ,C i ) + +(1-μ)*χ 2 (t k ,C i ) - ]*α*β
其中,μ为比例因子,α为词频因子,β为类间方差,α和β公式定义如下:Among them, μ is the scale factor, α is the word frequency factor, and β is the between-class variance. The formulas of α and β are defined as follows:
其中,m是训练集类别总数,tf(tk,Ci)表示类别Ci中特征项tk出现的次数,表示特征项tk在整个训练文本集出现的次数。训练集合中类别为Ci中包含p个文档分别是di1,di2,...,dij,...,dip,tf(tk,dij)表示特征项tk在类别Ci的第j个文档中出现的次数,表示特征tk在类别Ci中出现的总次数,表示特征项tk在整个训练文本集的所有文档中出现的总次数。词频因子α表示在特定类Ci中包含特征项tk的词频数占在整个训练集中tk的词频数的比例。α越大,表示该特征项在特定类中出现频数较高而在其它类中出现次数较少或者几乎不出现,显然这样的特征项具有更高的类别区分能力;α越小,表示该特征项在特定类中出现频数较低而在其它类中出现次数较高,显然这样的特征项具有较弱的类别区分能力。Among them, m is the total number of categories in the training set, tf(t k ,C i ) represents the number of occurrences of the feature item t k in the category C i , Represents the number of times the feature item t k appears in the entire training text set. In the training set, the category C i contains p documents which are d i1 , d i2 ,...,d ij ,...,d ip , tf(t k ,d ij ) indicates that the feature item t k is in the category C the number of occurrences of i in the jth document, represents the total number of times the feature t k appears in the category C i , Represents the total number of times the feature item t k appears in all documents in the entire training text set. The word frequency factor α represents the proportion of the word frequency that contains the feature item tk in a specific class C i to the word frequency of the entire training set tk . The larger the α, the higher the frequency of the feature item in a specific class and the less or almost no appearance in other classes. Obviously, such a feature item has a higher ability to distinguish between categories; the smaller the α, the more the feature. Items appear less frequently in a particular class and more frequently appear in other classes. Obviously, such feature items have a weaker class discrimination ability.
其中,m表示类别的个数,dfi为类别Ci中包含tk的文档数,为平均每个类别包含tk的文档数,表示特征词出现在特定类中的文本数大于或等于平均值表示特征词出现在特定类中的文本数小于平均值β值用来度量某一类中包含特征词的文档频与所有类文档频的平均值间的偏离程度。β越大,说明类别Ci中包含特征词tk的文档数比所有类中包含特征词tk文档数的平均值大,并且大得比较多,这样的特征项具有更高的类别区分能力。in, m represents the number of categories, df i is the number of documents containing t k in category C i , is the average number of documents containing t k per category, Indicates that the number of texts in which the feature word appears in a particular class is greater than or equal to the average Indicates that the number of texts in which the feature word appears in a particular class is less than the average The β value is used to measure the degree of deviation between the frequency of documents containing feature words in a certain class and the average of document frequencies of all classes. The larger the β is, it means that the number of documents containing the feature word t k in the category C i is larger than the average of the number of documents containing the feature word t k in all classes, and it is much larger, and such a feature item has a higher category discrimination ability. .
步骤4,将每个训练文本和每个测试文本分别用特征词库的特征词表示为向量形式,每一维的权重根据TFIDF=TF×IDF计算,TF(Term Frequency)为词频,是指特征项在文档中出现的次数,IDF(Inverse Document)为逆文档频率,公式为IDF=log(M/nk+0.01),M为文档集合中包含的文本数,nk表示包含该词的文档数;In step 4, each training text and each test text are represented by the feature words of the feature vocabulary as a vector form, and the weight of each dimension is calculated according to TFIDF=TF×IDF, and TF (Term Frequency) is the word frequency, which refers to the feature. The number of times the item appears in the document, IDF (Inverse Document) is the inverse document frequency, the formula is IDF=log(M/n k +0.01), M is the number of texts contained in the document collection, and n k represents the document containing the word number;
步骤5,进行KNN分类;Step 5, perform KNN classification;
训练文本集为S,测试文本为d,n为特征向量维度阈值,K取35。The training text set is S, the test text is d, n is the feature vector dimension threshold, and K is 35.
步骤5.1,利用向量夹角的余弦值来计算测试文本d与S中全部文本之间的相似度;Step 5.1, use the cosine value of the included angle of the vector to calculate the similarity between the test text d and all the texts in S;
步骤5.2,选出步骤5.1得到的相似度最大的K个文本作为测试文本d的K个最近邻文本;Step 5.2, select the K texts with the largest similarity obtained in step 5.1 as the K nearest neighbor texts of the test text d;
步骤5.3,计算测试文本d属于每个类别的权重,将测试文本d归到权重最大的类别。Step 5.3: Calculate the weight of the test text d belonging to each category, and assign the test text d to the category with the largest weight.
设训练文本di的已知类别为Cj,权重计算公式如下:Let the known category of the training text d i be C j , the weight calculation formula is as follows:
其中,Sim(d,di)为测试文本d与已知类别文本di的余弦相似度,公式如下:Among them, Sim(d, d i ) is the cosine similarity between the test text d and the known category text d i , and the formula is as follows:
其中,n为特征向量维度阈值,Xj表示待测文本d的第j维的权重(0<j≤n),xij表示训练文本向量di的第j维的权重。Among them, n is the feature vector dimension threshold, X j represents the weight of the j-th dimension of the text d to be tested (0<j≤n), and x ij represents the weight of the j-th dimension of the training text vector d i .
y(di,Cj)为类别属性函数,公式如下:y(d i , C j ) is the category attribute function, the formula is as follows:
步骤6,计算KNN分类算法的查准率、查全率和F1值,通过设置前后2次分类结果F1值差值的最大阈值,以及比例因子μ增长的步长来得到最终的比例因子μ的值,以保证较高的分类准确度。Step 6: Calculate the precision, recall and F 1 value of the KNN classification algorithm, and obtain the final scale factor by setting the maximum threshold of the difference between the F 1 values of the two classification results before and after, and the step size of the scale factor μ growth. value of μ to ensure high classification accuracy.
步骤6.1,设置初始F1值为0,初始μ值为0.5,ε=0.0001为前后两次F1差值的阈值,τ=0.05为比例因子μ增长的步长;Step 6.1, set the initial F 1 value to 0, the initial μ value to 0.5, ε=0.0001 is the threshold of the difference between the two F 1 before and after, and τ=0.05 is the step size of the scaling factor μ increase;
步骤6.2,重复步骤5,得到F1′的值,得到前后两次F1差值ΔF=|F1′-F1|;Step 6.2, repeat step 5 to obtain the value of F 1 ′, and obtain the difference ΔF=|F 1 ′-F 1 | between the two F 1 before and after;
步骤6.3,如果ΔF小于ε,则得到此时的比例因子μ;如果ΔF大于或者等于ε,则令μ′=μ+τ,F1=F1′,重复步骤6.2和步骤6.3,直至得到合适的比例因子μ。Step 6.3, if ΔF is less than ε, obtain the scale factor μ at this time; if ΔF is greater than or equal to ε, then let μ′=μ+τ, F 1 =F 1 ′, repeat steps 6.2 and 6.3 until a suitable The scale factor μ.
与现有技术相比,本发明具有如下有益效果。Compared with the prior art, the present invention has the following beneficial effects.
本发明提出一种基于卡方统计的自适应特征选择方法,分类算法选择KNN算法,用于对测试文本的分类,整个过程流程图见图1,计算比例因子μ的流程图见图2,平衡语料库的精确度指标见表1,非平衡语料库的精确度见表2。与传统的CHI方法相比,一方面本文引入词频因子和类间方差以降低CHI对低频词的倚重,选择出在特定类中出现频数大且均匀分布在该类中的特征项;另一方面引入自适应比例因子μ,以按照其正、负相关性进行分类并赋以不同权重,并且该方法适用于不同分布的语料库,从而降低人为选取比例因子带来的误差。从表1和表2可以看出,与传统的CHI方法相比,本发明分别用于平衡语料库和非平衡语料库分类精度均得到提高。The present invention proposes an adaptive feature selection method based on chi-square statistics. The classification algorithm selects the KNN algorithm for classifying the test text. The flow chart of the whole process is shown in Figure 1, and the flow chart of calculating the scale factor μ is shown in Figure 2. The accuracy indicators of the corpus are shown in Table 1, and the accuracy of the unbalanced corpus is shown in Table 2. Compared with the traditional CHI method, on the one hand, this paper introduces the word frequency factor and the inter-class variance to reduce the CHI's reliance on low-frequency words, and selects the feature items that appear frequently in a specific class and are evenly distributed in this class; An adaptive scale factor μ is introduced to classify and assign different weights according to its positive and negative correlations, and this method is suitable for corpora with different distributions, thereby reducing the error caused by artificial selection of scale factors. It can be seen from Table 1 and Table 2 that compared with the traditional CHI method, the classification accuracy of the present invention is improved for both balanced corpus and unbalanced corpus respectively.
附图说明Description of drawings
图1是本发明总体过程的流程图。Figure 1 is a flow diagram of the general process of the present invention.
图2是本发明计算比例因子μ的流程图。FIG. 2 is a flow chart of calculating the scale factor μ according to the present invention.
具体实施方式Detailed ways
本发明是采用以下技术手段实现的:The present invention adopts the following technical means to realize:
一种基于卡方统计的自适应文本特征选择方法。首先,进行训练文本集和测试文本集的预处理,包括分词,停用词处理,其次,进行基于卡方统计的自适应文本特征选择,定义词频因子α和类间方差β,将其引入CHI算法,为CHI算法添加合适的比例因子μ,最后,结合经典的KNN算法,自动调节比例因子μ,使改进的CHI适用于不同的语料库,以保证较高的分类准确度。An adaptive text feature selection method based on chi-square statistics. First, perform preprocessing of training text set and test text set, including word segmentation and stop word processing. Second, perform adaptive text feature selection based on chi-square statistics, define word frequency factor α and inter-class variance β, and introduce them into CHI Algorithm, adding a suitable scale factor μ for the CHI algorithm, and finally, combined with the classical KNN algorithm, automatically adjust the scale factor μ, so that the improved CHI is suitable for different corpora to ensure higher classification accuracy.
上述基于卡方统计的自适应文本特征选择方法用于文本分类,包括下述步骤:The above-mentioned adaptive text feature selection method based on chi-square statistics is used for text classification, including the following steps:
步骤1,从互联网下载复旦大学发布的中文语料库——训练文本集和测试文本集;Step 1, download the Chinese corpus released by Fudan University from the Internet - the training text set and the test text set;
步骤2,采用分词软件ICTCLAS对训练文本集和测试文本集进行分词、停用词去除等预处理,得到分词后的训练文本集和测试文本集;Step 2, using the word segmentation software ICTCLAS to perform preprocessing such as word segmentation and stop word removal on the training text set and the test text set, and obtain the training text set and the test text set after word segmentation;
步骤3,采用基于CHI的自适应文本特征选择方法对分词后的训练文本集进行特征选择,得到该训练文本集对应的特征词库;Step 3, adopt the adaptive text feature selection method based on CHI to perform feature selection on the training text set after word segmentation, and obtain the feature vocabulary corresponding to the training text set;
传统的CHI文本特征选择方法的计算公式如下:The calculation formula of the traditional CHI text feature selection method is as follows:
其中,A表示包含特征tk且属于类别Ci的文档数,B表示包含特征tk且不属于类别Ci的文档数,C表示不包含特征tk且属于类别Ci的文档数,D表示不包含特征tk且不属于类别Ci的文档数。Among them, A represents the number of documents that contain feature t k and belong to category C i , B represents the number of documents that contain feature t k but does not belong to category C i , C represents the number of documents that do not contain feature t k and belong to category C i , D represents the number of documents that do not contain feature t k and do not belong to category C i .
提出基于CHI的自适应文本特征选择方法,公式如下:An adaptive text feature selection method based on CHI is proposed, and the formula is as follows:
χnew 2(tk,Ci)=[μ*χ2(tk,Ci)++(1-μ)*χ2(tk,Ci)-]*α*β (2)χ new 2 (t k ,C i )=[μ*χ 2 (t k ,C i ) + +(1-μ)*χ 2 (t k ,C i ) - ]*α*β (2)
其中,μ为自适应因子,α为词频因子,β为类间方差,α和β公式定义如下:Among them, μ is the adaptive factor, α is the word frequency factor, and β is the inter-class variance. The formulas of α and β are defined as follows:
其中,m是训练集类别总数,tf(tk,Ci)表示类别Ci特征项tk出现的次数,表示特征项t在整个训练文本集出现的次数。训练集合中类别为Ci中包含n个文档分别是di1,di2,...,dij,...,din,tf(tk,dij)表示特征项tk在类别Ci的第j个文档中出现的次数,表示特征tk在类别Ci中出现的总次数,表示特征项tk在整个训练文本集的所有文档中出现的总次数。词频因子α表示在特定类包含特征项tk的词频数占在整个训练集中tk的词频数的比例。α越大,表示该特征项在特定类中出现频数较高而在其它类中出现次数较少或者几乎不出现,显然这样的特征项具有更高的类别区分能力;α越小,表示该特征项在特定类中出现频数较低而在其它类中出现次数较高,显然这样的特征项具有较弱的类别区分能力。Among them, m is the total number of categories in the training set, tf(t k , C i ) represents the number of occurrences of the feature item t k of the category C i , Represents the number of times the feature item t appears in the entire training text set. The category C i in the training set contains n documents, which are d i1 , d i2 ,...,d ij ,...,d in , tf(t k ,d ij ) indicates that the feature item t k is in the category C the number of occurrences of i in the jth document, represents the total number of times the feature t k appears in the category C i , Represents the total number of times the feature item t k appears in all documents in the entire training text set. The word frequency factor α represents the proportion of the word frequency that contains the feature item t k in a specific class to the word frequency of the entire training set t k . The larger the α, the higher the frequency of the feature item in a specific class and the less or almost no appearance in other classes. Obviously, such a feature item has a higher ability to distinguish between categories; the smaller the α, the more the feature. Items appear less frequently in a particular class and more frequently appear in other classes. Obviously, such feature items have a weaker class discrimination ability.
其中,m表示类别的个数,dfi为类别Ci中包含tk的文档数,为平均每个类别包含tk的文档数,表示特征词出现在特定类中的文本数大于或等于平均值表示特征词出现在特定类中的文本数小于平均值β值用来度量某一类中包含特征词的文档频与所有类文档频的平均值间的偏离程度。β越大,说明类别Ci中包含特征词tk的文档数比所有类中包含特征词tk文档数的平均值大,并且大得比较多,这样的特征项具有更高的类别区分能力。in, m represents the number of categories, df i is the number of documents containing t k in category C i , is the average number of documents containing t k per category, Indicates that the number of texts in which the feature word appears in a particular class is greater than or equal to the average Indicates that the number of texts in which the feature word appears in a particular class is less than the average The β value is used to measure the degree of deviation between the frequency of documents containing feature words in a certain class and the average of document frequencies of all classes. The larger the β is, it means that the number of documents containing the feature word t k in the category C i is larger than the average of the number of documents containing the feature word t k in all classes, and it is much larger, and such a feature item has a higher category discrimination ability. .
步骤4,将每个训练文本和每个测试文本分别用特征词库的特征词表示为向量形式,每一维的权重根据TFIDF=TF×IDF计算,TF(Term Frequency)为词频,是指特征项在文档中出现的次数,IDF(Inverse Document)为逆文档频率,公式为IDF=log(M/nk+0.01),M为文档集合中包含的文本数,nk表示包含该词的文档数;In step 4, each training text and each test text are represented by the feature words of the feature vocabulary as a vector form, and the weight of each dimension is calculated according to TFIDF=TF×IDF, and TF (Term Frequency) is the word frequency, which refers to the feature. The number of times the item appears in the document, IDF (Inverse Document) is the inverse document frequency, the formula is IDF=log(M/n k +0.01), M is the number of texts contained in the document collection, and n k represents the document containing the word number;
步骤5,进行KNN分类;Step 5, perform KNN classification;
训练文本集为S,测试文本为d,n为特征向量维度阈值,K取35。The training text set is S, the test text is d, n is the feature vector dimension threshold, and K is 35.
利用向量夹角的余弦值来计算测试文本d与S中全部文本之间的相似度;选出计算得到的相似度最大的K个文本作为测试文本d的K个最近邻;计算测试文本d属于每个类别的权重,将测试文本d归到权重最大的类别。Use the cosine value of the angle between the vectors to calculate the similarity between the test text d and all the texts in S; select the K texts with the highest calculated similarity as the K nearest neighbors of the test text d; calculate the test text d belongs to The weight of each category, assigning the test text d to the category with the largest weight.
设训练文本di的已知类别为Cj,权重计算公式如下:Let the known category of the training text d i be C j , the weight calculation formula is as follows:
其中,Sim(d,di)为测试文本d与已知类别文本di的余弦相似度,公式如下:Among them, Sim(d, d i ) is the cosine similarity between the test text d and the known category text d i , and the formula is as follows:
其中,n为特征向量维度阈值,Xj表示待测文本d的第j维的权重(0<j≤n),xij表示训练文本向量di的第j维的权重。Among them, n is the feature vector dimension threshold, X j represents the weight of the j-th dimension of the text d to be tested (0<j≤n), and x ij represents the weight of the j-th dimension of the training text vector d i .
y(di,Cj)为类别属性函数,公式如下:y(d i , C j ) is the category attribute function, the formula is as follows:
步骤6,计算KNN分类算法的查准率、查全率和F1值,通过设置前后2次分类结果F1值差值的最大阈值,以及比例因子μ增长的步长来得到最终的比例因子的值。Step 6: Calculate the precision, recall and F 1 value of the KNN classification algorithm, and obtain the final scale factor by setting the maximum threshold of the difference between the F 1 values of the two classification results before and after, and the step size of the scale factor μ growth. value of .
设置初始F1值为0,初始μ值为0.5,ε=0.0001为前后两次F1差值的阈值,τ=0.05为比例因子μ增长的步长。The initial value of F 1 is set to 0, the initial value of μ is 0.5, ε=0.0001 is the threshold value of the difference between the two times of F 1 before and after, and τ=0.05 is the step size of the increase of the scale factor μ.
重复步骤5,得到F1′的值,得到前后两次F1差值ΔF=|F1′-F1|;如果ΔF小于ε,则得到此时的比例因子μ;如果ΔF大于或者等于ε,则令μ′=μ+τ,F1=F1′,重复该步的迭代,直至得到合适的比例因子μ,以保证较高的分类准确度。Repeat step 5 to obtain the value of F 1 ′, and obtain the difference ΔF=|F 1 ′-F 1 | between the two times of F 1 before and after; if ΔF is less than ε, obtain the scale factor μ at this time; if ΔF is greater than or equal to ε , then let μ′=μ+τ, F 1 =F 1 ′, repeat the iteration of this step until a suitable scale factor μ is obtained to ensure higher classification accuracy.
表1算法改进前后结果比较(平衡语料库)(%)Table 1 Comparison of results before and after algorithm improvement (balanced corpus) (%)
表2算法改进前后结果比较(非平衡语料库)(%)Table 2 Comparison of results before and after algorithm improvement (unbalanced corpus) (%)
Claims (1)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510927759.9A CN105512311B (en) | 2015-12-14 | 2015-12-14 | An Adaptive Feature Selection Method Based on Chi-square Statistics |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510927759.9A CN105512311B (en) | 2015-12-14 | 2015-12-14 | An Adaptive Feature Selection Method Based on Chi-square Statistics |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105512311A CN105512311A (en) | 2016-04-20 |
CN105512311B true CN105512311B (en) | 2019-02-26 |
Family
ID=55720291
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510927759.9A Expired - Fee Related CN105512311B (en) | 2015-12-14 | 2015-12-14 | An Adaptive Feature Selection Method Based on Chi-square Statistics |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105512311B (en) |
Families Citing this family (25)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106021461A (en) * | 2016-05-17 | 2016-10-12 | 深圳市中润四方信息技术有限公司 | Text classification method and text classification system |
CN108073567B (en) * | 2016-11-16 | 2021-12-28 | 北京嘀嘀无限科技发展有限公司 | Feature word extraction processing method, system and server |
CN108090088A (en) * | 2016-11-23 | 2018-05-29 | 北京国双科技有限公司 | Feature extracting method and device |
CN106611057B (en) * | 2016-12-27 | 2019-08-13 | 上海利连信息科技有限公司 | The text classification feature selection approach of importance weighting |
CN107291837B (en) * | 2017-05-31 | 2020-04-03 | 北京大学 | A word segmentation method for web text based on domain adaptability |
CN107256214B (en) * | 2017-06-30 | 2020-09-25 | 联想(北京)有限公司 | Junk information judgment method and device and server cluster |
CN107577794B (en) * | 2017-09-19 | 2019-07-05 | 北京神州泰岳软件股份有限公司 | A kind of news category method and device |
CN108197307A (en) * | 2018-01-31 | 2018-06-22 | 湖北工业大学 | The selection method and system of a kind of text feature |
CN108491429A (en) * | 2018-02-09 | 2018-09-04 | 湖北工业大学 | A kind of feature selection approach based on document frequency and word frequency statistics between class in class |
CN108376130A (en) * | 2018-03-09 | 2018-08-07 | 长安大学 | A kind of objectionable text information filtering feature selection approach |
CN108346474B (en) * | 2018-03-14 | 2021-09-28 | 湖南省蓝蜻蜓网络科技有限公司 | Electronic medical record feature selection method based on word intra-class distribution and inter-class distribution |
CN108920545B (en) * | 2018-06-13 | 2021-07-09 | 四川大学 | Chinese sentiment feature selection method based on extended sentiment dictionary and chi-square model |
CN109325511B (en) * | 2018-08-01 | 2020-07-31 | 昆明理工大学 | A method to improve feature selection |
CN109543037A (en) * | 2018-11-21 | 2019-03-29 | 南京安讯科技有限责任公司 | A kind of article classification method based on improved TF-IDF |
CN112256865B (en) * | 2019-01-31 | 2023-03-21 | 青岛科技大学 | Chinese text classification method based on classifier |
CN110069630B (en) * | 2019-03-20 | 2023-07-21 | 重庆信科设计有限公司 | Improved mutual information feature selection method |
CN110705247B (en) * | 2019-08-30 | 2020-08-04 | 山东科技大学 | Text similarity calculation method based on χ2-C |
CN110688481A (en) * | 2019-09-02 | 2020-01-14 | 贵州航天计量测试技术研究所 | Text classification feature selection method based on chi-square statistic and IDF |
CN111144106B (en) * | 2019-12-20 | 2023-05-02 | 山东科技大学 | Two-stage text feature selection method under unbalanced data set |
CN111062212B (en) * | 2020-03-18 | 2020-06-30 | 北京热云科技有限公司 | Feature extraction method and system based on optimized TFIDF |
CN112200259A (en) * | 2020-10-19 | 2021-01-08 | 哈尔滨理工大学 | A kind of information gain text feature selection method and classification device based on classification and screening |
CN113032564B (en) * | 2021-03-22 | 2023-05-30 | 建信金融科技有限责任公司 | Feature extraction method, device, electronic equipment and storage medium |
CN113515623B (en) * | 2021-04-28 | 2022-12-06 | 西安理工大学 | Feature selection method based on word frequency difference factor |
CN113378567B (en) * | 2021-07-05 | 2022-05-10 | 广东工业大学 | An improved Chinese short text classification method for low-frequency words |
CN113987536A (en) * | 2021-10-27 | 2022-01-28 | 建信金融科技有限责任公司 | Method, device, electronic device and medium for determining security level of field in data sheet |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103678274A (en) * | 2013-04-15 | 2014-03-26 | 南京邮电大学 | Feature extraction method for text categorization based on improved mutual information and entropy |
CN103886108A (en) * | 2014-04-13 | 2014-06-25 | 北京工业大学 | Feature selection and weight calculation method of imbalance text set |
CN104750844A (en) * | 2015-04-09 | 2015-07-01 | 中南大学 | Method and device for generating text characteristic vectors based on TF-IGM, method and device for classifying texts |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9311390B2 (en) * | 2008-01-29 | 2016-04-12 | Educational Testing Service | System and method for handling the confounding effect of document length on vector-based similarity scores |
-
2015
- 2015-12-14 CN CN201510927759.9A patent/CN105512311B/en not_active Expired - Fee Related
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103678274A (en) * | 2013-04-15 | 2014-03-26 | 南京邮电大学 | Feature extraction method for text categorization based on improved mutual information and entropy |
CN103886108A (en) * | 2014-04-13 | 2014-06-25 | 北京工业大学 | Feature selection and weight calculation method of imbalance text set |
CN104750844A (en) * | 2015-04-09 | 2015-07-01 | 中南大学 | Method and device for generating text characteristic vectors based on TF-IGM, method and device for classifying texts |
Non-Patent Citations (1)
Title |
---|
基于词频的优化互信息文本特征选择方法;刘海峰;《计算机工程》;20141231;第179-182页 |
Also Published As
Publication number | Publication date |
---|---|
CN105512311A (en) | 2016-04-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105512311B (en) | An Adaptive Feature Selection Method Based on Chi-square Statistics | |
CN107609121B (en) | News text classification method based on LDA and word2vec algorithm | |
CN105224695B (en) | A kind of text feature quantization method and device and file classification method and device based on comentropy | |
CN102622373B (en) | Statistic text classification system and statistic text classification method based on term frequency-inverse document frequency (TF*IDF) algorithm | |
CN104750844B (en) | Text eigenvector based on TF-IGM generates method and apparatus and file classification method and device | |
Fan et al. | Research on text classification based on improved tf-idf algorithm | |
CN104391835B (en) | Feature Words system of selection and device in text | |
CN110543564B (en) | Domain Label Acquisition Method Based on Topic Model | |
CN107066555B (en) | On-line theme detection method for professional field | |
CN108763348B (en) | Classification improvement method for feature vectors of extended short text words | |
CN108197144B (en) | Hot topic discovery method based on BTM and Single-pass | |
CN103294817A (en) | Text feature extraction method based on categorical distribution probability | |
CN109960799A (en) | An optimized classification method for short texts | |
CN104346459A (en) | Text classification feature selecting method based on term frequency and chi-square statistics | |
CN107908624A (en) | A kind of K medoids Text Clustering Methods based on all standing Granule Computing | |
CN112579783B (en) | Short text clustering method based on Laplace atlas | |
CN109376235B (en) | Feature selection method based on document layer word frequency reordering | |
CN106570076A (en) | Computer text classification system | |
CN114896398A (en) | Text classification system and method based on feature selection | |
CN113032573A (en) | Large-scale text classification method and system combining theme semantics and TF-IDF algorithm | |
CN110348497B (en) | Text representation method constructed based on WT-GloVe word vector | |
CN113626604B (en) | Web Page Text Classification System Based on Maximum Spacing Criterion | |
CN108614825B (en) | Webpage feature extraction method and device | |
CN105760471B (en) | Two-class text classification method based on combined convex linear perceptron | |
Luo | A new text classifier based on random forests |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20190226 Termination date: 20211214 |