[go: up one dir, main page]

CN105512311B - An Adaptive Feature Selection Method Based on Chi-square Statistics - Google Patents

An Adaptive Feature Selection Method Based on Chi-square Statistics Download PDF

Info

Publication number
CN105512311B
CN105512311B CN201510927759.9A CN201510927759A CN105512311B CN 105512311 B CN105512311 B CN 105512311B CN 201510927759 A CN201510927759 A CN 201510927759A CN 105512311 B CN105512311 B CN 105512311B
Authority
CN
China
Prior art keywords
feature
class
text
chi
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201510927759.9A
Other languages
Chinese (zh)
Other versions
CN105512311A (en
Inventor
汪友生
樊存佳
王雨婷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Technology
Original Assignee
Beijing University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Technology filed Critical Beijing University of Technology
Priority to CN201510927759.9A priority Critical patent/CN105512311B/en
Publication of CN105512311A publication Critical patent/CN105512311A/en
Application granted granted Critical
Publication of CN105512311B publication Critical patent/CN105512311B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2216/00Indexing scheme relating to additional aspects of information retrieval not explicitly covered by G06F16/00 and subgroups
    • G06F2216/03Data mining

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种基于卡方统计的自适应特征选择方法,本方法涉及计算机文本数据处理领域,首先进行训练文本集和测试文本集的预处理,包括分词,停用词处理,然后进行基于卡方统计的自适应文本特征选择,定义词频因子和类间方差,将其引入CHI算法,为CHI算法添加合适的比例因子,最后结合经典的KNN算法的评价指标,自动调节比例因子,使改进的CHI适用于不同的语料库,以保证较高的分类准确度。实验结果表明,与传统的CHI方法相比,本发明分别用于平衡语料库和非平衡语料库分类精度均得到提高。

An adaptive feature selection method based on chi-square statistics. The method relates to the field of computer text data processing. First, the training text set and the test text set are preprocessed, including word segmentation and stop word processing. Adaptive text feature selection, define word frequency factor and inter-class variance, introduce them into the CHI algorithm, add appropriate scale factors to the CHI algorithm, and finally combine the evaluation indicators of the classic KNN algorithm to automatically adjust the scale factor, so that the improved CHI is suitable for Different corpora to ensure high classification accuracy. The experimental results show that compared with the traditional CHI method, the classification accuracy of the present invention is improved for both balanced corpus and unbalanced corpus respectively.

Description

一种基于卡方统计的自适应特征选择方法An Adaptive Feature Selection Method Based on Chi-square Statistics

技术领域technical field

本发明涉及计算机文本数据处理领域,特别涉及一种基于卡方统计(χ2,CHI)的自适应文本特征选择方法。The invention relates to the field of computer text data processing, in particular to an adaptive text feature selection method based on chi-square statistics (χ 2 , CHI).

背景技术Background technique

当今大数据时代,挖掘数据潜在的价值至关重要。数据挖掘作为发现数据潜在价值的技术,引起极大关注。大数据中文本数据占相当大的比例,而文本分类作为有效组织和管理文本数据的数据挖掘方法,逐渐成为关注热点。它在信息过滤、信息组织和管理、信息检索、数字图书馆以及垃圾邮件过滤等方面得到广泛应用。文本分类(TextClassification,TC)是指在预先给定的类别体系下对未知类别文本根据其内容将其自动划分到一类或多类的过程。常用的文本分类方法,如K最近邻(K-Nearest-Neighbor,KNN),贝叶斯(Naive Bayes,NB)以及支持向量机(SupportVector Machine,SVM)等。In today's era of big data, it is very important to mine the potential value of data. As a technology for discovering the potential value of data, data mining has attracted great attention. Text data accounts for a considerable proportion of big data, and text classification, as a data mining method to effectively organize and manage text data, has gradually become a hot spot of attention. It has been widely used in information filtering, information organization and management, information retrieval, digital library and spam filtering. Text Classification (TC) refers to the process of automatically classifying unknown category texts into one or more categories according to their content under a predetermined category system. Commonly used text classification methods, such as K-Nearest-Neighbor (KNN), Bayesian (Naive Bayes, NB) and Support Vector Machine (SVM) and so on.

现阶段文本分类过程包括预处理、特征降维、文本表示、分类器学习和评测,文本表示最常用的是向量空间模型,向量空间的高维性和稀疏性增加时间复杂度和空间复杂度,大大影响文本分类精度,所以特征降维过程至关重要,它直接影响分类的效率和准确率。特征降维主要包括两种方法——特征提取(Feature Extraction)和特征选择(FeatureSelection),基于语言学的特征提取需要自然语言处理方面的技术且计算复杂度高,而基于统计理论的特征选择方法复杂度较低且不需要过多的背景知识,因此特征选择方法应用更广泛。特征选择的基本思想是构造一个评价函数,分别对特征集的每个特征项进行评估打分,按照评估分值的高低对所有的特征项进行排序,选择特定数目的特征作为最终的文本特征集。常用的特征选择方法有卡方统计,文档频率(Document Frequency,DF),信息增益(Information Gain,IG),互信息(Mutual Information,MI),期望交叉熵(ExpectedCross Entropy,ECE)以及文本证据权重(Weight of Evidence,WE)等。The current text classification process includes preprocessing, feature dimensionality reduction, text representation, classifier learning and evaluation. The most commonly used text representation is the vector space model. The high dimensionality and sparsity of the vector space increase the time and space complexity. It greatly affects the accuracy of text classification, so the feature dimensionality reduction process is very important, which directly affects the efficiency and accuracy of classification. Feature dimensionality reduction mainly includes two methods—Feature Extraction and Feature Selection. Linguistic-based feature extraction requires natural language processing technology and high computational complexity, while feature selection method based on statistical theory The complexity is low and does not require too much background knowledge, so the feature selection method is more widely used. The basic idea of feature selection is to construct an evaluation function, evaluate and score each feature item of the feature set, sort all feature items according to the evaluation score, and select a specific number of features as the final text feature set. Commonly used feature selection methods are chi-square statistics, document frequency (Document Frequency, DF), information gain (Information Gain, IG), mutual information (Mutual Information, MI), expected cross entropy (Expected Cross Entropy, ECE) and text evidence weight (Weight of Evidence, WE) etc.

CHI方法作为常用的文本特征选择方法之一,有着实现简单、时间复杂度低等特点;但也存在很多缺点,以至于分类效果不理想。CHI算法的不足主要包括两个方面:第一,CHI只考虑了特征项的文档频率,忽略了特征项的词频,导致低频词的权重被放大;第二,放大了在一个类别中出现次数不多而在其他类中经常出现的特征项的权重。针对CHI算法存在的不足,很多研究者对其进行改进,将改进方法总结为以下两个方面:第一,引入若干调节参数以降低对低频词的倚重,但该方法没有考虑特征项与类别之间的正、负相关性问题。第二,引入比例因子α,按照其正、负相关性进行分类并赋以不同权重以改善CHI模型的特征选择能力,但是比例因子需要通过经验来选择。考虑到目前各种CHI改进型算法存在的不足,设计分类精度高的CHI文本特征选择方法具有重要的学术意义和实用价值。As one of the commonly used text feature selection methods, the CHI method has the characteristics of simple implementation and low time complexity; however, it also has many shortcomings, so that the classification effect is not ideal. The shortcomings of the CHI algorithm mainly include two aspects: first, CHI only considers the document frequency of feature items, ignoring the word frequency of feature items, resulting in the weight of low-frequency words being amplified; The weights of feature items that are many and often appear in other classes. In view of the shortcomings of the CHI algorithm, many researchers have improved it and summarized the improvement methods into the following two aspects: First, a number of adjustment parameters are introduced to reduce the reliance on low-frequency words, but the method does not consider the relationship between feature items and categories. positive and negative correlations between. Second, a scale factor α is introduced, which is classified according to its positive and negative correlations and assigned different weights to improve the feature selection ability of the CHI model, but the scale factor needs to be selected through experience. Considering the shortcomings of various CHI improved algorithms, it is of great academic significance and practical value to design a CHI text feature selection method with high classification accuracy.

发明内容SUMMARY OF THE INVENTION

本发明的目的在于,提供一种改进的CHI文本特征选择方法,从而提高文本分类的准确率。一方面本文引入词频因子和类间方差以降低CHI对低频词的倚重,选择出在特定类中出现频数大且均匀分布在该类中的特征项;另一方面引入自适应比例因子μ,以按照其正、负相关性进行分类并赋以不同权重,来降低人为选取比例因子带来的误差。The purpose of the present invention is to provide an improved CHI text feature selection method, thereby improving the accuracy of text classification. On the one hand, this paper introduces the word frequency factor and inter-class variance to reduce the CHI's reliance on low-frequency words, and selects the feature items that appear frequently in a specific class and are evenly distributed in this class; on the other hand, an adaptive scale factor μ is introduced to Classify according to their positive and negative correlations and assign different weights to reduce the error caused by artificial selection of scale factors.

本发明的特征如下:The features of the present invention are as follows:

步骤1,从互联网下载复旦大学发布的中文语料库——训练文本集和测试文本集;Step 1, download the Chinese corpus released by Fudan University from the Internet - the training text set and the test text set;

步骤2,采用中科院开源分词软件ICTCLAS对训练文本集和测试文本集进行分词、停用词去除等预处理,得到分词后的训练文本集和测试文本集;Step 2, using ICTCLAS, an open source word segmentation software of the Chinese Academy of Sciences, to perform preprocessing such as word segmentation and stop word removal on the training text set and the test text set, and obtain the training text set and test text set after word segmentation;

步骤3,采用基于CHI的自适应文本特征选择方法对分词后的训练文本集进行特征选择,得到该训练文本集对应的特征词库;Step 3, adopt the adaptive text feature selection method based on CHI to perform feature selection on the training text set after word segmentation, and obtain the feature vocabulary corresponding to the training text set;

传统的CHI文本特征选择方法的计算公式如下:The calculation formula of the traditional CHI text feature selection method is as follows:

其中,A表示包含特征tk且属于类别Ci的文档数,B表示包含特征tk且不属于类别Ci的文档数,C表示不包含特征tk且属于类别Ci的文档数,D表示不包含特征tk且不属于类别Ci的文档数。Among them, A represents the number of documents that contain feature t k and belong to category C i , B represents the number of documents that contain feature t k but does not belong to category C i , C represents the number of documents that do not contain feature t k and belong to category C i , D represents the number of documents that do not contain feature t k and do not belong to category C i .

提出基于CHI的自适应文本特征选择方法,公式如下:An adaptive text feature selection method based on CHI is proposed, and the formula is as follows:

χnew 2(tk,Ci)=[μ*χ2(tk,Ci)++(1-μ)*χ2(tk,Ci)-]*α*βχ new 2 (t k ,C i )=[μ*χ 2 (t k ,C i ) + +(1-μ)*χ 2 (t k ,C i ) - ]*α*β

其中,μ为比例因子,α为词频因子,β为类间方差,α和β公式定义如下:Among them, μ is the scale factor, α is the word frequency factor, and β is the between-class variance. The formulas of α and β are defined as follows:

其中,m是训练集类别总数,tf(tk,Ci)表示类别Ci中特征项tk出现的次数,表示特征项tk在整个训练文本集出现的次数。训练集合中类别为Ci中包含p个文档分别是di1,di2,...,dij,...,dip,tf(tk,dij)表示特征项tk在类别Ci的第j个文档中出现的次数,表示特征tk在类别Ci中出现的总次数,表示特征项tk在整个训练文本集的所有文档中出现的总次数。词频因子α表示在特定类Ci中包含特征项tk的词频数占在整个训练集中tk的词频数的比例。α越大,表示该特征项在特定类中出现频数较高而在其它类中出现次数较少或者几乎不出现,显然这样的特征项具有更高的类别区分能力;α越小,表示该特征项在特定类中出现频数较低而在其它类中出现次数较高,显然这样的特征项具有较弱的类别区分能力。Among them, m is the total number of categories in the training set, tf(t k ,C i ) represents the number of occurrences of the feature item t k in the category C i , Represents the number of times the feature item t k appears in the entire training text set. In the training set, the category C i contains p documents which are d i1 , d i2 ,...,d ij ,...,d ip , tf(t k ,d ij ) indicates that the feature item t k is in the category C the number of occurrences of i in the jth document, represents the total number of times the feature t k appears in the category C i , Represents the total number of times the feature item t k appears in all documents in the entire training text set. The word frequency factor α represents the proportion of the word frequency that contains the feature item tk in a specific class C i to the word frequency of the entire training set tk . The larger the α, the higher the frequency of the feature item in a specific class and the less or almost no appearance in other classes. Obviously, such a feature item has a higher ability to distinguish between categories; the smaller the α, the more the feature. Items appear less frequently in a particular class and more frequently appear in other classes. Obviously, such feature items have a weaker class discrimination ability.

其中,m表示类别的个数,dfi为类别Ci中包含tk的文档数,为平均每个类别包含tk的文档数,表示特征词出现在特定类中的文本数大于或等于平均值表示特征词出现在特定类中的文本数小于平均值β值用来度量某一类中包含特征词的文档频与所有类文档频的平均值间的偏离程度。β越大,说明类别Ci中包含特征词tk的文档数比所有类中包含特征词tk文档数的平均值大,并且大得比较多,这样的特征项具有更高的类别区分能力。in, m represents the number of categories, df i is the number of documents containing t k in category C i , is the average number of documents containing t k per category, Indicates that the number of texts in which the feature word appears in a particular class is greater than or equal to the average Indicates that the number of texts in which the feature word appears in a particular class is less than the average The β value is used to measure the degree of deviation between the frequency of documents containing feature words in a certain class and the average of document frequencies of all classes. The larger the β is, it means that the number of documents containing the feature word t k in the category C i is larger than the average of the number of documents containing the feature word t k in all classes, and it is much larger, and such a feature item has a higher category discrimination ability. .

步骤4,将每个训练文本和每个测试文本分别用特征词库的特征词表示为向量形式,每一维的权重根据TFIDF=TF×IDF计算,TF(Term Frequency)为词频,是指特征项在文档中出现的次数,IDF(Inverse Document)为逆文档频率,公式为IDF=log(M/nk+0.01),M为文档集合中包含的文本数,nk表示包含该词的文档数;In step 4, each training text and each test text are represented by the feature words of the feature vocabulary as a vector form, and the weight of each dimension is calculated according to TFIDF=TF×IDF, and TF (Term Frequency) is the word frequency, which refers to the feature. The number of times the item appears in the document, IDF (Inverse Document) is the inverse document frequency, the formula is IDF=log(M/n k +0.01), M is the number of texts contained in the document collection, and n k represents the document containing the word number;

步骤5,进行KNN分类;Step 5, perform KNN classification;

训练文本集为S,测试文本为d,n为特征向量维度阈值,K取35。The training text set is S, the test text is d, n is the feature vector dimension threshold, and K is 35.

步骤5.1,利用向量夹角的余弦值来计算测试文本d与S中全部文本之间的相似度;Step 5.1, use the cosine value of the included angle of the vector to calculate the similarity between the test text d and all the texts in S;

步骤5.2,选出步骤5.1得到的相似度最大的K个文本作为测试文本d的K个最近邻文本;Step 5.2, select the K texts with the largest similarity obtained in step 5.1 as the K nearest neighbor texts of the test text d;

步骤5.3,计算测试文本d属于每个类别的权重,将测试文本d归到权重最大的类别。Step 5.3: Calculate the weight of the test text d belonging to each category, and assign the test text d to the category with the largest weight.

设训练文本di的已知类别为Cj,权重计算公式如下:Let the known category of the training text d i be C j , the weight calculation formula is as follows:

其中,Sim(d,di)为测试文本d与已知类别文本di的余弦相似度,公式如下:Among them, Sim(d, d i ) is the cosine similarity between the test text d and the known category text d i , and the formula is as follows:

其中,n为特征向量维度阈值,Xj表示待测文本d的第j维的权重(0<j≤n),xij表示训练文本向量di的第j维的权重。Among them, n is the feature vector dimension threshold, X j represents the weight of the j-th dimension of the text d to be tested (0<j≤n), and x ij represents the weight of the j-th dimension of the training text vector d i .

y(di,Cj)为类别属性函数,公式如下:y(d i , C j ) is the category attribute function, the formula is as follows:

步骤6,计算KNN分类算法的查准率、查全率和F1值,通过设置前后2次分类结果F1值差值的最大阈值,以及比例因子μ增长的步长来得到最终的比例因子μ的值,以保证较高的分类准确度。Step 6: Calculate the precision, recall and F 1 value of the KNN classification algorithm, and obtain the final scale factor by setting the maximum threshold of the difference between the F 1 values of the two classification results before and after, and the step size of the scale factor μ growth. value of μ to ensure high classification accuracy.

步骤6.1,设置初始F1值为0,初始μ值为0.5,ε=0.0001为前后两次F1差值的阈值,τ=0.05为比例因子μ增长的步长;Step 6.1, set the initial F 1 value to 0, the initial μ value to 0.5, ε=0.0001 is the threshold of the difference between the two F 1 before and after, and τ=0.05 is the step size of the scaling factor μ increase;

步骤6.2,重复步骤5,得到F1′的值,得到前后两次F1差值ΔF=|F1′-F1|;Step 6.2, repeat step 5 to obtain the value of F 1 ′, and obtain the difference ΔF=|F 1 ′-F 1 | between the two F 1 before and after;

步骤6.3,如果ΔF小于ε,则得到此时的比例因子μ;如果ΔF大于或者等于ε,则令μ′=μ+τ,F1=F1′,重复步骤6.2和步骤6.3,直至得到合适的比例因子μ。Step 6.3, if ΔF is less than ε, obtain the scale factor μ at this time; if ΔF is greater than or equal to ε, then let μ′=μ+τ, F 1 =F 1 ′, repeat steps 6.2 and 6.3 until a suitable The scale factor μ.

与现有技术相比,本发明具有如下有益效果。Compared with the prior art, the present invention has the following beneficial effects.

本发明提出一种基于卡方统计的自适应特征选择方法,分类算法选择KNN算法,用于对测试文本的分类,整个过程流程图见图1,计算比例因子μ的流程图见图2,平衡语料库的精确度指标见表1,非平衡语料库的精确度见表2。与传统的CHI方法相比,一方面本文引入词频因子和类间方差以降低CHI对低频词的倚重,选择出在特定类中出现频数大且均匀分布在该类中的特征项;另一方面引入自适应比例因子μ,以按照其正、负相关性进行分类并赋以不同权重,并且该方法适用于不同分布的语料库,从而降低人为选取比例因子带来的误差。从表1和表2可以看出,与传统的CHI方法相比,本发明分别用于平衡语料库和非平衡语料库分类精度均得到提高。The present invention proposes an adaptive feature selection method based on chi-square statistics. The classification algorithm selects the KNN algorithm for classifying the test text. The flow chart of the whole process is shown in Figure 1, and the flow chart of calculating the scale factor μ is shown in Figure 2. The accuracy indicators of the corpus are shown in Table 1, and the accuracy of the unbalanced corpus is shown in Table 2. Compared with the traditional CHI method, on the one hand, this paper introduces the word frequency factor and the inter-class variance to reduce the CHI's reliance on low-frequency words, and selects the feature items that appear frequently in a specific class and are evenly distributed in this class; An adaptive scale factor μ is introduced to classify and assign different weights according to its positive and negative correlations, and this method is suitable for corpora with different distributions, thereby reducing the error caused by artificial selection of scale factors. It can be seen from Table 1 and Table 2 that compared with the traditional CHI method, the classification accuracy of the present invention is improved for both balanced corpus and unbalanced corpus respectively.

附图说明Description of drawings

图1是本发明总体过程的流程图。Figure 1 is a flow diagram of the general process of the present invention.

图2是本发明计算比例因子μ的流程图。FIG. 2 is a flow chart of calculating the scale factor μ according to the present invention.

具体实施方式Detailed ways

本发明是采用以下技术手段实现的:The present invention adopts the following technical means to realize:

一种基于卡方统计的自适应文本特征选择方法。首先,进行训练文本集和测试文本集的预处理,包括分词,停用词处理,其次,进行基于卡方统计的自适应文本特征选择,定义词频因子α和类间方差β,将其引入CHI算法,为CHI算法添加合适的比例因子μ,最后,结合经典的KNN算法,自动调节比例因子μ,使改进的CHI适用于不同的语料库,以保证较高的分类准确度。An adaptive text feature selection method based on chi-square statistics. First, perform preprocessing of training text set and test text set, including word segmentation and stop word processing. Second, perform adaptive text feature selection based on chi-square statistics, define word frequency factor α and inter-class variance β, and introduce them into CHI Algorithm, adding a suitable scale factor μ for the CHI algorithm, and finally, combined with the classical KNN algorithm, automatically adjust the scale factor μ, so that the improved CHI is suitable for different corpora to ensure higher classification accuracy.

上述基于卡方统计的自适应文本特征选择方法用于文本分类,包括下述步骤:The above-mentioned adaptive text feature selection method based on chi-square statistics is used for text classification, including the following steps:

步骤1,从互联网下载复旦大学发布的中文语料库——训练文本集和测试文本集;Step 1, download the Chinese corpus released by Fudan University from the Internet - the training text set and the test text set;

步骤2,采用分词软件ICTCLAS对训练文本集和测试文本集进行分词、停用词去除等预处理,得到分词后的训练文本集和测试文本集;Step 2, using the word segmentation software ICTCLAS to perform preprocessing such as word segmentation and stop word removal on the training text set and the test text set, and obtain the training text set and the test text set after word segmentation;

步骤3,采用基于CHI的自适应文本特征选择方法对分词后的训练文本集进行特征选择,得到该训练文本集对应的特征词库;Step 3, adopt the adaptive text feature selection method based on CHI to perform feature selection on the training text set after word segmentation, and obtain the feature vocabulary corresponding to the training text set;

传统的CHI文本特征选择方法的计算公式如下:The calculation formula of the traditional CHI text feature selection method is as follows:

其中,A表示包含特征tk且属于类别Ci的文档数,B表示包含特征tk且不属于类别Ci的文档数,C表示不包含特征tk且属于类别Ci的文档数,D表示不包含特征tk且不属于类别Ci的文档数。Among them, A represents the number of documents that contain feature t k and belong to category C i , B represents the number of documents that contain feature t k but does not belong to category C i , C represents the number of documents that do not contain feature t k and belong to category C i , D represents the number of documents that do not contain feature t k and do not belong to category C i .

提出基于CHI的自适应文本特征选择方法,公式如下:An adaptive text feature selection method based on CHI is proposed, and the formula is as follows:

χnew 2(tk,Ci)=[μ*χ2(tk,Ci)++(1-μ)*χ2(tk,Ci)-]*α*β (2)χ new 2 (t k ,C i )=[μ*χ 2 (t k ,C i ) + +(1-μ)*χ 2 (t k ,C i ) - ]*α*β (2)

其中,μ为自适应因子,α为词频因子,β为类间方差,α和β公式定义如下:Among them, μ is the adaptive factor, α is the word frequency factor, and β is the inter-class variance. The formulas of α and β are defined as follows:

其中,m是训练集类别总数,tf(tk,Ci)表示类别Ci特征项tk出现的次数,表示特征项t在整个训练文本集出现的次数。训练集合中类别为Ci中包含n个文档分别是di1,di2,...,dij,...,din,tf(tk,dij)表示特征项tk在类别Ci的第j个文档中出现的次数,表示特征tk在类别Ci中出现的总次数,表示特征项tk在整个训练文本集的所有文档中出现的总次数。词频因子α表示在特定类包含特征项tk的词频数占在整个训练集中tk的词频数的比例。α越大,表示该特征项在特定类中出现频数较高而在其它类中出现次数较少或者几乎不出现,显然这样的特征项具有更高的类别区分能力;α越小,表示该特征项在特定类中出现频数较低而在其它类中出现次数较高,显然这样的特征项具有较弱的类别区分能力。Among them, m is the total number of categories in the training set, tf(t k , C i ) represents the number of occurrences of the feature item t k of the category C i , Represents the number of times the feature item t appears in the entire training text set. The category C i in the training set contains n documents, which are d i1 , d i2 ,...,d ij ,...,d in , tf(t k ,d ij ) indicates that the feature item t k is in the category C the number of occurrences of i in the jth document, represents the total number of times the feature t k appears in the category C i , Represents the total number of times the feature item t k appears in all documents in the entire training text set. The word frequency factor α represents the proportion of the word frequency that contains the feature item t k in a specific class to the word frequency of the entire training set t k . The larger the α, the higher the frequency of the feature item in a specific class and the less or almost no appearance in other classes. Obviously, such a feature item has a higher ability to distinguish between categories; the smaller the α, the more the feature. Items appear less frequently in a particular class and more frequently appear in other classes. Obviously, such feature items have a weaker class discrimination ability.

其中,m表示类别的个数,dfi为类别Ci中包含tk的文档数,为平均每个类别包含tk的文档数,表示特征词出现在特定类中的文本数大于或等于平均值表示特征词出现在特定类中的文本数小于平均值β值用来度量某一类中包含特征词的文档频与所有类文档频的平均值间的偏离程度。β越大,说明类别Ci中包含特征词tk的文档数比所有类中包含特征词tk文档数的平均值大,并且大得比较多,这样的特征项具有更高的类别区分能力。in, m represents the number of categories, df i is the number of documents containing t k in category C i , is the average number of documents containing t k per category, Indicates that the number of texts in which the feature word appears in a particular class is greater than or equal to the average Indicates that the number of texts in which the feature word appears in a particular class is less than the average The β value is used to measure the degree of deviation between the frequency of documents containing feature words in a certain class and the average of document frequencies of all classes. The larger the β is, it means that the number of documents containing the feature word t k in the category C i is larger than the average of the number of documents containing the feature word t k in all classes, and it is much larger, and such a feature item has a higher category discrimination ability. .

步骤4,将每个训练文本和每个测试文本分别用特征词库的特征词表示为向量形式,每一维的权重根据TFIDF=TF×IDF计算,TF(Term Frequency)为词频,是指特征项在文档中出现的次数,IDF(Inverse Document)为逆文档频率,公式为IDF=log(M/nk+0.01),M为文档集合中包含的文本数,nk表示包含该词的文档数;In step 4, each training text and each test text are represented by the feature words of the feature vocabulary as a vector form, and the weight of each dimension is calculated according to TFIDF=TF×IDF, and TF (Term Frequency) is the word frequency, which refers to the feature. The number of times the item appears in the document, IDF (Inverse Document) is the inverse document frequency, the formula is IDF=log(M/n k +0.01), M is the number of texts contained in the document collection, and n k represents the document containing the word number;

步骤5,进行KNN分类;Step 5, perform KNN classification;

训练文本集为S,测试文本为d,n为特征向量维度阈值,K取35。The training text set is S, the test text is d, n is the feature vector dimension threshold, and K is 35.

利用向量夹角的余弦值来计算测试文本d与S中全部文本之间的相似度;选出计算得到的相似度最大的K个文本作为测试文本d的K个最近邻;计算测试文本d属于每个类别的权重,将测试文本d归到权重最大的类别。Use the cosine value of the angle between the vectors to calculate the similarity between the test text d and all the texts in S; select the K texts with the highest calculated similarity as the K nearest neighbors of the test text d; calculate the test text d belongs to The weight of each category, assigning the test text d to the category with the largest weight.

设训练文本di的已知类别为Cj,权重计算公式如下:Let the known category of the training text d i be C j , the weight calculation formula is as follows:

其中,Sim(d,di)为测试文本d与已知类别文本di的余弦相似度,公式如下:Among them, Sim(d, d i ) is the cosine similarity between the test text d and the known category text d i , and the formula is as follows:

其中,n为特征向量维度阈值,Xj表示待测文本d的第j维的权重(0<j≤n),xij表示训练文本向量di的第j维的权重。Among them, n is the feature vector dimension threshold, X j represents the weight of the j-th dimension of the text d to be tested (0<j≤n), and x ij represents the weight of the j-th dimension of the training text vector d i .

y(di,Cj)为类别属性函数,公式如下:y(d i , C j ) is the category attribute function, the formula is as follows:

步骤6,计算KNN分类算法的查准率、查全率和F1值,通过设置前后2次分类结果F1值差值的最大阈值,以及比例因子μ增长的步长来得到最终的比例因子的值。Step 6: Calculate the precision, recall and F 1 value of the KNN classification algorithm, and obtain the final scale factor by setting the maximum threshold of the difference between the F 1 values of the two classification results before and after, and the step size of the scale factor μ growth. value of .

设置初始F1值为0,初始μ值为0.5,ε=0.0001为前后两次F1差值的阈值,τ=0.05为比例因子μ增长的步长。The initial value of F 1 is set to 0, the initial value of μ is 0.5, ε=0.0001 is the threshold value of the difference between the two times of F 1 before and after, and τ=0.05 is the step size of the increase of the scale factor μ.

重复步骤5,得到F1′的值,得到前后两次F1差值ΔF=|F1′-F1|;如果ΔF小于ε,则得到此时的比例因子μ;如果ΔF大于或者等于ε,则令μ′=μ+τ,F1=F1′,重复该步的迭代,直至得到合适的比例因子μ,以保证较高的分类准确度。Repeat step 5 to obtain the value of F 1 ′, and obtain the difference ΔF=|F 1 ′-F 1 | between the two times of F 1 before and after; if ΔF is less than ε, obtain the scale factor μ at this time; if ΔF is greater than or equal to ε , then let μ′=μ+τ, F 1 =F 1 ′, repeat the iteration of this step until a suitable scale factor μ is obtained to ensure higher classification accuracy.

表1算法改进前后结果比较(平衡语料库)(%)Table 1 Comparison of results before and after algorithm improvement (balanced corpus) (%)

表2算法改进前后结果比较(非平衡语料库)(%)Table 2 Comparison of results before and after algorithm improvement (unbalanced corpus) (%)

Claims (1)

1. A chi-square statistic based adaptive feature selection method is characterized by comprising the following steps:
step 1, downloading a Chinese language database, namely a training text set and a test text set, issued by the university of Redandan from the Internet;
step 2, performing word segmentation and stop word removal pretreatment on the training text set and the test text set by using word segmentation software ICTCCLAS to obtain a training text set and a test text set after word segmentation;
step 3, performing feature selection on the training text set after word segmentation by adopting a CHI-based adaptive text feature selection method to obtain a feature word bank corresponding to the training text set;
the calculation formula of the traditional CHI text feature selection method is as follows:
wherein A represents the inclusion of a feature tkAnd belong to class CiB represents the number of documents including the feature tkAnd do not belong to class CiC represents that the feature t is not includedkAnd belong to class CiD indicates that the feature t is not includedkAnd do not belong to class CiThe number of documents;
a CHI-based adaptive text feature selection method is provided, and the formula is as follows:
χnew 2(tk,Ci)=[μ*χ2(tk,Ci)++(1-μ)*χ2(tk,Ci)-]*α*β
wherein μ is a scale factor, α is a word frequency factor, β is an inter-class variance, and the formulas α and β are defined as follows:
where m is the total number of training set classes, tf (t)k,Ci) Represents class CiMiddle feature item tkThe number of times of occurrence of the event,representing a feature item tkThe number of occurrences in the entire training text set; class C in the training setiIn which p documents are respectively di1,di2,...,dij,...,dip,tf(tk,dij) Representing a feature item tkIn class CiThe number of occurrences in the jth document of (1),represents a feature tkIn class CiThe total number of occurrences in (a),representing a feature item tkTotal number of occurrences in all documents in the entire training text set, word frequency factor α indicating the occurrence in a particular class CiIncluding a feature item tkThe number of word frequencies of t is in the whole training setkα is larger, the characteristic item has higher frequency of occurrence in a specific class and has less or almost no frequency of occurrence in other classes, obviously the characteristic item has higher class distinguishing capability, α is smaller, the characteristic item has lower frequency of occurrence in the specific class and has higher frequency of occurrence in other classes, obviously the characteristic item has weaker class distinguishing capability;
wherein,m represents the number of classes, dfiIs of class CiIn contains tkThe number of documents in the document file(s),including t for each class on averagekThe number of documents in the document file(s),the number of texts indicating that the feature words appear in a specific class is greater than or equal to the average value The number of texts indicating that the characteristic words appear in a specific class is less than the average valueβ value is used to measure the deviation degree between the document frequency containing characteristic words in a certain class and the average value of all classes of document frequency, β is larger to indicate the class CiIncluding a feature word tkThe document number ratio of (1) includes the characteristic word t in all classeskThe average value of the number of documents is large and is more, and the feature items have higher category distinguishing capability;
step 4, respectively representing each training text and each test text in a vector form by using feature words of a feature word bank, wherein the weight of each dimension is calculated according to TFIDF (TFIDF) ═ TF multiplied by IDF, TF (term frequency) is the frequency of the words and the frequency of the feature items appearing in the document, IDF (inverse document) is the frequency of the inverse document, and the formula is IDF ═ log (M/n)k+0.01), M is the number of texts contained in the document set, nkRepresenting the number of documents containing the word;
step 5, KNN classification is carried out;
the training text set is S, the test text is d, n is a feature vector dimension threshold, and K is 35;
step 5.1, calculating the similarity between all texts in the test text d and S by utilizing the cosine value of the vector included angle;
step 5.2, selecting the K texts with the maximum similarity obtained in the step 5.1 as K nearest neighbor texts of the test text d;
step 5.3, calculating the weight of the test text d belonging to each category, and classifying the test text d into the category with the maximum weight;
let training textdiOf a known class CjThe weight calculation formula is as follows:
wherein, Sim (d, d)i) For testing the text d and the known category text diThe cosine similarity of (a) is as follows:
where n is the feature vector dimension threshold, XjRepresenting the weight of j dimension of the text d to be measured, wherein j is more than 0 and less than or equal to n, xijRepresenting training text vector diThe weight of the jth dimension of (a);
y(di,Cj) For the category attribute function, the formula is as follows:
step 6, calculating precision ratio, recall ratio and F of the KNN classification algorithm1Value, by setting the 2-time classification results F before and after1The maximum threshold value of the value difference value and the step length of the increase of the scale factor mu are used for obtaining the final value of the scale factor mu so as to ensure higher classification accuracy;
step 6.1, set initial F1The value is 0, the initial value of μ is 0.5, and ∈ 0.0001 indicates F twice before and after1The threshold value of the difference, τ -0.05, is the step size of the increase of the scale factor μ;
step 6.2, repeat step 5 to get F1' value, two times before and after F1Difference Δ F ═ F1′-F1|;
Step 6.3, if the delta F is smaller than epsilon, obtaining the scale factor mu at the moment; if Δ F is greater than or equal to ∈, let μ' ═ μ + τ, F1=F1', repeat step 6.2 and step 6.3 until the appropriate scale factor μ is obtained to ensureHigher classification accuracy is demonstrated.
CN201510927759.9A 2015-12-14 2015-12-14 An Adaptive Feature Selection Method Based on Chi-square Statistics Expired - Fee Related CN105512311B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510927759.9A CN105512311B (en) 2015-12-14 2015-12-14 An Adaptive Feature Selection Method Based on Chi-square Statistics

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510927759.9A CN105512311B (en) 2015-12-14 2015-12-14 An Adaptive Feature Selection Method Based on Chi-square Statistics

Publications (2)

Publication Number Publication Date
CN105512311A CN105512311A (en) 2016-04-20
CN105512311B true CN105512311B (en) 2019-02-26

Family

ID=55720291

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510927759.9A Expired - Fee Related CN105512311B (en) 2015-12-14 2015-12-14 An Adaptive Feature Selection Method Based on Chi-square Statistics

Country Status (1)

Country Link
CN (1) CN105512311B (en)

Families Citing this family (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106021461A (en) * 2016-05-17 2016-10-12 深圳市中润四方信息技术有限公司 Text classification method and text classification system
CN108073567B (en) * 2016-11-16 2021-12-28 北京嘀嘀无限科技发展有限公司 Feature word extraction processing method, system and server
CN108090088A (en) * 2016-11-23 2018-05-29 北京国双科技有限公司 Feature extracting method and device
CN106611057B (en) * 2016-12-27 2019-08-13 上海利连信息科技有限公司 The text classification feature selection approach of importance weighting
CN107291837B (en) * 2017-05-31 2020-04-03 北京大学 A word segmentation method for web text based on domain adaptability
CN107256214B (en) * 2017-06-30 2020-09-25 联想(北京)有限公司 Junk information judgment method and device and server cluster
CN107577794B (en) * 2017-09-19 2019-07-05 北京神州泰岳软件股份有限公司 A kind of news category method and device
CN108197307A (en) * 2018-01-31 2018-06-22 湖北工业大学 The selection method and system of a kind of text feature
CN108491429A (en) * 2018-02-09 2018-09-04 湖北工业大学 A kind of feature selection approach based on document frequency and word frequency statistics between class in class
CN108376130A (en) * 2018-03-09 2018-08-07 长安大学 A kind of objectionable text information filtering feature selection approach
CN108346474B (en) * 2018-03-14 2021-09-28 湖南省蓝蜻蜓网络科技有限公司 Electronic medical record feature selection method based on word intra-class distribution and inter-class distribution
CN108920545B (en) * 2018-06-13 2021-07-09 四川大学 Chinese sentiment feature selection method based on extended sentiment dictionary and chi-square model
CN109325511B (en) * 2018-08-01 2020-07-31 昆明理工大学 A method to improve feature selection
CN109543037A (en) * 2018-11-21 2019-03-29 南京安讯科技有限责任公司 A kind of article classification method based on improved TF-IDF
CN112256865B (en) * 2019-01-31 2023-03-21 青岛科技大学 Chinese text classification method based on classifier
CN110069630B (en) * 2019-03-20 2023-07-21 重庆信科设计有限公司 Improved mutual information feature selection method
CN110705247B (en) * 2019-08-30 2020-08-04 山东科技大学 Text similarity calculation method based on χ2-C
CN110688481A (en) * 2019-09-02 2020-01-14 贵州航天计量测试技术研究所 Text classification feature selection method based on chi-square statistic and IDF
CN111144106B (en) * 2019-12-20 2023-05-02 山东科技大学 Two-stage text feature selection method under unbalanced data set
CN111062212B (en) * 2020-03-18 2020-06-30 北京热云科技有限公司 Feature extraction method and system based on optimized TFIDF
CN112200259A (en) * 2020-10-19 2021-01-08 哈尔滨理工大学 A kind of information gain text feature selection method and classification device based on classification and screening
CN113032564B (en) * 2021-03-22 2023-05-30 建信金融科技有限责任公司 Feature extraction method, device, electronic equipment and storage medium
CN113515623B (en) * 2021-04-28 2022-12-06 西安理工大学 Feature selection method based on word frequency difference factor
CN113378567B (en) * 2021-07-05 2022-05-10 广东工业大学 An improved Chinese short text classification method for low-frequency words
CN113987536A (en) * 2021-10-27 2022-01-28 建信金融科技有限责任公司 Method, device, electronic device and medium for determining security level of field in data sheet

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103678274A (en) * 2013-04-15 2014-03-26 南京邮电大学 Feature extraction method for text categorization based on improved mutual information and entropy
CN103886108A (en) * 2014-04-13 2014-06-25 北京工业大学 Feature selection and weight calculation method of imbalance text set
CN104750844A (en) * 2015-04-09 2015-07-01 中南大学 Method and device for generating text characteristic vectors based on TF-IGM, method and device for classifying texts

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9311390B2 (en) * 2008-01-29 2016-04-12 Educational Testing Service System and method for handling the confounding effect of document length on vector-based similarity scores

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103678274A (en) * 2013-04-15 2014-03-26 南京邮电大学 Feature extraction method for text categorization based on improved mutual information and entropy
CN103886108A (en) * 2014-04-13 2014-06-25 北京工业大学 Feature selection and weight calculation method of imbalance text set
CN104750844A (en) * 2015-04-09 2015-07-01 中南大学 Method and device for generating text characteristic vectors based on TF-IGM, method and device for classifying texts

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于词频的优化互信息文本特征选择方法;刘海峰;《计算机工程》;20141231;第179-182页

Also Published As

Publication number Publication date
CN105512311A (en) 2016-04-20

Similar Documents

Publication Publication Date Title
CN105512311B (en) An Adaptive Feature Selection Method Based on Chi-square Statistics
CN107609121B (en) News text classification method based on LDA and word2vec algorithm
CN105224695B (en) A kind of text feature quantization method and device and file classification method and device based on comentropy
CN102622373B (en) Statistic text classification system and statistic text classification method based on term frequency-inverse document frequency (TF*IDF) algorithm
CN104750844B (en) Text eigenvector based on TF-IGM generates method and apparatus and file classification method and device
Fan et al. Research on text classification based on improved tf-idf algorithm
CN104391835B (en) Feature Words system of selection and device in text
CN110543564B (en) Domain Label Acquisition Method Based on Topic Model
CN107066555B (en) On-line theme detection method for professional field
CN108763348B (en) Classification improvement method for feature vectors of extended short text words
CN108197144B (en) Hot topic discovery method based on BTM and Single-pass
CN103294817A (en) Text feature extraction method based on categorical distribution probability
CN109960799A (en) An optimized classification method for short texts
CN104346459A (en) Text classification feature selecting method based on term frequency and chi-square statistics
CN107908624A (en) A kind of K medoids Text Clustering Methods based on all standing Granule Computing
CN112579783B (en) Short text clustering method based on Laplace atlas
CN109376235B (en) Feature selection method based on document layer word frequency reordering
CN106570076A (en) Computer text classification system
CN114896398A (en) Text classification system and method based on feature selection
CN113032573A (en) Large-scale text classification method and system combining theme semantics and TF-IDF algorithm
CN110348497B (en) Text representation method constructed based on WT-GloVe word vector
CN113626604B (en) Web Page Text Classification System Based on Maximum Spacing Criterion
CN108614825B (en) Webpage feature extraction method and device
CN105760471B (en) Two-class text classification method based on combined convex linear perceptron
Luo A new text classifier based on random forests

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20190226

Termination date: 20211214