CN105512311B

CN105512311B - An Adaptive Feature Selection Method Based on Chi-square Statistics

Info

Publication number: CN105512311B
Application number: CN201510927759.9A
Authority: CN
Inventors: 汪友生; 樊存佳; 王雨婷
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2015-12-14
Filing date: 2015-12-14
Publication date: 2019-02-26
Anticipated expiration: 2035-12-14
Also published as: CN105512311A

Abstract

An adaptive feature selection method based on chi-square statistics. The method relates to the field of computer text data processing. First, the training text set and the test text set are preprocessed, including word segmentation and stop word processing. Adaptive text feature selection, define word frequency factor and inter-class variance, introduce them into the CHI algorithm, add appropriate scale factors to the CHI algorithm, and finally combine the evaluation indicators of the classic KNN algorithm to automatically adjust the scale factor, so that the improved CHI is suitable for Different corpora to ensure high classification accuracy. The experimental results show that compared with the traditional CHI method, the classification accuracy of the present invention is improved for both balanced corpus and unbalanced corpus respectively.

Description

An Adaptive Feature Selection Method Based on Chi-square Statistics

技术领域technical field

本发明涉及计算机文本数据处理领域，特别涉及一种基于卡方统计(χ²,CHI)的自适应文本特征选择方法。The invention relates to the field of computer text data processing, in particular to an adaptive text feature selection method based on chi-square statistics (χ ² , CHI).

背景技术Background technique

当今大数据时代，挖掘数据潜在的价值至关重要。数据挖掘作为发现数据潜在价值的技术，引起极大关注。大数据中文本数据占相当大的比例，而文本分类作为有效组织和管理文本数据的数据挖掘方法，逐渐成为关注热点。它在信息过滤、信息组织和管理、信息检索、数字图书馆以及垃圾邮件过滤等方面得到广泛应用。文本分类(TextClassification,TC)是指在预先给定的类别体系下对未知类别文本根据其内容将其自动划分到一类或多类的过程。常用的文本分类方法，如K最近邻(K-Nearest-Neighbor,KNN)，贝叶斯(Naive Bayes,NB)以及支持向量机(SupportVector Machine,SVM)等。In today's era of big data, it is very important to mine the potential value of data. As a technology for discovering the potential value of data, data mining has attracted great attention. Text data accounts for a considerable proportion of big data, and text classification, as a data mining method to effectively organize and manage text data, has gradually become a hot spot of attention. It has been widely used in information filtering, information organization and management, information retrieval, digital library and spam filtering. Text Classification (TC) refers to the process of automatically classifying unknown category texts into one or more categories according to their content under a predetermined category system. Commonly used text classification methods, such as K-Nearest-Neighbor (KNN), Bayesian (Naive Bayes, NB) and Support Vector Machine (SVM) and so on.

现阶段文本分类过程包括预处理、特征降维、文本表示、分类器学习和评测，文本表示最常用的是向量空间模型，向量空间的高维性和稀疏性增加时间复杂度和空间复杂度，大大影响文本分类精度，所以特征降维过程至关重要，它直接影响分类的效率和准确率。特征降维主要包括两种方法——特征提取(Feature Extraction)和特征选择(FeatureSelection)，基于语言学的特征提取需要自然语言处理方面的技术且计算复杂度高，而基于统计理论的特征选择方法复杂度较低且不需要过多的背景知识，因此特征选择方法应用更广泛。特征选择的基本思想是构造一个评价函数，分别对特征集的每个特征项进行评估打分，按照评估分值的高低对所有的特征项进行排序，选择特定数目的特征作为最终的文本特征集。常用的特征选择方法有卡方统计，文档频率(Document Frequency,DF)，信息增益(Information Gain,IG)，互信息(Mutual Information,MI)，期望交叉熵(ExpectedCross Entropy,ECE)以及文本证据权重(Weight of Evidence,WE)等。The current text classification process includes preprocessing, feature dimensionality reduction, text representation, classifier learning and evaluation. The most commonly used text representation is the vector space model. The high dimensionality and sparsity of the vector space increase the time and space complexity. It greatly affects the accuracy of text classification, so the feature dimensionality reduction process is very important, which directly affects the efficiency and accuracy of classification. Feature dimensionality reduction mainly includes two methods—Feature Extraction and Feature Selection. Linguistic-based feature extraction requires natural language processing technology and high computational complexity, while feature selection method based on statistical theory The complexity is low and does not require too much background knowledge, so the feature selection method is more widely used. The basic idea of feature selection is to construct an evaluation function, evaluate and score each feature item of the feature set, sort all feature items according to the evaluation score, and select a specific number of features as the final text feature set. Commonly used feature selection methods are chi-square statistics, document frequency (Document Frequency, DF), information gain (Information Gain, IG), mutual information (Mutual Information, MI), expected cross entropy (Expected Cross Entropy, ECE) and text evidence weight (Weight of Evidence, WE) etc.

CHI方法作为常用的文本特征选择方法之一，有着实现简单、时间复杂度低等特点；但也存在很多缺点，以至于分类效果不理想。CHI算法的不足主要包括两个方面：第一，CHI只考虑了特征项的文档频率，忽略了特征项的词频，导致低频词的权重被放大；第二，放大了在一个类别中出现次数不多而在其他类中经常出现的特征项的权重。针对CHI算法存在的不足，很多研究者对其进行改进，将改进方法总结为以下两个方面：第一，引入若干调节参数以降低对低频词的倚重，但该方法没有考虑特征项与类别之间的正、负相关性问题。第二，引入比例因子α，按照其正、负相关性进行分类并赋以不同权重以改善CHI模型的特征选择能力，但是比例因子需要通过经验来选择。考虑到目前各种CHI改进型算法存在的不足，设计分类精度高的CHI文本特征选择方法具有重要的学术意义和实用价值。As one of the commonly used text feature selection methods, the CHI method has the characteristics of simple implementation and low time complexity; however, it also has many shortcomings, so that the classification effect is not ideal. The shortcomings of the CHI algorithm mainly include two aspects: first, CHI only considers the document frequency of feature items, ignoring the word frequency of feature items, resulting in the weight of low-frequency words being amplified; The weights of feature items that are many and often appear in other classes. In view of the shortcomings of the CHI algorithm, many researchers have improved it and summarized the improvement methods into the following two aspects: First, a number of adjustment parameters are introduced to reduce the reliance on low-frequency words, but the method does not consider the relationship between feature items and categories. positive and negative correlations between. Second, a scale factor α is introduced, which is classified according to its positive and negative correlations and assigned different weights to improve the feature selection ability of the CHI model, but the scale factor needs to be selected through experience. Considering the shortcomings of various CHI improved algorithms, it is of great academic significance and practical value to design a CHI text feature selection method with high classification accuracy.

发明内容SUMMARY OF THE INVENTION

本发明的目的在于，提供一种改进的CHI文本特征选择方法，从而提高文本分类的准确率。一方面本文引入词频因子和类间方差以降低CHI对低频词的倚重，选择出在特定类中出现频数大且均匀分布在该类中的特征项；另一方面引入自适应比例因子μ，以按照其正、负相关性进行分类并赋以不同权重，来降低人为选取比例因子带来的误差。The purpose of the present invention is to provide an improved CHI text feature selection method, thereby improving the accuracy of text classification. On the one hand, this paper introduces the word frequency factor and inter-class variance to reduce the CHI's reliance on low-frequency words, and selects the feature items that appear frequently in a specific class and are evenly distributed in this class; on the other hand, an adaptive scale factor μ is introduced to Classify according to their positive and negative correlations and assign different weights to reduce the error caused by artificial selection of scale factors.

本发明的特征如下：The features of the present invention are as follows:

步骤1，从互联网下载复旦大学发布的中文语料库——训练文本集和测试文本集；Step 1, download the Chinese corpus released by Fudan University from the Internet - the training text set and the test text set;

步骤2，采用中科院开源分词软件ICTCLAS对训练文本集和测试文本集进行分词、停用词去除等预处理，得到分词后的训练文本集和测试文本集；Step 2, using ICTCLAS, an open source word segmentation software of the Chinese Academy of Sciences, to perform preprocessing such as word segmentation and stop word removal on the training text set and the test text set, and obtain the training text set and test text set after word segmentation;

步骤3，采用基于CHI的自适应文本特征选择方法对分词后的训练文本集进行特征选择，得到该训练文本集对应的特征词库；Step 3, adopt the adaptive text feature selection method based on CHI to perform feature selection on the training text set after word segmentation, and obtain the feature vocabulary corresponding to the training text set;

传统的CHI文本特征选择方法的计算公式如下：The calculation formula of the traditional CHI text feature selection method is as follows:

其中，A表示包含特征t_k且属于类别C_i的文档数，B表示包含特征t_k且不属于类别C_i的文档数，C表示不包含特征t_k且属于类别C_i的文档数，D表示不包含特征t_k且不属于类别C_i的文档数。Among them, A represents the number of documents that contain feature t _k and belong to category C _i , B represents the number of documents that contain feature t _k but does not belong to category C _i , C represents the number of documents that do not contain feature t _k and belong to category C _i , D represents the number of documents that do not contain feature t _k and do not belong to category C _i .

提出基于CHI的自适应文本特征选择方法，公式如下：An adaptive text feature selection method based on CHI is proposed, and the formula is as follows:

χ_new ²(t_k,C_i)＝[μ*χ²(t_k,C_i)⁺+(1-μ)*χ²(t_k,C_i)^-]*α*βχ _new ² (t _k ,C _i )=[μ*χ ² (t _k ,C _i ) ⁺ +(1-μ)*χ ² (t _k ,C _i ) ^- ]*α*β

其中，μ为比例因子，α为词频因子，β为类间方差，α和β公式定义如下：Among them, μ is the scale factor, α is the word frequency factor, and β is the between-class variance. The formulas of α and β are defined as follows:

其中，m是训练集类别总数，tf(t_k,C_i)表示类别C_i中特征项t_k出现的次数，表示特征项t_k在整个训练文本集出现的次数。训练集合中类别为C_i中包含p个文档分别是d_i1,d_i2,...,d_ij,...,d_ip，tf(t_k,d_ij)表示特征项t_k在类别C_i的第j个文档中出现的次数，表示特征t_k在类别C_i中出现的总次数，表示特征项t_k在整个训练文本集的所有文档中出现的总次数。词频因子α表示在特定类C_i中包含特征项t_k的词频数占在整个训练集中t_k的词频数的比例。α越大，表示该特征项在特定类中出现频数较高而在其它类中出现次数较少或者几乎不出现，显然这样的特征项具有更高的类别区分能力；α越小，表示该特征项在特定类中出现频数较低而在其它类中出现次数较高，显然这样的特征项具有较弱的类别区分能力。Among them, m is the total number of categories in the training set, tf(t _k ,C _i ) represents the number of occurrences of the feature item t _k in the category C _i , Represents the number of times the feature item t _k appears in the entire training text set. In the training set, the category C _i contains p documents which are d _i1 , d _i2 ,...,d _ij ,...,d _ip , tf(t _k ,d _ij ) indicates that the feature item t _k is in the category C the number of occurrences of _i in the jth document, represents the total number of times the feature t _k appears in the category C _i , Represents the total number of times the feature item t _k appears in all documents in the entire training text set. The word frequency factor α represents the proportion of the word frequency that contains the feature item _tk in a specific class C _i to the word frequency of the entire training set _tk . The larger the α, the higher the frequency of the feature item in a specific class and the less or almost no appearance in other classes. Obviously, such a feature item has a higher ability to distinguish between categories; the smaller the α, the more the feature. Items appear less frequently in a particular class and more frequently appear in other classes. Obviously, such feature items have a weaker class discrimination ability.

其中，m表示类别的个数，df_i为类别C_i中包含t_k的文档数，为平均每个类别包含t_k的文档数，表示特征词出现在特定类中的文本数大于或等于平均值表示特征词出现在特定类中的文本数小于平均值β值用来度量某一类中包含特征词的文档频与所有类文档频的平均值间的偏离程度。β越大，说明类别C_i中包含特征词t_k的文档数比所有类中包含特征词t_k文档数的平均值大，并且大得比较多，这样的特征项具有更高的类别区分能力。in, m represents the number of categories, df _i is the number of documents containing t _k in category C _i , is the average number of documents containing t _k per category, Indicates that the number of texts in which the feature word appears in a particular class is greater than or equal to the average Indicates that the number of texts in which the feature word appears in a particular class is less than the average The β value is used to measure the degree of deviation between the frequency of documents containing feature words in a certain class and the average of document frequencies of all classes. The larger the β is, it means that the number of documents containing the feature word t _k in the category C _i is larger than the average of the number of documents containing the feature word t _k in all classes, and it is much larger, and such a feature item has a higher category discrimination ability. .

步骤4，将每个训练文本和每个测试文本分别用特征词库的特征词表示为向量形式，每一维的权重根据TFIDF＝TF×IDF计算，TF(Term Frequency)为词频，是指特征项在文档中出现的次数，IDF(Inverse Document)为逆文档频率，公式为IDF＝log(M/n_k+0.01)，M为文档集合中包含的文本数，n_k表示包含该词的文档数；In step 4, each training text and each test text are represented by the feature words of the feature vocabulary as a vector form, and the weight of each dimension is calculated according to TFIDF=TF×IDF, and TF (Term Frequency) is the word frequency, which refers to the feature. The number of times the item appears in the document, IDF (Inverse Document) is the inverse document frequency, the formula is IDF=log(M/n _k +0.01), M is the number of texts contained in the document collection, and n _k represents the document containing the word number;

步骤5，进行KNN分类；Step 5, perform KNN classification;

训练文本集为S，测试文本为d，n为特征向量维度阈值，K取35。The training text set is S, the test text is d, n is the feature vector dimension threshold, and K is 35.

步骤5.1，利用向量夹角的余弦值来计算测试文本d与S中全部文本之间的相似度；Step 5.1, use the cosine value of the included angle of the vector to calculate the similarity between the test text d and all the texts in S;

步骤5.2，选出步骤5.1得到的相似度最大的K个文本作为测试文本d的K个最近邻文本；Step 5.2, select the K texts with the largest similarity obtained in step 5.1 as the K nearest neighbor texts of the test text d;

步骤5.3，计算测试文本d属于每个类别的权重，将测试文本d归到权重最大的类别。Step 5.3: Calculate the weight of the test text d belonging to each category, and assign the test text d to the category with the largest weight.

设训练文本d_i的已知类别为C_j，权重计算公式如下：Let the known category of the training text d _i be C _j , the weight calculation formula is as follows:

其中，Sim(d,d_i)为测试文本d与已知类别文本d_i的余弦相似度，公式如下：Among them, Sim(d, d _i ) is the cosine similarity between the test text d and the known category text d _i , and the formula is as follows:

其中，n为特征向量维度阈值，X_j表示待测文本d的第j维的权重(0＜j≤n)，x_ij表示训练文本向量d_i的第j维的权重。Among them, n is the feature vector dimension threshold, X _j represents the weight of the j-th dimension of the text d to be tested (0<j≤n), and x _ij represents the weight of the j-th dimension of the training text vector d _i .

y(d_i,C_j)为类别属性函数，公式如下：y(d _i , C _j ) is the category attribute function, the formula is as follows:

步骤6，计算KNN分类算法的查准率、查全率和F₁值，通过设置前后2次分类结果F₁值差值的最大阈值，以及比例因子μ增长的步长来得到最终的比例因子μ的值，以保证较高的分类准确度。Step 6: Calculate the precision, recall and F ₁ value of the KNN classification algorithm, and obtain the final scale factor by setting the maximum threshold of the difference between the F ₁ values of the two classification results before and after, and the step size of the scale factor μ growth. value of μ to ensure high classification accuracy.

步骤6.1，设置初始F₁值为0，初始μ值为0.5，ε＝0.0001为前后两次F₁差值的阈值，τ＝0.05为比例因子μ增长的步长；Step 6.1, set the initial F ₁ value to 0, the initial μ value to 0.5, ε=0.0001 is the threshold of the difference between the two F ₁ before and after, and τ=0.05 is the step size of the scaling factor μ increase;

步骤6.2，重复步骤5，得到F₁′的值，得到前后两次F₁差值ΔF＝|F₁′-F₁|；Step 6.2, repeat step 5 to obtain the value of F ₁ ′, and obtain the difference ΔF=|F ₁ ′-F ₁ | between the two F ₁ before and after;

步骤6.3，如果ΔF小于ε，则得到此时的比例因子μ；如果ΔF大于或者等于ε，则令μ′＝μ+τ，F₁＝F₁′，重复步骤6.2和步骤6.3，直至得到合适的比例因子μ。Step 6.3, if ΔF is less than ε, obtain the scale factor μ at this time; if ΔF is greater than or equal to ε, then let μ′=μ+τ, F ₁ =F ₁ ′, repeat steps 6.2 and 6.3 until a suitable The scale factor μ.

与现有技术相比，本发明具有如下有益效果。Compared with the prior art, the present invention has the following beneficial effects.

本发明提出一种基于卡方统计的自适应特征选择方法，分类算法选择KNN算法，用于对测试文本的分类，整个过程流程图见图1，计算比例因子μ的流程图见图2，平衡语料库的精确度指标见表1，非平衡语料库的精确度见表2。与传统的CHI方法相比，一方面本文引入词频因子和类间方差以降低CHI对低频词的倚重，选择出在特定类中出现频数大且均匀分布在该类中的特征项；另一方面引入自适应比例因子μ，以按照其正、负相关性进行分类并赋以不同权重，并且该方法适用于不同分布的语料库，从而降低人为选取比例因子带来的误差。从表1和表2可以看出，与传统的CHI方法相比，本发明分别用于平衡语料库和非平衡语料库分类精度均得到提高。The present invention proposes an adaptive feature selection method based on chi-square statistics. The classification algorithm selects the KNN algorithm for classifying the test text. The flow chart of the whole process is shown in Figure 1, and the flow chart of calculating the scale factor μ is shown in Figure 2. The accuracy indicators of the corpus are shown in Table 1, and the accuracy of the unbalanced corpus is shown in Table 2. Compared with the traditional CHI method, on the one hand, this paper introduces the word frequency factor and the inter-class variance to reduce the CHI's reliance on low-frequency words, and selects the feature items that appear frequently in a specific class and are evenly distributed in this class; An adaptive scale factor μ is introduced to classify and assign different weights according to its positive and negative correlations, and this method is suitable for corpora with different distributions, thereby reducing the error caused by artificial selection of scale factors. It can be seen from Table 1 and Table 2 that compared with the traditional CHI method, the classification accuracy of the present invention is improved for both balanced corpus and unbalanced corpus respectively.

附图说明Description of drawings

图1是本发明总体过程的流程图。Figure 1 is a flow diagram of the general process of the present invention.

图2是本发明计算比例因子μ的流程图。FIG. 2 is a flow chart of calculating the scale factor μ according to the present invention.

具体实施方式Detailed ways

本发明是采用以下技术手段实现的：The present invention adopts the following technical means to realize:

一种基于卡方统计的自适应文本特征选择方法。首先，进行训练文本集和测试文本集的预处理，包括分词，停用词处理，其次，进行基于卡方统计的自适应文本特征选择，定义词频因子α和类间方差β，将其引入CHI算法，为CHI算法添加合适的比例因子μ，最后，结合经典的KNN算法，自动调节比例因子μ，使改进的CHI适用于不同的语料库，以保证较高的分类准确度。An adaptive text feature selection method based on chi-square statistics. First, perform preprocessing of training text set and test text set, including word segmentation and stop word processing. Second, perform adaptive text feature selection based on chi-square statistics, define word frequency factor α and inter-class variance β, and introduce them into CHI Algorithm, adding a suitable scale factor μ for the CHI algorithm, and finally, combined with the classical KNN algorithm, automatically adjust the scale factor μ, so that the improved CHI is suitable for different corpora to ensure higher classification accuracy.

上述基于卡方统计的自适应文本特征选择方法用于文本分类，包括下述步骤：The above-mentioned adaptive text feature selection method based on chi-square statistics is used for text classification, including the following steps:

步骤2，采用分词软件ICTCLAS对训练文本集和测试文本集进行分词、停用词去除等预处理，得到分词后的训练文本集和测试文本集；Step 2, using the word segmentation software ICTCLAS to perform preprocessing such as word segmentation and stop word removal on the training text set and the test text set, and obtain the training text set and the test text set after word segmentation;

χ_new ²(t_k,C_i)＝[μ*χ²(t_k,C_i)⁺+(1-μ)*χ²(t_k,C_i)^-]*α*β (2)χ _new ² (t _k ,C _i )=[μ*χ ² (t _k ,C _i ) ⁺ +(1-μ)*χ ² (t _k ,C _i ) ^- ]*α*β (2)

其中，μ为自适应因子，α为词频因子，β为类间方差，α和β公式定义如下：Among them, μ is the adaptive factor, α is the word frequency factor, and β is the inter-class variance. The formulas of α and β are defined as follows:

其中，m是训练集类别总数，tf(t_k,C_i)表示类别C_i特征项t_k出现的次数，表示特征项t在整个训练文本集出现的次数。训练集合中类别为C_i中包含n个文档分别是d_i1,d_i2,...,d_ij,...,d_in，tf(t_k,d_ij)表示特征项t_k在类别C_i的第j个文档中出现的次数，表示特征t_k在类别C_i中出现的总次数，表示特征项t_k在整个训练文本集的所有文档中出现的总次数。词频因子α表示在特定类包含特征项t_k的词频数占在整个训练集中t_k的词频数的比例。α越大，表示该特征项在特定类中出现频数较高而在其它类中出现次数较少或者几乎不出现，显然这样的特征项具有更高的类别区分能力；α越小，表示该特征项在特定类中出现频数较低而在其它类中出现次数较高，显然这样的特征项具有较弱的类别区分能力。Among them, m is the total number of categories in the training set, tf(t _k , C _i ) represents the number of occurrences of the feature item t _k of the category C _i , Represents the number of times the feature item t appears in the entire training text set. The category C _i in the training set contains n documents, which are d _i1 , d _i2 ,...,d _ij ,...,d _in , tf(t _k ,d _ij ) indicates that the feature item t _k is in the category C the number of occurrences of _i in the jth document, represents the total number of times the feature t _k appears in the category C _i , Represents the total number of times the feature item t _k appears in all documents in the entire training text set. The word frequency factor α represents the proportion of the word frequency that contains the feature item t _k in a specific class to the word frequency of the entire training set t _k . The larger the α, the higher the frequency of the feature item in a specific class and the less or almost no appearance in other classes. Obviously, such a feature item has a higher ability to distinguish between categories; the smaller the α, the more the feature. Items appear less frequently in a particular class and more frequently appear in other classes. Obviously, such feature items have a weaker class discrimination ability.

步骤5，进行KNN分类；Step 5, perform KNN classification;

利用向量夹角的余弦值来计算测试文本d与S中全部文本之间的相似度；选出计算得到的相似度最大的K个文本作为测试文本d的K个最近邻；计算测试文本d属于每个类别的权重，将测试文本d归到权重最大的类别。Use the cosine value of the angle between the vectors to calculate the similarity between the test text d and all the texts in S; select the K texts with the highest calculated similarity as the K nearest neighbors of the test text d; calculate the test text d belongs to The weight of each category, assigning the test text d to the category with the largest weight.

步骤6，计算KNN分类算法的查准率、查全率和F₁值，通过设置前后2次分类结果F₁值差值的最大阈值，以及比例因子μ增长的步长来得到最终的比例因子的值。Step 6: Calculate the precision, recall and F ₁ value of the KNN classification algorithm, and obtain the final scale factor by setting the maximum threshold of the difference between the F ₁ values of the two classification results before and after, and the step size of the scale factor μ growth. value of .

设置初始F₁值为0，初始μ值为0.5，ε＝0.0001为前后两次F₁差值的阈值，τ＝0.05为比例因子μ增长的步长。The initial value of F ₁ is set to 0, the initial value of μ is 0.5, ε=0.0001 is the threshold value of the difference between the two times of F ₁ before and after, and τ=0.05 is the step size of the increase of the scale factor μ.

重复步骤5，得到F₁′的值，得到前后两次F₁差值ΔF＝|F₁′-F₁|；如果ΔF小于ε，则得到此时的比例因子μ；如果ΔF大于或者等于ε，则令μ′＝μ+τ，F₁＝F₁′，重复该步的迭代，直至得到合适的比例因子μ，以保证较高的分类准确度。Repeat step 5 to obtain the value of F ₁ ′, and obtain the difference ΔF=|F ₁ ′-F ₁ | between the two times of F ₁ before and after; if ΔF is less than ε, obtain the scale factor μ at this time; if ΔF is greater than or equal to ε , then let μ′=μ+τ, F ₁ =F ₁ ′, repeat the iteration of this step until a suitable scale factor μ is obtained to ensure higher classification accuracy.

表1算法改进前后结果比较(平衡语料库)(％)Table 1 Comparison of results before and after algorithm improvement (balanced corpus) (%)

表2算法改进前后结果比较(非平衡语料库)(％)Table 2 Comparison of results before and after algorithm improvement (unbalanced corpus) (%)

Claims

1. A chi-square statistic based adaptive feature selection method is characterized by comprising the following steps:

step 1, downloading a Chinese language database, namely a training text set and a test text set, issued by the university of Redandan from the Internet;

step 2, performing word segmentation and stop word removal pretreatment on the training text set and the test text set by using word segmentation software ICTCCLAS to obtain a training text set and a test text set after word segmentation;

step 3, performing feature selection on the training text set after word segmentation by adopting a CHI-based adaptive text feature selection method to obtain a feature word bank corresponding to the training text set;

the calculation formula of the traditional CHI text feature selection method is as follows:

wherein A represents the inclusion of a feature t_kAnd belong to class C_iB represents the number of documents including the feature t_kAnd do not belong to class C_iC represents that the feature t is not included_kAnd belong to class C_iD indicates that the feature t is not included_kAnd do not belong to class C_iThe number of documents;

a CHI-based adaptive text feature selection method is provided, and the formula is as follows:

χ_new ²(t_k,C_i)＝[μ*χ²(t_k,C_i)⁺+(1-μ)*χ²(t_k,C_i)^-]*α*β

wherein μ is a scale factor, α is a word frequency factor, β is an inter-class variance, and the formulas α and β are defined as follows:

where m is the total number of training set classes, tf (t)_k,C_i) Represents class C_iMiddle feature item t_kThe number of times of occurrence of the event,representing a feature item t_kThe number of occurrences in the entire training text set; class C in the training set_iIn which p documents are respectively d_i1,d_i2,...,d_ij,...,d_ip，tf(t_k,d_ij) Representing a feature item t_kIn class C_iThe number of occurrences in the jth document of (1),represents a feature t_kIn class C_iThe total number of occurrences in (a),representing a feature item t_kTotal number of occurrences in all documents in the entire training text set, word frequency factor α indicating the occurrence in a particular class C_iIncluding a feature item t_kThe number of word frequencies of t is in the whole training set_kα is larger, the characteristic item has higher frequency of occurrence in a specific class and has less or almost no frequency of occurrence in other classes, obviously the characteristic item has higher class distinguishing capability, α is smaller, the characteristic item has lower frequency of occurrence in the specific class and has higher frequency of occurrence in other classes, obviously the characteristic item has weaker class distinguishing capability;

wherein,m represents the number of classes, df_iIs of class C_iIn contains t_kThe number of documents in the document file(s),including t for each class on average_kThe number of documents in the document file(s),the number of texts indicating that the feature words appear in a specific class is greater than or equal to the average value The number of texts indicating that the characteristic words appear in a specific class is less than the average valueβ value is used to measure the deviation degree between the document frequency containing characteristic words in a certain class and the average value of all classes of document frequency, β is larger to indicate the class C_iIncluding a feature word t_kThe document number ratio of (1) includes the characteristic word t in all classes_kThe average value of the number of documents is large and is more, and the feature items have higher category distinguishing capability;

step 4, respectively representing each training text and each test text in a vector form by using feature words of a feature word bank, wherein the weight of each dimension is calculated according to TFIDF (TFIDF) ═ TF multiplied by IDF, TF (term frequency) is the frequency of the words and the frequency of the feature items appearing in the document, IDF (inverse document) is the frequency of the inverse document, and the formula is IDF ═ log (M/n)_k+0.01), M is the number of texts contained in the document set, n_kRepresenting the number of documents containing the word;

step 5, KNN classification is carried out;

the training text set is S, the test text is d, n is a feature vector dimension threshold, and K is 35;

step 5.1, calculating the similarity between all texts in the test text d and S by utilizing the cosine value of the vector included angle;

step 5.2, selecting the K texts with the maximum similarity obtained in the step 5.1 as K nearest neighbor texts of the test text d;

step 5.3, calculating the weight of the test text d belonging to each category, and classifying the test text d into the category with the maximum weight;

let training textd_iOf a known class C_jThe weight calculation formula is as follows:

wherein, Sim (d, d)_i) For testing the text d and the known category text d_iThe cosine similarity of (a) is as follows:

where n is the feature vector dimension threshold, X_jRepresenting the weight of j dimension of the text d to be measured, wherein j is more than 0 and less than or equal to n, x_ijRepresenting training text vector d_iThe weight of the jth dimension of (a);

y(d_i,C_j) For the category attribute function, the formula is as follows:

step 6, calculating precision ratio, recall ratio and F of the KNN classification algorithm₁Value, by setting the 2-time classification results F before and after₁The maximum threshold value of the value difference value and the step length of the increase of the scale factor mu are used for obtaining the final value of the scale factor mu so as to ensure higher classification accuracy;

step 6.1, set initial F₁The value is 0, the initial value of μ is 0.5, and ∈ 0.0001 indicates F twice before and after₁The threshold value of the difference, τ -0.05, is the step size of the increase of the scale factor μ;

step 6.2, repeat step 5 to get F₁' value, two times before and after F₁Difference Δ F ═ F₁′-F₁|；

Step 6.3, if the delta F is smaller than epsilon, obtaining the scale factor mu at the moment; if Δ F is greater than or equal to ∈, let μ' ═ μ + τ, F₁＝F₁', repeat step 6.2 and step 6.3 until the appropriate scale factor μ is obtained to ensureHigher classification accuracy is demonstrated.