CN100495408C - Text clustering element study method and device - Google Patents
Text clustering element study method and device Download PDFInfo
- Publication number
- CN100495408C CN100495408C CN 200710117752 CN200710117752A CN100495408C CN 100495408 C CN100495408 C CN 100495408C CN 200710117752 CN200710117752 CN 200710117752 CN 200710117752 A CN200710117752 A CN 200710117752A CN 100495408 C CN100495408 C CN 100495408C
- Authority
- CN
- China
- Prior art keywords
- text
- clustering
- matrix
- unit
- soft
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
- 238000000034 method Methods 0.000 title claims abstract description 151
- 239000011159 matrix material Substances 0.000 claims abstract description 80
- 238000012545 processing Methods 0.000 claims abstract description 32
- 238000004458 analytical method Methods 0.000 claims abstract description 26
- 230000015572 biosynthetic process Effects 0.000 claims abstract description 18
- 238000003786 synthesis reaction Methods 0.000 claims abstract description 18
- 238000010606 normalization Methods 0.000 claims description 8
- 230000008859 change Effects 0.000 claims description 2
- 230000001131 transforming effect Effects 0.000 claims 2
- 238000006243 chemical reaction Methods 0.000 claims 1
- 238000004422 calculation algorithm Methods 0.000 description 32
- 230000008569 process Effects 0.000 description 19
- 238000007781 pre-processing Methods 0.000 description 6
- 230000011218 segmentation Effects 0.000 description 6
- 238000010586 diagram Methods 0.000 description 4
- 238000000638 solvent extraction Methods 0.000 description 4
- 230000000694 effects Effects 0.000 description 3
- 238000005192 partition Methods 0.000 description 3
- 238000012549 training Methods 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 2
- 238000001514 detection method Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 2
- 238000013459 approach Methods 0.000 description 1
- 238000007635 classification algorithm Methods 0.000 description 1
- 238000010224 classification analysis Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 230000007123 defense Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000008092 positive effect Effects 0.000 description 1
- 238000010187 selection method Methods 0.000 description 1
- 230000002194 synthesizing effect Effects 0.000 description 1
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
本发明公开了一种文本聚类元学习方法,该方法包括:用文本分析方法对文本集进行软聚类或软分类处理,得到至少两个聚类或分类结果;将所述聚类或分类结果分别表示成处理结果矩阵,将所述处理结果矩阵拼接成文本向量矩阵;对所述文本向量矩阵进行元学习,得到最终聚类结果。本发明还公开了一种文本聚类元学习装置,该装置包括:文本分析模块、矩阵合成模块和元学习模块。由于本发明所提供的文本聚类元学习方法及装置综合考虑了多次聚类和分类的结果,所以可获得更准确的文本聚类结果,有效的降低了单次聚类所带来的偏差,提高了聚类结果的准确性和稳定性。
The invention discloses a text clustering meta-learning method, which comprises: performing soft clustering or soft classification processing on a text set by a text analysis method to obtain at least two clustering or classification results; said clustering or classification The results are expressed as processing result matrices, and the processing result matrices are spliced into a text vector matrix; meta-learning is performed on the text vector matrix to obtain a final clustering result. The invention also discloses a text clustering meta-learning device, which includes: a text analysis module, a matrix synthesis module and a meta-learning module. Since the text clustering meta-learning method and device provided by the present invention comprehensively consider the results of multiple clustering and classification, more accurate text clustering results can be obtained, and the deviation caused by a single clustering can be effectively reduced , which improves the accuracy and stability of the clustering results.
Description
技术领域 technical field
本发明涉及文本聚类方法,尤其是指一种文本聚类元学习方法及装置。The invention relates to a text clustering method, in particular to a text clustering meta-learning method and device.
背景技术 Background technique
文本聚类方法是一种聚类分析方法,是聚类分析技术在文本处理领域的一种应用。文本聚类的方法能自动发现一个文本集中的若干簇,并将文本集中的所有文本划分成多个簇,使得属于同一个簇中的文本之间的内容具有较高的相似度,而属于不同簇的文本之间的内容差别较大。文本聚类方法可应用于很多方面,例如:美国国防部的话题检测与追踪(TDT,Topic detection and tracking)项目就力图通过文本聚类方法在一个新闻文本流中自动发现热点话题;此外,还可以使用文本聚类方法对搜索引擎返回的结果网页进行聚类,从而使用户获得更加结构化的和可理解的搜索结果;通过使用文本聚类方法,还可自动产生类似于雅虎目录(Yahoo Directory)那样的网络文本的分类体系等。Text clustering method is a clustering analysis method, which is an application of clustering analysis technology in the field of text processing. The text clustering method can automatically discover several clusters in a text set, and divide all the texts in the text set into multiple clusters, so that the contents of the texts belonging to the same cluster have a high degree of similarity, while the texts belonging to different clusters have a high degree of similarity. The content of the texts of the clusters is quite different. Text clustering methods can be applied in many aspects, for example: the topic detection and tracking (TDT, Topic detection and tracking) project of the U.S. Department of Defense is trying to automatically discover hot topics in a news text flow through text clustering methods; in addition, Text clustering methods can be used to cluster the result pages returned by search engines, so that users can obtain more structured and understandable search results; ) such as the classification system of network texts, etc.
目前的文本聚类方法通常是基于向量空间模型的。在向量空间模型中,每个文本都被表示为一个多维欧几里德空间中的文本向量,空间中的每一维都和一个特征词相对应,文本向量在每一维上的取值一般定义为该维所对应的特征词在该文本向量所对应的文本中出现的次数。对于任何一个文本集,利用向量空间模型可以产生一个基于特征词的文本向量矩阵V(n*k),其中n为文本集中文本的数量,k为每个文本向量的维数,矩阵的每一行都对应一个文本向量。获得文本集的向量矩阵后,可以利用各种经典的聚类算法如K均值(K-means)算法算法、层次凝聚聚类(HAC)算法等对文本集的向量矩阵进行聚类计算,从而产生文本聚类结果。Current text clustering methods are usually based on vector space models. In the vector space model, each text is represented as a text vector in a multi-dimensional Euclidean space, each dimension in the space corresponds to a feature word, and the value of the text vector on each dimension is generally Defined as the number of occurrences of the feature word corresponding to the dimension in the text corresponding to the text vector. For any text set, a vector space model can be used to generate a feature word-based text vector matrix V(n*k), where n is the number of texts in the text set, k is the dimension of each text vector, and each row of the matrix Both correspond to a text vector. After obtaining the vector matrix of the text set, various classical clustering algorithms such as K-means (K-means) algorithm, hierarchical agglomerative clustering (HAC) algorithm, etc. can be used to cluster the vector matrix of the text set, thereby generating Text clustering results.
现有的聚类算法大致可分为层次聚类、划分聚类、基于密度的聚类、基于网格的聚类和基于模型的聚类算法等几种。其中划分聚类算法,尤其是K-means算法一直是应用最为广泛的聚类算法之一。在K-means算法中通过比较数据样本与各个类中心点之间的距离划分类别,经过反复迭代将数据集划分成K个部分。其中,K为希望得到的簇的数量,需预先指定。具体来说,上述的K-means算法包括三个步骤:第一步,在数据集中确定K个初始类中心点,分别代表K个类簇;第二步,将每一个数据样本赋予与其距离最近的类中心点所代表的类簇;第三步,计算当前形成的各个类簇的中心点,代替原有类中心点,并返回第二步;如此循环执行第二、三步,直到结果收敛,也就是所有数据样本所属簇不再发生变化为止,从而达到划分聚类的目的。The existing clustering algorithms can be roughly divided into hierarchical clustering, partition clustering, density-based clustering, grid-based clustering and model-based clustering algorithms. Among them, partition clustering algorithm, especially K-means algorithm has been one of the most widely used clustering algorithms. In the K-means algorithm, the categories are divided by comparing the distance between the data samples and the center points of each class, and the data set is divided into K parts after repeated iterations. Among them, K is the number of desired clusters, which needs to be specified in advance. Specifically, the above-mentioned K-means algorithm includes three steps: the first step is to determine K initial class center points in the data set, representing K clusters respectively; the second step is to assign each data sample to the nearest The cluster represented by the center point of the class; the third step is to calculate the center point of each cluster currently formed, replace the original class center point, and return to the second step; execute the second and third steps in a loop until the result converges , that is, until the clusters to which all data samples belong no longer change, so as to achieve the purpose of clustering.
除了文本聚类方法外,文本分类是另一种进行文本分析的方法。与文本聚类方法所不同的是,文本分类方法需要人工进行训练,即需要人工指定类别,并为每个类别提供一定的训练数据,然后根据被检测文本与训练数据之间的差异判断被检测文本所属的类别。常用的文本分类方法有K最近邻(KNN,K-Nearest Neighbor)算法等。Apart from text clustering methods, text classification is another approach to text analysis. Different from the text clustering method, the text classification method requires manual training, that is, it needs to manually specify the categories, and provide certain training data for each category, and then judge the detected text according to the difference between the detected text and the training data. The category the text belongs to. Commonly used text classification methods include K-Nearest Neighbor (KNN, K-Nearest Neighbor) algorithm, etc.
目前,一般的文本聚类和分类方法都是将文本集中的文本划分到一个特定的簇或者类别中,而软聚类和软分类方法则是对上述文本聚类和文本分类方法的一个扩展,这两种方法并不是将文本集中的文本划分到一个簇或者类别中,而是以不同的概率将文本集中的文本划分到多个簇和类别中。一般来说,通过软聚类和软分类方法所得到的分类结果更加科学。At present, the general text clustering and classification methods divide the text in the text set into a specific cluster or category, while the soft clustering and soft classification methods are an extension of the above text clustering and text classification methods. These two methods do not divide the text in the text set into one cluster or category, but divide the text in the text set into multiple clusters and categories with different probabilities. Generally speaking, the classification results obtained by soft clustering and soft classification methods are more scientific.
目前文本聚类方法存在的一个主要问题在于文本聚类方法的稳定性差,即对于不同的文本集,使用一个文本聚类方法对其进行处理的结果可能时好时坏;而且有可能出现对某一个文本集,使用文本聚类方法A比使用文本聚类方法B所得到的分类结果好,而对另一个文本集,则使用文本聚类方法A比使用文本聚类方法B所得到的分类结果差的问题。One of the main problems of the current text clustering methods is the poor stability of text clustering methods, that is, for different text sets, the results of using a text clustering method to process them may be good or bad; For a text set, the classification results obtained by using text clustering method A are better than those obtained by using text clustering method B, and for another text set, the classification results obtained by using text clustering method A are better than those obtained by using text clustering method B bad question.
为了解决这一问题,一些研究者提出了文本聚类元学习的方法。该方法通过对多种文本聚类算法产生的聚类结果进行综合,以提高文本聚类方法的稳定性。而所述的元学习,就是对学习结果的再学习,即对上述多种文本聚类算法产生的聚类结果所进行的再学习。上述的这种文本聚类元学习的方法可以通过不同的策略来实现,其中最直观的方法是投票(consensus)方法,该方法非常易于实现。在投票方法中,使用了多种文本聚类算法,而对于某一个对象应属于哪一簇,每一种不同的文本聚类方法都将给出各自的评价,然后采取一种投票的方式,即将得票数最多的聚类结果作为这一对象最终应归属的簇。另外,Alexander Strehl等人还提出了三种基于投票的元学习方法:第一种方法称之为基于簇的相似性的划分算法(CSPA,Cluster-based Similarity PartitioningAlgorithm),该算法通过分析同一个簇内的两个对象之间的关系来评价每两个对象之间的相似性,根据获得的相似性评价来重新划分簇,从而获得更好的聚类结果;第二种方法称之为超图划分算法(HGPA,HyperGraph PartitioningAlgorithm),该算法将文本聚类元学习的过程看做是进行超图分割的过程,超图的边即为产生的簇;第三种方法是将文本聚类元学习过程视为一个簇对应的过程,即将两组不同的聚类结果中相似的簇对应起来,然后综合产生最后的簇。该方法在实施前必须首先解决簇对应的问题,但由于文本聚类方法中并不存在固定的类别体系,所以要将两组不同的聚类结果中的簇一一对应起来是非常困难的。相应于上述的第三种方法,另一种比较宽松的方法是寻找在两组聚类结果中均出现的类簇,而忽略其它类簇。但即便如此,当文本聚类的次数较多时,这样重复出现的簇也会变得非常少;考虑到上述情况,还可以再将标准降低一些,即只要不同聚类结果中的簇比较相似即可将其对应起来,但这样一来,如何确定相似性阈值等问题会给元学习本身带来不稳定因素,使得利用元学习技术提高算法稳定性的目标变得难以实现。To solve this problem, some researchers proposed meta-learning methods for text clustering. This method improves the stability of the text clustering method by synthesizing the clustering results generated by various text clustering algorithms. The meta-learning mentioned above is the re-learning of the learning results, that is, the re-learning of the clustering results generated by the above-mentioned multiple text clustering algorithms. The above text clustering meta-learning method can be implemented through different strategies, and the most intuitive method is the voting (consensus) method, which is very easy to implement. In the voting method, a variety of text clustering algorithms are used, and for which cluster an object should belong to, each different text clustering method will give its own evaluation, and then take a voting method, The clustering result with the most votes will be the cluster to which this object should finally belong. In addition, Alexander Strehl et al. also proposed three voting-based meta-learning methods: the first method is called Cluster-based Similarity Partitioning Algorithm (CSPA, Cluster-based Similarity Partitioning Algorithm), which analyzes the same cluster The relationship between two objects in the network is used to evaluate the similarity between each two objects, and the clusters are re-divided according to the obtained similarity evaluation, so as to obtain better clustering results; the second method is called hypergraph Partitioning Algorithm (HGPA, HyperGraph Partitioning Algorithm), which regards the process of text clustering meta-learning as the process of hypergraph segmentation, and the edges of the hypergraph are the generated clusters; the third method is to use text clustering meta-learning The process is regarded as a process of cluster correspondence, that is, to correspond to similar clusters in two groups of different clustering results, and then synthesize the final cluster. Before the method is implemented, the problem of cluster correspondence must be solved first, but since there is no fixed category system in the text clustering method, it is very difficult to match the clusters in two groups of different clustering results one by one. Corresponding to the third method above, another looser method is to find clusters that appear in both groups of clustering results and ignore other clusters. But even so, when the number of text clustering is large, such repeated clusters will become very few; considering the above situation, the standard can be lowered, that is, as long as the clusters in different clustering results are similar. They can be connected, but in this way, how to determine the similarity threshold and other issues will bring unstable factors to the meta-learning itself, making it difficult to achieve the goal of using meta-learning technology to improve the stability of the algorithm.
综上可知,在现有技术中的文本聚类元学习方法中,虽然通过对多种文本聚类算法产生的聚类结果进行了综合,但由于对聚类结果所进行的元学习采取的是投票的方式,从而无法基于相似性来准确地对簇进行划分,导致聚类结果的准确性和稳定性都不高。In summary, in the text clustering meta-learning method in the prior art, although the clustering results generated by multiple text clustering algorithms are synthesized, the meta-learning method for the clustering results is The way of voting makes it impossible to accurately divide clusters based on similarity, resulting in low accuracy and stability of clustering results.
发明内容 Contents of the invention
有鉴于此,本发明提供一种文本聚类元学习方法及装置,提高聚类结果的准确性和稳定性。In view of this, the present invention provides a text clustering meta-learning method and device to improve the accuracy and stability of clustering results.
为达到上述目的,本发明的技术方案是这样实现的:In order to achieve the above object, technical solution of the present invention is achieved in that way:
本发明的实施例提供了一种文本聚类元学习方法,该方法包括以下步骤:Embodiments of the present invention provide a text clustering meta-learning method, the method comprising the following steps:
A、用文本分析方法对文本集进行软聚类或软分类处理,得到至少两个聚类或分类结果;A. Perform soft clustering or soft classification processing on the text set with a text analysis method, and obtain at least two clustering or classification results;
B、将所述聚类或分类结果分别表示成处理结果矩阵,将所述处理结果矩阵拼接成文本向量矩阵;B. Representing the clustering or classification results as processing result matrices, and splicing the processing result matrices into a text vector matrix;
C、对所述文本向量矩阵进行元学习,得到最终聚类结果。C. Performing meta-learning on the text vector matrix to obtain a final clustering result.
本发明的实施例还提供了一种文本聚类元学习装置,该装置包括:文本分析模块、矩阵合成模块和元学习模块;The embodiment of the present invention also provides a text clustering meta-learning device, which includes: a text analysis module, a matrix synthesis module and a meta-learning module;
所述文本分析模块,用于对文本集进行软聚类或软分类处理,将得到的聚类或分类结果发送给所述矩阵合成模块;The text analysis module is used to perform soft clustering or soft classification processing on the text set, and send the obtained clustering or classification results to the matrix synthesis module;
所述矩阵合成模块,用于将所接收到的聚类或分类结果转化成矩阵,并将转化后的矩阵拼接成文本向量矩阵,将所述文本向量矩阵发送给所述元学习模块;The matrix synthesis module is used to convert the received clustering or classification results into a matrix, and splice the converted matrix into a text vector matrix, and send the text vector matrix to the meta-learning module;
所述元学习模块,用于对接收到的文本向量矩阵进行元学习,输出最终聚类结果。The meta-learning module is used to perform meta-learning on the received text vector matrix, and output the final clustering result.
由上述的技术方案可知,本发明中的文本聚类元学习方法及装置,由于根据多个文本分析结果合成了文本向量矩阵,并对上述文本向量矩阵进行了元学习,综合考虑了多次软聚类或软分类的分析结果,所以可获得更准确的文本聚类结果,有效的降低了单次聚类所带来的偏差,提高了聚类结果的准确性和稳定性。It can be seen from the above-mentioned technical solution that the text clustering meta-learning method and device in the present invention synthesize a text vector matrix based on multiple text analysis results, and carry out meta-learning on the above text vector matrix, comprehensively considering multiple soft Clustering or soft classification analysis results, so more accurate text clustering results can be obtained, effectively reducing the deviation caused by a single clustering, and improving the accuracy and stability of the clustering results.
附图说明 Description of drawings
图1为本发明实施例中文本聚类元学习方法的原理图。FIG. 1 is a schematic diagram of a text clustering meta-learning method in an embodiment of the present invention.
图2为本发明实施例中文本聚类元学习方法的流程图。Fig. 2 is a flow chart of the text clustering meta-learning method in the embodiment of the present invention.
图3为本发明实施例中文本聚类元学习装置的结构图。Fig. 3 is a structural diagram of a text clustering meta-learning device in an embodiment of the present invention.
具体实施方式 Detailed ways
为使本发明的目的、技术方案和优点表达得更加清楚明白,下面结合附图及具体实施例对本发明再作进一步详细的说明。In order to make the object, technical solution and advantages of the present invention more clearly, the present invention will be further described in detail below in conjunction with the accompanying drawings and specific embodiments.
本发明提供了一种文本聚类元学习方法,该方法先利用软聚类和软分类方法对文本集进行多次处理,获得处理结果;然后通过多次软聚类和软分类处理结果构建文本向量矩阵,最后利用聚类算法对文本向量矩阵进行元学习处理,获得最后的聚类结果。The present invention provides a meta-learning method for text clustering, which uses soft clustering and soft classification methods to process text sets multiple times to obtain processing results; then constructs text through multiple soft clustering and soft classification processing results Vector matrix, and finally use the clustering algorithm to perform meta-learning processing on the text vector matrix to obtain the final clustering result.
图1为本发明实施例中文本聚类元学习方法的原理图。如图1所示,在本发明实施例中的文本聚类元学习方法中,首先分别利用多种文本软聚类方法和文本软分类方法对文本集进行处理,从而得到多个聚类结果;然后,基于上述通过多种文本软聚类方法和文本软分类所得到的多个聚类结果构建一个文本向量矩阵;接着,对文本向量矩阵进行归一化处理,最后基于硬聚类算法的元学习方法对上述文本向量矩阵进行元学习,从而获得最终的文本聚类的结果。FIG. 1 is a schematic diagram of a text clustering meta-learning method in an embodiment of the present invention. As shown in Figure 1, in the text clustering meta-learning method in the embodiment of the present invention, firstly, a plurality of text soft clustering methods and text soft classification methods are used to process the text set, thereby obtaining multiple clustering results; Then, construct a text vector matrix based on the multiple clustering results obtained by the above-mentioned various text soft clustering methods and text soft classification; then, normalize the text vector matrix, and finally based on the element of the hard clustering algorithm The learning method performs meta-learning on the above text vector matrix, so as to obtain the final text clustering result.
图2为本发明实施例中文本聚类元学习方法的流程图。如图2所示,本发明实施例中文本聚类元学习方法包括如下所述的步骤:Fig. 2 is a flow chart of the text clustering meta-learning method in the embodiment of the present invention. As shown in Figure 2, the text clustering meta-learning method in the embodiment of the present invention includes the following steps:
步骤201,用文本分析方法对文本集进行m次处理。
上述的文本分析方法包括软聚类方法和软分类方法。所述的软聚类方法和软分类方法是两类不同的数据挖掘方法,各自有不同的应用领域。使用软聚类方法和软分类方法所得到的结果是若干个类,以及每一个文本属于一个特定的类的概率。对于软聚类方法,类代表软聚类方法发现的簇,对于软分类,类代表一个具体的类别。在本发明的实施例中,将上述的两类方法有机结合起来,从而可以获得更加准确和稳定的聚类结果。The text analysis methods mentioned above include soft clustering methods and soft classification methods. The soft clustering method and the soft classification method are two different data mining methods, each having different application fields. The results obtained by using soft clustering methods and soft classification methods are several classes, and the probability that each text belongs to a specific class. For soft clustering methods, a class represents the clusters found by the soft clustering method, and for soft classification, a class represents a specific category. In the embodiment of the present invention, the above two types of methods are organically combined, so that a more accurate and stable clustering result can be obtained.
上述的用文本分析方法对文本集进行m次处理,可以是单独使用一种或多种软聚类方法对文本集进行m次处理;也可以是单独使用一种或多种软分类方法对文本集进行m次处理;还可以是同时使用一种或多种软聚类方法和软分类方法对文本集进行m次处理。为叙述方便,以下将以同时使用软聚类方法和软分类方法对文本集进行m次处理为例进行详细地说明。The above-mentioned processing of the text set by the text analysis method for m times may be performed by using one or more soft clustering methods to process the text set m times; The text set is processed m times; it is also possible to use one or more soft clustering methods and soft classification methods to process the text set m times. For the convenience of description, the following will use the soft clustering method and the soft classification method to process the text set m times as an example for detailed description.
在上述步骤中,可使用一些公开的软聚类和软分类方法对文本集进行处理,所述的软聚类和软分类方法包括软K-means、软KNN等算法。在实际应用中,可以先从软聚类和软分类方法中各自选择n种算法,然后使用所选择的每一种算法对所需处理的文本集进行m次处理。上述的n和m均为整数,可预先进行设置,并且所述的n和m满足:n≥1,m≥1。之所以使用同一种方法进行m次处理,是因为即使是使用同一种方法对同一个文本集进行处理,所得到的处理结果也可能不同,例如对于软K-means算法来说,即使是对于同一个文本集,每次使用同一个软K-means算法进行处理,所获得的结果都是不同的。为了提高所获得的结果的准确性,所以需要使用同一种方法进行m次处理,从而得到m个处理结果。In the above steps, some public soft clustering and soft classification methods can be used to process the text set, and the soft clustering and soft classification methods include algorithms such as soft K-means and soft KNN. In practical applications, n algorithms can be selected respectively from soft clustering and soft classification methods, and then each of the selected algorithms can be used to process the text set to be processed m times. Both n and m mentioned above are integers and can be set in advance, and said n and m satisfy: n≥1, m≥1. The reason why the same method is used for m processing is that even if the same method is used to process the same text set, the processing results obtained may be different. For example, for the soft K-means algorithm, even for the same text set A text set is processed with the same soft K-means algorithm every time, and the results obtained are different. In order to improve the accuracy of the obtained results, it is necessary to use the same method for m times of processing, so as to obtain m processing results.
此外,在使用上述的软聚类方法和软分类方法之前,还需要对文本集进行预处理,将文本集中的文本转化为基于特征词的文本向量,以便于使用软聚类方法和软分类方法对文本集进行处理。所述的预处理主要包括分词、特征选取和文本向量化等三个步骤,通过这三个步骤可将文本集中的每个文本都转换成一个多维欧几里德空间中的文本向量。具体的处理步骤如下所述:In addition, before using the above soft clustering method and soft classification method, the text set needs to be preprocessed, and the text in the text set is converted into a text vector based on feature words, so that the soft clustering method and soft classification method can be used Process the text set. The preprocessing mainly includes three steps of word segmentation, feature selection, and text vectorization. Through these three steps, each text in the text set can be converted into a text vector in a multidimensional Euclidean space. The specific processing steps are as follows:
1)首先需要进行分词处理。1) Word segmentation processing is first required.
分词是将文本集中的文本划分为单个的词,并统计每个词在文本中出现的次数,并据此作为文本向量化的依据。Word segmentation is to divide the text in the text set into individual words, and count the number of times each word appears in the text, and use it as the basis for text vectorization.
2)然后,进行特征选取处理。2) Then, perform feature selection processing.
由于在一个文本集中出现的词非常多,而且很多词对聚类的区分没有积极作用,反而会影响到聚类的效果,因此需要对文本集中的词进行特征选取。进行特征选取的目的是保留那些有利于聚类区分的词,而去除大部分的不利于聚类区分的词,从而在不影响聚类效果的情况下降低文本向量的维度。在本发明中可采用多种特征选取方法,包括:去除阻止词、去除在过多或者过少的文本中出现的词和去除在单个文本中出现次数过少的词等方法。通过结合使用上述的方法,可以在对聚类效果几乎没有影响的情况下将文本集中的特征词的数量从上万个降低到一千个左右。Since there are many words in a text set, and many words have no positive effect on the clustering distinction, but will affect the clustering effect, it is necessary to select the features of the words in the text set. The purpose of feature selection is to retain those words that are conducive to clustering and remove most of the words that are not conducive to clustering, so as to reduce the dimension of the text vector without affecting the clustering effect. Various feature selection methods can be used in the present invention, including methods such as removing blocked words, removing words that appear in too many or too few texts, and removing words that appear too few times in a single text. By using the above methods in combination, the number of feature words in the text set can be reduced from tens of thousands to about one thousand with little impact on the clustering effect.
3)最后,进行文本向量化。3) Finally, perform text vectorization.
文本向量化的目的是将文本集中的文本转化为一个多维欧几里德空间中的文本向量,所述的多维欧几里德空间中的每一维都和一个特征词相对应,一个文本向量在多维欧几里德空间中每一维上的取值就是该维所对应的特征词在该文本向量所对应的文本中的权重。常用的文本向量权重计算方法包括词频(TF)、词频反文献频率(TFIDF)、词频控制(tfc)和长度词频(1tc)等方法。与其他权重计算方法相比,1tc方法减少了不同文本的长短差异对文本向量化和聚类的影响,因此在本发明中使用的是1tc方法。使用1tc方法计算权重的公式为:The purpose of text vectorization is to convert the text in the text set into a text vector in a multi-dimensional Euclidean space. Each dimension in the multi-dimensional Euclidean space corresponds to a feature word, and a text vector The value of each dimension in the multidimensional Euclidean space is the weight of the feature word corresponding to the dimension in the text corresponding to the text vector. Commonly used text vector weight calculation methods include Term Frequency (TF), Term Frequency Inverse Document Frequency (TFIDF), Term Frequency Control (tfc), and Length Term Frequency (1tc). Compared with other weight calculation methods, the 1tc method reduces the impact of different text length differences on text vectorization and clustering, so the 1tc method is used in the present invention. The formula for calculating the weight using the 1tc method is:
其中,N为文本集中文本的数量,M为所选取的特征词的数量,fik为第i个特征词在第k个文本中出现的次数,ni为含有第i个特征词的文本的数量。Among them, N is the number of texts in the text set, M is the number of selected feature words, f ik is the number of times the i-th feature word appears in the k-th text, and n i is the number of texts containing the i-th feature word quantity.
对文本集进行上述预处理后,再使用上述的软聚类方法和软分类方法对文本集进行多次处理,获得处理结果。After the above-mentioned preprocessing is performed on the text set, the above-mentioned soft clustering method and soft classification method are used to process the text set multiple times to obtain a processing result.
以上所述为同时使用软聚类方法和软分类方法对文本集进行m次处理的方法,而单独使用软聚类方法或软分类方法对文本集进行处理的步骤与上述同时使用软聚类方法和软分类方法的步骤是相类似的,因此不再赘述。The above is the method of using the soft clustering method and the soft classification method to process the text set m times at the same time, and the steps of using the soft clustering method or soft classification method alone to process the text set are the same as the above-mentioned using the soft clustering method at the same time The steps of the soft classification method are similar, so they will not be repeated here.
步骤202,基于m次处理结果构建文本向量矩阵。
在该步骤中,上述文本向量矩阵的构建方法如下:In this step, the construction method of the above text vector matrix is as follows:
首先,将上述每一个利用软聚类或者软分类方法得到的处理结果都分别表示为一个矩阵,矩阵的行代表软聚类方法产生的簇或者软分类方法中的类别,列代表文本集中的文本,每一个对象是某一个文本属于一个簇或者类别的概率。例如,对于一个包含n个文本的文本集Corpus:(d1,d2,d3,...,dn),假设对其进行软聚类和软分类的处理次数为m,则可以得到对该文本集的m个划分结果,可以表示为Partition:(P1,P2,P3,...,Pm),其中Pi为第i个软聚类或者软分类的处理结果,可以表示为
其中,vli为上述文本集中的第l个文本在第i次软聚类或者软分类的处理结果,可以表示为
通过上述的方法,我们可以获得所有软分类和软聚类结果矩阵。然后,将所有软聚类和软分类结果矩阵拼接成一个分块矩阵M,如下所示,M即为文本向量矩阵。Through the above method, we can obtain all soft classification and soft clustering result matrices. Then, all the soft clustering and soft classification result matrices are concatenated into a block matrix M, as shown below, M is the text vector matrix.
M=[M1M2...Mm] (3)M=[M 1 M 2 ... M m ] (3)
其中,Mi为利用软聚类或者软分类方法所得到的处理结果矩阵,m为所进行的软聚类和软分类的次数。Among them, M i is the processing result matrix obtained by using soft clustering or soft classification methods, and m is the number of times of soft clustering and soft classification.
由于不同的软聚类和软分类方法的准确性有所差别,因此用户可以根据实际情况,预先对不同的利用软聚类方法和软分类方法所得到的处理结果设置不同的权值k,然后将所得到的每个处理结果,即矩阵分别乘上相应的权值k后,再将所有的矩阵拼接成一个文本向量矩阵。如下所示:Since the accuracy of different soft clustering and soft classification methods is different, the user can pre-set different weights k for the processing results obtained by using different soft clustering methods and soft classification methods according to the actual situation, and then After each obtained processing result, that is, the matrix is multiplied by the corresponding weight k, all the matrices are spliced into a text vector matrix. As follows:
M=[k1M1 k2M2...kmMn] (4)M=[k 1 M 1 k 2 M 2 ... k m M n ] (4)
其中,kj为对不同软聚类方法和软分类方法预先设置的权值,m为所使用的不同的软聚类方法或软分类方法的数量。Among them, kj is the preset weight for different soft clustering methods and soft classification methods, m is the number of different soft clustering methods or soft classification methods used.
步骤203,进行元学习,获得最后的聚类结果。
在该步骤中,本发明实施例首先对文本向量矩阵进行归一化处理,然后利用聚类方法对上述文本向量矩阵进行元学习,获得最后的文本聚类的结果。所述的元学习,就是对前一步骤中软聚类和软分类的学习结果进行的再学习。可以采用本领域中的一些常用的聚类算法,例如:K-means算法,层次凝聚聚类算法,密度聚类算法等,对上述的文本向量矩阵进行归一化处理和聚类元学习。最后得到的聚类结果为:P={C1,C2,...,Ck},其中K为产生的簇的数量,Ci为第i个簇。In this step, the embodiment of the present invention first performs normalization processing on the text vector matrix, and then uses a clustering method to perform meta-learning on the above text vector matrix to obtain the final text clustering result. The meta-learning mentioned above is the re-learning of the learning results of soft clustering and soft classification in the previous step. Some common clustering algorithms in this field can be used, such as K-means algorithm, hierarchical agglomerative clustering algorithm, density clustering algorithm, etc., to perform normalization processing and clustering element learning on the above-mentioned text vector matrix. The final clustering result is: P={C 1 , C 2 , . . . , C k }, where K is the number of clusters generated, and C i is the i-th cluster.
综上可知,本发明提供了一种文本聚类元学习方法,该方法首先将文本集向量化,分别利用软聚类和软分类算法对文本集进行若干次处理,然后根据获得的软聚类和软分类结果构建一个文本向量矩阵,该向量矩阵中的每一个对象分别是某一文本属于某一个软聚类方法产生的簇或者某一个软分类方法中的分类的概率,最后利用一般的聚类算法对该文本向量矩阵进行元学习,获得最后的文本聚类结果。由于该方法综合考虑了多次聚类和分类的结果,所以可获得更准确的文本聚类结果,有效的降低了单次聚类所带来的偏差,提高了聚类结果的准确性和稳定性。In summary, the present invention provides a meta-learning method for text clustering. This method first vectorizes the text set, uses soft clustering and soft classification algorithms to process the text set several times, and then according to the obtained soft clustering Construct a text vector matrix with the results of soft classification, each object in the vector matrix is the probability that a certain text belongs to a cluster generated by a certain soft clustering method or a classification in a certain soft classification method, and finally use the general clustering method The class algorithm performs meta-learning on the text vector matrix to obtain the final text clustering result. Because this method comprehensively considers the results of multiple clustering and classification, it can obtain more accurate text clustering results, effectively reduce the deviation caused by single clustering, and improve the accuracy and stability of clustering results. sex.
此外,本发明的实施例还提供了一种文本聚类元学习装置。图3所示为本发明实施例中文本聚类元学习装置的结构图,如图3所示,该装置包括:预处理模块300、文本分析模块301、矩阵合成模块302和元学习模块303。其中,预处理模块300用于对文本集中的文本进行文本向量化,将文本向量化后的文本集发送给文本分析模块301;文本分析模块301对所需处理的文本集进行文本分析,并将所得到的文本分析结果发送给矩阵合成模块302。在文本分析模块301中,所使用的文本分析方法包括软聚类方法和软分类方法;矩阵合成模块302根据接收到的文本分析结果合成文本向量矩阵,并将合成的文本向量矩阵发送给所述元学习模块303;所述元学习模块303,用于对接收到的文本向量矩阵进行元学习,输出最终聚类结果。In addition, the embodiment of the present invention also provides a text clustering meta-learning device. FIG. 3 is a structural diagram of a text clustering meta-learning device in an embodiment of the present invention. As shown in FIG. 3 , the device includes: a
在上述的文本聚类元学习装置中,预处理模块300不是必需的模块,所以用虚线表示。如图3所示,预处理模块300还包括:分词单元304、特征选取单元305和文本向量化单元306。分词单元304,用于将文本集中的文本划分为单个的词,并统计每个词在文本集中出现的次数,并将划分结果和统计结果发送给特征选取单元305;特征选取单元305根据接收到的划分结果和统计结果,从文本集中选取特征词,并将选取的特征词发送给文本向量化单元306;文本向量化单元306根据接收到的特征词将文本集中的文本转化成文本向量,并将文本向量化后的文本集发送给文本分析模块301。In the above text clustering meta-learning device, the
如图3所示,所述矩阵合成模块302还包括:矩阵化单元307和合成单元308。矩阵化单元307将接收到的文本分析结果转化成矩阵,并将转化后的矩阵发送给合成单元308;合成单元308将接收到的所有转化后的矩阵合成一个文本向量矩阵,并将合成的文本向量矩阵发送给所述元学习模块303。As shown in FIG. 3 , the
如图3所示,元学习模块303还包括:归一化单元309和学习单元310。归一化单元309用于对接收到的文本向量矩阵进行归一化处理,并将归一化后的文本向量矩阵发送给学习单元310;学习单元310用于对接收到的归一化后的文本向量矩阵进行元学习,从而输出最终聚类结果。As shown in FIG. 3 , the meta-learning
以上所述,仅为本发明的较佳实施例而已,并非用于限定本发明的保护范围。凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the protection scope of the present invention. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present invention shall be included within the protection scope of the present invention.
Claims (9)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN 200710117752 CN100495408C (en) | 2007-06-22 | 2007-06-22 | Text clustering element study method and device |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN 200710117752 CN100495408C (en) | 2007-06-22 | 2007-06-22 | Text clustering element study method and device |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN101079072A CN101079072A (en) | 2007-11-28 |
| CN100495408C true CN100495408C (en) | 2009-06-03 |
Family
ID=38906548
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN 200710117752 Expired - Fee Related CN100495408C (en) | 2007-06-22 | 2007-06-22 | Text clustering element study method and device |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN100495408C (en) |
Families Citing this family (11)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101604322B (en) * | 2009-06-24 | 2011-09-07 | 北京理工大学 | Decision level text automatic classified fusion method |
| CN102195899B (en) * | 2011-05-30 | 2014-05-07 | 中国人民解放军总参谋部第五十四研究所 | Method and system for information mining of communication network |
| CN103049581B (en) * | 2013-01-21 | 2015-10-07 | 北京航空航天大学 | A kind of web text classification method based on consistance cluster |
| CN105022733B (en) * | 2014-04-18 | 2018-03-23 | 中科鼎富(北京)科技发展有限公司 | DINFO OEC text analyzings method for digging and equipment |
| CN104035917B (en) * | 2014-06-10 | 2017-07-07 | 复旦大学 | A kind of knowledge mapping management method and system based on semantic space mapping |
| CN104391879B (en) * | 2014-10-31 | 2017-10-10 | 小米科技有限责任公司 | The method and device of hierarchical clustering |
| CN105279524A (en) * | 2015-11-04 | 2016-01-27 | 盐城工学院 | High-dimensional data clustering method based on unweighted hypergraph segmentation |
| CN105574539B (en) * | 2015-12-11 | 2018-09-21 | 中国联合网络通信集团有限公司 | A kind of DNS log analysis methods and device |
| CN105956083A (en) * | 2016-04-29 | 2016-09-21 | 广州优视网络科技有限公司 | Application software classification system, application software classification method and server |
| CN106202395B (en) * | 2016-07-11 | 2019-12-31 | 上海智臻智能网络科技股份有限公司 | Text clustering method and device |
| CN108664633B (en) * | 2018-05-15 | 2020-12-04 | 南京大学 | A method for text classification using diverse text features |
Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20040088156A1 (en) * | 2001-01-04 | 2004-05-06 | Rajaraman Kanagasabai | Method of text similarity measurement |
| JP2004341948A (en) * | 2003-05-16 | 2004-12-02 | Ricoh Co Ltd | Concept extraction system, concept extraction method, program, and storage medium |
| CN1755687A (en) * | 2004-09-30 | 2006-04-05 | 微软公司 | Forming intent-based clusters and employing same by search engine |
| US20070050388A1 (en) * | 2005-08-25 | 2007-03-01 | Xerox Corporation | Device and method for text stream mining |
-
2007
- 2007-06-22 CN CN 200710117752 patent/CN100495408C/en not_active Expired - Fee Related
Patent Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20040088156A1 (en) * | 2001-01-04 | 2004-05-06 | Rajaraman Kanagasabai | Method of text similarity measurement |
| JP2004341948A (en) * | 2003-05-16 | 2004-12-02 | Ricoh Co Ltd | Concept extraction system, concept extraction method, program, and storage medium |
| CN1755687A (en) * | 2004-09-30 | 2006-04-05 | 微软公司 | Forming intent-based clusters and employing same by search engine |
| US20070050388A1 (en) * | 2005-08-25 | 2007-03-01 | Xerox Corporation | Device and method for text stream mining |
Also Published As
| Publication number | Publication date |
|---|---|
| CN101079072A (en) | 2007-11-28 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN100495408C (en) | Text clustering element study method and device | |
| CN103970729B (en) | A kind of multi-threaded extracting method based on semantic category | |
| US7272594B1 (en) | Method and apparatus to link to a related document | |
| CN105183833B (en) | Microblog text recommendation method and device based on user model | |
| CN104657496A (en) | A method and device for calculating the heat value of information | |
| CN114298020B (en) | Keyword vectorization method based on topic semantic information and application thereof | |
| CN107066555B (en) | On-line theme detection method for professional field | |
| CN104008090A (en) | Multi-subject extraction method based on concept vector model | |
| CN104376406A (en) | Enterprise innovation resource management and analysis system and method based on big data | |
| CN104750844A (en) | Method and device for generating text characteristic vectors based on TF-IGM, method and device for classifying texts | |
| Ni et al. | Short text clustering by finding core terms | |
| CN113032550B (en) | An opinion summary evaluation system based on pre-trained language model | |
| CN105045875A (en) | Personalized information retrieval method and apparatus | |
| CN103970730A (en) | Method for extracting multiple subject terms from single Chinese text | |
| CN118551086A (en) | Multi-mode data distributed retrieval method and system based on knowledge graph and vector matching | |
| CN115712700A (en) | Hot word extraction method, system, computer device and storage medium | |
| CN107330557A (en) | A method and device for tracking and predicting public opinion hotspots based on community division and entropy | |
| Thangamani et al. | Ontology based fuzzy document clustering scheme | |
| Chow et al. | A new document representation using term frequency and vectorized graph connectionists with application to document retrieval | |
| CN108595411A (en) | More text snippet acquisition methods in a kind of same subject text set | |
| CN119358546B (en) | Document-level knowledge extraction and fusion method and system based on large language model | |
| CN119917942A (en) | A method for constructing knowledge semantic tree based on text data | |
| Kurimo | Thematic indexing of spoken documents by using self-organizing maps | |
| CN118333021A (en) | A method, device and storage medium for automatically generating prompts for long documents | |
| CN119166745A (en) | Novelty search retrieval construction method and device based on large model extraction and term alignment |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| C06 | Publication | ||
| PB01 | Publication | ||
| C10 | Entry into substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| C14 | Grant of patent or utility model | ||
| GR01 | Patent grant | ||
| C41 | Transfer of patent application or patent right or utility model | ||
| C56 | Change in the name or address of the patentee | ||
| CP03 | Change of name, title or address |
Address after: 100049 No. 19, Yuquanlu Road, Beijing, Shijingshan District Patentee after: University OF CHINESE ACADEMY OF SCIENCES Address before: 100039 Beijing, Yuquanlu Road, Shijingshan District (a) No. 19 Patentee before: GRADUATE University OF CHINESE ACADEMY OF SCIENCES |
|
| TR01 | Transfer of patent right |
Effective date of registration: 20151120 Address after: 100195 Beijing city Haidian District minzhuang Road No. 87 C Patentee after: INSTITUTE OF INFORMATION ENGINEERING, CHINESE ACADEMY OF SCIENCES Address before: 100049 No. 19, Yuquanlu Road, Beijing, Shijingshan District Patentee before: University of Chinese Academy of Sciences |
|
| CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20090603 Termination date: 20190622 |
|
| CF01 | Termination of patent right due to non-payment of annual fee |