[go: up one dir, main page]

CN102147813A - Method for automatically classifying documents based on K nearest neighbor algorithm under power cloud environment - Google Patents

Method for automatically classifying documents based on K nearest neighbor algorithm under power cloud environment Download PDF

Info

Publication number
CN102147813A
CN102147813A CN 201110086018 CN201110086018A CN102147813A CN 102147813 A CN102147813 A CN 102147813A CN 201110086018 CN201110086018 CN 201110086018 CN 201110086018 A CN201110086018 A CN 201110086018A CN 102147813 A CN102147813 A CN 102147813A
Authority
CN
China
Prior art keywords
document
nearest neighbor
similarity
classification
map
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN 201110086018
Other languages
Chinese (zh)
Inventor
赵俊峰
王磊
祁建
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
State Grid Jiangsu Electric Power Co Ltd
Original Assignee
State Grid Jiangsu Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by State Grid Jiangsu Electric Power Co Ltd filed Critical State Grid Jiangsu Electric Power Co Ltd
Priority to CN 201110086018 priority Critical patent/CN102147813A/en
Publication of CN102147813A publication Critical patent/CN102147813A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明公开了一种电力云环境下基于K最近邻算法的文档自动分类方法,该方法对云计算的MapReduce编程框架进行了改进,其中Map函数完成文档相似性的计算,reduce函数规约出相似性最高的K个样本,统计最近邻所属各个类别的权重,并输出权重最大的类别,对文档进行自动分类。本发明快速完成大量的文档分类任务,大幅缩短文档分类任务的执行时间,提高分类效率;且具备健壮性。

The invention discloses an automatic document classification method based on the K-nearest neighbor algorithm in a power cloud environment. The method improves the MapReduce programming framework of cloud computing, wherein the Map function completes the calculation of document similarity, and the reduce function stipulates the similarity For the highest K samples, the weights of each category to which the nearest neighbor belongs are counted, and the category with the largest weight is output to automatically classify the document. The invention quickly completes a large number of document classification tasks, greatly shortens the execution time of the document classification tasks, improves the classification efficiency, and has robustness.

Description

Under a kind of electric power cloud environment based on the document automatic classification method of K nearest neighbor algorithm
Technical field
The invention belongs to cloud computing and data mining field, relate to a kind of Utilities Electric Co. document classification method, under specifically a kind of electric power cloud environment based on the document automatic classification method of K nearest neighbor algorithm.
Background technology
The automatic document sorting technique is to utilize natural language, and data mining and artificial intelligence technology are through after certain training, and the technology that makes program discern and to classify document automatically has important use aspect the large-scale data processing.
Traditional K nearest neighbor algorithm is because it is simply effective, obtained using widely aspect the classification automatically at document.Because there is the computation complexity height in traditional K nearest neighbor algorithm, the shortcoming of poor expandability, under the situation that Utilities Electric Co.'s document sharply increases, if directly use this algorithm that document is classified, its calculated amount sharply rises, and the classification real-time descends.Since IBM in 2007 and Google unite release cloud computing since, cloud computing has become the hot issue that industry member and academia all pay close attention to, cloud computing has become the Distributed Calculation future thrust.On this basis, national grid has been set up electric power cloud simulation laboratory, has erected the electric power cloud computing environment, and the magnanimity information of intelligent grid is carried out analyzing and processing.The MapReduce programming framework that is proposed by Google is the representative art in the cloud computing, it is applicable to the distributed treatment large-scale dataset, the programmer specifies the processing procedure to each block data in the Map function, specify in the Reduce function and how the intermediate result of block data processing is carried out stipulations.But traditional MapReduce programming framework can only be handled the individual data collection, and can not directly support the processing to a plurality of associated data set, and the document classification method of K nearest neighbor algorithm must be in the face of the handling problem of a plurality of associated data set.Though it is very wide that traditional K nearest neighbor algorithm is used, also there is the high shortcoming of computation complexity simultaneously.In the operational process of algorithm, need find K neighbour, need to calculate for this reason and all samples between distance, calculated amount is huge.Though the improvement algorithm is arranged, major part is to make with the cost that reduces classification performance.
Summary of the invention
In order to overcome the problem that traditional K nearest neighbor algorithm exists, the purpose of this invention is to provide under a kind of electric power cloud environment document automatic classification method based on the K nearest neighbor algorithm, this method is improved the MapReduce programming framework of cloud computing, wherein the Map function is finished the calculating of document similarity, reduce function stipulations go out the highest K of a similarity sample, can utilize electric power cloud powerful computing ability, significantly shorten the document classification task executions time, improve classification effectiveness; And possesses robustness.
The objective of the invention is to be achieved through the following technical solutions:
Under a kind of electric power cloud environment based on the document automatic classification method of K nearest neighbor algorithm, it is characterized in that this method improves the MapReduce programming framework of cloud computing, wherein the Map function is finished the calculating of document similarity, reduce function stipulations go out the highest K of a similarity sample, the weight of each classification under the statistics arest neighbors, and the classification of output weight maximum, particular content comprises:
1) utilize metadata in the power system information storehouse, the special use of structure electric system industry feature speech dictionary, stop word word set and concept set; Then the training set document is carried out structuring and handle, set up model, remove useless, empty general stop word according to the forbidding word set; According to feature speech dictionary document is carried out participle; According to concept set the same concept of different expression waies is mapped as identical concept; Structured document after handling is carried out characteristic item extract and handle, generate the document vectors storehouse at last; According to this vector storehouse training set document and new text document to be classified are handled again, be expressed as a vector space model;
2) with the vector space model of the vector space model of training set document and new document to be classified, be kept on the distributed file system with file by row, the vector space model of a document of each line display is represented; The Map function from DFS by< a, La〉and read file logging, and the similarity between calculation training document and the new document, Top-k result returns as intermediate result with this node similarity; The Reduce function converges the intermediate result that all Map nodes produce, and calculates user K-arest neighbors set N (u); The Merge function reads style of writing part or row file on the DFS once more based on N (u), calculate each class in the K-arest neighbors set weight P (x, u);
3) weight is input to buffer memory, and ordering, find out the wherein class of the weight of maximum, this classification is as a result of exported, finish document and classify automatically.
Because traditional MapReduce programming framework can only be handled the individual data collection, the K nearest neighbor algorithm need be added up the weight of affiliated each classification of arest neighbors at last, therefore the present invention re-constructs a merge function on original MapReduce programming framework, in order to calculate this weight, export the classification of weight maximum at last.
Training set document and the to be classified new text document of the present invention after with vector space modelization is saved on the distributed file system (DFS), handle with the MapReduce programming framework, the calculating of K-arest neighbors is distributed to executed in parallel on each node, under the condition of the computational accuracy that does not influence the K-nearest neighbor algorithm, utilize the powerful calculating ability of electric power cloud, improve the execution efficient of K-nearest neighbor algorithm., on this basis, the MapReduce programming framework is improved simultaneously, added the Merge function, finish the implementation of K-nearest neighbor algorithm by Map, Reduce and Merge three phases because the MapReduce programming framework can only a reading of data collection.
The present invention is according to the characteristics of K nearest neighbor algorithm, this framework some improvement have been carried out, redesigned the K nearest neighbor algorithm on this basis, and use this algorithm that Utilities Electric Co.'s document is classified automatically, finish a large amount of document classification tasks fast, significantly shorten the document classification task executions time, improve classification effectiveness; And possesses robustness.
Description of drawings
Fig. 1 is a FB(flow block) of the present invention.
Embodiment
Under a kind of electric power cloud environment of the present invention based on the document automatic classification method of K nearest neighbor algorithm, this method is improved the MapReduce programming framework of cloud computing, wherein the Map function is finished the calculating of document similarity, reduce function stipulations go out the highest K of a similarity sample, the weight of each classification under the statistics arest neighbors, and the classification of output weight maximum, particular content comprises:
Utilize the metadata in the power system information storehouse, the special use of structure electric system industry feature speech dictionary, stop word word set and concept set.Then the training set document is carried out structuring and handle, set up model, remove useless, empty general stop word according to the forbidding word set; According to feature speech dictionary document is carried out participle; According to concept set the same concept of different expression waies is mapped as identical concept.Structured document after handling is carried out characteristic item extract and handle, generate the document vectors storehouse at last.According to this vector storehouse training set document and new text document to be classified are handled again, be expressed as a vector space model.
With the vector space model of the vector space model of training set document and new document to be classified, just a matrix is kept on the distributed file system (DFS) with file by row, and each row is exactly that the vector space model of a document is represented.The Map function from DFS by< a, La〉and read file logging, and the similarity between calculation training document and the new document, Top-k result returns as intermediate result with this node similarity.The Reduce function converges the intermediate result that all Map nodes produce, and calculates user K-arest neighbors set N (u).The Merge function reads style of writing part or row file on the DFS once more based on N (u), calculate each class in the K-arest neighbors set weight P (x, u).
Weight is input to buffer memory, and ordering, find out the wherein class of the weight of maximum.This classification is as a result of exported, finished document and classify automatically.
During enforcement, concrete steps are as follows:
1. the pre-service of document.Utilities Electric Co.'s document is carried out structuring handle,, document is carried out participle, remove stop word simultaneously and carry out the notion mapping according to the metadata of electric system.
2. characteristic item extracts, and reduces, and makes it the classification of document in the degree of depth portrayal electric system.
With training text according to characteristic item, save as vector space model.Promptly with the vector representation text: (ω 1, ω 2..., ω n) ω wherein iIt is the weight of i characteristic item.
4. the vector of training text is stored by row, and the classification of each training text is stored simultaneously, All Files is by the distributed file system unified management, to user transparent;
With new text according to characteristic item, save as vector space model, pending.
6.Map function reads file logging from DFS, calculates similarity between each training text that new text vector and distributed file system DFS read with calculating formula of similarity.And similarity K-top the result that this node is produced returns as intermediate result.
7.Reduce function converges the intermediate result that all Map nodes produce, and it is carried out queue order, stipulations go out K the training text vector that wherein similarity is the highest, generate K arest neighbors set N (u).
8.Merge, read the file logging on the DFS once more according to the weight calculation formula, find out K the training text vector that obtains in previous step, generate K arest neighbors set N (u) based on previous step, calculate the weight of each class.Output in the buffer memory.
9. all kinds of weight in the buffer memory is sorted, find out maximum weight, the classification under it is exactly the classification of new text.
The present invention can significantly shorten the document classification task executions time, improves classification effectiveness, and possesses robustness.

Claims (1)

1.一种电力云环境下基于K最近邻算法的文档自动分类方法,其特征在于该方法对云计算的MapReduce编程框架进行了改进,其中Map函数完成文档相似性的计算,reduce函数规约出相似性最高的K个样本,统计最近邻所属各个类别的权重,并输出权重最大的类别,具体内容包括:1. An automatic document classification method based on the K-nearest neighbor algorithm in a power cloud environment, characterized in that the method improves the MapReduce programming framework of cloud computing, wherein the Map function completes the calculation of document similarity, and the reduce function reduces similarity The K samples with the highest reliability, count the weights of each category to which the nearest neighbor belongs, and output the category with the largest weight. The specific content includes: 1)利用电力系统信息库里的元数据,构造电力系统行业专用的的特征词词典、禁用词词集以及概念集;然后将训练集文档进行结构化处理,建立模型,根据禁用词集去除无用、虚泛的禁用词;根据特征词词典对文档进行分词;根据概念集将不同表达方式的相同概念映射为同一概念;将处理后的结构化文档进行特征项提取并处理,最后生成文档矢量库;根据该矢量库将训练集文档以及待分类的新文本文档进行再处理,表示为一个空间向量模型;1) Using the metadata in the power system information base, construct a feature word dictionary, a set of stop words and a concept set dedicated to the power system industry; then structure the training set documents, build a model, and remove useless words based on the set of stop words , imaginary stop words; segment the document according to the feature word dictionary; map the same concept in different expressions to the same concept according to the concept set; extract and process the feature items of the processed structured document, and finally generate a document vector library ; According to the vector library, the training set document and the new text document to be classified are reprocessed, and expressed as a space vector model; 2)将训练集文档的空间向量模型以及待分类的新文档的空间向量模型,按行以文件保存在分布式文件系统上,每一行表示一个文档的空间向量模型表示;Map函数从DFS按<a,La>读取文件记录,并计算训练文档和新文档之间的相似度,将本结点相似度Top-k个结果作为中间结果返回;Reduce函数汇聚所有Map结点产生的中间结果,计算用户K-最近邻集合N(u);Merge函数基于N(u),再次读取DFS上的行文件或列文件,计算出K-最近邻集合中每个类的权重P(x,u);2) Save the space vector model of the training set document and the space vector model of the new document to be classified on the distributed file system by row, and each row represents the space vector model representation of a document; the Map function is obtained from the DFS by < a , La> read the file records, and calculate the similarity between the training document and the new document, and return the Top-k results of the node similarity as intermediate results; the Reduce function aggregates the intermediate results generated by all Map nodes, Calculate the user's K-nearest neighbor set N(u); based on N(u), the Merge function reads the row file or column file on the DFS again, and calculates the weight P(x,u) of each class in the K-nearest neighbor set ); 3)将权重输入到缓存,并排序,找出其中最大的权重的类,将该类别作为结果输出,完成文档自动分类。3) Input the weights into the cache, sort them, find out the class with the largest weight, output the class as the result, and complete the automatic classification of documents.
CN 201110086018 2011-04-07 2011-04-07 Method for automatically classifying documents based on K nearest neighbor algorithm under power cloud environment Pending CN102147813A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 201110086018 CN102147813A (en) 2011-04-07 2011-04-07 Method for automatically classifying documents based on K nearest neighbor algorithm under power cloud environment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 201110086018 CN102147813A (en) 2011-04-07 2011-04-07 Method for automatically classifying documents based on K nearest neighbor algorithm under power cloud environment

Publications (1)

Publication Number Publication Date
CN102147813A true CN102147813A (en) 2011-08-10

Family

ID=44422078

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 201110086018 Pending CN102147813A (en) 2011-04-07 2011-04-07 Method for automatically classifying documents based on K nearest neighbor algorithm under power cloud environment

Country Status (1)

Country Link
CN (1) CN102147813A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102799624A (en) * 2012-06-19 2012-11-28 北京大学 Large-scale graph data query method in distributed environment based on Datalog
CN103279478A (en) * 2013-04-19 2013-09-04 国家电网公司 Method for extracting features based on distributed mutual information documents
CN103309984A (en) * 2013-06-17 2013-09-18 腾讯科技(深圳)有限公司 Data processing method and device
CN104699772A (en) * 2015-03-05 2015-06-10 孟海东 Big data text classifying method based on cloud computing
CN104967121A (en) * 2015-07-13 2015-10-07 中国电力科学研究院 Large-scale electric power system node load flow computing method
CN106682035A (en) * 2015-11-11 2017-05-17 中国移动通信集团公司 Individualized learning recommendation method and device
CN107203579A (en) * 2016-03-18 2017-09-26 滴滴(中国)科技有限公司 The day off sorting technique and device for data of being called a taxi based on user

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080120314A1 (en) * 2006-11-16 2008-05-22 Yahoo! Inc. Map-reduce with merge to process multiple relational datasets

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080120314A1 (en) * 2006-11-16 2008-05-22 Yahoo! Inc. Map-reduce with merge to process multiple relational datasets

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
《EDBT》 20110324 K. Selçuk Candan .etc RanKloud: A Scalable Ranked Query Processing Framework on Hadoop 第1-4页 1 , *
《Proceedings of the 2007 ACM SIGMOD international conference on Management of data》 20070614 Hung-chih Yang. etc Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters 1029-1040 1 , *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102799624A (en) * 2012-06-19 2012-11-28 北京大学 Large-scale graph data query method in distributed environment based on Datalog
CN102799624B (en) * 2012-06-19 2015-03-04 北京大学 Large-scale graph data query method in distributed environment based on Datalog
CN103279478A (en) * 2013-04-19 2013-09-04 国家电网公司 Method for extracting features based on distributed mutual information documents
CN103279478B (en) * 2013-04-19 2016-08-10 国家电网公司 A kind of based on distributed mutual information file characteristics extracting method
CN103309984A (en) * 2013-06-17 2013-09-18 腾讯科技(深圳)有限公司 Data processing method and device
CN103309984B (en) * 2013-06-17 2016-12-28 腾讯科技(深圳)有限公司 The method and apparatus that data process
CN104699772A (en) * 2015-03-05 2015-06-10 孟海东 Big data text classifying method based on cloud computing
CN104967121A (en) * 2015-07-13 2015-10-07 中国电力科学研究院 Large-scale electric power system node load flow computing method
CN104967121B (en) * 2015-07-13 2018-01-19 中国电力科学研究院 A kind of tidal current computing method of large-scale electrical power system node
CN106682035A (en) * 2015-11-11 2017-05-17 中国移动通信集团公司 Individualized learning recommendation method and device
CN107203579A (en) * 2016-03-18 2017-09-26 滴滴(中国)科技有限公司 The day off sorting technique and device for data of being called a taxi based on user
CN107203579B (en) * 2016-03-18 2020-12-25 北京嘀嘀无限科技发展有限公司 User taxi taking data-based holiday classification method and device

Similar Documents

Publication Publication Date Title
CN104699772B (en) A kind of big data file classification method based on cloud computing
Saqib et al. Pipelined decision tree classification accelerator implementation in FPGA (DT-CAIF)
CN101308496A (en) Large scale text data external clustering method and system
CN102147813A (en) Method for automatically classifying documents based on K nearest neighbor algorithm under power cloud environment
CN108921188B (en) Parallel CRF method based on Spark big data platform
CN104156463A (en) Big-data clustering ensemble method based on MapReduce
CN103559016A (en) Frequent subgraph excavating method based on graphic processor parallel computing
CN104239553A (en) Entity recognition method based on Map-Reduce framework
Osman et al. Towards real-time analytics in the cloud
CN110837584B (en) A method and system for constructing a suffix array in blocks and in parallel
Zhu et al. A classification algorithm of CART decision tree based on MapReduce attribute weights
CN110717040A (en) Dictionary expansion method and device, electronic equipment and storage medium
Jiang et al. Parallel K-Medoids clustering algorithm based on Hadoop
CN104536830A (en) KNN text classification method based on MapReduce
CN103473275A (en) Automatic image labeling method and automatic image labeling system by means of multi-feature fusion
CN104378371A (en) Network intrusion detection method for parallel AP cluster based on MapReduce
CN107679244A (en) File classification method and device
CN103116636A (en) Method and device of excavation of subject of text big data based on characteristic space decomposition
CN102496033B (en) Image SIFT feature matching method based on MR computation framework
Zhang et al. MapReduce-based distributed tensor clustering algorithm
CN103995690B (en) A kind of parallel time sequential mining method based on GPU
CN106203508A (en) A kind of image classification method based on Hadoop platform
Qin et al. Mixer: efficiently understanding and retrieving visual content at web-scale
CN105760478A (en) Large-scale distributed data clustering method based on machine learning
CN119202344A (en) Multimodal data storage method, device, electronic device and storage medium

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20110810