CN102147813A

CN102147813A - Method for automatically classifying documents based on K nearest neighbor algorithm under power cloud environment

Info

Publication number: CN102147813A
Application number: CN 201110086018
Authority: CN
Inventors: 赵俊峰; 王磊; 祁建
Original assignee: State Grid Jiangsu Electric Power Co Ltd
Current assignee: State Grid Jiangsu Electric Power Co Ltd
Priority date: 2011-04-07
Filing date: 2011-04-07
Publication date: 2011-08-10

Abstract

The invention discloses an automatic document classification method based on the K-nearest neighbor algorithm in a power cloud environment. The method improves the MapReduce programming framework of cloud computing, wherein the Map function completes the calculation of document similarity, and the reduce function stipulates the similarity For the highest K samples, the weights of each category to which the nearest neighbor belongs are counted, and the category with the largest weight is output to automatically classify the document. The invention quickly completes a large number of document classification tasks, greatly shortens the execution time of the document classification tasks, improves the classification efficiency, and has robustness.

Description

Under a kind of electric power cloud environment based on the document automatic classification method of K nearest neighbor algorithm

Technical field

The invention belongs to cloud computing and data mining field, relate to a kind of Utilities Electric Co. document classification method, under specifically a kind of electric power cloud environment based on the document automatic classification method of K nearest neighbor algorithm.

Background technology

The automatic document sorting technique is to utilize natural language, and data mining and artificial intelligence technology are through after certain training, and the technology that makes program discern and to classify document automatically has important use aspect the large-scale data processing.

Traditional K nearest neighbor algorithm is because it is simply effective, obtained using widely aspect the classification automatically at document.Because there is the computation complexity height in traditional K nearest neighbor algorithm, the shortcoming of poor expandability, under the situation that Utilities Electric Co.'s document sharply increases, if directly use this algorithm that document is classified, its calculated amount sharply rises, and the classification real-time descends.Since IBM in 2007 and Google unite release cloud computing since, cloud computing has become the hot issue that industry member and academia all pay close attention to, cloud computing has become the Distributed Calculation future thrust.On this basis, national grid has been set up electric power cloud simulation laboratory, has erected the electric power cloud computing environment, and the magnanimity information of intelligent grid is carried out analyzing and processing.The MapReduce programming framework that is proposed by Google is the representative art in the cloud computing, it is applicable to the distributed treatment large-scale dataset, the programmer specifies the processing procedure to each block data in the Map function, specify in the Reduce function and how the intermediate result of block data processing is carried out stipulations.But traditional MapReduce programming framework can only be handled the individual data collection, and can not directly support the processing to a plurality of associated data set, and the document classification method of K nearest neighbor algorithm must be in the face of the handling problem of a plurality of associated data set.Though it is very wide that traditional K nearest neighbor algorithm is used, also there is the high shortcoming of computation complexity simultaneously.In the operational process of algorithm, need find K neighbour, need to calculate for this reason and all samples between distance, calculated amount is huge.Though the improvement algorithm is arranged, major part is to make with the cost that reduces classification performance.

Summary of the invention

In order to overcome the problem that traditional K nearest neighbor algorithm exists, the purpose of this invention is to provide under a kind of electric power cloud environment document automatic classification method based on the K nearest neighbor algorithm, this method is improved the MapReduce programming framework of cloud computing, wherein the Map function is finished the calculating of document similarity, reduce function stipulations go out the highest K of a similarity sample, can utilize electric power cloud powerful computing ability, significantly shorten the document classification task executions time, improve classification effectiveness; And possesses robustness.

The objective of the invention is to be achieved through the following technical solutions:

Under a kind of electric power cloud environment based on the document automatic classification method of K nearest neighbor algorithm, it is characterized in that this method improves the MapReduce programming framework of cloud computing, wherein the Map function is finished the calculating of document similarity, reduce function stipulations go out the highest K of a similarity sample, the weight of each classification under the statistics arest neighbors, and the classification of output weight maximum, particular content comprises:

1) utilize metadata in the power system information storehouse, the special use of structure electric system industry feature speech dictionary, stop word word set and concept set; Then the training set document is carried out structuring and handle, set up model, remove useless, empty general stop word according to the forbidding word set; According to feature speech dictionary document is carried out participle; According to concept set the same concept of different expression waies is mapped as identical concept; Structured document after handling is carried out characteristic item extract and handle, generate the document vectors storehouse at last; According to this vector storehouse training set document and new text document to be classified are handled again, be expressed as a vector space model;

2) with the vector space model of the vector space model of training set document and new document to be classified, be kept on the distributed file system with file by row, the vector space model of a document of each line display is represented; The Map function from DFS by＜ a, La〉and read file logging, and the similarity between calculation training document and the new document, Top-k result returns as intermediate result with this node similarity; The Reduce function converges the intermediate result that all Map nodes produce, and calculates user K-arest neighbors set N (u); The Merge function reads style of writing part or row file on the DFS once more based on N (u), calculate each class in the K-arest neighbors set weight P (x, u);

3) weight is input to buffer memory, and ordering, find out the wherein class of the weight of maximum, this classification is as a result of exported, finish document and classify automatically.

Because traditional MapReduce programming framework can only be handled the individual data collection, the K nearest neighbor algorithm need be added up the weight of affiliated each classification of arest neighbors at last, therefore the present invention re-constructs a merge function on original MapReduce programming framework, in order to calculate this weight, export the classification of weight maximum at last.

Training set document and the to be classified new text document of the present invention after with vector space modelization is saved on the distributed file system (DFS), handle with the MapReduce programming framework, the calculating of K-arest neighbors is distributed to executed in parallel on each node, under the condition of the computational accuracy that does not influence the K-nearest neighbor algorithm, utilize the powerful calculating ability of electric power cloud, improve the execution efficient of K-nearest neighbor algorithm., on this basis, the MapReduce programming framework is improved simultaneously, added the Merge function, finish the implementation of K-nearest neighbor algorithm by Map, Reduce and Merge three phases because the MapReduce programming framework can only a reading of data collection.

The present invention is according to the characteristics of K nearest neighbor algorithm, this framework some improvement have been carried out, redesigned the K nearest neighbor algorithm on this basis, and use this algorithm that Utilities Electric Co.'s document is classified automatically, finish a large amount of document classification tasks fast, significantly shorten the document classification task executions time, improve classification effectiveness; And possesses robustness.

Description of drawings

Fig. 1 is a FB(flow block) of the present invention.

Embodiment

Under a kind of electric power cloud environment of the present invention based on the document automatic classification method of K nearest neighbor algorithm, this method is improved the MapReduce programming framework of cloud computing, wherein the Map function is finished the calculating of document similarity, reduce function stipulations go out the highest K of a similarity sample, the weight of each classification under the statistics arest neighbors, and the classification of output weight maximum, particular content comprises:

Utilize the metadata in the power system information storehouse, the special use of structure electric system industry feature speech dictionary, stop word word set and concept set.Then the training set document is carried out structuring and handle, set up model, remove useless, empty general stop word according to the forbidding word set; According to feature speech dictionary document is carried out participle; According to concept set the same concept of different expression waies is mapped as identical concept.Structured document after handling is carried out characteristic item extract and handle, generate the document vectors storehouse at last.According to this vector storehouse training set document and new text document to be classified are handled again, be expressed as a vector space model.

With the vector space model of the vector space model of training set document and new document to be classified, just a matrix is kept on the distributed file system (DFS) with file by row, and each row is exactly that the vector space model of a document is represented.The Map function from DFS by＜ a, La〉and read file logging, and the similarity between calculation training document and the new document, Top-k result returns as intermediate result with this node similarity.The Reduce function converges the intermediate result that all Map nodes produce, and calculates user K-arest neighbors set N (u).The Merge function reads style of writing part or row file on the DFS once more based on N (u), calculate each class in the K-arest neighbors set weight P (x, u).

Weight is input to buffer memory, and ordering, find out the wherein class of the weight of maximum.This classification is as a result of exported, finished document and classify automatically.

During enforcement, concrete steps are as follows:

1. the pre-service of document.Utilities Electric Co.'s document is carried out structuring handle,, document is carried out participle, remove stop word simultaneously and carry out the notion mapping according to the metadata of electric system.

2. characteristic item extracts, and reduces, and makes it the classification of document in the degree of depth portrayal electric system.

With training text according to characteristic item, save as vector space model.Promptly with the vector representation text: (ω ₁, ω ₂..., ω _n) ω wherein _iIt is the weight of i characteristic item.

4. the vector of training text is stored by row, and the classification of each training text is stored simultaneously, All Files is by the distributed file system unified management, to user transparent;

With new text according to characteristic item, save as vector space model, pending.

6.Map function reads file logging from DFS, calculates similarity between each training text that new text vector and distributed file system DFS read with calculating formula of similarity.And similarity K-top the result that this node is produced returns as intermediate result.

7.Reduce function converges the intermediate result that all Map nodes produce, and it is carried out queue order, stipulations go out K the training text vector that wherein similarity is the highest, generate K arest neighbors set N (u).

8.Merge, read the file logging on the DFS once more according to the weight calculation formula, find out K the training text vector that obtains in previous step, generate K arest neighbors set N (u) based on previous step, calculate the weight of each class.Output in the buffer memory.

9. all kinds of weight in the buffer memory is sorted, find out maximum weight, the classification under it is exactly the classification of new text.

The present invention can significantly shorten the document classification task executions time, improves classification effectiveness, and possesses robustness.

Claims

1. An automatic document classification method based on the K-nearest neighbor algorithm in a power cloud environment, characterized in that the method improves the MapReduce programming framework of cloud computing, wherein the Map function completes the calculation of document similarity, and the reduce function reduces similarity The K samples with the highest reliability, count the weights of each category to which the nearest neighbor belongs, and output the category with the largest weight. The specific content includes:

1) Using the metadata in the power system information base, construct a feature word dictionary, a set of stop words and a concept set dedicated to the power system industry; then structure the training set documents, build a model, and remove useless words based on the set of stop words , imaginary stop words; segment the document according to the feature word dictionary; map the same concept in different expressions to the same concept according to the concept set; extract and process the feature items of the processed structured document, and finally generate a document vector library ; According to the vector library, the training set document and the new text document to be classified are reprocessed, and expressed as a space vector model;

2) Save the space vector model of the training set document and the space vector model of the new document to be classified on the distributed file system by row, and each row represents the space vector model representation of a document; the Map function is obtained from the DFS by < a , La> read the file records, and calculate the similarity between the training document and the new document, and return the Top-k results of the node similarity as intermediate results; the Reduce function aggregates the intermediate results generated by all Map nodes, Calculate the user's K-nearest neighbor set N(u); based on N(u), the Merge function reads the row file or column file on the DFS again, and calculates the weight P(x,u) of each class in the K-nearest neighbor set );

3) Input the weights into the cache, sort them, find out the class with the largest weight, output the class as the result, and complete the automatic classification of documents.