CN104239479A

CN104239479A - Document classification method and system

Info

Publication number: CN104239479A
Application number: CN201410449140.7A
Authority: CN
Inventors: 宗栋瑞; 郭美思; 吴楠
Original assignee: Inspur Beijing Electronic Information Industry Co Ltd
Current assignee: Inspur Beijing Electronic Information Industry Co Ltd
Priority date: 2014-09-04
Filing date: 2014-09-04
Publication date: 2014-12-24

Abstract

The invention discloses a document classification method and system, which are applied to a Hadoop cluster including a Map program and a Reduce program. The method includes the following steps: the Map program analyzes the training documents and the documents to be classified, and determines according to the analysis results feature attribute, and divide the feature attribute; the Map program generates a classifier according to the feature attribute of the training document and the classification result of the training document; the Reduce program uses the classifier to classify the The document to be classified is classified, and the classification result of the document to be classified is obtained. The present invention makes full use of the distributed characteristics of the Hadoop cluster, avoids the limitations of the traditional system framework, has the characteristics of parallel and fast, can quickly realize the classification of massive documents, saves classification time, improves the efficiency of document classification, and improves the efficiency of document classification. system performance.

Description

A kind of document classification method and system

技术领域technical field

本发明涉及计算机技术领域，具体涉及一种文档分类方法和系统。The invention relates to the field of computer technology, in particular to a document classification method and system.

背景技术Background technique

随着网络技术的日益普及，网络中的数据量急剧增加，应用类型也非常丰富。数据挖掘技术充分利用现有信息资源，从大量数据中找出隐藏的知识，是一个强有力的发展方向。数据挖掘涉及到机器学习、模式识别、统计学、智能数据库、数据可视化和高性能计算等领域，其目的在于从大量数据中发现隐含的、新颖的、令人感兴趣的关系和规律。其中，文档分类是数据挖掘的一个重要方向。With the increasing popularity of network technology, the amount of data in the network has increased dramatically, and the types of applications are also very rich. Data mining technology makes full use of existing information resources and finds hidden knowledge from a large amount of data, which is a powerful development direction. Data mining involves fields such as machine learning, pattern recognition, statistics, intelligent databases, data visualization, and high-performance computing. Its purpose is to discover hidden, novel, and interesting relationships and laws from large amounts of data. Among them, document classification is an important direction of data mining.

现有技术中，通常使用传统的系统框架进行文档分类，在处理海量数据时，会导致分类时间长，系统性能低下。In the prior art, traditional system frameworks are usually used for document classification, which will lead to long classification time and low system performance when processing massive data.

发明内容Contents of the invention

本发明提供了一种文档分类方法和系统，以解决现有技术中系统性能低下的缺陷。The invention provides a document classification method and system to solve the defect of low system performance in the prior art.

本发明提供了一种文档分类方法，应用于包括Map程序和Reduce程序的Hadoop集群中，所述方法包括以下步骤：The present invention provides a kind of document classification method, is applied in the Hadoop cluster that comprises Map program and Reduce program, and described method comprises the following steps:

所述Map程序对训练文档和待分类文档进行解析，根据解析结果确定特征属性，并对所述特征属性进行划分；The Map program parses the training document and the document to be classified, determines the feature attribute according to the parsing result, and divides the feature attribute;

所述Map程序根据所述训练文档的特征属性以及对所述训练文档的分类结果，生成分类器；The Map program generates a classifier according to the feature attribute of the training document and the classification result of the training document;

所述Reduce程序使用所述分类器对所述待分类文档进行分类，得到待分类文档的分类结果。The Reduce program uses the classifier to classify the document to be classified to obtain a classification result of the document to be classified.

可选地，所述Map程序根据解析结果确定特征属性之后，还包括：Optionally, after the Map program determines the characteristic attribute according to the parsing result, it also includes:

所述Map程序根据所述特征属性，分别对所述训练文档和所述待分类文档进行格式转换，得到符合预设格式的训练文档和待分类文档；The Map program performs format conversion on the training document and the document to be classified respectively according to the characteristic attribute, and obtains the training document and the document to be classified conforming to a preset format;

所述Map程序根据所述训练文档的特征属性以及对所述训练文档的分类结果，生成分类器，具体为：The Map program generates a classifier according to the feature attribute of the training document and the classification result of the training document, specifically:

所述Map程序根据格式转换后的训练文档的特征属性以及对所述训练文档的分类结果，生成分类器；The Map program generates a classifier according to the feature attribute of the training document after format conversion and the classification result of the training document;

所述Reduce程序使用所述分类器对所述待分类文档进行分类，得到待分类文档的分类结果，具体为：The Reduce program uses the classifier to classify the document to be classified to obtain the classification result of the document to be classified, specifically:

所述Reduce程序使用所述分类器对格式转换后的待分类文档进行分类，得到待分类文档的分类结果。The Reduce program uses the classifier to classify the documents to be classified after format conversion, and obtain the classification results of the documents to be classified.

可选地，所述Map程序根据格式转换后的训练文档的特征属性以及对所述训练文档的分类结果，生成分类器，具体为：Optionally, the Map program generates a classifier according to the feature attributes of the format-converted training document and the classification result of the training document, specifically:

所述Map程序根据所述格式转换后的训练文档对应的各个特征属性的取值范围以及对所述训练文档的分类结果，计算每个类别在所述训练文档中的出现频率以及在每个类别下所有特征属性的各个取值范围的条件概率估计，将所述出现频率和所述条件概率估计记录为分类器。The Map program calculates the frequency of occurrence of each category in the training document and the frequency of occurrence of each category in the training document according to the value range of each feature attribute corresponding to the format-converted training document and the classification result of the training document. conditional probability estimates for each value range of all the feature attributes, and record the frequency of occurrence and the conditional probability estimates as a classifier.

可选地，所述Reduce程序使用所述分类器对格式转换后的待分类文档进行分类，得到待分类文档的分类结果，具体为：Optionally, the Reduce program uses the classifier to classify the format-converted documents to be classified to obtain a classification result of the documents to be classified, specifically:

所述Reduce程序获取所述格式转换后的待分类文档的所有特征属性的取值范围，根据获取到的取值范围、每个类别在训练文档中的出现频率以及在每个类别下所有特征属性的各个取值范围的条件概率估计，计算所述待分类文档归属于各个类别的条件概率，并将数值最大的条件概率对应的类别作为所述待分类文档的分类结果。The Reduce program obtains the value range of all feature attributes of the document to be classified after the format conversion, according to the obtained value range, the frequency of occurrence of each category in the training document, and all feature attributes under each category The conditional probability estimation of each value range of the document to be classified is calculated, and the conditional probability that the document to be classified belongs to each category is calculated, and the category corresponding to the conditional probability with the largest value is used as the classification result of the document to be classified.

可选地，所述Map程序所述对训练文档和待分类文档进行解析，根据解析结果确定特征属性，并对所述特征属性进行划分，具体为：Optionally, the Map program parses the training document and the document to be classified, determines the feature attribute according to the parsing result, and divides the feature attribute, specifically:

所述Map程序通过对训练文档和待分类文档进行解析，得到训练文档和待分类文档包含的属性，并从解析得到的属性中选取特征属性，并针对每个特征属性划分多个取值范围。The Map program obtains the attributes contained in the training documents and the documents to be classified by parsing the training documents and the documents to be classified, and selects characteristic attributes from the attributes obtained by parsing, and divides multiple value ranges for each characteristic attribute.

本发明还提供了一种文档分类系统，应用于Hadoop集群中，所述系统包括：The present invention also provides a kind of document classification system, is applied in Hadoop cluster, and described system comprises:

解析模块，用于对训练文档和待分类文档进行解析，根据解析结果确定特征属性，并对所述特征属性进行划分；The analysis module is used to analyze the training documents and the documents to be classified, determine the characteristic attributes according to the analysis results, and divide the characteristic attributes;

生成模块，用于根据所述解析模块确定的所述训练文档的特征属性以及对所述训练文档的分类结果，生成分类器；A generating module, configured to generate a classifier according to the characteristic attributes of the training document determined by the parsing module and the classification results of the training document;

分类模块，用于使用所述生成模块生成的所述分类器对所述待分类文档进行分类，得到待分类文档的分类结果。A classification module, configured to use the classifier generated by the generating module to classify the document to be classified, and obtain a classification result of the document to be classified.

可选地，所述的系统，还包括：Optionally, the system further includes:

转换模块，用于根据所述解析模块确定的所述特征属性，分别对所述训练文档和所述待分类文档进行格式转换，得到符合预设格式的训练文档和待分类文档；A conversion module, configured to perform format conversion on the training document and the document to be classified respectively according to the characteristic attributes determined by the parsing module, so as to obtain the training document and the document to be classified conforming to a preset format;

所述生成模块，具体用于根据所述转换模块格式转换后的训练文档的特征属性以及对所述训练文档的分类结果，生成分类器；The generation module is specifically configured to generate a classifier according to the characteristic attributes of the training document after format conversion by the conversion module and the classification result of the training document;

所述分类模块，具体用于使用所述生成模块生成的所述分类器对所述转换模块格式转换后的待分类文档进行分类，得到待分类文档的分类结果。The classification module is specifically configured to use the classifier generated by the generation module to classify the documents to be classified after format conversion by the conversion module, and obtain a classification result of the documents to be classified.

可选地，所述生成模块，具体用于根据所述转换模块格式转换后的训练文档对应的各个特征属性的取值范围以及对所述训练文档的分类结果，计算每个类别在所述训练文档中的出现频率以及在每个类别下所有特征属性的各个取值范围的条件概率估计，将所述出现频率和所述条件概率估计记录为分类器。Optionally, the generation module is specifically configured to calculate the value range of each feature attribute corresponding to the training document after the format conversion of the conversion module and the classification result of the training document, and calculate each category in the training document. The occurrence frequency in the document and the conditional probability estimation of each value range of all feature attributes under each category, and the occurrence frequency and the conditional probability estimation are recorded as a classifier.

可选地，所述分类模块，具体用于获取所述转换模块格式转换后的待分类文档的所有特征属性的取值范围，根据获取到的取值范围、每个类别在训练文档中的出现频率以及在每个类别下所有特征属性的各个取值范围的条件概率估计，计算所述待分类文档归属于各个类别的条件概率，并将数值最大的条件概率对应的类别作为所述待分类文档的分类结果。Optionally, the classification module is specifically configured to obtain the value ranges of all feature attributes of the documents to be classified after format conversion by the conversion module, according to the obtained value ranges, the appearance of each category in the training documents Frequency and the conditional probability estimation of each value range of all feature attributes under each category, calculate the conditional probability that the document to be classified belongs to each category, and use the category corresponding to the conditional probability with the largest value as the document to be classified classification results.

可选地，所述解析模块，具体用于通过对训练文档和待分类文档进行解析，得到训练文档和待分类文档包含的属性，并从解析得到的属性中选取特征属性，并针对每个特征属性划分多个取值范围。Optionally, the parsing module is specifically configured to obtain the attributes contained in the training documents and the documents to be classified by parsing the training documents and the documents to be classified, and select feature attributes from the attributes obtained by parsing, and for each feature Attributes divide multiple value ranges.

本发明充分利用了Hadoop集群的分布式特点，避免了传统系统框架的局限性，具有并行快速的特点，能够快速实现对海量文档的分类，节省了分类时间，提高了文档分类的效率，提高了系统性能。The present invention makes full use of the distributed characteristics of the Hadoop cluster, avoids the limitations of the traditional system framework, has the characteristics of parallel and fast, can quickly realize the classification of massive documents, saves classification time, improves the efficiency of document classification, and improves the efficiency of document classification. system performance.

附图说明Description of drawings

图1为本发明实施例中一种文档分类方法的流程图；Fig. 1 is the flowchart of a kind of document classification method in the embodiment of the present invention;

图2为本发明实施例中一种文档分类系统的结构示意图。FIG. 2 is a schematic structural diagram of a document classification system in an embodiment of the present invention.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The following will clearly and completely describe the technical solutions in the embodiments of the present invention with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only some, not all, embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.

需要说明的是，如果不冲突，本发明实施例以及实施例中的各个特征可以相互结合，均在本发明的保护范围之内。另外，虽然在流程图中示出了逻辑顺序，但是在某些情况下，可以以不同于此处的顺序执行所示出或描述的步骤。It should be noted that, if there is no conflict, the embodiments of the present invention and various features in the embodiments can be combined with each other, and all are within the protection scope of the present invention. In addition, although a logical order is shown in the flowcharts, in some cases the steps shown or described may be performed in an order different from that shown or described herein.

本发明实施例中提出了一种文档分类方法，应用于包括Map程序和Reduce程序的Hadoop集群中，在使用Hadoop命令将训练文档和待分类文档放置到HDFS(Hadoop Distributed File System，分布式文件系统)上之后，执行如图1所示的操作：A kind of document classification method is proposed in the embodiment of the present invention, is applied in the Hadoop cluster that comprises Map program and Reduce program, uses Hadoop command to place training document and document to be classified to HDFS (Hadoop Distributed File System, distributed file system ) and perform the operations shown in Figure 1:

步骤101，Map程序对训练文档和待分类文档进行解析，根据解析结果确定特征属性，并对特征属性进行划分。Step 101, the Map program parses the training document and the document to be classified, determines the characteristic attributes according to the parsing results, and divides the characteristic attributes.

具体地，Map程序可以通过对训练文档和待分类文档进行解析，得到训练文档和待分类文档包含的属性，并从解析得到的属性中选取特征属性，并针对每个特征属性划分多个取值范围。Specifically, the Map program can obtain the attributes contained in the training documents and the documents to be classified by parsing the training documents and the documents to be classified, and select characteristic attributes from the attributes obtained by parsing, and divide multiple values for each characteristic attribute scope.

其中，训练文档和待分类文档可以位于HDFS中的不同目录下，并由分类目录进行管理，每个文件夹的名字即为类标签，文件夹下的内容即为与归属于该类标签对应的类的文档。Among them, the training documents and the documents to be classified can be located in different directories in HDFS, and are managed by the classification directory. The name of each folder is the class label, and the content under the folder is the corresponding data belonging to the class label. Class documentation.

例如，训练文档位于HDFS中的/train目录下，待分类文档位于HDFS中的/test目录下。Map程序根据对训练文档和待分类文档的分析结果，选择3个特征属性：a、日志数量/注册天数；b、好友数量/注册天数；c、是否使用真实头像，并将每个特征属性划分为：{a<＝0.05，0.05<a<0.2，a>＝0.2}；{b<＝0.1，0.1<b<0.8，b>＝0.8}；{c＝0(不是)，c＝1(是)}。For example, training documents are located in the /train directory in HDFS, and documents to be classified are located in the /test directory in HDFS. The Map program selects three feature attributes based on the analysis results of the training documents and the documents to be classified: a. Number of logs/number of days of registration; b. Number of friends/number of days of registration; c. Whether to use real avatars, and divide each feature attribute For: {a<=0.05, 0.05<a<0.2, a>=0.2}; {b<=0.1, 0.1<b<0.8, b>=0.8}; {c=0 (not), c=1( yes)}.

步骤102，Map程序根据确定的特征属性，分别对训练文档和待分类文档进行格式转换，得到符合预设格式的训练文档和待分类文档。Step 102, the Map program performs format conversion on the training document and the document to be classified respectively according to the determined feature attributes, and obtains the training document and the document to be classified conforming to the preset format.

具体地，Map程序可以使用命令行Mahout中的PrepareTwentyNewsgroups类，将训练文档和待分类文档转换为符合预设格式的训练文档和待分类文档。其中，预设格式可以是VectorWritable格式，在符合VectorWritable格式的文档中，第一个字符是类标签，其余的字符是特征属性。Specifically, the Map program can use the PrepareTwentyNewsgroups class in the command line Mahout to convert the training document and the document to be classified into the training document and the document to be classified conforming to the preset format. Wherein, the preset format may be a VectorWritable format, and in a document conforming to the VectorWritable format, the first character is a class label, and the remaining characters are feature attributes.

步骤103，Map程序根据格式转换后的训练文档的特征属性以及对训练文档的分类结果，生成分类器。Step 103, the Map program generates a classifier according to the feature attributes of the format-converted training documents and the classification results of the training documents.

具体地，Map程序可以根据格式转换后的训练文档对应的各个特征属性的取值范围以及对训练文档的分类结果，计算每个类别在训练文档中的出现频率以及在每个类别下所有特征属性的各个取值范围的条件概率估计，将上述出现频率和条件概率估计记录为分类器。Specifically, the Map program can calculate the frequency of occurrence of each category in the training document and all feature attributes under each category according to the value range of each feature attribute corresponding to the format-converted training document and the classification results of the training document The conditional probability estimates for each value range of , record the above frequency of occurrence and conditional probability estimates as a classifier.

例如，训练文档的个数为1万个，其分类结果为：8900个训练文档属于真实账号(即，C＝0)，1100个训练文档属于非真实账号(即，C＝1)。For example, the number of training documents is 10,000, and the classification result is: 8900 training documents belong to real accounts (ie, C=0), and 1100 training documents belong to non-real accounts (ie, C=1).

每个类别在训练文档中的出现频率为：The frequency of each category in the training documents is:

P(C＝0)＝8900/10000＝0.89；P(C=0)=8900/10000=0.89;

P(C＝1)＝1100/10000＝0.11；P(C=1)=1100/10000=0.11;

在每个类别下所有特征属性的各个取值范围的条件概率估计为：The conditional probability estimates for each value range of all feature attributes under each category are:

P(a<＝0.05︱C＝0)＝0.3P(a<=0.05︱C=0)=0.3

P(0.05<a<0.2︱C＝0)＝0.5P(0.05<a<0.2︱C＝0)＝0.5

P(a>＝0.2︱C＝0)＝0.2P(a>=0.2︱C＝0)＝0.2

P(a<＝0.05︱C＝1)＝0.8P(a<=0.05︱C=1)=0.8

P(0.05<a<0.2︱C＝1)＝0.1P(0.05<a<0.2︱C＝1)＝0.1

P(a>＝0.2︱C＝1)＝0.1P(a>＝0.2︱C＝1)＝0.1

P(b<＝0.1︱C＝0)＝0.1P(b<=0.1︱C=0)=0.1

P(0.1<b<0.8︱C＝0)＝0.7P(0.1<b<0.8︱C＝0)＝0.7

P(b>＝0.8︱C＝0)＝0.2P(b>＝0.8︱C＝0)＝0.2

P(b<＝0.1︱C＝1)＝0.7P(b<=0.1︱C=1)=0.7

P(0.1<b<0.8︱C＝1)＝0.2P(0.1<b<0.8︱C＝1)＝0.2

P(b>＝0.8︱C＝1)＝0.1P(b>＝0.8︱C＝1)＝0.1

P(c＝0︱C＝0)＝0.2P(c＝0︱C＝0)＝0.2

P(c＝1︱C＝0)＝0.8P(c=1︱C=0)=0.8

P(c＝0︱C＝1)＝0.9P(c＝0︱C＝1)＝0.9

P(c＝1︱C＝1)＝0.1P(c=1︱C=1)=0.1

步骤104，Reduce程序使用分类器对格式转换后的待分类文档进行分类，得到待分类文档的分类结果。Step 104, the Reduce program uses the classifier to classify the documents to be classified after format conversion, and obtains the classification results of the documents to be classified.

具体地，Reduce程序可以获取格式转换后的待分类文档的所有特征属性的取值范围，根据获取到的取值范围、每个类别在训练文档中的出现频率以及在每个类别下所有特征属性的各个取值范围的条件概率估计，计算待分类文档归属于各个类别的条件概率，并将数值最大的条件概率对应的类别作为待分类文档的分类结果记录到HDFS上。Specifically, the Reduce program can obtain the value ranges of all feature attributes of the documents to be classified after format conversion, according to the obtained value ranges, the frequency of occurrence of each category in the training document, and all feature attributes under each category Estimate the conditional probability of each value range of , calculate the conditional probability that the document to be classified belongs to each category, and record the category corresponding to the conditional probability with the largest value as the classification result of the document to be classified to HDFS.

例如，待分类文档的3个特征属性的取值范围为：0.05<a<0.2，0.1<b<0.8，b>＝0.8，c＝0，则待分类文档属于真实账号(即，C＝0)的条件概率为：For example, the value ranges of the three feature attributes of the document to be classified are: 0.05<a<0.2, 0.1<b<0.8, b>=0.8, c=0, then the document to be classified belongs to a real account (that is, C=0 ) conditional probability is:

P(C＝0)P(x︱C＝0)P(C＝0)P(x︱C＝0)

＝P(C＝0)P(0.05<a<0.2︱C＝0)P(0.1<b<0.8︱C＝0)P(c＝0︱C＝0)=P(C=0)P(0.05<a<0.2︱C=0)P(0.1<b<0.8︱C=0)P(c=0︱C=0)

＝0.89*0.5*0.7*0.2=0.89*0.5*0.7*0.2

＝0.0623；=0.0623;

待分类文档属于非真实账号(即，C＝1)的条件概率为：The conditional probability that the document to be classified belongs to a non-real account (that is, C=1) is:

P(C＝1)P(x︱C＝1)P(C＝1)P(x︱C＝1)

＝P(C＝1)P(0.05<a<0.2︱C＝1)P(0.1<b<0.8︱C＝1)P(c＝0︱C＝1)＝P(C＝1)P(0.05<a<0.2︱C＝1)P(0.1<b<0.8︱C＝1)P(c＝0︱C＝1)

＝0.11*0.1*0.2*0.9=0.11*0.1*0.2*0.9

＝0.00198=0.00198

由于待分类文档属于真实账号的条件概率最大，则Reduce程序确定该待分类文档属于真实账号。Since the conditional probability that the document to be classified belongs to the real account is the largest, the Reduce program determines that the document to be classified belongs to the real account.

本发明实施例充分利用了Hadoop集群的分布式特点，避免了传统系统框架的局限性，具有并行快速的特点，能够快速实现对海量文档的分类，节省了分类时间，提高了文档分类的效率，提高了系统性能。The embodiment of the present invention makes full use of the distributed characteristics of the Hadoop cluster, avoids the limitations of the traditional system framework, has the characteristics of parallel and fast, can quickly realize the classification of massive documents, saves classification time, and improves the efficiency of document classification. Improved system performance.

基于上述网页聚类方法，本发明实施例提出了一种文档分类系统，应用于Hadoop集群中，如图2所示，该系统包括：Based on the above-mentioned web page clustering method, an embodiment of the present invention proposes a document classification system, which is applied in a Hadoop cluster, as shown in Figure 2, the system includes:

解析模块210，用于对训练文档和待分类文档进行解析，根据解析结果确定特征属性，并对该特征属性进行划分；The analysis module 210 is used to analyze the training document and the document to be classified, determine the characteristic attribute according to the analysis result, and divide the characteristic attribute;

具体地，上述解析模块210，具体用于通过对训练文档和待分类文档进行解析，得到训练文档和待分类文档包含的属性，并从解析得到的属性中选取特征属性，并针对每个特征属性划分多个取值范围。Specifically, the above parsing module 210 is specifically used to analyze the training documents and the documents to be classified to obtain the attributes contained in the training documents and the documents to be classified, and select feature attributes from the attributes obtained by parsing, and for each feature attribute Divide multiple value ranges.

生成模块220，用于根据解析模块210确定的训练文档的特征属性以及对训练文档的分类结果，生成分类器；A generating module 220, configured to generate a classifier according to the feature attributes of the training documents determined by the parsing module 210 and the classification results of the training documents;

分类模块230，用于使用生成模块220生成的分类器对待分类文档进行分类，得到待分类文档的分类结果。The classification module 230 is configured to use the classifier generated by the generation module 220 to classify the document to be classified, and obtain a classification result of the document to be classified.

进一步地，上述系统，还包括：Further, the above system also includes:

转换模块240，用于根据解析模块210确定的所述特征属性，分别对训练文档和待分类文档进行格式转换，得到符合预设格式的训练文档和待分类文档；The conversion module 240 is used to perform format conversion on the training document and the document to be classified respectively according to the characteristic attributes determined by the parsing module 210, so as to obtain the training document and the document to be classified conforming to the preset format;

相应地，上述生成模块220，具体用于根据转换模块240格式转换后的训练文档的特征属性以及对训练文档的分类结果，生成分类器；Correspondingly, the above-mentioned generation module 220 is specifically configured to generate a classifier according to the characteristic attributes of the training document converted by the conversion module 240 and the classification result of the training document;

上述分类模块230，具体用于使用生成模块220生成的分类器对转换模块240格式转换后的待分类文档进行分类，得到待分类文档的分类结果。The above-mentioned classification module 230 is specifically configured to use the classifier generated by the generation module 220 to classify the documents to be classified after format conversion by the conversion module 240 to obtain the classification results of the documents to be classified.

进一步地，上述生成模块220，具体用于根据转换模块240格式转换后的训练文档对应的各个特征属性的取值范围以及对训练文档的分类结果，计算每个类别在训练文档中的出现频率以及在每个类别下所有特征属性的各个取值范围的条件概率估计，将上述出现频率和上述条件概率估计记录为分类器。Further, the above-mentioned generation module 220 is specifically used to calculate the frequency of occurrence of each category in the training document and In the conditional probability estimation of each value range of all feature attributes under each category, the above-mentioned frequency of occurrence and the above-mentioned conditional probability estimation are recorded as a classifier.

相应地，上述分类模块230，具体用于获取转换模块240格式转换后的待分类文档的所有特征属性的取值范围，根据获取到的取值范围、每个类别在训练文档中的出现频率以及在每个类别下所有特征属性的各个取值范围的条件概率估计，计算所述待分类文档归属于各个类别的条件概率，并将数值最大的条件概率对应的类别作为所述待分类文档的分类结果。Correspondingly, the above classification module 230 is specifically used to obtain the value ranges of all feature attributes of the documents to be classified after format conversion by the conversion module 240, according to the obtained value ranges, the frequency of occurrence of each category in the training documents and Estimate the conditional probability of each value range of all feature attributes under each category, calculate the conditional probability that the document to be classified belongs to each category, and use the category corresponding to the conditional probability with the largest value as the classification of the document to be classified result.

结合本文中所公开的实施例描述的方法中的步骤可以直接用硬件、处理器执行的软件模块，或者二者的结合来实施。软件模块可以置于随机存储器(RAM)、内存、只读存储器(ROM)、电可编程ROM、电可擦除可编程ROM、寄存器、硬盘、可移动磁盘、CD-ROM、或技术领域内所公知的任意其它形式的存储介质中。The steps in the methods described in conjunction with the embodiments disclosed herein may be directly implemented by hardware, software modules executed by a processor, or a combination of both. Software modules can be placed in random access memory (RAM), internal memory, read-only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, removable disk, CD-ROM, or any other Any other known storage medium.

以上所述，仅为本发明的具体实施方式，但本发明的保护范围并不局限于此，任何熟悉本技术领域的技术人员在本发明揭露的技术范围内，可轻易想到变化或替换，都应涵盖在本发明的保护范围之内。因此，本发明的保护范围应所述以权利要求的保护范围为准。The above is only a specific embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Anyone skilled in the art can easily think of changes or substitutions within the technical scope disclosed in the present invention. Should be covered within the protection scope of the present invention. Therefore, the protection scope of the present invention should be based on the protection scope of the claims.

Claims

1. a document classification method, is characterized in that, is applied in the Hadoop cluster that comprises Map program and Reduce program, and described method comprises the following steps:

The Map program parses the training document and the document to be classified, determines the feature attribute according to the parsing result, and divides the feature attribute;

The Map program generates a classifier according to the feature attribute of the training document and the classification result of the training document;

The Reduce program uses the classifier to classify the document to be classified to obtain a classification result of the document to be classified.

2. method as claimed in claim 1, is characterized in that, after described Map program determines feature attribute according to parsing result, also comprises:

The Map program performs format conversion on the training document and the document to be classified respectively according to the characteristic attribute, and obtains the training document and the document to be classified conforming to a preset format;

The Map program generates a classifier according to the feature attribute of the training document and the classification result of the training document, specifically:

The Map program generates a classifier according to the feature attribute of the training document after format conversion and the classification result of the training document;

The Reduce program uses the classifier to classify the document to be classified to obtain the classification result of the document to be classified, specifically:

The Reduce program uses the classifier to classify the documents to be classified after format conversion, and obtain the classification results of the documents to be classified.

3. The method according to claim 2, wherein the Map program generates a classifier according to the feature attribute of the training document after format conversion and the classification result of the training document, specifically:

The Map program calculates the frequency of occurrence of each category in the training document and the frequency of occurrence of each category in the training document according to the value range of each feature attribute corresponding to the format-converted training document and the classification result of the training document. conditional probability estimates for each value range of all the feature attributes, and record the frequency of occurrence and the conditional probability estimates as a classifier.

4. The method according to claim 3, wherein the Reduce program uses the classifier to classify the documents to be classified after format conversion, to obtain the classification results of the documents to be classified, specifically:

The Reduce program obtains the value range of all feature attributes of the document to be classified after the format conversion, according to the obtained value range, the frequency of occurrence of each category in the training document, and all feature attributes under each category The conditional probability estimation of each value range of the document to be classified is calculated, and the conditional probability that the document to be classified belongs to each category is calculated, and the category corresponding to the conditional probability with the largest value is used as the classification result of the document to be classified.

5. The method according to claim 1, wherein the Map program parses the training document and the document to be classified, determines the feature attribute according to the analysis result, and divides the feature attribute, specifically:

The Map program obtains the attributes contained in the training documents and the documents to be classified by parsing the training documents and the documents to be classified, and selects characteristic attributes from the attributes obtained by parsing, and divides multiple value ranges for each characteristic attribute.

6. a kind of document classification system, is characterized in that, is applied in Hadoop cluster, and described system comprises:

The analysis module is used to analyze the training documents and the documents to be classified, determine the characteristic attributes according to the analysis results, and divide the characteristic attributes;

A generating module, configured to generate a classifier according to the feature attributes of the training document determined by the parsing module and the classification results of the training document;

A classification module, configured to use the classifier generated by the generating module to classify the document to be classified, and obtain a classification result of the document to be classified.

7. The system of claim 6, further comprising:

A conversion module, configured to perform format conversion on the training document and the document to be classified respectively according to the characteristic attributes determined by the parsing module, so as to obtain the training document and the document to be classified conforming to a preset format;

The generation module is specifically configured to generate a classifier according to the characteristic attributes of the training document after format conversion by the conversion module and the classification result of the training document;

The classification module is specifically configured to use the classifier generated by the generation module to classify the documents to be classified after format conversion by the conversion module to obtain a classification result of the documents to be classified.

8. The system of claim 7, wherein:

The generation module is specifically configured to calculate the occurrence of each category in the training document according to the value range of each feature attribute corresponding to the training document converted from the format of the conversion module and the classification result of the training document Frequency and the conditional probability estimates of the respective value ranges of all feature attributes under each category, and the frequency of occurrence and the conditional probability estimates are recorded as a classifier.

9. The system of claim 8, wherein:

The classification module is specifically used to obtain the value ranges of all feature attributes of the documents to be classified after format conversion by the conversion module, according to the obtained value ranges, the frequency of occurrence of each category in the training document, and the Estimate the conditional probability of each value range of all feature attributes under each category, calculate the conditional probability that the document to be classified belongs to each category, and use the category corresponding to the conditional probability with the largest value as the classification result of the document to be classified.

10. The system of claim 6, wherein:

The parsing module is specifically used to analyze the training documents and the documents to be classified to obtain the attributes contained in the training documents and the documents to be classified, and select characteristic attributes from the attributes obtained by parsing, and divide each characteristic attribute into multiple Ranges.