KR20110057421A

KR20110057421A - Apparatus and method for classifying documents in a single class category

Info

Publication number: KR20110057421A
Application number: KR1020090113822A
Authority: KR
Inventors: 김현기; 임수종; 황이규; 윤여찬; 허정; 오효정; 이충희; 최미란; 이창기; 장명길
Original assignee: 한국전자통신연구원
Priority date: 2009-11-24
Filing date: 2009-11-24
Publication date: 2011-06-01

Abstract

본 발명은 단일 클래스 범주의 문서 분류 기술에 관한 것으로, 학습문서집합으로부터 색인을 수행하여 중요한 자질들을 선택하고, 이를 토대로 생성된 행렬로부터 깊이 우선 탐색 또는 넓이 우선 탐색 방식으로 연관 규칙 후보들을 점증적으로 생성하고, 생성된 연관 규칙 후보들의 집합으로부터 연관 규칙 학습 모델을 생성한 후, 생성된 연관 규칙 학습 모델을 이용하여 문서를 분류하는 것을 특징으로 한다. 본 발명에 의하면, 연관 규칙 탐사 방법을 문서 분류 방법에 적용하여 추출된 연관 규칙을 문서 분류를 위한 자질로 사용함으로써, 문서 분류의 정확도를 높일 수 있다.The present invention relates to a document classification technique of a single class category, which performs indexing from a set of learning documents, selects important features, and incrementally selects association rule candidates in a depth-first search or a breadth-first search method based on the generated matrix. After generating and generating an association rule learning model from the generated association rule candidates, the document is classified using the generated association rule learning model. According to the present invention, the accuracy of document classification can be improved by applying the association rule search method to the document classification method and using the extracted association rule as a feature for document classification.

Description

Apparatus and method for classifying documents in a single class category {APPARATUS AND METHOD FOR CLASSIFICATING DOCUMENT OF SINGLE CALSS CATEGORY}

본 발명은 문서 분류(Document classification)를 수행하는 기술에 관한 것으로서, 특히 연관 규칙 탐사 방법을 문서 분류 방법에 적용하여 추출된 연관 규칙을 문서 분류를 위한 자질로 사용함으로써, 문서 분류의 정확도를 높이는데 적합한 단일 클래스 범주의 문서 분류 장치 및 방법에 관한 것이다. The present invention relates to a technique for performing document classification, and in particular, by using an association rule extracted by applying an association rule search method to a document classification method as a feature for document classification, A device and method for classifying documents in a suitable single class category.

본 발명은 지식경제부의 IT성장동력기술개발사업의 일환으로 수행한 연구로부터 도출된 것이다[과제관리번호: 2008-S-020-02, 과제명: 웹QA 기술개발].The present invention is derived from a study conducted as part of the IT growth engine technology development project of the Ministry of Knowledge Economy [Task management number: 2008-S-020-02, title: WebQA technology development].

일반적으로 문서 분류는 한정된 문서 범주 집합 C = {c₁, c₂,…, c_n}과 학습 문서 집합이 주어졌을 때, 학습 문서는 (d_i, c_i) 쌍으로 표현된다. 문서 d_i는 어휘로 이루어진 m개의 속성을 가진 벡터

= <w1, w2,…, wm>로 표현되고 클래스 c_i에 속한다. 학습 문서 집합은 각각의 학습 문서마다 분류하고자 하는 범주에 대해 레 이블이 표기된 S = {(d₁,c₁), (d₂,c₂),…, (d_m,c_m)} 집합으로 표현된다. 문서 분류의 목적은 학습 문서 집합 S로부터 분류 규칙을 찾아내어, 분류를 위한 새 문서 집합이 입력되었을 때 정확도 높게 분류하는 것이다.In general, document classification is a limited set of document categories C = {c ₁ , c ₂ ,... , c _n } and a set of learning documents, the learning documents are represented by (d _i , c _i ) pairs. Document d _i is a vector of m attributes of vocabulary

= <w1, w2, ... , wm> and belongs to class c _i . The set of learning documents is labeled S = {(d ₁ , c ₁ ), (d ₂ , c ₂ ),... , (d _m , c _m )}. The purpose of document classification is to find classification rules from the learning document set S and classify them with high accuracy when a new document set for classification is entered.

이러한, 문서 분류 방법으로 나이브 베이스 분류기(Naive Bayes Classifier), 지지 벡터 기계(Support Vector Machine), ME(Maximum Entropy classifier) 등이 널리 사용되고 있다. As the document classification method, a naive base classifier, a support vector machine, a maximum entropy classifier (ME), and the like are widely used.

나이브 베이스 분류기는 확률적인 분류방법으로 조건부 확률에 관한 정리인 베이스의 정리(Bayes' theorem) 및 나이브 베이스 독립 가정(Naive Bayes independence assumption)에 근거한다. 베이스의 정리는 높은 분류 정확도를 얻기 위해 문서 d 는 문서 범주 클래스 c_i 에 할당되어야 한다는 것을 의미한다. 나이브 베이스 독립 가정은 문서가 분류되어야 할 클래스 범주가 주어졌을 때 단어들 사이의 확률은 독립적이라는 가정으로 P(W_j|c_i)의 확률값을 빠르게 계산할 수 있도록 한다. 이러한 나이브 베이스 분류기는 다른 분류방법에 비해 정확율은 다소 떨어지나 구현이 단순하고 학습속도가 빨라 널리 사용되고 있다.The naive base classifier is a probabilistic classification method based on Bayes' theorem and Naive Bayes independence assumption. Theorem of the base means that document d must be assigned to document category class c _i to achieve high classification accuracy. The naive base independence hypothesis allows us to quickly calculate the probability of P (W _j | c _i ) assuming that the probability between words is independent given the class category in which the document should be classified. The naive base classifier is less widely used than other classification methods, but it is widely used because of its simple implementation and fast learning speed.

지지 벡터 기계는 통계 분류와 회귀 분석을 쓰는 분류방법으로 문서 분류의 정확율을 높이는 방법중의 하나이다. 지지 벡터 기계는 학습 문서 집합이 주어졌을 때 문서 범주 클래스를 구분할 수 있는 수많은 후보평면들 가운데 마진이 최대가 되는(maximum-margin) hyperplane을 찾아 분류에 이용한다. 이외에도 ME 분류기, CRF(Conditional random fields) 분류기가 사용되고 있다.The support vector machine is a classification method that uses statistical classification and regression analysis to increase the accuracy of document classification. The support vector machine finds and uses a hyper-margin hyperplane among a number of candidate planes that can classify document category classes, given a set of learning documents. In addition, ME classifiers and conditional random fields (CRF) classifiers are used.

최근에는 대규모로 저장된 데이터 안에서 체계적이면서도, 자동적으로 통계적 규칙(rule)이나 패턴(pattern)을 찾아내는 데이터 마이닝(Data Mining) 기법인 연관 규칙 마이닝(Associative rule mining 또는 Associative rule learning)을 문서 분류에 적용하기 위한 연구도 진행되고 있다. Recently, associative rule mining or associative rule learning, a data mining technique that systematically and automatically finds statistical rules or patterns in large-scale stored data, is applied to document classification. Research is also in progress.

연관 규칙 마이닝은 트랜잭션 데이터베이스와 사용자가 지정한 최소 지지도(Minimum support)와 최소 신뢰도(Minimum confidence)가 주어졌을 때, 빈번히 발생하는 아이템 집합(itermsets)을 마이닝하는 것이다. 일반적으로 슈퍼마켓에서 빵과 우유를 구매한 소비자가 버터를 구매하므로, 방대한 트랜잭션 데이터베이스로부터 소비자의 이와 같은 {빵, 우유} -> 버터 패턴을 마이닝하기 위한 방법으로 사용될 수 있다. Association rule mining is the mining of sets of items that occur frequently, given the minimum amount of support and minimum confidence specified by the transaction database and the user. In general, a consumer who purchases bread and milk at a supermarket purchases butter, so it can be used as a method for mining consumer's {bread, milk}-> butter pattern from a large transaction database.

이러한 연관 규칙 마이닝 기법은, 중복되지 않는 아이템이 있을 때, 트랜잭션 T=(TID, X) 은 TID 식별자와 아이템 집합(itemset) X를 가진다.This association rule mining technique, when there are non-overlapping items, transaction T = (TID, X) has a TID identifier and an item set X.

트랜잭션 데이터베이스는 트랜잭션의 집합이며, k개의 원소를 가진 아이템 집합은 k-itemset 으로 표현된다. 아이템 집합 X 의 지지도(support)는 트랜잭션 데이터베이스에서 아이템 집합 X를 포함하는 트랙잭션의 개수로 support(X)로 나타낸다. 아이템 집합 X 의 지지도가 사용자가 지정한 최소 지지도보다 높은 경우에 빈번히 발생한다고 표현한다. X 가 발생하였을 때 Y 역시 발생할 신뢰도 confidence(X -> Y)는 X를 갖고 있는 트랜잭션이 Y를 포함할 조건부 확률로 나타낸다.A transactional database is a collection of transactions. An item set of k elements is represented by k-itemset. The support of item set X is the number of transactions including item set X in the transaction database, which is represented as support (X). It is expressed frequently that the support of the item set X is higher than the minimum support designated by the user. The confidence confidence that X will also occur when X occurs (X-> Y) is expressed as the conditional probability that a transaction with X will contain Y.

지지도와 신뢰도 이외에도 lift, conviction, chi-square (χ2) 등과 같은 다양한 연관 규칙 측정 방법들이 제안되었으며, 이러한 연관규칙 탐사 기법을 이용하여 추출한 다수의 연관 분류 규칙에 의해 다중범주 문서를 자동으로 분류하는 방안으로, 다수의 범주 레이블과 다중 범주 레이블을 가진 문서집합에 대하여 범주 별 분류를 효율적으로 수행하기 위해 연관 규칙을 사용하였다.In addition to support and reliability, various association rule measurement methods, such as lift, conviction, and chi-square (χ2), have been proposed, and a method for automatically classifying multi-category documents by multiple association classification rules extracted using such association rule exploration techniques. We used associative rules to efficiently classify categories of documents with multiple category labels and multiple category labels.

이때, 다중 클래스 범주 문서를 분류하기 위한 방법으로 연관 규칙 마이닝 알고리즘 중에 FP-growth(Frequent pattern growth)를 적용하고, 빈도 패턴 나무(Frequent pattern tree)를 사용하여 연관 규칙을 마이닝하였으며, 연관 규칙 저장 시에는 접두 나무(Prefix tree)로 저장하였다.At this time, FP-growth (Frequent pattern growth) was applied to the association rule mining algorithm as a method for classifying multi-class category documents, and the association rule was mined using the frequency pattern tree. Is stored as a prefix tree.

상기한 바와 같이 동작하는 종래 기술에 의한 문서 분류 방식에 있어서는, 다중 클래스 범주의 문서를 분류하기 위한 방안 외에 단일 클래스 범주에 속한 문서를 분류하기 위한 방안에 대해서는 제시된 바가 없었다.In the document classification method according to the related art operating as described above, there is no proposal for a method for classifying documents belonging to a single class category in addition to the method for classifying documents in a multi-class category.

이에 단일 클래스 범주의 문서에 연관 규칙 마이닝을 다중 클래스 범주의 문서 분류 시와 같이 단순히 적용할 수 있으나, 연관 규칙 마이닝을 단순하게 문서 분류에 적용하는 경우에는 두 가지 문제가 발생한다. 첫째, 최소 지지도를 너무 낮게 설정하면 굉장히 많은 수의 아이템 집합이 생성된다. 둘째, 최소 지지도를 너무 높게 설정하면 발생 빈도가 낮으나 중요한 아이템 집합이 누락된다는 문제점이 있었다. As such, association rule mining can be applied to documents of a single class category as in the case of classifying documents of a multi-class category. However, two problems arise when applying association rule mining to a document classification simply. First, setting the minimum support too low creates a very large set of items. Second, if the minimum support is set too high, the frequency of occurrence is low, but there is a problem that the important item set is missing.

이에 본 발명은, 연관 규칙 탐사 방법을 문서 분류 방법에 적용하여 추출된 연관 규칙을 문서 분류를 위한 자질로 사용함으로써, 정확한 문서 분류를 수행할 수 있는 단일 클래스 범주의 문서 분류 장치 및 방법을 제공한다.Accordingly, the present invention provides an apparatus and method for classifying a document of a single class category, which can perform accurate document classification by applying the association rule search method to the document classification method and using the extracted association rule as a feature for document classification. .

또한 본 발명은, 연관 규칙의 상호 관계 평가를 위한 클래스 지지도(Class support), 클래스 전체 신뢰도(Class all confidence)를 새롭게 정의하여 연관 규칙 학습 시에 중요하고 빈도가 높은 아이템 집합을 생성하여 문서 분류를 위한 학습 모델을 구축하고, 문서 분류를 위한 새로운 문서가 입력되었을 때 연관 규칙을 활용한 문서 분류를 수행할 수 있는 단일 클래스 범주의 문서 분류 장치 및 방법을 제공한다.In addition, the present invention, by newly defining the class support (class support), class all confidence for the correlation evaluation of the association rules to generate a set of important and frequent items when learning association rules to improve document classification The present invention provides a single class category document classification apparatus and method for constructing a learning model for classifying documents and performing document classification using association rules when a new document for document classification is input.

본 발명의 일 실시예에 따른 단일 클래스 범주의 문서 분류 장치는, 학습 문서 집합으로부터 자질들을 선택하고, 선택한 자질들로 행렬을 생성한 후, 이를 토대로 깊이 우선 탐색 또는 넓이 우선 탐색 방식으로 연관 규칙 후보들을 생성하고, 상기 생성된 연관 규칙 후보들로부터 연관 규칙 학습 모델을 생성하는 연관 규칙 학습부와, 상기 생성된 연관 규칙 학습 모델을 이용하여 입력된 문서 집합에 대한 문서를 분류하는 문서 클래스 범주 분류부를 포함한다.According to an embodiment of the present invention, an apparatus for classifying a document of a single class category selects features from a set of learning documents, generates a matrix with the selected features, and based on this, the association rule candidates in a depth-first search or a breadth-first search method. And an association rule learner for generating an association rule learning model from the generated association rule candidates, and a document class category classifier for classifying documents for the input document set using the generated association rule learning model. do.

한편, 상기 연관 규칙 학습부는, 상기 학습문서집합으로부터 색인을 수행하여 자질들을 선택하고, 선택한 자질들로 수평적 자질 행렬을 생성하는 색인 및 자질 선택 전 처리부와, 상기 수평적 자질 행렬을 수직적 자질 행렬로 변환하는 데이 터 레이아웃 변환부를 포함하는 것을 특징으로 한다. On the other hand, the association rule learning unit, the index and feature selection pre-processing unit for selecting the features by performing an index from the learning document set, the horizontal feature matrix with the selected features, and the horizontal feature matrix to the vertical feature matrix It characterized in that it comprises a data layout conversion unit for converting to.

그리고 상기 연관 규칙 학습부는, 1개 원소를 갖는 1 아이템 집합으로부터 시작하여 깊이 우선 탐색 또는 넓이 우선 탐색 방식으로 이전 아이템 집합보다 길이가 긴 아이템 집합을 점증적으로 추출하는 것을 특징으로 한다.The association rule learner may be configured to incrementally extract an item set having a length longer than the previous item set in a depth-first search or a breadth-first search method starting from one item set having one element.

또한, 상기 연관 규칙 학습부는, 문서 클래스에 속한 특정 아이템 집합이 무작위로 추출될 확률인 클래스 지지도를 계산하고, 상기 문서 클래스에서 상기 특정 아이템 집합에 대한 최소 신뢰도인 클래스 전체 신뢰도를 계산하고, 계산된 상기 클래스 지지도 및 클래스 전체 신뢰도에서 최소 클래스 지지도 및 최소 클래스 전체 신뢰도를 확인하고, 상기 클래스 지지도 및 클래스 전체 신뢰도가 최소 클래스 지지도 및 최소 클래스 전체 신뢰도보다 높은 아이템 집합을 생성하고, 상기 아이템 집합의 생성 절차를 반복하여 더 이상 아이템 집합이 생성되지 않는 경우, 최종 아이템 집합을 상기 연관 규칙 학습 모델로 구성하는 것을 특징으로 한다.The association rule learner may calculate class support, which is a probability of randomly extracting a specific set of items belonging to a document class, calculate a class total reliability that is the minimum reliability for the specific set of items in the document class, and calculate Confirming the minimum class support and the minimum class overall reliability in the class support and the class overall reliability, generating an item set in which the class support and the class overall reliability are higher than the minimum class support and the minimum class overall reliability, and generating the item set Repeatedly, when the item set is no longer generated, the final item set may be configured as the association rule learning model.

그리고 상기 문서 클래스 범주 분류부는, 분류 문서 집합으로부터 색인을 수행하여 자질들을 선택하여 행렬을 생성하는 전처리부와, 생성된 상기 행렬로부터 중복되지 않는 원소들로 구성된 후보 아이템 집합을 생성하고, 상기 후보 아이템 집합에서 가장 많은 수의 발견되지 않은 아이템을 원소로 갖는 제1 아이템 집합을 선택하고, 상기 제1 아이템 집합 에서 가장 긴 길이를 갖는 제2 아이템 집합을 선택한 후, 상기 제2 아이템 집합에서 클래스 지지도가 가장 높은 값을 갖는 제3 아이템 집합을 선택하는 문서 아이템 집합 생성부와, 상기 제3아이템 집합을 대상으로 문서 클래스에 속할 확률이 가장 큰 문서 클래스 범주로 입력문서를 분류하는 클래스 범주 분류부를 포함하는 것을 특징으로 한다.The document class category classifier generates a candidate item set including elements that are not duplicated from the generated matrix, and a preprocessor for indexing a class document set to select features to generate a matrix. Selecting a first item set having the largest number of undiscovered items in the set as an element, selecting a second item set having the longest length in the first item set, and then applying class support in the second item set. A document item set generation unit that selects a third item set having the highest value, and a class category classification unit that classifies the input document into a document class category having the highest probability of belonging to a document class with respect to the third item set; It is characterized by.

본 발명의 일 실시예에 따른 단일 클래스 범주의 문서 분류 방법은, 학습문서집합으로부터 색인을 수행하여 자질들을 선택하는 과정과, 선택한 상기 자질들로 행렬을 생성하는 과정과, 생성된 행렬로부터 깊이 우선 탐색 또는 넓이 우선 탐색 방식으로 연관 규칙 후보들을 점증적으로 생성하는 과정과, 생성된 상기 연관 규칙 후보들로부터 연관 규칙 학습 모델을 생성하는 과정과, 생성된 상기 연관 규칙 학습 모델을 이용하여 문서를 분류하는 과정을 포함한다.According to an embodiment of the present invention, a method of classifying a document of a single class category includes: selecting features by performing an index from a learning document set, generating a matrix using the selected features, and depth-first from the generated matrix. Incrementally generating association rule candidates in a search or breadth first search manner, generating an association rule learning model from the generated association rule candidates, and classifying documents using the generated association rule learning model Process.

한편, 상기 행렬을 생성하는 과정은, 선택된 상기 자질들을 토대로 수평적 자질 행렬을 생성하는 과정과, 상기 수평적 자질 행렬을 수직적 자질 행렬로 변환하는 과정을 포함하는 것을 특징으로 한다.The process of generating the matrix may include generating a horizontal feature matrix based on the selected features and converting the horizontal feature matrix into a vertical feature matrix.

그리고 상기 연관 규칙 후보들을 점증적으로 생성하는 과정은, 1개 원소를 갖는 1 아이템 집합으로부터 시작하여 깊이 우선 탐색 또는 넓이 우선 탐색 방식으로 이전 아이템 집합보다 길이가 긴 아이템 집합을 점증적으로 추출하는 것을 특징으로 한다.The process of incrementally generating the association rule candidates may include incrementally extracting an item set having a length longer than the previous item set in a depth-first search or a breadth-first search method starting from one item set having one element. It features.

또한, 상기 연관 규칙 학습 모델을 생성하는 과정은, 문서 클래스에 속한 특정 아이템 집합이 무작위로 추출될 확률인 클래스 지지도를 계산하는 제1과정과, 상기 문서 클래스에서 상기 특정 아이템 집합에 대한 최소 신뢰도인 클래스 전체 신뢰도를 계산하는 제2과정과, 계산된 상기 클래스 지지도 및 클래스 전체 신뢰도에서 최소 클래스 지지도 및 최소 클래스 전체 신뢰도를 확인하는 제3과정과, 상기 클래스 지지도 및 클래스 전체 신뢰도가 최소 클래스 지지도 및 최소 클래스 전체 신뢰도보다 높은 아이템 집합을 생성하는 제4과정과, 상기 제1 과정 내지 제4과정을 반복하여 더 이상 아이템 집합이 생성되지 않는 경우, 최종 아이템 집합을 상기 연관 규칙 학습 모델로 구성하는 과정을 포함하는 것을 특징으로 한다. The generating of the association rule learning model may include a first process of calculating class support, which is a probability of randomly extracting a specific set of items belonging to a document class, and a minimum reliability of the specific set of items in the document class. A second process of calculating class overall reliability, a third process of checking minimum class support and minimum class overall reliability in the calculated class support and class overall reliability, and the class support and class overall reliability are the minimum class support and the minimum A fourth process of generating an item set having a higher reliability than a class overall reliability, and a process of configuring a final item set as the association rule learning model when the item set is no longer generated by repeating the first to fourth processes. It is characterized by including.

그리고 상기 문서를 분류하는 과정은, 분류 문서 집합으로부터 색인 및 자질들을 선택하여 행렬을 생성하는 과정과, 생성된 상기 행렬로부터 중복되지 않는 원소들로 구성된 후보 아이템 집합을 생성하는 과정과, 상기 후보 아이템 집합에서 가장 많은 수의 발견되지 않은 아이템을 원소로 갖는 제1 아이템 집합을 선택하는 과정과, 상기 제1 아이템 집합 에서 가장 긴 길이를 갖는 제2 아이템 집합을 선택하는 과정과, 상기 제2 아이템 집합에서 클래스 지지도가 가장 높은 값을 갖는 제3 아이템 집합을 선택하는 과정과, 상기 제3아이템 집합을 대상으로 문서 클래스에 속할 확률이 가장 큰 문서 클래스 범주로 입력문서를 분류하는 과정을 포함하는 것을 특징으로 한다.The classifying the document may include generating a matrix by selecting indexes and qualities from the classified document set, generating a candidate item set including elements that are not overlapped with the generated matrix, and generating the candidate item. Selecting a first item set having the largest number of undiscovered items in the set as an element, selecting a second item set having the longest length in the first item set, and selecting the second item set And selecting a third item set having the highest class support in the class, and classifying the input document into a document class category having the highest probability of belonging to a document class. It is done.

상기와 같은 본 발명의 실시예에 따른 단일 클래스 범주의 문서 분류 장치 및 방법에 따르면 다음과 같은 효과가 하나 혹은 그 이상이 있다.According to the document classification apparatus and method of a single class category according to the embodiment of the present invention as described above has one or more of the following effects.

본 발명의 실시예에 따른 단일 클래스 범주의 문서 분류 장치 및 방법에 의하면, 단일 클래스 범주에 속한 문서 분류 시 연관 규칙 탐사 방법을 문서 분류를 위한 자질로 사용하여 문서 분류의 정확도를 높일 수 있다.According to the apparatus and method for classifying a document of a single class category according to an embodiment of the present invention, an accuracy of document classification may be improved by using an association rule exploration method when classifying a document belonging to a single class category as a feature for classifying a document.

또한, 클래스 지지도와 클래스 전체 신뢰도를 이용하여 구축된 연관 규칙 데 이터는 기존의 문서 분류 방법에서 사용되는 워드백(Bag of words) 또는 형태소 분석 결과 이외의 분류를 위한 중요한 자질로 사용될 수 있으며, 학습시간이 기존의 지지벡터기계보다 훨씬 적게 소요되어 대용량 데이터 분류에 효율적으로 사용할 수 있는 효과가 있다.In addition, associative rule data constructed using class support and class-wide confidence can be used as an important feature for classification other than the word of words or stemming results used in existing document classification methods. It takes much less time than the conventional support vector machine, so it can be used efficiently for large data classification.

본 발명의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되어 있는 실시예들을 참조하면 명확해질 것이다. 그러나 본 발명은 이하에서 개시되는 실시예들에 한정되는 것이 아니라 서로 다른 다양한 형태로 구현될 수 있으며, 단지 본 실시예들은 본 발명의 개시가 완전하도록 하고, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명은 청구항의 범주에 의해 정의될 뿐이다. 명세서 전체에 걸쳐 동일 참조 부호는 동일 구성 요소를 지칭한다.Advantages and features of the present invention and methods for achieving them will be apparent with reference to the embodiments described below in detail with the accompanying drawings. However, the present invention is not limited to the embodiments disclosed below, but can be implemented in various different forms, and only the embodiments make the disclosure of the present invention complete, and the general knowledge in the art to which the present invention belongs. It is provided to fully inform the person having the scope of the invention, which is defined only by the scope of the claims. Like reference numerals refer to like elements throughout.

본 발명의 실시예들을 설명함에 있어서 공지 기능 또는 구성에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명을 생략할 것이다. 그리고 후술되는 용어들은 본 발명의 실시예에서의 기능을 고려하여 정의된 용어들로서 이는 사용자, 운용자의 의도 또는 관례 등에 따라 달라질 수 있다. 그러므로 그 정의는 본 명세서 전반에 걸친 내용을 토대로 내려져야 할 것이다. In describing the embodiments of the present invention, if it is determined that a detailed description of a known function or configuration may unnecessarily obscure the gist of the present invention, the detailed description thereof will be omitted. In addition, terms to be described below are terms defined in consideration of functions in the embodiments of the present invention, which may vary according to intentions or customs of users and operators. Therefore, the definition should be based on the contents throughout this specification.

첨부된 블록도의 각 블록과 흐름도의 각 단계의 조합들은 컴퓨터 프로그램 인스트럭션들에 의해 수행될 수도 있다. 이들 컴퓨터 프로그램 인스트럭션들은 범용 컴퓨터, 특수용 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비의 프로세서에 탑재될 수 있으므로, 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비의 프로세서를 통해 수행되는 그 인스트럭션들이 블록도의 각 블록 또는 흐름도의 각 단계에서 설명된 기능들을 수행하는 수단을 생성하게 된다. 이들 컴퓨터 프로그램 인스트럭션들은 특정 방식으로 기능을 구현하기 위해 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비를 지향할 수 있는 컴퓨터 이용 가능 또는 컴퓨터 판독 가능 메모리에 저장되는 것도 가능하므로, 그 컴퓨터 이용가능 또는 컴퓨터 판독 가능 메모리에 저장된 인스트럭션들은 블록도의 각 블록 또는 흐름도 각 단계에서 설명된 기능을 수행하는 인스트럭션 수단을 내포하는 제조 품목을 생산하는 것도 가능하다. 컴퓨터 프로그램 인스트럭션들은 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비 상에 탑재되는 것도 가능하므로, 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비 상에서 일련의 동작 단계들이 수행되어 컴퓨터로 실행되는 프로세스를 생성해서 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비를 수행하는 인스트럭션들은 블록도의 각 블록 및 흐름도의 각 단계에서 설명된 기능들을 실행하기 위한 단계들을 제공하는 것도 가능하다. Combinations of each block of the accompanying block diagram and each step of the flowchart may be performed by computer program instructions. These computer program instructions may be mounted on a processor of a general purpose computer, special purpose computer, or other programmable data processing equipment such that instructions executed through the processor of the computer or other programmable data processing equipment may not be included in each block or flowchart of the block diagram. It will create means for performing the functions described in each step. These computer program instructions may be stored in a computer usable or computer readable memory that can be directed to a computer or other programmable data processing equipment to implement functionality in a particular manner, and thus the computer usable or computer readable memory. It is also possible for the instructions stored in to produce an article of manufacture containing instruction means for performing the functions described in each block or flow chart step of the block diagram. Computer program instructions may also be mounted on a computer or other programmable data processing equipment, such that a series of operating steps may be performed on the computer or other programmable data processing equipment to create a computer-implemented process to create a computer or other programmable data. Instructions that perform processing equipment may also provide steps for performing the functions described in each block of the block diagram and in each step of the flowchart.

또한, 각 블록 또는 각 단계는 특정된 논리적 기능(들)을 실행하기 위한 하나 이상의 실행 가능한 인스트럭션들을 포함하는 모듈, 세그먼트 또는 코드의 일부를 나타낼 수 있다. 또, 몇 가지 대체 실시예들에서는 블록들 또는 단계들에서 언급된 기능들이 순서를 벗어나서 발생하는 것도 가능함을 주목해야 한다. 예컨대, 잇달아 도시되어 있는 두 개의 블록들 또는 단계들은 사실 실질적으로 동시에 수행되는 것도 가능하고 또는 그 블록들 또는 단계들이 때때로 해당하는 기능에 따라 역순으로 수행되는 것도 가능하다.In addition, each block or step may represent a portion of a module, segment or code that includes one or more executable instructions for executing a specified logical function (s). It should also be noted that in some alternative embodiments, the functions noted in the blocks or steps may occur out of order. For example, the two blocks or steps shown in succession may in fact be executed substantially concurrently or the blocks or steps may sometimes be performed in the reverse order, depending on the functionality involved.

본 발명의 실시예는, 연관 규칙의 상호 관계 평가를 위한 클래스 지지도 및 클래스 전체 신뢰도를 새롭게 정의하여 연관 규칙 학습 시에 중요하고 빈도가 높은 아이템 집합을 생성함으로써 문서 분류를 위한 학습 모델을 구축하고, 문서 분류를 위한 새로운 문서가 입력되었을 때 연관 규칙을 활용한 문서 분류를 수행하는 것이다.In an embodiment of the present invention, a learning model for document classification is constructed by newly defining class support and class-wide reliability for evaluating correlation of association rules to generate a set of important and frequent items when learning association rules. When a new document for document classification is entered, document classification using association rules is performed.

이하, 첨부된 도면을 참조하여 본 발명의 실시예를 상세히 설명하기로 한다.Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명의 실시예에 따른 연관 규칙 학습을 통한 단일 범주 문서 분류 장치의 구조를 도시한 블록도이다.1 is a block diagram illustrating the structure of a single category document classification apparatus through association rule learning according to an embodiment of the present invention.

도 1을 참조하면, 단일 범주 문서 분류 장치는 연관 규칙 학습 모델을 생성하기 위한 전처리 절차를 수행하는 연관 규칙 학습부(100)와, 연관 규칙 학습부(100)를 통해 생성된 연관 규칙 학습 모델(130)을 사용하여 새로운 문서를 분류하는 문서 클래스 범주 분류부(150) 등을 포함한다. 여기서 연관 규칙 학습부(100)는 텀(Term) 색인 및 자질선택 전처리부(102), 데이터 레이아웃 변환부(106), 후보 k-아이템 집합 생성부(110), 고빈도/중요 k-아이템 집합 추출부(112) 등을 포함하며, 문서 클래스 범주 분류부(150)는 텀 색인 및 자질선택 전처리부(152), 문서 아이템 집합 생성부(154), 클래스 범주 분류부(156) 등을 포함한다.Referring to FIG. 1, the apparatus for classifying a single category document may include an association rule learner 100 performing a preprocessing procedure for generating an association rule learn model, and an association rule learning model generated through the association rule learner 100. A document class category classifying unit 150 for classifying a new document using 130). Here, the association rule learner 100 may include a term index and feature selection preprocessor 102, a data layout converter 106, a candidate k-item set generator 110, and a high frequency / critical k-item set. The document class category classifier 150 includes a term index and feature selection preprocessor 152, a document item set generation unit 154, a class category classifier 156, and the like. .

구체적으로 연관 규칙 학습 모델을 생성하기 위한 전처리 단계로서 연관 규칙 학습부(100) 내의 텀 색인 및 자질선택 전처리부(102)에서는 먼저, 학습문서 집합(120)을 입력 받는다. 여기서 학습 문서 집합(120)은, 기계학습에 사용될 문서들로서 이 문서는 일반적인 문장 또는 이메일에서와 같이 단어 또는 구(phrase)로 구성되어 있으며, 기본적인 자질(feature)로는 단어(word)가 된다.Specifically, as a preprocessing step for generating the association rule learning model, the term index and feature selection preprocessor 102 in the association rule learning unit 100 first receives the training document set 120. Here, the learning document set 120 is a document to be used for machine learning, and this document is composed of words or phrases as in a general sentence or e-mail, and is a word as a basic feature.

텀 색인 및 자질선택 전처리부(102)에서는 입력된 학습문서 집합(120)에서 각각의 학습문서를 색인하여 워드백(bag of words) 형태로 텀 벡터를 추출한다. 이때 추출된 텀 벡터는 대용량의 희소행렬로 표현되므로, 중요하지 않은 텀을 제거하기 위해 자질 선택(Feature selection)을 수행한다.The term index and feature selection preprocessor 102 extracts a term vector in the form of a bag of words by indexing each learning document from the input learning document set 120. In this case, since the extracted term vectors are represented by a large sparse matrix, feature selection is performed to remove non-essential terms.

이에 텀 색인 및 자질선택 전처리부(102)는 학습문서 집합(120)을 입력 받아 m×n의 텀-문서 행렬, 즉 수평적 자질 행렬(104)을 출력하며, 각각의 문서 d_i 는 벡터로 구성된다.The term index and feature selection preprocessor 102 receives the learning document set 120 and outputs a term-document matrix of m × n, that is, a horizontal feature matrix 104, and each document d _i is a vector. It is composed.

상기 <수학식 1>의 문서 벡터에서 w_ij 는 문서 d_i 의 j 번째 텀에 대한 가중치를 나타낸다.In the document vector of Equation 1, w _ij represents a weight with respect to the j th term of the document d _i .

출력된 수평적 자질 행렬(104)은 텀 색인 및 자질선택 전처리부(102)의 전처리 결과로 생성되며, m 개의 중복되지 않는 텀들은 행렬의 열로, n 개의 문서는 줄로 구성된 m×n 텀-문서 행렬이다. 즉, 문서에 포함된 텀들로 구성된다.The output horizontal feature matrix 104 is generated as a result of the preprocessing of the term index and feature selection preprocessor 102, where m non-overlapping terms are columns of the matrix and n documents are lines of m × n term-documents. It is a matrix. That is, it consists of terms included in the document.

데이터 레이아웃 변환부(106)는 수평적 자질 행렬(104)인 m×n 텀-문서 행렬의 행과 열을 역으로 재구성하여 수직적 자질 행렬(108)을 생성한다.The data layout converter 106 inversely reconstructs rows and columns of the m × n term-document matrix that is the horizontal feature matrix 104 to generate the vertical feature matrix 108.

후보 k-아이템 집합 생성부(110)는 수직적 자질 행렬(108)을 입력 받아 클래스 지지도(Class support) 및 클래스 전체 신뢰도(Class all confidence)를 사용하여, 1개로 구성된 아이템 집합으로부터 넓이 우선 탐색((breadth-first search) 또는 깊이 우선 탐색(depth-first search) 방식으로 중요하면서도 빈도가 높은 k 아이템 집합을 생성한다.The candidate k-item set generating unit 110 receives the vertical feature matrix 108 and uses the class support and the class all confidence to search for a width first from the set of one item. Generate a set of k items that are important and frequent with breadth-first search or depth-first search.

문서 분류 시 연관 규칙 탐사에서의 트랜잭션 데이터베이스 D는 문서 컬렉션으로 볼 수 있으며, 각각의 트랜잭션 T는 문서 인스턴스로 맵핑된다. 연관 규칙 학습 방법을 문서 분류를 위한 자질 선택의 한 방법으로 사용하기 위해서는 각각의 아이템 집합이 각각의 문서 클래스 c_i에 대해 통계적 가중치를 가져야 한다. 이를 위해 본 발명의 실시예에서는 클래스 지지도(Class support)를 사용한다.In document classification, transaction database D in association rule exploration can be viewed as a document collection, with each transaction T mapped to a document instance. In order to use the association rule learning method as a feature selection method for document classification, each set of items must have a statistical weight for each document class c _i . To this end, embodiments of the present invention use class support.

아이템 집합 X의 클래스 지지도는 전체 트랜잭션 D에서 클래스 범주 c_i에 아이템 집합 X가 포함되어 있는 트랜잭션의 비율로서, 아래의 <수학식 2>로 나타낼 수 있다.The class support of the item set X is a ratio of a transaction in which the item set X is included in the class category c _i in the entire transaction D, and can be expressed by Equation 2 below.

또한, 클래스 지지도는 문서 클래스 c_i에 속한 아이템 집합 X가 무작위로 추출될 확률로서, 아이템 집합 X의 모든 클래스 지지도의 합은 전체 트랜잭션 D에서 아이템 집합 X의 클래스 지지도와 같다.In addition, the class support is a probability of randomly extracting an item set X belonging to the document class c _i , and the sum of all class support of the item set X is equal to the class support of the item set X in the entire transaction D.

아이템 집합 X의 클래스 전체 신뢰도(Class all confidence)는 문서 클래스 c_i에서의 최소 신뢰도이다. 즉, 아이템 집합 X로부터 생성되는 모든 연관 규칙은 클래스 전체 신뢰도보다 크거나 같은 값을 가진다.Class all confidence of item set X is the minimum confidence in document class c _i . That is, all association rules generated from item set X have a value equal to or greater than the class-wide confidence.

상기 <수학식 3>에서 분모는 전체 트랜잭션 D에서 아이템 집합 X를 멱집합(Power set)으로 갖는 트랜잭션의 최대 개수이다. 아이템 집합 X의 최대개수는 아이템 집합 X가 한 개의 원소로 구성되었을 때 발생한다. 그러므로 아이템 집합 X의 멱집합에 포함된 모든 원소에 대해 클래스 전체 신뢰도를 계산하지 않고, 1개로 구성된 아이템 집합 X에 대해서만 계산한다.In Equation 3, the denominator is the maximum number of transactions having the item set X as the power set in the entire transaction D. The maximum number of item sets X occurs when item set X consists of one element. Therefore, we do not calculate the class-wide confidence for all the elements in the set of itemsets X, but only for the set of items X.

고빈도/중요 k-아이템 집합 추출부(112)는 수직적 자질 행렬(108)로 구성된 입력 데이터가 1개의 원소로 구성된 1 아이템 집합으로부터 깊이 우선 탐색 방식을 적용하여 재귀적으로 최소 클래스 지지도 및 최소 클래스 전체 신뢰도 보다 큰 값을 갖는 아이템 집합들을 추출한다. 또한 각 아이템 집합은 각각의 문서 클래스 c_i에 대한 클래스 지지도를 계산하기 위해 필요한 정보인 트랜잭션 식별자 및 클래스 빈도수를 속성으로 갖는다.The high-frequency / important k-item set extractor 112 recursively applies the minimum depth of class support and the minimum class by applying a depth-first search method to the input data composed of the vertical feature matrix 108 from one item set composed of one element. Extract item sets with values greater than the overall confidence. Each item set also has a transaction identifier and class frequency, which are information needed to calculate class support for each document class c _i .

고빈도/중요 k-아이템 집합 추출부(112)는 1개 원소를 갖는 1 아이템 집합으로부터 2개 원소를 갖는 모든 2 아이템 후보집합에 대해 트랜잭션 식별자의 교집합을 구하며, 교집합에 포함된 원소의 개수를 클래스 빈도수로 설정한다. 모든 문서 클래스 c_i 에 대해 클래스 지지도 및 클래스 전체 신뢰도를 구한 후에, 최소 클래스 지지도 및 최소 클래스 전체 신뢰도 보다 큰 값을 갖는 2 아이템 후보집합을 추출한다. The high frequency / critical k-item set extracting unit 112 obtains the intersection of the transaction identifiers for all two item candidate sets having two elements from the one item set having one element, and calculates the number of elements included in the intersection. Set to the class frequency. After obtaining class support and class-wide confidence for all document classes c _i , two-item candidates with values greater than minimum class support and minimum class overall confidence are extracted.

모든 2 아이템 후보 집합이 추출된 후에는 위의 과정을 재귀적으로 반복하여 3개 원소를 갖는 3 아이템 집합을 추출한다. 이 과정은 더 이상 아이템 집합이 생성되지 않을 때까지 반복되어, 최종적으로 출력된 아이템 집합은 연관 규칙 학습 모델(130)로 출력되며, 출력된 연관 규칙 학습 모델(130)은 클래스 지지도를 갖는 아이템 집합 F로 구성된다.After all two item candidate sets have been extracted, the above process is repeated recursively to extract a three item set having three elements. This process is repeated until no more item sets are generated, and finally, the output item set is output to the association rule learning model 130, and the output association rule learning model 130 is the item set having class support. It consists of F

한편, 문서 클래스 범주 분류부(150)는 추출된 연관 규칙 학습 모델(130)을 이용하여, 새롭게 분류가 필요한 분류 문서 집합(140)이 입력되었을 때 문서가 속할 확률이 제일 큰 문서 클래스 c_i 로 문서를 분류한다.Meanwhile, the document class category classifying unit 150 uses the extracted association rule learning model 130 as the document class c _i having the highest probability that the document belongs when the newly classified classification document set 140 is input. Categorize the document.

구체적으로 텀 색인 및 자질선택 전처리부(152)는 연관 규칙 학습부(100)의 텀 색인 및 자질선택 전처리부(102)와 동일한 기능을 수행하는 것으로, 분류 문서 집합(140)에서 각각의 분류가 필요한 문서로부터 수직적 자질 행렬을 출력한다.In more detail, the term index and feature selection preprocessor 152 performs the same function as the term index and feature selection preprocessor 102 of the association rule learner 100. Print a vertical feature matrix from the required document.

문서 아이템 집합 생성부(154)는 분류가 필요한 문서로부터 추출된 수직적 자질 행렬을 입력 받고, 이때 새로운 문서는 중복되지 않는 1개의 아이템을 원소로 갖는 X={l₁, l₂, …, l_n}로 표현한다. 분류를 위해 집합 X로부터 중복되지 않는 원소들로 구성된 멱집합

을 생성한다. 멱집합 L을 점증적으로 구성하기 위한 조건은 아래 <수학식 4>와 같다. C는 이미 발견된 아이템 집합을 나타내며, X의 부분집합이다.The document item set generation unit 154 receives a vertical feature matrix extracted from a document requiring classification, where a new document has X = {l ₁ , l ₂ ,... , l _n }. Power set consisting of non-overlapping elements from set X for classification

. The condition for incrementally constructing the set L is as shown in Equation 4 below. C represents the set of items already found and is a subset of X.

문서 클래스 c_i, 발견된 아이템 집합 C, 그리고 후보 아이템 집합 L이 주어졌을 때, 가장 많은 수의 발견되지 않은 아이템을 원소로 갖는 아이템 집합 L'이 선택되며, 아이템 집합 L'에서 가장 긴 길이를 갖는 아이템 집합 L"이 선택되며, 아이템 집합 L"에서 클래스 지지도가 가장 높은 값을 갖는 아이템 집합 L"'이 선 택된다.Given document class c _i , found item set C, and candidate item set L, an item set L 'with the largest number of undiscovered items is selected, and the longest length in item set L' is selected. The item set L "having the highest class support is selected from the item set L".

클래스 범주 분류부(230)는 문서 클래스 c_i에 속할 확률이 가장 큰 문서 클래스 범주로 입력문서를 분류한다.The class category classification unit 230 classifies the input document into a document class category having the highest probability of belonging to the document class c _i .

도 2는 본 발명의 실시예에 따른 연관 규칙 학습을 통한 단일 범주 문서 분류 장치의 동작 절차를 도시한 흐름도이다.2 is a flowchart illustrating an operation procedure of a single category document classification apparatus through association rule learning according to an embodiment of the present invention.

도 2를 참조하면, 200단계에서 연관 규칙 학습부(100) 내의 텀 색인 및 자질선택 전처리부(102)에서는 입력된 학습 문서 집합(120)의 학습문서를 색인하여 자질 선택을 수행한 후, 수평적 자질 행렬(104)로서 텀 벡터를 추출한다. 그리고 데이터 레이아웃 변환부(106)를 통해 수평적 자질 행렬(104)을 수직적 자질 행렬(108)로 레이아웃을 변환시킨다.Referring to FIG. 2, in step 200, the term index and feature selection preprocessor 102 in the association rule learner 100 performs feature selection by indexing the learning documents of the input learning document set 120 and then horizontally. The term vector is extracted as the product feature matrix 104. The layout of the horizontal feature matrix 104 is converted into the vertical feature matrix 108 through the data layout converter 106.

202단계에서는 후보 k-아이템 집합 생성부(110)는 수직적 자질 행렬(108)을 입력 받아 클래스 지지도 및 클래스 전체 신뢰도를 사용하여, 1개로 구성된 아이템 집합으로부터 넓이 우선 탐색 또는 깊이 우선 탐색 방식으로 중요하면서도 빈도가 높은 k 아이템 집합을 생성한다.In step 202, the candidate k-item set generator 110 receives the vertical feature matrix 108 and uses the class support and the class-wide confidence, which is important as a breadth-first search or a depth-first search from a single set of items. Create a frequent set of k items.

그리고 고빈도/중요 k-아이템 집합 추출부(160)에서는 204단계에서 계산된 클래스 지지도 및 클래스 전체 신뢰도가 최소 클래스 지지도 및 최소 클래스 전체 신뢰도 보다 높은 아이템 집합을 생성하고, 이러한 절차를 반복하여 206단계에서는 최종으로 생성된 아이템 집합을 연관 규칙 학습 모델(130)로 구성하게 된다.In addition, the high frequency / critical k-item set extractor 160 generates an item set in which the class support and class overall reliability calculated in step 204 are higher than the minimum class support and the minimum class overall reliability, and repeats these steps, step 206. In FIG. 3, the finally generated item set is configured as the association rule learning model 130.

이에 문서 클래스 범주 분류부(150) 내의 텀 색인 및 자질선택 전처리부(152)에서는 208단계에서 입력된 분류 문서 집합에 대한 색인 및 자질들을 선택하여 행렬을 생성하고, 210단계에서는 문서 아이템 집합 생성부(154)를 통해 생성된 행렬로부터 중복되지 않는 원소들로 구성된 멱집합을 생성하고, 멱집합을 점증적으로 구성하여 이전 아이템 집합 보다 길이가 긴 아이템 집합을 선택적으로 추출하게 된다.Accordingly, the term index and feature selection preprocessor 152 in the document class category classification unit 150 generates a matrix by selecting indexes and features for the classified document set input in step 208, and generates a document item set generation unit in step 210. In operation 154, a set of non-overlapping elements is generated from the generated matrix, and the set is incrementally configured to selectively extract an item set having a length longer than that of the previous item set.

이후 클래스 범주 분류부(156)에서는 212단계에서 추출된 아이템 집합을 대상으로 문서 클래스에 속할 확률이 가장 큰 문서 클래스 범주로 입력문서를 분류하게 된다.Afterwards, the class category classification unit 156 classifies the input document into a document class category having the highest probability of belonging to the document class based on the item set extracted in step 212.

도 3a 내지 도 3b는 본 발명의 실시예에 따른 클래스 지지도 및 클래스 전체 신뢰도 계산 방식을 도시한 도면이다.3A to 3B are diagrams illustrating a class support and class overall reliability calculation method according to an embodiment of the present invention.

도 3a를 참조하면, A, B, C, D, E와 같이 5개의 아이템을 갖는 10개의 트랜잭션(T1 내지T10)이 2 개의 클래스 C1과 C2에 속한 문서집합에 대한 예시를 나타낸다.Referring to FIG. 3A, ten transactions T1 to T10 having five items, such as A, B, C, D, and E, show an example of a document set belonging to two classes C1 and C2.

즉, 트랜잭션 T1 내지 T5는 C1 클래스, T6 내지 T10은 C2 클래스로서, T1은 C1 클래스로서 A, B 아이템을 갖고, T2는 C1클래스로서 A, B, C 아이템을 갖고 있음을 알 수 있다.That is, it can be seen that transactions T1 to T5 have C1 class, T6 to T10 as C2 class, T1 has A, B items as C1 class, and T2 has A, B, C items as C1 class.

도 3b를 참조하면, A, B, C, D, E와 같이 5 개의 아이템으로 구성할 수 있는 모든 아이템 집합에서 각각의 클래스 C1과 C2에 대해서 클래스 지지도 및 클래스 전체 지지도를 계산하여 출력한 것이다.Referring to FIG. 3B, class support and class total support are calculated and output for each class C1 and C2 in all item sets that may be configured as five items, such as A, B, C, D, and E. FIG.

도 4는 본 발명의 실시예에 따른 학습 문서 집합에 대한 연관 규칙 추출 방식을 도시한 도면이다.4 is a diagram illustrating an association rule extraction method for a learning document set according to an embodiment of the present invention.

도 4를 참조하면, 수평적 자질 행렬(400)을 수직적 자질 행렬(402)로 변환한 후에 1개의 아이템 집합에 대해 구성 할 수 있는 모든 2개의 아이템 집합을 생성하여 각각의 클래스 지지도와 클래스 전체 신뢰도를 계산(404)하고, 이후 최소 클래스 지지도 및 최소 클래스 전체 신뢰도를 만족하는 2개의 아이템 집합 ab, bc, be, ce를 생성(406)한다. 그런 다음에 재귀적으로 3개의 아이템 집합 bce를 생성(408)하게 된다.Referring to FIG. 4, after converting the horizontal feature matrix 400 to the vertical feature matrix 402, all two item sets that can be configured for one item set are generated to generate each class support and class overall reliability. 404 and then generate 406 two item sets ab, bc, be, ce that satisfy the minimum class support and the minimum class overall reliability. Then recursively generate three sets of items bce.

도 5는 본 발명의 실시예에 따른 문서 분류 방식을 도시한 도면이다.5 is a diagram illustrating a document classification method according to an embodiment of the present invention.

도 5를 참조하면, 분류가 필요한 테스트 문서 X로부터 추출된 색인 자질 {a, b, c}로부터 멱집합 L을 구성하여 문서가 속한 클래스 범주를 분류하는 예시이다.Referring to FIG. 5, an example of classifying a class category to which a document belongs is configured by constructing a set L from an index feature {a, b, c} extracted from a test document X requiring classification.

이상 설명한 바와 같이, 본 발명의 실시예에 따른 단일 클래스 범주의 문서 분류 장치 및 방법은, 연관 규칙의 상호 관계 평가를 위한 클래스 지지도 및 클래스 전체 신뢰도를 새롭게 정의하여 연관 규칙 학습 시에 중요하고 빈도가 높은 아이템 집합을 생성함으로써 문서 분류를 위한 학습 모델을 구축하고, 문서 분류를 위한 새로운 문서가 입력되었을 때 연관 규칙을 활용한 문서 분류를 수행한다. As described above, the apparatus and method for classifying a document of a single class category according to an embodiment of the present invention newly defines a class support and a class-wide reliability for evaluating the correlation of the association rule, thereby making it important and frequent when learning the association rule. By creating a high item set, we build a learning model for document classification, and perform document classification using association rules when a new document for document classification is entered.

한편 본 발명의 상세한 설명에서는 구체적인 실시예에 관해 설명하였으나, 본 발명의 범위에서 벗어나지 않는 한도 내에서 여러 가지 변형이 가능함은 물론이다. 그러므로 본 발명의 범위는 설명된 실시예에 국한되지 않으며, 후술되는 특허청구의 범위뿐만 아니라 이 특허청구의 범위와 균등한 것들에 의해 정해져야 한다.While the present invention has been described in connection with what is presently considered to be the most practical and preferred embodiment, it is to be understood that the invention is not limited to the disclosed embodiments, but is capable of various modifications within the scope of the invention. Therefore, the scope of the present invention should not be limited to the described embodiments, but should be defined not only by the scope of the following claims, but also by those equivalent to the scope of the claims.

도 1은 본 발명의 실시예에 따른 연관 규칙 학습을 통한 단일 범주 문서 분류 장치의 구조를 도시한 블록도,1 is a block diagram showing the structure of a single category document classification apparatus through association rule learning according to an embodiment of the present invention;

도 2는 본 발명의 실시예에 따른 연관 규칙 학습을 통한 단일 범주 문서 분류 장치의 동작 절차를 도시한 흐름도,2 is a flowchart illustrating an operation procedure of a single category document classification device through association rule learning according to an embodiment of the present invention;

도 3a 내지 도 3b는 본 발명의 실시예에 따른 클래스 지지도 및 클래스 전체 신뢰도 계산 방식을 도시한 도면,3A to 3B are diagrams illustrating a class support and class overall reliability calculation method according to an embodiment of the present invention;

도 4는 본 발명의 실시예에 따른 학습 문서 집합에 대한 연관 규칙 추출 방식을 도시한 도면,4 is a diagram illustrating an association rule extraction method for a learning document set according to an embodiment of the present invention;

도 5는 본 발명의 실시예에 따른 문서 분류 방식을 도시한 도면.5 is a diagram illustrating a document classification scheme according to an embodiment of the present invention.

<　도면의 주요 부분에 대한 부호 설명 > <Description of Signs of Major Parts of Drawings>

100 : 연관 규칙 학습부 102 : 텀 색인 및 자질선택 전처리부100: association rule learning unit 102: term index and feature selection preprocessor

104 : 수평적 자질 행렬 106 : 데이터 레이아웃 변환부104: horizontal feature matrix 106: data layout converter

108 : 수직적 자질 행렬 110 : 후보 k-아이템 집합 생성부108: vertical feature matrix 110: candidate k-item set generator

112 : 고빈도/중요 k-아이템 집합 추출부 112: high frequency / important k-item set extractor

130 : 연관 규칙 학습 모델130: Association rule learning model

150 : 문서 클래스 범주 분류부 152 : 텀 색인 및 자질선택 전처리부150: document class category classifier 152: term index and feature selection preprocessor

154 : 문서 아이템 집합 생성부 156 : 클래스 범주 분류부 154: document item set generation unit 156: class category classification unit

Claims

Select features from a set of training documents, generate a matrix with the selected features, generate association rule candidates based on the depth-first search or breadth-first search, and generate an association rule learning model from the generated association rule candidates. Association rule learning department to say,

A document class category classifier for classifying documents for an input document set using the generated association rule learning model.

Document classification device of a single class category comprising a.

The method of claim 1,

The association rule learning unit,

An index and feature selection preprocessing unit configured to select features by indexing the set of learning documents, and generate a horizontal feature matrix using the selected features;

A data layout converter for converting the horizontal feature matrix into a vertical feature matrix

Document classification apparatus of a single class category comprising a.

The method of claim 1,

The association rule learning unit,

A document classifier of a single class category, characterized by incrementally extracting an item set having a length longer than the previous item set in a depth-first search or a breadth-first search method starting from one item set having one element.

The method of claim 1,

The association rule learning unit,

Calculate class support, which is the probability that a particular set of items in a document class will be randomly extracted,

Calculate a class overall reliability which is the minimum reliability for the specific set of items in the document class,

Confirm the minimum class support and the minimum class overall reliability from the calculated class support and the class overall reliability,

Generate an item set in which the class support and the class overall confidence are higher than the minimum class support and the minimum class overall confidence,

When the item set is no longer generated by repeating the generation of the item set, document classification apparatus of a single class category, characterized in that the final item set to configure the association rule learning model.

The method of claim 1,

The document class category classification unit,

A pre-processing unit for generating a matrix by selecting features by indexing the set of classification documents;

Generate a candidate item set consisting of non-overlapping elements from the generated matrix, select a first item set having the largest number of undiscovered items as elements in the candidate item set, and in the first item set A document item set generation unit for selecting a second item set having the longest length and then selecting a third item set having the highest class support value from the second item set;

A class category classification unit classifying the input document into a document class category having the highest probability of belonging to a document class based on the third item set

Document classification apparatus of a single class category comprising a.

Selecting features by indexing from a set of learning documents,

Generating a matrix with the selected qualities;

Incrementally generating association rule candidates from the generated matrix in a depth-first or breadth-first search manner;

Generating an association rule learning model from the association rule candidates generated;

Classifying documents using the generated association rule learning model

Document classification method of a single class category, including.

The method of claim 6,

The process of generating the matrix,

Generating a horizontal feature matrix based on the selected features;

Converting the horizontal feature matrix into a vertical feature matrix

Document classification method of a single class category comprising a.

The method of claim 6,

Incrementally generating the association rule candidates,

A method of classifying a document of a single class category, comprising extracting an item set having a length longer than the previous item set by using a depth-first search or a breadth-first search method starting from one item set having one element.

The method of claim 6,

The process of generating the association rule learning model,

A first step of calculating class support, which is a probability of randomly extracting a specific set of items belonging to a document class;

Calculating a class total reliability, which is a minimum reliability for the specific item set, in the document class;

A third process of confirming minimum class support and minimum class overall reliability in the calculated class support and class overall reliability;

A fourth process of generating an item set in which the class support and the class overall reliability are higher than the minimum class support and the minimum class overall reliability;

When the item set is no longer generated by repeating the first to fourth processes, a process of configuring a final item set as the association rule learning model

Document classification method of a single class category comprising a.

The method of claim 9,

The process of classifying the document,

Generating a matrix by selecting features and indexing from a set of classified documents;

Generating a candidate item set composed of non-overlapping elements from the generated matrix;

Selecting a first item set having the largest number of undiscovered items as elements in the candidate item set;

Selecting a second item set having the longest length from the first item set,

Selecting a third item set having the highest class support in the second item set;

Classifying the input document into a document class category having a highest probability of belonging to a document class based on the third item set;

Document classification method of a single class category comprising a.