JPH08221447A

JPH08221447A - Document automatic classifier

Info

Publication number: JPH08221447A
Application number: JP7046564A
Authority: JP
Inventors: Makoto Hirota; 誠廣田; Shiro Ito; 史朗伊藤; Shogo Shibata; 昇吾柴田; Takanari Ueda; 隆也上田; Yuji Ikeda; 裕治池田; Minoru Fujita; 稔藤田
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 1995-02-10
Filing date: 1995-02-10
Publication date: 1996-08-30

Abstract

(57)【要約】【目的】分類に有効な単語（有効語）をできるかぎり
多数保持して、分類対象の文書に含まれる単語が保存し
た有効語のいずれかに一致する確率を高めると共に、文
書を表現するベクトル空間の軸となる基底語の数をでき
るだけ少なくしてベクトル空間上での処理コストを低減
できるようにする。【構成】有効語抽出部２５は、文書データベース２４
に複数の文書を前記カテゴリに予め分けて保存された全
文書の中から有効な有効語をできるかぎり多数抽出し
て、有効語辞書２６に登録する。基底語抽出部２７は、
有効語辞書２６に登録された有効語の中から、文書を表
現するベクトル空間の軸となる基底語をなるべく少数抽
出する。有効語辞書２６に登録された各有効語には、基
底語との相関情報が付与されている。ベクトル表現部２
２は、有効語と基底語との相関情報を基に、分類対象と
して入力された文書を少ない次元のベクトルとして表現
し、識別決定部２３は、そのベクトル空間上で文書間の
距離計算等を行って、文書がいずれのカテゴリに属する
かを決定する。 (57) [Summary] [Purpose] As many words as possible to be classified (effective words) are retained to increase the probability that a word included in a document to be classified matches any of the saved effective words. To reduce the processing cost on the vector space by reducing the number of basic words as the axis of the vector space expressing the document as much as possible. [Structure] The effective word extraction unit 25 is a document database 24.
In addition, a plurality of valid documents are extracted from all the documents stored in advance by dividing the plurality of documents into the categories and registered in the valid word dictionary 26. The base word extraction unit 27
From the effective words registered in the effective word dictionary 26, as few base words as axes of the vector space expressing the document are extracted. Each effective word registered in the effective word dictionary 26 is provided with correlation information with the base word. Vector representation part 2
2 represents a document input as a classification target as a vector having a small dimension based on the correlation information between the effective word and the base word, and the identification determination unit 23 calculates the distance between documents in the vector space. Go to determine which category the document belongs to.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、入力された文書を与え
られたカテゴリに分類する文書自動分類装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to an automatic document classification device for classifying input documents into given categories.

【０００２】[0002]

【従来の技術】文書の自動分類方式の一つに、与えられ
た有限個の単語を軸とするベクトルで文書を表現し、そ
のベクトル空間上で文書間の距離計算などを行って、文
書が与えられたカデゴリのいずれに属するかを判別する
方式が知られている。この方式では、ベクトル表現の軸
となる基底語を、既にカテゴリごとに分けて保存されて
いる文書群から抽出している。2. Description of the Related Art As one of automatic document classification methods, a document is represented by a vector having a given finite number of words as an axis, and a distance between documents is calculated in the vector space to obtain a document. There is known a method of discriminating which one of given cadegos belongs. In this method, the base word, which is the axis of the vector expression, is extracted from the document group that is already divided and stored for each category.

【０００３】[0003]

【発明が解決しようとする課題】この場合、ベクトル空
間上での様々な数理的処理のコストを考慮すると、ベク
トル空間の次元、すなわち基底語の数は少ない方がよい
ので、分類のために有効なできるだけ少数の単語を基底
語として選択するのが望ましい。しかし、基底語の数が
少ないと、新たに入力された文書の中に、その基底語の
いずれかが含まれている確率が低くなり、文書中の多く
の単語が分類装置にとって未知語となってしまい、分類
精度が低下するという問題がある。逆に、未知語を減ら
すために、基底語の数を多くすると、ベクトル空間上で
の処理コストが高くなるという問題がある。In this case, considering the cost of various mathematical processes on the vector space, it is better that the dimension of the vector space, that is, the number of base words is smaller, which is effective for classification. It is desirable to choose as few words as possible as base words. However, if the number of base words is small, the probability that one of the base words is included in the newly input document is low, and many words in the document become unknown words to the classifier. Therefore, there is a problem that the classification accuracy decreases. On the contrary, if the number of base words is increased in order to reduce unknown words, there is a problem that the processing cost in the vector space becomes high.

【０００４】本発明は、このような背景の下になされた
もので、その目的は、分類に有効な単語（有効語）をで
きるかぎり多数保持して、分類対象の文書に含まれる単
語が保存した有効語のいずれかに一致する確率を高める
と共に、文書を表現するベクトル空間の軸となる基底語
の数をできるだけ少なくしてベクトル空間上での処理コ
ストを低減できるようにすることにある。The present invention has been made under such a background, and an object thereof is to retain as many words (effective words) effective for classification as possible and to save words included in a document to be classified. In addition to increasing the probability of matching with any of the valid words, the number of base words serving as the axis of the vector space expressing the document can be minimized to reduce the processing cost on the vector space.

【０００５】[0005]

【課題を解決するための手段】上記目的を達成するた
め、請求項１記載の発明は、文書を有限個の単語（基底
語）を軸とするベクトルで表現していずれのカテゴリに
分類されるかを決定する文書自動分類装置において、複
数の文書を前記カテゴリに予め分けて保存した文書デー
タベースと、入力された文書を自動分類するために有効
な単語を有効語として前記文書データベースに保存され
た文書から抽出する有効語抽出手段と、該有効語抽出手
段により抽出された有効語を登録した有効語辞書と、文
書のベクトル表現の軸となる基底語を前記有効語辞書に
登録された有効語の中から抽出する基底語抽出手段とを
備えている。In order to achieve the above object, the invention according to claim 1 classifies a document into any category by expressing the document by a vector having a finite number of words (base words) as axes. In the automatic document classification apparatus for determining whether or not a plurality of documents are divided into the categories in advance and stored, and a word effective for automatically classifying an input document is stored in the document database as an effective word. Effective word extracting means for extracting from the document, an effective word dictionary in which the effective words extracted by the effective word extracting means are registered, and an effective word in which a base word as an axis of vector expression of the document is registered in the effective word dictionary And a base word extracting means for extracting from the

【０００６】上記目的を達成するため、請求項２記載の
発明では、請求項１記載の前記有効語抽出手段は、前記
文書データベースに保存された複数の文書中においてカ
テゴリによって出現頻度にばらつきのある単語を有効語
として抽出するように構成されている。In order to achieve the above object, in the invention according to claim 2, the effective word extracting means according to claim 1 has variations in appearance frequency depending on categories in a plurality of documents stored in the document database. It is configured to extract the word as a valid word.

【０００７】上記目的を達成するため、請求項３記載の
発明では、請求項１記載の前記基底語抽出手段は、前記
有効語辞書に登録された有効語の中から分類のための有
効度が高く、かつ他の基底語との相関が低い単語を基底
語として抽出するように構成されている。In order to achieve the above object, in the invention according to claim 3, the base word extracting means according to claim 1 has the effectiveness for classification from the effective words registered in the effective word dictionary. A word that is high and has a low correlation with other base words is extracted as a base word.

【０００８】上記目的を達成するため、請求項４記載の
発明では、請求項１記載の前記有効語辞書に登録された
前記有効語には、前記基底語との相関情報が付与されて
いる。To achieve the above object, in the invention according to claim 4, the effective word registered in the effective word dictionary according to claim 1 is provided with correlation information with the base word.

【０００９】上記目的を達成するため、請求項５記載の
発明では、請求項４記載の前記有効語辞書に登録された
前記有効語のうち、同義語には、同一の前記相関情報が
付与されている。In order to achieve the above object, in the invention according to claim 5, among the effective words registered in the effective word dictionary according to claim 4, synonyms are provided with the same correlation information. ing.

【００１０】[0010]

【作用】請求項１〜５記載の発明では、前記有効語辞書
に十分な数の有効語を登録することで、分類対象の文書
に含まれる単語が登録に係る有効語のいずれかに一致す
る確率が高くなり、その文書の持つ情報をなるべく漏ら
さないようにすることができる。また、前記有効語辞書
に登録された有効語の中から前記基底語抽出手段により
できるだけ少数の基底語を抽出することで、分類対象の
文書の内容を次元の低いベクトル空間上の点として表現
し、ベクトル空間上での処理コストを低減することが可
能となる。According to the invention described in claims 1 to 5, by registering a sufficient number of valid words in the valid word dictionary, the words included in the document to be classified match any of the valid words to be registered. The probability increases, and it is possible to prevent the information held by the document from leaking as much as possible. Further, by extracting as few basic words as possible from the effective words registered in the effective word dictionary by the basic word extracting means, the contents of the document to be classified are expressed as points on a vector space of low dimension. It is possible to reduce the processing cost on the vector space.

【００１１】請求項２記載の発明では、請求項１記載の
前記有効語抽出手段は、前記文書データベースに保存さ
れた複数の文書中においてカテゴリによって出現頻度に
ばらつきのある単語を有効語として抽出することによ
り、カテゴリを特徴付ける単語を有効語として抽出す
る。According to a second aspect of the present invention, the effective word extracting means according to the first aspect extracts words having a variation in appearance frequency depending on a category in a plurality of documents stored in the document database as effective words. As a result, the words that characterize the category are extracted as effective words.

【００１２】請求項３記載の発明では、請求項１記載の
前記基底語抽出手段は、前記有効語辞書に登録された有
効語の中から分類のための有効度が高く、かつ他の基底
語との相関が低い単語をできるだけ少数基底語として抽
出することにより、文書の内容を次元の低いベクトル空
間上の点として表現できる。According to a third aspect of the present invention, the base word extraction means according to the first aspect has a high degree of effectiveness for classification from the valid words registered in the valid word dictionary, and other base words. The content of the document can be expressed as a point on a vector space with a low dimension by extracting as few basic words as possible that have a low correlation with.

【００１３】請求項４記載の発明では、前記有効語辞書
に登録された前記有効語には、前記基底語との相関情報
が付与されており、前記有効語と基底語との相関情報に
よって、前記有効語の情報を基底語を軸とするベクトル
空間上に反映させることができる。According to a fourth aspect of the present invention, the effective word registered in the effective word dictionary is provided with correlation information with the base word, and by the correlation information between the effective word and the base word, The information of the effective word can be reflected on the vector space having the base word as an axis.

【００１４】請求項５記載の発明では、請求項４記載の
前記有効語辞書に登録された前記有効語のうち、同義語
には、同一の前記相関情報が付与されており、同義語同
士は、同一のベクトルとして表現される。In the invention of claim 5, among the effective words registered in the effective word dictionary of claim 4, synonyms are given the same correlation information, and synonyms are different from each other. , Are represented as the same vector.

【００１５】[0015]

【実施例】以下、本発明の一実施例を図面を参照しなが
ら詳細に説明する。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS An embodiment of the present invention will be described in detail below with reference to the drawings.

【００１６】図１は、本発明の一実施例に係る文書自動
分類装置のハード構成を示すブロック図である。図１に
おいて、１１は中央処理装置、１２はＲＯＭ、１３はＲ
ＡＭ、１４はディスク装置であり、これらはバス１５に
よりデータ、および制御信号を相互に転送可能に接続さ
れている。FIG. 1 is a block diagram showing the hardware arrangement of an automatic document classification apparatus according to an embodiment of the present invention. In FIG. 1, 11 is a central processing unit, 12 is a ROM, and 13 is an R.
AMs and 14 are disk devices, which are connected by a bus 15 so that data and control signals can be mutually transferred.

【００１７】ＲＯＭ１２には、図３、図５、図８のフロ
ーチャートに示すような制御手順に従って一連の文書自
動分類処理を行うための制御プログラムがプリセットさ
れており、中央処理装置１１は、この制御プログラムに
従ってＲＡＭ１３をワークエリア等として使用しながら
文書自動分類処理を行う。The ROM 12 is preset with a control program for performing a series of automatic document classification processing in accordance with the control procedures shown in the flow charts of FIGS. 3, 5, and 8. The central processing unit 11 performs this control. Automatic document classification processing is performed while using the RAM 13 as a work area or the like according to a program.

【００１８】図２は、本発明の一実施例に係る文書自動
分類装置の機能を示す機能ブロック図である。図２にお
いて、２１は文書分類部、２２はベクトル表現部、２３
は識別決定部、２４は文書データベース、２５は有効語
抽出部、２６は有効語辞書、２７は基底語抽出部、２８
は同義語辞書である。FIG. 2 is a functional block diagram showing the functions of the automatic document classification apparatus according to the embodiment of the present invention. In FIG. 2, 21 is a document classification unit, 22 is a vector expression unit, and 23.
Is an identification determination unit, 24 is a document database, 25 is an effective word extraction unit, 26 is an effective word dictionary, 27 is a base word extraction unit, 28
Is a synonym dictionary.

【００１９】なお、文書データベース２４、有効語辞書
２６、同義語辞書２８は、図１のディスク装置１４上に
構築されている。また、文書データベース２４には、ユ
ーザが予め与えたカテゴリＣ₁，Ｃ₂，…，Ｃ_M（Ｍはカ
テゴリ数）に分類された文書が保持されている。また、
文書分類部２１（ベクトル表現部２２、識別決定部２
３）、有効語抽出部２５、基底語抽出部２７は、中央処
理装置１１、ＲＯＭ１２、およびＲＡＭ１３により構成
されるものである。文書分類部２１のベクトル表現部２
２は、極力少数の基底語を軸として入力に係る文書の内
容をベクトル表現し、識別決定部２３は、そのベクトル
空間上で文書間の距離計算などを行って、当該文書が上
記カテゴリＣ₁，Ｃ₂，…，Ｃ_Mのいずれに属するかを識
別・決定する。The document database 24, the effective word dictionary 26, and the synonym dictionary 28 are constructed on the disk device 14 of FIG. Further, the document database 24 holds documents classified into categories C ₁ , C ₂ , ..., _CM (M is the number of categories) given in advance by the user. Also,
Document classification unit 21 (vector expression unit 22, identification determination unit 2
3), the effective word extraction unit 25 and the base word extraction unit 27 are configured by the central processing unit 11, the ROM 12, and the RAM 13. Vector representation unit 2 of document classification unit 21
2 is a vector representation of the contents of a document related to an input with a minimum number of base words as an axis, and the identification determination unit 23 calculates the distance between documents in the vector space, and the document is classified into the category C ₁ , C ₂ , ..., C _M are identified and determined.

【００２０】次に、図３に示すフローチャートを参照し
て、有効語抽出部２５による有効語抽出処理を説明す
る。Next, the effective word extraction processing by the effective word extraction unit 25 will be described with reference to the flowchart shown in FIG.

【００２１】有効語抽出部２５は、まず、文書データベ
ース２４に保持された全文書を形態素解析して単語を抽
出し（ステップＳ３０１）、抽出した単語をリストＬに
保持する（ステップＳ３０２）。そして、単語リストＬ
から任意の単語Ｗを取り出す（ステップＳ３０３）。そ
して、この単語Ｗの文書分類のための有効度Ｅ（Ｗ）を
計算する（ステップＳ３０４）。ここでは、次のような
方式で有効度を評価する。すなわち、カテゴリＣ_Kに属
する文書の中で、単語Ｗを含む文書の数の割合ｐk
（Ｗ）を計算する。これを数式１のように全カテゴリに
ついて加算した総和が１となるように正規化する。First, the effective word extracting section 25 morphologically analyzes all the documents stored in the document database 24 to extract words (step S301), and stores the extracted words in the list L (step S302). And the word list L
An arbitrary word W is extracted from (step S303). Then, the effectiveness E (W) for document classification of the word W is calculated (step S304). Here, the effectiveness is evaluated by the following method. That is, of the documents belonging to the category C _K , the ratio pk of the number of documents containing the word W
Calculate (W). This is normalized so that the total sum of all the categories is 1 as in Expression 1.

【００２２】[0022]

【数１】ｐk（Ｗ）を例示すると、図４（ａ），（ｂ）のように
なる。図４（ａ）のように、カテゴリによって単語Ｗを
含む文書の数の割合に差がある場合は、その単語Ｗは、
その割合の高いカテゴリを特徴付けると考えられ、分類
のために有効な単語と言える。一方、図４（ｂ）のよう
に単語Ｗを含む文書の割合がカテゴリによって差がない
場合は、逆に分類にとって有効でないと考えられる。こ
のような分布の偏りを評価するために、例えば、次のよ
うにエントロピーＨ（Ｗ）を計算する。[Equation 1] An example of pk (W) is shown in FIGS. 4 (a) and 4 (b). As shown in FIG. 4A, when there is a difference in the number of documents including the word W depending on the category, the word W is
It is considered to characterize a category with a high proportion, and can be said to be an effective word for classification. On the other hand, when the proportion of documents including the word W does not differ depending on the category as shown in FIG. 4B, it is considered that the classification is not effective for classification. In order to evaluate such distribution bias, entropy H (W) is calculated as follows, for example.

【００２３】[0023]

【数２】ここで、Ｈ（Ｗ）は０〜１の値をとり、分布の偏りが大
きい（分類に有効）ほど小さい値となり、偏りが小さい
（分布に有効でない）ほど大きい値をとる。有効度Ｅ
（Ｗ）は、Ｅ（Ｗ）＝１−Ｈ（Ｗ）と定義する。[Equation 2] Here, H (W) takes a value of 0 to 1, and the larger the deviation of the distribution (effective for classification), the smaller the value, and the smaller the deviation (effective for the distribution), the larger the value of H (W). Effectiveness E
(W) is defined as E (W) = 1-H (W).

【００２４】このようにして有効度Ｅ（Ｗ）を計算した
ら、この有効度Ｅ（Ｗ）がしきい値ｈより大きいか否か
を判別する（ステップＳ３０５）。有効度Ｅ（Ｗ）がし
きい値ｈより大きい場合は、単語Ｗを有効語とみなし、
有効語辞書２６に登録する（ステップＳ３０６）。一
方、有効度Ｅ（Ｗ）がしきい値ｈ以下の場合は、ステッ
プＳ３０６をスキップしてステップＳ３０７に進むこと
により、単語Ｗを有効語とみなさず、有効語辞書２６へ
の登録も行わないようにする。なお、本実施例では、し
きい値ｈの大きさを適切に設定して、文書分類を有効に
行える範囲で極力多数の有効語を抽出して有効語辞書２
６に登録するようにしている。After the effectiveness E (W) is calculated in this manner, it is determined whether or not the effectiveness E (W) is larger than the threshold value h (step S305). When the effectiveness E (W) is larger than the threshold value h, the word W is regarded as a valid word,
The effective word dictionary 26 is registered (step S306). On the other hand, when the validity E (W) is less than or equal to the threshold value h, step S306 is skipped and the process proceeds to step S307, so that the word W is not regarded as a valid word and is not registered in the valid word dictionary 26. To do so. In this embodiment, the threshold h is appropriately set to extract as many valid words as possible within the range in which document classification can be effectively performed, and the valid word dictionary 2 is extracted.
I am going to register in 6.

【００２５】以上の処理が終わると、単語Ｗを単語リス
トＬから削除する（ステップＳ３０７）。そして単語リ
ストＬが空であるか否かを判別し（ステップＳ３０
８）、空でなければステップＳ３０３へ戻って、次の単
語について同様の処理を行い、単語リストＬが空になっ
ていれば、有効語抽出処理を終了する。When the above processing is completed, the word W is deleted from the word list L (step S307). Then, it is determined whether or not the word list L is empty (step S30
8) If it is not empty, the process returns to step S303, the same process is performed for the next word, and if the word list L is empty, the valid word extraction process is terminated.

【００２６】次に、基底語抽出処理を説明する。基底語
としては、それぞれが分類のための有効度が高いもので
あると同時に、お互いに相関の低い組合わせであること
が望ましい。例えば、「為替」、「最高値」、「景
気」、「財テク」、「インフレ」…などが有効語として
有効語辞書２６に登録されていたとする。これらは、そ
れぞれ「経済」というカテゴリをよく特徴付ける単語と
して言える。しかし、「為替」と「最高値」は同じ文書
内によく現れるので、この両方を基底語として採用する
のは冗長である。むしろ、「為替」、「財テク」、「イ
ンフレ」などを基底語として選んで、有効語辞書２６に
おいて「最高値」と「為替」の相関情報を付与する形に
した方がよい。Next, the basic word extraction process will be described. It is desirable that the base words are combinations that have high effectiveness for classification and have low correlation with each other. For example, it is assumed that "exchange rate", "maximum value", "economy", "goods tech", "inflation" ... Are registered in the effective word dictionary 26 as effective words. Each of these can be said to be a word that well characterizes the category of "economy". However, since "exchange rate" and "highest price" often appear in the same document, it is redundant to adopt both of them as base terms. Rather, it is better to select "exchange", "goods tech", "inflation", etc. as base terms and add correlation information of "maximum value" and "exchange" in the effective word dictionary 26.

【００２７】このような考えに基づいた基底語抽出処理
を、図５に示すフローチャートを参照して説明する。The base word extraction processing based on such an idea will be described with reference to the flowchart shown in FIG.

【００２８】基底語抽出部２７は、まず、有効語辞書２
６に登録された単語（有効語）同士の共起確率を文書デ
ータベース２４に保持された文書から計算する（ステッ
プＳ５０１）。単語Ｗの単語Ｗ′に対する共起確率ｒ
（Ｗ，Ｗ′）は、The base word extraction unit 27 firstly detects the effective word dictionary 2
The co-occurrence probabilities of the words (valid words) registered in No. 6 are calculated from the documents stored in the document database 24 (step S501). Co-occurrence probability r of word W to word W '
(W, W ') is

【００２９】[0029]

【数３】ｒ（Ｗ，Ｗ′）＝（ＷとＷ′を同時に含む文書
の数）／（Ｗを含む文書の数）のように求められる。## EQU3 ## r (W, W ') = (number of documents containing W and W'at the same time) / (number of documents containing W)

【００３０】次に、初期設定として、選択された基底語
を保持する基底語リストＢ、基底語の候補を保持する基
底語候補リストＣをそれぞれ空にし、選択された基底語
数ｎを“０”とする（ステップＳ５０２）。そして、有
効語辞書２６に登録されている単語のうち、基底語リス
トＢに保持されている単語を除く全ての単語を基底語候
補として基底語候補リストＣに保持すると共に、基底語
の評価値の最大値ｍａｘを“０”に初期化する（ステッ
プＳ５０３）。Next, as an initial setting, the base word list B holding the selected base words and the base word candidate list C holding the base word candidates are emptied, respectively, and the number n of the selected base words is set to "0". (Step S502). Then, of the words registered in the effective word dictionary 26, all the words except the words held in the base word list B are held in the base word candidate list C as base word candidates, and the base word evaluation value is held. The maximum value max of is initialized to "0" (step S503).

【００３１】そして、基底語候補リストＣから任意の単
語Ｗを取り出し（ステップＳ５０４）、この単語Ｗの基
底語としての評価値Ｅ^*（Ｗ）を計算する（ステップＳ
４０５）。この評価値Ｅ^*（Ｗ）は、次のようにして求
める。Then, an arbitrary word W is extracted from the basic word candidate list C (step S504), and the evaluation value E ^* (W) as the basic word of this word W is calculated (step S).
405). The evaluation value E ^* (W) is obtained as follows.

【００３２】すなわち、選択された基底語のリストＢ
が、まだ空のときは、評価値Ｅ^*（Ｗ）＝有効度Ｅ
（Ｗ）とする。既に基底語Ｗ^b ₁ ，Ｗ^b ₂ ，Ｗ^b ₃ ，…
Ｗ^b _n が選択されている場合は、単語Ｗの基底語しての
評価値Ｅ^*（Ｗ）は、単語Ｗ自身の文書分類のための有
効度Ｅ（Ｗ）が高いほど高くなり、基底語Ｗ^b ₁ ，Ｗ^b ₂
，Ｗ^b ₃ ，…Ｗ^b _n との相関が高いほど低くなるの
で、単語Ｗの評価値Ｅ^*（Ｗ）を次のように定める。That is, the list B of the selected base words
However, when it is still empty, the evaluation value E ^* (W) = effectiveness E
(W). Already the base words W ^b ₁ , W ^b ₂ , W ^b ₃ , ...
When W ^b _n is selected, the evaluation value E ^* (W) as the base word of the word W becomes higher as the effectiveness E (W) for document classification of the word W itself becomes higher, and the evaluation value E ^* (W) becomes higher. Words W ^b ₁ , W ^b ₂
, W ^b ₃ , ..., W ^b _n , the higher the correlation, the lower the evaluation value E ^* (W) of the word W is determined as follows.

【００３３】[0033]

【数４】Ｅ^*（Ｗ）＝Ｅ（Ｗ）×（１−ｒ（Ｗ，Ｗ
^b ₁ ））×（１−ｒ（Ｗ，Ｗ^b ₂ ））…（１−ｒ（Ｗ，
Ｗ^b _n ））次に、ステップＳ４０５で計算された評価値Ｅ^*（Ｗ）
が最大値ｍａｘより大きいか否かを判別し（ステップＳ
５０６）、大きければ、次の基底語候補Ｗ^*に単語Ｗを
セットし、最大値ｍａｘの値を単語Ｗの評価値Ｅ
^*（Ｗ）に更新して（ステップＳ５０７）、ステップＳ
５０８に進む。一方、評価値Ｅ^*（Ｗ）が最大値ｍａｘ
以下であれば、ステップＳ５０７の処理をスキップし
て、ステップＳ５０８に進む。[Equation 4] E ^* (W) = E (W) × (1-r (W, W
^b ₁ )) × (1-r (W, W ^b ₂ )) ... (1-r (W,
W ^b _n )) Next, the evaluation value E ^* (W) calculated in step S405.
Is greater than the maximum value max (step S
506) If larger, set the word W to the next base word candidate W ^* , and set the maximum value max to the evaluation value E of the word W.
^* Update to (W) (step S507), then step S
Proceed to 508. On the other hand, the evaluation value E ^* (W) is the maximum value max.
If it is below, the process of step S507 is skipped and the process proceeds to step S508.

【００３４】ステップＳ５０８では、基底語候補リスト
Ｃから単語Ｗを削除する。そして、基底語候補リストＣ
が空になったか否かを調べ（ステップＳ５０９）、空で
なければステップＳ５０４へ戻って、残りの有効語（基
底語候補）について同じ評価を行う。基底語候補リスト
Ｃが空であれば、基底語候補Ｗ^*を基底語リストＢに加
え、基底語数ｎを１つインクリメントする（ステップＳ
５１０）。In step S508, the word W is deleted from the base word candidate list C. Then, the base word candidate list C
Is checked to be empty (step S509). If it is not empty, the process returns to step S504 and the same evaluation is performed for the remaining valid words (base word candidates). If the basic word candidate list C is empty, the basic word candidate W ^* is added to the basic word list B and the basic word number n is incremented by 1 (step S
510).

【００３５】そして、基底語数ｎがユーザが予め設定し
た数Ｌ（本実施例では少数の７）に達したか否かを調べ
（ステップＳ５１１）、達していなければステップＳ５
０３に戻って、次の基底語候補選択を行う。基底語数ｎ
が設定数Ｌに達したら、Ｌ個の基底語抽出が完了する。Then, it is checked whether or not the number of base words n has reached a number L (a small number of 7 in this embodiment) preset by the user (step S511). If not, step S5.
Returning to 03, the next base word candidate is selected. Number of base words n
When reaches the set number L, the extraction of L base words is completed.

【００３６】このようにして抽出された基底語と有効語
辞書２６に登録された各単語との相関情報を有効語辞書
２６に付与する。これは、有効語辞書２６中の各単語Ｗ
に対し、単語Ｗの基底語Ｗ^b _i （ｉ＝１，２，…Ｌ）に
対する共起確率ｒ（Ｗ，Ｗ^b _i）を記述することによって
行う。また、同義語辞書２８を用いて、同義の有効語に
は同じ相関情報を付与するようにする。例えば、単語Ｗ
とＷ′が同義語で、かつ、単語Ｗの有効度Ｅ（Ｗ）が単
語Ｗ′の有効度Ｅ（Ｗ′）より大きい場合、単語Ｗ，
Ｗ′いずれにも、単語Ｗの基底語Ｗ^b _i （ｉ＝１，２，
…Ｌ）に対する共起確率ｒ（Ｗ，Ｗ^b _i ）を相関情報と
して記述する。Correlation information between the base word thus extracted and each word registered in the effective word dictionary 26 is added to the effective word dictionary 26. This is each word W in the effective word dictionary 26.
In contrast, the co-occurrence probability r (W, W ^b _i ) of the word W with respect to the base word W ^b _i (i = 1, 2, ... L) is described. In addition, the synonym dictionary 28 is used to give the same correlation information to synonymous effective words. For example, the word W
And W ′ are synonyms, and the effectiveness E (W) of the word W is greater than the effectiveness E (W ′) of the word W ′, the word W,
In all W ′, the base word W ^b _{i of the} word W (i = 1, 2,
The co-occurrence probability r (W, W ^b _i ) for L) is described as correlation information.

【００３７】このような方法で、「政治」、「経済」の
２つのカテゴリに分類れた文書群から作成された有効語
辞書２６の内容例を図６に示す。また、図７に同義語辞
書２８の内容例も示す。FIG. 6 shows an example of contents of the effective word dictionary 26 created from a document group classified into two categories of "politics" and "economy" by such a method. Further, FIG. 7 also shows an example of contents of the synonym dictionary 28.

【００３８】図６は、「外為」、「首相」、「サミッ
ト」、…他（計７個）が基底語として選択された場合の
有効語辞書の例である。例えば、有効語辞書２６に登録
された有効語「外為」の基底語との相関情報（共起確
率）は、（１．０，０．１，０．０，０．３，０．０，
０．４，０．０）であり、１番目の基底語は「外為」で
あり、共起確率は「１．０」である。また、有効語「組
閣」の基底語との相関情報は、（０．０，０．９，０．
０，０．０，０．１，０．１，０．０，０．４）であ
り、１番目の基底語「外為」とは相関が低く（「０．
０」）、２番目の基底語「首相」とは相関が高い
（「０．９」）という様子を表している。また、図７に
示したように、単語「外為」、「外国為替」は同義語と
して同義語辞書２８に登録されているので、図６に示し
たように、有効語辞書２６に登録された単語「外国為
替」の基底語との相関情報（共起確率）は、「外為」と
全く同様に（１．０，０．１，０．０，０．３，０．
０，０．４，０．０）となっている。FIG. 6 shows an example of an effective word dictionary when "foreign exchange", "prime minister", "summit", ... (7 words in total) are selected as base words. For example, the correlation information (co-occurrence probability) with the base word of the effective word “forex” registered in the effective word dictionary 26 is (1.0, 0.1, 0.0, 0.3, 0.0,
0.4, 0.0), the first base word is "forex", and the co-occurrence probability is "1.0". Also, the correlation information with the base word of the effective word "Kakukaku" is (0.0, 0.9, 0.
0,0.0,0.1,0.1,0.0,0.4), which has a low correlation with the first base word "forex"("0.
0)) and the second base word “Prime” has a high correlation (“0.9”). Further, as shown in FIG. 7, since the words “forex” and “foreign exchange” are registered as synonyms in the synonym dictionary 28, they are registered in the effective word dictionary 26 as shown in FIG. The correlation information (co-occurrence probability) with the base word of the word "foreign exchange" is exactly the same as "forex" (1.0, 0.1, 0.0, 0.3, 0.
0, 0.4, 0.0).

【００３９】次に、図８に示すフローチャートを参照し
て、ベクトル表現部２２によるベクトル表現処理を説明
する。Next, the vector expression processing by the vector expression unit 22 will be described with reference to the flowchart shown in FIG.

【００４０】ベクトル表現部２２は、まず、Ｌ次元ベク
トルｘを０に初期化する（ステップＳ８０１）。そし
て、対象とする文書を形態素解析し（ステップＳ８０
２）、文書中に含まれる単語Ｗと、その頻度ｆの組
（Ｗ，ｆ）からなる単語リストＤを作成し（ステップＳ
８０３）、単語リストＤから任意の１つ（Ｗ，ｆ）を取
り出す（ステップＳ８０４）。The vector expression unit 22 first initializes the L-dimensional vector x to 0 (step S801). Then, the target document is subjected to morphological analysis (step S80).
2) Create a word list D consisting of a set of words W included in the document and their frequencies f (step S).
803), any one (W, f) is extracted from the word list D (step S804).

【００４１】そして、単語Ｗを有効語辞書２６で引いて
（ステップＳ８０５）、有効語辞書２６に単語Ｗが登録
されているか否かを調べ（ステップＳ８０６）、登録さ
れていなければ、ステップＳ８０４へ戻る。一方、有効
語辞書２６に単語Ｗが登録されていれば、その辞書情報
をもとに単語Ｗのベクトル表現ｗを生成する（ステップ
Ｓ８０７）。ベクトル表現ｗは次のように定義する。な
お、この表現は、（湯浅他「大量の文書データから自動
抽出した名詞共起関係による文書の自動分類」、自然言
語処理９８−１１，１９９３）で用いられるものと基本
的に同じである。Then, the word W is subtracted from the effective word dictionary 26 (step S805), and it is checked whether or not the word W is registered in the effective word dictionary 26 (step S806). If not, the process proceeds to step S804. Return. On the other hand, if the word W is registered in the effective word dictionary 26, the vector expression w of the word W is generated based on the dictionary information (step S807). The vector expression w is defined as follows. This expression is basically the same as that used in (Yuasa et al., “Automatic Classification of Documents by Noun Co-occurrence Relationships Automatically Extracted from Large Amount of Document Data”, Natural Language Processing 98-11, 1993).

【００４２】[0042]

【数５】ｗ＝（ｒ（Ｗ，Ｗ^b ₁ ），ｒ（Ｗ，Ｗ^b ₂ ），
…ｒ（Ｗ，Ｗ^b _n ））ｒ（Ｗ，Ｗ^b _i ）は、ｉ番目の基底語Ｗ^b _i に対する単
語Ｗの共起確率であり、これは前述の通り有効語辞書２
６に記述されている。## EQU5 ## w = (r (W, W ^b ₁ ), r (W, W ^b ₂ ),
... r (W, W ^b _n )) r (W, W ^b _i ) is the co-occurrence probability of the word W with respect to the i-th base word W ^b _i , which is the effective word dictionary 2 as described above.
6 are described.

【００４３】次に、ベクトルｘを、ｘ←ｘ＋ｆ・ｗのよ
うに更新する（ステップＳ８０８）。そして、組（Ｗ，
ｆ）を単語リストＤから削除し（ステップＳ８０９）、
単語リストＤが空になったか否かを調べる（ステップＳ
８１０）。空でなければステップＳ８０４に戻って、次
の単語について同様の処理を行う。単語リストＤが空で
あれば処理を終了する。Next, the vector x is updated as x ← x + f · w (step S808). And the pair (W,
f) is deleted from the word list D (step S809),
It is checked whether or not the word list D is empty (step S
810). If it is not empty, the process returns to step S804 and the same process is performed for the next word. If the word list D is empty, the process ends.

【００４４】このようにして、Ｎ語の規模の有効語辞書
２６を用いて、文書内容をＬ次元（Ｌ《Ｎ）のベクトル
で表現することができる。すなわち、分類に有効な有効
語をできるかぎり多数辞書登録し、この中から、文書を
表現するベクトル空間の軸となる基底語をなるべく少数
抽出し、これら基底語と有効語辞書２６に登録された各
有効語との相関情報を有効語辞書２６に持たせるように
して、分類対象として入力された文書に含まれる単語が
辞書登録した有効語のいずれかに一致する確率を高める
と共に、有効語と基底語の相関情報を基に、文書を少な
い次元のベクトルとして表現することにより、ベクトル
空間上での処理コストを低くするようにしている。In this way, the content of the document can be represented by an L-dimensional (L << N) vector by using the effective word dictionary 26 of N words. That is, as many effective words as possible for classification are registered in the dictionary as much as possible, from which a small number of basic words that are the axes of the vector space expressing the document are extracted and registered in the basic word and effective word dictionary 26. By providing the effective word dictionary 26 with the correlation information with each effective word, it is possible to increase the probability that the word included in the document input as the classification target matches any of the effective words registered in the dictionary, and By expressing the document as a vector with a small dimension based on the correlation information of the base words, the processing cost on the vector space is reduced.

【００４５】なお、本発明は、上記実施例に限定される
ことなく、例えば、上記のように文書から単語を抽出す
るときに形態素解析を用いることなく、字種切りなどの
方法を用いて単語抽出処理の速度を上げるようにしても
よい。また、上記実施例では、単語の分類のための有効
度を評価するのに、エントロピーの計算を利用したが、
分布の偏りを評価できるものであれば、他の評価関数を
用いてもよい。The present invention is not limited to the above embodiment, and for example, a method such as character type cutting is used without using morphological analysis when extracting a word from a document as described above. The speed of the extraction process may be increased. Further, in the above embodiment, the entropy calculation was used to evaluate the effectiveness for classifying words,
Other evaluation functions may be used as long as the distribution bias can be evaluated.

【００４６】さらに、基底語の評価関数は、上記実施例
に示したものに限定されず、その単語自身のための有効
度と、基底語同士の相関を考慮したものであれば、他の
評価関数を用いてもよい。Further, the evaluation function of the base word is not limited to the one shown in the above embodiment, and other evaluation functions can be used as long as they take into consideration the effectiveness of the word itself and the correlation between the base words. You may use a function.

【００４７】また、上記実施例では、有効語辞書に、辞
書中の各単語それぞれに、全ての基底語との相関情報を
付与するものとしたが、相関の高い上位いくつかの基底
語との相関情報のみを付与して、有効語辞書の規模を削
減するようにしてもよい。Further, in the above embodiment, the effective word dictionary is provided with the correlation information with all the base words for each word in the dictionary. It is also possible to add only the correlation information and reduce the scale of the effective word dictionary.

【００４８】[0048]

【発明の効果】以上説明したように、本発明の文書自動
分類装置によれば、分類に有効な有効語をできるかぎり
多数辞書登録し、この中から、文書を表現するベクトル
空間の軸となる基底語をなるべく少数抽出し、これら基
底語と有効語辞書に登録された各有効語との相関情報を
有効語辞書に持たせるようにして、分類対象として入力
された文書に含まれる単語が辞書登録した有効語のいず
れかに一致する確率を高めると共に、有効語と基底語の
相関情報を基に、文書を少ない次元のベクトルとして表
現することにより、ベクトル空間上での処理コストを低
減することが可能となる。As described above, according to the automatic document classifying apparatus of the present invention, as many effective words as possible for classification are registered in the dictionary as much as possible, and from these, an axis of a vector space expressing a document is selected. Extract as few base words as possible, and provide the effective word dictionary with the correlation information between these base words and each effective word registered in the effective word dictionary so that the words included in the document input as the classification target are stored in the dictionary. To reduce the processing cost on the vector space by increasing the probability of matching with any of the registered valid words and expressing the document as a vector with less dimensions based on the correlation information between the valid words and the base words. Is possible.

[Brief description of drawings]

【図１】本発明の一実施例に係る文書自動分類装置のハ
ード構成を示すブロック図である。FIG. 1 is a block diagram showing a hardware configuration of an automatic document classification device according to an embodiment of the present invention.

【図２】本発明の一実施例に係る文書自動分類装置の機
能を示す機能ブロック図である。FIG. 2 is a functional block diagram showing functions of an automatic document classification device according to an embodiment of the present invention.

【図３】有効語抽出処理の処理手順を示すフローチャー
トである。FIG. 3 is a flowchart showing a processing procedure of valid word extraction processing.

【図４】或る単語を含む文書の割合をカテゴリごとに示
した図である。FIG. 4 is a diagram showing a ratio of documents containing a certain word for each category.

【図５】基底語抽出処理の処理手順を示すフローチャー
トである。FIG. 5 is a flowchart showing a processing procedure of base word extraction processing.

【図６】有効語辞書の内容例を示した図である。FIG. 6 is a diagram showing an example of contents of a valid word dictionary.

【図７】同義語辞書の内容例を示した図である。FIG. 7 is a diagram showing an example of contents of a synonym dictionary.

【図８】ベクトル表現処理の処理手順を示すフローチャ
ートである。FIG. 8 is a flowchart showing a processing procedure of vector expression processing.

[Explanation of symbols]

１１…中央処理装置１２…ＲＯＭ１３…ＲＡＭ１４…ディスク装置１５…バス２１…文書分類部２２…ベクトル表現部２３…識別決定部２４…文書データベース２５…有効語抽出部２６…有効語辞書２７…基底語抽出部２８…同義語辞書 11 ... Central processing unit 12 ... ROM 13 ... RAM 14 ... Disk device 15 ... Bus 21 ... Document classification unit 22 ... Vector expression unit 23 ... Identification determination unit 24 ... Document database 25 ... Effective word extraction unit 26 ... Effective word dictionary 27 ... Base word extraction unit 28 ... Synonym dictionary

───────────────────────────────────────────────────── フロントページの続き (72)発明者上田隆也東京都大田区下丸子３丁目30番２号キヤノン株式会社内 (72)発明者池田裕治東京都大田区下丸子３丁目30番２号キヤノン株式会社内 (72)発明者藤田稔東京都大田区下丸子３丁目30番２号キヤノン株式会社内 ─────────────────────────────────────────────────── ─── Continued Front Page (72) Inventor Takaya Ueda 3-30-2 Shimomaruko, Ota-ku, Tokyo Canon Inc. (72) Inventor Yuji Ikeda 3-30-2 Shimomaruko, Ota-ku, Tokyo Canon Incorporated (72) Inventor Minoru Fujita 3-30-2 Shimomaruko, Ota-ku, Tokyo Canon Inc.

Claims

[Claims]

1. A document automatic classification apparatus for expressing a document by a vector having a finite number of words as an axis and determining to which category the document is classified. A database, effective word extraction means for extracting effective words for automatically classifying input documents from the documents stored in the document database, and effective words extracted by the effective word extraction means And an effective word dictionary, and a basic word extracting means for extracting a basic word serving as an axis of a vector expression of a document from the effective words registered in the effective word dictionary. .

2. The document according to claim 1, wherein the effective word extracting means extracts, as an effective word, a word whose appearance frequency varies depending on a category in a plurality of documents stored in the document database. Automatic classifier.

3. The base word extracting means extracts, as a base word, a word having a high degree of effectiveness for classification and a low correlation with other base words from the valid words registered in the valid word dictionary. The automatic document classification device according to claim 1, wherein:

4. The automatic document classification apparatus according to claim 1, wherein the effective word registered in the effective word dictionary is provided with correlation information with the base word.

5. The automatic document classification device according to claim 4, wherein among the effective words registered in the effective word dictionary, synonyms are provided with the same correlation information.