JP2001312501A

JP2001312501A - Automatic document classification system, automatic document classification method, and computer-readable recording medium recording automatic document classification program

Info

Publication number: JP2001312501A
Application number: JP2000131009A
Authority: JP
Inventors: Yoichi Fujii; 洋一藤井; Yasuhiro Takayama; 泰博高山; Katsushi Suzuki; 克志鈴木
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 2000-04-28
Filing date: 2000-04-28
Publication date: 2001-11-09

Abstract

(57)【要約】【課題】分類対象の文書の分類カテゴリを予め作成
し、この分類カテゴリ毎に設定した学習用文書を大量に
用意しなくてはならないという課題があった。【解決手段】分類分野に対応して設けられた分類基準
文及び分類分野に分類する対象となる分類対象文書の文
構造を解析する文書構造解析手段と、この文書構造解析
手段による解析結果に基づいて分類基準文をクラスタリ
ングし、分類の着目点を規定する分類ルールに従って分
類対象文書の解析結果とクラスタリングされた分類基準
文との着目点に係る類似度を算出する類似度算出手段
と、この類似度算出手段が算出した類似度に基づいて分
類対象文書の分類分野を決定する文書分類手段とを備え
た。 (57) [Summary] [Problem] There is a problem that a classification category of a document to be classified must be created in advance, and a large number of learning documents set for each classification category must be prepared. SOLUTION: Based on a classification reference sentence provided corresponding to a classification field, a document structure analysis means for analyzing a sentence structure of a classification target document to be classified into the classification field, and a result of analysis by the document structure analysis means. A similarity calculating means for clustering the classification reference sentence according to a classification rule and calculating a similarity relating to the point of interest between the analysis result of the document to be classified and the clustered classification reference sentence according to a classification rule defining a point of interest of classification; Document classification means for determining a classification field of the classification target document based on the similarity calculated by the degree calculation means.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】この発明はアンケートのよう
に特定分野において様々な設問事項を選択肢や自由記述
によって記述された文書データのうち、自由記述された
回答内容に対して回答内容を分析し、分類する文書自動
分類システム、文書自動分類方法、及び文書自動分類プ
ログラムを記録したコンピュータ読み取り可能な記録媒
体に関するものである。BACKGROUND OF THE INVENTION The present invention analyzes the contents of a freely described answer among document data in which various questions are described in a specific field, such as a questionnaire, with options and free descriptions. The present invention relates to an automatic document classification system, an automatic document classification method, and a computer-readable recording medium on which an automatic document classification program is recorded.

【０００２】[0002]

【従来の技術】従来の文書自動分類システムとして、文
書中に出現する単語の出現頻度を利用して自動分類する
「意味属性の学習結果にもとづく文書自動分類方式」
（情報処理学会論文誌Ｖｏｌ．３３Ｎｏ．９、ｐｐ．
１１１４―１１２２）に開示されるものがある。2. Description of the Related Art As a conventional automatic document classification system, an automatic document classification method based on the learning result of a semantic attribute, which automatically classifies using the appearance frequency of words appearing in a document.
(Information Processing Society of Japan Vol.33 No.9, pp.
1114-1122).

【０００３】図１０は上述した従来の文書自動分類シス
テムの構成を示すブロック図である。図において、１は
文書の解析において参照される辞書、３は分類対象の文
書を格納する分類対象文書格納部、１０１は各分類カテ
ゴリ（例えば、経済、スポーツ、文化などの文書内容が
属する分類分野）に分類済みの標本データである学習用
文書を格納する学習用文書格納部、１０２は分類対象文
書や学習用文書に形態素解析などを施して名詞を抽出
し、分類対象文書名詞格納部１０３若しくは学習用文書
名詞格納部１０４に格納する名詞抽出処理手段、１０３
は名詞抽出処理手段１０２によって分類対象文書から抽
出された名詞を格納する分類対象文書名詞格納部、１０
４は名詞抽出処理手段１０２によって学習用文書から抽
出された名詞を格納する学習用文書名詞格納部で、１０
５は分類用ベクトル作成手段であって、学習用文書名詞
格納部１０４に格納された名詞に付与されている意味属
性の出現頻度を各学習用文書に既に付与されている分類
カテゴリ毎に集計し、さらに、分類カテゴリ毎に各名詞
の意味属性の重みを計算した結果を分類用ベクトルとし
て分類用ベクトル格納部１０６に格納する。FIG. 10 is a block diagram showing the configuration of the above-described conventional document automatic classification system. In the figure, 1 is a dictionary referred to in the analysis of documents, 3 is a classification target document storage unit for storing documents to be classified, and 101 is a classification field to which each classification category (for example, document contents such as economy, sports, culture, etc. belongs) A) a learning document storage unit for storing a learning document, which is sample data that has been classified, and 102 performs morphological analysis on the classification target document and the learning document to extract nouns, and the classification target document noun storage unit 103 or Noun extraction processing means 103 stored in learning document noun storage unit 104
Is a classification target document noun storage unit that stores the noun extracted from the classification target document by the noun extraction processing means 102;
A learning document noun storage unit 4 stores the noun extracted from the learning document by the noun extraction processing unit 102.
Reference numeral 5 denotes a classification vector creating unit that counts the appearance frequency of the semantic attribute assigned to the noun stored in the learning document noun storage unit 104 for each classification category already assigned to each learning document. Further, the result of calculating the weight of the semantic attribute of each noun for each classification category is stored in the classification vector storage unit 106 as a classification vector.

【０００４】１０６は分類用ベクトルを格納する分類用
ベクトル格納部、１０７は分類対象文書名詞格納部１０
３に格納された名詞に付与された意味属性の出現頻度を
集計して、同様にベクトルとする文書分類手段、１０８
は分類用ベクトル格納部１０６に格納された分類用ベク
トルと文書分類手段１０７が作成した上記ベクトルとの
類似度を計算し、その結果を文書分類手段１０７に出力
するベクトル類似度計算手段、１０９は最も類似度が高
い分類カテゴリを分類対象文書の分類先として格納する
分類結果格納部である。[0006] Reference numeral 106 denotes a classification vector storage unit for storing classification vectors, and 107 denotes a classification target document noun storage unit 10.
A document classifying unit that counts the appearance frequencies of the semantic attributes assigned to the nouns stored in No. 3 and converts them into a vector in the same manner;
Is a vector similarity calculation unit that calculates the similarity between the classification vector stored in the classification vector storage unit 106 and the vector created by the document classification unit 107, and outputs the result to the document classification unit 107. The classification result storage unit stores the classification category having the highest similarity as the classification destination of the classification target document.

【０００５】次に動作について説明する。先ず、辞書１
を用いて分類したい分類項目（以下、分類カテゴリとす
る）を予め作成し、各分類カテゴリ毎に分類した標本デ
ータである学習用文書を学習用文書格納部１０１に格納
しておく。次に、この学習用文書格納部１０１の各学習
用文書に対して、名詞抽出処理手段１０２で形態素解析
などを利用して名詞を抽出し、学習用文書名詞格納部１
０４に格納する。Next, the operation will be described. First, dictionary 1
In advance, a classification item to be classified (hereinafter referred to as a classification category) is created in advance, and a learning document as sample data classified for each classification category is stored in the learning document storage unit 101. Next, the noun extraction processing means 102 extracts a noun from each of the learning documents in the learning document storage unit 101 using morphological analysis or the like.
04.

【０００６】このあと、分類用ベクトル作成手段１０５
は、学習用文書名詞格納部１０４の名詞に付与された意
味属性の出現頻度を分類カテゴリ毎に集計し、さらに、
分類カテゴリ毎に各名詞の意味属性の重みを計算した結
果を分類用ベクトル格納部１０６に格納する。この重み
の計算に上記従来の文書自動分類システムでは、カイ２
乗統計を応用した方法によって計算を行っている。ま
た、分類用ベクトルは分類カテゴリ毎に名詞の意味属性
の重みをベクトル表現したものになる。[0006] Thereafter, the classification vector creating means 105
Calculates the appearance frequency of the semantic attribute assigned to the noun in the learning document noun storage unit 104 for each classification category.
The result of calculating the weight of the semantic attribute of each noun for each classification category is stored in the classification vector storage unit 106. In the calculation of the weight, the conventional document automatic classification system uses the chi 2
The calculation is performed by a method using the power statistics. The classification vector is a vector representation of the weight of the semantic attribute of the noun for each classification category.

【０００７】次に自動分類について説明する。先ず、分
類対象文書格納部３から自動分類したい分類対象文書を
１つずつ取り出し、学習用文書と同様に名詞抽出処理手
段１０２が分類対象文書中から名詞を抽出し、分類対象
文書名詞格納部１０３に格納する。次に、文書分類手段
１０７は分類対象文書名詞格納部１０３に格納された名
詞に付与された意味属性の出現頻度を集計して、学習用
文書と同様にベクトルとして表現する。さらに、ベクト
ル類似度計算手段１０８が上記ベクトルと分類用ベクト
ル格納部１０６の分類用ベクトルとの類似度を計算し
て、最も類似度が高い分類カテゴリ（つまり、最も類似
度が高い分類用ベクトルに対応する学習用文書に設定さ
れた分類カテゴリ）を分類対象文書の分類先として分類
結果格納部１０９に格納する。Next, automatic classification will be described. First, the classification target documents to be automatically classified are taken out one by one from the classification target document storage unit 3, and the noun extraction processing means 102 extracts the noun from the classification target document like the learning document, and the classification target document noun storage unit 103 To be stored. Next, the document classifying unit 107 counts the appearance frequencies of the semantic attributes assigned to the nouns stored in the classification target document noun storage unit 103, and expresses them as a vector like the learning document. Further, the vector similarity calculating means 108 calculates the similarity between the above-mentioned vector and the classifying vector in the classifying vector storage unit 106, and determines the classification category having the highest similarity (that is, the classification category having the highest similarity). The classification category set in the corresponding learning document) is stored in the classification result storage unit 109 as the classification destination of the classification target document.

【０００８】[0008]

【発明が解決しようとする課題】従来の文書自動分類シ
ステムは以上のように構成されているので、分類対象の
文書の分類カテゴリを予め作成し、この分類カテゴリ毎
に設定した学習用文書を大量に用意しなくてはならない
という課題があった。Since the conventional automatic document classification system is configured as described above, a classification category of a document to be classified is created in advance, and a large number of learning documents set for each classification category are prepared. There was a problem that it had to be prepared.

【０００９】上記課題について具体的に説明すると、予
め分類カテゴリが設定された学習用文書を設定し、これ
と分類対象文書とを比較して自動分類を行うことについ
ては、アンケート結果のようにどんな内容が記述されて
いるか分からないような場合では予め分類カテゴリを設
定することが困難であった。More specifically, the above-mentioned problem will be described. In order to set a learning document in which a classification category is set in advance and compare the document with a document to be classified and perform automatic classification, as described in a questionnaire result, It is difficult to set a classification category in advance when it is not known whether the contents are described.

【００１０】また、分類カテゴリ毎に設定された学習用
文書を大量に用意する必要があることについては、例え
ば新聞記事の自動分類のように分類が周期的に行われる
ものに対しては、学習用文書を大量に用意してもコスト
的に見合うが、アンケートのように一度限りの分析を行
うためだけに文書を自動分類する場合において、学習用
文書を大量に用意することはコスト的に見合わない。[0010] In addition, it is necessary to prepare a large number of learning documents set for each classification category. Preparing a large number of training documents is cost-effective, but preparing a large number of learning documents is cost-effective when automatically classifying documents for a one-time analysis like a questionnaire. Do not fit.

【００１１】さらに、従来の文書自動分類システムは類
似度を計算する方法が単一であることから、ユーザの目
的に合った着目点を利用して分類することができないと
いう課題があった。Furthermore, the conventional automatic document classification system has a problem that the method of calculating the similarity is a single method, so that it is not possible to perform classification using a point of interest suitable for the purpose of the user.

【００１２】上記課題について具体的に説明すると、ア
ンケートの自由記述回答は、文書の性質として１文から
多くても数文程度で記述されることが多く、アンケート
の質問事項は分類カテゴリがある程度限定されやすいた
め、アンケート中に自由記述される単語の出現傾向も似
通ってくる可能性が高い。また、アンケートの自由記述
内容には回答者の特別な意図や判断が含まれる場合もあ
る。このようなアンケートを自動分類する場合には、例
えば「何の対象ついて答えた人が多いのか」といった点
に着目したり、「ユーザが良いと思っているのか悪いと
思っているのか」といった点に着目し、分類しなくては
ならない。即ち、アンケートを分類していくに従って、
言語表現としての分類の着目点が異なってくる。従来の
文書自動分類システムのように単純に単語の出現の有無
を扱った分類方法では、上記のような着目点を区別して
分類することができない。[0012] To explain the above problem in detail, the open-ended answer of the questionnaire is often described in terms of the nature of the document, from one sentence to at most several sentences, and the questionnaire questionnaire has a limited number of classification categories. Therefore, there is a high possibility that the appearance tendency of words freely described in the questionnaire will be similar. In addition, the free description content of the questionnaire may include the special intention or judgment of the respondent. When automatically classifying such a questionnaire, for example, paying attention to "what objects are answered by many people" or "whether users think they are good or bad" We must focus on and classify. In other words, as you categorize your questionnaire,
The focus of classification as a linguistic expression is different. In a classification method that simply treats the presence or absence of words as in a conventional automatic document classification system, it is not possible to classify by distinguishing the noted points as described above.

【００１３】この発明は上記のような課題を解決するた
めになされたもので、対象となる分類カテゴリがある程
度限定されたアンケートのような文書集合に対して、分
類時に着目する点を規定する分類ルールを予め与えてお
き、さらに、分類カテゴリ毎に設定された分類基準文と
なる文書をサンプルデータとして与えた時に自動的にク
ラスタリングする事で、大量の学習用文書を必要とした
自動分類を、少ないサンプルデータで自動分類すること
ができる文書自動分類システム、文書自動分類方法、及
び文書自動分類プログラムを記録したコンピュータ読み
取り可能な記録媒体を得ることを目的とする。SUMMARY OF THE INVENTION The present invention has been made to solve the above-described problems, and a classification for defining a point to be focused on at the time of classification for a document set such as a questionnaire in which a target classification category is limited to some extent. Rules are given in advance, and when a document serving as a classification reference sentence set for each classification category is given as sample data, clustering is automatically performed, so that automatic classification that requires a large amount of learning documents can be performed. An object of the present invention is to obtain an automatic document classification system, an automatic document classification method, and a computer-readable recording medium on which an automatic document classification program is recorded, which can automatically perform classification with a small amount of sample data.

【００１４】[0014]

【課題を解決するための手段】この発明に係る文書自動
分類システムは、分類分野に対応して設けられた分類基
準文及び分類分野に分類する対象となる分類対象文書の
文構造を解析する文書構造解析手段と、この文書構造解
析手段による解析結果に基づいて分類基準文をクラスタ
リングし、分類の着目点を規定する分類ルールに従って
分類対象文書の解析結果とクラスタリングされた分類基
準文との着目点に係る類似度を算出する類似度算出手段
と、この類似度算出手段が算出した類似度に基づいて分
類対象文書の分類分野を決定する文書分類手段とを備え
るものである。An automatic document classification system according to the present invention provides a classification reference sentence corresponding to a classification field and a document for analyzing the sentence structure of a classification target document to be classified into the classification field. Clustering the classification standard sentence based on the analysis result by the structure analysis means and the document structure analysis means, and paying attention to the analysis result of the document to be classified and the clustered classification standard sentence according to the classification rule which specifies the point of interest of the classification And a document classifying unit that determines the classification field of the document to be classified based on the similarity calculated by the similarity calculating unit.

【００１５】この発明に係る文書自動分類システムは、
分類ルールが自然言語の文書の特徴量で分類の着目点を
規定するものである。An automatic document classification system according to the present invention comprises:
The classification rule defines the point of interest in the classification based on the feature amount of a natural language document.

【００１６】この発明に係る文書自動分類システムは、
分類ルールが分類基準文と分類対象文書とを構成する各
文節の類似性を示す文節類似度、分類基準文と分類対象
文書とを構成する各文節間の係り受けの類似性を示す係
り受け類似度、及び分類基準文と分類対象文書との様相
表現の類似性を示すモーダル類似度のうちの少なくとも
１つを自然言語の文書の特徴量として使用するものであ
る。An automatic document classification system according to the present invention comprises:
A clause similarity indicating the similarity of each clause constituting the classification reference sentence and the document to be classified, and a dependency similarity indicating the similarity of dependency between each clause constituting the classification reference sentence and the document to be classified. At least one of the degree and the modal similarity indicating the similarity of the modal expression between the classification reference sentence and the classification target document is used as a feature amount of a natural language document.

【００１７】この発明に係る文書自動分類システムは、
分類ルールが分類時に着目する着目点を複数規定し、各
着目点に係る類似度に分類の目的に応じた閾値を設定
し、文書分類手段が閾値以上の類似度に基づいて分類対
象文書の分類分野を決定するものである。An automatic document classification system according to the present invention comprises:
The classification rule defines a plurality of points of interest at the time of classification, sets a threshold according to the purpose of classification for the similarity of each point of interest, and the document classification means classifies the document to be classified based on the similarity equal to or more than the threshold. Determine the field.

【００１８】この発明に係る文書自動分類システムは、
分類対象文書が複数の文から構成され、類似度算出手段
が分類対象文書に対して文単位で類似度を算出するもの
である。An automatic document classification system according to the present invention comprises:
The classification target document is composed of a plurality of sentences, and the similarity calculation means calculates the similarity of the classification target document in units of sentences.

【００１９】この発明に係る文書自動分類システムは、
類似度算出手段が分類対象文書を構成する文と分類基準
文との類似度のうち最高値を示す類似度を、分類対象文
書と分類基準文との類似度とするものである。An automatic document classification system according to the present invention comprises:
The similarity calculating means sets the highest similarity among the similarities between the sentence constituting the classification target document and the classification reference sentence as the similarity between the classification target document and the classification reference sentence.

【００２０】この発明に係る文書自動分類システムは、
分類ルールが分類基準文と分類対象文書とを構成する各
文節の類似性を示す文節類似度を自然言語の文書の特徴
量として使用するとき、体言の単語からなる文節のみ、
若しくは、用言の単語からなる文節のみから類似度算出
手段が文節類似度を算出するものである。An automatic document classification system according to the present invention comprises:
When the classification rule uses the phrase similarity indicating the similarity of each of the phrases constituting the classification reference sentence and the document to be classified as the feature amount of the document of the natural language, only the phrases consisting of the nominative words,
Alternatively, the similarity calculating means calculates the phrase similarity only from a phrase consisting of words of the declinable word.

【００２１】この発明に係る文書自動分類システムは、
文書分類手段が分類対象文書を分類の着目点に応じて階
層的に分類し、分類ルールは分類の着目点に応じた階層
毎に予め設定されているものである。An automatic document classification system according to the present invention comprises:
The document classifying means classifies the documents to be classified hierarchically according to the points of interest in the classification, and the classification rules are preset for each layer according to the points of interest in the classification.

【００２２】この発明に係る文書自動分類システムは、
文書分類手段が分類対象文書と複数の分類基準文との類
似度が一定の閾値以上にあるとき、複数の分類基準文の
各々に対応する分類分野に分類対象文書を分類するもの
である。The automatic document classification system according to the present invention comprises:
When the degree of similarity between the classification target document and the plurality of classification reference sentences is equal to or greater than a predetermined threshold, the document classification means classifies the classification target document into a classification field corresponding to each of the plurality of classification reference sentences.

【００２３】この発明に係る文書自動分類システムは、
１つの分類分野に対して複数の分類基準文が設定され、
類似度算出手段が各分類基準文と分類対象文書との類似
度を算出し、文書分類手段は類似度算出手段が算出した
類似度のうち、最も類似性の高い値を示す類似度を有す
る分類基準文に対応する分類分野に分類対象文書を分類
するものである。An automatic document classification system according to the present invention comprises:
A plurality of classification criteria sentences are set for one classification field,
The similarity calculating means calculates the similarity between each classification standard sentence and the document to be classified, and the document classifying means classifies the class having the highest similarity among the similarities calculated by the similarity calculating means. This classifies the document to be classified into a classification field corresponding to the reference sentence.

【００２４】この発明に係る文書自動分類方法は、分類
分野に対応して設けられた分類基準文及び分類分野に分
類する対象となる分類対象文書の文構造を解析する文書
構造解析ステップと、この文書構造解析ステップにおけ
る解析結果に基づいて、分類基準文をクラスタリング
し、分類の着目点を規定する分類ルールに従って分類対
象文書の解析結果とクラスタリングされた分類基準文と
の着目点に係る類似度を算出する類似度算出ステップ
と、この類似度算出ステップで算出した類似度に基づい
て分類対象文書の分類分野を決定する文書分類ステップ
とを備えるものである。According to the automatic document classification method of the present invention, there is provided a document structure analysis step of analyzing a classification reference sentence provided corresponding to a classification field and a sentence structure of a classification target document to be classified into the classification field. Based on the analysis result in the document structure analysis step, the classification reference sentence is clustered, and the similarity between the analysis result of the document to be classified and the clustered classification reference sentence is determined according to the classification rule that specifies the point of interest for classification. The method includes a similarity calculation step to be calculated, and a document classification step of determining a classification field of a document to be classified based on the similarity calculated in the similarity calculation step.

【００２５】この発明に係る文書自動分類方法は、分類
ルールが自然言語の文書の特徴量で分類の着目点を規定
するものである。In the automatic document classification method according to the present invention, the classification rule specifies a point of interest in classification based on the feature amount of a document in a natural language.

【００２６】この発明に係る文書自動分類方法は、分類
ルールが分類基準文と分類対象文書とを構成する各文節
の類似性を示す文節類似度、分類基準文と分類対象文書
とを構成する各文節間の係り受けの類似性を示す係り受
け類似度、及び分類基準文と分類対象文書との様相表現
の類似性を示すモーダル類似度のうちの少なくとも１つ
を自然言語の文書の特徴量として使用するものである。In the automatic document classification method according to the present invention, the classification rule indicates the similarity of each of the clauses constituting the classification reference sentence and the classification target document, and the clause similarity indicating the similarity between the classification reference sentence and the classification target document. At least one of the dependency similarity indicating the similarity of the dependency between the phrases and the modal similarity indicating the similarity of the modal expression between the classification reference sentence and the document to be classified is used as the feature amount of the document in the natural language. To use.

【００２７】この発明に係る文書自動分類方法は、分類
ルールが分類時に着目する着目点を複数規定し、各着目
点に係る類似度に分類の目的に応じた閾値を設定し、文
書分類ステップにて、閾値以上の類似度に基づいて分類
対象文書の分類分野を決定するものである。In the automatic document classification method according to the present invention, the classification rule defines a plurality of points of interest at the time of classification, sets a similarity degree for each point of interest to a threshold value according to the purpose of classification, and performs a document classification step. That is, the classification field of the document to be classified is determined based on the similarity equal to or larger than the threshold value.

【００２８】この発明に係る文書自動分類方法は、類似
度算出ステップにて、複数の文から構成される分類対象
文書に対して文単位で類似度を算出するものである。In the automatic document classification method according to the present invention, in the similarity calculation step, a similarity is calculated for each classification target document composed of a plurality of sentences.

【００２９】この発明に係る文書自動分類方法は、類似
度算出ステップにて、分類対象文書を構成する文と分類
基準文との類似度のうち最高値を示す類似度を、分類対
象文書と分類基準文との類似度とするものである。In the automatic document classification method according to the present invention, in the similarity calculation step, the similarity indicating the highest value among the similarities between the sentence constituting the classification target document and the classification reference sentence is determined by the classification target document and the classification target document. This is a similarity to the reference sentence.

【００３０】この発明に係る文書自動分類方法は、類似
度算出ステップにて、分類ルールが分類基準文と分類対
象文書とを構成する各文節の類似性を示す文節類似度を
自然言語の文書の特徴量として使用するとき、体言の単
語からなる文節のみ、若しくは、用言の単語からなる文
節のみから文節類似度を算出するものである。In the automatic document classification method according to the present invention, in the similarity calculation step, the classification rule indicates the similarity of each of the clauses constituting the classification reference sentence and the document to be classified by the clause similarity of the natural language document. When used as a feature amount, the phrase similarity is calculated from only phrases consisting of nominative words or only phrases consisting of verbal words.

【００３１】この発明に係る文書自動分類方法は、文書
分類ステップにて、分類対象文書を分類の着目点に応じ
て階層的に分類し、分類ルールは分類の着目点に応じた
階層毎に予め設定されているものである。In the automatic document classification method according to the present invention, in the document classification step, the documents to be classified are classified hierarchically according to the points of interest of the classification, and the classification rules are set in advance for each layer corresponding to the points of interest of the classification. It is set.

【００３２】この発明に係る文書自動分類方法は、文書
分類ステップにて、分類対象文書と複数の分類基準文と
の類似度が一定の閾値以上にあるとき、複数の分類基準
文の各々に対応する分類分野に分類対象文書を分類する
ものである。In the automatic document classification method according to the present invention, in the document classification step, when the degree of similarity between the classification target document and the plurality of classification reference sentences is equal to or greater than a predetermined threshold value, the document classification step corresponds to each of the plurality of classification reference sentences. This classifies the document to be classified into the classification field to be classified.

【００３３】この発明に係る文書自動分類方法は、１つ
の分類分野に対して複数の分類基準文が設定され、類似
度算出ステップにて、各分類基準文と分類対象文書との
類似度を算出し、文書分類ステップにて、類似度算出ス
テップで算出した類似度のうち、最も類似性の高い値を
示す類似度を有する分類基準文に対応する分類分野に分
類対象文書を分類するものである。In the automatic document classification method according to the present invention, a plurality of classification reference sentences are set for one classification field, and a similarity calculation step calculates a similarity between each classification reference sentence and a document to be classified. Then, in the document classification step, the classification target document is classified into the classification field corresponding to the classification reference sentence having the similarity indicating the highest similarity value among the similarities calculated in the similarity calculation step. .

【００３４】この発明に係る文書自動分類プログラムを
記録したコンピュータ読み取り可能な記録媒体は、分類
分野に対応して設けられた分類基準文及び分類分野に分
類する対象となる分類対象文書の文構造を解析する文書
構造解析機能と、この文書構造解析機能による解析結果
に基づいて分類基準文をクラスタリングし、分類の着目
点を規定する分類ルールに従って分類対象文書の解析結
果とクラスタリングされた分類基準文との着目点に係る
類似度を算出する類似度算出機能と、算出された類似度
に基づいて分類対象文書の分類分野を決定する文書分類
機能とを備えるものである。A computer-readable recording medium on which the automatic document classification program according to the present invention is recorded includes a classification reference sentence provided corresponding to a classification field and a sentence structure of a document to be classified to be classified into the classification field. The document structure analysis function to be analyzed and the classification standard sentence are clustered based on the analysis result by the document structure analysis function, and the analysis result of the document to be classified and the clustered classification standard sentence are And a document classification function for determining the classification field of the document to be classified based on the calculated similarity.

【００３５】[0035]

【発明の実施の形態】以下、この発明の実施の一形態を
説明する。実施の形態１．図１はこの発明の実施の形態１による文
書自動分類システムの構成を示すブロック図である。図
において、１は分類基準文や分類対象文書の文構造を解
析する際に参照される辞書、２は分類カテゴリ（分類分
野）に対応付けられた分類基準文を格納する分類基準文
格納部、３は分類対象文書を格納する分類対象文書格納
部である。４は辞書１を利用して分類基準文格納部２の
分類基準文、分類対象文書格納部３の分類対象文書の文
構造を解析する文書解析手段（文書構造解析手段）であ
る。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS One embodiment of the present invention will be described below. Embodiment 1 FIG. FIG. 1 is a block diagram showing a configuration of an automatic document classification system according to Embodiment 1 of the present invention. In the figure, reference numeral 1 denotes a dictionary referred to when analyzing a sentence structure of a classification reference sentence or a document to be classified, 2 denotes a classification reference sentence storage unit that stores a classification reference sentence associated with a classification category (classification field), A classification target document storage unit 3 stores the classification target document. Reference numeral 4 denotes a document analysis unit (document structure analysis unit) that analyzes the classification reference sentence of the classification reference sentence storage unit 2 and the sentence structure of the classification target document in the classification target document storage unit 3 using the dictionary 1.

【００３６】５は分類基準文格納部２の分類基準文を文
書解析手段４で文構造を解析した結果を格納する分類基
準文解析結果格納部で、６は分類対象文書格納部３に格
納された分類対象文書を文書解析手段４で文構造を解析
した結果を格納する分類対象文書解析結果格納部であ
る。７は分類基準文解析結果格納部５に格納された文構
造解析済みの分類基準文と、分類対象文書解析結果格納
部６に格納された文構造解析済みの分類対象文書との類
似度を算出する際に使用する分類ルールを格納する分類
ルール格納部である。Reference numeral 5 denotes a classification reference sentence analysis result storage unit for storing the result of analyzing the sentence structure of the classification reference sentence in the classification reference sentence storage unit 2 by the document analysis means 4. The classification target document analysis result storage unit stores the result of analyzing the sentence structure of the classified target document by the document analysis unit 4. Reference numeral 7 calculates the similarity between the classification reference sentence stored in the classification reference sentence analysis result storage 5 and the classification target document stored in the classification target document analysis result storage 6 which has been subjected to the sentence structure analysis. This is a classification rule storage unit that stores the classification rules used when performing.

【００３７】８は分類ルール格納部７に格納された分類
ルールに従って、分類基準文解析結果格納部５に格納さ
れた文構造解析済みの分類基準文と、分類対象文書解析
結果格納部６に格納された文構造解析済みの分類対象文
書との類似度を算出する文書類似度計算手段（類似度算
出手段）である。９は文書類似度計算手段８で求めた類
似度に従って分類対象文書の分類先を決定する文書分類
手段で、１０は文書分類手段９で決定した分類先を格納
する分類結果格納部である。Reference numeral 8 denotes a classification standard sentence having a sentence structure analyzed stored in the classification standard sentence analysis result storage unit 5 and a classification target document analysis result storage unit 6 stored in accordance with the classification rule stored in the classification rule storage unit 7. Document similarity calculating means (similarity calculating means) for calculating the similarity to the classified sentence structure analyzed document. Reference numeral 9 denotes a document classifying unit that determines the classifying destination of the document to be classified according to the similarity calculated by the document similarity calculating unit 8. Reference numeral 10 denotes a classification result storage unit that stores the classifying destination determined by the document classifying unit 9.

【００３８】次に動作について説明する。図２はこの発
明の実施の形態１による文書自動分類システムの動作を
示すフロー図であり、この図２に沿って動作を説明す
る。先ず、文書解析手段４が分類対象文書格納部３に格
納された分類対象文書に形態素解析処理などを施して文
構造解析を行う。具体的には、例えば「未登録語を含む
日本語文の形態素解析」（情報処理学会論文誌Ｖｏｌ．
３０Ｎｏ．３、ｐｐ．２９４―３０１（１９８９））
に開示される形態素解析処理を行う。これについて以下
に簡単に説明する。Next, the operation will be described. FIG. 2 is a flowchart showing the operation of the automatic document classification system according to the first embodiment of the present invention. The operation will be described with reference to FIG. First, the document analysis unit 4 performs a morphological analysis process or the like on the classification target document stored in the classification target document storage unit 3 to perform a sentence structure analysis. Specifically, for example, “morphological analysis of Japanese sentence including unregistered words” (Information Processing Society Transactions Vol.
30 No. 3, pp. 294-301 (1989))
Performs the morphological analysis processing disclosed in (1). This will be briefly described below.

【００３９】先ず、文書解析手段４が辞書１を用いて分
類対象文書の各形態素を解析する。次に形態素の解析結
果に対して未知語の有無を検査し、未知語が有る場合に
は未知語を構成する範囲を推定し、複数の形態素として
まとめる。このあと、形態素の解析結果から文節構造を
解析する。ここで、分類対象文書を構成する各文節の属
性情報、自立語情報、及び付属語情報を抽出する。属性
情報は文法的な性質として上記文節を構成する単語の係
り受けなどの情報からなるもので、自立語情報は上記文
節を構成する自立語に関する情報、付属語情報は上記文
節を構成する自立語に付属する助詞や副詞などの付属語
に関する情報である。文節構造が解析されると、抽出さ
れた属性情報、自立語情報、及び付属語情報に基づいて
文節を構成する形態素に対して係り受け解析を行う。こ
れにより、分類対象文書を構成する各文節の係り受け構
造が決定される。First, the document analysis means 4 analyzes each morpheme of the document to be classified using the dictionary 1. Next, the presence / absence of an unknown word is checked with respect to the analysis result of the morpheme. If there is an unknown word, a range constituting the unknown word is estimated, and a plurality of morphemes are put together. Then, the phrase structure is analyzed from the morpheme analysis result. Here, attribute information, independent word information, and adjunct word information of each clause constituting the classification target document are extracted. The attribute information is composed of information such as the dependency of the word constituting the phrase as a grammatical property, the independent word information is information on the independent word constituting the phrase, and the auxiliary word information is the independent word constituting the phrase. This is information on adjuncts such as particles and adverbs attached to. When the phrase structure is analyzed, dependency analysis is performed on morphemes constituting the phrase based on the extracted attribute information, independent word information, and attached word information. As a result, the dependency structure of each clause constituting the classification target document is determined.

【００４０】このようにして文書解析手段４は分類対象
文書の文構造を解析すると、この結果を分類対象文書解
析結果格納部６に格納する。また、上記と同様にして文
書解析手段４は分類基準文格納部２に格納される分類基
準文に対して文構造解析を行い、解析された結果を分類
基準文解析結果格納部５に格納する。ここまでの処理が
ステップＳＴ２−１（文構造解析ステップ）に相当す
る。When the document analyzing means 4 analyzes the sentence structure of the document to be classified in this way, it stores the result in the document analysis result storage unit 6 for classification. In the same manner as described above, the document analysis unit 4 performs a sentence structure analysis on the classification standard sentence stored in the classification standard sentence storage unit 2 and stores the analyzed result in the classification standard sentence analysis result storage unit 5. . The processing so far corresponds to step ST2-1 (sentence structure analysis step).

【００４１】図３は実施の形態１による文書自動分類シ
ステムが分類する分類対象文書の例として自由記述され
たアンケートを示す図である。この例では、「現在お使
いのパソコンについて何かあれば以下にお書きくださ
い」といった質問事項に対して、自由記述された内容を
示している。図において、アンケート番号は自由記述さ
れた各アンケートに対して付された通し番号である。例
えばアンケート番号１の回答は、図中の「メモリを増設
できない」に該当する。図に示すように、回答内容は１
文から多くても数文程度で記述されており、また、上記
質問事項から分類カテゴリがある程度限定され、アンケ
ート中に自由記述される単語の出現傾向も似通ってい
る。FIG. 3 is a diagram showing a freely described questionnaire as an example of a document to be classified by the automatic document classification system according to the first embodiment. In this example, the contents are freely described in response to a question such as "Please describe below if you have any information about your personal computer". In the figure, the questionnaire number is a serial number given to each freely described questionnaire. For example, the answer to the questionnaire number 1 corresponds to "memory cannot be added" in the figure. As shown in the figure, the answer content is 1
The sentence is described in a few sentences at most, and the classification category is limited to some extent from the questionnaire, and the appearance tendency of words freely described in the questionnaire is similar.

【００４２】図４は図３におけるアンケート番号１の回
答の文構造を解析した結果を示す例である。図に示すよ
うに、自立語として「メモリ」、「増設」が抽出され、
「可能、否定」、つまり「できない」という属性を有す
る「増設」に「メモリ」が従属する文構造が解析結果と
して得られる。FIG. 4 is an example showing the result of analyzing the sentence structure of the answer to question number 1 in FIG. As shown in the figure, "memory" and "expansion" are extracted as independent words,
As a result of the analysis, a sentence structure in which “memory” is dependent on “expansion” having an attribute of “possible or negative”, that is, “not possible” is obtained.

【００４３】次に、文書分類手段９は分類基準文解析結
果格納部５及び分類対象文書解析結果格納部６から文構
造解析済みの分類基準文及び分類対象文書を単文毎に順
次読み出す。読み出された分類基準文及び分類対象文書
を構成する単文に対して、文書類似度計算手段８は分類
ルール格納部７に格納された分類ルールに従って分類基
準文と分類対象文書との類似度を算出する（ステップＳ
Ｔ２−２、類似度算出ステップ）。Next, the document classifying means 9 sequentially reads the sentence-structure-analyzed classification reference sentence and the classification target document from the classification reference sentence analysis result storage unit 5 and the classification target document analysis result storage unit 6 for each single sentence. The document similarity calculating means 8 calculates the similarity between the classification reference sentence and the classification target document according to the classification rule stored in the classification rule storage unit 7 for the read classification standard sentence and the simple sentence constituting the classification target document. Calculate (Step S
T2-2, similarity calculation step).

【００４４】文書分類手段９は文書類似度計算手段８が
算出した分類基準文と分類対象文書との類似度に基づい
て、最も類似する分類基準文を有する分類カテゴリに分
類対象文書を割り当て、その分類結果を分類結果格納部
１０に格納する（ステップＳＴ２−３、文書分類ステッ
プ）。The document classifying means 9 assigns a document to be classified to a classification category having the most similar classification standard sentence based on the similarity between the classification standard sentence calculated by the document similarity calculating means 8 and the document to be classified. The classification result is stored in the classification result storage unit 10 (step ST2-3, document classification step).

【００４５】次に類似度算出処理及び文書分類処理につ
いて詳細に説明する。図５は実施の形態１による文書自
動分類システムにおける類似度算出処理及び文書分類処
理を示すフロー図であり、この図５は図２のステップＳ
Ｔ２−２及びステップＳＴ２−３の動作に相当する。先
ず、ステップＳＴ３−１に進む前に、分類対象文書格納
部３に格納された分類対象文書は、文書解析手段４が予
め文構造を解析しておく。次に分類カテゴリとこれに対
応する分類基準文、及び分類時に適用する分類ルールを
入力する（ステップＳＴ３−１）。このあと、文書解析
手段４は上記と同様にして分類基準文の文構造を解析
し、この解析結果を分類対象文書解析結果格納部６に格
納する（ステップＳＴ３−２、文書構造解析ステッ
プ）。Next, the similarity calculation processing and the document classification processing will be described in detail. FIG. 5 is a flowchart showing a similarity calculation process and a document classification process in the automatic document classification system according to the first embodiment.
This corresponds to the operation of T2-2 and step ST2-3. First, before proceeding to step ST3-1, the document analysis unit 4 analyzes the sentence structure of the classification target document stored in the classification target document storage unit 3 in advance. Next, a classification category, a classification standard sentence corresponding thereto, and a classification rule applied at the time of classification are input (step ST3-1). Thereafter, the document analysis means 4 analyzes the sentence structure of the classification reference sentence in the same manner as described above, and stores the analysis result in the classification target document analysis result storage unit 6 (step ST3-2, document structure analysis step).

【００４６】ステップＳＴ３−２において分類基準文の
文構造が解析されると、文書分類手段９は、分類対象文
書解析結果格納部６に分類対象文書の解析結果が格納さ
れているかどうかを判断し、解析済みの分類対象文書が
あると、分類対象文書を単文毎に取り出し、文書類似度
計算手段８に出力してステップＳＴ３−４に進んで各分
類カテゴリに対応してクラスタリングされた分類基準文
との類似照合を行い、解析済みの分類対象文書がない
と、ステップＳＴ３−６に進んで分類結果を分類結果格
納部１０に格納する（ステップＳＴ３−３、類似度算出
ステップ）。When the sentence structure of the classification reference sentence is analyzed in step ST3-2, the document classifying means 9 determines whether or not the analysis result of the classification target document is stored in the classification target document analysis result storage unit 6. If there is a document to be classified that has been analyzed, the document to be classified is extracted for each single sentence, output to the document similarity calculation means 8, and the process proceeds to step ST3-4. Is performed, and if there is no analyzed document to be classified, the process proceeds to step ST3-6, where the classification result is stored in the classification result storage unit 10 (step ST3-3, similarity calculation step).

【００４７】ステップＳＴ３−４では、各分類カテゴリ
に対応する分類基準文と、分類対象文書の類似度照合
を、適用する分類ルールに従って類似度計算する。この
分類基準文と分類対象文書との類似度算出処理について
分類ルールの具体例を述べた後に詳細に説明する。In step ST3-4, the similarity comparison between the classification reference sentence corresponding to each classification category and the document to be classified is calculated according to the classification rule to be applied. The similarity calculation process between the classification reference sentence and the document to be classified will be described in detail after a specific example of a classification rule is described.

【００４８】図６は実施の形態１による文書自動分類シ
ステムが使用する分類ルールの例を示す図である。図に
おいて、破線で囲まれた領域を指す符号３１はルール番
号１における分類ルールを示している。また、この分類
ルール３１は、分類時に着目する点を規定する自然言語
の文書の特徴量として、分類基準文と分類対象文書とを
構成する各文節の類似性を示す文節類似度（図６中の
「文節」に相当する）、分類基準文と分類対象文書とを
構成する各文節間の係り受け情報の類似性を示す係り受
け類似度（図６中の「係り受け」に相当する）、及び分
類基準文と分類対象文書との様相表現の類似性を示すモ
ーダル類似度（図６中の「モーダル」に相当し、「否
定」などの単語に付随する様相表現を「モーダル」と称
する）を使用している。ここで、図示の例では、文節類
似度のパラメータを０．８、係り受け類似度のパラメー
タを０．２とし、様相表現に関するモーダル類似度は
１．０で類似度としている。これらのパラメータは各類
似度に対する重み付け（閾値）、つまり、ユーザがどの
類似度を重要視して分類するかによって設定される。な
お、パラメータが１．０に設定された類似度は、分類基
準文と分類対象文書との類似度の算出に使用しないこと
とする。FIG. 6 is a diagram showing an example of a classification rule used by the automatic document classification system according to the first embodiment. In the figure, reference numeral 31 indicating a region surrounded by a broken line indicates a classification rule in rule number 1. The classification rule 31 is a phrase similarity (see FIG. 6) indicating the similarity of each of the phrases forming the classification reference sentence and the document to be classified, as a feature amount of a natural language document that defines a point of interest at the time of classification. Dependency similarity (corresponding to “Dependency” in FIG. 6) indicating the similarity of dependency information between the clauses constituting the classification reference sentence and the document to be classified. And a modal similarity indicating the similarity of the modal expression between the classification reference sentence and the classification target document (corresponding to “modal” in FIG. 6 and a modal expression attached to a word such as “negation” is referred to as “modal”) You are using Here, in the illustrated example, the parameter of the phrase similarity is 0.8, the parameter of the dependency similarity is 0.2, and the modal similarity related to the modal expression is 1.0 and the similarity. These parameters are set according to weighting (threshold) for each similarity, that is, which similarity the user classifies with importance. The similarity with the parameter set to 1.0 is not used for calculating the similarity between the classification reference sentence and the classification target document.

【００４９】また、文節類似度のパラメータを大きくす
ると、同一の単語を有する文を集めることができる。つ
まり、図３に示すような複数の分類対象文書（アンケー
ト番号１〜２０００）を分類する場合に、分類基準文と
各分類対象文書との文節類似度から同一の単語を有する
分類対象文書を集めることができる。これより、例えば
分類の目的として「アンケートが何について回答されて
いるか」といった点に着目したとき、これを容易に知る
ことができる。When the phrase similarity parameter is increased, sentences having the same word can be collected. That is, when classifying a plurality of classification target documents (questionnaire numbers 1 to 2000) as shown in FIG. 3, the classification target documents having the same word are collected from the phrase similarity between the classification reference sentence and each classification target document. be able to. Thus, for example, when attention is paid to "what is answered in the questionnaire" as the purpose of classification, this can be easily known.

【００５０】さらに、係り受け類似度のパラメータを大
きくすると、類似する表現を有する分類対象文書を集め
ることができる。これにより、「同じようなことをいっ
ているかどうか」といった点に着目して分類することが
できる。さらに、様相表現に関するモーダル類似度のパ
ラメータを小さくすると、用言に対する細かな表現の違
いで分類対象文書を分類することができるので、例えば
「否定的なことをいっているかどうか」といった点に着
目して分類することができる。このように、類似度のパ
ラメータを変更することで、分類時に着目する点を変え
ることができ、ユーザの目的に合致した分類を行うこと
ができる。Furthermore, if the parameter of the dependency similarity is increased, documents to be classified having similar expressions can be collected. As a result, it is possible to perform classification by paying attention to "whether the same thing is being done". Furthermore, if the parameter of modal similarity related to modal similarity is reduced, documents to be classified can be classified based on differences in detailed expressions with respect to adjectives. Can be classified. As described above, by changing the parameter of the degree of similarity, it is possible to change a point of interest at the time of classification, and it is possible to perform classification that matches the purpose of the user.

【００５１】ここで、ステップＳＴ３−４における分類
基準文と分類対象文書との類似度算出処理について詳細
に説明する。分類基準文と分類対象文書との間の類似度
を計算する方法としては、例えば「類似文書検索システ
ム及び方法並びに類似文書検索プログラムを記録したコ
ンピュータ読み取り可能な記録媒体」（特願平１１−２
５７１６７号公報）に開示されるものを利用する。上記
類似文書検索システムにおいて類似度は以下の式によっ
て計算される。類似度＝α×文節類似度＋β×係り受け類似度・・・式（Ａ） α：文節類似度のパラメータ、β：係り受け類似度のパ
ラメータここで、文節類似度は２つの文で共通する自立語の数を
基に類似度計算し、係り受け類似度は、共通する係り受
け構造の数を基に類似度計算を行なっている。Here, the process of calculating the similarity between the classification reference sentence and the document to be classified in step ST3-4 will be described in detail. As a method of calculating the similarity between the classification reference sentence and the classification target document, for example, a “similar document search system and method, and a computer-readable recording medium recording a similar document search program” (Japanese Patent Application No. 11-2)
No. 57167). In the similar document search system, the similarity is calculated by the following equation. Similarity = α × Phrase similarity + β × Dependency similarity Equation (A) α: Parameter of phrase similarity, β: Parameter of dependency similarity Here, the phrase similarity is common to two sentences. The similarity is calculated based on the number of independent words, and the dependency similarity is calculated based on the number of common dependency structures.

【００５２】また、文節類似度の計算において、「否
定」などの単語に付随する表現をモーダルが異なるかど
うかによって各単語の類似度に重みを付けている（具体
的には、モーダルが一緒の場合には、単語の類似度を
１．０、一致しない場合には０．１としている）。この
モーダルに関する重みを別のパラメータとして独立に扱
い、以下の式によって類似度を計算するように変更す
る。類似度＝α×文節類似度（γ）＋β×係り受け類似度・・・・式（Ｂ） γ：モーダル不一致時のパラメータ（モーダルが一致し
ない場合には、文節の類似度をγ（０≦γ≦１）とし、
モーダルが一致した場合には文節の類似度を１．０とす
る）。即ち、γの値を０．１とすると、特願平１１−２
５７１６７号公報の類似文書検索システムが類似度計算
に使用する式（Ａ）に一致する。なお、以降、式（Ｂ）
を利用して文書類似度計算手段８が類似度を算出するも
のとする。Further, in the calculation of the phrase similarity, the similarity of each word is weighted according to whether or not the modal is different from the expression attached to the word such as “negation” (specifically, the similarity of the modal is the same). In this case, the similarity between words is set to 1.0, and when they do not match, it is set to 0.1). The weight related to the modal is treated independently as another parameter, and the modification is calculated so that the similarity is calculated by the following equation. Similarity = α × Phrase similarity (γ) + β × Dependency similarity Equation (B) γ: Parameter at the time of modal mismatch (when the modal does not match, the similarity of the phrase is γ (0 ≦ γ ≦ 1),
When the modal matches, the similarity of the phrase is set to 1.0). That is, assuming that the value of γ is 0.1, Japanese Patent Application No. Hei 11-2
The expression (A) used in the similar document search system of Japanese Patent No. 57167 for similarity calculation is the same. In the following, the formula (B)
And the document similarity calculating means 8 calculates the similarity.

【００５３】また、特願平１１−２５７１６７号公報に
開示された類似文書検索システムにおいては、類似度計
算の対象を入力検索文とクラスタ代表構造との間で計算
しているが、ここでは、分類基準文を入力検索文、分類
対象文書をクラスタ代表構造として同様に計算すること
とする。In the similar document search system disclosed in Japanese Patent Application No. 11-257167, the target of similarity calculation is calculated between an input search sentence and a cluster representative structure. The classification reference sentence is calculated as an input search sentence, and the classification target document is calculated as a cluster representative structure.

【００５４】次にステップＳＴ３−１で与えられる分類
カテゴリ、分類基準文、及び分類ルールの適用例を示
し、具体的な分類処理について説明する。図７は階層的
に分類される分類対象文書に上述した分類ルールを適用
した例を示す図である。図において、破線で囲まれた領
域を指す符号３６は、第１階層に分類された分類対象文
書に適用する分類ルールのルール番号が１、即ち、図６
中の符号３１で示す分類ルールを適用することを示して
いる。また、階層欄は分類対象文書の分類における階層
順位を示しており、１は第１階層の分類対象文書、２は
第１階層に従属する第２階層の分類対象文書を示す。Next, an example of application of the classification category, classification reference sentence, and classification rule given in step ST3-1 will be described, and specific classification processing will be described. FIG. 7 is a diagram illustrating an example in which the above-described classification rules are applied to the classification target documents classified hierarchically. In the figure, reference numeral 36 indicating an area surrounded by a broken line indicates that the rule number of the classification rule applied to the classification target document classified into the first hierarchy is 1, that is, FIG.
This indicates that the classification rule indicated by reference numeral 31 in the middle is applied. The hierarchy column indicates the hierarchical order in the classification of the documents to be classified, where 1 indicates the document to be classified in the first layer, and 2 indicates the document to be classified in the second layer subordinate to the first layer.

【００５５】図８は実施の形態１による文書自動分類シ
ステムが使用する分類カテゴリ及びこれに対応する分類
基準文を示す図である。図において、破線で囲まれた領
域を指す符号４１は親分類カテゴリＣ０の分類先とし
て、分類カテゴリＣ１若しくはＣ２があり、その分類基
準文として「メモリ」若しくは「プリンタ」という文
（単語でもよい）がクラスタリングされていることを示
している。破線で囲まれる符号４２も同様に親分類カテ
ゴリＣ１の分類先として、分類カテゴリＣ３若しくはＣ
４があり、その分類基準文として「メモリが不足」、
「メモリが足りない」、及び「メモリの増設」という文
がクラスタリングされている。また、分類カテゴリＣ０
は未分類の分類対象文書、つまり分類対象文書全体を含
むものとする。FIG. 8 is a diagram showing classification categories used by the automatic document classification system according to the first embodiment and classification reference sentences corresponding thereto. In the figure, reference numeral 41 indicating a region surrounded by a broken line has a classification category C1 or C2 as a classification destination of the parent classification category C0, and a sentence "memory" or "printer" (or a word) as a classification reference sentence. Indicates that is clustered. Similarly, the reference numeral 42 surrounded by a broken line indicates the classification category C3 or C3 as the classification destination of the parent classification category C1.
4 and the classification criteria sentence is "insufficient memory"
The sentences "out of memory" and "expansion of memory" are clustered. In addition, the classification category C0
Include the uncategorized document to be classified, that is, the entire document to be classified.

【００５６】ここで、ステップＳＴ３−１で図８中の符
号４１で示す２つの分類カテゴリＣ１，Ｃ２が実施の形
態１による文書自動分類システムに与えられたものとす
る。このあと、ステップＳＴ３−２で、２つの分類カテ
ゴリに与えられた分類基準文である「メモリ」、「プリ
ンタ」に対して文書解析手段４が文構造の解析をする。
この場合、２つの分類カテゴリＣ１，Ｃ２における分類
基準文が単語（「メモリ」、「プリンタ」）だけなの
で、単語がそのまま解析結果となる。Here, it is assumed that two classification categories C1 and C2 indicated by reference numeral 41 in FIG. 8 have been given to the automatic document classification system according to the first embodiment in step ST3-1. Thereafter, in step ST3-2, the document analyzing unit 4 analyzes the sentence structure of the classification reference sentences "memory" and "printer" given to the two classification categories.
In this case, since the classification reference sentences in the two classification categories C1 and C2 are only words (“memory” and “printer”), the words are the analysis results as they are.

【００５７】次に、この２つの分類基準文に対して、親
分類カテゴリはＣ０となっているので、図４の自由回答
内容の回答を解析した結果と、「メモリ」、「プリン
タ」の解析結果を、図７中の符号３６で示すルール番号
１の重み付け（図６中の符号３６で示すルール番号１の
各類似度に設定されているパラメータ）に従って類似度
の計算を行い、分類先を決定する。即ち、ステップＳＴ
３−３で、文書分類手段９が文構造を解析した分類対象
文書を分類対象文書解析結果格納部６から１文ずつ取り
出す。最初は分類されていないので、分類対象文書全体
を含む分類カテゴリＣ０が設定されている。このあと、
文書類似度計算手段８が文書分類手段９からアンケート
番号１の解析結果を選択しステップＳＴ３−４で類似度
を計算する。Next, since the parent classification category is C0 for these two classification standard sentences, the result of analyzing the free answer contents in FIG. 4 and the analysis of "memory" and "printer" Based on the result, the similarity is calculated in accordance with the weight of rule number 1 indicated by reference numeral 36 in FIG. 7 (parameter set for each similarity of rule number 1 indicated by reference numeral 36 in FIG. 6), and the classification destination is determined. decide. That is, step ST
In step 3-3, the document classification unit 9 extracts the documents to be classified whose sentence structures have been analyzed from the classification target document analysis result storage unit 6 one by one. Since the document is not classified at first, the classification category C0 including the entire document to be classified is set. after this,
The document similarity calculation means 8 selects the analysis result of the questionnaire number 1 from the document classification means 9 and calculates the similarity in step ST3-4.

【００５８】この類似度計算を具体的に説明すると、例
えば「メモリ」の解析結果と、アンケート番号１の「メ
モリを増設できない」との類似度は、「メモリ」という
自立語を共通に持つため、ルール番号１における文節類
似度のパラメータが０．８であることから類似度０．８
と計算される。一方、「プリンタ」の解析結果とアンケ
ート番号１の「メモリを増設できない」との類似度は、
共通する自立語がないので、類似度は０となる。従っ
て、ステップＳＴ３−５で、上記類似度から文書分類手
段９が最も類似する分類カテゴリＣ１にアンケート番号
１の分類対象文書を分類する。To explain this similarity calculation specifically, for example, the similarity between the analysis result of “memory” and the questionnaire number 1 “memory cannot be added” has a common independent word of “memory”. Since the phrase similarity parameter in rule number 1 is 0.8, the similarity 0.8
Is calculated. On the other hand, the similarity between the analysis result of "Printer" and the questionnaire number 1 "Cannot add memory"
Since there is no common independent word, the similarity is 0. Therefore, in step ST3-5, the document classifying means 9 classifies the document to be classified with the questionnaire number 1 into the classification category C1 that is most similar based on the similarity.

【００５９】他のアンケート番号２〜２０００について
も、同様にステップＳＴ３−３〜ステップＳＴ３−５の
処理を実行して類似度を計算する。ここで、全ての分類
カテゴリに対して類似度が０の分類対象文書は、「その
他」カテゴリに分類する。全ての分類対象文書を処理す
ると、ステップＳＴ３−６で、最終的な（分類カテゴ
リ、アンケート番号リスト）の組として（Ｃ１、［１、
３、７、８、１２、．．．］）、（Ｃ２、［１
１、．．．］）、（Ｃ０’、［２、１０、．．．］）を
得る。ここで、Ｃｉ’は指定した分類カテゴリのどのカ
テゴリとも一致しない「その他」カテゴリを示す。For the other questionnaire numbers 2 to 2000, the processing of steps ST3-3 to ST3-5 is similarly executed to calculate the similarity. Here, the classification target documents having a similarity of 0 for all the classification categories are classified into the “other” category. When all the documents to be classified have been processed, in step ST3-6, the final set of (category category, questionnaire number list) is (C1, [1,.
3, 7, 8, 12,. . . ]), (C2, [1
1,. . . ]), (C0 ′, [2, 10,...]) Are obtained. Here, Ci ′ indicates an “other” category that does not match any of the specified classification categories.

【００６０】次に、分類カテゴリＣ１を、図５の処理フ
ローに従って同様に分類する。この分類カテゴリＣ１
は、図８中の符号４２で示す２つの分類カテゴリＣ３，
Ｃ４に分類される。ここで、分類カテゴリＣ３には２つ
の分類基準文「メモリが不足」、「メモリが足らない」
がクラスタリングされている。分類カテゴリＣ４には１
つの分類基準文「メモリが増設」がクラスタリングされ
ている。Next, the classification category C1 is similarly classified according to the processing flow of FIG. This classification category C1
Are two classification categories C3, indicated by reference numeral 42 in FIG.
Classified as C4. Here, the classification category C3 has two classification reference sentences “insufficient memory” and “insufficient memory”.
Are clustered. 1 for classification category C4
One of the classification criteria sentences “memory is added” is clustered.

【００６１】このような分類カテゴリＣ１を分類する時
は、図７に示した階層２の分類ルール（ルール番号２）
が適用される。ルール番号２は、図６に示すように式
（Ｂ）でα＝０．２、β＝０．８、γ＝１．０にて類似
度計算するので、係り受け類似度が分類時の着目点とし
て重要視されている。これにより、解析した結果とし
て、分類カテゴリＣ３に対応する分類基準文「メモリが
不足」及び「メモリが足らない」と係り受けの構造が似
ているアンケート番号７及びアンケート番号８の分類対
象文書が分類カテゴリＣ３に分類される。When classifying such a classification category C1, the classification rule of layer 2 (rule number 2) shown in FIG.
Is applied. Rule number 2 calculates the similarity using α = 0.2, β = 0.8, and γ = 1.0 in equation (B) as shown in FIG. It is considered important as a point. As a result of the analysis, as a result of the analysis, the classification target documents of questionnaire number 7 and questionnaire number 8, which have similar dependency structures to the classification standard sentence “insufficient memory” and “insufficient memory” corresponding to classification category C3, are obtained. It is classified into the classification category C3.

【００６２】また、分類カテゴリＣ４に対応する分類基
準文「メモリを増設」と係り受けの構造が似ているアン
ケート番号１，３が、分類カテゴリＣ４に分類される。
このとき、アンケート番号１，３の分類対象文は、モー
ダルが異なるがγ＝１．０なので、分類基準文「メモリ
が増設」に対して同じ類似度を持つこととなる。従っ
て、（Ｃ３、［７、８、．．．］）、（Ｃ４、［１、
３、．．．］）、（Ｃ１’、［１２、．．．］）とな
る。Further, questionnaire numbers 1 and 3 having a similar dependency structure to the classification reference sentence "addition of memory" corresponding to the classification category C4 are classified into the classification category C4.
At this time, the classification target sentences of questionnaire numbers 1 and 3 have different modalities but have γ = 1.0, and therefore have the same similarity to the classification reference sentence “memory is added”. Therefore, (C3, [7, 8,...]), (C4, [1,.
3,. . . ]), (C1 ′, [12,...]).

【００６３】最後に、分類カテゴリＣ４を分類カテゴリ
Ｃ５に対応する分類基準文「メモリ増設したい」と、分
類カテゴリＣ６に対応する分類基準文「メモリ増設でき
ない」とをそれぞれアンケート番号１，３と比較する。
分類カテゴリＣ４に対して第３階層の分類を行うことか
ら、図７に示すようにルール番号３が適用される。この
ため、α＝０．２、β＝０．８、γ＝０．１となり、モ
ダリティの差が類似度の差となって、アンケート番号１
は分類カテゴリＣ６に、アンケート番号３はＣ５に分類
される。Finally, the classification standard sentence "I want to add memory" corresponding to the classification category C5 and the classification standard sentence "Memory cannot be added" corresponding to the classification category C6 are compared with the questionnaire numbers 1 and 3, respectively. I do.
Since the third-level classification is performed on the classification category C4, rule number 3 is applied as shown in FIG. Therefore, α = 0.2, β = 0.8, γ = 0.1, and the difference in modality becomes the difference in similarity.
Is classified into the classification category C6, and the questionnaire number 3 is classified into C5.

【００６４】図９は実施の形態１による文書自動分類シ
ステムの分類結果を示す図である。上述のようにして分
類対象文書が分類の着目点に応じて階層的に分類されて
おり、未分類の状態の「全体」（分類カテゴリＣ０）か
ら、第１階層として分類カテゴリＣ１，Ｃ２，Ｃ０に対
応する「メモリ」、「プリンタ」、及び「その他」の３
カテゴリに分類され、さらに、分類カテゴリＣ１から第
２階層として分類カテゴリＣ３，Ｃ４，Ｃ１’に対応す
る「メモリ不足」、「メモリを増設」、及び「その他」
の３カテゴリに分類され、最終的に分類カテゴリＣ４か
ら第３階層として分類カテゴリＣ５，Ｃ６に対応する
「メモリを増設したい」、「メモリの増設ができない」
の２カテゴリに分類される。FIG. 9 is a diagram showing classification results of the automatic document classification system according to the first embodiment. As described above, the documents to be classified are hierarchically classified according to the point of interest of the classification, and the classifications C1, C2, and C0 are set as the first layer from the “unclassified” “whole” (classification category C0). "Memory", "Printer", and "Other" corresponding to
"Insufficient memory", "Additional memory", and "Other" corresponding to the classification categories C3, C4, and C1 'as the second hierarchy from the classification category C1.
Finally, "want to add memory" and "cannot add memory" corresponding to classification categories C5 and C6 as the third hierarchy from classification category C4.
Are classified into two categories.

【００６５】なお、上記実施の形態では、分類基準文と
分類対象文書の類似度計算を、分類基準文を解析した結
果と、分類対象文書を解析した結果との間で単純に類似
度計算を行ったが、分類対象文書が複数の文からなる場
合には、分類対象文書の各文との解析結果と分類基準文
の解析結果の間で類似度計算し、最大の類似度を分類基
準文と分類対象文書の類似度としてもよい。これによ
り、よりユーザの目的に合った着目点を利用して分類す
ることができるとともに、分類をより高速化することが
できる。In the above embodiment, the similarity calculation between the classification standard sentence and the document to be classified is performed by simply calculating the similarity between the result of analyzing the classification standard sentence and the result of analyzing the document to be classified. When the classification target document is composed of multiple sentences, the similarity is calculated between the analysis result of each sentence of the classification target document and the analysis result of the classification reference sentence, and the maximum similarity is calculated based on the classification reference sentence. And the similarity between the documents to be classified. As a result, classification can be performed by using a point of interest that is more suitable for the purpose of the user, and classification can be further speeded up.

【００６６】さらに、上記実施の形態では、分類基準文
と分類対象文書の類似度計算を、分類対象文書の各文と
の解析結果と分類基準文の解析結果の間で類似度計算
し、類似度が一定の閾値以上の場合には複数の分類カテ
ゴリに分類してもよい。これにより、よりユーザの目的
に合った着目点を利用して分類することができる。Further, in the above embodiment, the similarity calculation between the classification reference sentence and the classification target document is performed by calculating the similarity between the analysis result of each sentence of the classification target document and the analysis result of the classification reference sentence. If the degree is equal to or more than a certain threshold, the data may be classified into a plurality of classification categories. As a result, it is possible to perform classification using a point of interest that is more suitable for the purpose of the user.

【００６７】なお、実施の形態では、予め分類カテゴリ
とその分類基準文を設定して、自動分類を行なったが、
任意の分類カテゴリに対して、さらに詳細に分類したい
下位の階層を設定するインタフェースを本願発明のシス
テムのユーザに提供し、対話的、且つ、段階的に分類を
行なってもよい。これにより、よりユーザの目的に合っ
た着目点を利用して分類することができる。In the present embodiment, the classification category and its classification reference sentence are set in advance, and the automatic classification is performed.
An interface may be provided to the user of the system of the present invention for setting a lower hierarchy to be classified in more detail for an arbitrary classification category, and the classification may be performed interactively and stepwise. As a result, it is possible to perform classification using a point of interest that is more suitable for the purpose of the user.

【００６８】さらに、上記実施の形態では、式（Ｂ）の
文節類似度の計算を分類基準文及び分類対象文書に現れ
る全ての自立単語で行なったが、文節類似度を求める単
語の対象を体言のみ、若しくは、用言のみに限定可能と
してもよい。これにより、分類を高速化することができ
る。Further, in the above embodiment, the calculation of the phrase similarity of the formula (B) is performed on the classification reference sentence and all the independent words appearing in the document to be classified. It may be possible to limit to only the word or only the word. Thereby, classification can be speeded up.

【００６９】また、実施の形態では、類似度の閾値を分
類ルールに依存せず、一定の値を仮定して処理を行なっ
たが、閾値は分類ルールごとに設定可能としてもよい。
これにより、よりユーザの目的に合った着目点を利用し
て分類することができる。In the embodiment, the threshold value of the similarity is not dependent on the classification rule, and the processing is performed assuming a constant value. However, the threshold value may be set for each classification rule.
As a result, it is possible to perform classification using a point of interest that is more suitable for the purpose of the user.

【００７０】以上のように、この実施の形態１によれ
ば、分類カテゴリに対応して設けられた分類基準文及び
分類カテゴリに分類する対象となる分類対象文書の文構
造を解析する文書解析手段４と、この文書解析手段４に
よる解析結果に基づいて分類基準文をクラスタリング
し、分類の着目点を規定する分類ルールに従って分類対
象文書の解析結果とクラスタリングされた分類基準文と
の着目点に係る類似度を算出する文書類似度計算手段８
と、この文書類似度計算手段８が算出した類似度に基づ
いて分類対象文書の分類カテゴリを決定する文書分類手
段９とを備えるので、サンプルデータとしての分類基準
文が分類カテゴリＣ０〜Ｃｎにクラスタリングされて、
これらと分類対象文書との類似度を利用することから、
大量の学習文書での学習を必要とした自動分類を、少な
いサンプルデータで自動分類することができる。また、
分類の着目点を規定する分類ルールに従って類似度を算
出することから、ユーザの目的に合った着目点を利用し
て分類することができる。As described above, according to the first embodiment, the document analysis means for analyzing the classification reference sentence provided corresponding to the classification category and the sentence structure of the classification target document to be classified into the classification category. 4 and clustering of the classification reference sentence based on the analysis result by the document analysis means 4, and according to the analysis result of the document to be classified and the point of interest of the clustered classification reference sentence according to the classification rule that defines the point of interest of the classification. Document similarity calculating means 8 for calculating similarity
And a document classifying means 9 for determining the classification category of the document to be classified based on the similarity calculated by the document similarity calculating means 8, so that the classification reference sentence as sample data is clustered into the classification categories C0 to Cn. Being
By using the similarity between these and the documents to be classified,
Automatic classification that requires learning with a large amount of learning documents can be automatically classified with a small amount of sample data. Also,
Since the similarity is calculated according to the classification rule that defines the point of interest in the classification, the classification can be performed using the point of interest that matches the purpose of the user.

【００７１】また、この実施の形態１によれば、分類ル
ールが自然言語の文書の特徴量で分類の着目点を規定す
るので、上記と同様の効果を奏する。Further, according to the first embodiment, the same effect as described above can be obtained because the classification rule defines the point of interest in the classification by the feature amount of the document in the natural language.

【００７２】さらに、この実施の形態１によれば、分類
ルールが分類基準文と分類対象文書とを構成する各文節
の類似性を示す文節類似度、分類基準文と分類対象文書
とを構成する各文節間の係り受けの類似性を示す係り受
け類似度、及び分類基準文と分類対象文書との様相表現
の類似性を示すモーダル類似度のうちの少なくとも１つ
を自然言語の文書の特徴量として使用するので、上記と
同様の効果を奏する。Further, according to the first embodiment, the classification rule forms the phrase similarity indicating the similarity of each of the phrases forming the classification reference sentence and the document to be classified, and the classification reference sentence and the document to be classified. At least one of the dependency similarity indicating the similarity of the dependency between each clause and the modal similarity indicating the similarity of the modal expression between the classification reference sentence and the classification target document is a feature amount of the document in the natural language. , So that the same effects as above can be obtained.

【００７３】さらに、この実施の形態１によれば、分類
ルールが分類時に着目する着目点を複数規定し、各着目
点に係る類似度に分類の目的に応じた閾値を設定し、文
書分類手段９が閾値以上の類似度に基づいて分類対象文
書の分類カテゴリを決定するので、よりユーザの目的に
合った着目点を利用して分類することができる。Further, according to the first embodiment, the classification rule defines a plurality of points of interest at the time of classification, sets a threshold value according to the purpose of classification for the similarity of each point of interest, 9 determines the classification category of the document to be classified based on the similarity greater than or equal to the threshold value, so that the classification can be performed using a point of interest more suited to the purpose of the user.

【００７４】さらに、この実施の形態１によれば、分類
対象文書が複数の文から構成され、文書類似度計算手段
８が分類対象文書に対して文単位で類似度を算出するの
で、分類を高速化することができる。Further, according to the first embodiment, the document to be classified is composed of a plurality of sentences, and the document similarity calculating means 8 calculates the similarity to the document to be classified in units of sentences. Speed can be increased.

【００７５】さらに、この実施の形態１によれば、文書
類似度計算手段８が分類対象文書を構成する文と分類基
準文との類似度のうち最高値を示す類似度を、分類対象
文書と分類基準文との類似度とするので、よりユーザの
目的に合った着目点を利用して分類することができると
ともに、分類をより高速化することができる。Further, according to the first embodiment, the document similarity calculating means 8 determines the highest similarity among the similarities between the sentence constituting the classification target document and the classification reference sentence, by using Since the degree of similarity to the classification reference sentence is used, classification can be performed using a point of interest that more closely matches the purpose of the user, and classification can be further speeded up.

【００７６】さらに、この実施の形態１によれば、文書
分類手段９が分類対象文書を分類の着目点に応じて階層
的に分類し、分類ルールは分類の着目点に応じた階層毎
に予め設定されているので、よりユーザの目的に合った
着目点を利用して分類することができる。Further, according to the first embodiment, the document classifying means 9 classifies the documents to be classified hierarchically according to the point of interest of the classification, and the classification rule is set in advance for each layer corresponding to the point of interest of the classification. Since it is set, classification can be performed using a point of interest more suited to the purpose of the user.

【００７７】[0077]

【発明の効果】以上のように、この発明によれば、分類
分野に対応して設けられた分類基準文及び分類分野に分
類する対象となる分類対象文書の文構造を解析し、この
解析結果に基づいて分類基準文をクラスタリングし、分
類の着目点を規定する分類ルールに従って分類対象文書
の解析結果とクラスタリングされた分類基準文との着目
点に係る類似度を算出し、この類似度に基づいて分類対
象文書の分類分野を決定するので、サンプルデータとし
てのクラスタリングされた分類基準文と分類対象文書と
の類似度を利用することから、大量の学習文書での学習
を必要とした自動分類を、少ないサンプルデータで自動
分類することができる効果がある。また、分類の着目点
を規定する分類ルールに従って類似度を算出することか
ら、ユーザの目的に合った着目点を利用して分類するこ
とができる効果がある。As described above, according to the present invention, the classification reference sentence provided corresponding to the classification field and the sentence structure of the classification target document to be classified into the classification field are analyzed, and the analysis result is obtained. Based on the classification rules, the classification reference sentence is clustered, and the similarity relating to the point of interest between the analysis result of the document to be classified and the clustered classification reference sentence is calculated in accordance with the classification rule that defines the point of interest in the classification. Since the classification field of the document to be classified is determined by using the similarity between the clustered classification reference sentence as sample data and the document to be classified, automatic classification that requires learning with a large number of learning documents can be performed. There is an effect that automatic classification can be performed with a small amount of sample data. Further, since the similarity is calculated according to the classification rule that defines the point of interest of the classification, there is an effect that the classification can be performed using the point of interest that matches the purpose of the user.

【００７８】この発明によれば、分類ルールが自然言語
の文書の特徴量で分類の着目点を規定するので、上記段
落００７７の構成に適用することで同様の効果を奏す
る。According to the present invention, since the classification rule defines the point of interest of the classification by the characteristic amount of the document in the natural language, the same effect can be obtained by applying to the configuration of paragraph 0077.

【００７９】この発明によれば、分類ルールが分類基準
文と分類対象文書とを構成する各文節の類似性を示す文
節類似度、分類基準文と分類対象文書とを構成する各文
節間の係り受けの類似性を示す係り受け類似度、及び分
類基準文と分類対象文書との様相表現の類似性を示すモ
ーダル類似度のうちの少なくとも１つを自然言語の文書
の特徴量として使用するので、上記段落００７８の構成
に適用することで同様の効果を奏する。According to the present invention, the classification rule indicates the similarity between the phrases forming the classification reference sentence and the document to be classified, and the relation between the phrases forming the classification reference sentence and the document to be classified. Since at least one of the dependency similarity indicating the similarity of the subject and the modal similarity indicating the similarity of the modal expression between the classification reference sentence and the classification target document is used as the feature amount of the natural language document, A similar effect can be obtained by applying to the configuration of paragraph 0078.

【００８０】この発明によれば、分類ルールが分類時に
着目する着目点を複数規定し、各着目点に係る類似度に
分類の目的に応じた閾値を設定し、閾値以上の類似度に
基づいて分類対象文書の分類分野を決定するので、より
ユーザの目的に合った着目点を利用して分類することが
できる効果がある。According to the present invention, the classification rule defines a plurality of points of interest to be focused on at the time of classification, sets a threshold according to the purpose of classification for the similarity related to each point of interest, and sets the similarity based on the similarity not less than the threshold. Since the classification field of the document to be classified is determined, there is an effect that the classification can be performed by using a point of interest more suited to the purpose of the user.

【００８１】この発明によれば、複数の文から構成され
る分類対象文書に対して文単位で類似度を算出するの
で、分類を高速化することができる効果がある。According to the present invention, the similarity is calculated for each document to be classified composed of a plurality of sentences in units of sentences, so that there is an effect that the classification can be speeded up.

【００８２】この発明によれば、分類対象文書を構成す
る文と分類基準文との類似度のうち最高値を示す類似度
を、分類対象文書と分類基準文との類似度とするので、
よりユーザの目的に合った着目点を利用して分類するこ
とができるとともに、分類をより高速化することができ
る効果がある。According to the present invention, the similarity indicating the highest value among the similarities between the sentence constituting the classification target document and the classification reference sentence is determined as the similarity between the classification target document and the classification reference sentence.
There is an effect that the classification can be performed by using a point of interest more suitable for the purpose of the user, and the classification can be further speeded up.

【００８３】この発明によれば、分類ルールが分類基準
文と分類対象文書とを構成する各文節の類似性を示す文
節類似度を自然言語の文書の特徴量として使用すると
き、体言の単語からなる文節のみ、若しくは、用言の単
語からなる文節のみから文節類似度を算出するので、分
類を高速化することができる効果がある。According to the present invention, when the classification rule uses the phrase similarity indicating the similarity of each of the phrases constituting the classification reference sentence and the document to be classified as the feature amount of the document in the natural language, Since the phrase similarity is calculated from only phrases that are composed of words or words composed of words of declinable words, there is an effect that classification can be speeded up.

【００８４】この発明によれば、分類対象文書を分類の
着目点に応じて階層的に分類し、分類ルールは分類の着
目点に応じた階層毎に予め設定されているので、よりユ
ーザの目的に合った着目点を利用して分類することがで
きる効果がある。According to the present invention, the documents to be classified are classified hierarchically according to the points of interest in the classification, and the classification rules are preset for each layer according to the points of interest in the classification. There is an effect that it is possible to perform classification using a point of interest that matches with.

【００８５】この発明によれば、分類対象文書と複数の
分類基準文との類似度が一定の閾値以上にあるとき、複
数の分類基準文の各々に対応する分類分野に分類対象文
書を分類するので、よりユーザの目的に合った着目点を
利用して分類することができる効果がある。According to the present invention, when the degree of similarity between the classification target document and the plurality of classification reference sentences is equal to or more than a predetermined threshold, the classification target document is classified into a classification field corresponding to each of the plurality of classification reference sentences. Therefore, there is an effect that classification can be performed by using a point of interest more suitable for the purpose of the user.

【００８６】この発明によれば、１つの分類分野に対し
て複数の分類基準文が設定され、各分類基準文と分類対
象文書との類似度を算出し、算出された類似度のうち、
最も類似性の高い値を示す類似度を有する分類基準文に
対応する分類分野に分類対象文書を分類するので、より
ユーザの目的に合った着目点を利用して分類することが
できるとともに、分類をより高速化することができる効
果がある。According to the present invention, a plurality of classification reference sentences are set for one classification field, the similarity between each classification reference sentence and the document to be classified is calculated, and among the calculated similarities,
Since the classification target document is classified into the classification field corresponding to the classification reference sentence having the highest similarity, the classification can be performed using a point of interest more suitable for the user's purpose. Has the effect that the speed can be further increased.

[Brief description of the drawings]

【図１】この発明の実施の形態１による文書自動分類
システムの構成を示すブロック図である。FIG. 1 is a block diagram showing a configuration of an automatic document classification system according to a first embodiment of the present invention.

【図２】この発明の実施の形態１による文書自動分類
システムの動作を示すフロー図である。FIG. 2 is a flowchart showing an operation of the automatic document classification system according to the first embodiment of the present invention;

【図３】実施の形態１による文書自動分類システムが
分類する分類対象文書の例として自由記述されたアンケ
ートを示す図である。FIG. 3 is a diagram illustrating a freely described questionnaire as an example of a classification target document classified by the automatic document classification system according to the first embodiment;

【図４】図３におけるアンケート番号１の回答の文構
造を解析した結果を示す例である。FIG. 4 is an example showing the result of analyzing the sentence structure of the answer to questionnaire number 1 in FIG.

【図５】実施の形態１による文書自動分類システムに
おける類似度算出処理及び文書分類処理を示すフロー図
である。FIG. 5 is a flowchart showing a similarity calculation process and a document classification process in the automatic document classification system according to the first embodiment.

【図６】実施の形態１による文書自動分類システムが
使用する分類ルールの例を示す図である。FIG. 6 is a diagram showing an example of a classification rule used by the automatic document classification system according to the first embodiment.

【図７】階層的に分類される分類対象文書に分類ルー
ルを適用した例を示す図である。FIG. 7 is a diagram illustrating an example in which a classification rule is applied to documents to be classified that are classified hierarchically.

【図８】実施の形態１による文書自動分類システムが
使用する分類カテゴリ及びこれに対応する分類基準文を
示す図である。FIG. 8 is a diagram showing classification categories used by the automatic document classification system according to the first embodiment and classification reference sentences corresponding thereto.

【図９】実施の形態１による文書自動分類システムの
分類結果を示す図である。FIG. 9 is a diagram showing a classification result of the automatic document classification system according to the first embodiment.

【図１０】従来の文書自動分類システムの構成を示す
ブロック図である。FIG. 10 is a block diagram showing a configuration of a conventional automatic document classification system.

[Explanation of symbols]

１辞書、２分類基準文格納部、３分類対象文書格
納部、４文書解析手段（文書構造解析手段）、５分
類基準文解析結果格納部、６分類対象文書解析結果格
納部、７分類ルール格納部、８文書類似度計算手段
（類似度算出手段）、９文書分類手段、１０分類結
果格納部、３１分類ルール、３６階層及びルール番
号、４１親分類カテゴリ、分類カテゴリ及び分類基準
文、４２親分類カテゴリ、分類カテゴリ及び分類基準
文、Ｃ０〜Ｃ６分類カテゴリ（分類分野）。1 dictionary, 2 classification standard sentence storage section, 3 classification target document storage section, 4 document analysis means (document structure analysis means), 5 classification standard sentence analysis result storage section, 6 classification target document analysis result storage section, 7 classification rule storage Section, 8 document similarity calculation means (similarity calculation means), 9 document classification means, 10 classification result storage section, 31 classification rules, 36 layers and rule numbers, 41 parent classification category, classification category and classification reference statement, 42 parent Classification category, classification category and classification standard statement, C0 to C6 classification category (classification field).

フロントページの続き (72)発明者鈴木克志東京都千代田区丸の内二丁目２番３号三菱電機株式会社内Ｆターム(参考） 5B075 ND03 NK32 NR12 PR06 QM08Continuation of front page (72) Inventor Katsushi Suzuki 2-3-2 Marunouchi, Chiyoda-ku, Tokyo F-term in Mitsubishi Electric Corporation (reference) 5B075 ND03 NK32 NR12 PR06 QM08

Claims

[Claims]

1. A document structure analyzing means for analyzing a classification reference sentence provided corresponding to a classification field and a sentence structure of a document to be classified to be classified in the classification field, and the analysis by the document structure analyzing means. Based on the result, the classification reference sentence is clustered, and a similarity between the analysis result of the document to be classified and the clustered classification reference sentence is calculated according to a classification rule that defines a point of interest for classification. An automatic document classification system, comprising: a similarity calculating unit; and a document classifying unit that determines a classification field of the classification target document based on the similarity calculated by the similarity calculating unit.

2. The automatic document classification system according to claim 1, wherein the classification rule defines a point of interest in the classification by a feature amount of a natural language document.

3. The classification rule includes a phrase similarity indicating the similarity of each of the phrases constituting the classification reference sentence and the document to be classified, and a dependency between the phrases constituting the classification reference sentence and the document to be classified. At least one of a dependency similarity indicating the similarity of a document and a modal similarity indicating a similarity of a modal expression between the classification reference sentence and the classification target document is used as a feature amount of a natural language document. 3. The automatic document classification system according to claim 2, wherein:

4. A classification rule defines a plurality of points of interest at the time of classification and sets a threshold value according to the purpose of classification for the similarity of each point of interest. 2. The automatic document classification system according to claim 1, wherein a classification field of the classification target document is determined based on the degree of similarity.

5. The classification target document is composed of a plurality of sentences, and the similarity calculation means calculates a similarity between the classification target document and each sentence. The document automatic classification system according to any one of the above.

6. The similarity calculating means determines a similarity indicating the highest value among similarities between a sentence constituting a document to be classified and a classification reference sentence by calculating a similarity between the document to be classified and the classification reference sentence. 6. The automatic document classification system according to claim 5, wherein

7. A similarity calculating means, wherein the classification rule uses a phrase similarity indicating the similarity of each of the phrases constituting the classification reference sentence and the classification target document as a feature amount of the natural language document. 4. The automatic document classification system according to claim 3, wherein the phrase similarity is calculated only from a phrase composed of words or only a phrase composed of words of declinable words.

8. The document classification means classifies documents to be classified hierarchically according to a point of interest of the classification, and the classification rule is preset for each layer corresponding to the point of interest of the classification. Claim 1 to Claim 7
The document automatic classification system according to any one of the above.

9. When the degree of similarity between a document to be classified and a plurality of classification reference sentences is equal to or greater than a predetermined threshold, the document classification means adds the classification target document to a classification field corresponding to each of the plurality of classification reference sentences. The automatic document classification system according to any one of claims 1 to 8, wherein the classification is performed.

10. A plurality of classification reference sentences are set for one classification field, a similarity calculation unit calculates a similarity between each of the classification reference sentences and a document to be classified, and the document classification unit sets the similarity. 2. The classification target document according to claim 1, wherein the classification target document is classified into a classification field corresponding to a classification reference sentence having a similarity indicating the highest similarity value among the similarities calculated by the degree calculation means. Automatic document classification system.

11. A document structure analyzing step for analyzing a classification reference sentence provided corresponding to the classification field and a sentence structure of a document to be classified to be classified into the classification field, and the analysis in the document structure analysis step. Based on the result, the classification reference sentence is clustered, and a similarity between the analysis result of the document to be classified and the clustered classification reference sentence is calculated according to a classification rule that defines a point of interest in the classification. An automatic document classification method, comprising: a similarity calculation step; and a document classification step of determining a classification field of the classification target document based on the similarity calculated in the similarity calculation step.

12. The classification rule according to claim 1, wherein a point of interest of the classification is defined by a feature amount of a document in a natural language.
1. The document automatic classification method according to 1.

13. The classification rule is a phrase similarity indicating the similarity between the phrases forming the classification reference sentence and the document to be classified, and the dependency between the phrases forming the classification reference sentence and the document to be classified. At least one of a dependency similarity indicating the similarity of a document and a modal similarity indicating a similarity of a modal expression between the classification reference sentence and the classification target document is used as a feature amount of a natural language document. Claim 1 characterized by the following:
2. Automatic document classification method described in 2.

14. A classification rule defines a plurality of points of interest at the time of classification, sets a threshold value according to the purpose of classification for the similarity of each point of interest, and, in the document classification step, The method according to claim 11, wherein a classification field of the document to be classified is determined based on the similarity.

15. The method according to claim 11, wherein in the similarity calculation step, the similarity is calculated for each classification target document composed of a plurality of sentences. Item 1. The automatic document classification method according to Item 1.

16. A similarity calculating step, wherein the similarity indicating the highest value among the similarities between the sentence constituting the classification target document and the classification reference sentence is determined by the similarity between the classification target document and the classification reference sentence. The automatic document classification method according to claim 15, wherein

17. In the similarity calculation step, when the classification rule uses the phrase similarity indicating the similarity of each of the phrases forming the classification reference sentence and the classification target document as a feature amount of the natural language document, 14. The automatic document classification method according to claim 13, wherein the phrase similarity is calculated only from a phrase composed of words of the above or only a phrase composed of words of the declinable word.

18. In the document classification step, the classification target documents are classified hierarchically according to the point of interest of the classification, and the classification rule is set in advance for each layer corresponding to the point of interest of the classification. The document automatic classification method according to any one of claims 11 to 17, characterized in that:

19. When the similarity between a document to be classified and a plurality of classification reference sentences is equal to or greater than a predetermined threshold value in the document classification step, the classification target is added to the classification field corresponding to each of the plurality of classification reference sentences. 19. The automatic document classification method according to claim 11, wherein the documents are classified.

20. A plurality of classification reference sentences are set for one classification field, a similarity calculation step calculates a similarity between each of the classification reference sentences and the document to be classified, and a document classification step. Wherein the classification target document is classified into a classification field corresponding to a classification reference sentence having a similarity indicating a value having the highest similarity among the similarities calculated in the similarity calculating step. 11. The document automatic classification method according to item 11.

21. A document structure analysis function for analyzing a sentence structure of a classification reference sentence provided corresponding to a classification field and a classification target document to be classified in the classification field, and the analysis by the document structure analysis function. Based on the result, the classification reference sentence is clustered, and a similarity between the analysis result of the document to be classified and the clustered classification reference sentence is calculated according to a classification rule that defines a point of interest in the classification. A computer-readable recording medium that records an automatic document classification program having a similarity calculation function and a document classification function of determining a classification field of the classification target document based on the calculated similarity.