JP2001331514A

JP2001331514A - Document classification device and document classification method

Info

Publication number: JP2001331514A
Application number: JP2000148443A
Authority: JP
Inventors: Eiji Kenmochi; 栄治剣持
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 2000-05-19
Filing date: 2000-05-19
Publication date: 2001-11-30

Abstract

(57)【要約】【課題】ユーザの意図に沿わない文書分類が行われる
ことを防止し、簡便に、初期分類代表特徴ベクトルを生
成することが可能な文書分類装置及び文書分類方法を目
的とする。【解決手段】文書入力部１０１と、文書データの単語
を解析する文書解析部１０２と、文書に対する文書特徴
ベクトルを算出する文書特徴ベクトル生成部１０３と、
文書特徴ベクトルと同じ次元数を持つ分類代表ベクトル
を生成する分類代表ベクトル生成部１０４と、精錬化処
理を行わない分類代表ベクトルを指定する精錬化除外ベ
クトル指定部１０５と、文書データを分類代表ベクトル
のいづれか一つに割り当てる文書データ割り当て部１０
６と、精錬化除外ベクトルを除いて、文書データ割り当
て部にて割り当てられた文書特徴ベクトルをもとに分類
代表ベクトルを再計算する分類代表ベクトル精錬化部１
０７と、分類結果保存部１０８とを有する文書分類装
置。 (57) [Summary] [PROBLEMS] To provide a document classification device and a document classification method that can prevent the classification of documents that do not meet the user's intention and can easily generate an initial classification representative feature vector. I do. A document input unit, a document analysis unit that analyzes words of document data, a document feature vector generation unit that calculates a document feature vector for a document,
A classifying representative vector generating unit 104 for generating a classifying representative vector having the same number of dimensions as the document feature vector, a refining exclusion vector specifying unit 105 for specifying a classifying representative vector for which refining processing is not to be performed, Document data allocating unit 10 to be allocated to any one of
6 and a classification representative vector refining unit 1 for recalculating the classification representative vector based on the document feature vector allocated by the document data allocation unit except for the refinement exclusion vector.
07 and a classification result storage unit 108.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、文書分類装置及び
文書分類方法に関する。The present invention relates to a document classification device and a document classification method.

【０００２】[0002]

【従来の技術】近年インターネット等の普及により大量
の文書情報へのアクセスが可能になり、収集した大量の
文書情報を意味のあるグループに分類し、文書集合の構
造を把握するなどの知的作業が行われ始めている。大量
な文書集合を分析する場合、まず文書集合をいくつかの
話題で分類し、得られた部分文書集合（ある基準で集め
られた複数の文書）を単位としてさまざまな作業を行う
ことにより、分析作業を効率的に行うことができる。大
量の文書情報をユーザが手動で分類する場合、人的／時
間的コストが膨大なものになるため、文書集合を文書の
内容により自動分類できる装置が望まれている。2. Description of the Related Art In recent years, the spread of the Internet and the like has made it possible to access a large amount of document information, and the collected large amount of document information is classified into meaningful groups, and intellectual work such as grasping the structure of a document set. Is beginning to take place. When analyzing a large document set, the document set is first classified into several topics, and various tasks are performed using the obtained partial document set (a plurality of documents collected according to a certain standard) as a unit. Work can be performed efficiently. When a user manually classifies a large amount of document information, human / time costs become enormous. Therefore, an apparatus that can automatically classify a document set according to the contents of the document is desired.

【０００３】日本語形態素解析などの自然言語処理を用
いて、文書からそれらを構成する単語を抽出することに
より、文書は単語頻度のベクトル（文書特徴ベクトル）
として空間表現することが可能となる。これは文書のベ
クトル空間モデルと呼ばれ、広く用いられている。ベク
ト空間モデルでは文書が計測可能な空間内にマッピング
されるため、統計的手法を用いて文書の内容による自動
分類を行うことが可能となる。[0003] By using natural language processing such as Japanese morphological analysis to extract the words that compose them from a document, the document becomes a word frequency vector (document feature vector).
It is possible to express as a space. This is called a vector space model of the document and is widely used. In the vector space model, a document is mapped in a measurable space, so that automatic classification based on the content of the document can be performed using a statistical method.

【０００４】このように、統計的手法を用いて文書の内
容による自動分類を行う手法の代表的なものとして、特
開平７−１１４５７２公報に記載されているようなクラ
スタリング手法がある。[0004] As a typical technique for performing automatic classification based on the contents of documents using a statistical technique, there is a clustering technique as described in Japanese Patent Application Laid-Open No. Hei 7-114572.

【０００５】しかしながら、クラスタリング手法におけ
るベクトル空間モデルは、大量の文書個々について高次
元のベクトル情報を保持、計算しなければならないた
め、計算資源に対する制約が大きいという問題がある。[0005] However, the vector space model in the clustering method has a problem in that it has to hold and calculate high-dimensional vector information for each of a large number of documents, so that there is a large restriction on computational resources.

【０００６】そこで、非階層クラスタリング手法のよう
に、まずいくつかの初期分類代表特徴ベクトルを設定
し、それら分類代表特徴ベクトルと文書の特徴ベクトル
との類似性をもとに文書を適切な分類代表値に割り当て
ることにより文書分類を行う手法を用いることが効率的
である。Therefore, as in the non-hierarchical clustering method, first, some initial classification representative feature vectors are set, and a document is appropriately classified based on the similarity between the classification representative feature vector and the document feature vector. It is efficient to use a method of classifying documents by assigning them to values.

【０００７】また、ユーザが、初期分類代表特徴ベクト
ルを指定することにより、ユーザの意図した文書分類が
行える利点もある。There is also an advantage that the user can specify the initial classification representative feature vector to perform the document classification intended by the user.

【０００８】[0008]

【発明が解決しようとする課題】しかしながら、従来の
非階層クラスタリング手法では、分類精度の向上のため
に動的に分類代表特徴ベクトルを変化させ、反復的に文
書の割り当てを実行する精錬化処理（例えば、ベクトル
の重心を求め、このベクトル重心のベクトルを新しい分
類代表特徴ベクトルとして分類し、この分類操作を、所
定回数又は分類誤差が最小になるまで繰り返す。）を行
うのが一般的であるが、この精錬化処理を行うことによ
り、ユーザが当初指定した分類代表特徴ベクトルが変化
し、結果としてユーザの意図にそぐわない文書分類結果
が生成されてしまう場合（例えば、精錬化処理により、
ベクトル重心が移動した結果、当初の分類代表特徴ベク
トルが他の分類代表特徴ベクトルに分類されることがあ
る。）がある。However, in the conventional non-hierarchical clustering method, a refinement process (the repetition process) that dynamically changes the classification representative feature vector and repeatedly assigns a document in order to improve the classification accuracy is performed. For example, it is common to obtain the centroid of the vector, classify the vector of the vector centroid as a new classification representative feature vector, and repeat this classification operation a predetermined number of times or until the classification error is minimized.) By performing the refining process, the classification representative feature vector initially specified by the user changes, and as a result, a document classification result that does not match the user's intention is generated (for example, by the refining process,
As a result of the movement of the vector centroid, the original classified representative feature vector may be classified into another classified representative feature vector. ).

【０００９】また、分類代表特徴ベクトルをユーザが指
定できる場合、ユーザが利用可能な情報としては、例え
ば、文書とその文書を構成する単語が考えられる。初期
分類代表特徴ベクトルの指定方法として従来の文書検索
で広く用いられている文書と単語を要素とした論理式を
用いることにより、ユーザは簡便に初期分類代表値を指
定できるようになる。この際、指定された論理式からど
のように初期分類代表特徴ベクトルを生成するかによっ
て、得られる分類結果の質は異なると考えられるが、従
来手法ではこの点については言及されていない。[0009] When the user can specify the classified representative feature vector, the information usable by the user may be, for example, a document and words constituting the document. As a method of designating the initial classification representative feature vector, the user can easily specify the initial classification representative value by using a logical expression having elements of a document and a word widely used in the conventional document search. At this time, the quality of the obtained classification result is considered to be different depending on how the initial classification representative feature vector is generated from the specified logical expression, but this is not mentioned in the conventional method.

【００１０】そこで、本発明では上記間題点に鑑み、ユ
ーザが指定した初期分類代表特徴ベクトルを除外して、
精錬化処理を行うことにより、ユーザが指定した初期分
類代表特徴ベクトルは固定されたまま精錬化処理が行わ
れ、ユーザの意図に沿わない文書分類が行われることを
防止することを目的とする。In view of the above problems, the present invention excludes the initial classification representative feature vector specified by the user,
By performing the refining process, it is an object to prevent the refining process from being performed while the initial classification representative feature vector specified by the user is fixed, and to prevent the document classification not meeting the user's intention from being performed.

【００１１】また、指定された論理式を文書や単語の論
理積式を単位とした論理和式として展開し、各論理積式
についてはそれを構成する文書や単語の特徴ベクトルの
合成とし、各論理和式についてはそれを構成する要素
（論理積式）個々を初期分類代表特徴ベクトルとするこ
とにより、ユーザにとって、簡便に、初期分類代表特徴
ベクトルを生成することを目的とする。The specified logical expression is developed as a logical sum expression in units of a logical product expression of a document or a word, and each logical product expression is synthesized with a feature vector of a document or a word constituting the logical product expression. It is an object of the present invention to easily generate an initial classification representative feature vector for a user by using each element (logical product expression) constituting the logical sum expression as an initial classification representative feature vector.

【００１２】つまり、本発明は、ユーザの意図に沿わな
い文書分類が行われることを防止し、簡便に、初期分類
代表特徴ベクトルを生成することが可能な文書分類装置
及び文書分類方法を目的とするものである。That is, an object of the present invention is to provide a document classifying apparatus and a document classifying method which can prevent the classification of documents that do not meet the user's intention and can easily generate an initial classification representative feature vector. Is what you do.

【００１３】[0013]

【課題を解決するための手段】本件発明は、以下の通
り、上記課題を解決するための手段と作用・効果を有す
る。The present invention has means, functions and effects for solving the above problems as described below.

【００１４】請求項１に記載された発明は、複数の初期
分類代表特徴ベクトルを設定し、該分類代表特徴ベクト
ルと文書の特徴ベクトルとの類似性をもとに文書を分類
する文書分類装置において、分類精度向上のため動的に
分類代表特徴ベクトルを変化させて、反復的に文書の割
り当てを行う精錬化処理部（例えば、図１における分類
代表ベクトル精錬化処理部１０７）を有し、該精錬化処
理部は、一部又は全部の前記分類代表特徴ベクトルにつ
いて、精錬化処理を行わないことを特徴とする。According to a first aspect of the present invention, there is provided a document classification apparatus for setting a plurality of initial classification representative feature vectors and classifying a document based on the similarity between the classification representative feature vector and the document feature vector. And a refining processing unit (for example, a classification representative vector refining processing unit 107 in FIG. 1) for dynamically changing a classification representative feature vector and repeatedly assigning a document in order to improve classification accuracy. The refining processing unit does not perform the refining processing on some or all of the classification representative feature vectors.

【００１５】請求項１に記載の発明では、一部又は全部
の初期分類代表特徴ベクトルに対し、精錬化処理をバイ
パスすることにより、前記一部又は全部の初期分類代表
ベクトルに関しては、それらが表現する観点を明確に示
すような分類結果を得ることができる。According to the first aspect of the present invention, the refining process is bypassed for some or all of the initial classification representative feature vectors, so that the partial or all of the initial classification representative vectors are represented. It is possible to obtain a classification result that clearly indicates the viewpoint of performing the classification.

【００１６】請求項２に記載された発明は、文書の内容
にしたがって文書の分類を行う文書分類装置において、
文書データを入力する文書入力部（例えば、図１におけ
る文書入力部１０１）と、前記文書データの単語を解析
する文書解析部（例えば、図１における文書解析部１０
２）と、該文書解析部の文書解析結果に基づいて、文書
に対する文書特徴ベクトルを算出する文書特徴ベクトル
生成部（例えば、図１における文書特徴ベクトル生成部
１０３）と、文書特徴ベクトルと同じ次元数を持つ分類
代表ベクトルを生成する分類代表ベクトル生成部（例え
ば、図１における分類代表ベクトル生成部１０４）と、
精錬化処理を行わない分類代表ベクトルを指定する精錬
化除外ベクトル指定部（例えば、図１における精錬化除
外ベクトル指定部１０５）と、前記文書特徴ベクトルと
前記分類代表ベクトル間の類似度を基にして、文書デー
タを分類代表ベクトルのいづれか一つに割り当てる文書
データ割り当て部（例えば、図１における文書データ割
り当て部１０６）と、前記精錬化除外ベクトル指定部に
て指定された分類代表ベクトル以外の分類代表ベクトル
について、前記文書データ割り当て部にて割り当てられ
た文書特徴ベクトルをもとに分類代表ベクトルを再計算
し、特定の基準を満たすまで前記文書データ割り当てと
分類代表ベクトルの再計算をくり返す分類代表ベクトル
精錬化部（例えば、図１における分類代表ベクトル精錬
化部１０７）と、分類結果を保存する分類結果保存部
（例えば、図１における分類結果保存部１０８）と有す
ることを特徴とする。According to a second aspect of the present invention, there is provided a document classification apparatus for classifying a document according to the content of the document.
A document input unit for inputting document data (for example, document input unit 101 in FIG. 1) and a document analysis unit for analyzing words of the document data (for example, document analysis unit 10 in FIG. 1)
2) a document feature vector generation unit (for example, the document feature vector generation unit 103 in FIG. 1) that calculates a document feature vector for the document based on the document analysis result of the document analysis unit; A classification representative vector generation unit (for example, the classification representative vector generation unit 104 in FIG. 1) that generates a classification representative vector having a number,
A refined exclusion vector designating unit (for example, a refined exclusion vector designating unit 105 in FIG. 1) for designating a classified representative vector for which no refining process is performed, and a similarity between the document feature vector and the classified representative vector. The document data allocating unit (for example, the document data allocating unit 106 in FIG. 1) that allocates document data to one of the classified representative vectors, and the classification other than the classified representative vector specified by the refinement exclusion vector specifying unit. For the representative vector, a classification representative vector is recalculated based on the document feature vector allocated by the document data allocation unit, and the document data allocation and the recalculation of the classification representative vector are repeated until a specific criterion is satisfied. A representative vector refining unit (for example, the classified representative vector refining unit 107 in FIG. 1); Classification result storage unit for storing a class result (e.g., the classification result storing unit 108 in FIG. 1) and having a.

【００１７】請求項２に記載の発明では、幾つかの指定
される初期分類代表特徴ベクトルに対し、精錬化処理を
バイパスすることにより、指定された初期分類代表ベク
トルに関しては、それらが表現する観点を明確に示すよ
うな分類結果を得ることができる。According to the second aspect of the present invention, the refining process is bypassed for some of the specified initial classification representative feature vectors. Can be obtained.

【００１８】請求項３に記載された発明は、請求項２に
記載の文書分類装置において、前記分類代表ベクトル生
成部において、幾つかの分類代表ベクトルがユーザの指
定する情報により生成されることを特徴とする。According to a third aspect of the present invention, in the document classification apparatus according to the second aspect, the classification representative vector generation unit generates some classification representative vectors based on information designated by a user. Features.

【００１９】請求項３に記載の発明では、請求項２に記
載の文書分類装置の特徴に加え、精錬化処理をバイパス
する初期分類代表ベクトルをユーザが指定することによ
り、ユーザの分類意図を反映した分類結果を得ることが
できる。According to the third aspect of the present invention, in addition to the features of the document classification apparatus according to the second aspect, the user specifies the initial classification representative vector that bypasses the refining process, thereby reflecting the user's classification intention. A classified result can be obtained.

【００２０】請求項４に記載された発明は、請求項３に
記載の文書分類装置において、前記精錬化除外ベクトル
指定部において、前記分類代表ベクトル生成部にてユー
ザによって指定される情報により生成された分類代表ベ
クトルのみを精錬化を行わない分類代表ベクトルとする
ことを特徴とする。According to a fourth aspect of the present invention, in the document classification apparatus according to the third aspect, the refinement exclusion vector designation section is generated by information designated by a user in the classification representative vector generation section. The method is characterized in that only the classified representative vectors obtained are classified representative vectors for which refining is not performed.

【００２１】請求項４に記載の発明では、請求項３に記
載の文書分類装置の特徴に加え、ユーザが指定した分類
代表特徴ベクトルについては、精錬化処理を行わないこ
とにより、ユーザの分類意図を強制的に保持した分類結
果を得ることができる。According to the fourth aspect of the present invention, in addition to the features of the document classifying apparatus according to the third aspect, the classification representative feature vector designated by the user is not subjected to refining processing, so that the user's classification Can be obtained.

【００２２】請求項５に記載された発明は、請求項４の
文書分類装置において、前記分類代表ベクトル生成部に
おいて、ユーザにより指定される情報が、文書データと
文書データ内に存在する単語の論理式であることを特徴
とする。According to a fifth aspect of the present invention, in the document classification apparatus of the fourth aspect, in the classification representative vector generation unit, the information specified by the user is a logical combination of the document data and the words existing in the document data. It is a formula.

【００２３】請求項５に記載の発明では、請求項４に記
載の文書分類装置の特徴に加え、ユーザが指定する文書
と文書内に含まれる単語の論理式から初期分類代表特徴
ベクトルを生成することにより、ユーザが分類意図を容
易に記述できる文書分類装置を実現することができる。According to a fifth aspect of the present invention, in addition to the features of the document classification apparatus of the fourth aspect, an initial classification representative feature vector is generated from a logical expression of a document specified by a user and a word included in the document. This makes it possible to realize a document classification device that allows a user to easily describe a classification intention.

【００２４】請求項６に記載された発明は、請求項５に
記載の文書分類装置において、前記分類代表ベクトル生
成部において指定される分類代表ベクトルを生成するた
めの情報を保存する分類代表ベクトル生成情報記憶部
（例えば、図８における分類代表ベクトル生成情報記憶
部２０１）と、前記分類代表ベクトル生成情報記憶部に
記憶された情報を読み込む分類代表ベクトル生成情報読
み込み部（例えば、図８における分類代表ベクトル生成
情報読み込み部２０２）と、前記分類代表ベクトル生成
部において指定される精錬化処理をおこなわない分類代
表ベクトルに関する情報を保存する精錬化除外ベクトル
情報記憶部（例えば、図８における精錬化除外ベクトル
情報記憶部２０３）と、前記精錬化除外ベクトル情報記
憶部に記憶された情報を読み込む精錬化除外ベクトル情
報読み込み部（例えば、図８における精錬化除外ベクト
ル情報読み込み部２０４）とをさらに有することを特徴
とする。According to a sixth aspect of the present invention, there is provided the document classification apparatus according to the fifth aspect, wherein a classification representative vector generation unit stores information for generating a classification representative vector specified by the classification representative vector generation unit. An information storage unit (for example, the classification representative vector generation information storage unit 201 in FIG. 8), and a classification representative vector generation information reading unit that reads information stored in the classification representative vector generation information storage unit (for example, the classification representative in FIG. 8) A vector generation information reading unit 202) and a refined exclusion vector information storage unit (for example, a refined exclusion vector in FIG. 8) that stores information on the classified representative vector not subjected to the refining process specified in the classified representative vector generation unit Information storage unit 203) and information stored in the refining exclusion vector information storage unit. Refining of exclusion vector information reading part for reading (e.g., refining of exclusion vector information reading section 204 in FIG. 8), characterized in that it further comprises a.

【００２５】請求項６に記載の発明では、請求項５に記
載の文書分類装置の特徴に加え、指定された初期分類代
表特徴ベクトルと精錬化処理を行わない分類代表特徴ベ
クトルを記憶し、その情報を後で読み出す仕組みを提供
することにより、事前に行われた文書分類の設定に多少
修正を加えて新たな分類結果を獲得したり、分類対象文
書集合が変化した場合に対しても同一の設定での分類結
果を得ることができる。According to a sixth aspect of the present invention, in addition to the features of the document classification apparatus according to the fifth aspect, a designated initial classified representative feature vector and a classified representative feature vector not subjected to refining processing are stored. By providing a mechanism to read out information later, it is possible to obtain a new classification result by slightly modifying the setting of the document classification performed in advance, and to maintain the same classification even when the classification target document set changes. You can get the classification result in the setting.

【００２６】請求項７に記載された発明は、文書の内容
にしたがって文書の分類を行う文書分類装置において、
文書データを入力する文書入力部（例えば、図１１にお
ける文書入力部１０１）と、前記文書データの単語を解
析する文書解析部（例えば、図１１における文書解析部
１０２）と、該文書解析部による文書分析結果から各文
書に対する文書特徴ベクトルを算出する文書特徴ベクト
ル生成部（例えば、図１１における文書特徴ベクトル生
成部１０３）と、幾つかの分類代表ベクトルとして、指
定される複数の情報プリミティブの論理式を、それぞれ
要素が情報プリミティブの論理積である論理和による結
合式に展開し、生成された情報プリミティブの論理積そ
れぞれから文書特徴ベクトルと同じ次元数を持つ分類代
表ベクトルを生成する分類代表ベクトル指定生成部（例
えば、図１１における分類代表ベクトル指定生成部３０
１）と、前記指定される論理式とそれから生成される分
類代表ベクトルに関する情報を記憶する分類代表ベクト
ル情報記憶部と、前記文書特徴ベクトル生成部にて生成
される情報を用いて分類代表ベクトルを自動生成する分
類代表ベクトル自動生成部（例えば、図１１における分
類代表ベクトル自動生成部３０３）と、前記文書特徴ベ
クトルと前記分類代表ベクトル間の類似度を基にして、
文書データを分類する文書分類部（例えば、図１１にお
ける文書分類部３０４）と、前記分類代表ベクトル情報
記憶部にて記憶された情報をもとに分類結果を生成し、
保存する分類結果保存部（例えば、図１１における分類
結果保存部３０５）とを有することを特徴とする文書分
類装置。According to a seventh aspect of the present invention, there is provided a document classification apparatus for classifying a document according to the content of the document.
A document input unit for inputting document data (for example, the document input unit 101 in FIG. 11), a document analysis unit for analyzing words of the document data (for example, the document analysis unit 102 in FIG. 11), and A document feature vector generation unit (for example, the document feature vector generation unit 103 in FIG. 11) that calculates a document feature vector for each document from a document analysis result, and a logic of a plurality of information primitives specified as some classification representative vectors A classification representative vector that expands the expression into a combination expression based on a logical sum in which each element is a logical product of information primitives, and generates a classification representative vector having the same number of dimensions as the document feature vector from each logical product of the generated information primitives The designation generation unit (for example, the classification representative vector designation generation unit 30 in FIG. 11)
1), a classifying representative vector information storage unit for storing information on the designated logical formula and classifying representative vectors generated therefrom, and a classifying representative vector using the information generated by the document feature vector generating unit. Based on the automatically generated classification representative vector generation unit (for example, the classification representative vector automatic generation unit 303 in FIG. 11) and the similarity between the document feature vector and the classification representative vector,
Generating a classification result based on information stored in the document classification unit (for example, the document classification unit 304 in FIG. 11) for classifying the document data and the classification representative vector information storage unit;
A document classification device comprising a classification result storage unit (for example, a classification result storage unit 305 in FIG. 11) for storing.

【００２７】請求項７に記載の発明では、指定された情
報プリミティブの論理式を論理和式に展開し、その論理
和式を構成する情報プリミティブの論理積それぞれから
初期分類代表ベクトルを生成することにより、記述され
た論理式の意味に適した文書分類結果を生成することが
できる。According to the present invention, the logical expression of the designated information primitive is developed into a logical sum expression, and an initial classification representative vector is generated from each of the logical products of the information primitives constituting the logical sum expression. Thus, a document classification result suitable for the meaning of the described logical expression can be generated.

【００２８】請求項８に記載された発明は、請求項７に
記載の文書分類装置において、前記記憶された分類代表
ベクトルに関する情報を読み込む分類代表ベクトル情報
読み込み部（例えば、図１１における分類代表ベクトル
情報読み込み部３０５）とをさらに有することを特徴と
する。According to an eighth aspect of the present invention, in the document classification apparatus according to the seventh aspect, a classification representative vector information reading unit (for example, the classification representative vector in FIG. 11) for reading the information about the stored classification representative vector. (Information reading unit 305).

【００２９】請求項８に記載の発明では、請求項７に記
載の文書分類装置の特徴に加え、事前に記憶されている
情報プリミティブの論理式と生成される文書代表ベクト
ルに関する情報を再利用することにより、異なる文書集
合に対しても同一の分類基準で文書分類を行うことので
きる文書分類装置を実現することができる。According to an eighth aspect of the present invention, in addition to the features of the document classifying apparatus according to the seventh aspect, information relating to a logical expression of an information primitive stored in advance and information relating to a generated document representative vector is reused. As a result, it is possible to realize a document classification device capable of classifying documents based on the same classification standard even for different document sets.

【００３０】請求項９に記載された発明は、請求項８に
記載の文書分類装置において、前記分類代表ベクトル生
成部において、生成された情報プリミティブの論理積に
含まれる各情報プリミティブの算術平均として分類代表
ベクトルが生成されることを特徴とする。According to a ninth aspect of the present invention, in the document classification apparatus according to the eighth aspect, the classification representative vector generation unit calculates an arithmetic average of each information primitive included in a logical product of the generated information primitives. A classification representative vector is generated.

【００３１】請求項９に記載の発明では、請求項８に記
載の文書分類装置の特徴に加え、生成される情報プリミ
ティブの論理積において、それを構成する各情報プリミ
ティブを適切に表現する特徴ベクトルの算術平均から文
書代表ベクトルを生成することにより、論理積を適切に
数量化することが可能となり、これにより論理式の意味
に適した文書分類結果を生成することができる。According to the ninth aspect of the present invention, in addition to the features of the document classifying apparatus according to the eighth aspect, in the logical product of the generated information primitives, a feature vector appropriately representing each information primitive constituting the information primitives is provided. By generating a document representative vector from the arithmetic mean of, the AND can be quantified appropriately, and thereby a document classification result suitable for the meaning of the logical expression can be generated.

【００３２】請求項１０に記載された発明は、請求項９
に記載の文書分類装置において、前記分類代表ベクトル
生成部において指定される情報プリミティブが文書及び
文書内に含まれる単語であることを特徴とする。[0032] The invention described in claim 10 is the ninth invention.
Wherein the information primitive specified by the classification representative vector generation unit is a document and a word included in the document.

【００３３】請求項１０に記載の発明では、請求項９に
記載の文書分類装置の特徴に加え、情報プリミティブと
して文書及び文書内に含まれる単語を用いることによ
り、簡便に論理式を生成できる文書分類装置を実現する
ことができる。According to the tenth aspect of the present invention, in addition to the features of the document classifying apparatus according to the ninth aspect, a document in which a logical expression can be easily generated by using a document and words included in the document as information primitives. A classification device can be realized.

【００３４】請求項１１に記載された発明は、請求項１
０に記載の文書分類装置において、前記分類代表ベクト
ル生成部にて指定される情報プリミティブの論理式がユ
ーザによって指定されることを特徴とする。The invention described in claim 11 is the first invention.
0, wherein the logical expression of the information primitive specified by the classification representative vector generation unit is specified by a user.

【００３５】請求項１１に記載の発明では、請求項１０
に記載の文書分類装置の特徴に加え、ユーザが論理式を
指定できることにより、ユーザの意図を明確にした文書
分類結果を生成することができる。According to the eleventh aspect, in the tenth aspect,
In addition to the features of the document classification device described in (1), since the user can specify a logical expression, a document classification result that clarifies the user's intention can be generated.

【００３６】請求項１２に記載された発明は、複数の初
期分類代表特徴ベクトルを設定し、該分類代表特徴ベク
トルと文書の特徴ベクトルとの類似性をもとに文書を分
類する文書分類方法において、分類精度向上のため動的
に分類代表特徴ベクトルを変化させて、反復的に文書の
割り当てを行う精錬化処理ステップを有し、該精錬化処
理ステップは、一部又は全部の前記分類代表特徴ベクト
ルについて、精錬化処理を行わないことを特徴とする。According to a twelfth aspect of the present invention, there is provided a document classification method for setting a plurality of initial classification representative feature vectors and classifying a document based on the similarity between the classification representative feature vector and the document feature vector. Has a refining processing step of dynamically changing a classification representative feature vector to improve classification accuracy and repetitively assigning a document, and the refining processing step includes a part or all of the classification representative feature. Refining is not performed on vectors.

【００３７】請求項１３に記載された発明は、文書の内
容にしたがって文書の分類を行う文書分類方法におい
て、文書データを入力する文書入力ステップと、前記文
書データの単語を解析する文書解析ステップと、該文書
解析ステップの文書解析結果に基づいて、文書に対する
文書特徴ベクトルを算出する文書特徴ベクトル生成ステ
ップと、文書特徴ベクトルと同じ次元数を持つ分類代表
ベクトルを生成する分類代表ベクトル生成ステップと、
精錬化処理を行わない分類代表ベクトルを指定する精錬
化除外ベクトル指定ステップと、前記文書特徴ベクトル
と前記分類代表ベクトル間の類似度を基にして、文書デ
ータを分類代表ベクトルのいづれか一つに割り当てる文
書データ割り当てステップと、前記精錬化除外ベクトル
指定ステップにて指定された分類代表ベクトル以外の分
類代表ベクトルについて、前記文書データ割り当てステ
ップにて割り当てられた文書特徴ベクトルをもとに分類
代表ベクトルを再計算し、特定の基準を満たすまで前記
文書データ割り当てと分類代表ベクトルの再計算をくり
返す分類代表ベクトル精錬化ステップと、分類結果を保
存する分類結果保存ステップとを有することを特徴とす
る。According to a thirteenth aspect of the present invention, in the document classification method for classifying documents according to the contents of the documents, a document input step of inputting document data and a document analysis step of analyzing words of the document data. A document feature vector generation step of calculating a document feature vector for the document based on the document analysis result of the document analysis step; a classification representative vector generation step of generating a classification representative vector having the same number of dimensions as the document feature vector;
Refining exclusion vector designating step of designating a classified representative vector for which refining processing is not performed; and assigning document data to one of the classified representative vectors based on the similarity between the document feature vector and the classified representative vector. For the classification representative vector other than the classification representative vector specified in the document data allocation step and the refinement exclusion vector specification step, the classification representative vector is re-created based on the document feature vector allocated in the document data allocation step. A classification / representation vector refining step of repeating calculation and recalculation of the document data allocation and classification / representation vector until a specific criterion is satisfied; and a classification result storage step of storing a classification result.

【００３８】請求項１４に記載された発明は、請求項１
３に記載の文書分類方法において、前記分類代表ベクト
ル生成ステップにおいて、幾つかの分類代表ベクトルが
ユーザの指定する情報により生成されることを特徴とす
る。The invention described in claim 14 is the first invention.
3. The document classification method according to 3, wherein in the classifying representative vector generating step, some classifying representative vectors are generated based on information specified by a user.

【００３９】請求項１５に記載された発明は、請求項１
４に記載の文書分類方法において、前記精錬化除外ベク
トル指定ステップにおいて、前記分類代表ベクトル生成
ステップにてユーザによって指定される情報により生成
された分類代表ベクトルのみを精錬化を行わない分類代
表ベクトルとすることを特徴とする。The invention described in claim 15 is the first invention.
4. In the document classification method according to 4, the refined exclusion vector designating step includes, in the classified representative vector generating step, only the classified representative vector generated by the information specified by the user in the classified representative vector, which is not refined. It is characterized by doing.

【００４０】請求項１６に記載された発明は、請求項１
５の文書分類方法において、前記分類代表ベクトル生成
ステップにおいて、ユーザにより指定される情報が、文
書データと文書データ内に存在する単語の論理式である
ことを特徴とする。The invention described in claim 16 is the first invention.
In the document classification method of the fifth aspect, in the classification representative vector generation step, the information specified by the user is a logical expression of the document data and words existing in the document data.

【００４１】請求項１７に記載された発明は、請求項１
６に記載の文書分類方法において、前記分類代表ベクト
ル生成ステップにおいて指定される分類代表ベクトルを
生成するための情報を保存する分類代表ベクトル生成情
報記憶ステップと、前記分類代表ベクトル生成情報記憶
ステップに記憶された情報を読み込む分類代表ベクトル
生成情報読み込みステップと、前記分類代表ベクトル生
成ステップにおいて指定される精錬化処理をおこなわな
い分類代表ベクトルに関する情報を保存する精錬化除外
ベクトル情報記憶ステップと、前記精錬化除外ベクトル
情報記憶ステップに記憶された情報を読み込む精錬化除
外ベクトル情報読み込みステップとをさらに有すること
を特徴とする。The invention described in claim 17 is the first invention.
6. In the document classification method described in 6, the classification representative vector generation information storing step for storing information for generating the classification representative vector specified in the classification representative vector generation step, and the classification representative vector generation information storing step Classifying representative vector generation information reading step for reading the classified information, a refining exclusion vector information storing step for storing information related to the classifying representative vector that is not subjected to the refining process specified in the classification representative vector generating step, A refining exclusion vector information reading step for reading the information stored in the exclusion vector information storage step.

【００４２】請求項１８に記載された発明は、文書の内
容にしたがって文書の分類を行う文書分類方法におい
て、文書データを入力する文書入力ステップと、前記文
書データの単語を解析する文書解析ステップと、該文書
解析ステップによる文書分析結果から各文書に対する文
書特徴ベクトルを算出する文書特徴ベクトル生成ステッ
プと、幾つかの分類代表ベクトルとして、指定される複
数の情報プリミティブの論理式をそれぞれ要素が情報プ
リミティブの論理積である論理和による結合式に展開
し、生成された情報プリミティブの論理積それぞれから
文書特徴ベクトルと同じ次元数を持つ分類代表ベクトル
を生成する分類代表ベクトル指定生成ステップと、前記
指定される論理式とそれから生成される分類代表ベクト
ルに関する情報を記憶する分類代表ベクトル情報記憶ス
テップと、前記文書特徴ベクトル生成ステップにて生成
される情報を用いて分類代表ベクトルを自動生成する分
類代表ベクトル自動生成ステップと、前記文書特徴ベク
トルと前記分類代表ベクトル間の類似度を基にして、文
書データを分類する文書分類ステップと、前記分類代表
ベクトル情報記憶ステップにて記憶された情報をもとに
分類結果を生成し、保存する分類結果保存ステップとを
有することを特徴とする。According to the invention described in claim 18, in a document classification method for classifying documents according to the contents of the document, a document input step of inputting document data, and a document analysis step of analyzing words of the document data. A document feature vector generation step of calculating a document feature vector for each document from the document analysis result by the document analysis step; and a plurality of information primitives, each of which has a logical expression of a plurality of designated information primitives as elements of the information primitive. A classification representative vector specification generating step of generating a classification representative vector having the same number of dimensions as the document feature vector from each of the logical products of the generated information primitives, Stores information on logical expressions and the classification representative vectors generated from them A classifying representative vector information storing step, a classifying representative vector automatic generating step of automatically generating a classifying representative vector using the information generated in the document feature vector generating step, A document classification step of classifying the document data based on the similarity; and a classification result storage step of generating and storing a classification result based on the information stored in the classification representative vector information storage step. It is characterized by.

【００４３】請求項１９に記載された発明は、請求項１
８に記載の文書分類方法において、前記記憶された分類
代表ベクトルに関する情報を読み込む分類代表ベクトル
情報読み込みステップとをさらに有することを特徴とす
る。The invention described in claim 19 is the first invention.
8. The document classification method according to item 8, further comprising: a classification representative vector information reading step of reading information on the stored classification representative vector.

【００４４】請求項２０に記載された発明は、請求項１
９に記載の文書分類方法において、前記分類代表ベクト
ル生成ステップにおいて、生成された情報プリミティブ
の論理積に含まれる各情報プリミティブの算術平均とし
て分類代表ベクトルが生成されることを特徴とする。The invention described in claim 20 is the first invention.
9. The document classification method according to 9, wherein in the classifying representative vector generating step, a classifying representative vector is generated as an arithmetic average of each information primitive included in a logical product of the generated information primitives.

【００４５】請求項２１に記載された発明は、請求項２
０に記載の文書分類方法において、前記分類代表ベクト
ル生成ステップにおいて指定される情報プリミティブが
文書及び文書内に含まれる単語であることを特徴とす
る。The invention described in claim 21 is the second invention.
0, wherein the information primitive specified in the classification representative vector generation step is a document and a word included in the document.

【００４６】請求項２２に記載された発明は、請求項２
１に記載の文書分類方法において、前記分類代表ベクト
ル生成ステップにて指定される情報プリミティブの論理
式がユーザによって指定されることを特徴とする。The invention described in claim 22 is the second invention.
1. The document classification method according to 1, wherein a logical expression of the information primitive specified in the classification representative vector generation step is specified by a user.

【００４７】請求項１２〜請求項２２記載の文書分類方
法は、請求項１〜請求項１１記載の文書分類装置に適し
た文書分類方法である。The document classification method according to the twelfth to twenty-second aspects is a document classification method suitable for the document classification apparatus according to the first to eleventh aspects.

【００４８】[0048]

【発明の実施の形態】次に、本発明の実施の形態につい
て図面と共に説明する。Next, embodiments of the present invention will be described with reference to the drawings.

【００４９】本発明では、自然言語で記述された１つ以
上の文の集まりで、それが分類対象となる場合は、これ
を文書と言う。また、ひとつの文書の終端には、それが
判別可能な文書終端記号が布置されているものとする。In the present invention, if a set of one or more sentences described in a natural language is to be classified, it is called a document. At the end of one document, it is assumed that a document end symbol that can be identified is placed.

【００５０】具体的な例をあげれば、公開特許公報や特
定の新聞記事も文書であるし、それらから、請求項や特
定の１文を取り出したものであっても、これを文書と見
なす。（第１の実施の形態）図１は、第１の実施の形態を説明
するための文書分類装置の構成図である。To give a concrete example, a patent publication and a specific newspaper article are also documents, and even if a claim or a specific sentence is extracted therefrom, it is regarded as a document. (First Embodiment) FIG. 1 is a configuration diagram of a document classification device for explaining a first embodiment.

【００５１】文書入力部１０１では、キーボード、ＯＣ
Ｒ装置、ハードディスク等の補助記憶装置、又はネット
ワーク経由にて文書や文書群を獲得し、文書データを入
力する。In the document input unit 101, a keyboard, an OC
A document or a document group is obtained via an auxiliary storage device such as an R device, a hard disk, or a network, and document data is input.

【００５２】文書解析部１０２では、入力された文書そ
れぞれに対し、自然言語解析を行い、単語やその品詞な
どを抽出する。さらに、文書内での単語の出現順序や、
文書の作成者や作成日などの文書のメタ情報なども抽出
する。抽出後、文書群で出現した単語に対し一意な単語
ＩＤを付与し、文書内及び文書群に対する単語出現回数
を計数する。The document analysis unit 102 performs a natural language analysis on each of the input documents, and extracts words and their parts of speech. In addition, the order in which words appear in the document,
The meta information of the document such as the creator and the date of creation of the document is also extracted. After the extraction, a unique word ID is assigned to the word that has appeared in the document group, and the number of word appearances in the document and for the document group is counted.

【００５３】文書特徴ベクトル生成部１０３では、文書
解析部１０２で生成する単語、単語ＩＤ、単語出現回
数、品詞情報などの文書解析データを基に、文書特徴ベ
クトルを生成する。The document feature vector generation unit 103 generates a document feature vector based on document analysis data such as a word, a word ID, the number of word appearances, and part of speech information generated by the document analysis unit 102.

【００５４】図２を用いて、文書特徴ベクトルについて
説明する。図２は、行を文書（ｄｏｃｉ［ｉ＝１〜２
０］）、列を単語（ｗｏｒｄｊ［j＝１〜１０］）とし
た表である。文書と単語の交点（ｗｏｒｄｊｉ）には、
その文書におけるその単語の出現頻度が示されている。
例えば、文書（ｄｏｃ１）における単語（ｗｏｒｄ１）
の出現頻度は１であり、文書（ｄｏｃ３）における単語
（ｗｏｒｄ１）の出現頻度は５である。The document feature vector will be described with reference to FIG. FIG. 2 shows a case where a line is stored in a document (doci [i = 1 to
0]), and columns as words (wordj [j = 1 to 10]). At the intersection of the document and the word (wordji),
The appearance frequency of the word in the document is shown.
For example, a word (word1) in a document (doc1)
Is 1, and the appearance frequency of the word (word1) in the document (doc3) is 5.

【００５５】このように、文書（ｄｏｃｉ）における単
語（ｗｏｒｄｊ）の出現頻度が（ｉ、ｊ）要素となるよ
うな文書-単語頻度行列データを生成し、この文書-単語
頻度行列の各列ベクトルを文書特徴ベクトルとする。As described above, the document-word frequency matrix data is generated such that the appearance frequency of the word (wordj) in the document (doci) becomes the (i, j) element, and each column vector of the document-word frequency matrix is generated. Is a document feature vector.

【００５６】以下本発明の説明においては、図２に示す
文書データを３つの文書集合に分類する場合を考える。In the description of the present invention, the case where the document data shown in FIG. 2 is classified into three document sets will be considered.

【００５７】また、文書-単語頻度行列に対し、正規化
処理を同時に行い、文書間の長さの影響を考慮した文書
特徴ベクトルを用いることにより分類精度を向上させる
ことが可能となる。Further, it is possible to improve the classification accuracy by simultaneously performing normalization processing on the document-word frequency matrix and using a document feature vector in consideration of the influence of the length between documents.

【００５８】さらに、文書-単語行列に因子分析、数量
化III類、特異値分解等の多次元尺度手法を適用するこ
とにより、単語が有する多義性・同義性の問題を考慮し
た文書特徴ベクトルを用いることもできる。Further, by applying a multidimensional scaling method such as factor analysis, quantification type III, and singular value decomposition to the document-word matrix, a document feature vector in consideration of the ambiguity and synonymity of a word is obtained. It can also be used.

【００５９】例として、有効特異値次元を１０として、
図２の文書-単語頻度行列に対し特異値分解を適用した
結果得られる文書特徴ベクトルを図３に示す。As an example, assuming that the effective singular value dimension is 10,
FIG. 3 shows a document feature vector obtained as a result of applying the singular value decomposition to the document-word frequency matrix of FIG.

【００６０】文書-単語頻度行列に対し特異値分解を適
用することにより、単語もまた文書と同一の特徴空間に
布置される。これら単語特徴ベクトルを図４に示す。By applying singular value decomposition to the document-word frequency matrix, words are also placed in the same feature space as the document. FIG. 4 shows these word feature vectors.

【００６１】また、図３、図４において、ｄｉｍｋ［ｋ
＝1〜６］は各特徴次元を示す。なお、図３、図４にお
ける例では、ＳＶＤＰＡＣＫＣ［ｈｔｔｐ：／／ｗｗ
ｗ．ｃｓ．ｕｔｋ．ｅｄｕ／ｂｅｒｒｙ／］を用いて
特異値分解を求めたものである。In FIGS. 3 and 4, dimk [k
= 1 to 6] indicate each feature dimension. In the examples shown in FIGS. 3 and 4, SVDPACKC [http: // www.
w. cs. utk. edu / berry /] to determine the singular value decomposition.

【００６２】分類代表ベクトル生成部１０４では、統計
的手法を用いて文書分類を行う前処理として、文書特徴
ベクトルと同一の次元数を持つ分類代表ベクトルを生成
する。The classification representative vector generation unit 104 generates a classification representative vector having the same number of dimensions as the document feature vector as preprocessing for performing document classification using a statistical method.

【００６３】この分類代表ベクトルの生成方法として
は、例えば、文書特徴ベクトルから、等間隔で選択する、文書特徴ベクトルから、ランダムに選択する、文書特徴ベクトルの先頭から選択する、等の一定の規則に基づき幾つかの文書特徴ベクトルを選
択し、それらの文書特徴ベクトルを分類代表ベクトルと
する。As a method of generating the classification representative vector, for example, a fixed rule such as selecting at equal intervals from the document feature vector, selecting at random from the document feature vector, selecting from the beginning of the document feature vector, and the like. Some document feature vectors are selected based on the above, and those document feature vectors are used as classification representative vectors.

【００６４】また、ユーザが指定した文書や単語の情報
から分類代表ベクトルを生成することにより、ユーザの
分類意図を明確に反映した分類代表ベクトルを生成する
ことができる。Further, by generating a classification representative vector from information of a document or word specified by the user, a classification representative vector that clearly reflects the user's classification intention can be generated.

【００６５】ユーザが指定する情報は、文書や文書内に
含まれる単語に関するものであればどのようなものでも
よい。しかし、それぞれの情報は、文書もしくは文書内
の単語と特定の対応関係を有している必要がある。The information specified by the user may be any information as long as it relates to a document or a word included in the document. However, each piece of information needs to have a specific correspondence with a document or a word in the document.

【００６６】さらに、ユーザが文書と文書内に存在する
単語の論理式によって分類代表ベクトルを指定すること
ができる。これにより、ユーザは簡便に分類意図を記述
することができる。なお、論理式によって分類代表ベク
トルを指定する具体的な指定手法は、問わない。Further, the user can designate a classification representative vector by a logical expression of a document and words existing in the document. Thus, the user can easily describe the classification intention. It should be noted that a specific specification method for specifying the classification representative vector by a logical expression does not matter.

【００６７】ここで、２つの分類代表ベクトルをユーザ
が指定した情報から生成し、１つの分類代表ベクトルを
自動的に生成する場合について説明する。Here, a case will be described in which two classified representative vectors are generated from information designated by the user, and one classified representative vector is automatically generated.

【００６８】簡単のため、ユーザは単語の論理積か文書
で、分類代表ベクトルを指定するものとする。単語セッ
トが指定された場合は、分類代表ベクトルを、含まれる
単語の特徴ベクトルの算術平均とし、文書が指定された
場合は、対応する文書特徴ベクトルを、分類代表ベクト
ルとする。For the sake of simplicity, it is assumed that the user designates a classification representative vector by a logical product of words or a document. When a word set is specified, the classified representative vector is set as the arithmetic average of the feature vectors of the included words, and when a document is specified, the corresponding document feature vector is set as the classified representative vector.

【００６９】分類代表ベクトル1として、単語：ｗｏｒ
ｄ１と単語：ｗｏｒｄ３が指定され、分類代表ベクトル
２として文書：ｄｏｃ５がユーザによって指定され、分
類代表ベクトル３として文書：ｄｏｃ７の文書特徴ベク
トルが自動的に選択された場合の初期文書代表ベクトル
を図５に示す。As the classification representative vector 1, the word: wor
The figure shows an initial document representative vector in a case where d1 and the word: word3 are specified, the document: doc5 is specified by the user as the classification representative vector 2, and the document feature vector of the document: doc7 is automatically selected as the classification representative vector 3. It is shown in FIG.

【００７０】精錬化除外ベクトル指定部１０５では、前
記分類代表ベクトル生成部１０４にて生成した分類代表
ベクトルに対し、精錬化処理を行うか否かを特定の基準
を基に指定する。The refined exclusion vector specifying unit 105 specifies whether or not to perform the refining process on the classified representative vector generated by the classified representative vector generating unit 104 based on a specific reference.

【００７１】精錬化処理を行うか否かを決定する基準
は、文書及び単語に関連する情報やユーザの日常的な傾
向などから自動的に決定することもできる。ここでは、
すべての分類対象ベクトルについて判定ができるのであ
れば、その手法は問わない。The criterion for determining whether or not to perform the refining process can be automatically determined from information related to documents and words, daily tendency of the user, and the like. here,
Any method can be used as long as the determination can be made for all the classification target vectors.

【００７２】また、前記分類代表ベクトル生成部１０４
にて、ユーザによって指定された分類代表ベクトルのみ
を精錬化除外ベクトルとする基準を用いることにより、
ユーザが指定した分類意図を表現している分類代表ベク
トルが精錬化によって変動するのを防ぐことができる。The classification representative vector generation unit 104
By using the criterion that only the classification representative vector specified by the user is the refining exclusion vector,
The classification representative vector expressing the classification intention specified by the user can be prevented from changing due to refinement.

【００７３】文書データ割り当て部１０６では、各文書
特徴ベクトルについて分類代表ベクトルとの類似度を算
出し、最大の類似度の分類代表ベクトルにそれぞれの文
書データを割り当てる。ここで、類似度の計算には、ベ
クトル間の内積、余弦、ユークリッド距離などを用いる
ことができるが、ここでは手法は問わない。The document data allocating unit 106 calculates the similarity between each document feature vector and the classified representative vector, and allocates each document data to the classified representative vector having the highest similarity. Here, for calculating the similarity, an inner product between vectors, a cosine, a Euclidean distance, or the like can be used, but any method is applicable here.

【００７４】分類代表ベクトル精錬化部１０７では、前
記精錬化除外ベクトル指定部１０５にて、精錬化の除外
を指定されなかったすべての分類代表ベクトルにおい
て、前記文書データ割り当て部１０６にて割り当てられ
た文書データの特徴ベクトルを基に分類代表ベクトルを
再計算し、特定の基準を満たすまで（例えば、精錬化処
理を所定回数行うまで、分類誤差が所定の値以下になる
まで等）、前記文書データ割り当てと分類代表ベクトル
の再計算をくり返し実施する。In the classification representative vector refining section 107, the document data allocating section 106 allocates all the classification representative vectors for which refining exclusion has not been specified in the refining exclusion vector specifying section 105. The classification representative vector is recalculated based on the feature vector of the document data, and the document data is re-calculated until a specific criterion is satisfied (for example, until a refining process is performed a predetermined number of times, a classification error becomes a predetermined value or less, etc.) The assignment and the recalculation of the classification representative vector are repeatedly performed.

【００７５】再計算手法としては、非階層クラスタリン
グで用いられているｋ−ｍeａｎｓ法（分類対象の文書
を一つ割り当てる都度、再計算する手法）やｆｏｒｇｙ
法（分類対象の文書を全部割り当てた後に、再計算する
手法）を用いることができるが、ここではその手法は問
わない。As the recalculation method, the k-means method (a method of recalculating each time one document to be classified is allocated) used in non-hierarchical clustering, or a method of re-calculation is used.
A method (a method of recalculating after all the documents to be classified are assigned) can be used, but the method is not limited here.

【００７６】分類結果保存部１０８では、生成された分
類結果を適切な型式で保存する。The classification result storage unit 108 stores the generated classification result in an appropriate format.

【００７７】次に、いくつかの分類代表ベクトルを精錬
化除外ベクトルとした場合について説明する。Next, a case will be described in which some classified representative vectors are used as refined exclusion vectors.

【００７８】まず、図５に示した３つの分類代表特徴ベ
クトルの内、ユーザが指定した分類代表ベクトル、すな
わち分類代表ベクトル１と分類代表ベクトル２を精錬化
除外ベクトルとする。文書データ割り当て部１０６の類
似測度として余弦、文書代表ベクトル精錬化部の手法と
してｋ−ｍｅａｎｓ法を用い、精錬化停止基準としてく
り返し回数５回とした場合、図５に示す分類代表ベクト
ルは図６のようになる。First, among the three classified representative feature vectors shown in FIG. 5, the classified representative vector designated by the user, that is, the classified representative vector 1 and the classified representative vector 2 are defined as refined exclusion vectors. When the cosine is used as the similarity measure of the document data allocating unit 106 and the k-means method is used as the method of the document representative vector refining unit, and the number of repetitions is set to 5 as the refining stop criterion, the classification representative vector shown in FIG. become that way.

【００７９】図６から明らかなように価が変化している
のは分類代表ベクトル３のみである。また、その分類結
果を図７に示す。As is clear from FIG. 6, only the classification representative vector 3 changes in value. FIG. 7 shows the classification result.

【００８０】図７において各所属文書の括弧内の値は所
属している分類代表ベクトルとの類似度である。In FIG. 7, the value in parentheses of each belonging document is the degree of similarity with the belonging classification representative vector.

【００８１】この分類結果を見ると、精錬化除外した分
類代表ベクトル２では、指定した文書：ｄｏｃ５を中心
とした文書集合を形成しているが、精錬化を実行した分
類代表ベクトル３は、文書：ｄｏｃ７をもとに生成した
ものの、文書：ｄｏｃ７はこの文書集合ではなく、分類
代表ベクトル２の文書集合に帰属している。According to the classification result, the classified representative vector 2 excluding the refinement forms a document set centering on the designated document: doc5. Although the document: doc7 is generated based on: doc7, the document: doc7 belongs to the document set of the classification representative vector 2 instead of this document set.

【００８２】従って、もし、分類代表ベクトル３がユー
ザによって指定されていた場合には、生成される文書集
合は、文書：ｄｏｃ７を中心とするものではないため、
ユーザの意図にはそぐわない結果となっていたと予測で
きる。（第２の実施の形態）図８は、第２の実施の形態を説明
するための文書分類装置の構成図である。Therefore, if the classification representative vector 3 is specified by the user, the generated document set does not center on the document: doc7.
It can be predicted that the result did not match the user's intention. (Second Embodiment) FIG. 8 is a configuration diagram of a document classification device for explaining a second embodiment.

【００８３】図１と同じ構成には、同じ参照番号を付し
ている。The same components as those in FIG. 1 are denoted by the same reference numerals.

【００８４】分類代表ベクトル生成情報記憶部２０１で
は、分類代表ベクトル生成部１０４で分類代表ベクトル
を生成する際に利用した情報、例えば分類代表ベクトル
生成に使用した規則、ユーザが指定した文書と単語に関
する情報等と、分類代表ベクトル自体を記憶する。The classification representative vector generation information storage unit 201 stores information used when the classification representative vector generation unit 104 generates a classification representative vector, such as rules used for generating the classification representative vector, documents and words specified by the user. Information and the like and the classification representative vector itself are stored.

【００８５】分類代表ベクトル生成情報読み込み部２０
２では、分類代表ベクトル生成部１０４にて分類代表ベ
クトルを生成する際に、分類代表ベクトル生成情報記憶
部２０１にて記憶した情報を読み込む。Classification representative vector generation information reading section 20
In step 2, when the classified representative vector generation unit 104 generates a classified representative vector, the information stored in the classified representative vector generation information storage unit 201 is read.

【００８６】精錬化除外ベクトル情報記憶部２０３で
は、精錬化除外ベクトル指定部１０５にて用いられた除
外分類代表ベクトルに関する情報を記憶する。The refined exclusion vector information storage unit 203 stores information on the exclusion classification representative vector used in the refined exclusion vector designation unit 105.

【００８７】精錬化除外ベクトル情報読み込み部２０４
では、精錬化除外ベクトル指定部１０５にて除外分類代
表ベクトルを指定する際に、精錬化除外ベクトル情報記
憶部２０３にて記憶された除外分類代表ベクトルに関す
る情報を読み込むする。Refining exclusion vector information reading section 204
Then, when the exclusion classification representative vector is designated by the refinement exclusion vector designation unit 105, the information on the exclusion classification representative vector stored in the refinement exclusion vector information storage unit 203 is read.

【００８８】この構成により、簡便に、様々に設定を変
えて、分類代表ベクトルを生成したり、精錬化除外分類
代表ベクトルを変化させて、分類結果を得ることが可能
となる。With this configuration, it is possible to easily generate a classification representative vector by changing various settings, or to change the refined exclusion classification representative vector to obtain a classification result.

【００８９】次に、図５に示した３つの分類代表特徴ベ
クトルの内、分類代表ベクトル１と分類代表ベクトル２
を精錬化除外ベクトルして文書を分類した後に、引き続
いて、図５に示した３つの分類代表特徴ベクトルの全て
について精錬化処理した場合について説明する。Next, among the three classification representative feature vectors shown in FIG. 5, classification representative vector 1 and classification representative vector 2
Next, a case will be described in which after refining is performed by classifying documents using the refinement exclusion vector, all the three classification representative feature vectors shown in FIG. 5 are refined.

【００９０】まず、分類代表ベクトル生成部１０４に
て、生成された分類代表ベクトルの値を記憶する。この
際、記憶の方式は問わない。また、精錬化除外ベクトル
指定部にて、分類代表ベクトル１と分類代表ベクトル２
が精錬化除外ベクトルとして指定された旨の情報も記憶
する。First, the classification / representation vector generation unit 104 stores the value of the generated classification / representation vector. At this time, the storage method does not matter. Further, the classification representative vector 1 and the classification representative vector 2
Is also stored as a refining exclusion vector.

【００９１】そして、この設定で文書分類を実行するこ
とにより図７に示す分類結果を得る。Then, by performing document classification with this setting, the classification result shown in FIG. 7 is obtained.

【００９２】次に、再び同一の文書データに対し、同様
の処理を行う。この際、文書口調ベクトルや単語特徴ベ
クトルを適切な型式で記憶しておき、それらを再処理の
場合に読み込むことにより特異値分解の再実行をバイパ
スすることもできる。Next, the same processing is performed again on the same document data. At this time, the document tone vector and the word feature vector are stored in an appropriate format, and the re-execution of the singular value decomposition can be bypassed by reading them in the case of reprocessing.

【００９３】分類代表ベクトル生成部１０４では、事前
に記憶された分類代表ベクトルの値を読み込む。また、
精錬化除外ベクトル指定部１０５では、精錬化除外ベク
トルに関する情報を読み込み、本例ではすべての分類代
表ベクトルについて精錬化処理を行う設定に変更する。The classification representative vector generation unit 104 reads the value of the classification representative vector stored in advance. Also,
The refining exclusion vector specifying unit 105 reads information on the smelting exclusion vector, and in this example, changes the setting to perform the smelting process for all the classified representative vectors.

【００９４】そして、この設定で文書分類を実行した場
合の、分類代表ベクトルの値を図９に、分類結果を図１
０に示す。FIG. 9 shows the values of the classification representative vectors and FIG. 1 shows the classification results when the document classification is executed with this setting.
0 is shown.

【００９５】このように、分類代表ベクトル生成情報記
憶部２０１、分類代表ベクトル生成情報読み込み部２０
２、精錬化除外ベクトル情報記憶部２０３及び精錬化除
外ベクトル情報読み込み部２０４を追加することによ
り、さまざまな分類結果を簡便に得ることが可能とな
る。（第３の実施の形態）図１１は、第３の実施の形態を説
明するための文書分類装置の構成図である。図１及び図
２と同じ構成には、同じ参照番号を付している。As described above, the classification representative vector generation information storage unit 201 and the classification representative vector generation information reading unit 20
2. By adding the refined exclusion vector information storage unit 203 and the refined exclusion vector information reading unit 204, various classification results can be easily obtained. (Third Embodiment) FIG. 11 is a configuration diagram of a document classification device for explaining a third embodiment. 1 and 2 are denoted by the same reference numerals.

【００９６】分類代表ベクトル指定生成部３０１では、
指定された情報プリミティブの論理式を、情報プリミテ
ィブの論理積だけで構成されるプリミティブ論理積の論
理和結合に展開する。In the classification / representative vector designation generation unit 301,
The logical expression of the specified information primitive is developed into a logical OR combination of the primitive logical product composed only of the logical product of the information primitives.

【００９７】例えば、Ａ、Ｂ、Ｃ、Ｄ、Ｅ、Ｆのそれぞ
れを情報プリミティブとして、これらの論理式：（（Ａ
＊Ｂ）＋（Ｃ＊Ｄ）＋Ｅ）＊Ｆが与えられた場合、この
論理式を（Ａ＊Ｂ＊Ｆ）＋（Ｃ＊Ｄ＊Ｆ）＋（Ｅ＊Ｆ）
のように展開する。For example, each of A, B, C, D, E, and F is defined as an information primitive, and these logical expressions are expressed as: ((A
Given * B) + (C * D) + E) * F, this logical expression can be expressed as (A * B * F) + (C * D * F) + (E * F)
Expand like

【００９８】ここで、＊は論理積を、＋は論理和をそれ
ぞれ示す。Here, * indicates a logical product and + indicates a logical sum.

【００９９】そして、これらのプリミティブ論理積のそ
れぞれから分類代表ベクトルを生成する。ここでは、プ
リミティブ論理積から分類代表ベクトルを生成する手法
は、問わない。しかしながら、生成された分類代表ベク
トルは文書特徴ベクトルと同一の次元数を持たなければ
ならない。Then, a classification representative vector is generated from each of these primitive logical products. Here, the method of generating the classification representative vector from the primitive logical product does not matter. However, the generated classification representative vector must have the same number of dimensions as the document feature vector.

【０１００】また、プリミティブ論理積を構成する情報
プリミティブのそれぞれが、文書特徴ベクトルと同一の
次元数を持つベクトルで表現可能な場合、それらの算術
平均により分類代表ベクトルを生成してもよい。すなわ
ち、上記Ａ、Ｂ、Ｃ、Ｄ、Ｅ、Ｆの情報プリミティブを
表現するベクトルがＶａ、Ｖｂ、Ｖｃ、Ｖｄ、Ｖｅ、Ｖ
ｆで与えられている場合、生成される分類代表ベクトル
は、（Ｖａ＋Ｖｂ＋Ｖｆ／３、（Ｖｃ＋Ｖｄ＋Ｖｆ）／
３、（ｖｅ＋Ｖｆ）／２となる。If each of the information primitives constituting the primitive logical product can be represented by a vector having the same dimension as the document feature vector, a classification representative vector may be generated by arithmetic mean of the vectors. That is, the vectors expressing the information primitives of A, B, C, D, E, and F are Va, Vb, Vc, Vd, Ve, V
f, the generated classification representative vector is (Va + Vb + Vf / 3, (Vc + Vd + Vf) /
3, (ve + Vf) / 2.

【０１０１】また、情報プリミティブとして文書と文書
に含まれる単語を用いることができる。すなわち、前述
の各情報プリミティブを表現するベクトルＶａ、Ｖｂ、
Ｖｃ、Ｖｄ、Ｖｅ、Ｖｆとして、図３と図４に示すよう
な文書特徴ベクトルと単語特徴ベクトルを用いることが
できる。A document and words contained in the document can be used as information primitives. That is, vectors Va, Vb,
As Vc, Vd, Ve, and Vf, a document feature vector and a word feature vector as shown in FIGS. 3 and 4 can be used.

【０１０２】さらに、情報プリミティブの論理式をユー
ザが指定することもできる。ただし、ここでは具体的な
指定方法は問わない。Further, the logical formula of the information primitive can be specified by the user. However, a specific designation method does not matter here.

【０１０３】分類代表ベクトル情報記憶部３０２では、
分類代表ベクトル指定生成部３０１で用いた指定された
論理式と生成された文書代表ベクトルに関する情報を適
切な型式で記憶する。In the classification representative vector information storage unit 302,
Information on the designated logical expression used in the classification representative vector designation generation unit 301 and the generated document representative vector is stored in an appropriate format.

【０１０４】分類代表ベクトル自動生成部３０３では、
いくつかの分類代表ベクトルを自動的に生成する。In the classification representative vector automatic generation unit 303,
Generate some classification representative vectors automatically.

【０１０５】自動生成方法としては、例えば、文書特徴ベクトルから、等間隔で選択する、文書特徴ベクトルから、ランダムに選択する、文書特徴ベクトルの先頭から選択する、等の一定の規則に基づき幾つかの文書特徴ベクトルを選
択し、それらの文書特徴ベクトルを分類代表ベクトルと
する。As an automatic generation method, for example, based on certain rules such as selecting at equal intervals from a document feature vector, selecting at random from a document feature vector, selecting from the beginning of a document feature vector, and the like, Are selected, and those document feature vectors are set as classification representative vectors.

【０１０６】ただし、生成される文書代表ベクトルは文
書特徴ベクトルと同一の次元数を持たなければならな
い。However, the generated document representative vector must have the same number of dimensions as the document feature vector.

【０１０７】文書分類部３０４では、生成した分類代表
ベクトルと文書特徴ベクトルを用いて文書分類を行う。The document classifying unit 304 classifies documents using the generated classification representative vector and the document feature vector.

【０１０８】ここでは、具体的な手法は問わないが非階
層クラスタリング手法のｋ−ｍｅａｎｓ法、ｆｏｒｇｙ
法などを用いることができる。In this case, the k-means method of the non-hierarchical clustering method,
Method can be used.

【０１０９】分類結果保存部３０５では、生成した文書
分類結果を適切な型式で保存する。分類代表ベクトル情
報読み込み部３０６では、分類代表ベクトル情報記億部
３０２にて、それ以前に、記憶した分類代表ベクトル指
定生成部３０１で用いた論理式と生成された文書代表ベ
クトルに関する情報を読み込む。この読み込んだ情報を
参考にして、分類代表ベクトル指定生成部３０１で、新
たに、分類代表特徴ベクトルを指定する。The classification result storage unit 305 stores the generated document classification result in an appropriate format. In the classification representative vector information reading unit 306, the classification representative vector information storage unit 302 reads the information on the logical expression used in the classification representative vector designation generation unit 301 and the generated document representative vector before that. With reference to the read information, the classification / representative vector specification generation unit 301 newly specifies a classification / representative feature vector.

【０１１０】これにより、事前に指定した様々な論理式
等を参照して、新たな論理式を作成することができる。Thus, a new logical expression can be created with reference to various logical expressions specified in advance.

【０１１１】例として、図２に示す文書集合が与えられ
ており、文書特徴ベクトル、単語特徴ベクトルが図３と
図４で与えられている場合、ユーザが２つの文書と単語
で構成される論理式と１つの文書集合の自動生成を指定
した時の文書分類について説明する。As an example, when the document set shown in FIG. 2 is given and the document feature vector and the word feature vector are given in FIGS. A description will be given of a document classification when an expression and automatic generation of one document set are designated.

【０１１２】まず、ユーザが“ｗｏｒｄ１＊ｗｏｒｄ
３”と“（ｄｏｃ２＋ｄｏｃ１２）＊ｗｏｒｄ５”の論
理式を指定したとすると、分類代表ベクトル１としてｗ
ｏｒｄ１＊ｗｏｒｄ３、分類代表ベクトル２としてｄｏ
ｃ２＊ｗｏｒｄ５、分類代表ベクトル３として、ｄｏｃ
１２＊ｗｏｒｄ５、分類代表ベクトル４として自動的に
（ランダムに）文書が選択され、ｄｏｃ１７が選択され
たものとする。First, when the user enters “word1 * word”
Assuming that the logical expressions “3” and “(doc2 + doc12) * word5” are specified, w
ord1 * word3, do as classification representative vector 2
doc as c2 * word5 and classification representative vector 3
It is assumed that a document is automatically (randomly) selected as 12 * word5 and the classification representative vector 4, and doc17 is selected.

【０１１３】この場合、論理積プリミティブは、各情報
プリミティブを表現するベクトルの算術平均をとること
にすると、分類代表ベクトル１は（ｗｏｒｄ１の単語特
徴ベクトル＋ｗｏｒｄ３の単語特徴ベクトル）／２、分
類代表ベクトル２は（ｗｏｒｄ５の単語特徴ベクトル＋
ｄｏｃ２の文書特徴ベクトル）／２、分類代表ベクトル
３は（ｗｏｒｄ５の単語特徴ベクトル＋ｄｏｃ１２の文
書特徴ベクトル）／２、分類代表ベクトル４は（ｄｏｃ
１７の文書特徴ベクトル）となり、この結果を図１３に
示す。ユーザ指定による分類代表ベクトルは、精錬化除
外ベクトルとし、余弦測度を用いたｋ−ｍｅａｎｓ法を
用いて文書分類を行った結果を、図１３に示す。（文書分類方法）次に、文書の分類方法について説明す
る。In this case, if the AND primitive is to take the arithmetic mean of the vectors representing the information primitives, the classification representative vector 1 is (word1 word feature vector + word3 word feature vector) / 2, the classification representative vector 2 is (word5 word feature vector +
The document feature vector of doc2) / 2, the classification representative vector 3 is (word feature vector of word5 + document feature vector of doc12) / 2, and the classification representative vector 4 is (doc).
17 document feature vectors), and the result is shown in FIG. FIG. 13 shows the result of performing document classification using the k-means method using the cosine measure, with the classification representative vector specified by the user being the refinement exclusion vector. (Document Classification Method) Next, a document classification method will be described.

【０１１４】図１、図８及び図１１は、上述のとおり、
文書分類装置の構成図であるが、図１、図８及び図１１
には、文書分類装置に関する処理の内容が記述されてい
る。従って、図１、図８及び図１１に記載された各部の
処理を、初期の目的を達成するように、時系列的に行う
ことにより、文書の分類方法を示すことができる。例え
ば、図１において、文書入力部１０１で処理を行い、次
いで文書解析部１０２で処理を行い、その後、順次、文
書特徴ベクトル生成部１０３の処理、分類代表ベクトル
生成部１０４の処理、精錬化除外ベクトル指定部１０５
の処理、文書データ割り当て部１０６の処理、分類代表
ベクトル精錬化部１０７の処理、分類結果保存部１０８
の処理を行うことにより、文書の分類方法を示すことが
できる。FIG. 1, FIG. 8 and FIG.
FIG. 12 is a configuration diagram of the document classification device, which is shown in FIGS.
Describes the contents of the processing related to the document classification device. Therefore, a document classification method can be shown by performing the processing of each unit described in FIG. 1, FIG. 8, and FIG. 11 in chronological order so as to achieve the initial purpose. For example, in FIG. 1, the processing is performed by the document input unit 101, then the processing is performed by the document analysis unit 102, and then the processing of the document feature vector generation unit 103, the processing of the classification representative vector generation unit 104, and the refining exclusion Vector designator 105
Processing, processing of the document data allocation unit 106, processing of the classification representative vector refining unit 107, classification result storage unit 108
By performing the above processing, it is possible to indicate a document classification method.

【０１１５】従って、文書の分類方法については、図
１、図８及び図１１に記載された各部の処理の流れが理
解できれば、文書の分類方法を理解することができる。Therefore, as for the method of classifying documents, if the flow of processing of each unit described in FIGS. 1, 8 and 11 can be understood, the method of classifying documents can be understood.

【０１１６】そこで、図１、図８及び図１１に記載され
た各部の処理の流れを以下に説明する。（各構成部の処理の流れ）図１における文書入力部１０
１、文書解析部１０２、文書特徴ベクトル生成部１０
３、分類代表ベクトル生成部１０４、精錬化除外ベクト
ル指定部１０５、文書データ割り当て部１０６、分類代
表ベクトル精錬化部１０７、分類結果保存部１０８処理
の流れを、図１４〜図２１を用いて説明する。・文書入力部１０１の処理について図１４を用いて説明する。The flow of the processing of each unit shown in FIGS. 1, 8 and 11 will be described below. (Processing flow of each component) Document input unit 10 in FIG.
1. Document analysis unit 102, document feature vector generation unit 10
3. The process flow of the classification representative vector generation unit 104, the refinement exclusion vector designation unit 105, the document data allocation unit 106, the classification representative vector refinement unit 107, and the classification result storage unit 108 will be described with reference to FIGS. I do. The processing of the document input unit 101 will be described with reference to FIG.

【０１１７】対象文書について、対象ＩＤを付与する。
これを全ての対象文書について行う。・文書解析部１０２の処理について図１５を用いて説明する。A target ID is assigned to the target document.
This is performed for all target documents. The processing of the document analysis unit 102 will be described with reference to FIG.

【０１１８】対象文書から、単語情報（単語表記、品
詞）を抽出する。The word information (word notation, part of speech) is extracted from the target document.

【０１１９】次いで、抽出した単語が、初めて抽出され
た単語であれば、単語情報を独立単語情報として登録
し、この登録された独立単語情報にＩＤを付与する。し
かし、抽出した単語が、既に抽出された単語であれば、
対応する独立単語情報を抽出する。また、単語情報のう
ち独立単語情報への参照で置き換え可能なものは置き換
える。Next, if the extracted word is the first extracted word, the word information is registered as independent word information, and an ID is assigned to the registered independent word information. However, if the extracted words are already extracted words,
Extract the corresponding independent word information. Also, word information that can be replaced by reference to independent word information is replaced.

【０１２０】これを全ての抽出された単語情報について
行った後に、これを全ての対象文書について行う。・文書特徴ベクトル生成部１０３の処理について図１６を用いて説明する。After performing this for all the extracted word information, this is performed for all the target documents. The processing of the document feature vector generation unit 103 will be described with reference to FIG.

【０１２１】文書内での全独立単語の生起頻度を計数
し、文書特徴ベクトルとする。これを全ての対象文書に
ついて行う。The frequency of occurrence of all independent words in a document is counted and used as a document feature vector. This is performed for all target documents.

【０１２２】次に、文書特徴ベクトルについて、正規化
を行う場合であれば、文書特徴ベクトルを正規化する。
これを全ての文書特徴ベクトルについて行う。Next, when normalizing the document feature vector, the document feature vector is normalized.
This is performed for all document feature vectors.

【０１２３】さらに、線形変換を行う場合であれば、文
書特徴ベクトルに線形変換行列を作用させる。これを全
ての文書特徴ベクトルについて行う。・分類代表ベクトル生成部１０４の処理について図１７を用いて説明する。Further, when performing a linear transformation, a linear transformation matrix is applied to the document feature vector. This is performed for all document feature vectors. The processing of the classification representative vector generation unit 104 will be described with reference to FIG.

【０１２４】全ての文書代表ベクトルが生成されるま
で、順次、文書代表ベクトルを生成する。The document representative vectors are sequentially generated until all the document representative vectors are generated.

【０１２５】なお、文書代表ベクトルをユーザが指定す
る処理を、上記処理に前置してもよい。なお、ユーザ
が、文書データと文書データに存在する単語の論理式で
指定した場合は、ユーザによって生成された文書ＩＤと
単語ＩＤの論理式を獲得上で、上記処理（文書代表ベク
トル生成処理）を行う。・精錬化除外ベクトル指定部１０５の処理について図１８を用いて説明する。The process of designating the document representative vector by the user may precede the above process. If the user specifies the logical expression of the document data and the word existing in the document data, the above-described processing (document representative vector generation processing) is performed to obtain the logical expression of the document ID and the word ID generated by the user. I do. The process of the refinement exclusion vector designation unit 105 will be described with reference to FIG.

【０１２６】先ず、精錬化除外情報を生成する。First, refinement exclusion information is generated.

【０１２７】次いで、精錬化を除外する情報が与えられ
ている場合は、精錬化を行うフラグを降ろす。また、精
錬化を除外する情報が与えられていない場合は、精錬化
を行うフラグを立てる。Next, if information for excluding refining is given, the flag for performing refining is lowered. If information for excluding refining is not given, a flag for performing refining is set.

【０１２８】これを全ての対象文書について行う。This is performed for all target documents.

【０１２９】なお、ユーザが指定した文書代表ベクトル
については、精錬化除外ベクトルとして扱つかう処理と
し、ユーザが指定した文書代表ベクトルについて精錬化
除外する情報を付与する処理を、上記処理に前置しても
よい。・文書データ割り当て部１０６の処理について図１９を用いて説明する。The process of treating the document representative vector specified by the user as a refining exclusion vector is assumed, and the process of adding information for excluding the refining of the document representative vector specified by the user is added to the above process. You may. The processing of the document data allocation unit 106 will be described with reference to FIG.

【０１３０】全ての文書代表ベクトルとの類似度を算出
し、類似度の一番高い文書代表ベクトルに文書特徴ベク
トルを割り当てる。The similarity with all the document representative vectors is calculated, and the document feature vector is assigned to the document representative vector having the highest similarity.

【０１３１】これを全ての文書特徴ベクトルについて行
う。This is performed for all document feature vectors.

【０１３２】これにより、文書が分類される。・分類代表ベクトル精錬化部１０７の処理について図２０を用いて説明する。Thus, the documents are classified. The process of the classification representative vector refining unit 107 will be described with reference to FIG.

【０１３３】精錬化停止基準を満たすまで、次の処理を
繰り返す。The following process is repeated until the refining stop criterion is satisfied.

【０１３４】文書代表ベクトルについて、精錬化を行う
フラグが立っている場合は、分類代表ベクトルに所属す
る文書特徴ベクトルをもとに分類代表ベクトルを再計算
する。If a flag for refining is set for the document representative vector, the classification representative vector is recalculated based on the document feature vector belonging to the classification representative vector.

【０１３５】これを全ての文書代表ベクトルについて行
う。This is performed for all document representative vectors.

【０１３６】その後、図１９の処理を行う。・分類結果保存部１０８の処理について図２１を用いて説明する。Thereafter, the processing in FIG. 19 is performed. The processing of the classification result storage unit 108 will be described with reference to FIG.

【０１３７】文書代表ベクトルについて、所属している
文書特徴ベクトルに関する情報を保存する。これを全て
の文書代表ベクトルについて行う。For the document representative vector, information on the document feature vector to which it belongs is stored. This is performed for all the document representative vectors.

【０１３８】図８における分類代表ベクトル生成情報記
憶部２０１、分類代表ベクトル生成情報読み込み部２０
２、精錬化除外ベクトル情報記憶部２０３及び精錬化除
外ベクトル情報読み込み部２０４は、ベクトルを記憶又
は読み出しを行うものであり、記憶又は読み出し自体
は、一般的な記憶又は読み出し処理と格別の相違はない
ので説明を省略する。The classification representative vector generation information storage unit 201 and the classification representative vector generation information reading unit 20 in FIG.
2. The refining exclusion vector information storage unit 203 and the smelting exclusion vector information reading unit 204 store or read a vector, and the storage or reading itself is different from general storage or reading processing. Since there is no description, the description is omitted.

【０１３９】また、文書入力部１０１、文書解析部１０
２、文書特徴ベクトル生成部１０３、分類代表ベクトル
生成部１０４、精錬化除外ベクトル指定部１０５、文書
データ割り当て部１０６、分類代表ベクトル精錬化部１
０７、分類結果保存部１０８の処理の流れについては、
上記説明と重複するので説明を省略する。The document input unit 101 and the document analysis unit 10
2. Document feature vector generation unit 103, classification representative vector generation unit 104, refinement exclusion vector designation unit 105, document data allocation unit 106, classification representative vector refinement unit 1
07, regarding the processing flow of the classification result storage unit 108,
The description is omitted because it is the same as the above description.

【０１４０】図１１における分類代表ベクトル指定生成
部３０１、分類代表ベクトル情報記憶部３０２、分類代
表ベクトル自動生成部３０３、文書分類部３０４及び類
結果保存部３０５における処理の流れを、図２２〜図２
６を用いて説明する。また、文書入力部１０１、文書解
析部１０２及び文書特徴ベクトル生成部１０３の処理の
流れについては、図１の説明と重複するので、説明を省
略する。・分類代表ベクトル指定生成部３０１の処理について図２２を用いて説明する。FIGS. 22 to 22 show the flow of processing in the classification / representation vector designation generation unit 301, classification / representation vector information storage unit 302, classification / representation vector automatic generation unit 303, document classification unit 304, and classification result storage unit 305 in FIG. 2
6 will be described. The processing flow of the document input unit 101, the document analysis unit 102, and the document feature vector generation unit 103 is the same as that described with reference to FIG. The processing of the classification / representative vector designation generation unit 301 will be described with reference to FIG.

【０１４１】先ず、指定された情報プリミティブの論理
式を獲得し、各論理式にＩＤを付与する。First, a logical expression of a designated information primitive is obtained, and an ID is assigned to each logical expression.

【０１４２】次いで、獲得した情報プリミティブ論理式
について、結合律、交換律、べき等律、吸収律により論
理式を論理積だけで構成される要素の論理和に展開す
る。Next, with respect to the acquired information primitive logical expression, the logical expression is developed into a logical OR of elements composed only of logical ANDs according to a combination rule, an exchange rule, an idempotent rule, and an absorption rule.

【０１４３】次いで、論理積だけで構成される要素につ
いて、論理積を構成する各情報プリミティブの情報をも
とに文書代表ベクトルを生成する。Next, a document representative vector is generated for an element consisting only of a logical product based on information of each information primitive forming the logical product.

【０１４４】このとき、生成した文書代表ベクトルが事
前に生成された文書代表ベクトルと一致する場合は、事
前に生成された文書代表ベクトルのＩＤを獲得する。し
かし、生成した文書代表ベクトルが事前に生成された文
書代表ベクトルと一致しない場合は、生成された文書代
表ベクトルにＩＤを付与する。At this time, if the generated document representative vector matches the previously generated document representative vector, the ID of the previously generated document representative vector is obtained. However, if the generated document representative vector does not match the previously generated document representative vector, an ID is assigned to the generated document representative vector.

【０１４５】そして、文書代表ベクトルのＩＤをスタッ
クする。Then, the ID of the document representative vector is stacked.

【０１４６】これを全ての獲得した情報プリミティブ論
理式について行う。・分類代表ベクトル情報記憶部３０２の処理について図２３を用いて説明する。This is performed for all the acquired information primitive logical expressions. The processing of the classification representative vector information storage unit 302 will be described with reference to FIG.

【０１４７】先ず、情報プリミティブの論理式につい
て、論理式に付与されたＩＤを記憶する。次いで、スタ
ックされている文書代表ベクトルのＩＤを記憶する。First, the ID assigned to the logical expression of the information primitive is stored. Next, the ID of the stacked document representative vector is stored.

【０１４８】これを全ての情報プリミティブの論理式に
ついて行う。・分類代表ベクトル自動生成部３０３の処理について図２４を用いて説明する。This is performed for the logical expressions of all information primitives. The processing of the classification representative vector automatic generation unit 303 will be described with reference to FIG.

【０１４９】先ず、文書特徴ベクトルに関する情報を取
得する。First, information on the document feature vector is obtained.

【０１５０】全ての文書代表ベクトルを生成するまで、
文書代表ベクトルを生成し、その文書代表ベクトルに文
書代表ベクトルＩＤを付与する。・文書分類部３０４の処理について図２５を用いて説明する。Until all document representative vectors are generated,
A document representative vector is generated, and a document representative vector ID is assigned to the document representative vector. The processing of the document classification unit 304 will be described with reference to FIG.

【０１５１】文書特徴ベクトルについて、全ての文書代
表ベクトルとの類似度を算出する。類似度の一番高い文
書代表ベクトルに文書特徴ベクトルを割り当てる。The similarity between the document feature vector and all the document representative vectors is calculated. The document feature vector is assigned to the document representative vector having the highest similarity.

【０１５２】これを全ての文書特徴ベクトルについて行
う。This is performed for all document feature vectors.

【０１５３】次いで、精錬化停止基準を満たすまで、次
の処理を繰り返す。Next, the following processing is repeated until the refining stop criterion is satisfied.

【０１５４】文書代表ベクトルに所属する文書特徴ベク
トルをもとに分類代表ベクトルを再計算し、図１９の処
理を行う。The classification representative vector is recalculated based on the document feature vector belonging to the document representative vector, and the processing in FIG. 19 is performed.

【０１５５】これを全ての文書特徴ベクトルについて行
う。・類結果保存部３０５の処理について図２６を用いて説明する。This is performed for all document feature vectors. The processing of the class result storage unit 305 will be described with reference to FIG.

【０１５６】先ず、文書代表ベクトル情報記憶部に記憶
されている情報を読み込む。First, the information stored in the document representative vector information storage section is read.

【０１５７】次いで、情報プリミティブ論理式につい
て、文書分類識別子を保存する。次いで、全ての文書代
表ベクトルについて、文書代表ベクトルのＩＤがスタッ
クされていた文書代表ベクトルのＩＤであれば、文書代
表ベクトルに所属する文書特徴ベクトルに関する情報を
保存する。Next, a document classification identifier is stored for the information primitive logical expression. Next, for all the document representative vectors, if the ID of the document representative vector is the ID of the stacked document representative vector, information on the document feature vector belonging to the document representative vector is stored.

【０１５８】これを全ての情報プリミティブ論理式につ
いて行う。This is performed for all information primitive logical expressions.

【０１５９】さらに、文書代表ベクトルについて、どの
情報プリミティブ論理式にもスタックされていない場合
は、文書識別子を保存し、文書代表ベクトルに所属する
文書特徴ベクトルに関する情報を保存する。If the document representative vector is not stacked in any of the information primitive logical expressions, the document identifier is stored, and information on the document feature vector belonging to the document representative vector is stored.

【０１６０】なお、図２７及び図２８に、分類代表ベク
トル指定生成部３０１の他の処理の流れの例を示す。FIGS. 27 and 28 show an example of the flow of another process of the classification / representative vector designation generation unit 301.

【０１６１】[0161]

【発明の効果】上述の如く本発明によれば、次に述べる
種々の効果を奏することができる。請求項１に記載の発
明では、一部又は全部の初期分類代表特徴ベクトルに対
し、精錬化処理をバイパスすることにより、前記一部又
は全部の初期分類代表ベクトルに関しては、それらが表
現する観点を明確に示すような分類結果を得ることがで
きる。According to the present invention as described above, the following various effects can be obtained. According to the first aspect of the present invention, the refining process is bypassed for some or all of the initial classification representative feature vectors, so that the partial or all of the initial classification representative vectors have a viewpoint expressed by them. A classification result as clearly shown can be obtained.

【０１６２】請求項２に記載の発明では、幾つかの指定
される初期分類代表特徴ベクトルに対し、精錬化処理を
バイパスすることにより、指定された初期分類代表ベク
トルに関しては、それらが表現する観点を明確に示すよ
うな分類結果を得ることができる。According to the second aspect of the present invention, the refining process is bypassed for some of the specified initial classification representative feature vectors. Can be obtained.

【０１６３】請求項３に記載の発明では、請求項２に記
載の文書分類装置の特徴に加え、精錬化処理をバイパス
する初期分類代表ベクトルをユーザが指定することによ
り、ユーザの分類意図を反映した分類結果を得ることが
できる。According to the third aspect of the invention, in addition to the features of the document classification apparatus according to the second aspect, the user specifies the initial classification representative vector that bypasses the refining process, thereby reflecting the user's classification intention. A classified result can be obtained.

【０１６４】請求項４に記載の発明では、請求項３に記
載の文書分類装置の特徴に加え、ユーザが指定した分類
代表特徴ベクトルについては、精錬化処理を行わないこ
とにより、ユーザの分類意図を強制的に保持した分類結
果を得ることができる。According to the fourth aspect of the present invention, in addition to the features of the document classifying apparatus according to the third aspect, the classification representative feature vector designated by the user is not subjected to the refining process, so that the user's classification Can be obtained.

【０１６５】請求項５に記載の発明では、請求項４に記
載の文書分類装置の特徴に加え、ユーザが指定する文書
と文書内に含まれる単語の論理式から初期分類代表特徴
ベクトルを生成することにより、ユーザが分類意図を容
易に記述できる文書分類装置を実現することができる。According to a fifth aspect of the present invention, in addition to the features of the document classifying apparatus according to the fourth aspect, an initial classification representative feature vector is generated from a document specified by a user and a logical expression of a word included in the document. This makes it possible to realize a document classification device that allows a user to easily describe a classification intention.

【０１６６】請求項６に記載の発明では、請求項５に記
載の文書分類装置の特徴に加え、指定された初期分類代
表特徴ベクトルと精錬化処理を行わない分類代表特徴ベ
クトルを記憶し、その情報を後で読み出す仕組みを提供
することにより、事前に行われた文書分類の設定に多少
修正を加えて新たな分類結果を獲得したり、分類対象文
書集合が変化した場合に対しても同一の設定での分類結
果を得ることができる。According to the invention described in claim 6, in addition to the features of the document classification device described in claim 5, a designated initial classification representative feature vector and a classified representative feature vector not subjected to refining processing are stored. By providing a mechanism to read out information later, it is possible to obtain a new classification result by slightly modifying the setting of the document classification performed in advance, and to maintain the same classification even when the classification target document set changes. You can get the classification result in the setting.

【０１６７】請求項７に記載の発明では、指定された情
報プリミティブの論理式を論理和式に展開し、その論理
和式を構成する情報プリミティブの論理積それぞれから
初期分類代表ベクトルを生成することにより、記述され
た論理式の意味に適した文書分類結果を生成することが
できる。According to the present invention, the logical expression of the specified information primitive is expanded into a logical sum expression, and an initial classification representative vector is generated from each of the logical products of the information primitives constituting the logical sum expression. Thus, a document classification result suitable for the meaning of the described logical expression can be generated.

【０１６８】請求項８に記載の発明では、請求項７に記
載の文書分類装置の特徴に加え、事前に記憶されている
情報プリミティブの論理式と生成される文書代表ベクト
ルに関する情報を再利用することにより、異なる文書集
合に対しても同一の分類基準で文書分類を行うことので
きる文書分類装置を実現することができる。According to the eighth aspect of the present invention, in addition to the features of the document classifying apparatus according to the seventh aspect, information relating to a logical expression of an information primitive stored in advance and information relating to a generated document representative vector is reused. As a result, it is possible to realize a document classification device capable of classifying documents based on the same classification standard even for different document sets.

【０１６９】請求項９に記載の発明では、請求項８に記
載の文書分類装置の特徴に加え、生成される情報プリミ
ティブの論理積において、それを構成する各情報プリミ
ティブを適切に表現する特徴ベクトルの算術平均から文
書代表ベクトルを生成することにより、論理積を適切に
数量化することが可能となり、これにより論理式の意味
に適した文書分類結果を生成することができる。According to the ninth aspect of the present invention, in addition to the features of the document classifying apparatus according to the eighth aspect, in the logical product of the generated information primitives, a feature vector appropriately representing each information primitive constituting the information primitives is provided. By generating a document representative vector from the arithmetic mean of, the AND can be quantified appropriately, and thereby a document classification result suitable for the meaning of the logical expression can be generated.

【０１７０】請求項１０に記載の発明では、請求項９に
記載の文書分類装置の特徴に加え、情報プリミティブと
して文書及び文書内に含まれる単語を用いることによ
り、簡便に論理式を生成できる文書分類装置を実現する
ことができる。According to the tenth aspect of the present invention, in addition to the features of the document classifying apparatus according to the ninth aspect, a document in which a logical expression can be easily generated by using a document and words included in the document as information primitives. A classification device can be realized.

【０１７１】請求項１１に記載の発明では、請求項１０
に記載の文書分類装置の特徴に加え、ユーザが論理式を
指定できることにより、ユーザの意図を明確にした文書
分類結果を生成することができる。According to the eleventh aspect, in the tenth aspect,
In addition to the features of the document classification device described in (1), since the user can specify a logical expression, a document classification result that clarifies the user's intention can be generated.

【０１７２】請求項１２〜請求項２２記載の文書分類方
法により、請求項１〜請求項１１記載の文書分類装置に
適した文書分類方法を提供することができる。According to the document classification method of the present invention, a document classification method suitable for the document classification apparatus of the present invention can be provided.

[Brief description of the drawings]

【図１】第１の実施の形態を説明するための文書分類装
置の構成図である。FIG. 1 is a configuration diagram of a document classification device for describing a first embodiment.

【図２】文書-単語頻度行列を説明するための図であ
る。FIG. 2 is a diagram for explaining a document-word frequency matrix.

【図３】文書特徴ベクトルを説明するための図である。FIG. 3 is a diagram for explaining a document feature vector.

【図４】単語特徴ベクトルを説明するための図である。FIG. 4 is a diagram for explaining a word feature vector.

【図５】初期文書代表特徴ベクトルを説明するための図
（その１）である。FIG. 5 is a diagram (part 1) for describing an initial document representative feature vector.

【図６】精錬化後の文書代表特徴ベクトルを説明するた
めの図（その１）である。FIG. 6 is a diagram (part 1) for describing a document representative feature vector after refining.

【図７】分類結果を説明するための図（その１）であ
る。FIG. 7 is a diagram (part 1) for explaining a classification result;

【図８】第２の実施の形態を説明するための文書分類装
置の構成図である。FIG. 8 is a configuration diagram of a document classification device for explaining a second embodiment.

【図９】精錬化後の文書代表特徴ベクトルを説明するた
めの図（その２）である。FIG. 9 is a diagram (part 2) for explaining a document representative feature vector after refining.

【図１０】分類結果を説明するための図（その２）であ
る。FIG. 10 is a diagram (part 2) for explaining a classification result;

【図１１】第３の実施の形態を説明するための文書分類
装置の構成図である。FIG. 11 is a configuration diagram of a document classification device for describing a third embodiment.

【図１２】初期文書代表特徴ベクトルを説明するための
図（その２）である。FIG. 12 is a diagram (part 2) for explaining an initial document representative feature vector.

【図１３】分類結果を説明するための図（その３）であ
る。FIG. 13 is a diagram (part 3) for explaining the classification result;

【図１４】文書入力部１０１の処理手順を説明するため
の図である。FIG. 14 is a diagram for explaining a processing procedure of the document input unit 101.

【図１５】文書解析部１０２の処理手順を説明するため
の図である。FIG. 15 is a diagram for explaining a processing procedure of the document analysis unit 102;

【図１６】文書特徴ベクトル生成部１０３の処理手順を
説明するための図である。FIG. 16 is a diagram illustrating a processing procedure of a document feature vector generation unit 103.

【図１７】分類代表ベクトル生成部１０４の処理手順を
説明するための図である。FIG. 17 is a diagram for describing a processing procedure of a classification representative vector generation unit 104.

【図１８】精錬化除外ベクトル指定部１０５の処理手順
を説明するための図である。FIG. 18 is a diagram for describing a processing procedure of a refining exclusion vector specifying unit 105.

【図１９】文書データ割り当て部１０６の処理手順を説
明するための図である。FIG. 19 is a diagram illustrating a processing procedure of a document data allocating unit 106.

【図２０】分類代表ベクトル精錬化部１０７の処理手順
を説明するための図である。FIG. 20 is a diagram illustrating a processing procedure of a classification representative vector refining unit 107.

【図２１】分類結果保存部１０８の処理手順を説明する
ための図である。FIG. 21 is a diagram for describing a processing procedure of a classification result storage unit.

【図２２】分類代表ベクトル指定生成部３０１の処理手
順を説明するための図（その１）である。FIG. 22 is a diagram (part 1) for describing a processing procedure of the classification / representative vector designation generation unit 301;

【図２３】分類代表ベクトル情報記憶部３０２の処理手
順を説明するための図である。FIG. 23 is a diagram for describing a processing procedure of a classification representative vector information storage unit 302.

【図２４】分類代表ベクトル自動生成部３０３の処理手
順を説明するための図である。FIG. 24 is a diagram for describing a processing procedure of a classification representative vector automatic generation unit 303.

【図２５】文書分類部３０４の処理手順を説明するため
の図である。FIG. 25 is a diagram illustrating a processing procedure of a document classification unit 304.

【図２６】類結果保存部３０５の処理手順を説明するた
めの図である。FIG. 26 is a diagram for explaining a processing procedure of a similar result storage unit 305.

【図２７】分類代表ベクトル指定生成部３０１の処理手
順を説明するための図（その２）である。FIG. 27 is a diagram (part 2) for explaining the processing procedure of the classification / representative vector designation generation unit 301;

【図２８】分類代表ベクトル指定生成部３０１の処理手
順を説明するための図（その３）である。FIG. 28 is a diagram (part 3) for describing the processing procedure of the classification / representative vector designation generation unit 301;

[Explanation of symbols]

１０１文書入力部１０２文書解析部１０３文書特徴ベクトル生成部１０４分類代表ベクトル生成部１０５精錬化除外ベクトル指定部１０６文書データ割り当て部１０７分類代表ベクトル精錬化部１０８、３０５分類結果保存部２０１分類代表ベクトル生成情報記憶部２０２分類代表ベクトル生成情報読み込み部２０３精錬化除外ベクトル情報記憶部２０４精錬化除外ベクトル情報読み込み部３０１分類代表ベクトル指定生成部３０２分類代表ベクトル情報記憶部３０３分類代表ベクトル自動生成部３０４文書分類部 Reference Signs List 101 Document input unit 102 Document analysis unit 103 Document feature vector generation unit 104 Classified representative vector generation unit 105 Refined exclusion vector designating unit 106 Document data allocation unit 107 Classified representative vector refinement unit 108, 305 Classification result storage unit 201 Classified representative vector Generation information storage unit 202 Classified representative vector generation information reading unit 203 Refined exclusion vector information storage unit 204 Refined exclusion vector information reading unit 301 Classified representative vector designation generation unit 302 Classified representative vector information storage unit 303 Classified representative vector automatic generation unit 304 Document classifier

Claims

[Claims]

1. A document classification apparatus for setting a plurality of initial classification representative feature vectors and classifying a document based on the similarity between the classification representative feature vector and a document feature vector. And a refining processing unit for repetitively assigning documents by changing the classification representative feature vector. The refining processing unit performs a refining process on a part or all of the classification representative feature vectors. Document classification device characterized by the absence.

2. A document classification device for classifying documents according to the contents of the documents, a document input unit for inputting document data, a document analysis unit for analyzing words of the document data, and a document analysis unit of the document analysis unit. A document feature vector generation unit that calculates a document feature vector for the document based on the result, a classification representative vector generation unit that generates a classification representative vector having the same number of dimensions as the document feature vector, and a classification representative that does not perform refining processing A refining exclusion vector designating unit that designates a vector; a document data assigning unit that assigns document data to one of the classification representative vectors based on the similarity between the document feature vector and the classification representative vector; Classification representative vectors other than the classification representative vector specified by the A classification representative vector recalculating unit that recalculates the classification representative vector based on the document feature vector allocated by the application unit and repeats the document data allocation and the recalculation of the classification representative vector until a specific criterion is satisfied; A document classification device, comprising: a classification result storage unit that stores a classification result.

3. The document classification apparatus according to claim 2, wherein the classification representative vector generation unit generates some classification representative vectors based on information specified by a user.

4. The document classification apparatus according to claim 3, wherein the refining exclusion vector designating unit refines only the classified representative vector generated by the information designated by the user in the classified representative vector generating unit. A classifying representative vector which does not perform the classifying.

5. The document classification device according to claim 4, wherein in the classification representative vector generation unit, the information specified by the user is a logical expression of document data and a word existing in the document data. Document classifier.

6. The document classification device according to claim 5, wherein the classification representative vector generation information storage unit stores information for generating a classification representative vector specified by the classification representative vector generation unit; A classification representative vector generation information reading unit that reads information stored in a vector generation information storage unit; and a refining exclusion vector information that stores information related to a classification representative vector that is not subjected to the refining process specified in the classification representative vector generation unit. A document classification device further comprising: a storage unit; and a refined exclusion vector information reading unit that reads information stored in the refined exclusion vector information storage unit.

7. A document classification device for classifying documents according to the contents of the documents, a document input unit for inputting document data, a document analysis unit for analyzing words of the document data, and a document analysis by the document analysis unit. A document feature vector generation unit that calculates a document feature vector for each document from the result, and a logical expression of a plurality of information primitives specified as some classification representative vectors, and a logical sum in which each element is a logical product of the information primitives And a classification representative vector designation generation unit that generates a classification representative vector having the same number of dimensions as the document feature vector from each of the logical product of the generated information primitives, A classification representative vector information storage unit for storing information about classification representative vectors, A classification representative vector automatic generation unit that automatically generates a classification representative vector using information generated by a generation unit; and a document that classifies document data based on a similarity between the document feature vector and the classification representative vector. A classification unit, a classification result storage unit that generates a classification result based on the information stored in the classification representative vector information storage unit, and stores the classification result;
Document classification apparatus characterized by having.

8. The document classification device according to claim 7, further comprising: a classification representative vector information reading unit that reads information on the stored classification representative vector.

9. The document classification apparatus according to claim 8, wherein the classification representative vector generation unit generates a classification representative vector as an arithmetic average of each information primitive included in a logical product of the generated information primitives. Document classification apparatus characterized by the above-mentioned.

10. The document classification device according to claim 9, wherein the information primitive specified by the classification representative vector generation unit is a document and a word included in the document.

11. The document classification device according to claim 10, wherein a logical expression of the information primitive specified by the classification representative vector generation unit is specified by a user.

12. A document classification method for setting a plurality of initial classification representative feature vectors and classifying a document based on the similarity between the classification representative feature vector and a document feature vector. And a refining process step of repeatedly assigning documents by changing the classification representative feature vector. The refining process step performs a refining process on a part or all of the classification representative feature vector. A document classification method characterized by the absence of a document.

13. A document classification method for classifying documents according to the contents of the documents, a document input step of inputting document data, a document analysis step of analyzing words of the document data, and a document analysis of the document analysis step. A document feature vector generation step for calculating a document feature vector for the document based on the result, a classification representative vector generation step for generating a classification representative vector having the same number of dimensions as the document feature vector, and a classification representative without refining processing A refining exclusion vector designating step of designating a vector; a document data assigning step of assigning document data to one of the classification representative vectors based on a similarity between the document feature vector and the classification representative vector; Classification representative vector specified in the generalization exclusion vector specification step For other classification representative vectors, the classification representative vector is recalculated based on the document feature vector allocated in the document data allocation step, and the document data allocation and the recalculation of the classification representative vector are performed until a specific criterion is satisfied. A document classification method comprising: a step of refining a classified representative vector; and a step of storing a classification result for storing a classification result.

14. The document classification method according to claim 13, wherein in the classification representative vector generation step, some classification representative vectors are generated based on information specified by a user.

15. The document classification method according to claim 14, wherein in the refining exclusion vector specifying step, only the classification representative vector generated by the information specified by the user in the classification representative vector generation step is refined. A document classification method characterized by using a classification representative vector that does not perform the classification.

16. The document classification method according to claim 15, wherein in the classification representative vector generation step, the information specified by the user is document data and a logical expression of a word existing in the document data. Document classification method.

17. The document classification method according to claim 16, wherein: a classification representative vector generation information storing step of storing information for generating a classification representative vector specified in the classification representative vector generation step; Classification representative vector generation information reading step for reading information stored in the vector generation information storage step; Refining exclusion vector information for storing information related to classification representative vectors not subjected to the refining process specified in the classification representative vector generation step A document classification method, further comprising: a storage step; and a refining exclusion vector information reading step for reading information stored in the refining exclusion vector information storage step.

18. A document classification method for classifying documents according to the contents of the documents, a document input step of inputting document data, a document analysis step of analyzing words of the document data, and a document analysis by the document analysis step A document feature vector generation step of calculating a document feature vector for each document from the result, and a logical expression of a plurality of information primitives specified as some classification representative vectors, and a logical sum in which each element is a logical product of the information primitives Classifying representative vector specification generating step of generating a classifying representative vector having the same number of dimensions as the document feature vector from each logical product of the generated information primitives; and Classification representative vector information that stores information about the classification representative vector A storage step, a classification representative vector automatic generation step of automatically generating a classification representative vector using information generated in the document feature vector generation step, and a similarity between the document feature vector and the classification representative vector. A document classification step of classifying the document data; and a classification result storage step of generating and storing a classification result based on the information stored in the classification representative vector information storage step. Document classification method.

19. The document classification method according to claim 18, further comprising: a step of reading classification representative vector information for reading information on the stored classification representative vector.

20. The document classification method according to claim 19, wherein in the classifying representative vector generating step, a classifying representative vector is generated as an arithmetic average of each information primitive included in a logical product of the generated information primitives. Document classification method characterized by the following.

21. The document classification method according to claim 20, wherein the information primitive specified in the classification representative vector generation step is a document and a word included in the document.

22. The document classification method according to claim 21, wherein a logical expression of an information primitive specified in the classification representative vector generation step is specified by a user.