JP2006277413A

JP2006277413A - Document classification device and document classification method

Info

Publication number: JP2006277413A
Application number: JP2005096374A
Authority: JP
Inventors: Tsutomu Kobayashi; 勉小林; Yoshihisa Otake; 能久大嶽; Toshihiko Kobayashi; 俊彦小林; Takeshi Matsukuma; 剛松隈; Hiroshi Yamazaki; 弘山崎
Original assignee: Toshiba Corp; Toshiba Solutions Corp
Current assignee: Toshiba Corp; Toshiba Digital Solutions Corp
Priority date: 2005-03-29
Filing date: 2005-03-29
Publication date: 2006-10-12

Abstract

<P>PROBLEM TO BE SOLVED: To provide a document classification device for changing a parameter to be used for classification for every document to be classified. <P>SOLUTION: The document classification device is provided with a comparison object document information storage part 11 for storing the information of a comparison object document to be compared with a document to be classified and the field of the comparison object document as comparison object document information, a word weight information storage part 12 for storing a word and the word weight of the word and a classification processing part 10 for comparing the document to be classified with the comparison object document information, and for extracting a commonly used word commonly used by the document to be classified and the comparison object document, and for generating the commonly used word, the use frequency of the commonly used word, the word weight of the commonly used word read from the word weight information storage part and an adjustment value for adjusting the word weight set for every word weight as common word information, and for calculating the similarity of a plurality of comparison object documents and the document to be classified based on the generated common word information, and for specifying a field based on the calculated similarity, and for specifying a new field based on the comparison object document by varying the adjustment value. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

本発明は、文書の分野を分類する文書分類装置および文書分類方法に関する。 The present invention relates to a document classification apparatus and a document classification method for classifying document fields.

従来、予めデータベースに記憶される複数の文書情報に基づいて、入力文書の属する分野を特定する文書分類システムがある。このような文書分類システムでは、まず、予め分野が特定されてデータベースに記憶されている複数の比較対象文書から分類を特定したい入力文書と類似する比較対象文書を抽出する。その後、その抽出された比較対象文書に予め関連付けられている分野に基づいて、入力文書が属する分野を特定する方式がある（例えば、特許文献１）。 Conventionally, there is a document classification system that identifies a field to which an input document belongs based on a plurality of document information stored in advance in a database. In such a document classification system, first, a comparison target document similar to an input document whose classification is to be specified is extracted from a plurality of comparison target documents whose fields are specified in advance and stored in the database. Thereafter, there is a method for specifying a field to which an input document belongs based on a field previously associated with the extracted comparison target document (for example, Patent Document 1).

さらに、分類処理の効率化を図るため、文書分類システムでは、一括して大量の入力文書の分類処理を行なう方式が一般的である。また、このような文書分類システムでは、入力文書について大量の比較対象文書との類似度算出を行なうことが多い。さらに、高い精度が求められる文書分類システムにおいては、コンピュータにより分類した結果を、人手によってチェックすることになる。
特開２００１−１５５０２５号公報 Further, in order to increase the efficiency of the classification process, a document classification system generally performs a classification process for a large number of input documents at once. Also, in such a document classification system, the input document is often calculated for similarity with a large number of comparison target documents. Furthermore, in a document classification system that requires high accuracy, the result of classification by a computer is manually checked.
JP 2001-1555025 A

上述したような、大量の入力文書を同一の基準により一括して分類する従来の文書分類システムでは、一度に大量の文書を効率良く処理できる反面、一括して分類処理を行った時点で分類の結果が確定される。そのため、対話性が犠牲になる問題が生じることがある。 As described above, the conventional document classification system that collectively classifies a large number of input documents according to the same standard can efficiently process a large number of documents at one time, but at the time when the classification process is performed collectively. The result is confirmed. As a result, the problem of interactivity may arise.

再び分類処理でのパラメータ調整して再度分類を実行することも考えられるが、これによれば、パラメータを調整することで、その後に分類される全ての分類結果に対して変更されたパラメータが適用される。これによれば、全ての分類結果が同一のパラメータで分類されてしまうという問題があった。 It is possible to adjust the parameters again in the classification process and execute the classification again, but according to this, by adjusting the parameters, the changed parameters are applied to all the classification results that are classified after that. Is done. According to this, there is a problem that all classification results are classified with the same parameter.

本発明は上記の問題を解決するためになされたものであり、分類される文書ごとに分類するために利用されるパラメータを調整することが可能な文書分類装置および文書分類方法を提供することを目的とする。 The present invention has been made to solve the above problem, and provides a document classification apparatus and a document classification method capable of adjusting parameters used for classification for each document to be classified. Objective.

本発明の第１の特徴に係る文書分類装置によれば、入力された被分類文書の属する分野を分類する文書分類装置であって、被分類文書と比較する比較対象文書の情報と、この比較対象文書の分野とを比較対象文書情報として記憶する比較対象文書情報記憶部と、単語と、この単語の単語重みを記憶する単語重み情報記憶部と、被分類文書を比較対象文書情報と比較して被分類文書と比較対象文書とで共通に使用されている共通使用単語を抽出し、共通使用単語と、共通使用単語の使用回数と、単語重み情報記憶部から読み出した共通使用単語の単語重みと、この単語重みごとに設定されて単語重みを調整する調整値とを共通単語情報として生成し、この生成された共通単語情報に基づいて複数の比較対象文書と被分類文書との類似度を求め、この求められた類似度に基づいて分野を特定し、入力装置からの指示に基づいて調整値を可変させて比較対象文書に基づいて新たな分野を特定する分類処理部とを有することを特徴としている。 According to the document classification apparatus according to the first aspect of the present invention, the document classification apparatus classifies the field to which the input classified document belongs, and information of a comparison target document to be compared with the classified document and the comparison The comparison target document information storage unit that stores the field of the target document as comparison target document information, the word, the word weight information storage unit that stores the word weight of the word, and the classified document are compared with the comparison target document information. The common use words that are commonly used in the classified document and the comparison target document are extracted, the common use words, the number of use of the common use words, and the word weights of the common use words read from the word weight information storage unit And an adjustment value that is set for each word weight and adjusts the word weight is generated as common word information, and the similarity between a plurality of comparison target documents and classified documents is calculated based on the generated common word information. Seeking And a classification processing unit that identifies a field based on the obtained similarity and identifies a new field based on a comparison target document by varying an adjustment value based on an instruction from an input device. Yes.

また、第２の特徴に係る文書分類装置によれば、入力された被分類文書の属する分野を分類する文書分類装置であって、被分類文書と比較する比較対象文書の情報と、この比較対象文書の分野とを比較対象文書情報として記憶する比較対象文書情報記憶部と、単語と、この単語の単語重みを記憶する単語重み情報記憶部と、被分類文書を比較対象文書情報と比較して被分類文書と比較対象文書とで共通に使用されている共通使用単語を抽出し、共通使用単語と、共通使用単語の使用回数と、単語重み情報記憶部から読み出した共通使用単語の単語重みと、この単語重みごとに設定されて単語重みを調整する調整値とを共通単語情報として生成する共通単語情報生成部と、共通単語情報に含まれる各共通使用単語の使用回数とその共通使用単語の単語重みと単語重みに関連付けられる調整値とに基づいて類似度を算出する類似度算出部と、この類似度算出部で求められた類似度に基づいて被分類文書の属する分野を特定して分類結果とする分野特定部と、調整値を可変させる類似度パラメータ調整部と、分類結果が確定されると、被分類文書の各使用単語の調整値を含む分類結果を生成して記憶装置に記憶させる分類結果記憶部とを有することを特徴としている。 Further, according to the document classification device according to the second feature, the document classification device classifies the field to which the input classified document belongs, and information on the comparison target document to be compared with the classified document, and the comparison target A comparison target document information storage unit that stores the field of the document as comparison target document information, a word, a word weight information storage unit that stores the word weight of the word, and a classified document compared with the comparison target document information Commonly used words that are commonly used in the classified document and the comparison target document are extracted, the commonly used words, the number of times of use of the commonly used words, and the word weights of the commonly used words read from the word weight information storage unit, A common word information generation unit that generates, as common word information, an adjustment value that is set for each word weight and adjusts the word weight, and the number of times each common word used in the common word information is used and the common word used A similarity calculation unit that calculates a similarity based on the word weight and an adjustment value associated with the word weight, and identifies and classifies the field to which the classified document belongs based on the similarity calculated by the similarity calculation unit When the classification result is confirmed, a classification result including the adjustment value of each word used in the classified document is generated and stored in the storage device. And a classification result storage unit to be stored.

上記構成の本発明によれば、文書を分類するためのパラメータを調整する文書分類装置及び文書分類方法を提供することができる。 According to the present invention configured as described above, it is possible to provide a document classification apparatus and a document classification method for adjusting parameters for classifying a document.

本発明によれば、文書分類装置及び文書分類方法において、文書を分類するためのパラメータを調整することができる。 According to the present invention, parameters for classifying a document can be adjusted in a document classification apparatus and a document classification method.

以下に、図面を参照して、本発明の最良の実施の形態に係る文書分類装置１及び文書分類方法を説明する。 Hereinafter, a document classification apparatus 1 and a document classification method according to the preferred embodiment of the present invention will be described with reference to the drawings.

［文書分類装置］
図１に示すのは、本発明の最良の実施の形態に係る文書分類装置１のブロック図である。 [Document Classification Device]
FIG. 1 is a block diagram of a document classification apparatus 1 according to the preferred embodiment of the present invention.

図１に示す文書分類装置１は、分類処理部１０、比較対象文書情報記憶部１１、単語重み情報記憶部１２及び分類結果記憶部１３を有する。 A document classification apparatus 1 shown in FIG. 1 includes a classification processing unit 10, a comparison target document information storage unit 11, a word weight information storage unit 12, and a classification result storage unit 13.

比較対象文書情報記憶部１１は、文書の属する分野を分類する対象となる被分類文書と比較する比較対象文書の情報（テキスト情報）と、この比較対象文書の分野が関連付けられた比較対象文書情報を記憶している。 The comparison target document information storage unit 11 compares information (text information) of the comparison target document to be compared with the classified document to be classified into the field to which the document belongs, and the comparison target document information in which the field of the comparison target document is associated. Is remembered.

単語重み情報記憶部１２は、単語と、単語が含まれる文書の分野の特徴を示す指標となる単語重みを記憶している。 The word weight information storage unit 12 stores a word and a word weight serving as an index indicating characteristics of the field of the document including the word.

分類結果記憶部１３は、分類処理部１０で分類された被分類文書に関する情報と、この被分類文書で使用される使用単語の調整値を含む分類結果を記憶する。 The classification result storage unit 13 stores the classification result including the information related to the classified document classified by the classification processing unit 10 and the adjustment value of the used word used in the classified document.

分類処理部１０は、分類キー文書を比較対象文書情報と比較して、分類キー文書および比較対象文書で共通に使用されている単語である共通使用単語を抽出し、少なくともこれらの共通使用単語と、共通使用単語の使用回数と、単語重み情報記憶部から読み出した共通使用単語の単語重みと、この単語重みごとに設定され、この単語重みを調整する調整値と、を関連付て共通単語情報として生成する。また、分類処理部１０は、共通単語情報から複数の比較対象文書との類似度を求め、求められた類似度の高い比較対象文書の分野に基づいて分野を特定した分類結果を求め、分類結果記憶部１３に記憶させる。さらに、分類処理部１０は、入力装置からの指示に基づいて、調整値を可変させて前記比較対象文書に基づいて、新たな分野を特定する。 The classification processing unit 10 compares the classification key document with the comparison target document information, extracts common usage words that are commonly used in the classification key document and the comparison target document, and at least the common usage words and The common word information by associating the number of use of the common word, the word weight of the common word read from the word weight information storage unit, and an adjustment value set for each word weight and adjusting the word weight. Generate as Further, the classification processing unit 10 obtains a similarity with a plurality of comparison target documents from the common word information, obtains a classification result specifying the field based on the field of the comparison target document having a high similarity, and obtains a classification result. The data is stored in the storage unit 13. Further, the classification processing unit 10 specifies a new field based on the comparison target document by varying the adjustment value based on an instruction from the input device.

図２に示すのは、比較対象文書情報記憶部１１で記憶する比較対象文書情報の一例である。比較対象文書とは、分野を分類する対象となる被分類文書と比較する文書である。また、比較対象文書情報は、この比較対象文書に基づいて生成される。 FIG. 2 shows an example of comparison target document information stored in the comparison target document information storage unit 11. The comparison target document is a document to be compared with a classified document to be classified into fields. The comparison target document information is generated based on the comparison target document.

具体的に図２に示す比較対象文書情報１１ａは、複数の比較対象文書の「タイトル」、「分野」、「使用単語」、「使用回数」および「調整値」の情報を含んでいる。「タイトル」は比較対象文書のタイトルであり、「分野」は比較対象文書に定められた分野である。また、「使用単語」は比較対象文書で使用されている単語であり、「使用回数」は、各使用単語が比較対象文書中で使用されている回数である。 Specifically, the comparison target document information 11a illustrated in FIG. 2 includes information on “title”, “field”, “use word”, “use count”, and “adjustment value” of a plurality of comparison target documents. “Title” is a title of the comparison target document, and “Field” is a field defined in the comparison target document. The “used word” is a word used in the comparison target document, and the “use count” is the number of times each used word is used in the comparison target document.

「調整値」は各被分類文書の使用単語毎に、単語重みを調整する値である。本発明の最良の実施の形態に係る文書分類装置１では、この「調整値」を変化させることにより、類似度を求めるためのパラメータが調整される。 The “adjustment value” is a value for adjusting the word weight for each used word of each classified document. In the document classification device 1 according to the preferred embodiment of the present invention, the parameter for obtaining the similarity is adjusted by changing the “adjustment value”.

この図２に示す比較対象文書情報１１ａによれば、比較対象文書１は、タイトルが「データベース更新処理時間の短縮」であり、分野は「データベース更新」である。また、比較対象文書１の中で使用されている単語とその使用回数として、それぞれ「大規模」が２回、「データベース」が５回、「更新処理」が８回、「時間」が３回、「短縮」が２回であることを表している。また、各使用単語の「調整値」は、初期値である「１．０」が設定されている。 According to the comparison target document information 11a shown in FIG. 2, the title of the comparison target document 1 is “reduction in database update processing time” and the field is “database update”. Further, the words used in the comparison target document 1 and the number of times of use thereof are 2 for “Large scale”, 5 for “Database”, 8 for “Update processing”, and 3 for “Time”, respectively. , “Shortening” represents 2 times. In addition, the initial value “1.0” is set as the “adjustment value” of each used word.

図３は、単語重み情報記憶部１２で記憶する単語重み情報１２ａの一例である。図３に示す単語重み情報１２ａでは、例えば「自動分類」の単語重みは「８．５」であり、「データベース」の単語重みは「４．３」であることを表している。 FIG. 3 is an example of word weight information 12 a stored in the word weight information storage unit 12. In the word weight information 12a shown in FIG. 3, for example, the word weight of “automatic classification” is “8.5”, and the word weight of “database” is “4.3”.

この「単語重み」には、例えば比較対象文書情報記憶部１１中の全ての比較対象文書におけるその使用単語の使用回数の逆数を利用する。これは、使用される回数の多い単語は一般的な単語であり、文書の特徴を表さない単語であると考え、逆に、使用される回数の少ない単語は特徴的な単語であると考える。本発明の最良の実施の形態では、分類に使用する単語重み情報として、図３に示すような単語重み情報１２ａが予め作成され、単語重み情報記憶部１２に記憶されているものとする。 For this “word weight”, for example, the reciprocal of the number of times the used word is used in all comparison target documents in the comparison target document information storage unit 11 is used. This is because words that are used frequently are general words and do not represent the characteristics of the document, and conversely, words that are used frequently are considered characteristic words. . In the best mode of the present invention, it is assumed that word weight information 12a as shown in FIG. 3 is created in advance and stored in the word weight information storage unit 12 as word weight information used for classification.

本発明の最良の実施の形態に係る文書分類装置１は、図４に示すように、中央処理制御装置１０１、ＲＯＭ（Read Only Memory）１０２、ＲＡＭ（Random Access Memory）１０３および入出力インタフェース１０９が、バス１１０を介して接続されている。入出力インタフェース１０９には、入力装置１０４、表示装置１０５、通信制御装置１０６、記憶装置１０７およびリムーバブルディスク１０８が接続されている。 As shown in FIG. 4, the document classification apparatus 1 according to the preferred embodiment of the present invention includes a central processing control device 101, a ROM (Read Only Memory) 102, a RAM (Random Access Memory) 103, and an input / output interface 109. Are connected via a bus 110. An input device 104, a display device 105, a communication control device 106, a storage device 107, and a removable disk 108 are connected to the input / output interface 109.

中央処理制御装置１０１は、入力装置１０４からの入力信号に基づいてＲＯＭ１０２から文書分類装置１を起動するためのブートプログラムを読み出して実行し、更に記憶装置１０７に記憶されたオペレーティングシステムを読み出す。更に中央処理制御装置１０１は、入力装置１０４や通信制御装置１０６などの入力信号に基づいて、各種装置の制御を行ったり、ＲＡＭ１０３や記憶装置１０７などに記憶されたプログラムおよびデータを読み出してＲＡＭ１０３にロードするとともに、ＲＡＭ１０３から読み出されたプログラムのコマンドに基づいて、データの計算または加工など、後述する一連の処理を実現する処理装置である。 The central processing control device 101 reads and executes a boot program for starting the document classification device 1 from the ROM 102 based on an input signal from the input device 104, and further reads an operating system stored in the storage device 107. Further, the central processing control device 101 controls various devices based on input signals from the input device 104, the communication control device 106, etc., and reads programs and data stored in the RAM 103, the storage device 107, etc., into the RAM 103. A processing device that loads and implements a series of processing described later, such as data calculation or processing, based on a program command read from the RAM 103.

入力装置１０４は、操作者が各種の操作を入力するキーボード、マウスなどの入力デバイスにより構成されており、操作者の操作に基づいて入力信号を作成し、入出力インタフェース１０９およびバス１１０を介して中央処理制御装置１０１に送信される。表示装置１０５は、ＣＲＴ（Cathode Ray Tube）ディスプレイや液晶ディスプレイなどであり、中央処理制御装置１０１からバス１１０および入出力インタフェース１０９を介して表示装置１０５において表示させる出力信号を受信し、例えば、中央処理制御装置１０１の処理結果などを表示する装置である。通信制御装置１０６は、ＬＡＮカードやモデムなどの装置であり、文書分類装置１をインターネットやＬＡＮなどの通信ネットワークに接続する装置である。通信制御装置１０６を介して通信ネットワークと送受信したデータは入力信号または出力信号として、入出力インタフェース１０９およびバス１１０を介して中央処理制御装置１０１に送受信される。 The input device 104 includes input devices such as a keyboard and a mouse through which an operator inputs various operations. The input device 104 generates an input signal based on the operation of the operator, and inputs via the input / output interface 109 and the bus 110. It is transmitted to the central processing control apparatus 101. The display device 105 is a CRT (Cathode Ray Tube) display, a liquid crystal display, or the like. The display device 105 receives an output signal to be displayed on the display device 105 from the central processing control device 101 via the bus 110 and the input / output interface 109. This is a device that displays the processing results of the processing control device 101. The communication control device 106 is a device such as a LAN card or a modem, and is a device that connects the document classification device 1 to a communication network such as the Internet or a LAN. Data transmitted / received to / from the communication network via the communication control device 106 is transmitted / received to / from the central processing control device 101 via the input / output interface 109 and the bus 110 as an input signal or an output signal.

記憶装置１０７は半導体記憶装置または磁気ディスク装置等であって、中央処理制御装置１０１で実行されるプログラムやデータが記憶されている。リムーバブルディスク１０８は、光ディスクやフレキシブルディスクのことであり、ディスクドライブによって読み書きされた信号は、入出力インタフェース１０９およびバス１１０を介して中央処理制御装置１０１に送受信される。本発明の実施の形態に係る文書分類装置１の記憶装置１０７には、文書分類プログラムが記憶されるとともに、比較対象文書情報記憶部１１、単語重み情報記憶部１２および分類結果記憶部１３が記憶される。また、この文書分類プログラムが文書分類装置１の中央処理制御装置１０１に読み込まれて実行されることによって、分類処理部１０が実装される。 The storage device 107 is a semiconductor storage device, a magnetic disk device, or the like, and stores programs and data executed by the central processing control device 101. The removable disk 108 is an optical disk or a flexible disk, and signals read / written by the disk drive are transmitted / received to / from the central processing control apparatus 101 via the input / output interface 109 and the bus 110. The storage device 107 of the document classification device 1 according to the embodiment of the present invention stores a document classification program, and also stores a comparison target document information storage unit 11, a word weight information storage unit 12, and a classification result storage unit 13. Is done. Further, the document classification program is read and executed by the central processing control device 101 of the document classification device 1 to implement the classification processing unit 10.

なお、本発明の最良の実施の形態に係る文書分類装置１は、一つのコンピュータによって実現されても良いし、互いに通信可能な複数のコンピュータによって実現されても良い。例えば、一括処理を行なうための構成と対話処理を行なうための構成は、同一のコンピュータシステム上にあっても構わないし、ネットワーク等を介して接続された別のコンピュータシステム上にあっても構わない。また、分類処理部１０もそれぞれ一つのコンピュータによって実現されていても良く、また複数のコンピュータによって実現されていても良い。 The document classification apparatus 1 according to the best embodiment of the present invention may be realized by a single computer or a plurality of computers that can communicate with each other. For example, the configuration for performing batch processing and the configuration for performing interactive processing may be on the same computer system or on different computer systems connected via a network or the like. . Further, each of the classification processing units 10 may be realized by a single computer, or may be realized by a plurality of computers.

図５に示すように、本発明の実施の形態に係る文書分類装置１における分類処理部１０は制御部２００およびメモリ部２５０を有する。 As shown in FIG. 5, the classification processing unit 10 in the document classification apparatus 1 according to the embodiment of the present invention includes a control unit 200 and a memory unit 250.

制御部２００は、初期化部２０１、入力部２０２、単語重み読み込み部２０３、被分類文書情報生成部２０４、比較対象文書情報読み込み部２０５、共通単語情報生成部２０６、比較対象文書類似度算出部２０７、分野別類似度積算部２０８、分野特定部２０９、類似度算出パラメータ調整部２１０および分類結果記憶部２１１を有する。 The control unit 200 includes an initialization unit 201, an input unit 202, a word weight reading unit 203, a classified document information generation unit 204, a comparison target document information reading unit 205, a common word information generation unit 206, and a comparison target document similarity calculation unit. 207, a field-specific similarity accumulation unit 208, a field identification unit 209, a similarity calculation parameter adjustment unit 210, and a classification result storage unit 211.

また、メモリ部２５０は、単語重みバッファ部２５１、被分類文書情報バッファ部２５２、比較対象文書情報バッファ部２５３、共通単語情報バッファ部２５４、比較対象文書類似度バッファ部２５５および分野別類似度積算値バッファ部２５６を有する。 The memory unit 250 includes a word weight buffer unit 251, a classified document information buffer unit 252, a comparison target document information buffer unit 253, a common word information buffer unit 254, a comparison target document similarity buffer unit 255, and a field-specific similarity accumulation unit. A value buffer unit 256 is provided.

単語重みバッファ部２５１は、比較対象文書で使用されている単語である各使用単語について、使用単語とその単語重みとが関連付けられた単語重み情報を記憶する。 The word weight buffer unit 251 stores word weight information in which a used word and its word weight are associated with each used word that is a word used in the comparison target document.

被分類文書情報バッファ部２５２は、分類の対象となる被分類文書から生成される被分類文書情報を記憶する。 The classified document information buffer unit 252 stores classified document information generated from the classified document to be classified.

比較対象文書情報バッファ部２５３は、被分類文書情報と比較する比較対象文書情報を記憶する。 The comparison target document information buffer unit 253 stores comparison target document information to be compared with classified document information.

共通単語情報バッファ部２５４は、被分類文書と比較対象文書で共通して使用されている単語である共通使用単語と、その共通使用単語の文書における使用回数と、その共通使用単語の単語重みと、この単語重みごとに設定されてこの単語重みを調整する調整値とを関連付けた共通単語情報を記憶する。 The common word information buffer unit 254 includes a common use word that is a word used in common in the classified document and the comparison target document, the number of times the common use word is used in the document, and the word weight of the common use word. The common word information associated with the adjustment value that is set for each word weight and adjusts the word weight is stored.

比較対象文書類似度バッファ部２５５は、被分類文書に関して求められた比較対象文書毎の共通使用単語情報に基づいて算出された類似度を比較対象文書類似度として記憶する。本発明の最良の実施の形態においては、比較対象文書類似度算出部２０７において比較対象文書類似度を算出し、この比較対象文書類似度に基づいて生成した比較対象文書類似度情報を記憶している。 The comparison target document similarity buffer unit 255 stores the similarity calculated based on the commonly used word information for each comparison target document obtained with respect to the classified document as the comparison target document similarity. In the best mode of the present invention, the comparison target document similarity calculation unit 207 calculates the comparison target document similarity, and stores the comparison target document similarity information generated based on the comparison target document similarity. Yes.

分野別類似度積算値バッファ部２５６は、比較対象文書類似度について、その比較対象文書が属する分野毎に合計した分野別類似度積算値を記憶する。 The field-specific similarity integrated value buffer unit 256 stores field-specific similarity integrated values totaled for each field to which the comparison target document belongs with respect to the comparison target document similarity.

初期化部２０１は、メモリ部２５０の各バッファ部２５１〜２５６を初期化する。 The initialization unit 201 initializes the buffer units 251 to 256 of the memory unit 250.

入力部２０２は、被分類文書や操作指示を入力装置を介して入力する。 The input unit 202 inputs a classified document and an operation instruction via an input device.

単語重み読み込み部２０３は、単語重み情報記憶部１２から単語重みバッファ部２５１に単語重み情報を読み込む。 The word weight reading unit 203 reads word weight information from the word weight information storage unit 12 into the word weight buffer unit 251.

被分類文書情報生成部２０４は、入力部２０２に入力された被分類文書を単語単位に分解し、分解された各単語とその単語の使用回数とを含む被分類文書情報を生成して被分類文書情報バッファ部２５２に記憶させる。 The classified document information generation unit 204 decomposes the classified document input to the input unit 202 into words, generates classified document information including each decomposed word and the number of times the word is used, and classifies the classified document. It is stored in the document information buffer unit 252.

比較対象文書情報読み込み部２０５は、比較対象文書情報記憶部１１から比較対象文書情報バッファ部２５３に比較対象文書情報を読み込む。 The comparison target document information reading unit 205 reads the comparison target document information from the comparison target document information storage unit 11 into the comparison target document information buffer unit 253.

共通単語情報生成部２０６は、被分類文書情報バッファ部２５２に記憶される被分類文書情報と比較対象文書情報バッファ部２５３に記憶される比較対象文書情報とを読み出し、被分類文書と比較対象文書で共通で使用している共通使用単語を抽出する。また、共通単語情報生成部２０６は、その共通使用単語の文書中での使用回数およびその共通使用単語の単語重みと調整値の初期値とを関連付けた共通単語情報を生成して共通単語情報バッファ部２５４に記憶させる。 The common word information generation unit 206 reads out the classified document information stored in the classified document information buffer unit 252 and the comparison target document information stored in the comparison target document information buffer unit 253, and reads the classified document and the comparison target document. Extract commonly used words commonly used in. Further, the common word information generation unit 206 generates common word information in which the number of times the common used word is used in the document, the word weight of the common used word, and the initial value of the adjustment value, and generates a common word information buffer. Stored in the unit 254.

比較対象文書類似度算出部２０７は、共通単語情報バッファ部２５４に記憶されている共通単語情報を読み出し、読み出した共通単語情報に含まれる共通単語の使用回数と単語重みと調整値とに基づいて、被分類文書と各比較対象文書との類似度である比較対象文書類似度を算出する。また、比較対象文書類似度算出部２０７は、算出した各比較対象文書類似度により、比較対象文書類似度情報を生成して、比較対象文書類似度バッファ部２５５に記憶させる。 The comparison target document similarity calculation unit 207 reads the common word information stored in the common word information buffer unit 254, and based on the number of common words used, the word weight, and the adjustment value included in the read common word information. The comparison target document similarity that is the similarity between the classified document and each comparison target document is calculated. Further, the comparison target document similarity calculation unit 207 generates comparison target document similarity information based on each calculated comparison target document similarity, and stores it in the comparison target document similarity buffer unit 255.

なお、本発明の最良の実施の形態において比較対象文書類似度を算出する方法は、被分類文書および比較対象文書の２つの文書で共通して使用されている使用単語の出現回数の和に単語重みと調整値との積を掛け合わせたものを類似度とする例を用いて説明する。しかし、この類似度の算出方法は、上記の方法に限定するものではなく、他の算出方法で求めてもよい。 In the best mode of the present invention, the method for calculating the comparison target document similarity is based on the sum of the number of used words commonly used in the two documents of the classified document and the comparison target document. This will be described using an example in which the product of the weight and the adjustment value is used as the similarity. However, the method of calculating the similarity is not limited to the above method, and may be obtained by other calculation methods.

分野別類似度積算部２０８は、比較対象文書類似度バッファ部２５５に比較対象文書類似度が記憶されると、この比較対象文書類似度を適合する分野について、各分野別に積算した分野別類似度積算値に加算し、分野別類似度積算値バッファ部２５６に記憶させる。 When the comparison target document similarity is stored in the comparison target document similarity buffer unit 255, the sector similarity accumulation unit 208 accumulates the comparison target document similarity for each field for each field that matches the comparison target document similarity. The added value is added to the integrated value and is stored in the field-specific similarity integrated value buffer unit 256.

分野特定部２０９は、分野別類似度積算値バッファ部２５６に記憶された分野別類似度積算値と、共通単語情報バッファ部２５４に記憶されている共通単語情報２５４ａとを読み出すとともに、これらを関連付けた分野特定結果を生成し、表示装置などの出力装置に出力する。 The field specifying unit 209 reads the field-specific similarity integrated value stored in the field-specific similarity integrated value buffer unit 256 and the common word information 254a stored in the common word information buffer unit 254 and associates them with each other. Field identification results are generated and output to an output device such as a display device.

類似度算出パラメータ調整部２１０は、入力装置を介して利用者によって単語重みを調整するために変更された調整値に基づいて、新たな共通単語情報２５４ａに基づいて再分類させるため、共通単語情報バッファ部２５４の共通単語情報２５４ａを書き替える。 The similarity calculation parameter adjustment unit 210 performs reclassification based on the new common word information 254a based on the adjustment value changed to adjust the word weight by the user via the input device. The common word information 254a in the buffer unit 254 is rewritten.

分類結果記憶部２１１は、分類結果が確定されると、共通単語情報バッファ部２５４に記憶される共通単語情報２５４ａを読み出し、確定された分類と使用単語と調整値とを含む登録用の分類結果１３ａを生成し、分類結果記憶部１３に記憶させる。 When the classification result is confirmed, the classification result storage unit 211 reads the common word information 254a stored in the common word information buffer unit 254, and the classification result for registration including the confirmed classification, the used word, and the adjustment value. 13 a is generated and stored in the classification result storage unit 13.

［文書分類処理］
次に、図６乃至図１７を用いて、本発明の実施の形態に係る文書分類装置１における文書分類処理を説明する。図６及び図７に示すフローチャートは、分類処理部１０における処理を示している。 [Document classification processing]
Next, document classification processing in the document classification apparatus 1 according to the embodiment of the present invention will be described with reference to FIGS. The flowcharts shown in FIGS. 6 and 7 show processing in the classification processing unit 10.

まず、図６に示すフローチャートにあるように、初期化部２０１は、メモリ部２５０の各バッファ部２５１〜２５６を初期化する（Ｓ００１）。その後、単語重み読み込み部２０３は、単語重み情報記憶部１２から単語重みバッファ部２５１に単語重み情報を読み込む（Ｓ００２）。 First, as shown in the flowchart of FIG. 6, the initialization unit 201 initializes the buffer units 251 to 256 of the memory unit 250 (S001). Thereafter, the word weight reading unit 203 reads word weight information from the word weight information storage unit 12 into the word weight buffer unit 251 (S002).

続いて、被分類文書情報生成部２０４は、入力部２０２を介して被分類文書が入力されると、入力された被分類文書を単語単位に分解する。また被分類文書情報生成部２０４は、分解された各単語と各単語の使用回数とを含む被分類文書情報を生成し、生成した被分類文書情報を被分類文書情報バッファ部２５２に記憶させる（Ｓ００３）。 Subsequently, when a classified document is input via the input unit 202, the classified document information generation unit 204 decomposes the input classified document into words. The classified document information generation unit 204 generates classified document information including each decomposed word and the number of times each word is used, and stores the generated classified document information in the classified document information buffer unit 252 ( S003).

図８に、入力部２０２を介して被分類文書情報生成部２０４に入力される被分類文書の一例である被分類文書１を示す。なお、文書の後半は省略されている。このような、複数の被分類文書が被分類文書情報生成部２０４に入力される。 FIG. 8 shows a classified document 1 that is an example of a classified document input to the classified document information generation unit 204 via the input unit 202. The second half of the document is omitted. Such a plurality of classified documents are input to the classified document information generation unit 204.

また、図９に、被分類文書情報生成部２０４において「被分類文書１」に基づいて生成された被分類文書情報２５２ａの一例を示す。図９に示すように、被分類文書情報２５２ａは例えば、被分類文書中で使用されている「使用単語」と、その使用単語が対象となる被分類文書中で使用されている回数である「使用回数」および「調整値」が関連付けられた情報である。ここで、この「調整値」の初期値としては、「１．０」を設定する。 FIG. 9 shows an example of classified document information 252a generated by the classified document information generation unit 204 based on “classified document 1”. As illustrated in FIG. 9, the classified document information 252a includes, for example, “used words” used in the classified document and the number of times the used words are used in the target classified document “ This is information in which “the number of uses” and “adjustment value” are associated. Here, “1.0” is set as the initial value of the “adjustment value”.

このステップＳ００３の処理は、分類の対象となる被分類文書全てに対して行なわれる（Ｓ００４）。例えば、被分類文書として２０００の文書が入力された場合、ステップＳ００３の処理は２０００回繰り返される。 The process in step S003 is performed for all classified documents to be classified (S004). For example, when 2000 documents are input as classified documents, the process of step S003 is repeated 2000 times.

全ての被分類文書について被分類文書情報が生成されると、比較対象文書情報読み込み部２０５は、比較対象文書情報記憶部１１から比較対象文書情報バッファ部２５３に比較対象文書情報１１ａを読み込む（Ｓ００５）。 When classified document information is generated for all classified documents, the comparison target document information reading unit 205 reads the comparison target document information 11a from the comparison target document information storage unit 11 into the comparison target document information buffer unit 253 (S005). ).

続いて、共通単語情報生成部２０６は、被分類文書情報バッファ部２５２から被分類文書情報２５２ａを読み出し、比較対象文書情報バッファ部２５３から比較対象文書情報１１ａを読み出し、被分類文書と比較対象文書で共通して使用されている単語を共通使用単語として抽出するとともに、抽出された共通使用単語について被分類文書および比較対象文書で使用されている回数の合計値とを合わせて共通単語情報２５４ａを生成し、比較対象文書毎に共通単語情報バッファ部２５４に記憶する（Ｓ００６）。 Subsequently, the common word information generation unit 206 reads the classified document information 252 a from the classified document information buffer unit 252, reads the comparison target document information 11 a from the comparison target document information buffer unit 253, and sorts the classified document and the comparison target document. And the common word information 254a is extracted by combining the extracted common use word with the total number of times used in the classified document and the comparison target document. It is generated and stored in the common word information buffer unit 254 for each comparison target document (S006).

その後、全ての比較対象文書について共通単語情報２５４ａが生成されて、記憶されるまで、ステップＳ００５及びＳ００６の処理を繰り返す（Ｓ００７）。 Thereafter, the processes in steps S005 and S006 are repeated until the common word information 254a is generated and stored for all the comparison target documents (S007).

図１０に、共通単語情報２５４ａの一例を示す。この共通単語情報２５４ａでは、各被分類文書と比較対象文書との組み合わせ毎に、その比較対象文書の「分野」、「使用単語」、その使用単語が被分類文書および比較対象文書で使用された「使用回数」、その使用単語の「単語重み」およびその単語重みを調整する「調整値」が関連付けられて記憶されている。 FIG. 10 shows an example of the common word information 254a. In the common word information 254a, for each combination of each classified document and the comparison target document, the “field”, “used word”, and the used word of the comparison target document are used in the classified document and the comparison target document. “Use count”, “word weight” of the used word, and “adjustment value” for adjusting the word weight are stored in association with each other.

例えば、図１０に示す共通単語情報２５４ａでは、被分類文書１を比較対象文書１であるタイトルが「データベース更新処理時間の短縮」の文書と比較すると、使用単語「大規模」の使用回数は「５回」、使用単語「データベース」の使用回数は「１１回」、「時間」の使用回数は「５回」であることを示している。 For example, in the common word information 254a shown in FIG. 10, when the classified document 1 is compared with a document whose title is the comparison target document 1 and whose title is “reduction in database update processing time”, the usage count of the used word “large” is “ This indicates that the usage count of the word “database” is “11” and the usage count of “time” is “5”.

本発明の最良の実施の形態で比較対象文書類似度算出部２０７は、共通単語情報に含まれる共通単語の使用回数と単語重みと調整値との積を加算して、類似度を求めている。しかしながら、これ以外にもベクトル空間法を利用して類似度を算出することも可能である。 In the best embodiment of the present invention, the comparison target document similarity calculation unit 207 calculates the similarity by adding the product of the number of times of use of the common word included in the common word information, the word weight, and the adjustment value. . However, in addition to this, it is also possible to calculate the similarity using a vector space method.

次に、比較対象文書類似度算出部２０７は、共通単語情報バッファ部２５４に記憶される共通単語情報２５４ａを読み出して類似度を算出し、算出した類似度を比較対象文書類似度情報２５５ａとして比較対象文書類似度バッファ部２５５に記憶させる（Ｓ００８）。 Next, the comparison target document similarity calculation unit 207 reads the common word information 254a stored in the common word information buffer unit 254, calculates the similarity, and compares the calculated similarity as comparison target document similarity information 255a. It is stored in the target document similarity buffer unit 255 (S008).

この比較対象文書類似度を算出するために、まず、各共通使用単語について使用回数と単語重みとの積を算出する。各共通使用単語について求められた使用回数と単語重みとの積の合計の値を、被分類文書毎に各比較対象文書類似度とし、これらの各比較対象文書類似度に基づき比較対象文書類似度情報２５５ａを生成する。 In order to calculate the comparison target document similarity, first, the product of the number of times of use and the word weight is calculated for each commonly used word. The total value of the product of the number of times used for each common word and the word weight is used as each comparison target document similarity for each classified document, and the comparison target document similarity is based on each comparison target document similarity. Information 255a is generated.

例えば、図１０に示した「比較対象文書１」の場合、その比較対象文書類似度は５×２．１＋１１×４．３＋５×１．７＝６６．３となる。 For example, in the case of “comparison target document 1” shown in FIG. 10, the comparison target document similarity is 5 × 2.1 + 11 × 4.3 + 5 × 1.7 = 66.3.

図１１に示すのは、比較対象文書類似度バッファ部２５５に記憶される比較対象文書類似度情報２５５ａの一例である。 FIG. 11 shows an example of comparison target document similarity information 255 a stored in the comparison target document similarity buffer unit 255.

その後、分野別類似度積算部２０８は、比較対象文書類似度バッファ部２５５に比較対象文書類似度情報２５５ａが記憶されると、比較対象文書類似度を分野別に積算した分野別類似度積算値に比較対象文書類似度を加算し、分野別類似度積算値情報２５６ａを書き替えて分野別類似度積算値バッファ部２５６に記憶させる（Ｓ００９）。 After that, when the comparison target document similarity information 255a is stored in the comparison target document similarity buffer unit 255, the field-specific similarity integration unit 208 sets the comparison target document similarity to the field-specific similarity integration value. The comparison target document similarity is added, and the field-specific similarity integrated value information 256a is rewritten and stored in the field-specific similarity integrated value buffer unit 256 (S009).

類似度算出結果が図１１に示す状態にあった場合、まず、「データベース更新」という分野にはタイトルが「データベース更新処理時間の短縮」の文書について算出された類似度６６．３と、タイトルが「テキストデータベース更新」の文書について算出された類似度４３．５が加算され、その後に続く分類対象文書で「データベース更新」に分類される文書について算出された類似度が加算されて分野別類似度積算値とされる。 When the similarity calculation result is in the state shown in FIG. 11, first, in the field “database update”, the similarity 66.3 calculated for the document whose title is “reduction in database update processing time” and the title is The similarity 43.5 calculated for the “text database update” document is added, and the similarity calculated for the document classified as “database update” in the subsequent classification target document is added to add the similarity by field It is an integrated value.

図１２に、ステップＳ１０５において、すべての比較対象文書について処理を行った結果、得られた分野別の類似度積算値から生成された分野別類似度積算値情報２５６ａの一例を示す。 FIG. 12 shows an example of the field-specific similarity integrated value information 256a generated from the field-specific similarity integrated values obtained as a result of processing all the comparison target documents in step S105.

次に、全ての比較対象文書について算出された比較対象文書類似度が分野別類似度積算値に加算されると（Ｓ０１０でＹＥＳ）、分野特定部２０９は、分野別類似度積算値バッファ部２５６から分野別類似度積算値情報２５６ａを読み出して分野別類似度積算値を降順に並べるとともに、共通単語情報バッファ部２５４に記憶される共通単語情報２５４ａを読み出し、分野別類似度積算値の大きい分野から順に共通単語情報２５４ａと類似度積算値とを関連付けた分野特定結果を生成し、出力装置に出力する。この分野特定結果は、「分野名」、その分野に該当する比較対象文書の「タイトル」および「類似度」を有している。また、分野特定部２０９は、各比較対象文書との類似度の算出で利用した単語重みについても表示する。（Ｓ０１１）このときに、各単語の単語重みを書き換え可能な状態で表示する。 Next, when the comparison target document similarity calculated for all the comparison target documents is added to the field-specific similarity integrated value (YES in S010), the field specifying unit 209 determines the field-specific similarity integrated value buffer unit 256. The field-specific similarity integrated value information 256a is read out, the field-specific similarity integrated values are arranged in descending order, the common word information 254a stored in the common word information buffer unit 254 is read out, and the field of the field-specific similarity integrated value is large. The field identification result associating the common word information 254a with the integrated similarity value in order is generated and output to the output device. This field identification result includes “field name”, “title” and “similarity” of the comparison target document corresponding to the field. The field specifying unit 209 also displays word weights used in calculating similarity with each comparison target document. (S011) At this time, the word weight of each word is displayed in a rewritable state.

図１３に、ステップＳ０１１で表示される分野特定結果の表示画面６００の一例を示す。また、図１４に、ステップＳ０１１で表示される単語重み調整画面の表示画面６０１の一例を示す。図１４の表示画面６０１の例では、使用回数の大きい順にソートして単語重みを表示している。 FIG. 13 shows an example of the field identification result display screen 600 displayed in step S011. FIG. 14 shows an example of a display screen 601 of the word weight adjustment screen displayed in step S011. In the example of the display screen 601 in FIG. 14, the word weights are displayed by sorting in descending order of the number of uses.

図１３に示す表示画面６００には、単語重み調整ボタン６００ａが設けられ、この単語重み調整ボタン６００ａを押下することにより表示画面６０１が表示される例であるが、単語重み調整ボタン６００ａは必須のものではなく、分野特定結果及び単語重み調整画面が同時に表示されるものであっても良く、また、一定時間を経て順に表示されるものであっても良い。 The display screen 600 shown in FIG. 13 is provided with a word weight adjustment button 600a, and the display screen 601 is displayed by pressing the word weight adjustment button 600a. However, the word weight adjustment button 600a is indispensable. Instead, the field identification result and the word weight adjustment screen may be displayed at the same time, or may be sequentially displayed after a predetermined time.

続いて、類似度算出パラメータ調整部２１０は、ステップＳ０１１で表示した単語重みを調整する調整値の変更を受け付ける。（Ｓ０１２）
具体的には、利用者は、表示された被分類文書の内容と分類結果を参照し、分類結果が正しくないと判断した場合、その分類結果に含まれる分類に影響した単語とその単語重みを参照し、調整値を可変することで単語重みを調整して、変更後の調整値を利用して新たに分類結果を求めることができる。例えば、利用者は、表示された単語重みの中で、被分類文書の分野の特徴を示していないにも関わらず、高い重みが付いている場合や、逆に分野の特徴を示しているにも関わらず、低い重みが付いている場合に、その単語重みを調整することが可能となる。 Subsequently, the similarity calculation parameter adjustment unit 210 accepts a change in an adjustment value for adjusting the word weight displayed in step S011. (S012)
Specifically, when the user refers to the contents of the displayed classified document and the classification result and determines that the classification result is not correct, the user selects the word that affected the classification included in the classification result and the word weight. It is possible to refer to and adjust the word weight by changing the adjustment value, and to obtain a new classification result using the changed adjustment value. For example, the user does not indicate the characteristics of the field of the classified document among the displayed word weights, but the user has a high weight, or conversely, indicates the characteristics of the field. Nevertheless, when the weight is low, the word weight can be adjusted.

図１４に示す表示画面６０１を用いて説明すると、「調整値」を自在に変更することが可能であり、再分類開始ボタン６０１ａを押下することで、後述するように、変更された「調整値」に基づいて再び分類が開始される。例えば、利用者が、図８に示す被分類文書に対して属する分野として、「データベース更新」や「文書検索」が適当でないと判断し、その原因が「データベース」の単語重みが高いことによると判断したとする。この場合、「データベース」の調整値を例えば０．２に変更するなど、低い値に設定し直すことができる。 Referring to the display screen 601 shown in FIG. 14, the “adjustment value” can be freely changed. By pressing the reclassification start button 601 a, the changed “adjustment value” will be described later. The classification is started again based on “ For example, the user determines that “database update” or “document search” is not appropriate as the field belonging to the classified document shown in FIG. 8, and the cause is that the word weight of “database” is high. Assume that you have determined. In this case, the adjustment value of the “database” can be reset to a low value, for example, to 0.2.

続いて、類似度算出パラメータ調整部２１０で調整値が変更を受け付けたことが判断されると（Ｓ０１３でＹＥＳ）、類似度算出パラメータ調整部２１０は、変更された調整値を利用して共通単語情報バッファ部２５４の共通単語情報２５４ａを書き替える（Ｓ０１４）。具体的には、類似度算出パラメータ調整部２１０は、共通単語情報２５４ａの「調整値」を書き替える。図１５に、「データベース」の「調整値」を０．２に書き替えられた一例である共通単語情報２５４ｂを示す。その後、ステップＳ００８からＳ０１３の処理を再実行する。 Subsequently, when the similarity calculation parameter adjustment unit 210 determines that the adjustment value has been changed (YES in S013), the similarity calculation parameter adjustment unit 210 uses the changed adjustment value to generate a common word. The common word information 254a in the information buffer unit 254 is rewritten (S014). Specifically, the similarity calculation parameter adjustment unit 210 rewrites the “adjustment value” of the common word information 254a. FIG. 15 shows common word information 254b as an example in which “adjustment value” of “database” is rewritten to 0.2. Thereafter, the processing from step S008 to S013 is executed again.

図１６に、上述した例にあるように「データベース」の「調整値」を０．２に書き替えた場合に比較対象文書１について算出された類似度により生成された比較対象文書類似度情報２５５ｂの一例を示す。このように調整した結果、被分類文書が属する分野としてあまり適当でなかった、「データベース更新」や「文書検索」分野の点数が下がり、分類先として適当な「文書分類」分野が上位に上がる結果となる。 FIG. 16 shows comparison target document similarity information 255b generated based on the similarity calculated for the comparison target document 1 when the “adjustment value” of “database” is rewritten to 0.2 as in the above-described example. An example is shown. As a result of such adjustment, the score of the “database update” and “document search” fields is reduced, and the “document classification” field appropriate as the classification destination is raised to the top, which is not suitable as the field to which the classified document belongs. It becomes.

本発明の最良の実施の形態では、類似度算出の中間データとして、共通単語とその単語重みを調整する調整値を用いたが、これらに限られず、複数の方式により算出した値に基づいて、それらの複数の値の比率などを調整して、分野を特定する実施の形態も考えられる。 In the best embodiment of the present invention, the adjustment value for adjusting the common word and its word weight is used as intermediate data for similarity calculation, but is not limited thereto, and based on values calculated by a plurality of methods, An embodiment in which the field is specified by adjusting the ratio of the plurality of values is also conceivable.

ステップＳ０１３の判定で調整を受け付けられていなかった場合（Ｓ０１３でＮＯ）、現在設定されている調整値で確定されたと判断し、分類結果記憶部２１１は、共通単語情報バッファ部２５４に記憶される共通単語情報に基づいて、重み調整済みの登録用の分類結果１３ａを生成し、分類結果記憶部１３に記憶させる。 If the adjustment is not accepted in the determination of step S013 (NO in S013), it is determined that the adjustment value is currently set, and the classification result storage unit 211 is stored in the common word information buffer unit 254. Based on the common word information, a weighted registration classification result 13 a is generated and stored in the classification result storage unit 13.

図１７に、分類結果記憶部１３に記憶された登録用の分類結果１３ａを示す。図１７に示す分類結果１３ａは、被分類文書１について生成された登録用の分類結果１３ａであり、分野及び被分類文書１の内容と共に使用単語と関連付けられた調整値を含んでいる。 FIG. 17 shows a registration classification result 13 a stored in the classification result storage unit 13. The classification result 13a shown in FIG. 17 is a registration classification result 13a generated for the classified document 1, and includes an adjustment value associated with the word used together with the contents of the field and the classified document 1.

上述したステップＳ００８〜００９の処理は、対象となる全ての被分類文書に対して繰り返される（Ｓ０１６）。例えば、被分類文書として２０００件分の文書が入力された場合、２０００回繰り返される。 The above-described processing of steps S008 to 009 is repeated for all the classified documents to be processed (S016). For example, when 2000 documents are input as classified documents, the process is repeated 2000 times.

上述した本発明によれば、分類の対象である被分類文書ごとに分類に利用されるパラメータを可変して分類することが出来る。これにより、分類の精度を向上させることが可能になる。 According to the present invention described above, the parameters used for classification can be classified for each classified document that is the classification target. This makes it possible to improve the classification accuracy.

上記のように、本発明の実施の形態によって記載したが、この開示の一部をなす論述及び図面はこの発明を限定するものであると理解すべきではない。この開示から当業者には様々な代替実施の形態、実施例及び運用技術が明らかとなる。 As described above, the embodiments of the present invention have been described. However, it should not be understood that the descriptions and drawings constituting a part of this disclosure limit the present invention. From this disclosure, various alternative embodiments, examples, and operational techniques will be apparent to those skilled in the art.

本発明はここでは記載していない様々な実施の形態等を含むことは勿論である。従って、本発明の技術的範囲は上記の説明に記載した事項と自明な特許請求の範囲に係る発明特定事項によってのみ定められるものである。 It goes without saying that the present invention includes various embodiments not described herein. Therefore, the technical scope of the present invention is defined only by the matters described in the above description and the invention specific matters according to the obvious claims.

本発明の最良の実施の形態に係る文書分類装置を説明するブロック図である。1 is a block diagram for explaining a document classification device according to an embodiment of the present invention. 本発明の最良の実施の形態に係る比較対象文書情報記憶部で記憶する比較対象文書情報の一例である。It is an example of the comparison object document information memorize | stored in the comparison object document information storage part which concerns on the best embodiment of this invention. 本発明の最良の実施の形態に係る単語重み情報記憶部で記憶する単語重み情報の一例である。It is an example of the word weight information memorize | stored in the word weight information storage part which concerns on the best embodiment of this invention. 本発明の最良の実施の形態に係る文書分類装置を説明する図である。It is a figure explaining the document classification device based on the best embodiment of this invention. 本発明の最良の実施の形態に係る文書分類装置を説明する図である。It is a figure explaining the document classification device based on the best embodiment of this invention. 本発明の最良の実施の形態に係る文書分類装置における処理を説明するフローチャートである。It is a flowchart explaining the process in the document classification device based on the best embodiment of this invention. 図６に続いて、本発明の最良の実施の形態に係る文書分類装置における処理を説明するフローチャートである。FIG. 7 is a flowchart for explaining processing in the document classification device according to the preferred embodiment of the present invention, following FIG. 6. 本発明の最良の実施の形態に係る文書分類装置に入力される被分類文書の一例である。It is an example of the classified document input into the document classification device based on the best embodiment of this invention. 本発明の最良の実施の形態に係る文書分類装置において生成される被分類文書情報の一例である。It is an example of the classified document information produced | generated in the document classification device based on the best embodiment of this invention. 本発明の最良の実施の形態に係る文書分類装置において生成される共通単語情報の一例である。It is an example of the common word information produced | generated in the document classification device based on the best embodiment of this invention. 本発明の最良の実施の形態に係る文書分類装置において記憶される比較対象文書類似度の一例である。It is an example of the comparison object document similarity memorize | stored in the document classification device based on the best embodiment of this invention. 本発明の最良の実施の形態に係る文書分類装置において記憶される分野別の類似度積算値の一例である。It is an example of the similarity integrated value according to field memorize | stored in the document classification device based on the best embodiment of this invention. 本発明の最良の実施の形態に係る文書分類装置において求められる分野特定結果の一例である。It is an example of the field identification result calculated | required in the document classification device based on the best embodiment of this invention. 本発明の最良の実施の形態に係る文書分類装置において表示される単語重み調整画面の一例である。It is an example of the word weight adjustment screen displayed in the document classification device according to the preferred embodiment of the present invention. 本発明の最良の実施の形態に係る文書分類装置において修正された共通単語情報の一例である。It is an example of the common word information corrected in the document classification device according to the best embodiment of the present invention. 本発明の最良の実施の形態に係る文書分類装置において修正された共通単語情報に基づいて生成された比較対象文書類似度情報の一例である。It is an example of the comparison object document similarity information produced | generated based on the common word information corrected in the document classification device based on the best embodiment of this invention. 本発明の最良の実施の形態に係る文書分類装置において生成された登録用の分類結果の一例である。It is an example of the classification result for registration produced | generated in the document classification device based on the best embodiment of this invention.

Explanation of symbols

１…文書分類装置
１０…分類処理部
１１…比較対象文書情報記憶部
１１ａ…比較対象文書情報
１２…単語重み情報記憶部
１２ａ…単語重み情報記憶部
１０１…中央処理制御装置
１０２…ＲＯＭ
１０３…ＲＡＭ
１０４…入力装置
１０５…表示装置
１０６…通信制御装置
１０７…記憶装置
１０８…リムーバブルディスク
１０９…入出力インタフェース
１１０…バス
２００…制御部
２０１…初期化部
２０２…入力部
２０３…単語重み読み込み部
２０４…被分類文書情報生成部
２０５…比較対象文書情報読み込み部
２０６…共通単語情報生成部
２０７…比較対象文書類似度算出部
２０８…分野別類似度積算部
２０９…分野特定部
２１０…類似度算出パラメータ調整部
２１１…分類結果記憶部
２５０…メモリ部
２５１…単語重みバッファ部
２５２…被分類文書情報バッファ部
２５２ａ…被分類文書情報
２５３…比較対象文書情報バッファ部
２５４…共通単語情報バッファ部
２５４ａ，２５４ｂ…共通単語情報
２５５…比較対象文書類似度バッファ部
２５５ａ，２５５ｂ…比較対象文書類似度情報
２５６…分野別類似度積算値バッファ部
２５６ａ…分野別類似度積算値情報
６００…表示画面
６００ａ…調整ボタン
６０１…表示画面
６０１ａ…再分類開始ボタン DESCRIPTION OF SYMBOLS 1 ... Document classification | category apparatus 10 ... Classification process part 11 ... Comparison object document information storage part 11a ... Comparison object document information 12 ... Word weight information storage part 12a ... Word weight information storage part 101 ... Central processing control apparatus 102 ... ROM
103 ... RAM
DESCRIPTION OF SYMBOLS 104 ... Input device 105 ... Display device 106 ... Communication control device 107 ... Memory | storage device 108 ... Removable disk 109 ... Input / output interface 110 ... Bus 200 ... Control part 201 ... Initialization part 202 ... Input part 203 ... Word weight reading part 204 ... Classified document information generation unit 205 ... comparison target document information reading unit 206 ... common word information generation unit 207 ... comparison target document similarity calculation unit 208 ... field-specific similarity accumulation unit 209 ... field specification unit 210 ... similarity calculation parameter adjustment Section 211 ... Classification result storage section 250 ... Memory section 251 ... Word weight buffer section 252 ... Classified document information buffer section 252a ... Classified document information 253 ... Comparison target document information buffer section 254 ... Common word information buffer sections 254a, 254b ... Common word information 255... Comparison target document similarity level Fan unit 255a, 255b ... comparison document similarity information 256 ... sector similarity integration value buffer 256a ... sector similarity integration value information 600 ... screen 600a ... adjustment button 601 ... screen 601a ... reclassification start button

Claims

A document classification device for classifying a field to which an inputted classified document belongs,
A comparison target document information storage unit that stores information of a comparison target document to be compared with the classified document and a field of the comparison target document as comparison target document information;
A word and a word weight information storage unit for storing the word weight of the word;
The classified document is compared with the comparison target document information to extract a common use word that is commonly used in the classified document and the comparison target document, the common use word, and the number of times the common use word is used, The word weight of the commonly used word read from the word weight information storage unit and an adjustment value that is set for each word weight and adjusts the word weight are generated as common word information, and the generated common word A similarity between a plurality of comparison target documents and the classified document is obtained based on information, a field is specified based on the obtained similarity, and the adjustment value is varied based on an instruction from an input device. A classification processing unit that identifies a new field based on the comparison target document;
A document classification apparatus comprising:

A document classification device for classifying a field to which an inputted classified document belongs,
A comparison target document information storage unit that stores information of a comparison target document to be compared with the classified document and a field of the comparison target document as comparison target document information;
A word and a word weight information storage unit for storing the word weight of the word;
The classified document is compared with the comparison target document information to extract a common use word that is commonly used in the classified document and the comparison target document, the common use word, and the number of times the common use word is used, A common word information generation unit that generates, as common word information, the word weight of the commonly used word read from the word weight information storage unit and an adjustment value that is set for each word weight and adjusts the word weight;
A similarity calculation unit that calculates a similarity based on the number of times of use of each commonly used word included in the common word information, the word weight of the commonly used word, and an adjustment value associated with the word weight;
A field identification unit that identifies the field to which the classified document belongs based on the similarity obtained by the similarity calculation unit and sets the classification result;
A similarity parameter adjustment unit that varies the adjustment value;
When the classification result is confirmed, a classification result storage unit that generates a classification result including an adjustment value of each used word of the classified document and stores it in a storage device;
A document classification apparatus comprising:

The document classification apparatus according to claim 2, wherein the similarity parameter adjustment unit includes:
A document classification apparatus, wherein common word information is generated by varying an adjustment value of a specific common use word.

The document classification device according to claim 2, wherein the similarity calculation unit includes:
A document classification apparatus characterized in that a sum of products of the number of times of use of each common word, a word weight, and a common word is calculated and the calculated total value is used as a similarity.

A document classification method for classifying a field to which an input classified document belongs,
The comparison target document information for comparing the classified document with the classified document is compared with the comparison target document information associated with the field of the comparison target document. The commonly used words used are extracted, and the commonly used words, the number of times the common used words are used, the word weights of the commonly used words, and adjustment values for adjusting the word weights set for the respective word weights. It is generated as common word information, a similarity between a plurality of comparison target documents and the classified document is obtained from the generated common word information, a field is specified based on the obtained similarity, and an input device A classification processing step of varying the adjustment value based on the instruction and specifying a new field based on the comparison target document;
A document classification method characterized by comprising:

A document classification method for classifying a field to which an input classified document belongs,
Comparing the classified document with the information of the comparison target document for comparing the classified document with the classified document, the common used word commonly used in the classified document and the comparison target document is extracted, A common word information generation step for generating, as common word information, the number of times of use of the common use word, the word weight of the common use word, and an adjustment value that is set for each word weight and adjusts the word weight;
A similarity calculation step of calculating a similarity based on the number of times of use of each commonly used word, the word weight of the commonly used word, and an adjustment value associated with the word weight;
A field specifying step of specifying the field to which the classified document belongs based on the similarity obtained for the classified document and making it a classification result;
A similarity parameter adjustment step for varying the adjustment value;
When the classification result is confirmed, a classification result storing step of generating a classification result including an adjustment value of each used word of the classified document and storing it in a storage device;
A document classification method characterized by comprising:

The document classification method according to claim 6, wherein the similarity parameter adjustment step includes:
A document classification method, characterized in that common word information is generated by varying an adjustment value of a specific commonly used word.

The document classification method according to claim 6 or 7, wherein the similarity calculation step includes:
A document classification method characterized in that a sum of products of the number of times of use of each common word, word weight, and a common word is calculated, and the calculated total value is used as a similarity.