JPH11259515A

JPH11259515A - Similar document retrieval device and method and recording medium recording similar document retrieval program

Info

Publication number: JPH11259515A
Application number: JP10061726A
Authority: JP
Inventors: Yasuo Tanosaki; 康雄田野崎; Yukio Nakamoto; 幸夫中本; Takuya Nishina; 卓哉仁科; Naohide Kubota; 直秀久保田
Original assignee: Toshiba Corp; Toshiba Computer Engineering Corp
Current assignee: Toshiba Corp; Toshiba Computer Engineering Corp
Priority date: 1998-03-12
Filing date: 1998-03-12
Publication date: 1999-09-24

Abstract

PROBLEM TO BE SOLVED: To improve both inter-document similarity calculation accuracy and similar document retrieval accuracy via the optimization of a list of unnecessary words by deciding some of extracted words as unnecessary words, deleting the unnecessary words from a retrieval key document and a retrieval object document and calculating the similarity between both documents. SOLUTION: Some of words extracted by a word extraction means are decided as unnecessary words based on the occurrence frequency of each designated unnecessary word. Then the unnecessary words are deleted from a retrieval key document and a retrieval object document, and the similarity is calculated between both documents. An unnecessary word deletion part 28 of this similar document retrieval device deletes the words equivalent to the unnecessary words stored in an unnecessary word buffer 45 from a retrieval keyword information storing buffer 47 and a retrieval object word information storing buffer 42. A similarity calculation part 29 calculates the similarity between the retrieval key document and the retrieval object document based on the information which are stored in the buffer 47, the buffer 42 and a common word information storing buffer 48.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、文書データベース
から、文書間の類似度に基づく文書データの検索を行う
類似文書検索装置、類似文書検索方法、および類似文書
検索のためのプログラムが記録された記録媒体に関す
る。[0001] The present invention relates to a similar document search apparatus, a similar document search method, and a similar document search program for searching a document database for document data based on the similarity between documents. It relates to a recording medium.

【０００２】[0002]

【従来の技術】近年、大量の電子化された文書データが
流通するようになり、自動分類等を行う目的で、文書デ
ータベース中から指定された文書（以下、検索キー文書
と呼ぶ）に類似する文書の自動検索を行うシステムが実
用化されてきている。この文書検索システムでは、検索
キー文書に含まれている単語と検索対象となる文書（以
下、検索対象文書と呼ぶ）に含まれている単語とを比較
し、共通する単語の種類、出現場所、出現回数などから
ベクトル空間法により類似度を算出し、類似度の高い検
索対象文書を検索結果として出力する。2. Description of the Related Art In recent years, a large amount of digitized document data has been distributed, and similar to a document specified in a document database (hereinafter referred to as a retrieval key document) for the purpose of automatic classification and the like. Systems for automatically searching documents have been put to practical use. In this document search system, a word included in a search key document is compared with a word included in a search target document (hereinafter, referred to as a search target document), and a type of a common word, an appearance location, The similarity is calculated from the number of appearances by the vector space method, and a search target document having a high similarity is output as a search result.

【０００３】このとき、類似文書検索を行う上で不要な
単語（文書の内容を特徴付けるものではない一般的な単
語）を含めた類似度の算出は検索精度を落とす原因とな
り得ることから、予め不要語リストを作成しておき、文
書から単語を抽出する際に不要語リストを参照して、不
要語に相当する単語については文書から抽出しないよう
にする方法をとっている。At this time, calculation of similarity including unnecessary words (common words that do not characterize the contents of a document) in performing similar document search may cause a decrease in search accuracy. A method is adopted in which a word list is created, and when extracting words from a document, the unnecessary word list is referred to so that words corresponding to unnecessary words are not extracted from the document.

【０００４】しかしながら、通常、不要語リストの作成
においては、不要語とすべき単語の種類をユーザが１つ
１つ決定する必要があり、しかも検索対象文書データベ
ースの種類毎に別々の不要語リストを用意する必要があ
る。このような不要語リストの作成作業は、ユーザにと
って大きな負担となるばかりか、不要語の選択の個人差
によって、類似文書検索の精度に大きなばらつきが生じ
るという問題がある。[0004] However, in general, when creating an unnecessary word list, it is necessary for the user to determine the types of words to be unnecessary words one by one, and a separate unnecessary word list is required for each type of document database to be searched. It is necessary to prepare. The work of creating such an unnecessary word list not only imposes a heavy burden on the user, but also has a problem that the accuracy of the similar document search greatly varies due to individual differences in the selection of the unnecessary word.

【０００５】[0005]

【発明が解決しようとする課題】このように、精度の高
い類似文書検索を行うためには、文書から抽出すべき単
語対象から不要語を排除することが好ましいが、そのた
めには検索対象文書データベースの種類毎に不要語リス
トを人手により作成する必要があり、ユーザに負担を強
いることになる。また、不要語の選択漏れはもちろん、
ユーザによる不要語の選択の個人差が検索結果に色濃く
反映されてしまい、類似文書検索の精度のばらつきが生
じやすいという問題がある。As described above, in order to perform a highly accurate similar document search, it is preferable to exclude unnecessary words from word targets to be extracted from the document. It is necessary to manually create an unnecessary word list for each type, and this imposes a burden on the user. Also, of course, unnecessary words are not selected,
There is a problem that the individual difference in the selection of the unnecessary word by the user is strongly reflected in the search result, and the accuracy of the similar document search is likely to vary.

【０００６】本発明はこのような課題を解決するために
なされたもので、最適な不要語リストを自動的に作成で
き、不要語リストの最適化による文書間の類似度算出精
度の向上並びに類似文書検索精度の向上を図ることので
きる類似文書検索装置、類似文書検索方法、および類似
文書検索のためのプログラムが記録された記録媒体の提
供を目的としている。SUMMARY OF THE INVENTION The present invention has been made to solve such a problem. An optimum unnecessary word list can be automatically created, and the similarity calculation between documents can be improved and the similarity can be improved by optimizing the unnecessary word list. It is an object of the present invention to provide a similar document search device, a similar document search method, and a recording medium on which a program for searching for a similar document is recorded, which can improve the document search accuracy.

【０００７】[0007]

【課題を解決するための手段】上記した目的を達成する
ために、本発明の類似文書検索装置は、請求項１に記載
されるように、ある文書を検索キー文書としてこの検索
キー文書と類似する文書を複数の検索対象文書の中から
検索する類似文書検索装置において、前記検索キー文書
および前記検索対象文書を含む複数の文書データが格納
された文書データ格納手段と、任意の単語を指定する単
語指定手段と、前記文書データ格納手段に格納された各
文書データから単語を抽出する単語抽出手段と、前記単
語指定手段により指定された任意の単語および前記単語
抽出手段により抽出された単語の前記各文書データ中で
の出現頻度をそれぞれ算出する出現頻度算出手段と、前
記出現頻度算出手段によって算出された前記任意の単語
の出現頻度を基準として、前記単語抽出手段により抽出
された単語のうちの少なくとも一部の単語を不要語とし
て判定する不要語判定手段と、前記検索キー文書および
前記検索対象文書から前記不要語判定手段により判定さ
れた不要語をそれぞれ除いて両文書間の類似度を算出す
る手段とを具備することを特徴とする。According to a first aspect of the present invention, there is provided a similar document search apparatus according to the present invention, wherein a certain document is used as a search key document. In a similar document search apparatus for searching for a document to be searched from a plurality of search target documents, a document data storage unit storing a plurality of document data including the search key document and the search target document, and an arbitrary word are designated. Word designation means, word extraction means for extracting a word from each document data stored in the document data storage means, and arbitrary words designated by the word designation means and words extracted by the word extraction means An appearance frequency calculation means for calculating the appearance frequency in each document data, and an appearance frequency of the arbitrary word calculated by the appearance frequency calculation means as a reference And an unnecessary word determining unit that determines at least some of the words extracted by the word extracting unit as unnecessary words, and an unnecessary word determining unit that determines the unnecessary words from the search key document and the search target document. Means for calculating the similarity between the two documents by removing the unnecessary words.

【０００８】本発明においては、複数の文書データから
抽出された単語群の中から、ユーザにより不要語の代表
として任意に指定された単語に対して算出された出現頻
度を基準として、単語抽出手段により抽出された単語の
うちの少なくとも一部の単語を不要語として判定し、検
索キー文書および検索対象文書から不要語をそれぞれ除
いて両文書間の類似度を算出することによって類似文書
検索を行う。In the present invention, word extraction means is selected based on an appearance frequency calculated for a word arbitrarily designated by a user as a representative of unnecessary words from a group of words extracted from a plurality of document data. A similar document search is performed by determining at least some of the words extracted by the above as unnecessary words and calculating the similarity between the two documents by removing the unnecessary words from the search key document and the search target document, respectively. .

【０００９】例えば、請求項２に記載されるように、単
語抽出手段により抽出された単語のうち、算出された出
現頻度が、前記任意の単語について算出された出現頻度
以上の単語を不要語として判定したり、或いは、請求項
３に記載されるように、単語抽出手段により抽出された
単語のうち、算出された出現頻度が、複数の任意の単語
の出現頻度のうちの最小出現頻度以上の単語を不要語と
して判定する。更には、請求項４に記載されるように、
単語抽出手段により抽出された単語のうち、算出された
出現頻度が高いものから優先に予め指定された数の単語
を不要語として判定したり、請求項５に記載されるよう
に、単語抽出手段により抽出された単語のうち、算出さ
れた出現頻度が、予め指定された任意の出現頻度以上の
単語を不要語として判定する。For example, as described in claim 2, of the words extracted by the word extracting means, words whose appearance frequency calculated is higher than the appearance frequency calculated for the arbitrary word are regarded as unnecessary words. In the determination, or as described in claim 3, of the words extracted by the word extraction means, the calculated appearance frequency is equal to or higher than the minimum appearance frequency of the appearance frequencies of a plurality of arbitrary words. The word is determined as an unnecessary word. Further, as described in claim 4,
6. A method according to claim 5, wherein, out of the words extracted by the word extracting means, a predetermined number of words are determined as unnecessary words in preference to those having a higher calculated appearance frequency. Are determined as unnecessary words, the words whose calculated appearance frequency is equal to or higher than an arbitrary appearance frequency specified in advance among the words extracted by.

【００１０】以上の発明により、文書データに含まれる
単語群の中からの不要語の抽出を自動化できる。すなわ
ち、ユーザは、例えば、代表的な不要語に当たる任意の
単語を１つ乃至数個入力したり、任意の不要語の数を入
力したり、基準の出現頻度を入力するだけで、希望する
ものに近い不要語リストを得ることができ、類似文書検
索の全体的な効率を高めることができ、また、検索対象
文書データベースの種類毎に最適かつ妥当な不要語を漏
れなく迅速抽出することができるので、類似文書検索の
精度の向上と安定化を図ることができる。According to the above-mentioned invention, extraction of unnecessary words from a word group included in document data can be automated. That is, for example, the user can input one or several arbitrary words corresponding to representative unnecessary words, input the number of arbitrary unnecessary words, or input the frequency of appearance of the reference, and can obtain the desired one. A list of unnecessary words close to the above can be obtained, the overall efficiency of similar document search can be improved, and the optimum and appropriate unnecessary words can be quickly extracted without omission for each type of search target document database. Therefore, it is possible to improve and stabilize the accuracy of the similar document search.

【００１１】また、複数の任意の単語の出現頻度のうち
の最小出現頻度以上の単語を不要語として判定すること
により、ユーザ毎の個人差が不要語のリストの違いに現
れる度合が小さくなり、この点からも、類似文書検索の
精度の向上と安定化を図ることができる。[0011] Further, by determining a word having a frequency equal to or higher than the minimum frequency among the frequencies of occurrence of a plurality of arbitrary words as an unnecessary word, the degree of individual differences for each user in the difference in the unnecessary word list is reduced. Also from this point, it is possible to improve and stabilize the accuracy of the similar document search.

【００１２】[0012]

【発明の実施の形態】以下、本発明の一実施例を図面を
参照しながら説明する。DESCRIPTION OF THE PREFERRED EMBODIMENTS One embodiment of the present invention will be described below with reference to the drawings.

【００１３】図１は本発明の実施形態である類似文書検
索装置のハードウェア構成を示す図である。同図に示す
ように、本実施形態の類似文書検索装置は、ＣＰＵ、メ
モリなどから構成される制御装置１、キーボードなどの
入力装置２、類似文書検索の過程や結果などを表示する
表示装置３、文書データや類似文書検索のために必要な
各種データを格納する外部記憶装置４などから構成され
る。FIG. 1 is a diagram showing a hardware configuration of a similar document search apparatus according to an embodiment of the present invention. As shown in FIG. 1, a similar document search device according to the present embodiment includes a control device 1 including a CPU, a memory, and the like, an input device 2 such as a keyboard, and a display device 3 that displays a process and a result of similar document search. , And an external storage device 4 for storing document data and various data necessary for similar document search.

【００１４】図２は本実施形態の類似文書検索装置の制
御装置１の構成を示す機能ブロック図である。同図に示
すように、制御装置１は制御部とメモリ部で構成され
る。FIG. 2 is a functional block diagram showing the configuration of the control device 1 of the similar document search device of the present embodiment. As shown in FIG. 1, the control device 1 includes a control unit and a memory unit.

【００１５】制御部は、メイン処理部１１、初期化部１
２、入力部１３、出力部１４、検索対象文書読出部１
５、検索対象文書単語抽出部１６、検索対象単語出現頻
度算出部１７、検索対象単語情報算出部１８、全検索対
象単語統計算出部１９、不要語設定部２０、不要語リス
ト作成部２１、不要語リスト読出部２２、検索キー文書
入力部２３、検索キー単語抽出部２４、検索キー単語出
現頻度算出部２５、検索対象単語情報読出部２６、共通
単語抽出部２７、不要語除去部２８、類似度算出部２
９、検索結果出力部３０などで構成されている。The control unit comprises a main processing unit 11, an initialization unit 1
2, input unit 13, output unit 14, search target document reading unit 1
5. Search target document word extraction unit 16, search target word appearance frequency calculation unit 17, search target word information calculation unit 18, all search target word statistics calculation unit 19, unnecessary word setting unit 20, unnecessary word list creation unit 21, unnecessary Word list reading unit 22, search key document input unit 23, search key word extraction unit 24, search key word appearance frequency calculation unit 25, search target word information reading unit 26, common word extraction unit 27, unnecessary word removal unit 28, similar Degree calculator 2
9, a search result output unit 30 and the like.

【００１６】メモリ部は、検索対象文書格納バッファ４
１、検索対象単語情報格納バッファ４２、全検索対象単
語情報格納バッファ４３、不要語設定バッファ４４、不
要語バッファ４５、検索キー文書格納バッファ４６、検
索キー単語情報格納バッファ４７、共通単語情報格納バ
ッファ４８、類似度格納バッファ４９、検索結果出力バ
ッファ５０、作業バッファ５１などで構成されている。The memory section includes a search target document storage buffer 4.
1. Search target word information storage buffer 42, all search target word information storage buffer 43, unnecessary word setting buffer 44, unnecessary word buffer 45, search key document storage buffer 46, search key word information storage buffer 47, common word information storage buffer 48, a similarity storage buffer 49, a search result output buffer 50, a work buffer 51, and the like.

【００１７】初期化部１２は、各バッファ４１，４２，
…，５１の初期化を行う。入力部１３は、入力装置２を
通してユーザより入力されたデータを制御部に入力す
る。出力部１４は、制御部の出力データを表示装置３に
出力する。The initialization unit 12 includes buffers 41, 42,
.., 51 are initialized. The input unit 13 inputs data input by a user through the input device 2 to the control unit. The output unit 14 outputs output data of the control unit to the display device 3.

【００１８】検索対象文書読出部１５は、ユーザにより
指定された検索対象文書を外部記憶装置４から読み込
み、読み込んだ検索対象文書を検索対象文書格納バッフ
ァ４１に格納する。The search target document reading unit 15 reads the search target document specified by the user from the external storage device 4 and stores the read search target document in the search target document storage buffer 41.

【００１９】検索対象文書単語抽出部１６は、検索対象
文書格納バッファ４１に格納された文書データから単語
を切り出し、切り出された単語群の中からその文書の内
容を特徴付ける単語を抽出し、抽出された単語種を検索
対象単語情報格納バッファ４２に格納する。ここで、単
語の切り出しは形態素解析等によって行われ、その文書
の内容を特徴付ける単語の抽出は、単語の品詞情報に基
づいて、例えば「名詞」や「サ変名詞」の単語を選択す
ることによって行われる。The search target document word extraction unit 16 extracts words from the document data stored in the search target document storage buffer 41, extracts words characterizing the contents of the extracted document from the extracted word group, and extracts the extracted words. The searched word type is stored in the search target word information storage buffer 42. Here, word extraction is performed by morphological analysis or the like, and extraction of words characterizing the contents of the document is performed by selecting words such as “noun” or “sa-variant noun” based on the word class information of the word. Will be

【００２０】検索対象単語出現頻度算出部１７は、検索
対象単語情報格納バッファ４２に格納された個々の単語
について抽出元文書内での出現頻度（出現数）を算出
し、算出された出現頻度を検索対象単語情報格納バッフ
ァ４２に単語と対応付けて格納する。The search target word appearance frequency calculation unit 17 calculates the appearance frequency (the number of appearances) of the individual words stored in the search target word information storage buffer 42 in the extraction source document, and calculates the calculated occurrence frequency. It is stored in the search target word information storage buffer 42 in association with the word.

【００２１】検索対象単語情報書込部１８は、検索対象
単語情報格納バッファ４２に格納された各検索対象文書
の単語情報と出現頻度の情報を読み出して外部記憶装置
４に書き込む。The search target word information writing unit 18 reads out the word information and the appearance frequency information of each search target document stored in the search target word information storage buffer 42 and writes them to the external storage device 4.

【００２２】全検索対象単語統計算出部１９は、外部記
憶装置４に格納されている各検索対象文書の単語出現頻
度の情報を、順次、読み出して検索対象文書格納バッフ
ァ４１に書き込み、全検索対象文書から抽出された単語
の種類毎に、出現頻度の統計値例えば出現文書数などを
算出し、その結果を全検索対象単語統計として全検索対
象単語情報格納バッファ４３に格納する。The all-search-target-word statistics calculating unit 19 sequentially reads out the word appearance frequency information of each search target document stored in the external storage device 4 and writes the information in the search target document storage buffer 41. For each type of word extracted from the document, the statistical value of the appearance frequency, for example, the number of appearing documents, is calculated, and the result is stored in the all-search-target-word information storage buffer 43 as the all-search target word statistics.

【００２３】不要語設定部２０は、文書間の類似度算出
において各文書データから抽出される単語群の中から排
除すべき種類の単語（不要語）を設定するための基準の
設定をユーザより受け付けて、その設定された基準を不
要語設定バッファ４４に格納する。このときの基準の設
定方法には、任意の単語（不要語）を１つ乃至複数指定
する方法、不要語の数を指定する方法、出現頻度の基準
値を指定する方法がある。The unnecessary word setting unit 20 allows a user to set a reference for setting a type of word (unnecessary word) to be excluded from a group of words extracted from each document data in calculating similarity between documents. Upon reception, the set reference is stored in the unnecessary word setting buffer 44. At this time, there are a method of setting one or more arbitrary words (unnecessary words), a method of specifying the number of unnecessary words, and a method of specifying a reference value of an appearance frequency.

【００２４】不要語リスト作成部２１は、全検索対象単
語情報格納バッファ４３に格納されている全検索対象単
語情報と不要語設定バッファ４４内の設定内容に基づい
て不要語リストを作成し、作成された不要語リストを外
部記憶装置４に格納する。The unnecessary word list creating section 21 creates and creates an unnecessary word list based on all the search target word information stored in the all search target word information storage buffer 43 and the setting contents in the unnecessary word setting buffer 44. The unnecessary word list is stored in the external storage device 4.

【００２５】この不要語リストの作成において、不要語
設定部２０にて任意の不要語が１つ指定された場合は、
その不要語について算出された出現頻度以上の単語を不
要語として不要語バッファ４５に格納し、複数の不要語
が指定された場合は、その不要語について算出された出
現頻度のうちの最小出現頻度以上の単語を不要語として
不要語バッファ４５に格納する。また、不要語設定部２
０にて不要語の数が指定された場合は、全単語のうち出
現頻度が高いものから優先に、指定された不要語数の単
語を不要語として不要語バッファ４５に格納する。ま
た、不要語設定部２０にて出現頻度の基準値が指定され
た場合は、全単語のうち指定された出現頻度の基準値以
上の単語を不要語として不要語バッファ４５に格納す
る。In the creation of the unnecessary word list, if any one of the unnecessary words is designated by the unnecessary word setting unit 20,
Words having a frequency equal to or higher than the appearance frequency calculated for the unnecessary word are stored in the unnecessary word buffer 45 as unnecessary words, and when a plurality of unnecessary words are specified, the minimum appearance frequency of the frequency calculated for the unnecessary word is used. The above words are stored in the unnecessary word buffer 45 as unnecessary words. Unnecessary word setting unit 2
When the number of unnecessary words is designated by 0, the words having the specified unnecessary words are stored in the unnecessary word buffer 45 as unnecessary words with priority given to those having a high appearance frequency among all the words. When the reference value of the appearance frequency is specified by the unnecessary word setting unit 20, words that are equal to or more than the specified reference value of the appearance frequency among all the words are stored in the unnecessary word buffer 45 as unnecessary words.

【００２６】不要語リスト読出部２２は、外部記憶装置
４に格納されている不要語リストを読み込み、不要語バ
ッファ４５に格納する。The unnecessary word list reading section 22 reads the unnecessary word list stored in the external storage device 4 and stores it in the unnecessary word buffer 45.

【００２７】検索キー文書入力部２３は、入力装置２か
ら入力された検索キー文書を検索キー文書格納バッファ
４６に格納する。The search key document input section 23 stores the search key document input from the input device 2 in the search key document storage buffer 46.

【００２８】検索キー単語抽出部２４は、検索キー文書
格納バッファ４６に格納された検索キー文書からの単語
の切り出しを行い、切り出された単語群のなかから、そ
の検索キー文書の内容を特徴付ける単語種を抽出し、抽
出された単語種を検索キー単語情報格納バッファ４７に
格納する。ここで、単語の切り出しは形態素解析等によ
り行われ、文書の内容を特徴付ける単語の抽出は、単語
の品詞情報に基づいて、例えば「名詞」や「サ変名詞」
の単語を選択することによって行われる。The search key word extracting section 24 cuts out a word from the search key document stored in the search key document storage buffer 46, and selects a word characterizing the contents of the search key document from the cut out word group. The seed is extracted, and the extracted word type is stored in the search key word information storage buffer 47. Here, the extraction of words is performed by morphological analysis or the like, and the extraction of words characterizing the contents of the document is performed based on the part of speech information of the words, for example, “noun” or “sa-variable noun”.
This is done by selecting a word.

【００２９】検索キー単語出現頻度算出部２５は、検索
キー単語抽出部２４によって抽出された個々の単語につ
いて、抽出元文書内での出現頻度（出現数）を算出し、
算出された出現頻度を検索キー単語情報格納バッファ４
７に格納する。The search key word appearance frequency calculation unit 25 calculates the appearance frequency (appearance number) in the extraction source document for each word extracted by the search key word extraction unit 24,
The calculated appearance frequency is stored in the search key word information storage buffer 4
7 is stored.

【００３０】検索対象単語情報読出部２６は、外部記憶
装置４に格納されている文書データベース中の各検索対
象文書の単語情報とその出現頻度の情報を１文書毎に呼
び出し、検索対象単語情報格納バッファ４２に格納す
る。The search target word information reading unit 26 retrieves the word information of each search target document in the document database stored in the external storage device 4 and the information of its appearance frequency for each document, and stores the search target word information. The data is stored in the buffer 42.

【００３１】共通単語抽出部２７は、検索キー単語情報
格納バッファ４７および検索対象単語情報格納バッファ
４２から検索キー文書および検索対象文書中に共通に存
在する単語情報とその出現頻度の情報を読み出し、共通
単語情報格納バッファ４８に格納する。The common word extraction unit 27 reads word information commonly present in the search key document and the search target document and information on the frequency of occurrence thereof from the search key word information storage buffer 47 and the search target word information storage buffer 42, It is stored in the common word information storage buffer 48.

【００３２】不要語除去部２８は、検索キー単語情報格
納バッファ４７および検素対象単語情報格納バッファ４
２から、不要語バッファ４５に格納されている不要語に
当たる単語を削除する。The unnecessary word removing unit 28 includes a search key word information storage buffer 47 and a search target word information storage buffer 4.
From 2, the word corresponding to the unnecessary word stored in the unnecessary word buffer 45 is deleted.

【００３３】類似度算出部２９は、検索キー単語情報格
納バッファ４７、検索対象単語情報格納バッファ４２お
よび共通単語情報格納バッファ４８にそれぞれ格納され
た情報に基づき、ベクトル空間法等によって検索キー文
書と検索対象文書との類似度を算出し、算出された類似
度を類似度格納バッファ４９に格納する。Based on the information stored in the search key word information storage buffer 47, the search target word information storage buffer 42, and the common word information storage buffer 48, the similarity calculation unit 29 compares the search key document with the search key document by a vector space method or the like. The similarity with the search target document is calculated, and the calculated similarity is stored in the similarity storage buffer 49.

【００３４】検索結果出力部３０は、類似度格納バッフ
ァ４９に格納されている検索対象文書毎の類似度から、
類似検索結果とする文書情報（例えば、文書ＩＤ）を検
索結果出力バッファ５０に格納し、検索結果出力バッフ
ァ５０の内容を出力部１４を通じて表示装置３に出力す
る。The search result output unit 30 calculates the similarity of each search target document stored in the similarity storage buffer 49 from
Document information (for example, a document ID) serving as a similar search result is stored in the search result output buffer 50, and the contents of the search result output buffer 50 are output to the display device 3 through the output unit 14.

【００３５】次に、本実施形態の類似文書検索装置の動
作を説明する。Next, the operation of the similar document search apparatus according to this embodiment will be described.

【００３６】最初に、文書データベースおよび不要語リ
ストを作成する動作について図３乃至図１０を参照して
説明する。First, the operation of creating a document database and an unnecessary word list will be described with reference to FIGS.

【００３７】まず、初期化部１２が起動され、全バッフ
ァの初期化が行われる（ステップ３０１）。続いて、不
要語設定部２０が起動され、不要語を設定するための基
準の設定が行われる（ステップ３０２）。不要語を設定
するための基準は、以下の３通りの方法の中からユーザ
により任意に選択された方法で設定される。First, the initialization unit 12 is started, and all buffers are initialized (step 301). Subsequently, the unnecessary word setting unit 20 is activated, and a reference for setting the unnecessary word is set (step 302). The criterion for setting the unnecessary word is set by a method arbitrarily selected by the user from the following three methods.

【００３８】第１の方法は、ユーザが任意の数の単語
（不要語）を指定し、この指定単語について算出された
出現頻度（指定単語が複数の場合は各不要語について算
出された出現頻度のうちの最小出現頻度）を基準値と
し、そして文書データより抽出された単語群のうち、算
出された出現頻度が基準値以上のすべての単語を不要語
とする方法である。例えば、図４に示すように、文書を
特徴付ける性質を持たない一般的な単語例えば「こと」
「装置」などが指定され、これらの単語について算出さ
れた出現頻度のうち最小出現頻度を基準値として、出現
頻度がこの基準値以上の単語を不要語とする。In the first method, the user specifies an arbitrary number of words (unnecessary words), and calculates the appearance frequency calculated for the specified words (or the appearance frequency calculated for each unnecessary word when there are a plurality of specified words). Among the words extracted from the document data, all words having a calculated appearance frequency equal to or higher than the reference value are regarded as unnecessary words. For example, as shown in FIG. 4, a general word having no property to characterize a document, for example, "koto"
"Apparatus" or the like is designated, and the minimum appearance frequency among the appearance frequencies calculated for these words is used as a reference value, and words whose appearance frequency is equal to or higher than this reference value are used as unnecessary words.

【００３９】第２の方法は、ユーザが不要語の数（或い
は出現頻度値の順位）を任意に指定し、文書データから
抽出された単語群のうち、算出された出現頻度が高いも
のから優先に、前記指定された数の単語を不要語とする
方法である。例えば、図５に示すように、「指定順位＝
２（不要語数＝２）」のように指定された場合、算出さ
れた出現頻度が上位２位までの単語を不要語とする。In the second method, the user arbitrarily designates the number of unnecessary words (or the order of appearance frequency values), and, of words extracted from document data, gives priority to words having a higher calculated appearance frequency. In addition, there is a method of making the specified number of words unnecessary words. For example, as shown in FIG.
2 (the number of unnecessary words = 2) ", the words whose calculated appearance frequency is in the top two are regarded as unnecessary words.

【００４０】第３の方法は、ユーザが出現頻度の基準値
を任意に指定し、文書データから抽出された単語群のう
ち、算出された出現頻度が前記指定された出現頻度の基
準値以上のすべての単語を不要語とする方法である。例
えば、図６に示すように、「指定出現頻度＝５００以
上」にように指定された場合、出現頻度が５００以上の
すべての単語を不要語とする。In a third method, the user arbitrarily specifies a reference value of the appearance frequency, and the calculated appearance frequency of the word group extracted from the document data is equal to or higher than the specified reference value of the appearance frequency. This is a method of making all words unnecessary words. For example, as shown in FIG. 6, when “designated appearance frequency = 500 or more” is designated, all words whose appearance frequency is 500 or more are regarded as unnecessary words.

【００４１】これら３つの方法のいずれかによって設定
された不要語の設定基準は不要語設定バッファ４４に格
納される。The unnecessary word setting criterion set by any of these three methods is stored in the unnecessary word setting buffer 44.

【００４２】次に、検索対象文書読出部１５が起動され
る。検索対象文書読出部１５は外部記憶装置４にまだ処
理を終えてない検索対象文書があるか否かを判断し（ス
テップ３０３）、検索対象文書があれば、図７に示すよ
うに、その検索対象文書を検索対象文書格納バッファ４
１に格納する（ステップ３０４）。Next, the retrieval target document reading section 15 is started. The search target document reading unit 15 determines whether there is a search target document which has not been processed yet in the external storage device 4 (step 303), and if there is a search target document, as shown in FIG. Search target document storage buffer 4 for target document
1 (step 304).

【００４３】次に、検索対象文書単語抽出部１６が起動
される。検索対象文書単語抽出部１６は、検索対象文書
格納バッファ４１に格納された検索対象文書から形態素
解析等によって単語を切り出し、切り出された単語群か
ら「名詞」や「サ変名詞」などの文書の内容を特徴付け
る単語を抽出し、抽出された単語を検索対象単語情報格
納バッファ４２に格納する（ステップ３０５）。Next, the search target document word extraction unit 16 is activated. The search target document word extraction unit 16 cuts out a word from the search target document stored in the search target document storage buffer 41 by morphological analysis or the like, and extracts the content of the document such as “noun” or “sa noun” from the cut out word group. Are extracted, and the extracted words are stored in the search target word information storage buffer 42 (step 305).

【００４４】続いて、検索対象単語出現頻度算出部１７
が起動される。検索対象単語出現頻度算出部１７は、検
索対象単語情報格納バッファ４２に格納されている個々
の単語について、その抽出元文書中での出現頻度をそれ
ぞれ算出し、例えば図８に示すように、算出された出現
頻度の情報を単語と対応付けて検索対象単語情報格納バ
ッファ４２に格納する。以降、この単語と出現頻度の情
報を「単語情報」と呼ぶ。なお、図８において、「文
書」という単語に対応して記述された「頻度２」は「文
書」という単語が抽出元の文書中に２回出現しているこ
とを示す。Subsequently, the search target word appearance frequency calculation unit 17
Is started. The search target word appearance frequency calculation unit 17 calculates the appearance frequency of each word stored in the search target word information storage buffer 42 in the extraction source document, for example, as shown in FIG. The information on the appearance frequency thus obtained is stored in the search target word information storage buffer 42 in association with the word. Hereinafter, the information on the word and the appearance frequency is referred to as “word information”. In FIG. 8, “frequency 2” described in correspondence with the word “document” indicates that the word “document” appears twice in the extraction source document.

【００４５】次に、検索対象単語情報書込部１８が起動
され、検索対象単語情報格納バッファ４２の内容（単語
情報）が外部記憶装置４に格納される（ステップ３０
６）。この後、ステップ３０３に戻り、外部記憶装置４
に格納された次の検索対象文書を読み出し、その検索対
象文書からの単語の抽出と出現頻度の算出を行う。この
ようにして外部記憶装置４に格納されたすべての検索対
象文書について単語の抽出および出現頻度の算出を行
い、その結果を外部記憶装置４に格納する。Next, the search target word information writing unit 18 is started, and the contents (word information) of the search target word information storage buffer 42 are stored in the external storage device 4 (step 30).
6). Thereafter, the process returns to step 303 and the external storage device 4
Then, the next search target document stored in the search target document is read, and a word is extracted from the search target document and the appearance frequency is calculated. As described above, the extraction of the words and the calculation of the appearance frequency are performed for all the search target documents stored in the external storage device 4, and the results are stored in the external storage device 4.

【００４６】外部記憶装置４に格納されたすべての検索
対象文書の単語情報が外部記憶装置４に格納されたら、
次に全検索対象単語統計算出部１９が起動される。全検
索対象単語統計算出部１９は、外部記憶装置４に格納さ
れた全検索対象文書の単語情報（出現頻度の情報）を順
次読み出して検索対象文書格納バッファ４１に格納し、
この検索対象文書格納バッファ４１に格納された、全検
索対象文書の単語情報（出現頻度の情報）単語の出現頻
度に基づき、個々の単語の出現頻度の統計値（例えば出
現文書数など）を算出する。そして、図９に示すよう
に、このように算出された個々の単語の出現頻度の出現
文書数など統計値を、全検索対象単語情報格納バッファ
４３に全検索対象単語情報として格納する（ステップ３
０７）。次に、不要語リスト作成部２１が起動され
る。不要語リスト作成部２１は、全検索対象単語情報格
納バッファ４３に格納されている全検索対象単語情報と
不要語設定バッファ４４に格納された不要語設定基準に
基づいて不要語リストを作成し、作成された不要語リス
トを外部記憶装置４に格納する（ステップ３０８）。
この不要語リストの作成は、ユーザにより任意に指定さ
れた不要語の選択基準に基づいて行われる。When the word information of all the search target documents stored in the external storage device 4 is stored in the external storage device 4,
Next, the all search target word statistics calculation unit 19 is activated. The all-search-target-word-statistics calculating unit 19 sequentially reads out word information (information of appearance frequency) of all the search target documents stored in the external storage device 4 and stores the word information in the search target document storage buffer 41.
Based on the word information (information on the appearance frequency) of all the search target documents stored in the search target document storage buffer 41, the statistical value of the appearance frequency of each word (for example, the number of appearing documents) is calculated. I do. Then, as shown in FIG. 9, the statistical values calculated in this manner, such as the number of appearing documents of the appearance frequency of each word, are stored in the all search target word information storage buffer 43 as all search target word information (step 3).
07). Next, the unnecessary word list creation unit 21 is activated. The unnecessary word list creation unit 21 creates an unnecessary word list based on all the search target word information stored in the all search target word information storage buffer 43 and the unnecessary word setting criteria stored in the unnecessary word setting buffer 44, The created unnecessary word list is stored in the external storage device 4 (step 308).
The generation of the unnecessary word list is performed based on the selection criteria of the unnecessary word arbitrarily specified by the user.

【００４７】不要語設定部２０にてユーザにより任意の
数の単語（不要語）が指定された場合（第１の方法の場
合）、不要語リスト作成部２１は、その不要語の出現頻
度を全検索対象単語情報格納バッファ４３から読み出
し、読み出した不要語の出現頻度の中の最小出現頻度を
基準として、出現頻度が基準値以上の単語を全検索対象
単語情報格納バッファ４３の中からすべて抽出し、これ
を不要語として不要語バッファ４５に格納する。図１０
にこの不要語バッファ４５に格納された不要語の例を示
す。When an arbitrary number of words (unnecessary words) are specified by the user in the unnecessary word setting unit 20 (in the case of the first method), the unnecessary word list creating unit 21 determines the frequency of occurrence of the unnecessary words. It is read from the all search target word information storage buffer 43, and all words whose appearance frequency is equal to or more than a reference value are extracted from the all search target word information storage buffer 43 based on the minimum occurrence frequency among the read unnecessary word appearance frequencies. Then, this is stored in the unnecessary word buffer 45 as an unnecessary word. FIG.
9 shows an example of the unnecessary word stored in the unnecessary word buffer 45.

【００４８】また、不要語設定部２０にてユーザにより
不要語の数（或いは出現頻度値の順位）が指定された場
合（第２の方法の場合）、不要語リスト作成部２１は、
全検索対象単語情報の中で、出現頻度が高いものから優
先に指定数（指定順位）までの単語を不要語として決定
して不要語バッファ４５に格納する。When the number of unnecessary words (or the order of appearance frequency values) is designated by the user in the unnecessary word setting unit 20 (in the case of the second method), the unnecessary word list creating unit 21
Among all the search target word information, words having a high frequency of appearance and up to a designated number (designated order) are determined as unnecessary words and stored in the unnecessary word buffer 45.

【００４９】さらに、不要語設定部２０にてユーザによ
り出現頻度の基準値が指定された場合（第３の方法の場
合）、不要語リスト作成部２１は、出現頻度が基準値以
上のすべての単語を不要語として決定し、不要語バッフ
ァ４５に格納する。Further, when the reference value of the appearance frequency is specified by the user in the unnecessary word setting section 20 (in the case of the third method), the unnecessary word list creating section 21 sets all unnecessary words whose appearance frequency is higher than the reference value. The word is determined as an unnecessary word and stored in the unnecessary word buffer 45.

【００５０】以上により、文書データベースおよび不要
語リストの作成が終了する。Thus, the creation of the document database and the unnecessary word list is completed.

【００５１】続いて、類似文書検索の動作について図１
１乃至図１６を参照して説明する。まず、初期化部１２
が起動され、全バッファの初期化が行われる（ステップ
４０１）。次に、不要語リスト読出部２２が起動され、
外部記憶装置４から不要語リストを読み出して不要語バ
ッファ４５に格納する（ステップ４０２）。Next, the operation of similar document retrieval will be described with reference to FIG.
This will be described with reference to FIGS. First, the initialization unit 12
Is started, and all buffers are initialized (step 401). Next, the unnecessary word list reading unit 22 is activated,
The unnecessary word list is read from the external storage device 4 and stored in the unnecessary word buffer 45 (step 402).

【００５２】次に、検索キー文書入力部２３が起動され
ることで、ユーザにより指定された検索キー文書が外部
記憶装置４から読み込まれ、読み込まれた検索キー文書
が検索キー文書格納バッファ４６に格納される（ステッ
プ４０３）。図１２に検索キー文書格納バッファ４６に
格納された検索キー文書の例を示す。Next, when the search key document input section 23 is activated, the search key document specified by the user is read from the external storage device 4 and the read search key document is stored in the search key document storage buffer 46. It is stored (step 403). FIG. 12 shows an example of the search key document stored in the search key document storage buffer 46.

【００５３】続いて、検索キー単語抽出部２４が起動さ
れる。検索キー単語抽出部２４は、検索キー文書格納バ
ッファ４６に格納された検索キー文書から形態素解析等
によって単語を切り出し、切り出された単語群から「名
詞」や「サ変名詞」などの文書の内容を特徴付ける単語
を抽出し、抽出された単語を検索キー単語情報格納バッ
ファ４７に格納する（ステップ４０４）。Subsequently, the search key word extraction unit 24 is activated. The search key word extraction unit 24 cuts out a word from the search key document stored in the search key document storage buffer 46 by morphological analysis or the like, and extracts the contents of the document such as “noun” or “sa variable noun” from the cut out word group. Characteristic words are extracted, and the extracted words are stored in the search key word information storage buffer 47 (step 404).

【００５４】次に、不要語除去部２８が起動される。不
要語除去部２８は、検索キー単語情報格納バッファ４７
に格納されている検索キー文書の単語群の中から、不要
語バッファ４５に格納されている不要語と一致する単語
を見つけ出してこれを削除する（ステップ４０５）。Next, the unnecessary word removing unit 28 is activated. The unnecessary word removing unit 28 includes a search key word information storage buffer 47.
A word that matches the unnecessary word stored in the unnecessary word buffer 45 is found from the word group of the search key document stored in the search key document and deleted (step 405).

【００５５】続いて、検索キー単語出現頻度算出部２５
が起動される。検索キー単語出現頻度算出部２５は、検
索キー単語情報格納バッファ４７に格納されている個々
の単語について、その抽出元文書中での出現頻度を算出
し、算出された出現頻度の情報を、図１３に示すよう
に、検索キー単語情報格納バッファ４７において単語と
対応付けて格納する（ステップ４０６）。Subsequently, the search key word appearance frequency calculation unit 25
Is started. The search key word appearance frequency calculation unit 25 calculates the appearance frequency of each word stored in the search key word information storage buffer 47 in the extraction source document, and uses the calculated appearance frequency information as a figure. As shown in FIG. 13, it is stored in the search key word information storage buffer 47 in association with the word (step 406).

【００５６】次に、検索対象文書読出部１５が起動され
る。検索対象文書読出部１５は、外部記憶装置４にまだ
処理を終えてない検索対象文書あるか否かを判断し（ス
テップ４０７）、もし検索対象文書があれば、その検索
対象文書を検索対象文書格納バッファ４１に格納する。
この後、検索対象文書単語抽出部１６によって、検索対
象文書格納バッファ４１に格納された検索対象文書から
形態素解析等によって単語の切り出しが行われ、切り出
された単語群の中から「名詞」や「サ変名詞」などの文
書の内容を特徴付ける単語が抽出され、抽出された単語
の情報が検索対象単語情報格納バッファ４２に格納され
る（ステップ４０８）。Next, the retrieval target document reading section 15 is activated. The search target document reading unit 15 determines whether there is a search target document that has not been processed yet in the external storage device 4 (step 407). If there is a search target document, the search target document is searched. The data is stored in the storage buffer 41.
Thereafter, the search target document word extraction unit 16 cuts out words from the search target document stored in the search target document storage buffer 41 by morphological analysis or the like, and selects “noun” or “ Words that characterize the contents of the document, such as "sa noun," are extracted, and information on the extracted words is stored in the search target word information storage buffer 42 (step 408).

【００５７】続いて、不要語除去部２８が起動される。
不要語除去部２８は、検索対象単語情報格納バッファ４
２に格納されている検索対象文書の単語群の中から、不
要語バッファ４５に格納されている不要語と一致する単
語を見つけ出してこれを削除する（ステップ４０９）。Subsequently, the unnecessary word removing section 28 is activated.
The unnecessary word removing unit 28 stores the search target word information storage buffer 4.
A word that matches the unnecessary word stored in the unnecessary word buffer 45 is found out of the word group of the search target document stored in 2 and is deleted (step 409).

【００５８】次に、共通単語抽出部２７が起動される。
共通単語抽出部２７は、それぞれ不要語の削除を終えた
検索対象単語情報格納バッファ４２と検索キー単語情報
格納バッファ４７内から共通に格納されている単語を検
出し、図１４に示すように、その検出された単語を共通
単語情報格納バッファ４８に格納する（ステップ４１
０）。Next, the common word extraction unit 27 is activated.
The common word extraction unit 27 detects words commonly stored in the search target word information storage buffer 42 and the search key word information storage buffer 47 from which the unnecessary words have been deleted, and as shown in FIG. The detected word is stored in the common word information storage buffer 48 (step 41).
0).

【００５９】この後、類似度算出部２９が起動される。
類似度算出部２９は、検索対象単語情報格納バッファ４
２、検索キー単語情報格納バッファ４７および共通単語
情報格納バッファ４８の内容を基に、ベクトル空間法等
により、検索キー文書と検索対象文書との類似度を算出
し、算出された類似度を類似度格納バッファ４９に格納
する（ステップ４１１）。図１５に、この類似度格納バ
ッファ４９に格納された検索キー文書と個々の検索対象
文書との類似度情報の例を示す。Thereafter, the similarity calculating section 29 is started.
The similarity calculation unit 29 stores the search target word information storage buffer 4
2. Based on the contents of the search key word information storage buffer 47 and the common word information storage buffer 48, the similarity between the search key document and the search target document is calculated by the vector space method or the like, and the calculated similarity is calculated. It is stored in the degree storage buffer 49 (step 411). FIG. 15 shows an example of similarity information between the search key document stored in the similarity storage buffer 49 and each search target document.

【００６０】この後、ステップ４０７に戻り、外部記憶
装置４にまだ処理を終えてない検索対象文書がある場合
は、その検索対象文書について前記と同様の処理を行
い、こうして算出された検索キー文書と検索対象文書と
の類似度を類似度格納バッファ４９に格納する。Thereafter, returning to step 407, if there is a search target document in the external storage device 4 that has not been processed yet, the same processing as described above is performed on the search target document, and the search key document thus calculated Is stored in the similarity storage buffer 49.

【００６１】外部記憶装置４に格納されたすべての検索
対象文書と検索キー文書との類似度が類似度格納バッフ
ァ４９に格納された後、検索結果出力部３０が起動され
る。検索結果出力部３０は、類似度格納バッファ４９の
内容から、例えば図１６に示すように、類似度が高いも
のから順に検索対象文書のＩＤを並べ、その結果を検索
結果出力バッファ５０に格納する。この後、出力部１４
によって、検索結果出力バッファ５０の内容が表示装置
３に出力される（ステップ４１２）。After all the similarities between the search target documents and the search key documents stored in the external storage device 4 are stored in the similarity storage buffer 49, the search result output unit 30 is activated. The search result output unit 30 arranges the IDs of documents to be searched in the order of the highest similarity, as shown in FIG. 16, for example, from the contents of the similarity storage buffer 49, and stores the result in the search result output buffer 50. . Thereafter, the output unit 14
Thus, the contents of the search result output buffer 50 are output to the display device 3 (step 412).

【００６２】かくして本実施形態の類似文書検索装置に
よれば、不要語リストの作成する際のユーザの作業負荷
が大幅に軽減され、全般的な類似文書検索の効率アップ
を図ることができる。すなわち、本実施形態の類似文書
検索装置において、不要語リストを作成するために必要
となるユーザの操作は、１つ乃至少数の不要語の指定、
或いは不要語の数の指定、或いは出現頻度の基準値のい
ずれかでよく、このような簡単な指定操作がユーザによ
って事前に行われるだけで、最適かつ妥当な不要語を漏
れなくリストアップでき、類似文書検索の精度の向上と
安定化を図ることができる。Thus, according to the similar document search apparatus of the present embodiment, the work load of the user when creating the unnecessary word list is greatly reduced, and overall similar document search efficiency can be improved. That is, in the similar document search device of the present embodiment, the user operation required to create the unnecessary word list includes one or a small number of unnecessary words,
Alternatively, either the designation of the number of unnecessary words or the reference value of the appearance frequency may be performed, and only such a simple designation operation is performed in advance by the user, and the optimum and appropriate unnecessary words can be listed without omission, The accuracy of similar document search can be improved and stabilized.

【００６３】なお、本実施形態では、不要語の代表とし
てユーザにより指定された単語の出現頻度を基準値とし
て、この基準値以上の出現頻度を持つすべての単語を不
要語として設定する場合について説明したが、この基準
値よりも低値側に一定マージンを確保して、このマージ
ン内の出現頻度をもつ単語も不要語として判定するよう
にしてもよい。In this embodiment, a case will be described in which the appearance frequency of a word specified by the user as a representative of the unnecessary word is set as a reference value, and all words having an appearance frequency equal to or higher than the reference value are set as unnecessary words. However, a certain margin may be secured on the lower side than this reference value, and a word having an appearance frequency within this margin may be determined as an unnecessary word.

【００６４】[0064]

【発明の効果】以上説明したように本発明によれば、不
要語リストの作成においてユーザが不要語とすべき単語
を一つ一つ登録しなくてもよく、例えば、ユーザが１つ
乃至少数の不要語を指定したり、不要語の数を指定した
り、出現頻度の基準値を入力するだけで、所望の不要語
リストを作成することができる。これにより、類似文書
検索の全体的な効率を高めることができ、また、検索対
象文書データベースの種類毎に最適かつ妥当な不要語を
漏れなく抽出することができるので、類似文書検索の精
度の向上と安定化を図ることができる。As described above, according to the present invention, it is not necessary for the user to register each word to be an unnecessary word in the creation of the unnecessary word list. By simply specifying the unnecessary words, the number of unnecessary words, and inputting the reference value of the appearance frequency, a desired unnecessary word list can be created. As a result, the overall efficiency of the similar document search can be improved, and the optimum and appropriate unnecessary words can be extracted without omission for each type of the search target document database, thereby improving the accuracy of the similar document search. And stabilization.

[Brief description of the drawings]

【図１】本発明の実施形態である類似文書検索装置のハ
ードウェア構成を示す図FIG. 1 is a diagram showing a hardware configuration of a similar document search device according to an embodiment of the present invention.

【図２】図１の類似文書検索装置の制御装置の構成を示
す機能ブロック図FIG. 2 is a functional block diagram showing a configuration of a control device of the similar document search device of FIG. 1;

【図３】文書データベースおよび不要語リストの作成手
順を示すフローチャートFIG. 3 is a flowchart showing a procedure for creating a document database and an unnecessary word list.

【図４】ユーザにより指定された不要語の例を示す図FIG. 4 is a diagram showing an example of an unnecessary word specified by a user.

【図５】ユーザにより指定された不要語の数（出現頻度
の順位）の例を示す図FIG. 5 is a diagram showing an example of the number of unnecessary words specified by the user (order of appearance frequency).

【図６】ユーザにより指定された出現頻度の基準値の例
を示す図FIG. 6 is a diagram illustrating an example of a reference value of an appearance frequency specified by a user;

【図７】検索対象文書の例を示す図FIG. 7 shows an example of a search target document.

【図８】検索対象単語情報格納バッファに格納された単
語と出現頻度の例を示す図FIG. 8 is a diagram showing an example of words stored in a search target word information storage buffer and appearance frequencies;

【図９】全検索対象単語情報格納バッファに格納され
た、全検索対象文書の単語とその出現頻度の統計値の例
を示す図FIG. 9 is a diagram illustrating an example of words of all documents to be searched and statistical values of their appearance frequencies stored in a buffer for storing all words to be searched.

【図１０】不要語バッファに格納された不要語の例を示
す図FIG. 10 is a diagram illustrating an example of unnecessary words stored in an unnecessary word buffer.

【図１１】類似文書検索の動作の手順を示すフローチャ
ートFIG. 11 is a flowchart showing a procedure of a similar document search operation;

【図１２】検索キー文書の例を示す図FIG. 12 illustrates an example of a search key document.

【図１３】検索キー単語情報格納バッファに格納された
単語と出現頻度の例を示す図FIG. 13 is a diagram showing an example of words stored in a search key word information storage buffer and appearance frequencies;

【図１４】共通単語情報格納バッファに格納された共通
単語と出現頻度の例を示す図FIG. 14 is a diagram showing an example of common words stored in a common word information storage buffer and appearance frequencies;

【図１５】類似度格納バッファに格納された検索キー文
書と検索対象文書との類似度の例を示す図FIG. 15 is a diagram showing an example of the similarity between the search key document stored in the similarity storage buffer and the search target document.

【図１６】類似文書検索結果の例を示す図FIG. 16 is a diagram illustrating an example of a similar document search result.

[Explanation of symbols]

２００・・・・・・メイン処理部２０１・・・・・・初期化部２０２・・・・・・入力部２０３・・・・・・出力部２０４・・・・・・検索対象文書読み出し部２０５・・・・・・検索対象文書単語抽出部２０６・・・・・・検索対象単語出現頻度算出部２０７・・・・・・検索対象単語情報算出部２０８・・・・・・検索対象単語統計算出部２０９・・・・・・不要語設定部２１０・・・・・・不要語リスト作成部２１１・・・・・・不要語リスト読み出し部２１２・・・・・・検索キー文書入力部２１３・・・・・・検索キー単語抽出部２１４・・・・・・検索キー単語出現頻度算出部２１５・・・・・・検索対象単語情報読み出し部２１６・・・・・・共通単語抽出部２１７・・・・・・不要語除去部２１８・・・・・・類似度算出部２１９・・・・・・検索結果出力部 200: Main processing unit 201: Initializing unit 202: Input unit 203: Output unit 204: Search target document reading unit 205: Search target document word extraction unit 206: Search target word appearance frequency calculation unit 207: Search target word information calculation unit 208: Search target word Statistical calculation unit 209 Unnecessary word setting unit 210 Unnecessary word list creation unit 211 Unnecessary word list reading unit 212 Search key document input unit 213: Search key word extraction unit 214: Search key word appearance frequency calculation unit 215: Search target word information reading unit 216: Common word extraction unit 217: Unnecessary word removing unit 218: Similarity score calculating unit 219 ...... search result output unit

フロントページの続き (72)発明者中本幸夫東京都青梅市新町1381番地１東芝コンピュータエンジニアリング株式会社内 (72)発明者仁科卓哉東京都青梅市新町1381番地１東芝コンピュータエンジニアリング株式会社内 (72)発明者久保田直秀東京都青梅市新町1381番地１東芝コンピュータエンジニアリング株式会社内Continued on the front page (72) Inventor Yukio Nakamoto 1381-1, Shinmachi, Ome-shi, Tokyo Toshiba Computer Engineering Co., Ltd. (72) Inventor Takuya Nishina 1381-1, Shinmachi, Ome-shi, Tokyo Toshiba Computer Engineering (72) Inventor Naohide Kubota 1381-1, Shinmachi, Ome-shi, Tokyo Toshiba Computer Engineering Co., Ltd.

Claims

[Claims]

1. A similar document search apparatus for searching a document similar to a search key document from a plurality of search target documents using a certain document as a search key document, wherein a plurality of search target documents including the search key document and the search target document are included. Document data storage means in which document data is stored; word specification means for specifying an arbitrary word; word extraction means for extracting a word from each document data stored in the document data storage means; An appearance frequency calculation unit that calculates an appearance frequency of each of the specified arbitrary word and the word extracted by the word extraction unit in each of the document data; and an occurrence frequency of the arbitrary word calculated by the appearance frequency calculation unit. It is unnecessary to determine at least some of the words extracted by the word extracting means as unnecessary words based on the frequency of appearance. Determination means; and means for calculating a similarity between the two documents by removing the unnecessary words determined by the unnecessary word determination means from the search key document and the search target document, respectively. Document search device.

2. A similar document search apparatus for searching a document similar to a search key document from a plurality of search target documents using a certain document as a search key document, comprising: a plurality of search target documents including the search key document and the search target document. Document data storage means in which document data is stored; word specification means for specifying an arbitrary word; word extraction means for extracting a word from each document data stored in the document data storage means; An appearance frequency calculation means for calculating an appearance frequency of each of the specified arbitrary word and the word extracted by the word extraction means in each of the document data; and, among the words extracted by the word extraction means, A word whose appearance frequency calculated by the frequency calculation means is equal to or higher than the appearance frequency calculated for the arbitrary word is determined as an unnecessary word. Key word determination means, and means for calculating a similarity between the two documents by removing the unnecessary words determined by the unnecessary word determination means from the search key document and the search target document, respectively. Similar document search device.

3. A similar document search apparatus for searching a document similar to a search key document from a plurality of search target documents using a certain document as a search key document, comprising: a plurality of search target documents including the search key document and the search target document. Document data storing means for storing document data; word specifying means for specifying a plurality of arbitrary words; word extracting means for extracting words from each document data stored in the document data storing means; A plurality of arbitrary words specified by a means and an appearance frequency calculating means for calculating an appearance frequency of each of the words extracted by the word extracting means in each of the document data; The appearance frequency calculated by the appearance frequency calculation means is the appearance of a plurality of arbitrary words calculated by the appearance frequency calculation means. Unnecessary word determining means for determining a word having a frequency equal to or higher than the minimum frequency of occurrence as an unnecessary word; and removing the unnecessary words determined by the unnecessary word determining means from the search key document and the search target document. Means for calculating the similarity of the similar document search device.

4. A similar document search apparatus for searching a document similar to a search key document from a plurality of search target documents using a certain document as a search key document, comprising: a plurality of search target documents including the search key document and the search target document. Document data storing means for storing document data; word extracting means for extracting words from each document data stored in the document data storing means; and unnecessary words among words extracted by the word extracting means. Unnecessary word number specifying means for arbitrarily specifying the number of words; appearance frequency calculating means for calculating the appearance frequency of each word extracted by the word extracting means in each of the document data; extracted by the word extracting means Of the words, words having a higher appearance frequency calculated by the appearance frequency calculation means are given priority, and the number of words designated by the unnecessary word number designation means is given priority. Unnecessary word determining means for determining as a word; and means for calculating a similarity between the two documents by removing the unnecessary words determined by the unnecessary word determining means from the search key document and the search target document. A similar document search device characterized by the following.

5. A similar document search apparatus for searching a document similar to a search key document from a plurality of search target documents using a certain document as a search key document, comprising: a plurality of search target documents including the search key document and the search target document. Document data storage means in which document data is stored; word extraction means for extracting a word from each document data stored in the document data storage means; in each of the document data of words extracted by the word extraction means An appearance frequency calculating means for calculating an appearance frequency of, an appearance frequency designating means for designating an arbitrary appearance frequency, and, among words extracted by the word extracting means, an appearance frequency calculated by the appearance frequency calculating means, Unnecessary word determining means for determining a word having an arbitrary appearance frequency or more designated by the appearance frequency designating means as an unnecessary word; And a means for calculating the degree of similarity between the respective documents by excluding the unnecessary words determined by the unnecessary word determining means from the search target document.

6. A similar document search method for searching a document similar to the search key document from a plurality of search target documents using a certain document as a search key document, wherein an arbitrary word is specified; And extracting words from a plurality of document data including the search target document; calculating the appearance frequency of each of the designated arbitrary words and the extracted words in each of the document data; Determining at least a part of the extracted words as unnecessary words based on the calculated appearance frequency of the arbitrary word; and determining the words from the search key document and the search target document. Calculating a similarity between the two documents by removing unnecessary words from each other.

7. A word designation for designating an arbitrary word in a recording medium in which a program for retrieving a document similar to the search key document from a plurality of search target documents using a certain document as a search key document is recorded. Means, word extraction means for extracting a word from a plurality of document data including the search key document and the search target document, and an arbitrary word designated by the word designation means and a word extracted by the word extraction means An appearance frequency calculation unit for calculating the appearance frequency in each of the document data; and a word extracted by the word extraction unit based on the appearance frequency of the arbitrary word calculated by the appearance frequency calculation unit. Unnecessary word determining means for determining at least a part of the words as unnecessary words; and Recording medium comprising a program and a means for calculating a similarity between the determined unnecessary words both documents except each is recorded by a main character discriminating unit.