JPH1185764A

JPH1185764A - Method and apparatus for statistically estimating the number of search results, and storage medium storing program for statistically estimating the number of search results

Info

Publication number: JPH1185764A
Application number: JP9241424A
Authority: JP
Inventors: Kazuhiro Hayakawa; 和宏早川; Takashi Inoue; 孝史井上; Masakatsu Ookubo; 雅且大久保; Kazuo Tanaka; 一男田中
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1997-09-05
Filing date: 1997-09-05
Publication date: 1999-03-30

Abstract

(57)【要約】【課題】ある検索語で得られた検索結果に、さらに、
別の検索語を追加した場合のおよその検索結果件数の変
化を、検索結果集合全体に保持せずに算出することが可
能な検索結果件数の統計的推定方法及び装置及び検索結
果件数の統計的推定プログラムを格納した記憶媒体を提
供する。【解決手段】本発明は、検索語に適合する検索結果集
合の一部を用いて、該検索語に適合する検索結果集合全
体の中で、任意の検索語が含まれる割合を推定する。 (57) [Summary] [Problem] In addition to the search results obtained with a certain search word,
A method and apparatus for statistically estimating the number of search results that can be calculated without retaining the approximate change in the number of search results when another search word is added to the entire search result set, and a statistical method for the number of search results A storage medium storing an estimation program is provided. According to the present invention, a part of a search result set that matches a search word is used to estimate a ratio of an arbitrary search word included in the entire search result set that matches the search word.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、検索結果件数の統
計的推定方法及び装置及び検索結果件数の統計的推定プ
ログラムを格納した記憶媒体に係り、特に、データ検索
システムにおいて、検索結果をさらに絞り込むために検
索語を追加した場合、検索結果の件数がどのように変動
するかを知るための検索結果件数の統計的推定方法及び
装置及び検索結果件数の統計的推定プログラムを格納し
た記憶媒体に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a method and apparatus for statistically estimating the number of search results and a storage medium storing a program for statistically estimating the number of search results, and more particularly to further narrowing down search results in a data search system. Therefore, the present invention relates to a method and apparatus for statistically estimating the number of search results to know how the number of search results changes when a search word is added, and a storage medium storing a program for statistically estimating the number of search results.

【０００２】[0002]

【従来の技術】データ検索システムにおいては、検索結
果を閲覧に適する程度の件数に絞り込むことが重要であ
る。このために従来、文書データベース等においては、
検索の過程でそれまでの検索によって得られたデータの
集合を保持し、検索語を追加していくことによって検索
結果集合を小さくする方法がある。2. Description of the Related Art In a data search system, it is important to narrow down search results to a number suitable for browsing. For this reason, conventionally, in a document database or the like,
There is a method in which a set of data obtained by a previous search is held in a search process, and a search result set is reduced by adding search words.

【０００３】[0003]

【発明が解決しようとする課題】しかしながら、上記従
来の方法では、検索結果集合を保持する記憶領域が必要
になり、また、それまでに行われた検索履歴を保持して
おく必要がある。しかし、これらの条件は、一度に数万
の検索要求を受け付けなければならない場合や、履歴を
保持できない通信プロトコルを使って検索を行う場合に
は満たすことが難しい。従ってそのような場合には、絞
り込みをできないようにするか、検索語を追加する場合
には、その都度新しい検索質問を生成して最初から検索
を行なわなければならないという問題がある。However, in the above-mentioned conventional method, a storage area for holding a set of search results is required, and it is necessary to hold a search history performed so far. However, it is difficult to satisfy these conditions when it is necessary to accept tens of thousands of search requests at once, or when a search is performed using a communication protocol that cannot maintain a history. Therefore, in such a case, there is a problem that it is necessary to prevent the search from being performed or to add a search word, to generate a new search question and perform the search from the beginning each time.

【０００４】本発明は、上記の点に鑑みなされたもの
で、ある検索語で得られた検索結果に、さらに、別の検
索語を追加した場合のおよその検索結果件数の変化を、
検索結果集合全体に保持せずに算出することが可能な検
索結果件数の統計的推定方法及び装置及び検索結果件数
の統計的推定プログラムを格納した記憶媒体を提供する
ことを目的とする。The present invention has been made in view of the above points, and it has been found that a change in the approximate number of search results when another search word is further added to a search result obtained by a certain search word,
It is an object of the present invention to provide a method and apparatus for statistically estimating the number of search results that can be calculated without holding the entire search result set, and a storage medium storing a program for statistically estimating the number of search results.

【０００５】[0005]

【課題を解決するための手段】第１の発明は、データ検
索システムにおいて、検索結果をさらに絞り込むために
検索語を追加した場合に、検索結果の件数がどのように
変化するかを知るための検索結果件数の統計的推定方法
において、検索語に適合する検索結果集合の一部を用い
て、該検索語に適合する検索結果集合全体の中で、任意
の検索語が含まれる割合を推定する。According to a first aspect of the present invention, there is provided a data search system for determining how the number of search results changes when a search word is added to further narrow down search results. In a method for statistically estimating the number of search results, using a part of a set of search results that match a search word, estimate a ratio of an arbitrary search word included in the entire search result set that matches the search word. .

【０００６】第２の発明は、データ検索システムにおい
て、検索結果をさらに絞り込むために検索語を追加した
場合に、検索結果の件数がどのように変化するかを知る
ための検索結果件数の統計的推定方法において、検索語
に適合する文献集合の一部を用いて、追加検索語となり
得る語の一覧を生成して提示し、利用者が追加する検索
語を選べるようにすると同時に、該検索語に適合する文
献集合全体の中で、追加された検索語が含まれている割
合を推定する。According to a second aspect of the present invention, in a data search system, when a search term is added in order to further narrow down search results, the number of search results in order to know how the number of search results changes varies statistically. In the estimation method, a list of words that can be additional search words is generated and presented by using a part of the document set that matches the search words, so that the user can select the search words to be added, Is estimated in the entire document set that satisfies.

【０００７】第３の発明は、データ検索システムにおい
て、検索結果をさらに絞り込むために検索語を追加した
場合に、検索結果の件数がどのように変化するかを知る
ための検索結果件数の統計的推定方法において、検索語
に適合する検索結果集合の一部を用いて、該検索語に別
の検索語を追加した場合の検索結果の件数を推定する。According to a third aspect, in a data search system, when a search term is added to further narrow down search results, the number of search results is statistically determined to know how the number of search results changes. In the estimation method, the number of search results when another search word is added to the search word is estimated using a part of the search result set that matches the search word.

【０００８】第４の発明は、データ検索システムにおい
て、検索結果をさらに絞り込むために検索語を追加した
場合に、検索結果の件数がどのように変化するかを知る
ための検索結果件数の統計的推定方法において、検索語
に適合する文献集合の一部を用いて、追加検索語となり
得る語の一覧を生成して提示し、利用者が追加する検索
語を選べるようにすると同時に、検索語を追加した場合
の検索結果の件数を推定する。According to a fourth aspect of the present invention, in a data search system, when a search term is added in order to further narrow down search results, the number of search results in order to know how the number of search results changes varies statistically. In the estimation method, a list of possible additional search terms is generated and presented using a part of the document set that matches the search terms, so that the user can select the search terms to be added, Estimate the number of search results when added.

【０００９】図１は、本発明の原理構成図である。第５
の発明は、データ検索システムにおいて、検索結果をさ
らに絞り込むために検索語を追加した場合に、検索結果
の件数がどのように変化するかを知るための検索結果件
数の統計的推定装置であって、検索語に適合する検索結
果集合の一部を取り出す検索結果サンプル抽出手段１１
０と、検索結果集合と追加検索語を受け取り、検索結果
集合の一部を用いて、該検索結果集合の中で該追加検索
語を含むものの割合を求め出力する追加検索語割合算出
手段１２０とを有する。FIG. 1 is a diagram showing the principle of the present invention. Fifth
The invention of the present invention is a statistical estimation device for the number of search results to know how the number of search results changes when a search term is added to further narrow down the search results in a data search system, Search result sample extracting means 11 for extracting a part of a search result set matching a search word
0, a search result set and an additional search word, and an additional search word ratio calculating means 120 for obtaining and outputting a ratio of the search result set including the additional search word by using a part of the search result set; and Having.

【００１０】第６の発明は、データ検索システムにおい
て、検索結果をさらに絞り込むために検索語を追加した
場合に、検索結果の件数がどのように変化するかを知る
ための検索結果件数の統計的推定装置であって、検索語
に適合する検索結果集合の一部を抽出する検索結果サン
プル抽出手段と、検索結果サンプル抽出手段により抽出
された検索結果集合の一部から追加検索語となり得るも
ののリストを検索語候補とする検索語候補生成手段と、
利用者から受け取った選択信号に対応する検索語を検索
語候補から選択し、追加検索語とする選択手段と、検索
結果集合の一部の中で、追加検索語に適合するものの割
合を推定し、出力する追加検索語割合算出手段とを有す
る。According to a sixth aspect of the present invention, in a data search system, when a search word is added in order to further narrow down a search result, a statistical search of the number of search results to know how the number of search results changes is performed. A search result sample extracting means for extracting a part of a search result set that matches a search term, and a list of potential search terms that can be an additional search term from a part of the search result set extracted by the search result sample extracting means A search word candidate generating means that sets
Selects the search term corresponding to the selection signal received from the user from the search term candidates, and selects a selection means as an additional search term, and estimates a proportion of a part of the search result set that matches the additional search term. And an additional search term ratio calculating means for outputting.

【００１１】第７の発明は、データ検索システムにおい
て、検索結果をさらに絞り込むために検索語を追加した
場合に、検索結果の件数がどのように変化するかを知る
ための検索結果件数の統計的推定装置であって、検索結
果の総件数と、検索語に適合する検索結果集合の一部を
取り出す検索語候補生成手段と、検索結果集合の一部の
中で、追加検索語を含むものの推定件数を推定し、出力
する追加検索語件数算出手段とを有する。According to a seventh aspect, in the data search system, when a search term is added to further narrow down the search results, the number of search results in order to know how the number of search results changes varies statistically. An estimating apparatus, a search word candidate generating means for extracting a total number of search results and a part of a search result set matching the search word, and estimating a part of the search result set including an additional search word Means for calculating the number of additional search words to estimate and output the number of cases.

【００１２】第８の発明は、データ検索システムにおい
て、検索結果をさらに絞り込むために検索語を追加した
場合に、検索結果の件数がどのように変化するかを知る
ための検索結果件数の統計的推定装置であって、検索結
果の総件数と、検索語に適合する文献集合の一部を抽出
する検索語候補生成手段と、検索結果の総件数と文献集
合の中で、追加検索語となり得る検索語のリストを検索
語候補とする検索語候補生成手段と、検索語候補の中か
ら利用者から取得した選択信号に対応する検索語を追加
検索語とする選択手段と、文献集合の中における追加検
索語の件数を推定し、出力する追加検索語件数算出手段
とを有する。According to an eighth invention, in a data search system, when a search word is added in order to further narrow down search results, a statistical search of the number of search results to know how the number of search results changes is performed. An estimating device, a search term candidate generating means for extracting a part of a document set that matches the search result and a part of the document set that matches the search term, and an additional search term in the total number of search results and the document set A search word candidate generating unit that uses a list of search words as a search word candidate; a selection unit that uses a search word corresponding to a selection signal obtained from a user from the search word candidates as an additional search word; Means for calculating the number of additional search words and outputting the number of additional search words.

【００１３】第９の発明は、データ検索システムにおい
て、検索結果をさらに絞り込むために検索語を追加した
場合に、検索結果の件数がどのように変化するかを知る
ための検索結果件数の統計的推定プログラムを格納した
記憶媒体であって、検索語に適合する検索結果集合の一
部を取り出す検索結果サンプル抽出プロセスと、検索結
果集合と追加検索語を受け取り、検索結果集合の一部を
用いて、該検索結果集合の中で該追加検索語を含むもの
の割合を求め出力する追加検索語割合算出プロセスとを
有する。According to a ninth invention, in a data search system, when a search term is added to further narrow down a search result, a statistical search of the number of search results to know how the number of search results changes is performed. A storage medium storing an estimation program, wherein a search result sample extraction process for extracting a part of a search result set that matches a search word, a search result set and an additional search word are received, and a part of the search result set is used. An additional search term ratio calculating process for calculating and outputting a ratio of the search result set including the additional search term.

【００１４】第１０の発明は、データ検索システムにお
いて、検索結果をさらに絞り込むために検索語を追加し
た場合に、検索結果の件数がどのように変化するかを知
るための検索結果件数の統計的推定プログラムを格納し
た記憶媒体であって、検索語に適合する検索結果集合の
一部を抽出する検索結果サンプル抽出プロセスと、検索
結果サンプル抽出プロセスにより抽出された検索結果集
合の一部から追加検索語となり得るもののリストを検索
語候補とする検索語候補生成プロセスと、利用者から受
け取った選択信号に対応する検索語を検索語候補から選
択し、追加検索語とする選択プロセスと、検索結果集合
の一部の中で、追加検索語に適合するものの割合を推定
し、出力する追加検索語割合算出プロセスとを有する。According to a tenth aspect of the present invention, in the data search system, when a search term is added to further narrow down the search results, the number of search results is statistically determined to know how the number of search results changes. A search result sample extraction process for extracting a part of a search result set that matches a search word on a storage medium storing an estimation program, and additional search from a part of the search result set extracted by the search result sample extraction process A search word candidate generation process for selecting a list of possible words as search word candidates, a selection process for selecting a search word corresponding to a selection signal received from a user from the search word candidates and setting it as an additional search word, and a search result set And a process of estimating and outputting a ratio of those that match the additional search word among a part of the additional search words.

【００１５】第１１の発明は、データ検索システムにお
いて、検索結果をさらに絞り込むために検索語を追加し
た場合に、検索結果の件数がどのように変化するかを知
るための検索結果件数の統計的推定プログラムを格納し
た記憶媒体であって、検索結果の総件数と、検索語に適
合する検索結果集合の一部を取り出す検索語候補生成プ
ロセスと、検索結果集合の一部の中で、追加検索語を含
むものの推定件数を推定し、出力する追加検索語件数算
出プロセスとを有する。According to an eleventh aspect, in the data search system, when a search term is added in order to further narrow down the search results, the number of search results in order to know how the number of search results changes varies statistically. A storage medium storing an estimation program, a search word candidate generation process for extracting a part of a search result set matching a total number of search results and a search word, and an additional search in a part of the search result set And a process of calculating the number of additional search terms to estimate and output the estimated number of terms containing words.

【００１６】第１２の発明は、データ検索システムにお
いて、検索結果をさらに絞り込むために検索語を追加し
た場合に、検索結果の件数がどのように変化するかを知
るための検索結果件数の統計的推定プログラムを格納し
た記憶媒体であって、検索結果の総件数と、検索語に適
合する文献集合の一部を抽出する検索語候補生成プロセ
スと、検索結果の総件数と文献集合の中で、追加検索語
となり得る検索語のリストを検索語候補とする検索語候
補生成プロセスと、検索語候補の中から利用者から取得
した選択信号に対応する検索語を追加検索語とする選択
プロセスと、文献集合の中における追加検索語の件数を
推定し、出力する追加検索語件数算出プロセスとを有す
る。According to a twelfth aspect, in the data search system, when a search term is added to further narrow down the search results, the number of search results is statistically determined to know how the number of search results changes. A storage medium storing an estimation program, wherein a total number of search results, a search word candidate generation process for extracting a part of a set of documents matching the search word, and a total number of search results and a set of documents, A search word candidate generation process that sets a list of search words that can be additional search words as search word candidates, a selection process that sets a search word corresponding to a selection signal obtained from a user from the search word candidates as an additional search word, A process of estimating the number of additional search words in the document set and outputting the estimated number of additional search words.

【００１７】上記において、第１、第５及び第９の発明
は、検索語の適合する検索結果集合の一部を用いて、検
索語に適合する検索結果集合全体の中で、任意の検索語
が含まれている割合を推定することにより、ある検索語
に別の検索語を追加した場合のおよその検索結果件数の
変化を、検索結果集合全体を保持せず、再検索も行わず
に推定することが可能である。In the above, the first, fifth, and ninth aspects of the present invention use any part of the search result set that matches the search term, and select an arbitrary search term in the entire search result set that matches the search term. By estimating the percentage of search terms included, the change in the approximate number of search results when one search term is added to another search term is estimated without holding the entire search result set and without performing a re-search. It is possible to

【００１８】また、第２、第６及び第１０の発明は、検
索語に適合する文献集合の一部を用いて、追加検索語と
なり得る語の一覧を生成して提示し、利用者が追加する
検索語を選べるようにすると同時に、検索語に適合する
文献集合全体の中で、追加された検索語が含まれている
割合を推定することにより、ある検索語に別の検索語を
追加した場合のおよその検索結果件数の変化を、検索結
果集合全体を保持せず、再検索を行わずに推定すると同
時に、追加する検索語の候補を利用者に提示することが
できる。Further, the second, sixth and tenth inventions generate and present a list of words that can be additional search words by using a part of a document set that matches the search words, and present the list to the user. Search terms to be added, and at the same time, add another search term to one search term by estimating the proportion of the added search term in the entire set of documents that match the search term The approximate change in the number of search results in the case can be estimated without holding the entire search result set and without performing a re-search, and at the same time, candidates for additional search words can be presented to the user.

【００１９】また、第３、第７及び第１１の発明は、検
索語に適合する検索結果集合の一部を用いて、検索語に
別の検索語を追加した場合の検索結果を推定することに
より、ある検索語に別の検索語を追加した場合のおよそ
の検索結果件数を、検索結果集合全体を保持せず、再検
索も行わずに算出することができる。また、第４、第８
及び第１２の発明は、検索語に適合する文献集合の一部
を用いて、追加検索語となり得る語の一覧を生成して提
示し、利用者が追加する検索語を選べるようにすると同
時に、検索語を追加した場合の検索結果の件数を推定す
ることにより、ある検索語に別の検索語を追加した場合
のおよその検索結果件数を、再検索を行わずに推定する
と同時に、追加する検索語の候補を利用者に提示するこ
とができる。Further, the third, seventh and eleventh inventions estimate a search result when another search word is added to a search word by using a part of a search result set that matches the search word. Accordingly, the approximate number of search results when another search word is added to a certain search word can be calculated without holding the entire search result set and without performing a re-search. In addition, the fourth and eighth
And a twelfth invention generates and presents a list of words that can be additional search words by using a part of a document set that matches the search word, and allows the user to select a search word to be added, By estimating the number of search results when a search term is added, the approximate number of search results when adding another search term to one search term is estimated without re-searching, and at the same time Word candidates can be presented to the user.

【００２０】上記のように、本発明では、検索結果の一
部だけを保持し、追加検索語が与えられると、再検索を
行う代わりに、保持している一部分の検索結果のうち、
追加検索語を含むものの割合を求め、その値をもとに検
索結果集合全体の中で追加検索語が含まれている割合を
算出する。そのため、算出される値は完全に正確な値で
はないが、その代わりに検索結果集合の全体を保持して
おく必要がない。As described above, according to the present invention, only a part of the search result is held, and when an additional search word is given, instead of performing a re-search, the search result of the held part of the search result is
The ratio of those containing the additional search word is obtained, and the ratio of the additional search word in the entire search result set is calculated based on the value. Therefore, although the calculated value is not completely accurate, it is not necessary to hold the entire search result set instead.

【００２１】[0021]

【発明の実施の形態】最初に第１の実施の形態として、
検索語の適合する検索結果集合の一部を用いて、検索語
に適合する検索結果集合全体の中で、任意の検索語が含
まれている割合を推定する場合について説明する。図２
は、本発明の第１の推定装置の構成を示す。DESCRIPTION OF THE PREFERRED EMBODIMENTS First, as a first embodiment,
A description will be given of a case in which a part of a search result set that matches a search word is used to estimate the ratio of an arbitrary search word included in the entire search result set that matches the search word. FIG.
Shows the configuration of the first estimating device of the present invention.

【００２２】同図に示す装置１００は、全体検索部１０
１及びサンプル検索部１０２から構成される。全体検索
部１０１は、検索語１０３を受け取り、検索語１０３に
適合するデータを全データの中から検索し、その一部を
検索結果サンプル集合１０４として、サンプル検索部１
０２に出力する。The apparatus 100 shown in FIG.
1 and a sample search unit 102. The whole search unit 101 receives the search term 103, searches for data matching the search term 103 from all the data, and uses a part of the search term as a search result sample set 104 to make the sample search unit 1
02 is output.

【００２３】サンプル検索部１０２は、検索結果サンプ
ル集合１０４と、外部から追加検索語１０５を受け取
り、検索結果サンプル集合１０４の中で追加検索語１０
５に適合するものの要素数を、検索結果サンプル集合１
０４の要素数で割ったものを出現率１０６として出力す
る。ここで算出された出現率１０６は、検索結果サンプ
ル集合１０４を元にしているが、出現率１０６の精度を
高めたければ、検索結果サンプル集合１０４を十分大き
く選べばよい。The sample search unit 102 receives a search result sample set 104 and an additional search word 105 from the outside, and in the search result sample set 104,
The number of elements that match 5
A value obtained by dividing by the number of elements of 04 is output as an appearance rate 106. The appearance rate 106 calculated here is based on the search result sample set 104. However, if the accuracy of the appearance rate 106 is to be improved, the search result sample set 104 may be selected to be sufficiently large.

【００２４】次に、第２の実施の形態として、検索語に
適合する文献集合の一部を用いて、追加検索語となり得
る語の一覧を生成して提示し、利用者が追加する検索語
を選べるようにすると同時に、検索語に適合する文献集
合全体の中で、追加された検索語が含まれている割合を
推定する場合について説明する。図３は、本発明の第２
の推定装置の構成を示す。Next, as a second embodiment, a list of words that can be additional search words is generated and presented using a part of the document set that matches the search words, and the search words added by the user are added. And a case where the proportion of the added search term in the entire document set that matches the search term is estimated. FIG. 3 shows a second embodiment of the present invention.
1 shows the configuration of the estimation device.

【００２５】同図に示す装置２００は、全体検索部２０
１、検索語候補生成部２０２、検索語選択部２０３、サ
ンプル検索部２０４から構成される。全体検索部２０１
は、検索語２０５を受け取り、検索語２０５に適合する
データを全データの中から検索し、その一部を検索結果
サンプル集合２０６として出力する。The apparatus 200 shown in FIG.
1. It comprises a search word candidate generation unit 202, a search word selection unit 203, and a sample search unit 204. Overall search unit 201
Receives the search term 205, searches data matching the search term 205 from all data, and outputs a part thereof as a search result sample set 206.

【００２６】検索語候補生成部２０２は、検索結果サン
プル集合２０６を受け取り、追加検索語２０９となり得
るもののリストを検索語候補２０７として出力する。な
お、検索語候補２０７を生成する方法は、データの種類
により異なるが、例えば、予めキーワードが付与されて
いるデータの場合は、検索結果サンプル集合２０６のそ
れぞれに付与されているキーワードの集合を用いること
ができる。データがテキストである場合には、テキスト
を形態素解析によって単語に分割し、既存のキーワード
推定手法（例えば、ＴＦ＊ＩＤＦ値など）により検索語
候補２０７を生成することができる。The search term candidate generation unit 202 receives the search result sample set 206 and outputs a list of potential additional search terms 209 as search term candidates 207. The method of generating the search term candidate 207 differs depending on the type of data. For example, in the case of data to which a keyword has been assigned in advance, a set of keywords assigned to each of the search result sample sets 206 is used. be able to. If the data is text, the text can be divided into words by morphological analysis, and search term candidates 207 can be generated by an existing keyword estimation method (for example, TF * IDF value).

【００２７】検索語選択部２０３は、検索語候補２０７
と、利用者からの選択信号２０８を受け取り、選択信号
２０８に対応する検索語２０５を追加検索語２０９とし
て出力する。サンプル検索部２０４は、検索結果サンプ
ル集合２０６と追加検索語２０９を受け取り、検索結果
サンプル集合２０６の中で、追加検索語２０９に適合す
るものの要素数を、検索結果サンプル集合２０６の要素
数で割ったものを出現率２１０として出力する。即ち、
出現率＝ＭＴ／Ｔ但し、ＭＴ＝追加検索語に適合する検
索結果サンプル集合の要素数，Ｔ＝検索結果サンプル集
合の要素数次に、第３の形態として、検索語に適合する
検索結果集合の一部を用いて、検索語に別の検索語を追
加した場合の検索結果の件数を推定する場合について説
明する。The search word selection section 203 searches for a search word candidate 207.
And a selection signal 208 from the user, and outputs a search word 205 corresponding to the selection signal 208 as an additional search word 209. The sample search unit 204 receives the search result sample set 206 and the additional search term 209, and divides the number of elements of the search result sample set 206 that match the additional search term 209 by the number of elements of the search result sample set 206. Is output as the appearance rate 210. That is,
Appearance rate = MT / T, where MT = the number of elements of the search result sample set that matches the additional search word, T = the number of elements of the search result sample set Next, as a third mode, a search result set that matches the search word A case where the number of search results when another search word is added to the search word is estimated using a part of the search word will be described.

【００２８】図４は、本発明の第３の推定装置の構成図
である。同図に示す推定装置３００は、図２に示す構成
と同様であるが、全体検索部３０１が検索結果サンプル
集合のみならず、検索語による検索結果の件数を検索結
果総数３０４として出力することである。サンプル検索
部３０２では、検索結果サンプル集合３０５と追加検索
語３０６を取得し、検索結果サンプル集合３０５の中で
追加検索語３０６に適合するものの要素数を、検索結果
サンプル集合３０５の要素数で割ったものを出現率と
し、当該出現率に検索結果総数３０４を掛けたものを推
定件数として出力する。FIG. 4 is a block diagram of a third estimating device according to the present invention. The estimating device 300 shown in FIG. 7 has the same configuration as that shown in FIG. 2 except that the entire search unit 301 outputs not only a set of search result samples but also the number of search results based on search words as the total number of search results 304. is there. The sample search unit 302 acquires the search result sample set 305 and the additional search term 306, and divides the number of elements of the search result sample set 305 that match the additional search term 306 by the number of elements of the search result sample set 305. Is output as the estimated number, and the appearance rate multiplied by the total number of search results 304 is output as the estimated number.

【００２９】次に、第４の形態として、検索語に適合す
る文献集合の一部を用いて、追加検索語となり得る語の
一覧を生成して提示し、利用者が追加する検索語を選択
できるようにすると共に、検索語を追加した場合の検索
結果の件数を推定する場合について説明する。図５は、
本発明の第４の推定装置の構成図である。Next, as a fourth mode, a list of words that can be additional search words is generated and presented using a part of the document set that matches the search words, and the user selects the search words to be added. A description will be given of a case where the number of search results when a search term is added is estimated while the number of search results is estimated. FIG.
It is a lineblock diagram of the 4th presumption device of the present invention.

【００３０】同図に示す推定装置４００は、図３に示す
構成と同様であるが、全体検索部４０１が検索結果サン
プル集合のみならず、検索語による検索結果の件数を検
索結果総数４０６として出力することである。サンプル
検索部４０４では、検索結果サンプル集合４０７と、検
索語選択部４０３から追加検索語４１０を取得し、検索
結果サンプル集合４０７の中で、追加検索語４１０に適
合する件数を推定件数４１１として出力する。The estimating apparatus 400 shown in FIG. 3 has the same configuration as that shown in FIG. 3, except that the entire search unit 401 outputs not only a set of search result samples but also the number of search results by a search word as the total number of search results 406. It is to be. The sample search unit 404 obtains the search result sample set 407 and the additional search word 410 from the search word selection unit 403, and outputs the number of cases that match the additional search word 410 in the search result sample set 407 as the estimated number 411. I do.

【００３１】なお、検索語選択部４０３は、図３に示す
構成と同様に、検索語候補生成部４０２から取得した検
索語候補４０８と利用者からの選択信号４０９を取得し
て、選択信号に対応する検索語を検索語候補４０８から
選択する。Note that the search word selection unit 403 acquires the search word candidate 408 obtained from the search word candidate generation unit 402 and the selection signal 409 from the user, as in the configuration shown in FIG. A corresponding search term is selected from the search term candidates 408.

【００３２】[0032]

【実施例】次に、本発明の実施例を図面と共に説明す
る。図６は、本発明の一実施例の推定装置の構成を示
す。同図に示す推定装置５００は、図５に示す第４の推
定装置４００に基づき、全文検索による文献検索を対象
としている。Next, an embodiment of the present invention will be described with reference to the drawings. FIG. 6 shows the configuration of the estimating device according to one embodiment of the present invention. The estimating device 500 shown in FIG. 5 is based on the fourth estimating device 400 shown in FIG.

【００３３】同図に示す推定装置５００は、テキストデ
ータベース５０１、文献単語行列生成部５０２、件数推
定部５０３、検索語候補提示部５０４、及び検索語選択
部５０５より構成される。テキストデータベース５０１
は、検索語５０６が与えられると、データベースを検索
する。結果件数が十分小さければ、検索操作を終了し、
結果を出力すればよいが、件数が多過ぎて絞り込みが必
要な場合には、テキストデータベース５０１は、検索結
果の中から５０件を無作為に抽出し、文献サンプル集合
５０７として出力する。また、検索結果総数５０８を出
力する。なお、サンプル数は特に５０件に限定されるわ
けではなく、計算機の能力によって最適な値を選べばよ
い。また、文献サンプル集合５０７は、すでに形態素解
析によって単語に分割されているとする。通常、全文検
索では、形態素解析を行って索引ファイルを作成するこ
とが多いので、検索を行う前に文献の形態素解析が行わ
れていると考えられる。The estimating apparatus 500 shown in FIG. 6 comprises a text database 501, a document word matrix generating section 502, a number estimating section 503, a search word candidate presenting section 504, and a search word selecting section 505. Text database 501
Searches a database when a search term 506 is given. If the number of results is small enough, terminate the search operation,
The result may be output, but if the number of cases is too large and narrowing is necessary, the text database 501 randomly extracts 50 cases from the search results and outputs the result as a document sample set 507. Also, the total number of search results 508 is output. The number of samples is not particularly limited to 50, but an optimum value may be selected according to the capacity of the computer. It is also assumed that the document sample set 507 has already been divided into words by morphological analysis. Usually, in a full-text search, a morphological analysis is often performed to create an index file. Therefore, it is considered that a morphological analysis of a document is performed before a search is performed.

【００３４】文献単語行列生成部５０２は、文献サンプ
ル集合５０７を受け取り、各文献について、各単語の出
現回数を数える。文献サンプル集合５０７の全体につい
てこの操作を行い、図７のような文献単語行列５０９を
生成して出力する。なお、文献単語行列５０９は、件数
推定部５０３と検索語候補提示部５０４に共通に必要な
情報なので、文献単語行列生成部５０２を設けて生成し
ているが、実際には、文献単語行列生成部５０２は、図
５におけるサンプル検索部４０４及び検索語候補生成部
４０２の下位のモジュールである。図７に示す文献単語
行列５０９において、行と列は文献識別子と検索語リス
トを表し、表のマスの値は対応する行の文献の中に、対
応する列の検索語が何回含まれているかを表している。The document word matrix generation unit 502 receives the document sample set 507 and counts the number of appearances of each word for each document. This operation is performed on the entire document sample set 507 to generate and output a document word matrix 509 as shown in FIG. Note that the document word matrix 509 is information that is required in common by the number-of-cases estimation unit 503 and the search word candidate presentation unit 504. Therefore, the document word matrix 509 is generated by providing the document word matrix generation unit 502. The unit 502 is a lower module of the sample search unit 404 and the search word candidate generation unit 402 in FIG. In the document word matrix 509 shown in FIG. 7, the rows and columns represent the document identifier and the search word list, and the value of the cell in the table indicates how many times the search word in the corresponding column is included in the document in the corresponding row. Or represents

【００３５】検索語候補提示部５０４は、文献単語行列
５０９を受け取り、リストから助詞や代名詞など検索語
５０６として不適当な単語を除き、さらに、任意の文献
について任意の単語がどの程度重要であるかを以下のよ
うにして計算する。単語重要度＝（文献中での単語出現回数）×log ｛５０
／（単語が出現している文献の個数）｝上記の式は、ＴＦ＊ＩＤＦとして知られる指標で、５０
は文献サンプル集合５０７の要素数である。容易に分か
るように、単語重要度は、特定の文献に集中的に出現し
ている単語では高くなり、逆に全文献で出現している単
語の重要度は０となる。The search word candidate presentation unit 504 receives the document word matrix 509, removes words that are inappropriate as search words 506, such as particles and pronouns, from the list, and how important any word is in any document. Is calculated as follows. Word importance = (number of occurrences of word in document) × log ｛50
/ (The number of documents in which the word appears)｝ The above equation is an index known as TF * IDF,
Is the number of elements in the document sample set 507. As can be easily understood, the word importance is high for words appearing intensively in a specific document, and conversely, the importance of words appearing in all documents is zero.

【００３６】検索語候補提示部５０４は、以上のような
計算の結果、高い重みを持った単語を検索語候補５１０
として出力する。検索語選択部５０５は、検索語候補５
１０を受け取って画面に出力し、利用者に追加検索語５
１２を選択させる。利用者の選択は選択信号５１１とし
て入力される。検索語選択部５０５は、選択信号５１１
に対応する検索語５０６を追加検索語５１２として出力
する。As a result of the above-described calculation, the search word candidate presentation unit 504 determines a word having a high weight as a search word candidate 510.
Output as The search word selection unit 505 selects the search word candidate 5
10 and output it to the screen to provide the user with additional search words 5
12 is selected. The user's selection is input as a selection signal 511. The search word selection unit 505 selects the selection signal 511
Is output as the additional search word 512.

【００３７】件数推定部５０３は、検索結果総数５０８
と文献単語行列５０９を受け取って保持している。追加
検索語５１２を受け取ると、件数推定部５０３は、文献
単語行列５０９を調べ、追加検索語５１２の列が０でな
い行の数を数える。これを行列の全行数で割ると、追加
検索語５１２の出現率が得られる。出現率に検索結果総
数５０８を掛ければ、全文献中での追加検索語５１２で
検索した場合の検索結果総数５０８を推定件数５１３と
して出力できる。The number-of-cases estimation unit 503 calculates the total number of search results 508
And the document word matrix 509 are received and held. Upon receiving the additional search word 512, the number estimating unit 503 examines the document word matrix 509 and counts the number of rows where the column of the additional search word 512 is not 0. Dividing this by the total number of rows in the matrix gives the appearance rate of additional search terms 512. If the appearance rate is multiplied by the total number of search results 508, the total number of search results 508 when the search is performed with the additional search words 512 in all documents can be output as the estimated number 513.

【００３８】図８は、図６の実施例を用いた検索システ
ムの画面の例を表したものである。同図は、利用者が検
索語「映画」で文献検索を行い、さらに、追加検索語５
１２として、「紹介」（追加検索語）を追加しようとし
ている。以下、この画面に表示されている情報がどのよ
うに生成されるかを図６に基づいて説明する。図８の１
段目は、検索語５０６を入力する部分で、最初、利用者
はここに、「映画」と入力して検索を行った。検索語５
０６は、テキストデータベース５０１に送られ、その結
果、２段目に示されるように、１２３４件の文献が見つ
かった。テキストデータベース５０１では、１２３４件
の中から５０件を無作為に抽出し、文献単語行列生成部
５０２に送る。５０件の文献は形態素解析されて単語に
分解され、文献単語行列５０９として出力される。FIG. 8 shows an example of a screen of a search system using the embodiment of FIG. In the figure, the user performs a document search using the search term “movie”, and further searches for additional search terms 5.
As 12, an “introduction” (additional search word) is to be added. Hereinafter, how the information displayed on this screen is generated will be described with reference to FIG. 8 in FIG.
The lower part is a part for inputting a search word 506. At first, the user inputs "movie" here to perform a search. Search word 5
06 was sent to the text database 501, and as a result, as shown in the second row, 1234 documents were found. In the text database 501, 50 out of 1234 items are randomly extracted and sent to the document word matrix generation unit 502. The 50 documents are subjected to morphological analysis and decomposed into words, and output as a document word matrix 509.

【００３９】検索語候補提示部５０４では、単語の重み
付けを行い、図８の３段目のような１５個の単語を抽出
し、検索語候補５１０として出力した。検索語選択部５
０５は、図８の画面を提示して利用者から入力を受け付
ける。ここでは、利用者は「紹介」を指定したので、
「紹介」が１段目の検索語入力欄に追加されると共に、
追加検索語５１２「紹介」が件数推定部５０３に送られ
る。The search word candidate presentation section 504 weights the words, extracts 15 words as shown in the third row of FIG. 8, and outputs them as search word candidates 510. Search term selection part 5
In step 05, the screen shown in FIG. 8 is presented to accept input from the user. Here, the user has specified "Introduction",
"Introduction" is added to the search term input field in the first row,
The additional search word 512 “introduction” is sent to the number estimation unit 503.

【００４０】件数推定部５０３は、文献単語行列５０９
を検索して「紹介」が含まれている文献数を調べる。こ
の場合、８件の文献が「紹介」を含んでいた。これは、
文献サンプル集合５０件の１６％にあたる。これに検索
結果総数１２３４を掛けた１９７件が推定件数５１３と
して出力される。この出力が図８の２段目に該当する。The number estimating unit 503 includes a document word matrix 509.
To find out the number of documents that include "Introduction". In this case, eight documents included "Introduction". this is,
This is 16% of the 50 reference sample sets. This is multiplied by the total number of search results 1234 to output 197 cases as the estimated number 513. This output corresponds to the second stage in FIG.

【００４１】推定件数５１３をより、正確に出力するに
は、推定件数の値に（例えば、１９７件±２０件という
ように）ある範囲を持たせればよい。この場合、真の値
がこの範囲に存在している確立は統計的に算出すること
ができる。なお、本発明は、上記の実施例に限定される
ことなく、特許請求の範囲内で種々変更・応用が可能で
ある。In order to output the estimated number 513 more accurately, the value of the estimated number may have a certain range (for example, 197 ± 20). In this case, the probability that the true value is in this range can be calculated statistically. It should be noted that the present invention is not limited to the above-described embodiment, but can be variously modified and applied within the scope of the claims.

【００４２】[0042]

【発明の効果】上述のように、第１、第５及び第９の発
明によれば、ある検索語に別の検索語を追加した場合の
およその検索結果件数の変化を、検索結果集合全体を保
持せず、再検索も行わずに推定することができる。第
２、第６及び第１０の発明によれば、ある検索語に別の
検索語を追加した場合のおよその検索結果件数の変化
を、検索結果集合全体を保持せず、再検索を行わずに推
定すると同時に、追加する検索語の候補を利用者に提示
することができる。As described above, according to the first, fifth, and ninth aspects, the change in the approximate number of search results when another search word is added to a certain search word is determined by the entire search result set. , And can be estimated without re-searching. According to the second, sixth, and tenth aspects, the change in the approximate number of search results when another search word is added to a certain search word is determined without holding the entire search result set and performing a re-search. , And at the same time, a candidate for a search word to be added can be presented to the user.

【００４３】第３、第７及び第１１の発明によれば、あ
る検索語に別の検索語を追加した場合のおよその検索結
果件数を、検索結果集合全体を保持せず、再検索も行わ
ずに算出することができる。第４、第８及び第１２の発
明によれば、ある検索語に別の検索語を追加した場合の
およその検索結果件数を、再検索を行わずに推定すると
同時に、追加する検索語の候補を利用者に提示すること
ができる。According to the third, seventh and eleventh aspects of the present invention, the approximate number of search results when another search word is added to a certain search word can be re-searched without holding the entire search result set. Can be calculated without the need. According to the fourth, eighth, and twelfth inventions, the approximate number of search results when another search word is added to a certain search word is estimated without performing a re-search, and at the same time, candidates for additional search words are estimated. Can be presented to the user.

[Brief description of the drawings]

【図１】本発明の原理構成図である。FIG. 1 is a principle configuration diagram of the present invention.

【図２】本発明の第１の推定装置の構成図である。FIG. 2 is a configuration diagram of a first estimation device of the present invention.

【図３】本発明の第２の推定装置の構成図である。FIG. 3 is a configuration diagram of a second estimation device of the present invention.

【図４】本発明の第３の推定装置の構成図である。FIG. 4 is a configuration diagram of a third estimation device of the present invention.

【図５】本発明の第４の推定装置の構成図である。FIG. 5 is a configuration diagram of a fourth estimation device of the present invention.

【図６】本発明の一実施例の推定装置の例である。FIG. 6 is an example of an estimation device according to an embodiment of the present invention.

【図７】本発明の一実施例の文献単語行列の例である。FIG. 7 is an example of a document word matrix according to one embodiment of the present invention.

【図８】本発明の一実施例の検索システムの画面例であ
る。FIG. 8 is a screen example of a search system according to an embodiment of the present invention.

[Explanation of symbols]

１００検索結果件数の統計的推定装置，推定装置１０１，２０１，３０１，４０１全体検索部１０２，２０４，３０２，４０４サンプル検索部１０３，２０５，３０３，４０５，５０６検索語１０４，２０６，３０５，４０７検索結果サンプル集
合１０５，２０９，３０６，４１０，５１２追加検索語１０６出現率１１０検索結果サンプル抽出手段１２０追加検索語割合算出手段２００，３００推定装置２０２，４０２検索語候補生成部２０３，４０３，５０５検索語選択部２０７，４０８，５１０検索語候補２０８，４０９，５１１選択信号２１０出現率３０４，４０６，５０８検索結果総数３０７，４１１，５１３推定件数４００推定装置５００推定装置５０１テキストデータベース５０２文献単語行列生成部５０３件数推定部５０４検索語候補提示部５０７文献サンプル集合５０９文献単語行列100 Statistical estimation device for the number of search results, estimation device 101, 201, 301, 401 Overall search unit 102, 204, 302, 404 Sample search unit 103, 205, 303, 405, 506 Search terms 104, 206, 305, 407 Search result sample set 105, 209, 306, 410, 512 Additional search word 106 Appearance rate 110 Search result sample extraction means 120 Additional search word ratio calculation means 200, 300 Estimation device 202, 402 Search word candidate generation unit 203, 403, 505 Search word selection unit 207, 408, 510 Search word candidate 208, 409, 511 Selection signal 210 Appearance rate 304, 406, 508 Total number of search results 307, 411, 513 Estimated number 400 Estimating device 500 Estimating device 501 Text database 502 Literature word matrix Living Part 503 number estimation section 504 search word candidate presenting unit 507 literature sample set 509 literature word matrix

フロントページの続き (72)発明者田中一男東京都新宿区西新宿三丁目19番２号日本電信電話株式会社内Continued on the front page (72) Inventor Kazuo Tanaka Nippon Telegraph and Telephone Corporation, 3-9-1-2 Nishishinjuku, Shinjuku-ku, Tokyo

Claims

[Claims]

In a data search system, a method for statistically estimating the number of search results to know how the number of search results changes when a search word is added to further narrow the search results, A statistic of the number of search results characterized by estimating, by using a part of a search result set matching a search term, a ratio of an arbitrary search term included in the entire search result set matching the search term. Estimation method.

2. A method for statistically estimating the number of search results in order to know how the number of search results changes when a search word is added to further narrow down search results in a data search system. Using a part of the document set that matches the search term, a list of terms that can be additional search terms is generated and presented, so that the user can select the additional search terms,
A statistical estimation method of the number of search results, characterized by estimating a ratio of an added search term included in a whole document set that matches the search term.

3. A method for statistically estimating the number of search results for knowing how the number of search results changes when a search word is added to further narrow the search results in a data search system, A statistical estimation method for search result training, comprising estimating the number of search results when another search word is added to the search word using a part of a search result set that matches the search word.

4. A method for statistically estimating the number of search results in order to know how the number of search results changes when a search term is added to further narrow down the search results in the data search system. Using a part of the document set that matches the search term, a list of terms that can be additional search terms is generated and presented, so that the user can select the additional search terms,
A statistical method for estimating the number of search results, comprising estimating the number of search results when a search word is added.

5. A statistical estimation device for the number of search results to know how the number of search results changes when a search word is added to further narrow down the search results in the data search system. A search result sample extracting means for extracting a part of the search result set that matches the search word; receiving the search result set and the additional search word; And an additional search word ratio calculating means for calculating and outputting the ratio of words containing the additional search word.

6. A data estimating apparatus for statistically estimating the number of search results to know how the number of search results changes when a search term is added to further narrow down the search results. A search result sample extracting means for extracting a part of a search result set matching the search word; and a list of possible additional search words from a part of the search result set extracted by the search result sample extracting means. A search word candidate generating means to be a candidate, a search word corresponding to a selection signal received from a user to select from the search word candidates, and a selection means to be an additional search word, and a part of the search result set A statistical unit for estimating the number of search results, comprising: an additional search word ratio calculating means for estimating and outputting a ratio of the search result matching the additional search word.

7. A statistical estimation device for the number of search results to know how the number of search results changes when a search word is added to further narrow down the search results in the data search system. A search word candidate generating means for extracting a total number of search results and a part of a set of search results that match the search word; and estimating an estimated number of cases including an additional search word in a part of the set of search results. And a means for calculating the number of additional search terms to be output.

8. A statistical estimation device for the number of search results to know how the number of search results changes when a search word is added to further narrow down the search results in a data search system. Search word candidate generation means for extracting the total number of search results and a part of the document set that matches the search term, and a list of search terms that can be additional search terms in the total number of search results and the document set A search word candidate generating unit that sets a search word as a search word candidate, a selection unit that sets a search word corresponding to a selection signal selected by a user from the search word candidates as an additional search word, and the addition in the document set A statistical device for estimating the number of search results, comprising: an additional search word number calculator for estimating and outputting the number of search words.

9. A program for statistically estimating the number of search results to know how the number of search results changes when a search term is added to further narrow down search results in a data search system. A retrieval result sample extraction process for extracting a part of a search result set that matches a search word; receiving the search result set and an additional search word, and using the part of the search result set to A storage medium storing a program for statistically estimating the number of search results, characterized by having a process of calculating a ratio of additional search words that calculates and outputs a ratio of the search result set that includes the additional search word.

10. In a data search system, a program for statistically estimating the number of search results to know how the number of search results changes when a search word is added to further narrow the search results is stored. A search result sample extraction process for extracting a part of a search result set that matches the search word, and a search result sample extracted from the search result sample extraction process may be an additional search word. A search word candidate generation process of using a list of things as a search word candidate; a selection process of selecting a search word corresponding to a selection signal received from a user from the search word candidates and setting it as an additional search word; And estimating a ratio of those that match the additional search word, and outputting the additional search word ratio calculation process. Results count storage medium storing a statistical estimation program.

11. In a data search system, a program for statistically estimating the number of search results to know how the number of search results changes when a search word is added to further narrow the search results is stored. A search medium candidate generating process for extracting a total number of search results and a part of a search result set that matches the search word; and including an additional search word in a part of the search result set. A storage medium storing a statistical estimation program for the number of search results, characterized by having an additional search word number calculation process for estimating and outputting the estimated number of items.

12. In a data search system, a program for statistically estimating the number of search results to know how the number of search results changes when a search word is added to further narrow the search results is stored. A search term candidate generation process that extracts the total number of search results and a part of the document set that matches the search term, and an additional search term in the total number of search results and the document set. A search word candidate generation process that uses a list of obtained search words as search word candidates; a selection process that uses a search word corresponding to a selection signal acquired from a user from the search word candidates as an additional search word; And a program for estimating the number of the additional search words in the program and outputting the number of additional search words. Media.