JPH11203310A

JPH11203310A - Search expression creation method and apparatus

Info

Publication number: JPH11203310A
Application number: JP10005130A
Authority: JP
Inventors: Hiroyuki Nakajima; 浩之中島; Tsuyoshi Kitani; 強木谷
Original assignee: NTT Data Corp
Current assignee: NTT Data Group Corp
Priority date: 1998-01-13
Filing date: 1998-01-13
Publication date: 1999-07-30

Abstract

(57)【要約】【課題】大量の学習文書群を必要とすることなく、
指定情報が付与されていない文書群を考慮した検索式が
作成可能な検索式作成装置を提供する。【解決手段】文書集合分割部１４は、キーワード抽出部
３２から出力される文書集合を分割する際の相互情報量
を、不定文書群を仮想的に不要文書群とみなして算出
し、相互情報量が最大となる場合の単語を検索キーワー
ドとして決定する。文書集合分割部１４では、決定した
検索キーワードで文書集合が必要文書群と不要文書群に
区別された時点、及び当該時点且つ当該必要文書群の文
書集合における必要文書数が不定文書数に対する所定の
割合を超過した場合に文書集合の分割及び検索キーワー
ドの決定を停止する。検索式作成部３３は、決定した検
索キーワードを論理演算子“ａｎｄ”及び“ｏｒ”で結
合して検索式を作成する。 (57) [Summary] [Problem] Without requiring a large amount of learning documents,
Provided is a search formula creation device capable of creating a search formula in consideration of a group of documents to which designation information is not added. A document set dividing unit calculates a mutual information amount when dividing a document set output from a keyword extracting unit by regarding an indefinite document group as a virtually unnecessary document group, and calculates a mutual information amount. Is determined as the search keyword. The document set division unit 14 determines when the set of documents is classified into a required document group and an unnecessary document group by the determined search keyword, and when the number of required documents in the document set at that time and in the required document group is a predetermined value for the indefinite number of documents. When the ratio is exceeded, the division of the document set and the determination of the search keyword are stopped. The search expression creating unit 33 creates a search expression by combining the determined search keywords with the logical operators “and” and “or”.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、例えば大量に蓄積
された電子文書から特定の情報を索出する文書データベ
ースや、予め蓄積された電子文書例等を文書作成や発想
展開の支援のために利用する各種支援システム等に適用
される文書検索技術に係り、特に、電子文書中から抽出
したキーワードを用いて、検索者が関心のある文書の索
出を効率的に行うための検索式を試行錯誤的に作成する
手法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a document database for retrieving specific information from a large amount of stored electronic documents, and a pre-stored example of an electronic document for supporting document creation and idea development. Related to the document search technology applied to various support systems to be used, in particular, using a keyword extracted from an electronic document, a search formula for a searcher to efficiently search for a document of interest is tried. It relates to a method of making mistakes.

【０００２】[0002]

【従来の技術】検索対象となる電子文書を蓄積した文書
データベースからあるキーワードを抽出し、このキーワ
ードの論理積や論理和の組み合わせにより所要の検索式
を検索者と協調して試行錯誤的に作成する検索式作成装
置が知られている。2. Description of the Related Art A keyword is extracted from a document database in which electronic documents to be searched are stored, and a required search formula is created by trial and error in cooperation with a searcher by a combination of a logical product and a logical sum of the keywords. There is known a search formula creation device.

【０００３】図５は、従来のこの種の検索式作成装置の
機能構成図である。この検索式作成装置５０は、コンピ
ュータ装置が所定のプログラムを読み込んで実行するこ
とにより形成される、キーワード抽出部５１、文書集合
分割部５２、及び検索式作成部５３の機能ブロックを備
えている。なお、文書には、それぞれ検索者が関心のあ
る必要文書か、関心のない不要文書かを表す必要・不要
の指定情報が付与されているものとする。[0005] FIG. 5 is a functional block diagram of a conventional search formula creating apparatus of this kind. The search formula creation device 50 includes functional blocks of a keyword extraction unit 51, a document set division unit 52, and a search formula creation unit 53, which are formed by a computer device reading and executing a predetermined program. Note that it is assumed that each document is provided with necessary / unnecessary designation information indicating whether the searcher is a necessary document of interest or an unnecessary document of no interest.

【０００４】キーワード抽出部５１は、複数の文書から
公知の形態素解析処理によって文書毎に複数のキーワー
ドの抽出処理を行う。また、個々の文書におけるキーワ
ードの出現の有無を表す判別情報及び当該文書が必要文
書か不要文書かを表す指定情報を、文書名や文書番号等
の文書識別子と共に文書集合として出力する。符号３１
Ｂは、キーワード抽出部５１から出力される文書集合の
内容を例示したものである。[0004] A keyword extracting unit 51 performs a process of extracting a plurality of keywords for each document from a plurality of documents by a known morphological analysis process. In addition, determination information indicating the presence or absence of a keyword in each document and designation information indicating whether the document is a necessary document or an unnecessary document are output as a document set together with a document identifier such as a document name and a document number. Code 31
B illustrates the contents of the document set output from the keyword extraction unit 51.

【０００５】文書集合分割部５２は、文書集合を上記判
別情報に基づいて段階的に分割し、文書検索に用いる検
索式を作成する場合の基礎となる複数の検索キーワード
を決定する。この場合、出来るだけ一つ（少数）のキー
ワードの判別情報によって文書集合を分割していくこと
で、必要文書と不要文書とを区別した検索者の意図の抽
出が可能となる。文書集合分割部５２で決定した複数の
検索キーワードは、検索式作成部５３において論理演算
子“ａｎｄ”、“ｏｒ”及び“ｎｏｔ”で結合され、検
索式として後続処理に出力される。[0005] The document set division unit 52 divides the document set in stages based on the discrimination information, and determines a plurality of search keywords that are the basis for creating a search formula used for document search. In this case, by dividing the document set by the discrimination information of one (small) keyword as much as possible, it is possible to extract the intention of the searcher who distinguishes the necessary documents from the unnecessary documents. The plurality of search keywords determined by the document set division unit 52 are combined by the logical operators “and”, “or”, and “not” in the search expression creation unit 53, and are output to the subsequent processing as a search expression.

【０００６】文書集合分割部５２における文書集合の分
割処理は、例えば公知のＭＤＬ（Minimum Description
Length：最小記述長）原理に基づくものである。このＭ
ＤＬ原理は、「より多くの必要文書と不要文書とをでき
るだけ少ないキーワードの組み合わせ（検索式）で区別
することにより、人間（検索者）の意図をより正確に表
現できる」とするヒューリスティックな手法であるが、
ＭＤＬ原理を厳密に実現するには多くの処理量が必要と
なる。そこで、この処理量の軽減を図るために、ＭＤＬ
原理を近似的に実現するのが一般的である。[0006] The document set dividing process in the document set dividing section 52 is performed, for example, by a known MDL (Minimum Description).
Length: minimum description length) based on the principle. This M
The DL principle is a heuristic method that states that "manual (searcher) intentions can be expressed more accurately by distinguishing between more and more unnecessary documents with as few keyword combinations (search formulas) as possible". There is
A strict realization of the MDL principle requires a large amount of processing. Therefore, in order to reduce this processing amount, MDL
Generally, the principle is realized approximately.

【０００７】ＭＤＬ原理を近似的に実現する手法は、例
えば、公知の決定木（論理式を木構造で表現したもの）
学習アルゴリズムである「ＩＤ３」に基づいて行われ
る。「ＩＤ３」についての詳細は、「知識獲得と学習シ
リーズ１：知識獲得入門」（Ｍｉｃｈａｌｓｋｉ，Ｒ．
Ｓ．他編、共立出版）を参考にすることができる。以
下、この決定木学習アルゴリズム「ＩＤ３」による文書
集合の分割処理の概要を図６を参照して説明する。A method for approximately realizing the MDL principle is, for example, a known decision tree (a logical expression represented by a tree structure).
This is performed based on “ID3” which is a learning algorithm. For details on “ID3”, see “Knowledge Acquisition and Learning Series 1: Introduction to Knowledge Acquisition” (Michalski, R.A.).
S. Other editions, Kyoritsu Publishing) can be referred to. Hereinafter, an outline of the document set division processing by the decision tree learning algorithm “ID3” will be described with reference to FIG.

【０００８】まず、キーワード抽出部５１から送られた
文書集合を初期文書集合Ｓｅｔ₀とする（ステップＳ２
０１）。次に、初期文書集合Ｓｅｔ₀の“未分割”のフ
ラグをオンにする（ステップＳ２０２）。これをＳｅｔ
_iとする（ステップＳ２０３）。次に、この文書集合Ｓ
ｅｔ_i中の必要文書、不要文書に含まれる各キーワード
ｔ_j(１≦ｊ≦Ｎ）について、文書全体の情報量に対する
個別文書の情報量の相対関係を表す相互情報量Ｉ（ｔ_j)
を算出する（ステップＳ２０４）。相互情報量Ｉ（ｔ_j)
は、具体的には、未分割の文書集合についての情報量Ｈ
からキーワードｔ_jが含まれた文書集合及び含まれない
文書集合についての情報量Ｈ（ｔ_j)を差し引いた、以下
の式（１）で表される。First, the document set sent from the keyword extracting unit 51 is set as an initial document set Set ₀ (step S2).
01). Next, the “undivided” flag of the initial document set Set ₀ is turned on (step S202). This is Set
_i (step S203). Next, this document set S
Mutual information I (t _j ) representing the relative relationship between the information amount of the individual document and the information amount of the entire document for each keyword t _j (1 ≦ j ≦ N) included in the required document and unnecessary document in et _i
Is calculated (step S204). Mutual information I (t _j )
Is, specifically, the information amount H about the undivided document set.
Is subtracted from the information amount H (t _j ) about the document set including the keyword t _j and the document set not including the keyword t _{j, and} is represented by the following equation (1).

【０００９】式（１）Ｉ（ｔ_j)＝Ｈ−Ｈ（ｔ_j)Equation (1) I (t _j ) = H−H (t _j )

【００１０】各情報量Ｈ及びＨ（ｔ_j)は、各々下記の式
（２）、式（３）で表される。 The information amounts H and H (t _j ) are expressed by the following equations (2) and (3), respectively.

【００１１】但し、式（２）、式（３）におけるパラメ
ータは下記のようになる。ｐ_i：Ｓｅｔ_i中の必要文書数、ｎ_i：Ｓｅｔ_i中の不要文書数、ｓ_i：ｐ_i+ｎ_i、ｐ_i(ｔ_j)：Ｓｅｔ_i中でキーワードｔ_jを含む必要文書
数、ｎ_i(ｔ_j)：Ｓｅｔ_i中でキーワードｔ_jを含む不要文書
数、ｓ_i(ｔ_j)：ｐ_i(ｔ_j)＋ｎ_i(ｔ_j)、ｐ_i not（ｔ_j)：Ｓｅｔ_i中でキーワードｔ_jを含まない
必要文書数、ｎ_i not（ｔ_j)：Ｓｅｔ_i中でキーワードｔ_jを含まない
不要文書数、ｓ_i not（ｔ_j)：ｐ_i not（ｔ_j)＋ｎ_i not（ｔ_j)、ｈ(a,b,c)：-{a/c・log₂(a/c)＋b/c・log₂(b/c)｝However, the parameters in the equations (2) and (3) are as follows. p _{_i:} Set _i need the number of documents in, n _{_i:} Set _i unnecessary number of documents _{_{in, s i: p i + n}} i, p i (t j): necessary number of documents that contain the keyword t _j in Set _i _{_{, n i (t j):}} Set i unnecessary number of documents that contain the keyword t _j _{_{in, s i (t j):}} p i (t j) + n i (t j), p i not (t j): Set _i need the number of documents that do not contain the keyword t _j _{_{in, n i not (t j)}} : Set i unnecessary number of documents that do not contain the keyword t _j _{_{in, s i not (t j)}} : p i not (t j) + N _i not (t _j ), h (a, b, c):-{a / c · log ₂ (a / c) + b / c · log ₂ (b / c)}

【００１２】次に、複数のキーワードｔ_jから相互情報
量Ｉ（ｔ_k)の値を最大にすることが可能なキーワードｔ
_kを選択し、これを検索キーワードとする（ステップＳ
２０５）。この相互情報量Ｉ（ｔ_k)が正の有限値（＞
０）の場合（ステップＳ２０６）、検索キーワードｔ_k
を含む文書の番号からなる文書集合をＳｅｔ_i′、検索キ
ーワードｔ_kを含まない文書の番号からなる文書集合を
Ｓｅｔ_i″として分割し、分割したそれぞれの文書集合
の“未分割”のフラグをオンにする（ステップＳ２０７
〜Ｓ２１０）。ｉ′，ｉ″は既に文書集合Ｓｅｔ_i′、Ｓ
ｅｔ_i″が存在しなければ任意の値で良い。相互情報量
Ｉ（ｔ_k)がゼロ値（＝０）の場合は、文書集合の分割を
行わない（ステップＳ２０６）。その後、集合Ｓｅｔ_i
の“未分割”のフラグをオフにする（ステップＳ２１
１）。“未分割”のフラグがオンの文書集合がある場合
はステップＳ２０３に戻り（ステップＳ２１２：Ye
s）、“未分割”のフラグがオンの文書集合がなくなる
まで処理を繰り返す。そして、すべての文書集合につい
ての“未分割”のフラグがオフになった時点で処理を終
える（ステップＳ２１２：No）。Next, a keyword t that can maximize the value of the mutual information I (t _k ) from a plurality of keywords t _j
_k is selected and set as a search keyword (step S
205). This mutual information I (t _k ) has a positive finite value (>
In the case of 0) (step S206), the search keyword t _k
Set _i 'a document set consisting of number of documents containing the flag of the search keyword set of documents consisting of number of documents that do not contain t _k Set _i "divided as, for each document set divided" undivided " Turn on (step S207)
To S210). i ′, i ″ are already document sets Set _i ′, S
If there is et _i "good at any value. If the mutual information I (t _k) is zero value (= 0), it does not perform the division of document set (step S206). Thereafter, the set Set _i
Is turned off (step S21).
1). If there is a document set whose flag of “undivided” is on, the process returns to step S203 (step S212: Ye
s) The process is repeated until there is no document set for which the “undivided” flag is on. Then, the process ends when the “undivided” flag is turned off for all the document sets (step S212: No).

【００１３】上記アルゴリズム「ＩＤ３」による処理過
程は、例えば、公知のアルゴリズムである「Ｃ４．５」
の手法等による代用も可能である。この「Ｃ４．５」の
詳細については、「C4.5 Programs for Machine Learni
ng」（Quinlan、J.R.著、Morgan Kaufmann Publishers
刊）の記載を参考にすることができる。The processing by the above algorithm "ID3" is performed, for example, by a known algorithm "C4.5".
It is also possible to substitute the above method. For details of this "C4.5", see "C4.5 Programs for Machine Learni".
ng "(Quinlan, JR, Morgan Kaufmann Publishers
Publication) can be referred to.

【００１４】図７は、上記検索式作成装置５０におい
て、一つの文書集合から複数の文書集合に分割され、検
索式が試行錯誤的に作成されていく過程を示す説明図で
ある。以下、図７を参照して、従来の検索式の作成手順
を説明する。まず、キーワード抽出部５１から出力され
た初期文書集合Ｓｅｔ₀から、上述の決定木学習アルゴ
リズム「ＩＤ３」に基づいて相互情報量が最大となるキ
ーワードを決定し、これを検索キーワードとする。ここ
では、検索キーワードｋｗｄ３が決定されたとする。そ
して、この検索キーワードｋｗｄ３によって、初期文書
集合Ｓｅｔ₀を、検索キーワードｋｗｄ３を含む必要文
書の集合Ｓｅｔ₁と検索キーワードｋｗｄ３を含まない
必要文書及び不要文書の集合Ｓｅｔ₂とに分割する。FIG. 7 is an explanatory diagram showing a process in which the above-mentioned search formula creating apparatus 50 divides one document set into a plurality of document sets and creates a search formula by trial and error. Hereinafter, with reference to FIG. 7, a description will be given of a conventional search formula creation procedure. First, from the initial document set Set ₀ output from the keyword extraction unit 51, a keyword that maximizes the mutual information is determined based on the above-described decision tree learning algorithm “ID3”, and is set as a search keyword. Here, it is assumed that the search keyword kwd3 has been determined. Then, by this search Kwd3, divides the initial document set Set _0, the search keyword Kwd3 in a set Set ₂ required documents and unnecessary documents that do not contain the set Set ₁ the search keyword Kwd3 necessary documents including.

【００１５】文書集合Ｓｅｔ₁は、検索キーワードｋｗ
ｄ３によるこれ以上の分割は不可能であるが、一方、文
書集合Ｓｅｔ₂はさらなる分割が可能である。そこで、
この文書集合Ｓｅｔ₂において相互情報量が最大となる
検索キーワードｋｗｄ２を決定し、この検索キーワード
ｋｗｄ２によって文書集合Ｓｅｔ₂を、検索キーワード
ｋｗｄ２を含まない不要文書の集合Ｓｅｔ₃と検索キー
ワードｋｗｄ２を含む必要及び不要文書の集合Ｓｅｔ₄
とに分割する。The document set Set ₁ is composed of a search keyword kw
No further division by d3 is possible, while the document set Set ₂ can be further divided. Therefore,
Mutual information in this document set Set ₂ determines a search keyword Kwd2 which maximizes the document set Set ₂ by this search Kwd2, need to include a search keyword Kwd2 a set Set ₃ of unnecessary documents that do not contain a search term Kwd2 Set _{4 of} unnecessary documents and unnecessary documents
And split into

【００１６】文書集合Ｓｅｔ₄は、さらなる分割が可能
なので、この文書集合Ｓｅｔ₄において相互情報量が最
大となるキーワードｋｗｄ１を検索キーワードとして決
定し、この検索キーワードｋｗｄ１を含む必要文書の集
合Ｓｅｔ₅と、検索キーワードｋｗｄ１を含まない文書
の集合Ｓｅｔ₆とを分割する。文書集合Ｓｅｔ₅及びＳｅ
ｔ₆は、共にこれ以上の分割が不可能であるため、分割
処理を終える。上記分割処理において決定された複数の
検索キーワードｋｗｄ１〜ｋｗｄ３は逐次図示しない記
憶手段に保持され、分割処理が終了した時点で検索式作
成部５３に渡される。Since the document set Set ₄ can be further divided, the keyword kwd1 that maximizes the mutual information in the document set Set ₄ is determined as a search keyword, and a set Set _{5 of} necessary documents including the search keyword kwd1 is determined. , And a set Set _{6 of} documents that do not include the search keyword kwd1. Document Set Set ₅ and Se
At t ₆ , the division process ends because no further division is possible. The plurality of search keywords kwd1 to kwd3 determined in the division processing are sequentially stored in a storage unit (not shown), and are passed to the search expression creation unit 53 when the division processing ends.

【００１７】検索式作成部５３では、文書集合分割部５
２の結果である各検索キーワードを、論理演算子“ａｎ
ｄ”、“ｏｒ”及び“ｎｏｔ”により結合して検索式ｑ
ｕｅｒｙを作成する。符号５３Ｂは、検索式作成部５３
から出力される検索式を例示したものである。In the search formula creating section 53, the document set dividing section 5
Each search keyword that is the result of 2 is assigned to the logical operator “an
d "," or "and" not "
Create a query. Reference numeral 53B denotes a search formula creation unit 53.
3 illustrates a search expression output from.

【００１８】[0018]

【発明が解決しようとする課題】ところで、上述の決定
木学習「ＩＤ３」アルゴリズムでは、文書集合の分割に
用いる検索キーワードを、文書に付与された必要・不要
の指定情報と相互情報量とを基準に決定しており、検索
者からの指定情報が付与された大量の文書群を学習文書
群として用いなければ検索精度を向上させることができ
ない。換言すれば、正確な検索式を作成するためには、
大量の学習文書が必要となる。In the above-mentioned "ID3" algorithm for decision tree learning, a search keyword used for dividing a document set is determined based on required / unnecessary designation information given to a document and a mutual information amount. The search accuracy cannot be improved unless a large number of documents to which the designated information from the searcher is added are used as the learning documents. In other words, to create an accurate search expression,
Large amounts of learning documents are required.

【００１９】しかしながら、指定情報は検索者が自ら判
定して与えるものであり、正確な検索式を作成するため
に必ずしも必要な数の文書が与えられているとは限らな
かった。そのため、指定情報が付与されていない文書に
含まれるキーワードについては、それが検索者にとって
重要なキーワードであったとしても検索式に反映され
ず、正確な検索式を作成できなかった。これは、決定木
学習アルゴリズム「ＩＤ３」に基づいて文書集合を完全
に分割処理した結果、作成される検索式は、検索者の指
定による必要文書にのみ対応するものとなることに起因
するものである。即ち当該検索式で検索される文書群
は、検索者にとって既知の学習文書群であり、検索者が
指定した必要文書のみに対応する検索式を作成しても実
効性に乏しい。また、例えば、すべての文書に対する指
定情報を検索者が逐次判定して付与しようとすれば、多
大な作業量が必要となり、検索者側の負荷が増大する。However, the designated information is determined and given by the searcher himself, and the required number of documents for creating an accurate search formula is not always provided. Therefore, a keyword included in a document to which no designation information is added is not reflected in a search formula even if it is an important keyword for a searcher, and an accurate search formula cannot be created. This is due to the fact that as a result of completely dividing the document set based on the decision tree learning algorithm “ID3”, the created search formula corresponds only to the required document specified by the searcher. is there. That is, the group of documents searched by the search formula is a group of learning documents known to the searcher. Even if a search formula corresponding to only the necessary document specified by the searcher is created, the effectiveness is poor. Further, for example, if the searcher attempts to sequentially determine and assign designation information for all documents, a large amount of work is required, and the load on the searcher increases.

【００２０】本発明は、上記背景のもと、大量の学習文
書群を必要とすることなく、指定情報が付与されていな
い文書群を考慮した検索キーワードの決定及び検索式の
作成をコンピュータ装置を用いて行うことが可能とな
る、改良された検索式作成方法を提供することを課題と
する。本発明の他の課題は、上記検索式作成方法の実施
に適した検索式作成装置を提供することにある。According to the present invention, there is provided a computer apparatus for determining a search keyword and creating a search formula in consideration of a document group to which designation information is not added without requiring a large number of learning document groups. An object of the present invention is to provide an improved method for creating a search formula, which can be performed by using the method. Another object of the present invention is to provide a search formula creation device suitable for implementing the above search formula creation method.

【００２１】[0021]

【課題を解決するための手段】上記課題を解決するた
め、本発明は、以下の２つの形態の検索式作成方法を提
供する。（１）予め文書毎に必要・不要の指定情報が付与された
学習文書群と前記指定情報が未知の不定文書群とから複
数の単語を抽出するとともに、抽出した個々の単語の出
現を文書毎に検出する過程と、当該単語を含む文書群及
び含まない文書群の情報量を、前記学習文書群及び不定
文書群の総情報量からの差分で得られる相互情報量が最
大となる単一の単語を検索キーワードとして決定する過
程と、前記検索キーワードを含む文書群及び含まない文
書群が、必要文書群と不要文書群とに区別可能な場合に
前記検索キーワードの決定を抑止する過程と、決定した
１または複数の検索キーワードを論理式で結合して文書
検索に用いる検索式を作成する過程と、を含み、前記不
定文書群の単語を反映させた検索式を作成することを特
徴とする方法。SUMMARY OF THE INVENTION In order to solve the above-mentioned problems, the present invention provides the following two forms of a retrieval formula creation method. (1) A plurality of words are extracted from a learning document group to which necessary / unnecessary designation information is added in advance for each document and an indefinite document group whose designation information is unknown, and the appearance of each extracted word is determined for each document. And the information amount of the document group including the word and the document group not including the word is converted into a single information in which the mutual information amount obtained by the difference from the total information amount of the learning document group and the indefinite document group is maximized. Determining a word as a search keyword; and suppressing the determination of the search keyword when a document group including and not including the search keyword can be distinguished into a necessary document group and an unnecessary document group. Creating a search formula for use in document search by combining one or more search keywords obtained by a logical formula, and creating a search formula reflecting words of the indefinite document group. .

【００２２】（２）予め文書毎に必要・不要の指定情報
が付与された学習文書群と前記指定情報が未知の不定文
書群とから複数の単語を抽出するとともに、抽出した個
々の単語の出現を文書毎に検出する過程と、当該単語を
含む文書群及び含まない文書群の情報量を、前記学習文
書群及び不定文書群の総情報量からの差分で得られる相
互情報量が最大となる単一の単語を検索キーワードとし
て決定する過程と、前記検索キーワードを含む文書群及
び含まない文書群が、必要文書群と不要文書群とに区別
可能であり、且つ、当該検索キーワードを含む文書群の
必要文書数が不定文書数に対する所定の割合を超過する
場合に検索キーワードの決定を抑止する過程と、決定し
た１または複数の検索キーワードを論理式で結合して文
書検索に用いる検索式を作成する過程と、を含み、前記
不定文書群の単語を反映させた検索式を作成することを
特徴とする方法。(2) A plurality of words are extracted from a group of learning documents to which necessary / unnecessary designation information has been added in advance for each document and a group of indefinite documents whose designation information is unknown, and the appearance of each extracted word. And the mutual information obtained from the difference between the information amount of the document group containing the word and the information amount of the document group not including the word from the total information amount of the learning document group and the indefinite document group is maximized. A process of determining a single word as a search keyword, and a group of documents including and not including the search keyword can be distinguished into a necessary document group and an unnecessary document group, and a document group including the search keyword. When the required number of documents exceeds a predetermined ratio with respect to the number of indefinite documents, and a process of combining the determined one or more search keywords by a logical expression and using the determined keyword in a document search. Method characterized by comprising the steps of creating a formula, and creates the reflecting word indefinite documents search expression.

【００２３】上記他の課題を解決する本発明の検索式作
成装置は、所定の文書群、例えば検索者にとって関心の
ある必要文書群、関心の無い不要文書群、及び関心が未
知の不定文書群であり、各々前記検索キーワードの決定
の際の判定に用いられる、必要、不要、または不定のい
ずれかの指定情報が付与された文書群から特定の文書を
索出するための検索式を作成する装置であって、下記の
要素を含んで構成されているものである。A search formula creation device according to the present invention that solves the above-mentioned other problems includes a predetermined document group, for example, a group of necessary documents of interest to a searcher, a group of unnecessary documents of no interest, and a group of indefinite documents of unknown interest. And a search formula for searching for a specific document from a group of documents to which necessary, unnecessary, or unspecified designation information is added, which is used for determination in determining the search keyword. An apparatus, which includes the following elements.

【００２４】（１）前記文書群に形態素解析を施して複
数の単語を抽出し、抽出した個々の単語が文書中に含ま
れるか否かを表す判別情報、当該文書が必要文書、不要
文書、または不定文書かを表す指定情報を各文書の識別
情報と共に集合させた文書集合を生成する文書集合生成
手段。この文書集合生成手段は、好ましくは、前記抽出
した個々の単語が文書中に出現するか否かを表す判別情
報を、各文書の識別情報と共に構築した単語データベー
スを含んで構成される。（２）個々の単語を含む文書群及び含まない文書群の情
報量を前記文書群の総情報量からの差分として得られる
相互情報量と、当該単語が出現する文書数とに基づいて
単一の単語を検索キーワードとして決定するとともに、
決定した検索キーワードを用いて一つの文書集合を複数
の文書集合に分割する文書集合分割手段。（３）前記文書集合の分割の際に用いた検索キーワード
を論理式で結合して前記検索式を作成する検索式作成手
段。(1) The document group is subjected to morphological analysis to extract a plurality of words, discrimination information indicating whether or not each of the extracted words is included in the document, a necessary document, an unnecessary document, Alternatively, a document set generation means for generating a document set in which designation information indicating an unfixed document is collected together with identification information of each document. The document set generation means preferably includes a word database constructed with identification information indicating whether or not the extracted individual words appear in the document together with identification information of each document. (2) Based on the mutual information amount obtained as a difference from the total information amount of the document group and the number of documents in which the word appears, the information amount of the document group including and not including the individual word is determined based on the number of documents in which the word appears. Is determined as a search keyword,
Document set dividing means for dividing one document set into a plurality of document sets using the determined search keyword. (3) Search expression creating means for combining the search keywords used for dividing the document set with a logical expression to create the search expression.

【００２５】なお、前記文書集合分割手段は、例えば下
記のように構成される。（２−１）前記相互情報量を、所定の最小記述長原理に
基づいて仮想的に前記不定文書群を前記不要文書群とみ
なし、学習事例を増加させて算出する。（２−２）所定の最小記述長原理に基づいて前記相互情
報量が最大となる単語を前記検索キーワードとして逐次
決定するとともに、決定した検索キーワードを用いて前
記文書群についての文書集合を複数の文書集合に分割す
る。（２−３）分割された複数の文書集合の少なくとも一方
の文書集合が必要文書群または不要文書群に区別された
時点で、前記検索キーワードの決定を停止する。（２−４）分割された複数の文書集合の少なくとも一方
の文書集合が必要文書群または不要文書群に区別された
時点で、且つ、当該必要文書群に係る文書集合中の必要
文書数が、不定文書数に対する所定の割合を超過した場
合に前記検索キーワードの決定を停止する。The document set dividing means is constituted, for example, as follows. (2-1) The mutual information amount is calculated by virtually considering the indefinite document group as the unnecessary document group based on a predetermined minimum description length principle and increasing the number of learning cases. (2-2) A word having the maximum mutual information is sequentially determined as the search keyword based on a predetermined minimum description length principle, and a document set of the document group is determined using the determined search keyword. Divide into document sets. (2-3) The determination of the search keyword is stopped when at least one of the plurality of divided document sets is classified into a necessary document group or an unnecessary document group. (2-4) When at least one of the divided document sets is classified into a necessary document group or an unnecessary document group, and the required number of documents in the document set related to the required document group is: The determination of the search keyword is stopped when a predetermined ratio to the number of indefinite documents is exceeded.

【００２６】[0026]

【発明の実施の形態】以下、本発明の実施の形態を詳細
に説明する。図１及び図２は、上記検索式の作成方法の
実施に適した検索式作成装置の機能構成図である。図５
で説明した従来の検索式作成装置５０と同一の機能の構
成要素については、同一符号を付して重複説明を省略す
る。Embodiments of the present invention will be described below in detail. FIG. 1 and FIG. 2 are functional configuration diagrams of a search formula creation device suitable for implementing the above search formula creation method. FIG.
Constituent elements having the same functions as those of the conventional search formula creation device 50 described in the above are denoted by the same reference numerals, and redundant description will be omitted.

【００２７】本実施形態の検索式作成装置１０は、コン
ピュータ装置が所定のプログラムを読み込んで実行する
ことにより形成される、キーワード抽出部１１、文書デ
ータベース（以下、本明細書では、データベースを「Ｄ
Ｂ」と称する）１２、キーワードＤＢ１３、文書集合分
割部１４、検索式作成部３３の各機能を備えて構成され
る。The retrieval formula creation device 10 of the present embodiment includes a keyword extraction unit 11 and a document database (hereinafter, referred to as “D” in this specification) formed by a computer device reading and executing a predetermined program.
B), a keyword DB 13, a document set division unit 14, and a search expression creation unit 33.

【００２８】上記プログラムは、通常、コンピュータ装
置の内部あるいは外部記憶装置に格納されて随時読み取
られて実行されるようになっているが、コンピュータ装
置とは分離した形態で流通する記録媒体、例えばＣＤ−
ＲＯＭやＦＤ等のような可搬性媒体に格納され、使用時
に上記内部または外部記憶装置にインストールされて随
時実行に供されるものであっても良い。The above-mentioned program is usually stored in an internal or external storage device of a computer device and read and executed as needed. However, a recording medium distributed in a form separate from the computer device, for example, a CD −
It may be stored in a portable medium such as a ROM or FD, installed in the internal or external storage device at the time of use, and provided for execution at any time.

【００２９】本実施形態では、学習事例となる文書群
（以下、学習文書または学習文書群）に、予め利用者等
によって必要文書か不要文書かを表す必要・不要の指定
情報が付与されており、また、この必要・不要の指定情
報が付与されていないものは、不定文書（非指定文書）
として判別されるものとする。このことから、本実施形
態では、必要・不要・不定の３種類に分類される文書群
が存在することになる。In the present embodiment, a document group as a learning example (hereinafter referred to as a learning document or a learning document group) is provided with necessary / unnecessary designation information indicating whether it is a necessary document or an unnecessary document by a user or the like in advance. Documents that do not have this required / unnecessary designation information are indefinite documents (non-designated documents)
It is assumed to be determined as Thus, in the present embodiment, there are document groups classified into three types: required, unnecessary, and undefined.

【００３０】文書ＤＢ１２には、学習文書群及び大量の
不定文書群が蓄積されており、キーワードＤＢ１３に
は、予め文書ＤＢ１２における学習文書群及び不定文書
群に含まれるすべてのキーワードについて、個々の文書
における当該キーワードの出現の有無を表す判別情報を
文書番号等の文書識別情報毎に対応づけて蓄積されてい
る。符号１３Ａは、キーワードＤＢ１３における情報例
である。The document DB 12 stores a group of learning documents and a large amount of indefinite documents. The keyword DB 13 stores individual documents for all the keywords included in the learning documents and the indefinite documents in the document DB 12 in advance. Is stored in association with each piece of document identification information such as a document number. Reference numeral 13A is an example of information in the keyword DB 13.

【００３１】キーワード抽出部１１は、入力された必要
文書及び不要文書、あるいは文書ＤＢ１２に蓄積された
文書群に形態素解析を施して、文書毎にキーワードの抽
出処理を行う。また、個々の文書におけるキーワードの
出現の有無を表す判別情報、及び当該文書の指定情報
を、文書名や文書番号等の文書識別子と共に文書集合と
して出力する。符号１１Ａは、キーワード抽出部１１か
ら出力される文書集合の内容を例示したものである。こ
の場合の指定情報は、例えば、検索者が指定した必要／
不要の学習文書群以外には、“不定”のタグを付与して
文書集合を作成し、文書集合分割部１４に入力するよう
に構成される。The keyword extracting unit 11 performs a morphological analysis on the input required documents and unnecessary documents, or the documents stored in the document DB 12, and performs a keyword extraction process for each document. Further, it outputs determination information indicating presence / absence of a keyword in each document and designation information of the document as a document set together with document identifiers such as a document name and a document number. Reference numeral 11A exemplifies the contents of a document set output from the keyword extraction unit 11. In this case, the specified information is, for example, necessary / specified by the searcher /
In addition to the unnecessary learning document group, a document set is created by adding an “undefined” tag, and is input to the document set dividing unit 14.

【００３２】文書集合分割部１４は、キーワード抽出部
１１から出力された文書集合の分割に用いる検索キーワ
ードを決定するものである。文書集合の分割には、前述
の決定木学習アルゴリズム「ＩＤ３」（以下、単に「Ｉ
Ｄ３」と記述する）を用いる。一般に、文書ＤＢ１２中
における殆どの文書は、検索者の興味とは関係がない文
書群であり、検索者にとって必要となる文書群が占める
割合は極めて小さいと考えられる。即ち、文書ＤＢ１２
における必要文書群以外の文書群は、仮想的にすべて不
要文書群であるといえる。そこで、文書集合分割部１４
では、文書ＤＢ１２における不定文書群を不要文書群と
して取り扱うことにより、学習事例、即ち学習文書群を
増加させて相互情報量の算出を行うものである。本実施
形態では、文書群を上述のように３種類に分類したた
め、決定木学習アルゴリズム「ＩＤ３」による前述の式
（１）で示した文書集合Ｓｅｔ_iの相互情報量Ｉ（ｔ_j)
を得るための各情報量Ｈ、Ｈ（ｔ_j)は、それぞれ下記
（４）式、及び（５）式のようになる。The document set dividing section 14 determines a search keyword used for dividing the document set output from the keyword extracting section 11. To divide a document set, the above-described decision tree learning algorithm “ID3” (hereinafter simply referred to as “I3
D3 ”). In general, most documents in the document DB 12 are a group of documents that are not related to the interest of the searcher, and the ratio of the group of documents necessary for the searcher is considered to be extremely small. That is, the document DB 12
Can be said to be virtually all unnecessary documents. Therefore, the document set dividing unit 14
In the first embodiment, the number of learning cases, that is, the number of learning documents is increased to calculate the mutual information amount by treating the indefinite documents in the document DB 12 as unnecessary documents. In the present embodiment, since the document group is classified into three types as described above, the mutual information I (t _j ) of the document set Set _i represented by the above equation (1) by the decision tree learning algorithm “ID3”.
The information amounts H and H (t _j ) for obtaining are expressed by the following equations (4) and (5), respectively.

【００３３】 [0033]

【００３４】但し、式（４）、式（５）におけるパラメ
ータは下記のようになる。ｐ_i：Ｓｅｔ_i中の必要文書数、ｎ_i：Ｓｅｔ_i中の不要文書数と不定文書数の和、ｓ_i：ｐ_i+ｎ_i、ｐ_i(ｔ_j)：Ｓｅｔ_i中でキーワードｔ_jを含む必要文書
数、ｎ_i(ｔ_j)：Ｓｅｔ_i中でキーワードｔ_jを含む不要文書数
と不定文書数の和、ｓ_i(ｔ_j)：ｐ_i(ｔ_j)＋ｎ_i(ｔ_j)、ｐ_i not（ｔ_j)：Ｓｅｔ_i中でキーワードｔ_jを含まない
必要文書数、ｎ_i not（ｔ_j)：Ｓｅｔ_i中でキーワードｔ_jを含まない
不要文書数と不定文書数の和、ｓ_i not（ｔ_j)：ｐ_i not（ｔ_j)＋ｎ_i not（ｔ_j)、ｈ(a,b,c)：-{a/c・log₂(a/c)＋b/c・log₂(b/c)｝However, the parameters in the equations (4) and (5) are as follows. p _{_i:} Set _i need the number of documents in, n _i: unnecessary number of documents and the sum of the indefinite number of documents in _{_{_{Set i, s i: p i}}} + n i, p i (t j): keyword in Set _i t the number of necessary documents, including the _{_{_{j, n i (t j)}}} : sum of Set _i unnecessary number of documents that contain the keyword t _j in the indefinite number of _{_{documents, s i (t j):}} p i (t j) + n i (t _{_{j), p i not (t}} j): Set i need the number of documents that do not contain the keyword t _j _{_{in, n i not (t j)}} : Set i unnecessary number of documents and indefinite number of documents that do not contain the keyword t _j in S _i not (t _j ): p _i not (t _j ) + n _i not (t _j ), h (a, b, c): − {a / c · log ₂ (a / c) + b / clog ₂ (b / c)｝

【００３５】式（４）及び（５）自体は、前述の式
（２）及び（３）と同一であるが、不定文書群を加味し
ない前述の手法とは文書集合の形態が異なるために示し
たものである。この（４）式及び（５）式を評価し、
「ＩＤ３」に基づいて文書集合を完全に分割処理した結
果、作成される検索式は、検索者の指定による必要文書
群にのみ対応するものとなる。即ち当該検索式で検索さ
れる文書群は、検索者にとって既知の学習文書群であ
り、検索者が指定した必要文書のみに対応する検索式を
作成しても実効性に乏しかった。そこで本実施形態で
は、文書集合の分割処理を途中で停止させて、不定文書
群にも対応する最適な検索式を作成することとした。こ
の文書集合の分割処理の停止に関して、本例では、以下
に示す二つの方法を挙げて具体的に説明する。Equations (4) and (5) are the same as Equations (2) and (3) above, but are shown because the form of the document set is different from that of the above-described method that does not take into account the indefinite document group. It is a thing. Evaluating the expressions (4) and (5),
The search formula created as a result of completely dividing the document set based on “ID3” corresponds only to the required document group specified by the searcher. That is, the group of documents searched by the search formula is a group of learning documents known to the searcher, and even if a search formula corresponding to only the necessary document specified by the searcher is created, the effectiveness is poor. Therefore, in the present embodiment, the process of dividing the document set is stopped halfway, and an optimal search formula corresponding to an indefinite document group is created. In this example, the stop of the document set division process will be specifically described with reference to the following two methods.

【００３６】＜第１の方法＞分割処理の過程で必要文書
群と不要文書群とが区別された段階（時点）で集合の分
割を停止する。<First Method> At the stage (time point) when the necessary document group and the unnecessary document group are distinguished in the process of the division process, the division of the set is stopped.

【００３７】＜第２の方法＞必要文書群と不要文書群と
が区別されており、且つ、文書集合中の不定文書数に対
する必要文書数が所定の割合を超過した時点で集合の分
割を停止する。<Second Method> The required document group and the unnecessary document group are distinguished from each other, and division of the set is stopped when the required number of documents with respect to the number of indefinite documents in the document set exceeds a predetermined ratio. I do.

【００３８】上記第２の方法における所定の割合とは、
例えば、「必要文書数が不定文書数の半分、即ち５割以
上」となる場合のように、一定の割合基準値をシステム
パラメータ等により予め適宜設定しておくことで対応す
る。なお、上記第１及び第２の方法は、各々単独で用い
ることが可能である。また、上記第２の方法には、上記
第１の方法が包含されており、上記第１の方法と比較し
て、より絞り込んだ停止条件となる。The predetermined ratio in the second method is as follows.
For example, as in the case where "the number of required documents is half of the number of indefinite documents, that is, 50% or more", a predetermined ratio reference value is appropriately set in advance using system parameters or the like. The first and second methods can be used independently. In addition, the second method includes the first method, and provides a more narrow stop condition as compared with the first method.

【００３９】図３は、上記第２の方法を用いた場合の本
実施形態の検索式作成装置１における処理の概要を表す
処理手順図である。キーワード抽出部１１では、文書Ｄ
Ｂ１２中における検索式作成の対象となる不定文書群、
及び学習文書群からすべてのキーワードを抽出して、キ
ーワードＤＢ１３に蓄積する（ステップＳ１０１）。キ
ーワード抽出部１１では、また、キーワードＤＢ１３の
データに基づいて検索式作成の対象となる初期文書集合
を作成する（ステップＳ１０２）。文書集合分割部１４
では、初期文書集合の分割処理を行う（ステップＳ１０
３）。FIG. 3 is a processing procedure diagram showing an outline of the processing in the search expression creating apparatus 1 of the present embodiment when the above-described second method is used. In the keyword extracting unit 11, the document D
An indefinite document group to be searched for in B12,
Then, all keywords are extracted from the learning document group and stored in the keyword DB 13 (step S101). The keyword extraction unit 11 also creates an initial document set to be searched for based on the data in the keyword DB 13 (step S102). Document set division unit 14
Then, the initial document set is divided (step S10).
3).

【００４０】ステップＳ１０３において検索キーワード
により分割された２つの文書集合のうち、少なくとも一
方の文書集合が必要または不要文書群であると区別され
た場合（ステップＳ１０４：Yes）、文書集合分割部１
４では、区別された当該文書集合中において、必要文書
数が不定文書数に対する所定の割合を超過するか否かを
検出し、超過する場合には（ステップＳ１０５：Ye
s）、文書集合の分割処理を停止する（ステップＳ１０
６）。このステップＳ１０４及びＳ１０５は、図６に示
した「ＩＤ３」の処理手順におけるステップＳ２１２に
相当する処理であり、本発明では、不定文書群を検索式
に反映させるために設けるものである。一方、ステップ
Ｓ１０３において分割された２つの文書集合のうち、少
なくとも一方の文書集合が必要または不要文書群である
と区別されない場合（ステップＳ１０４：No）、また
は、当該文書集合において必要文書数が不定文書数に対
する所定の割合を超過しない場合には（ステップＳ１０
５：No）、ステップＳ１０３に戻り、文書集合の分割処
理を繰り返す。If it is determined in step S103 that at least one of the two document sets divided by the search keyword is a necessary or unnecessary document group (step S104: Yes), the document set dividing unit 1
In step S105, it is detected whether or not the required number of documents exceeds a predetermined ratio with respect to the number of indefinite documents in the set of distinguished documents.
s), the process of dividing the document set is stopped (step S10)
6). Steps S104 and S105 are processing corresponding to step S212 in the processing procedure of "ID3" shown in FIG. 6, and are provided in the present invention in order to reflect the indefinite document group in the search formula. On the other hand, if at least one of the two document sets divided in step S103 is not distinguished as a necessary or unnecessary document group (step S104: No), or the required number of documents in the document set is indeterminate If the predetermined ratio to the number of documents is not exceeded (step S10
5: No), the process returns to step S103, and the process of dividing the document set is repeated.

【００４１】検索式作成部５３では、文書集合の分割処
理が停止した時点までの、文書集合分割処理の過程で得
られた各検索キーワードを、論理演算子“ａｎｄ”、
“ｏｒ”及び“ｎｏｔ”により結合して検索式ｑｕｅｒ
ｙを作成する（ステップＳ１０７）。The search formula creation unit 53 uses the logical operators “and”, “and”, as the search keywords obtained in the document set division process up to the point where the document set division process is stopped.
Search expression "quer" linked by "or" and "not"
y is created (step S107).

【００４２】なお、本処理手順では、ステップＳ１０４
及び１０５が上記第２の方法の処理手順に対応する。上
記第１の方法を単独で用いる場合には、ステップＳ１０
４における必要または不要文書群の区別が検出された時
点で、文書集合の分割処理を停止するようにする。It should be noted that in this processing procedure, step S104
And 105 correspond to the processing procedure of the second method. If the above first method is used alone, step S10
When the distinction between the necessary and unnecessary document groups in 4 is detected, the process of dividing the document set is stopped.

【００４３】このように、上記処理手順によれば、ステ
ップＳ１０４及び１０５における停止条件に基づいて検
索キーワードの決定が抑制される結果、つまり、文書集
合を完全に分割しないようにする結果、得られる検索式
には、不定文書群が反映されるようになる。As described above, according to the above-described processing procedure, the result that the determination of the search keyword is suppressed based on the stop condition in steps S104 and S105, that is, the result that the document set is not completely divided is obtained. The indefinite document group is reflected in the search formula.

【００４４】図４は、上記第１及び第２の方法による文
書集合分割処理の停止状況を表す模式図である。この図
では、初期文書集合がキーワード“ｋｗｄ１”を含む文
書集合と、含まない文書集合とに分割される。次に、キ
ーワード“ｋｗｄ１”を含む文書集合がキーワード“ｋ
ｗｄ２”を含む文書集合と含まない文書集合とに分割さ
れる。ここで、キーワード“ｋｗｄ２”を含む文書集合
と含まない文書集合は、必要文書群と不要文書群とに区
別することができる。そこで、この時点で検索キーワー
ドの決定を中止する。これは、上記第１の方法による停
止条件に適合したことを意味する。なお、キーワード
“ｋｗｄ２”で区別された必要文書群及び不要文書群に
は、当該キーワード“ｋｗｄ２”を含む不定文書群及び
含まない不定文書群とが各々含まれる。キーワード“ｋ
ｗｄ２”の決定の際における相互情報量は、仮想的に不
定文書群を不要文書群とみなして算出するが、分割処理
の停止条件を判断する際には、不定文書を不要文書とは
みなさない。FIG. 4 is a schematic diagram showing a stop state of the document set dividing process according to the first and second methods. In this figure, the initial document set is divided into a document set including the keyword "kwd1" and a document set not including the keyword. Next, a document set including the keyword “kwd1” is added to the keyword “kwd”.
The document set is divided into a document set containing wd2 "and a document set not containing it. Here, a document set containing the keyword" kwd2 "and a document set not containing it can be distinguished into a required document group and an unnecessary document group. Therefore, at this point, the determination of the search keyword is stopped, which means that the stop condition according to the first method has been met, and the necessary document group and the unnecessary document group distinguished by the keyword “kwd2” are added. Includes an unfixed document group including the keyword “kwd2” and an undefined document group not including the keyword “kwd2”.
The mutual information amount at the time of determining wd2 ″ is calculated by virtually assuming the indefinite document group as an unnecessary document group, but does not regard the indefinite document as an unnecessary document when determining the condition for stopping the division process. .

【００４５】また、キーワード“ｋｗｄ２”を含む文書
集合では、必要文書数が不定文書数に対する所定の割合
以上になれば検索キーワードの決定を中止する。これ
は、上記第２の方法における停止条件に適合したことを
意味する。In the document set including the keyword "kwd2", the determination of the search keyword is stopped when the required number of documents becomes a predetermined ratio or more with respect to the indefinite number of documents. This means that the stop condition in the second method has been met.

【００４６】このように、本実施形態の検索式作成装置
１０では、非指定の不定文書群を仮想的に不要文書群と
みなして（みなすのは、相互情報量を算出するときの
み：分割処理の停止条件を判断する際には、不定文書を
不要文書とはみなさない）決定木学習アルゴリズム「Ｉ
Ｄ３」に基づいて相互情報量を算出し、文書集合の分割
処理を行うようにしたので、従来手法と比較して、検索
者からの指定情報が付与された学習文書群は、少量で済
むようになる。As described above, in the retrieval formula creating apparatus 10 of the present embodiment, the unspecified indefinite document group is virtually regarded as an unnecessary document group (only when calculating the mutual information: division processing). When determining the stop condition of, the indefinite document is not regarded as an unnecessary document.) The decision tree learning algorithm “I
Since the mutual information amount is calculated based on “D3” and the document set is divided, compared with the conventional method, the number of learning documents to which the designated information from the searcher is added is small. become.

【００４７】また、不定文書群に含まれる単語が反映さ
れた検索式が作成できるようになり、必要文書群にのみ
対応する従来手法による検索式と比較して、検索者にと
って検索精度及び実用性の高い検索式が得られるように
なる。Further, a search formula reflecting words included in an indefinite document group can be created, and search accuracy and practicality for a searcher can be compared with a search formula according to a conventional method corresponding only to a necessary document group. Can be obtained.

【００４８】また、上述のように学習文書群が少量で済
むことから、従来のようにすべての文書に対する指定情
報を検索者が逐次判定して付与するような作業が軽減さ
れ、検索者側に係る負荷を低減できるようになる。Further, since a small number of learning documents is required as described above, the work of the searcher successively determining and assigning the designation information for all the documents as in the related art is reduced, and the searcher side is reduced. Such a load can be reduced.

【００４９】[0049]

【発明の効果】以上の説明から明らかなように、本発明
によれば、大量の学習文書群を必要とすることなく、指
定情報が付与されていない文書群が考慮された検索式が
作成可能になるという特有の効果がある。また、作成さ
れた検索式を検索処理に用いることにより、検索者にと
って検索精度を一定値以上に維持することが可能とな
り、実用性の高い検索結果が得られるという効果があ
る。As is apparent from the above description, according to the present invention, it is possible to create a retrieval formula in consideration of a document group to which no specified information is added without requiring a large amount of learning document group. Has the specific effect of becoming Further, by using the created search formula in the search processing, the search accuracy for the searcher can be maintained at a certain value or more, and a highly practical search result can be obtained.

[Brief description of the drawings]

【図１】本発明の一実施形態に係る検索式作成装置の実
施形態を表す機能ブロック図。FIG. 1 is a functional block diagram illustrating an embodiment of a search formula creation device according to an embodiment of the present invention.

【図２】本発明の一実施形態に係る検索式作成装置の実
施形態を表す機能ブロック図。FIG. 2 is a functional block diagram illustrating an embodiment of a search formula creation device according to an embodiment of the present invention.

【図３】本実施形態の検索式作成装置における処理手順
図。FIG. 3 is a processing procedure diagram in the search expression creating apparatus of the embodiment.

【図４】本実施形態における文書集合の分割処理過程を
表す模式図。FIG. 4 is a schematic diagram showing a process of dividing a document set according to the embodiment.

【図５】従来の検索式作成装置の機能ブロック図。FIG. 5 is a functional block diagram of a conventional search expression creation device.

【図６】従来の検索式作成装置における処理手順説明
図。FIG. 6 is an explanatory diagram of a processing procedure in a conventional search expression creation device.

【図７】従来の分割処理過程で得られる情報の模式図。FIG. 7 is a schematic diagram of information obtained in a conventional dividing process.

[Explanation of symbols]

１０，５０検索式作成装置１１，５１キーワード抽出部１４，５２文書集合分割部１２文書ＤＢ１３キーワードＤＢ５３検索式作成部 10, 50 search formula creation device 11, 51 keyword extraction unit 14, 52 document set division unit 12 document DB 13 keyword DB 53 search formula creation unit

Claims

[Claims]

1. A method for extracting a plurality of words from a learning document group to which necessary / unnecessary designation information is added in advance for each document and an indefinite document group whose designation information is unknown, and determining the appearance of each extracted word. The process of detecting for each document, and the information amount of the document group containing the word and the document group not containing the word,
A step of determining, as a search keyword, a single word that maximizes a mutual information amount obtained from a difference from the total information amount of the learning document group and the indefinite document group, and a necessary document group and an unnecessary document depending on whether the search keyword exists. A step of inhibiting the determination of the search keyword when it can be distinguished from a group, and a step of combining the determined one or more search keywords with a logical expression to create a search expression used for document search; A search formula creation method using a computer device, wherein a search formula reflecting words of the indefinite document group is created.

2. A plurality of words are extracted from a learning document group to which necessary / unnecessary designation information is added in advance for each document and an indefinite document group whose designation information is unknown, and the appearance of each extracted word is determined. The process of detecting for each document, and the information amount of the document group containing the word and the document group not containing the word,
A step of determining, as a search keyword, a single word that maximizes the mutual information obtained as a difference from the total information amount of the learning document group and the indefinite document group; a document group including the search keyword and a document group not including the search keyword However, when the required document group and the unnecessary document group can be distinguished from each other, and the required document number of the document group including the search keyword exceeds a predetermined ratio to the indefinite document number, the determination of the search keyword is suppressed. Creating a search formula for use in document search by combining the determined one or more search keywords with a logical formula, and creating a search formula that reflects the words of the indefinite document group A method for creating a retrieval formula using a computer device.

3. The learning method according to claim 3, wherein the indefinite document group is virtually regarded as an unnecessary document group of the learning document group based on a predetermined minimum description length principle, and the number of learning cases is increased and reflected in the calculation of the mutual information amount. The method according to claim 1 or 2, wherein

4. An apparatus for creating a search formula for searching for a specific document from a group of documents, comprising: performing a morphological analysis on the group of documents to extract a plurality of words;
Generates a document set in which identification information indicating whether or not each extracted word is included in a document, and designation information indicating whether the document is a required document, an unnecessary document, or an indefinite document are collected together with identification information of each document. A document set generating means, and a mutual information amount obtained as a difference between the information amount of the document group including and not including the individual word from the total information amount of the document group, and the number of documents in which the word appears. Document set dividing means for determining a single word as a search keyword based on the determined keyword, and dividing one document set into a plurality of document sets using the determined search keyword; and a search used in dividing the document set. A search formula creating means for creating the search formula by combining keywords by a logical formula.

5. The document group includes a required document group that is of interest to the searcher, an unnecessary document group that is not interested in the document group, and an indefinite document group that is of unknown interest, each of which is used for determination when determining the search keyword. 5. The search expression creating apparatus according to claim 4, wherein any of necessary, unnecessary, or indefinite designation information is used.

6. The document set generating means includes a word database constructed by constructing discrimination information indicating whether or not the extracted individual words appear in a document together with identification information of each document. 5. The retrieval formula creation device according to claim 4, wherein:

7. The document set dividing unit calculates the mutual information amount by virtually considering the indefinite document group as the unnecessary document group based on a predetermined minimum description length principle and increasing the number of learning cases. 5. The retrieval formula creation device according to claim 4, wherein the retrieval formula creation device is configured as follows.

8. The document set dividing means sequentially determines a word in which the mutual information amount is maximum as the search keyword based on a predetermined minimum description length principle, and uses the determined search keyword to generate the document group. 5. The retrieval formula creation device according to claim 4, wherein the document set for is divided into a plurality of document sets.

9. The document set dividing means stops the determination of the search keyword when at least one of the plurality of divided document sets is classified into a required document group or an unnecessary document group. 5. The retrieval formula creation device according to claim 4, wherein the retrieval formula creation device is configured.

10. The document set dividing means, when at least one of a plurality of divided document sets is classified into a necessary document group or an unnecessary document group, and a document set related to the necessary document group. 5. The search expression creating apparatus according to claim 4, wherein the determination of the search keyword is stopped when the number of required documents in the document exceeds a predetermined ratio to the number of indefinite documents.