JP4011595B2

JP4011595B2 - Electronic document retrieval system and recording medium

Info

Publication number: JP4011595B2
Application number: JP2005316003A
Authority: JP
Inventors: 泰嗣小川; 透松田
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 1998-02-02
Filing date: 2005-10-31
Publication date: 2007-11-21
Anticipated expiration: 2018-09-10
Also published as: JP2006073035A

Description

この発明は、電子化された文書群を索引登録しておき、指定した文字列の検索語を含む文書を検索する電子化文書検索システムおよび記録媒体に関する。 The present invention relates to an electronic document search system and a recording medium that search for a document including a search term of a designated character string by registering an electronic document group as an index.

電子化文書検索システムとしては、従来のキーワード検索方式の他に、近年の電子計算機の高速化と記憶装置の大容量化により、全文検索方式も用いられている。この全文検索方式では、基本的には検索の際に登録されている全文書を走査するものであるため、大量の文書に対しては膨大な検索処理時間を要することに鑑み、索引構造の工夫や検索処理の手法の工夫による処理の高速化が試みられている。 As an electronic document retrieval system, in addition to the conventional keyword retrieval system, a full-text retrieval system is also used due to the recent increase in the speed of electronic computers and the increase in capacity of storage devices. This full-text search method basically scans all the documents registered at the time of search, so that it takes a lot of search processing time for a large number of documents. Attempts have also been made to speed up processing by devising search processing techniques.

この索引構造としては、索引語に文書の識別子のみを対応付けて記憶する方式が従来から行なわれているが、一般には、検索語を構成する文字列を複数の索引語に分解して索引照合を行なうため検索ノイズ（過剰ヒット）が避けられず、このノイズの除去のため、全文書の走査によるあと処理を必要とし、処理の高速化に限界があった。 As the index structure, a method of storing only the identifier of the document in association with the index word has been conventionally performed. Generally, however, the index string is divided by dividing the character string constituting the search word into a plurality of index words. Therefore, search noise (excess hit) is unavoidable, and post-processing by scanning all documents is necessary to remove this noise, and there is a limit to speeding up the processing.

そこで、索引語の各文書における出現位置などの統計情報も索引にもたせる方式も提案されている。例えば、非特許文献１には、文字種に応じて１文字または連接２文字を索引語とする技術が開示されている（以下、従来技術１という）。また、特許文献１には、文字種によらず連接ｎ文字（ｎ≧２）を索引語とする技術が開示されている（以下、従来技術２という）。 Therefore, a method has been proposed in which statistical information such as the appearance position of each index word in each document is also provided in the index. For example, Non-Patent Document 1 discloses a technique in which one character or two consecutive characters are used as an index word according to the character type (hereinafter referred to as Conventional Technology 1). Patent Document 1 discloses a technique that uses n connected characters (n ≧ 2) as an index word regardless of the character type (hereinafter referred to as Conventional Technique 2).

「論文日本語文書用高速全文検索の一手法」（電子情報通信学会論文誌Ｄ−１Ｖｏｌ．Ｊ７５−Ｄ−１Ｎｏ．９８３６〜８４６頁１９９２年９月）"A Method for High-Speed Full-Text Search for Japanese Papers" (The IEICE Transactions D-1 Vol. J75-D-1 No. 9 pages 836-846, September 1992) 特開平７−５６９４４号公報JP 7-56944 A

しかしながら、従来技術１、２では、索引語が短いと頻出索引語が多くなって検索処理に要する時間が長くなる一方、索引語が長いと索引が非常に大きくなるという不具合がある。 However, the prior arts 1 and 2 have a problem that if the index word is short, frequent index words increase and the time required for the search process becomes long. On the other hand, if the index word is long, the index becomes very large.

この発明の目的は、検索処理全体の処理速度を向上などを図った電子化文書検索を可能とすることにある。 An object of the present invention is to enable digitized document search that improves the processing speed of the entire search process.

請求項１に記載の発明は、電子化された複数の登録文書を連続するｎ文字（ｎは２以上の整数）の索引語に分割する文書分割手段と、前記文書分割手段が分割した前記索引語と、この索引語を含んでいる前記登録文書の数である文書頻度、前記索引語を含む前記登録文書の文書識別子、前記索引語の前記各登録文書内での出現回数である文書内頻度および前記索引語の前記各登録文書内での出現位置を含む転置リストと、を対応付けた索引を記憶している索引記憶手段と、ｎ文字より大きい与えられた検索語を任意の連続するｎ文字に分割し、分割したｎ文字と一致する前記索引語を前記索引から取得し、取得した前記索引語の組合せのうち、前記検索語内のすべての文字が前記組合せ内の前記索引語のいずれかに含まれ、かつ、前記組合せ内の前記索引語の個数が最小であり、かつ、前記組合せ内の前記各索引語の文書頻度の合計が最小である前記組合せを選択する検索語分割手段と、選択された前記組合せ内の前記索引語に対応する前記文書識別子を検索する第１の検索条件を下位ノードとし、２つの前記第１の検索条件で用いる２つの前記索引語間の位置関係が前記検索語における位置関係と同一である前記登録文書の前記文書識別子を２つの前記第１の検索条件の検索結果である前記文書識別子の中から検索する第２の検索条件を上位ノードとする検索条件木を生成する検索条件解析手段と、前記検索条件木の各ノードの検索を実行し、最上位のノードの検索で取得された前記文書識別子を検索結果として得る検索条件評価手段とを備えたことを特徴とする。 According to the first aspect of the present invention, there is provided a document dividing unit that divides a plurality of digitized registered documents into consecutive n-character (n is an integer of 2 or more) index words, and the index divided by the document dividing unit. A document frequency that is the number of registered documents including the word and the index word, a document identifier of the registered document that includes the index word, and an intra-document frequency that is the number of occurrences of the index word in each registered document And an index storage means for storing an index in which the index word is associated with the transposed list including the appearance position of each of the index words in each of the registered documents, and a given search word larger than n characters can be arbitrarily repeated n The index word that matches the divided n characters is acquired from the index, and among the acquired combinations of the index words, all the characters in the search word are any of the index words in the combination And the combination Search word dividing means for selecting the combination with the smallest number of index words and the sum of the document frequencies of the index words in the combination, and the index in the selected combination. The first search condition for searching for the document identifier corresponding to the word is a lower node, and the positional relationship between the two index words used in the two first search conditions is the same as the positional relationship in the search word. Search condition analysis means for generating a search condition tree having a second search condition as a higher node for searching the document identifier of the registered document from among the document identifiers as search results of the two first search conditions; And a search condition evaluation unit that executes a search of each node of the search condition tree and obtains the document identifier obtained by the search of the highest node as a search result .

したがって、索引語は一律にｎ文字の連鎖とされるので、膨大な単語辞書を必要とする形態素解析を用いる手法と比較すると、単語辞書の管理などの手間が省ける。また、無駄な索引語の使用を省いて検索処理の高速化を図ることができる。また、文書頻度を小さくして検索処理の高速化を図ることができる。 Therefore, since the index word is uniformly a chain of n characters, it is possible to save time and labor for managing the word dictionary as compared with a method using morphological analysis that requires a huge word dictionary. Further, it is possible to speed up the search process by eliminating useless index words. In addition, the document frequency can be reduced to speed up the search process.

請求項２に記載の発明は、請求項１に記載の発明において、前記検索条件解析手段は、さらに、前記検索語分割手段によって取得された前記索引語に対応する前記文書識別子を検索する第３の検索条件を下位ノードとし、複数の前記第３の検索条件の検索結果である前記文書識別子の積集合を取得する第４の検索条件を上位ノードとする候補決定用条件木を生成し、前記検索条件評価手段は、前記候補決定用条件木の各ノードの検索を実行し、最上位のノードの検索で取得された前記文書識別子を対象として、前記検索条件木の各ノードの検索を実行して前記検索結果を得ることを特徴とする。 According to a second aspect of the present invention, in the first aspect of the invention, the search condition analysis unit further searches for the document identifier corresponding to the index word acquired by the search word division unit. Generating a candidate decision condition tree with the fourth search condition as an upper node for obtaining a product set of the document identifiers as a search result of a plurality of the third search conditions as a lower node, The search condition evaluation means executes a search for each node of the candidate determination condition tree, and executes a search for each node of the search condition tree for the document identifier acquired in the search for the highest node. And obtaining the search result .

したがって、詳細判定用条件木の検索結果合成処理に先立って、候補決定用条件木の検索結果合成処理を実行して対象となる文書に絞りをかけることにより、文字位置の突き合せの処理を低減して、検索処理の高速化を図ることができる。 Therefore, prior to the detailed decision condition tree search result synthesis process, the candidate decision condition tree search result synthesis process is executed to narrow down the target document, thereby reducing the character position matching process. Thus, the search process can be speeded up.

請求項３に記載の発明は、請求項１に記載の発明において、前記索引記憶手段は、ハードディスク装置であり、前記索引は、前記ハードディスク装置の固定長のブロックであるページをファイルの読み書きの単位としていて、複数の転置リストの大きさが前記ページの大きさより小さいときは１つのページに前記複数の転置リストを格納し、前記転置リストの大きさが前記ページより大きいときは１つの前記転置リストを複数のページに格納することを特徴とする。 The invention according to claim 3 is the invention according to claim 1, wherein the index storage means is a hard disk device, and the index is a unit for reading / writing a file which is a fixed-length block of the hard disk device. When the size of a plurality of transposed lists is smaller than the size of the page, the plurality of transposed lists are stored in one page, and when the size of the transposed list is larger than the page, one transposed list is stored. Is stored in a plurality of pages .

したがって、索引において転置リストを大きさに合わせてページ単位に管理するので、検索処理を高速化することができる。 Therefore, since the inverted list is managed in units of pages in the index according to the size, the search process can be speeded up.

請求項４に記載の発明は、電子化されて所定の記憶装置に登録されている複数の登録文書中から所望の検索語を含む前記登録文書の検索をコンピュータに実行させるためのプログラムを記録した記録媒体であって、前記登録文書を連続するｎ文字（ｎは２以上の整数）の索引語に分割する文書分割手順と、前記文書分割手順で分割した前記索引語と、この索引語を含んでいる前記登録文書の数である文書頻度、前記索引語を含む前記登録文書の文書識別子、前記索引語の前記各登録文書内での出現回数である文書内頻度および前記索引語の前記各登録文書内での出現位置を含む転置リストと、を対応付けた索引を記憶手段に記憶する索引記憶手順と、ｎ文字より大きい与えられた検索語を任意の連続するｎ文字に分割し、分割したｎ文字と一致する前記索引語を前記索引から取得し、取得した前記索引語の組合せのうち、前記検索語内のすべての文字が前記組合せ内の前記索引語のいずれかに含まれ、かつ、前記組合せ内の前記索引語の個数が最小であり、かつ、前記組合せ内の前記各索引語の文書頻度の合計が最小である前記組合せを選択する検索語分割手順と、選択された前記組合せ内の前記索引語に対応する前記文書識別子を検索する第１の検索条件を下位ノードとし、２つの前記第１の検索条件で用いる２つの前記索引語間の位置関係が前記検索語における位置関係と同一である前記登録文書の前記文書識別子を２つの前記第１の検索条件の検索結果である前記文書識別子の中から検索する第２の検索条件を上位ノードとする検索条件木を生成する検索条件解析手順と、前記検索条件木の各ノードの検索を実行し、最上位のノードの検索で取得された前記文書識別子を検索結果として得る検索条件評価手順手段と、を前記コンピュータに実行させるプログラムを記録したコンピュータ読み取り可能な記憶媒体である。 According to a fourth aspect of the present invention, there is recorded a program for causing a computer to search for a registered document including a desired search word from a plurality of registered documents that are digitized and registered in a predetermined storage device. A recording medium, a document dividing procedure for dividing the registered document into index words of continuous n characters (n is an integer of 2 or more), the index word divided by the document dividing procedure, and the index word A document frequency that is the number of the registered documents, a document identifier of the registered document that includes the index word, a document frequency that is the number of occurrences of the index word in each registered document, and each registration of the index word An index storage procedure for storing an index in which a transposed list including an appearance position in a document is associated with the storage means, and a given search term larger than n characters is divided into arbitrary consecutive n characters and divided. n letters and one The index word to be acquired from the index, and among the acquired combinations of index words, all the characters in the search word are included in any of the index words in the combination, and in the combination A search word dividing procedure for selecting the combination in which the number of the index words is the minimum and the sum of the document frequencies of the index words in the combination is the minimum, and the index words in the selected combination The first search condition for searching for the document identifier corresponding to is a lower node, and the positional relationship between the two index words used in the two first search conditions is the same as the positional relationship in the search word A search condition analysis procedure for generating a search condition tree having a second search condition as a higher node for searching the document identifier of a registered document from among the document identifiers that are search results of the two first search conditions; in front A computer readable recording of a program that causes the computer to execute a search condition evaluation procedure unit that executes a search of each node of the search condition tree and obtains the document identifier obtained by the search of the highest node as a search result Storage medium.

請求項１に記載の発明は、索引語は一律にｎ文字の連鎖とされるので、膨大な単語辞書を必要とする形態素解析を用いる手法と比較すると、単語辞書の管理などの手間が省ける。また、無駄な索引語の使用を省いて検索処理の高速化を図ることができる。また、文書頻度を小さくして検索処理の高速化を図ることができる。 According to the first aspect of the present invention, since the index word is uniformly a chain of n characters, it is possible to save time and labor for managing the word dictionary as compared with a method using morphological analysis that requires an enormous word dictionary. Further, it is possible to speed up the search process by eliminating useless index words. In addition, the document frequency can be reduced to speed up the search process.

請求項２に記載の発明は、詳細判定用条件木の検索結果合成処理に先立って、候補決定用条件木の検索結果合成処理を実行して対象となる文書に絞りをかけることにより、文字位置の突き合せの処理を低減して、検索処理の高速化を図ることができる。 According to the second aspect of the present invention, prior to the detailed determination condition tree search result synthesis process, the candidate determination condition tree search result synthesis process is executed to narrow down the target document, thereby reducing the character position. The matching process can be reduced to speed up the search process.

請求項３に記載の発明は、索引において転置リストを大きさに合わせてページ単位に管理するので、検索処理を高速化することができる。 According to the third aspect of the present invention, since the inverted list is managed in the index according to the size in the index, the search process can be speeded up.

請求項４に記載の発明は、索引語は一律にｎ文字の連鎖とされるので、膨大な単語辞書を必要とする形態素解析を用いる手法と比較すると、単語辞書の管理などの手間が省ける。また、無駄な索引語の使用を省いて検索処理の高速化を図ることができる。また、文書頻度を小さくして検索処理の高速化を図ることができる。 In the invention according to claim 4 , since the index word is uniformly made up of a chain of n characters, it is possible to save time and labor for managing the word dictionary as compared with a method using morphological analysis that requires a huge word dictionary. Further, it is possible to speed up the search process by eliminating useless index words. In addition, the document frequency can be reduced to speed up the search process.

［発明の実施の形態１］
図１は、この発明の実施の形態１にかかる電子化文書検索システム１のシステム構成の概略を示すブロック図である。この電子化文書検索システム１は、同図に示すように、例えば、クライアント２と、サーバ３とを通信回線４で接続したクライアント／サーバシステムとして実施することができる。 Embodiment 1 of the Invention
FIG. 1 is a block diagram showing an outline of the system configuration of an electronic document retrieval system 1 according to the first embodiment of the present invention. As shown in FIG. 1, the electronic document search system 1 can be implemented as a client / server system in which a client 2 and a server 3 are connected by a communication line 4, for example.

サーバ３は、その外部記憶に、電子化文書検索プログラムを記憶している。また、多数の電子化された文書をデータベースにして登録し、さらに、この文書から所望の検索語を含んでいる文書を検索する索引となる索引テーブルを登録している（これにより、この発明の文書登録手段、索引記憶手段を実現している）。前記電子化文書検索プログラム、前記データベースのフォーマット、前記索引テーブルなどは、フロッピー（登録商標）ディスク、光ディスク、光磁気ディスクなどの記録媒体（この発明の記録媒体を実現するものである）に記憶されていたものを、サーバ３が読み取って外部記憶に格納したものである。そして、この外部記憶に記憶されている、電子化文書検索プログラム、データベース、索引テーブルにより、以下に説明する各種の処理を実現している。 The server 3 stores an electronic document search program in its external storage. In addition, a large number of digitized documents are registered as a database, and an index table serving as an index for searching for documents including a desired search word from the documents is registered. Document registration means and index storage means are realized). The computerized document search program, the database format, the index table, and the like are stored in a recording medium such as a floppy (registered trademark) disk, an optical disk, or a magneto-optical disk (which realizes the recording medium of the present invention). What has been read is read by the server 3 and stored in the external storage. Various processes described below are realized by the computerized document search program, database, and index table stored in the external storage.

クライアント２は、文書検索の検索条件を入力する入力装置や、この検索の結果をディスプレイに表示したりプリンタにプリントアウトするための出力装置として機能する。 The client 2 functions as an input device for inputting search conditions for document search, and an output device for displaying the search results on a display or printing them out to a printer.

図２は、前記索引テーブルのデータ構造の一例を示す表である。この索引テーブル１１は、アドレス１２と、索引語１３と、この索引語１３が出現する文書の識別子１４と、この索引語１３が出現する文書中での索引語１３の出現位置１５と、索引登録の際に識別子１４の数からわかる索引語１３が出現する文書数１６とが対応付けて記憶されている。 FIG. 2 is a table showing an example of the data structure of the index table. The index table 11 includes an address 12, an index word 13, an identifier 14 of a document in which the index word 13 appears, an appearance position 15 of the index word 13 in a document in which the index word 13 appears, and an index registration. In this case, the number of documents 16 in which the index word 13 that can be understood from the number of identifiers 14 appears is stored in association with each other.

サーバ３では、クライアント２から文書検索の要求を受け付け、受け付けた検索条件を解析して、その内容に従い、索引テーブル１１に対し様々に索引照合を行なう。すなわち、検索語と索引語１３の照合、索引語１３に対応する識別子１４の照会、索引語１３に対応する出現位置１５の照会のうち、１つまたは複数の処理を個別にまたは一度に行なうことができる（これにより、この発明の索引照合手段を実現している）。 The server 3 receives a document search request from the client 2, analyzes the received search condition, and performs various index collations on the index table 11 according to the contents. That is, one or a plurality of processes are performed individually or at a time among the collation of the search word and the index word 13, the inquiry of the identifier 14 corresponding to the index word 13, and the inquiry of the appearance position 15 corresponding to the index word 13. (Accordingly, the index matching means of the present invention is realized).

そして、この索引照合により、クライアント２から受け付けた検索語を覆う１または複数の索引語１３を取得する。この検索語が１つの索引語１３で覆えるときは、当該索引語１３に対応する識別子１４を照会し、その識別子１４の文書の表示を文書検索結果としてクライアント２に返す。例えば、サーバ３が受け付けた検索語が“原子力”で、この“原子力”がそのまま索引語１３として索引テーブル１１に登録されている場合である。 Then, one or a plurality of index words 13 covering the search word received from the client 2 are acquired by this index collation. When the search word can be covered by one index word 13, the identifier 14 corresponding to the index word 13 is inquired, and the display of the document of the identifier 14 is returned to the client 2 as the document search result. For example, the search term received by the server 3 is “nuclear power”, and this “nuclear power” is registered in the index table 11 as the index word 13 as it is.

検索語が複数の索引語１３でしか覆えないときも、その複数の索引語１３の各々を新たな検索語として、索引照合により、この各検索語が出現する文書の文書数１６を取得することで（これにより、この発明の中間検索手段を実現している）、この文書数１６、すなわち文書頻度を見積もる（これにより、この発明の文書頻度見積手段を実現している）。 Even when a search word can only be covered by a plurality of index words 13, each of the plurality of index words 13 is used as a new search word, and the number 16 of documents of documents in which each search word appears is obtained by index matching. Thus, the number of documents 16, that is, the document frequency is estimated (which realizes the document frequency estimation unit of the present invention).

そして、この場合は、図３に示すように、サーバ３が受け付けた検索条件を、新たな２つの検索語の各々についての索引照合による検索を下位ノード２１に、この両下位ノード２１の検索結果の比較による、新たな両検索語の位置条件で限定する両下位ノード２１の合成処理条件２４を上位ノード２２とする、２進木の木構造２３の複合検索に変換する（図３(ａ)）。新たな検索語が３つ以上あるときは、２進木を階層的に並べて、２進木の木構造２３に変換する（図３(ｂ)）（これにより、この発明の第１検索条件変換手段を実現している）。図３(ａ)は、受け付けた検索語が“原子力発電”であるときに、この検索語を覆う索引語１３が、“原子力”、“発電”であるときの例であり、図３(ｂ)は、受け付けた検索語が“原子力発電”であるときに、この検索語を覆う索引語１３が、“原子力”、“発電”、“設備”であるときの例である。 In this case, as shown in FIG. 3, the search condition received by the server 3 is set to search the lower node 21 by the index matching for each of the two new search words, and the search results of both the lower nodes 21 3 is converted into a composite search of the tree structure 23 of the binary tree with the combination processing condition 24 of both lower nodes 21 limited by the position conditions of both new search words as the upper node 22 (FIG. 3A). ). When there are three or more new search terms, the binary trees are arranged in a hierarchy and converted to the tree structure 23 of the binary tree (FIG. 3 (b)) (the first search condition conversion of the present invention is thereby performed). Means). FIG. 3A is an example when the index word 13 covering the search word is “nuclear power” and “power generation” when the accepted search word is “nuclear power generation”. ) Is an example when the received search word is “nuclear power generation” and the index word 13 covering the search word is “nuclear power”, “power generation”, and “equipment”.

そして、この変換後の検索条件に従い、上位ノード２２の処理を行なう。すなわち、索引照合により、新たな検索語の各々の識別子１４と出現位置１５の情報を取得して比較し、新たな検索語同士で同一識別子１４があるときに、その識別子１４に対応する文書中で新たな検索語の出現位置１５を比較して、上位ノード２２の位置条件に合致するか否かを判断し、すべての上位ノード２２で合致する文書の表示を検索結果としてクライアント２に出力する。この出現位置１５の比較は、新たな検索語のうち、文書頻度が小さいものから順に行なう（これにより、この発明の複合検索手段を実現している）。すなわち、図３(ｂ)の例で、"原子力”、“発電”、“設備”の順に文書頻度が小さいときは、まず、“原子力”が出現する文書の中で“発電”も出現する文書に限定し、この文書の中で、文書中の“原子力”と“発電”の出現位置が下層の上位ノード２２の位置条件に合致するか否かを判断する。そして、合致するものがある文書の中で“設備”も出現する文書に限定し、この文書の中で上層の上位ノード２２の位置条件に合致するか否かを判断して、最終的な検索結果を取得する。 Then, the upper node 22 is processed according to the converted search condition. In other words, when the identifier 14 and the information on the appearance position 15 are acquired and compared by index matching, and the same identifier 14 exists between the new search terms, the information corresponding to the identifier 14 is included in the document. Compare the appearance position 15 of the new search word to determine whether or not the position condition of the upper node 22 is met, and output the display of the documents that match in all the upper nodes 22 to the client 2 as the search result. . The comparison of the appearance position 15 is performed in order from the lowest search frequency among new search words (this realizes the composite search means of the present invention). That is, in the example of FIG. 3B, when the document frequency is small in the order of “nuclear power”, “power generation”, and “equipment”, first, documents in which “power generation” also appears in documents in which “nuclear power” appears In this document, it is determined whether or not the appearance positions of “nuclear power” and “power generation” in the document match the position condition of the upper node 22 in the lower layer. Then, among documents that match, the “facility” is limited to documents that appear, and it is determined whether or not the document matches the position condition of the upper node 22 in the upper layer, and the final search is performed. Get the result.

このように、下位ノード２１の各々の文書頻度を見積ることで、検索条件評価の中間結果を得られれば、文書頻度の小さい順に評価を行なうことで、この中間結果を早めにしぼり込めるので、文書検索処理を従来より高速化することができる。 In this way, if the intermediate results of the search condition evaluation can be obtained by estimating the document frequency of each of the lower nodes 21, the intermediate results can be narrowed down earlier by performing the evaluation in ascending order of the document frequency. The search process can be made faster than before.

また、この場合に、図３に示すような木構造２３の２進木を作る際に、与えられた検索語を覆う索引語１３の、検索語の中での順番どおりに２進木の下位ノード２１を組み合わせるのではなく、文書頻度の小さい下位ノード２１の順に２進木に変換し、この変換により文書頻度の小さい下位ノード２１ほど木構造２３の下層となるようにしている（これにより、この発明の第１検索条件変換手段を実現している）。これにより、より下層でより末端ノードに近いほど、文書頻度の小さい下位ノード２１となるので、さらに中間結果を早めにしぼり込むことができ、さらに文書検索処理を高速化することができる。 Further, in this case, when a binary tree having a tree structure 23 as shown in FIG. 3 is created, the subordinate nodes of the binary tree in the order of the index word 13 covering the given search word in the search word. 21 is not combined, but is converted into a binary tree in the order of the lower node 21 with the lowest document frequency, and by this conversion, the lower node 21 with the lower document frequency becomes a lower layer of the tree structure 23 (thereby The first search condition conversion means of the invention is realized). As a result, the closer to the end node in the lower layer, the lower the node 21 becomes, so that the intermediate result can be narrowed earlier and the document search process can be further speeded up.

与えられた検索語が索引登録されている索引語１３で覆うことができないときは、検索結果は必ず空になる。このようなときは、２進木に変換するのではなく、検索条件の評価の際には直ちに空の検索結果を返す末端ノードに変換する（これにより、この発明の第１検索条件変換手段を実現している）。 When a given search word cannot be covered with the index word 13 registered in the index, the search result is always empty. In such a case, instead of converting to a binary tree, in the evaluation of the search condition, it is immediately converted to a terminal node that returns an empty search result (the first search condition conversion means of the present invention is thereby converted). Realized).

よって、検索条件の評価の際に直ちに空の検索結果を返して、検索処理を高速化することができる。 Therefore, an empty search result is immediately returned when the search condition is evaluated, and the search process can be speeded up.

以下では、合成処理条件２４の具体的内容、様々な検索条件への対応などについて説明する。 Hereinafter, specific contents of the composition processing condition 24, correspondence to various search conditions, and the like will be described.

１．新たな検索語間の距離を合成処理条件２４とする場合について
以下では、木構造２３の２進木の構造を、“合成処理条件（合成される条件１、合成される条件２、…、合成される条件ｍ）”の形式で記載する。 1. Regarding the case where the distance between new search terms is set as the synthesis processing condition 24 In the following, the structure of the binary tree of the tree structure 23 is referred to as “synthesis processing condition (synthesized condition 1, synthesized condition 2,... The condition m) ”is described.

最初の例は、新たな両検索語の位置条件を、“検索条件中の検索語を覆う２つの索引語（新たな２つの検索語）ｑ１，ｑ２の先頭がｎ文字分ずれていて、出現順が一定、すなわち正順序であって逆順序でないこと”とするもので（これにより、この発明の第１検索条件変換手段を実現している）、この位置条件を、以下では、“＃ｄｉｓｔａｎｃｅ〔ｎ〕（ｑ１，ｑ２）”と表記する。図３(ｂ)の例では、図４に示すとおりとなる。合成処理条件２４による合成後の文書頻度は実際に合成しないと不明であるが、近似を試みることはできる。検索条件ｑの文書頻度を“ＤＦ（ｑ）”、ｑ１とｑ２とのうち小さい方の値をとることを、“ｍｉｎ｛ｑ１，ｑ２｝”と表記すると、“＃ｄｉｓｔａｎｃｅ〔ｎ〕（ｑ１，ｑ２）”の文書頻度は、
“０≦ＤＦ（＃ｄｉｓｔａｎｃｅ〔ｎ〕（ｑ１，ｑ２））
≦ｍｉｎ｛ＤＦ（ｑ１），ＤＦ（ｑ２）｝” …… （１）
となる。 In the first example, the position condition of both new search terms is “the two index words (new two search terms) covering the search terms in the search conditions q1 and q2 are shifted by n characters and appear. It is assumed that the order is constant, that is, it is a normal order and not a reverse order (this realizes the first search condition converting means of the present invention), and this position condition is referred to as “# distance below”. [N] (q1, q2) ". In the example of FIG. 3B, it is as shown in FIG. The document frequency after composition under the composition processing condition 24 is unknown unless it is actually composed, but approximation can be attempted. If the document frequency of the search condition q is “DF (q)” and taking the smaller one of q1 and q2 is expressed as “min {q1, q2}”, “#distance [n] (q1, q2) ”document frequency is
“0 ≦ DF (# distance [n] (q1, q2))
≦ min {DF (q1), DF (q2)} ”(1)
It becomes.

そこで、（１）式の範囲内で、“ＤＦ（＃ｄｉｓｔａｎｃｅ〔ｎ〕（ｑ１，ｑ２））”の値を適当な値に近似すればよい。例えば、上限値である、“ｍｉｎ｛ＤＦ（ｑ１），ＤＦ（ｑ２）｝”に近似することができる。 Therefore, the value of “DF (#distance [n] (q1, q2))” may be approximated to an appropriate value within the range of the expression (1). For example, it can be approximated to “min {DF (q1), DF (q2)}” which is the upper limit value.

このように、新たな２つの検索語の出現位置の距離を上位ノードの位置条件とし、この位置条件を文書頻度の小さい方から評価することにより、不要な索引照合を省いて、文書検索速度を向上することができる。 In this way, the distance between the appearance positions of two new search terms is used as the position condition of the upper node, and this position condition is evaluated from the one with the lowest document frequency, thereby eliminating unnecessary index matching and increasing the document search speed. Can be improved.

２．検索条件に複数の検索語の論理演算を含む場合について
複数の検索語の積集合や和集合を意味する集合演算（論理演算）も、一般的に文書検索に利用されている。そこで、以下では、このような論理演算を含んでいる場合の処理について説明する。 2. When the search condition includes a logical operation of a plurality of search terms A set operation (logical operation) meaning a product set or union of a plurality of search terms is also generally used for document search. Therefore, hereinafter, processing when such a logical operation is included will be described.

このような場合には、各検索語を覆う索引語の各々を検索語として見積もった文書頻度に基づいて、各検索語の検索を下位ノード２１に、この下位ノードの前記論理演算を合成処理条件２４として上位ノード２２にする複合検索の木構造２３に変換する（これにより、この発明の第２検索条件変換手段を実現している）。 In such a case, based on the document frequency estimated using each index word covering each search word as a search word, the search of each search word is performed in the lower node 21 and the logical operation of this lower node is combined with the synthesis processing condition. 24 is converted into the tree structure 23 of the composite search to be the upper node 22 (this implements the second search condition conversion means of the present invention).

（１）複数の検索語が積集合で結ばれている場合について
複数の検索語ｑ１，ｑ２，……の積集合を“＃ａｎｄ（ｑ１，ｑ２，……）”で表記することにすると、この“＃ａｎｄ（ｑ１，ｑ２，……）”を含む検索条件の解析は図５に示す例のようになる。 (1) When a plurality of search terms are connected by a product set When a product set of a plurality of search terms q1, q2,... Is expressed by “#and (q1, q2,...)”, The analysis of the search condition including “#and (q1, q2,...)” Is as in the example shown in FIG.

この積集合による合成の場合も、“＃ｄｉｓｔａｎｃｅ〔ｎ〕（ｑ１，ｑ２）”の場合と同様に、文書頻度の小さい方から評価していった方が速くしぼり込むことができるので、文書検索処理を高速化することができる。 Also in the case of composition by intersection, as in the case of “# distance [n] (q1, q2)”, it is possible to squeeze faster by evaluating from the document with the lower document frequency. Processing can be speeded up.

しかし、“＃ｄｉｓｔａｎｃｅ〔ｎ〕（ｑ１，ｑ２）”が２進木であるのに対し、“＃ａｎｄ（ｑ１，ｑ２，……）”は多進木であるので、検索条件評価の途中で見積り文書頻度を比較するのは非効率である。そこで、予め見積り文書頻度の小さい順に並べなおす、すなわち、文書頻度の小さい下位ノード２１の順に木構造２３に変換し、この変換により文書頻度の小さい下位ノード２１ほど木構造２３の下層となるようにする（これにより、この発明の第２検索条件変換手段を実現している）。これにより、早めに結果をしぼり込むことができるので、文書検索速度を高速化することができる。例えば、検索条件に、“＃ａｎｄ（カラー，イメージ，プリンタ）”があり、“ＤＦ（イメージ）＜ＤＦ（カラー）＜ＤＦ（プリンタ）”であるときは、“＃ａｎｄ（イメージ，カラー，プリンタ）”と
変換する。 However, “#distance [n] (q1, q2)” is a binary tree, whereas “#and (q1, q2,...)” Is a multi-ary tree. Comparing estimated document frequencies is inefficient. Therefore, the estimated document frequencies are rearranged in ascending order, that is, the lower nodes 21 having the lowest document frequencies are converted into the tree structure 23. By this conversion, the lower nodes 21 having the lower document frequencies become lower layers of the tree structure 23. (Thus, the second search condition conversion means of the present invention is realized). As a result, the results can be narrowed down early, and the document search speed can be increased. For example, if the search condition is “#and (color, image, printer)” and “DF (image) <DF (color) <DF (printer)”, then “#and (image, color, printer) ) ”.

“＃ａｎｄ（ｑ１，ｑ２，……）”で合成される条件に、“＃ａｎｄ（ｑ１，ｑ２，……）”を含むような入れ子構造になっているときは、その入れ子構造を解いて平坦化しても論理演算の意味に変動はない。そして、入れ子構造のままで見積り文書頻度による並べ替えを行なうと、並べ替えの範囲が不必要に狭くなり、効率が悪くなることがある。例えば、“ＤＦ（ｑ２）＜ＤＦ（ｑ３）＜ＤＦ（ｑ４）＜ＤＦ（ｑ１）”である場合、“＃ａｎｄ（＃ａｎｄ（ｑ１，ｑ２），ｑ３，ｑ４）”に、入れ子構造のまま並べ替えを施すと、“＃ａｎｄ（＃ａｎｄ（ｑ２，ｑ１），ｑ３，ｑ４）”となるが、入れ子構造を解いて並べ替えると、“＃ａｎｄ（ｑ２，ｑ３，ｑ４，ｑ１）”となり、この方が早期のしぼり込みを期待でき、文書検索を高速化することができる（これにより、この発明の第２検索条件変換手段を実現している）。例えば、“＃ａｎｄ（＃ａｎｄ（カラー，イメージ），プリンタ）”で、“ＤＦ（イメージ）＜ＤＦ（カラー）＜ＤＦ（プリンタ）”であるときは、入れ子構造を解いて、“＃ａｎｄ（カラー，イメージ，プリンタ）”とし、並べ替えを行なって、“＃ａｎｄ（イメージ，カラー，プリンタ）”に変換する。また、末端のノードが文書頻度が正確であるのに対して、中間ノードの文書頻度は近似値であるから、合成処理演算の階層は少ない方が見積り誤差が少なくなるという利点もある。 If the condition synthesized by “#and (q1, q2,...)” Has a nested structure including “#and (q1, q2,...)”, Solve the nested structure. There is no change in the meaning of the logical operation even if it is flattened. If sorting is performed based on the estimated document frequency with the nested structure, the sorting range becomes unnecessarily narrow and efficiency may deteriorate. For example, if “DF (q2) <DF (q3) <DF (q4) <DF (q1)”, “#and (#and (q1, q2), q3, q4)” remains a nested structure. When rearrangement is performed, “#and (#and (q2, q1), q3, q4)” is obtained. However, when rearrangement is performed and rearrangement is performed, “#and (q2, q3, q4, q1)” is obtained. In this case, it is possible to expect early reduction, and it is possible to speed up the document search (the second search condition conversion means of the present invention is thereby realized). For example, when “#and (#and (color, image), printer)” is “DF (image) <DF (color) <DF (printer)”, the nested structure is solved and “#and ( Color, image, printer) ", and rearrangement is performed to convert it into" #and (image, color, printer) ". Further, since the document frequency of the end node is accurate while the document frequency of the intermediate node is an approximate value, there is an advantage that the estimation error is reduced when the hierarchy of the composition processing operation is smaller.

また、積集合の場合、合成される下位ノード２１に１つでも空の検索結果を返すものがあれば合成の結果も空であるから、２進木の木構造２３に変換せず、合成処理条件２４の際には直ちに空の検索結果を返す末端ノードに変換する。これにより、文書検索処理を高速化することができる（これにより、この発明の第２検索条件変換手段を実現している）。 Also, in the case of a product set, if even one of the subordinate nodes 21 to be synthesized returns an empty search result, the result of the synthesis is also empty. In the case of condition 24, it is immediately converted to a terminal node that returns an empty search result. As a result, the document search process can be speeded up (the second search condition conversion means of the present invention is thereby realized).

なお、積集合の文書頻度は、登録されている全文書数をＮとすると、
ｍｉｎ｛ＤＦ（ｑ１）＋ＤＦ（ｑ２）−Ｎ，０｝
≦ＤＦ（＃ａｎｄ（ｑ１，ｑ２））
≦ｍｉｎ｛ＤＦ（ｑ１），ＤＦ（ｑ２）｝ …… （２）
となる。 Note that the document frequency of the product set is N, where N is the total number of registered documents.
min {DF (q1) + DF (q2) -N, 0}
≦ DF (#and (q1, q2))
≦ min {DF (q1), DF (q2)} (2)
It becomes.

この範囲を適当な値で近似すれば、文書頻度の見積りを行なえる。例えば、変位の上限である“ｍｉｎ｛ＤＦ（ｑ１），ＤＦ（ｑ２）｝”で近似することができる。 If this range is approximated by an appropriate value, the document frequency can be estimated. For example, it can be approximated by “min {DF (q1), DF (q2)}” which is the upper limit of displacement.

（２）複数の検索語が和集合で結ばれている場合について
複数の検索語ｑ１，ｑ２，……の和集合を“＃ｏｒ（ｑ１，ｑ２，……）”で表記することにすると、この和集合による合成の場合は、“＃ａｎｄ（ｑ１，ｑ２，……）”の場合とは逆に、文書頻度の大きい方から評価していった方が速くしぼり込めて、文書検索処理を高速化することができる（これにより、この発明の第２検索条件変換手段を実現している）。 (2) Case where a plurality of search words are connected by a union If a union of a plurality of search words q1, q2,... Is expressed by “#or (q1, q2,...)”, In the case of composition by union, contrary to the case of “#and (q1, q2,...)”, Evaluation from the document with the highest document frequency can be performed faster, and the document search processing can be performed. The speed can be increased (the second search condition conversion means of the present invention is thereby realized).

そして、“＃ａｎｄ（ｑ１，ｑ２，……）”も多進木であるので、検索条件評価の途中で見積り文書頻度を比較するのは非効率である。そこで、予め見積り文書頻度の大きい順に並べなおす、すなわち、文書頻度の大きい下位ノード２１の順に木構造２３に変換していき、この変換により文書頻度の大きい下位ノード２１ほど木構造２３の下層となるようにする（これにより、この発明の第２検索条件変換手段を実現している）。これにより、早めに結果をしぼり込むことができるので、文書検索速度を高速化することができる。例えば、検索条件に、“＃ｏｒ（パルス変調，位置変調，ＰＰＭ）”があり、“ＤＦ（ＰＰＭ）＜ＤＦ（パルス）＜ＤＦ（位置変調）”であるときは、“＃ｏｒ（位置変調，パルス，ＰＰＭ）”と変換する。 Since “#and (q1, q2,...)” Is also a multi-ary tree, it is inefficient to compare the estimated document frequencies during the search condition evaluation. Therefore, the estimated document frequencies are rearranged in advance in descending order, that is, the lower node 21 having the highest document frequency is converted into the tree structure 23. By this conversion, the lower node 21 having the highest document frequency becomes a lower layer of the tree structure 23. (Thus, the second search condition conversion means of the present invention is realized). As a result, the results can be narrowed down early, and the document search speed can be increased. For example, when the search condition is “#or (pulse modulation, position modulation, PPM)” and “DF (PPM) <DF (pulse) <DF (position modulation)”, “#or (position modulation) , Pulse, PPM) ”.

また、積集合の場合と同様、和集合の場合も入れ子構造を有するときは、この入れ子構造を解いてからの方が、見積り文書頻度による処理が効果的になる（これにより、この発明の第２検索条件変換手段を実現している）。例えば、検索条件に、“＃ｏｒ（＃ｏｒ（パルス変調，位置変調），ＰＰＭ）”があるときに、“ＤＦ（ＰＰＭ）＜ＤＦ（パルス変調）＜ＤＦ（位置変調）”であるときは、入れ子構造を解いて“＃ｏｒ（パルス変調，位置変調，ＰＰＭ）”としてから、“＃ｏｒ（位置変調，パルス変調，ＰＰＭ）”に並べ替える。 Similarly to the product set, when the unset also has a nested structure, the processing based on the estimated document frequency is more effective after the nested structure is solved. 2 search condition conversion means is realized). For example, when the search condition includes “#or (#or (pulse modulation, position modulation), PPM)” and “DF (PPM) <DF (pulse modulation) <DF (position modulation)”. Then, after nesting the structure, “#or (pulse modulation, position modulation, PPM)” is arranged, and then “#or (position modulation, pulse modulation, PPM)” is rearranged.

なお、“＃ｏｒ（ｑ１，ｑ２，……）”の文書頻度は、登録されている全文書数をＮとすると、
ｍａｘ｛ＤＦ（ｑ１），ＤＦ（ｑ２）−Ｎ，０｝
≦ＤＦ（＃ｏｒ（ｑ１，ｑ２））
≦ｍｉｎ｛ＤＦ（ｑ１）＋ＤＦ（ｑ２），Ｎ｝ …… （３）
となる。 Note that the document frequency of “#or (q1, q2,...)” Is N as the total number of registered documents.
max {DF (q1), DF (q2) -N, 0}
≦ DF (#or (q1, q2))
≦ min {DF (q1) + DF (q2), N} (3)
It becomes.

そして、この範囲を適当な値で近似すれば、文書頻度の見積りを行なえる。例えば、範囲の上限である、限界和“ｍｉｎ｛ＤＦ（ｑ１）＋ＤＦ（ｑ２），Ｎ｝”で近似することができる。算術和“ＤＦ（ｑ１）＋ＤＦ（ｑ２）”がＮを超えるような検索は、大量の文書から必要な文書を見つけだそうとしている以上、実際にはめったにないと考えられるので、算術和を用いても実用上大差ない。 If this range is approximated by an appropriate value, the document frequency can be estimated. For example, it can be approximated by a limit sum “min {DF (q1) + DF (q2), N}” which is the upper limit of the range. Searches where the arithmetic sum "DF (q1) + DF (q2)" exceeds N is considered to be rare in practice as it tries to find the necessary documents from a large number of documents, so the arithmetic sum is used. But there is not much difference in practical use.

（３）複数の検索語が差集合で結ばれている場合について
複数の検索語ｑ１，ｑ２，……，ｑｍがあって、検索語ｑ１を含んでいるが、検索語ｑ２，……，ｑｍのいずれも含まないものを検索する、差集合による合成の場合を、“＃ａｎｄ−ｎｏｔ（ｑ１，ｑ２，……，ｑｍ）”と表記すると、これは“＃ｏｒ（ｑ１，ｑ２，……）”の検索結果の補集合と、ｑ１の検索結果との積集合を取る論理演算と同義である。そこで、第２項以下を見積り文書頻度の大きい順に並べれば、早めに絞り込んで文書検索処理を高速化することができる（これにより、この発明の第２検索条件変換手段を実現している）。例えば、“＃ａｎｄ−ｎｏｔ（空乏層，化合物，ＧａＡｓ，ＩｎＰ）”が検索条件に含まれていて、“ＤＦ（ＧａＡｓ）＜ＤＦ（ＩｎＰ）＜ＤＦ（化合物）”であるときは、“＃ａｎｄ−ｎｏｔ（空乏層，化合物，ＩｎＰ，ＧａＡｓ）”と並べ替える。 (3) Case where a plurality of search terms are connected by a difference set There are a plurality of search terms q1, q2,..., Qm, which includes the search term q1, but the search terms q2,. If a combination using a difference set that searches for items that do not contain any of the above is expressed as “# and-not (q1, q2,..., Qm)”, this is expressed as “#or (q1, q2,. This is synonymous with a logical operation that takes a product set of a complementary set of search results of “)” and a search result of q1. Therefore, if the second term and the subsequent terms are arranged in descending order of the estimated document frequency, the document search process can be speeded up by narrowing down earlier (thus realizing the second search condition conversion means of the present invention). For example, if “# and-not (depletion layer, compound, GaAs, InP)” is included in the search condition and “DF (GaAs) <DF (InP) <DF (compound)”, then “# and-not (depletion layer, compound, InP, GaAs) ".

また、第２項以下に和集合があるときは、この和集合を解いてからの方が、見積り文書頻度による処理が効果的である（これにより、この発明の第２検索条件変換手段を実現している）。例えば、“＃ａｎｄ−ｎｏｔ（空乏層，化合物，＃ｏｒ（ＧａＡｓ，ＩｎＰ））”が検索条件に含まれていて、“ＤＦ（ＧａＡｓ）＜ＤＦ（ＩｎＰ）＜ＤＦ（化合物）”であるときは、３項の和集合を解いて、“＃ａｎｄ−ｎｏｔ（空乏層，化合物，ＩｎＰ，ＧａＡｓ）”とした後、“＃ａｎｄ−ｎｏｔ（空乏層，化合物，ＩｎＰ，ＧａＡｓ）”と並べ替える。 Also, when there is a union below the second term, the processing based on the estimated document frequency is more effective after solving the union (this realizes the second search condition conversion means of the present invention). is doing). For example, when “# and-not (depletion layer, compound, #or (GaAs, InP))” is included in the search condition and “DF (GaAs) <DF (InP) <DF (compound)”. Solves the union of the three terms and makes it “# and-not (depletion layer, compound, InP, GaAs)”, and then rearranges it as “# and-not (depletion layer, compound, InP, GaAs)”. .

さらに、差集合の第１項が空の検索結果を返すときは、差集合も空になるので、差集合を直ちに空の検索結果を返すノードに変換することで、文書検索処理の高速化を図ることができる（これにより、この発明の第２検索条件変換手段を実現している）。 Furthermore, when the first term of the difference set returns an empty search result, the difference set also becomes empty. Therefore, by converting the difference set to a node that immediately returns an empty search result, the document search processing is speeded up. (Thus, the second search condition conversion means of the present invention is realized).

なお、“＃ａｎｄ−ｎｏｔ（ｑ１，ｑ２，……，ｑｍ）”の文書頻度は、
０≦ＤＦ（＃ａｎｄ−ｎｏｔ（ｑ１，ｑ２，……，ｑｍ））
≦ＤＦ（ｑ１） …… （４）
となる。 Note that the document frequency of “# and-not (q1, q2,..., Qm)” is
0 ≦ DF (# and-not (q1, q2,..., Qm))
≦ DF (q1) (4)
It becomes.

この範囲で、適当な値で近似すれば、文書頻度の見積りを行なうことができる。例えば、範囲の上限である“ＤＦ（ｑ１）”で近似することができる。 If an approximate value is approximated within this range, the document frequency can be estimated. For example, it can be approximated by “DF (q1)” which is the upper limit of the range.

（４）２つの検索語の出現位置間の距離の上限を指定する場合について
複数の検索語が共通に出現する文書というだけではなく、関連して出現する文書を検索したいというユーザの要望に応じて、近接演算に対応した文書検索システムも用いられている。 (4) When the upper limit of the distance between the appearance positions of two search terms is specified According to the user's desire to search not only for documents in which a plurality of search terms appear in common, but also for documents that appear in association with each other. Document retrieval systems that support proximity computation are also used.

近接演算の例として、出現位置間の距離の上限を指定するものがあるが、２つの検索語ｑ１，ｑ２の出現位置間の距離の上限をｎ字とする場合で、出現順序を問わない場合を、“＃ｗｉｎｄｏｗ〔ｎ〕（ｑ１，ｑ２）”と表記することにする。このような場合は、索引１１に出現位置１５の情報を含んでいるので、積集合とみなした評価の後、位置条件による吟味を行なうことで容易に対応することができる。すなわち、各検索語ｑ１，ｑ２の検索を下位ノード２１に、この下位ノード２１の積集合を上位ノード２２にする複合検索の木構造２３に変換し、この変換は文書頻度の小さい下位ノード２１の順に２進木に変換し、文書頻度の小さい下位ノード２１ほど木構造２３の下層となるように行なう（これにより、この発明の第２検索条件変換手段を実現している）。 As an example of the proximity calculation, there is one that specifies the upper limit of the distance between the appearance positions, but the upper limit of the distance between the appearance positions of the two search terms q1 and q2 is n characters, and the appearance order does not matter. Is expressed as “#window [n] (q1, q2)”. In such a case, since the information on the appearance position 15 is included in the index 11, it can be easily dealt with by performing examination based on the position condition after the evaluation that is regarded as a product set. That is, the search of each search word q1, q2 is converted to a lower node 21, and the product set of the lower nodes 21 is converted to a tree structure 23 of a complex search having an upper node 22. The binary tree is converted in order, and the lower node 21 with the lower document frequency is placed below the tree structure 23 (this realizes the second search condition converting means of the present invention).

“＃ｗｉｎｄｏｗ〔ｎ〕（ｑ１，ｑ２）”の場合も、積集合の場合と同様に、この見積り文書頻度による並べ替えで処理を高速化することができるが、それは、前記のように量検索語ｑ１，ｑ２の出現順序を考慮しない場合に限られる。また、入れ子構造をとる場合でも、この入れ子構造を外すと意味が異なってしまうので行なわない。 In the case of “#window [n] (q1, q2)”, as in the case of the product set, the processing can be speeded up by the rearrangement based on the estimated document frequency. Only when the appearance order of the words q1 and q2 is not considered. Also, even when a nested structure is adopted, the meaning is different if the nested structure is removed.

３．予備的評価について
前記した、“＃ｄｉｓｔａｎｃｅ〔ｎ〕（ｑ１，ｑ２）”や“＃ｗｉｎｄｏｗ〔ｎ〕（ｑ１，ｑ２）”の評価は、位置条件による吟味を必要とするため、位置条件による吟味を伴わない評価に比べ、時間を要する。そこで、不要な位置条件吟味を削減することができれば、文書検索処理を高速化することができる。“＃ａｎｄ（＃ｄｉｓｔａｎｃｅ〔３〕（原子力，発電），事故）”を例にとると、“＃ｄｉｓｔａｎｃｅ〔３〕（原子力，発電）”から評価していくことになるが、この場合、“事故”を含まない文書に関しても“原子力”と“発電”の位置条件吟味（この発明の例では、“原子力”と“発電”が、この順に出現して、先頭が３文字ずれているか、すなわち、隣接しているか否かを吟味）を行なうこととなる。 3. Preliminary evaluation Since the evaluation of “# distance [n] (q1, q2)” and “#window [n] (q1, q2)” described above requires examination based on position conditions, examination based on position conditions It takes time compared to the evaluation that does not involve. Therefore, if unnecessary position condition examination can be reduced, the document search process can be speeded up. Taking "#and (#distance [3] (nuclear power, power generation), accident)" as an example, the evaluation will start from "#distance [3] (nuclear power, power generation)". For documents that do not include "accident", examine the location conditions of "nuclear power" and "power generation" (in the example of the present invention, "nuclear power" and "power generation" appear in this order, and the first three characters are shifted, To examine whether or not they are adjacent to each other.

そこで、“＃ｄｉｓｔａｎｃｅ〔ｎ〕（ｑ１，ｑ２）”や“＃ｗｉｎｄｏｗ〔ｎ〕（ｑ１，ｑ２）”を、“＃ａｎｄ（ｑ１，ｑ２）”に置き換えた粗い評価を予備的評価として行ない、これにより予め絞り込みを行なって、無駄な位置条件吟味を削減して、文書検索処理を高速化することができる。 Therefore, a rough evaluation in which “# distance [n] (q1, q2)” or “#window [n] (q1, q2)” is replaced with “#and (q1, q2)” is performed as a preliminary evaluation. Thus, it is possible to narrow down in advance, reduce unnecessary position condition examination, and speed up the document search process.

すなわち、各検索語ｑ１，ｑ２の検索を下位ノード２１に、この下位ノード２１の“＃ｄｉｓｔａｎｃｅ〔ｎ〕（ｑ１，ｑ２）”や“＃ｗｉｎｄｏｗ〔ｎ〕（ｑ１，ｑ２）”の位置条件を上位ノード２２にする複合検索の木構造２３に変換する。そして、この変換は文書頻度の小さい下位ノード２１の順に２進木に変換し、文書頻度の小さい下位ノード２１ほど木構造２３の下層となるように行なう（これにより、この発明の第１、第２予備変換手段を実現している）。そして、変換後の木構造２３での上位ノード２２の処理を木構造２３中での最下位層から上位層に向けて順次実行して複合検索の結果を作成し、上位ノード２２の処理は各検索語ｑ１，ｑ２間の出現位置の比較を文書頻度の小さいものから順に行なう（これにより、この発明の第１、第２予備検索手段を実現している）。あとは、この粗い評価により限定された文書についてのみ、正確な評価を行なう（これにより、この発明の複合検索手段を実現している）。 That is, the search of each search word q1, q2 is performed on the lower node 21, and the position condition of “#distance [n] (q1, q2)” or “#window [n] (q1, q2)” of the lower node 21 is set. It is converted into a tree structure 23 of the complex search for the upper node 22. Then, this conversion is performed so that the lower nodes 21 having the lowest document frequency are converted into binary trees, and the lower nodes 21 having the lower document frequency are arranged in the lower layer of the tree structure 23 (thus, the first and the first of the present invention). 2 Preliminary conversion means are realized). Then, the processing of the upper node 22 in the tree structure 23 after conversion is sequentially executed from the lowest layer to the upper layer in the tree structure 23 to create a composite search result. Comparison of the appearance positions between the search terms q1 and q2 is performed in order from the document having the lowest frequency (this realizes the first and second preliminary search means of the present invention). After that, accurate evaluation is performed only for the document limited by this rough evaluation (this realizes the compound search means of the present invention).

４．全順序関係をもつ識別子１４を用いる場合について
検索条件の評価は、より具体的には、その条件を満たす文書の識別子１４のリストを得るための処理（これを“ｒｅｔｒｉｅｖｅ（）”と表記することにする）である。識別子１４としては、整数や文字列などの全順序関係をもつものを用いるのが一般的であるので、識別子１４の下限ｄ０を与えて、その検索条件を満たす文書の識別子１４のうちでｄ０以上で最小のものを得る（以下、“ｌｏｗｅｒ＿ｂｏｕｎｄ（ｄ０）”と表記する）。これにより、“ｒｅｔｒｉｅｖｅ（）”を得る処理は、“ｌｏｗｅｒ＿ｂｏｕｎｄ（ｄ０）”を用いて実現することができる。 4). When using identifiers 14 having a total order relationship More specifically, the evaluation of search conditions is a process for obtaining a list of identifiers 14 of documents satisfying the conditions (this is expressed as “retrieve ()”. ). Since it is common to use an identifier having a total order relationship such as an integer or a character string, the identifier 14 is given a lower limit d0 of the identifier 14 and is d0 or more among the identifiers 14 of documents that satisfy the search condition. To obtain the smallest one (hereinafter referred to as “lower_bound (d0)”). Thereby, the process of obtaining “retrieve ()” can be realized using “lower_bound (d0)”.

索引１１のデータを電子化文書検索システム１の記憶装置間で転送する際の時間や索引１１を格納する記憶装置の資源の節約のために、通常は索引１１は圧縮されており、圧縮の方式にもよるが、多くの場合、識別子１４の順序に従って走査する方が、順序を無視してランダムに識別子１４を変更しながら走査するよりも、索引照合がはるかに高速になる場合が多い。このような索引１１（これにより、この発明の索引記憶手段を実現している）に対して、索引語１３であるωと、識別子１４の下限値ｄ０を指定されれば、ωを含む識別子１４のうちでｄ０以上で最小のものを得ることができる比較的高速な索引照合が可能になる（これにより、この発明の第１予備照合手段を実現している）。これにより、索引語ωの索引照合に対応する検索条件の“ｌｏｗｅｒ＿ｂｏｕｎｄ（ｄ０）”は簡単に得ることができる。 The index 11 is usually compressed in order to save time for transferring the data of the index 11 between the storage devices of the electronic document search system 1 and resources of the storage device for storing the index 11. However, in many cases, scanning in the order of identifiers 14 is often much faster in index matching than scanning with random changes in identifiers 14 ignoring the order. For such an index 11 (which implements the index storage means of the present invention), if an index word 13 and a lower limit d0 of the identifier 14 are designated, an identifier 14 including ω Among them, a comparatively high-speed index collation capable of obtaining a minimum one at d0 or more becomes possible (thus realizing the first preliminary collation means of the present invention). Thus, the search condition “lower_bound (d0)” corresponding to the index matching of the index word ω can be easily obtained.

さらに、“ｌｏｗｅｒ＿ｂｏｕｎｄ（ｄ０）”に与えられた識別子１４の下限値ｄ０と、得られた識別子１４と、アドレス３２と、索引語ωとを関連付けて、図６に示すような履歴テーブル３１として記録しておいて（これにより、この発明の第１予備照合記憶手段を実現している）、これを参照することにより、索引照合の回数を削減することができる。 Furthermore, the lower limit d0 of the identifier 14 given to “lower_bound (d0)”, the obtained identifier 14, the address 32, and the index word ω are associated and recorded as a history table 31 as shown in FIG. (By this, the first preliminary collation storage means of the present invention is realized), and by referring to this, the number of index collations can be reduced.

例えば、“ｌｏｗｅｒ＿ｂｏｕｎｄ（１０）”が２０であれば、“ｌｏｗｅｒ＿ｂｏｕｎｄ（１１）”も、“ｌｏｗｅｒ＿ｂｏｕｎｄ（１５）”も２０であるので、１０以上２０以下の下限値ｄ０については、索引照合することなく、履歴テーブル３１参照により、２０の値を返すことができる（これにより、この発明の第１照合範囲指定手段を実現している）。なお、履歴テーブル３１の参照だけで得られたものは履歴テーブル３１に追加しない。 For example, if “lower_bound (10)” is 20, “lower_bound (11)” and “lower_bound (15)” are both 20, so the lower limit d0 of 10 or more and 20 or less is not index-checked. By referring to the history table 31, a value of 20 can be returned (this implements the first collation range specifying means of the present invention). What is obtained only by referring to the history table 31 is not added to the history table 31.

上位ノード２２の合成処理においても、合成処理条件２４を満たす識別子１４のうち、下限値ｄ０以上で最小のものを得て、索引語１３、識別子１４の下限値ｄ０、得られた識別子１４を対応付けて履歴テーブル３１に記憶することで、この履歴テーブル３１を参照することにより下位ノード２１の評価回数を削減することができる（これにより、この発明の第２予備照合手段、第２予備照合記憶手段、第２照合範囲指定手段を実現している）。 Also in the synthesis process of the upper node 22, the identifier 14 satisfying the synthesis process condition 24 is obtained with the minimum value equal to or higher than the lower limit value d 0, and the index word 13, the lower limit value d 0 of the identifier 14, and the obtained identifier 14 are associated. In addition, the number of evaluations of the lower node 21 can be reduced by referring to the history table 31 (the second preliminary collating means, the second preliminary collating memory according to the present invention). Means and second collation range designation means).

ただし、“ｌｏｗｅｒ＿ｂｏｕｎｄ（ｄ０）”の履歴をすべて記憶しておくと、その履歴が膨大になってしまう可能性がある。そこで、最新の一定数のもののみに限定することが望ましい（これにより、この発明の第１、第２予備照合記憶手段を実現している）。この場合、古い履歴を記録から消してしまうために、無駄な索引照合や、下位ノード２１の評価が発生する場合があるが、検索条件の評価の大部分は、“ｒｅｔｒｉｅｖｅ（）”を得る処理から派生するものであり、“ｒｅｔｒｉｅｖｅ（）”の処理では、識別子１４に関して１回走査するだけなので、最新の一つだけを履歴に残しておけば充分である。 However, if all the history of “lower_bound (d0)” is stored, the history may become enormous. Therefore, it is desirable to limit to only the latest constant number (this implements the first and second preliminary collation storage means of the present invention). In this case, since the old history is erased from the record, useless index collation or evaluation of the lower node 21 may occur. However, most of the evaluation of the search condition is processing for obtaining “retrieve ()”. In the process of “retrieve ()”, the identifier 14 is only scanned once, so it is sufficient to leave only the latest one in the history.

［発明の実施の形態２］
図７は、この発明の実施の形態２にかかる電子化文書検索システム４０の概要を示す機能ブロック図である。 [Embodiment 2 of the Invention]
FIG. 7 is a functional block diagram showing an outline of the digitized document search system 40 according to the second embodiment of the present invention.

図７に示すように、この電子化文書検索システム４０は、電子化された複数の登録文書中から所望の検索語を含む文書を検索するための索引を記憶する索引記憶手段４１を備えている。この索引の見出しとして登録される各索引語には、この索引語を含んでいる登録文書の数である文書頻度、索引語を含む文書の文書識別子、索引語の各登録文書内での出現回数である文書内頻度および索引語の各登録文書内での出現位置の各情報を対応付けて記憶している（以下では、これら索引語ごとの一連のデータを「転置リスト」という）。 As shown in FIG. 7, the computerized document search system 40 includes index storage means 41 for storing an index for searching for a document including a desired search word from a plurality of digitized registered documents. . For each index word registered as the index heading, the document frequency, which is the number of registered documents containing this index word, the document identifier of the document containing the index word, and the number of occurrences of the index word in each registered document Are stored in association with each other in the document frequency and the appearance position of each index word in each registered document (hereinafter, a series of data for each index word is referred to as a “transposition list”).

文書分割手段４２は、登録文書を索引語に分割する。検索語分割手段４３は、与えられた検索条件中の検索語を索引語に分割し、また、検索語中に索引語が１つも含まれていないときは該当文書がない旨を示す空文書集合を作成する。 The document dividing unit 42 divides the registered document into index words. The search word dividing means 43 divides a search word in a given search condition into index words, and when there is no index word in the search word, an empty document set indicating that there is no corresponding document Create

検索条件解析手段４４は、検索条件を解析して、この検索条件から、検索語分割手段４３が取得した索引語と空文書集合とのうちの少なくとも一方を演算子で合成した検索条件木を生成する。 The search condition analysis unit 44 analyzes the search condition and generates a search condition tree by combining at least one of the index word acquired by the search word dividing unit 43 and the empty document set with an operator from the search condition. To do.

検索条件評価手段４５は、検索条件木に基づき、索引から索引語に関する情報を取得して検索結果合成処理を実行し、検索結果を得る。 Based on the search condition tree, the search condition evaluation unit 45 obtains information related to the index word from the index and executes search result synthesis processing to obtain a search result.

（１）ｎを１以上の整数としたときに、文書分割手段４２は登録文書を一律にｎ文字の連鎖である索引語に分割する。これにより、膨大な単語辞書を必要とする形態素解析を用いる手法と比較して、単語辞書の管理などの手間を省くことができる。 (1) When n is an integer of 1 or more, the document dividing unit 42 uniformly divides a registered document into index words that are a chain of n characters. Thereby, compared with the method using the morphological analysis that requires a huge word dictionary, it is possible to save the trouble of managing the word dictionary.

すなわち、例えば、文書１＝“あああ”、文書２＝“あいうえお”、文書３＝“あいえ”、文書４＝“いう”を登録していたとすると、ｎ＝１なら索引には図８のような情報が記録されることになる。ここに、各文書ごとの出現情報は｛，｝で囲まれた範囲がひとつの文書に対応していて、この｛，｝内の１番目のフィールドが文書識別子、２番目のフィールドが文書内頻度、３番目のフィールド（“（，）”で囲まれている）が出現位置である。 That is, for example, if document 1 = “Ah”, document 2 = “Aiueo”, document 3 = “Ae”, and document 4 = “say” are registered, if n = 1, the index is as shown in FIG. Information will be recorded. Here, in the appearance information for each document, the range surrounded by {,} corresponds to one document, the first field in {,} is the document identifier, and the second field is the in-document frequency. The third field (enclosed by “(,)”) is the appearance position.

検索条件が、検索語の単体で、あるいは、複数の検索語をＡＮＤ，ＯＲなどの演算子により結合した形式で与えられた場合、まず検索語分割手段４３が、ｎ文字連鎖で索引に記録されている索引語に分割する。検索条件解析手段４４は、検索語が１つの索引語になる場合はその索引語のみからなる、検索語が２つ以上の索引語に分割される場合にはそれらの索引語の出現位置間の距離を指定する位置演算子で、それぞれ合成した検索条件木を作成する。検索語を、検索語を覆う索引語に分割できないときは、検索語分割手段４３が空文書集合を出力し、検索条件解析手段４４が空文書集合のみからなる検索条件木を作成する。 When the search condition is given as a single search word or in a form in which a plurality of search words are combined by operators such as AND and OR, the search word dividing means 43 is first recorded in the index by n-character chain. Divide into index words. The search condition analysis means 44 consists of only the index word when the search word becomes one index word. When the search word is divided into two or more index words, the search condition analysis unit 44 determines between the appearance positions of those index words. A combined search condition tree is created using position operators that specify distances. When the search word cannot be divided into index words covering the search word, the search word dividing unit 43 outputs an empty document set, and the search condition analyzing unit 44 creates a search condition tree including only the empty document set.

いま、“＃ｄｉｓｔａｎｃｅ〔ｘ〕（Ａ，Ｂ）”で索引語Ａと索引語Ｂが距離ｘ文字にある文書を検索することを指定するものとする。例えば、ｎ＝１で、検索語が「あいう」であれば、検索語分割手段４３は検索語を３個の索引語「あ」「い」「う」に分割し、検索条件解析手段４４は、＃ｄｉｓｔａｎｃｅ〔２〕（＃ｄｉｓｔａｎｃｅ〔１〕（あ，い），う）に相当する検索条件木を作成する。 Now, “# distance [x] (A, B)” designates that a document in which the index word A and the index word B are at a distance x characters is to be searched. For example, if n = 1 and the search term is “Yes”, the search term dividing unit 43 divides the search term into three index words “A”, “I”, “U”, and the search condition analyzing unit 44 , #Distance [2] (#distance [1] (a, i), u) is created.

そして、検索条件評価手段４５は、「あ」「い」「う」に関する転置リストを読み出して、これら３つの索引語が同時に出現しており、かつ、「あ」「い」の距離が１である場合の「あ」の出現位置と、「う」の距離が２であるものを探し出す検索結果合成処理を行なう。この場合、文書２のみがこれに該当しており、これが検索結果となる。 Then, the search condition evaluation unit 45 reads the transposed list regarding “a”, “i”, and “u”, and these three index words appear simultaneously, and the distance between “a” and “i” is 1. A search result synthesis process is performed to search for a position where the distance between “A” and “U” is 2 in some cases. In this case, only the document 2 corresponds to this, and this is a search result.

前記の場合に、ｎが２以上で、検索語がｎ文字未満である場合も正しく検索できるようにするためには、次のような処理を行なう。すなわち、検索語分割手段４３でｍ文字目（ｍは１以上で（ｎ−ｍ＋１）以下である整数）から検索語と一致するすべての索引語を索引から取り出し、検索条件解析手段４４で、検索語分割手段４３が取り出した複数の索引語を複数の検索結果の和集合をとる和集合演算子で合成する。 In the above case, the following processing is performed to enable a correct search even when n is 2 or more and the search word is less than n characters. That is, the search word dividing unit 43 extracts all index words that match the search word from the m-th character (m is an integer that is 1 or more and (n−m + 1) or less) from the index, and the search condition analysis unit 44 performs the search. A plurality of index words taken out by the word dividing means 43 are synthesized by a union operator that takes a union of a plurality of search results.

例えば、前記の例で、ｎ＝２のときの索引は図９に示すとおりである。ｍ＝１の場合に、検索語が「あ」であれば、検索語分割手段４３は「ああ」「あい」を出力し、検索条件解析手段４４は、“＃ｏｒ（ああ，あい）”という検索条件木をつくる。ここで、“＃ｏｒ（Ａ，Ｂ）”は索引語Ａを含む文書集合と、索引語Ｂを含む文書集合の和集合を検索することを指定するものである。この場合には、前記の例で文書１，２，３が検索結果となる。 For example, in the above example, the index when n = 2 is as shown in FIG. When m = 1, if the search word is “A”, the search word dividing unit 43 outputs “Ah” “Ai”, and the search condition analysis unit 44 is “#or (Ah, Ai)”. Create a search condition tree. Here, “#or (A, B)” specifies that a union of a document set including the index word A and a document set including the index word B is searched. In this case, the documents 1, 2, and 3 are search results in the above example.

ｎが２以上で、検索語がｎ文字以上である場合、次のような処理を行なうことで、無駄な索引語の処理を省いて検索処理を高速化することができる。 When n is 2 or more and the search word is n characters or more, the following process is performed to speed up the search process by eliminating unnecessary index word processing.

たとえば、検索語が「あいうえ」であれば、前記の例で、“＃ｄｉｓｔａｎｃｅ〔２〕（＃ｄｉｓｔａｎｃｅ〔１〕（あい，いう），うえ）”が検索条件木として作成される。しかし、“＃ｄｉｓｔａｎｃｅ〔２〕（あい，うえ）”という検索条件木を満たす文書は、必ず“＃ｄｉｓｔａｎｃｅ〔１〕（あい，いう）”を満たしているので、無駄な位置のつき合わせ処理を行なってしまう。そこで、このような場合は、検索語分割手段４３は検索語を覆う最小個数の索引語に分割して、検索を効率化する。すなわち、この例であれば、検索語からは「あい」「うえ」の２つの索引語だけに分割され、“＃ｄｉｓｔａｎｃｅ〔２〕(あい，うえ）”という検索条件木が作成される。 For example, if the search term is “Aiue”, “#distance [2] (#distance [1] (Ai)”) is created as a search condition tree in the above example. However, a document that satisfies the search condition tree “#distance [2] (ai, u)” always satisfies “#distance [1] (ai, u)”, and therefore a wasteful position matching process is performed. It will be done. Therefore, in such a case, the search word dividing unit 43 divides the search word into the minimum number of index words covering the search word, thereby improving the search efficiency. That is, in this example, the search term is divided into only two index words “Ai” and “Ue”, and a search condition tree “# distance [2] (Ai, Ue)” is created.

また、この場合に、検索語がさらに長いときは最小個数の索引語に分割できる場合が複数存在するときがある。例えば、検索語が「あいうえお」であれば、「あい」「うえ」「えお」にように分割することも、「あい」「いう」「えお」のように分割することもできる。この場合に、検索語分割手段４３は検索語を覆う最小個数の各索引語の文書頻度の合計が最小となるように検索語を索引語に分割する。このように文書頻度を低減して検索処理を高速化することができる。図９の索引では、「いう」の文書頻度が２であるのに対し、「うえ」は１であるので、検索語分割手段４３は「あい」「うえ」「えお」のように検索語を分割することとなる。 In this case, when the search term is longer, there may be a plurality of cases where the search term can be divided into the minimum number of index terms. For example, if the search term is “Aiueo”, it can be divided into “Ai”, “Ue”, “Eo”, or “Ai”, “Nou”, “Eo”. In this case, the search word dividing unit 43 divides the search word into index words so that the sum of the document frequencies of the minimum number of index words covering the search word is minimized. Thus, the document frequency can be reduced and the search process can be speeded up. In the index of FIG. 9, the document frequency of “say” is 2, whereas “up” is 1. Therefore, the search word segmenting means 43 searches the search terms like “ai”, “up”, “eo”. Will be divided.

（２）前記の例では、文書分割手段４２、検索語分割手段４３ともに、登録文書、検索語を一律にｎ文字の連鎖に分割したが、文字列の分割手段としては、文字種に応じて抽出するｎを変化させたり、異なるｎに対する分割の結果を組み合わせたりすることも考えられる。一例として、ｎ＝１とｎ＝２の分割の結果を組み合わせる場合を考えると、検索語「あいうえお」に対しては「あ」「い」「う」「え」「お」「あい」「いう」「うえ」「えお」が得られる。 (2) In the above example, both the document dividing unit 42 and the search word dividing unit 43 uniformly divide the registered document and the search word into a chain of n characters, but the character string dividing unit is extracted according to the character type. It is also conceivable to change n to be combined or to combine the division results for different n. As an example, when combining the division results of n = 1 and n = 2, the search terms “Aiueo” are “A” “I” “U” “E” “O” “Ai” “ "Ue" "Eo" is obtained.

このような場合に、検索語分割手段４３が検索語を分割して得た複数の索引語のうち、検索語を覆う他の索引語に包含されるものは除外するようにすることで、検索語を検索向きに索引語に分割して、検索処理を高速化することができる。すなわち、前記の例で検索語「あいうえお」に対しては、ｎ＝２のものに包含されるｎ＝１のものを除外した、「あい」「いう」「うえ」「えお」が得られる。 In such a case, the search word dividing unit 43 excludes the index words included in the other index words covering the search word from the plurality of index words obtained by dividing the search word. The search process can be speeded up by dividing the word into index words suitable for search. That is, in the above example, for the search term “aiueo”, “ai”, “say”, “up”, and “eo” are obtained by excluding n = 1, which are included in n = 2. .

この場合に、検索語分割手段４３は検索語を覆う最小個数の索引語に分割して、無駄な索引語の使用を省いて検索処理の高速化を図る。また、最小個数の索引語に分割できる場合が複数あるときは、検索語分割手段４３は検索語を覆う最小個数の各索引語の文書頻度の合計が最小となるように検索語を索引語に分割して、文書頻度を小さくして検索処理の高速化を図る。 In this case, the search word dividing means 43 divides the search word into the minimum number of index words covering the search word, thereby speeding up the search process by omitting use of useless index words. When there are a plurality of cases where the index word can be divided into the minimum number of index words, the search word dividing unit 43 sets the search words as index words so that the sum of the document frequencies of the minimum number of index words covering the search words is minimized. By dividing, the document frequency is reduced to speed up the search process.

（３）前記（１）（２）の手法では、検索語が２つ以上の索引語に分割される場合には、位置演算子で合成した検索条件のみを用いて検索を行なっている。しかし、この手法だと無駄な文字位置の突き合せ処理を生じる可能性がある。例えば（１）の例で、検索語「あいえお」を処理すると、索引語として「あい」「えお」が得られ、例えば、文書２＝“あいうえお”に関しては両索引語が出現しているものの、距離の条件を満たしていないということで、該当文書なしという検索結果が得られる。しかし、検索語に含まれる索引語「いえ」を考慮すれば、「あい」「いえ」「えお」を同時に含む文書は存在しないので、索引語の文書中での出現位置を調べることなく該当文書なしと判断することができる。 (3) In the methods (1) and (2), when the search word is divided into two or more index words, the search is performed using only the search condition synthesized by the position operator. However, this method may cause useless character position matching processing. For example, in the example of (1), when the search word “Aieo” is processed, “Ai” and “Eo” are obtained as index words. For example, both index words appear for document 2 = “Aiueo”. However, a search result that there is no corresponding document is obtained because the distance condition is not satisfied. However, considering the index word “No” included in the search term, there is no document that contains “Ai”, “No”, and “Eo” at the same time. It can be determined that there is no document.

そこで、検索語分割手段４３が検索語を２つ以上の索引語に分割する場合は次のような処理を行なってもよい。すなわち、検索条件解析手段４４は、検索語の分割で得られた複数の索引語を複数の検索結果の積集合をとる積集合演算子で合成した条件木である候補決定用条件木と、複数の索引語から検索語を覆う最小個数のものを選択してそれを出現位置間の距離を指定する位置演算子で合成した条件木である詳細判定用条件木とを作成する。そして、検索条件評価手段４５は、まず候補決定用条件木の検索結果合成処理を実行して複数の登録文書から該当文書を検索し、次に、この検索後の登録文書を対象に詳細判定用条件木の検索結果合成処理を実行して検索結果を得るようにする。 Therefore, when the search word dividing unit 43 divides the search word into two or more index words, the following processing may be performed. That is, the search condition analysis unit 44 includes a candidate decision condition tree that is a condition tree obtained by combining a plurality of index words obtained by dividing a search word with a product set operator that takes a product set of a plurality of search results, A detailed determination condition tree, which is a condition tree obtained by selecting the smallest number covering the search word from the index words and synthesizing it with a position operator specifying the distance between the appearance positions, is created. Then, the search condition evaluation means 45 first executes a search result synthesis process for the candidate decision condition tree to search for a corresponding document from a plurality of registered documents, and then performs a detailed determination for the registered documents after the search. A search result synthesis process is executed to obtain a search result.

これにより、詳細判定用条件木の検索結果合成処理に先立って、候補決定用条件木の検索結果合成処理を実行して対象となる文書に絞りをかけることにより、文字位置の突き合せの処理を低減して、検索処理の高速化を図ることができる。 Thus, prior to the detailed decision condition tree search result synthesis process, the candidate determination condition tree search result synthesis process is executed to narrow down the target document, thereby performing the character position matching process. This can reduce the speed of search processing.

この場合に、詳細判定用条件木を、複数の索引語から検索語を覆う最小個数のものを選択してそれを出現位置間の距離を指定する位置演算子で合成して作成するのに代えて、複数の索引語から前記検索語を覆いかつ索引語ごとの文書頻度の合計が最小となるものを選択してそれを出現位置間の距離を指定する位置演算子で合成して作成するようにしてもよい。 In this case, instead of creating a detailed decision condition tree by selecting the minimum number of search terms covering a search term from a plurality of index terms and combining them with a position operator that specifies the distance between the appearance positions. And selecting the one that covers the search word from the plurality of index words and that minimizes the total document frequency for each index word, and synthesizes it using a position operator that specifies the distance between the appearance positions. It may be.

（４）文書識別子、文書内頻度、出現位置などのデータは、通常、固定長(２あるいは４バイト）で表現される。例えば、４バイトで表現することにすると、図８の索引語「あ」のデータは、１１＊４＝４４バイト＝１４０８ビットが必要となる。これに対し、各文書識別子を前に出現している文書識別子の値との差分をとることで表現すると、その値は一般に小さな値となるので、可変長符号を用いて表現すると索引を小型化できる。同様に、文書内頻度は、その値を直接、可変長符号で表現し、出現位置は文書ごとに各出現位置の前の値との差分を可変長符号で表現することにより、索引を小型化することができる。 (4) Data such as a document identifier, in-document frequency, and appearance position is usually expressed in a fixed length (2 or 4 bytes). For example, if expressed in 4 bytes, the data of the index word “A” in FIG. 8 requires 11 * 4 = 44 bytes = 1408 bits. On the other hand, if each document identifier is expressed by taking the difference from the value of the previously appearing document identifier, the value will generally be a small value, so if the variable length code is used, the index will be downsized. it can. Similarly, the frequency in the document is directly expressed as a variable-length code, and the position of each occurrence is expressed in a variable-length code as the difference from the previous value of each occurrence position for each document. can do.

たとえば、可変長符号としてγ符号（I.H.witten著"Managing Gigabytes"，Van Nostrand Reinhold，1994年，84頁参照）を用いることとすると、図８の索引語「あ」の例で、文書識別子は、各々、１，２−１＝１，３−２＝１となり、この３つの１を各々“０”で表現できる。また、文書内頻度は、３，１，１を“１０１”“０”“０”で表現できる。さらに、出現位置は、文書１での出現が、各々、１，２−１＝１，３−２＝１となり、この３つの１を各々“０"で表現でき、文書２，３での出現はそれぞれ“０”で表現できる。結局、この場合の転置リストのデータは、｛“０”，“１０１”，（“０”，“０”，“０”）｝，｛“０”，“０”，（“０”）｝，｛“０”，“０”，（“０”）｝のように表現できるので、３＊１＋（３＋１＋１）＋（３＊１＋１＋１）＝１３ビットのデータ量ですむ。 For example, if a γ code (see “Managing Gigabytes” by IHwitten, Van Nostrand Reinhold, 1994, p. 84) is used as a variable length code, the document identifier is an example of the index word “A” in FIG. 1, 2-1 = 1, 3-2 = 1 respectively, and these three 1s can be expressed by “0”. In the document frequency, 3, 1, and 1 can be expressed as “101”, “0”, and “0”. Further, the appearance positions are 1, 2 = 1 = 1, 3-2 = 1 in the document 1, and these three 1s can be expressed by “0”, respectively, and appear in the documents 2 and 3. Can be represented by “0”. After all, the data of the transposed list in this case is {“0”, “101”, (“0”, “0”, “0”)}, {“0”, “0”, (“0”)} , {“0”, “0”, (“0”)}, 3 * 1 + (3 + 1 + 1) + (3 * 1 + 1 + 1) = 13 bits of data is sufficient.

しかしながら、このように圧縮した索引では、ある転置リストを処理するためには伸長処理が必要となるので、検索処理を遅くする可能性がある。しかも、検索語が「あ」であれば、検索処理においては出現位置は必要ないにもかかわらず、索引が圧縮されていると、その出現位置を伸長しなければ２番目以降の文書識別子を得ることができない。 However, an index compressed in this way requires a decompression process in order to process a certain transposed list, which may slow down the search process. Moreover, if the search term is “A”, the appearance position is not required in the search process, but if the index is compressed, the second and subsequent document identifiers are obtained unless the appearance position is expanded. I can't.

そこで、索引は、索引語ごとに、各文書識別子と各文書内頻度および各出現位置とで格納する領域を分けて、文書識別子だけを伸長処理すればよいようにして、検索処理を高速化することができる。 Therefore, the index speeds up the search process by dividing the area for storing each document identifier, the frequency in each document, and the appearance position for each index word, and decompressing only the document identifier. be able to.

例えば、前記の例で、文書識別子を記録するデータを、｛“０”｝，｛“０”｝，｛“０”｝とし、文書内頻度および出現位置のデータを、｛“１０１”，（“０”，“０”，“０”）｝，｛“０”，（“０”）｝，｛“０”，（“０”）｝として分けて記録する。 For example, in the above example, the data for recording the document identifier is {“0”}, {“0”}, {“0”}, and the frequency and appearance position data in the document are {“101”, ( “0”, “0”, “0”)}, {“0”, (“0”)}, {“0”, (“0”)} are recorded separately.

しかし、この場合に、特定の文書識別子の文書に関して位置の突き合せを行なうときには、その文書までの文書内頻度、出現位置を全て伸長処理しなければならない。 However, in this case, when position matching is performed for a document with a specific document identifier, all the in-document frequencies and appearance positions up to that document must be decompressed.

そこで、索引語ごとの各文書ごとに対応して、文書頻度および出現位置を表現するのに必要なビット数である文書内頻度出現位置表現ビット数を可変長符号で表現したデータを保持するようにする。これにより、文書内頻度出現位置表現ビット数を参照することにより、必要とする文書の文書頻度、出現位置のデータがどこにあるかがわかるので、不要な文書については文書頻度、出現位置のデータを伸長する必要がなくなり、文書頻度、出現位置を参照する場合に伸長するデータ量を少なくして、検索処理を高速化することができる。 Therefore, in correspondence with each document for each index word, data in which the frequency in the document frequency appearance position representing the number of bits necessary for expressing the document frequency and the appearance position is represented by a variable length code is retained. To. As a result, by referring to the number of occurrence frequency representation bits in the document, it is possible to know where the document frequency of the required document and the data of the appearance position are located. There is no need for decompression, and the amount of data to be decompressed when referring to the document frequency and appearance position can be reduced to speed up the search process.

例えば、前記した図８に示した例で、索引語「あ」に対する文書１の文書内頻度出現位置表現ビット数は６（γ符号で“１１０１０”）、文書２，３の場合はいずれも２（γ符号でいずれも“１００”）なので、前記の文書内頻度、出現位置のデータは、｛“１１０１０”，“１０１”，（“０”，“０”，“０”）｝，｛“１００”，“０”，（“０”）｝，｛“１００”，“０”，（“０”）｝のように表現することができる。この例で、３番目に位置する文書３に関する文書内頻度、出現位置を参照したい場合、はじめの２文書に関しては、文書内頻度出現位置表現ビット数のみを伸長して、そのビット数だけシフトすることで、３番目の文書に関する文書内頻度、出現位置を得るための位置を求めることができる。 For example, in the example shown in FIG. 8 described above, the document frequency appearance position number of bits in document 1 for the index word “A” is 6 (γ code “11010”), and in the case of documents 2 and 3, both are 2 (Gamma codes are both “100”), so the frequency in the document and the data of the appearance position are {“11010”, “101”, (“0”, “0”, “0”)}, {“ 100 "," 0 ", (" 0 ")}, {" 100 "," 0 ", (" 0 ")}. In this example, when it is desired to refer to the in-document frequency and appearance position regarding the third document 3, the first two documents are expanded by the number of in-document frequency appearance position expression bits and shifted by the number of bits. As a result, the position for obtaining the frequency and the appearance position in the document regarding the third document can be obtained.

一方、このような処理では、文書内頻度出現位置表現ビット数を記録するために索引が大型化するとともに、文書内頻度、出現位置が必要な文書に関しては、文書内頻度出現位置表現ビット数を伸長することが余分な処理として発生する。そこで、文書内頻度が指定された閾値以上である場合に限って出現位置を表現するのに必要なビット数である出現位置表現ビット数を可変長符号で表現したものを記録するようにする。例えば、図８の前記の例で、閾値を２とすれば、文書内頻度が３である文書１にのみ出現位置表現ビット数３（γ符号で“１０１”）が記録され、文書内頻度、出現位置のデータは、｛“１０１”，“１０１”，（“０”，“０”，“０”）｝，｛“０”，（“０”）｝，｛“０”，（“０”）｝となる。ここで、先頭の文書１の｛，｝内の２番目にある“１０１”が出現位置表現ビット数である。このように、２番目に置いたのは、検索時には、まず文書内頻度を伸長した後でないと出現位置ビット数が記録されているのか否かが判定できないためである。 On the other hand, in such a process, the index is enlarged to record the frequency in the document frequency appearance position expression bit, and the frequency in the document frequency appearance position expression bit is set for a document that requires the frequency in the document and the appearance position. Expansion occurs as an extra process. Therefore, only when the in-document frequency is equal to or higher than the specified threshold, the number of appearance position expression bits, which is the number of bits necessary for expressing the appearance position, is recorded with a variable length code. For example, in the above example of FIG. 8, if the threshold value is 2, the appearance position expression bit number 3 (“101” in the γ code) is recorded only in the document 1 whose document frequency is 3, and the document frequency, The data of the appearance positions are {“101”, “101”, (“0”, “0”, “0”)}, {“0”, (“0”)}, {“0”, (“0 ")}. Here, the second “101” in {,} of the first document 1 is the number of appearance position expression bits. As described above, the second reason is that at the time of retrieval, it is not possible to determine whether or not the number of appearance position bits is recorded unless the in-document frequency is first expanded.

なお、文書内頻度出現位置表現ビット数、出現位置表現ビット数は、前記の説明から明らかなように、文書内頻度および出現位置を格納する領域内に格納する。 Note that the document frequency appearance position expression bit number and the appearance position expression bit number are stored in an area for storing the document frequency and the appearance position, as is apparent from the above description.

（５）前記（１）の場合で、ｍ＝１であれば、和集合演算子で結合される索引語は文字コード順にソートされたものである。これらに対応する転置リストが索引ファイルの近接した領域に格納されていれば、データの読み出しが高速化できるので、検索処理も高速になる。そこで、索引は、索引語ごとの転置リストを文字コード順にソートしてファイルに格納している。 (5) In the case of (1) above, if m = 1, the index words combined by the union operator are sorted in the order of the character codes. If the permutation list corresponding to these is stored in an adjacent area of the index file, the data can be read at a high speed, so that the search process is also speeded up. Therefore, in the index, the transposed list for each index word is sorted in the character code order and stored in a file.

（６）新聞記事１年分のような大規模な文書データベースを考えた場合、索引のファイルからのデータの読み書きの効率化が重要である。ハードディスク装置からのデータの読み書きは、適当な大きさの固定長ブロック単位で行われるので、索引の読み出しも、このブロックの整数倍の固定長ブロックであるページ単位で行なう。この場合、実際の文書における索引語の出現頻度にはばらつきがあるため、各索引語の転置リストの大きさにもばらつきがでる。 (6) When considering a large-scale document database for one year of newspaper articles, it is important to improve the efficiency of reading and writing data from index files. Since reading and writing of data from the hard disk device is performed in units of fixed-length blocks of an appropriate size, index reading is also performed in units of pages that are fixed-length blocks that are integral multiples of this block. In this case, since the appearance frequency of the index word in the actual document varies, the size of the transposed list of each index word also varies.

そこで、転置リストの大きさがページより所定程度小さいものは、１つのページに１つ以上の転置リストを同時に格納し、転置リストの大きさがページより大きいものは複数のページを用いて格納する。このようにすることで、小さな転置リストを効率的に格納でき、また、転置リストの大きさの上限を取り払うことができる。 Therefore, when the size of the transposed list is smaller than the page by a predetermined amount, one or more transposed lists are simultaneously stored in one page, and when the size of the transposed list is larger than the page, it is stored using a plurality of pages. . By doing so, a small transposed list can be efficiently stored, and the upper limit of the size of the transposed list can be removed.

（７）次に、前記（４）の場合と（６）の場合とを組み合わせるような手法について説明する。大きな転置リストがあるような索引語のみからなる検索条件を処理する場合、文書内頻度、出現位置のデータは不要である。したがって、それらをファイルから読み出すことは無駄な処理となり、検索処理を遅くする。そこで、大きな配置リストを構成するページを、文書頻度、文書識別子とページの管理情報を格納するヘッダーページと、必要であれば文書識別子を格納する文書識別子ページと、文書内頻度、出現位置、文書内頻度出現位置表現ビット数あるいは出現位置表現ビット数がある場合は文書内頻度出現位置表現ビット数あるいは出現位置表現ビット数を格納する文書内頻度出現位置ページとに分けて格納する。その結果、文書内頻度、出現位置が不要な場合には文書内頻度、出現位置の読み出し処理が不要になり、検索処理が高速化できる。 (7) Next, a method for combining the cases (4) and (6) will be described. When processing a search condition consisting only of index terms with a large transposed list, data on the frequency in the document and the appearance position are not required. Therefore, reading them from the file is a wasteful process and slows down the search process. Therefore, the pages constituting the large arrangement list are divided into a document frequency, a header page for storing the document identifier and page management information, a document identifier page for storing the document identifier if necessary, an in-document frequency, an appearance position, and a document. If there is an internal frequency appearance position expression bit number or an appearance position expression bit number, it is stored separately in an in-document frequency appearance position page for storing the in-document frequency appearance position expression bit number or the appearance position expression bit number. As a result, when the in-document frequency and the appearance position are unnecessary, the reading process of the in-document frequency and the appearance position becomes unnecessary, and the search process can be speeded up.

図１０は、この場合の転置リストの構成例を示すものである。同図に示すように、この転置リストは、文書識別子ページ（id page）、文書内頻度出現位置ページ（loc page）、ヘッダーページ（header page）にページが別れている。 FIG. 10 shows a configuration example of the transposition list in this case. As shown in the figure, the transposed list is divided into a document identifier page (id page), an in-document frequency appearance position page (loc page), and a header page (header page).

ヘッダーページは、文書頻度を格納するヘッダー部分（header）、文書識別子ページの位置を記録する文書識別子ページ索引（id page index）、文書内頻度出現位置ページの位置を記録する文書内頻度出現位置ページ索引（loc page index）、文書識別子の圧縮データの一部（last id block）、未使用部分（wastage）から構成されている。 The header page includes a header part for storing the document frequency, a document identifier page index for recording the position of the document identifier page (id page index), and an in-document frequency appearance position page for recording the position of the in-document frequency appearance position page. It consists of an index (loc page index), a part of the compressed data of the document identifier (last id block), and an unused part (wastage).

また、文書識別子を複数の文書識別子ページにまたがって格納し、この文書識別子ページの管理情報である文書識別子ページ索引に文書識別子のページ番号と各ページの先頭に記録された文書識別子とを記録するようにしてもよい。これにより、必要な文書識別子のデータが格納された文書識別子ページのみにアクセスすればよいので、検索処理を高速化することができる。図１１は文書識別子ページ索引の例である。この例で、文書識別子１０１０の文書が存在しているか否かを調べるには文書識別子ページ索引を参照して、２番目のページであるページ番号２００の文書識別子ページを読み込めばよく、ページ番号１００のページの読み込みを回避することができる。 Further, the document identifier is stored across a plurality of document identifier pages, and the page number of the document identifier and the document identifier recorded at the head of each page are recorded in the document identifier page index which is management information of the document identifier page. You may do it. Thus, only the document identifier page storing the necessary document identifier data needs to be accessed, so that the search process can be speeded up. FIG. 11 is an example of a document identifier page index. In this example, in order to check whether or not the document with the document identifier 1010 exists, it is only necessary to read the document identifier page with the page number 200 as the second page by referring to the document identifier page index. Can be avoided.

さらに、図１２に例を示すように、文書識別子ページ索引には、文書識別子ページの各ページの末尾に記録された文書識別子を記録してもよい。このように、末尾の文書識別子を記録することで、文書識別子ページ索引を参照するだけで、例えば図１２の例では文書識別子１９００の文書は存在しないことが確認でき文書識別子ページにアクセスする必要がないので、検索処理が高速化する。 Furthermore, as shown in FIG. 12, the document identifier page index may record the document identifier recorded at the end of each page of the document identifier page. In this way, by recording the last document identifier, it is necessary to access the document identifier page by confirming that there is no document with the document identifier 1900 in the example of FIG. 12, for example, only by referring to the document identifier page index. This speeds up the search process.

また、文書識別子をページのサイズより小さな固定長ブロックである文書識別子ブロックに分割して格納し、文書識別子ブロックごとにそのはじめの文書の文書識別子は前の文書識別子との差分をとらずに格納し、文書内頻度、出現位置および文書内頻度出現位表現ビット数もしくは出現位置表現ビット数がある場合には前記文書内頻度出現位表現ビット数または出現位置表現ビット数は文書識別子ブロックと同じ文書の情報を格納するブロックである文書内頻度出現位置ブロックに分割して格納する。 Also, the document identifier is divided and stored in document identifier blocks that are fixed-length blocks smaller than the page size, and the document identifier of the first document is stored for each document identifier block without taking the difference from the previous document identifier. If the document frequency, the appearance position, and the document frequency appearance position expression bit number or the appearance position expression bit number are present, the document frequency appearance position expression bit number or the appearance position expression bit number is the same document as the document identifier block. Are divided into document frequency appearance position blocks, which are blocks for storing the information.

図１３に、この場合の転置リストの一例を示す。この転置リストは、図１０のものと比較して次の点が異なっている。すなわち、文書識別子ページ（id page）は固定長の文書識別子ブロック（id block）から構成されている。文書内頻度出現位置ページ（loc page）は文書内頻度出現位置ブロック（loc block）から構成されている。各文書識別子ブロックが対応する文書内頻度出現位置ブロックを指している。なお、文書内頻度出現位置ブロックは必ずしも１つのページに収まらなくてもかまわない。例えば、図１３の左から３番目の文書内頻度出現位置ブロックがこれに相当する。 FIG. 13 shows an example of the transposition list in this case. This transposed list differs from that of FIG. 10 in the following points. That is, the document identifier page (id page) is composed of a fixed-length document identifier block (id block). The in-document frequency appearance position page (loc page) is composed of in-document frequency appearance position blocks (loc block). Each document identifier block points to a corresponding intra-document frequency appearance position block. The in-document frequency appearance position block does not necessarily fit on one page. For example, the third intra-document frequency appearance position block from the left in FIG. 13 corresponds to this.

大きな転置リストの構成をこのようにすることで、ページ内で特定の文書の存在を調べる場合は、ページの先頭からすべての文書識別子を伸長する必要はなく、まず各文書識別子ブロックの先頭に記録されている文書識別子を使って所望の文書識別子が存在するならばどのブロックにあるかを調べることができる。所望の文書識別子が存在するとすればどのブロックにあるかを調べるには、ブロックの先頭に記録されている文書識別子が所望の文書識別子より大きいブロックの１つ前のブロックを探せばよい。もし、比較を開始したブロックの先頭の文書識別子がすでに所望の文書識別子より大きい場合には現在の文書識別子ページには所望の文書識別子は存在しておらず、そのようなブロックが見つかることなく最終のブロックとの比較が終わった場合には所望の文書識別子は最終ブロックそのものにあるとする。また、位置の突き合せのため位置情報が必要になった場合には、文書識別子ブロックに記録されている対応する文書内頻度出現位置ブロックの位置情報を用いて文書内頻度出現位置ブロックにアクセスし、所望の位置情報を取得すればよい。以上のようにして所望の文書識別子、位置情報が簡単に得られるので、検索処理が高速化できる。 By configuring a large transposed list in this way, when examining the existence of a specific document within a page, it is not necessary to expand all document identifiers from the top of the page, but first record them at the beginning of each document identifier block. If the desired document identifier exists, it is possible to check in which block the document identifier is stored. In order to check which block a desired document identifier exists in, it is sufficient to search for a block immediately before a block in which the document identifier recorded at the head of the block is larger than the desired document identifier. If the first document identifier of the block that started the comparison is already larger than the desired document identifier, the desired document identifier does not exist in the current document identifier page, and no such block is found. It is assumed that the desired document identifier is in the final block itself when the comparison with the block is completed. In addition, when position information is needed for position matching, the in-document frequency appearance position block is accessed using the position information of the corresponding in-document frequency appearance position block recorded in the document identifier block. What is necessary is just to acquire desired position information. Since the desired document identifier and position information can be obtained easily as described above, the search process can be speeded up.

この発明の実施の形態１である電子化文書検索システムの全体のシステム構成を示すブロック図である。BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a block diagram showing an overall system configuration of an electronic document retrieval system that is Embodiment 1 of the present invention. 前記電子化文書検索システムの索引テーブル１１を示す表である。It is a table | surface which shows the index table 11 of the said electronic document search system. 前記電子化文書検索システムで検索条件から変換して作成する木構造の例を示すブロック図である。It is a block diagram which shows the example of the tree structure created by converting from search conditions with the said electronic document search system. 前記木構造の例を示すブロック図である。It is a block diagram which shows the example of the said tree structure. 前記木構造の例を示すブロック図である。It is a block diagram which shows the example of the said tree structure. 前記電子化文書検索システムの履歴テーブルを示す表である。It is a table | surface which shows the history table of the said electronic document search system. この発明の実施の形態２にかかる電子化文書検索システムの全体構成を示す機能ブロック図である。It is a functional block diagram which shows the whole structure of the digitized document search system concerning Embodiment 2 of this invention. 前記電子化文書検索システムで用いる索引の例を示す表である。It is a table | surface which shows the example of the index used with the said electronic document search system. 前記電子化文書検索システムで用いる索引の例を示す表である。It is a table | surface which shows the example of the index used with the said electronic document search system. 前記索引の転置リストの例を示すブロック図である。It is a block diagram which shows the example of the transposition list | wrist of the said index. 前記転置リストの文書識別子ページ索引の例を示す表である。It is a table | surface which shows the example of the document identifier page index of the said transposition list. 前記転置リストの文書識別子ページ索引の例を示す表である。It is a table | surface which shows the example of the document identifier page index of the said transposition list. 前記索引の転置リストの例を示すブロック図である。It is a block diagram which shows the example of the transposition list | wrist of the said index.

Explanation of symbols

１電子化文書検索システム
１１索引
１３索引語
１４識別子
１５出現位置
１６文書数
３１第１、第２予備照合記憶手段
４０電子化文書検索システム
４１索引記憶手段
４２文書分割手段
４３検索語分割手段
４４検索条件解析手段
４５検索条件評価手段 DESCRIPTION OF SYMBOLS 1 Electronic document search system 11 Index 13 Index word 14 Identifier 15 Appearance position 16 Number of documents 31 1st, 2nd preliminary collation memory means 40 Electronic document search system 41 Index storage means 42 Document division means 43 Search word division means 44 Search Condition analysis means 45 Search condition evaluation means

Claims

Document dividing means for dividing a plurality of digitized registered documents into index words of consecutive n characters (n is an integer of 2 or more);
Said index word to the document divider means is divided, the document frequency is the number of the registered document containing this index words, the document identifier of the registered document including the index word, the index terms of the respective registration document Index storage means for storing an index that associates the frequency in the document that is the number of occurrences in and a transposed list that includes the appearance position of the index word in each registered document;
A given search word larger than n characters is divided into arbitrary consecutive n characters, the index word that matches the divided n characters is obtained from the index, and the search word among the obtained combinations of the index words Are included in any of the index words in the combination, the number of index words in the combination is minimal, and the document frequency of each index word in the combination Search word dividing means for selecting the combination having the smallest sum ;
A first search condition for searching for the document identifier corresponding to the index word in the selected combination is a lower node, and a positional relationship between the two index words used in the two first search conditions is A search condition in which the second search condition for searching the document identifier of the registered document that is the same as the positional relationship in the search word from the document identifiers that are the search results of the two first search conditions is an upper node . A search condition analysis means for generating a tree;
An electronic document search system comprising: search condition evaluation means for executing a search of each node of the search condition tree and obtaining the document identifier acquired by searching the highest node as a search result .

The search condition analysis means further uses a third search condition for searching for the document identifier corresponding to the index word acquired by the search word dividing means as a lower node, and searches for a plurality of the third search conditions. Generating a candidate decision condition tree with the fourth search condition for obtaining a product set of the document identifiers as a result as an upper node;
The search condition evaluation unit executes a search for each node of the candidate determination condition tree, and executes a search for each node of the search condition tree with respect to the document identifier acquired in the search for the highest node. The electronic document retrieval system according to claim 1, wherein the retrieval result is obtained.

The index storage means is a hard disk device, and the index uses a page, which is a fixed-length block of the hard disk device, as a file read / write unit, and the size of a plurality of transposed lists is smaller than the size of the page stores a plurality of inverted list on one page, according to claim 1 size of the inverted list is characterized by storing one of the inverted list into a plurality of pages is greater than the page Electronic document search system.

A recording medium that records a program for causing a computer to search for a registered document including a desired search word from among a plurality of registered documents that are digitized and registered in a predetermined storage device,
A document dividing procedure for dividing the registered document into consecutive n-letter (n is an integer of 2 or more) index terms;
Said index word divided by the document splitting procedure, document frequency is the number of the registered document containing this index words, the document identifier of the registered document including the index word, the index terms of the respective registration document An index storage procedure for storing in the storage means an index that associates the frequency in the document that is the number of appearances in the document and the transposed list that includes the appearance position of the index word in each registered document ;
A given search word larger than n characters is divided into arbitrary consecutive n characters, the index word that matches the divided n characters is obtained from the index, and the search word among the obtained combinations of the index words Are included in any of the index words in the combination, the number of index words in the combination is minimal, and the document frequency of each index word in the combination A search term splitting procedure for selecting the combination having the smallest sum;
A first search condition for searching for the document identifier corresponding to the index word in the selected combination is a lower node, and a positional relationship between the two index words used in the two first search conditions is A search condition in which the second search condition for searching the document identifier of the registered document that is the same as the positional relationship in the search word from the document identifiers that are the search results of the two first search conditions is an upper node. A search condition analysis procedure for generating a tree;
Search condition evaluation procedure means for executing a search for each node in the search condition tree and obtaining the document identifier acquired in the search for the highest node as a search result;
A computer-readable storage medium having recorded thereon a program for causing the computer to execute.