JP2004164555A

JP2004164555A - Apparatus and method for retrieval, and apparatus and method for index building

Info

Publication number: JP2004164555A
Application number: JP2003075724A
Authority: JP
Inventors: Shigehisa Kawabe; 惠久川邉; Minoru Ikeda; 稔池田; Takashi Osawa; 隆大澤; Atsushi Kadona; 敦門奈; Masao Nukaga; 雅夫額賀
Original assignee: Fuji Xerox Co Ltd
Current assignee: Fujifilm Business Innovation Corp
Priority date: 2002-09-17
Filing date: 2003-03-19
Publication date: 2004-06-10

Abstract

<P>PROBLEM TO BE SOLVED: To provide retrieval/index building technology, which enables users to perform high-speed retrieval, without producing security problems, even when users perform integrated retrieval over document filing in various security domains. <P>SOLUTION: When a retrieval request is required from a retrieval user terminal 15, a retrieval part 10 transmits user ID of the retrieval user to an access controller 11 and the access controller 11 returns access authority of the retrieval user by referring to a user authority storage part 12. The access controller 11, for example, refers to a table, which specifies relations between access rights and their corresponding indexes, and then returns an identifier (or identifiers) for an index 14 referable with the access authority of the retrieval user to the retrieval part 10. The retrieval part 10 extracts a record hit with reference to the index 14, which is approved based on the identifier of the referable index 14, and returns it to the retrieval user terminal 15. <P>COPYRIGHT: (C)2004,JPO

Description

【０００１】
【発明の属する技術分野】
この発明は、複数の文書ファイリングから、文書を取り出して、検索のためのインデクスを構築し、複数の文書ファイリングに存在する、複数の文書の属性や、ＵＲＬなどで示される文書の位置を、一元的に管理し、検索可能な統合検索データベースに関し、とくに、文書ファイリングの個々のセキュリティドメインを考慮した管理・検索を行えるようにしたものである。
【０００２】
【従来の技術】
従来、分散環境において、独立して管理され、開示される複数の文書ファイリングをまたがって、論理的に唯一のインデクスを構築し、インデクスに対して、一回の検索操作で、複数の文書ファイリングに存在する、複数の、文書の属性や、ＵＲＬなどで示される文書の位置を、一元的に管理し、検索が可能なデータベースが構築されている。このような検索を統合検索と呼ぶ。
【０００３】
統合検索のためのインデクス構築に際して、検索操作と独立して行われる収集操作によって、複数の文書ファイリングから、文書が収集される。収集操作は、収集対象とする文書ファイリングから、所定のアクセス権を有するユーザ、またはアプリケーションが、所定のネットワークプロトコルで、文書名を指定するか検索を行って、文書を特定し取得する。取得した文書を解析し、インデクス構築に必要な属性やキーワードを作成して、インデクスを構築する。
【０００４】
なお、この発明と関連する特許文献には、複数のデータベースにそれぞれ格納されている文章データを解析し必要項目を抽出し抽出結果をインデクス化し、単一のインデクスで複数のデータベースにアクセスすることを開示するものや（特許文献１）、記憶装置に記憶されている複数のファイルの各々から所定の情報を取得するとともに権限情報も取得し所定の情報と権限情報とを用いてインデクスを構築してユーザの権限に応じた範囲でしか検索が行われないようにすることを開示するもの（特許文献２）がある。
【特許文献１】
特開２０００−１６３４４５公報
【特許文献２】
特開２００１−３４４２４５公報
【０００５】
【発明が解決する課題】
ところで、統合検索は、インターネットのように、公開するかしないか、２者択一の環境で、広く用いられるが、これを企業内のネットワークで提供するには、以下に示す課題がある。
【０００６】
一般に、企業内で公開される文書または文書ファイリングは、公開範囲を指定して、公開される。たとえば、「部外秘」の文書は、部門メンバーに限定的に公開されていると考えられるし、Ｗｅｂサーバなどは、部門ごとに設定されているネットワークドメインを利用して、接続可能なクライアントを限定することが行われている。
【０００７】
特定の単位で収集された複数の文書が、同一の公開範囲を指定している場合、その文書は、同一のセキュリティ上のドメインに属している。文書ごとに定まる公開範囲をセキュリティドメインと呼ぶ。企業内では、複数のセキュリティドメインが存在し、収集対象とする文書は、いずれかのセキュリティドメインに属する。
【０００８】
先に述べた統合検索を提供するためには、収集操作が必要であり、収集のためには、対象とする文書や文書ファイリングが属するセキュリティドメインのアクセス権が必要となる。すなわち、統合検索の対象とする、すべての文書や文書ファイリングに対する特権的なアクセス権を有するものが、収集操作を行う必要がある。
【０００９】
しかし、通常は、企業内では、このような広い範囲の特権的なアクセス権をシステム管理者またはソフトウェアシステムに与えることはセキュリティ上問題がある（課題１）。
【００１０】
さらに、論理的に唯一のインデクスを構築するため、文書ごとにセキュリティドメインを管理できるインデクスを構築する必要がある。そうしないと、本来アクセス権をもたないセキュリティドメインの文書が検索結果に含まれてしまうため、セキュアでない検索となる。
【００１１】
加えて述べれば、検索をしたユーザのアクセス権と、文書ごとに定まる、セキュリティドメインの比較、判定をする必要があり、検索処理が複雑になるため、検索性能が下がる。企業内の統合検索では、１００万乃至数千万の文書を検索対象とする場合があり、このような大規模な検索を、高速に行うのが困難となる。この場合は、利用者からみたレスポンスが低下する（課題２）。
【００１２】
この発明は、以上の事情を考慮してなされたものであり、種々のセキュリティドメインの文書ファイリングにわたって統合検索を行う場合でも、セキュリティの問題を生じさせることなく、かつ、高速の検索が可能な検索・インデクス構築技術を提供することを目的としている。
【００１３】
【課題を解決するための手段】
この発明によれば、上述の目的を達成するために、特許請求の範囲に記載のとおりの構成を採用している。ここでは、発明を詳細に説明するのに先だって、特許請求の範囲の記載について補充的に説明を行なっておく。
【００１４】
上記に示した課題を解決するために、本発明の原理的な構成では、収集対象のセキュリティドメインごとに、該セキュリティドメインに所属する管理者か、または、アクセス権を有する管理者が文書を収集し、インデクスを構築する。具体的には、収集プログラムに、すべてのセキュリティドメインに有効な権限を与えるのではなく、それぞれのセキュリティドメインから収集するのに必要な権限を与えられた収集プログラムを、各々のセキュリティドメインにて稼動させる。
【００１５】
すなわち、一つのセキュリティドメインに対して、一つのインデクスを対応付けるように構成する。これらインデクスは、データモデルが同一で、論理的には一つのインデクスで、その中がセキュリティドメインで分割されているという構成となる。
【００１６】
このように構成した複数のインデクスは、特定のコンピュータシステム内に集中して、管理する構成と、セキュリティごとに管理部門を設け、分散的に配置する構成がある。
【００１７】
いずれの構成においてもインデクスごとに定められた管理者は、特定のインデクスに対する限定された閲覧、更新、バックアップ、リストア操作を許すようにすることで、さきの課題１が解消する。
【００１８】
次に、課題２について説明する。
【００１９】
アクセス制御は、一般に、アクセス主体（サブジェクト）がアクセス対象（オブジェクト）に対するアクセス操作をもって、モデル化される。アクセス制御リストと呼ばれる方法は、アクセス対象ごとの属性として、アクセス主体と（その主体に許される）アクセス操作の組のリストを持つことで具現化される。主体の対象に対する操作は、対象ごとに付与されたアクセス制御リストを、走査または検索し、アクセス制御リストに、該主体と操作の組が含まれるかどうかを調べ、含まれていたら、アクセスが許される。
【００２０】
本発明の原理的な構成では、アクセス対象は、セキュリティドメインごとに構築されたインデクスである。アクセス操作は、検索とする。アクセス主体は、検索を行うユーザである。アクセス制御リストは、インデクスの属性として、検索を許すユーザのリストで構成し、アクセス制御データベースで一元管理する。
【００２１】
本発明の検索操作を実現する方法を説明する。
【００２２】
検索を行うユーザが、セキュリティドメインに対して検索が許されるインデクスのリストを、アクセス制御データベースから検索する。インデクスを保持するコンピュータに対して、検索要求を発行し、検索処理を行い、検索結果を得る。複数のインデクスに対する検索結果は、複数のコンピュータから複数の検索結果を得る。所定のタイムアウト時間が経過するのを待ち、複数の検索結果を、併合して、統合した検索結果を構成する。
【００２３】
この発明では、一つのインデクスに格納されている文書は、同一のセキュリティドメインに属する文書であるため、該セキュリティドメインにアクセス可能なユーザからの検索要求に対しては、個別の文書ごとにアクセス権のチェックを行う必要がない。一方、統合された検索結果を構成するには、併合の必要があり、その分の処理コストが必要となる。
【００２４】
一般には、収集された総文書数に対して、検索を行うユーザがアクセス可能なセキュリティドメインに属する文書が、少ない場合は、個別にアクセス権のチェックを行う必要がある文書数は、インデクスの単位で枝狩りされた結果、少なくなるため、併合のコストを勘案しても、短い時間で検索処理ができる。
【００２５】
たとえばアクセス可能な文書が半分以下の場合、すべての文書に対してアクセス権を調べるコストに対して、検索結果を併合するコストは、少ないことが期待できる。
【００２６】
次に本発明の他の原理的な構成について説明する。
【００２７】
検索したユーザに対して、得られた検索結果をすべて表示するのではなく、所定のランキング計算を行った結果として得られる、ランキングスコアの大きい文書について、所定の表示上限件数に限定して、表示を行うことを考える。
【００２８】
ユーザからみて、検索結果に、不当に低いランキングスコアの文書が含まれないようにするために、表示上限件数に対して、所定の倍率をかけた値を要求件数とし、要求件数だけ、検索結果を取り出し、ランキング計算を行い、降順にソートを行い、上位から表示上限件数だけ取り出して表示を行うものとする。
【００２９】
セキュリティドメインごとに定まるインデクスに対して、検索要求を出す際に、要求件数を指定し、検索結果は、表示上限件数の検索結果と、ヒット件数を返すように構成する。
【００３０】
複数のインデクスからの検索結果を併合する際に、いずれか一つのインデクスからの検索結果が、要求件数以上のヒット件数である場合には、該インデクスからの検索結果を、（他のインデクスからの検索結果と併合せずに）統合検索結果として採用する。
【００３１】
該インデクスからの検索結果が、要求件数に満たない場合は、それ以外のインデクスの検索結果を、要求件数に達するまで併合し、ランキングスコアで降順にソートを行い、上位から表示件数だけ取り出して表示を行う。
【００３２】
すべてのインデクスの検索結果を併合しても、要求件数に満たない場合は、すべてのインデクスの検索結果を併合してランキングスコアで降順にソートを行い、上位から表示件数だけ取り出して表示を行う。
【００３３】
以上のように、構成することで、統合後の検索結果が要求件数に対して、多い場合、特に、一つのインデクスからの検索結果が要求件数に対して多い場合は、併合の処理を減らすか、行わないので、処理時間が短くなる。
【００３４】
本発明の更に他の原理的な構成では、複数のインデクスを用いて検索を行う検索装置において、各インデクスから取得した、スコアを含むヒットレコードをスコアに基づいて各インデクスごとにソートし、ソートした上記スコアを含むヒットレコードを所定の規則で連結し、連結した上記スコアを含むヒットレコードをスコアに基づいて再度ソートし、再度ソートした後のヒットレコードの上位の所定数を検索結果として出力するようにしている。
【００３５】
この構成においては、インデクスごとにスコアを計算しソートを行うので、複数のインデクスに対して分散処理が可能であり、応答性を高め、スケーラビリティを確保することができる。また、インデクスごとにソートしたヒットレコードを連結する際に、インデクスごとのヒットレコードの処理対象上限値を定めておけば、不必要なヒットレコードをヒットレコード連結部に送る必要がなくなり、例えば通信コストを低減することが可能となる。
【００３６】
なお、この発明は装置またはシステムとして実現できるのみでなく、方法としても実現可能である。また、そのような発明の一部をソフトウェアとして構成することができることはもちろんである。またそのようなソフトウェアをコンピュータに実行させるために用いるソフトウェア製品もこの発明の技術的な範囲に含まれることも当然である。
【００３７】
この発明の上述の側面およびこの発明の他の側面は特許請求の範囲に記載され、以下実施例を用いて詳細に説明される。
【００３８】
【発明の実施の形態】
以下、この発明の実施例について説明する。
【００３９】
［実施例１］
実施例１は複数のインデクスを用いアクセス権限に応じて検索を制御するものである。
【００４０】
図１は、実施例１の検索装置を模式的に示しており、この図において、検索装置は、検索部１０、アクセス制御部１１、ユーザ権限記憶部１２およびインデクス記憶装置１３を含んで構成されている。インデクス記憶装置１３は、複数のインデクス（便宜上Ａ〜Ｎを付す）１４を記憶している。複数のインデクス１４はそれぞれ異なるレベルのアクセス権限が付与されている。もちろん、同一のアクセス権限が複数のインデクス１４に付与され、同一のアクセス権限のグループとして管理されても良い。１つのインデクス記憶装置１３にすべてのインデクス１４を記憶するのでなく、複数のインデクス記憶装置１３を設け、分散させて記憶するようにしても良い。この実施例の検索装置には検索ユーザ端末１５から検索要求が送られ、検索結果が検索ユーザ端末１５に返される。
【００４１】
インデクス記憶装置１３のインデクス１４は、後述するインデクス構築装置（図２）により構築・管理される。
【００４２】
この実施例において、検索ユーザ端末１５から検索要求がなされると、検索部１０はアクセス制御部１１に検索ユーザのユーザＩＤ等を供給し、アクセス制御部１１は、ユーザ権限記憶部１２を参照して検索ユーザのアクセス権限を返す。アクセス制御部１１は、例えば、アクセス権限とそれに対応するインデクスとの関係を規定した表を表引きして、検索ユーザのアクセス権限で参照可能なインデクス１４の識別子（複数の場合もある）を検索部１０に返す。検索部１０は、参照可能なインデクス１４の識別子に基づいて許容されるインデクス１４を参照してヒットしたレコードを取りだし、検索ユーザ端末１５に返す。ヒットしたレコードを、ランキングスコアに基づいて整理し、所定の表示数のレコードのみ検索ユーザ端末１５に返すようにしてもよい。
【００４３】
この例では、インデクス１４を参照してヒットしたレコードは、すべてアクセス可能なものであり、ヒットしたレコードについて個々にユーザのアクセス権限を検証する必要がない。
【００４４】
なお、検索ユーザが指定したインデクス１４あるいはすべてのインデクス１４に対して検索部１０が参照要求を行い、アクセス制御部１１が、ユーザ権限記憶部１２のユーザのアクセス権限を参照して参照の許否を行うようにしても良い。
【００４５】
つぎにこの実施例のインデクス構築装置について説明する。
【００４６】
図２は、この実施例のインデクス構築装置を模式的に示しており、この図において、インデクス構築装置は、プロセス起動部２０、インデクスレコード管理部２１、アクセス制御部２３、プロセス権限記憶部２４を含んで構成されている。プロセス起動部２０は、予めアクセス権限が設定されている。プロセス起動部２０は、インデクスレコード管理部２１のインデクスレコード管理プロセス２２を起動し、プロセス起動部２０のプロセスを付与する。ユーザあるいは管理者がインデクスレコード管理部２１のインデクスレコード管理プロセス２２を起動し、そのアクセス権限を付与するようにしても良い。起動されたインデクスレコード管理プロセス２２は、文書を保持する文書ファイリングシステム１０３（図３参照）にアクセスし、自らのアクセス権限で許容される文書を参照してインデクスレコードを生成する。文書ファイリングシステム１０３の文書へのアクセスはアクセス制御部２３およびプロセス権限記憶部２４により制御される。こうしてインデクスレコード管理プロセス２２は、自らのアクセス権限に対応する（同等以下の）セキュリティドメインの文書のインデクスレコードを生成して、インデクス記憶装置１３中の対応するアクセス権限のインデクス１４を構築したり、修正（挿入・削除）したりする。このインデクス１４の構築・修正の処理についてもアクセス制御部２３およびプロセス権限記憶部２４により制御される。
【００４７】
このようにしてアクセス権限ごとにインデクス１４が構築・管理される。
【００４８】
図３は、実施例１の検索装置およびインデクス構築装置をイントラネット環境で実現した構成例を示す。図３において、検索システム１００、複数のインデクス構築システム１０２、複数の文書ファイリングシステム１０３、ディレクトリサーバ１０４、ウェブサーバ１０５、アプリケーションサーバ１０６、クライアント端末１２０等が、ＬＡＮ１０８に配置されている。またＬＡＮ１０８にはルータ１０７、ネットワーク１２１を介してクライアント端末１２０が接続されている。
【００４９】
検索システム１００はインデクス保持部１０１を有し、複数のインデクス（図１のインデクス１４）を参照できる。
【００５０】
検索システム１００、インデクス構築システム１０２はそれぞれ記憶媒体１０９、１１０、あるいはネットワーク１２１を用いてインストールされる。
【００５１】
文書ファイリングシステム１０３は全体として単一のアクセス権限が付与されていても良いし（例えば１０３Ａ）、文書ファイリングシステム１０３の個々の文書あるいはディレクトリにアクセス権限が個別に付与されても良い。文書ファイリングシステム１０３Ａとインデクス構築システム１０２Ａは例えば同一のアクセス権限を有し、対応するセキュリティドメイン２００をなす。他のファイリングシステム１０３は種々のアクセス権限の文書等を含み、それぞれ、アクセス権限に対応するインデクス構築システム１０２によりインデクスレコードを生成できるようになっている。
【００５２】
インデクス構築システム１０２は対応するアクセス権限で各文書ファイリングシステム１０３の文書をアクセスしていき、文書ファイリングシステム１０３はディレクトリサーバ１０４を用いて権限を認証し、アクセスの許否を決定する。インデクス構築システム１０２は、対応するアクセス権限の文書を参照してインデクスレコードを生成して、インデクス保持部１０１の対応するインデクス１４を構築し、あるいは対応するインデクスにレコードを挿入する。また、必要に応じ、インデクスのレコードの削除等の処理を行う。
【００５３】
このようにして、インデクス保持部１０１にアクセス権限ごとにインデクス１４が構築されその後管理される。
【００５４】
検索ユーザはクライアント端末１２０を用いてウェブサーバ１０５およびアプリケーションサーバ１０６（あるいはＣＧＩプログラム等を用いて）を介して検索システム１００に検索要求を行う。検索システム１００は、ディレクトリサーバ１０４を用いて検索ユーザのアクセス権限を調べ、これに応じて対応するインデクス１４を参照して検索ユーザに許容されるヒットレコードのみをリストとしてクライアント端末１２０に返す。検索ユーザは、リストから選択した文書を所定の文書ファイリングシステム１０３から取り出すことができる。
【００５５】
なお、インデクス保持部１０１をインデクス構築システム１０２サイトに分散して配置し、検索システム１００がこれを参照するようにしても良い。また、インデクス構築システム１０２サイトに検索システム１００およびインデクス保持部１０１を分散配置してもよい。この場合、クライアント端末１２０の検索要求を代行して分散配置された複数の検索システム１００にディスパッチする。
【００５６】
［実施例２］
つぎにこの発明の実施例２について説明する。この実施例は複数のインデクスを用いた場合でも、ランキングスコアの小さなヒットレコードが表示リストに含まれないようにするものである。
【００５７】
図４は、この実施例の検索装置を模式的に示しており、この図において、検索部１０は、インデクス別ヒットレコード数生成部３０、インデクス選択部３１、ヒットレコード併合部３２、ヒットレコード一時記憶部３３、表示レコード出力部３４等を含んで構成されている。
【００５８】
検索ユーザ端末１５は、検索部１０に検索要求を送る。検索要求には検索キーと共に表示レコードの数を含ませることができる。インデクス別ヒットレコード数生成部３０は、検索キーに対してインデクス１４ごとにヒットレコード数を算出する。これについては後に説明する。インデクス選択部３１は、指定された表示レコード数あるいはデフォルトの表示レコード数に基づいてインデクス記憶装置１３から取り出すヒットレコード数を決定する。これを閾値と呼ぶ。閾値は、表示レコード数のＮ倍である（Ｎは十分に精度の良い結果を得られるように決められる）。インデクス選択部３１は、最も少ないインデクス数で閾値のヒットレコードを得られるようにインデクスを選択する。種々の態様が可能であるが、例えば、ヒットレコード数が多い順にインデクスを選び、それで閾値に達したら、そのインデクスのみを選ぶ。ヒットレコード数が閾値に達しない場合には、つぎにヒットレコード数が多いインデクスを選び、そのヒットレコード数を、現在のヒットレコード数の総数に累積する。累積値が閾値に達するまで、同様の処理を繰り返し、用いる１または複数のインデクスを確定する。
【００５９】
用いるインデクスが複数の場合にはヒットレコードをヒットレコード併合部３２で併合し、ヒットレコード一時記憶部３３にストアする。用いるインデクスが一個の場合にはヒットレコードをそのままヒットレコード一時記憶部３３にストアする。
【００６０】
ヒットレコード一時記憶部３３のヒットレコードはそこの含まれるランキングスコアに基づいてソートされ、ソート順に表示レコード出力部３４に送られる。表示レコード出力部３４の出力表示レコードリストは検索ユーザ端末１５に返される。
【００６１】
こうして、ヒットレコードの併合処理の回数を少なくすることができる。
【００６２】
つぎに、インデクス別ヒットレコード数生成部３０で行うヒットレコード数算出処理について説明する。もちろん、キーごとにヒットレコード数を予め求めて表を作成し、このような表を表引きしても良い。
【００６３】
インデクス記憶装置１３のインデクス１４は、例えば、図５に示すように、管理ノード、中間ノードおよびリーフノードにより記述されるＢ＋ツリー構造である。管理ノードは、図６に示すように、複数のＢ＋ツリーを管理する。各Ｂ＋ツリーはスキーマによりキー、バリュー等のバイト数等が規定される。管理ノードにより、検索キーが対応するＢ＋ツリーに振り分けられる。中間ノードは、図７に示すように、分岐を制御するキーと分岐する下位ノード（サブツリー）が規定される。また、この実施例に特有の構成として、各下位ノードについてそのサブツリーのリーフノードに属するレコードの数を件数管理情報として保持している。リーフノードは図８に示すようにキーとバリュー（例えば文書ＩＤ）との複数の対を含んでいる。リーフノードは、中間ノードにおいて分岐を制御するキーについても、そのキーとバリューとの対を含んでいる。また、つぎのリーフノードへのポインタも含まれ、いわゆる水平検索を行える。
【００６４】
検索に際しては、図９に示すように、管理ノードによりＢ＋ツリーが決定され、そのルートノードから中間ノードを沿って垂直検索が行われ、リーフノードに当直した後、水平検索が行われる。
【００６５】
ここで、図１０を用いて、中間ノードの件数管理情報について説明する。図１０において、中間ノードは、第１段目の中間ノード（管理ノードのつぎのノード）を例にすると、キー「ＬＥＦＴ」、Ｋ（０）_１、Ｋ（０）_２、Ｋ（０）_３、・・・により下位ノード（サブツリー）に分岐する。キー「ＬＥＦＴ」の直下にはレコードは格納されない。「Ｋ（０）」は第１段目のキーであることを示す。第ｎ段目の中間ノードのキーは同様に「Ｋ（ｎ−１）」で表す。「ＬＥＦＴ」からＫ（０）１までの範囲のキーが分岐する下位ノード（サブツリー）のリーフノードに格納されるレコードの数Ｒ（０）_１を、下位ノード０の件数管理情報にストアする。Ｋ（０）_１からＫ（０）_２までの範囲のキーが分岐する下位ノード（サブツリー）のリーフノードに格納されるレコードの数ｒ（０）_１を求め、これにその前の下位ノードのレコードの数（この場合Ｒ０）を足して、Ｒ（０）_１＝Ｒ（０）_１＋ｒ（０）_１を得、下位ノード１の件数管理情報に格納する。キーＫ（０）_ＮからキーＫ（０）_Ｎ＋１までの範囲のキーが分岐する下位ノードＮのリーフノードに格納されるレコードｒ（０）_Ｎを求め、これにその直前の下位ノードＮ−１の件数管理情報（Ｒ（０）_Ｎ−１）を足して、下位ノードＮの件数管理情報Ｒ（０）_Ｎ＝Ｒ（０）_Ｎ−１＋ｒ（０）_Ｎを得る。同様に最後の下位ノードまで、件数管理情報を取得して管理する。
【００６６】
開始キーおよび終了キーを用いて検索するときに、中間ノードの件数管理情報を用いてリーフノードに到達した時点の順位を求めることができる。すなわち、順次辿っていく中間ノードにおいて、つぎに辿る下位の中間ノードを決定する。このとき、その左側の中間ノードの件数管理情報を求める。つぎに辿る中間ノードでも同様にし、この操作をリーフノードに至るまで繰り返す。例えば、第１段から第Ｎ段のそれぞれのキーＫ（０）_Ａ、Ｋ（１）_Ｂ、Ｋ（２）_Ｃ、・・・、Ｋ（Ｎ−１）_Ｄを辿っていくとすると、中間ノード０のキー（下位のノードまたはサブツリー。以下同様）Ｋ（０）_Ａ−１の件数管理情報Ｒ（０）_Ａ−１、中間ノード１のキーＫ（１）_Ｂ−１の件数管理情報Ｒ（１）_Ｂ−１、中間ノード２のキーＫ（２）_Ｃ−１の件数管理情報Ｒ（２）_Ｃ−１、・・・中間ノード（Ｎ−１）のキーＫ（Ｎ−１）_Ｄ−１の件数管理情報Ｒ（Ｎ−１）_Ｄ−１を累積してリーフノードに到達したときレコードの順位を得ることができる。
【００６７】
まず、開始キーを基づいて中間ノードを辿り、対応する件数管理情報を累積してリーフノードに到達したときのレコードの順位を求め、さらにリーフノードを水平検索する。開始キーを含むレコードに到達したときにそのレコードに至るまでの水平検索時のレコード数を求め、これをリーフノードに到達したときのレコードの順位に足して開始キーを含むレコード（開始キーを含むレコードがない場合には、検索範囲に含まれて開始キーに最も近いキーを含むレコード）の順位（Ｎｓｔａｒｔ）を求める。
【００６８】
つぎに、終了キーに基づいて中間ノードを辿り、対応する件数管理情報を累積してリーフノードに到達したときのレコードの順位を求め、さらにリーフノードを水平検索する。終了キーを含むレコードに到達したときにそのレコードに至るまでの水平検索時のレコード数を求め、これをリーフノードに到達したときのレコードの順位に足して終了キーを含むレコード（開始キーを含むレコードがない場合には、検索範囲に含まれて開始キーに最も近いキーを含むレコード）の順位（Ｎｅｎｄ）を求める。
【００６９】
インデクス別ヒットレコード数生成部３０は、ＮｓｔａｒｔおよびＮｅｎｄに基づいて検索範囲に含まれるキーを持つレコードの総数を算出する。終了キーを含むレコードが有る場合には、そのレコードの総数はＮｅｎｄ−Ｎｓｔａｒｔ＋１であり、終了キーを含むレコードがない場合には、そのレコードの総数はＮｅｎｄ−Ｎｓｔａｒｔである。
【００７０】
図１１は、インデクス別ヒットレコード数生成部３０における各インデクスごとのヒットレコード数算出処理を示している。図１１においては、語および文書ＩＤを用いて範囲検索における検索範囲のレコード（文書）の総数を算出する。総数の算出の処理は以下のとおりである。なお、検索者が語を入力すると、文書ＩＤの範囲が自動的に０ｘ３０００（１６進数表示）から０ｘ３ｆｆｆとされる。
【００７１】
［ステップＳ１０］：検索範囲を受け取る。
［ステップＳ１１］：Ｂ＋ツリーを決定する。
［ステップＳ１２］：開始キーを検索キーとする。
［ステップＳ１３］：検索キーが該当するキーを、選択する
［ステップＳ１４］：順位算出ルーチンを実施する。図９参照。
［ステップＳ１５］：順位算出ルーチンで取得した順位をＮｓｔａｒｔとする。
［ステップＳ１６］：終了キーを検索キーとする。
［ステップＳ１７］：順位算出ルーチンを実施する。
［ステップＳ１８］：順位算出ルーチンで取得した順位をＮｅｎｄとする。
［ステップＳ１９］：終了キーに該当するレコードがあるか。あればステップＳ２０ヘ進み、なければステップＳ２１へ進む。
［ステップＳ２０］：検索範囲の件数をＮｅｎｄ−Ｎｓｔａｒｔ＋１で算出する。
［ステップＳ２１］：検索範囲の件数をＮｅｎｄ−Ｎｓｔａｒｔで算出する。
【００７２】
順位算出ルーチンはつぎのとおりである。
【００７３】
［ステップＳ４０］：順位を０にリセットする。
［ステップＳ４１］：中間ノードにおいて検索キーが該当するキーの左のキーの件数管理情報を順位に累積する。
［ステップＳ４２］：検索キーが該当するキーの下位のノードに進む。
［ステップＳ４３］：ノードが中間ノードかリーフノードかを判別する。中間ノードであれば、ステップＳ４１に戻る。リーフノードであればステップＳ４４に進む。
［ステップＳ４４］：リーフノードに到達したときのレコードから検索キーに対応するキーのレコードまで水平検索で辿る。
［ステップＳ４５］：水平検索で辿ったレコードの数を上述の順位に累積する。
【００７４】
以上で実施例２の説明を終了する。
【００７５】
［実施例３］
つぎにこの発明の実施例３について説明する。この実施例は、インデクスを用いた検索処理を行う検索装置本体と検索装置本体の検索結果を連結等する検索管理装置とをネットワークを介して接続して検索システムを構築するものである。
【００７６】
図１３は、実施例３の検索システムを全体として示しており、図１３において図３と対応する箇所には対応する符号を付した。図１３において、検索システム１００は、検索管理サーバ３００と複数の検索サーバ３０１とを有して構成されている。検索サーバ３０１はそれぞれ対応するインデクス保持部３０２を有し、例えばこのインデクス保持部３０２に格納されているＢ＋ツリーの情報（実施例１、２と同様）を用いて検索を行う。検索管理サーバ３００は、クライアント１２０からの検索要求を受取り、アクセス制御等を行うとともに、検索要求に対して許容された検索サーバ３０１に検索要求をディスパッチする。検索管理サーバ３００は、検索要求をディスパッチした検索サーバ３０１から検索結果を受取り、出力上限値（例えばユーザが指定したもの。あるいはシステム上のデフォルト値）だけヒットレコードを取り出して検索結果としてクライアント端末１２０に返す。
【００７７】
図１４は、実施例３の検索システム１００における処理を示しており、その詳細は以下のとおりである。なお、これらの処理は検索管理サーバ３００および検索サーバ３０１で実行されるものであり、例えば記録媒体３０３、３０４に記憶されたプログラムを検索管理サーバ３００や検索サーバ３０１にインストールして実現できる。
【００７８】
［ステップＳ５０］：各検索サーバ３０１でインデクス保持部３０２のインデクスを用いて検索を行う。なお、各検索サーバ３０１は、出力制限値（ユーザに出力するレコードの数の上限）の例えば１０倍のレコード数を上限としてレコードを取り出す（上限値を超えたら検索を終了する）。このレコードは例えば図１５に示すようなキーとバリューとを含むものであり、キーは語キー（キーワード等の文書の属性）および文書ＩＤからなる。バリューは各レコードの検索スコアを算出するためのオカレンスデータであり、例えば、更新時刻、出現頻度、出現分布のデータからなる。オカレンスデータからスコアを計算し、このスコアに基づいてヒットレコードをソートする。
【００７９】
［ステップＳ５１］：検索管理サーバ３００は、検索サーバ３０１からソート済みのヒットレコードを受け取る。受け取るレコードはスコアを直接に含み、オカレンスデータは基本的には不要である。
［ステップＳ５２］：ソート済みのヒットレコード数が多い順に、検索サーバ３０１からのヒットレコードを連結する。
［ステップＳ５３］：連結したヒットレコードの総数が累積上限値、例えば、出力上限値の１０倍に達したかどうかを判別する。累積上限値に達しない場合にはステップＳ５２に戻り処理を繰り返す。達した場合にはステップＳ５４へ進む。
［ステップＳ５４］：連結したヒットレコードをスコアで再度ソートする。
「ステップＤ５５］：出力上限値だけ上位からヒットレコードを出力する。
【００８０】
各レコードのスコアは例えばつぎのように算出される。
【数１】
｛Ａ１×（出現密度）＋Ａ２×（更新日−基準日）｝×（出現分布情報で決定される値。例えば１〜２の値）
Ａ１、Ａ２は係数である。
【００８１】
出現密度は、キーワードが文書中に含まれる割合であり、例えば、定数×出現数／文書サイズで求められる。出現密度が大きいほどスコアが大きくなる。
【００８２】
更新日は文書を更新した日付であり、原則として基準時は検索を行っている日付に「２０４８」（約４年）を足したものである。「日付」は例えば０〜３２７６７の整数値であり、およそ、１９７０年１月１日から２０３８年１月１９日をカバーする。１日は１．３に相当する。通常、更新日は数カ月から数年程度前の日付である。更新日をそのまま用いると、約３０年分使用しない期間ができてしまうので、ダイナミックレンジが小さくなってしまう。そのため検索実行日から４年前（約２０４８）を基準日としている（更新日−基準日＝更新日−検索実行日＋２０４８）。更新日が新しいほどスコアは大きくなる。
【００８３】
出現分布情報は、文書中の文の列に語キーがどのように分布するかを示すものであり、文の列を３２ビットであらわし、当該文の位置に語キーが出現すれば「１」を立てる。文の数だけビットを設ければより性格であるが、この例では、語キーが出現する文の番号の３２の剰余が示すビット位置に「１」を立てている。複数の語キーを用いたときに３２ビットの出現分布情報のＡＮＤをとり、同一文中に当該複数の語キーが共起するかどうかを表す。ＡＮＤ結果の３２ビットの各値を評価すればより正確であるが、８ビットずつに４つのフラグメントに分け、１つのフラグメント中に「１」があれば２５％ずつ増分する。４つのフラグメントのすべてに「１」があれば２倍となり、すべてのフラグメントに「１」がなければ１倍のままである。
【００８４】
また、スコアが同一の値にならないように、スコアに文書サイズの下位数ビットを連結する。
【００８５】
図１６は、スコア計算の一例を示している。この例では、「コピー」と「富士ゼロックス株式会社」（「富士ゼロックス」は商標である）のＯＲ検索を行って、文書Ａ、Ｂ、Ｃがヒットした例である。検索日は「２００２年８月１日」である。
【００８６】
文書Ａのスコアはつぎのとおりである。すなわち、実際の出現密度の和が「０ｘ０９＋０ｘ１３＝０ｘ１Ｃ」（０ｘは１６進を表す）であり、文書サイズと合わせて「０ｘ１ＣＢ８」である。更新日の寄与を合わせて、「０ｘ３ＣＢ８」となり、出現分布で１．７５倍になり、「０ｘ６Ａ４７＝２７２０７」がスコアとなる。
【００８７】
文書Ｂのスコアはつぎのとおりである。出現密度と文書サイズから同様に「０ｘ１Ｆ４０」となる。「富士ゼロックス株式会社」からのオカレンスからは得られないので、デフォルト値の「０ｘ１８００」が用いられ、合わせて「０ｘ３７４０」となり、出現分布により２倍され、「０ｘ６Ｅ８０＝２８２８８」がスコアとなる。
【００８８】
文書Ｃのスコアはつぎのとおりである。実際の出現密度の和が「０ｘ１Ｄ＋０ｘ００＝０ｘ１Ｄ」であり、文書サイズと合わせて「０ｘ１Ｄ８０」である。更新日の寄与を合わせて「０ｘ１７Ｅ３」となる。出現分布により２倍され、「０ｘ２ＦＣ６＝１２２３０」がスコアとなる。
【００８９】
以上の結果、文書Ｂ、Ａ、Ｃの順にソートされる。
【００９０】
以上で実施例３の説明を終了する。この実施例によれば、スコア計算やソートを分散させて実行するため、応答性を高くでき、スケーラビリティもある。また、所定の上限値を超えるヒットレコードは検索管理サーバへ送らないので、通信コストが減少する。
【００９１】
なお、図１３では、検索管理サーバと検索サーバとを別々に構成し、ネットワーク（ＬＡＮやＷＡＮ）で接続したが、図３に示すように、検索管理サーバの機能と検索サーバの機能を一体化した場合にも、インデクスごとにスコアでソートを行い、これを連結し、その後、再度スコアでソートして検索結果とすることもできることはもちろんである。
【００９２】
なお、この発明は上述の実施例に限定されるものではなくその趣旨を逸脱しない範囲で種々変更が可能である。例えば、実施例２の検索装置を図４に示すイントラネット環境に適用できることはもちろんであり、その際記録媒体等を用いて同様のシステムをコンピュータシステムにインストールして構築することもできる。
【００９３】
【発明の効果】
以上説明したように、この発明によれば、アクセス権限に配慮して統合検索のインデクスを構築することができ、また、ヒットレコードごとにアクセス権限を検証することを回避し、高速に検索を行うことができる。また複数のインデクスを用いた場合でも、ヒットレコードの併合回数を減らし、高速に検索を行える。
【図面の簡単な説明】
【図１】この発明の実施例１の検索装置を模式的に示すブロック図である。
【図２】上述の実施例１のインデクス構築装置を模式的に示すブロック図である。
【図３】上述の実施例１をイントラネット環境に適用した構成例を説明する図である。
【図４】この発明の実施例２の検索装置を模式的に示すブロック図である。
【図５】上述実施例２のインデクス別ヒットレコード数生成部における各インデクスのヒットレコード算出処理を説明するための、Ｂ＋ツリー構造の説明図である。
【図６】図５のＢ＋ツリー構造の管理ノードを説明する図である。
【図７】図５のＢ＋ツリー構造の中間ノードを説明する図である。
【図８】図５のＢ＋ツリー構造のリーフノードを説明する図である。
【図９】図５のＢ＋ツリー構造の検索を説明する図である。
【図１０】図５のＢ＋ツリー構造に含まれる件数管理情報を説明する図である。
【図１１】図５のＢ＋ツリー構造を用いてインデクスのヒットレコードを算出する処理を説明するフローチャートである。
【図１２】図１１の順位算出ルーチンを説明するフローチャートである。
【図１３】この発明の実施例３の構成を説明する図である。
【図１４】上述実施例３の動作を説明するフローチャートである。
【図１５】上述実施例３におけるレコードのフォーマットを説明する図である。
【図１６】上述実施例のスコア計算の例を説明する図である。
【符号の説明】
１０検索部
１１アクセス制御部
１２ユーザ権限記憶部
１３インデクス記憶装置
１４インデクス
１５検索ユーザ端末
２０プロセス起動部
２１インデクスレコード管理部
２２インデクスレコード管理プロセス
２３アクセス制御部
２４プロセス権限記憶部
３０インデクス別ヒットレコード数生成部
３１インデクス選択部
３２ヒットレコード併合部
３３ヒットレコード一時記憶部
３４表示レコード出力部
１００検索システム
１０１インデクス保持部
１０２インデクス構築システム
１０３文書ファイリングシステム
１０４ディレクトリサーバ
１０５ウェブサーバ
１０６アプリケーションサーバ
１０７ルータ
１０８ＬＡＮ
１０９、１１０記憶媒体
１２０クライアント端末
１２１ネットワーク
２００セキュリティドメイン
３００検索管理サーバ
３０１検索サーバ
３０２インデクス保持部[0001]
TECHNICAL FIELD OF THE INVENTION
According to the present invention, a document is extracted from a plurality of document filings, an index for search is constructed, and attributes of a plurality of documents existing in the plurality of document filings and positions of documents indicated by URLs are unified. An integrated search database that can be managed and searched in a specific way can be managed and searched, especially considering individual security domains of document filing.
[0002]
[Prior art]
Conventionally, in a distributed environment, a logically unique index is built across multiple document filings that are independently managed and disclosed, and a single search operation can be performed on the index to multiple document filings. A database has been constructed in which a plurality of existing document attributes, document positions indicated by URLs and the like can be centrally managed and searched. Such a search is called an integrated search.
[0003]
When constructing an index for an integrated search, documents are collected from a plurality of document filings by a collection operation performed independently of the search operation. In the collection operation, a user or an application having a predetermined access right specifies or obtains a document from a document filing to be collected by specifying or searching for a document name using a predetermined network protocol. Analyze the acquired documents, create the attributes and keywords required for index building, and build the index.
[0004]
The patent document related to the present invention discloses that sentence data stored in each of a plurality of databases is analyzed, necessary items are extracted, the extraction result is indexed, and a plurality of databases are accessed with a single index. By obtaining predetermined information from each of a plurality of files to be disclosed and a plurality of files stored in a storage device and obtaining authority information, an index is constructed using the predetermined information and the authority information. There is a technology that discloses that a search is performed only within a range according to the authority of a user (Patent Document 2).
[Patent Document 1]
JP 2000-163445 A
[Patent Document 2]
JP 2001-344245 A
[0005]
[Problems to be solved by the invention]
By the way, the integrated search is widely used in an environment where the search is made public or not, as in the Internet, but there are the following problems to provide this through an in-house network.
[0006]
Generally, documents or document filings that are made public within a company are made public by designating the scope of disclosure. For example, a document "confidential" is considered to be open to limited members of the department, and a Web server or the like uses a network domain set for each department to identify clients that can connect. Limitations have been made.
[0007]
If multiple documents collected in a specific unit specify the same disclosure range, the documents belong to the same security domain. The disclosure range determined for each document is called a security domain. In a company, a plurality of security domains exist, and documents to be collected belong to any one of the security domains.
[0008]
In order to provide the above-mentioned integrated search, a collection operation is required, and for collection, an access right of a target document or a security domain to which the document filing belongs is required. In other words, those who have a privileged access right to all documents and document filings to be subjected to the integrated search need to perform the collection operation.
[0009]
However, usually, in a company, giving such a wide range of privileged access rights to a system administrator or a software system has a security problem (Issue 1).
[0010]
Furthermore, in order to construct a logically unique index, it is necessary to construct an index capable of managing a security domain for each document. Otherwise, documents in a security domain that originally do not have access rights will be included in the search results, resulting in an insecure search.
[0011]
In addition, it is necessary to compare and determine the access right of the searched user and the security domain determined for each document, and the search processing becomes complicated, so that the search performance is reduced. In an integrated search within a company, there may be cases in which one to tens of millions of documents are to be searched, and it is difficult to perform such a large-scale search at high speed. In this case, the response as seen from the user decreases (Problem 2).
[0012]
The present invention has been made in view of the above circumstances, and even when performing an integrated search over document filings in various security domains, a search that can perform a high-speed search without causing a security problem. -It aims to provide index construction technology.
[0013]
[Means for Solving the Problems]
According to the present invention, in order to achieve the above object, a configuration as described in the claims is adopted. Here, before describing the invention in detail, the description of the claims will be supplementarily described.
[0014]
In order to solve the problems described above, in the basic configuration of the present invention, for each security domain to be collected, an administrator belonging to the security domain or an administrator having access right collects documents. And build an index. Specifically, instead of granting effective rights to all security domains to the collection program, run the collection program in each security domain that has the necessary authority to collect from each security domain Let it.
[0015]
That is, one index is associated with one security domain. These indexes have the same data model, and are logically one index, which is divided by a security domain.
[0016]
The plurality of indexes configured as described above are classified into a configuration in which the indexes are managed in a specific computer system and a configuration in which a management section is provided for each security and distributed.
[0017]
In any of the configurations, the administrator defined for each index is allowed to perform limited browsing, updating, backup, and restoration operations on a specific index, thereby solving the first problem.
[0018]
Next, the problem 2 will be described.
[0019]
In general, access control is modeled by an access subject (subject) having an access operation to an access target (object). A method called an access control list is embodied by having a list of a set of an access subject and an access operation (permitted by the subject) as an attribute for each access target. The operation of the subject on the object is performed by scanning or searching the access control list assigned to each object, checking whether or not the access control list includes the set of the subject and the operation. If the access control list is included, the access is permitted. It is.
[0020]
In the principle configuration of the present invention, the access target is an index constructed for each security domain. The access operation is a search. The access subject is a user who performs a search. The access control list is composed of a list of users permitted to search as an index attribute, and is centrally managed by an access control database.
[0021]
A method for implementing the search operation of the present invention will be described.
[0022]
The user performing the search searches the access control database for a list of indexes that can be searched for the security domain. A search request is issued to the computer holding the index, a search process is performed, and a search result is obtained. As for search results for a plurality of indexes, a plurality of search results are obtained from a plurality of computers. After a predetermined timeout period has elapsed, a plurality of search results are merged to form an integrated search result.
[0023]
According to the present invention, the documents stored in one index belong to the same security domain, and therefore, in response to a search request from a user who can access the security domain, an access right is set for each individual document. There is no need to check. On the other hand, in order to form an integrated search result, it is necessary to perform merging, which requires a corresponding processing cost.
[0024]
In general, if the number of documents belonging to the security domain accessible to the user performing the search is small relative to the total number of collected documents, the number of documents that need to be individually checked for access rights is determined by the index unit. As a result, the search processing can be performed in a short time even if the cost of merging is considered.
[0025]
For example, if the number of accessible documents is less than half, the cost of merging the search results can be expected to be less than the cost of checking the access rights for all documents.
[0026]
Next, another principle configuration of the present invention will be described.
[0027]
For the searched user, instead of displaying all the obtained search results, only documents with a large ranking score, which are obtained as a result of performing a predetermined ranking calculation, are limited to a predetermined display upper limit number and displayed. Think about doing.
[0028]
In order to prevent the search results from including documents with an unduly low ranking score from the user's point of view, a value obtained by multiplying the display upper limit number by a predetermined ratio is set as the requested number, and only the requested number is searched. Are taken out, ranking calculation is performed, sorting is performed in descending order, and display is performed by taking out only the display upper limit number from the upper rank.
[0029]
When a search request is issued for an index determined for each security domain, the number of requests is specified, and the search result is configured to return the search result of the display upper limit number and the number of hits.
[0030]
When the search results from a plurality of indexes are merged, if the search results from any one index are equal to or more than the number of requests, the search results from the index are changed to (the number of hits from other indexes). Adopt as an integrated search result (without combining with the search result).
[0031]
If the search result from the index is less than the number of requests, the search results of other indexes are merged until the number of requests is reached, sorted in descending order by ranking score, and only the number of displayed items is taken out from the top and displayed I do.
[0032]
If the number of search results for all indexes is less than the requested number even when the search results are merged, the search results for all indexes are merged and sorted by ranking score in descending order.
[0033]
As described above, if the number of search results after integration is large relative to the number of requests, especially if the number of search results from one index is large relative to the number of requests, reduce the merge processing. , The processing time is shortened.
[0034]
In still another basic configuration of the present invention, in a search device that performs a search using a plurality of indexes, a hit record including a score obtained from each index is sorted for each index based on the score, and sorted. The hit records including the score are linked according to a predetermined rule, the hit records including the linked score are re-sorted based on the score, and a predetermined number of hit records after the re-sorting are output as search results. I have to.
[0035]
In this configuration, since the score is calculated and sorted for each index, distributed processing can be performed for a plurality of indexes, responsiveness can be improved, and scalability can be secured. In addition, when linking hit records sorted for each index, if the processing target upper limit value of the hit record for each index is determined, unnecessary hit records do not need to be sent to the hit record linking unit. Can be reduced.
[0036]
The present invention can be realized not only as a device or a system but also as a method. In addition, it goes without saying that a part of such an invention can be configured as software. Also, it goes without saying that a software product used for causing a computer to execute such software is also included in the technical scope of the present invention.
[0037]
The above aspects of the present invention and other aspects of the present invention are set forth in the following claims, and will be described in detail below with reference to embodiments.
[0038]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, embodiments of the present invention will be described.
[0039]
[Example 1]
In the first embodiment, a search is controlled using a plurality of indexes according to the access authority.
[0040]
FIG. 1 schematically illustrates a search device according to the first embodiment. In FIG. 1, the search device includes a search unit 10, an access control unit 11, a user authority storage unit 12, and an index storage device 13. ing. The index storage device 13 stores a plurality of indexes (labeled A to N for convenience) 14. A plurality of indexes 14 are given different levels of access authority. Of course, the same access authority may be assigned to a plurality of indexes 14 and managed as a group with the same access authority. Instead of storing all the indexes 14 in one index storage device 13, a plurality of index storage devices 13 may be provided and distributed and stored. A search request is sent from the search user terminal 15 to the search device of this embodiment, and the search result is returned to the search user terminal 15.
[0041]
The index 14 of the index storage device 13 is constructed and managed by an index construction device (FIG. 2) described later.
[0042]
In this embodiment, when a search request is made from the search user terminal 15, the search unit 10 supplies the user ID of the search user to the access control unit 11, and the access control unit 11 refers to the user authority storage unit 12. To return the access rights of the search user. The access control unit 11 retrieves, for example, a table defining the relationship between the access authority and the corresponding index, and searches for the identifier (in some cases) of the index 14 that can be referred to by the access authority of the search user. Return to section 10. The search unit 10 retrieves a hit record by referring to the allowable index 14 based on the identifier of the index 14 that can be referred to, and returns it to the search user terminal 15. The hit records may be arranged based on the ranking score, and only a predetermined number of displayed records may be returned to the search user terminal 15.
[0043]
In this example, all records hit by referring to the index 14 are accessible, and it is not necessary to individually verify the user's access authority for the hit records.
[0044]
The search unit 10 makes a reference request to the index 14 or all the indexes 14 specified by the search user, and the access control unit 11 refers to the access authority of the user in the user authority storage unit 12 to determine whether to permit the reference. It may be performed.
[0045]
Next, the index construction apparatus of this embodiment will be described.
[0046]
FIG. 2 schematically shows an index construction apparatus of this embodiment. In this figure, the index construction apparatus includes a process activation unit 20, an index record management unit 21, an access control unit 23, and a process authority storage unit 24. It is comprised including. The process starting unit 20 has access rights set in advance. The process activation unit 20 activates the index record management process 22 of the index record management unit 21 and gives the process of the process activation unit 20. A user or an administrator may start the index record management process 22 of the index record management unit 21 and grant the access right. The started index record management process 22 accesses the document filing system 103 (see FIG. 3) that holds the document, and generates an index record by referring to a document permitted by its own access right. Access to the document by the document filing system 103 is controlled by the access control unit 23 and the process authority storage unit 24. In this way, the index record management process 22 generates an index record of a document in the security domain corresponding to (or equal to or less than) the access right of itself, constructs the index 14 of the corresponding access right in the index storage device 13, Correction (insertion / deletion). The process of building and modifying the index 14 is also controlled by the access control unit 23 and the process authority storage unit 24.
[0047]
In this way, the index 14 is constructed and managed for each access right.
[0048]
FIG. 3 illustrates a configuration example in which the search device and the index construction device according to the first embodiment are implemented in an intranet environment. 3, a search system 100, a plurality of index construction systems 102, a plurality of document filing systems 103, a directory server 104, a web server 105, an application server 106, a client terminal 120, and the like are arranged on a LAN 108. A client terminal 120 is connected to the LAN 108 via a router 107 and a network 121.
[0049]
The search system 100 has an index holding unit 101, and can refer to a plurality of indexes (index 14 in FIG. 1).
[0050]
The search system 100 and the index construction system 102 are installed using the storage media 109 and 110 or the network 121, respectively.
[0051]
The document filing system 103 may be given a single access right as a whole (for example, 103A), or may be individually given an access right to each document or directory of the document filing system 103. The document filing system 103A and the index construction system 102A have, for example, the same access authority and form a corresponding security domain 200. The other filing systems 103 include documents with various access rights and the like, and index records can be generated by the index construction system 102 corresponding to the respective access rights.
[0052]
The index construction system 102 accesses the document of each document filing system 103 with the corresponding access right, and the document filing system 103 authenticates the right using the directory server 104 and determines whether or not the access is permitted. The index construction system 102 generates an index record by referring to the document of the corresponding access right, constructs the corresponding index 14 of the index holding unit 101, or inserts the record into the corresponding index. Further, if necessary, processing such as deletion of an index record is performed.
[0053]
In this way, the index 14 is constructed in the index holding unit 101 for each access right, and is subsequently managed.
[0054]
The search user uses the client terminal 120 to make a search request to the search system 100 via the web server 105 and the application server 106 (or using a CGI program or the like). The search system 100 checks the access authority of the search user using the directory server 104, and refers to the corresponding index 14 in response to the search authority to return only the hit records permitted to the search user to the client terminal 120 as a list. The search user can retrieve the document selected from the list from the predetermined document filing system 103.
[0055]
Note that the index holding unit 101 may be dispersedly arranged at the index construction system 102 site, and the search system 100 may refer to this. Further, the search system 100 and the index holding unit 101 may be distributed and arranged at the index construction system 102 site. In this case, the search request of the client terminal 120 is dispatched to a plurality of distributed search systems 100 on behalf of the client terminal 120.
[0056]
[Example 2]
Next, a second embodiment of the present invention will be described. In this embodiment, even when a plurality of indexes are used, a hit record having a small ranking score is not included in the display list.
[0057]
FIG. 4 schematically shows the search apparatus of this embodiment. In this figure, the search unit 10 includes an index-specific hit record number generating unit 30, an index selecting unit 31, a hit record merging unit 32, and a hit record temporary unit. It is configured to include a storage unit 33, a display record output unit 34 and the like.
[0058]
The search user terminal 15 sends a search request to the search unit 10. The search request can include the number of displayed records along with the search key. The index-specific hit record number generation unit 30 calculates the number of hit records for each index 14 for the search key. This will be described later. The index selection unit 31 determines the number of hit records to be retrieved from the index storage device 13 based on the specified number of display records or the default number of display records. This is called a threshold. The threshold value is N times the number of display records (N is determined to obtain a sufficiently accurate result). The index selection unit 31 selects an index so that a hit record with a threshold value can be obtained with the smallest number of indexes. Although various modes are possible, for example, an index is selected in descending order of the number of hit records, and when the threshold is reached, only that index is selected. If the number of hit records does not reach the threshold, an index having the next largest number of hit records is selected, and the number of hit records is accumulated in the total number of current hit records. The same process is repeated until the cumulative value reaches the threshold, and one or more indexes to be used are determined.
[0059]
When a plurality of indexes are used, the hit records are merged by the hit record merging unit 32 and stored in the hit record temporary storage unit 33. If only one index is used, the hit record is stored in the hit record temporary storage unit 33 as it is.
[0060]
The hit records in the hit record temporary storage unit 33 are sorted based on the ranking score included therein, and sent to the display record output unit 34 in the sort order. The output display record list of the display record output unit 34 is returned to the search user terminal 15.
[0061]
Thus, the number of hit record merging processes can be reduced.
[0062]
Next, the hit record number calculation process performed by the index-specific hit record number generation unit 30 will be described. Of course, a table may be created by previously obtaining the number of hit records for each key, and such a table may be tabulated.
[0063]
The index 14 of the index storage device 13 has, for example, a B + tree structure described by a management node, an intermediate node, and a leaf node, as shown in FIG. The management node manages a plurality of B + trees as shown in FIG. For each B + tree, the number of bytes such as a key and a value is defined by a schema. The search key is sorted to the corresponding B + tree by the management node. In the intermediate node, as shown in FIG. 7, a key for controlling branching and a lower node (subtree) for branching are defined. Further, as a configuration unique to this embodiment, the number of records belonging to leaf nodes of the subtree for each lower node is stored as number management information. The leaf node includes a plurality of pairs of keys and values (for example, document IDs) as shown in FIG. The leaf node also includes a key-value pair for a key that controls branching at an intermediate node. Also, a pointer to the next leaf node is included, and so-called horizontal search can be performed.
[0064]
At the time of the search, as shown in FIG. 9, a B + tree is determined by the management node, a vertical search is performed from the root node along an intermediate node, and after a shift to a leaf node, a horizontal search is performed.
[0065]
Here, the number management information of the intermediate nodes will be described with reference to FIG. In FIG. 10, as an example, the intermediate node of the first stage (the node next to the management node) is the key “LEFT”, K (0) ₁ , K (0) ₂ , K (0) ₃ ,... Branch to a lower node (sub-tree). No record is stored immediately below the key “LEFT”. “K (0)” indicates that the key is the first row. Similarly, the key of the n-th intermediate node is represented by “K (n−1)”. Number R (0) of records stored in leaf nodes of lower nodes (subtrees) from which keys in the range from “LEFT” to K (0) 1 branch ₁ Is stored in the number management information of the lower node 0. K (0) ₁ To K (0) ₂ The number of records stored in the leaf node of the lower node (sub-tree) where the keys in the range up to the branch branch r (0) ₁ , And adding the number of records of the preceding lower node (R0 in this case) to R (0) ₁ = R (0) ₁ + R (0) ₁ And store it in the number management information of the lower node 1. Key K (0) _N To key K (0) _{N + 1} Record r (0) stored in the leaf node of the lower node N to which the keys in the range up to branch _N And the number management information (R (0)) of the immediately preceding lower node N-1 _N-1 ), And the number management information R (0) of the lower nodes N _N = R (0) _N-1 + R (0) _N Get. Similarly, the number of pieces of management information is acquired and managed up to the last lower node.
[0066]
When performing a search using the start key and the end key, it is possible to obtain the order at the time of reaching the leaf node using the number management information of the intermediate nodes. That is, among the intermediate nodes that are sequentially traced, the lower intermediate node to be traced next is determined. At this time, the number management information of the intermediate node on the left side is obtained. This operation is similarly performed for the next intermediate node, and this operation is repeated until the intermediate node reaches the leaf node. For example, the respective keys K (0) of the first to Nth stages _A , K (1) _B , K (2) _C , ..., K (N-1) _D , The key of the intermediate node 0 (lower node or subtree; the same applies hereinafter) K (0) _A-1 Management information R (0) _A-1 , Key K (1) of intermediate node 1 _B-1 Management information R (1) _B-1 , Key K (2) of intermediate node 2 _C-1 Management information R (2) _C-1 .., Key K (N−1) of intermediate node (N−1) _D-1 Management information R (N-1) _D-1 Are accumulated and the rank of the record can be obtained when reaching the leaf node.
[0067]
First, the intermediate node is traced based on the start key, the corresponding number management information is accumulated, the rank of the record when reaching the leaf node is obtained, and the leaf node is horizontally searched. When the record that includes the start key is reached, the number of records in the horizontal search up to that record is obtained, and this is added to the rank of the record when the leaf node is reached, and the record that includes the start key (including the start key) If there is no record, the order (Nstart) of the record that includes the key closest to the start key included in the search range is obtained.
[0068]
Next, the intermediate node is traced based on the end key, the corresponding number management information is accumulated, the order of the record when the record reaches the leaf node is obtained, and the leaf node is horizontally searched. When the record containing the end key is reached, the number of records in the horizontal search up to that record is obtained, and this is added to the rank of the record when the leaf node is reached, and the record containing the end key (including the start key) When there is no record, the order (Nend) of the record that includes the key closest to the start key included in the search range is obtained.
[0069]
The index-based hit record number generation unit 30 calculates the total number of records having keys included in the search range based on Nstart and Nend. If there is a record including the end key, the total number of records is Nend-Nstart + 1. If there is no record including the end key, the total number of records is Nend-Nstart.
[0070]
FIG. 11 shows a process of calculating the number of hit records for each index in the index-specific hit record number generation unit 30. In FIG. 11, the total number of records (documents) in the search range in the range search is calculated using the word and the document ID. The process of calculating the total number is as follows. When the searcher inputs a word, the range of the document ID is automatically changed from 0x3000 (hexadecimal notation) to 0x3fff.
[0071]
[Step S10]: A search range is received.
[Step S11]: B + tree is determined.
[Step S12]: The start key is used as a search key.
[Step S13]: Select a key corresponding to the search key
[Step S14]: A ranking calculation routine is performed. See FIG.
[Step S15]: The rank acquired in the rank calculation routine is set to Nstart.
[Step S16]: The end key is used as a search key.
[Step S17]: A ranking calculation routine is performed.
[Step S18]: The rank acquired in the rank calculation routine is set to Nend.
[Step S19]: Is there a record corresponding to the end key? If there is, go to step S20, otherwise go to step S21.
[Step S20]: The number of items in the search range is calculated by Nend−Nstart + 1.
[Step S21]: The number of cases in the search range is calculated by Nend-Nstart.
[0072]
The ranking calculation routine is as follows.
[0073]
[Step S40]: Reset the rank to 0.
[Step S41]: In the intermediate node, the number management information of the key to the left of the key corresponding to the search key is accumulated in order.
[Step S42]: The search key proceeds to a node below the corresponding key.
[Step S43]: It is determined whether the node is an intermediate node or a leaf node. If it is an intermediate node, the process returns to step S41. If it is a leaf node, the process proceeds to step S44.
[Step S44]: Trace from the record at the time of reaching the leaf node to the record of the key corresponding to the search key by horizontal search.
[Step S45]: The number of records traversed in the horizontal search is accumulated in the order described above.
[0074]
This is the end of the description of the second embodiment.
[0075]
[Example 3]
Next, a third embodiment of the present invention will be described. In this embodiment, a search system is constructed by connecting, via a network, a search device main body that performs a search process using an index and a search management device that links search results of the search device main body.
[0076]
FIG. 13 shows the entire search system according to the third embodiment. In FIG. 13, parts corresponding to those in FIG. 13, a search system 100 includes a search management server 300 and a plurality of search servers 301. Each search server 301 has a corresponding index holding unit 302, and performs a search using, for example, B + tree information (similar to the first and second embodiments) stored in the index holding unit 302. The search management server 300 receives a search request from the client 120, performs access control and the like, and dispatches the search request to the search server 301 permitted for the search request. The search management server 300 receives the search result from the search server 301 that has dispatched the search request, extracts hit records by an output upper limit value (for example, a value specified by the user, or a default value on the system), and obtains the client terminal 120 as a search result. To return.
[0077]
FIG. 14 illustrates a process in the search system 100 according to the third embodiment, and details thereof are as follows. These processes are executed by the search management server 300 and the search server 301. For example, the programs stored in the recording media 303 and 304 can be installed and realized in the search management server 300 and the search server 301.
[0078]
[Step S50]: Each search server 301 performs a search using the index of the index holding unit 302. Note that each search server 301 retrieves records with an upper limit of, for example, 10 times the number of records that is an output limit value (the upper limit of the number of records to be output to the user) (the search ends when the upper limit is exceeded). This record includes, for example, a key and a value as shown in FIG. 15, and the key includes a word key (a document attribute such as a keyword) and a document ID. The value is occurrence data for calculating a search score of each record, and includes, for example, data of an update time, an appearance frequency, and an appearance distribution. Calculate the score from the occurrence data and sort the hit records based on this score.
[0079]
[Step S51]: The search management server 300 receives the sorted hit records from the search server 301. The record received contains the score directly, and occurrence data is basically unnecessary.
[Step S52]: Hit records from the search server 301 are linked in descending order of the number of sorted hit records.
[Step S53]: It is determined whether or not the total number of linked hit records has reached the cumulative upper limit, for example, 10 times the output upper limit. If the cumulative upper limit has not been reached, the process returns to step S52 to repeat the processing. If it has reached, the process proceeds to step S54.
[Step S54]: The linked hit records are sorted again by score.
"Step D55": Hit records are output from the upper position by the output upper limit value.
[0080]
The score of each record is calculated, for example, as follows.
(Equation 1)
{A1 × (appearance density) + A2 × (update date−base date)} × (value determined by appearance distribution information, for example, a value of 1-2)
A1 and A2 are coefficients.
[0081]
The appearance density is a rate at which the keyword is included in the document, and is obtained by, for example, constant × number of appearances / document size. The score increases as the appearance density increases.
[0082]
The update date is the date on which the document was updated. In principle, the reference time is obtained by adding "2048" (about four years) to the search date. The “date” is, for example, an integer value of 0 to 32767, and covers approximately from January 1, 1970 to January 19, 2038. One day is equivalent to 1.3. Typically, the renewal date is several months to several years ago. If the renewal date is used as it is, a period in which the renewal date is not used for about 30 years is created, and the dynamic range is reduced. Therefore, the reference date is four years before the search execution date (about 2048) (update date−reference date = update date−search execution date + 2048). The newer the update date, the higher the score.
[0083]
The appearance distribution information indicates how word keys are distributed in a sentence row in a document. The sentence row is represented by 32 bits, and “1” indicates that the word key appears at the position of the sentence. Stand up. It is more characteristic to provide bits as many as the number of sentences, but in this example, "1" is set at the bit position indicated by the remainder of the number 32 of the sentence in which the word key appears. When a plurality of word keys are used, an AND of the 32-bit appearance distribution information is calculated to indicate whether the plurality of word keys co-occur in the same sentence. It is more accurate if each 32-bit value of the AND result is evaluated, but it is divided into four fragments of eight bits each, and if "1" is included in one fragment, the value is incremented by 25%. If all four fragments have a "1", they are doubled, and if all the fragments do not have a "1", they are doubled.
[0084]
Also, the lower several bits of the document size are linked to the score so that the score does not have the same value.
[0085]
FIG. 16 shows an example of score calculation. In this example, documents A, B, and C are hit by performing an OR search for “copy” and “Fuji Xerox Co., Ltd.” (“Fuji Xerox” is a trademark). The search date is “August 1, 2002”.
[0086]
The score of document A is as follows. That is, the sum of the actual appearance densities is “0x09 + 0x13 = 0x1C” (0x represents hexadecimal), and is “0x1CB8” together with the document size. The sum of the contributions of the update dates is “0x3CB8”, which is 1.75 times the appearance distribution, and “0x6A47 = 27207” is the score.
[0087]
The score of document B is as follows. Similarly, “0x1F40” is obtained from the appearance density and the document size. Since it cannot be obtained from the occurrence from "Fuji Xerox Co., Ltd.", the default value of "0x1800" is used, and it becomes "0x3740" in total.
[0088]
The score of document C is as follows. The sum of the actual appearance densities is “0x1D + 0x00 = 0x1D”, and is “0x1D80” together with the document size. The sum of the contributions of the update dates is “0x17E3”. Doubled by the appearance distribution, “0x2FC6 = 12230” becomes the score.
[0089]
As a result, documents B, A, and C are sorted in this order.
[0090]
This is the end of the description of the third embodiment. According to this embodiment, since score calculation and sorting are executed in a distributed manner, responsiveness can be improved and scalability can be achieved. Also, since hit records exceeding a predetermined upper limit are not sent to the search management server, communication costs are reduced.
[0091]
In FIG. 13, the search management server and the search server are separately configured and connected by a network (LAN or WAN). However, as shown in FIG. 3, the functions of the search management server and the search server are integrated. In this case as well, it is of course possible to sort by index for each index, concatenate them, and then sort again by score to obtain a search result.
[0092]
It should be noted that the present invention is not limited to the above-described embodiment, and various changes can be made without departing from the gist of the present invention. For example, it goes without saying that the search device according to the second embodiment can be applied to the intranet environment shown in FIG. 4, and at that time, a similar system can be installed in a computer system using a recording medium or the like.
[0093]
【The invention's effect】
As described above, according to the present invention, an index for an integrated search can be constructed in consideration of an access right, and it is possible to perform a high-speed search by avoiding verifying an access right for each hit record. be able to. Further, even when a plurality of indexes are used, the number of hit record merges can be reduced and high-speed search can be performed.
[Brief description of the drawings]
FIG. 1 is a block diagram schematically showing a search device according to a first embodiment of the present invention.
FIG. 2 is a block diagram schematically illustrating the index construction device according to the first embodiment.
FIG. 3 is a diagram illustrating a configuration example in which the first embodiment is applied to an intranet environment.
FIG. 4 is a block diagram schematically illustrating a search device according to a second embodiment of the present invention.
FIG. 5 is an explanatory diagram of a B + tree structure for explaining a hit record calculation process of each index in a hit record number generating unit for each index according to the second embodiment.
FIG. 6 is a diagram for explaining a management node having a B + tree structure in FIG. 5;
FIG. 7 is a diagram illustrating an intermediate node of the B + tree structure in FIG. 5;
FIG. 8 is a diagram illustrating leaf nodes having a B + tree structure in FIG. 5;
FIG. 9 is a diagram illustrating a search of the B + tree structure in FIG. 5;
FIG. 10 is a diagram illustrating the number-of-items management information included in the B + tree structure of FIG. 5;
11 is a flowchart illustrating a process of calculating an index hit record using the B + tree structure of FIG. 5;
FIG. 12 is a flowchart illustrating a ranking calculation routine of FIG. 11;
FIG. 13 is a diagram illustrating a configuration of a third embodiment of the present invention.
FIG. 14 is a flowchart illustrating the operation of the third embodiment.
FIG. 15 is a diagram illustrating a record format according to the third embodiment.
FIG. 16 is a diagram illustrating an example of score calculation according to the above embodiment.
[Explanation of symbols]
10 Search section
11 Access control unit
12 User authority storage
13 Index storage device
14 Index
15 Search user terminal
20 Process starter
21 Index record management unit
22 Index Record Management Process
23 Access control unit
24 Process authority storage
30 Hit Record Number Generator for Each Index
31 Index selector
32 Hit Record Merging Section
33 Hit Record Temporary Storage
34 Display record output section
100 search system
101 Index holding unit
102 Index construction system
103 Document Filing System
104 Directory server
105 Web server
106 Application Server
107 router
108 LAN
109, 110 storage medium
120 client terminal
121 Network
200 security domains
300 search management server
301 search server
302 Index holding unit

Claims

Index storage means for storing an index provided for each document access right;
Search means for searching for a document using the index,
Means for specifying the index allowed for the search request of the search user based on the access authority of the search user.

2. The retrieval apparatus according to claim 1, wherein said index storage means is constituted by one storage system.

2. The retrieval apparatus according to claim 1, wherein said index storage means comprises a plurality of storage systems that divide the index according to access authority.

In an index construction device of a search device, which constructs an index for each document access right,
Means for setting the access authority and starting a process for building and managing an index with reference to the document for each of the access authorities,
Means for generating an index record by referring to the corresponding access right document by the above process;
Means for causing the index record to be included in the corresponding index.

In a search device that peeks out more than a desired number of hit records and sorts the records according to a predetermined rule, and displays a record of a good order with a predetermined display number below the desired number as an upper limit,
Means for storing a plurality of indexes;
A number calculation means for generating the number of hit records in each of the indexes,
Hit record merging means for merging one or more other index hit records with the single index hit record if the number of hit records of the single index does not reach the desired number;
Based on the number of hit records in each of the indexes and the desired number, so that the number of indexes used for the merging is minimized, and an index selecting means for selecting an index used for the merging,
Sorting means for sorting hit records merged by the desired number or more hit records or the record merging means extracted from a single index according to a predetermined rule,
Hit record extracting means for extracting hit records with good ranks from the sorted records with the display number as an upper limit,
Hit record output means for outputting the retrieved number of records as display data as a display data.

The search device according to claim 5, wherein the predetermined rule determines a ranking based on a magnitude of a ranking score given to a hit record.

A plurality of search device bodies, and a search management device connected to the plurality of search device bodies via a communication network,
Each of the above-described search device bodies performs a search using the corresponding index, calculates a score for the hit record, sorts the hit records based on the score,
The search management device includes the number calculation unit, the hit record merging unit, the index selection unit, the sorting unit, the hit record retrieval unit, and the hit record output unit, and includes the index selection unit. 6. The search device according to claim 5, wherein a hit record including a score is received from the search device body corresponding to the index selected by the means.

In a search method for performing a search using an index provided for each document access right,
Identifying the index allowed in the search request of the search user based on the access authority of the search user,
Searching for a document using the allowed index.

In an index construction method for a search device, which constructs an index for each document access right,
For each of the above access rights,
Launching the process of building and managing the index with reference to the document by setting the access authority; and
Generating an index record by referring to a document of a corresponding access right by the above process;
And a step of including the index record in a corresponding index.

Using a plurality of indices, and peeking out hit records exceeding the desired number among the hit records of the above index and sorting them according to a predetermined rule, and displaying records of good rank with a predetermined display number below the desired number as an upper limit. Search method
Generating the number of hit records in each of the indexes;
A hit record merging step of merging one or more other index hit records with the single index hit record if the desired number is not reached by only the number of hit records of the single index;
Based on the number of hit records in each of the indexes and the desired number, so as to minimize the number of indexes used for the merging, selecting an index used for the merging,
Sorting according to a predetermined rule the hit records merged by the desired number or more hit records or the record merging step extracted from a single index,
Extracting a hit record with a good ranking from the sorted records with the display number as an upper limit;
Outputting the retrieved number of records as display data as display data.

In a search computer program for performing a search using an index provided for each document access right,
Identifying the index allowed in the search request of the search user based on the access authority of the search user,
Searching for a document using the allowable index described above.

In an index construction computer program for a search device, which constructs an index for each document access right,
For each of the above access rights,
Launching the process of building and managing the index with reference to the document by setting the access authority; and
Generating an index record by referring to a document of a corresponding access right by the above process;
And a step of causing the computer to execute the step of including the index record in a corresponding index.

A search that uses a plurality of indices, retrieves more than the desired number of hit records, sorts them according to a predetermined rule, and displays records with good ranks up to a predetermined display number below the desired number. In computer programs,
Generating the number of hit records in each of the indexes;
A hit record merging step of merging one or more other index hit records with the single index hit record if the desired number is not reached by only the number of hit records of the single index;
Based on the number of hit records in each of the indexes and the desired number, so as to minimize the number of indexes used for the merging, selecting an index used for the merging,
Sorting according to a predetermined rule the hit records merged by the desired number or more hit records or the record merging step extracted from a single index,
Extracting a hit record with a good ranking from the sorted records with the display number as an upper limit;
Outputting the retrieved number of displayed records as display data to a computer.

In a search device that performs a search using a plurality of indexes, hit records including scores obtained from each index are sorted for each index based on the scores, and the hit records including the sorted scores are connected according to a predetermined rule. And re-sorting the hit records including the linked scores based on the scores, and outputting a predetermined number of higher-ranked hit records as a search result.

The search device according to claim 14, wherein the hit records are linked in order from an index having a large number of hit records, and the link is terminated when a predetermined number is reached.

In a search device that performs a search using a plurality of indexes,
First means for searching records provided for each index, performing score calculation on hit records according to predetermined rules, assigning scores to records, and sorting hit records based on the scores;
Receiving a set of sorted hit records generated for the corresponding index from the first means, and receiving a sorted set of hit records of a plurality of sets; Second means for connecting the hit records, re-sorting the hit records, and outputting, as a search result, a predetermined number of upper ranks of the re-sorted hit records.

17. The search device according to claim 16, wherein said first means and said second means are connected via a communication network.

The score is calculated based on at least one of the update time of the document indicated by the record, the appearance density of keywords in the document, and the degree to which simultaneously specified keywords appear simultaneously in the same document in the document. The search device according to claim 16 or 17, wherein the search is performed.

In a search method for performing a search using a plurality of indexes, hit records including scores obtained from the respective indexes are sorted for each index based on the scores, and the hit records including the sorted scores are connected according to a predetermined rule. And re-sorting the hit records including the linked scores based on the scores, and outputting a predetermined number of higher-ranked hit records after the re-sorting as a search result.

In a search computer program performing a search using a plurality of indexes,
Concatenating hit records including the score obtained from each index with a predetermined rule,
Sorting hit records containing the linked scores based on the scores;
A retrieval computer program for outputting, as a retrieval result, a predetermined number of upper ranks of hit records after sorting.