JP2004070957A

JP2004070957A - Search system

Info

Publication number: JP2004070957A
Application number: JP2003285107A
Authority: JP
Inventors: Setsu Suzuoka; 鈴岡　節; Shinichi Sugano; 菅野　伸一; Shinsuke Sawajima; 澤島　信介; Tetsuya Yamane; 山根　徹也
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2003-08-01
Filing date: 2003-08-01
Publication date: 2004-03-04

Abstract

【課題】　ネットワーク上に散在する膨大な検索対象データを効率良く取得しデータベース化する検索システムを提供する。
【解決手段】　ネットワーク上でロボットを用いて収集したデータをもとにデータベースを作成し、データベース検索を行なう検索システムにおいて、外部からの参照要求に応答して取得されたデータおよびロボットを用いて収集されたデータを保持するキャッシュ手段と、外部から参照要求が与えられた場合に、前記キャッシュ手段に該当するデータが保持されているならば、前記キャッシュ手段からデータを提供し、前記キャッシュ手段に該当するデータが保持されていないならば、該データを保持する本来のサーバから該データを取得して提供するデータ提供手段とを備える。
【選択図】　　　図１０PROBLEM TO BE SOLVED: To provide a search system for efficiently acquiring a huge amount of search target data scattered on a network and making it a database.
SOLUTION: In a search system for creating a database based on data collected by using a robot on a network and performing a database search, data is collected using a robot and data obtained in response to an external reference request. And a cache unit for holding the obtained data, and when a reference request is given from the outside, if the data corresponding to the cache unit is held, the data is provided from the cache unit, and And data providing means for acquiring and providing the data from the original server which holds the data if the data to be stored is not held.
[Selection] Fig. 10

Description

　本発明は、ネットワーク上に分散したデータの検索システムに関する。 The present invention relates to a data retrieval system distributed on a network.

　Ａｌｔａｖｉｓｔａ（ｈｔｔｐ：／／ｗｗｗ．ａｌｔａｖｉｓｔａ．ｃｏｍ／）、Ｌｙｃｏｓ（ｈｔｔｐ：／／ｗｗｗ．ｌｙｃｏｓ．ｃｏｍ／）、Ｙａｈｏｏ！（ｈｔｔｐ：／／ｗｗｗ．ｙａｈｏｏ．ｃｏｍ／）などロボットを用いたネットワーク上の検索システムは多数存在する。これらはロボットと呼ばれる機械的にネットワーク上で情報を収集するソフトウェアを用いている。そして、収集したデータをデータベース化し、利用者が検索できるようにしている。 @Altavista (http://www.altavista.com/), Lycos (http://www.lycos.com/), Yahoo! There are many search systems on a network using a robot such as (http://www.yahoo.com/). These use software called robots that mechanically collect information on a network. Then, the collected data is made into a database so that the user can search.

　上記ロボットは、ネットワーク上でＨＴＭＬ（Ｈｙｐｅｒ　Ｔｅｘｔ　Ｍａｒｋｕｐ　Ｌａｎｇｕａｇｅ）で記述された文章を探し、そこに記載されているリンク先を辿って、ネットワーク上に存在するデータを収集する。データベース化については、フルテキストサーチをするものもあれば、タイトルやＵＲＬといった部分のみを検索対象とするようなものもある。 (4) The robot searches for a text described in HTML (Hyper \ Text \ Markup \ Language) on the network, and collects data existing on the network by following the link destination described therein. Regarding the creation of a database, there are those that perform a full-text search, and those that search only parts such as titles and URLs.

　上記データベースは、量が多いので分散化されている場合もある。しかし、あくまでも量が多いための単なる分割であり、何らかの意味を持って分割してはいない。データベース The above database is large and may be decentralized. However, this is merely a division because of the large amount, and is not divided in any way.

　上記検索には、キーワード検索が行なわれる。すなわち、探したい文章に含まれているであろう語を入力して、検索を行なう。キーワード A keyword search is performed for the above search. In other words, a search is performed by inputting a word that would be included in a sentence to be searched.

　一方、人気のあるサイトへのアクセス集中を分散させ、トラフィックを軽減するために、ミラーサイトが設けられることがある。例えば、Ｐｏｉｎｔ　Ｃａｓｔ　Ｎｅｔｗｏｒｋ（ＰＣＮ）社のＩ−Ｓｅｒｖｅｒ（ｈｔｔｐ：／／ｗｗｗ．ｐｏｉｎｔｃａｓｔ．ｃｏｍ／ｐｒｏｄｕｃｔｓ／ｉｓｅｒｖｅｒ．ｈｔｍｌ）ではＰＣＮ本社へ定期的に情報をプリフェッチして、ミラーサイトを管理している。
清水　奨，ＷＷＷサーバ上の検索システム構築，Interface，日本，ＣＱ出版株式会社，１９９６年　７月　１日，第２２巻　第６号，第１３０頁乃至第１３９頁。 On the other hand, mirror sites may be provided to distribute the concentration of access to popular sites and reduce traffic. For example, Point Cast Network (PCN) 's I-Server (http://www.pointcast.com/products/isserver.html) periodically prefetches information to the PCN head office and manages mirror sites. .
S. Shimizu, Construction of Search System on WWW Server, Interface, Japan, CQ Publishing Co., Ltd., July 1, 1996, Vol. 22, No. 6, pp. 130-139.

　従来、ネットワーク上に分散したデータの検索システムにおいては、以下のような問題点があった。 (4) Conventionally, in a data retrieval system distributed on a network, there are the following problems.

　（１）増大するデータを扱うのが困難になりつつある。　
　例えばＷＷＷ上のページデータが１９９６年で世界で４０００万以上あると言われ、今後も指数関数的に増加すると予想される。現在、ページ数も、１ページあたりのデータ量も急激に増大する傾向にある。　
　このように急増するデータを単に量により分割するだけでは、データベース管理が極めて困難である。 (1) It is becoming difficult to handle increasing data.
For example, it is said that there are over 40 million page data on the WWW in 1996 in the world, and it is expected to increase exponentially in the future. At present, the number of pages and the amount of data per page tend to increase rapidly.
It is extremely difficult to manage the database simply by dividing the rapidly increasing data by the amount.

　（２）更新頻度が高い情報を扱うのが困難である。　
　一日に何度も更新されるデータについては、現在の検索システムではロボット探索対象から外している。この理由は、頻繁に更新されるデータをロボットで情報収集してデータベース化しても、そのデータが検索される前に更新されることが少なくないからである。このような場合には、検索結果に現れたページを見ても、既になくなっていたり、内容が全く別のものに変更されたために利用者の意図したものとは別ものもが表示されたりする不都合が生じる。 (2) It is difficult to handle frequently updated information.
Data that is updated many times a day is excluded from robot search targets in current search systems. The reason is that even if frequently updated data is collected by a robot and made into a database, the data is often updated before being searched. In such a case, even if you look at the page that appears in the search results, it is already gone, or something different from what the user intended is displayed because the content has been changed to something completely different Inconvenience occurs.

　本発明は、上記事情を考慮してなされたもので、ネットワーク上に散在する膨大な検索対象データを効率良く取得しデータベース化する検索システムを提供することを目的とする。 The present invention has been made in consideration of the above circumstances, and has as its object to provide a search system that efficiently obtains a large amount of search target data scattered on a network and creates a database.

　また、本発明は、極めて更新頻度の高いデータをも効果的にデータベース化する検索システムを提供することを目的とする。 Another object of the present invention is to provide a search system that effectively creates a database even for data with extremely high update frequency.

　本発明は、ネットワーク上でロボットを用いて収集したデータをもとにデータベースを作成し、データベース検索を行なう検索システムにおいて、外部からの参照要求に応答して取得されたデータおよびロボットを用いて収集されたデータを保持するキャッシュ手段と、外部から参照要求が与えられた場合に、前記キャッシュ手段に該当するデータが保持されているならば、前記キャッシュ手段からデータを提供し、前記キャッシュ手段に該当するデータが保持されていないならば、該データを保持する本来のサーバから該データを取得して提供するデータ提供手段とを備えたことを特徴とする検索システムを提供する。 The present invention relates to a search system for creating a database based on data collected by using a robot on a network and performing a database search, and using a data collected in response to a reference request from the outside and a robot to collect the data. And a cache unit for holding the obtained data, and when a reference request is given from the outside, if the data corresponding to the cache unit is held, the data is provided from the cache unit, and And a data providing means for acquiring and providing the data from an original server which holds the data if the data to be stored is not held.

　本検索システムは、プロキシーも兼ねるものであり、これによって、利用者が要求したデータがシステム内にあるならば、それが利用者からの要求によって取得したものであっても、それがロボットによって収集されたものであっても、それを利用者に提示することができる。これによって、極めて更新頻度が高いデータに対しても、検索を適用することができる。 This search system also functions as a proxy, so that if the data requested by the user is in the system, even if it is obtained by the request from the user, it is collected by the robot. Even if it is done, it can be presented to the user. As a result, the search can be applied to data that is updated frequently.

　本発明は、上記検索システムにおいて、外部から参照要求されたデータについての統計処理を行って、今後参照要求されるデータを予測する予測手段と、予測されたデータおよび予め明示的に指定されたデータを、ロボットを用いて取得し前記キャッシュ手段にプリフェッチするプリフェッチ手段とをさらに備える。 According to the present invention, in the above-described search system, a prediction unit for performing statistical processing on data requested to be referred to from the outside to predict data to be referred to in the future, the predicted data and the data explicitly specified in advance And a prefetch means for acquiring the data by using a robot and prefetching the data in the cache means.

　本発明では、取得可能なすべてのデータをロボットを用いてあらかじめ収集せずに、あらかじめ指定したデータおよび利用者からの統計的観点から参照要求があると思われるデータについてロボットによりデータをプリフェッチしておくので、適切なデータに対して効果的にミラー化される。 In the present invention, the robot does not collect all the data that can be obtained in advance, but prefetches the data by the robot with respect to the data specified in advance and the data that is considered to have a reference request from a statistical viewpoint from the user. So that the appropriate data is effectively mirrored.

　本発明は、上記検索システムにおいて、前記プリフェッチ手段は、取得対象となるデータの更新頻度に応じた頻度で該データを取り直す。 According to the present invention, in the above retrieval system, the prefetch means reacquires data to be acquired at a frequency corresponding to an update frequency of the data.

　本発明は、上記検索システムにおいて、前記検索要求に応答して行う検索で対象とするデータの範囲の制約条件として、ロボットで収集されたデータに限る条件、外部からの参照要求に応答して取得されたデータに限る条件、同じ名前またはアドレスを持つデータについては最新のものだけに限る条件、動的または対話的に生成されたデータ以外のものに限る条件、および指定されたサイト群またはデータ群に限る条件のうち少なくとも１つを課す。 According to the present invention, in the above-described search system, as a constraint condition of a range of data to be searched in response to the search request, a condition limited to data collected by a robot, a condition obtained in response to an external reference request is obtained. Conditions, limited to the latest data for data with the same name or address, conditions other than data generated dynamically or interactively, and specified sites or data groups At least one of the following conditions.

　本発明は、上記検索システムにおいて、前記キャッシュ手段は、取得されたデータにその更新時刻情報および収集時刻情報の少なくとも一方を付加して保持する。これによって、取得元のデータの名前が同じでも時刻によって異なるデータに対しても管理できる。 According to the present invention, in the above retrieval system, the cache unit adds and retains at least one of update time information and collection time information to the acquired data. As a result, it is possible to manage data that has the same name as the acquisition source data but differs depending on the time.

　なお、以上の各装置に係る発明は、方法に係る説明としても成立する。 The inventions relating to the respective devices described above also hold as descriptions relating to methods.

　また、上記の発明は、相当する手順あるいは手段をコンピュータに実行させるためのプログラムを記録した機械読取り可能な媒体としても成立する。 The invention described above is also realized as a machine-readable medium storing a program for causing a computer to execute a corresponding procedure or means.

　本発明によれば、データの更新頻度に応じて異なったデータベースにてデータを管理することができる。この結果、例えば、そのデータベースが管理するデータの更新頻度の高さに応じて計算機等の持つ処理能力を設定することができ、ネットワーク上に分散された膨大なデータを効果的に管理することができる。 According to the present invention, data can be managed in different databases according to the data update frequency. As a result, for example, the processing capacity of a computer or the like can be set according to the frequency of update of data managed by the database, and the huge amount of data distributed on the network can be effectively managed. it can.

　また、本発明によれば、検索システムにプロキシー機能をも内蔵させたので、プロキシーに格納されているデータを検索し提示することができる。この結果、例えば、極めて更新頻度が高いデータに対しても、検索サービス・参照サービスを提供することができる。 According to the present invention, since the search system also has a built-in proxy function, data stored in the proxy can be searched and presented. As a result, for example, a search service and a reference service can be provided even for data that is updated frequently.

　以下、図面を参照しながら発明の実施の形態を説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

　まず、語句の定義を行う。 First, define the words and phrases.

　プロキシー（Ｐｒｏｘｙ）とは、クライアント（例えば利用者端末）からサーバ（例えばＷＷＷサイト）への資源アクセスの際にアプリケーションレベルにおいて、クライアントとサーバの間に入り、クライアントからの資源アクセス要求をサーバに対して中継し、サーバからの応答をクライアントに対して中継する機能を有するサーバのことを言う。 A proxy is a proxy that enters between a client and a server at the application level when a resource is accessed from a client (for example, a user terminal) to a server (for example, a WWW site), and sends a resource access request from the client to the server. Server that relays the response from the server to the client.

　ページ（ｐａｇｅ）とは、ハイパーテキストのページを意味するものとする。ＷＷＷの世界では、１つのページはユニークなＵＲＬを持つ。 Page means a hypertext page. In the WWW world, one page has a unique URL.

　ＵＲＬ（Ｕｎｉｆｏｒｍ　Ｒｅｓｏｕｃｅ　Ｌｏｃａｔｉｏｎ）とは、ページデータをアクセスするのに必要な情報である。ＵＲＬは、プロトコル、ドメイン名、ポート番号、パス名の情報を含む。 $ URL (Uniform Resource Resource Location) is information necessary to access page data. The URL includes information on a protocol, a domain name, a port number, and a path name.

　ＣＧＩ（Ｃｏｍｍｏｎ　Ｇａｔｅｗａｙ　Ｉｎｔｅｒｆａｃｅ）とは、対話的なページや動的なページを作るためにサーバからプログラムを起こすためのインターフェースである。 CGI (Common Gateway Interface) is an interface for starting a program from a server to create an interactive page or a dynamic page.

　ロボット（Ｒｏｂｏｔ）とは、ＨｙｐｅｒＴｅｘｔ　Ｍａｒｋｕｐ　Ｌａｎｇｕａｇｅ（ＨＴＭＬ）やＳｔａｎｄａｒｄ　Ｇｅｎｅｒａｌｉｚｅｄ　Ｍａｒｋｕｐ　Ｌａｎｇｕａｇｅ（ＳＧＭＬ）のようなハイパーテキストで記述された文書を読み、そこに書かれているリンクを機械的に辿りながら文書をネットワーク上で収集するものであり、ソフトウェアにより実現される。ロボットの代わりにスパイダー（ｓｐｉｄｅｒ）あるいはワンダラー（Ｗａｎｄｅｒｅｒ）などと呼ばれることもある。 A robot refers to a document described in a hypertext such as HyperText Markup Language (HTML) or Standard Generalized Markup Language (SGML), and mechanically traces the document while linking the network. This is collected by the above, and is realized by software. Instead of a robot, it may be called a spider or a wanderer.

　ロボットの基本的な動作は次のようになる。 The basic operation of the robot is as follows.

　（手順１）指定されたＵＲＬの根を探訪リストに登録する。　
　（手順２）ロボットは、探訪リストに従いページを取得する。　
　（手順３）取得されたページを解析してＵＲＬを抽出する。　
　（手順４）抽出されたＵＲＬを探訪リストに追加する（ただし、ＵＲＬの重複登録はしない）。　
　以降、手順２〜４を繰り返す。なお、ページの取得頻度は、該ページの更新頻度に応じて決めるようにしても良い。 (Procedure 1) Register the root of the specified URL in the search list.
(Procedure 2) The robot acquires a page according to the search list.
(Step 3) The acquired page is analyzed to extract a URL.
(Procedure 4) Add the extracted URL to the search list (however, do not duplicately register the URL).
Thereafter, steps 2 to 4 are repeated. The page acquisition frequency may be determined according to the page update frequency.

　次に、本実施形態を概略的に説明する。 Next, the present embodiment will be schematically described.

　本実施形態では、ネットワーク中に分散されたデータの一例としてページを扱うものとする。 In this embodiment, a page is treated as an example of data distributed in a network.

　前述したように、例えば、Ｗｏｒｌｄ　Ｗｉｄｅ　Ｗｅｂ（ＷＷＷ）上のページ数（ページの種類）は４０００万を越えると言われる。この数は、今後も指数関数的に増え続けると予測されている。このような膨大な量のページを単一のデータベースで管理することは極めて困難である。 As described above, for example, it is said that the number of pages (types of pages) on the World Wide Web (WWW) exceeds 40 million. This number is expected to continue growing exponentially. It is extremely difficult to manage such a huge number of pages in a single database.

　データベースを分割する最も単純な方法は、サイト（ドメイン）単位でデータベースを分割することであるが、こうすると、どのデータベースも等しく高速でなければならない。データベースを分割することができても、すべてが高速でなければならないとすると、データベース構築の負担は依然高い。 The simplest way to divide a database is to divide the database by site (domain), but all databases must be equally fast. If the database can be split, but everything must be fast, the burden of building the database is still high.

　そこで、第１の実施形態では、データベースの内容を人気の度合いに応じて分割するようにしている。そして、人気の高いデータベースは高速なシステム（例えば大容量メモリを持つマシン）の上に載せ、人気があまりないデータベースは低速なシステムの上に載せるようにする。このようにすると、人気の高いデータベースを載せるマシンだけ高速なマシンを使えば良くなり、データベース構築の負担を効果的に軽減することができる。 Therefore, in the first embodiment, the contents of the database are divided according to the degree of popularity. Then, the popular database is put on a high-speed system (for example, a machine having a large memory), and the less popular database is put on a low-speed system. In this way, it is sufficient to use a high-speed machine only as a machine on which a popular database is mounted, and the burden of database construction can be effectively reduced.

　ここで、ページの人気の高さを知るためには厳密に言うとネットワークの視聴率調査などをしなければならないが、そのような作業は大きな困難を伴い現実的ではない。そこで、本実施形態では、次のような良く成り立つ近似を使う。まず「ページが飽きられずに高い人気を保ためには、絶えずコンテンツをアップデートしていく必要がある」と考える。そして、その逆をとって「データの更新頻度が高いページは、人気の高いページである」と近似する。つまり、本実施形態では、人気のバロメーターとしてデータの更新頻度を使い、データベースの内容をデータの更新頻度に応じて分割する。なお、ページの更新頻度はロボットを走行させることにより取得できる情報である。 In order to know the popularity of a page, strictly speaking, it is necessary to conduct a survey on network ratings, etc., but such a task is not realistic with great difficulty. Thus, in the present embodiment, the following well-approximate approximation is used. First, he said, "In order to keep the page popular and keep the page from getting tired, it is necessary to constantly update the content." The reverse is approximated as "a page with a high data update frequency is a popular page". That is, in the present embodiment, the data update frequency is used as a popular barometer, and the contents of the database are divided according to the data update frequency. The page update frequency is information that can be obtained by running the robot.

　ところで、更新頻度が高いページには１日に何度も更新されるものもある。このようなページに対して時々しかアクセスしない方法を採る場合、実際のページデータと検索システム内のデータベースとが不一致となる状態が発生する。特に、データベース検索の結果をもとにページを参照しにいくと、既に該当ページがなくなっていたり、ページ自体はあっても内容が別のものに変更されていたりすることがあり、このような場合に不具合が発生する。 By the way, some pages that are frequently updated are updated many times a day. When a method of accessing such a page only occasionally is adopted, a state occurs in which actual page data does not match a database in the search system. In particular, when referring to a page based on the results of a database search, the relevant page may already be gone, or the content of the page itself may have been changed to something else. The problem occurs in the case.

　一方、データベースの陳腐化による矛盾を軽減するためには、ロボットが非常に高頻度にページをアクセスする必要がある。しかし、不定期に頻繁に変更されるページの最新情報に追い付くために頻繁にアクセスすることは、無用なトラフィックを増大させ、情報を保持するサイトにも検索システム側にも不利益を被らせる。 On the other hand, in order to reduce inconsistencies caused by database obsolescence, robots need to access pages very frequently. However, frequent access to keep up with the latest information on pages that change frequently and irregularly increases unnecessary traffic, and disadvantages both the information storage site and the search system side. .

　そこで、第２の実施形態では、データベース化した元データを保存しておき、それを利用者に提示するようにしている。このようにすると、実際のページの変化には多少遅れるが、無駄にトラフィックを増やすこともなく、しかも検索結果に対応した元ページを常に見ることができる。 Therefore, in the second embodiment, the original data in a database is saved and presented to the user. In this way, the actual page change is slightly delayed, but the traffic does not needlessly increase and the original page corresponding to the search result can always be viewed.

　なお、第１の実施形態と第２の実施形態を組み合わせることも可能である。この場合には、両者の効果を得ることができる。 It is also possible to combine the first embodiment and the second embodiment. In this case, both effects can be obtained.

　以下、本発明の実施形態について詳しく説明する。 Hereinafter, embodiments of the present invention will be described in detail.

　（第１の実施形態）
　まず、第１の実施形態について説明する。 (1st Embodiment)
First, a first embodiment will be described.

　本実施形態のシステム構成例を、図１、図４、図６に示す。例 Examples of the system configuration of the present embodiment are shown in FIGS.

　本実施形態では、複数のデータベースを容易にし、データの更新頻度に応じてデータベースを使い分ける。すなわち、各データベースに、対象とするページデータの更新頻度の範囲を割り当てる。そして、ユーザが要求するキーワードについて検索を行なう際には、複数のデータベースを連携させて検索し、結果をまとめて利用者に提示する。 In the present embodiment, a plurality of databases are facilitated, and the databases are selectively used according to the data update frequency. That is, a range of the update frequency of the target page data is assigned to each database. When performing a search for a keyword requested by the user, a search is performed in cooperation with a plurality of databases, and the results are collectively presented to the user.

　各データベースへのページ分担方法には、例えば次のようなものが考えられる。　
　（ａ）統計的更新頻度情報によって分担　
　（ｂ）最終更新時刻によって分担　
　（ｃ）統計的更新頻度情報と最終更新時刻との総合的情報によって分担
　ここで、（ｂ）の最終更新時刻によって分担する方法について説明する。 For example, the following method can be considered as a method of assigning pages to each database.
(A) Sharing by statistical update frequency information
(B) Sharing by last update time
(C) Sharing by comprehensive information of statistical update frequency information and last update time Here, a method of sharing by the last update time of (b) will be described.

　あるページは、更新された直後は頻繁にアクセスされ（つまり人気があり）、最後に更新されてから時間が経過している程、アクセスされる頻度が少ない（つまり人気がない）と考えられる。そこで、例えば図３のように、最終更新時刻の範囲に応じて、格納すべきデータベースを分担する。 It is considered that a certain page is frequently accessed immediately after being updated (that is, popular), and is accessed less frequently (that is, not popular) as time elapses since the last update. Therefore, for example, as shown in FIG. 3, the database to be stored is shared according to the range of the last update time.

　あるページに関する情報を格納するデータベースを決定する方法には、例えば次のようなものが考えられる。　
　（１）サイト単位に格納すべきデータベースを決定する。この場合には、サイト内のデータの更新頻度の平均値を評価値に用いる。　
　（２）サイト内のディレクトリ単位に格納すべきデータベースを決定する。この場合には、ディレクトリ内のデータの更新頻度の平均値を評価値に用いる。　
　（３）データ単位に格納すべきデータベースを決定する。この場合には、そのデータの更新頻度を評価値に用いる。 For example, the following method is conceivable as a method of determining a database that stores information on a certain page.
(1) Determine a database to be stored for each site. In this case, the average value of the update frequency of the data in the site is used as the evaluation value.
(2) Determine a database to be stored for each directory in the site. In this case, the average value of the update frequency of the data in the directory is used as the evaluation value.
(3) Determine a database to be stored in data units. In this case, the update frequency of the data is used as the evaluation value.

　ここで、更新頻度は、上記の統計的更新頻度情報や最終更新時刻などである。　なお、上記の（１）〜（３）の方法は、併用可能である。例えば、サイトＡについてはサイト単位にデータベースに入れ、サイトＢについては、データ単位にデータベースに入れるようにしても良い。また、サイトＣ内で、ディレクトリａについてはディレクトリ単位にデータベースに入れ、ディレクトリｂについてはデータ単位にデータベースに入れるようにすることも可能である。更新 Here, the update frequency is the above-described statistical update frequency information, the last update time, or the like. The above methods (1) to (3) can be used together. For example, the site A may be stored in the database for each site, and the site B may be stored in the database for each data. In the site C, the directory a can be stored in the database in directory units, and the directory b can be stored in the database in data units.

　また、更新頻度が高いデータほど、内部ネットワークにつながれたサーバにおくことも考えられる。例えば、更新頻度が高い方のデータを組織内のイントラネットにおき、更新頻度が低い方のデータをインターネットに直接接続された場所で管理する。データ Furthermore, it is conceivable that the more frequently updated data is placed on a server connected to the internal network. For example, data with a higher update frequency is placed on an intranet within the organization, and data with a lower update frequency is managed in a place directly connected to the Internet.

　なお、本実施形態では、データベースにはページ自体ではなくキーワードとＵＲＬとを格納するものとする。また、ページを全文検索などして抽出したキーワードをＵＲＬに付加して格納し、キーワードでＵＲＬを検索するものとする。 In the present embodiment, it is assumed that not the page itself but the keyword and the URL are stored in the database. Also, it is assumed that a keyword extracted by performing a full-text search of a page is added to a URL and stored, and the URL is searched using the keyword.

　また、本実施形態では、語単位もしくはキーワード単位のデータベースについて述べているが、文字単位のデータベースであっても良い。 Also, in the present embodiment, a database in units of words or keywords is described, but a database in units of characters may be used.

　次に、図１、図４、図６に示す各システム構成例について説明する。 Next, an example of each system configuration shown in FIGS. 1, 4, and 6 will be described.

　図１の構成例では、ネットワーク１００に、複数のロボットとデータベースとの組（１０１と１０２，１１１と１１２，１２１と１２２）からなる検索装置１００，１１０，１２０、複数のＷＷＷサイト（１３１，１３２）、利用者端末　（１３３）が接続されている。 In the configuration example of FIG. 1, search devices 100, 110, and 120 composed of pairs of robots and databases (101 and 102, 111 and 112, 121 and 122), and a plurality of WWW sites (131 and 132) ), User terminal (133) is connected.

　各データベースには、前述したようなページ分担方法で、対象とする更新頻度を割り当てる。データベース Assign the target update frequency to each database by the page sharing method described above.

　第１のロボット１０２は、高頻度に変化するサイト群もしくはデータ群を集め（例えばＷＷＷサイト１３１，１３２から集め）、それをデータベース化して第１のデータベース１０１に格納する。 {Circle around (1)} The first robot 102 collects a site group or a data group that changes at a high frequency (for example, from the WWW sites 131 and 132), converts it into a database, and stores it in the first database 101.

　第３のロボット１２２は、低頻度に変化するサイト群もしくはデータ群を集め、それをデータベース化して第３のデータベース１２１に格納する。 (3) The third robot 122 collects a group of sites or data that change at a low frequency, creates a database, and stores the database in the third database 121.

　第２のロボット１１２は、それ以外の中頻度に変化するサイト群もしくはデータ群を集め、それをデータベース化して第２のデータベース１１１に格納する。 {Circle around (2)} The second robot 112 collects a group of sites or data that change at a medium frequency other than that, converts it into a database, and stores it in the second database 111.

　高頻度、低頻度、それ以外の中頻度に夫々対応する実際の統計的更新頻度情報（あるいは、最終更新時刻など）の範囲は、適宜設定する。 (4) The range of the actual statistical update frequency information (or the last update time, etc.) corresponding to the high frequency, the low frequency, and the other medium frequencies is appropriately set.

　次に、動的なデータベースの分担変更について述べている。 (4) Next, we will discuss dynamic changes in database allocation.

　本実施形態では、統計から得られる更新頻度情報に応じて分割された各データベースに該当するページのＵＲＬを入れるが、時間とともにページの更新頻度　（あるいはページの属するサイトの平均的な更新頻度等）は変化することがあるので、あるページの更新頻度（あるいはページの属するサイトの平均的な更新頻度等）がそのページを分担した初期のデータベースの持つ更新頻度の範囲を逸脱する場合が発生する。従って、あるページを分担中のデータベースから適切な更新頻度範囲を持つデータベースにそのページデータもしくはサイトを受け持つように依頼するようにするのが望ましい。この依頼は、データベース間の交渉により実現されるものとする。 In the present embodiment, the URL of the corresponding page is entered into each database divided according to the update frequency information obtained from the statistics. However, the page update frequency とともに (or the average update frequency of the site to which the page belongs, etc.) over time. May change, so that the update frequency of a certain page (or the average update frequency of a site to which the page belongs) may deviate from the range of the update frequency of the initial database assigned to the page. Therefore, it is desirable to request a database having an appropriate update frequency range from a database that is sharing a page to take charge of the page data or site. This request is realized by negotiation between databases.

　例えば、図１において、第１のロボット１０２は、統計的に高頻度のデータ群を取り寄せ第１のデータベース１０１に格納する。しかし、当初高頻度で更新されていたデータの更新頻度が自分が受け持つ範囲よりも低下したならば、そのデータを第２のロボット１１２とデータベース１１１に引き受けてもらう。また、更新頻度が大きく落ちた場合には、第３のロボット１２２とデータベース１２１に担当を替えるよう依頼する。 For example, in FIG. 1, the first robot 102 obtains a statistically high-frequency data group and stores it in the first database 101. However, if the update frequency of the data that has been updated frequently at first becomes lower than the range covered by the user, the data is received by the second robot 112 and the database 111. If the update frequency drops significantly, the third robot 122 and the database 121 are requested to change their responsibilities.

　図２に、図１のように更新頻度に応じてロボットが複数台あり、それぞれにデータベースがある場合の各検索装置の処理手順の一例を示す。 FIG. 2 shows an example of a processing procedure of each search device when there are a plurality of robots according to the update frequency as shown in FIG. 1 and each has a database.

　ステップＳ２１で、他の検索装置からページの分担を依頼されているかどうか調べ、あればステップＳ２７を行い、なければステップＳ２２を行う。 (4) In step S21, it is checked whether or not another search device requests sharing of a page. If there is a request, step S27 is performed. If not, step S22 is performed.

　ステップＳ２２で、それぞれのロボットは、指定されたページを１つ選び、そのページを取得する。このときのページの統計的更新頻度に比例した頻度でページを取得するようにスケジュールする。なお、そのページについて統計的更新頻度の情報がない場合には、そのページを含むサイトのページのうち得られている統計的更新頻度の平均的な値あるいはデフォルト値などで代用すれば良い。 In step S22, each robot selects one designated page and acquires the page. The page is scheduled to be acquired at a frequency proportional to the statistical update frequency of the page at this time. If there is no information on the statistical update frequency for the page, an average value or a default value of the statistical update frequency obtained from the pages of the site including the page may be used instead.

　ステップＳ２３で、取得したページが前回と変わっているか否かにより、そのページの統計的更新情報を更新する。もし、ネットワークや相手サーバのトラブルにより、そのページの取得に失敗した場合には、そのページの取得に失敗したという記録を残して、ステップＳ２２に戻る。 In step S23, the statistical update information of the acquired page is updated based on whether or not the acquired page has changed from the previous time. If the acquisition of the page fails due to a trouble of the network or the partner server, the process returns to step S22, leaving a record that the acquisition of the page failed.

　ステップＳ２４で、新しい更新頻度が自らが担当している範囲内かどうかを調べる。 (4) In step S24, it is checked whether the new update frequency is within the range in which it is in charge.

　ステップＳ２５で、もし自らの担当範囲外になったならば、それを範囲内に含む検索装置に以降の処理を依頼する。このとき、そのページのデータは消去する。 {Circle around (5)} In step S25, if the user is out of his / her area of responsibility, a request is made to a search apparatus that includes the area within the area in charge of the subsequent processing. At this time, the data of the page is erased.

　ステップＳ２６で、もし自らの担当範囲内ならば、取得したページをデータベース化し、格納する。例えば、ページデータを形態素解析し、単語レベルに分解し、単語を含むページという形にデータベース化する。このとき、そのページの前のデータは消去する。 In step S26, if it is within its own area, the acquired page is converted into a database and stored. For example, page data is subjected to morphological analysis, decomposed into word levels, and compiled into a database in the form of pages containing words. At this time, the data before the page is erased.

　ステップＳ２７で、他の検索装置から依頼があった場合には、そのページを自ロボットで扱うことができるように、そのページを登録し、そのページの統計的更新頻度情報を設定する。 In step S27, if there is a request from another search device, the page is registered and the statistical update frequency information of the page is set so that the page can be handled by the own robot.

　本実施形態において、検索利用者がデータベース検索を行う場合、利用者端末１３３から複数のデータベース１０１，１１１，１２１のすべてに検索要求を出す方法と、いずれか１つのデータベース１に検索要求を出す方法が考えられる。後者のいずれか１つのデータベースに検索要求を出す場合には、その検索要求を受け取ったデータベースのみが結果を返すようなモードと、そのデータベースが他のデータベースにも問い合わせに行き結果をマージして返すようなモードが考えられる。 In the present embodiment, when a search user performs a database search, a method of issuing a search request from the user terminal 133 to all of the plurality of databases 101, 111, and 121, and a method of issuing a search request to any one of the databases 1 Can be considered. When a search request is issued to one of the latter databases, only the database that receives the search request returns a result, and the database also queries other databases and merges and returns the result. Such a mode can be considered.

　次に、図４の構成例について説明する。図４は、基本的には図１と同様であり、データの更新頻度に応じた複数のデータベース２０１〜２０３が用意されているが、ロボット２０４を一台で兼用する点に関して図１の構成例と相違する。 Next, the configuration example of FIG. 4 will be described. FIG. 4 is basically the same as FIG. 1, and a plurality of databases 201 to 203 are prepared according to the data update frequency, but the configuration example of FIG. Is different from

　図５に、図４のように、ロボットが１台でデータベースが複数ある場合の検索装置の処理手順の一例を示す。 FIG. 5 shows an example of the processing procedure of the search device when there is one robot and there are a plurality of databases as shown in FIG.

　ステップＳ１１で、指定されたページを１つ選び、ロボット２０４を用いてそのページを取得する。このときのページの統計的更新頻度に比例した頻度でページを取得するようにスケジュールする。なお、そのページについて統計的更新頻度の情報がない場合には、そのページを含むサイトのページのうち得られている統計的更新頻度の平均的な値あるいはデフォルト値などで代用すれば良い。 In step S11, one designated page is selected, and the page is acquired using the robot 204. The page is scheduled to be acquired at a frequency proportional to the statistical update frequency of the page at this time. If there is no information on the statistical update frequency for the page, an average value or a default value of the statistical update frequency obtained from the pages of the site including the page may be used instead.

　ステップＳ１２で、取得したページが前回と変わっているか否かにより、そのページの統計的更新情報を更新する。もし、ネットワークや相手サーバのトラブルにより、そのページの取得に失敗した場合には、そのページの取得に失敗したという記録を残して、ステップＳ１１に戻る。 In step S12, the statistical update information of the acquired page is updated based on whether or not the acquired page is different from the previous page. If the acquisition of the page has failed due to a trouble of the network or the partner server, the process returns to step S11 with a record that the acquisition of the page failed.

　ステップＳ１３で、ステップＳ１１で取得したページの新しい統計的更新確率により、そのページをどのデータベースに担当させるかを決定する。 In step S13, it is determined which database is to be assigned to the page based on the new statistical update probability of the page obtained in step S11.

　ステップＳ１４で、ページ情報をデータベース化する。例えば、ページデータを形態素解析し、単語レベルに分解し、単語を含むページという形にデータベース化する。このデータをステップＳ１３で決めたデータベースに格納する。このとき、そのページの前のデータは消去する。もし、ここで、これまで格納されていたデータベースと異なるデータベースに格納されていたならば、それをも消去する。もし、取得したページが前回から変更がない場合には、データベース化は行わないが、格納すべきデータベースがそれにより変更された場合には、データの移動のみを行う。 In step S14, the page information is made into a database. For example, page data is subjected to morphological analysis, decomposed into word levels, and compiled into a database in the form of pages containing words. This data is stored in the database determined in step S13. At this time, the data before the page is erased. Here, if the data is stored in a database different from the previously stored database, it is also deleted. If the acquired page has not been changed from the previous time, the database is not created, but if the database to be stored is changed by that, only the data movement is performed.

　以上のように、ロボットの数はデータベースの数と一致している必要はない。例えば、図４の場合、ロボットの数は２台でも４台以上でも良い。各ロボットとデータベースとの対応関係は適宜設定すれば良い。 As mentioned above, the number of robots does not need to match the number of databases. For example, in the case of FIG. 4, the number of robots may be two or four or more. The correspondence between each robot and the database may be set as appropriate.

　なお、検索利用者によるデータベース検索については前述した図１と同様である。 The database search by the search user is the same as in FIG. 1 described above.

　次に、図６の構成例について説明する。図６の検索装置３００は、データベース全体を取りまとめるデータベース・フロントエンド（ＤＢＦ）３０１が設けられている点が図４の検索装置２００と相違する。 Next, the configuration example of FIG. 6 will be described. The search device 300 of FIG. 6 differs from the search device 200 of FIG. 4 in that a database front end (DBF) 301 for integrating the entire database is provided.

　本構成例では、このＤＢＦ３０１が利用者端末１３３からの検索要求を受付け、適切なデータベースに問い合わせて、結果を利用者に提示する。 In this configuration example, the DBF 301 receives a search request from the user terminal 133, inquires of an appropriate database, and presents the result to the user.

　次に、データベース検索における検索対象範囲の指定について説明する。 Next, the specification of the search target range in the database search will be described.

　本第１の実施形態では、検索要求にて、キーワードを用いた検索条件の他に、対象とする更新頻度の範囲および／または更新時刻の範囲を指定できるようにすると好ましい。また、検索要求において明示的に更新頻度が指定されていない場合に、データベースあるいはＤＢＦの方でデフォルト値（例えば最も更新頻度の高いデータベースのみといった更新頻度範囲）をもって検索を行なうようにしても良い。 In the first embodiment, it is preferable that, in addition to a search condition using a keyword, a range of target update frequency and / or a range of update time can be specified in the search request. If the update frequency is not explicitly specified in the search request, the database or DBF may perform the search with a default value (for example, an update frequency range such as only the database with the highest update frequency).

　ここで、図７に、図６の検索装置における検索手順の一例を示す。 FIG. 7 shows an example of a search procedure in the search device of FIG.

　利用者が利用者端末１３３からデータベース・フロントエンド３０１に向けて検索要求を送り出すと、ステップＳ３１で、データベース・フロントエンド３０１は利用者端末３０８からの検索要求を受け取る。 When the user sends a search request from the user terminal 133 to the database front end 301, the database front end 301 receives the search request from the user terminal 308 in step S31.

　ステップＳ３２で、その検索要求が更新頻度範囲指定を持つかどうかを判定する。 In step S32, it is determined whether the search request has an update frequency range designation.

　もし持つならば、ステップＳ３３で、利用者の検索要求の対象範囲に応じて適切な範囲のデータベースでのみ検索を行う。 If so, in step S33, a search is performed only in a database within an appropriate range according to the target range of the user's search request.

　もし持たないならば、ステップＳ３４で、すべてのデータベースで検索を行う。 If not, search is performed in all databases in step S34.

　ステップＳ３５で、結果をマージして利用者端末３０８に返す。 (4) In step S35, the result is merged and returned to the user terminal 308.

　次に、システムのハードウェア構成に関して説明する。 Next, the hardware configuration of the system will be described.

　本第１の実施形態では、更新頻度の高い方（例えば、統計的更新頻度情報の高い方、あるいは最終更新時刻の新しい方など）を受け持つデータベース（またはデータベースおよびロボット）などを構成する計算機には、更新頻度の低い方　（例えば、統計的更新頻度情報の低い方、あるいは最終更新時刻の古い方など）を受け持つデータベース（またはデータベースおよびロボット）などを構成する計算機よりも、高速性について同等以上のものを用い、あるいは台数について同数以上を用いるなどして、更新頻度が高いデータを検索するデータベースを担当する計算機の方がそうでないデータベースを担当する計算機よりも処理能力が同じかより高いようにシステムを構成すると好ましい。 In the first embodiment, a computer that configures a database (or a database and a robot) that handles a higher update frequency (for example, a higher statistical update frequency information or a latest update time) is included in the computer. , The lower the update frequency (for example, the statistical update frequency information is lower, or the last update time is older, etc.). A system in which a computer in charge of a database that searches for frequently updated data using the same or more than the same number of computers has the same or higher processing capacity than a computer in charge of a database that does not update frequently Is preferable.

　すなわち、更新頻度が高い方のデータを担当するデータベースの方が更新頻度が低い方のデータを担当するデータベースよりも頻繁に利用されるので、更新頻度が高い方のデータを担当するデータベースの方のみについて処理能力を上げるだけで、全体の処理能力を効果的に向上させることができる。 In other words, the database responsible for the data with the higher update frequency is used more frequently than the database responsible for the data with the lower update frequency, so only the database responsible for the data with the higher update frequency By simply increasing the processing capability of the above, the overall processing capability can be effectively improved.

　従って、本実施形態のように更新頻度に応じてデータベースを分割することにより、更新頻度の高いデータベースを載せる計算機だけ高速なものを使えば良くなり、データベース構築の負担を効果的に軽減することができる。 Therefore, by dividing the database according to the update frequency as in the present embodiment, it is only necessary to use a high-speed computer as the computer on which the frequently updated database is mounted, and it is possible to effectively reduce the burden of constructing the database. it can.

　例えば、図８のように、第１の検索装置４１０を構成する計算機群が更新頻度が高いデータ群を担当し、第２の検索装置４０１を構成する計算機群が更新頻度が低いデータ群を担当している場合には、第１の計算機群４１０においてはデータベースをハードウェア的に二重化して高速化している。高速化の手段としては、ハードウェアを多重化する他にも、速い素子を使った計算機を使うとか、メモリの容量を大きくするなどの方法がある。 For example, as shown in FIG. 8, a computer group forming the first search device 410 is in charge of a data group with a high update frequency, and a computer group forming the second search device 401 is in charge of a data group with a low update frequency. In this case, in the first computer group 410, the database is duplicated in terms of hardware to increase the speed. As means for increasing the speed, in addition to multiplexing hardware, there are methods such as using a computer using a fast element or increasing the memory capacity.

　以上では、本実施形態についてネットワークを１つとして説明したが、図９のように複数のネットワーク５００〜５０４が結合された環境であっても良い。さらに、ネットワーク５００〜５０４が組織や国のように物理的にまったく離れた場所を結合しているものであっても良い。 In the above, the present embodiment has been described with one network, but an environment in which a plurality of networks 500 to 504 are coupled as shown in FIG. 9 may be used. Further, the networks 500 to 504 may connect physically physically separated places such as organizations and countries.

　（第２の実施形態）
　次に、第２の実施形態について説明する。 (Second embodiment)
Next, a second embodiment will be described.

　本実施形態では、検索システムにプロキシー機能も装備し、検索結果として参照されるべきページデータを既に持っているならば、そのデータをネットワークを介して新たに取りに行くことはせずに、既に持っているデータを返す。 In this embodiment, the search system is also provided with a proxy function, and if the user already has page data to be referred to as a search result, the data is not newly obtained via the network, Returns the data you have.

　これにより、前述した頻繁に変化するページの問題にも対処することができる。すなわち、頻繁に変化するページでは、検索結果として示されるリンクを辿ったときには、既にそのページがなくなっていたり、更新されていて役にたたなくなっていたりすることがある。これに対して、検索用データベースで用いたデータを提示するのであれば、このような問題は生じない。 This makes it possible to address the problem of frequently changing pages described above. That is, in a page that changes frequently, when a link shown as a search result is followed, the page may already be missing or may have been updated and useless. On the other hand, if the data used in the search database is presented, such a problem does not occur.

　すなわち、頻繁に変化するページは、図１３に示すようにサンプリング的に取得し、次の取得まで内容を保持しておく。これにより、例えば図１３中のｔ１でページが消失しあるいは内容が別のものに移行されるなどしても、最後にサンプリングしたｔ０のときの内容を提示することができる。 That is, frequently changing pages are acquired in a sampling manner as shown in FIG. 13, and the contents are held until the next acquisition. Thus, even if the page is lost or the content is shifted to another at t1 in FIG. 13, the content at the last sampling time t0 can be presented.

　図１０に、本実施形態のシステム構成例を示す。 FIG. 10 shows a system configuration example of the present embodiment.

　図１０に示すように、本実施形態の検索装置６０１は、ネットワーク６００に接続されており、ロボット６０２、キャッシュ６０３、データベース化部６０４、データベース６０５、データベース・フロントエンド（ＤＢＦ）６０７、ＷＷＷフロントエンド６０６を有する。また、図１０には示していないが、ネットワーク６００を介して各ＷＷＷサイトや利用者端末が接続されているものとする。また、図１０中では、データベースを１つとして表わしているが、複数に分割されていても良い。また、複数のデータベースに第１の実施形態にて説明した発明を適用し、データの更新頻度に応じてデータベースに情報の格納を分担させても良い。 As shown in FIG. 10, a search device 601 according to the present embodiment is connected to a network 600, and includes a robot 602, a cache 603, a database unit 604, a database 605, a database front end (DBF) 607, and a WWW front end. 606. Although not shown in FIG. 10, it is assumed that each WWW site and user terminal are connected via the network 600. Further, in FIG. 10, the database is represented as one, but may be divided into a plurality. Further, the invention described in the first embodiment may be applied to a plurality of databases, and the storage of information may be shared among the databases according to the frequency of data update.

　本実施形態では、データベースにはページのＵＲＬを格納するものとする。また、ページを全文検索などして抽出したキーワードをＵＲＬに付加して格納し、キーワードでＵＲＬを検索するのもとする。では In the present embodiment, the URL of the page is stored in the database. It is also assumed that a keyword extracted by performing a full-text search of a page is added to a URL and stored, and the URL is searched by the keyword.

　最初にデータベース化までを説明し、次に利用方法について説明する。 First, we will explain up to the creation of a database, and then explain how to use it.

　データベース化まで手順の一例を以下に示す。　
　まず、ロボット６０２を用いて、探訪リストに従って、ネットワーク６００を介して他のＷＷＷサイトからデータを収集する。もし自身も独自コンテンツを持つＷＷＷサイトであるならば、自身からもデータを収集する。　
　その収集したものをキャッシュ６０３に格納する。　
　キャッシュ６０３に格納されているものの中からデータベース化部６０４により検索用データベース６０５を作成する。例えば、語単位でのキーワード検索を行なう場合には、データベース化部６０４では、キャッシュ６０３内のデータを形態素解析し、語単位でデータベース化する。これにより、利用者から特定の語を含む情報を要求された場合に、即座にデータベース検索が可能となる。ここで、本検索装置では、データベース化するときのデータの在処として、そのデータを取得したネットワーク上のアドレス（ＵＲＬ）ではなく、キャッシュ６０３に格納されているデータのアドレスを用いる。 An example of the procedure up to database creation is shown below.
First, the robot 602 collects data from another WWW site via the network 600 according to the search list. If it is a WWW site that has its own contents, it also collects data from itself.
The collected data is stored in the cache 603.
From the data stored in the cache 603, the search database 605 is created by the database conversion unit 604. For example, when performing a keyword search in word units, the database conversion unit 604 performs morphological analysis on the data in the cache 603 and converts the data into word-based databases. Thus, when a user requests information including a specific word, a database search can be immediately performed. Here, in the present search apparatus, the address of the data stored in the cache 603 is used instead of the address (URL) on the network from which the data was obtained, as the location of the data when creating the database.

　一方、ユーザからの参照要求によりＷＷＷフロントエンド６０６がアクセスして取得したページも、キャッシュ６０３に格納するとともに、上記と同様にデータベース化しておく。 On the other hand, a page accessed and obtained by the WWW front-end 606 in response to a reference request from the user is stored in the cache 603 and is also made into a database as described above.

　次に利用する際の手順の一例を以下に示す　。 << An example of the procedure for the next use is shown below >>.

　利用者は、ネットワーク６００を介して、検索装置６０１のＷＷＷフロントエンド６０６にアクセスし、検索要求を出す。　
　その要求は、データベース・フロントエンド（ＤＢＦ）６０７に伝えられ、複数のデータベースがある場合には、適切なデータベースが選択され、それに検索要求を出す。　
　データベース・フロントエンド（ＤＢＦ）６０７では、複数のデータベースに検索要求を出した場合には、それらの結果を取りまとめて、ＷＷＷフロントエンド６０６を介して利用者に検索結果を提示する。　
　利用者は、検索結果の中で、さらにその中身を見てみたいと思うものがあれば、検索装置６０１のＷＷＷフロントエンド６０６に参照要求を出す。　
　ＷＷＷフロントエンド６０６では、参照を要求されたページが自キャッシュ６０３に格納されているものであるならば、該ページをキャッシュ６０３から取り出して参照要求者に返す。もし自キャッシュ６０３になければ、その旨を参照要求者に返す。 The user accesses the WWW front end 606 of the search device 601 via the network 600 and issues a search request.
The request is passed to the database front end (DBF) 607, and if there are multiple databases, an appropriate database is selected and a search request is issued to it.
When a search request is issued to a plurality of databases, the database front end (DBF) 607 collects the results and presents the search results to the user via the WWW front end 606.
The user issues a reference request to the WWW front end 606 of the search device 601 if there is a search result that the user wants to see further.
In the WWW front end 606, if the page requested to be referenced is stored in its own cache 603, the page is retrieved from the cache 603 and returned to the reference requester. If it is not in its own cache 603, the fact is returned to the reference requester.

　ここで、検索装置では、取得可能なすべてのデータをロボットを用いて収集せずに、予め指定されたデータに加えて、統計的観点から参照要求があると思われるデータについてロボットによりデータをプリフェッチしておくようにしても良い。これは、ＷＷＷ上のすべてのデータを検索対象としない場合や、実際のページの更新頻度ではなく、利用者の要求に基づいてデータを更新する場合に有効である。 Here, the search device does not collect all the obtainable data using the robot, but prefetches the data by the robot in addition to the data specified in advance and the data that seems to have a reference request from a statistical viewpoint. You may do it. This is effective when not searching all data on the WWW or when updating data based on a user's request instead of the actual page update frequency.

　すなわち、ＷＷＷ上のすべてのデータを検索対象としない場合には、どの範囲をロボットで収集するかが問題となる。そこで、この検索サーバ兼プロキシーへの要求に現れるページやサイトを統計処理し、その頻度が高いデータやサイトのデータを優先的にロボットを用いてあらかじめプリフェッチしておく。このときには、実際のページの更新情報が高いもの程よくそのページをロボットが訪問するのみならず、そのページに対する参照要求の発生確率が高いページほど良くそのページをロボットが訪問するようにする。これにより、システム管理者が特別に指定しなくても、適切なデータに対してミラー化される。 That is, if all data on the WWW is not to be searched, it becomes a problem which range is collected by the robot. Therefore, pages and sites appearing in the request to the search server / proxy are statistically processed, and data with high frequency or data of the sites are prefetched in advance by using a robot in advance. At this time, the higher the update information of the actual page is, the better the robot will visit the page, and the higher the probability of the reference request to the page is, the better the robot will visit the page. As a result, appropriate data is mirrored without any special designation by the system administrator.

　上記のような検索装置に構成例を図１１に示す。 FIG. 11 shows an example of the configuration of the above search device.

　図１１の検索装置７０１は、図１０の検索装置６０１にユーザ要求記録部７０８を追加したものである。従って、相当する部分の説明は省略し、相違する部分を中心に説明を行う。 The search device 701 in FIG. 11 is obtained by adding a user request recording unit 708 to the search device 601 in FIG. Therefore, the description of the corresponding parts will be omitted, and the description will focus on the different parts.

　図１２に、本検索装置７０１による情報収集の処理手順を示す。 FIG. 12 shows a processing procedure of information collection by the search device 701.

　ステップＳ４１で、利用者のアクセスログを解析し、そのサイトで良く参照されるページやサイトの情報を得る。 In step S41, the access log of the user is analyzed to obtain information on pages and sites frequently referred to on the site.

　ステップＳ４２で、上記とは別にシステム管理者などにより明示的に指示されたページやサイトの情報をステップＳ４１で得たものとマージする。 {Circle around (2)} In step S42, the page and site information explicitly specified by the system administrator or the like is merged with the information obtained in step S41.

　ステップＳ４３で、上記で得たデータを、その統計的更新確率にしたがってロボットを用いて取得する。もし、ページについて統計的更新確率情報が得られていなかったときには、そのページを含むサイトのページのわかっている統計的更新確率情報の平均値で代用する。さらに、そのサイトの統計的更新確率情報もわからない場合には、知っているすべてのサイトの統計的更新確率情報もしくはデフォルト値で代用する。この統計的更新確率情報に比例した頻度でデータを繰り返し取得する。また、あるサイトがある時刻に更新される可能性が高いことがわかったならば、その時刻よりも少し後に情報を取に行くようにする。で In step S43, the data obtained above is obtained using a robot according to the statistical update probability. If the statistical update probability information has not been obtained for the page, the average of the known statistical update probability information of the pages of the site including the page is substituted. Further, if the statistical update probability information of the site is not known, the statistical update probability information of all known sites or the default value is used. Data is repeatedly acquired at a frequency proportional to the statistical update probability information. Also, if you find that a site is likely to be updated at a certain time, try to get information a little later than that time.

　さて、本検索装置７０１は、プロキシーも兼ねているので、利用者は検索要求でなく、単にネットワーク上の情報が欲しいときには、参照要求を検索装置７０１に出す。その参照要求は、ＷＷＷフロントエンド７０６を介して、ユーザ要求記録部７０８に出され、ここで要求データの記録が残される。ここで要求されたデータがキャッシュ７０３にあれば、それをそのまま返し、なければネットワーク７００を介してデータを取りに行き、そのデータをキャッシュ７０３に一旦格納した後、ＷＷＷフロントエンド７０７を介して利用者に返す。 Since the search device 701 also serves as a proxy, the user issues a reference request to the search device 701 when the user simply wants information on the network, not a search request. The reference request is output to the user request recording unit 708 via the WWW front end 706, where the request data is recorded. If the requested data is in the cache 703, it is returned as it is. If not, the data is fetched via the network 700, and the data is temporarily stored in the cache 703, and then used via the WWW front end 707. To the person.

　このように、図１１の検索装置では、利用者がどのデータに関心が高いかといった情報がユーザ要求記録部７０８に格納されている。従って、ロボットでデータを予め収集するときに、ロボットで取得できるすべてのデータを取ろうとするのではなく、ユーザ要求記録部７０８に格納されているデータと明示的に指示された取得すべきデータとを取得する。 As described above, in the search device of FIG. 11, information indicating which data the user is interested in is stored in the user request recording unit 708. Therefore, when the robot collects data in advance, it does not try to acquire all the data that can be acquired by the robot. Instead, the data stored in the user request recording unit 708 and the data to be acquired that are explicitly specified are acquired. To get.

　なお、取得すべきでないデータ群を指定して、それらはユーザ要求記録部７０８にあるものであっても取得しないようにしても良い。 Note that data groups that should not be acquired may be specified so that they may not be acquired even if they are in the user request recording unit 708.

　ところで、頻繁に更新されるデータについては、ユーザ要求記録部７０８の記録を見ても有効でないと考えられる。なぜならば、再び訪れたときにはそのデータが消滅している可能性が高い。従って、そのようなデータについては、サイトもしくはデータへのパスのみを有効な情報とし、同じデータでなくても同じサイトのデータならばロボットによって取得するようにする。 By the way, with regard to frequently updated data, it is considered that it is not effective even if the record in the user request recording unit 708 is seen. This is because the data is likely to be lost when you return. Therefore, for such data, only the path to the site or the data is regarded as valid information, and if it is not the same data, the data of the same site is acquired by the robot.

　例えば、以下のような番号を名前とするようなＵＲＬは一時的にのみ存在している可能性が高い。　
　ｈｔｔｐ：／／ｗｗｗ．ｔｓｂ．ｃｏ．ｊｐ／ｆｏｏ／１２４６３８９．ｈｔｍｌ
　このような場合には、このファイルを再び取得するのではなく、このファイルへのリンクを張っているファイルを取得し、そのファイルからリンクを辿った先のファイルを取得する。 For example, there is a high possibility that a URL having the following number as a name exists only temporarily.
http: // www. tsb. co. jp / foo / 1246639. html
In such a case, instead of reacquiring this file, a file linked to this file is acquired, and a file following the link is acquired from the file.

　図１１の検索装置では、プリフェッチしたものが将来使われると仮定している。ここでプリフェッチする対象は、文字情報、画像、音声、動画などのメディアを任意に選択できるものとする。例えば、記憶容量の制約から文字情報のみをプリフェッチするように指定したが、そのページに動画が入っていた場合には、その動画は利用者が参照したときにネットワークを介して取りに行くか、表示されないかのいずれかになる。検索 The search device in FIG. 11 assumes that the prefetched data will be used in the future. Here, it is assumed that a target such as character information, an image, a sound, or a moving image can be arbitrarily selected as a prefetch target. For example, if you specified to prefetch only character information due to storage capacity restrictions, but the page contains a video, if the user refers to the video, go to the network to get it, Will not be displayed either.

　次に、図１０や図１１の検索装置におけるページの取得頻度に関して説明する。 Next, a description will be given of a page acquisition frequency in the search device in FIGS. 10 and 11.

　ロボットは、同じＵＲＬのページを定期的に取得しに行くが、その際、対象ページの更新頻度に応じた頻度で該ページを取り直すのが好ましい。すなわち、対象ページが統計的に一日に変更される回数に比例した回数だけ、該ページを取得しに行く。ただし、指定したデータが消滅したならば、二度とそのデータを取りに行かないようにする。また、取得したデータがハイパーリンクとなっている場合には、リンク先の情報も取りに行くことも可能である。 (4) The robot goes to acquire the page of the same URL periodically, and it is preferable that the robot retake the page at a frequency corresponding to the update frequency of the target page. That is, the target page is obtained by the number of times that is proportional to the number of times that the target page is statistically changed in one day. However, if the specified data disappears, do not go to get that data again. If the acquired data is a hyperlink, it is also possible to get information on the link destination.

　また、指定したサイト群やＵＲＬ群のデータについては、利用者がリロード要求を出しても、それに応じないようにする。これにより、検索サーバから同じＵＲＬに対する一定回数以上の要求がでないことが保証される。 (4) Regarding the data of the designated site group and URL group, even if the user issues a reload request, the user does not respond to the request. This ensures that the search server does not request the same URL more than a certain number of times.

　次に、図１０や図１１の検索装置における検索対象に関して説明する。 Next, a search target in the search device of FIGS. 10 and 11 will be described.

　本実施形態では、ロボットで収集したデータもプロキシーのキャッシュの中に入れておき、利用者が直接要求したデータと同じ場所で管理する。 In this embodiment, the data collected by the robot is also stored in the cache of the proxy, and is managed in the same place as the data directly requested by the user.

　ここで、参照したコンテンツが暗号化されていない有料データのこともあるし、利用者のプライバシーの問題もあるので、検索システムが検索対象とするデータに制限が加えられるようにしても良い。 Here, since the referred content may be unencrypted paid data and there is a problem of the privacy of the user, the data to be searched by the search system may be restricted.

　制限の与え方としては、以下の条件を１つ以上組み合わせたものとする。　
　（１）ロボットで収集したものに限る、　
　（２）プロキシーとしてデータを保持しているものに限る、　
　（３）同じ名前もしくはアドレスを持つ情報については最新のものだけに限る、　
　（４）ＣＧＩなどにより動的もしくは対話的に生成された情報は除く、　
　（５）指定したサイト群やＵＲＬ群に限る。 The way of giving the restriction is a combination of one or more of the following conditions.
(1) limited to those collected by robots,
(2) Only those that hold data as a proxy,
(3) Information with the same name or address is limited to the latest information.
(4) Excluding information generated dynamically or interactively by CGI, etc.
(5) Limited to specified sites and URLs.

　例えば、図１０において、データをキャッシュ６０４に入れるときに、そのデータの取得状況も記録しておく。すなわち、そのデータが、ロボットで収集したものか、利用者が直接要求したものか、ＣＧＩなどにより動的もしくは対話的に生成されたものか（これはＵＲＬのパス名にＣＧＩやＢＩＮという文字を含むかどうかで判定する）、指定されたサイト群かＵＲＬ群かなどの情報も、データと一緒に記録しておく。そして、管理者がどの種類のデータはキャッシュ内のデータについて検索が可能かどうかを指定できるようにしておく。検索システムでは、この指定に従って、条件の合うものだけをデータベース化する。 For example, in FIG. 10, when data is put into the cache 604, the acquisition status of the data is also recorded. That is, whether the data is collected by a robot, requested directly by the user, or generated dynamically or interactively by CGI (CGI or BIN is added to the URL path name. Information is also recorded together with the data, such as whether the group is a designated site group or a URL group. The administrator can specify which type of data can be searched for in the cache. According to this specification, the search system creates only a database that meets the conditions in a database.

　次に、図１０や図１１の検索装置における収集データのアドレスの付け替えについて説明する。 Next, a description will be given of the replacement of the address of the collected data in the search device in FIG. 10 or FIG.

　本実施形態では、収集したデータを検索装置のキャッシュに格納する際に、該収集データのアドレスもしくはＵＲＬを付け変えて格納しておいても良い。すなわち、データの位置がネットワークのある場所から検索装置内のキャッシュに移動したのであるから、ドメイン名を検索装置のドメイン名に変えるようにする。次に、パス名の先頭に元のドメイン名を付加する。例えば、以下のようにする。　
　元のＵＲＬ　ｈｔｔｐ：／／ｗｗｗ．ｆｏｏ．ｃｏ．ｊｐ／ｂａｒ／ｉｎｄｅｘ．ｈｔｍｌ　
　検索装置のドメイン名　ｗｗｗ．ｓｅａｒｃｈ．ｃｏ．ｊｐ　
　新たなＵＲＬ　ｈｔｔｐ：／／ｗｗｗ．ｓｅａｒｃｈ．ｃｏ．ｊｐ／ｗｗｗ．ｆｏｏ．ｃｏ．ｊｐ／ｂａｒ／ｉｎｄｅｘ．ｈｔｍｌ
　このようにすることにより、データのミラー化が実現できる。 In this embodiment, when storing the collected data in the cache of the search device, the address or URL of the collected data may be changed and stored. That is, since the position of the data has been moved from a location on the network to the cache in the search device, the domain name is changed to the domain name of the search device. Next, the original domain name is added to the head of the path name. For example,
Original URL http: // www. foo. co. jp / bar / index. html
Domain name of search device www. search. co. jp
New URL http: // www. search. co. jp / www. foo. co. jp / bar / index. html
By doing so, mirroring of data can be realized.

　次に、図１０や図１１の検索装置における収集データの時刻管理について説明する。 Next, a description will be given of time management of collected data in the search device of FIGS.

　本実施形態では、収集データに更新時刻データも付与して管理するようにしても良い。通常のプロキシーのように同じアドレス（ＵＲＬ）に対しては、最新のデータのみを保持するだけでなく、過去のデータも管理して保持する。ここでの時刻は、そのデータが有効になった時刻、あるいはそれに加えて無効になった時刻とを持つ。 In the present embodiment, the collected data may be managed by adding update time data. For the same address (URL) as a normal proxy, not only the latest data but also the past data is managed and stored. Here, the time has a time when the data becomes valid or a time when the data becomes invalid.

　有効になった時刻は、同一ＵＲＬで内容が更新されたような場合には、サーバから通知される更新時刻が変化するので、その時刻が無効になった時刻になり、データそのものが消滅した場合には、アクセスに行ったことにより消滅したことが判った時刻とする。 When the content becomes updated at the same URL, the update time notified from the server changes. Therefore, the time becomes invalid when the data itself is lost. Is the time at which it has been determined that it has disappeared due to access.

　アドレス（ＵＲＬ名）は、時刻管理をするために付け替えて管理する。 The address (URL name) is changed and managed for time management.

　まず、データの位置がネットワークのある場所から検索装置内のキャッシュに移動したのであるから、ドメイン名を検索装置のドメイン名に変える。次に、パス名の先頭に元のドメイン名を付加する。例えば、以下のようにする。 First, since the position of the data has been moved from a certain place on the network to the cache in the search device, the domain name is changed to the domain name of the search device. Next, the original domain name is added to the head of the path name. For example,

　元のＵＲＬ　ｈｔｔｐ：／／ｗｗｗ．ｆｏｏ．ｃｏ．ｊｐ／ｂａｒ／ｉｎｄｅｘ．ｈｔｍｌ　
　検索装置のドメイン名　ｗｗｗ．ｓｅａｒｃｈ．ｃｏ．ｊｐ　
　新たなＵＲＬ　ｈｔｔｐ：／／ｗｗｗ．ｓｅａｒｃｈ．ｃｏ．ｊｐ／ｗｗｗ．ｆｏｏ．ｃｏ．ｊｐ／ｂａｒ／ｉｎｄｅｘ．ｈｔｍｌ
　さらに、これに時刻の情報も付与する。例えば、１９９６年３月２３日１６：３９から１９９６年４月３０日１０：２３まで有効であったデータならば、以下のようにする。 Original URL http: // www. foo. co. jp / bar / index. html
Domain name of search device www. search. co. jp
New URL http: // www. search. co. jp / www. foo. co. jp / bar / index. html
Further, time information is also added to this. For example, if the data is valid from 16:39 on March 23, 1996 to 10:23 on April 30, 1996, the following is performed.

　ｈｔｔｐ：／／ｗｗｗ．ｓｅａｒｃｈ．ｃｏ．ｊｐ／ｗｗｗ．ｆｏｏ．ｃｏ．ｊｐ／ｂａｒ／ｉｎｄｅｘ．ｈｔｍｌ／１９９６０３２３１６３９−１９９６０４３０１０２３
　また、以下のような変形も考えられる。 http: // www. search. co. jp / www. foo. co. jp / bar / index. html / 199603232339-199604301023
In addition, the following modifications are also conceivable.

　ｈｔｔｐ：／／ｗｗｗ．ｓｅａｒｃｈ．ｃｏ．ｊｐ／ｗｗｗ．ｆｏｏ．ｃｏ．ｊｐ／ｂａｒ／ｉｎｄｅｘ．ｈｔｍｌ／１９９６．３．２３．１６．３９−１９９６．４．３０．１０．２３
　なお、以上説明した本発明の実施の形態における各構成は、相当する手順あるいは手段をコンピュータに実行させるためのプログラムを作成し、これをコンピュータに実行させることにより実現可能である。 http: // www. search. co. jp / www. foo. co. jp / bar / index. html / 1996.3.23.26.16.39-1996.4.30.10.23
Each configuration in the embodiment of the present invention described above can be realized by creating a program for causing a computer to execute a corresponding procedure or means, and causing the computer to execute the program.

　また、上記プログラムを機械読取り可能な媒体に記録し、コンピュータがこの媒体からプログラムを読取って実行するように構成することも可能である。 It is also possible to record the above-mentioned program on a machine-readable medium so that a computer reads and executes the program from this medium.

　本発明は、上述した実施の形態に限定されるものではなく、その技術的範囲において種々変形して実施することができる。 The present invention is not limited to the above-described embodiment, and can be implemented with various modifications within the technical scope.

本発明の第１の実施形態に係る検索装置の構成例を示す図FIG. 2 is a diagram illustrating a configuration example of a search device according to the first embodiment of the present invention. 同検索装置の処理手順の一例を示すフローチャートA flowchart showing an example of a processing procedure of the search device 最終更新時刻によってデータベースを分担する方法を説明するための図Diagram for explaining how to share the database according to the last update time 同実施形態に係る検索装置の他の構成例を示す図FIG. 5 is a diagram showing another configuration example of the search device according to the embodiment. 同検索装置の処理手順の一例を示すフローチャートA flowchart showing an example of a processing procedure of the search device 同実施形態に係る検索装置のさらに他の構成例を示す図FIG. 6 is a diagram showing still another configuration example of the search device according to the embodiment. 同検索装置の処理手順の一例を示すフローチャートA flowchart showing an example of a processing procedure of the search device 同実施形態に係る検索装置のさらに他の構成例を示す図FIG. 6 is a diagram showing still another configuration example of the search device according to the embodiment. 複数のネットワークが接続された場合のシステム構成の一例を示す図Diagram showing an example of a system configuration when a plurality of networks are connected 本発明の第２の実施形態に係る検索装置の構成例を示す図FIG. 6 is a diagram illustrating a configuration example of a search device according to a second embodiment of the present invention. 本発明の第２の実施形態に係る他の検索装置の構成例を示す図FIG. 9 is a diagram illustrating a configuration example of another search device according to the second embodiment of the present invention. 同検索装置の処理手順の一例を示すフローチャートA flowchart showing an example of a processing procedure of the search device 頻繁に変化するページのサンプリングを説明するため図Illustration to illustrate sampling of frequently changing pages

Explanation of reference numerals

　１００，５００〜５０４，６００…ネットワーク
　１００，１１０，１２０，２００，３００，４０１，４１０，６０１…検索装置
　１０２，１１２，１２２，２０４，６０２…ロボット
　１０１，１０１−１，１０１−２，１１１，１２１，６０５…データベース
　１３１，１３２…ＷＷＷサイト
　１３３…利用者端末
　３０１，３０１−１，３０１−２，６０７…データベース・フロントエンド　（ＤＢＦ）　６０３…キャッシュ
　６０４…データベース化部
　６０６…ＷＷＷフロントエンド
　７０８…ユーザ要求記録部 100, 500 to 504, 600 Network 100, 110, 120, 200, 300, 401, 410, 601 Search device 102, 112, 122, 204, 602 Robot 101, 101-1, 101-2, 111, 121, 605: Database 131, 132: WWW site 133: User terminal 301, 301-1, 301-2, 607: Database front end (DBF) 603: Cache 604: Database conversion unit 606: WWW front end 708 ... User request recording section

Claims

In a search system that creates a database based on data collected using robots on the network and performs database search,
A cache unit for holding data acquired in response to a reference request from outside and data collected using a robot,
When a reference request is given from the outside, if data corresponding to the cache means is held, data is provided from the cache means, and if data corresponding to the cache means is not held, A data providing means for acquiring and providing the data from an original server holding the data.

Forecasting means for performing statistical processing on data requested to be referred from the outside and predicting data requested to be referred in the future;
2. The retrieval system according to claim 1, further comprising: a prefetch unit that acquires predicted data and data explicitly specified in advance using a robot, and prefetches the data into the cache unit.

2. The retrieval system according to claim 1, wherein the prefetch unit refetches the data to be acquired at a frequency corresponding to an update frequency of the data.

As constraints on the range of data to be searched in response to the search request, conditions limited to data collected by the robot, conditions limited to data acquired in response to an external reference request, same names Or, at least one of the following conditions is limited to only the latest data having addresses, conditions other than data generated dynamically or interactively, and conditions limited to designated sites or data groups. The search system according to claim 1, wherein the search system is charged.

2. The search system according to claim 1, wherein the cache unit adds and retains at least one of update time information and collection time information to the acquired data.