JP2009282913A

JP2009282913A - Personal-adaptive web information search device, method, and program

Info

Publication number: JP2009282913A
Application number: JP2008136861A
Authority: JP
Inventors: Shuichi Nakawatase; 秀一中渡瀬; Minako Izawa; 味奈子井沢
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: NTT Inc
Priority date: 2008-05-26
Filing date: 2008-05-26
Publication date: 2009-12-03

Abstract

【課題】個人の興味に応じたインターネット情報検索ができるようにするために、個人のプロファイル情報を適切に取得することによって個人適応検索を行う。
【解決手段】本発明は、ある利用者のプロファイル情報となる特徴語の組を当該利用者のブックマークに含まれるＵＲＬの指し示すＷｅｂページ本文及びそのページが参照するリンク先のＷｅｂページからも抽出し、ブックマークの更新による変化に追随して、プロファイル情報（特徴語群）を更新前、更新後で比較して共通部分となる特徴語群を生成し、プロファイル情報を更新する。更新されたプロファイル情報及び更新されたリンク先のプロファイル情報のある一定以上の類似度を有するプロファイル情報に基づいて、検索結果をソートして出力する。
【選択図】図１Personal adaptive search is performed by appropriately acquiring personal profile information in order to enable Internet information search according to personal interests.
The present invention also extracts a set of feature words as profile information of a user from a Web page text indicated by a URL included in the user's bookmark and a linked Web page to which the page refers. Following the change due to the bookmark update, the profile information (feature word group) is compared before and after the update to generate a feature word group as a common part, and the profile information is updated. The search results are sorted and output based on profile information having a certain degree of similarity or higher in the updated profile information and the updated link destination profile information.
[Selection] Figure 1

Description

本発明は、個人適応型Ｗｅｂ情報検索装置及び方法及びプログラムに係り、特に、個人のプロファイルによってＷｅｂ検索結果の得点付けを行う場合のプロファイル情報の更新が可能な個人適応型Ｗｅｂ情報検索装置及び方法及びプログラムに関する。 The present invention relates to a personal adaptive Web information search apparatus, method, and program, and more particularly to a personal adaptive Web information search apparatus and method capable of updating profile information when scoring Web search results based on a personal profile. And the program.

従来の個人適応型情報検索技術において、個人の興味内容に応じて作成されたプロファイル情報を用いて、情報検索結果の適合度順位を変更する手法としては以下のようなものがある。 In the conventional personal adaptive information search technology, there are the following methods for changing the suitability ranking of information search results using profile information created according to the personal interest content.

通常の情報検索において利用者が使用した検索式、その検索結果に表示された文字列やその検索結果のうちユーザが保存した文書の内容文に含まれる単語からストップワードを除いて得られる語や高頻度の語（特徴語）の組を当該利用者のプロファイル情報とし、それと情報検索の結果として得られる文書の文書ベクトルとの類似度順位を検索結果の順位とする（例えば、非特許文献１参照）。
野美山浩"個人適応型情報検索システム：個人の興味を学習する階層記憶モデルとその協調的フィルタリングへの適用"情報処理学会研究報告、情報学基礎研究会報告 IPSJ SIG Notes Vol. 96, No.70 pp. 49-56 (1996) The search formula used by the user in normal information search, the character string displayed in the search result and the word obtained by removing the stop word from the word included in the content sentence of the document saved by the user A set of frequently used words (feature words) is used as the profile information of the user, and a similarity ranking between the user and the document vector of the document obtained as a result of the information search is set as the search result rank (for example, Non-Patent Document 1). reference).
Namiyama Hiroshi "Personal Adaptive Information Retrieval System: Hierarchical Memory Model for Learning Individual Interests and Its Application to Collaborative Filtering" IPSJ SIG Notes Vol. 96, No. 70 pp. 49-56 (1996)

しかしながら、上記従来の方法には以下のような問題点がある。 However, the conventional method has the following problems.

上記の方法では、検索結果全体には利用者要求に適合しないＷｅｂページも多く含まれるので、そこから抽出される特徴語には利用者の興味に合致しない特徴語も多く抽出されてしまう。 In the above method, since the entire search result includes many Web pages that do not match the user request, many feature words that do not match the user's interest are extracted from the feature words extracted therefrom.

また、Ｗｅｂページの保存は通常、永続的なＷｅｂサイトについては行われていないことが多い。そのため、上述した従来の方法では、その対象Ｗｅｂページからプロファイル情報として適切な特徴語を抽出することが困難であるため、それを用いた検索結果の個人適応化も適切に行われていない。 In addition, Web pages are not usually saved for permanent Web sites. For this reason, in the conventional method described above, it is difficult to extract an appropriate feature word as profile information from the target Web page, and thus personal adaptation of search results using it is not appropriately performed.

本発明は、上記の点に鑑みなされたもので、個人の興味に応じたインターネット情報検索ができるようにするために、個人のプロファイル情報を適切に取得することによって個人適応検索を行うことが可能な個人適応型Ｗｅｂ情報検索装置及び方法及びプログラムを提供することを目的とする。 The present invention has been made in view of the above points. In order to enable Internet information search according to personal interests, it is possible to perform personal adaptive search by appropriately acquiring personal profile information. An object of the present invention is to provide a personal adaptive web information retrieval apparatus, method and program.

図１は、本発明の原理を説明するための図である。 FIG. 1 is a diagram for explaining the principle of the present invention.

本発明（請求項１）は、個人の興味応じたＷｅｂページの情報検索を行うための個人適応型情報検索方法であって、
ＵＲＬ抽出手段が、ブックマーク記憶手段から、利用者が過去に閲覧し、保存したＷｅｂページのうち有用なものとして保存されているＵＲＬを取得するＵＲＬ抽出ステップ（ステップ１）と、
第１の特徴語抽出手段が、ＵＲＬに基づいてＷｅｂページを取得し、該Ｗｅｂページの本文から抽出された特徴語群ＷＡａを第１のプロファイル情報記憶手段に格納する特徴語抽出ステップ（ステップ２）と、
第１の更新手段が、前回の特徴語群ＷＢａが格納されている第２のプロファイル情報記憶手段と第１のプロファイル情報記憶手段に格納されている特徴語群ＷＡａ，ＷＢａとを比較してそれらの共通部分となる特徴語群ＷＣａを生成し、該第２のプロファイル情報記憶手段に格納することにより、該第２のプロファイル情報記憶手段の内容を更新する第１の更新ステップ（ステップ３）と、
第２の特徴語抽出手段が、取得したＵＲＬに基づいて参照先Ｗｅｂページを取得し、該参照先Ｗｅｂページの本文から抽出された特徴語群ＷＡｂを第１の追加プロファイル情報記憶手段に格納する参照先ページ取得ステップ（ステップ４）と、
第２の更新手段が、前回の特徴語群ＷＢｂが格納されている第２の追加プロファイル情報記憶手段と第１の追加プロファイル記憶手段に格納されているＷＡｂ、ＷＢｂとを比較してそれらの共通部分となる特徴語群ＷＣｂを生成し、該第２の追加プロファイル情報記憶手段に格納することにより、該第２の追加プロファイル情報記憶手段の内容を更新する第２の更新ステップ（ステップ５）と、
特徴語合成手段が、第２のプロファイル情報記憶手段と第２の追加プロファイル情報記憶手段に格納されている特徴語群ＷＡａ、ＷＡｂの個々に対して類似度を比較し、所定の類似度以上の特徴語群ＷＤを抽出する特徴語合成ステップ（ステップ６）と、
検索結果ソート手段が、検索式が入力されると（ステップ７）検索された結果を特徴語群ＷＤに基づいて検索結果をソートして出力する検索結果ソートステップ（ステップ８）と、を行う。 The present invention (Claim 1) is a personal adaptive information retrieval method for retrieving information on a Web page according to an individual's interest,
A URL extracting step (step 1) in which the URL extracting unit obtains a URL stored as a useful Web page that the user browsed and stored in the past from the bookmark storing unit;
A first feature word extraction unit acquires a Web page based on the URL, and stores a feature word group WAa extracted from the text of the Web page in the first profile information storage unit (step 2) )When,
The first updating means compares the second profile information storage means storing the previous feature word group WBa with the feature word groups WAa and WBa stored in the first profile information storage means, and compares them. A first update step (step 3) for updating the contents of the second profile information storage unit by generating a feature word group WCa that is a common part of the second profile information and storing it in the second profile information storage unit ,
The second feature word extraction unit acquires a reference destination Web page based on the acquired URL, and stores the feature word group WAb extracted from the text of the reference destination Web page in the first additional profile information storage unit. A reference page acquisition step (step 4);
The second updating unit compares the second additional profile information storage unit storing the previous feature word group WBb with the WAb and WBb stored in the first additional profile storage unit, and uses them in common. A second update step (step 5) for updating the contents of the second additional profile information storage means by generating a feature word group WCb to be a part and storing it in the second additional profile information storage means; ,
The feature word synthesizing unit compares the similarity with each of the feature word groups WAa and WAb stored in the second profile information storage unit and the second additional profile information storage unit, A feature word synthesis step (step 6) for extracting a feature word group WD;
When a search expression is input (step 7), the search result sorting means performs a search result sorting step (step 8) in which the search result is sorted and output based on the feature word group WD.

また、本発明（請求項２）は、参照先ページ取得ステップ（ステップ４）において、
取得した参照先のＵＲＬとその参照レベルｎに基づいて、該ＵＲＬの示すソースを取得する処理を参照レベルが、所定のレベルＮになるまで繰り返す。 Further, the present invention (Claim 2) provides a reference page acquisition step (Step 4).
Based on the acquired URL of the reference destination and the reference level n, the process of acquiring the source indicated by the URL is repeated until the reference level reaches a predetermined level N.

また、本発明（請求項３）は、検索結果ソートステップ（ステップ８）において、
検索結果に含まれる各文書から文書ベクトルを生成し、
特徴語群ＷＤによるベクトルとの内積を計算して、該内積の大きい順に文書をソートする。 Further, the present invention (Claim 3) provides a search result sorting step (Step 8).
Generate a document vector from each document included in the search results,
An inner product with a vector based on the feature word group WD is calculated, and the documents are sorted in descending order of the inner product.

図２は、本発明の原理構成図である。 FIG. 2 is a principle configuration diagram of the present invention.

本発明（請求項４）は、個人の興味応じたＷｅｂページの情報検索を行うための個人適応型情報検索装置であって、
利用者により選択されたＷｅｂページをブックマークとして格納されているブックマーク記憶手段１と、
ブックマーク記憶手段１から、利用者が過去に閲覧し、保存したＷｅｂページのうち有用なものとして保存されているＵＲＬを取得するＵＲＬ抽出手段２と、
ＵＲＬに基づいてＷｅｂページを取得するページ取得手段３と、
Ｗｅｂページの本文から抽出された特徴語群ＷＡａを第１のプロファイル情報記憶手段９に格納する第１の特徴語抽出手段４と、
前回の特徴語群ＷＢａが格納されている第２のプロファイル情報記憶手段８と第１のプロファイル情報記憶手段に格納されている特徴語群ＷＡａ，ＷＢａとを比較してそれらの共通部分となる特徴語群ＷＣａを生成し、該第２のプロファイル情報記憶手段８に格納することにより、該第２のプロファイル情報記憶手段８の内容を更新する第１の更新手段１０と、
ページ取得手段５で取得したＷｅｂページから参照先のＵＲＬを抽出し、参照先Ｗｅｂページを取得する参照先ページ取得手段１３と、
参照先Ｗｅｂページの本文から抽出された特徴語群ＷＡｂを第１の追加プロファイル情報記憶手段１７に格納する第２の特徴語抽出手段１１と、
前回の特徴語群ＷＢｂが格納されている第２の追加プロファイル情報記憶手段１６と第１の追加プロファイル記憶手段１７に格納されているＷＡｂ、ＷＢｂとを比較してそれらの共通部分となる特徴語群ＷＣｂを生成し、該第２の追加プロファイル情報記憶手段１６に格納することにより、該第２の追加プロファイル情報記憶手段１６の内容を更新する第２の更新手段１８、
第２のプロファイル情報記憶手段８と第２の追加プロファイル情報記憶手段１６に格納されている特徴語群ＷＡａ、ＷＡｂの個々に対して類似度を比較し、所定の類似度以上の特徴語群ＷＤを抽出する特徴語合成手段１９と、
入力された検索式により検索手段２１により検索された結果を特徴語群ＷＤに基づいて検索結果をソートして出力する検索結果ソート手段２２と、を有する。 The present invention (Claim 4) is a personal adaptive information retrieval apparatus for performing information retrieval of Web pages according to personal interests,
Bookmark storage means 1 storing a Web page selected by the user as a bookmark;
URL extraction means 2 for acquiring a URL stored as a useful Web page from the bookmark storage means 1 that the user browsed and saved in the past;
Page acquisition means 3 for acquiring a web page based on the URL;
First feature word extraction means 4 for storing the feature word group WAa extracted from the text of the Web page in the first profile information storage means 9;
The second profile information storage means 8 in which the previous feature word group WBa is stored and the feature word groups WAa and WBa stored in the first profile information storage means are compared to form a common part of them. A first updating unit 10 for generating the word group WCa and storing the word group WCa in the second profile information storage unit 8 to update the contents of the second profile information storage unit 8;
A reference destination page acquisition unit 13 that extracts a reference destination URL from the Web page acquired by the page acquisition unit 5 and acquires a reference destination Web page;
A second feature word extraction unit 11 that stores the feature word group WAb extracted from the text of the reference destination Web page in the first additional profile information storage unit 17;
The feature word that is the common part of the second additional profile information storage means 16 storing the previous feature word group WBb and the WAb and WBb stored in the first additional profile storage means 17 are compared. A second updating unit 18 for updating the contents of the second additional profile information storage unit 16 by generating a group WCb and storing it in the second additional profile information storage unit 16;
Similarity levels are compared for each of the feature word groups WAa and WAb stored in the second profile information storage unit 8 and the second additional profile information storage unit 16, and a feature word group WD having a predetermined similarity or higher is compared. Feature word synthesizing means 19 for extracting
And search result sorting means 22 for sorting the search results based on the feature word group WD and outputting the results searched by the search means 21 based on the input search formula.

また、本発明（請求項５）は、ページ取得手段５において、
取得したＷｅｂページの参照先ＵＲＬと参照レベルｎを参照先ページ取得手段１３に出力する手段を含み、
参照先ページ取得手段１３は、
取得した参照先のＵＲＬと参照レベルｎに基づいて、該ＵＲＬの示すソースを取得する処理を参照レベルが、所定のレベルＮになるまで繰り返す手段を含む。 Further, the present invention (Claim 5) is provided in the page acquisition means 5,
Means for outputting the reference URL and reference level n of the acquired web page to the reference page acquisition means 13;
The reference page acquisition means 13
Based on the acquired URL of the reference destination and the reference level n, means for repeating the process of acquiring the source indicated by the URL until the reference level reaches a predetermined level N is included.

また、本発明（請求項６）は、検索結果ソート手段２２において、
検索結果に含まれる各文書から文書ベクトルを生成する手段と、
特徴語群ＷＤによるベクトルとの内積を計算して、該内積の大きい順に文書をソートする手段を含む。 Further, the present invention (Claim 6) is provided in the search result sorting means 22,
Means for generating a document vector from each document included in the search results;
Means for calculating an inner product with a vector based on the feature word group WD and sorting the documents in descending order of the inner product.

本発明（請求項７）は、請求項４乃至６のいずれか１項記載の個人適応型Ｗｅｂ情報検索装置を構成する各手段としてコンピュータを機能させるための個人適応型Ｗｅｂ情報検索プログラムである。 The present invention (Claim 7) is a personal adaptive Web information search program for causing a computer to function as each means constituting the personal adaptive Web information search apparatus according to any one of Claims 4 to 6.

上述のように本発明によれば、利用者がインターネット検索途中に一時的に閲覧した検索結果ページ全体や保存したＷｅｂページの本文から抽出するという従来の特徴語抽出法に比べ、ブックマークに含まれるＵＲＬの指し示すＷｅｂページ及び、そのページが参照する別のＷｅｂページから特徴語を抽出することで、より利用者に個人適合したプロファイル情報を作成できるようになり、その結果、それを用いた検索結果の適合度順位付けがより個人適応化される。 As described above, according to the present invention, the bookmark is included in the bookmark as compared with the conventional feature word extraction method in which the user extracts the entire search result page temporarily browsed during the Internet search or the text of the saved Web page. By extracting feature words from the Web page indicated by the URL and another Web page referenced by the page, profile information more personally suited to the user can be created. As a result, search results using the profile information can be created. The fitness ranking is more personalized.

また、その後ブックマークが更新され、内容が変化してもそれに応じて変化する特徴語群を更新前後の特徴語群の比較から適切に計算することによってプロファイル情報の内容を適切に更新することができる。 In addition, the content of the profile information can be appropriately updated by appropriately calculating the feature word group that changes in accordance with the change of the content from the comparison of the feature word groups before and after the update after the bookmark is updated. .

以下、図面と共に本発明の実施の形態を説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

図３は、本発明の一実施の形態における個人適応型インターネット情報検索装置の構成を示す。 FIG. 3 shows the configuration of a personal adaptive Internet information retrieval apparatus according to an embodiment of the present invention.

同図に示す装置は、ブックマーク記憶部１、ＵＲＬ抽出部２、タイミング制御部３、ＵＲＬフィルタリング部４、基本ＨＴＭＬ取得部５、本文抽出部６、特徴語取得部７、基本特徴語保存部Ａ８、基本特徴語保存部Ｂ９，比較更新部１０、リンク取得部１１、ＵＲＬフィルタリング部１２、ＨＴＭＬ取得部１３、本文抽出部１４、特徴語取得部１５、追加特徴語保存部Ａ１６、追加特徴語保存部Ｂ１７、比較更新部１８、特徴語合成部１９、検索式入力部２０、検索エンジン部２１、検索結果ソート部２２、結果表示部２３から構成される。 The apparatus shown in the figure includes a bookmark storage unit 1, a URL extraction unit 2, a timing control unit 3, a URL filtering unit 4, a basic HTML acquisition unit 5, a text extraction unit 6, a feature word acquisition unit 7, and a basic feature word storage unit A8. , Basic feature word storage unit B9, comparison update unit 10, link acquisition unit 11, URL filtering unit 12, HTML acquisition unit 13, text extraction unit 14, feature word acquisition unit 15, additional feature word storage unit A16, additional feature word storage A part B 17, a comparison / update part 18, a feature word composition part 19, a search expression input part 20, a search engine part 21, a search result sort part 22, and a result display part 23.

上記の構成における動作を以下に示す。 The operation in the above configuration is shown below.

個人適応型インターネット情報検索装置は、利用者が過去に保存したＷｅｂページのうち有用なものをとしてブックマークに保存されているＵＲＬなど、Ｗｅｂページの情報をインタねっとブラウザのブックマークから取得し、そのＵＲＬのＷｅｂページ及びそのページ参照先Ｗｅｂページ本文中からそれらの特徴語群を抽出して、それらをプロファイル情報とする。さらに、ブックマークがその後、更新された場合には更新前後の参照先Ｗｅｂページから得られる特徴語群を比較してプロファイル情報をより適切なものに更新する。 The personal adaptive Internet information retrieval device acquires Web page information, such as a URL stored in a bookmark as a useful Web page stored by the user in the past, from the browser bookmark, and the URL These feature word groups are extracted from the web page and the page reference destination web page body, and are used as profile information. Further, when the bookmark is subsequently updated, the feature word groups obtained from the reference Web pages before and after the update are compared to update the profile information to a more appropriate one.

そして検索の際には、情報検索結果をこのプロファイル情報中の特徴語群によって個人適応化した適合度順位にソートする。 At the time of retrieval, the information retrieval results are sorted into suitability rankings that are personally adapted by the feature word group in the profile information.

図４は、本発明の一実施の形態における全体処理のフローチャートである。 FIG. 4 is a flowchart of the overall processing in one embodiment of the present invention.

予め利用者は情報検索に先立って自分にとって興味のあるＷｅｂページをブックマーク記憶部１に格納しておくものとする。 It is assumed that the user stores in advance in the bookmark storage unit 1 a web page that is of interest to the user prior to information retrieval.

ステップ１０１）ＵＲＬ抽出部２は、ブックマーク記憶部１に保存されているＵＲＬを抽出して、ＵＲＬフィルタリング部に送る。このＵＲＬ抽出を行うタイミングはタイミング制御部３により指定される利用者の指定する時点、もしくはＸ時間おきなど予めスケジューリングされたタイミングに基づいて指定される。 Step 101) The URL extraction unit 2 extracts the URL stored in the bookmark storage unit 1 and sends it to the URL filtering unit. The timing for extracting the URL is specified based on a timing specified by the user specified by the timing control unit 3 or a timing scheduled in advance such as every X hours.

ステップ１０２）ＵＲＬフィルタリング部４は、ＵＲＬにフィルタリングを行い、不要なＵＲＬは除去し、そうでないものは基本ＨＴＭＬ取得部５に送る。フィルタリングの具体的方法には、例えば、予め作成したＵＲＬブラックリストに一致するＵＲＬを除去する方法や、ＵＲＬ文字列パターンによるマッチングルールに基づいて方法などを用いる。 Step 102) The URL filtering unit 4 performs filtering on the URL, removes unnecessary URLs, and sends other URLs to the basic HTML acquisition unit 5. As a specific filtering method, for example, a method of removing a URL that matches a URL black list created in advance or a method based on a matching rule based on a URL character string pattern is used.

なお、本ステップの処理については図５で詳述する。 The processing in this step will be described in detail with reference to FIG.

ステップ１０３）基本ＨＴＭＬ取得部５は、指定されたＵＲＬのＷｅｂページ（基本ページ）をインターネットから取得し、そのページのＨＴＭＬソースを本文抽出部６とリンク抽出部１１へ送る。このとき、参照レベルの情報を付加する。 Step 103) The basic HTML acquisition unit 5 acquires the Web page (basic page) of the designated URL from the Internet, and sends the HTML source of the page to the text extraction unit 6 and the link extraction unit 11. At this time, reference level information is added.

ステップ１０４）本文抽出部６では、基本ページのＨＴＭＬソースからタグなどを取り除いて文章の本文だけを抽出して、特徴語取得部７に送る。 Step 104) The text extracting unit 6 removes the tag from the HTML source of the basic page, extracts only the text of the sentence, and sends it to the feature word acquiring unit 7.

ステップ１０５）特徴語抽出部７では、抽出された本文を解析して単語を抽出し、それからストップワードを除去した後、特徴語抽出手法によって、それらの単語から特徴語を選び出す。そして、基本特徴語保存部Ｂ９に保存する。但し、ここで用いる特徴語抽出手法としてはＴＦＩＤＦ法のような既存の方法であれば何でもよい。 Step 105) The feature word extraction unit 7 analyzes the extracted text to extract words, removes stop words from the extracted word, and then selects feature words from those words by a feature word extraction method. And it preserve | saves at basic feature word preservation | save part B9. However, the feature word extraction method used here may be any existing method such as the TFIDF method.

ステップ１０６）比較更新部１０において、基本特徴語保存部Ｂ９から特徴語群ＷＢを取得し、もし、基本特徴語保存部Ａ８に何も保存されていない場合には、取得した特徴語群ＷＢを基本特徴語保存部Ａ８に保存する。しかし、既に、基本特徴語保存部Ａ８に特徴語群が存在する場合はその特徴語群ＷＡを取得し、ＷＡ，ＷＢの両者を比較してそれらの共通部分となる特徴語群ＷＣを生成し、ＷＣで基本特徴語保存部Ａ８の内容を更新する。 Step 106) The comparison / update unit 10 acquires the feature word group WB from the basic feature word storage unit B9. If nothing is stored in the basic feature word storage unit A8, the acquired feature word group WB is Save in the basic feature word storage A8. However, if a feature word group already exists in the basic feature word storage unit A8, the feature word group WA is acquired, and both the WA and WB are compared to generate a feature word group WC that is a common part of them. The content of the basic feature word storage unit A8 is updated with WC.

ステップ１０７）リンク抽出部１１において、ステップ１０３において、基本ＨＴＭＬ取得部５から送られた対象となるＨＴＭＬソースからアンカータグによって示されるリンク先ＵＲＬを取得し、ＵＲＬフィルタリング部１２に送る。なお、本ステップの処理については図６で詳述する。 Step 107) In the link extraction unit 11, in step 103, the link destination URL indicated by the anchor tag is acquired from the target HTML source sent from the basic HTML acquisition unit 5 and sent to the URL filtering unit 12. The processing in this step will be described in detail with reference to FIG.

ステップ１０８）ＵＲＬフィルタリング部１２では、ＵＲＬフィルタリング部４と同様にリンク先ＵＲＬをフィルタリングし、得られたＵＲＬをＨＴＭＬ取得部１３に送る。 Step 108) The URL filtering unit 12 filters the link destination URL in the same manner as the URL filtering unit 4, and sends the obtained URL to the HTML acquisition unit 13.

ステップ１０９）ＨＴＭＬ取得部１３は、指定されたＵＲＬのＷｅｂページをインターネットから取得し、そのページのＨＴＭＬソースを本文抽出部１４に送り、また、そのＵＲＬが基本ページから予め決められたＮリンク以内のＵＲＬであれば、そのＨＴＭＬソースをリンク抽出部１１に送る。ここでは、リンク抽出部１１、ＵＲＬフィルタリング部１２、ＨＴＭＬ取得部１３で基本ページからＮリンク以内のＷｅｂページを反復取得しているが、同等の処理結果を得るためにこれらのブロックをＮ段連続して構成したものでもよい。当該ステップの処理については図７で詳述する。 Step 109) The HTML acquisition unit 13 acquires the Web page of the specified URL from the Internet, sends the HTML source of the page to the text extraction unit 14, and the URL is within a predetermined N link from the basic page. Is sent to the link extraction unit 11. Here, the link extraction unit 11, the URL filtering unit 12, and the HTML acquisition unit 13 repeatedly acquire Web pages within N links from the basic page. However, in order to obtain an equivalent processing result, these blocks are continuously arranged in N stages. It may be configured as follows. The processing of this step will be described in detail with reference to FIG.

ステップ１１０）本文抽出部１４では、ＨＴＭＬ取得部１３で取得されたＨＴＭＬソースからタグなどを取り除いて文章の本文だけを抽出して特徴語取得部１５に送る。 Step 110) The text extracting unit 14 removes the tag from the HTML source acquired by the HTML acquiring unit 13, extracts only the text of the sentence, and sends it to the feature word acquiring unit 15.

ステップ１１１）特徴語抽出部１５では抽出された本文を解析して単語を抽出し、それからストップワードを除去した後、特徴語抽出手法によって、それらの単語から特徴語を選び出し、追加特徴語保存部Ｂ１７に格納する。但し、ここで用いる特徴語抽出手法としてはＴＦＩＤＦ法のような既存の方法であれば何を用いてもよい。 Step 111) The feature word extraction unit 15 analyzes the extracted text to extract words, removes stop words from the extracted word, selects feature words from those words by a feature word extraction method, and adds an additional feature word storage unit. Store in B17. However, as a feature word extraction method used here, any existing method such as the TFIDF method may be used.

ステップ１１２）比較更新部１８において、追加特徴語保存部Ｂ１７から特徴語群ＷＢを取得し、もし追加特徴語保存部Ａ１６に何も保存されていない場合には、取得した特徴語群を追加特徴語保存部Ａ１６に保存する。しかし、既に追加特徴語保存部Ａ１６に特徴語群が存在する場合は、その特徴語群ＷＡを取得し、ＷＡ，ＷＢの両者を比較してそれらの共通部分となる特徴語群ＷＣを生成し、ＷＣで追加特徴語保存部Ａ１６の内容を更新する。 Step 112) The comparison / update unit 18 acquires the feature word group WB from the additional feature word storage unit B17. If nothing is stored in the additional feature word storage unit A16, the acquired feature word group is added to the additional feature word storage unit A16. Save in the word storage unit A16. However, if a feature word group already exists in the additional feature word storage unit A16, the feature word group WA is acquired, and both the WA and WB are compared to generate a feature word group WC that is a common part of them. The content of the additional feature word storage unit A16 is updated with WC.

ステップ１１３）特徴語合成部１９では、基本特徴語保存部Ａ８の特徴語群ＷＡを取得し、検索結果ソート部１７に送る。追加特徴語保存部Ａ１６に含まれる特徴語も取得し、それら個々に対して特徴語群ＷＡ中の全ての語との類似度を比較し、ＷＡのいずれかの特徴語とある一定の類似度以上があるものは検索結果ソート部１７に送る。但し、ここで用いる類似度としては余弦類似度など既存の何を用いてもよい。当該ステップの詳細な処理は図８で詳述する。 Step 113) The feature word synthesis unit 19 acquires the feature word group WA of the basic feature word storage unit A8 and sends it to the search result sorting unit 17. The feature words included in the additional feature word storage unit A16 are also acquired, the degree of similarity is compared with all the words in the feature word group WA for each of them, and a certain degree of similarity with any feature word of the WA Those having the above are sent to the search result sorting unit 17. However, any existing similarity such as cosine similarity may be used as the similarity used here. Detailed processing of this step will be described in detail with reference to FIG.

ステップ１１４）検索式入力部２０において、利用者から検索式の入力を受け付けて、得られた検索式を検索エンジン部２１に送る。 Step 114) The search expression input unit 20 receives the input of the search expression from the user, and sends the obtained search expression to the search engine unit 21.

ステップ１１５）検索エンジン部２１では、検索式入力部２０で得られた検索式によってインターネット検索を行い、その検索結果を検索結果ソート部１７に送る。 Step 115) The search engine unit 21 performs an Internet search using the search formula obtained by the search formula input unit 20, and sends the search result to the search result sorting unit 17.

ステップ１１６）検索結果ソート部１７では、検索エンジン部２１から得られる検索結果に含まれる各文書から文書ベクトルを生成し、特徴語合成部１９から得られる特徴語群によるベクトルとの内積を計算して、その内積の大きさの順に元の文書をソートし、その結果を結果表示部２３に送る。 Step 116) The search result sorting unit 17 generates a document vector from each document included in the search result obtained from the search engine unit 21, and calculates an inner product with the vector based on the feature word group obtained from the feature word synthesis unit 19. Then, the original documents are sorted in the order of the inner product, and the result is sent to the result display unit 23.

ステップ１１７）最後に結果表示部２３では検索結果ソート部２２によって得られた検索結果を利用者に表示する。 Step 117) Finally, the result display unit 23 displays the search results obtained by the search result sorting unit 22 to the user.

次に、上記のステップ１０２，１０８におけるＵＲＬのフィルタリングについて説明する。以下ではブラックリスト方式の例を説明する。 Next, URL filtering in steps 102 and 108 will be described. Hereinafter, an example of the black list method will be described.

図５は、本発明の一実施の形態におけるＵＲＬフィルタリング処理のフローチャートである。以下では、ＵＲＬフィルタリング部４の例を示すが、リンク先のＵＲＬのフィルタリングを行うＵＲＬフィルタリング部１２も同様の処理を行う。 FIG. 5 is a flowchart of URL filtering processing according to an embodiment of the present invention. Below, although the example of the URL filtering part 4 is shown, the URL filtering part 12 which filters URL of a link destination performs the same process.

ステップ２０１）ＵＲＬ抽出部２からＵＲＬを取得する。 Step 201) The URL is acquired from the URL extraction unit 2.

ステップ２０２）取得したＵＲＬを除去すべきＵＲＬの登録されたブラックリストデータベース（図示せず）と照合する。 Step 202) The acquired URL is checked against a black list database (not shown) in which the URL to be removed is registered.

ステップ２０３）照合の結果、ＵＲＬがブラックリストにマッチすれば、本処理終了する。マッチしなければステップ２０４に移行する。 Step 203) If the URL matches the black list as a result of the collation, this processing ends. If no match is found, the process proceeds to step 204.

ステップ２０４）当該ＵＲＬを基本ＨＴＭＬ取得部５に出力する。 Step 204) The URL is output to the basic HTML acquisition unit 5.

次に、上記のステップ１０７のリンク取得部１１における動作を説明する。 Next, the operation in the link acquisition unit 11 in step 107 will be described.

図６は、本発明の一実施の形態におけるリンク取得処理のフローチャートである。 FIG. 6 is a flowchart of link acquisition processing according to an embodiment of the present invention.

ステップ３０１）リンク取得部１１は、基本ＨＴＭＬ取得部５からＨＴＭＬ文書とその参照レベルｎ＝ｋを取得する。 Step 301) The link acquisition unit 11 acquires an HTML document and its reference level n = k from the basic HTML acquisition unit 5.

ステップ３０２）ＨＴＭＬ文書中の全てのアンカータグについてそのＨＲＥＦ属性をＵＲＬとして抽出する。 Step 302) The HREF attribute is extracted as a URL for all anchor tags in the HTML document.

ステップ３０３）それぞれのＵＲＬに対してその参照レベルｎをｋ＋１として出力する。 Step 303) For each URL, output its reference level n as k + 1.

次に、上記のステップ１０９のＨＴＭＬ取得部１３の動作を説明する。 Next, the operation of the HTML acquisition unit 13 in step 109 will be described.

図７は、本発明の一実施の形態におけるＨＴＭＬ取得部の処理を示すフローチャートである。 FIG. 7 is a flowchart showing processing of the HTML acquisition unit in the embodiment of the present invention.

ステップ４０１）ＵＲＬフィルタリング部１２からＵＲＬとその参照レベルを取得する。 Step 401) The URL and its reference level are acquired from the URL filtering unit 12.

ステップ４０２）ＵＲＬの指し示すＷｅｂページのソースを取得する。 Step 402) The source of the Web page indicated by the URL is acquired.

ステップ４０３）参照レベルｎとＮ（予め設定された参照レベルの最大値）とを比較し、ｎ＜Ｎであればページソースと参照レベルｎとをリンク取得部１１に出力してステップ４０４に移行する。ｎ≧Ｎであればステップ４０４に移行する。 Step 403) The reference level n is compared with N (a preset maximum value of the reference level). If n <N, the page source and the reference level n are output to the link acquisition unit 11 and the process proceeds to Step 404. To do. If n ≧ N, the process proceeds to step 404.

ステップ４０４）ページソースを本文抽出部１４に出力する。 Step 404) The page source is output to the text extracting unit 14.

次に、上記のステップ１１３の特徴語合成部１９の処理を説明する。 Next, the processing of the feature word synthesis unit 19 in step 113 will be described.

図８は、本発明の一実施の形態における特徴語合成処理のフローチャートである。 FIG. 8 is a flowchart of the feature word synthesis process according to the embodiment of the present invention.

ステップ５０１）基本特徴語保存部Ａ８から基本特徴語Ｗ１，…，Ｗｎを取得し、これらを出力する
ステップ５０２）ループ変数ｋ，ｉを１に初期化する。 Step 501) Basic feature words W1,..., Wn are acquired from the basic feature word storage unit A8, and these are output. Step 502) Loop variables k and i are initialized to 1.

ステップ５０３）ｉ＞ｍ（追加特徴語数の最大値）であれば終了する。そうなければステップ５０４に移行する。 Step 503) If i> m (the maximum number of additional feature words), the process ends. Otherwise, the process proceeds to step 504.

ステップ５０４）ｋ＞ｎ（基本特徴語数の最大値）であればステップ５０９に移行する。そうでなければステップ５０５に移行する。 Step 504) If k> n (the maximum number of basic feature words), the process proceeds to Step 509. Otherwise, the process proceeds to step 505.

ステップ５０５）基本特徴語Ｗｋと追加特徴語Ｔｉとの類似度であるＳｋｉを計算する。 Step 505) Ski, which is the similarity between the basic feature word Wk and the additional feature word Ti, is calculated.

ステップ５０６）類似度Ｓｋｉが予め決められた閾値Ｃ以上であるならば、ステップ５０７に移行する。そうでなければ、ステップ５０８に移行する。 Step 506) If the similarity Ski is equal to or greater than a predetermined threshold C, the process proceeds to Step 507. Otherwise, the process proceeds to step 508.

ステップ５０７）追加特徴語Ｔｉを出力してステップ５０９に移行する。 Step 507) The additional feature word Ti is output and the process proceeds to Step 509.

ステップ５０８）ｋ＝ｋ＋１とし、ステップ５０４に移行する。 Step 508) Set k = k + 1 and proceed to Step 504.

ステップ５０９）ｉ＝ｉ＋１とし、ステップ５０３に移行する。 Step 509) Set i = i + 1, and proceed to Step 503.

上記の図３に示す構成要素の動作をプログラムとして構築し、個人適応型インターネット情報検索装置として利用されるコンピュータにインストールして実行させる、または、ネットワークを介して流通させることが可能である。 The operations of the components shown in FIG. 3 described above can be constructed as a program and installed in a computer used as a personal adaptive Internet information retrieval apparatus for execution or distributed via a network.

また、構築されたプログラムをハードディスクや、フレキシブルディスク・ＣＤ−ＲＯＭ等の可搬記憶媒体に格納し、コンピュータにインストールする、または、配布することが可能である。 Further, the constructed program can be stored in a portable storage medium such as a hard disk, a flexible disk, or a CD-ROM, and can be installed or distributed in a computer.

なお、本発明は、上記の実施の形態に限定されることなく、特許請求の範囲内において種々変更・応用が可能である。 The present invention is not limited to the above-described embodiment, and various modifications and applications can be made within the scope of the claims.

本発明は、インターネットのＷｅｂ検索技術に適用可能である。 The present invention is applicable to Internet Web search technology.

本発明の原理を説明するための図である。It is a figure for demonstrating the principle of this invention. 本発明の原理構成図である。It is a principle block diagram of this invention. 本発明の一実施の形態における個人適応型Ｗｅｂ情報検索装置の構成図である。It is a block diagram of the personal adaptive type | mold web information search device in one embodiment of this invention. 本発明の一実施の形態における全体処理のフローチャートである。It is a flowchart of the whole process in one embodiment of this invention. 本発明の一実施の形態におけるＵＲＬフィルタリング処理のフローチャートである。It is a flowchart of the URL filtering process in one embodiment of this invention. 本発明の一実施の形態におけるリンク取得部の処理を示すフローチャートである。It is a flowchart which shows the process of the link acquisition part in one embodiment of this invention. 本発明の一実施の形態におけるＨＴＭＬ取得部の処理を示すフローチャートである。It is a flowchart which shows the process of the HTML acquisition part in one embodiment of this invention. 本発明の一実施の形態における特徴語合成部の処理を示すフローチャートである。It is a flowchart which shows the process of the feature word synthetic | combination part in one embodiment of this invention.

Explanation of symbols

１ブックマーク記憶手段、ブックマーク記憶部
２ＵＲＬ抽出手段、ＵＲＬ抽出部
３タイミング制御部
４ＵＲＬファイルタリング部
５ページ取得手段
６本文抽出部
７第１の特徴語抽出手段、特徴語取得部
８第２のプロファイル情報記憶部、基本特徴語保存部Ａ
９第１のプロファイル情報記憶手段、基本特徴語保存部Ｂ
１０第１の更新手段、比較更新部
１１第２の特徴語抽出手段、リンク取得部
１２ＵＲＬフィルタリング部
１３参照先ページ取得手段、ＨＴＭＬ取得部
１４本文抽出部
１５特徴語取得部
１６第２の追加プロファイル情報記憶手段、追加特徴語保存部Ａ
１７第１の追加プロファイル情報記憶手段、追加特徴語保存部Ｂ
１８第２の更新手段、比較更新部
１９特徴語合成手段、特徴語合成部
２０検索式入力部
２１検索手段、検索エンジン
２２検索結果ソート手段、検索結果ソート部
２３結果出力部 DESCRIPTION OF SYMBOLS 1 Bookmark memory | storage means, Bookmark memory | storage part 2 URL extraction means, URL extraction part 3 Timing control part 4 URL file talling part 5 Page acquisition means 6 Text extraction part 7 1st feature word extraction means, feature word acquisition part 8 2nd Profile information storage unit, basic feature word storage unit A
9 First profile information storage means, basic feature word storage B
10 first update means, comparison update section 11 second feature word extraction means, link acquisition section 12 URL filtering section 13 reference page acquisition means, HTML acquisition section 14 text extraction section 15 feature word acquisition section 16 second addition Profile information storage means, additional feature word storage unit A
17 First additional profile information storage means, additional feature word storage B
18 second update means, comparison update section 19 feature word composition means, feature word composition section 20 search expression input section 21 search means, search engine 22 search result sort means, search result sort section 23 result output section

Claims

A personal adaptive information retrieval method for retrieving information on a web page according to an individual's interest,
A URL extracting step in which the URL extracting means acquires from the bookmark storage means a URL that the user has browsed in the past and saved as a useful Web page;
A first feature word extraction unit that acquires a Web page based on the URL and stores a feature word group WAa extracted from the text of the Web page in the first profile information storage unit;
The first updating means compares the second profile information storage means storing the previous feature word group WBa with the feature word groups WAa and WBa stored in the first profile information storage means. A first update step of updating the contents of the second profile information storage unit by generating a feature word group WCa serving as a common part thereof and storing it in the second profile information storage unit;
The second feature word extraction unit acquires a reference destination Web page based on the acquired URL, and stores the feature word group WAb extracted from the text of the reference destination Web page in the first additional profile information storage unit. A reference page acquisition step to be performed;
The second updating unit compares the second additional profile information storage unit storing the previous feature word group WBb with the WAb and WBb stored in the first additional profile storage unit, and compares them. A second update step of updating the contents of the second additional profile information storage unit by generating a feature word group WCb to be a common part and storing it in the second additional profile information storage unit;
The feature word synthesizing unit compares the similarity with each of the feature word groups WAa and WAb stored in the second profile information storage unit and the second additional profile information storage unit to obtain a predetermined similarity A feature word synthesis step for extracting the above feature word group WD;
A search result sorting step, wherein the search result sorting means sorts the search results based on the feature word group WD and outputs the results searched by the input search formula;
A personal adaptive Web information retrieval method characterized by:

In the reference page acquisition step,
The personal adaptive Web information search method according to claim 1, wherein the process of acquiring the source indicated by the URL is repeated until the reference level reaches a predetermined level N based on the acquired URL of the reference destination and the reference level n.

In the search result sorting step,
Generating a document vector from each document included in the search results;
2. The personal adaptive Web information retrieval method according to claim 1, wherein inner products with vectors by the feature word group WD are calculated, and the documents are sorted in descending order of the inner products.

A personal adaptive information retrieval device for retrieving information on a web page according to an individual's interest,
Bookmark storage means for storing a web page selected by the user as a bookmark;
URL extraction means for obtaining a URL stored as a useful one of the Web pages that the user browsed and saved in the past from the bookmark storage means;
Page acquisition means for acquiring a web page based on the URL;
First feature word extraction means for storing in the first profile information storage means a feature word group WAa extracted from the text of the Web page;
A feature that is a common part of the second profile information storage means storing the previous feature word group WBa and the feature word groups WAa, WBa stored in the first profile information storage means. A first updating means for updating the contents of the second profile information storage means by generating a word group WCa and storing it in the second profile information storage means;
A reference destination page acquisition unit that extracts a reference destination URL from the Web page acquired by the page acquisition unit and acquires a reference destination Web page;
Second feature word extraction means for storing in the first additional profile information storage means the feature word group WAb extracted from the text of the reference Web page;
The feature word group which is the common part of the second additional profile information storage means storing the previous feature word group WBb and the WAb and WBb stored in the first additional profile storage means A second updating means for updating the contents of the second additional profile information storage means by generating a WCb and storing it in the second additional profile information storage means;
Similarity levels are compared for each of the feature word groups WAa and WAb stored in the second profile information storage unit and the second additional profile information storage unit, and a feature word group WD having a predetermined similarity or higher is compared. Feature word synthesis means for extracting
Search result sorting means for sorting and outputting the search results based on the feature word group WD based on the input search expression;
A personal adaptive Web information retrieval apparatus characterized by comprising:

The page acquisition means
Means for outputting the reference URL and reference level n of the acquired web page to the reference page acquisition means;
The reference page acquisition means includes
5. The personal adaptive Web according to claim 4, further comprising means for repeating the process of acquiring the source indicated by the URL based on the acquired URL of the reference destination and the reference level n until the reference level reaches a predetermined level N. Information retrieval device.

The search result sorting means includes:
Means for generating a document vector from each document included in the search results;
5. The personal adaptive Web information retrieval method according to claim 4, further comprising means for calculating an inner product with a vector based on the feature word group WD and sorting the documents in descending order of the inner product.

A personal adaptive Web information search program for causing a computer to function as each means constituting the personal adaptive Web information search device according to any one of claims 4 to 6.