JP2007264718A

JP2007264718A - User interest analysis device, method, program

Info

Publication number: JP2007264718A
Application number: JP2006085174A
Authority: JP
Inventors: Masahiro Matsumura; 真宏松村; Julian Brody; ブローディジュリアン
Original assignee: Yahoo Japan Corp
Current assignee: Yahoo Japan Corp
Priority date: 2006-03-27
Filing date: 2006-03-27
Publication date: 2007-10-11

Abstract

【課題】ユーザの閲覧したファイル間を伝播している語から、ユーザの変わりゆく興味をリアルタイムに推定するアルゴリズムやそれを実装した装置などを提供すること。
【解決手段】ユーザが閲覧した履歴から複数のファイルに含まれる語をファイル毎にテキストとして入力する手段と、そのテキストから言語単位に分割する手段と、ユーザが閲覧した複数のファイル間でユーザが参照した「伝播する語」を抽出する手段と、一または複数の前記「伝播する語」を記憶する手段と、前記「伝播する語」の全てのファイルに対する出現頻度から所定の「影響度」と、「伝播する語」の特定のファイル内に出現する程度を表す所定のｉＤＦ値を求める手段と、「影響度」と前記ｉＤＦ値との関数である「影響度ｉＤＦ値」に応じてユーザの興味ある語の集合をユーザプロファイル情報として抽出する手段と、を備える。
【選択図】図１
An object of the present invention is to provide an algorithm for estimating in real time a user's changing interests from a word propagating between files viewed by the user, a device equipped with the algorithm, and the like.
A means for inputting words included in a plurality of files from a history viewed by a user as text for each file, a means for dividing the text into language units, and a user among a plurality of files viewed by the user. A means for extracting the referred “propagating word”, a means for storing one or a plurality of the “propagating words”, and a predetermined “influence” from the appearance frequencies of all the “propagating words” for all files. , A means for obtaining a predetermined iDF value representing the degree of occurrence of the “propagating word” in a specific file, and the “influence degree iDF value” which is a function of the “influence degree” and the iDF value. Means for extracting a set of interesting words as user profile information.
[Selection] Figure 1

Description

本発明は、ユーザ興味分析装置、その制御方法、および当該方法を実現するコンピュータ・プログラムに関する。 The present invention relates to a user interest analysis device, a control method thereof, and a computer program for realizing the method.

近年、インターネットを始めとしてユーザが参加できる、いわゆる双方向メディアは、様々なニーズに応じて、その種類も量も益々拡大している。その中でも、匿名で情報を発信することによって別人格でも意見交換ができる掲示板メディア、記入した所見を時系列で表示し、所見に応じて相互にハイパーリンクをすることで情報発信や意見交換をすることができるブログ（ＷＥＢＬｏｇ）メディア、加入することにより友人関係や趣味・思考を共有・シェアできるコミュニティであるソーシャルメディアなどの発展が特に著しい。 In recent years, so-called interactive media that users can participate in, including the Internet, have been increasing in type and quantity according to various needs. Among them, bulletin board media that can exchange opinions even with different personalities by transmitting information anonymously, displaying the written findings in chronological order, and exchanging information and exchanging opinions by making hyperlinks according to the findings The blog (WEB Log) media that can be used, and social media that is a community that can share and share friendships, hobbies and thoughts by joining, are particularly remarkable.

このようなメディア上において、ユーザをプロファイリング、カテゴライズする試みや、ユーザに強い影響を与えるキーワードを見出す試みがなされている。なぜならば、ユーザの興味、嗜好、ニーズ、年代、性別、地域、職業、価値観など、いわゆる「属性」を評価することができれば、ユーザに対して的確なコンテンツ配信や広告配信を行うこと（これをターゲット配信と称する）ができ、また、ユーザ同士の意見交換が購買意思決定に影響を与えることも指摘されていることから、ユーザに強い影響を与えるキーワードを見出すことができれば、企業の商品の開発やマーケティング戦略に役立てたりすることが可能となると考えられている。 On such media, attempts have been made to profile and categorize users and to find keywords that have a strong influence on users. This is because if users can evaluate so-called “attributes” such as user interests, preferences, needs, age, gender, region, occupation, values, etc., accurate content distribution and advertisement distribution to users (this) Can be called target distribution), and it has been pointed out that the exchange of opinions between users has an influence on purchasing decision making, so if we can find keywords that strongly influence users, It can be used for development and marketing strategies.

そのため、例えばマーケティングの現場では、分析者が掲示板やブログに目を通したり、コミュニティに参加して影響力のあるキーワードをピックアップすることがなされているが、これはいわば手動で行っており、判断基準は分析者の経験や感性に基づいているので、統一的な指標で評価することができないし、膨大なメディアの中での分析作業は多くのリソースを必要とする。 Therefore, for example, in marketing sites, analysts read bulletin boards and blogs and participate in the community to pick up influential keywords. Since the standard is based on the analyst's experience and sensibility, it cannot be evaluated with a unified index, and analysis work in a vast amount of media requires a lot of resources.

ユーザをプロファイリング、カテゴライズする試みとしては、アンケートによるものがあるが、充分な量のサンプルの回収には大きな労力が発生する上に、アンケート回収後の的確なターゲティング配信のためにはユーザのブラウザの固有情報に紐付けることによって、またはユーザがログインしている状態でしかトレースできないため、特定のメディア内部でしか適用できず、汎用的に用いることはできない。また、様々な要因で、内容が不正確であることが指摘されている。 Attempts to profile and categorize users include surveys. However, collecting a sufficient amount of samples requires a lot of effort, and for accurate targeted distribution after the surveys are collected, Since it can be traced only by associating with unique information or when the user is logged in, it can be applied only within a specific medium and cannot be used for general purposes. In addition, it is pointed out that the contents are inaccurate due to various factors.

また、ユーザを自動的にプロファイリング、カテゴライズする試みとしては、ユーザの情報閲覧履歴や検索条件入力を解析し、ユーザの嗜好や興味に関する情報をユーザプロファイルとして保持する技術が知られている（例えば、特許文献１）。 In addition, as an attempt to automatically profile and categorize a user, a technique for analyzing a user's information browsing history and search condition input and holding information on the user's preference and interest as a user profile is known (for example, Patent Document 1).

しかし、上記のような手法は、属性の評価やキーワードの決定の過程において語彙の頻度を重要なファクターにしているため、例えば掲示板のような匿名性が高いメディアにおいては意見の極化や誹謗中傷的な内容によって、頻度の高い語彙が必ずしも影響力を持つものではないことが指摘されている。また、ブログメディアやソーシャルメディアも含めて、頻度が高くなるのは、必ずしも中心的な話題に出てくる影響力のある語彙ではなく、周辺的な話題で多く登場する語彙や一般的な語彙であることが多いので、真に影響力のあるキーワードを抽出して正確にユーザをプロファイリング、カテゴライズすることは困難であった。 However, the above methods make vocabulary frequency an important factor in the process of attribute evaluation and keyword determination. For example, in media with high anonymity, such as bulletin boards, the opinions are polarized and slandered. It has been pointed out that frequent vocabulary does not necessarily have an influence due to specific content. Also, including blog media and social media, the frequency is high, not necessarily influential vocabulary that appears in the central topic, but in vocabulary and general vocabulary that appears frequently in peripheral topics Because there are many cases, it is difficult to extract truly influential keywords and accurately profile and categorize users.

そこで、双方向性メディアの主要な構成要素であるテキスト情報によるコミュニケーションにおいて、文字すなわち語彙への興味が伝播していく過程に着目することにより、影響力のあるキーワードを定量的に見出すモデルが提案されている（非特許文献１）。この、コンテクスチャルな支配の強さすなわち影響の普及を表すモデルでは、テキストコンテンツおよび語彙に対してその媒介影響量を定義し、これを尺度することによって頻度が低くても影響力の大きいキーワードを抽出できるとしている。 Therefore, a model for quantitatively finding influential keywords is proposed by focusing on the process of interest in characters, that is, vocabulary, in communication using text information, which is the main component of interactive media. (Non-Patent Document 1). In this model representing the strength of contextual domination, that is, the spread of influence, we define the amount of mediation influence on text content and vocabulary, and measure this to measure keywords that have high influence even if they are infrequent. It can be extracted.

また、このようなメディア上で、ユーザごとに、上述のようにして抽出した影響力の大きいキーワードの集合から導かれる特徴をそのユーザのプロファイルと定義することによってユーザをプロファイリング、カテゴライズするアルゴリズムが提案されている（非特許文献２）。 In addition, an algorithm for profiling and categorizing a user by defining a feature derived from a set of influential keywords extracted as described above as the user's profile for each user on such media is proposed. (Non-Patent Document 2).

特開２００３−６７４１０号公報JP 2003-67410 A 松村真宏ほか；テキストによるコミュニケーションにおける影響の普及モデル，人工知能学会論文誌１７巻３号ＳＰ−Ｂ，Ｐ２５９−２６７，２００２MATSUMURA, Masahiro et al .: Dissemination model of influence in communication by text, JSAI Journal 17-3 SP-B, P259-267, 2002 松村真宏ほか；影響の普及モデルに基づくオンラインコミュニティ参加者のプロファイリング，人工知能学会論文誌１８巻４号Ａ，Ｐ１６５−１７２，２００３MATSUMURA Masahiro et al. Profiling Online Community Participants Based on Dissemination Models of Impact, JSAI Proceedings Vol.18, No.4, A, P165-172, 2003

しかし、これらの提案のいずれにおいても伝播の向きや履歴を有効に生かすことができないため、ユーザの変わりゆく興味をリアルタイムに推定することができなかった。すなわち、このような双方向メディアに共通する特徴として、レスポンス、コメント、リンク、及びトラックバックを可能ならしめる技術によって、ユーザ同士が意見や情報の記述、交換、参照をすることが可能になっていることが挙げられるのであるが、このような技術によって可能になっている意見や情報の記述、交換、参照の時系列の情報に対してノードや閲覧順序を定義することによって、伝播の向きや履歴を有効に生かす有向リンクを定義することができる。また、ユーザは自身の興味に従ってファイル（例えばＷＥＢページ）を閲覧するので、ユーザの閲覧したファイル集合に一貫して含まれる特徴的な語は、その時々のユーザの興味をリアルタイムに反映している。 However, in any of these proposals, since the propagation direction and history cannot be utilized effectively, the user's changing interest cannot be estimated in real time. That is, as a feature common to such interactive media, the technology that enables response, comment, link, and trackback enables users to describe, exchange, and refer to opinions and information. However, the direction and history of propagation can be defined by defining nodes and viewing order for time-series information of description, exchange, and reference of opinions and information enabled by such technology. It is possible to define a directed link that makes effective use of. In addition, since the user browses a file (for example, a WEB page) according to his / her interest, the characteristic words that are consistently included in the file set browsed by the user reflect the user's interest at that time in real time. .

そこで、本発明では、ユーザの閲覧したファイルをノード、閲覧順序を有向リンクとする有向グラフにおいて、ノード間を伝播している語の出現頻度を再帰的に計量し、その値の上位の語の集合を抽出することにより、ユーザの変わりゆく興味をリアルタイムに推定するアルゴリズムやそれを実装した装置、方法およびプログラムを提案する。 Therefore, in the present invention, in the directed graph in which the file browsed by the user is a node and the browsing order is a directed link, the appearance frequency of words propagating between the nodes is recursively measured, We propose an algorithm that estimates a user's changing interests in real time by extracting a set, and a device, method, and program that implement the algorithm.

（１）ファイルを閲覧するユーザの興味のある語を抽出するユーザ興味分析装置であって、ユーザが閲覧した履歴情報を利用してユーザが閲覧したファイルに含まれる複数の語をファイル毎にテキストとして入力する手段と、前記テキストから意味を有する最小の言語単位に形態素分割する手段と、ユーザが閲覧した複数のファイル間でユーザが参照した「伝播する語」を抽出する手段と、一または複数のその伝播する語を記憶する手段と、伝播する語の対象とするファイルに対する出現頻度から所定の「影響度」および伝播する語の特定のファイル内に出現する程度を表す所定のｉＤＦ値を求める手段と、前記影響度と前記ｉＤＦ値との関数である「影響度ｉＤＦ値」の値に応じてユーザの興味ある語をユーザプロファイル情報として抽出する手段と、そのユーザプロファイル情報を出力する手段と、を備えるユーザ興味分析装置を提供する。 (1) A user interest analysis device that extracts words of interest of a user who browses a file, and uses a history information browsed by the user as a text for a plurality of words included in the file browsed by the user. One or a plurality of means, a means for dividing morphemes into the smallest meaningful language units from the text, a means for extracting "propagating words" referenced by the user among a plurality of files viewed by the user, A means for storing the propagating word and a frequency of appearance of the propagating word as a target file to obtain a predetermined “influence” and a predetermined iDF value representing a degree of the propagating word appearing in a specific file. Meaning and extraction of user profile information as user profile information according to the value of “influence degree iDF value” which is a function of the influence degree and the iDF value It means that provides a user interested analyzer and means for outputting the user profile information.

（１）の発明によれば、まず、ユーザがインターネット上で閲覧したファイルの履歴からユーザがリンクなどによってさらに参照した語をファイル上で伝播する語として抽出する。次に、その伝播する語の後に参照されたファイルに対する影響度を数値化する。さらに、その伝播する語が全ファイル内に出現する程度であるｉＤＦ（ＩｎｖｅｒｓｅＤｏｃｕｍｅｎｔＦｒｅｑｕｅｎｃｙ）値を求め、影響度とｉＤＦ値との関数である影響度ｉＤＦ値の値に応じてそのユーザにとって興味のある語を検出する。さらに、検出された語の特定の集合を影響度ｉＤＦ値と共にユーザのプロファイル情報として出力する。以上の機能を備えることにより、変わり行くユーザの興味のある語をリアルタイムに分析可能とするユーザ興味分析装置が提供できる。 According to the invention of (1), first, a word further referred to by a user through a link or the like is extracted from a history of a file browsed on the Internet by the user as a word to be propagated on the file. Next, the degree of influence on the file referenced after the propagating word is quantified. Further, an iDF (Inverse Document Frequency) value is calculated so that the propagated word appears in all files, and the user is interested in the value of the influence iDF value that is a function of the influence and the iDF value. Detect a word. Further, a specific set of detected words is output as user profile information together with the influence iDF value. By providing the above functions, it is possible to provide a user interest analysis device that can analyze in real time a word of interest of a changing user.

また、このユーザ興味分析装置が出力したプロファイル情報を参照して、そのユーザの興味がある語に関連する商品の販売戦略に利用したり、コンテンツや広告を配信したり、ダイレクトメールなどを効率的にそのユーザに送信することができる。 In addition, referring to the profile information output by this user interest analysis device, it can be used for sales strategies for products related to the words that the user is interested in, deliver content and advertisements, and direct mail etc. efficiently Can be sent to that user.

（２）前記プロファイル情報を他のユーザに公開する手段をさらに備える、（１）に記載のユーザ興味分析装置。 (2) The user interest analysis device according to (1), further comprising means for disclosing the profile information to other users.

（２）の発明によれば、インターネット上のコミュニティにおいて、他のユーザの興味ある語を知ることにより、自分と共通する興味をもつユーザを見つけること（友達探し）やその分野に詳しそうなユーザを見つけて、質問すること（達人探し）などが可能になる。 According to the invention of (2), in the community on the Internet, by finding out the words that other users are interested in, it is possible to find users who have an interest in common with them (search for friends) and users who are likely to be familiar with the field. You can find and ask questions (search for masters).

（３）前記伝播する語に関連する語を検出するための類似語辞書を更に備え、前記影響度ｉＤＦ値を前記伝播する語に関連する語に対しても算出する手段を備える、（１）または（２）に記載のユーザ興味分析装置。 (3) A similar word dictionary for detecting a word related to the propagating word is further included, and means for calculating the influence iDF value also for the word related to the propagating word is provided. Or the user interest analysis apparatus as described in (2).

（４）上記（１）〜（３）において、前記影響度ｉＤＦ値が、所定の数式（後述）で求められる、ユーザ興味分析装置。 (4) The user interest analysis device according to (1) to (3), wherein the influence iDF value is obtained by a predetermined mathematical formula (described later).

また、上記（１）〜（４）の発明を備えた装置は、同等な制御方法、およびその制御方法をコンピュータに実行させるコンピュータ・プログラムによっても実現可能である。 Moreover, the apparatus provided with invention of said (1)-(4) is realizable also by the computer program which makes a computer perform the equivalent control method and the control method.

本発明によれば、ユーザの閲覧したファイルをノード、閲覧順序を有向リンクとする有向グラフにおいて、ノード間を伝播している語の影響力と出現頻度を加味した値を再帰的に計量し、その値の上位の語の集合を抽出することにより、ユーザの変わりゆく興味をリアルタイムに推定することができる。 According to the present invention, in a directed graph with a file viewed by a user as a node and a browsing order as a directed link, recursively measure a value that takes into account the influence and appearance frequency of words propagating between nodes, By extracting a set of words having higher values, the user's changing interest can be estimated in real time.

以下、本発明の実施形態について図を参照しながら説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

図１は、本発明に係るユーザ興味分析装置の機能ブロック図の一例を示したものである。ユーザ興味分析装置１０は、図に示すように、ファイルテキスト入力手段２、形態素分割手段３（必須ではない）、伝播語抽出手段４、影響度算出手段５、ｉＤＦ値算出手段６、データの一時保管に用いる記憶手段７、ユーザ興味語抽出手段８、プロファイル情報出力手段９、類義語辞書１１で構成される。但し、本構成は一例を示したものであり、同等な機能を持つ他の構成をとってもよい。 FIG. 1 shows an example of a functional block diagram of a user interest analysis apparatus according to the present invention. As shown in the figure, the user interest analysis apparatus 10 includes a file text input means 2, a morpheme division means 3 (not essential), a propagation word extraction means 4, an influence degree calculation means 5, an iDF value calculation means 6, a temporary data The storage means 7 used for storage, the user interest word extraction means 8, the profile information output means 9, and the synonym dictionary 11 are comprised. However, this configuration is merely an example, and another configuration having an equivalent function may be taken.

まず、ユーザ興味分析装置１０は、ユーザのファイル閲覧履歴１を入力とし、ファイルテキスト入力手段２によって、ページ毎にテキストを抽出する。ファイル閲覧履歴は、一般にはインターネット・ブラウザの一時記憶ファイルに存在するが、掲示板やブログの閲覧履歴情報であってもよい。 First, the user interest analysis apparatus 10 receives the user's file browsing history 1 as input, and extracts text for each page by the file text input means 2. The file browsing history generally exists in a temporary storage file of an Internet browser, but may be browsing history information of a bulletin board or a blog.

次に、抽出したテキストが文章で構成されている場合には、形態素分割手段３によって文章を必要な単位に分割する。ページ内のメタデータを用いる場合や抽出したページが単語のみで構成されている場合など、形態素分割手段３の処理がスキップされる場合もある。次の、伝播語抽出手段４によって、ユーザがある一定期間に参照したファイル群またはその中の必要部分を解析し、そのページ内で共通する語、または伝播する語を抽出する。共通する語とは、各ファイルに共通に出現するキーワードを指す。但し、後の例で述べるように、共通する語は、各ファイルで必ずしも完全一致した語である必要はなく、一部が一致する語や類義語を含むものとする。 Next, if the extracted text is composed of sentences, the morpheme dividing means 3 divides the sentences into necessary units. There are cases where the process of the morpheme dividing means 3 is skipped, for example, when the metadata in the page is used or when the extracted page is composed of only words. Next, the propagation word extracting means 4 analyzes a file group referred to by a user for a certain period or a necessary portion thereof, and extracts a common word or a word to be propagated in the page. A common word refers to a keyword that appears in common in each file. However, as will be described later, common words do not necessarily need to be completely matched in each file, but include partially matching words and synonyms.

また、伝播する語とは、ユーザが、あるファイルから次のファイルを参照するきっかけ（トリガー）や影響を与えた語を言う。伝播する語についても、各ページで完全一致する必要はなく、一部が一致する語や類義語を含むものとする。類義語は、周知のシソーラス（類義語辞書）などを用いて定義される。なお、伝播する語については後述の例でさらに詳しく説明する。 In addition, the word to be propagated means a word (trigger) or an influence that the user refers to the next file from a certain file. Propagating words do not need to be completely matched on each page, and include partially matching words and synonyms. Synonyms are defined using a known thesaurus (synonym dictionary). Note that the word to be propagated will be described in more detail in an example described later.

次に、影響度算出手段５、およびｉＤＦ値算出手段６によって、抽出された一または複数の伝播する語それぞれについて、その伝播の影響力を表す影響度と、伝播する語の出現頻度（ファイル数）の関数であるｉＤＦ（ＩｎｖｅｒｓｅＤｏｃｕｍｅｎｔＦｒｅｑｕｅｎｃｙ）値を算出する。伝播の影響度とは、伝播する語の、後に参照されたファイルに対する影響力（重み）を表す量である。例えば、ＴＦ値（ＴｅｒｍＦｒｅｑｕｅｎｃｙ）の定義を適用することができる。ＴＦ値は、一般的には、対象とする文書に対象とする単語が出現する頻度を表すが、本発明においては、文書はユーザが閲覧したファイル群またはその中の必要部分を対象とする。以下、伝播の影響度ＥＤＴ（ＥｆｆｅｃｔｏｆＤｉｆｆｕｓｉｂｌｅＴｅｒｍ）と称する。 Next, with respect to each of one or a plurality of propagated words extracted by the influence degree calculating means 5 and the iDF value calculating means 6, the influence degree indicating the influence of the propagation and the appearance frequency (number of files) of the propagated words. ) Is an iDF (Inverse Document Frequency) value. The influence degree of propagation is an amount representing the influence (weight) of a word to be propagated on a file referred to later. For example, the definition of TF value (Term Frequency) can be applied. The TF value generally represents the frequency of occurrence of the target word in the target document. In the present invention, the document targets a file group viewed by the user or a necessary portion thereof. Hereinafter, the degree of influence EDT (Effect of Diffusible Term) is referred to.

また、ｉＤＦ値とは、対象とする語句が対象とする文書に出現する頻度の関数であり、一般にはこの頻度の増加に伴って減少する関数として定義される。以下、先に述べた影響度と、このｉＤＦ値との積を「影響度ｉＤＦ値」と呼ぶことにする。影響度ｉＤＦ値は、Ｇ．Ｓａｌｔｏｎの提唱したＴＦｉＤＦの一般式（Ｇ．Ｓａｌｔｏｎ，Ｍ．ＭｃＧｉｌｌ，ＩｎｔｒｏｄｕｃｔｉｏｎｔｏＭｏｄｅｒｎＩｎｆｏｒｍａｔｉｏｎＲｅｔｒｉｅｖａｌ，ＮｅｗＹｏｒｋ，ＭｃＧｒａｗ−Ｈｉｌｌ，１９８３）、またはそれを変形した数式を用いることが便利であるが、語句の伝播に注目してその影響度を定義している特長を備えていればよい。例えば、本発明の実施形態の一例として、次の数式を用いて計算している。

The iDF value is a function of the frequency at which the target word appears in the target document, and is generally defined as a function that decreases as the frequency increases. Hereinafter, the product of the degree of influence described above and this iDF value is referred to as an “influence degree iDF value”. The degree of influence iDF is G. A general formula of TFiDF proposed by Salton (G. Salton, M. McGill, Introduction to Modern Information Retrieval, New York, McGraw-Hill, 1983), or a modified expression using a phrase that is a convenient phrase. It suffices to have features that focus on propagation and define the degree of influence. For example, as an example of the embodiment of the present invention, calculation is performed using the following mathematical formula.

ここで、
ｔは伝播する語、
ＥＤＴは、伝播する語がユーザが所定の時間中に閲覧したファイル群の中に出現する頻度、
Ｎは、ユーザが所定の時間中に閲覧した全ファイル数、
ＤＦ（ｔ）は、伝播する語ｔを含むファイル数、
を表す。
所定の時間とは、ユーザ分析装置が分析する対象期間を指し、分析対象やニーズに従って個別に定めることができる。例えば、数時間であっても、数ヶ月であってもよい。 here,
t is the word to propagate,
EDT is the frequency at which the propagated word appears in the file group that the user browsed during a predetermined time,
N is the total number of files viewed by the user during a given time,
DF (t) is the number of files containing the word t to propagate,
Represents.
The predetermined time refers to a target period analyzed by the user analysis device, and can be determined individually according to the analysis target and needs. For example, it may be several hours or months.

上記の例では、影響度として一般的なＴＦ値に対する定義を用いたが、影響度を他の方法で定義してもよい。また、上記ｉＤＦ（ｔ）式において、対数を用いなくても良いが、対数を用いる場合は、ｌｏｇの底として、１０を用いる他、自然対数ｅや、２などを用いることも可能である。従って、影響度算出手段５、ｉＤＦ値算出手段６は、それぞれ複数の数式から選択することができるため、対応する複数の手段を備える。図１ではこれらを５ａ、５ｂ、６ａ、６ｂで表している。 In the above example, the definition for the general TF value is used as the influence degree, but the influence degree may be defined by another method. In the iDF (t) equation, the logarithm may not be used. However, when the logarithm is used, it is possible to use a natural logarithm e, 2 or the like in addition to using 10 as the base of the log. Accordingly, each of the influence degree calculating means 5 and the iDF value calculating means 6 can be selected from a plurality of mathematical expressions, and therefore includes a plurality of corresponding means. In FIG. 1, these are represented by 5a, 5b, 6a and 6b.

さらに、算出された影響度とｉＤＦ値を用いて、ユーザ興味語抽出手段８によって、この影響度ｉＤＦ値を先に抽出された伝播する語それぞれについて求め、この値に応じてユーザが興味を持った語を抽出する。例えば、影響度ｉＤＦ値が大きい語をユーザが興味を持った語として抽出することができる。 Further, using the calculated influence degree and iDF value, the user interest word extraction unit 8 obtains the influence degree iDF value for each of the previously transmitted words, and the user is interested according to this value. Extracted words. For example, words having a large influence degree iDF value can be extracted as words that the user is interested in.

最後に、プロファイル情報出力手段９によって、影響度ｉＤＦ値が予め定められた閾値を比較して、そのユーザのプロファイルを出力する。 Finally, the profile information output means 9 compares the threshold value with the influence degree iDF value determined in advance, and outputs the profile of the user.

図２は、ユーザ興味分析装置を他の実施形態で示したものである。図２のユーザ興味分析装置２０は、一般的なコンピュータ・システムで実現される。すなわち、ＣＰＵ２１、入力部２２、出力部２３、通信部２４、プログラムメモリ２５、作業用メモリ２６、ユーザプロファイル２７で構成される。また、前述の類義語辞書１１をオプションとして追加してもよい。 FIG. 2 shows a user interest analysis apparatus according to another embodiment. 2 is realized by a general computer system. That is, it is composed of a CPU 21, an input unit 22, an output unit 23, a communication unit 24, a program memory 25, a work memory 26, and a user profile 27. Further, the above-mentioned synonym dictionary 11 may be added as an option.

入力部２２は、ユーザからの操作入力を受付ける一般的なマウスやキーボードなどの入力機器でよく、出力部２３は、液晶やＣＲＴなどの表示ディスプレイでよい。また、通信部２４は、ＬＡＮやインターネット通信網とのデータの送受信を行う。
プログラムメモリ２５には、ＣＰＵ２１が実行する本装置の各機能を備えるプログラムが格納されている。すなわち、装置全体の制御部、入力したファイルからキーワードを抽出するキーワード抽出部、影響度ｉＤＦ値を所定のアルゴリズムで求める影響度ｉＤＦ値算出部、およびプロファイル作成部などのプログラムが格納されている。プログラムは機能別に分割されている必要はなく、単一のプログラムで構成されてもよい。 The input unit 22 may be a general input device such as a mouse or a keyboard that receives an operation input from a user, and the output unit 23 may be a display such as a liquid crystal display or a CRT. The communication unit 24 transmits / receives data to / from a LAN or Internet communication network.
The program memory 25 stores a program having each function of the apparatus executed by the CPU 21. That is, programs such as a control unit for the entire apparatus, a keyword extraction unit for extracting a keyword from an input file, an influence iDF value calculation unit for obtaining an influence iDF value with a predetermined algorithm, and a profile creation unit are stored. The program does not need to be divided according to function, and may be configured by a single program.

プログラムメモリ２５は、ＲＯＭやフラッシュメモリであってもよいし、ハードディスク（ＨＤＤ）からロードされるＲＡＭであってもよい。作業用メモリ２６は、ＣＰＵ２１が処理する中間データを一時的に格納するもので、一般的にはＲＡＭまたはＨＤＤで構成される。 The program memory 25 may be a ROM or a flash memory, or a RAM loaded from a hard disk (HDD). The work memory 26 temporarily stores intermediate data processed by the CPU 21 and is generally composed of a RAM or an HDD.

ユーザプロファイル２７は、プログラムメモリ２５に格納されたプログラムを実行した結果を格納する記憶部である。また、類義語辞書１１は、既に述べたように、テキストから抽出した単語群に対する類義語を定義するための辞書であり、キーワード抽出部から必要に応じて参照される。 The user profile 27 is a storage unit that stores a result of executing a program stored in the program memory 25. Further, as already described, the synonym dictionary 11 is a dictionary for defining synonyms for a word group extracted from the text, and is referred to as necessary by the keyword extraction unit.

図３は、前述の伝播する語の概念を示す図である。この図においては、例として、あるユーザが閲覧したページの履歴が示されている。まず、ユーザはページ１の中に興味ある語を発見した。ユーザはそれをより詳しく調べるために、その語を検索ページに入力してファイルを検索するか、あるいは直接その語からハイパーリンクが張られている場合はこれをクリックしてページ２を閲覧する。もちろん検索やハイパーリンク以外の遷移手段であってもよい。同様に、ページ２からページ３へと移りページ３を閲覧したが、そこでは興味ある語の記述がなかったためにページ２ヘ戻り、ページ２から更に別のページ４を閲覧し、ページ４から更に別のページ５へと閲覧を繰り返している。 FIG. 3 is a diagram showing the concept of the above-mentioned propagating word. In this figure, as an example, a history of pages viewed by a certain user is shown. First, the user found an interesting word in page 1. In order to examine it in more detail, the user enters the word into the search page and searches the file, or if a hyperlink is made directly from the word, clicks this to browse page 2. Of course, it may be a transition means other than search and hyperlink. Similarly, the page 3 is moved from the page 2 to the page 3, and the page 3 is browsed. However, since there is no description of an interesting word, the page 2 is returned to, and another page 4 is browsed from the page 2, and further from the page 4. Repeated browsing to another page 5.

このようなページの閲覧履歴は、閲覧したページをノードとし、ページからページへの閲覧順序を有向リンク（エッジ）とする有向グラフで表現することができる。有向グラフとは、ノード間のエッジに方向性があるグラフをいう。 The browsing history of such a page can be expressed by a directed graph in which the viewed page is a node and the browsing order from the page to the page is a directed link (edge). A directed graph is a graph in which the edge between nodes has directionality.

図示するように、このユーザが閲覧したページ１〜５にはすべて「共通する語」３６が含まれているが、ユーザが興味のある語は必ずしもこの共通する語３６でなく、検索バーを用いて検索した語やハイパーリンク上に記載された語であることも多い。このように、ページ間を遷移（ジャンプ）するトリガーとなった語または影響力を持った語を、「伝播する語」と呼んでいる（図では、伝播する語３７として示している）。すなわち、伝播する語は、たまたまページ間で共通して出現する語（共通する語）よりもそのユーザの興味をリアルタイムに表していると考えられる。一方、共通する語だけを単に頻度順に抽出しても、上位頻出語として抽出されるのは、例えば、「商品」、「インターネット」などのような一般名詞や、「する」、「である」などの動詞であることが多く、ユーザが真に興味のある語（影響力のある語）を見出すことは容易ではない。そこで、本発明では、このファイル間を伝播する語がユーザの興味を最も端的に示し、しかも、参照するファイル数の上限の調整または分析の対象期間の調整により、興味が変わり行く過程をリアルタイムに示すようにコントロールできる点に注目する。 As shown in the figure, the pages 1 to 5 viewed by the user all include the “common word” 36, but the word that the user is interested in is not necessarily the common word 36, and the search bar is used. This is often the word searched for and the word listed on the hyperlink. In this way, a word that has triggered or jumped between pages is called a “propagating word” (shown as a propagating word 37 in the figure). In other words, it is considered that the word to be propagated represents the user's interest in real time rather than the word that happens to appear in common between pages (common word). On the other hand, even if only common words are extracted in order of frequency, common nouns such as “product”, “Internet”, “do”, “is” are extracted as higher frequent words. And so on, and it is not easy for the user to find a really interesting word (influential word). Therefore, in the present invention, the word propagating between the files most directly indicates the user's interest, and the process of changing interest in real time by adjusting the upper limit of the number of files to be referred to or adjusting the target period of analysis is performed in real time. Note that you can control as shown.

図４は、ファイル閲覧におけるユーザの興味ある語が変わって行く例を示したものである。まず、ユーザは、新製品ニュース４１（ページ１）で、Ｘ社が新製品として液晶ＴＶの最新モデルである製品Ａを発売したことを知る。かねてから液晶ＴＶに興味があったユーザは、早速Ｘ社の製品情報サイト４２（ページ２）へ行き製品Ａの情報を見る。ここでユーザは、製品Ａの情報の詳細を見ていくうちに、他社の類似製品と比較したいとの衝動にかられ、価格比較サイト４３（ページ３）で液晶ＴＶの複数のメーカの製品リストを表示した。ここでユーザはＸ社の新製品Ａよりも同等な機能で製品Ａより価格が安いＹ社の製品Ｂに興味を持った。そこでＹ社の製品情報サイト４４（ページ４）へ飛び、製品Ｂの情報を見る。そこでたまたま製品Ｂが製品Ｃの後継機ではあるが、製品Ｃよりかなり価格が高いことを知ったユーザは、今度は製品Ｃに興味を持ち同じＹ社の製品情報サイト４５（ページ５）で製品Ｃの情報を閲覧した。製品Ｃにさらに興味を持ったユーザは、更に最も安く手に入る店を探すために、再び価格比較サイト４６（ページ６）にもどり、そこから最安値で販売しているショップＺを知る。ショップＺサイト４７（ページ７）へ移ったユーザは、最終的に購入を決意し、そのまま購入ページ４８（ページ８）から製品Ｃを発注した。 FIG. 4 shows an example in which a user's words of interest in file browsing change. First, the user learns from the new product news 41 (page 1) that company X has released the product A, which is the latest model of a liquid crystal TV, as a new product. A user who has been interested in a liquid crystal TV for a long time immediately goes to the product information site 42 (page 2) of Company X and sees information about the product A. Here, as the user looks at the details of the product A information, he / she is urged to compare it with similar products from other companies, and the price comparison site 43 (page 3) lists product lists of multiple manufacturers of LCD TVs. Is displayed. Here, the user became interested in the product B of Y company which has the same function as the new product A of X company and is cheaper than the product A. Therefore, the user jumps to the product information site 44 (page 4) of company Y and sees information about product B. Therefore, a user who knows that product B is a successor of product C, but that the price is considerably higher than product C, is now interested in product C, and the product is on product information site 45 (page 5) of the same company Y. Browse C information. The user who is more interested in the product C returns to the price comparison site 46 (page 6) again and finds the shop Z selling at the lowest price, in order to find the cheapest available store. The user who moved to the shop Z site 47 (page 7) finally decided to purchase, and ordered the product C from the purchase page 48 (page 8).

以上のような想定で、このユーザがたどったページ１からページ８に含まれるテキストすべてをユーザ興味分析装置を用いて分析すると、キーワードとして、「液晶ＴＶ」、「Ｘ社」、「製品Ａ」、「Ｙ社」、「製品Ｂ」、「製品Ｃ」が抽出される。ここで、「液晶ＴＶ」という語は、すべてのページに共通して出現しているが、「製品Ａ」、「製品Ｂ」、「製品Ｃ」という語はその製品のメーカサイトで仕様の説明などのページ内に多く出現しているものとする。例えば、図に示すように、「液晶ＴＶ」は各ページに１回、「製品Ｂ」、「製品Ｃ」は、Ｙ社の製品仕様のページに各５回ずつ出現している。また、この例では、製品Ａの仕様はユーザは参照していないので、「製品Ａ」が出現した回数は、ページ１、ページ２、ページ３において各１回である。一方、ユーザの興味は当初は製品Ａにあったが、しだいに製品Ｂ、製品Ｃへと移り、最終的に製品Ｃを注文しているので、「製品Ｃ」が出現した回数は、ページ３に１回、ページ４に１回、ページ５に５回、ページ６〜８に各１回となる。 Based on the above assumption, when all the texts included in page 1 to page 8 followed by the user are analyzed using the user interest analysis device, the keywords are “liquid crystal TV”, “Company X”, “product A”. , “Company Y”, “Product B”, and “Product C” are extracted. Here, the term “liquid crystal TV” appears in common on all pages, but the terms “product A”, “product B”, and “product C” explain the specifications on the manufacturer site of the product. It is assumed that many appear in the page. For example, as shown in the figure, “Liquid Crystal TV” appears once on each page, and “Product B” and “Product C” appear on the product specification page of Company Y five times each. In this example, since the user does not refer to the specification of the product A, the number of times “product A” appears is once for each of the page 1, the page 2, and the page 3. On the other hand, the user was initially interested in product A, but gradually moved to product B and product C, and finally ordered product C, so the number of times “product C” appeared was 1 time, once for page 4, 5 times for page 5, and once for pages 6-8.

図５は、図４の例において各キーワードの影響度ｉＤＦ値を実際に計算した例を示す。ここで影響度ｉＤＦ値は、前述の＜数１＞の数式を用いて算出した。今回ユーザが参照した総ページ数Ｎは８であり、例えば、「製品Ｃ」が出現したのは、ページ３〜８の６回であるのでＤＦ値は６となる。また、「製品Ｃ」はページ５において５回、ページ３、４、６、７、８にそれぞれ出現しているので、影響度は、５＋１＋１＋１＋１＋１＝１０となる。従って、影響度ｉＤＦ値は、１０＊８／（ｌｏｇ（６＋１））＝９４．７となる。同様に他のキーワードについて影響度ｉＤＦ値を求め、それらを大きい順に並べると図５に示す表となる。 FIG. 5 shows an example in which the influence degree iDF value of each keyword is actually calculated in the example of FIG. Here, the influence degree iDF value was calculated using the above-described mathematical formula (1). The total number of pages N referred to by the user this time is 8. For example, since “product C” has appeared 6 times from page 3 to page 8, the DF value is 6. In addition, since “Product C” appears five times on page 5 and on pages 3, 4, 6, 7, and 8, respectively, the degree of influence is 5 + 1 + 1 + 1 + 1 + 1 = 10. Therefore, the influence degree iDF value is 10 * 8 / (log (6 + 1)) = 94.7. Similarly, when the influence iDF values are obtained for other keywords and arranged in descending order, the table shown in FIG. 5 is obtained.

この表でわかるように、「液晶ＴＶ」は、すべてのページに出現する共通語であるが、影響度ｉＤＦ値は低く、「製品Ｃ」や「製品Ｂ」のほうがはるかにユーザの興味を示している語であることがわかる。このようにして影響度ｉＤＦ値の上位の語の集合を集めることによりそのユーザのプロファイルを作成することができる。ここで影響度ｉＤＦ値の上位の語の選別には、予め与えられた閾値などを用いてもよい。 As can be seen from this table, “Liquid Crystal TV” is a common word that appears on all pages, but the impact iDF value is low, and “Product C” and “Product B” show much more interest to users. It is understood that In this way, a user profile can be created by collecting a set of high-order words of influence degree iDF values. Here, a threshold value or the like given in advance may be used to select words higher in the influence degree iDF value.

図６は、インターネットの掲示板におけるユーザの興味ある語が変わり行く具体例を示したものである。この図では、Ａさん、Ｂさん、Ｃさん、Ｄさんの間で交わされた以下のような掲示板でのやりとりを示している。 FIG. 6 shows a specific example in which a user's words of interest on an Internet bulletin board change. In this figure, the following message exchanges between A, B, C, and D are shown.

Ａさんは、近く旅行に行くことを決め旅先での宿を探すために、“今度、３泊４日で函館あたりに旅行に行くのですがお勧めのホテルなどの情報教えてください”（６１）とのコメントを掲示板に書き込んだ。それに対して、Ｂさんから、“函館ならＸホテルがお勧めです。綺麗だし価格もリーゾナブルです”（６２）とのレスがあった。また、Ｃさんからは“Ａさんは、温泉は好きですか。函館へ行くのならよい温泉旅館がありますよ”（６３）とのコメントが返ってきた。Ａさんは、早速、Ｂさん、Ｃさんにお礼をいうと共に、Ｃさんに、“Ｃさん、ありがとうございます。温泉も大好きです”（６４）との返事を入れた。そこで、Ｃさんは、Ｙ旅館とＺ旅館を紹介した（６５）。一方、そのやりとりを見ていたＤさんは、“Ａさん、函館へ行くのなら、定山渓まで足を伸ばしてはいかがでしょう。お勧めはこちら”（６６）と、Ａさんが行く予定の函館ではなく、定山渓のＱホテルとＲ旅館をホームページのリンク付で紹介した。これを見たＡさんは、函館から定山渓まで足を伸ばすのも悪くないと考えて、“早速、定山渓の温泉旅館も調べてみます”（６７）との返答を行った。その後、Ａさんは、Ｄさんが紹介してくれたＵＲＬのＱホテルとＲ旅館のホームページ（６８、６９）を調べて、最終的にＲ旅館に予約を入れた。以下、各コメント６１〜６９（ＱホテルとＲ旅館のホームページを含む）を、ページ１〜ページ９と呼ぶことにする。 Mr. A decides to go on a trip nearby, and in order to find an accommodation at his destination, “Tell me about a recommended hotel, etc. I would like to travel around Hakodate in 3 days and 4 nights” (61 ) Was posted on the bulletin board. On the other hand, Mr. B responded that “X Hotel is recommended for Hakodate. It is beautiful and the price is reasonable” (62). In addition, Mr. C responded, “A-san likes hot springs. If you go to Hakodate, there are good hot spring inns” (63). Mr. A immediately thanked Mr. B and Mr. C and responded to Mr. C, “Thank you Mr. C. I love hot springs” (64). Therefore, Mr. C introduced Y inn and Z inn (65). On the other hand, Mr. D, who was watching the exchange, said, “If you go to Hakodate, why don't you go to Jozankei? Recommended here” (66). Rather, I introduced Q Hotel and R Ryokan in Jozankei with links on the website. Mr. A, who saw this, thought that it would not be bad to go from Hakodate to Jozankei, and responded, “I will immediately check the hot spring inn in Jozankei” (67). After that, Mr. A checked the homepage (68, 69) of Q Hotel and R Ryokan of URL introduced by Mr. D, and finally made a reservation at R Ryokan. Hereinafter, the comments 61 to 69 (including Q hotel and R inn homepages) will be referred to as pages 1 to 9.

このやりとりの中で出現する主なキーワードは、ページ１では、「函館」、「旅行」、「お勧め」、「ホテル」である。また、ページ２では、「函館」、「ホテル」、「お勧め」「綺麗」、「価格」、「リーゾナブル」などがキーワードとなる。同様に、ページ３からページ８までのキーワードを抽出し、接続詞や助詞などキーワードとなりにくい語を除いて、影響度ｉＤＦ値の大きい順に並べる。これを図７の表に示す。 The main keywords appearing in this exchange are “Hakodate”, “Travel”, “Recommended”, and “Hotel” on page 1. On page 2, keywords such as “Hakodate”, “Hotel”, “Recommended”, “Beautiful”, “Price”, “Reasonable”, and the like. Similarly, keywords from page 3 to page 8 are extracted and arranged in descending order of influence iDF value except words that are difficult to be keywords, such as conjunctions and particles. This is shown in the table of FIG.

Ａさんは、当初、函館のお勧めのホテルを探すつもりで掲示板にコメントを書き込んだのであるが、Ｃさんのコメントを見て、温泉に興味を持ち、その後、Ｄさんが書き込んだ温泉旅館のホームページへのリンクが決め手となって、当初の行き先である函館とはかなり離れた定山渓の温泉旅館を予約することになったのである。 At first, Mr. A wrote a comment on the bulletin board with the intention of finding a recommended hotel in Hakodate, but after seeing Mr. C's comment, he became interested in the hot spring, and then Mr. D wrote the hot spring inn. The link to the homepage became the decisive factor, and it was decided to book a hot spring inn in Jozankei that was quite far from Hakodate, the original destination.

この例からわかるように、Ａさんの行動（興味）に重要な影響を与えた語は、「温泉」であることは明らかである。図７の表からも、「温泉」の影響度ｉＤＦ値が最上位に位置しており、このことが読み取れる。また、Ａさんが、当初興味を持っていた「函館」や「旅行」は下位に位置し、興味がしだいに薄れていったことがわかる。 As can be seen from this example, it is clear that the word that has an important influence on the behavior (interest) of Mr. A is “hot spring”. Also from the table of FIG. 7, the influence degree iDF value of “hot spring” is located at the top, which can be read. It can also be seen that “Hakodate” and “Travel”, which Mr. A was initially interested in, were located at the lower level and their interest gradually faded away.

このように、本発明のユーザ興味分析装置を用いて、ユーザが所定の期間閲覧したページを時系列に分析していくことで、ユーザの興味の移り代わりをリアルタイムに調べることができる。そして、このユーザの興味に対して大きな影響力を持つ語（上記の例では、「温泉」）を見出すことができれば、そのような情報を多数集め、商品企画やマーケティングツールとして大いに役立てることができる。 As described above, by using the user interest analysis apparatus of the present invention to analyze the pages browsed by the user for a predetermined period in time series, the user's interests can be examined in real time. And if you can find a word that has a great influence on the user's interest (in the above example, "hot spring"), you can collect a lot of such information and use it as a product planning and marketing tool. .

図８は、図６の掲示板の例において、類義語を考慮した影響度ｉＤＦ値の計算の例を示す図である。すなわち、「ホテル」と「旅館」は類義語として定義し、両者をまとめて一つの語と考えて、「ホテル・旅館」を図７の他の上位３つまでの語と比較したものである。ここで、「ホテル」と「旅館」を合わせて一つの語と扱ったためその合計の出現頻度は高くなり、影響度ｉＤＦ値も上昇する。従って、ユーザの興味は、旅館であろうとホテルであろうと「宿」にあることは間違いない。ただし、それでも「ホテル・旅館」の影響度ｉＤＦ値は、「温泉」の値には及んでいない。ユーザ興味分析装置の目的は、このような影響力の強い語を見つけ出すことである。従って、最適な影響度ｉＤＦ値を求める数式も複数のものから選択できるようになっている。 FIG. 8 is a diagram illustrating an example of calculating the influence iDF value in consideration of the synonyms in the example of the bulletin board in FIG. 6. That is, “hotel” and “ryokan” are defined as synonyms, and both are considered as one word, and “hotel / ryokan” is compared with the other top three words in FIG. Here, since “hotel” and “ryokan” are treated as one word, the total appearance frequency increases and the influence iDF value also increases. Therefore, there is no doubt that the user's interest is in the “inn” whether it is an inn or a hotel. However, the impact iDF value of “Hotel / Ryokan” does not reach the value of “Onsen”. The purpose of the user interest analyzer is to find such powerful words. Accordingly, a formula for obtaining the optimum influence degree iDF value can be selected from a plurality of formulas.

図９は、ユーザ興味分析装置の他の応用例を示す図である。ユーザＡ、ユーザＢ、ユーザＣの各端末（９１〜９３）には、ユーザ興味分析装置がそれぞれ備えられ、各ユーザは、ユーザ興味分析装置の出力である自己のプロファイルをインターネット９４を介して公開することに同意しているとする。もちろん、プロファイルの中に非公開としたい情報があれば、それを除いた形で公開できるようにしてもよいし、不特定多数に公開するのではなく、会員のユーザにのみ公開するようにしてもよい。この公開されたユーザプロファイル情報は、プロファイルサーバ９５の公開プロファイルＤＢ９６に集積される。公開プロファイルＤＢ９６の中には、各ユーザ毎にそのユーザのプロファイルテーブルＡ、Ｂ、Ｃ（９７〜９９）が作成される。プロファイルテーブルには、各ユーザの興味ある語がその順位と共に並べられているので、これを公開することによって、さまざまなコミュニティ形成のツールとなり得る。 FIG. 9 is a diagram illustrating another application example of the user interest analysis apparatus. Each terminal (91 to 93) of user A, user B, and user C is provided with a user interest analysis device, and each user publishes his / her profile, which is the output of the user interest analysis device, via the Internet 94. Suppose you agree. Of course, if there is information that you want to keep private in the profile, you may be able to make it public in a form that excludes it, or make it open only to member users, not to unspecified number of people Also good. This public user profile information is accumulated in the public profile DB 96 of the profile server 95. In the public profile DB 96, user profile tables A, B, and C (97 to 99) are created for each user. In the profile table, the words of interest of each user are listed along with their ranks. By making this public, it can be used as a tool for forming various communities.

例えば、ユーザＡが、「釣り」に興味がある場合、同じ趣味を持つユーザをこの公開プロファイルＤＢ９６から探すことができる。すなわちこの場合、ユーザ興味分析装置は、“友達探し”のツールである。この例では、ユーザＣのプロファイルテーブルＣ９９には「釣り」に関連する語が上位にあるので、ユーザＡはユーザＣが同じ趣味を持つ人物であることがわかり直接コンタクトをとるかもしれない。ユーザＣの興味ある語は、ユーザＡも分かっているので大いに話が弾むことも期待できる。 For example, when the user A is interested in “fishing”, a user having the same hobby can be searched from the public profile DB 96. That is, in this case, the user interest analysis device is a “friend search” tool. In this example, since the word related to “fishing” is higher in the profile table C99 of the user C, the user A knows that the user C is a person with the same hobby, and may make a direct contact. Since the user A also knows the words that the user C is interested in, it can be expected that the words will bounce greatly.

また、公開プロファイルＤＢ９６には、興味ある語の影響度ｉＤＦ値の順位だけでなく、興味ある語が出現したページ総数や、ＥＤＴ値、ページの履歴の期間などの数値も同時に参照できるようにしておけば、その興味ある語がどのくらいの規模（ボリューム）でそのユーザのプロファイルを形成しているかを判断することができる。例えば、「釣り」またはその類義語が出現する総ページ数（ユーザＣが所定の期間に閲覧したページ）が、人並み外れて膨大であれば、ユーザＣは相当釣り好きの人か、あるいは達人であるとの推定もできる。すなわち、ユーザ興味分析装置は、“達人探し”のツールともなり得るのである。 In addition, the public profile DB 96 can refer to not only the ranking of the influence degree iDF value of the word of interest but also the total number of pages in which the word of interest appears, the EDT value, the page history period, and the like. In this case, it is possible to determine how large (volume) the interesting word forms the user's profile. For example, if the total number of pages on which “fishing” or its synonyms appear (pages browsed by the user C in a predetermined period) is extremely large, the user C is a person who likes fishing or is a master. It can be estimated that there is. In other words, the user interest analysis device can also be a “master search” tool.

このように、本発明のユーザ興味分析装置によって得られたユーザプロファイル情報は、それを公開するユーザが多くなればなるほど、商品企画やマーケティングツールとして大いに役立つばかりか、個人的な趣味の友人探しや、達人探しのツールとしても用いることができる。また、プロファイルを一般には公開したくないユーザにとっても、家族内や気心の知れた友人どうしでのみ公開することにしておけば、お互いにプレゼントを考えるとき、旅行に誘うとき、食事に誘うときなどの基本情報として、さまざまな用途に使用できる可能性がある。 As described above, the user profile information obtained by the user interest analysis device of the present invention is not only useful as a product planning or marketing tool as the number of users who publish it increases, but also for finding friends for personal hobbies. It can also be used as a tool for searching for experts. Also, for users who don't want to share their profile with the general public, if they decide to share it only with their family members or other enthusiastic friends, when they want to give each other gifts, invite them to a trip, invite them to a meal, etc. As basic information, there is a possibility that it can be used for various purposes.

以上、本発明を実施形態や実施例を用いて説明したが、本発明の技術的範囲は、上記の実施形態などに限られるものではない。上記実施形態に多様なバリエーションまたは改良を加えることが可能である。 As mentioned above, although this invention was demonstrated using embodiment and an Example, the technical scope of this invention is not restricted to said embodiment etc. above. Various variations or improvements can be added to the above embodiment.

なお、本発明の図１または図２の実施形態であるユーザ興味分析装置は、コンピュータ上のプログラムによっても実現可能である。上記プログラムを格納する記憶媒体は、電子的、磁気的、光学的、電磁的、赤外線または半導体システム（または、装置または機器）であることができる。この記憶媒体の例には、半導体またはソリッド・ステート記憶装置、磁気テープ、取り外し可能なコンピュータ可読の媒体の例には、半導体またはソリッド・ステート記憶装置、磁気テープ、取り外し可能なフロッピー（登録商標）・ディスク、ランダム・アクセス・メモリ（ＲＡＭ）、リードオンリー・メモリ（ＲＯＭ）、リジッド磁気ディスクおよび光ディスクが含まれる。現時点における光ディスクの例には、コンパクト・ディスク−リードオンリー・メモリ（ＣＤ−ＲＯＭ）、コンパクト・ディスク−リード／ライト（ＣＤ−Ｒ／Ｗ）およびＤＶＤが含まれる。 Note that the user interest analysis apparatus according to the embodiment of FIG. 1 or 2 of the present invention can also be realized by a program on a computer. The storage medium storing the program can be an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system (or apparatus or device). Examples of this storage medium include a semiconductor or solid state storage device, magnetic tape, removable computer readable media examples include a semiconductor or solid state storage device, magnetic tape, a removable floppy. Includes disks, random access memory (RAM), read only memory (ROM), rigid magnetic disks and optical disks. Current examples of optical disks include compact disk-read only memory (CD-ROM), compact disk-read / write (CD-R / W) and DVD.

本発明に係るユーザ興味分析装置の一つの実施形態における機能ブロックを示す図である。It is a figure which shows the functional block in one Embodiment of the user interest analysis apparatus which concerns on this invention. 本発明に係るユーザ興味分析装置の他の実施形態における機能ブロックを示す図である。It is a figure which shows the functional block in other embodiment of the user interest analysis apparatus which concerns on this invention. 本発明に係る伝播する語の概念およびページ間の有向グラフを示す図である。It is a figure which shows the concept of the word to propagate and the directed graph between pages based on this invention. 本発明の実施例１として、ファイル閲覧の具体例を示す図である。It is a figure which shows the specific example of file browsing as Example 1 of this invention. 図３の実施例における影響度ｉＤＦ値の計算の具体例を示す図である。It is a figure which shows the specific example of calculation of the influence degree iDF value in the Example of FIG. 本発明の実施例２として、掲示板におけるユーザの興味ある語が変わり行く具体例を示す図である。As Example 2 of this invention, it is a figure which shows the specific example from which the user's interesting word in a bulletin board changes. 図６の実施例において、影響度ｉＤＦ値の計算の具体例を示す図である。FIG. 7 is a diagram illustrating a specific example of calculation of an influence degree iDF value in the embodiment of FIG. 6. 図６の実施例において、類義語を考慮した影響度ｉＤＦ値の計算の具体例を示す図である。FIG. 7 is a diagram showing a specific example of calculation of an influence degree iDF value considering synonyms in the embodiment of FIG. 6. 本発明の実施例３として、ユーザプロファイルを他のユーザに公開可能とするプロファイルサーバおよびプロファイルテーブルを示す図である。It is a figure which shows the profile server and profile table which can open | release a user profile to other users as Example 3 of this invention.

Explanation of symbols

１ファイル閲覧履歴
２ファイルテキスト入力手段
３形態素分割手段
４伝播語抽出手段
５影響度算出手段
５ａ、５ｂ影響度算出手段
６ｉＤＦ値算出手段
６ａ、６ｂｉＤＦ値算出手段
７記憶手段
８ユーザ興味語抽出手段
９プロファイル情報出力手段
１０ユーザ興味分析装置（第一の実施形態）
１１類義語辞書
２０ユーザ興味分析装置（第二の実施形態）
２１ＣＰＵ
２２入力部
２４出力部
２４通信部
２５プログラムメモリ
２６作業用メモリ
２７ユーザプロファイル
３６共通する語
３７伝播する語
４１新製品ニュース
４２製品情報サイト
４３価格比較サイト
４４Ｙ社製品情報サイト（製品Ｂ仕様ページ）
４５Ｙ社製品情報サイト（製品Ｃ仕様ページ）
４６価格比較サイト
４７ショプＺサイト
４８購入ページ
６１〜６９ページ１〜９
９１〜９３ユーザ端末
９４インターネット
９６プロファイルサーバ
９７〜９９プロファイルテーブル DESCRIPTION OF SYMBOLS 1 File browsing history 2 File text input means 3 Morphological division means 4 Propagation word extraction means 5 Influence degree calculation means 5a, 5b Influence degree calculation means 6 iDF value calculation means 6a, 6b iDF value calculation means 7 Storage means 8 User interest word extraction Means 9 Profile information output means 10 User interest analysis device (first embodiment)
11 Synonym Dictionary 20 User Interest Analysis Device (Second Embodiment)
21 CPU
22 Input Unit 24 Output Unit 24 Communication Unit 25 Program Memory 26 Work Memory 27 User Profile 36 Common Words 37 Propagating Words 41 New Product News 42 Product Information Site 43 Price Comparison Site 44 Y Company Product Information Site (Product B Specification Page )
45 Y company product information site (product C specification page)
46 Price comparison site 47 Shop Z site 48 Purchase page 61-69 Page 1-9
91-93 User terminal 94 Internet 96 Profile server 97-99 Profile table

Claims

A user interest analysis device that extracts words of interest of a user browsing a file,
Means for inputting, as text for each file, a plurality of words included in the file from the history of the file viewed by the user;
Means for dividing the text into predetermined units;
Means for extracting a propagated word referred to by a user among the plurality of files viewed by the user;
Means for storing one or more of the propagated words;
Means for obtaining a predetermined iDF value representing a predetermined influence degree and a degree of appearance of the propagating word in a specific file from the appearance frequencies of the propagating word with respect to the plurality of files;
Means for extracting a set of words of interest of the user as user profile information according to an influence iDF value which is a function of the influence and the iDF value;
Means for outputting the user profile information;
A user interest analysis device comprising:

The user interest analysis device according to claim 1, further comprising means for disclosing the user profile information to other users.

Further comprising a similar word dictionary for detecting a word related to the propagating word;
The user interest analysis device according to claim 1, further comprising means for calculating the influence degree iDF value for a word related to the word to be propagated.

The user interest analysis device according to claim 1, wherein the influence iDF value is obtained by the following mathematical formula.

here,
t is the word to propagate,
EDT is the frequency at which the propagated word t appears in the file group viewed by the user,
N is the number of files viewed by the user during a predetermined time,
DF (t) is the number of files including the word t to be propagated.

A user interest analysis method for extracting words of interest of a user browsing a file,
Inputting a plurality of words included in the file from the history of the file viewed by the user as text for each file;
Dividing the morpheme into predetermined units from the text;
Extracting the propagating word referred to by the user among the plurality of files viewed by the user;
Storing one or more of the propagating words;
Obtaining a predetermined iDF value representing a predetermined influence degree and a degree of appearance of the propagating word in a specific file from the appearance frequencies of the propagating word with respect to the plurality of files;
Extracting a set of words of interest of the user as user profile information in descending order of the influence iDF value, which is the product of the influence and the iDF value;
Outputting the user profile information;
A user interest analysis method including:

There is a user interest analysis computer program that extracts words of interest of the user browsing the file,
Inputting a plurality of words included in the file from the history of the file viewed by the user as text for each file;
Dividing morphemes from the text into the smallest linguistic units having meaning;
Extracting the propagating word referred to by the user among the plurality of files viewed by the user;
Storing one or more of the propagating words;
Obtaining a predetermined iDF value representing a predetermined degree of influence and a degree of occurrence of the word to be propagated in a specific file from the appearance frequencies of all the words to be propagated;
Extracting a set of words of interest of the user as user profile information according to an influence iDF value that is a function of the influence and the iDF value;
Outputting the user profile information;
A computer program that causes a computer to execute.

The user interest analysis device according to claim 1, wherein the file is a WEB page.

The user interest analysis method according to claim 5, wherein the file is a WEB page.

The computer program according to claim 6, wherein the file is a WEB page.