JP2014021862A

JP2014021862A - Associated node searching device, associated node searching method and program

Info

Publication number: JP2014021862A
Application number: JP2012161986A
Authority: JP
Inventors: Yasuhiro Fujiwara; 靖宏藤原
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: NTT Inc
Priority date: 2012-07-20
Filing date: 2012-07-20
Publication date: 2014-02-03
Anticipated expiration: 2032-07-20
Also published as: JP5727421B2

Abstract

【課題】PPRに基づく関連度の高い個数のノードもしくは閾値内のノードを高速かつ正確に検索する。
【解決手段】複数のノードからなるグラフデータと問い合わせ分布からPPR（Personalized PageRank）に基づき関連度を計算し、関連度が高い順にK 個のノードを検索する、又は所定の閾値よりも大きい関連度を持つノードを検索する関連ノード検索装置において、前記グラフデータから、PPRに基づく特定のノードの関連度を算出するために用いるユニタリ行列と上三角行列の逆行列を計算し、記憶手段に格納する事前計算手段と、前記問い合わせ分布、及び前記事前計算手段により求めたユニタリ行列と上三角行列の逆行列を用いて、関連度が高い順にK 個のノードを検索、又は所定の閾値よりも大きい関連度を持つノードを検索し、出力する検索手段とを備える。
【選択図】図１An object of the present invention is to quickly and accurately search a number of nodes having high relevance based on PPR or nodes within a threshold.
Relevance is calculated based on PPR (Personalized PageRank) from graph data consisting of a plurality of nodes and a query distribution, and K nodes are searched in descending order of relevance, or the relevance is greater than a predetermined threshold. In the related node search device for searching for nodes having a node, the inverse matrix of the unitary matrix and the upper triangular matrix used for calculating the degree of association of a specific node based on the PPR is calculated from the graph data and stored in the storage means Search for K nodes in descending order of relevance using the pre-calculation means, the query distribution, and the inverse matrix of the unitary matrix and upper triangular matrix obtained by the pre-calculation means, or greater than a predetermined threshold Searching means for searching for and outputting nodes having relevance.
[Selection] Figure 1

Description

本発明はPersonalized PageRank に基づき関連度を計算し検索を行う技術に関するものである。 The present invention relates to a technique for calculating a degree of association based on a personalized page rank and performing a search.

グラフはデータをノードとエッジで表現するデータ構造であり、様々な分野で用いられている。近年グラフを用いたパーソナライズドサーチに関心が高まっている。パーソナライズドサーチの例としては個人の購入履歴から個別にレコメンドを行うサービスなどがあげられる。 Graphs are data structures that represent data with nodes and edges, and are used in various fields. In recent years, interest in personalized search using graphs has increased. An example of a personalized search is a service that makes recommendations individually from an individual purchase history.

グラフ理論においてノードの関連度は重要な性質の一つであり、ノードの関連度として今まで様々な手法が提案されてきた。その中でもPersonalized PageRank(PPR)はノードの関連度として最も注目を集めているもののひとつである。PPRは今までグラフ理論でよく用いられてきたノード間の最短距離などと異なり、グラフの構造的な特徴に基づいて関連度が計算できるからである（非特許文献１、２）。 Node relevance is one of the important properties in graph theory, and various methods have been proposed for node relevance. Among them, Personalized PageRank (PPR) is one of the things that attracts the most attention as the degree of relevance of nodes. This is because PPR can calculate the degree of relevance based on the structural features of the graph, unlike the shortest distance between nodes that has been often used in graph theory so far (Non-Patent Documents 1 and 2).

PPRは概念的に以下のように説明できる。問い合わせ分布に基づいた各ノードの存在確率からランダムウォークを開始し、隣接するノードにエッジの重みに比例した確率でランダムに移動する。さらにノードに到達するたびに一定の確率で問い合わせ分布に基づいた確率でノードに戻る。この操作を再帰的に繰り返した結果として各ノードにおける定常状態確率が得られるが、PPRはこの得られた定常状態確率を関連度とする方法である。 PPR can be conceptually explained as follows. A random walk is started from the existence probability of each node based on the inquiry distribution, and the node randomly moves to an adjacent node with a probability proportional to the edge weight. Furthermore, every time the node is reached, the node returns to the node with a probability based on the query distribution with a certain probability. As a result of recursively repeating this operation, a steady state probability at each node is obtained. PPR is a method in which the obtained steady state probability is used as a degree of relevance.

Glen Jeh and Jennifer Widom, Scaling personalized web search, WWW, 2003Glen Jeh and Jennifer Widom, Scaling personalized web search, WWW, 2003 Hanghang Tong and Christos Faloutsos, Center-piece subgraphs: problem definition and fast solutions,KDD, 2006Hanghang Tong and Christos Faloutsos, Center-piece subgraphs: problem definition and fast solutions, KDD, 2006 D′aniel Fogaras and Bal′azs R′acz and K′aroly Csalog′any and Tam′as Sarl′os, Towards Scaling FullyD′ aniel Fogaras and Bal′azs R′acz and K′aroly Csalog′any and Tam′as Sarl′os, Towards Scaling Fully Hanghang Tong and Christos Faloutsos and Jia-Yu Pan, Fast Random Walk with Restart and Its Applications, ICDM, 2006Hanghang Tong and Christos Faloutsos and Jia-Yu Pan, Fast Random Walk with Restart and Its Applications, ICDM, 2006 Jimeng Sun and Huiming Qu and Deepayan Chakrabarti and Christos Faloutsos, Neighborhood Formation and Anomaly Detection in Bipartite Graphs, ICDM, 2005Jimeng Sun and Huiming Qu and Deepayan Chakrabarti and Christos Faloutsos, Neighborhood Formation and Anomaly Detection in Bipartite Graphs, ICDM, 2005 Thomas H. Cormen and Charles E. Leiserson and Ronald L. Rivest and Clifford Stein, Introduction to Algorithms, The MIT Press, 2001Thomas H. Cormen and Charles E. Leiserson and Ronald L. Rivest and Clifford Stein, Introduction to Algorithms, The MIT Press, 2001 David A. Harville, Matrix Algebra From a Statistician's Perspective. Springer, 2008David A. Harville, Matrix Algebra From a Statistician's Perspective. Springer, 2008

PPRは様々な分野のアプリケーションに応用されている関連度であるが、計算量が高いという問題がある。そのため今まで様々な高速化手法が提案されてきた（非特許文献３、４、５）が、それらは精度を犠牲にするものであった。しかし、アプリケーションに対する応用を考えた場合、精度が犠牲になるのは好ましくない。また実際のアプリケーションにおいては問い合わせ分布からほかのすべてのノードの関連度を計算するのではなく、関連度の高いノードの処理が行われている（非特許文献５）。 PPR is a degree of relevance applied to applications in various fields, but there is a problem that the calculation amount is high. Therefore, various speed-up methods have been proposed so far (Non-Patent Documents 3, 4, and 5), but they sacrifice accuracy. However, it is not preferable that accuracy is sacrificed when considering application to an application. In an actual application, the relevance of all other nodes is not calculated from the query distribution, but processing of nodes with high relevance is performed (Non-Patent Document 5).

本発明は上記の点に鑑みてなされたものであり、問い合わせ分布と検索個数もしくは閾値が与えられたときに、問い合わせ分布に対して、PPRに基づく関連度の高い個数のノードもしくは閾値内のノードを高速かつ正確に検索することを可能とした技術を提供することを目的とする。 The present invention has been made in view of the above points, and when a query distribution and the number of searches or a threshold value are given, the number of nodes having a high degree of relevance based on PPR or the nodes within the threshold value for the query distribution An object of the present invention is to provide a technique that enables a high-speed and accurate search for a computer.

上記の課題を解決するために、本発明は、複数のノードからなるグラフデータと問い合わせ分布からPPR（Personalized PageRank）に基づき関連度を計算し、関連度が高い順にK 個のノードを検索する、又は所定の閾値よりも大きい関連度を持つノードを検索する関連ノード検索装置であって、
前記グラフデータから、PPRに基づく特定のノードの関連度を算出するために用いるユニタリ行列と上三角行列の逆行列を計算し、記憶手段に格納する事前計算手段と、
前記問い合わせ分布、及び前記事前計算手段により求めたユニタリ行列と上三角行列の逆行列を用いて、関連度が高い順にK 個のノードを検索、又は所定の閾値よりも大きい関連度を持つノードを検索し、出力する検索手段とを備えることを特徴とする関連ノード検索装置として構成される。 In order to solve the above problems, the present invention calculates the relevance based on PPR (Personalized PageRank) from graph data consisting of a plurality of nodes and a query distribution, and searches for K nodes in descending order of relevance. Or a related node search device for searching for a node having a relevance greater than a predetermined threshold,
From the graph data, a unitary matrix used to calculate the degree of association of a specific node based on PPR and an inverse matrix of an upper triangular matrix are calculated, and a pre-calculating unit that stores the matrix in a storage unit;
Using the inquiry distribution and the inverse matrix of the unitary matrix and upper triangular matrix obtained by the pre-calculation means, search for K nodes in descending order of relevance, or nodes having relevance greater than a predetermined threshold It is comprised as a related node search apparatus characterized by including the search means which searches for and outputs.

前記事前計算手段は、前記ユニタリ行列と前記上三角行列の逆行列とがそれぞれ疎になるように、前記グラフデータにおけるノードを並び替え、ノードを並び替えたグラフデータを記憶手段に格納するノード並び替え手段と、前記ノード並び替え手段によりノードが並び替えられたグラフデータを用いて、前記ユニタリ行列と前記上三角行列の逆行列を計算する逆行列計算手段と、を備え、
前記検索手段は、グラフの隣接行列及び前記問い合わせ分布に基づいて全てのノードについての関連度の下限値を推定し、当該下限値に基づいて関連度の上限値を推定する関連度推定手段と、特定のノードについて、前記ユニタリ行列と前記上三角行列の逆行列を用いて関連度を計算する関連度計算手段と、を備えるように構成することができる。 The pre-calculating means rearranges the nodes in the graph data so that the unitary matrix and the inverse matrix of the upper triangular matrix are sparse, and stores the graph data in which the nodes are rearranged in the storage means Rearrangement means; and inverse matrix calculation means for calculating an inverse matrix of the unitary matrix and the upper triangular matrix using the graph data in which the nodes are rearranged by the node rearrangement means,
The search means estimates a lower limit value of the relevance level for all nodes based on an adjacency matrix of the graph and the query distribution, and a relevance level estimation means for estimating an upper limit value of the relevance level based on the lower limit value; For a specific node, it may be configured to include relevance calculation means for calculating relevance using the unitary matrix and the inverse matrix of the upper triangular matrix.

また、前記検索手段において、前記関連度推定手段により、下限値の降順に、関連度の上限値が前記K番目の関連度未満又は前記所定の閾値未満となるまでノードを探索し、関連度の上限値が前記K番目の関連度未満又は前記所定の閾値未満にならない間は、前記関連度計算手段により正確な関連度を計算することにより、関連度が高い順にK 個のノードを検索、又は前記所定の閾値よりも大きい関連度を持つノードを検索するようにしてもよい。 Further, in the search means, the relevance estimation means searches the nodes until the upper limit of the relevance is less than the Kth relevance or less than the predetermined threshold in descending order of the lower limit. While the upper limit value is less than the Kth relevance level or less than the predetermined threshold, the relevance calculation unit calculates an accurate relevance level to search K nodes in descending order of relevance level, or A node having a degree of relevance greater than the predetermined threshold may be searched.

また、本発明は、上記関連ノード検索装置が実行する関連ノード検索方法として構成することもできる。更に、本発明は、コンピュータを、上記関連ノード検索装置における各手段として機能させるためのプログラムとして構成することもできる。 The present invention can also be configured as a related node search method executed by the related node search apparatus. Furthermore, the present invention can also be configured as a program for causing a computer to function as each means in the related node search apparatus.

本発明によれば、特定のノードの関連度を疎行列を用いて計算するとともに、不必要な関連度の計算を探索において省略することとしたので、従来手法より大幅に高速に関連ノードを検索できる。また、本発明による検索結果は正確であり、更に、本発明に対してユーザはパラメータ調整する必要がないという効果も奏する。 According to the present invention, the relevance of a specific node is calculated using a sparse matrix, and unnecessary relevance calculation is omitted in the search. it can. In addition, the search result according to the present invention is accurate, and further, there is an effect that the user does not need to adjust parameters for the present invention.

本発明の実施の形態に係る関連ノード検索装置１の機能構成図である。It is a functional block diagram of the related node search apparatus 1 which concerns on embodiment of this invention. 本発明の実施の形態に係る事前計算部１０の機能構成図である。It is a functional block diagram of the prior calculation part 10 which concerns on embodiment of this invention. 本発明の実施の形態に係る検索部２０の機能構成図である。It is a functional block diagram of the search part 20 which concerns on embodiment of this invention. ノードの置換のアルゴリズムであるアルゴリズム１を示す図である。It is a figure which shows the algorithm 1 which is an algorithm of the replacement | exchange of a node. ノードの置換のフローチャートである。It is a flowchart of node replacement. 上位K個検索のアルゴリズムであるアルゴリズム２を示す図である。It is a figure which shows the algorithm 2 which is an algorithm of a top K piece search. 上位K個検索のフローチャートである。It is a flowchart of a top K piece search. 範囲検索のアルゴリズムであるアルゴリズム３を示す図である。It is a figure which shows the algorithm 3 which is an algorithm of a range search. 範囲検索のフローチャートである。It is a flowchart of a range search.

以下、図面を参照して本発明の実施の形態を説明する。なお、以下で説明する実施の形態は一例に過ぎず、本発明が適用される実施の形態は、以下の実施の形態に限られるわけではない。 Embodiments of the present invention will be described below with reference to the drawings. The embodiment described below is only an example, and the embodiment to which the present invention is applied is not limited to the following embodiment.

（装置の基本構成）
本実施の形態に係る関連ノード検索装置１の構成、及び機能の概要を図１〜図３を参照して説明する。 (Basic configuration of the device)
An outline of the configuration and functions of the related node search device 1 according to the present embodiment will be described with reference to FIGS.

図１は、本実施の形態に係る関連ノード検索装置１の機能構成図である。図１に示すように、本実施の形態に係る関連ノード検索装置１は、事前計算部１０と検索部２０を有する。事前計算部１０はグラフデータを外部入力とし、ユニタリ行列と上三角行列の逆行列を出力する。検索部２０には、事前計算部１０からユニタリ行列と上三角行列の逆行列が入力され、外部から問い合わせ分布と検索個数等が入力される。検索部２０は、これらの入力に基づいて、検索結果である関連ノードを出力する。 FIG. 1 is a functional configuration diagram of a related node search device 1 according to the present embodiment. As shown in FIG. 1, the related node search device 1 according to the present embodiment includes a pre-calculation unit 10 and a search unit 20. The pre-calculation unit 10 receives the graph data as an external input and outputs an inverse matrix of the unitary matrix and the upper triangular matrix. The search unit 20 receives the unitary matrix and the inverse matrix of the upper triangular matrix from the pre-calculation unit 10 and receives the query distribution, the number of searches, and the like from the outside. Based on these inputs, the search unit 20 outputs related nodes that are search results.

図２に、事前計算部１０の機能構成を示す。図２に示すとおり、事前計算部１０は、ノード並び替え部１１と逆行列計算部１２を有する。ノード並び替え部１１は、グラフを入力とし、グラフのノードを並び替えて出力する。出力された並び替え後のグラフのデータは、メモリ等の記憶手段に格納される。逆行列計算部１２はノードが並び替えられたグラフを入力としユニタリ行列と上三角行列の逆行列を出力する。出力されたユニタリ行列と上三角行列の逆行列のデータは、後述するデータ保存部２３（記憶手段）に格納される。 FIG. 2 shows a functional configuration of the pre-calculation unit 10. As illustrated in FIG. 2, the prior calculation unit 10 includes a node rearrangement unit 11 and an inverse matrix calculation unit 12. The node rearrangement unit 11 receives a graph, rearranges the nodes of the graph, and outputs the result. The output rearranged graph data is stored in a storage means such as a memory. The inverse matrix calculator 12 receives the graph in which the nodes are rearranged and outputs an inverse matrix of the unitary matrix and the upper triangular matrix. The output data of the inverse matrix of the unitary matrix and the upper triangular matrix is stored in the data storage unit 23 (storage means) described later.

図３に、検索部２０の機能構成を示す。図３に示すとおり、検索部２０は、関連度推定部２１、関連度計算部２２、データ保存部２３、検索処理制御部２４を有する。 FIG. 3 shows a functional configuration of the search unit 20. As illustrated in FIG. 3, the search unit 20 includes a relevance level estimation unit 21, a relevance level calculation unit 22, a data storage unit 23, and a search processing control unit 24.

関連度推定部２１には、問い合わせ分布と検索個数等が外部から入力され、関連度計算部２２から計算された関連度が入力され、更に、データ保存部２３から検索対象のノードが入力される。関連度推定部２１は、これらの入力に基づいて、ノードの関連度を推定し、推定した関連度をデータ保存部２３に格納する。 The relevance estimation unit 21 receives the query distribution, the number of searches, and the like from the outside, the relevance calculated by the relevance calculation unit 22, and the search target node from the data storage unit 23. . Based on these inputs, the relevance level estimation unit 21 estimates the relevance level of the node, and stores the estimated relevance level in the data storage unit 23.

関連度計算部２２には、事前計算部１０からユニタリ行列と上三角行列の逆行列が入力され、データ保存部２３から検索対象のノードが入力される。関連度計算部２２は、これらの入力に基づいて、ノードの関連度を計算し、計算した関連度をデータ保存部２３に出力する。 The relevance calculation unit 22 receives an inverse matrix of a unitary matrix and an upper triangular matrix from the pre-calculation unit 10, and receives a search target node from the data storage unit 23. The relevance calculation unit 22 calculates the relevance level of the nodes based on these inputs, and outputs the calculated relevance level to the data storage unit 23.

検索処理制御部２４は、後述するアルゴリズムに従って処理が行われるように検索対象のノードを決定するなどの制御を実施するとともに、検索結果を外部に出力する。 The search processing control unit 24 performs control such as determining a search target node so that processing is performed according to an algorithm described later, and outputs the search result to the outside.

本実施の形態における関連ノード検索装置１は、例えば、ＣＰＵ、メモリやハードディスク等の記憶手段を備えるコンピュータに、本実施の形態で説明する機能部の処理に対応するプログラムを実行させることにより実現可能である。当該プログラムは、可搬メモリ等の記憶媒体に格納して配布し、上記コンピュータにインストールして用いてもよいし、ネットワーク上のサーバからダウンロードして上記コンピュータにインストールしてもよい。本実施の形態で説明する各計算処理は、メモリ等の記憶手段に格納されたデータをCPUが読み出して処理し、処理された結果を記憶手段に格納する処理を繰り返しながら実行されるものである。 The related node search device 1 according to the present embodiment can be realized by causing a computer including a storage unit such as a CPU, a memory, and a hard disk to execute a program corresponding to the processing of the functional units described in the present embodiment. It is. The program may be stored in a storage medium such as a portable memory and distributed, installed on the computer, or downloaded from a server on a network and installed on the computer. Each calculation process described in the present embodiment is executed while the CPU reads and processes data stored in a storage unit such as a memory and stores the processed result in the storage unit. .

（関連ノード検索装置１の処理動作）
以下、本発明の実施の形態における関連ノード検索装置１が実行する処理内容について、その原理や処理手順などを詳細に説明する。実施の形態の説明にあたり、まずは本明細書で用いる記号を定義し、必要となる背景知識を説明する。 (Processing operation of related node search device 1)
Hereinafter, the principle and processing procedure of the processing contents executed by the related node search device 1 according to the embodiment of the present invention will be described in detail. In describing the embodiment, first, symbols used in this specification are defined, and necessary background knowledge is described.

＜PPRについて＞
PPRはグラフのノードの関連度を計算するひとつの方法である。PPRでは問い合わせ分布から各ノードの存在確率を決定し、決定されたノードからランダムウォークを行う。そしてランダムウォークでノードに到達するたびに一定の確率cで問い合わせ分布に基づき各ノードに戻る。ここで、グラフのノード数をnとする。sをn×1行列とし、s_u要素をノードuにランダムウォークする確率を表すとする。またdを問い合わせ分布で決定されるn×1行列とする。またAを列が正規化されたグラフの隣接行列とする（すなわちA_u,v要素はノードvからノードuへランダムウォークする確率を表す）。定常状態における各ノードにおける存在確率は以下の式を再帰的に収束するまで繰り返すことで計算することができる。
s = (1-c)As + cd （１）
PPR は定常状態における確率を関連度とする方法である。すなわちs行列のs_u要素はノードuの分布dに対する関連度となる。 <About PPR>
PPR is a way to calculate the relevance of graph nodes. In PPR, the existence probability of each node is determined from the query distribution, and a random walk is performed from the determined node. Each time a node is reached in a random walk, it returns to each node based on the query distribution with a certain probability c. Here, n is the number of nodes in the graph. Let s be an n × 1 matrix and represent the probability of a random walk of s _u elements to node u. Also, let d be an n × 1 matrix determined by the query distribution. Also, let A be an adjacency matrix of a graph with normalized columns (that is, the A _{u, v} element represents the probability of a random walk from node v to node u). The existence probability at each node in the steady state can be calculated by repeating the following expression until it converges recursively.
s = (1-c) As + cd (1)
PPR is a method that uses the probability in steady state as the relevance. That is, the s _u element of the s matrix is the degree of relevance to the distribution d of the node u.

定義から繰り返し回数をtとしたときPPRの計算コストはO(mt)となる。そのためグラフが大規模である場合PPRの値を計算するのは非常に時間がかかる。 From the definition, when the number of repetitions is t, the calculation cost of PPR is O (mt). Therefore, it takes a very long time to calculate the PPR value when the graph is large.

＜本実施の形態に係る技術の原理＞
本実施の形態に係る技術は、大きく以下の２つの手法（疎行列計算と関連度の推定）で構成される。 <Principle of the technology according to the present embodiment>
The technique according to the present embodiment is roughly configured by the following two methods (sparse matrix calculation and relevance estimation).

（１）疎行列計算
上記のように、問い合わせ分布に対する関連度は繰り返し計算によって定常状態における確率を計算することで求めることができる。この方法はグラフの全てのノードの関連度を再帰的に計算するため計算コストが高い。そこで本実施の形態に係る手法では全てのノードの関連度を計算せずに、選んだノードの関連度のみを計算し高速な検索を可能にする。 (1) Sparse Matrix Calculation As described above, the degree of relevance to the query distribution can be obtained by calculating the probability in the steady state by iterative calculation. Since this method recursively calculates the relevance of all nodes in the graph, the calculation cost is high. Therefore, in the method according to the present embodiment, the relevance of all the nodes is not calculated, but only the relevance of the selected node is calculated to enable high-speed search.

選んだノードの関連度は式（１）から直接求められる逆行列を用いることで計算できる。そのためこの逆行列を事前に計算しておけば関連度を高速に計算できる。しかし逆行列を保持するには一般的にO(n²) のメモリが必要になるのが問題になる。 The degree of relevance of the selected node can be calculated by using an inverse matrix obtained directly from the equation (1). Therefore, if this inverse matrix is calculated in advance, the relevance can be calculated at high speed. However, it is generally a problem that O (n ² ) memory is required to hold the inverse matrix.

そこで本実施の形態に係る手法では、この逆行列を疎行列として保持するために、検索の事前処理でまずノードを並び替えてからQR分解を計算し、得られたユニタリ行列と上三角行列の逆行列を計算する。このユニタリ行列と上三角行列の逆行列は疎行列なため、隣接リスト表現（非特許文献６）を用いることにより選んだノードの関連度を疎行列から計算できる。 Therefore, in the method according to the present embodiment, in order to hold this inverse matrix as a sparse matrix, the QR decomposition is calculated by first rearranging the nodes in the search preprocessing, and the obtained unitary matrix and upper triangular matrix are calculated. Calculate the inverse matrix. Since the inverse matrix of the unitary matrix and the upper triangular matrix is a sparse matrix, the relevance of the selected node can be calculated from the sparse matrix by using the adjacent list representation (Non-patent Document 6).

（２）関連度の推定
上記の疎行列計算の手法により、選んだノードの関連度を高速に計算することができる。しかし、上位K個のノードを検索するためには全てのノードの関連度を計算しなければならない。そこで、本実施の形態に係る技術では、高速に検索を行うために関連度の推定を行う。具体的には各ノードに対して関連度の推定を行い、関連度が高いことが期待されるノードに対してのみ疎行列を用いて正確な関連度を計算する。ノードの推定値はO(1)で計算することができるため、少ない計算コストで関連度の計算を省略することができる。後述するように、本実施の形態で用いる関連度の推定値は正確な関連度の上限値になることが保証されているため、推定を用いても検索結果が正確になるという利点がある。 (2) Estimation of relevance The relevance of the selected node can be calculated at high speed by the above sparse matrix calculation method. However, in order to search for the top K nodes, the relevance of all nodes must be calculated. Therefore, in the technique according to the present embodiment, the relevance is estimated in order to search at high speed. Specifically, the degree of association is estimated for each node, and an accurate degree of association is calculated using a sparse matrix only for nodes that are expected to have a high degree of association. Since the estimated value of the node can be calculated by O (1), the calculation of the relevance can be omitted with a small calculation cost. As will be described later, the estimated value of the relevance used in the present embodiment is guaranteed to be an accurate upper limit of the relevance, so that there is an advantage that the search result is accurate even if the estimation is used.

＜疎行列による関連度計算、疎行列計算について＞
以下、疎行列を用いて選択したノードの関連度を計算する処理について詳細に説明する。まず選択したノードの関連度は逆行列から計算できることを説明し、その逆行列を疎行列から計算する手法について説明する。 <Relationship calculation using sparse matrix, sparse matrix calculation>
Hereinafter, processing for calculating the degree of association of the selected node using the sparse matrix will be described in detail. First, it will be explained that the relevance of the selected node can be calculated from an inverse matrix, and a method for calculating the inverse matrix from a sparse matrix will be described.

行列Pをノードの置換行列とすると、グラフの隣接行列AはA′= PAP^Tという計算を経て行列A′に変換される。ここで行列P^Tは行列Pの転置行列である。また、ここでn×nの大きさの置換行列Pは直交行列であり、各行と列においてただ一つの値が1である要素と、それ以外は値が0である要素を持つ。またP_ij= 1はj番目の行がi番目の行に置換されることを表す。 When the matrix P and the node of the permutation matrix, the adjacency matrix of the graph A is converted into A '= through calculations that PAP ^T matrix A'. Here, the matrix P ^T is a transposed matrix of the matrix P. Here, the permutation matrix P having a size of n × n is an orthogonal matrix, and has an element having a single value of 1 in each row and column and an element having a value of 0 in the other cases. P _ij = 1 indicates that the j-th row is replaced with the i-th row.

ここで、I = P^TIP、P^-1 = P^T、A = P^TAPであるため（P^-1 はPの逆行列）、式（１）は以下のように書き換えられる。
s=c{I- (1-c)A}^-1d=c{P^TIP- (1-c)P^TA′P}^-1d =cP^T{I- (1-c)A′}^-1Pd （２）
特定のノードの関連度を計算するため、本実施の形態では行列I - (1 - c)A′ のQR分解を計算する。すなわちQR = I - (1 - c)A′となるが、ここで行列Qは直交行列（ユニタリ行列）であり、行列Rは上三角行列である。形式的には特定のノードxの関連度は以下のように計算する。 Here, since I = P ^T IP, P ⁻¹ = P ^T , and A = P ^T AP (P ⁻¹ is an inverse matrix of P), Equation (1) can be rewritten as follows.
s = c {I- (1-c) A} ^-1 d = c {P ^T IP- (1-c) P ^T A′P} ^-1 d = cP ^T {I- (1-c) A ′ } ^-1 Pd (2)
In order to calculate the relevance of a specific node, the QR decomposition of the matrix I − (1 −c) A ′ is calculated in this embodiment. That is, QR = I− (1−c) A ′, where the matrix Q is an orthogonal matrix (unitary matrix), and the matrix R is an upper triangular matrix. Formally, the relevance of a specific node x is calculated as follows.

定義1：n × nの行列F = cP^TR^-1とし、n × 1のベクトルg = Q^TPd とし、1 × nのベクトルf_iを行列Fのi番目の行ベクトルとすると、ノードxの分布dに対する関連度s_xは以下のように計算できる。
s_x = f_x・g （３）
式（３）からノードxの関連度はベクトルf_xとgを用いれば繰り返し計算なしに求めることができる。またベクトルdはユーザから与えられるため、関連度を計算するためまずベクトルgを計算する必要がある。 Definition 1: Given an n × n matrix F = cP ^T R ⁻¹ , an n × 1 vector g = Q ^T Pd, and a 1 × n vector f _i as the i-th row vector of the matrix F, the node x The degree of relevance s _x for the distribution d of can be calculated as follows.
s _x = f _x · g (3)
Relevance node x from equation (3) can be determined without repeatedly calculating the use of the vector f _x and g. Since the vector d is given by the user, it is necessary to calculate the vector g first in order to calculate the relevance.

定義1に対して以下の定理が成り立つ。 The following theorem holds for definition 1.

定理1：|f_x|と|Q|をそれぞれf_x and Qにおける非零要素の数とすると、特定のノードの関連度を計算するために必要な計算量はO(|f_x| + |Q|) となる。 Theorem 1: If | f _x | and | Q | are the numbers of nonzero elements in f _x and Q, respectively, the amount of computation required to calculate the relevance of a particular node is O (| f _x | + | Q |)

定理１の証明：本実施の形態では関連度を計算するため、まずベクトルgを分布dから計算する。ベクトルPdは分布dのノードを置換すれば得られるため、ベクトルg = Q^TPd を計算するために必要な計算量はO(|Q|) となる。ベクトルf_xとgの内積を計算するための計算コストはO(|f_x| + |Q|) であるため、関連度を計算するために必要な計算量はO(|f_x| + |Q|) となる。 Proof of Theorem 1: In this embodiment, in order to calculate the relevance, first, the vector g is calculated from the distribution d. Since the vector Pd can be obtained by replacing the node of the distribution d, the amount of calculation required to calculate the vector g = Q ^T Pd is O (| Q |). Since the calculation cost for calculating the inner product of the vectors f _x and g is O (| f _x | + | Q |), the amount of calculation required to calculate the relevance is O (| f _x | + | Q |)

ここでF = cP^TR^-1でありg = Q^TPdであるため、定理1から関連度を高速に計算するためには行列R^-1とQの非零要素を減らす必要があることがわかる。そのため本実施の形態では非零要素を減らすための後述する手法を用いる。 Since F = cP ^T R ^-1 and g = Q ^T Pd, it is necessary to reduce the non-zero elements of the matrices R ^-1 and Q in order to calculate the relevance from Theorem 1 at high speed. Recognize. Therefore, in this embodiment, a method described later for reducing non-zero elements is used.

当該手法を詳しく説明する前に行列R^-1とQの計算方法について説明する。ベクトルq_i を行列Qのi 番目の列ベクトル、w_iを行列W = I - (1 - c)A′のi番目の列ベクトルとすると、行列QとRはグラム・シュミットの正規直交化法（非特許文献７参照）から以下のように計算できる。 Before describing the method in detail, a method for calculating the matrices R ⁻¹ and Q will be described. If the vector q _i is the i-th column vector of the matrix Q and w _i is the i-th column vector of the matrix W = I-(1-c) A ′, then the matrices Q and R are Gram-Schmidt orthonormalization methods. (See Non-Patent Document 7).

ここで||q′_i||はq′_iのノルムである。行列R^-1の要素は後退代入（非特許文献７参照）を用いることで以下のように計算できる。 Where || q ′ _i || is the norm of q ′ _i . The elements of the matrix R ⁻¹ can be calculated as follows by using backward substitution (see Non-Patent Document 7).

式（４）と（５）から行列QとRの列ベクトルは左から右の列へそれぞれ直交な列ベクトルと行列Wとの内積を計算することで求められることがわかる。式（６）から行列R^-1 の列ベクトルは右から左の列へ、また各列ベクトルの要素は下から上の要素の順番で求められることがわかる。また式（４）と（５）と（６）から行列QとRは行列Wから計算できることがわかる。 It can be seen from equations (4) and (5) that the column vectors of the matrices Q and R can be obtained by calculating the inner product of the orthogonal column vector and the matrix W from the left to the right column. It can be seen from equation (6) that the column vector of the matrix R ⁻¹ is obtained from the right to the left column, and the elements of each column vector are obtained in the order of the bottom to top elements. It can also be seen from equations (4), (5), and (6) that the matrices Q and R can be calculated from the matrix W.

行列Qの非零要素を減らすために、本実施の形態では以下の知見を用いる。「もし列ベクトルw_iが行列Ｗにおいて全ての左の要素がすべて零である一つの非零要素をもつならば、その列ベクトルw_iは行列Qのすべての左の列ベクトルに対して直交となる。」。これは列ベクトルw_iが線形独立になるためである。よって行列Qは疎なデータ構造となる。 In order to reduce non-zero elements of the matrix Q, the following knowledge is used in the present embodiment. "If column vector w _i has one non-zero element in matrix W where all left elements are all zero, then column vector w _i is orthogonal to all left column vectors of matrix Q. Become.". This is because the column vector w _i is linearly independent. Thus, the matrix Q has a sparse data structure.

また行列R^-1 を疎にするため以下の２つの知見を用いる。「もし行列R の右と下の要素が疎であれば、行列R^-1の右と下の要素も疎になる。」。「もし行列Wの右と下の要素が疎であれば、行列R の右と下の要素も疎になる。」。 The following two findings are used to make the matrix R ^-1 sparse. “If the right and bottom elements of matrix R are sparse, the right and bottom elements of matrix R ⁻¹ are also sparse.” “If the right and bottom elements of matrix W are sparse, the right and bottom elements of matrix R are also sparse.”

図４に、疎な行列を得るためのノードの置換方法についてのアルゴリズム１を示す。また図５に、このアルゴリズム１に対応して、事前計算部１０のノード並び替え部１１が実行する処理のフローチャートを示す。 FIG. 4 shows an algorithm 1 for a node replacement method for obtaining a sparse matrix. FIG. 5 shows a flowchart of processing executed by the node rearrangement unit 11 of the pre-calculation unit 10 corresponding to the algorithm 1.

図４に示すアルゴリズムは行列QとR^-1に対する知見に基づいている。このアルゴリズムにおいてVはグラフにおけるノードの集合とし、集合"P"は置換されたノードの集合とする。deg(u) はノードu に入るエッジの数とし、e(u)はノードuから集合"P"へ出ていないエッジの数とする。まずノード並び替え部１１は、置換行列Pを零行列に初期化し、集合"P"を空集合とする（１〜２行目、図５のステップ１１）。 The algorithm shown in FIG. 4 is based on knowledge of the matrices Q and R ⁻¹ . In this algorithm, V is a set of nodes in the graph, and set “P” is a set of replaced nodes. deg (u) is the number of edges that enter node u, and e (u) is the number of edges that have not exited node u from the set “P”. First, the node rearrangement unit 11 initializes the permutation matrix P to a zero matrix, and sets the set “P” as an empty set (first and second rows, step 11 in FIG. 5).

置換されていないノードの集合V＼Pからエッジの数がmin{e(u)|u ∈ V＼P}になるノード集合Uを計算する（４行目、ステップ１２）。この処理は行列Qに対する知見に基づいている。もし行列A′の右と下の要素が疎であれば行列R^-1も疎になるため、行列A′の左と上の要素を密にするためにエッジの数が最大になるノードを選択する（５行目、ステップ１３）。そして選択されたノードから行列P の要素を決定し、置換の順番を決定する（６〜７行目、ステップ１４、１５）。 A node set U in which the number of edges is min {e (u) | uεV \ P} is calculated from the set V \ P of nodes that are not replaced (line 4, step 12). This process is based on knowledge of the matrix Q. If the right and bottom elements of the matrix A 'are sparse, the matrix R ^-1 will also be sparse, so select the node with the largest number of edges to make the left and top elements of the matrix A' dense (5th line, step 13). Then, the elements of the matrix P are determined from the selected node, and the order of replacement is determined (lines 6-7, steps 14 and 15).

＜関連度の推定について＞
次に、関連度の推定について説明する。これは、上位K個のノードを検索するために、検索の途中において正確な関連度を計算していないノードの関連度の上限値を高速に推定するものである。関連度の上限値を計算するために、本実施の形態ではまず関連度の下限値を推定する。関連度の下限値は与えられた分布の値が零にならないシードノードからの最小ホップ数を用いて計算する。シードノードからの最小ホップ数は幅優先探索を用いて計算する。h_uをノードuに対するシードノードからの最小ホップ数とし、H(i)をシードノードからの最小ホップ数がiとなるノードの集合とする。すなわちH(i)はi番目のレイヤを構成し、H(0)はシードノードの集合となる。本実施の形態では以下のように関連度の下限値を計算する。 <About estimation of relevance>
Next, the estimation of relevance will be described. In this method, in order to search for the top K nodes, the upper limit value of the relevance level of a node for which an accurate relevance level is not calculated during the search is estimated at high speed. In order to calculate the upper limit value of the relevance level, the lower limit value of the relevance level is first estimated in the present embodiment. The lower limit value of the relevance is calculated using the minimum number of hops from the seed node where the value of the given distribution does not become zero. The minimum number of hops from the seed node is calculated using a breadth-first search. Let h _{u be} the minimum number of hops from the seed node for node u, and H (i) be the set of nodes whose minimum hop number from the seed node is i. That is, H (i) constitutes the i-th layer, and H (0) is a set of seed nodes. In the present embodiment, the lower limit value of the relevance is calculated as follows.

定義2：ノードuの分布d に対する関連度の下限値s _uは以下のように定義される。 Definition 2: The lower limit value s _{u of the} degree of association with the distribution d of the node u is defined as follows.

幅優先探索を用いることにより、すべてのノードに対して下限値を計算するには、O(n+m) の計算コストが必要になる。この下限値の性質を示すために以下の補助定理を示す。

Using the breadth-first search requires O (n + m) calculation cost to calculate the lower limit for all nodes. In order to show the property of this lower limit, the following lemma is shown.

補助定理1 ：H(i)に含まれる全てのノードに対してs _u ≦ s_uが成り立つ。 Lemma 1: s _u ≤ s _u holds for all nodes included in H (i).

補助定理1の証明：証明には数学的帰納法を用いる。まずH(0)に含まれるノードに対して成り立つことを示す。PPRにおいてランダムウォークがシードノードに至るのは、（１）ランダムウォークが確率cでシードノードにジャンプするか、（２）ランダムウォークがその過程でシードノードに至るかの２つの場合である。（１）の場合、ランダムウォークはあるシードノードuに確率cd_uでジャンプする。そのためシードノードuの定常状態における確率はcd_uより小さくなることはない。よってH(0)に含まれるノードに対して成り立つ。 Proof of Lemma 1: Use mathematical induction for proof. First, we show that this holds for the nodes included in H (0). In PPR, the random walk reaches the seed node in two cases: (1) the random walk jumps to the seed node with probability c, or (2) the random walk reaches the seed node in the process. In the case of (1), the random walk jumps to a certain seed node u with a probability cd _u . Therefore, the probability of the seed node u in the steady state does not become smaller than cd _u . Therefore, this holds for nodes included in H (0).

次に、H(i-1)に含まれるノードに対してs _v ≦ s_v が成り立つ場合、H(i) に含まれるノードに対してs _v ≦ s_vが成り立つことを示す。H(i - 1) に含まれるノードはグラフ全体のノードの部分集合でありc、du ≧ 0であるため、式（１）において以下が成り立つ。 Next, when s _v ≦ s _v holds for the nodes included in H (i−1), it is shown that s _v ≦ s _v holds for the nodes included in H (i). The nodes included in H (i-1) are a subset of the nodes of the entire graph, and c and du ≥ 0, so the following holds in equation (1).

よって補助定理1 が成り立つ。 Therefore, Lemma 1 holds.

上位K個のノードを検索するために本実施の形態では一つ一つノードをたどり上限値と正確な関連度を計算する。関連度の上限値は関連度の下限値を用いて計算するが、u_iをi番目にたどるノードとしたときに関連度の上限値は以下のように計算する。 In this embodiment, in order to search for the top K nodes, each node is traced one by one to calculate the upper limit value and the exact degree of association. The upper limit value of the relevance level is calculated using the lower limit value of the relevance level, and the upper limit value of the relevance level is calculated as follows when u _i is the i-th tracing node.

定義3：ノードu_iの Definition 3: node u _i

は以下のように定義される。 Is defined as follows:

上限値を計算する計算コストはもしi = 1であればO(n) になる。またもしそうでなければ上限値はすでに計算した The computational cost of calculating the upper limit is O (n) if i = 1. If not, the upper limit has already been calculated

を用いて逐次的に計算できるため、その計算コストはO(1)になる。 Since the calculation can be performed sequentially using, the calculation cost is O (1).

本実施の形態ではノードは関連度の下限値の降順でたどられる。これには２つの理由がある。はじめの理由は下限値が大きいほど正確な下限値も大きくなることが期待されるからである。そのため効率的に上位K個のノードを検索できる。２つめの理由は以下に示す補助定理により検索の途中で処理を停止し、より高速に検索を行うためである。 In this embodiment, the nodes are traced in descending order of the lower limit of the relevance level. There are two reasons for this. The first reason is that it is expected that the larger the lower limit value, the larger the accurate lower limit value. Therefore, the top K nodes can be searched efficiently. The second reason is that the processing is stopped in the middle of the search by the following theorem and the search is performed at a higher speed.

補助定理2：ノードを関連度の下限値の降順にたどった場合、 Lemma 2: When following the nodes in descending order of the lower limit of relevance,

が成り立つ。 Holds.

証明：まず Proof: First

になることを示す。式（８）と補助定理1 から以下の式が成り立つ。 Show that. From Equation (8) and Lemma 1, the following equation holds.

つぎに、 Next,

が成り立つことを示す。式（８）から Indicates that From equation (8)

となる。よって成り立つ。 It becomes. Therefore it holds.

図６に、上位K個のノードを検索するアルゴリズム２を示す。このアルゴリズムに従って検索部２０において実行される処理のフローチャートを図７に示す。アルゴリズム2 においてθは解候補のノード集合におけるK番目の関連度であり、V_aは解候補のノード集合であり、V_eは探索済みのノード集合とする。この処理において、検索部２０における関連度推定部２１が、関連度の下限値及び上限値の計算を行い、関連度計算部２２が正確な関連度の計算を行う。また、データ保存部２３には、計算結果のデータ、計算途中のデータ、計算に必要な行列のデータや、その他の計算に必要なデータが格納される。検索処理制御部２４は、アルゴリズム２の通りに処理が行われるように処理制御を行い、最終結果（K個の関連ノード）を出力する制御を行う。 FIG. 6 shows algorithm 2 for searching for the top K nodes. FIG. 7 shows a flowchart of processing executed in the search unit 20 in accordance with this algorithm. Is θ in the algorithm 2 is K-th relevance in the node set of candidate solutions, V _a is the node set of candidate solutions, V _e is the searched-node set. In this process, the relevance estimation unit 21 in the search unit 20 calculates the lower limit and upper limit of the relevance, and the relevance calculation unit 22 calculates the correct relevance. The data storage unit 23 stores calculation result data, mid-calculation data, matrix data necessary for the calculation, and other data necessary for the calculation. The search processing control unit 24 performs processing control so that the processing is performed according to the algorithm 2, and performs control to output a final result (K related nodes).

アルゴリズム２において、初期化の後、関連度推定部２１はデータ保存部２３に格納するデータとして、関連度が0であるK個のダミーノードをV_aに追加する（４行目、図７のステップ２２）。そして関連度推定部２１はすべてのノードに対して関連度の下限値を計算する（５行目、ステップ２３）。そして最も関連度の下限値が高いノードを、データ保存部２３に保存された探索していないノードの集合V＼V_eから選択し、選択されたノードの関連度の上限値を計算する（７〜８行目、ステップ２４、２５）。補助定理2から選択されたノードの関連度の上限値がθより小さい場合、そのノードおよび探索されていないノードは解ノードになり得ない。そのため、その場合に関連度推定部２１は処理を停止する（９〜１０行目、ステップ２６、２７）。 In the algorithm 2, after initialization, the relevance estimation unit 21 adds K dummy nodes having _a relevance of 0 to Va as data to be stored in the data storage unit 23 (line 4, FIG. 7). Step 22). Then, the relevance estimation unit 21 calculates a lower limit value of relevance for all nodes (5th line, step 23). Then, the node having the highest relevance value is selected from the unsearched node set V \ V _e stored in the data storage unit 23, and the relevance value upper limit value of the selected node is calculated (7). -8th line, steps 24, 25). When the upper limit value of the relevance of the node selected from Lemma 2 is smaller than θ, the node and the unsearched node cannot be the solution node. Therefore, in that case, the relevance estimation unit 21 stops the processing (9th to 10th lines, steps 26 and 27).

もしそうでなければ、選択されたノードは解になり得る。そのため、そのノードの正確な関連度を関連度計算部２２が計算する（１２行目、ステップ２８）。もし計算した関連度がθ より大きければ、関連度計算部２２はデータ保存部２３において解候補の集合V_aとK番目の関連度θを更新する（１３〜１８行目、ステップ２９〜３３）。 If not, the selected node can be the solution. Therefore, the relevance calculation unit 22 calculates the exact relevance of the node (12th line, step 28). If the calculated relevance is greater than theta, relevance calculator 22 updates the set V _a and K th relevance theta candidate solutions in the data storage unit 23 (13-18 line, step 29-33) .

このアルゴリズムが正確に解ノードを検索できることを示すために以下の定理を示す。 The following theorem is shown to show that this algorithm can search the solution node accurately.

定理2：アルゴリズム2は上位K個のノードを正確に検索する。 Theorem 2: Algorithm 2 searches the top K nodes accurately.

定理２の証明：θ_Kを解ノードにおけるK番目の関連度とした場合、解ノードの関連度の上限値はθ_Kより小さくなることはない（補助定理2）。もし関連度の上限値がθより小さければ、アルゴリズム2はノードを枝刈りする。解候補のノード集合におけるK番目の関連度θはθ_Kより大きくなることはないため、解ノードがアルゴリズム2により枝刈りされることはない。またもしあるノードがθより小さければ、そのノードはアルゴリズム2 により枝刈りされる。よって定理２が成り立つ。
上位K個のノードを検索するために必要な計算コストは以下のようになる。 Proof of Theorem 2: When θ _K is the Kth relevance level at the solution node, the upper limit value of the relevance level of the solution node is never smaller than θ _K (Lemma Theorem 2). If the upper limit of relevance is smaller than θ, algorithm 2 prunes the node. The K-th relevance θ in the solution candidate node set never becomes larger than θ _K , so that the solution node is not pruned by the algorithm 2. If a node is smaller than θ, the node is pruned by algorithm 2. Therefore, Theorem 2 holds.
The calculation cost required to search the top K nodes is as follows.

定理3：アルゴリズム2 はO(n log n + m + |F| + |Q|)の計算コストを要する。 Theorem 3: Algorithm 2 requires O (n log n + m + | F | + | Q |).

定理３の証明：アルゴリズム2 はまずすべてのノードに対して関連度の下限値を計算するが、下限値はO(n + m)の計算コストを要する幅優先探索で計算できる。そして関連度の下限値が最も大きいノードを選択しその関連度の上限値を計算するが、もしどのノードも枝刈りされなければ、これらはそれぞれO(n log n) とO(n)の計算コストを要する。 Proof of Theorem 3: Algorithm 2 first calculates the lower limit of relevance for all nodes, but the lower limit can be calculated by a breadth-first search that requires O (n + m) calculation costs. Then select the node with the highest lower relevance value and calculate the upper relevance value, but if no nodes are pruned, these calculate O (n log n) and O (n) respectively. Cost is required.

であるため、定理1からすべてのノードの関連度を計算するにはO(|F| + |Q|)の計算コストを要する。そのため上位K 個のノードを検索をするためには、O(n log n +m+ |F| + |Q|) の計算コストを要する。 Therefore, calculating the relevance of all nodes from Theorem 1 requires O (| F | + | Q |) calculation cost. Therefore, in order to search the top K nodes, a calculation cost of O (n log n + m + | F | + | Q |) is required.

＜閾値εより大きい関連度を持つノードの範囲検索＞
これまで上位K個のノードを検索する方法について説明したが、本実施の形態に係る関連ノード検索装置１により、閾値εより大きい関連度を持つノードを検索することもできる。すなわち本実施の形態に係る関連ノード検索装置１により、閾値εより大きい関連度を持つノードの範囲検索が可能である。以下にその詳細を説明する。 <Range search for nodes with relevance greater than threshold ε>
Although the method of searching for the top K nodes has been described so far, the related node search apparatus 1 according to this embodiment can also search for a node having a degree of association greater than the threshold ε. That is, the related node search apparatus 1 according to the present embodiment can perform a range search for nodes having a relevance greater than the threshold ε. Details will be described below.

閾値εより大きい関連度を持つノードを検索するために関連度の下限値を用いる。具体的には、もしあるノードの関連度の下限値が閾値εより大きい場合、明らかにそのノードの正確な関連度は閾値εより大きくなる。そのためそのようなノードに対して正確な関連度の計算を省略する。上位K個のノードを検索する場合と同様に、下限値の降順にノードを探索する。そして関連度の上限値を計算し、不要な関連度の計算を省略する。しかしもし正確な関連度の計算を行わない場合、定義3 を用いて上限値を計算することができない。そのためもし正確な関連度の計算を省略した場合、以下のように関連度の上限値を計算する。 The lower limit value of the degree of association is used to search for nodes having a degree of association greater than the threshold ε. Specifically, if the lower limit value of the relevance level of a certain node is larger than the threshold value ε, the correct relevance level of the node is obviously larger than the threshold value ε. Therefore, the calculation of the correct relevance for such a node is omitted. As in the case of searching for the upper K nodes, the nodes are searched in the descending order of the lower limit value. Then, the upper limit value of the relevance level is calculated, and the unnecessary relevance level calculation is omitted. However, if accurate relevance is not calculated, the upper limit cannot be calculated using Definition 3. Therefore, if the calculation of the correct relevance is omitted, the upper limit value of the relevance is calculated as follows.

定義4 ：もしノードu_i-1の正確な関連度の計算を省略した場合、ノードu_iの Definition 4: If the exact relevance calculation of node u _i-1 is omitted, node u _i

を以下のように計算する。 Is calculated as follows.

この定義について以下の補助定理を示す。 The following lemma for this definition is shown.

補助定理3 ：定義4に対してi番目に探索したノードu_iに対して Lemma 3: For node u _i searched for i-th definition 4

が成り立つ。また上限値を計算する計算コストはi ≠ 1であればO(1) になる。 Holds. The calculation cost for calculating the upper limit is O (1) if i ≠ 1.

補助定理３の証明：まず Proof of Lemma 3: First

になることを示す。V_i-1をi - 1番目までに探索したノードの集合、V_eをi - 1番目までに探索したノードの集合のうち、正確な関連度を計算したノードとすると、式（８）と補助定理1から以下の式が成り立つ。 Show that. If V _i-1 is a set of nodes searched up to i −1 and V _e is a set of nodes searched up to i −1, the exact relevance is calculated. From Lemma 1, the following equation holds.

つぎに Next

が成り立つことを示す。式（９）から Indicates that From equation (9)

となる。 It becomes.

次に、計算コストがi ≠ 1 であればO(1) になることを示す。これは Next, we show that if the calculation cost is i ≠ 1, then O (1). this is

を計算する前にすでに Already before calculating

が計算済みであることから明らかである。よって成り立つ。 It is clear from the fact that has been calculated. Therefore it holds.

図８に、範囲検索のアルゴリズム３を示す。また、検索部２０が、アルゴリズム３に従って実行する範囲検索のフローチャートを図９に示す。この処理において、検索部２０における関連度推定部２１が、アルゴリズム３に基づく関連度の下限値及び上限値の計算を行い、関連度計算部２２が正確な関連度の計算を行う。また、データ保存部２３には、計算結果のデータ、計算途中のデータ、計算に必要な行列のデータや、その他の計算に必要なデータが格納される。検索処理制御部２４は、アルゴリズム３の通りに処理が行われるように処理制御を行い、最終結果（条件を満たす関連ノード）を出力する制御を行う。 FIG. 8 shows a range search algorithm 3. FIG. 9 shows a flowchart of range search executed by the search unit 20 according to the algorithm 3. In this process, the relevance estimation unit 21 in the search unit 20 calculates the lower limit and upper limit of the relevance based on the algorithm 3, and the relevance calculation unit 22 calculates the correct relevance. The data storage unit 23 stores calculation result data, mid-calculation data, matrix data necessary for the calculation, and other data necessary for the calculation. The search processing control unit 24 performs process control so that the process is performed according to Algorithm 3, and performs control to output a final result (related node that satisfies a condition).

このアルゴリズム３において、初期化の後、すべてのノードに対して関連度推定部２１が関連度の下限値を計算する（３行目、図９のステップ４１）。そして最も関連度の下限値が高いノードを、データ保存部２３に格納された探索していないノードの集合から選択し、選択されたノードの関連度の上限値を関連度推定部２１が計算する（５〜６行目、ステップ４３、４４）。もしその上限値がεより大きくなければ、そのノードおよび探索されていないノードは解ノードになり得ない。そのためその場合、関連度推定部２１は処理を停止する（７〜８行目、ステップ４５、４６）。 In this algorithm 3, after initialization, the relevance estimation unit 21 calculates a lower limit value of relevance for all nodes (third line, step 41 in FIG. 9). Then, the node having the highest relevance value is selected from the set of unsearched nodes stored in the data storage unit 23, and the relevance estimation unit 21 calculates the relevance value of the selected node. (5th to 6th lines, steps 43 and 44). If the upper limit is not greater than ε, the node and unsearched nodes cannot be solution nodes. Therefore, in that case, the relevance estimation unit 21 stops the processing (7th to 8th lines, steps 45 and 46).

もし選択されたノードの下限値がεより大きければ、そのノードは解ノードとなる。そのため、関連度推定部２１は、データ保存部２３においてそのノードを解ノードの集合に追加する（９〜１０行目、ステップ４７、４８）。そうでなければ、関連度計算部２２が正確な関連度を計算し、選択されたノードが解ノードか否かの確認を行う（１１〜１６行目、ステップ４９〜５１）。 If the lower limit value of the selected node is larger than ε, the node becomes a solution node. Therefore, the relevance estimation unit 21 adds the node to the set of solution nodes in the data storage unit 23 (9th to 10th lines, steps 47 and 48). Otherwise, the relevance calculation unit 22 calculates an accurate relevance, and checks whether or not the selected node is a solution node (11th to 16th lines, steps 49 to 51).

アルゴリズム3 により正確に範囲検索を実施できることを以下に示す。 The following shows that the range search can be accurately performed by the algorithm 3.

定理4 ：アルゴリズム3 は範囲検索を正確にO(n log n + m + |F| + |Q|) の計算コストで行う。 Theorem 4: Algorithm 3 performs a range search exactly at the computational cost of O (n log n + m + | F | + | Q |).

まずアルゴリズム3は範囲検索を正確に行うことを示す。アルゴリズム3は選択されたノードの上限値がεより小さければ、そのノードとそれ以外の探索されていないノードを枝刈りする。またアルゴリズム3は選択されたノードの正確な関連度がεより小さければ、そのノードを枝刈りする。探索されていないノードの上限値は選択されたノードの上限値より小さくなることはないので、これらのノードが解ノードの集合に含まれることはない。 First, Algorithm 3 shows that the range search is performed correctly. If the upper limit value of the selected node is smaller than ε, algorithm 3 prunes that node and other unsearched nodes. Algorithm 3 also prunes the selected node if the exact relevance of the selected node is less than ε. Since the upper limit value of the unsearched nodes is never smaller than the upper limit value of the selected node, these nodes are not included in the set of solution nodes.

またアルゴリズム3は選択されたノードの下限値及び正確な関連度がεより大きい場合、そのノードを解ノードとする。よってアルゴリズム3 は範囲検索を正確に行う。 In addition, when the lower limit value and the exact relevance of the selected node are larger than ε, the algorithm 3 sets that node as a solution node. Therefore, algorithm 3 performs a range search accurately.

次にアルゴリズム3は範囲検索をO(n log n+m+|F|+|Q|)の計算コストで行うことを示す。アルゴリズム3はまずすべてのノードに対して関連度の下限値を計算するが、下限値はO(n+m)の計算コストを要する幅優先探索で計算できる。そして関連度の下限値が最も大きいノードを選択しその関連度の上限値を計算するが、もしどのノードも枝刈りされなければこれらはそれぞれO(n log n) とO(n) の計算コストを要する。 Next, algorithm 3 shows that the range search is performed at a calculation cost of O (n log n + m + | F | + | Q |). Algorithm 3 first calculates the lower limit of the relevance level for all nodes, but the lower limit can be calculated by a breadth-first search that requires O (n + m) calculation costs. The node with the highest relevance value is selected and the upper relevance value is calculated. If no node is pruned, these are the calculation costs of O (n log n) and O (n), respectively. Cost.

であるため、定理1からすべてのノードの関連度を計算するにはO(|F| + |Q|)の計算コストを要する。そのため上位K個のノードを検索をするためにはO(n log n + m + |F| + |Q|) の計算コストを要する。よって定理４が成り立つ。 Therefore, calculating the relevance of all nodes from Theorem 1 requires O (| F | + | Q |) calculation cost. Therefore, in order to search the top K nodes, a calculation cost of O (n log n + m + | F | + | Q |) is required. Therefore, Theorem 4 holds.

本発明は、上記の実施の形態に限定されることなく、特許請求の範囲内において、種々変更・応用が可能である。 The present invention is not limited to the above-described embodiments, and various modifications and applications are possible within the scope of the claims.

１関連ノード検索装置
１０事前計算部
１１ノード並び替え部
１２逆行列計算部
２０検索部
２１関連度推定部
２２関連度計算部
２３データ保存部
２４検索処理制御部 DESCRIPTION OF SYMBOLS 1 Related node search apparatus 10 Prior calculation part 11 Node rearrangement part 12 Inverse matrix calculation part 20 Search part 21 Relevance degree estimation part 22 Relevance degree calculation part 23 Data storage part 24 Search process control part

Claims

Calculate the relevance based on PPR (Personalized PageRank) from graph data consisting of multiple nodes and query distribution, and search for K nodes in descending order of relevance, or nodes with relevance greater than a predetermined threshold A related node search device for searching,
From the graph data, a unitary matrix used to calculate the degree of association of a specific node based on PPR and an inverse matrix of an upper triangular matrix are calculated, and a pre-calculating unit that stores the matrix in a storage unit;
Using the inquiry distribution and the inverse matrix of the unitary matrix and upper triangular matrix obtained by the pre-calculation means, search for K nodes in descending order of relevance, or nodes having relevance greater than a predetermined threshold And a search means for searching for and outputting the related node search device.

The pre-calculating means is
Reordering the nodes in the graph data so that the unitary matrix and the inverse matrix of the upper triangular matrix are sparse, and node reordering means for storing the reordered nodes in the storage means;
Inverse matrix calculation means for calculating an inverse matrix of the unitary matrix and the upper triangular matrix using the graph data in which the nodes are rearranged by the node rearrangement means,
The search means includes
Relevance estimation means for estimating a lower limit of relevance for all nodes based on an adjacency matrix of the graph and the query distribution, and estimating an upper limit of relevance based on the lower limit;
The related node search device according to claim 1, further comprising: a relevance level calculating unit that calculates a relevance level for a specific node using an inverse matrix of the unitary matrix and the upper triangular matrix.

In the search means,
The relevance level estimation means searches for nodes until the upper limit value of the relevance level is less than the Kth relevance level or less than the predetermined threshold in descending order of the lower limit value, and the upper limit value of the relevance level is the Kth While the degree of relevance is not less than or less than the predetermined threshold value, the relevance degree calculating means calculates the exact relevance degree, thereby searching for K nodes in descending order of the relevance degree, or larger than the predetermined threshold value. The related node search device according to claim 2, wherein a node having a relevance level is searched.

Calculate the relevance based on PPR (Personalized PageRank) from graph data consisting of multiple nodes and query distribution, and search for K nodes in descending order of relevance, or nodes with relevance greater than a predetermined threshold A related node search method executed by a related node search device to search,
A pre-calculation step of calculating a unitary matrix and an inverse matrix of an upper triangular matrix used for calculating the relevance of a specific node based on the PPR from the graph data, and storing in a storage unit;
Using the inquiry distribution and the inverse matrix of the unitary matrix and upper triangular matrix obtained by the pre-calculation means, search for K nodes in descending order of relevance, or nodes having relevance greater than a predetermined threshold And a retrieval step for retrieving and outputting the related node.

The pre-calculation step includes
Reordering the nodes in the graph data so that the unitary matrix and the inverse matrix of the upper triangular matrix are sparse, respectively, and a node reordering step of storing the reordered nodes in the storage means;
An inverse matrix calculation step of calculating an inverse matrix of the unitary matrix and the upper triangular matrix using the graph data in which the nodes are rearranged by the node rearrangement step, and
The search step includes
A relevance estimation step for estimating a lower limit value of relevance for all nodes based on an adjacency matrix of the graph and the query distribution, and estimating an upper limit value of relevance based on the lower limit;
The related node search method according to claim 4, further comprising: a relevance level calculating step of calculating a relevance level for a specific node using an inverse matrix of the unitary matrix and the upper triangular matrix.

In the search step,
The relevance level estimation step searches for nodes until the upper limit value of the relevance level is less than the Kth relevance level or less than the predetermined threshold in descending order of the lower limit value, and the upper limit value of the relevance level is the Kth While the degree of relevance is less than or less than the predetermined threshold, K nodes are searched in descending order of the degree of relevance by calculating an accurate degree of relevance in the relevance degree calculating step, or larger than the predetermined threshold The related node search method according to claim 5, wherein a node having a relevance level is searched.

The program for functioning a computer as each means in the related node search apparatus of any one of Claims 1 thru | or 3.