JP2008210024A

JP2008210024A - Document set analysis apparatus, document set analysis method, program implementing the method, and recording medium storing the program

Info

Publication number: JP2008210024A
Application number: JP2007044141A
Authority: JP
Inventors: Hiroyuki Toda; 浩之戸田; Takashi Fujimura; 考藤村; Ryoji Kataoka; 良治片岡
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: NTT Inc
Priority date: 2007-02-23
Filing date: 2007-02-23
Publication date: 2008-09-11

Abstract

【課題】文書集合分析装置において、時間的に異なる日時に発行された文書同士でも、内容の相関が高い場合には関連性が高いと判断できるようにする。
【解決手段】特定条件により文書集合を特定する文書集合特定部１０と、前記文書集合に含まれる各文書間の内容類似度を評価する内容類似性評価部２０と、前記文書集合に含まれる各文書間の時間類似度を評価するタイムスタンプ類似性評価部３０と、前記内容類似度および時間類似度に基づいて、文書間の関係性を抽出する関係抽出部４０と、前記文書間の関係性に基づき、該文書の中心性を算出する中心性判定５０と、前記得られた文書間の関係性と、個々の文書の中心性に基づいて、文書集合全体から文書集合中に含まれる話題語，話題語に関連する文書集合，その文書集合中における文書の役割を特定する情報分析１００と、前記特定された文書集合を可視化して出力する情報出力部１１０と、を備える。
【選択図】図１In a document set analyzing apparatus, documents issued at different dates and times can be judged to have high relevance if the correlation between contents is high.
A document set specifying unit that specifies a document set according to a specific condition, a content similarity evaluation unit that evaluates a content similarity between documents included in the document set, and each of the documents included in the document set A time stamp similarity evaluation unit 30 that evaluates time similarity between documents, a relationship extraction unit 40 that extracts a relationship between documents based on the content similarity and the time similarity, and a relationship between the documents Based on the centrality determination 50 for calculating the centrality of the document, the relationship between the obtained documents, and the centrality of each document, topic words included in the document set from the entire document set , An information analysis 100 that identifies a document set related to a topic word, a role of the document in the document set, and an information output unit 110 that visualizes and outputs the specified document set.
[Selection] Figure 1

Description

本発明は、例えばコンピュータ内部に存在もしくは、コンピュータネットワークを介してアクセスできる大量文書の集合、特にタイムスタンプ付きの文書の分析に係り、文書集合分析装置，文書集合分析方法，その方法を実装したプログラム及びそのプログラムを格納した記録媒体に関するものである。 The present invention relates to analysis of a collection of a large number of documents existing in a computer or accessible via a computer network, in particular, a document with a time stamp, and a document set analysis apparatus, a document set analysis method, and a program implementing the method And a recording medium storing the program.

現在、Ｗｅｂページ、ブログ記事等、ニュース記事等、発行日のタイムスタンプを持ったテキストを含む文書集合（文書データ集合とも言う）に対して検索やデータマイニングを行う技術が広く知られている。 Currently, techniques for performing search and data mining on a document set (also referred to as a document data set) including text with a date stamp, such as a web page, a blog article, a news article, and the like, are widely known.

その技術において、ユーザが大量の文書を取り扱う場合に、ユーザが「文書集合中に存在する主要な話題が知りたい」や「文書集合中の特定の話題に関連する情報群にアクセスしたい」という文書に関連した情報を取得する要求を持つことが多い。 In this technology, when a user handles a large amount of documents, a document that the user wants to know the main topics existing in the document collection or wants to access a group of information related to a specific topic in the document collection. Often has a request to get information related to.

これらの要求を実現する実現方法としては、次のようなものが知られている。 The following methods are known as implementation methods for realizing these requirements.

一つは、クラスタリングアルゴリズムを利用する方法（例えば、非特許文献１参照）である。この方法では、それぞれの文書を単語ベクトルで表現し、ベクトル間の類似度（コサイン類似度等）を利用して、類似したベクトルを統合することによって、類似した話題（あるいは話題語）に関する文書をクラスタとして特定する。そして、この個々のクラスタを特定の話題に関連する情報の集合と見做す事によって、上述の要求を実現するものである。 One is a method using a clustering algorithm (see, for example, Non-Patent Document 1). In this method, documents related to similar topics (or topic words) are expressed by expressing each document as a word vector and using similarities between vectors (such as cosine similarity) to integrate similar vectors. Identify as a cluster. The above-described request is realized by regarding each individual cluster as a set of information related to a specific topic.

一方、タイムスタンプ付きの文書が大量に存在する場合、ユーザには「タイムスタンプを意識した形で情報が組織化され、アクセスしやすい状態にしたい」という要求がある。 On the other hand, when there are a large number of documents with time stamps, the user has a requirement that “information is organized in a form that is conscious of the time stamps and that it is easy to access”.

これを実現する方法として、時系列を意識し文書集合の話題を特定しようとする方法（例えば非特許文献２参照）がある。この方法では、忘却の概念を導入し、古い記事は新しく得られるどの文書とも類似性が低く、新しい記事同士がより高い類似性を持つと言うモデルを構築し、最新記事中の話題を特定している。 As a method for realizing this, there is a method (see, for example, Non-Patent Document 2) in which a topic of a document set is specified while being aware of time series. This method introduces the concept of forgetting, builds a model where old articles are less similar to any new document, and new articles are more similar to each other, and identifies the topics in the latest article. ing.

また、非特許文献３では、ニュース記事群からのイベントの抽出を行う場合に時間情報を導入するクラスタリング方法が提案され、これにより、ニュース記事のクラスタリング精度が向上し、結果的にイベントの分類性が向上する事が報告されている。 Non-Patent Document 3 proposes a clustering method that introduces time information when events are extracted from news articles, thereby improving the accuracy of news article clustering and, consequently, classifying events. Has been reported to improve.

また、特許文献１では時間情報を利用した検索について開示されている。 Patent Document 1 discloses a search using time information.

なお、関連技術として、文書をノードと見做して、各ノード（文書）の中心性を算出する方法（例えば、ＰａｇｅＲａｎｋ（例えば、非特許文献４参照））が知られている。また文書の集合を特定するために、ｗｅｂ上に存在する検索エンジン（例えば、非特許文献５参照）も広く知られている。また文書を単語ベクトルに表す技術（例えば、非特許文献６参照）も広く知られている。
特開２００１−１１７９４１号公報（特許３４２５９０６）Ｄ．Ｃｕｔｔｉｎｇ，Ｄ．Ｋａｒｇｅｒ，Ｊ．Ｐｅｄｅｒｓｅｎ，ａｎｄＪ．Ｔｕｋｒｙ，“Ｓｃａｔｔｅｒ／Ｇａｔｈｅｒ：ａｃｌｕｓｔｅｒ−ｂａｓｅｄａｐｐｒｏａｃｈｔｏｂｒｏｗｓｉｎｇｌａｒｇｅｄｏｃｕｍｅｎｔｃｏｌｌｅｃｔｉｏｎｓ”，Ｐｒｏｃ．ｏｆＳＩＧＩＲ１９９２，ＡＣＭ，Ｊｕｎｅ１９９２，ｐｐ３１８−３２９．石川佳治，北川博之，“忘却の概念に基づくクラスタリング手法の改良方式”，日本データベース学会ＬｅｔｔｅｒｓＶｏｌ．２，Ｎｏ．３，２００３Ｙ．Ｙａｎｇ，Ｔ．Ｐｉｅｒｃｅ，ａｎｄＪ．Ｃａｒｂｏｎｅｌｌ，“ＡＳｔｕｄｙｏｎＲｅｔｒｏｓｐｅｃｔｉｖｅａｎｄＯｎ−ＬｉｎｅＥｖｅｎｔＤｅｔｅｃｔｉｏｎ”，Ｐｒｏｃ．ｏｆＳＩＧＩＲ１９９８Ｓ．Ｂｒｉｎ，ａｎｄＬ．Ｐａｇｅ，“Ｔｈｅａｎａｔｏｍｙｏｆａｌａｒｇｅ−ｓｃａｌｅｈｙｐｅｒｔｅｘｕｔｕａｌＷｅｂＳｅａｒｃｈＥｎｇｉｎｅ”，Ｐｒｏｃ．ｏｆＷＷＷ７，Ｅｌｓｅｖｉｅｒｓｃｉｅｎｃｅ，Ａｐｒｉｌ１９９８，ｐｐ１０７−１１７．エヌ・ティ・ティレゾナント株式会社、”ポータルサイトｇｏｏ”、［ｏｎｌｉｎｅ］、平成１９年、エヌ・ティ・ティレゾナント株式会社、［平成１９年２月２２日検索］、インターネット＜ＵＲＬ：ｈｔｔｐ：／／ｗｗｗ．ｇｏｏ．ｎｅ．ｊｐ／＞北研二、津田和彦、獅子掘正幹、「情報検索アルゴリズム」、共立出版、２００２年１月。 As a related technique, a method of calculating the centrality of each node (document) by regarding the document as a node (for example, PageRank (for example, see Non-Patent Document 4)) is known. In addition, a search engine (for example, see Non-Patent Document 5) existing on a web for specifying a set of documents is also widely known. A technique for expressing a document as a word vector (for example, see Non-Patent Document 6) is also widely known.
JP 2001-117941 A (Patent 3425906) D. Cutting, D.C. Karger, J .; Pedersen, and J.M. Tukry, “Scatter / Gather: a cluster-based approach to browsing large document collections”, Proc. of SIGIR 1992, ACM, June 1992, pp 318-329. Yoshiharu Ishikawa, Hiroyuki Kitagawa, “Improved Clustering Method Based on the Concept of Forgetting”, The Database Society of Japan Letters Vol. 2, no. 3,2003 Y. Yang, T .; Pierce, and J.M. Carbonell, “A Study on Retrospective and On-Line Event Detection”, Proc. of SIGIR 1998 S. Brin, and L.L. Page, “The anatomy of a large-scale hypertextual Web Search Engine”, Proc. of WWW7, Elsevier science, April 1998, pp 107-117. NTT Resonant Co., Ltd., “Portal Site goo”, [online], 2007, NTT Resonant Co., Ltd., [Search February 22, 2007], Internet <URL: http: // / Www. goo. ne. jp /> Kita Kenji, Tsuda Kazuhiko, Isogo Miki, “Information Retrieval Algorithm”, Kyoritsu Shuppan, January 2002.

しかし、上記の従来技術では以下の問題点がある。 However, the above prior art has the following problems.

非特許文献１の方法は、基本的に文書の内容のみに基づく方法であり、タイムスタンプを持つ文書が存在した場合にも、それを意識した文書の組織化を行うことができない。例えば、「湾岸戦争」と「イラク戦争」は、文書内容だけを考えれば非常に近い話題で構成されるため、二つの話題に関する文書を別のものとして見分ける事は非常に難しい。 The method of Non-Patent Document 1 is basically a method based only on the contents of a document, and even when a document having a time stamp exists, it is not possible to organize the document in consideration of the document. For example, since the “Wangan War” and the “Iraq War” are composed of very close topics considering only the document contents, it is very difficult to distinguish two documents related to two topics.

また、非特許文献２，３の方法はタイムスタンプを意識することで、情報の時間的近さを元にした情報の組織化を実現している。しかし、これらの手法では全ての文書が何れかのクラスタに属することを前提とするクラスタリングをベースとしている。一方、現実のデータでは、他の文書と関係ない、いわゆる「その他」に属する文書が存在するため、必ずしも適切なクラスタリングが行えず、結果的に上記の要求に対しても多くのノイズを含む等の問題が生じる。 In addition, the methods of Non-Patent Documents 2 and 3 realize information organization based on closeness of information by being aware of time stamps. However, these methods are based on clustering on the assumption that all documents belong to any cluster. On the other hand, in actual data, there are documents belonging to so-called “others” that are not related to other documents, so that appropriate clustering cannot always be performed, and as a result, there is a lot of noise in response to the above requirements, etc. Problem arises.

また特許文献１では、時間情報をフィルタリングに利用している。つまり、時間差が閾値以内の値では関連性がある、閾値を越えた場合は関連性がないというように時間情報を利用している。このため、閾値を越えた時間差を有する文書は一律に「関連性がない」と判断されてしまう欠点があり、時間的に異なる日時に発行された文書同士でも、内容の相関が高い場合には関連性が高いと判断することができないという問題があった。 In Patent Document 1, time information is used for filtering. In other words, the time information is used such that the time difference is within the threshold value and is related, and when the time difference exceeds the threshold value, there is no relationship. For this reason, documents that have a time difference that exceeds the threshold have the disadvantage that they are uniformly judged to be `` unrelated '', and even when documents issued at different times in time are highly correlated with each other, There was a problem that it could not be determined that the relevance was high.

本発明は上記の点に鑑み、文書間の類似性の判定基準に文書間のタイムスタンプの近さを考慮することで、時間を意識した文書の組織化を、その他に含まれる文書が存在する状況でも、その影響を受けることなく、強い繋がりが存在する文書集合のみを特定し、また、単に文書集合をクラスタに分類するだけでなく、クラスタ中の各文書がクラスタ内でどのような役割、位置付けであるかを明確に分析できる事を目的とし、さらに、時間的に異なる日時に発行された文書同士でも、内容の相関が高い場合には関連性が高いと判断することができる、文書集合分析装置，文書集合分析方法，その方法を実装したプログラム及びそのプログラムを格納した記録媒体を提供することを目的としている。 In view of the above-mentioned points, the present invention considers the closeness of time stamps between documents as a criterion for similarity between documents, so that there is a document that includes time-conscious document organization and others. Even in the situation, not only is it affected, but only the document set that has strong connection exists is identified, and not only the document set is classified into a cluster, but also what role each document in the cluster has in the cluster, A set of documents whose purpose is to be able to clearly analyze whether it is positioned, and even if documents issued at different times in time are highly correlated with each other, the correlation is high. It is an object of the present invention to provide an analysis apparatus, a document set analysis method, a program that implements the method, and a recording medium that stores the program.

前記課題の解決を図るために、本発明では、ユーザからの要求に基づき特定した文書集合を元に、各文書間の相互の内容類似度および時間的近さに基づく時間類似度を評価し、これら２つの類似度に基づき文書間の関連性を特定する。この関連性に基づき、各文書の中心性を評価する。この文書間の関連性と、個々の文書の中心性をともに用いる事で、文書集合中の特定の話題の検出や、特定の話題に属する文書のクラスタ化、クラスタ内の各文書の位置付けの明確化を実現する。 In order to solve the above-mentioned problem, in the present invention, based on a document set specified based on a request from a user, a content similarity between documents and a temporal similarity based on temporal proximity are evaluated, The relationship between documents is specified based on these two similarities. Based on this relationship, the centrality of each document is evaluated. By using both the relationship between documents and the centrality of individual documents, it is possible to detect specific topics in a document set, cluster documents belonging to a specific topic, and clarify the positioning of each document in the cluster. Realize.

すなわち、請求項１に記載の文書集合分析装置は、文書データ管理手段によって管理される文書集合内のタイムスタンプ付きの文書間の関連性に基づいて、文書の役割を特定する文書集合分析装置であって、入力手段から入力された文書集合特定条件に基づいて前記文書集合を特定する文書集合特定手段と、前記特定された文書集合に含まれる各文書間で話題語に関する内容類似度を評価する内容類似性評価手段と、前記特定された文書集合に含まれる各文書間で話題語に関する時間類似度を評価するタイムスタンプ類似性評価手段と、前記内容類似性評価手段によって評価された内容類似度および前記タイムスタンプ類似性評価手段によって評価された時間類似度に基づいて、文書間の関係性を抽出する関係抽出手段と、前記関係抽出手段によって抽出された文書間の関係性に基づき、文書と該文書以外の文書間の関連性の高さを示す指標として、該文書の中心性を算出する中心性判定手段と、前記関係抽出手段によって得られた文書間の関係性と、前記中心性判定手段によって得られた個々の文書の中心性に基づいて、文書集合全体から文書集合中に含まれる話題語，話題語に関連する文書集合，その文書集合中における文書の役割を特定する情報分析手段と、前記特定された文書集合全体から文書集合中に含まれる話題語，話題語に関連する文書集合，その文書集合中における文書の役割に基づいて文書集合を可視化して出力する情報出力手段と、を備えることを特徴としている。 In other words, the document set analysis apparatus according to claim 1 is a document set analysis apparatus that specifies a role of a document based on a relationship between documents with time stamps in the document set managed by the document data management unit. The document set specifying means for specifying the document set based on the document set specifying condition input from the input means, and the content similarity regarding the topic word between each document included in the specified document set is evaluated. Content similarity evaluation means, time stamp similarity evaluation means for evaluating time similarity regarding topic words between the documents included in the specified document set, and content similarity evaluated by the content similarity evaluation means And a relationship extracting means for extracting a relationship between documents based on the time similarity evaluated by the time stamp similarity evaluating means, and the relationship extracting means Therefore, based on the relationship between the extracted documents, the centrality determination unit that calculates the centrality of the document as an index indicating the high degree of relationship between the document and the document other than the document, and the relationship extraction unit Based on the relationship between the obtained documents and the centrality of each document obtained by the centrality determination means, topic words included in the document set from the entire document set, a document set related to the topic word, Information analysis means for specifying the role of a document in the document set, topic words included in the document set from the entire specified document set, a document set related to the topic word, and a role of the document in the document set And an information output means for visualizing and outputting the document set based on the information.

また請求項５に記載の文書集合分析方法は、文書データ管理手段によって管理される文書集合内のタイムスタンプ付きの文書間の関連性に基づいて、文書の役割を特定する文書集合分析方法であって、文書集合特定手段が、入力手段から入力された文書集合特定条件に基づいて前記文書集合を特定する文書集合特定ステップと、内容類似性評価手段が、前記特定された文書集合に含まれる各文書間で話題語に関する内容類似度を評価する内容類似性評価ステップと、タイムスタンプ類似性評価手段が、前記特定された文書集合に含まれる各文書間で話題語に関する時間類似度を評価するタイムスタンプ類似性評価ステップと、関係抽出手段が、前記内容類似性評価ステップによって評価された内容類似度および前記タイムスタンプ類似性評価ステップによって評価された時間類似度に基づいて、文書間の関係性を抽出する関係抽出ステップと、中心性判定手段が、前記関係抽出ステップによって抽出された文書間の関係性に基づき、文書と該文書以外の文書間の関連性の高さを示す指標として、該文書の中心性を算出する中心性判定ステップと、情報分析手段が、前記関係抽出ステップによって得られた文書間の関係性と、前記中心性判定ステップによって得られた個々の文書の中心性に基づいて、文書集合全体から文書集合中に含まれる話題語，話題語に関連する文書集合，その文書集合中における文書の役割を特定する情報分析ステップと、情報出力手段が、前記特定された文書集合全体から文書集合中に含まれる話題語，話題語に関連する文書集合，その文書集合中における文書の役割に基づいて文書集合を可視化して出力する情報出力ステップと、を備えることを特徴としている。 The document set analysis method according to claim 5 is a document set analysis method for specifying a role of a document based on a relationship between documents with time stamps in the document set managed by the document data management means. The document set specifying unit includes a document set specifying step for specifying the document set based on the document set specifying condition input from the input unit, and a content similarity evaluation unit includes each of the contents included in the specified document set. A content similarity evaluation step for evaluating the content similarity regarding the topic word between the documents, and a time when the time stamp similarity evaluation means evaluates the time similarity regarding the topic word between the documents included in the specified document set The stamp similarity evaluation step, and the relationship extraction means includes the content similarity evaluated by the content similarity evaluation step and the time stamp similarity evaluation. A relationship extraction step for extracting a relationship between documents based on the temporal similarity evaluated by the step, and a centrality determination means, wherein the centrality determination means is based on the relationship between documents extracted by the relationship extraction step. As an index indicating the degree of relevance between documents other than documents, a centrality determination step of calculating the centrality of the document, and an information analysis means, the relationship between documents obtained by the relationship extraction step, Based on the centrality of each document obtained in the centrality determination step, the topic word included in the document set, the document set related to the topic word, and the role of the document in the document set are specified from the entire document set. And an information output means comprising: a topic word included in the document set from the whole specified document set, a document set related to the topic word, The document set based on the role of the book is characterized by comprising an information output step of outputting visualized, the.

上記構成によれば、内容類似度と時間類似度に基づき文書間の関係性を抽出し、文書の役割を特定するので、時間的には異なる日時に発行された分書同士でも、内容の相関が高い場合には関連性が高いと判断することが可能になる。 According to the above configuration, the relationship between documents is extracted based on the content similarity and time similarity, and the role of the document is specified. When is high, it can be determined that the relevance is high.

さらに、文書内容の類似度と併せて文書作成日時をも考慮して文書群を分析しているため、分析精度が向上する。例えばニュース記事の場合に、事件が起こった時点の記事と、裁判が行われた時点での記事というように、関連性はあるがタイムスタンプの差があるという状態が頻繁に見られる文書群（例えば、新聞記事）を精度良く分類・検索することが可能となる。 Further, since the document group is analyzed in consideration of the document creation date and time in addition to the similarity of the document contents, the analysis accuracy is improved. For example, in the case of a news article, a group of documents that are frequently related but have a time stamp difference, such as an article at the time of the incident and an article at the time of the trial. For example, newspaper articles) can be classified and searched with high accuracy.

また本発明では、文書間の関連性と各文書の中心性の値を元に、文書集合を３次元のグラフ構造と見なし、その中の頂点や、山状のノード群を特定することで、文書集合中の特定の話題の検出や、特定の話題に属する文書のクラスタ化、クラスタ内の各文書の位置付けの明確化を実現する。 In the present invention, based on the relationship between documents and the centrality value of each document, the document set is regarded as a three-dimensional graph structure, and by specifying a vertex or a group of nodes in the shape, Detection of a specific topic in a document set, clustering of documents belonging to a specific topic, and clarification of the position of each document in the cluster are realized.

すなわち、請求項２に記載の文書集合分析装置は、請求項１に記載の文書集合分析装置であって、前記情報分析手段は、前記文書間の関連性と、各文書の中心性に基づいて、文書間の関連性を二次元座標で表現し、その二次元座標に対する三次元目の座標で中心性を表現し、前記文書集合を三次元のグラフ構造で表現するグラフ構造構築手段と、前記得られたグラフ構造から、エッジで連結している自身以外のノードより高い中心性を持つノードを頂点として抽出する頂点ノード抽出手段と、前記得られた頂点ノードから、中心性が低くなる方向にグラフ構造をたどり、ノードで構成される山を特定する山状ノード群特定手段と、前記特定される頂点ノード及び山状ノードに基づいて、ノードに対し役割を示すラベルを付与するラベル付け手段と、を備えることを特徴としている。 That is, the document set analysis apparatus according to claim 2 is the document set analysis apparatus according to claim 1, wherein the information analysis unit is configured to determine whether the information is based on the relationship between the documents and the centrality of each document. A graph structure construction means for expressing the relationship between documents in two-dimensional coordinates, expressing the centrality in the third dimension of the two-dimensional coordinates, and expressing the document set in a three-dimensional graph structure; From the obtained graph structure, a vertex node extracting means for extracting, as a vertex, a node having a higher centrality than nodes other than itself connected by an edge, and in a direction in which the centrality decreases from the obtained vertex node. A mountain-shaped node group identifying means for tracing a graph structure and identifying a mountain composed of nodes, and a labeling hand for assigning a label indicating a role to the node based on the identified vertex node and mountain-shaped node It is characterized in that it comprises, when.

また請求項６に記載の文書集合分析方法は、請求項５に記載の文書集合分析方法であって、前記情報分析ステップは、前記文書間の関連性と、各文書の中心性に基づいて、文書間の関連性を二次元座標で表現し、その二次元座標に対する三次元目の座標で中心性を表現し、前記文書集合を三次元のグラフ構造で表現するグラフ構造構築ステップと、前記得られたグラフ構造から、エッジで連結している自身以外のノードより高い中心性を持つノードを頂点として抽出する頂点ノード抽出ステップと、前記得られた頂点ノードから、中心性が低くなる方向にグラフ構造をたどり、ノードで構成される山を特定する山状ノード群特定ステップと、
前記特定される頂点ノード及び山状ノードに基づいて、ノードに対し役割を示すラベルを付与するラベル付けステップと、を有することを特徴としている。 The document set analysis method according to claim 6 is the document set analysis method according to claim 5, wherein the information analysis step is based on the relationship between the documents and the centrality of each document. A graph structure construction step of expressing the relationship between documents in two-dimensional coordinates, expressing the centrality in the third dimension of the two-dimensional coordinates, and expressing the document set in a three-dimensional graph structure; From the obtained graph structure, a vertex node extraction step for extracting, as a vertex, a node having a higher centrality than nodes other than itself connected by an edge, and a graph in a direction in which the centrality decreases from the obtained vertex node. A mountain-shaped node group identification step for tracing a structure and identifying a mountain composed of nodes;
And a labeling step of assigning a label indicating a role to the node based on the specified vertex node and mountain-shaped node.

上記構成によれば、文書間の関係に基づくグラフ構造を取得することができる。 According to the above configuration, a graph structure based on the relationship between documents can be acquired.

また請求項３に記載の文書集合分析装置は、請求項１又は２に記載の文書集合分析装置であって、前記タイムスタンプ類似性評価手段は、前記各文書間のタイムスタンプの差に応じて連続的に変化する忘却関数を用いて時間類似度を評価することを特徴としている。 Further, the document set analysis apparatus according to claim 3 is the document set analysis apparatus according to claim 1 or 2, wherein the time stamp similarity evaluation unit is configured to respond to a time stamp difference between the documents. It is characterized by evaluating temporal similarity using a continuously changing forgetting function.

また請求項７に記載の文書集合分析方法は、請求項５又は６に記載の文書集合分析方法であって、前記タイムスタンプ類似性評価ステップは、前記各文書間のタイムスタンプの差に応じて連続的に変化する忘却関数を用いて時間類似度を評価することを特徴としている。 The document set analysis method according to claim 7 is the document set analysis method according to claim 5 or 6, wherein the time stamp similarity evaluation step is performed according to a time stamp difference between the documents. It is characterized by evaluating temporal similarity using a continuously changing forgetting function.

上記構成によれば、時間的な近さを連続的な値として利用することができる。 According to the above configuration, temporal proximity can be used as a continuous value.

また請求項４に記載の文書集合分析装置は、請求項１乃至３のいずれか１項に記載の文書集合分析装置であって、前記関係抽出手段は、前記内容類似度と、前記タイムスタンプ類似性評価手段によって評価された時間類似度に所定の重み付けを行って得た時間類似度とに基づいて文書間の関係性を抽出することを特徴としている。 Further, the document set analysis apparatus according to claim 4 is the document set analysis apparatus according to any one of claims 1 to 3, wherein the relation extraction unit is configured to compare the content similarity and the time stamp similarity. It is characterized in that the relationship between documents is extracted based on the time similarity obtained by performing predetermined weighting on the time similarity evaluated by the sex evaluation means.

また請求項８に記載の文書集合分析方法は、請求項５乃至７のいずれか１項に記載の文書集合分析方法であって、前記関係抽出ステップは、前記内容類似度と、前記タイムスタンプ類似性評価ステップによって評価された時間類似度に所定の重み付けを行って得た時間類似度とに基づいて文書間の関係性を抽出することを特徴としている。 The document set analysis method according to claim 8 is the document set analysis method according to any one of claims 5 to 7, wherein the relation extraction step includes the content similarity and the time stamp similarity. It is characterized in that the relationship between documents is extracted based on the temporal similarity obtained by performing predetermined weighting on the temporal similarity evaluated in the sex evaluation step.

上記の構成によれば、時間類似度に重みが付けられるため、過度に時間的近さのみに依存した類似度とはならない。 According to the above configuration, since the temporal similarity is weighted, the similarity does not become excessively dependent only on temporal proximity.

また請求項９に記載の文書集合分析プログラムは、請求項５乃至８のいずれか１項に記載の文書集合分析方法を、コンピュータで実行可能なコンピュータプログラムとして記述したことを特徴としている。 A document set analysis program according to claim 9 is characterized in that the document set analysis method according to any one of claims 5 to 8 is described as a computer program executable by a computer.

上記構成によれば、請求項５乃至８のいずれか１項に記載の文書集合分析方法をコンピュータプログラムとして記述することができる。 According to the above configuration, the document set analysis method according to any one of claims 5 to 8 can be described as a computer program.

また請求項１０に記載の記録媒体は、請求項５乃至８のいずれか１項に記載の文書集合分析方法を、コンピュータで実行可能なコンピュータプログラムとして記述し、そのコンピュータプログラムを記録したことを特徴としている。 A recording medium according to a tenth aspect is characterized in that the document set analysis method according to any one of the fifth to eighth aspects is described as a computer program executable by a computer, and the computer program is recorded. It is said.

上記構成によれば、請求項５乃至８のいずれか１項に記載の文書集合分析方法を実装したコンピュータプログラムを、記録媒体に記録することができる。 According to the above configuration, a computer program that implements the document set analysis method according to any one of claims 5 to 8 can be recorded on a recording medium.

（１）請求項１〜１０に記載の発明によれば、内容類似度と時間類似度の組み合わせにより得られた文書の関連性に基づいて、その文書の役割を明確に分析することができる。 (1) According to the invention described in claims 1 to 10, the role of the document can be clearly analyzed based on the relevance of the document obtained by the combination of the content similarity and the time similarity.

特に、時間的には異なる日時に発行された分書同士でも、内容の相関が高い場合には関連性が高いと判断することが可能になる。 In particular, it is possible to determine that relevance is high even if the documents are issued at different dates and times with high time correlation.

さらに、文書内容の類似度と併せて文書作成日時をも考慮して文書群を分析しているため、分析精度が向上する。例えばニュース記事の場合に、事件が起こった時点の記事と、裁判が行われた時点での記事というように、関連性はあるがタイムスタンプの差があるという状態が頻繁に見られる文書群（例えば、新聞記事）を精度良く分類・検索することが可能となる。
（２）また請求項２，６に記載の発明によれば、強い繋がりを有する文書集合のみを特定することができる。
（３）また請求項３，７に記載の発明によれば、時間的な近さを連続的な値として利用することができる。
（４）また請求項４，８に記載の発明によれば、時間類似度に重みが付けられるため、過度に時間的近さのみに依存した類似度とはならない。
（５）また請求項９に記載の発明によれば、請求項５乃至８のいずれか１項に記載の文書集合分析方法を実装したコンピュータプログラムを提供できる。
（６）また請求項１０に記載の発明によれば、請求項５乃至８のいずれか１項に記載の文書集合分析方法を実装したコンピュータプログラムを記録した記録媒体を提供できる。 Further, since the document group is analyzed in consideration of the document creation date and time in addition to the similarity of the document contents, the analysis accuracy is improved. For example, in the case of a news article, a group of documents that are frequently related but have a time stamp difference, such as an article at the time of the incident and an article at the time of the trial. For example, newspaper articles) can be classified and searched with high accuracy.
(2) According to the inventions of claims 2 and 6, it is possible to specify only a document set having a strong connection.
(3) According to the third and seventh aspects of the invention, the temporal proximity can be used as a continuous value.
(4) According to the inventions described in claims 4 and 8, since the time similarity is weighted, the similarity does not become excessively dependent only on temporal proximity.
(5) According to the invention described in claim 9, it is possible to provide a computer program that implements the document set analysis method described in any one of claims 5 to 8.
(6) According to the invention described in claim 10, it is possible to provide a recording medium recording a computer program in which the document set analysis method described in any one of claims 5 to 8 is implemented.

以下、本発明の実施形態を図面等に基づいて詳細に説明する。本実施形態における文書集合分析装置は、検索したニュース記事の中に存在する話題（即ち、話題語）を特定し、その特定した話題に関連する文書をクラスタ化し、さらに、そのクラスタ中の文書に対してそれぞれの文書の位置付けを明らかにする文書分析を行う装置である。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings. The document set analysis apparatus according to the present embodiment identifies topics (that is, topic words) existing in the searched news articles, clusters the documents related to the identified topics, and further converts the documents in the clusters into It is a device that performs document analysis to clarify the position of each document.

本実施形態における文書集合分析装置の構成を図１に基づいて説明する。文書集合分析装置は、文書データ管理手段としての文書ＤＢ（Ｄａｔａｂａｓｅ）１、文書集合特定手段としての文書集合特定部１０，内容類似性評価手段としての内容類似性評価部２０，タイムスタンプ類似性評価手段としてのタイムスタンプ類似性評価部３０、関係抽出手段としての関係抽出部４０，中心性判定手段としての中心性判定部５０，情報分析手段としての情報分析部１００，情報出力手段としての情報出力部１１０から構成される。さらに、情報分析部１００は、グラフ構造構築手段としてのグラフ構造構築部６０，頂点ノード抽出手段としての頂点ノード抽出部７０，山上ノード群特定手段としての山状ノード群特定部８０，ラベル付け手段としてのラベル付け部９０から構成される。 The configuration of the document set analysis apparatus in this embodiment will be described with reference to FIG. The document set analysis apparatus includes a document DB (Database) 1 as document data management means, a document set specification section 10 as document set specification means, a content similarity evaluation section 20 as content similarity evaluation means, and a time stamp similarity evaluation. Time stamp similarity evaluation unit 30 as means, relationship extraction unit 40 as relationship extraction means, centrality determination unit 50 as centrality determination means, information analysis unit 100 as information analysis means, information output as information output means The unit 110 is configured. Further, the information analysis unit 100 includes a graph structure construction unit 60 as a graph structure construction unit, a vertex node extraction unit 70 as a vertex node extraction unit, a mountain node group identification unit 80 as a mountain node group identification unit, and a labeling unit. As shown in FIG.

文書集合特定部１０は、文書集合を特定する文書集合特定条件を含む指定や要求（例えば、ユーザからの指定や要求）、もしくは、あらかじめ決められた文書集合特定条件に基づいて文書ＤＢ１にアクセスし、複数文書で構成される文書集合を特定する。なお、文書集合特定条件は、予め備えられた入力手段（例えば、キーボード装置）によって、入力されても良い。 The document set specifying unit 10 accesses the document DB 1 based on a specification or request including a document set specifying condition for specifying a document set (for example, a specification or request from a user) or a predetermined document set specifying condition. A document set composed of a plurality of documents is specified. The document set specifying condition may be input by an input means (for example, a keyboard device) provided in advance.

内容類似性評価部２０は、話題（あるいは、話題語）に関して文書集合内の各文書間の内容類似度を評価する。例えば、文書間の類似度は、各文書を単語ベクトルで評価しコサイン類似度を利用する方法（例えば、非特許文献１参照）や、一方の文書に基づいて言語モデルを構築し、他方の文書がその言語モデルからどの程度の確率で生成されるかという言語モデルに基づく評価を行う方法が考えられる。 The content similarity evaluation unit 20 evaluates the content similarity between the documents in the document set with respect to the topic (or topic word). For example, the similarity between documents can be determined by evaluating each document with a word vector and using the cosine similarity (for example, see Non-Patent Document 1), or constructing a language model based on one document and the other document. It is conceivable to perform an evaluation based on the language model of how much probability is generated from the language model.

タイムスタンプ類似性評価部３０は、文書集合内の各文書間の時間的な近さを評価する。文書間の時間的な近さは、タイムスタンプ間の差を元に評価を行う。近さの評価値としては、タイムスタンプの差をそのまま距離とする方法や、直近のタイムスタンプの差を重視し、一定以上離れていてもあまり差が増大しないような指標をとる事も考えられる。
一つの例としては「文書間のタイムスタンプが一定の時間離れる毎に、一定の割合で類似度が減少する」との仮定に基づく指標が考えられる。この仮定は以下の式（１）で表現できる。 The time stamp similarity evaluation unit 30 evaluates the temporal proximity between the documents in the document set. The temporal proximity between documents is evaluated based on the difference between time stamps. As the proximity evaluation value, it is conceivable to adopt a method in which the difference between the time stamps is used as a distance, or an index that emphasizes the difference between the latest time stamps and does not increase much even if the distance is more than a certain distance. .
As an example, an index based on the assumption that “similarity decreases at a certain rate whenever the time stamp between documents deviates by a certain time” can be considered. This assumption can be expressed by the following equation (1).

ここで、λは忘却による類似度の逓減の程度を示す定数、ｔは二つの文書のタイムスタンプの差を示し、ＴｉｍｅＷｅｉｇｈｔ（ｔ）は、忘却の程度を示す関数であり、二つの文書のタイムスタンプの差がｔだった場合の時間類似度を示している。 Here, λ is a constant indicating the degree of decrease in similarity due to forgetting, t is a difference between time stamps of two documents, and TimeWeight (t) is a function indicating the degree of forgetting, and the time of two documents The time similarity is shown when the stamp difference is t.

この常微分方程式を解くと、タイムスタンプの差がｔの場合の時間類似度ＴｉｍｅＷｅｉｇｈｔ（ｔ）は以下の式（２）で表現できる。 Solving this ordinary differential equation, the time similarity TimeWeight (t) when the time stamp difference is t can be expressed by the following equation (2).

ここで、Ｔ０は、タイムスタンプの差が０の場合の重みであり、タイムスタンプの差ｔが、大きくなるにつれ時間類似度が次第に減少し、最後には０に限りなく近づくこととなる。
ただ、このままの式では、忘却による類似度の逓減の程度を示す値が直感的でないため、以下の様に式（３）の変形を行い逓減の度合いを示すパラメータをｔ１／２とする。 Here, T0 is a weight when the time stamp difference is 0, and the time similarity gradually decreases as the time stamp difference t increases, and finally approaches zero as much as possible.
However, in the equation as it is, the value indicating the degree of decrease in similarity due to forgetting is not intuitive, so the equation (3) is modified as follows and the parameter indicating the degree of decrease is set to t1 / 2.

上式において、ｔ_1/2は、忘却により時間類似度が５０％になるタイムスタンプの差を（半減期）を示している。図３に、Ｔ０＝１，ｔ_1/2＝１０とした場合の忘却関数を示す。 In the above equation, t _1/2 represents the difference in time stamps (half-life) at which the time similarity is 50% due to forgetting. FIG. 3 shows the forgetting function when T0 = 1 and t _1/2 = 10.

関係抽出部４０は、前記２つの類似度評価機能（内容類似性評価部２０およびタイムスタンプ類似性評価部３０）で評価した、内容類似度と時間的近さを元に文書間の類似度を算出し、文書間に関係があるか否かの関係性を特定する。 The relationship extraction unit 40 evaluates the similarity between documents based on the content similarity and temporal proximity evaluated by the two similarity evaluation functions (the content similarity evaluation unit 20 and the time stamp similarity evaluation unit 30). Calculate and identify the relationship of whether there is a relationship between documents.

二つの類似度を利用した文書間類似度ｓｉｍ（ｉ，ｊ）の一つの例としては以下の式（４）のような手法が考えられる。 As an example of the inter-document similarity sim (i, j) using two similarities, a technique such as the following equation (4) is conceivable.

ここで、ｓｉｍ’（ｉ，ｊ）は文書ｉとｊの文書内容に基づく類似度を示し、αは時間類似度の重みを調整するパラメータである。 Here, sim ′ (i, j) indicates the similarity based on the document contents of the documents i and j, and α is a parameter for adjusting the weight of the temporal similarity.

本手法は基本的に内容の類似度に基づく手法であり、過度に時間的近さのみに依存した類似度にならない手法である。図４に、この式（４）によって時間類似度による重みがどうのように適用されるかを示す。 This method is basically a method based on the similarity of contents, and is a method that does not become a similarity that depends only on temporal proximity. FIG. 4 shows how the weight based on the time similarity is applied according to the equation (4).

図４において、（ａ）はタイムスタンプの差ｔ（ｉ，ｊ）が変化した場合を示し、（ｂ）は内容類似度（ｓｉｍ’（ｉ，ｊ））が異なる場合を示している。 In FIG. 4, (a) shows a case where the time stamp difference t (i, j) changes, and (b) shows a case where the content similarity (sim '(i, j)) is different.

次にこの文書間類似度を利用した場合の文書間の関係抽出について示す。文書間の関係を行列Ａと表現した場合、以下の式（５）のように定義することが考えられる。 Next, the extraction of the relationship between documents when this similarity between documents is used will be described. When the relationship between documents is expressed as a matrix A, it is conceivable to define it as the following formula (5).

ここで、ＴｏｐＳｉｍp（ｉ）は文書ｉとの類似度が高い文書ｐ件に含まれる文書の集合を示す。一般に全ての類似度を利用した場合には、低い類似度がノイズとなる傾向があるため、類似度が高い文書間に対してのみリンクを設定している。ｓｉｍ（ｉ，ｊ）は、文書をｌｏｇｔｆ−ｉｄｆ重み（例えば、非特許文献６）による単語ベクトルとして表現した場合の文書ｉと文書ｊのコサイン類似度を示している。なお、ｌｏｇｔｆ−ｉｄｆ重みは、個々の文書をベクトルで表現するときの個々の要素の重みである。 Here, TopSimp (i) indicates a set of documents included in p documents having a high similarity to the document i. In general, when all the similarities are used, since a low similarity tends to be noise, a link is set only between documents having a high similarity. Sim (i, j) indicates the cosine similarity between the document i and the document j when the document is expressed as a word vector with log tf-idf weight (for example, Non-Patent Document 6). The log tf-idf weight is a weight of each element when each document is expressed by a vector.

さらに、上記のように全ての類似度を利用した場合には、他のリンクと比較して明らかに重みが小さいリンクが存在している。そこで、アウトリンクのうちごく少ない確率でしか遷移しないリンクを除去する事が考えられる。この操作は以下の式（６）で示される。 Furthermore, when all the similarities are used as described above, there is a link having a clearly smaller weight than other links. Therefore, it is conceivable to remove a link that transitions with a very low probability among outlinks. This operation is expressed by the following equation (6).

ここで、ｌi,qはノードｉからのアウトリンクを遷移確率の降順に並べ、閾値ｑを越えるまで加算した遷移確率の合計値を示す。ＴｏｐＬｉｎｋq（ｉ）は、そのとき加算対象になったリンクのリンク先ノードの集合を示す。 Here, l i, q represents the total value of the transition probabilities obtained by arranging the outlinks from the node i in descending order of the transition probabilities and adding them until the threshold q is exceeded. TopLinkq (i) indicates a set of link destination nodes of the links to be added at that time.

中心性判定部５０は、関係抽出部４０で得られた文書間の関係を、文書をノードと見做して文書間の関係を重み（即ち、文書間の類似度）つきのエッジとするグラフ構造と見做し、各ノード（文書）の中心性を算出する。なお、前述の中心性は、単純なリンク本数を計算する方法やＰａｇｅＲａｎｋ（非特許文献４参照）等を利用することが考えられる。 The centrality determination unit 50 considers the relationship between documents obtained by the relationship extraction unit 40 as a node, and uses the relationship between documents as an edge with a weight (ie, similarity between documents). And the centrality of each node (document) is calculated. Note that the above-described centrality may use a simple method for calculating the number of links, PageRank (see Non-Patent Document 4), or the like.

グラフ構造構築部６０は、中心性判定部５０で得られる、各文書間の関係と、各文書の中心性のスコアに基づいて、文書間の関係を示すグラフ構造を二次元平面（例えば、ｘｙ平面）上に配置し、個々の文書の中心性のスコアを三次元目（例えば、ｚ軸）に割り当てた三次元のグラフ構造を構築する。図５は、この三次元のグラフ構造の概念図である。なお、この図５に関しては、後で説明する。 The graph structure construction unit 60 converts the graph structure indicating the relationship between documents based on the relationship between the documents obtained by the centrality determination unit 50 and the centrality score of each document into a two-dimensional plane (for example, xy). A three-dimensional graph structure is constructed in which the centrality score of each document is assigned to the third dimension (eg, z-axis). FIG. 5 is a conceptual diagram of this three-dimensional graph structure. Note that FIG. 5 will be described later.

頂点ノード抽出部７０は、グラフ構造構築部６０で構築されたグラフ構造から、ノードとノードを繋ぐ辺（即ち、エッジ）で連結している自身以外のノード（文書と一対一で対応）より高い中心性を持つノードを頂点として抽出する。 The vertex node extraction unit 70 is higher than the nodes other than itself (one-to-one correspondence with the document) connected from the graph structure constructed by the graph structure construction unit 60 by the edges (ie, edges) connecting the nodes. A node with centrality is extracted as a vertex.

山状ノード群特定部８０は、頂点ノード抽出部７０で抽出された頂点ノードから、中心性が低くなる方向にグラフ構造をたどり、ノードで構成される山を特定する。即ち、山状ノード群特定部８０までの処理によって、文書がクラスタ化されることになる。 The mountain-shaped node group identification unit 80 traces the graph structure from the vertex node extracted by the vertex node extraction unit 70 in the direction in which the centrality is lowered, and identifies a mountain composed of the nodes. That is, the documents are clustered by the processing up to the mountain node group specifying unit 80.

ラベル付け部９０は、頂点ノード抽出部７０で抽出した頂点ノード，山状ノード群特定部８０で特定した頂点を中心とする山状のノード群，およびそれらの間の関係に対して、ラベル付けを行う。 The labeling unit 90 labels the vertex nodes extracted by the vertex node extracting unit 70, the mountain nodes centered on the vertices identified by the mountain node group identifying unit 80, and the relationship between them. I do.

ここで、グラフ構造と中心性について説明する。中心性スコアの定義によれば、多くのエッジが存在するエリアのノードは高いスコアを有する。グラフ構造に基づき、ある人がノードを渡り歩くモデル（ユーザがグラフに沿ってノードを閲覧するモデル）を考えた場合、そのような中心性の高いエリアでは、エリア内での遷移が多く、ノード間の関連性も高い。つまり、そのエリアは同じ話題に関連するノードで構成される。したがって、図５のそれぞれの山は、それぞれ異なる話題に対応すると考えられる。 Here, the graph structure and centrality will be described. According to the definition of centrality score, nodes in areas where there are many edges have a high score. Based on the graph structure, when a model in which a person walks across nodes (a model in which a user browses nodes along a graph) is considered, in such an area with high centrality, there are many transitions within the area, and between nodes Is also highly relevant. That is, the area is composed of nodes related to the same topic. Therefore, each mountain in FIG. 5 is considered to correspond to a different topic.

また、図５中の山に含まれるノードの位置に応じて、文書にはそれぞれ特徴があると考えられる。以下では、それぞれのノードに該当する文書の特徴を説明する。さらに、それぞれの特徴を持つノード毎に、文書集合における役割の特定方法を説明する。 Further, it is considered that each document has a characteristic depending on the position of the node included in the mountain in FIG. In the following, the characteristics of a document corresponding to each node will be described. Further, a method for identifying a role in a document set will be described for each node having each characteristic.

図５における最初の段階のノードは、山の頂上にあるノード（例えば、符号ａ１やｂ１で示されるノードに該当）であり、１つの山には１つのノードが存在するのみである。これらのノードは、周囲のノードから最も高い状態遷移があるノードであり、周囲のノードと最もよく関係するノードであるため、話題を最もよく表現する文書であると言える。つまり、頂点ノードが示す文書は、そのエリアの話題および時間的にいつ頃の話題であるかを特定する。以後、このエリアの話題を特定する文書（ノード）のラベルをコア文書（または、コアノード）とする。 The node in the first stage in FIG. 5 is a node at the top of the mountain (for example, corresponding to a node indicated by reference signs a1 and b1), and only one node exists in one mountain. Since these nodes are nodes having the highest state transition from the surrounding nodes and are the nodes most closely related to the surrounding nodes, it can be said that these nodes are documents that best express the topic. That is, the document indicated by the vertex node specifies the topic of the area and when the topic is in time. Hereinafter, the label of the document (node) that identifies the topic in this area is set as the core document (or core node).

第２段階目のノードは、頂点と近接したノード（例えば、図５中の符号ａ２，ａ３，ａ４やｂ２，ｂ３で示されるノード）である。これらのノードはコアノードから直接もしくは間接的に双方向リンクのみをたどって到達できるノードである。双方向リンクは、相互にリンクが張られており、高い関連性を示す。これらのノードはコアノードとの間で多くの状態遷移があり、文書の内容もコア文書との高い関連性を有する。以後、このコア文書との高い関連性を有する文書（ノード）のラベルをサプリメンタル文書（または、サプリメンタルノード）とする。 Nodes in the second stage are nodes close to the vertex (for example, nodes indicated by symbols a2, a3, a4 and b2, b3 in FIG. 5). These nodes are nodes that can be reached from a core node directly or indirectly by following only a bidirectional link. Bidirectional links are linked to each other and show high relevance. These nodes have many state transitions with the core node, and the content of the document is also highly related to the core document. Hereinafter, a label of a document (node) having a high relationship with the core document is referred to as a supplemental document (or supplemental node).

第３段階目のノードは、例えば、図５中の符号ａ５，ａ６，ａ７，ｂ４で示されるノードのように、コアノードもしくはサプリメンタルノードにリンクしているノードである。外部のノードへの状態遷移や自己遷移と比べて、特定の話題のコアノードやサプリメンタルノードへの遷移確率が高いノードである。これらのノードは必ずしも話題の中心ではないが話題に関連する情報を含んでおり、話題の周辺の情報等ノベルティの高い情報を含む事が多いノードである。以後、この話題の周辺の情報等ノベルティの高い情報を含む事が多い文書（ノード）のラベルをサブトピック文書（またはサブトピックノード）とする。 The node at the third stage is a node linked to a core node or a supplemental node, such as nodes indicated by reference signs a5, a6, a7, and b4 in FIG. It is a node that has a higher probability of transition to a core node or supplemental node of a specific topic than state transition to an external node or self-transition. These nodes are not necessarily the center of the topic, but contain information related to the topic, and are often nodes that contain highly novel information such as information around the topic. Hereinafter, a label of a document (node) that often includes highly novel information such as information around the topic is referred to as a subtopic document (or subtopic node).

最終段階目のノードは、どの話題のノードに対しても強い関連性がないノードである。例えば、図５中の符号ｃ１で示されるノードである。このノードは、他に似ているノードが少なく、自己遷移確率が高い。以後、この他に似ているノードが少なく、自己遷移確率が高い文書（ノード）のラベルをアウトライヤー文書（アウトライヤーノード）とする。このアウトライヤー文書の存在を許容することによって、その他文書が無理にいづれかのクラスタに属しノイズの原因となることを防ぐことになる。 The node at the final stage is a node that does not have a strong relationship with any topic node. For example, it is a node indicated by reference sign c1 in FIG. This node has few similar nodes and high self-transition probability. Hereinafter, a label of a document (node) having few similar nodes and a high self-transition probability is referred to as an outlier document (outlier node). By allowing the outlier document to exist, it is possible to prevent other documents from belonging to any one of the clusters and causing noise.

以上のような方法に基づいて、それぞれのノードに対し以下のようにラベル付けを行う。 Based on the above method, each node is labeled as follows.

まず、各ノードに対しては、各ノードがどのような話題に関連する文書なのか、その話題を表現する場合にどの程度の役割を持つ文書であるのか、という情報をラベル（即ち、コアノード）として付与する。 First, for each node, information indicating what topic each node is related to and the role of the document when expressing the topic is labeled (ie, core node). As given.

次に、山状のノード群に対しては、頂点ノードが表現する話題に関連する文書のクラスタとしてのラベル（即ち、サプリメンタルノード）を付与する。 Next, a label (that is, a supplemental node) as a cluster of documents related to the topic expressed by the vertex node is assigned to the mountain-shaped node group.

そして、山状ノード群の組合せについては、それらの連結状態から、二つの山が表現する話題の関連性の高さに付いてラベル付け（即ち、サブトピックノードまたはアウトライヤーノード）を行う。 Then, the combinations of mountain nodes are labeled (ie, subtopic nodes or outlier nodes) based on their connected state, with high relevance of topics expressed by the two mountains.

情報出力部１１０は、情報分析部１００によって得られた、ノード間の関係，個々のノードの中心性及び文書集合中での役割を利用して、ユーザに対して文書集合の内容を表示（可視化）する。可視化は、例えば、ディスプレイ装置で行う。 The information output unit 110 displays (visualizes) the contents of the document set to the user by using the relationship between nodes, the centrality of each node, and the role in the document set obtained by the information analysis unit 100. ) Visualization is performed with a display device, for example.

文書ＤＢ１（文書データ管理手段）は、ユーザが指定した検索キーワードや文書の最終更新日等の条件に応じて文書集合を特定できる検索機能を持った文書データ格納装置（例えば、ハードディスク装置やメモリを含む装置）である。この文書ＤＢ１は、ｗｅｂ等からあらかじめ情報を収集してきて構築する事が考えられる。また、ｗｅｂ上に存在する検索エンジン（非特許文献５参照）をそのまま文書ＤＢ１として利用することも考えられる。 The document DB 1 (document data management means) is a document data storage device (for example, a hard disk device or a memory) having a search function that can specify a document set according to conditions such as a search keyword specified by the user and the last update date of the document. Device). The document DB 1 may be constructed by collecting information in advance from a web or the like. It is also conceivable to use a search engine (see Non-Patent Document 5) existing on the web as it is as the document DB 1.

次に本実施形態における文書集合分析方法を図２に基づいて説明する。 Next, a document set analysis method according to this embodiment will be described with reference to FIG.

まず、ユーザから指定、もしくは、予め決められた文書集合特定条件を入力手段から読み込む（Ｓ１）。なお、入力手段は、例えば、キーボード装置などが想定できる。 First, a document set specifying condition designated by the user or predetermined is read from the input means (S1). The input means can be assumed to be a keyboard device, for example.

次に、文書集合特定部１０が、前記文書集合特定条件に合致した文書の集合を特定する（Ｓ２）。 Next, the document set specifying unit 10 specifies a set of documents that meet the document set specifying condition (S2).

次に、内容類似性評価部２０が、文書集合特定部１０で特定した文書群中の各文書ペア間の内容類似度を算出する（Ｓ３）。 Next, the content similarity evaluation unit 20 calculates the content similarity between each document pair in the document group specified by the document set specifying unit 10 (S3).

次に、タイムスタンプ類似性評価部３０が、文書集合特定部１０で特定した文書群中の各文書ペア間の時間的近さ（時間類似度）を算出する（Ｓ４）。 Next, the time stamp similarity evaluation unit 30 calculates temporal closeness (time similarity) between each document pair in the document group specified by the document set specifying unit 10 (S4).

次に、関係抽出部４０は、算出された内容類似度および時間類似度に基づいて関連性の強いペアを抽出し、重み付きで関連性を特定する（Ｓ５）。 Next, the relationship extraction unit 40 extracts a pair having strong relevance based on the calculated content similarity and time similarity, and identifies the relevance with weight (S5).

次に、中心性判定部５０は、前記類似性評価部２０，３０及び関係抽出部４０によって特定された情報に基づいて、指標（例えば、ＰａｇｅＲａｎｋなど）を作成し、各ノードの中心性を特定する（Ｓ６）。 Next, the centrality determination unit 50 creates an index (for example, PageRank) based on the information specified by the similarity evaluation units 20 and 30 and the relationship extraction unit 40, and specifies the centrality of each node. (S6).

次に、グラフ構造構築部６０は、前記類似性評価部２０，３０、関係抽出部４０，中心性判定部５０から得られた情報に基づいて、三次元空間状にノード（文書と一対一対応）を配置したグラフ構造を構築する（Ｓ７）。 Next, based on the information obtained from the similarity evaluation units 20 and 30, the relationship extraction unit 40, and the centrality determination unit 50, the graph structure construction unit 60 corresponds to nodes (one-to-one correspondence with documents) in a three-dimensional space. ) Is constructed (S7).

次に、頂点ノード抽出部７０は、グラフ構造構築部６０で得られたグラフ構造に基づいて、頂点ノードを抽出する（Ｓ８）。 Next, the vertex node extraction unit 70 extracts a vertex node based on the graph structure obtained by the graph structure construction unit 60 (S8).

次に、山状ノード群特定部８０は、グラフ構造構築部６０で得られたグラフ構造と頂点ノード抽出部７０で得られた頂点から山状のノード群を抽出する（９）。 Next, the mountain node group specifying unit 80 extracts a mountain node group from the graph structure obtained by the graph structure construction unit 60 and the vertex obtained by the vertex node extraction unit 70 (9).

次に、ラベル付け部９０は、グラフ構造構築部６０，頂点ノード抽出部７０，山状ノード群特定部８０で得られた情報に基づいてノード，山状のノード群，ノード群の関係に対してラベル付けを行う（Ｓ１０）。 Next, the labeling unit 90 determines the relationship between the node, the mountain-shaped node group, and the node group based on the information obtained by the graph structure building unit 60, the vertex node extracting unit 70, and the mountain-shaped node group specifying unit 80. Then, labeling is performed (S10).

そして、ラベル付けされたノード，山状のノード群，ノード群の関係をリストや３Ｄマップとして可視化する（Ｓ１１）。 Then, the labeled nodes, mountain-shaped node groups, and node group relationships are visualized as a list or 3D map (S11).

以上のように、本実施形態によれば、文書集合特定条件によって与えられる文書集合から、各文書間の内容およびタイムスタンプに基づく類似度を特定し、その類似度に基づいて文書間に強い繋がりを有する部分を重み付きで特定し、この情報に基づいて文書間の繋がりをグラフ構造に見立てて各文書の中心性を算出し、以上で得た文書間の関係と各文書の中心性の値から文書群を三次元に配置されるグラフ構造と見立てて、その位置関係から各文書の位置付けを特定することによって、文書間のタイムスタンプの近さを意識した「主要な話題の特定」、「話題に関連する文書の特定」、「各話題に関連する文書のうち、各文書の役割」、「話題間の関係」等を取得できる。 As described above, according to the present embodiment, the similarity between the documents based on the content and the time stamp is specified from the document set given by the document set specifying condition, and the documents are strongly connected based on the similarity. The weight of each part is specified with weight, and the centrality of each document is calculated based on this information, assuming the connection between the documents as a graph structure. The relationship between the documents and the centrality value of each document obtained above are calculated. By identifying the document group as a three-dimensional graph structure and identifying the position of each document from its positional relationship, "identification of major topics" and " It is possible to acquire “specification of documents related to topic”, “role of each document among documents related to each topic”, “relation between topics”, and the like.

さらに詳述すると、本実施形態では、ユーザからの要求に基づき特定した文書集合を元に、各文書間の相互の類似度を評価し、類似度に基づき文書間の関連性を特定する。この関連性に基づき、各文書の中心性を評価する。この文書間の関連性と、個々の文書の中心性をともに用いる事で、文書集合中の特定の話題の検出や、特定の話題に属する文書のクラスタ化、クラスタ内の各文書の位置付けの明確化を実現する。 More specifically, in this embodiment, the mutual similarity between documents is evaluated based on a document set specified based on a request from a user, and the relationship between documents is specified based on the similarity. Based on this relationship, the centrality of each document is evaluated. By using both the relationship between documents and the centrality of individual documents, it is possible to detect specific topics in a document set, cluster documents belonging to a specific topic, and clarify the positioning of each document in the cluster. Realize.

また、文書間の関連性と各文書の中心性の値に基づき、文書集合を三次元のグラフ構造と見做し、その中の頂点や、山状のノード群を特定することで、文書集合中の特定の話題の検出や、特定の話題に属する文書のクラスタ化、クラスタ内の各文書の位置付けの明確化を実現する。 Also, based on the relationship between documents and the centrality value of each document, the document set is regarded as a three-dimensional graph structure, and by specifying the vertex and mountain-shaped node group, the document set It is possible to detect a specific topic, cluster a document belonging to a specific topic, and clarify the position of each document in the cluster.

また、本実施形態によれば、「主要な話題の特定」や「話題に関連する文書特定」の精度は、時間を意識しない（考慮しない）場合と比較して高い精度を示す。 Further, according to the present embodiment, the accuracy of “specification of main topics” and “specification of documents related to topics” is higher than that in the case where time is not conscious (not considered).

すなわち、図６は新聞記事に対して図１〜図５の方法を適用して分析した結果を表しており、時間類似度を変化させる為のパラメータとしての、時間類似度の強さ（時間重み）を決定するαと、時間類似度の半減期を示すｔ_1/2と、クラスタリング精度との関係を調査した結果を示している。図６において破線は、時間情報を考慮しない場合のベースラインを示している。 That is, FIG. 6 shows the result of analyzing the newspaper article by applying the method of FIGS. 1 to 5, and the strength of time similarity (time weight) as a parameter for changing the time similarity. ) For determining), t _1/2 indicating the half-life of time similarity, and the result of investigating the relationship between clustering accuracy. In FIG. 6, a broken line indicates a baseline when time information is not considered.

今回の調査条件は、新聞記事コーパスに対して検索を行い得られた検索結果で評価を行っている。新聞記事を利用しており、その発行日時は１日毎であるため、時間差を示す値の単位は日である。 The current survey conditions are evaluated using the search results obtained by searching the newspaper article corpus. Since newspaper articles are used and the issue date is every day, the unit of the value indicating the time difference is days.

図６によれば、時間情報（α、ｔ_1/2）を考慮しない場合に比べて、時間情報を考慮した場合の方がクラスタリング精度が高くなり、特に半減期ｔ_1/2を３０〜９０程度に設定し、かつ時間類似度の重みを０．５前後に設定した場合、クラスタリング精度がベースライン（破線）を有意に上回っていることがわかる。 According to FIG. 6, the time information (alpha, t _1/2) as compared with the case without consideration of, it is the higher the clustering accuracy when considering time information, in particular the half-life t _1/2 30 to 90 It can be seen that the clustering accuracy is significantly higher than the baseline (broken line) when the time similarity is set to about 0.5 and the time similarity weight is set to around 0.5.

さらに、本実施形態によれば、「各話題に関連する文書のうち、各文書の役割」において、コア文書を取得する際、内容的にクラスタを代表だけでなく、実際に話題になっている段階、つまり、関連する文書が連続的に発行されている時期の文書を抽出することが可能となっている。これにより、特定の話題がいつごろ盛り上がったのかというような情報も取得可能となっている。 Furthermore, according to the present embodiment, when acquiring a core document in “the role of each document among documents related to each topic”, the content is not only a representative but actually a topic. It is possible to extract a document at a stage, that is, when a related document is continuously issued. As a result, it is possible to acquire information such as when a specific topic has been raised.

なお、本実施形態の文書集合分析装置における各手段の一部もしくは全部の機能をコンピュータのプログラムで構成し、そのプログラムをコンピュータを用いて実行して本発明を実現することができること、本実施形態の文書集合分析方法における手順をコンピュータのプログラムで構成し、そのプログラムをコンピュータに実行させることができることは言うまでもなく、コンピュータでその機能を実現するためのプログラムを、そのコンピュータが読み取り可能な記録媒体、例えばＦＤ（Ｆｌｏｐｐｙ（登録商標）Ｄｉｓｋ）や、ＭＯ（Ｍａｇｎｅｔｏ−Ｏｐｔｉｃａｌｄｉｓｋ）、ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）、メモリカード、ＣＤ（ＣｏｍｐａｃｔＤｉｓｋ）、ＤＶＤ（ＤｉｇｉｔａｌＶｅｒｓａｔｉｌｅＤｉｓｋ）、リムーバブルディスクなどに記録して、保存したり、配布したりすることが可能である。また、上記のプログラムをインターネットや電子メールなど、ネットワークを通して提供することも可能である。 Note that the present invention can be realized by configuring some or all of the functions of each means in the document set analysis apparatus of the present embodiment with a computer program and executing the program using the computer. It is needless to say that the procedure in the document set analysis method of the above can be configured by a computer program and the program can be executed by the computer. For example, FD (Floppy (registered trademark) Disk), MO (Magneto-Optical disk), ROM (Read Only Memory), memory card, CD (Compact Disk), DVD (Digital Versatile D) sk), and recorded in a removable disk, or stored, it is possible or distribute. It is also possible to provide the above program through a network such as the Internet or electronic mail.

さらに、上述の文書集合分析装置に関する方法を記述したコンピュータプログラムを、文書集合分析装置に関する方法に必要とされる入出力データを格納したメモリや外部記憶装置等にアクセスするように実装してもよい。 Further, a computer program describing a method related to the document set analysis apparatus described above may be implemented to access a memory or an external storage device that stores input / output data required for the method related to the document set analysis apparatus. .

以上、本発明の実施形態について説明したが、本発明は説明した実施形態に限定されるものでなく、各請求項に記載した範囲において各種の変形を行うことが可能である。 Although the embodiments of the present invention have been described above, the present invention is not limited to the described embodiments, and various modifications can be made within the scope described in each claim.

例えば、本実施形態における情報分析部は、グラフ構造構築部からラベル付け部のような手段で構成されているが、これらの手段に限らず、文書群と文書間の関連をグラフ構造と見做す他の処理手段も考えられる。より具体的には、ラベル付け部において、ラベル付けを更に細かく（例えば、５段階以上）して役割の特定を行っても良い。 For example, the information analysis unit in the present embodiment is configured by means such as a graph structure construction unit to a labeling unit. However, the information analysis unit is not limited to these units, and a relation between a document group and a document is regarded as a graph structure. Other processing means are also conceivable. More specifically, the labeling unit may specify the role by further finely labeling (for example, five or more levels).

本実施形態における文書集合分析装置の構成図。The block diagram of the document set analysis apparatus in this embodiment. 本実施形態における文書集合分析方法を示すフローチャート。6 is a flowchart illustrating a document set analysis method according to the present embodiment. 本実施形態における忘却関数を示すグラフ。The graph which shows the forgetting function in this embodiment. 本実施形態における時間類似度を利用した文書間類似度の特性図。FIG. 5 is a characteristic diagram of similarity between documents using temporal similarity in the present embodiment. 本実施形態における三次元構造概念図。The three-dimensional structure conceptual diagram in this embodiment. 本実施形態における時間情報の影響強度パラメータとクラスタリングの精度の関係を示す特性図。The characteristic view which shows the relationship between the influence intensity parameter of the time information in this embodiment, and the precision of clustering.

Explanation of symbols

１…文書ＤＢ
１０…文書集合特定部
２０…内容類似性評価部
３０…タイムスタンプ類似性評価部
４０…関係抽出部
５０…中心性判定部
６０…グラフ構造構築部
７０…頂点ノード抽出部
８０…山状ノード群特定部
９０…ラベル付け部
１００…情報分析部

１１０…情報出力部
ａ１，ａ２，ａ３，ａ４，ａ５，ａ６，ａ７，ａ８，ｂ１，ｂ２，ｂ３，ｂ４，ｃ１…ノード 1 ... Document DB
DESCRIPTION OF SYMBOLS 10 ... Document set specific | specification part 20 ... Content similarity evaluation part 30 ... Time stamp similarity evaluation part 40 ... Relation extraction part 50 ... Centrality determination part 60 ... Graph structure construction part 70 ... Vertex node extraction part 80 ... Mountain node group Specific part 90 ... Labeling part 100 ... Information analysis part

110 ... Information output part a1, a2, a3, a4, a5, a6, a7, a8, b1, b2, b3, b4, c1 ... node

Claims

A document set analysis device for identifying a role of a document based on a relationship between documents with time stamps in a document set managed by a document data management means,
Document set specifying means for specifying the document set based on the document set specifying condition input from the input means;
Content similarity evaluation means for evaluating the content similarity regarding the topic word between the documents included in the specified document set;
Time stamp similarity evaluation means for evaluating a time similarity regarding a topic word between each document included in the specified document set;
A relationship extraction unit that extracts a relationship between documents based on the content similarity evaluated by the content similarity evaluation unit and the time similarity evaluated by the time stamp similarity evaluation unit;
Based on the relationship between the documents extracted by the relationship extraction unit, the centrality determination unit that calculates the centrality of the document as an index indicating the high degree of relationship between the document and a document other than the document;
Based on the relationship between the documents obtained by the relationship extracting means and the centrality of each document obtained by the centrality determining means, the topic words and the topic words included in the document set are converted from the entire document set. An information analysis means for identifying a related document set and a role of the document in the document set;
Information output means for visualizing and outputting the document set based on the topic word included in the document set from the entire specified document set, the document set related to the topic word, and the role of the document in the document set;
A document set analyzing apparatus comprising:

The document set analysis apparatus according to claim 1,
The information analysis means includes
Based on the relationship between the documents and the centrality of each document, the relationship between the documents is expressed in two-dimensional coordinates, and the centrality is expressed in the third-dimensional coordinates with respect to the two-dimensional coordinates. A graph structure constructing means for representing a three-dimensional graph structure;
From the obtained graph structure, vertex node extraction means for extracting, as a vertex, a node having a higher centrality than nodes other than itself connected by edges;
From the obtained vertex node, follow a graph structure in the direction of decreasing centrality, and a mountain-shaped node group specifying means for specifying a mountain composed of nodes;
Labeling means for assigning a label indicating a role to the node based on the identified vertex node and mountain node;
A document set analyzing apparatus comprising:

The document set analysis device according to claim 1 or 2,
The time stamp similarity evaluation unit evaluates time similarity using a forgetting function that continuously changes in accordance with a time stamp difference between the documents.

The document set analysis device according to any one of claims 1 to 3,
The relationship extraction unit extracts a relationship between documents based on the content similarity and a time similarity obtained by performing predetermined weighting on the time similarity evaluated by the time stamp similarity evaluation unit. A document set analyzing apparatus characterized by the above.

A document set analysis method for identifying a role of a document based on a relationship between documents with time stamps in a document set managed by a document data management means,
A document set specifying step in which the document set specifying means specifies the document set based on the document set specifying condition input from the input means;
A content similarity evaluation means for evaluating a content similarity regarding a topic word between the documents included in the specified document set;
A time-stamp similarity evaluation unit that evaluates a time-similarity related to a topic word between documents included in the specified document set; and
A relationship extraction step for extracting a relationship between documents based on the content similarity evaluated by the content similarity evaluation step and the time similarity evaluated by the time stamp similarity evaluation step;
A center for calculating the centrality of the document as an index indicating the degree of relevance between the document and a document other than the document based on the relationship between the documents extracted by the relationship extraction step. Sex determination step;
Topics included in the document set from the entire document set based on the relationship between documents obtained by the relationship extracting step and the centrality of individual documents obtained by the centrality determining step. An information analysis step for identifying a word, a document set related to a topic word, and a role of the document in the document set;
Information that the information output means visualizes and outputs the document set based on the topic word included in the document set from the entire specified document set, the document set related to the topic word, and the role of the document in the document set An output step;
A document set analysis method comprising:

The document set analysis method according to claim 5,
The information analysis step includes
Based on the relationship between the documents and the centrality of each document, the relationship between the documents is expressed in two-dimensional coordinates, and the centrality is expressed in the third-dimensional coordinates with respect to the two-dimensional coordinates. A graph structure construction step that expresses a three-dimensional graph structure,
From the obtained graph structure, a vertex node extraction step of extracting, as a vertex, a node having a higher centrality than nodes other than itself connected by an edge;
From the obtained vertex node, follow the graph structure in the direction of decreasing centrality, and specify a mountain-shaped node group specifying step for specifying a mountain composed of nodes;
A labeling step of assigning a label indicating a role to the node based on the identified vertex node and the mountain node;
A document set analysis method characterized by comprising:

The document set analysis method according to claim 5 or 6,
In the time stamp similarity evaluation step, the time similarity is evaluated using a forgetting function that continuously changes in accordance with a time stamp difference between the documents.

The document set analysis method according to any one of claims 5 to 7,
The relationship extraction step extracts a relationship between documents based on the content similarity and a time similarity obtained by performing predetermined weighting on the time similarity evaluated by the time stamp similarity evaluation step. Document collection analysis method characterized by the above.

9. A document set analysis program, wherein the document set analysis method according to claim 5 is described as a computer program executable by a computer.

9. A recording medium in which the document set analysis method according to claim 5 is described as a computer program executable by a computer and the computer program is recorded.