JP2015200961A

JP2015200961A - Semantic relationship extraction apparatus and program

Info

Publication number: JP2015200961A
Application number: JP2014078011A
Authority: JP
Inventors: 山田　一郎; Ichiro Yamada; 一郎山田; 菊佳望月; Kikuka Mochizuki; 太郎宮▲崎▼; Taro Miyazaki; 田中　英輝; Hideki Tanaka; 英輝田中
Original assignee: Nippon Hoso Kyokai NHK; Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 2014-04-04
Filing date: 2014-04-04
Publication date: 2015-11-12
Anticipated expiration: 2034-04-04
Also published as: JP6410455B2

Abstract

PROBLEM TO BE SOLVED: To provide a semantic relation extraction device and a program capable of estimating the relation between those nouns regarding a noun pair appearing in a sentence.SOLUTION: A noun pair extraction unit extracts a noun pair contained in the same sentence in document data. A third noun extraction unit extracts a third noun from the sentence containing the noun pair extracted by the noun pair extraction unit. A seed pattern storage unit stores a pattern of a sentence structure in which a seed word pair, which is a noun pair having the known relation, appears as a seed pattern associated with the relation. A score calculation unit calculates a score showing a degree in which the noun pair has the relation associated with the seed pattern, based on the noun pair extracted by the noun pair extraction unit, the third noun extracted by the third noun extraction unit and the seed pattern stored in the seed pattern storage unit. A specific relation noun pair extraction unit extracts a noun pair having the predetermined relation based on the score.

Description

本発明は、入力される文書から単語間の意味関係を抽出する、意味関係抽出装置およびプログラムに関する。 The present invention relates to a semantic relationship extraction apparatus and program for extracting a semantic relationship between words from an input document.

自然言語処理技術の一分野では、テキストデータから情報を抽出する試みが為されている。
テキストに出現する２つの名詞の関係を自動的に推定しようとする場合、従来の技術の一手法は、２つの名詞間を結ぶパターンを手掛かりとする方法である。ここで、パターンとは、「Ａ（名詞）がＢ（名詞）を起こす」などといった、文のパターンである。 In one field of natural language processing technology, attempts have been made to extract information from text data.
When trying to automatically estimate the relationship between two nouns appearing in a text, one conventional technique is a method that uses a pattern connecting two nouns as a clue. Here, the pattern is a sentence pattern such as “A (noun) causes B (noun)”.

非特許文献１には、そのようなパターンを手掛かりとして、学習処理により、大規模に単語間の意味的関係を獲得する技術が記載されている。 Non-Patent Document 1 describes a technique for acquiring a semantic relationship between words on a large scale by learning processing using such a pattern as a clue.

Stijn De Saeger，鳥澤健太郎，風間淳一，黒田航，村田真樹，「単語の意味クラスを用いたパターン学習による大規模な意味的関係獲得」，言語処理学会第１６回年次大会発表論文集，Ｄ４−２，２０１０年，ｐ．９３２−ｐ．９３５Stijn De Saeger, Kentaro Torisawa, Junichi Kazama, Wataru Kuroda, Maki Murata, “Acquiring large-scale semantic relationships by pattern learning using word semantic classes”, Proc. Of the 16th Annual Conference of the Language Processing Society, D4 -2, 2010, p. 932-p. 935

上述した従来の技術では、２つの単語を結ぶパターンを手掛かりとして単語間の関係を推定する。その推定のためのパターンの種類数は膨大である。そして、そのように多種のパターンが存在するため、出現頻度が低いパターンや、名詞が出現するようなパターンは、関係推定のためのパターンとして利用することが難しいという問題があった。 In the conventional technique described above, a relationship between words is estimated using a pattern connecting two words as a clue. The number of types of patterns for the estimation is enormous. Since such various patterns exist, there is a problem that it is difficult to use a pattern with a low appearance frequency or a pattern in which a noun appears as a pattern for relationship estimation.

ここで、単語（名詞）間の関係とは、例えば、因果関係、上位下位関係、病気と治療法の関係、病気と予防法の関係、場所と名物との関係などである。 Here, the relationship between words (nouns) includes, for example, a causal relationship, an upper-lower relationship, a relationship between a disease and a treatment method, a relationship between a disease and a preventive method, a relationship between a place and a specialty.

本発明は、上記の課題認識に基づいて行なわれたものであり、従来は推定が困難であった名詞間の関係の推定を行うことのできる意味関係抽出装置およびプログラムを提供するものである。 The present invention has been made based on the above problem recognition, and provides a semantic relationship extraction apparatus and program capable of estimating a relationship between nouns that has been difficult to estimate in the past.

［１］上記の課題を解決するため、本発明の一態様による意味関係抽出装置は、文書データを元に、前記文書データ内の同一文に含まれていた名詞対を抽出する名詞対抽出部と、前記名詞対抽出部によって抽出された前記名詞対が含まれていた文から、第３の名詞を抽出する第３名詞抽出部と、既知の関係を有する名詞対であるシード単語対が出現する文構造のパターンを、前記関係と関連付けたシードパターンとして記憶するシードパターン記憶部と、前記名詞対抽出部が抽出した前記名詞対と、前記第３名詞抽出部が抽出した前記第３の名詞と、前記シードパターン記憶部に記憶された前記シードパターンとに基づいて、前記名詞対が前記シードパターンに関連付けられた前記関係を有する度合いを示すスコアを算出するスコア計算部と、前記スコア計算部によって算出された前記スコアに基づいて、前記関係を有すると推定される前記名詞対を抽出する特定関係名詞対抽出部と、を具備する。 [1] In order to solve the above-described problem, a semantic relationship extraction device according to an aspect of the present invention extracts a noun pair extraction unit that extracts noun pairs included in the same sentence in the document data based on the document data. And a third noun extraction unit that extracts a third noun from the sentence in which the noun pair extracted by the noun pair extraction unit is included, and a seed word pair that is a noun pair having a known relationship appears. A pattern of sentence structure to be stored as a seed pattern associated with the relation, a noun pair extracted by the noun pair extraction unit, and a third noun extracted by the third noun extraction unit And a score calculation unit that calculates a score indicating a degree that the noun pair has the relationship associated with the seed pattern based on the seed pattern stored in the seed pattern storage unit; Wherein on the basis of the score calculated by the score calculating unit comprises a, a specific relationship noun pair extraction unit for extracting the noun pair is estimated to have the relationship.

［２］また、本発明の一態様は、上記の意味関係抽出装置において、前記シードパターンは、文節間の係り受け関係を表す木構造のデータにおける、前記シード単語対に属する単語間の文節のパスとして表される、ことを特徴とする。 [2] Further, according to one aspect of the present invention, in the semantic relation extraction device, the seed pattern is a tree-structured data representing a dependency relation between phrases, and the phrase between phrases belonging to the seed word pair. It is characterized by being expressed as a path.

［３］また、本発明の一態様は、上記の意味関係抽出装置において、前記スコア計算部は、前記名詞対に対応する名詞のクラス対が、前記関係に出現する度合いを表す値である第１スコアを算出し、前記第１スコアに基づく前記スコアを算出する、ことを特徴とする。 [3] Further, according to one aspect of the present invention, in the semantic relation extraction device, the score calculation unit is a value representing a degree of appearance of a class pair of nouns corresponding to the noun pair in the relation. One score is calculated, and the score based on the first score is calculated.

［４］また、本発明の一態様は、上記の意味関係抽出装置において、前記スコア計算部は、前記名詞対に含まれる一方の名詞と、前記第３の名詞との間のパターンとの、前記シードパターンへの出現しやすさの度合いを示す値である第２スコアを算出し、前記第２スコアに基づく前記スコアを算出する、ことを特徴とする。 [4] Further, according to one aspect of the present invention, in the semantic relationship extraction apparatus, the score calculation unit includes a pattern between one noun included in the noun pair and the third noun. A second score, which is a value indicating a degree of ease of appearing in the seed pattern, is calculated, and the score based on the second score is calculated.

［５］また、本発明の一態様は、上記の意味関係抽出装置において、前記スコア計算部は、前記名詞対に含まれる一方の名詞と前記第３の名詞との間のパターンの、上位下位関係らしさあるいは並列関係らしさを表す値である第３スコアを算出し、前記第３スコアに基づく前記スコアを算出する、ことを特徴とする。 [5] In addition, according to one aspect of the present invention, in the semantic relationship extraction apparatus, the score calculation unit is configured to display a pattern between one noun and the third noun included in the noun pair. A third score, which is a value representing the likelihood of relationship or the likelihood of parallel relationship, is calculated, and the score based on the third score is calculated.

［６］また、本発明の一態様は、文書データを元に、前記文書データ内の同一文に含まれていた名詞対を抽出する名詞対抽出部と、前記名詞対抽出部によって抽出された前記名詞対が含まれていた文から、第３の名詞を抽出する第３名詞抽出部と、既知の関係を有する名詞対であるシード単語対が出現する文構造のパターンを、前記関係と関連付けたシードパターンとして記憶するシードパターン記憶部と、前記名詞対抽出部が抽出した前記名詞対と、前記第３名詞抽出部が抽出した前記第３の名詞と、前記シードパターン記憶部に記憶された前記シードパターンとに基づいて、前記名詞対が前記シードパターンに関連付けられた前記関係を有する度合いを示すスコアを算出するスコア計算部と、前記スコア計算部によって算出された前記スコアに基づいて、前記関係を有すると推定される前記名詞対を抽出する特定関係名詞対抽出部と、としてコンピューターを機能させるためのプログラムである。 [6] According to another aspect of the present invention, a noun pair extraction unit that extracts noun pairs included in the same sentence in the document data based on the document data, and the noun pair extraction unit extracts the noun pairs. A third noun extraction unit that extracts a third noun from a sentence in which the noun pair was included, and a sentence structure pattern in which a seed word pair that is a noun pair having a known relation appears is associated with the relation. Stored in the seed pattern storage unit, the seed pattern storage unit that stores the seed pattern, the noun pair extracted by the noun pair extraction unit, the third noun extracted by the third noun extraction unit, and the seed pattern storage unit Based on the seed pattern, a score calculator that calculates a score indicating the degree that the noun pair has the relationship associated with the seed pattern; and the score calculated by the score calculator Based on A, is a program for causing a specific relation noun pair extraction unit, as a computer for extracting the noun pair is estimated to have the relationship.

本発明によれば、２つの単語がどのような関係を持つかを自動的に推定することができる。つまり、文書データを元に、情報抽出を行うことができる。 According to the present invention, it is possible to automatically estimate the relationship between two words. That is, information extraction can be performed based on document data.

本発明の一実施形態による意味関係抽出装置の概略機能構成を示したブロック図である。It is the block diagram which showed schematic function structure of the semantic relationship extraction apparatus by one Embodiment of this invention. 同実施形態による意味関係抽出装置が単語間の関係を抽出する処理の動作手順を示すフローチャートである。It is a flowchart which shows the operation | movement procedure of the process which the semantic relationship extraction apparatus by the embodiment extracts the relationship between words. 同実施形態による名詞対抽出部による処理に関するものであり、名詞の対と、文における係り受け関係の例を示す概略図である。It is related with the process by the noun pair extraction part by the embodiment, and is the schematic which shows the example of the dependency relationship in a noun pair and a sentence. 同実施形態によるスコア計算部が、単語間が特定の関係を有するか否かを判定するための指標となるスコアを計算する処理の手順を示すフローチャートである。It is a flowchart which shows the procedure of the process in which the score calculation part by the embodiment calculates the score used as the parameter | index for determining whether a word has a specific relationship. 同実施形態によるスコア計算部が利用する、シード単語ペアの一例を示す概略図である。It is the schematic which shows an example of the seed word pair which the score calculation part by the same embodiment utilizes.

次に、本発明の一実施形態について、図面を参照しながら説明する。
テキストを解析してテキスト間の類似性を評価するような場合、関係を持つ単語が有益となる。本実施形態では、同一文中に出現する３つの単語と、あらかじめ用意する関係を持つ単語ペア（シード単語ペア）を利用して、関係を持つ新たな単語ペアを抽出する。 Next, an embodiment of the present invention will be described with reference to the drawings.
When analyzing text and evaluating the similarity between texts, related words are useful. In the present embodiment, a new word pair having a relationship is extracted by using three words appearing in the same sentence and a word pair (seed word pair) having a relationship prepared in advance.

図１は、本実施形態による意味関係抽出装置の概略機能構成を示すブロック図である。図示するように、意味関係抽出装置１は、文書取得部１０と、名詞抽出部１１と、名詞対抽出部１２と、第３名詞抽出部１３と、スコア計算部１４と、特定関係名詞対抽出部１５と、特定関係名詞対集合出力部１６とを含んで構成される。これら各部の機能は、例えば、電子回路を用いた論理演算を適宜組み合わせることによって実現する。また、各部において、適宜、半導体メモリや磁気ハードディスク装置（ＨＤＤ）を用いることによって情報を記憶できるようにする。 FIG. 1 is a block diagram showing a schematic functional configuration of the semantic relationship extraction apparatus according to the present embodiment. As shown in the figure, the semantic relationship extraction device 1 includes a document acquisition unit 10, a noun extraction unit 11, a noun pair extraction unit 12, a third noun extraction unit 13, a score calculation unit 14, and a specific related noun pair extraction. A unit 15 and a specific related noun pair output unit 16 are included. The functions of these units are realized, for example, by appropriately combining logic operations using electronic circuits. In each unit, information can be stored by using a semiconductor memory or a magnetic hard disk drive (HDD) as appropriate.

なお、意味関係抽出装置１は、図示しないシードパターン記憶部をも備えている。シードパターン記憶部は、既知の関係を有する名詞対であるシード単語対が出現する文構造のパターンを、関係と関連付けたシードパターンとして記憶するものである。シードパターンの詳細については後述する（図４も参照）。 The semantic relationship extraction apparatus 1 also includes a seed pattern storage unit (not shown). The seed pattern storage unit stores a sentence structure pattern in which a seed word pair, which is a noun pair having a known relationship, appears as a seed pattern associated with the relationship. Details of the seed pattern will be described later (see also FIG. 4).

文書取得部１０は、以下における処理の対象となる入力文書を取り込む。
名詞抽出部１１は、文書取得部１０が取得した入力文書を文ごとに分割し、そして、得られた各文から名詞を抽出する。名詞を抽出する際には、名詞抽出部１１は、既存技術による形態素解析処理を行う。
名詞対抽出部１２は、名詞抽出部１１の処理で得られた名詞の集合から任意の対を抽出する。つまり、名詞対抽出部１２は、その名詞集合から任意の２つの名詞を抽出する。つまり、名詞対抽出部１２は、入力される文書データを元にして、文書データ内の同一文に含まれていた名詞対を抽出する。
第３名詞抽出部１３は、名詞対抽出部１２が抽出した名詞対に関係の深い第３の名詞を同一文中から抽出する。つまり、第３名詞抽出部１３は、名詞対抽出部１２によって抽出された名詞対が含まれていた文から、第３の名詞を抽出する。 The document acquisition unit 10 captures an input document to be processed in the following.
The noun extraction unit 11 divides the input document acquired by the document acquisition unit 10 into sentences, and extracts nouns from the obtained sentences. When extracting a noun, the noun extraction unit 11 performs a morphological analysis process using an existing technology.
The noun pair extraction unit 12 extracts an arbitrary pair from the noun set obtained by the processing of the noun extraction unit 11. That is, the noun pair extraction unit 12 extracts any two nouns from the noun set. That is, the noun pair extraction unit 12 extracts noun pairs included in the same sentence in the document data based on the input document data.
The third noun extraction unit 13 extracts a third noun deeply related to the noun pair extracted by the noun pair extraction unit 12 from the same sentence. That is, the third noun extraction unit 13 extracts the third noun from the sentence in which the noun pair extracted by the noun pair extraction unit 12 was included.

スコア計算部１４は、名詞対抽出部１２が抽出した名詞対と、第３名詞抽出部１３が抽出した当該第３の名詞との間のパターンと関係を利用して、対象の関係を持つか否かを判定するためのスコア（得点）を計算する。言い換えれば、スコア計算部１４は、名詞対抽出部１２が抽出した名詞対と、第３名詞抽出部１３が抽出した第３の名詞と、シードパターン記憶部に記憶されたシードパターンとに基づいて、名詞対がシードパターンに関連付けられた関係を有する度合いを示すスコアを算出する。 Whether the score calculation unit 14 has a target relationship using the pattern and relationship between the noun pair extracted by the noun pair extraction unit 12 and the third noun extracted by the third noun extraction unit 13. The score (score) for judging whether or not is calculated. In other words, the score calculation unit 14 is based on the noun pair extracted by the noun pair extraction unit 12, the third noun extracted by the third noun extraction unit 13, and the seed pattern stored in the seed pattern storage unit. The score indicating the degree that the noun pair has a relationship associated with the seed pattern is calculated.

特定関係名詞対抽出部１５は、スコア計算部１４が算出したスコアに基づいて、高いスコアを持つ名詞対を、対象関係を持つ名詞対として抽出する。つまり、特定関係名詞対抽出部１５は、スコア計算部１４によって算出されたスコアに基づいて、特定の関係を有すると推定される名詞対を抽出する。 The specific related noun pair extraction unit 15 extracts a noun pair having a high score as a noun pair having a target relationship based on the score calculated by the score calculation unit 14. That is, the specific related noun pair extraction unit 15 extracts a noun pair estimated to have a specific relationship based on the score calculated by the score calculation unit 14.

図２は、意味関係抽出装置１の動作手順を示すフローチャートである。以下、このフローチャートに沿って、単語間の意味的関係を抽出する処理の手順について説明する。なお、この一連の処理の前提として、文書取得部１０は、大量の文で構成される文書データを既に取得している。 FIG. 2 is a flowchart showing an operation procedure of the semantic relationship extraction apparatus 1. Hereinafter, the procedure of the process of extracting the semantic relationship between words will be described with reference to this flowchart. As a premise of this series of processing, the document acquisition unit 10 has already acquired document data composed of a large amount of sentences.

まずステップＳ１−１において、名詞抽出部１１は、文書取得部１０によって取得されている文書データを、文ごとに分割する。このとき、名詞抽出部１１は、句点や文区切りの手掛かりとなる記号(「！」、「？」など)を利用することによって文への分割を行う。そして、名詞抽出部１１は、既存技術による形態素解析器などを利用して、文の形態素解析処理を行い、各文内に出現する名詞を抽出する。形態素解析器として、例えば、Ｍｅｃａｂなどを利用できる。 First, in step S1-1, the noun extraction unit 11 divides the document data acquired by the document acquisition unit 10 for each sentence. At this time, the noun extraction unit 11 divides the sentence into sentences by using symbols (“!”, “?”, Etc.) that serve as clues for phrases and sentence breaks. And the noun extraction part 11 performs the morphological analysis process of a sentence using the morphological analyzer by the existing technique, etc., and extracts the noun which appears in each sentence. For example, Mecab or the like can be used as the morphological analyzer.

次にステップＳ１−２において、名詞対抽出部１２は、名詞抽出部１１が抽出した名詞の任意の２つを組み合わせることによって名詞対を一つ生成する。なお、名詞対抽出部１２は、元の文における構文上の制約に基づいて名詞対生成の処理を行う。具体的には、名詞対抽出部１２は、名詞が含まれていた元の文の係り受けの構造を参照する。そして、名詞対抽出部１２は、２つの名詞のそれぞれが含まれている２つの文節の共通の係り受け先までのパス中に所定数（例えば３文節）以上の文節が含まれる場合には、それらの２つの名詞を、名詞対として出力しない。この、係り受け構造に基づく名詞対の選択の具体例については、図３を参照しながら後で説明する。 Next, in step S1-2, the noun pair extraction unit 12 generates one noun pair by combining any two of the nouns extracted by the noun extraction unit 11. The noun pair extraction unit 12 performs noun pair generation processing based on syntactic constraints in the original sentence. Specifically, the noun pair extraction unit 12 refers to the dependency structure of the original sentence in which the noun was included. Then, the noun pair extraction unit 12 includes a phrase that is equal to or more than a predetermined number (for example, three phrases) in a path to a common dependency destination of two phrases including each of the two nouns. Those two nouns are not output as noun pairs. A specific example of selecting a noun pair based on the dependency structure will be described later with reference to FIG.

なお、文中の文節を区切る処理と、文節間の係り受け関係の解析には、既存技術による構文解析器を利用できる。一例としては、文献［工藤拓，松本裕治，チャンキングの段階適用による係り受け解析，SIG-NL-142，２００１年］にも記載されている係り受け解析器ｃａｂｏｃｈａを利用できる。 It should be noted that a syntax analyzer based on an existing technology can be used for the processing of dividing the clauses in the sentence and the analysis of the dependency relation between the clauses. As an example, the dependency analyzer cabocha described in the document [Taku Kudo, Yuji Matsumoto, Dependency Analysis by Chunking Stage Application, SIG-NL-142, 2001] can be used.

次にステップＳ１−３において、第３名詞抽出部１３は、名詞対抽出部１２によって抽出された名詞対のうちの一つについて、その名詞対と関係の深い第３の名詞を、その名詞対が属している文と同一の文から一つ抽出する。 Next, in step S <b> 1-3, the third noun extraction unit 13 selects a third noun that is closely related to the noun pair for one of the noun pairs extracted by the noun pair extraction unit 12. Extract one sentence from the same sentence as

このとき、第３名詞抽出部１３は、次の２つのいずれかに該当する名詞を抽出する。第１に、その名詞対のうちの一方の単語が含まれる文節と直接係り受け関係のある文節中の名詞、または、間接的に係り受け関係のある文節中の名詞である（便宜上、第１のルールと呼ぶ）。第２に、その名詞対の一方の単語が含まれる文節と並列関係にある文節が存在する場合には、その並列関係において最後に現れる文節中の名詞である（便宜上、第２のルールと呼ぶ）。なお、第２のルールに該当する場合には、第２のルールのみが適用され、第１のルールは適用されない。そして、上記の第１または第２のいずれかに該当する名詞は、複数存在し得る。そして、第３名詞抽出部１３は、元の名詞対に含まれる２つの名詞と、上記の第１または第２のいずれかに該当する名詞との、３項組を生成する。
これら、第１または第２のパターンのそれぞれについては、図３を参照しながら後で説明する。 At this time, the third noun extraction unit 13 extracts a noun corresponding to one of the following two. First, it is a noun in a phrase directly related to a phrase including one word of the noun pair, or a noun in a phrase indirectly related to a phrase (for convenience, the first noun Called the rules). Second, if there is a phrase in parallel with the phrase containing one word of the noun pair, it is the noun in the phrase that appears last in the parallel relationship (referred to as the second rule for convenience). ). When the second rule is applicable, only the second rule is applied, and the first rule is not applied. There can be a plurality of nouns corresponding to either the first or the second. And the 3rd noun extraction part 13 produces | generates the 3 term group of two nouns contained in the original noun pair, and the noun corresponding to either said 1st or 2nd.
Each of these first or second patterns will be described later with reference to FIG.

次にステップＳ１−４において、スコア計算部１４は、第３名詞抽出部１３が生成した名詞の３項組を対象として、スコアを計算する。このスコアは、この３項組の構成要素である名詞対（名詞対抽出部１２によって抽出された名詞対）が、ある特定の関係（例えば、因果関係など）を持つか否かを判定する指標となるものである。 Next, in step S <b> 1-4, the score calculation unit 14 calculates a score for the noun ternary group generated by the third noun extraction unit 13. This score is an index for determining whether or not a noun pair (noun pair extracted by the noun pair extraction unit 12) that is a component of the three-tuple group has a specific relationship (for example, a causal relationship). It will be.

次にステップＳ１−５において、意味関係抽出装置１は、現在処理中の名詞対に対応して他の（未処理の）第３名詞が存在するか否かを判定する。存在する場合（ステップＳ１−５：ＹＥＳ）には、他の第３名詞に関してスコアを計算するためにステップＳ１−３に戻る。存在しない場合（ステップＳ１−５：ＮＯ）には、次のステップＳ１−６に進む。 Next, in step S1-5, the semantic relationship extraction apparatus 1 determines whether there is another (unprocessed) third noun corresponding to the currently processed noun pair. If it exists (step S1-5: YES), the process returns to step S1-3 to calculate a score for another third noun. If it does not exist (step S1-5: NO), the process proceeds to the next step S1-6.

次にステップＳ１−６において、意味関係抽出装置１は、抽出されている全名詞対を処理したか否かを判定する。全名詞対を処理していない場合（未処理の名詞対がある場合，ステップＳ１−５：ＮＯ）には、他の名詞対についての処理を行うためにステップＳ１−２に戻る。全名詞対を処理済みの場合（ステップＳ１−６：ＹＥＳ）には、次のステップＳ１−７に進む。 Next, in step S1-6, the semantic relationship extraction apparatus 1 determines whether or not all the extracted noun pairs have been processed. If all noun pairs have not been processed (if there are unprocessed noun pairs, step S1-5: NO), the process returns to step S1-2 to perform processing for other noun pairs. If all noun pairs have been processed (step S1-6: YES), the process proceeds to the next step S1-7.

次にステップＳ１−７において、意味関係抽出装置１は、全文を処理したか否かを判定する。全文を処理していない場合（未処理の文がある場合，ステップＳ１−７：ＮＯ）には、他の文についての処理を行うためにステップＳ１−１に戻る。全文を処理済みの場合（ステップＳ１−７：ＹＥＳ）には、次のステップＳ１−８に進む。 Next, in step S1-7, the semantic relationship extraction apparatus 1 determines whether or not the entire sentence has been processed. If the entire sentence has not been processed (if there is an unprocessed sentence, step S1-7: NO), the process returns to step S1-1 in order to perform processing for another sentence. If the entire sentence has been processed (step S1-7: YES), the process proceeds to the next step S1-8.

次にステップＳ１−８において、特定関係名詞対抽出部１５は、対象とする関係を持つ名詞対を抽出する。ここでは、名詞対ｎ_ｉ，ｎ_ｋに対して、後述するスコアオール（図４のフローチャートとその説明を参照）の値が最大となるものを抽出する。つまり、特定関係名詞対抽出部１５は、名詞対ｎ_ｉ，ｎ_ｋに対するスコアｓｃｏｒｅ（ｎ_ｉ，ｎ_ｋ）を、下の式（１）により計算する。 Next, in step S1-8, the specific related noun pair extraction unit 15 extracts a noun pair having a target relationship. Here, for the noun pair n _i , n _k , the one having the maximum score all value (see the flowchart in FIG. 4 and the description thereof) described later is extracted. That is, the specific related noun pair extraction unit 15 calculates the score score (n _i , n _k ) for the noun pair n _i , n _{k according} to the following equation (1).

つまり、Ｓｃｏｒｅ（ｎ_ｉ，ｎ_ｋ）は、名詞対（ｎ_ｉ，ｎ_ｋ）が、シードパターンＳＰに対応する特定の関係である度合いを示す値である。シードパターンについては後述する。 That is, Score (n _i , n _k ) is a value indicating the degree to which the noun pair (n _i , n _k ) has a specific relationship corresponding to the seed pattern SP. The seed pattern will be described later.

なお、式（１）におけるスコアオールｓｃｏｒｅＡｌｌ（ｎ_ｉ，ｎ_ｋ，ＳＰ，ｎ_ｊ）の算出については、後で、図４を参照しながら説明する。 The calculation of the score all scoreAll (n _i , n _k , SP, n _j ) in the equation (1) will be described later with reference to FIG.

そして、特定関係名詞対抽出部１５は、この式（１）で計算されたスコアが、ある所定の閾値以上の名詞対を、対象とする関係（特定関係）を持つ名詞対であると判定する。 Then, the specific relation noun pair extraction unit 15 determines that the noun pair whose score calculated by the equation (1) is a predetermined threshold or more is a noun pair having a target relation (specific relation). .

図３は、名詞対抽出部１２による処理に関するものであり、名詞の対と、文における係り受け関係の例を示す概略図である。同図の（Ａ）〜（Ｃ）の各々は、文の係り受け関係を表す木構造のデータを示すものである。それぞれの木構造において、ノード（四角形の箱）は文に含まれる文節に対応する。また、エッジ（矢印）は、文節間の係り受け関係（実線の場合）または並列関係（破線の場合）に対応する。 FIG. 3 relates to processing by the noun pair extraction unit 12, and is a schematic diagram illustrating an example of a noun pair and a dependency relation in a sentence. Each of (A) to (C) in the figure shows tree-structured data representing the dependency relationship of sentences. In each tree structure, nodes (rectangular boxes) correspond to clauses included in the sentence. An edge (arrow) corresponds to a dependency relationship between phrases (in the case of a solid line) or a parallel relationship (in the case of a broken line).

同図（Ａ）の例では、名詞Ａが含まれる文節と名詞Ｂが含まれる文節の共通係り先（文節４）までのパスに４つの文節（文節１、文節２、文節３、文節４）が含まれる。つまり、基準である「３」よりも長いパス長であるので、この文に出現する名詞ＡとＢの対は、名詞対として抽出されず、後の処理の対象から除外される。なお、この例において、名詞Ａから直接係り受け関係にある文節とは、文節１である。名詞Ｂから直接係り受け関係にある文節とは、文節４である。また、名詞Ａから間接的に係り受け関係にある文節とは、文節２、文節３、文節４である。 In the example of FIG. 4A, there are four clauses (phrase 1, clause 2, clause 3, clause 4) in the path to the common destination of the clause containing noun A and the clause containing noun B (phrase 4). Is included. That is, since the path length is longer than the reference “3”, the pair of nouns A and B appearing in this sentence is not extracted as a noun pair and is excluded from the target of subsequent processing. In this example, the phrase directly related to the noun A is phrase 1. The phrase directly related to the noun B is the phrase 4. The phrases indirectly related to the noun A are phrase 2, phrase 3, and phrase 4.

同図（Ｂ）の例では、名詞Ａが含まれる文節と名詞Ｂが含まれる文節の共通係り先（文節３）までのパスに３つの文節（文節１、文節２、文節３）が含まれる。つまり、基準である「３」以下のパス長であるので、この文に出現する名詞ＡとＢの対は除外されず、名詞対として抽出される。 In the example of FIG. 5B, three clauses (sentence 1, clause 2, and clause 3) are included in the path to the common destination of the clause containing the noun A and the clause containing the noun B (phrase 3). . That is, since the path length is “3” or less which is the reference, the pair of nouns A and B appearing in this sentence is not excluded and is extracted as a noun pair.

同図（Ｃ）の例では、名詞Ａが含まれる文節と名詞Ｂが含まれる文節の共通係り先（文節４）までのパスに３つの文節（文節２、文節３、文節４）が含まれる。なお、名詞Ｂが含まれる文節と、文節１と、文節２は、並列関係にある。この場合も、係り受け関係のパス長が基準である「３」以下であるので、この文に出現する名詞ＡとＢの対は除外されず、名詞対として抽出される。係り受け関係と並列関係を合わせたパス長は５であるが、ここでは、係り受け関係のパス長が基準以下であるかどうかに基づく判断を行っている。 In the example shown in FIG. 5C, three clauses (phrase 2, clause 3, and clause 4) are included in the path to the common destination of the clause containing noun A and the clause containing noun B (phrase 4). . The phrase including the noun B, the phrase 1, and the phrase 2 are in a parallel relationship. Also in this case, since the dependency-related path length is equal to or less than the reference “3”, the pair of nouns A and B appearing in this sentence is not excluded and is extracted as a noun pair. The path length combining the dependency relationship and the parallel relationship is 5, but here, a determination is made based on whether or not the path length of the dependency relationship is less than or equal to the reference.

また次に、図３を参照しながら、第３名詞抽出部１３が抽出し得る名詞について説明する。
同図（Ｂ）に示す例では、名詞対（名詞Ａおよび名詞Ｂ）のそれぞれが含まれる文節と直接、または間接的に係り受け関係にある文節を全て抽出すると、文節１、文節２、文節３が抽出される。これらの文節に含まれる名詞と、名詞Ａおよび名詞Ｂと組み合わせて名詞の３項組を生成する。つまり、第３名詞抽出部１３は、（名詞Ａ，名詞Ｂ，文節１に含まれる名詞）、（名詞Ａ，名詞Ｂ，文節２に含まれる名詞）、（名詞Ａ，名詞Ｂ，文節３に含まれる名詞）という３項組を生成する。
同図（Ｃ）に示す例では、名詞Ｂが含まれる文節と並列関係にある文節が存在する。そのため、その並列関係の最後となる文節２の中の名詞と、名詞Ａ、名詞Ｂとを組み合わせて名詞の３項組を生成する。つまり、第３名詞抽出部１３は、（名詞Ａ，名詞Ｂ，文節２に含まれる名詞）という３項組を生成する。 Next, nouns that can be extracted by the third noun extraction unit 13 will be described with reference to FIG.
In the example shown in FIG. 5B, when all the clauses that are directly or indirectly dependent on the clauses including each of the noun pairs (noun A and noun B) are extracted, clause 1, clause 2, clause 3 is extracted. Combining the nouns included in these phrases with the nouns A and B, the noun triplets are generated. That is, the third noun extraction unit 13 adds (noun included in noun A, noun B, phrase 1), (noun included in noun A, noun B, phrase 2), (noun A, noun B, phrase 3). A noun) is created.
In the example shown in FIG. 5C, there is a phrase in parallel with the phrase including the noun B. Therefore, a noun ternary group is generated by combining the nouns in the phrase 2 which is the last of the parallel relation, the nouns A and B. That is, the 3rd noun extraction part 13 produces | generates the 3 term group called (the noun A, the noun B, the noun contained in the phrase 2).

図４は、スコア計算部１４がスコアを計算する処理の手順を示すフローチャートである。 FIG. 4 is a flowchart showing a procedure of processing in which the score calculation unit 14 calculates a score.

同図に示す処理を行うに先立って、予め、抽出対象とする特定の関係を持つ単語ペアの例として、シード単語ペアを生成しておく。シード単語ペアは、第１の単語および第２の単語と、それらの２単語の関係を表す情報（関係名）とからなるデータである。関係名とは、例えば「因果関係」などである。シード単語ペアを生成する処理自体は、文献［Stijn，鳥澤，風間，黒田，村田，単語の意味クラスを用いたパターン学習による大規模な意味的関係獲得，言語処理学会第１６回年次大会発表論文集，Ｄ４−２，２０１０年］に記載されている方法を用いて行うことができる。また、人が思いつく単語ペア（特定の関係を有する単語ペア）を列挙することによってこのシード単語ペアのデータを作成するようにしても良い。
シード単語ペアの例については、後で図５を参照して説明する。 Prior to performing the processing shown in the figure, a seed word pair is generated in advance as an example of a word pair having a specific relationship to be extracted. The seed word pair is data composed of a first word and a second word and information (relation name) representing the relationship between the two words. The relationship name is, for example, “causal relationship”. The process itself for generating seed word pairs is described in the literature [Stijn, Torizawa, Kazama, Kuroda, Murata, acquisition of large-scale semantic relations by pattern learning using word semantic classes, and 16th Annual Conference of the Association for Natural Language Processing. The method described in the collection of papers, D4-2, 2010] can be used. In addition, the seed word pair data may be created by enumerating word pairs that a person can think of (word pairs having a specific relationship).
An example of the seed word pair will be described later with reference to FIG.

そして、ステップＳ３−１において、スコア計算部１４は、シード単語ペアのデータを読み込み、これに基づいてシードパターン生成の処理を行う。シードパターン生成の詳細は下で述べる通りである。 In step S3-1, the score calculation unit 14 reads seed word pair data and performs seed pattern generation processing based on the read data. Details of seed pattern generation are as described below.

なお、ここで、パターンとは、入力される文の係り受け構造（木構造を有する有向グラフとして表現される）において、２つの名詞が属する文節間の共通の係り先までのパス（それらの２つの名詞が属する文節自身を含まない）である。そのようなパターンのうち、名詞間の特定の関係を表現するもの、あるいはそのような特定の関係を表現する度合いが高いものが、シードパターンである。つまり、シードパターンは、文節間の係り受け関係を表す木構造のデータにおける、シード単語対に属する単語間の文節のパスとして表される。 Here, a pattern is a path to a common dependency point between clauses to which two nouns belong in a dependency structure of an input sentence (expressed as a directed graph having a tree structure). Does not include the phrase to which the noun belongs). Among such patterns, a pattern that expresses a specific relationship between nouns or a pattern that expresses such a specific relationship is a seed pattern. That is, the seed pattern is represented as a path of phrases between words belonging to the seed word pair in the tree-structured data representing the dependency relationship between phrases.

具体的なパターンの例は、次の通りである。例えば図３（Ａ）に示すように、名詞Ａが含まれる文節が他の文節1に係り、その文節１が他の文節２に係り、その文節２が他の文節３に係り、その文節３が文節４に係っている。また、名詞Ｂが含まれる文節が前記の文節４に係っている。このとき、名詞Ａと名詞Ｂの共通の係り先は、文節４である。したがって、名詞Ａが含まれる文節から名詞Ｂが含まれる文節までのパスは、｛文節１，文節２，文節３，文節４｝（文節１から文節４までが直列につながっている）であり、これがパターンである。 Examples of specific patterns are as follows. For example, as shown in FIG. 3A, a phrase including the noun A is related to another phrase 1, the phrase 1 is related to another phrase 2, the phrase 2 is related to another phrase 3, and the phrase 3 Is related to clause 4. A phrase including the noun B is related to the phrase 4 described above. At this time, the common destination of the noun A and the noun B is the phrase 4. Therefore, the path from the phrase containing noun A to the phrase containing noun B is {sentence 1, phrase 2, phrase 3, phrase 4} (sentences 1 to 4 are connected in series), This is a pattern.

このように抽出されたパターンのうち、所定の判断基準により、シードパターンが生成される。例えば、以下に述べる第１から第３のシードパターン生成方法のいずれかを用いるようにする。 Of the patterns extracted in this way, a seed pattern is generated according to a predetermined criterion. For example, one of the first to third seed pattern generation methods described below is used.

第１のシードパターン生成手法では、シード単語ペアが出現した全パターンをシードパターンとして生成する。下の式（２）が、第１のシードパターン生成手法によって生成されるシードパターンの集合を表す。 In the first seed pattern generation method, all patterns in which seed word pairs appear are generated as seed patterns. Equation (2) below represents a set of seed patterns generated by the first seed pattern generation method.

ここで、（ｎ_ｉ，ｐａｔ，ｎ_ｊ）は、名詞ペア（ｎ_ｉ，ｎ_ｊ）がパターンｐａｔと共起している事象を表している。第１のシードパターン生成手法では、式（２）に表したように、シード単語ペア（名詞ペア，ｓｅｅｄｐａｉｒ）に含まれる名詞のペアが共起するパターン（ｐａｔ）の全てをシードパターンとする。 Here, (n _i , pat, n _j ) represents an event in which the noun pair (n _i , n _j ) co-occurs with the pattern pat. In the first seed pattern generation method, as shown in Expression (2), all patterns (pat) in which noun pairs included in seed word pairs (noun pairs, seedpair) co-occur are used as seed patterns.

第２のシードパターン生成方法では、シード単語ペアが出現したパターンのうち、出現頻度が高いもののみをシードパターンとして生成する。下の式（３）が、第２のシードパターン生成手法によって生成されるシードパターンの集合を表す。 In the second seed pattern generation method, only patterns with high appearance frequency among patterns in which seed word pairs appear are generated as seed patterns. Equation (3) below represents a set of seed patterns generated by the second seed pattern generation method.

ここで、Ｆｒｅｑ（ｐａｔ）は、パターンｐａｔの出現回数である。また、Ｆｒｅｑ（ｓｅｅｄｐａｉｒ）は、シード単語ペアの出現回数である。式（３）の中の不等式の左辺の分子は、入力文においてシード単語ペアに含まれる２単語と共起するパターンｐａｔの出現回数である。この値をシード単語ペアの出現数で割った値が所定の閾値（β）より大きいものを、シードパターンとしている。 Here, Freq (pat) is the number of appearances of the pattern pat. Freq (seedpair) is the number of appearances of the seed word pair. The numerator on the left side of the inequality in Expression (3) is the number of appearances of the pattern pat that co-occurs with two words included in the seed word pair in the input sentence. A value obtained by dividing this value by the number of occurrences of the seed word pair is larger than a predetermined threshold (β) is used as a seed pattern.

第３のシードパターン生成方法では、シード単語ペアが出現した全パターンのうち、該当パターンがシード単語ペアと共起する割合が高いものを、シードパターンとして生成する。下の式（４）が、第３のシードパターン生成手法によって生成されるシードパターンの集合を表す。 In the third seed pattern generation method, among all patterns in which a seed word pair appears, a pattern having a high ratio of the corresponding pattern co-occurring with the seed word pair is generated as a seed pattern. Equation (4) below represents a set of seed patterns generated by the third seed pattern generation method.

ここで、式（４）に含まれる不等式の左辺の分子は、シード単語ペアに含まれる２単語と共起するパターンｐａｔを表す。また、同じく左辺の分母は、パターンｐａｔの出現回数を示す。この左辺の値が所定の閾値（γ）より大きいものを、シードパターンとしている。 Here, the numerator on the left side of the inequality included in Equation (4) represents a pattern pat that co-occurs with two words included in the seed word pair. Similarly, the denominator on the left side indicates the number of appearances of the pattern pat. A seed pattern having a value on the left side larger than a predetermined threshold value (γ) is used.

上記のいずれかのシードパターン生成方法により、スコア計算部１４は、所定の量のシードパターンの集合を生成する。なお、上記のシードパターン生成方法を利用せず、人手のみでシードパターンの集合を生成するようにしても良い。このように生成したシードパターン集合を、記憶装置（シードパターン記憶部）に記憶させておくことにより、それぞれの名詞３項組に関するスコアの計算において、シードパターン集合を共通に用いることができる。シードパターン記憶部は半導体メモリやハードディスク装置などで構成する。 The score calculation unit 14 generates a set of a predetermined amount of seed patterns by any one of the seed pattern generation methods described above. Note that a set of seed patterns may be generated only by hand without using the seed pattern generation method. By storing the seed pattern set generated in this manner in a storage device (seed pattern storage unit), the seed pattern set can be used in common in the calculation of the score for each noun ternary group. The seed pattern storage unit is configured by a semiconductor memory, a hard disk device, or the like.

以下の処理では、スコア計算部１４は、生成されたシードパターン集合を利用して、ある単語ペアが、対象としている関係を表すか否かを判定するための指標となるスコアを計算する。ここでは、単語ペアを、ｎ_ｉ，ｎ_ｋとする。また、この名詞対に関係の深い第３の名詞をｎ_ｊとする。
以下では、ｎ_ｊがｎ_ｋから直接的にまたは間接的に係り受け関係を持つ場合、または、ｎ_ｊがｎ_ｋと並列関係にある場合について説明する。但し、ｎ_ｊがｎ_ｉから直接的にまたは間接的に係り受け関係を持つ場合、または、ｎ_ｊがｎ_ｉと並列関係にある場合についても、下の式においてｎ_ｉとｎ_ｋを入れ替えることによってスコアを求めることができる。 In the following processing, the score calculation unit 14 uses the generated seed pattern set to calculate a score that serves as an index for determining whether a certain word pair represents a target relationship. Here, it is assumed that the word pairs are n _i and _nk . Further, a third noun closely related to this noun pair is assumed to be n _j .
In the following, n _j may have a direct or indirect dependency relationship from n _k, or, the case where n _j is in parallel relationship with the n _k. However, if n _j has a directly or indirectly dependency relation from n _i, or, in the case where n _j is in parallel relationship with the n _i also exchanging the n _i and n _k in the equation below The score can be obtained by

次にステップＳ３−２において、スコア計算部１４は、スコア１（ｓｃｏｒｅ１，第１スコア）を計算する。このスコア１は、２単語（ｎ_ｉ，ｎ_ｋ）がそれぞれ属するクラス（ｃ_ｉ，ｃ_ｋ）の、対象関係を表すシードパターンへの出現のし易さである。ここで、クラスとは、単語が属している上位語あるいは上位概念である。例えば、単語「高血圧」はクラス「病気」に対応する。ある単語が複数のクラスに対応することもある。単語とクラスとの関係は、既存技術を用いて文書データから抽出することもでき、また、シソーラスデータを用いて得ることもできる。単語とクラスとの関係は、データとして予め意味関係抽出装置１内に保持しておく。具体的には、スコア計算部１４は、スコア１を下の式（５）により計算する。 Next, in step S3-2, the score calculation unit 14 calculates score 1 (score 1, first score). The score 1 is the ease of appearance of the class (c _i , c _k ) to which the two words (n _i , n _k ) belong to the seed pattern representing the target relationship. Here, a class is a broad word or a broad concept to which a word belongs. For example, the word “hypertension” corresponds to the class “disease”. A word may correspond to multiple classes. The relationship between words and classes can be extracted from document data using existing technology, or can be obtained using thesaurus data. The relationship between words and classes is stored in advance in the semantic relationship extraction apparatus 1 as data. Specifically, the score calculation unit 14 calculates the score 1 by the following equation (5).

ここで、ＳＰはシードパターン集合である。また、ｎ_ｉ，ｎ_ｋは単語ペアである。また、ｃ_ｉ，ｃ_ｋは、単語ｎ_ｉ，ｎ_ｋがそれぞれ属するクラスである。式（５）における右辺の分子は、クラス（ｃ_ｉ，ｃ_ｋ）に属する単語のＳＰとの共起数である。式（５）における右辺の分母は、クラス（ｃ_ｉ，ｃ_ｋ）に属する単語の任意のパターンとの共起数である。 Here, SP is a seed pattern set. N _i and n _k are word pairs. C _i and c _k are classes to which the words n _i and n _k belong, respectively. The numerator on the right side in Equation (5) is the number of co-occurrence with SP of words belonging to the class (c _i , c _k ). The denominator on the right side in the equation (5) is the number of co-occurrence with an arbitrary pattern of words belonging to the class (c _i , c _k ).

つまり、スコア１は、名詞対（ｎ_ｉ，ｎ_ｋ）に対応するクラス対（ｃ_ｉ，ｃ_ｋ）が、特定の関係（対象関係）に出現する度合い（可能性、確率）を表す値である。 That is, the score 1 is a value representing the degree (possibility, probability) that the class pair (c _i , c _k ) corresponding to the noun pair (n _i , n _k ) appears in a specific relationship (target relationship). is there.

次にステップＳ３−３において、スコア計算部１４は、スコア２（ｓｃｏｒｅ２，第２スコア）を計算する。このスコア２は、２単語（ｎ_ｉ，ｎ_ｊ）間のパターン（クラス制限付）の、シードパターンへの出現のしやすさを表す。具体的には、スコア計算部１４は、スコア２を下の式（６）により計算する。 Next, in step S3-3, the score calculation unit 14 calculates score 2 (score 2, second score). This score 2 represents the ease of appearance of a pattern (with class restriction) between two words (n _i , n _j ) in the seed pattern. Specifically, the score calculation unit 14 calculates the score 2 by the following equation (6).

ここで、Ｉ（Ｐ_ｉｊ，ｃ_ｉ，ｃ_ｋ）は、処理対象となっている名詞３項組（ｎ_ｉ，ｎ_ｊ，ｎ_ｋ）において、単語ペアｎ_ｉとｎ_ｊの間のパターンＰ_ｉｊと共起するｃ_ｉ，ｃ_ｋ（それぞれ、ｎ_ｉ，ｎ_ｋのクラス）に属する名詞対集合を表す。Ｉ（ＳＰ（ｃ_ｉ，ｃ_ｋ））は、シードパターンのいずれかと共起するクラスｃ_ｉとｃ_ｋに属する名詞対集合を表す。Ｉ（Ｐ_ｉｊ）は、ｎ_ｉとｎ_ｊの間のパターンＰ_ｉｊと共起する名詞対集合である。Ｉ（ＳＰ）は、シードパターンのいずれかと共起する名詞対集合である。 Here, I (P _ij , c _i , c _k ) is a pattern P between the word pairs n _i and n _{j in} the noun ternary group (n _i , n _j , n _k ) to be processed. _It represents a set of noun pairs belonging to c _i and c _k co-occurring with _ij (classes of n _i and n _k , respectively). I (SP (c _i , c _k )) represents a set of noun pairs belonging to classes c _i and c _k that co-occur with any of the seed patterns. I (P _ij ) is a noun pair set that co-occurs with the pattern P _ij between n _i and n _j . I (SP) is a noun pair set that co-occurs with any of the seed patterns.

つまり、スコア２は、下記の（Ａ）〜（Ｄ）を用いて、｛（Ａ）×（Ｂ）｝／｛（Ｃ）×（Ｄ）｝で計算される値である。
（Ａ）単語ペアｎ_ｉとｎ_ｊの間のパターンＰ_ｉｊと共起するクラスｃ_ｉ，ｃ_ｋに属する名詞対集合とシードパターンのいずれかと共起するクラスｃ_ｉとｃ_ｋに属する名詞対集合との積集合の要素数。
（Ｂ）単語ペアｎ_ｉとｎ_ｊの間のパターンＰ_ｉｊと共起する名詞対集合とシードパターンのいずれかと共起する名詞対集合との積集合の要素数。
（Ｃ）単語ペアｎ_ｉとｎ_ｊの間のパターンＰ_ｉｊと共起するクラスｃ_ｉ，ｃ_ｋに属する名詞対集合とシードパターンのいずれかと共起するクラスｃ_ｉとｃ_ｋに属する名詞対集合との和集合の要素数。
（Ｄ）単語ペアｎ_ｉとｎ_ｊの間のパターンＰ_ｉｊと共起する名詞対集合とシードパターンのいずれかと共起する名詞対集合との和集合の要素数。 That is, score 2 is a value calculated by {(A) × (B)} / {(C) × (D)} using the following (A) to (D).
(A) Noun pairs belonging to classes c _i and c _k co-occurring with either a set of noun pairs belonging to classes c _i and c _k co-occurring with pattern P _ij between word pairs n _i and n _j The number of elements in the intersection with the set.
(B) The number of elements in the intersection of the noun pair set co-occurring with the pattern P _ij between the word pairs n _i and n _j and the noun pair set co-occurring with any of the seed patterns.
(C) Noun pairs belonging to classes c _i and c _k co-occurring with either a noun pair set belonging to classes c _i and c _k co-occurring with pattern P _ij between word pairs n _i and n _j or seed patterns The number of elements in the union with the set.
(D) The number of elements in the union of the noun pair set co-occurring with the pattern P _ij between the word pairs n _i and n _j and the noun pair set co-occurring with any of the seed patterns.

つまり、スコア２は、名詞３項組における２単語（ｎ_ｉ，ｎ_ｊ）間のパターンＰ_ｉｊの、シードパターンへの出現のしやすさの度合い（可能性、確率）を表す値である。
を表す値である。 That is, the score 2 is a value representing the degree (probability, probability) of the ease of appearance of the pattern P _ij between the two words (n _i , n _j ) in the noun ternary group in the seed pattern.
Is a value representing

次にステップＳ３−４において、スコア計算部１４は、スコア３（ｓｃｏｒｅ３，第３スコア）を計算する。このスコア３は、２単語（ｎ_ｋ，ｎ_ｊ）の間のパターンの上位下位関係らしさと、並列関係らしさを表すものである。具体的には、スコア計算部１４は、スコア３を下の式（７）により計算する。 Next, in step S3-4, the score calculation unit 14 calculates score 3 (score 3, third score). This score 3 represents the upper and lower relations of the pattern between two words (n _k , n _j ) and the likelihood of parallel relations. Specifically, the score calculation unit 14 calculates the score 3 by the following equation (7).

ここで、ｐｒｏｂＮｏｕｎ_ｈｙｐ（ｎ_ｋ，ｎ_ｊ）は、２単語（ｎ_ｋ，ｎ_ｊ）が上位下位関係である確率を示し、この確率はあらかじめ与えた上位下位関係パターンとの共起確率から計算することが出来る。また、ｐｒｏｂＰａｔｔｅｒｎ_ｈｙｐ（ｐ_ｋｊ）は、２単語（ｎ_ｋ，ｎ_ｊ）間のパターンｐ_ｋｊの上位下位関係らしさを示す確率であり、この確率はあらかじめ与えた上位下位関係単語ペアとの共起確率から計算することができる。ｐｒｏｂＮｏｕｎ_ｐａｒａ（ｎ_ｋ，ｎ_ｊ）は、２単語（ｎ_ｋ，ｎ_ｊ）が並列関係である確率を示し、この確率は並列関係パターンとの共起頻度から計算することができる。ｐｒｏｂＰａｔｔｅｒｎ_ｐａｒａ（ｐ_ｋｊ）は、２単語（ｎ_ｋ，ｎ_ｊ）のパターンの並列関係らしさを示し、これは上位下位関係単語ペア集合から兄弟語ペアを抽出し、その兄弟語ペアとの共起頻度から求めることができる。 Here, probNoun _hyp (n _k , n _j ) indicates the probability that the two words (n _k , n _j ) are in the upper and lower relationship, and this probability is calculated from the co-occurrence probability with the upper and lower relationship pattern given in advance. I can do it. ProbPattern _hyp (p _kj ) is a probability indicating the likelihood of an upper-lower relationship of the pattern p _kj between two words (n _k , n _j ), and this probability is a co-occurrence with a higher-lower relationship word pair given in advance. It can be calculated from the probability. probNoun _para (n _k , n _j ) indicates the probability that two words (n _k , n _j ) are in a parallel relationship, and this probability can be calculated from the co-occurrence frequency with the parallel relationship pattern. probPattern _para (p _kj ) indicates the parallelism of the pattern of two words (n _k , n _j ), which extracts sibling word pairs from the upper and lower relation word pair sets and co-occurs with the sibling word pairs It can be obtained from the frequency.

なお、上位下位関係や並列関係を用いてスコア３を計算する理由は、元の単語対に含まれる単語と、第３の名詞との関係が、上位下位関係または並列関係である場合に、その第３の名詞が単語対の関係を特定するために重要な作用を及ぼすためである。 The reason for calculating the score 3 using the upper / lower relationship or the parallel relationship is that the relationship between the word contained in the original word pair and the third noun is an upper / lower relationship or a parallel relationship. This is because the third noun has an important effect for specifying the relationship between word pairs.

つまり、スコア３は、２単語（ｎ_ｋ，ｎ_ｊ）とその２単語間のパターンｐ_ｋｊの上位下位関係らしさ、あるいは、２単語（ｎ_ｋ，ｎ_ｊ）とその２単語間のパターンｐ_ｋｊの並列関係らしさ、のいずれか大きいほうを表す値である。 That is, a score of 3, 2 words _{_(n} k, n _j) and Is ness Upper Lower relationship pattern _{p kj} between the two words, or two words _{_(n} k, n _j) and the pattern _{p kj} between the two words This is the value that represents the greater of the parallelism.

次にステップＳ３−５において、スコア計算部１４は、スコアオール（ｓｃｏｒｅＡｌｌ）を計算する。このスコアオールは、上の各ステップで計算した、スコア１、スコア２、スコア３の積を取ることによって計算される値である。具体的には、スコア計算部１４は、下の式（８）によってスコアオールの値を計算する。 Next, in step S3-5, the score calculation unit 14 calculates score all. This score all is a value calculated by taking the product of score 1, score 2, and score 3 calculated in the above steps. Specifically, the score calculation unit 14 calculates the score all value by the following equation (8).

つまり、スコアオールＳｃｏｒｅＡｌｌ（ｎ_ｉ，ｎ_ｋ，ＳＰ，ｎ_ｊ）は、名詞対（ｎ_ｉ，ｎ_ｋ）と、第３名詞ｎ_ｊが与えられたときに、名詞対（ｎ_ｉ，ｎ_ｋ）が、シードパターンＳＰに対応する特定の関係を有する度合いを表す値である。
そして、式（８）が表すように、スコアオールＳｃｏｒｅＡｌｌ（ｎ_ｉ，ｎ_ｋ，ＳＰ，ｎ_ｊ）は、スコア１とスコア２とスコア３との積の値である。但し、名詞対（ｎ_ｉ，ｎ_ｋ）がそれぞれ属するクラス対（ｃ_ｉ，ｃ_ｋ）の中で、その積の値が最大となるように選択した場合における、当該積の値である。その前提として、名詞ｎ_ｉ，ｎ_ｋがそれぞれ属するクラス集合ｃｌａｓｓ（ｎ_ｉ），ｃｌａｓｓ（ｎ_ｋ）の各々は、複数の要素を持ち得る集合である。 That is, the score ol _{_{ScoreAll (n i, n k,}} SP, n j) is a noun pair _{_(n} i, n _k) and, when the third noun _{n j} is given, noun pair _(n i, _{n k} ) Is a value representing the degree of having a specific relationship corresponding to the seed pattern SP.
And, as represented by the equation (8), the score all ScoreAll (n _i , n _k , SP, n _j ) is a product value of the score 1, the score 2, and the score 3. However, it is the value of the product when the product value is selected to be the maximum among the class pairs (c _i , c _k ) to which the noun pair (n _i , n _k ) belongs. As a premise thereof, each of the class sets class (n _i ) and class (n _k ) to which the nouns n _i and n _k belong is a set that can have a plurality of elements.

図５は、スコア計算部１４による処理の中で述べたシード単語ペアの一例を示す概略図である。図示するように、シード単語ペアは、複数の単語のペアを含むものであり、例えば表形式のデータとして表される。シード単語ペアのデータは、単語１と関係名と単語２というデータ項目（桁）を有し、複数の行からなる。例えば、第１行目のデータは、「コレステロール」という単語（単語１）と「高血圧」という単語（単語２）とが「因果関係」という関係名を有することを表している。また、第２行目のデータは、「デキストリン」という単語（単語１）と「虫歯」（単語２）という単語とが「因果関係」という関係名を有することを表している。同図に示す例では単語１が原因に該当し単語２が結果に該当するが、関係の方向は逆でも良く任意である。 FIG. 5 is a schematic diagram illustrating an example of a seed word pair described in the processing by the score calculation unit 14. As shown in the figure, the seed word pair includes a plurality of word pairs, and is represented, for example, as tabular data. The seed word pair data has data items (digits) of word 1, relation name, and word 2, and is composed of a plurality of rows. For example, the data in the first row indicates that the word “cholesterol” (word 1) and the word “hypertension” (word 2) have a relationship name “causal”. The data in the second row indicates that the word “dextrin” (word 1) and the word “cavities” (word 2) have a relation name “causal relationship”. In the example shown in the figure, word 1 corresponds to the cause and word 2 corresponds to the result, but the direction of the relationship may be reversed and arbitrary.

［処理の実例と、本実施形態の構成による作用］
ここで、意味関係抽出装置１による処理の実例を簡単に説明する。例として、「高血圧を引き起こすコレステロールなどの物質は、・・・」という文を考える。文のこの部分を係り受け解析するとその結果は、［高血圧を］−［引き起こす］−［物質は］という係り受け関係、および［コレステロールなどの］−［物質は］という係り受け関係を含む、文節の木構造が得られる。このような文を元に抽出される名詞対の例が、（高血圧，コレステロール）である。また、抽出される第３名詞の例が「物質」である。ここから得られるパターンの一つは、［名詞Ａを含む文節］−「引き起こす」−［名詞Ｂを含む文節］である。このようなパターンは、因果関係をよく表すため、因果関係という特定の関係に対応するシードパターンに含まれる。なお、そのようなシードパターンは、前述の通り、シード単語ペアから得られる。 [Example of processing and operation of the configuration of this embodiment]
Here, an example of processing by the semantic relationship extraction apparatus 1 will be briefly described. As an example, consider the sentence "A substance such as cholesterol that causes high blood pressure ...". When this part of the sentence is subjected to dependency analysis, the result includes a dependency relation of [hypertension]-[cause]-[substance is] and a dependency relation of [such as cholesterol]-[substance is]. A tree structure is obtained. An example of a noun pair extracted based on such a sentence is (high blood pressure, cholesterol). An example of the extracted third noun is “substance”. One of the patterns obtained from this is [sentence including noun A] − “cause” − [sentence including noun B]. Such a pattern is included in the seed pattern corresponding to a specific relationship called a causal relationship in order to express the causal relationship well. Such a seed pattern is obtained from a seed word pair as described above.

仮に第３の名詞を用いず、名詞対のみからパターン（共通係先までの文節のパス）を得ようとすると、（名詞「高血圧」を含む文節）−「引き起こす」−「物質は」−（名詞「コレステロール」を含む文節）となる。 If the third noun is not used and a pattern (sentence path to the common contact point) is obtained only from the noun pair, (sentence including the noun "hypertension")-"cause"-"substance"-( Phrase containing the noun “cholesterol”).

第３の名詞を用いたパターンと、第３の名詞を用いないパターンとを比較すると、このように前者のほうが、長さが短く、統計的な処理において特定の関係（本例では、因果関係）との共起度合いを計算するのに具合が良い。その一つの理由は、第３の名詞を抽出して用いない場合のパターン（名詞「高血圧」を含む文節）−「引き起こす」−「物質は」−（名詞「コレステロール」を含む文節）は、パスの途中に「物質」という名詞を含んでおり、それがパターンの種類数を増やしてしまうからである。 When the pattern using the third noun is compared with the pattern not using the third noun, the former is shorter in this way and has a specific relationship in statistical processing (in this example, a causal relationship). ) Is good for calculating the co-occurrence degree. One reason for this is that the pattern when the third noun is not extracted and used (phrase containing the noun “hypertension”)-“cause”-“substance is”-(phrase containing the noun “cholesterol”) is the path This is because the noun “substance” is included in the middle of this, which increases the number of types of patterns.

本実施形態の意味関係抽出装置１は、第３名詞抽出部１３を備えており、スコア計算部１４は、抽出された第３名詞と、元の単語対に含まれる単語との間のパターンを利用したスコアの計算を行う。この構成により、従来技術よりも良好な結果を得ることができる。また、名詞対に含まれる名詞「コレステロール」と、第３名詞「物質」とは、上位下位関係にあり、そのような第３名詞に関しては、前記のスコア３の値が高くなる。 The semantic relationship extraction apparatus 1 of the present embodiment includes a third noun extraction unit 13, and the score calculation unit 14 calculates a pattern between the extracted third noun and a word included in the original word pair. Calculate the score used. With this configuration, it is possible to obtain better results than in the prior art. In addition, the noun “cholesterol” and the third noun “substance” included in the noun pair are in a high-order subordinate relationship, and the value of the score 3 is high for such a third noun.

言い換えれば、意味関係抽出装置１は、名詞対に含まれる名詞と第３の名詞との間の上位下位関係等（コレステロールの上位は物質である）の情報を利用することによって、その名詞対の関係を推定する（言い換えれば、特定の関係（対象関係）を有する度合いを算出する）技術構成を備えている。 In other words, the semantic relationship extraction device 1 uses the information of the upper and lower relations between the nouns included in the noun pair and the third noun (the upper part of cholesterol is a substance), and thereby the information of the noun pair is obtained. It has a technical configuration for estimating the relationship (in other words, calculating the degree of having a specific relationship (target relationship)).

なお、上述した実施形態における意味関係抽出装置の機能、あるいはその一部の機能をコンピューターで実現するようにしても良い。その場合、この制御機能を実現するためのプログラムをコンピューター読み取り可能な記録媒体に記録して、この記録媒体に記録されたプログラムをコンピューターシステムに読み込ませ、実行することによって実現しても良い。なお、ここでいう「コンピューターシステム」とは、ＯＳや周辺機器等のハードウェアを含むものとする。また、「コンピューター読み取り可能な記録媒体」とは、フレキシブルディスク、光磁気ディスク、ＲＯＭ、ＣＤ−ＲＯＭ等の可搬媒体、コンピューターシステムに内蔵されるハードディスク等の記憶装置のことをいう。さらに「コンピューター読み取り可能な記録媒体」とは、インターネット等のネットワークや電話回線等の通信回線を介してプログラムを送信する場合の通信線のように、短時間の間、動的にプログラムを保持するもの、その場合のサーバーやクライアントとなるコンピューターシステム内部の揮発性メモリのように、一定時間プログラムを保持しているものも含んでも良い。また上記プログラムは、前述した機能の一部を実現するためのものであっても良く、さらに前述した機能をコンピューターシステムにすでに記録されているプログラムとの組み合わせで実現できるものであっても良い。 Note that the function of the semantic relationship extraction device in the above-described embodiment, or a part of the function, may be realized by a computer. In that case, the program for realizing the control function may be recorded on a computer-readable recording medium, and the program recorded on the recording medium may be read into a computer system and executed. Here, the “computer system” includes an OS and hardware such as peripheral devices. The “computer-readable recording medium” refers to a storage device such as a flexible disk, a magneto-optical disk, a portable medium such as a ROM and a CD-ROM, and a hard disk incorporated in a computer system. Furthermore, a “computer-readable recording medium” dynamically holds a program for a short time, like a communication line when transmitting a program via a network such as the Internet or a communication line such as a telephone line. In this case, a volatile memory inside a computer system serving as a server or a client in that case may be included, and a program that holds a program for a certain period of time. The program may be a program for realizing a part of the above-described functions, or may be a program that can realize the above-described functions in combination with a program already recorded in a computer system.

以上、実施形態を説明したが、本発明はさらに次のような変形例でも実施することが可能である。 Although the embodiment has been described above, the present invention can also be implemented in the following modified example.

変形例：上記実施形態において、式（８）でスコアオールＳｃｏｒｅＡｌｌ（ｎ_ｉ，ｎ_ｋ，ＳＰ，ｎ_ｊ）の値を計算する際には、スコア１とスコア２とスコア３の積の値を用いていた。本変形例では、スコア１とスコア２とスコア３の積の代わりに、下記の（Ａ）から（Ｆ）までの、いずれかの値を用いる。
（Ａ）スコア１とスコア２のみの積
（Ｂ）スコア２とスコア３のみの積
（Ｃ）スコア１とスコア３のみの積
（Ｄ）スコア１の値そのもの
（Ｅ）スコア２の値そのもの
（Ｆ）スコア３の値そのもの Modified example: In the above embodiment, when calculating the value of the score all ScoreAll (n _i , n _k , SP, n _j ) in the equation (8), the product of score 1, score 2 and score 3 is I used it. In this modification, instead of the product of score 1, score 2, and score 3, any one of the following values (A) to (F) is used.
(A) Product of score 1 and score 2 only (B) Product of score 2 and score 3 only (C) Product of score 1 and score 3 only (D) Value of score 1 itself (E) Value of score 2 itself ( F) Score 3 itself

以上、この発明の実施形態について図面を参照して詳述してきたが、具体的な構成はこの実施形態に限られるものではなく、この発明の要旨を逸脱しない範囲の設計等も含まれる。 The embodiment of the present invention has been described in detail with reference to the drawings. However, the specific configuration is not limited to this embodiment, and includes designs and the like that do not depart from the gist of the present invention.

本発明は、文書間の関連を推定するために利用可能である。例えば、文書がウェブページ（ウェブ文書）である場合、本発明は、ウェブページ間の関連を推定するために利用可能である。文書（例えば、ＥＰＧにおける番組概要文の文書）を放送番組と関連付ければ、本発明は、放送番組間の関連を推定するために利用可能である。つまり、ある放送番組（例えば、視聴者が視聴中の番組）に関連する放送番組を提示するような連想検索機能などを実現するために利用可能である。つまり、本発明は、映像等のコンテンツのオンデマンド配信サービスにも利用可能である。 The present invention can be used to estimate the relationship between documents. For example, if the document is a web page (web document), the present invention can be used to estimate the association between web pages. If a document (for example, a program summary sentence document in EPG) is associated with a broadcast program, the present invention can be used to estimate the relationship between the broadcast programs. That is, it can be used to realize an associative search function that presents a broadcast program related to a certain broadcast program (for example, a program being viewed by a viewer). That is, the present invention can also be used for an on-demand distribution service for content such as video.

１意味関係抽出装置
１０文書取得部
１１名詞抽出部
１２名詞対抽出部
１３第３名詞抽出部
１４スコア計算部
１５特定関係名詞対抽出部
１６特定関係名詞対集合出力部 DESCRIPTION OF SYMBOLS 1 Semantic relationship extraction apparatus 10 Document acquisition part 11 Noun extraction part 12 Noun pair extraction part 13 3rd noun extraction part 14 Score calculation part 15 Specific relation noun pair extraction part 16 Specific relation noun pair output part

Claims

Based on the document data, a noun pair extraction unit that extracts noun pairs included in the same sentence in the document data;
A third noun extraction unit for extracting a third noun from the sentence including the noun pair extracted by the noun pair extraction unit;
A seed pattern storage unit that stores a pattern of a sentence structure in which a seed word pair, which is a noun pair having a known relationship, appears as a seed pattern associated with the relationship;
Based on the noun pair extracted by the noun pair extraction unit, the third noun extracted by the third noun extraction unit, and the seed pattern stored in the seed pattern storage unit, the noun pair is A score calculator that calculates a score indicating the degree of having the relationship associated with the seed pattern;
Based on the score calculated by the score calculation unit, a specific related noun pair extraction unit that extracts the noun pair estimated to have the relationship;
A semantic relationship extraction apparatus comprising:

The seed pattern is represented as a path of a phrase between words belonging to the seed word pair in a tree-structured data representing a dependency relation between phrases.
The semantic relationship extraction apparatus according to claim 1, wherein:

The score calculation unit calculates a first score that is a value representing a degree that a class pair of nouns corresponding to the noun pair appears in the relationship, and calculates the score based on the first score.
The semantic relationship extraction device according to claim 1, wherein the semantic relationship extraction device is a device.

The score calculation unit calculates a second score that is a value indicating a degree of ease of appearing in the seed pattern of a pattern between one noun included in the noun pair and the third noun. Calculating and calculating the score based on the second score;
The semantic relationship extraction device according to any one of claims 1 to 3, wherein

The score calculation unit calculates a third score, which is a value representing the likelihood of an upper-lower relationship or a parallel relationship, of a pattern between one noun included in the noun pair and the third noun, Calculating the score based on 3 scores;
The semantic relationship extraction device according to any one of claims 1 to 4, wherein:

Based on the document data, a noun pair extraction unit that extracts noun pairs included in the same sentence in the document data;
A third noun extraction unit for extracting a third noun from the sentence including the noun pair extracted by the noun pair extraction unit;
A seed pattern storage unit that stores a pattern of a sentence structure in which a seed word pair, which is a noun pair having a known relationship, appears as a seed pattern associated with the relationship;
Based on the noun pair extracted by the noun pair extraction unit, the third noun extracted by the third noun extraction unit, and the seed pattern stored in the seed pattern storage unit, the noun pair is A score calculator that calculates a score indicating the degree of having the relationship associated with the seed pattern;
Based on the score calculated by the score calculation unit, a specific related noun pair extraction unit that extracts the noun pair estimated to have the relationship;
As a program to make the computer function as.