JP2017123168A

JP2017123168A - Method for making entity mention in short text associated with entity in semantic knowledge base, and device

Info

Publication number: JP2017123168A
Application number: JP2016255039A
Authority: JP
Inventors: ミアオ・チンリアン; Qingliang Miao; 遥孟; Yao Meng; 双永宋; Zhuang Yong Song
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2016-01-05
Filing date: 2016-12-28
Publication date: 2017-07-13
Also published as: CN106940702A

Abstract

【課題】本発明はセマンティック知識ベースにおける実体にショートテキストにおける実体言及を関連付ける方法及び装置を提供する。【解決手段】該方法は、セマンティック知識ベースにおける実体から、ショートテキストにおける実体言及に関連する候補実体を選択するステップと、候補実体及び実体言及の属するカテゴリを決定するステップと、実体言及の属するカテゴリの最も高い識別性を有する属性集合を決定するステップと、該属性集合に基づいて、該カテゴリに属する候補実体と実体言及との類似度を計算するステップと、該類似度に基づいて、候補実体を選択して実体言及に関連付けるステップとを含む。【選択図】図１The present invention provides a method and apparatus for associating an entity reference in a short text with an entity in a semantic knowledge base. The method includes: selecting a candidate entity associated with an entity reference in a short text from entities in the semantic knowledge base; determining a category to which the candidate entity and the entity reference belong; and a category to which the entity reference belongs. Determining an attribute set having the highest discriminability, calculating a similarity between a candidate entity belonging to the category and an entity reference based on the attribute set, and a candidate entity based on the similarity Selecting and associating it with the entity reference. [Selection] Figure 1

Description

本発明は、情報処理の分野に関し、具体的に、セマンティック知識ベースにおける実体にショートテキストにおける実体言及を関連付ける方法及び装置に関する。 The present invention relates to the field of information processing, and in particular, to a method and apparatus for associating an entity reference in a short text with an entity in a semantic knowledge base.

近年、ＤＢＰｅｄｉａ等のインターネットのセマンティック知識ベース（ＳＫＢ：Ｓｅｍａｎｔｉｃｋｎｏｗｌｅｄｇｅｂａｓｅ）及びミニブログ（ｍｉｃｒｏｂｌｏｇｇｉｎｇ）、ショートメッセージサービス（ＳＭＳ）等のショートテキスト情報プラットフォームの急速な発展に伴い、インターネットのセマンティック知識ベースにおける実体（ｅｎｔｉｔｙ）にショートテキストにおける「実体言及（ｍｅｎｔｉｏｎ）」をどのように関連付け、ショートテキストの内容を意味化するのかは、言語情報処理の分野の問題点となった。 In recent years, with the rapid development of short text information platforms such as DBK Media and other Internet semantic knowledge bases (SKB: Semantic Knowledge Base), microblogging and short message services (SMS), entities in the Internet semantic knowledge base It has become a problem in the field of linguistic information processing how to associate “entity” with “entity” and make the content of the short text meaningful.

ショートテキストの内容の意味化は、ユーザ及びコンピュータにショートテキストの意味情報を効率的に検索、利用させることができ、ショートテキストのデータの意味解析のために必要な基盤を提供できる。また、インターネットの知識ベースをリアルタイムで拡張でき、インターネットの知識ベースの動的な更新能力を向上できる。 The semanticization of the contents of the short text enables the user and the computer to efficiently search and use the semantic information of the short text, and can provide a necessary base for the semantic analysis of the data of the short text. In addition, the knowledge base of the Internet can be expanded in real time, and the ability to dynamically update the knowledge base of the Internet can be improved.

このため、本発明は、セマンティック知識ベースにおける実体にショートテキストにおける実体言及を正確に関連付けることを目的とする。 Thus, the present invention aims to accurately associate an entity reference in a short text with an entity in a semantic knowledge base.

以下は、本発明の態様を基本的に理解させるために、本発明の簡単な概要を説明する。なお、この簡単な概要は、本発明を網羅的な概要ではなく、本発明のポイント又は重要な部分を意図的に特定するものではなく、本発明の範囲を意図的に限定するものではなく、後述するより詳細的な説明の前文として、単なる概念を簡単な形で説明することを目的とする。 The following presents a simplified summary of the invention in order to provide a basic understanding of aspects of the invention. It should be noted that this brief summary is not an exhaustive summary of the present invention, does not intentionally identify the points or important parts of the present invention, and does not intentionally limit the scope of the present invention. As a preamble to a more detailed description to be described later, it is intended to explain a simple concept in a simple form.

本発明は、セマンティック知識ベースにおける実体にショートテキストにおける実体言及を正確に関連付けることができる方法及び装置を提供することを目的とする。 The present invention seeks to provide a method and apparatus that can accurately associate an entity reference in a short text with an entity in a semantic knowledge base.

上記の目的を実現するために、本発明の１つの態様では、セマンティック知識ベースにおける実体にショートテキストにおける実体言及を関連付ける方法であって、セマンティック知識ベースにおける実体から、ショートテキストにおける実体言及に関連する候補実体を選択するステップと、候補実体及び実体言及の属するカテゴリを決定するステップと、実体言及の属するカテゴリの最も高い識別性を有する属性集合を決定するステップと、該属性集合に基づいて、該カテゴリに属する候補実体と実体言及との類似度を計算するステップと、前記類似度に基づいて、候補実体を選択して実体言及に関連付けるステップと、を含む、方法を提供する。 To achieve the above object, according to one aspect of the present invention, there is provided a method of associating an entity reference in a short text with an entity in a semantic knowledge base, wherein the entity reference in the short text is related to an entity reference in the short text. Selecting a candidate entity, determining a category to which the candidate entity and the entity reference belong, determining an attribute set having the highest distinction of the category to which the entity reference belongs, and based on the attribute set, A method is provided comprising: calculating a similarity between a candidate entity belonging to a category and an entity reference; and selecting a candidate entity to associate with the entity reference based on the similarity.

本発明のもう１つの態様では、セマンティック知識ベースにおける実体にショートテキストにおける実体言及を関連付ける装置であって、セマンティック知識ベースにおける実体から、ショートテキストにおける実体言及に関連する候補実体を選択する候補実体選択手段と、候補実体及び実体言及の属するカテゴリを決定するカテゴリ決定手段と、実体言及の属するカテゴリの最も高い識別性を有する属性集合を決定する属性集合決定手段と、該属性集合に基づいて、該カテゴリに属する候補実体と実体言及との類似度を計算する類似度計算手段と、前記類似度に基づいて、候補実体を選択して実体言及に関連付ける関連付け手段と、を含む、装置を提供する。 In another aspect of the present invention, an apparatus for associating an entity reference in a short text with an entity in a semantic knowledge base, the candidate entity selection for selecting a candidate entity associated with an entity reference in a short text from an entity in a semantic knowledge base A category determining means for determining a category to which a candidate entity and an entity reference belong, an attribute set determining means for determining an attribute set having the highest distinction of the category to which the entity reference belongs, and based on the attribute set, An apparatus is provided that includes similarity calculation means for calculating a similarity between a candidate entity belonging to a category and an entity reference, and association means for selecting a candidate entity and associating it with the entity reference based on the similarity.

本発明のもう１つの態様では、記憶媒体をさらに提供する。該記憶媒体は、機器が読み取り可能なプログラムコードを含み、情報処理装置において該プログラムコードを実行する際に、該プログラムコードは該情報処理装置に本発明の上記の方法を実行させる。 In another aspect of the invention, a storage medium is further provided. The storage medium includes program code that can be read by a device, and when the program code is executed in the information processing apparatus, the program code causes the information processing apparatus to execute the above-described method of the present invention.

本発明のもう１つの態様では、プログラムプロダクトをさらに提供する。該プログラムプロダクトは、機器が実行可能な指令を含み、情報処理装置において該指令を実行する際に、該指令は該情報処理装置に本発明の上記の方法を実行させる。 In another aspect of the invention, a program product is further provided. The program product includes an instruction that can be executed by the device, and when the information processing apparatus executes the instruction, the instruction causes the information processing apparatus to execute the above-described method of the present invention.

下記図面の詳細の説明を通じて、本発明の実施例の上記の目的、他の目的、特徴及び利点はより明確になる。図面におけるユニットは、単なる本発明の原理を示すものである。図面において、同一又は類似する技術的特徴又はユニットは、同一又は類似する記号で示されている。
本発明の実施例に係るセマンティック知識ベースにおける実体にショートテキストにおける実体言及を関連付ける方法のフローチャートである。候補実体及び実体言及の属するカテゴリを決定する第１方法のフローチャートである。候補実体及び実体言及の属するカテゴリを決定する第２方法のフローチャートである。本発明の実施例に係るセマンティック知識ベースにおける実体にショートテキストにおける実体言及を関連付ける装置の構成を示すブロック図である。本発明の実施例に係る方法及び装置を実施するためのコンピュータの構成を示すブロック図である。 Through the following detailed description of the drawings, the above objects, other objects, features and advantages of the embodiments of the present invention will become clearer. The units in the drawings are merely illustrative of the principles of the present invention. In the drawings, identical or similar technical features or units are indicated by identical or similar symbols.
3 is a flowchart of a method for associating an entity reference in a short text with an entity in a semantic knowledge base according to an embodiment of the present invention; It is a flowchart of the 1st method of determining the category to which a candidate entity and an entity reference belong. It is a flowchart of the 2nd method of determining the category to which a candidate entity and an entity reference belong. It is a block diagram which shows the structure of the apparatus which associates the entity reference in a short text with the entity in the semantic knowledge base based on the Example of this invention. It is a block diagram which shows the structure of the computer for implementing the method and apparatus which concern on the Example of this invention.

以下、図面を参照しながら本発明の例示的な実施例を詳細に説明する。説明の便宜上、明細書には実際の実施形態の全ての特徴が示されていない。なお、実際に実施する際に、開発者の具体的な目標を実現するために、特定の実施形態を変更してもよい、例えばシステム及び業務に関する制限条件に応じて実施形態を変更してもよい。また、開発作業が非常に複雑であり、且つ時間がかかるが、本公開の当業者にとって、この開発作業は単なる例の作業である。 Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the drawings. For convenience of explanation, the specification does not show all the features of the actual embodiment. In actual implementation, a specific embodiment may be changed in order to realize a specific goal of a developer. For example, the embodiment may be changed according to a restriction condition related to a system and business. Good. Also, the development work is very complex and time consuming, but for those skilled in the art, this development work is just an example work.

なお、本発明を明確にするために、図面には本発明に密に関連する装置の構成要件又は処理のステップのみが示され、本発明と関係のない細部が省略される。また、本発明の図面又は実施形態に示されている要素及び特徴と他の図面又は実施形態に示されている要素及び特徴とを組み合わせてもよい。 For the sake of clarity, the drawings show only the components or processing steps of the apparatus closely related to the present invention, and omit details not related to the present invention. Further, elements and features shown in the drawings or embodiments of the present invention may be combined with elements and features shown in other drawings or embodiments.

以下は、図１を参照しながら本発明の実施例のセマンティック知識ベースにおける実体にショートテキストにおける実体言及を関連付ける方法のフローを説明する。 The following describes the flow of a method for associating an entity reference in a short text with an entity in the semantic knowledge base of an embodiment of the present invention with reference to FIG.

図１は、本発明の実施例に係るセマンティック知識ベースにおける実体にショートテキストにおける実体言及を関連付ける方法のフローチャートである。図１に示すように、本発明の実施例のセマンティック知識ベースにおける実体にショートテキストにおける実体言及を関連付ける方法は、下記のステップを含む。セマンティック知識ベースにおける実体から、ショートテキストにおける実体言及に関連する候補実体を選択する（ステップＳ１）。候補実体及び実体言及の属するカテゴリを決定する（ステップＳ２）。実体言及の属するカテゴリの最も高い識別性を有する属性集合を決定する（ステップＳ３）。該属性集合に基づいて、該カテゴリに属する候補実体と実体言及との類似度を計算する（ステップＳ４）。該類似度に基づいて、候補実体を選択して実体言及に関連付ける（ステップＳ５）。 FIG. 1 is a flowchart of a method for associating an entity reference in a short text with an entity in a semantic knowledge base according to an embodiment of the present invention. As shown in FIG. 1, a method for associating an entity reference in a short text with an entity in the semantic knowledge base of an embodiment of the present invention includes the following steps. A candidate entity related to the entity reference in the short text is selected from the entities in the semantic knowledge base (step S1). The category to which the candidate entity and the entity reference belong is determined (step S2). An attribute set having the highest distinctiveness of the category to which the entity reference belongs is determined (step S3). Based on the attribute set, the similarity between the candidate entity belonging to the category and the entity reference is calculated (step S4). Based on the similarity, candidate entities are selected and associated with entity references (step S5).

ステップＳ１において、ショートテキストにおける実体言及に関連する候補実体を選択する。 In step S1, candidate entities related to entity references in the short text are selected.

即ち、潜在的な関連付け対象を候補として初期的に選択し、後続のステップにおいて選別する。 That is, a potential association target is initially selected as a candidate and selected in subsequent steps.

具体的な実現方法は複数種類ある。例えば、セマンティック知識ベースにおける、実体名が実体言及と同一の同名実体を、候補実体として選択してもよい。該知識ベースは、該方法に適用するシナリオに対応する特定の知識ベース、例えばインターネットのセマンティック知識ベースであるＷｉｋｉｐｅｄｉａ（ウィキペディア）、ＤＢＰｅｄｉａ、ＢａｉｄｕＢａｉｋｅ（百度百科）などを含んでもよいが、これらの知識ベースに限定されない。例えば、実体言及が「ａｐｐｌｅ」である場合は、インターネットのセマンティック知識ベースから、「林檎」、「米国のアップル・インコーポレイテッド」など複数の候補実体を見つけることができる。 There are a plurality of specific implementation methods. For example, an entity having the same name as the entity reference in the semantic knowledge base may be selected as a candidate entity. The knowledge base may include specific knowledge bases corresponding to the scenario applied to the method, such as the Internet semantic knowledge bases Wikipedia, Wikipedia, BaiduBaike, etc. It is not limited to. For example, if the entity reference is “apple”, a plurality of candidate entities such as “Apple” and “Apple Inc. of the United States” can be found from the semantic knowledge base of the Internet.

セマンティック知識ベースにおける、同名実体と等価関係を有する実体を、候補実体として選択してもよい。ここで、等価関係は、リダイレクト関係及び別称関係を含む。例えば、実体言及「ＩＢＭ」について、知識ベースから見つけられた内容はリダイレクト・リンク「インターナショナル・ビジネス・マシーンズ・コーポレーション」を含み、この内容は実体言及「ＩＢＭ」の候補実体とされてもよい。 An entity having an equivalent relationship with an entity with the same name in the semantic knowledge base may be selected as a candidate entity. Here, the equivalent relationship includes a redirect relationship and an alias relationship. For example, for the entity reference “IBM”, the content found from the knowledge base includes a redirect link “International Business Machines Corporation”, which may be a candidate entity for the entity reference “IBM”.

或いは、セマンティック知識ベースにおける、実体言及をアンカーテキストとするリンク先の実体を、候補実体として選択してもよい。実体言及「ワシントン」について、ウェブページのアンカーテキスト「ワシントン」をクリックし、百科事典における「米国首都のワシントン」にリンクし、「アメリカ人のワシントン」にリンクすると、「米国首都のワシントン」、「アメリカ人のワシントン」を実体言及「ワシントン」の候補実体としてもよい。 Alternatively, a linked entity having an entity reference as an anchor text in the semantic knowledge base may be selected as a candidate entity. For the entity reference "Washington", click on the anchor text "Washington" on the web page, link to "U.S. capital Washington" in the encyclopedia, and link to "American Washington". “American Washington” may be a candidate entity for the entity “Washington”.

或いは、セマンティック知識ベースにおける、実体言及と百科事典の曖昧さ回避の関係を有する実体を、候補実体として選択してもよい。例えば、実体言及「アップル」について、知識ベースから曖昧さ回避ページが見つけられ、「アップル・インコーポレイテッド」、「アップル日刊新聞」、「アップル（映画）」などはいずれも、実体言及「アップル」の候補実体とされてもよい。 Alternatively, an entity having a relationship between entity reference and encyclopedia ambiguity avoidance in the semantic knowledge base may be selected as a candidate entity. For example, an ambiguity avoidance page can be found in the knowledge base for the entity reference “Apple”, and “Apple Inc.”, “Apple Daily”, “Apple (movie)”, etc. It may be a candidate entity.

或いは、セマンティック知識ベースにおける、実体名が実体言及と実体記述テキストにおいて照応関係を有する実体を、候補実体として選択してもよい。また、セマンティック知識ベースにおける、実体名が実体言及と実体言及の所在するテキストテキストにおいて照応関係を有する実体を、候補実体として選択してもよい。 Alternatively, an entity whose entity name has an anaphoric relationship in the entity reference and the entity description text in the semantic knowledge base may be selected as a candidate entity. In addition, an entity having an anaphoric relationship in the text text in which the entity name is located in the semantic knowledge base and the entity reference is located may be selected as a candidate entity.

ここで、セマンティック知識ベースにおける実体の実体名と実体言及とが、該実体の実体記述テキスト又は実体言及の所在するテキストにおいて特定の照応パターンに合致するか否かに基づいて、照応関係を有するか否かを決定してもよい。セマンティック知識ベースにおける該実体の実体記述テキスト又は該実体言及の所在するテキストに対してテキスト解析を行うことによって、照応関係を有するか否かを決定してもよい。テキスト解析は照応解析（ａｎａｐｈｏｒａｒｅｓｏｌｕｔｉｏｎ）を含む。 Whether the entity's entity name and entity reference in the semantic knowledge base have an anaphoric relationship based on whether the entity's entity description text or entity reference text matches a specific anaphoric pattern You may decide whether or not. Whether or not there is an anaphoric relationship may be determined by performing text analysis on the entity description text of the entity or the text where the entity reference is located in the semantic knowledge base. Text analysis includes anaphora resolution.

例えば、ショートテキスト「ＩＢＭ（インターナショナル・ビジネス・マシーンズ・コーポレーション）」、「ＡｇｒｉｃｕｌｔｕｒａｌＢａｎｋｏｆＣｈｉｎａ（ＡＢＣ）」における括弧前の内容と括弧内の内容、「計算機はコンピュータとも称される」における「とも称される」前後の内容、「北京時間３月１２日、２０１３アジアチャンピオンズリーグのグループリーグ２回戦、広州恒大サッカークラブチームがアウェーで全北現代と戦い、広州恒大の先発発表」における「広州恒大」と「広州恒大サッカークラブチーム」とは、特定の照応パターンに合致し、テキスト解析、例えば照応解析により、照応関係を有すると決定してもよい。 For example, the contents before and in parentheses in the short texts “IBM (International Business Machines Corporation)” and “Agricultural Bank of China (ABC)”, also referred to as “computer is also referred to as computer” "Beijing time March 12, 2013 Asian Champions League group round 2nd round, Guangzhou Hengda football club team away from Jeonbuk Hyundai, Guangzhou Hengda start announcement" in Guangzhou It may be determined that “Hongda” and “Guangzhou Hengda Soccer Club Team” match a specific anaphoric pattern and have an anaphoric relationship by text analysis, for example, anaphora analysis.

ステップＳ２において、候補実体及び実体言及の属するカテゴリを決定する。以下は、２種類の例示的な態様を説明するが、本発明はこれらに限定されない。 In step S2, the category to which the candidate entity and entity reference belong is determined. The following describes two exemplary embodiments, but the invention is not limited thereto.

カテゴリは、既存の知識ベースにおける実体の分類システムであってもよく、例えば、カテゴリは、機関、人、地名、建物などに分けられてもよい。少なくとも一部の実体は、知識ベースにおいてタイプ情報を有するため、例えば方式１のように、該情報に基づいて実体言及又はタイプ情報を有しない候補実体のカテゴリを決定する。また、例えば方式２のように、タイプ情報を既に有する実体に基づいて訓練データを構築し、分類器を訓練し、該分類器を用いて、タイプ情報を有しない候補実体又は実体言及を分類してもよい。 The category may be an entity classification system in an existing knowledge base. For example, the category may be divided into an institution, a person, a place name, a building, and the like. Since at least some of the entities have type information in the knowledge base, the category of candidate entities that do not have entity mention or type information is determined based on the information, for example, as in method 1. Also, for example, as in Scheme 2, training data is constructed based on entities that already have type information, classifiers are trained, and classifiers are used to classify candidate entities or entity references that do not have type information. May be.

方式１：主題ベクトルに基づいて、実体言及又はタイプ情報を有しない候補実体の属するカテゴリを決定する。 Method 1: Based on the theme vector, a category to which a candidate entity having no entity mention or type information belongs is determined.

図２は候補実体及び実体言及の属するカテゴリを決定する第１方法のフローチャートである。 FIG. 2 is a flowchart of a first method for determining a category to which candidate entities and entity references belong.

具体的には、ステップＳ２１において、実体言及の所在するテキスト又はタイプ情報を有しない候補実体の実体記述テキスト（例えば主題ｓｕｂｊｅｃｔ、注釈コメントｃｏｍｍｅｎｔ、要約ａｂｓｔｒａｃｔ）に対応する第１主題ベクトルを取得する。実体言及の所在するテキスト又はタイプ情報を有しない候補実体の実体記述テキストを、主題モデルに入力することで、該ベクトルを取得してもよい。 Specifically, in step S21, a first theme vector corresponding to the entity description text (for example, the subject subject, the comment comment, and the summary abstract) of the candidate entity that does not have text or type information where the entity reference is located is acquired. The vector may be obtained by inputting the entity description text of the candidate entity that does not have the text or type information where the entity reference is located into the subject model.

ステップＳ２２において、各カテゴリの実体の実体記述テキストに対応する第２主題ベクトルを取得する。各カテゴリの実体の実体記述テキストを主題モデルに入力することで、該ベクトルを取得してもよい。 In step S22, a second theme vector corresponding to the entity description text of each category is obtained. The vector may be obtained by inputting the entity description text of the entity of each category into the theme model.

ステップＳ２３において、第１主題ベクトルと各カテゴリの第２主題ベクトルとの平均類似度を計算する。 In step S23, an average similarity between the first theme vector and the second theme vector of each category is calculated.

即ち、第１主題ベクトルと各カテゴリの１つ又は複数の実体に対応する１つ又は複数の第２主題ベクトルとの類似度をそれぞれ計算し、各カテゴリの類似度の平均値を計算してもよい。ベクトルの類似度は、例えば余弦夾角に基づいて計算されてもよい。 That is, the similarity between the first theme vector and one or more second theme vectors corresponding to one or more entities of each category is calculated, and the average value of the similarity of each category is calculated. Good. Vector similarity may be calculated, for example, based on cosine depression.

ステップＳ２４において、平均類似度の最も高いカテゴリを、実体言及又はタイプ情報を有しない候補実体の属するカテゴリとして決定する。 In step S24, the category having the highest average similarity is determined as the category to which the candidate entity having no entity mention or type information belongs.

即ち、各カテゴリの平均類似度の大きさを比較し、そのうち最も高い平均類似度を選択し、最も高い平均類似度に対応するカテゴリを、実体言及又はタイプ情報を有しない候補実体の属するカテゴリとして決定する。 That is, the average similarity of each category is compared, the highest average similarity is selected, and the category corresponding to the highest average similarity is set as the category to which the candidate entity having no entity mention or type information belongs. decide.

方式２：分類器を用いて候補実体及び実体言及の属するカテゴリを決定する。 Method 2: A category to which a candidate entity and entity reference belong is determined using a classifier.

図３は候補実体及び実体言及の属するカテゴリを決定する第２方法のフローチャートである。 FIG. 3 is a flowchart of a second method for determining a category to which candidate entities and entity references belong.

具体的には、ステップ３１において、各カテゴリの実体の実体記述テキストと予め定義されたテンプレートとの合致度、該実体記述テキストが各カテゴリに関連するキーワードを含むか否か、各カテゴリの実体の百科事典における対応する主題情報、及び各カテゴリの実体に関連する属性タイプのうち少なくとも１つの特徴に基づいて、分類器を訓練する。 Specifically, in step 31, the degree of match between the entity description text of each category entity and a predefined template, whether or not the entity description text includes a keyword associated with each category, The classifier is trained based on at least one feature of corresponding subject information in the encyclopedia and attribute types associated with each category entity.

予め定義されたテンプレート特徴：各カテゴリの実体の実体記述テキストと予め定義されたテンプレートとの合致度は、各カテゴリの実体の実体記述テキストが予め定義されたテンプレートに合致できる場合、該特徴が１であり、そうでない場合、該特徴が０であることを意味する。 Predefined template feature: The degree of matching between the entity description text of each category entity and the predefined template is such that the feature is 1 if the entity description text of each category entity can match the predefined template. If not, it means that the feature is zero.

予め定義されたテンプレートは以下の通りであり、左側は複数のカテゴリの例を示し、右側はカテゴリにそれぞれに対応する予め定義されたテンプレートの例を示している。

The predefined templates are as follows, the left side shows an example of a plurality of categories, and the right side shows an example of a predefined template corresponding to each category.

キーワード特徴：各カテゴリの実体の実体記述テキストが各カテゴリに関連するキーワードを含むか否かは、各カテゴリの実体の実体記述テキストからキーワードを、各カテゴリに関連するキーワードとして抽出することを意味する。各カテゴリの実体の実体記述テキストにこれらのキーワードが含まれるか否かを判断し、これらのキーワードの少なくとも１つが含まれる場合、該特徴が１であり、そうでない場合、該特徴が０である。各カテゴリに関連するキーワードの例は以下の通りであり、左側は複数のカテゴリの例を示し、右側はカテゴリにそれぞれに対応するキーワードの例を示している。

Keyword feature: Whether or not the entity description text of each category entity includes a keyword related to each category means that the keyword is extracted from the entity description text of each category entity as a keyword related to each category. . It is determined whether or not these keywords are included in the entity description text of each category entity. If at least one of these keywords is included, the feature is 1, otherwise the feature is 0. . Examples of keywords related to each category are as follows. The left side shows examples of a plurality of categories, and the right side shows examples of keywords corresponding to the categories.

百科事典主題特徴：各カテゴリの実体の百科事典における対応する主題情報は、例えば実体である青龍山の例えば百度百科における主題情報である。各カテゴリの実体の実体記述テキストにこれらの主題情報が含まれるか否かを判断し、これらの主題情報の少なくとも１つが含まれる場合、該特徴が１であり、そうでない場合、該特徴が０である。各カテゴリに関連する主題情報の例は以下の通りであり、左側は複数のカテゴリの例を示し、右側はカテゴリにそれぞれに対応する主題情報の例を示している。

Encyclopedia subject feature: The corresponding subject information in the encyclopedia of entities in each category is, for example, subject information in, for example, the Encyclopedia of Seiryusan, the entity. It is determined whether or not these subject information is included in the entity description text of each category entity. If at least one of these subject information is included, the feature is 1, otherwise the feature is 0. It is. Examples of thematic information related to each category are as follows, the left side shows an example of a plurality of categories, and the right side shows an example of thematic information corresponding to each category.

関連属性タイプ特徴：各カテゴリの実体に関連する属性タイプは、各カテゴリの実体の知識ベースにおける通常又は固有のタイプの属性を意味する。例えば、カテゴリが「人」の実体は通常「出生日」、「出生地」、「国籍」等の属性を含む。カテゴリが「会社」の実体は通常「登録住所」、「設立日」、「経営範囲」等の属性を含む。各カテゴリの実体にこれらの属性が含まれるか否かを判断し、これらの属性の少なくとも１つが含まれる場合、該特徴が１であり、そうでない場合、該特徴が０である。 Related Attribute Type Features: The attribute type associated with each category entity refers to a normal or unique type of attribute in the knowledge base of each category entity. For example, an entity of the category “person” usually includes attributes such as “date of birth”, “birth place”, “nationality” and the like. An entity whose category is “Company” usually includes attributes such as “Registered Address”, “Establishment Date”, and “Management Range”. It is determined whether each category entity includes these attributes. If at least one of these attributes is included, the feature is 1; otherwise, the feature is 0.

ステップＳ３２において、分類器を用いて候補実体及び実体言及を分類する。 In step S32, the candidate entities and entity references are classified using a classifier.

分類する際に、予め定義されたテンプレート特徴、キーワード特徴、百科事典主題特徴は候補実体の実体記述テキスト、実体言及の所在するテキストに基づくものであり、関連属性タイプ特徴は候補実体及び実体言及そのものに基づくものである。 When classifying, predefined template features, keyword features, encyclopedia subject features are based on the entity description text of the candidate entity, the text where the entity reference is located, and the related attribute type feature is the candidate entity and the entity reference itself It is based on.

ステップＳ３において、実体言及の属するカテゴリの最も高い識別性を有する属性集合を決定する。 In step S3, an attribute set having the highest distinctiveness of the category to which the entity reference belongs is determined.

属性識別度が識別閾値よりも高い属性を、該カテゴリの最も高い識別性を有する属性集合における属性として決定する。 An attribute having an attribute identification degree higher than an identification threshold is determined as an attribute in the attribute set having the highest discriminability of the category.

以下は、属性の属性識別度の計算方式の２つの例を説明する。 In the following, two examples of a method for calculating the attribute identification degree of an attribute will be described.

方式１：セマンティック知識ベースにおける各カテゴリの各属性について、セマンティック知識ベースにおける該属性の該カテゴリにおいて出現する第１頻度の統計を取り、セマンティック知識ベースにおける該カテゴリの該属性の各属性値の該属性において出現する回数に関する第２頻度の統計を取り、第１頻度と第２頻度との積を、該カテゴリの該属性の属性識別度として計算する。 Method 1: For each attribute of each category in the semantic knowledge base, statistics on the first frequency appearing in the category of the attribute in the semantic knowledge base are taken, and the attribute value of each attribute value of the category in the semantic knowledge base The statistics of the 2nd frequency regarding the frequency | count of appearing in are taken, and the product of a 1st frequency and a 2nd frequency is calculated as an attribute identification degree of the said attribute of this category.

例えば、候補実体集合Ｅについて、Ｅにおける各候補実体ｅ_ｉはｍ（ｅ_ｉ）個の属性、ｍ（ｅ_ｉ）個の属性値Ｖ_ｊを有し、説明の便宜上、１つの属性が１つの属性値に対応すると仮定し、ここで、ｉ及びｊは番号である。Ｅにおける属性の第１頻度ｐｆ及び属性値の第２頻度ｉｅｆの統計を取る。ｐｆは属性ｐの集合Ｅの全ての属性において出現する頻度であり、ｉｅｆの計算方法として、該属性の各属性値の該属性において出現する回数の逆数の和を求め、該属性の出現総数で除算する。表１の例では、ｐ_１に対応するｐｆ＝３、ｐ_１に対応する属性値はｖ_１、ｖ_４及びｖ_７であると、ｐ_１に対応するｉｅｆ＝（１／１＋１／１＋１／１）／３＝１．０。ｐ_２に対応するｐｆ＝３、ｐ_２に対応する属性値はｖ_２及びｖ_５であり、ｖ_２が１回出現し、ｖ_５が２回出現すると、ｐ_２に対応するｉｅｆ＝（１／１＋１／２）／３＝０．５。ｐ_３に対応するｐｆ＝３、ｐ_３に対応する属性値はｖ_３のみであると、ｐ_３に対応するｉｅｆ＝（１／３）／３＝０．１１。この場合、Ｅに対応するカテゴリにおけるｐ_１、ｐ_２及びｐ_３のそれぞれの属性識別度は、３＊１．０＝３．０、３＊０．５＝１．５、及び３＊０．１１＝０．３３である。識別閾値δを設定し、δよりも大きい属性は該カテゴリの最も高い識別性を有する属性集合を構成する。また、該カテゴリの最も高い識別性を有する属性集合における属性の属性識別度を正規化する。

表１．候補実体の属性及びその属性値の例 For example, for a candidate entity set E, each candidate entity e _i in E has m (e _i ) attributes, m (e _i ) attribute values V _j , and one attribute is one for convenience of explanation. Assume that it corresponds to an attribute value, where i and j are numbers. Statistics of the first frequency pf of the attribute and the second frequency ief of the attribute value in E are taken. pf is a frequency of appearance in all attributes of the set E of the attribute p. As a calculation method of ief, a sum of the reciprocal numbers of the number of appearances of each attribute value of the attribute is obtained. Divide. In the example of Table 1, the attribute value corresponding to the pf = 3, _{p 1} corresponding to _{p 1} is the _v 1, _{v 4} and _{v 7, ief = (1/} 1 + 1/1 + 1/1 corresponding to _{p 1} ) /3=1.0. attribute value corresponding to the pf = 3, _{p 2} corresponding to p ₂ is _v is ₂ and _{v 5,} _{v 2} appeared once, when _{v 5} appears twice, ief = (1 corresponding to _{p 2} /1+1/2)/3=0.5. When the attribute value corresponding to the pf = 3, _{p 3} corresponding to p ₃ is _{v 3} only, ief = (1/3) /3=0.11 corresponding to _{p 3.} In this case, the attribute identification degrees of p ₁ , p ₂ and p ₃ in the category corresponding to E are 3 * 1.0 = 3.0, 3 * 0.5 = 1.5, and 3 * 0. 11 = 0.33. An identification threshold value δ is set, and attributes larger than δ constitute an attribute set having the highest discriminability of the category. Further, the attribute identification degree of the attribute in the attribute set having the highest discriminability of the category is normalized.

Table 1. Examples of candidate entity attributes and their attribute values

方式２：セマンティック知識ベースにおける各カテゴリの各属性について、実体と属性値との相関行列を計算し、相関行列の各列の最大値を加算し、得られた和を該カテゴリの該属性の属性識別度とする。 Method 2: For each attribute of each category in the semantic knowledge base, the correlation matrix between the entity and the attribute value is calculated, the maximum value of each column of the correlation matrix is added, and the obtained sum is the attribute of the attribute of the category The degree of discrimination.

例えば、カテゴリについての属性ｐについて、点別相互情報（ＰＭＩ：ＰｏｉｎｔｗｉｓｅＭｕｔｕａｌＩｎｆｏｒｍａｔｉｏｎ）関数により確率Ｐ（ｅ_ｉ｜ｖ_ｊ）を計算して、相関行列Ｍを取得し、ここで、ｅ_ｉは実体であり、ｖ_ｊは属性値である。 For example, for the attribute p for the category, the probability P (e _i | v _j ) is calculated by a pointwise mutual information (PMI) function to obtain a correlation matrix M, where e _i is an entity And v _j is an attribute value.

例えば、以下のように、属性ｐ_１、ｐ_２及びｐ_３について行列Ｍ１、Ｍ２及びＭ３をそれぞれ取得する。

For example, the matrices M1, M2, and M3 are acquired for the attributes p ₁ , p _2, and p _{3 as} follows.

相関行列Ｍの各列の最大値を加算し、得られた和を該カテゴリの該属性ｐの属性識別度とする。 The maximum value of each column of the correlation matrix M is added, and the obtained sum is set as the attribute identification degree of the attribute p of the category.

例えば、属性ｐ_１について、属性識別度＝０．８＋０．７＋０．５＝２．０。 For example, for the attribute p ₁ , the attribute identification level = 0.8 + 0.7 + 0.5 = 2.0.

属性ｐ_２について、属性識別度＝０．９＋０．９＋０．９＝２．６。 For attribute _{p 2,} the attribute discrimination degree = 0.9 + 0.9 + 0.9 = 2.6.

属性ｐ_３について、属性識別度＝０．４＋０．４＋０．４＝１．２。 The attribute _{p 3,} the attribute discrimination degree = 0.4 + 0.4 + 0.4 = 1.2.

識別閾値δを設定し、δよりも大きい属性は該カテゴリの最も高い識別性を有する属性集合を構成するようにしてもよい。また、該カテゴリの最も高い識別性を有する属性集合における属性の属性識別度を正規化する。 An identification threshold δ may be set, and an attribute larger than δ may constitute an attribute set having the highest discriminability of the category. Further, the attribute identification degree of the attribute in the attribute set having the highest discriminability of the category is normalized.

以上の２つの方式は２つの属性識別度をそれぞれ取得してもよい。そのうち１つの方式を用いて属性識別度を計算してもよいし、２つの属性識別度を併合し、最終的な属性識別度を取得してもよい。 In the above two methods, two attribute identification degrees may be acquired. One of them may be used to calculate the attribute identification degree, or the two attribute identification degrees may be merged to obtain the final attribute identification degree.

併合の方法は、例えば両者に対して重み付け加算を行ってもよく、重みの総和は１である。 As a method of merging, for example, weighting addition may be performed on both, and the sum of weights is 1.

ステップＳ４において、該属性集合に基づいて、該カテゴリに属する候補実体と実体言及との類似度を計算する。 In step S4, the similarity between the candidate entity belonging to the category and the entity mention is calculated based on the attribute set.

具体的には、実体言及の所在するテキストから、関係抽出／分類技術を用いて、実体言及の、該属性集合の属性の属性値を抽出し、該カテゴリに属する候補実体の、該属性集合の属性の属性値と実体言及の対応する属性値との類似度に基づいて、該候補実体と該実体言及との類似度を計算する。 More specifically, the attribute value of the attribute set of the entity reference is extracted from the text where the entity reference is located using a relation extraction / classification technique, and the attribute set of the candidate entity belonging to the category is extracted. Based on the similarity between the attribute value of the attribute and the corresponding attribute value of the entity reference, the similarity between the candidate entity and the entity reference is calculated.

即ち、同一のカテゴリに属する候補実体及び実体言及について、該カテゴリの最も高い識別性を有する属性集合における属性に基づいて、その属性値の類似度を比較し、候補実体と実体言及との類似度とする。 That is, for candidate entities and entity references belonging to the same category, the similarity of the attribute values is compared based on the attribute in the attribute set having the highest discriminability of the category, and the similarity between the candidate entity and the entity reference And

例えば、候補実体ｅｎｔｉｔｙと実体言及ｍｅｎｔｉｏｎとの類似度ｓｉｍ（ｍｅｎｔｉｏｎ，ｅｎｔｉｔｙ）＝Σｓｉｍ（ｖ_ｉ（ｍｅｎｔｉｏｎ），ｖ_ｉ（ｅｎｔｉｔｙ））となる。 For example, the similarity sim between candidate entities entity and the entity referred mention (mention, entity) = Σsim a _{_{(v i (mention), v}} i (entity)).

ここで、ｓｉｍ（ｖ_ｉ（ｍｅｎｔｉｏｎ），ｖ_ｉ（ｅｎｔｉｔｙ））は、実体言及ｍｅｎｔｉｏｎと候補実体ｅｎｔｉｔｙとの属性ｐ_ｉに対応する属性値ｖ_ｉの類似度である。 _{_{Here, sim (v i (mention)}} , v i (entity)) is the similarity between the attribute values _{v i} corresponding to the attribute _{p i} of an entity referred mention the candidate entity entity.

また、好ましい態様では、該候補実体と該実体言及との相互照応確率、及び該属性集合の各属性の属性識別度のうち少なくとも１つに基づいて、該候補実体と該実体言及との類似度を計算する。 Further, in a preferred aspect, the similarity between the candidate entity and the entity reference is based on at least one of the mutual correspondence probability between the candidate entity and the entity reference and the attribute identification degree of each attribute of the attribute set. Calculate

例えば、候補実体ｅｎｔｉｔｙと実体言及ｍｅｎｔｉｏｎとの類似度ｓｉｍ（ｍｅｎｔｉｏｎ，ｅｎｔｉｔｙ）＝Σｗｅｉｇｈｔ（ｐ_ｉ）＊ｓｉｍ（ｖ_ｉ（ｍｅｎｔｉｏｎ），ｖ_ｉ（ｅｎｔｉｔｙ））となる。 For example, the similarity sim between candidate entities entity and the entity referred mention (mention, entity) = Σweight (p i) * sim becomes _{(v i (mention), v} i (entity)).

ここで、ｗｅｉｇｈｔ（ｐ_ｉ）は属性ｐ_ｉの属性識別度であり、ｓｉｍ（ｖ_ｉ（ｍｅｎｔｉｏｎ），ｖ_ｉ（ｅｎｔｉｔｙ））は、実体言及ｍｅｎｔｉｏｎと候補実体ｅｎｔｉｔｙとの属性ｐ_ｉに対応する属性値ｖ_ｉの類似度である。 Here, weight _{(p i)} is the attribute identifying the degree of attribute _{_{p i, sim (v i (}} mention), v i (entity)) corresponds to the attribute _{p i} of an entity referred mention the candidate entity entity is the similarity of the attribute value v _i.

即ち、候補実体と実体言及との類似度を計算する際に、候補実体と実体言及との相互照応確率、及び該カテゴリの最も高い識別性を有する属性集合における属性の属性識別度の両方の情報を用いてもよい。 That is, when calculating the degree of similarity between a candidate entity and an entity reference, information on both the cross-correlation probability between the candidate entity and the entity reference, and the attribute attribute identification level of the attribute in the attribute set having the highest discriminability of the category May be used.

ここで、候補実体と実体言及との相互照応確率は、該候補実体を選択する処理において用いられる情報の信頼性を表す。即ち、その前のステップＳ１において、セマンティック知識ベースにおける実体から、ショートテキストにおける実体言及に関連する候補実体を選択する。複数の方式を用いて候補実体を選択するため、候補実体の由来に基づいて、候補実体を選択する時に用いられる情報の信頼度を表す、異なる相互照応確率を取得してもよい。 Here, the mutual correspondence probability between the candidate entity and the entity reference represents the reliability of information used in the process of selecting the candidate entity. That is, in the previous step S1, candidate entities related to the entity reference in the short text are selected from the entities in the semantic knowledge base. Since candidate entities are selected using a plurality of methods, different cross-correlation probabilities representing the reliability of information used when selecting candidate entities may be acquired based on the origin of candidate entities.

例えば、候補実体ｅと実体言及ｍの相互照応確率はｐ（ｅ｜ｍ）である。 For example, the cross-correlation probability between the candidate entity e and the entity reference m is p (e | m).

候補実体ｅがセマンティック知識ベースにおける同名実体からのものである場合、相互照応確率ｐ（ｅ｜ｍ）＝１／ｒとなり、ｒは同名実体の総数である。 If the candidate entity e is from the same name entity in the semantic knowledge base, the cross-correlation probability p (e | m) = 1 / r, where r is the total number of the same name entities.

候補実体ｅが等価関係（リダイレクト関係、別称関係）からのものである場合、相互照応確率ｐ（ｅ｜ｍ）＝１となる。 When the candidate entity e is from an equivalence relationship (redirect relationship, aka relationship), the mutual correspondence probability p (e | m) = 1.

候補実体ｅが特定パターンの照応関係からのものである場合、相互照応確率ｐ（ｅ｜ｍ）＝１となる。 When the candidate entity e is based on the anaphoric relationship of the specific pattern, the mutual anaphoric probability p (e | m) = 1.

候補実体ｅが曖昧さ回避のページからのものである場合、相互照応確率ｐ（ｅ｜ｍ）＝１／ｋとなり、ｋは曖昧な実体の総数である。 If the candidate entity e is from an ambiguity avoidance page, the cross-correlation probability p (e | m) = 1 / k, where k is the total number of ambiguous entities.

候補実体ｅがインターネットのアンカーテキストからのものである場合、相互照応確率ｐ（ｅ｜ｍ）＝ｗ／ｎとなり、ｗは実体言及とアンカーテキストのリンク先の実体との間に存在するリンク数であり、ｎは実体言及と全ての実体との間に存在するリンク数である。 If the candidate entity e is from an Internet anchor text, the cross-correlation probability p (e | m) = w / n, where w is the number of links existing between the entity reference and the anchor text link destination entity. N is the number of links that exist between the entity reference and all entities.

ステップＳ５において、該類似度に基づいて、候補実体を選択して実体言及に関連付ける。 In step S5, based on the similarity, a candidate entity is selected and associated with the entity reference.

具体的には、類似度が類似度閾値よりも大きい候補実体を実体言及に関連付ける。 Specifically, candidate entities whose similarity is greater than the similarity threshold are associated with the entity mention.

また、該類似度が何れも類似度閾値よりも小さい場合、実体言及を新たな実体としてセマンティック知識ベースに追加する。 Further, when both of the similarities are smaller than the similarity threshold, the entity reference is added as a new entity to the semantic knowledge base.

以下は、図４を参照しながら、セマンティック知識ベースにおける実体にショートテキストにおける実体言及を関連付ける装置を説明する。 The following describes an apparatus for associating an entity reference in a short text with an entity in a semantic knowledge base with reference to FIG.

図４は本発明の実施例に係るセマンティック知識ベースにおける実体にショートテキストにおける実体言及を関連付ける装置の構成を示すブロック図である。図４に示すように、本発明の関連付け装置４００は、セマンティック知識ベースにおける実体から、ショートテキストにおける実体言及に関連する候補実体を選択する候補実体選択部４１と、候補実体及び実体言及の属するカテゴリを決定するカテゴリ決定部４２と、実体言及の属するカテゴリの最も高い識別性を有する属性集合を決定する属性集合決定部４３と、該属性集合に基づいて、該カテゴリに属する候補実体と実体言及との類似度を計算する類似度計算部４４と、該類似度に基づいて、候補実体を選択して実体言及に関連付ける関連付け部４５とを含む。 FIG. 4 is a block diagram illustrating a configuration of an apparatus for associating an entity reference in a short text with an entity in a semantic knowledge base according to an embodiment of the present invention. As shown in FIG. 4, the associating apparatus 400 of the present invention includes a candidate entity selection unit 41 that selects a candidate entity related to an entity reference in a short text from entities in the semantic knowledge base, and a category to which the candidate entity and the entity reference belong. A category determining unit 42 for determining the attribute set, an attribute set determining unit 43 for determining an attribute set having the highest distinctiveness of the category to which the entity reference belongs, and candidate entities and entity references belonging to the category based on the attribute set A similarity calculation unit 44 that calculates the similarity of the two, and an association unit 45 that selects a candidate entity and associates it with the entity reference based on the similarity.

１つの態様では、候補実体選択部４１は、セマンティック知識ベースにおける、実体名が実体言及と同一の同名実体を、候補実体として選択するステップ、セマンティック知識ベースにおける、同名実体と等価関係を有する実体を、候補実体として選択するステップ、セマンティック知識ベースにおける、実体名が実体言及と実体記述テキストにおいて照応関係を有する実体を、候補実体として選択するステップ、セマンティック知識ベースにおける、実体言及と百科事典の曖昧さ回避の関係を有する実体を、候補実体として選択するステップ、セマンティック知識ベースにおける、実体言及をアンカーテキストとするリンク先の実体を、候補実体として選択するステップ、及びセマンティック知識ベースにおける、実体名が実体言及と実体言及の所在するテキストテキストにおいて照応関係を有する実体を、候補実体として選択するステップのうち１つを実行する。 In one aspect, the candidate entity selection unit 41 selects, as a candidate entity, an entity with the same name as the entity reference in the semantic knowledge base, and an entity having an equivalent relationship with the entity with the same name in the semantic knowledge base. The step of selecting as a candidate entity, the step of selecting as the candidate entity the entity whose name corresponds to the entity reference and the entity description text in the semantic knowledge base, the entity reference and encyclopedia ambiguity in the semantic knowledge base Selecting an entity having an evasion relationship as a candidate entity; selecting a linked entity having an entity reference as an anchor text as a candidate entity in the semantic knowledge base; and an entity name in the semantic knowledge base as an entity Reference and substance In Text Text to whereabouts of 及 an entity having a anaphoric relationship, performing one of the steps of selecting a candidate substance.

１つの態様では、セマンティック知識ベースにおける実体の実体名と実体言及とが、該実体の実体記述テキスト又は実体言及の所在するテキストにおいて特定の照応パターンに合致するか否かに基づいて、或いはセマンティック知識ベースにおける該実体の実体記述テキスト又は該実体言及の所在するテキストに対してテキスト解析を行うことによって、照応関係を有するか否かを決定する。 In one aspect, based on whether the entity's entity name and entity reference in the semantic knowledge base match a specific anaphoric pattern in the entity's entity description text or the text in which the entity reference is located, or based on semantic knowledge Whether or not there is an anaphoric relationship is determined by performing text analysis on the entity description text of the entity in the base or the text where the entity reference is located.

１つの態様では、カテゴリ決定部４２は、実体言及の所在するテキスト又はタイプ情報を有しない候補実体の実体記述テキストに対応する第１主題ベクトルを取得し、各カテゴリの実体の実体記述テキストに対応する第２主題ベクトルを取得し、第１主題ベクトルと各カテゴリの第２主題ベクトルとの平均類似度を計算し、平均類似度の最も高いカテゴリを、実体言及又はタイプ情報を有しない候補実体の属するカテゴリとして決定する。 In one aspect, the category determination unit 42 acquires the first theme vector corresponding to the entity description text of the candidate entity that does not have text or type information where the entity reference is located, and corresponds to the entity description text of the entity of each category. A second theme vector is calculated, an average similarity between the first theme vector and the second theme vector of each category is calculated, and a category having the highest average similarity is assigned to a candidate entity having no entity reference or type information. It is determined as a category to which it belongs.

１つの態様では、カテゴリ決定部４２は、各カテゴリの実体の実体記述テキストと予め定義されたテンプレートとの合致度、前記実体記述テキストが各カテゴリに関連するキーワードを含むか否か、各カテゴリの実体の百科事典における対応する主題情報、及び各カテゴリの実体に関連する属性タイプのうち少なくとも１つの特徴に基づいて、分類器を訓練し、分類器を用いて候補実体及び実体言及を分類する。 In one aspect, the category determination unit 42 matches the entity description text of each category entity with a predefined template, whether the entity description text includes a keyword associated with each category, The classifier is trained based on the corresponding subject information in the encyclopedia of entities and at least one characteristic of the attribute types associated with each category of entities, and the classifier is used to classify candidate entities and entity references.

１つの態様では、類似度計算部４４は、実体言及の所在するテキストから、実体言及の、該属性集合の属性の属性値を抽出し、該カテゴリに属する候補実体の、該属性集合の属性の属性値と実体言及の対応する属性値との類似度に基づいて、該候補実体と該実体言及との類似度を計算する。 In one aspect, the similarity calculation unit 44 extracts the attribute value of the attribute of the attribute set of the entity reference from the text where the entity reference is located, and extracts the attribute value of the attribute set of the candidate entity belonging to the category. Based on the similarity between the attribute value and the corresponding attribute value of the entity reference, the similarity between the candidate entity and the entity reference is calculated.

１つの態様では、類似度計算部４４は、該候補実体と該実体言及との相互照応確率、及び該属性集合の各属性の属性識別度のうち少なくとも１つに基づいて、該候補実体と該実体言及との類似度を計算する。 In one aspect, the similarity calculation unit 44 determines whether the candidate entity and the candidate entity are based on at least one of the cross-correlation probability between the candidate entity and the entity reference and the attribute identification degree of each attribute of the attribute set. Calculate the similarity to the entity reference.

１つの態様では、属性集合決定部４３は、属性の属性識別度を取得するステップにおいて、セマンティック知識ベースにおける各カテゴリの各属性について、セマンティック知識ベースにおける該属性の該カテゴリにおいて出現する第１頻度の統計を取り、セマンティック知識ベースにおける該カテゴリの該属性の各属性値の該属性において出現する回数に関する第２頻度の統計を取り、第１頻度と第２頻度との積を、該カテゴリの該属性の属性識別度として計算し、属性識別度が識別閾値よりも高い属性を、該カテゴリの最も高い識別性を有する属性集合における属性として決定する。 In one aspect, the attribute set determination unit 43 acquires, for each attribute of each category in the semantic knowledge base, the first frequency of occurrence in the category of the attribute in the semantic knowledge base in the step of acquiring the attribute identification degree of the attribute. Taking statistics, taking statistics of the second frequency regarding the number of times each attribute value of the attribute of the category in the semantic knowledge base appears in the attribute, and multiplying the product of the first frequency and the second frequency by the attribute of the category The attribute with the attribute identification degree higher than the identification threshold is determined as the attribute in the attribute set having the highest discriminability of the category.

１つの態様では、属性集合決定部４３は、属性の属性識別度を取得するステップにおいて、セマンティック知識ベースにおける各カテゴリの各属性について、実体と属性値との相関行列を計算し、相関行列の各列の最大値を加算し、得られた和を該カテゴリの該属性の属性識別度とし、属性識別度が識別閾値よりも高い属性を、該カテゴリの最も高い識別性を有する属性集合における属性として決定する。 In one aspect, the attribute set determination unit 43 calculates the correlation matrix between the entity and the attribute value for each attribute of each category in the semantic knowledge base in the step of acquiring the attribute identification degree of the attribute. The maximum value of the column is added, and the obtained sum is set as the attribute identification level of the attribute of the category, and an attribute having an attribute identification level higher than the identification threshold is set as an attribute in the attribute set having the highest identification level of the category. decide.

１つの態様では、関連付け部４５は、類似度が類似度閾値よりも大きい候補実体を実体言及に関連付ける。該類似度が何れも類似度閾値よりも小さい場合、関連付け部４５は、実体言及を新たな実体としてセマンティック知識ベースに追加する。 In one aspect, the associating unit 45 associates candidate entities whose similarity is greater than the similarity threshold with the entity reference. If all the similarities are smaller than the similarity threshold, the associating unit 45 adds the entity reference as a new entity to the semantic knowledge base.

本発明の関連付け装置４００に含まれる各部の処理は上述した関連付け方法に含まれる各ステップの処理と類似するため、説明の便宜上、ここでこれらの部分及びユニットの詳細な説明が省略される。 Since the processing of each unit included in the association apparatus 400 of the present invention is similar to the processing of each step included in the above-described association method, detailed description of these units and units is omitted here for convenience of description.

なお、上記装置における各構成要件、ユニットはソフトウェア、ファームウェア、ハードウェア又はそれらの組み合わせにより実現されてもよい。用いられる具体的な手段又は方式の構成は当業者にとって周知であり、ここでその説明が省略される。ソフトウェア又はファームウェアにより実施されている場合、記録媒体又はネットワークから専用のハードウェア構成を有するコンピュータ（例えば図５示されている汎用コンピュータ５００）に上記方法を実施するためのソフトウェアを構成するプログラムをインストールしてもよく、該コンピュータは各種のプログラムがインストールされている場合は各種の機能などを実行できる。 In addition, each component and unit in the above apparatus may be realized by software, firmware, hardware, or a combination thereof. The specific means or schemes used are well known to those skilled in the art and will not be described here. When implemented by software or firmware, a program constituting the software for performing the above method is installed from a recording medium or a network to a computer having a dedicated hardware configuration (for example, the general-purpose computer 500 shown in FIG. 5). The computer may execute various functions when various programs are installed.

図５は本発明の実施例に係る方法及び装置を実施するためのコンピュータの構成を示すブロック図である。 FIG. 5 is a block diagram showing the configuration of a computer for carrying out the method and apparatus according to the embodiment of the present invention.

図５において、中央処理部（即ちＣＰＵ）５０１は、読み出し専用メモリ（ＲＯＭ）５０２に記憶されているプログラム、又は記憶部５０８からランダムアクセスメモリ（ＲＡＭ）５０３にロードされたプログラムにより各種の処理を実行する。ＲＡＭ５０３には、必要に応じて、ＣＰＵ５０１が各種の処理を実行するに必要なデータが記憶されている。ＣＰＵ５０１、ＲＯＭ５０２、及びＲＡＭ５０３は、バス５０４を介して互いに接続されている。入力／出力インターフェース５０５もバス５０４に接続されている。 In FIG. 5, a central processing unit (i.e., CPU) 501 performs various processes by a program stored in a read-only memory (ROM) 502 or a program loaded from a storage unit 508 to a random access memory (RAM) 503. Run. The RAM 503 stores data necessary for the CPU 501 to execute various processes as necessary. The CPU 501, ROM 502, and RAM 503 are connected to each other via a bus 504. An input / output interface 505 is also connected to the bus 504.

入力部５０６（キーボード、マウスなどを含む）、出力部５０７（ディスプレイ、例えばブラウン管（ＣＲＴ）、液晶ディスプレイ（ＬＣＤ）など、及びスピーカなどを含む）、記憶部５０８（例えばハードディスクなどを含む）、通信部５０９（例えばネットワークのインタフェースカード、例えばＬＡＮカード、モデムなどを含む）は、入力／出力インターフェース５０５に接続されている。通信部５０９は、ネットワーク、例えばインターネットを介して通信処理を実行する。必要に応じて、ドライブ部５１０は、入力／出力インターフェース５０５に接続されてもよい。取り外し可能な媒体５１１は、例えば磁気ディスク、光ディスク、光磁気ディスク、半導体メモリなどであり、必要に応じてドライブ部５１０にセットアップされて、その中から読みだされたコンピュータプログラムは必要に応じて記憶部５０８にインストールされている。 Input unit 506 (including a keyboard, mouse, etc.), output unit 507 (display, including a cathode ray tube (CRT), liquid crystal display (LCD), etc., speaker, etc.), storage unit 508 (including a hard disk, etc.), communication A unit 509 (for example, a network interface card such as a LAN card or a modem) is connected to the input / output interface 505. The communication unit 509 executes communication processing via a network, for example, the Internet. The drive unit 510 may be connected to the input / output interface 505 as needed. The removable medium 511 is, for example, a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, and the like, and is set up in the drive unit 510 as necessary, and a computer program read from the medium is stored as necessary. Installed in the section 508.

ソフトウェアにより上記処理を実施する場合、ネットワーク、例えばインターネット、又は記憶媒体、例えば取り外し可能な媒体５１１を介してソフトウェアを構成するプログラムをインストールする。 When the above processing is performed by software, a program constituting the software is installed via a network such as the Internet or a storage medium such as a removable medium 511.

また、これらの記憶媒体は、図５に示されている、プログラムを記憶し、機器と分離してユーザへプログラムを提供する取り外し可能な媒体５１１に限定されない。取り外し可能な媒体５１１は、例えば磁気ディスク（フロッピーディスク（登録商標）を含む）、光ディスク（光ディスク−読み出し専用メモリ（ＣＤＲＯＭ）、及びデジタル多目的ディスク（ＤＶＤ）を含む）、光磁気ディスク（ミニディスク（ＭＤ）（登録商標））及び半導体メモリを含む。或いは、記憶媒体は、ＲＯＭ５０２、記憶部５０８に含まれるハードディスクなどであってもよく、プログラムを記憶し、それらを含む機器と共にユーザへ提供される。 These storage media are not limited to the removable media 511 shown in FIG. 5 that stores the program and provides the program to the user separately from the device. The removable medium 511 includes, for example, a magnetic disk (including a floppy disk (registered trademark)), an optical disk (including an optical disk-read only memory (CDROM), and a digital multipurpose disk (DVD)), and a magneto-optical disk (mini disk ( MD) (registered trademark)) and semiconductor memory. Alternatively, the storage medium may be a hard disk or the like included in the ROM 502 and the storage unit 508, stores the program, and is provided to the user together with a device including them.

本発明は、機器に読み取り可能な指令コードを記憶するプログラムプロダクトをさらに提供する。該指令コードは機器により読み出されて、上述した本発明の実施例に係る方法を実行できる。 The present invention further provides a program product for storing a command code readable by a device. The command code is read by the device, and the method according to the embodiment of the present invention described above can be executed.

それに応じて、本発明は、機器読み取り可能な指令コードを記憶するプログラムのプロダクトが記録されている記憶媒体をさらに含む。該記憶媒体は、フロッピーディスク、光ディスク、光磁気ディスク、メモリカード、メモリスティックを含むが、これらに限定されない。 Accordingly, the present invention further includes a storage medium in which a product of a program that stores a device-readable command code is recorded. The storage medium includes, but is not limited to, a floppy disk, an optical disk, a magneto-optical disk, a memory card, and a memory stick.

なお、本発明の具体的な実施例の上記の説明では、１つの態様について説明及び／又は例示された特徴は同一又は類似の方式で１つ又は複数の他の態様に用いられてもよいし、他の態様における特徴と組み合わせてもよいし、他の態様における特徴の代わりに用いられてもよい。 In the above description of specific embodiments of the invention, features described and / or illustrated for one aspect may be used for one or more other aspects in the same or similar manner. , May be combined with features in other aspects, or may be used in place of features in other aspects.

なお、用語「包括／含む」は、本文に使用される際に、特徴、要素、ステップ又は構成要件の存在を意味し、一つ又は複数の他の特徴、要素、ステップ又は構成要件の存在又は追加を排除するものではない。 As used herein, the term “inclusive / include” means the presence of a feature, element, step or component, and the presence or absence of one or more other features, elements, steps or components. It does not exclude the addition.

また、本発明の方法は、明細書に説明された時間的順序で実行するものに限定されず、他の時間的順序で順次、並行、又は独立して実行されてもよい。このため、本明細書に説明された方法の実行順序は、本発明の技術的な範囲を限定するものではない。 Further, the method of the present invention is not limited to the execution in the temporal order described in the specification, and may be executed sequentially, in parallel, or independently in another temporal order. For this reason, the execution order of the method described in this specification does not limit the technical scope of the present invention.

以上は本発明の具体的な実施例の説明を通じて本発明を開示するが、上記の全ての実施例及び例は例示的なものであり、制限的なものではない。当業者は、特許請求の範囲の主旨及び範囲内で本発明に対して各種の修正、改良、均等的なものに変更してもよい。これらの修正、改良又は均等的なものに変更することは本発明の保護範囲に含まれるものである。 Although the present invention has been disclosed through the description of specific embodiments of the present invention, all the above-described embodiments and examples are illustrative and not restrictive. Those skilled in the art may make various modifications, improvements, and equivalents to the present invention within the spirit and scope of the appended claims. It is within the protection scope of the present invention to change to these modifications, improvements or equivalents.

また、上述の各実施例を含む実施形態に関し、更に以下の付記を開示する。
（付記１）
セマンティック知識ベースにおける実体にショートテキストにおける実体言及を関連付ける方法であって、
セマンティック知識ベースにおける実体から、ショートテキストにおける実体言及に関連する候補実体を選択するステップと、
候補実体及び実体言及の属するカテゴリを決定するステップと、
実体言及の属するカテゴリの最も高い識別性を有する属性集合を決定するステップと、
該属性集合に基づいて、該カテゴリに属する候補実体と実体言及との類似度を計算するステップと、
前記類似度に基づいて、候補実体を選択して実体言及に関連付けるステップと、を含む、方法。
（付記２）
セマンティック知識ベースにおける実体から、ショートテキストにおける実体言及に関連する候補実体を選択するステップは、
セマンティック知識ベースにおける、実体名が実体言及と同一の同名実体を、候補実体として選択するステップ、
セマンティック知識ベースにおける、同名実体と等価関係を有する実体を、候補実体として選択するステップ、
セマンティック知識ベースにおける、実体名が実体言及と実体記述テキストにおいて照応関係を有する実体を、候補実体として選択するステップ、
セマンティック知識ベースにおける、実体言及と百科事典の曖昧さ回避の関係を有する実体を、候補実体として選択するステップ、
セマンティック知識ベースにおける、実体言及をアンカーテキストとするリンク先の実体を、候補実体として選択するステップ、及び
セマンティック知識ベースにおける、実体名が実体言及と実体言及の所在するテキストテキストにおいて照応関係を有する実体を、候補実体として選択するステップ、のうち１つのステップを含む、付記１に記載の方法。
（付記３）
セマンティック知識ベースにおける実体の実体名と実体言及とが、該実体の実体記述テキスト又は実体言及の所在するテキストにおいて特定の照応パターンに合致するか否かに基づいて、或いは
セマンティック知識ベースにおける該実体の実体記述テキスト又は該実体言及の所在するテキストに対してテキスト解析を行うことによって、
照応関係を有するか否かを決定する、付記２に記載の方法。
（付記４）
実体言及の属するカテゴリを決定するステップは、
実体言及の所在するテキスト又はタイプ情報を有しない候補実体の実体記述テキストに対応する第１主題ベクトルを取得するステップと、
各カテゴリの実体の実体記述テキストに対応する第２主題ベクトルを取得するステップと、
第１主題ベクトルと各カテゴリの第２主題ベクトルとの平均類似度を計算するステップと、
平均類似度の最も高いカテゴリを、実体言及又はタイプ情報を有しない候補実体の属するカテゴリとして決定するステップと、を含む、付記１に記載の方法。
（付記５）
候補実体及び実体言及の属するカテゴリを決定するステップは、
各カテゴリの実体の実体記述テキストと予め定義されたテンプレートとの合致度、前記実体記述テキストが各カテゴリに関連するキーワードを含むか否か、各カテゴリの実体の百科事典における対応する主題情報、及び各カテゴリの実体に関連する属性タイプのうち少なくとも１つの特徴に基づいて、分類器を訓練するステップと、
分類器を用いて候補実体及び実体言及を分類するステップと、を含む、付記１に記載の方法。
（付記６）
該属性集合に基づいて該カテゴリに属する候補実体と実体言及との類似度を計算するステップは、
実体言及の所在するテキストから、実体言及の、該属性集合の属性の属性値を抽出するステップと、
該カテゴリに属する候補実体の、該属性集合の属性の属性値と実体言及の対応する属性値との類似度に基づいて、該候補実体と該実体言及との類似度を計算するステップと、を含む、付記１に記載の方法。
（付記７）
該属性集合に基づいて該カテゴリに属する候補実体と実体言及との類似度を計算するステップは、
該候補実体と該実体言及との相互照応確率、及び該属性集合の各属性の属性識別度のうち少なくとも１つに基づいて、該候補実体と該実体言及との類似度を計算するステップ、を含む、付記６に記載の方法。
（付記８）
実体言及の属するカテゴリの最も高い識別性を有する属性集合を決定するステップは、属性識別度が識別閾値よりも高い属性を、該カテゴリの最も高い識別性を有する属性集合における属性として決定するステップ、を含み、
属性の属性識別度を取得するステップにおいて、
セマンティック知識ベースにおける各カテゴリの各属性について、
セマンティック知識ベースにおける該属性の該カテゴリにおいて出現する第１頻度の統計を取り、
セマンティック知識ベースにおける該カテゴリの該属性の各属性値の該属性において出現する回数に関する第２頻度の統計を取り、
第１頻度と第２頻度との積を、該カテゴリの該属性の属性識別度として計算する、付記１に記載の方法。
（付記９）
実体言及の属するカテゴリの最も高い識別性を有する属性集合を決定するステップは、属性識別度が識別閾値よりも高い属性を、該カテゴリの最も高い識別性を有する属性集合における属性として決定するステップ、を含み、
属性の属性識別度を取得するステップにおいて、
セマンティック知識ベースにおける各カテゴリの各属性について、実体と属性値との相関行列を計算し、
相関行列の各列の最大値を加算し、得られた和を該カテゴリの該属性の属性識別度とする、付記１に記載の方法。
（付記１０）
前記類似度に基づいて候補実体を選択して実体言及に関連付けるステップは、類似度が類似度閾値よりも大きい候補実体を実体言及に関連付けるステップ、を含み、前記類似度が何れも類似度閾値よりも小さい場合、実体言及を新たな実体としてセマンティック知識ベースに追加する、付記１に記載の方法。
（付記１１）
セマンティック知識ベースにおける実体にショートテキストにおける実体言及を関連付ける装置であって、
セマンティック知識ベースにおける実体から、ショートテキストにおける実体言及に関連する候補実体を選択する候補実体選択手段と、
候補実体及び実体言及の属するカテゴリを決定するカテゴリ決定手段と、
実体言及の属するカテゴリの最も高い識別性を有する属性集合を決定する属性集合決定手段と、
該属性集合に基づいて、該カテゴリに属する候補実体と実体言及との類似度を計算する類似度計算手段と、
前記類似度に基づいて、候補実体を選択して実体言及に関連付ける関連付け手段と、を含む、装置。
（付記１２）
候補実体選択手段は、
セマンティック知識ベースにおける、実体名が実体言及と同一の同名実体を、候補実体として選択するステップ、
セマンティック知識ベースにおける、同名実体と等価関係を有する実体を、候補実体として選択するステップ、
セマンティック知識ベースにおける、実体名が実体言及と実体記述テキストにおいて照応関係を有する実体を、候補実体として選択するステップ、
セマンティック知識ベースにおける、実体言及と百科事典の曖昧さ回避の関係を有する実体を、候補実体として選択するステップ、
セマンティック知識ベースにおける、実体言及をアンカーテキストとするリンク先の実体を、候補実体として選択するステップ、及び
セマンティック知識ベースにおける、実体名が実体言及と実体言及の所在するテキストテキストにおいて照応関係を有する実体を、候補実体として選択するステップ、のうち１つのステップを実行する、付記１１に記載の装置。
（付記１３）
セマンティック知識ベースにおける実体の実体名と実体言及とが、該実体の実体記述テキスト又は実体言及の所在するテキストにおいて特定の照応パターンに合致するか否かに基づいて、或いは
セマンティック知識ベースにおける該実体の実体記述テキスト又は該実体言及の所在するテキストに対してテキスト解析を行うことによって、
照応関係を有するか否かを決定する、付記１２に記載の装置。
（付記１４）
カテゴリ決定手段は、
実体言及の所在するテキスト又はタイプ情報を有しない候補実体の実体記述テキストに対応する第１主題ベクトルを取得し、
各カテゴリの実体の実体記述テキストに対応する第２主題ベクトルを取得し、
第１主題ベクトルと各カテゴリの第２主題ベクトルとの平均類似度を計算し、
平均類似度の最も高いカテゴリを、実体言及又はタイプ情報を有しない候補実体の属するカテゴリとして決定する、付記１１に記載の装置。
（付記１５）
カテゴリ決定手段は、
各カテゴリの実体の実体記述テキストと予め定義されたテンプレートとの合致度、前記実体記述テキストが各カテゴリに関連するキーワードを含むか否か、各カテゴリの実体の百科事典における対応する主題情報、及び各カテゴリの実体に関連する属性タイプのうち少なくとも１つの特徴に基づいて、分類器を訓練し、
分類器を用いて候補実体及び実体言及を分類する、付記１１に記載の装置。
（付記１６）
類似度計算手段は、
実体言及の所在するテキストから、実体言及の、該属性集合の属性の属性値を抽出し、
該カテゴリに属する候補実体の、該属性集合の属性の属性値と実体言及の対応する属性値との類似度に基づいて、該候補実体と該実体言及との類似度を計算する、付記１１に記載の装置。
（付記１７）
類似度計算手段は、
該候補実体と該実体言及との相互照応確率、及び該属性集合の各属性の属性識別度のうち少なくとも１つに基づいて、該候補実体と該実体言及との類似度を計算する、付記１６に記載の装置。
（付記１８）
属性集合決定手段は、
属性の属性識別度を取得するステップにおいて、セマンティック知識ベースにおける各カテゴリの各属性について、セマンティック知識ベースにおける該属性の該カテゴリにおいて出現する第１頻度の統計を取り、セマンティック知識ベースにおける該カテゴリの該属性の各属性値の該属性において出現する回数に関する第２頻度の統計を取り、第１頻度と第２頻度との積を、該カテゴリの該属性の属性識別度として計算し、
属性識別度が識別閾値よりも高い属性を、該カテゴリの最も高い識別性を有する属性集合における属性として決定する、付記１１に記載の装置。
（付記１９）
属性集合決定手段は、
属性の属性識別度を取得するステップにおいて、セマンティック知識ベースにおける各カテゴリの各属性について、実体と属性値との相関行列を計算し、相関行列の各列の最大値を加算し、得られた和を該カテゴリの該属性の属性識別度とし、
属性識別度が識別閾値よりも高い属性を、該カテゴリの最も高い識別性を有する属性集合における属性として決定する、付記１１に記載の装置。
（付記２０）
関連付け手段は、類似度が類似度閾値よりも大きい候補実体を実体言及に関連付け、
前記類似度が何れも類似度閾値よりも小さい場合、関連付け手段は、実体言及を新たな実体としてセマンティック知識ベースに追加する、付記１１に記載の装置。 Moreover, the following additional remarks are disclosed regarding the embodiment including each of the above-described examples.
(Appendix 1)
A method of associating an entity reference in a short text with an entity in a semantic knowledge base,
Selecting from the entities in the semantic knowledge base candidate entities associated with entity references in the short text;
Determining a category to which the candidate entity and entity reference belong;
Determining the attribute set having the highest distinctiveness of the category to which the entity reference belongs;
Calculating a similarity between a candidate entity belonging to the category and an entity reference based on the attribute set;
Selecting a candidate entity and associating it with an entity reference based on the similarity.
(Appendix 2)
The step of selecting from the entities in the semantic knowledge base candidate entities associated with entity references in the short text is:
Selecting an entity with the same name as the entity reference in the semantic knowledge base as a candidate entity;
Selecting an entity having an equivalent relationship with an entity of the same name in the semantic knowledge base as a candidate entity;
Selecting an entity whose entity name has an anaphoric relationship in the entity reference and entity description text as a candidate entity in the semantic knowledge base;
Selecting, as a candidate entity, an entity having a relationship between entity reference and encyclopedia ambiguity avoidance in the semantic knowledge base;
In the semantic knowledge base, a step of selecting a linked entity having an entity reference as an anchor text as a candidate entity, and an entity having an anaphoric relationship in the text text in which the entity name is located in the semantic knowledge base The method according to claim 1, comprising one of the steps of selecting as a candidate entity.
(Appendix 3)
Based on whether the entity's entity name and entity reference in the semantic knowledge base match a particular anaphoric pattern in the entity's entity description text or the text in which the entity reference is located, or the entity's name in the semantic knowledge base By performing text analysis on the entity description text or the text where the entity reference is located,
The method according to appendix 2, wherein it is determined whether or not an anaphoric relationship is present.
(Appendix 4)
The step of determining the category to which the entity reference belongs is:
Obtaining a first subject vector corresponding to entity description text of a candidate entity that does not have text or type information in which the entity reference is located;
Obtaining a second theme vector corresponding to the entity description text of each category entity;
Calculating an average similarity between the first theme vector and the second theme vector of each category;
And determining the category having the highest average similarity as a category to which a candidate entity having no entity mention or type information belongs.
(Appendix 5)
The step of determining the category to which the candidate entity and entity reference belong is:
The degree of match between the entity description text of each category entity and a predefined template, whether the entity description text includes a keyword associated with each category, corresponding subject information in the encyclopedia of each category entity, and Training a classifier based on characteristics of at least one of the attribute types associated with each category entity;
The method of claim 1, comprising classifying candidate entities and entity references using a classifier.
(Appendix 6)
The step of calculating the similarity between the candidate entity belonging to the category and the entity reference based on the attribute set includes:
Extracting the attribute value of the attribute of the attribute set of the entity reference from the text where the entity reference is located;
Calculating the similarity between the candidate entity and the entity reference based on the similarity between the attribute value of the attribute of the attribute set and the corresponding attribute value of the entity reference for the candidate entity belonging to the category; The method according to appendix 1, comprising:
(Appendix 7)
The step of calculating the similarity between the candidate entity belonging to the category and the entity reference based on the attribute set includes:
Calculating a similarity between the candidate entity and the entity reference based on at least one of a cross-correlation probability between the candidate entity and the entity reference and an attribute identification degree of each attribute of the attribute set; The method according to appendix 6, comprising:
(Appendix 8)
Determining an attribute set having the highest discriminating property of the category to which the entity reference belongs, determining an attribute having an attribute discrimination level higher than an identification threshold as an attribute in the attribute set having the highest discriminating property of the category; Including
In the step of obtaining the attribute identification degree of the attribute,
For each attribute of each category in the semantic knowledge base,
Taking statistics of the first frequency appearing in the category of the attribute in the semantic knowledge base;
Taking a second frequency statistic about the number of times each attribute value of the attribute of the category in the semantic knowledge base appears in the attribute;
The method according to supplementary note 1, wherein the product of the first frequency and the second frequency is calculated as an attribute identification degree of the attribute of the category.
(Appendix 9)
Determining an attribute set having the highest discriminating property of the category to which the entity reference belongs, determining an attribute having an attribute discrimination level higher than an identification threshold as an attribute in the attribute set having the highest discriminating property of the category; Including
In the step of obtaining the attribute identification degree of the attribute,
For each attribute in each category in the semantic knowledge base, calculate the correlation matrix between the entity and the attribute value,
The method according to supplementary note 1, wherein the maximum value of each column of the correlation matrix is added, and the obtained sum is used as the attribute identification degree of the attribute of the category.
(Appendix 10)
Selecting a candidate entity based on the similarity and associating it with an entity reference includes associating a candidate entity with a similarity greater than the similarity threshold with the entity reference, both of which are greater than the similarity threshold The method according to claim 1, wherein the entity reference is added as a new entity to the semantic knowledge base.
(Appendix 11)
A device for associating an entity reference in a short text with an entity in a semantic knowledge base,
A candidate entity selection means for selecting a candidate entity related to the entity reference in the short text from the entity in the semantic knowledge base;
A category determining means for determining a category to which the candidate entity and the entity reference belong;
An attribute set determining means for determining an attribute set having the highest distinctiveness of the category to which the entity reference belongs;
Similarity calculation means for calculating the similarity between the candidate entity belonging to the category and the entity reference based on the attribute set;
An association means for selecting a candidate entity and associating it with an entity reference based on the similarity.
(Appendix 12)
Candidate entity selection means:
Selecting an entity with the same name as the entity reference in the semantic knowledge base as a candidate entity;
Selecting an entity having an equivalent relationship with an entity of the same name in the semantic knowledge base as a candidate entity;
Selecting an entity whose entity name has an anaphoric relationship in the entity reference and entity description text as a candidate entity in the semantic knowledge base;
Selecting, as a candidate entity, an entity having a relationship between entity reference and encyclopedia ambiguity avoidance in the semantic knowledge base;
In the semantic knowledge base, a step of selecting a linked entity having an entity reference as an anchor text as a candidate entity, and an entity having an anaphoric relationship in the text text in which the entity name is located in the semantic knowledge base The apparatus according to appendix 11, wherein one of the steps of selecting as a candidate entity is executed.
(Appendix 13)
Based on whether the entity's entity name and entity reference in the semantic knowledge base match a particular anaphoric pattern in the entity's entity description text or the text in which the entity reference is located, or the entity's name in the semantic knowledge base By performing text analysis on the entity description text or the text where the entity reference is located,
Item 13. The apparatus according to appendix 12, wherein it is determined whether or not there is an anaphoric relationship.
(Appendix 14)
Category decision means
Obtaining a first theme vector corresponding to the entity description text of the candidate entity having no text or type information in which the entity reference is located;
Obtaining a second theme vector corresponding to the entity description text of each category entity;
Calculating an average similarity between the first theme vector and the second theme vector of each category;
The apparatus according to appendix 11, wherein the category having the highest average similarity is determined as a category to which a candidate entity having no entity mention or type information belongs.
(Appendix 15)
Category decision means
The degree of match between the entity description text of each category entity and a predefined template, whether the entity description text includes a keyword associated with each category, corresponding subject information in the encyclopedia of each category entity, and Training the classifier based on at least one characteristic of the attribute types associated with each category entity;
The apparatus of claim 11, wherein the classifier is used to classify candidate entities and entity references.
(Appendix 16)
The similarity calculation means is
Extracting the attribute value of the attribute of the attribute set of the entity reference from the text where the entity reference is located,
Appendix 11 calculates the similarity between the candidate entity and the entity reference based on the similarity between the attribute value of the attribute of the attribute set and the corresponding attribute value of the entity reference for the candidate entity belonging to the category The device described.
(Appendix 17)
The similarity calculation means is
Item 16. The similarity between the candidate entity and the entity reference is calculated based on at least one of the cross-correlation probability between the candidate entity and the entity reference and the attribute identification degree of each attribute of the attribute set. The device described in 1.
(Appendix 18)
The attribute set decision means
In the step of obtaining the attribute identification degree of the attribute, for each attribute of each category in the semantic knowledge base, statistics of the first frequency appearing in the category of the attribute in the semantic knowledge base are taken, and the attribute of the category in the semantic knowledge base is obtained. Taking statistics of the second frequency regarding the number of times each attribute value of the attribute appears in the attribute, and calculating the product of the first frequency and the second frequency as the attribute identification degree of the attribute of the category;
The apparatus according to appendix 11, wherein an attribute having an attribute identification degree higher than an identification threshold is determined as an attribute in an attribute set having the highest discriminability of the category.
(Appendix 19)
The attribute set decision means
In the step of obtaining the attribute identification degree of the attribute, the correlation matrix between the entity and the attribute value is calculated for each attribute of each category in the semantic knowledge base, the maximum value of each column of the correlation matrix is added, and the obtained sum Is the attribute identification degree of the attribute of the category,
The apparatus according to appendix 11, wherein an attribute having an attribute identification degree higher than an identification threshold is determined as an attribute in an attribute set having the highest discriminability of the category.
(Appendix 20)
The associating means associates a candidate entity having a similarity greater than a similarity threshold with the entity reference,
The apparatus according to appendix 11, wherein when all the similarity degrees are smaller than the similarity threshold value, the association unit adds the entity reference as a new entity to the semantic knowledge base.

Claims

A method of associating an entity reference in a short text with an entity in a semantic knowledge base,
Selecting from the entities in the semantic knowledge base candidate entities associated with entity references in the short text;
Determining a category to which the candidate entity and entity reference belong;
Determining the attribute set having the highest distinctiveness of the category to which the entity reference belongs;
Calculating a similarity between a candidate entity belonging to the category and an entity reference based on the attribute set;
Selecting a candidate entity and associating it with an entity reference based on the similarity.

The step of selecting from the entities in the semantic knowledge base candidate entities associated with entity references in the short text is:
Selecting an entity with the same name as the entity reference in the semantic knowledge base as a candidate entity;
Selecting an entity having an equivalent relationship with an entity of the same name in the semantic knowledge base as a candidate entity;
Selecting an entity whose entity name has an anaphoric relationship in the entity reference and entity description text as a candidate entity in the semantic knowledge base;
Selecting, as a candidate entity, an entity having a relationship between entity reference and encyclopedia ambiguity avoidance in the semantic knowledge base;
In the semantic knowledge base, a step of selecting a linked entity having an entity reference as an anchor text as a candidate entity, and an entity having an anaphoric relationship in the text text in which the entity name is located in the semantic knowledge base The method according to claim 1, comprising one step of selecting as a candidate entity.

Based on whether the entity's entity name and entity reference in the semantic knowledge base match a particular anaphoric pattern in the entity's entity description text or the text in which the entity reference is located, or the entity's name in the semantic knowledge base By performing text analysis on the entity description text or the text where the entity reference is located,
The method of claim 2, wherein it is determined whether there is an anaphoric relationship.

The step of determining the category to which the entity reference belongs is:
Obtaining a first subject vector corresponding to entity description text of a candidate entity that does not have text or type information in which the entity reference is located;
Obtaining a second theme vector corresponding to the entity description text of each category entity;
Calculating an average similarity between the first theme vector and the second theme vector of each category;
And determining a category having the highest average similarity as a category to which a candidate entity having no entity mention or type information belongs.

The step of determining the category to which the candidate entity and entity reference belong is:
The degree of match between the entity description text of each category entity and a predefined template, whether the entity description text includes a keyword associated with each category, corresponding subject information in the encyclopedia of each category entity, and Training a classifier based on characteristics of at least one of the attribute types associated with each category entity;
Classifying candidate entities and entity references using a classifier.

The step of calculating the similarity between the candidate entity belonging to the category and the entity reference based on the attribute set includes:
Extracting the attribute value of the attribute of the attribute set of the entity reference from the text where the entity reference is located;
Calculating the similarity between the candidate entity and the entity reference based on the similarity between the attribute value of the attribute of the attribute set and the corresponding attribute value of the entity reference for the candidate entity belonging to the category; The method of claim 1 comprising.

The step of calculating the similarity between the candidate entity belonging to the category and the entity reference based on the attribute set includes:
Calculating a similarity between the candidate entity and the entity reference based on at least one of a cross-correlation probability between the candidate entity and the entity reference and an attribute identification degree of each attribute of the attribute set; The method of claim 6 comprising.

Determining an attribute set having the highest discriminating property of the category to which the entity reference belongs, determining an attribute having an attribute discrimination level higher than an identification threshold as an attribute in the attribute set having the highest discriminating property of the category; Including
In the step of obtaining the attribute identification degree of the attribute,
For each attribute of each category in the semantic knowledge base,
Taking statistics of the first frequency appearing in the category of the attribute in the semantic knowledge base;
Taking a second frequency statistic about the number of times each attribute value of the attribute of the category in the semantic knowledge base appears in the attribute;
The method according to claim 1, wherein a product of the first frequency and the second frequency is calculated as an attribute identification degree of the attribute of the category.

Determining an attribute set having the highest discriminating property of the category to which the entity reference belongs, determining an attribute having an attribute discrimination level higher than an identification threshold as an attribute in the attribute set having the highest discriminating property of the category; Including
In the step of obtaining the attribute identification degree of the attribute,
For each attribute in each category in the semantic knowledge base, calculate the correlation matrix between the entity and the attribute value,
The method according to claim 1, wherein the maximum value of each column of the correlation matrix is added, and the obtained sum is used as the attribute identification degree of the attribute of the category.

A device for associating an entity reference in a short text with an entity in a semantic knowledge base,
A candidate entity selection means for selecting a candidate entity related to the entity reference in the short text from the entity in the semantic knowledge base;
A category determining means for determining a category to which the candidate entity and the entity reference belong;
An attribute set determining means for determining an attribute set having the highest distinctiveness of the category to which the entity reference belongs;
Similarity calculation means for calculating the similarity between the candidate entity belonging to the category and the entity reference based on the attribute set;
An association means for selecting a candidate entity and associating it with an entity reference based on the similarity.