JP5426292B2

JP5426292B2 - Opinion classification device and program

Info

Publication number: JP5426292B2
Application number: JP2009214909A
Authority: JP
Inventors: 健小早川; 正熊野; 英輝田中; 潤一辻井; 進東金; 直観岡崎
Original assignee: Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 2009-09-16
Filing date: 2009-09-16
Publication date: 2014-02-26
Anticipated expiration: 2029-09-16
Also published as: JP2011065380A

Description

本発明は、自然言語で書かれた入力文あるいは入力文書を所定のカテゴリーに分類する意見分類装置およびそのプログラムに関する。 The present invention relates to an opinion classification device that classifies an input sentence or input document written in a natural language into a predetermined category and a program thereof.

製品やサービスや放送等のコンテンツなどに対する利用者や視聴者からのフィードバック情報を効率良く分析するために、自然言語で書かれた文を、その意見の種別ごとに自動的に分類する技術が有効である。従来の技術では、入力文書を単語に分割し、それらの単語の出現頻度の分布を分析することにより、文書を分類していた。
この手法はbag-of-words法と呼ばれ、まずまずの性能が得られていた。 In order to efficiently analyze feedback information from users and viewers for content such as products, services and broadcasts, a technology that automatically classifies sentences written in natural language according to their opinion type is effective It is. In the conventional technique, an input document is divided into words, and the documents are classified by analyzing the distribution of the appearance frequency of those words.
This method was called the bag-of-words method and provided reasonable performance.

特許文献１に記載されている技術は、テキスト中の語の出現回数にも基づきながら、肯定的な評価を表わす正極性の表現と否定的な評価を表わす負極性の表現とを登録しておき、テキストから評価表現及び接続関係を示す接続表現を抽出し、抽出された評価表現の中から登録されている評価表現を検出し、検出された評価表現に対する極性を判断するものである。 The technique described in Patent Document 1 registers a positive expression representing a positive evaluation and a negative expression representing a negative evaluation based on the number of occurrences of a word in the text. The evaluation expression and the connection expression indicating the connection relationship are extracted from the text, the registered evaluation expression is detected from the extracted evaluation expressions, and the polarity with respect to the detected evaluation expression is determined.

特許第３９６２３８２号公報Japanese Patent No. 3962382

背景技術として記載したbag-of-words法は、分割された単語の表層のみを特徴として用いるものである。従って、第１に、単語の同義語を識別できず、たとえある表現で分類すべき特徴を学習しても、同義語を用いた別表現は理解できず、分類性能が上がらないという問題があった。また、第２に連語（複数の単語からなり一つの意味をなす語の塊）を識別できないため、連語を構成する個別の単語をそれぞれ処理してしまい、慣用句などのように塊として特有の意味を有する場合に、つまり、意味が構成的でない場合に誤分類を起してしまうという問題があった。第３に、語の修飾関係を識別できないため、否定表現が何を否定しているのかといったことや、複数の意見性述語の候補がある時にどちらが重要かといったことを判断できず、分類性能があがらないという問題があった。
つまり、表現の表層の統計的特徴のみではなく、意味に基づいて、文が所定の文カテゴリーに属するか否かを判定できるようにすることが求められる。 The bag-of-words method described as background technology uses only the surface layer of divided words as a feature. Therefore, the first problem is that synonyms of words cannot be identified, and even if a feature to be classified with a certain expression is learned, another expression using synonyms cannot be understood, and classification performance does not improve. It was. Secondly, because a collocation (a lump of words consisting of a plurality of words and having a single meaning) cannot be identified, each individual word constituting a collocation is processed, and is unique as a lump such as an idiom. There is a problem that misclassification occurs when it has meaning, that is, when the meaning is not constructive. Third, because the word modification relationship cannot be identified, it is impossible to determine what the negative expression denies or which is more important when there are multiple candidate opinion predicates, and the classification performance is There was a problem of not going up.
That is, it is required to be able to determine whether or not a sentence belongs to a predetermined sentence category based not only on the statistical characteristics of the surface layer of the expression but also on the meaning.

本発明は、上記の課題認識に基づいて行なわれたものであり、同義語にも対応して文を分類することのできる意見分類装置を提供するものである。また、慣用句などのように塊としての意味にも対応して文を分類することのできる意見分類装置を提供するものである。また、複数の述語が含まれていたり、文が修飾関係を持ったりする場合にも対応して文を正しく分類することのできる意見分類装置を提供するものである。また、本発明は、これらのプログラムを提供するものである。 The present invention has been made based on the above problem recognition, and provides an opinion classification device that can classify sentences corresponding to synonyms. Further, the present invention provides an opinion classification device that can classify sentences corresponding to meanings such as idioms. It is another object of the present invention to provide an opinion classification device that can correctly classify a sentence even when a plurality of predicates are included or the sentence has a modification relationship. The present invention also provides these programs.

［１］上記の課題を解決するため、本発明の一態様による意見分類装置は、入力文データを読み込んで形態素解析処理を行う形態素解析部と、形態素解析処理済の前記入力文データに基づいて係り受け解析処理を行い、構文解析木データを出力する係り受け解析部と、前記構文解析木データから意見性述語を検出し、検出した前記意見性述語の種類を表わす述語カテゴリーを含む意見性述語情報を前記構文解析木データに付加する述語検出部と、前記形態素解析部によって形態素解析処理が行われた前記入力文データに基づいて、前記入力文データの機能表現を検出し、検出した前記機能表現の種類を表わす機能表現タイプを含む機能表現情報を前記構文解析木データに付加する機能表現検出部と、前記述語検出部によって前記意見性述語情報が付加され前記機能表現検出部によって前記機能表現情報が付加された構文解析木データを読み込み、読み込んだ前記構文解析木データの特徴を用いて当該構文解析木データが所定の文カテゴリーに該当するか否かを判断する判断部とを具備する。 [1] In order to solve the above problem, an opinion classification apparatus according to an aspect of the present invention is based on a morpheme analysis unit that reads input sentence data and performs morpheme analysis processing, and the input sentence data that has been subjected to morpheme analysis processing. A dependency analysis unit that performs dependency analysis processing and outputs parse tree data; and an opinion property predicate that includes a predicate category representing a type of the opinion property predicate detected from the parse tree data A predicate detection unit for adding information to the parse tree data, and a function representation of the input sentence data based on the input sentence data subjected to morpheme analysis processing by the morpheme analysis unit, and the detected function A functional expression detection unit that adds functional expression information including a functional expression type representing a type of expression to the parse tree data; and Is read and the functional expression detection unit adds the functional expression information, and whether the parse tree data corresponds to a predetermined sentence category using the characteristics of the read parse tree data And a determination unit for determining whether or not.

ここで、機能表現とは、自然言語が有するモダリティやその他の機能表現を含むものである。モダリティは、機能表現の一種である。
上記の構成によれば、係り受け解析部が文の構造を表わす構文解析木のデータを出力する。そして、述語検出部は、述語を検出するとともに、述語カテゴリーに関する情報を構文解析木のデータに付加する。よって、構文解析木のデータは、述語の表層だけでなく、述語の種類を表わす述語カテゴリーの情報を保持し、また文の構造におけるその述語の位置の情報を保持する。そして、機能表現検出部は、機能表現を検出するとともに、機能表現タイプに関する情報を構文解析木のデータに付加する。よって、構文解析木のデータは、機能表現の種類を表わす機能表現タイプの情報を保持し、また文の構造におけるその機能表現の位置の情報を保持する。これらの豊富な情報を含む構文解析木のデータを、判断部が読み込み、特徴として利用する。そして、判断部が、そのような特徴に基づいて、文が所定の文カテゴリーに該当するか否かを判断する。つまり、判断部が、そのように文を文カテゴリーに分類する。 Here, the functional expression includes modalities of natural language and other functional expressions. Modality is a kind of functional expression.
According to the above configuration, the dependency analysis unit outputs data of the parse tree representing the sentence structure. Then, the predicate detection unit detects the predicate and adds information on the predicate category to the data of the parse tree. Therefore, the data of the parse tree stores not only the surface layer of the predicate, but also information on the predicate category representing the type of the predicate, and information on the position of the predicate in the sentence structure. Then, the function expression detection unit detects the function expression and adds information on the function expression type to the data of the parse tree. Therefore, the parse tree data holds information on the function expression type indicating the type of function expression, and also holds information on the position of the function expression in the sentence structure. The decision part reads the data of the parse tree including such abundant information and uses it as a feature. And a judgment part judges whether a sentence corresponds to a predetermined sentence category based on such a feature. That is, the determination unit classifies sentences into sentence categories as such.

このように、述語カテゴリーや機能表現タイプを特徴として利用することにより、表現の表層だけでなく、述語の種類や機能表現の種類に基づく分類を行える。例えば単語の同義語などが含まれる文も、より適切に分類を行うことができるようになる。
また、構文解析木は、連語などの語の塊の情報を保持しているため、語の塊を識別し、語の塊が文の構造においてどの位置にあるか、どのような役割を果たすか、などといったことも特徴として利用しながら、文の分類を行える。
また、構文解析木は語の修飾関係の情報をも保持しているため、単に意見性述語等の出現頻度だけでなく、修飾関係の中において意見性述語等が果たす役割をも考慮した文の分類を行えるようになる。 As described above, by using the predicate category and the function expression type as features, classification based on the type of the predicate and the function expression can be performed as well as the surface layer of the expression. For example, a sentence including a synonym of a word can be classified more appropriately.
In addition, the parse tree holds information on word chunks such as collocations, so it identifies the word chunks, where they are located in the sentence structure, and what role they play Sentences can be classified using features such as.
In addition, since the parse tree also holds information on word modification relationships, not only the frequency of appearance of opinion opinion predicates, but also statements that take into account the role played by opinion property predicates in modification relationships. Classification can be performed.

［２］また、本発明の一態様による意見分類装置は、上記の構成において、前記機能表現検出部は、前記入力文データの機能表現としてモダリティを検出し、検出した前記モダリティの種類を表わす機能表現タイプを含む機能表現情報を前記構文解析木データに付加するモダリティ検出部である、ことを特徴とする。 [2] Further, in the opinion classification device according to one aspect of the present invention, in the above configuration, the function expression detection unit detects a modality as a function expression of the input sentence data, and represents a type of the detected modality. The modality detection unit adds functional expression information including an expression type to the parse tree data.

［３］また、本発明の一態様による意見分類装置は、上記の構成において、前記判断部は、読み込んだ前記構文解析木データの部分木に関する情報を前記特徴データとして用いるものであり、正解事例文について前記意見性述語情報及び前記機能表現情報が付加された前記構文解析木データと前記正解事例文に対応する前記文カテゴリーとによって予め機械学習したモデルを備えており、前記モデルを参照することによって前記構文解析木データが所定の前記文カテゴリーに該当するか否かを判断することを特徴とする。 [3] Also, in the opinion classification device according to one aspect of the present invention, in the above configuration, the determination unit uses information on a subtree of the read parse tree data as the feature data. The machine comprises a model that has been machine-learned in advance with the parse tree data to which the opinion predicate information and the function expression information are added, and the sentence category corresponding to the correct example sentence, and refers to the model To determine whether the parse tree data corresponds to the predetermined sentence category.

上記の構成を用いる場合、正解事例文と、それに対応する文カテゴリーとを予め与えるようにする。そして、判断部は、正解事例文を用いた機械学習を予め行い、モデルを構築しておく。そして、判断部は、そのように構築されたモデルを用いて、文の分類を行う。
このような構成とすることにより、構文解析木の特徴空間と文カテゴリーとの関係を装置設計者や利用者が、分析的に把握する必要なく、事例を与えるだけで精度の高い分類が行える。
なお、機械学習の手法としては、例えば、ツリー・カーネルＳＶＭを用いる。 When the above configuration is used, a correct example sentence and a sentence category corresponding thereto are given in advance. And a judgment part performs machine learning using a correct example sentence beforehand, and builds a model. And a judgment part classifies a sentence using the model constructed | assembled in that way.
With such a configuration, the device designer and the user do not need to analytically grasp the relationship between the feature space of the parse tree and the sentence category, and classification can be performed with high accuracy simply by giving examples.
As a machine learning method, for example, a tree kernel SVM is used.

［４］また、本発明の一態様によるプログラムは、入力文データを読み込んで形態素解析処理を行う形態素解析部と、形態素解析処理済の前記入力文データに基づいて係り受け解析処理を行い、構文解析木データを出力する係り受け解析部と、前記構文解析木データから意見性述語を検出し、検出した前記意見性述語の種類を表わす述語カテゴリーを含む意見性述語情報を前記構文解析木データに付加する述語検出部と、前記形態素解析部によって形態素解析処理が行われた前記入力文データに基づいて、前記入力文データの機能表現を検出し、検出した前記機能表現の種類を表わす機能表現タイプを含む機能表現情報を前記構文解析木データに付加する機能表現検出部と、前記述語検出部によって前記意見性述語情報が付加され前記機能表現検出部によって前記機能表現情報が付加された構文解析木データを読み込み、読み込んだ前記構文解析木データの特徴を用いて当該構文解析木データが所定の文カテゴリーに該当するか否かを判断する判断部とを具備する意見分類装置としてコンピュータを機能させるものである。 [4] A program according to an aspect of the present invention includes a morpheme analysis unit that reads input sentence data and performs morpheme analysis processing, and performs dependency analysis processing based on the input sentence data that has been subjected to morpheme analysis processing. A dependency analysis unit that outputs parse tree data; and an opinion predicate information including a predicate category representing a type of the detected opinion predicate in the parse tree data; A predicate detection unit to be added, and a function expression type representing a type of the detected function expression by detecting a function expression of the input sentence data based on the input sentence data subjected to morpheme analysis processing by the morpheme analysis unit A functional expression detection unit for adding functional expression information including parse tree data to the parse tree data; Judgment of reading the parse tree data to which the function expression information is added by the detection unit, and determining whether the parse tree data corresponds to a predetermined sentence category using features of the read parse tree data The computer is made to function as an opinion classification device having a section.

［５］また、本発明の一態様によるプログラムは、前記機能表現検出部は、前記入力文データの機能表現としてモダリティを検出し、検出した前記モダリティの種類を表わす機能表現タイプを含む機能表現情報を前記構文解析木データに付加するモダリティ検出部である意見分類装置としてコンピュータを機能させる。 [5] Further, in the program according to one aspect of the present invention, the function expression detecting unit detects a modality as the function expression of the input sentence data, and includes function expression information indicating a type of the detected modality. The computer is caused to function as an opinion classification device which is a modality detection unit for adding to the parse tree data.

本発明によれば、表現の表層のみならず、文の構造や、その構造において述語やモダリティが果たす役割にも基づいて、文を精度良く効率的に分類することができる。 According to the present invention, it is possible to classify sentences accurately and efficiently based not only on the surface layer of the expression but also on the structure of the sentence and the role played by predicates and modalities in the structure.

本発明の一実施形態による意見分類装置の機能構成を示したブロック図である。It is the block diagram which showed the function structure of the opinion classification device by one Embodiment of this invention. 同実施形態による係り受け解析部が出力する係り受け構文解析木のデータ構成例を示す概略図である。It is the schematic which shows the data structural example of the dependency parsing tree which the dependency analysis part by the same embodiment outputs. 図２の構文解析木データに対応する構文解析木を示した概略図である。FIG. 3 is a schematic diagram illustrating a parse tree corresponding to the parse tree data of FIG. 2. 同実施形態による述語辞書データの構成及びデータ例を示す概略図である。It is the schematic which shows the structure and example of data of predicate dictionary data by the embodiment. 同実施形態による意味カテゴリー辞書データの構成及びデータ例を示す概略図である。It is the schematic which shows the structure and data example of semantic category dictionary data by the embodiment. 同実施形態により、述語検出部が意見性述語情報を付加し、モダリティ検出部がモダリティ情報を付加した、構文解析木データの構成例を示す概略図である。It is the schematic which shows the structural example of parsing tree data with which the predicate detection part added opinion property predicate information and the modality detection part added modality information by the embodiment. 図６の構文解析木データに対応する構文解析木を示した概略図である。FIG. 7 is a schematic diagram illustrating a parse tree corresponding to the parse tree data of FIG. 6. 同実施形態の変形例により、依存構造文法解析を行った場合の構文解析木を示す概略図である。It is the schematic which shows the parsing tree at the time of performing a dependence structure grammar analysis by the modification of the embodiment. 同実施形態による意見分類装置１の実証実験結果による、文分類の正解率を示すテーブルである。It is a table which shows the correct rate of sentence classification | category by the verification experiment result of the opinion classification | category apparatus 1 by the embodiment.

以下、図面を参照しながら、本発明の一実施形態について説明する。
図１は、同実施形態による意見分類装置の機能構成を示すブロック図である。図示するように、意見分類装置１は、形態素解析部１１と、係り受け解析部１２と、述語検出部１４と、モダリティ検出部１５（機能表現検出部）と、統合判断部１７（判断部）とを含んで構成される。 Hereinafter, an embodiment of the present invention will be described with reference to the drawings.
FIG. 1 is a block diagram showing a functional configuration of the opinion classification apparatus according to the embodiment. As shown in the figure, the opinion classification apparatus 1 includes a morphological analysis unit 11, a dependency analysis unit 12, a predicate detection unit 14, a modality detection unit 15 (functional expression detection unit), and an integrated determination unit 17 (determination unit). It is comprised including.

形態素解析部１１は、解析対象となる入力文データを読み込み、その形態素解析（morpheme analysis）処理を行い、その結果、形態素列を出力する。ここで、入力文データは、日本語等の自然言語で書かれた文章や文を含むテキストデータである。なお、形態素解析の処理自体は、広く知られた様々な手法を用いて行うことができる。例えば、文字のシーケンスを入力として、形態素の辞書データを参照しながら、可能性のある形態素候補のグラフであるラティス構造を生成し、品詞情報やｎグラムの出現頻度などに基づいて最適な経路を探索することによって、形態素解析の処理を行える。
係り受け解析部１２は、上記の形態素解析の結果を読み込み、係り受け解析（dependency analysis）の処理を行い、その結果、係り受け構造解析木のデータを出力する。なお、係り受け解析の処理自体は、広く知られた様々な手法を用いて行うことができる。 The morpheme analysis unit 11 reads input sentence data to be analyzed, performs morpheme analysis processing, and outputs a morpheme string as a result. Here, the input sentence data is text data including sentences and sentences written in a natural language such as Japanese. The morphological analysis process itself can be performed using various widely known methods. For example, using a character sequence as input, generate a lattice structure that is a graph of possible morpheme candidates while referring to morpheme dictionary data, and select the optimal path based on the part-of-speech information and the appearance frequency of n-grams. By searching, morphological analysis can be performed.
The dependency analysis unit 12 reads the result of the above morphological analysis, performs dependency analysis, and outputs the data of the dependency structure analysis tree as a result. Note that the dependency analysis process itself can be performed using various well-known methods.

述語検出部（predicate detector）１４は、上記の係り受け構造解析木のデータを元に、辞書データに予め登録されている意見性述語（opinion-holding predicate）の一覧データを参照しながら、さらに日本語の活用や連語の処理を考慮した処理を行うことによって、意見性述語の検出を行う。なお、意見性述語とは意見性を有する述語である。そして、述語検出部１４は、検出した意見性述語に関する情報を、構造解析木の該当するノードに付加する。ここで構造解析木に付加される意見性述語情報は、検出された意見性述語の種類を表わす述語カテゴリーを含むものである。 The predicate detector (predicate detector) 14 refers to the list of opinion-predicate (opinion-holding predicate) pre-registered in the dictionary data based on the data of the dependency structure analysis tree. Opinion predicates are detected by processing that takes into account word usage and collocation processing. The opinion predicate is a predicate having opinion. Then, the predicate detection unit 14 adds information regarding the detected opinion predicate to a corresponding node of the structural analysis tree. Here, the opinion predicate information added to the structural analysis tree includes a predicate category representing the type of detected opinion predicate.

モダリティ検出部（modality detector）１５は、上記の係り受け構造解析木のデータを元に、日本語のモダリティの検出を行い、「否定」や「願望」や「疑問」などの意味付けを行う。そして、モダリティ検出部１５は、検出したモダリティやその他の機能表現（以下の説明において、モダリティやその他の機能表現を、単に「モダリティ」と呼ぶ場合もある）に関する情報を、構造解析木の該当するノードに付加する。ここで構造解析木に付加されるモダリティ情報（上と同様に、以下の説明において、モダリティおよびその他の機能表現に関する情報を単に「モダリティ情報」と呼ぶ場合もある）は、検出されたモダリティの種類を表わすモダリティタイプ（上と同様に、以下の説明において、モダリティのタイプやその他の機能表現のタイプを単に「モダリティタイプ」と呼ぶ場合もある）の情報を含むものである。なお、後述するように、モダリティ検出部１５は、係り受け構造の情報を用いる必要がなく、構造解析木のデータに含まれている形態素列の情報を入力として用いれば充分である。つまり、モダリティ検出部１５が、構造解析木のデータを読み込む代わりに、形態素解析の結果を直接用いるようにしても良い。これらのいずれの場合も、モダリティ検出部１５が形態素解析部１１によって形態素解析処理が行われた結果の入力文データに基づいた処理を行うことには変わりはない。 The modality detector 15 detects Japanese modalities based on the dependency structure analysis tree data and assigns meanings such as “deny”, “desire”, and “question”. Then, the modality detection unit 15 applies information related to the detected modality and other functional expressions (in the following description, the modality and other functional expressions may be simply referred to as “modalities”) to correspond to the structural analysis tree. Append to node. Here, the modality information added to the structural analysis tree (similarly, in the following description, information on the modality and other functional expressions may be simply referred to as “modality information”) is the type of the detected modality. (In the following description, the type of modality and other functional expression types may be simply referred to as “modality type”). As will be described later, the modality detection unit 15 does not need to use the dependency structure information, and it is sufficient to use the morpheme string information included in the data of the structural analysis tree as an input. That is, the modality detection unit 15 may directly use the result of the morphological analysis instead of reading the data of the structural analysis tree. In any of these cases, the modality detection unit 15 does not change the processing based on the input sentence data obtained as a result of the morphological analysis processing performed by the morphological analysis unit 11.

統合判断部１７は、述語検出部１４及びモダリティ検出部１５によって、それぞれ、意見性述語に関する情報（述語カテゴリーを含む）及びモダリティに関する情報（モダリティタイプを含む）が付加された構造解析木のデータを読み込み、これを特徴として（特徴空間として）用いて分類処理を行い、その結果として文の分類カテゴリー（文カテゴリー）のデータ（分類結果データ）を出力する。なお、統合判断部１７は、ツリー・カーネルＳＶＭ（サポートベクターマシン，Support Vector Machine）の技術を用いた処理を行う。ツリー・カーネルＳＶＭは機械学習の手法であり、統合判断部１７には多数の正解事例データを与えて予めモデルの学習処理を行っておく。 The integrated judgment unit 17 uses the predicate detection unit 14 and the modality detection unit 15 respectively to obtain the data of the structural analysis tree to which the information on the opinion predicate (including the predicate category) and the information on the modality (including the modality type) are added. The data is read, and this is used as a feature (as a feature space) for classification, and as a result, sentence classification category (sentence category) data (classification result data) is output. The integration determining unit 17 performs processing using a tree kernel SVM (Support Vector Machine) technique. The tree kernel SVM is a machine learning technique, and a model learning process is performed in advance by giving a large number of correct answer case data to the integrated judgment unit 17.

次に、上記各部のうちの主要な処理の詳細と、それらの処理に用いるデータについて説明する。 Next, details of main processes among the above-described units and data used for the processes will be described.

図２は、係り受け解析部１２によって出力される係り受け構造解析木のデータの構成例を示す概略図である。図示するように、このデータはテキスト形式のデータであり、Ｓ式（S-expression）を用いて表わされている。図示するデータ例は、「The theme is good, but the content is boring.」という入力文データに対応して係り受け構造解析部１２が出力するものであり、このＳ式全体が入力文に対する構造解析木を表わしている。なお、上記の入力文に対応する日本語の文は、「テーマは良いが内容がつまらん」である。
このデータにおいて、例えば、「（ｗｏｒｄｂｕｔ）」という表現は、構造解析木における「ｂｕｔ」という語のノードに対応する。また、「（ｃｈｕｎｋ（ｗｏｒｄＴｈｅ）（ｗｏｒｄｔｈｅｍｅ））」という表現は、「Ｔｈｅｔｈｅｍｅ」という複数の語からなる塊（チャンク, chunk）のノードに対応する。また、（要素１要素２要素３）という表現は、要素１のノードの子ノードが、左側から順に、要素２のノード、要素３のノードであるという構造を表わしている。そして、このような親ノードと子ノードとの関係を多重にネストすることにより、構文解析木の木構造を表わしている。
なお、上記のＳ式のデータにおける改行位置は、任意に変更可能である。また、本実施形態ではＳ式の形式で構文解析木を表現しているが、他のデータ構造によって構文解析木を表現するようにしても良い。 FIG. 2 is a schematic diagram illustrating a configuration example of the dependency structure analysis tree data output by the dependency analysis unit 12. As shown in the figure, this data is text data and is expressed using an S-expression. In the illustrated data example, the dependency structure analysis unit 12 outputs corresponding to the input sentence data “The theme is good, but the content is boring.”, And the entire S expression is the structure analysis for the input sentence. Represents a tree. The Japanese sentence corresponding to the above input sentence is “the theme is good but the content is boring”.
In this data, for example, the expression “(word but)” corresponds to a node of the word “but” in the structural analysis tree. Further, the expression “(chunk (word the) (word theme))” corresponds to a node of a chunk consisting of a plurality of words “The theme”. The expression (element 1 element 2 element 3) represents a structure in which the child nodes of the node of element 1 are the node of element 2 and the node of element 3 in this order from the left. The tree structure of the parse tree is expressed by nesting the relationship between the parent node and the child node in multiple layers.
Note that the line feed position in the S-expression data can be arbitrarily changed. In the present embodiment, the parse tree is expressed in the form of an S expression, but the parse tree may be expressed by another data structure.

図３は、前図に示したデータに対応する構文解析木を視覚的に示した概略図である。なお、この図では、英語の文に対応する日本語表現を同時に示している。図示するように、本例では、「が／ｂｕｔ」が根ノードであり、この根ノードに直結する子ノードは、左側から、「は／ｉｓ」と「が／ｉｓ」の２つである。また、上記子ノード「は／ｉｓ」のさらに子のノードは、左側から、「テーマ／Ｔｈｅｔｈｅｍｅ」と「良い／ｇｏｏｄ」の２つである。また、上記子ノード「が／ｉｓ」のさらに子のノードは、左側から、「内容／ｔｈｅｃｏｎｔｅｎｔ」「つまらん／ｂｏｒｉｎｇ」の２つである。
なお、図２および図３に示した構文解析木は、係り受け解析部１２による句構造文法解析の結果として得られたものである。 FIG. 3 is a schematic diagram visually showing a parse tree corresponding to the data shown in the previous figure. In this figure, Japanese expressions corresponding to English sentences are shown at the same time. As shown in the figure, in this example, “ga / but” is a root node, and there are two child nodes “ha / is” and “ga / is” directly connected to the root node from the left side. Further, the child nodes “ha / is” have two child nodes, “theme / The theme” and “good / good” from the left side. Further, the child nodes “ga / is” have two child nodes “content / the content” and “boring / boring” from the left side.
The parsing trees shown in FIGS. 2 and 3 are obtained as a result of phrase structure grammar analysis by the dependency analysis unit 12.

述語検出部１４は、構文解析木に対応するＳ式のデータを読み込み、そのデータを解析することにより入力文中の意見性述語を検出する。このとき、述語検出部１４は、述語辞書データを参照することによって、意見性述語の検出を行う。
図４は、述語辞書データの構成およびデータ例を示す概略図である。図示するように、述語辞書データは、表形式のデータとして構成され、意見性述語と述語カテゴリーの２つのデータ項目を有している。意見性述語のデータ項目は、意見性を有する形容詞、形容詞句、動詞、動詞句などの語の文字列を格納する。述語カテゴリーのデータ項目は、述語に対応するカテゴリーを識別する情報を格納する。同図に示す例では、「良い／ｇｏｏｄ」という形容詞のエントリーが「カテゴリー２」に対応付けられ、「つまらない／ｂｏｒｉｎｇ」という形容詞のエントリーが「カテゴリー１」に対応付けられている。
述語辞書データの内容そのものは、予め準備しておく。述語辞書データのエントリー数は一例としては５００エントリー前後で良いが、より豊富なエントリーを準備するようにしても良い。
なお、述語検出部１４は、検出できた意見性述語に関する情報を、構文解析木のデータに付加する。言い換えれば、述語検出部１４は、意見性述語に関する情報を、構文解析木のノードに張り付ける処理を行う。その結果のデータについては、後述する。 The predicate detection unit 14 reads the data of the S expression corresponding to the parse tree and analyzes the data to detect the opinion predicate in the input sentence. At this time, the predicate detection unit 14 detects the opinion predicate by referring to the predicate dictionary data.
FIG. 4 is a schematic diagram illustrating a configuration of predicate dictionary data and a data example. As shown in the figure, the predicate dictionary data is configured as tabular data, and has two data items of opinion predicate and predicate category. The opinion item predicate data item stores character strings of words such as adjectives, adjective phrases, verbs, and verb phrases having opinion. The data item of the predicate category stores information for identifying the category corresponding to the predicate. In the example shown in the figure, an adjective entry “good” is associated with “category 2”, and an adjective entry “boring” is associated with “category 1”.
The content of the predicate dictionary data itself is prepared in advance. As an example, the number of entries in the predicate dictionary data may be around 500 entries, but more abundant entries may be prepared.
The predicate detection unit 14 adds information on the detected opinion predicate to the data of the parse tree. In other words, the predicate detection unit 14 performs processing for pasting information on opinion opinion predicates to the nodes of the parse tree. The resulting data will be described later.

モダリティ検出部１５は、構文解析木に対応するＳ式のデータを読み込み、そのデータを解析することにより入力文中のモダリティを検出する。但し、モダリティ検出部１５は、モダリティの検出の処理自体には構文解析木の構造を利用しない。モダリティ検出部１５は、読み込んだＳ式を基に入力文の語のシーケンスを得て、このシーケンスを利用してモダリティの検出を行う。 The modality detection unit 15 reads the data of the S expression corresponding to the parse tree and analyzes the data to detect the modality in the input sentence. However, the modality detection unit 15 does not use the structure of the parse tree for the modality detection process itself. The modality detection unit 15 obtains a sequence of words of the input sentence based on the read S-expression, and detects the modality using this sequence.

図５は、意味カテゴリー辞書データの構成およびデータ例を示す概略図である。図示するように、意味カテゴリー辞書データは、表形式のデータとして構成され、タイプと意味カテゴリーの２つのデータ項目を有している。同図に示すデータ例では、要望、推薦、意思、願望、・・・などの意味カテゴリーがタイプ「モダリティ」に属する。また、モダリティに類似した他の種類の表現もあり、これらを総称して機能表現と呼ぶ。図５に示すように、モダリティは、機能表現に含まれるタイプの一つである。モダリティ以外の機能表現としては、接続法表現、態関連、時制関連、様相関連、格関連、その他のタイプが存在し、これらのタイプとそれらに含まれる意味カテゴリーとの関係もまた、意味カテゴリー辞書データに格納されている。
この意味カテゴリー辞書データを参照することにより、モダリティ検出部１５は、モダリティや、その他の機能表現を、いずれも同様の方法で行うことができる。よって、以下においては、モダリティの検出とその他の機能表現の検出とを特に区別せずに説明する。
なお、図５に示した「タイプ」と「意味カテゴリー」のいずれか、或いはそれら２つを組み合わせたものが、機能表現の種類を表わすモダリティタイプ（機能表現タイプ）である。 FIG. 5 is a schematic diagram showing a configuration of the semantic category dictionary data and a data example. As shown in the figure, the semantic category dictionary data is configured as tabular data, and has two data items of type and semantic category. In the data example shown in the figure, semantic categories such as request, recommendation, intention, desire,... Belong to the type “modality”. There are also other types of expressions similar to modalities, and these are collectively referred to as functional expressions. As shown in FIG. 5, the modality is one of the types included in the functional expression. Functional expressions other than modalities include connection method expressions, state-related, tense-related, modal-related, case-related, and other types, and the relationship between these types and the semantic categories they contain is also a semantic category dictionary. Stored in the data.
By referring to the semantic category dictionary data, the modality detection unit 15 can perform the modality and other functional expressions in the same manner. Therefore, in the following description, the detection of modality and the detection of other functional expressions are not particularly distinguished.
Note that one of “type” and “semantic category” shown in FIG. 5 or a combination of the two is a modality type (function expression type) representing the type of function expression.

日本語におけるモダリティやその他の機能表現は、文中における動詞に後続する助動詞等のシーケンスとして表現される。なお、場合により、動詞に後続するシーケンスであって、動詞や名詞をも含むシーケンスがモダリティを表現する。そのようなシーケンスによって、図５に示したモダリティや様相や時制や態などが表わされる。 Modality and other functional expressions in Japanese are expressed as a sequence of auxiliary verbs following the verb in the sentence. In some cases, a sequence subsequent to a verb and including a verb or a noun expresses a modality. Such a sequence represents the modality, aspect, tense and state shown in FIG.

文のモダリティやその他の機能表現を正しく検出するためには、日本語が持つ意味の曖昧さと範囲の曖昧さという２種類の曖昧さを解決する必要がある。意味の曖昧さとは、ある表現が、複数の異なるモダリティを表わし得ることである。範囲の曖昧さとは、膠着語としての日本語が有する問題であり、動詞に続く表現の範囲（どこまでがその動詞に後続するシーケンスであるかを同定して区切る方法）が複数通りあり得ることである。 To detect sentence modalities and other functional expressions correctly, it is necessary to resolve two types of ambiguity, meaning ambiguity and range ambiguity in Japanese. Semantic ambiguity is that an expression can represent a plurality of different modalities. Range ambiguity is a problem with Japanese as an agglutinating word, and there are multiple possible ranges of expressions following a verb (how to identify and delimit how far the sequence follows the verb). is there.

この課題を解決してモダリティやその他の機能表現を正しく検出するために、モダリティ検出部１５は、まず、係り受け解析部１２の出力から得られている入力文の語のシーケンスに基づき、上記２種類の曖昧さによって生じるすべての可能性を探索し、その結果に基づいて語のラティス構造を生成する。この処理は、予め準備した機能表現辞書データを参照し、入力文に含まれる語のシーケンスのうち、機能表現辞書データのエントリーにマッチするシーケンスが入力文中における機能表現の候補であると見なし、これらの候補に対応する辺を張っていくことにより行う。ここで、機能表現辞書データは、機能表現であり得る語のシーケンスを１エントリーとして、多数のエントリーを有しているデータである。
そして、モダリティ検出部１５は、生成されたラティス構造が含む多数の経路（文の始端から終端までをつなぐ経路）のうち、確率的に最適な経路（最尤経路）を選択し、選択された最適経路上に含まれている辺を、入力文中の機能表現として抽出する。
なお、ここで、ラティス構造から最適経路を選択するための確率モデルとして、マルコフモデルを用いる。そして、素性として、表層文字列、品詞、機能表現の意味属性を用いる。また、２−ｇｒａｍ（バイグラム）を扱う１次のマルコフモデルとし、低頻度の接続確率にはＫａｔｚのバックオフスムージングを用いる。なお、人手等によって正解を付与した正解データを用いて、予めマルコフモデルの学習を行っておく。 In order to solve this problem and correctly detect modalities and other functional expressions, the modality detection unit 15 first determines the above 2 based on the word sequence of the input sentence obtained from the output of the dependency analysis unit 12. Search all the possibilities caused by the kind of ambiguity and generate the lattice structure of the word based on the result. This process refers to the function expression dictionary data prepared in advance, and among the word sequences included in the input sentence, the sequence that matches the entry in the function expression dictionary data is regarded as a function expression candidate in the input sentence. This is done by extending the edge corresponding to the candidate. Here, the function expression dictionary data is data having a large number of entries with a word sequence that can be a function expression as one entry.
The modality detection unit 15 selects a stochastic optimum route (maximum likelihood route) from among a large number of routes (routes connecting from the beginning to the end of the sentence) included in the generated lattice structure. Edges included on the optimum route are extracted as function expressions in the input sentence.
Here, a Markov model is used as a probabilistic model for selecting the optimum route from the lattice structure. And as a feature, a surface layer character string, a part of speech, and a semantic attribute of a functional expression are used. In addition, a first-order Markov model that handles 2-gram (bigram) is used, and Katz back-off smoothing is used for a low-frequency connection probability. Note that the Markov model is learned in advance using correct data to which a correct answer is given manually.

そして、モダリティ検出部１５は、前述の意味カテゴリー辞書データを参照することにより、上で検出された機能表現のうち、意味カテゴリー辞書データに規定された意味カテゴリーに該当するもののみを選択する。つまり、意味カテゴリー辞書データに格納する意味カテゴリーを予め適宜調整しておくことにより、モダリティ検出部１５がアプリケーションに合わせて必要な機能表現のみを選択するようにできる。例えば、モダリティ検出部１５が全ての機能表現を選択するようにしても良いし、モダリティのみを選択するようにしても良いし、また、全ての機能表現のうちの任意の部分集合に相当する表現のみを選択するようにしても良い。なお、図５に例示した意味カテゴリー辞書データを用いる場合には、モダリティ検出部１５が、すべてのモダリティと、その他の機能表現のうちの一部の機能表現とを選択するようにしている。
なお、モダリティ検出部１５は、検出できたモダリティに関する情報を、構文解析木のデータに付加する。言い換えれば、モダリティ検出部１５は、モダリティのタイプおよび意味カテゴリーに関する情報を、構文解析木のノードに張り付ける処理を行う。その結果のデータについては、次に述べる。 Then, the modality detection unit 15 refers to the above-described semantic category dictionary data, and selects only the functional expressions detected above that correspond to the semantic category defined in the semantic category dictionary data. That is, by appropriately adjusting the semantic categories stored in the semantic category dictionary data in advance, the modality detection unit 15 can select only necessary function expressions according to the application. For example, the modality detection unit 15 may select all the function expressions, or may select only the modality, or an expression corresponding to an arbitrary subset of all the function expressions. You may make it select only. When the semantic category dictionary data illustrated in FIG. 5 is used, the modality detection unit 15 selects all the modalities and some of the functional expressions.
The modality detection unit 15 adds information on the detected modality to the data of the parse tree. In other words, the modality detection unit 15 performs a process of pasting information on the modality type and the semantic category to the node of the parse tree. The resulting data will be described next.

図６は、図２に示した構文解析木データに対して、述語検出部１４及びモダリティ検出部１５が、それぞれ、意見性述語に関する情報及びモダリティに関する情報を付加した結果のデータを示す概略図である。
図示するように、元の構文解析木データの「（ｗｏｒｄｇｏｏｄ）」の部分に、述語検出部１４によって、意見性述語に関する「（ｐｒｅｄ−ｃａｔｅｇｏｒｙ１）（ｐｒｅｄ−ｓｕｒｆａｃｅｇｏｏｄ）」という情報が付加されて、「（ｃｈｕｎｋ（ｗｏｒｄｇｏｏｄ）（ｐｒｅｄ−ｃａｔｅｇｏｒｙ１）（ｐｒｅｄ−ｓｕｒｆａｃｅｇｏｏｄ））」となっている。同様に、元の構文解析木データの「（ｗｏｒｄｂｏｒｉｎｇ）」の部分に、述語検出部１４によって、意見性述語に関する「（ｐｒｅｄ−ｃａｔｅｇｏｒｙ２）（ｐｒｅｄ−ｓｕｒｆａｃｅｂｏｒｉｎｇ）」という情報が付加されて、「（ｃｈｕｎｋ（ｗｏｒｄｇｏｏｄ）（ｐｒｅｄ−ｃａｔｅｇｏｒｙ２）（ｐｒｅｄ−ｓｕｒｆａｃｅｂｏｒｉｎｇ））」となっている。
また、同様に、元の構文解析木データの「（ｗｏｒｄｂｕｔ）」の部分に、モダリティ検出部１５によって、モダリティに関する「（ｍｏｄａｌｉｔｙ−ｔｙｐｅｂｕｔ＋ｄｅｔｅｒｍｉｎａｔｅ）」」という情報が付加されて、「（ｃｈｕｎｋ（ｗｏｒｄｂｕｔ）（ｍｏｄａｌｉｔｙ−ｔｙｐｅｂｕｔ＋ｄｅｔｅｒｍｉｎａｔｅ））」となっている。 FIG. 6 is a schematic diagram showing data obtained as a result of the predicate detection unit 14 and the modality detection unit 15 adding information about opinionality predicates and information about modalities, respectively, to the parse tree data shown in FIG. is there.
As shown in the figure, information “(pred-category 1) (pred-surface good)” related to the opinion predicate is added by the predicate detection unit 14 to the “(word good)” portion of the original parse tree data. Thus, “(chunk (word good) (pred-category 1) (pred-surface good))” is obtained. Similarly, information of “(pred-category 2) (pred-surface boring)” relating to opinion predicates is added to the “(word boring)” portion of the original parse tree data by the predicate detection unit 14. , “(Chunk (word good) (pred-category 2) (pred-surface boring))”.
Similarly, the modality detection unit 15 adds the information “(modity-type but + determine)” to the “(word but)” portion of the original parse tree data and adds “(chunk ( word but (modity-type but + determinate)) ".

図７は、前図に示したデータに対応する構文解析木を視覚的に示した概略図である。図示するように、根ノードの「が／ｂｕｔ」には、「ｍｅａｎｉｎｇ “ｂｕｔ”＋ｄｅｔｅｒｍｉｎａｔｅ」というモダリティ情報が示されている。これは、前図の「（ｍｏｄａｌｉｔｙ−ｔｙｐｅｂｕｔ＋ｄｅｔｅｒｍｉｎａｔｅ）」に対応している。また、ノード「良い／ｇｏｏｄ」と「つまらん／ｂｏｒｉｎｇ」の各々には「ｏｐｉｎｉｏｎ−ｈｏｌｄｉｎｇｐｒｅｄｉｃａｔｅ」という意見性述語情報が示されている。これらは、それぞれ、前図の「（ｐｒｅｄ−ｃａｔｅｇｏｒｙ１）（ｐｒｅｄ−ｓｕｒｆａｃｅｇｏｏｄ）」と「（ｐｒｅｄ−ｃａｔｅｇｏｒｙ２）（ｐｒｅｄ−ｓｕｒｆａｃｅｂｏｒｉｎｇ）」に対応している。 FIG. 7 is a schematic diagram visually showing a parse tree corresponding to the data shown in the previous figure. As shown in the figure, modality information “meaning“ but ”+ determinate” is shown in the root node “ga / but”. This corresponds to “(modity-type but + determine)” in the previous figure. In addition, opinion predicate information “opinion-holding predicate” is shown in each of the nodes “good” and “boring”. These correspond to “(pred-category 1) (pred-surface good)” and “(pred-category 2) (pred-surface boring)” in the previous figure, respectively.

上記のデータが示すように、この例による入力文は、「よい／ｇｏｏｄ」と「つまらん／ｂｏｒｉｎｇ」という２つの意見を含んでおり、それらは互いに相反するものである。また、これらの意見を表わす語である「よい／ｇｏｏｄ」と「つまらん／ｂｏｒｉｎｇ」は、それぞれ１回だけ文中に出現する。よって、単純な「bag-of-words」による手法では、この文を適切に分類することができない。
一方で、本実施形態の意見分類装置１は、述語検出部１４とモダリティ検出部１５とによって検出された意見性述語と意見性モダリティと機能表現とを利用し、構文木の構造におけるそれらの位置の情報にも基づいた統合的判断を行う。
そのために、意見性述語の情報とモダリティ情報とが張り付けられた構文解析木を表わすＳ式のデータが、統合判断部１７への入力となる。 As the above data shows, the input sentence according to this example includes two opinions “good” and “boring”, which are mutually contradictory. Moreover, the words “good” and “boring” that represent these opinions appear in the sentence only once. Therefore, the simple “bag-of-words” method cannot properly classify this sentence.
On the other hand, the opinion classification device 1 according to the present embodiment uses the opinion property predicates, opinion property modalities, and functional expressions detected by the predicate detection unit 14 and the modality detection unit 15, and their positions in the structure of the syntax tree. Make an integrated decision based on this information.
For this purpose, the data of the S expression representing the parse tree in which the opinion predicate information and the modality information are pasted becomes the input to the integrated judgment unit 17.

次に、統合判断部１７の処理について詳しく説明する。
統合判断部１７は、意見性述語の情報とモダリティ情報とが付加された構文解析木（以下において、この構文解析木のデータを、便宜上、特徴データと呼ぶ。）を特徴空間として、入力文データを所定の文カテゴリーに分類する。例えば、入力文データが放送番組に対する意見である場合、統合判断部１７は、（１）肯定的意見（強い意見）、（２）否定的意見（強い意見）、（３）番組を視聴して思ったこと（弱い意見）、（４）番組を視聴して知ったこと（弱い意見）、（５）番組に対する要望、（６）番組に対する質問、（７）その他の意見（上のいずれにも該当しない意見）、（８）意見ではない入力文、の８つの文カテゴリーのいずれかに入力文を分類する。なお、ここでは、文カテゴリーをこれらの８つに設定しているが、これに限らず一般に、任意のカテゴリーの集合を設定することができる。 Next, the process of the integrated determination unit 17 will be described in detail.
The integrated judgment unit 17 uses the parse tree to which opinion predicate information and modality information are added (hereinafter, the parse tree data is referred to as feature data for convenience) as input feature data. Are classified into predetermined sentence categories. For example, when the input sentence data is an opinion on a broadcast program, the integrated judgment unit 17 (1) a positive opinion (strong opinion), (2) a negative opinion (strong opinion), (3) watching the program What I thought (weak opinion), (4) What I learned by watching the program (weak opinion), (5) Requests for the program, (6) Questions about the program, (7) Other opinions (both above) The input sentence is classified into one of the eight sentence categories of (not applicable opinion) and (8) input sentence that is not an opinion. Here, the sentence categories are set to these eight, but the present invention is not limited to this, and in general, a set of arbitrary categories can be set.

統合判断部１７は、前述の通り、ツリー・カーネルＳＶＭの技術を用いて予め機械学習を行っておき、学習済みの状態で、入力文に対応する特徴データの分類を行う。
ここで、入力文に対応する特徴データを分類するという処理は、特徴データが前述のカテゴリーの集合に属する個々の要素に該当するか否かを判断する処理に還元される。
ここで採用する処理方法は、カーネル法の一種である。カーネル法とは一般に、特徴データ（特徴空間）をユークリッド空間に写像して、ＳＶＭを適用することによってそのユークリッド空間を超平面で区切る方法である。本実施形態では、入力文に対応する個々の特徴データがそれぞれのカテゴリーに該当するか否かを判断するための基準となるのが、この超平面である。ここでは、ユークリッド空間における特徴ベクトル間の内積に相当するものとして、特徴データ間の類似度を次のように計算する。即ち、統合判断部１７は、第１の構文解析木Ｔ_１と第２の構文解析木Ｔ_２との間の類似度Ｓ（Ｔ_１，Ｔ_２）を、下の式（１）および式（２）によって計算する。 As described above, the integrated determination unit 17 performs machine learning in advance using the tree kernel SVM technique, and classifies the feature data corresponding to the input sentence in a learned state.
Here, the process of classifying the feature data corresponding to the input sentence is reduced to a process of determining whether or not the feature data corresponds to each element belonging to the above-described category set.
The processing method adopted here is a kind of kernel method. In general, the kernel method is a method in which feature data (feature space) is mapped to the Euclidean space and the Euclidean space is partitioned by a hyperplane by applying SVM. In the present embodiment, it is this hyperplane that serves as a reference for determining whether or not individual feature data corresponding to an input sentence falls into each category. Here, the similarity between feature data is calculated as follows, assuming that it corresponds to the inner product between feature vectors in the Euclidean space. That is, the integrated determination unit 17 calculates the similarity S (T ₁ , T ₂ ) between the _first parse tree T ₁ and the second parse tree T ₂ using the following expressions (1) and ( Calculate according to 2).

ここで、ＰＴ_１およびＰＴ_２は、それぞれ、構文解析木Ｔ_１およびＴ_２の全ての部分木の集合である。また、ｍ（ｔ_ｉ，ｔ_ｊ）は、木ｔ_ｉと木ｔ_ｊとが一致するか否かに応じた値を返す関数である。ここで、木ｔ_ｉと木ｔ_ｊとが一致するとは、両者が完全に一致することであるが、木ｔ_ｉと木ｔ_ｊとが一致することの定義を適宜与えても良い。
つまり、類似度Ｓ（Ｔ_１，Ｔ_２）は、構文解析木Ｔ_１の部分木とＴ_２の部分木との間で一致するものがいくつ存在するかによって定まる。
具体的な処理としては、統合判断部１７は、類似度Ｓ（Ｔ_１，Ｔ_２）を計算するために、構文解析木Ｔ_１およびＴ_２の部分木をすべて洗い出し、総当り的に両部分木集合の間での要素の一致または不一致を判断する処理を行なう。 Here, PT ₁ and PT ₂ are a set of all subtrees of the parse trees T ₁ and T ₂ , respectively. M (t _i , t _j ) is a function that returns a value according to whether or not the tree t _i and the tree t _j match. Here, the fact that the tree t _i and the tree t _j match each other means that they match completely, but a definition that the tree t _i and the tree t _j match may be given as appropriate.
That is, the similarity S (T ₁ , T ₂ ) is determined by the number of matching items between the subtree of the parse tree T _{1 and} the subtree of T ₂ .
As a specific process, the integrated determination unit 17 identifies all subtrees of the parse trees T ₁ and T ₂ in order to calculate the similarity S (T ₁ , T ₂ ), and brute force both parts. A process is performed to determine whether the elements match or do not match between the tree sets.

つまり、統合判断部１７は、読み込んだ前記構文解析木データの部分木の集合を表わすデータを前記特徴データとして用いる。そして、統合判断部１７は、与えられる正解事例文について意見性述語情報及びモダリティ情報が付加された構文解析木データと、この正解事例文に対応する文カテゴリー（正解の分類）とによって予め機械学習したモデルを備えている。つまり、統合判断部１７は、学習の段階においては、事例文を解析した結果の解析木データとその事例文が属するカテゴリーの正解から成るペアのデータを大量に与えられる。このとき、形態素解析部１１や係り受け解析部１２や述語検出部１４やモダリティ検出部１５が事例文を適宜解析することによって、解析木データを得るようにしても良い。そして、統合判断部１７はこれら正解データを用いてＳＶＭによる機械学習を行って内部のモデルを構築する。ここでモデルとは、物理的には、学習処理の結果得られ、メモリに記憶されたパラメータ値の集合である。
予め機械学習が完了している状態では、統合判断部１７は、このモデルを参照することによって、入力文が各々のカテゴリーに属するか否かを判断する。 That is, the integration determining unit 17 uses data representing a set of subtrees of the read parse tree data as the feature data. Then, the integrated determination unit 17 performs machine learning in advance by using the parse tree data to which opinion predicate information and modality information are added to the given correct answer sentence and the sentence category (correct answer classification) corresponding to the correct answer sentence. Equipped with the model. That is, in the learning stage, the integrated judgment unit 17 is given a large amount of paired data consisting of analysis tree data obtained as a result of analyzing a case sentence and a correct answer of a category to which the case sentence belongs. At this time, the morphological analysis unit 11, the dependency analysis unit 12, the predicate detection unit 14, and the modality detection unit 15 may appropriately analyze the case sentence to obtain analysis tree data. And the integrated judgment part 17 performs machine learning by SVM using these correct answer data, and builds an internal model. Here, the model is physically a set of parameter values obtained as a result of the learning process and stored in the memory.
In a state where machine learning has been completed in advance, the integrated determination unit 17 refers to this model to determine whether or not the input sentence belongs to each category.

以上の処理をまとめると、次の通りである。述語検出部１４は、意見性述語を検出し、その述語に対応する述語カテゴリーを出力する。よって、述語の表層文字列だけでなく、述語カテゴリーを特徴として利用した判断を行える。モダリティ検出部１５は、モダリティを検出し、その意味カテゴリーを出力する。よって、意味カテゴリーを特徴として利用した判断を行える。
そして、上記のような統合判断部１７を備える意見分類装置１は、入力文の構文解析木が表わす情報、つまり、文が含む表現（語や、語のシーケンス）の表層や、文の係り受けの構造や、文に含まれる意見性述語の種類（述語カテゴリー）や、文に含まれるモダリティの種類（モダリティタイプ）や、文の係り受け構造におけるそれら意見性述語およびモダリティの位置等を特徴として利用しながらも、それらの特徴と文の分類（文が属するカテゴリー）との関係を分析的に求めることなく、事例を用いた学習によって、入力文が属するカテゴリーを同定するための判断を行うことができる。 The above processing is summarized as follows. The predicate detection unit 14 detects an opinion predicate and outputs a predicate category corresponding to the predicate. Therefore, it is possible to make a determination using not only the surface character string of the predicate but also the predicate category as a feature. The modality detection unit 15 detects the modality and outputs the semantic category. Therefore, it is possible to make a judgment using the semantic category as a feature.
The opinion classification apparatus 1 including the integrated determination unit 17 as described above is information represented by the parse tree of the input sentence, that is, the surface layer of expressions (words and word sequences) included in the sentence, and sentence dependency. It is characterized by the type of opinion predicates contained in the sentence (predicate category), the type of modality contained in the sentence (modality type), and the position of those opinion predicates and modalities in the dependency structure of the sentence. While using it, it is necessary to make a judgment to identify the category to which the input sentence belongs by learning using cases without analytically determining the relationship between these features and sentence classification (category category). Can do.

このようにして、文中の述語に、「否定」や「願望」や「疑問」などの意味を検出することができる。例えば、多数の文の中から、放送番組に対する要望だけの一覧情報を抽出するときなどに、「〜してほしい」という表現だけをキーワード検索するだけでは不十分であるが、「願望」という意味を含む述語を検索すると、要望の一覧をより多く適切に検索できる。このようにして、従来技術よりも高い検索性能、分類性能が得られる。 In this way, it is possible to detect meanings such as “denial”, “desire”, and “question” in the predicate in the sentence. For example, it is not enough to search only for the phrase “I want you to do it” when extracting a list of broadcast requests only from a large number of sentences. If you search for predicates containing, you can more appropriately search the list of requests. In this way, higher search performance and classification performance than the prior art can be obtained.

本実施形態による意見分類装置１は、電子回路を用いて実現可能である。また、装置の一部又は全部の機能をコンピュータで実現するようにしても良い。
コンピュータを用いて装置を構成する場合、必要な機能を実現するためのプログラムをコンピュータ読み取り可能な記録媒体に記録して、この記録媒体に記録されたプログラムをコンピュータシステムに読み込ませ、実行することによって実現しても良い。なお、ここでいう「コンピュータシステム」とは、ＯＳや周辺機器等のハードウェアを含むものとする。また、「コンピュータ読み取り可能な記録媒体」とは、フレキシブルディスク、光磁気ディスク、ＲＯＭ、ＣＤ−ＲＯＭ等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置のことをいう。さらに「コンピュータ読み取り可能な記録媒体」とは、インターネット等のネットワークや電話回線等の通信回線を介してプログラムを送信する場合の通信線のように、短時刻の間、動的にプログラムを保持するもの、その場合のサーバやクライアントとなるコンピュータシステム内部の揮発性メモリのように、一定時刻プログラムを保持しているものも含んでも良い。また上記プログラムは、前述した機能の一部を実現するためのものであっても良く、さらに前述した機能をコンピュータシステムにすでに記録されているプログラムとの組み合わせで実現できるものであっても良い。 The opinion classification device 1 according to the present embodiment can be realized using an electronic circuit. Further, some or all of the functions of the apparatus may be realized by a computer.
When configuring a device using a computer, a program for realizing a necessary function is recorded on a computer-readable recording medium, and the program recorded on the recording medium is read into a computer system and executed. It may be realized. Here, the “computer system” includes an OS and hardware such as peripheral devices. The “computer-readable recording medium” refers to a storage device such as a flexible medium, a magneto-optical disk, a portable medium such as a ROM and a CD-ROM, and a hard disk incorporated in a computer system. Further, the “computer-readable recording medium” dynamically holds a program for a short time, like a communication line when transmitting a program via a network such as the Internet or a communication line such as a telephone line. It is also possible to include those that hold a program for a certain time, such as a volatile memory inside a computer system serving as a server or client in that case. The program may be a program for realizing a part of the functions described above, and may be a program capable of realizing the functions described above in combination with a program already recorded in a computer system.

以上、実施形態を説明したが、本発明はさらに次のような変形例でも実施することが可能である。 Although the embodiment has been described above, the present invention can also be implemented in the following modified example.

＜変形例１＞変形例１は、上記の実施形態において、図１に示した機能ブロックのうち、述語検出部１４を除いた構成とする。この場合、意見性述語に関する情報は、構文解析木のノードに付加されない。しかしながら、統合判断部１７は、モダリティ検出部１５によってモダリティ情報を付加した構文解析木のデータを用いて、機械学習処理を行い、学習結果を用いて入力文データを分類する処理を行う。 <Modification 1> Modification 1 has a configuration in which the predicate detection unit 14 is excluded from the functional blocks illustrated in FIG. In this case, information about opinion predicates is not added to the nodes of the parse tree. However, the integrated determination unit 17 performs machine learning processing using data of the parse tree to which the modality information is added by the modality detection unit 15 and performs processing of classifying input sentence data using the learning result.

＜変形例２＞上記の実施形態では、係り受け解析部１２が句構造文法解析を行い、その結果得られる構文解析木を利用した。図２、図３、図６、図７に示した構文解析木が句構造文法解析に基づくものである。変形例２においては、係り受け解析部１２は、その代わりに、依存構造文法解析を行うことによって構文解析木データを出力する。図８は、依存構造文法解析の結果得られる構文解析木を示す概略図である。図示するように、この構文解析木は、「テーマは／Ｔｈｅｔｈｅｍｅ」、「良いが／ｉｓｇｏｏｄ，ｂｕｔ」、「内容が／ｔｈｅｃｏｎｔｅｎｔ」、「つまらん／ｉｓｂｏｒｉｎｇ」という４つのノードで構成されている。そして、この構文解析木は、「つまらん／ｉｓｂｏｒｉｎｇ」を根ノードとして、それにつながる「良いが／ｉｓｇｏｏｄ，ｂｕｔ」および「内容が／ｔｈｅｃｏｎｔｅｎｔ」の２つの子ノードを有し、さらに「良いが／ｉｓｇｏｏｄ，ｂｕｔ」の子ノードとして「テーマは／Ｔｈｅｔｈｅｍｅ」を有している。なお、「良いが／ｉｓｇｏｏｄ，ｂｕｔ」、「つまらん／ｉｓｂｏｒｉｎｇ」のそれぞれのノードには、意見性述語情報（opinion-holding predicate）が付加されている。また、「良いが／ｉｓｇｏｏｄ，ｂｕｔ」のノードには、機能表現タイプ情報（meaning “BUT” + determinate）も付加されている。
日本語の係り受け解析では依存構造文法解析を行うことも一般的である。このような依存構造文法解析の結果得られる木構造も、Ｓ式やその他のデータ構造を用いて表現することができる。
また、その他の解析手法により、さらに異なる構造の構文解析木を用いるようにしても良い。
これらのいずれの構造を用いる場合も、それぞれの構文解析木は、文の構文を表わす情報を含んでいるとともに、述語カテゴリーや機能表現タイプの情報、そしてそれらの述語や機能表現の構文中における位置の情報を含んでいる。よって、この変形例２においても同様に、統合判断部１７が文を適切に分類することができる。 <Modification 2> In the above embodiment, the dependency analysis unit 12 performs a phrase structure grammar analysis and uses a parse tree obtained as a result. The parsing trees shown in FIGS. 2, 3, 6, and 7 are based on phrase structure grammar analysis. In the second modification, the dependency analysis unit 12 outputs parse tree data by performing dependency structure grammar analysis instead. FIG. 8 is a schematic diagram showing a parse tree obtained as a result of the dependency structure grammar analysis. As shown in the figure, this parse tree is composed of four nodes, “Theme is / The theme”, “Good is / good, but”, “The content is / the content”, and “Is boring / is boring”. ing. This parse tree has “is boring” as a root node, and has two child nodes “good but is good, but” and “content is / the content” connected to the root node. Has “the theme is / The theme” as a child node of “/ is good, but”. Note that opinion-predicate information (opinion-holding predicate) is added to each node of “good but is good, but” and “is boring”. In addition, function expression type information (meaning “BUT” + determinate) is also added to the node of “good but is good, but”.
In Japanese dependency analysis, it is also common to perform dependency structure grammar analysis. The tree structure obtained as a result of such dependency structure grammar analysis can also be expressed using S-expressions or other data structures.
Further, a parse tree having a different structure may be used by other analysis methods.
Regardless of which of these structures is used, each parse tree contains information that represents the syntax of the statement, as well as information on the predicate category and function expression type, and the position in the syntax of those predicates and function expressions. Contains information. Therefore, also in this modified example 2, similarly, the integrated determination unit 17 can appropriately classify the sentence.

＜変形例３＞上記の実施形態では、モダリティ検出部１５が、モダリティだけはなく、その他の機能表現をも検出するようにした。この変形例３では、モダリティ検出部１５は、モダリティのみを検出する。そのため、モダリティ検出部１５が用いる意味カテゴリー辞書データには、モダリティに関するエントリーのみが格納されており、その他の機能表現に関するエントリーは格納されていない。 <Modification 3> In the above embodiment, the modality detection unit 15 detects not only the modality but also other functional expressions. In the third modification, the modality detection unit 15 detects only the modality. Therefore, the semantic category dictionary data used by the modality detection unit 15 stores only entries related to modalities, and does not store entries related to other functional expressions.

以上、この発明の実施形態について図面を参照して詳述してきたが、具体的な構成はこの実施形態に限られるものではなく、この発明の要旨を逸脱しない範囲の設計等も含まれる。
例えば、上記の実施形態では、構文解析木を表わすためにＳ式を用いたが、他の形式のデータを用いて構文解析木を表わすようにしてもよい。例えば、ノードに対応するメモリブロック間をポインタ（ポイント先のアドレスを表わすデータ）でつなぐことによって構文解析木を表わすようにしても良い。また、例えば、リレーショナルデータベースを用いて、各ノードの属性と、ノード間の関係とを保持することによって、構文解析木を表わすようにしても良い。 The embodiment of the present invention has been described in detail with reference to the drawings. However, the specific configuration is not limited to this embodiment, and includes designs and the like that do not depart from the gist of the present invention.
For example, in the above embodiment, the S-expression is used to represent the parse tree, but the parse tree may be represented using other types of data. For example, the parse tree may be represented by connecting the memory blocks corresponding to the nodes with pointers (data indicating the address of the point destination). Further, for example, a parsing tree may be expressed by using a relational database to hold the attributes of each node and the relationship between the nodes.

＜実証実験とその結果＞本発明の有効性を検証することを目的として、コンピュータを用いて上記の実施形態による意見分類装置を実際に構成し、入力文を分類する実験を行なった。
図９（ａ）〜（ｃ）は、この実験の結果による文分類の正解率を示すテーブルである。
図９（ａ）は、bag-of-words手法を用いた場合をベースラインとして、本発明の手法と比較した結果を示す。ここに示すように、語の表層と述語表層と述語カテゴリーを特徴として使用した場合、また述語表層と述語カテゴリーとモダリティ種類を使用した場合のいずれにおいても、本発明の手法の正解率が上回っている。
図９（ｂ）は、語の表層のみを特徴として利用した場合をベースラインとして、本発明の手法と比較した結果を示す。ここに示すように、本発明の手法の正解率が上回っている。図９（ｃ）も、語の表層のみを特徴として利用した場合をベースラインとして、本発明の手法と比較した結果を示す。ここに示すように、本発明の手法の正解率が上回っている。
このように、本発明の手法の有効性が実証された。 <Demonstration Experiment and Results> For the purpose of verifying the effectiveness of the present invention, an experiment for classifying an input sentence was performed by actually configuring the opinion classification apparatus according to the above-described embodiment using a computer.
FIGS. 9A to 9C are tables showing the correct rate of sentence classification based on the results of this experiment.
FIG. 9A shows the result of comparison with the method of the present invention using the bag-of-words method as a baseline. As shown here, the accuracy rate of the method of the present invention is higher both when the word surface layer, the predicate surface layer, and the predicate category are used as features, and when the predicate surface layer, the predicate category, and the modality type are used. Yes.
FIG. 9B shows the result of comparison with the method of the present invention using the case where only the surface layer of the word is used as a feature as a baseline. As shown here, the accuracy rate of the method of the present invention is higher. FIG. 9C also shows the result of comparison with the method of the present invention using only the surface layer of the word as a feature as a baseline. As shown here, the accuracy rate of the method of the present invention is higher.
Thus, the effectiveness of the technique of the present invention was demonstrated.

本発明は、自然言語で書かれた大量の文を効率良く且つ正確にカテゴリーに分類する目的で利用可能である。特に、製品を販売したりサービスを提供したりする事業者が、その利用者からフィードバックされる大量の意見文を効率良く且つ正確にカテゴリーに分類する目的で利用可能である。なお、例えば、放送事業において利用可能である。 The present invention can be used for the purpose of efficiently and accurately classifying a large amount of sentences written in a natural language. In particular, a business operator who sells a product or provides a service can use it for the purpose of efficiently and accurately classifying a large amount of opinion feedback fed back from the user. For example, it can be used in a broadcasting business.

１意見分類装置
１１形態素解析部
１２係り受け解析部
１４述語検出部
１５モダリティ検出部（機能表現検出部）
１７統合判断部（判断部） DESCRIPTION OF SYMBOLS 1 Opinion classification apparatus 11 Morphological analysis part 12 Dependency analysis part 14 Predicate detection part 15 Modality detection part (functional expression detection part)
17 Integrated judgment part (judgment part)

Claims

A morpheme analyzer that reads input sentence data and performs morpheme analysis processing;
A dependency analysis unit that performs dependency analysis processing based on the input sentence data that has been subjected to morphological analysis processing, and outputs parse tree data;
A predicate detector that detects an opinion predicate from the parse tree data and adds opinion predicate information including a predicate category representing the type of the detected opinion predicate to the parse tree data;
Based on the input sentence data subjected to morpheme analysis processing by the morpheme analysis unit, the function expression of the input sentence data is detected, and the function expression information including the function expression type indicating the type of the detected function expression is A function expression detection unit to be added to the parse tree data;
Reading the parse tree data to which the opinion predicate information is added by the predescript word detection unit and the function expression information being added by the functional expression detection unit, and the opinion predicate added to the read parse tree data A determination unit that determines whether or not the parse tree data corresponds to a predetermined sentence category using information and features of the function expression information ;
An opinion classification device you provided with a,
The determination unit includes a model that has been machine-learned in advance with the parse tree data to which the opinion predicate information and the function expression information are added and the sentence category corresponding to the correct case sentence for the correct case sentence. Determining whether or not the parse tree data corresponds to the predetermined sentence category by referring to the model;
An opinion classification device characterized by that.

The function expression detection unit is a modality detection unit that detects a modality as a function expression of the input sentence data and adds function expression information including a function expression type representing a type of the detected modality to the parse tree data. ,
The opinion classification apparatus according to claim 1, wherein:

The determination unit, Ru der those using information about the branch of the parse tree data read as the feature data,
3. The opinion classification apparatus according to claim 1 or 2, wherein

A morpheme analyzer that reads input sentence data and performs morpheme analysis processing;
A dependency analysis unit that performs dependency analysis processing based on the input sentence data that has been subjected to morphological analysis processing, and outputs parse tree data;
A predicate detector that detects an opinion predicate from the parse tree data and adds opinion predicate information including a predicate category representing the type of the detected opinion predicate to the parse tree data;
Based on the input sentence data subjected to morpheme analysis processing by the morpheme analysis unit, the function expression of the input sentence data is detected, and the function expression information including the function expression type indicating the type of the detected function expression is A function expression detection unit to be added to the parse tree data;
Reading the parse tree data to which the opinion predicate information is added by the predescript word detection unit and the function expression information being added by the functional expression detection unit, and the opinion predicate added to the read parse tree data A determination unit that determines whether or not the parse tree data corresponds to a predetermined sentence category using information and features of the function expression information ;
A program that causes a computer to function as an opinion classification device comprising :
The determination unit includes a model that has been machine-learned in advance with the parse tree data to which the opinion predicate information and the function expression information are added and the sentence category corresponding to the correct case sentence for the correct case sentence. Determining whether or not the parse tree data corresponds to the predetermined sentence category by referring to the model;
program.

The function expression detection unit is a modality detection unit that detects a modality as a function expression of the input sentence data and adds function expression information including a function expression type representing a type of the detected modality to the parse tree data. ,
The program according to claim 4, which causes a computer to function as the opinion classification device.