JP2008305291A

JP2008305291A - Information processing apparatus, information processing method, and program

Info

Publication number: JP2008305291A
Application number: JP2007153518A
Authority: JP
Inventors: Naoki Kamimaeda; 直樹上前田
Original assignee: Sony Corp
Current assignee: Sony Corp
Priority date: 2007-06-11
Filing date: 2007-06-11
Publication date: 2008-12-18

Abstract

【課題】単語に品詞を付与する精度を維持しつつ、メモリ容量を抑える。
【解決手段】共起確率テーブル６３に記憶された、２つの品詞が共起する確率である共起確率共起確率に基づいて、品詞候補決定部１０１は、単語列を構成する単語のうちの注目している注目単語の前または後ろに隣接する単語である隣接単語の品詞の候補である隣接品詞候補と共起する可能性がある、注目単語の品詞の候補である注目品詞候補を決定し、単語出現確率決定部１０２は、隣接品詞候補と注目品詞候補との共起確率に基づいて、品詞が注目品詞候補の注目単語が出現する確率である単語出現確率を決定して、単語ラティス生成部１０３は、単語列の隣接する単語どうしについての共起確率と、単語列を構成する各単語の単語出現確率とに基づいて、単語ラティスを生成する。本発明は、例えば、形態素解析エンジンに適用できる。
【選択図】図９An object of the present invention is to reduce the memory capacity while maintaining the accuracy of giving a part of speech to a word.
Based on a co-occurrence probability co-occurrence probability stored in a co-occurrence probability table 63, which is a probability that two parts of speech co-occur, a part-of-speech candidate determination unit 101 selects among words constituting a word string. Determine a part-of-speech candidate that is a candidate for a part-of-speech candidate for a word of interest that may co-occur with a part-of-speech candidate for a part-of-speech candidate for a part-of-speech that is a word adjacent to the target word before or after the target word. The word appearance probability determination unit 102 determines a word appearance probability that is a probability that the attention word of the attention part-of-speech candidate appears as a part of speech based on the co-occurrence probability of the adjacent part-of-speech candidate and the attention part-of-speech candidate, and generates a word lattice. The unit 103 generates a word lattice based on the co-occurrence probability of adjacent words in the word string and the word appearance probability of each word constituting the word string. The present invention can be applied to, for example, a morphological analysis engine.
[Selection] Figure 9

Description

本発明は、情報処理装置、情報処理方法、およびプログラムに関し、特に、例えば、単語に品詞を付与する精度を維持しつつ、メモリ容量を抑えることができるようにした情報処理装置、情報処理方法、およびプログラムに関する。 The present invention relates to an information processing device, an information processing method, and a program, and more particularly, for example, an information processing device, an information processing method, and an information processing method capable of suppressing the memory capacity while maintaining the accuracy of giving parts of speech to words. And program.

例えば、文（文章）を構成する各単語に、各単語の品詞を付与する形態素解析エンジンでは、文に形態素解析処理を施すことにより、文を構成する各単語に、各単語の品詞を付与する。 For example, in a morphological analysis engine that assigns the part of speech of each word to each word constituting a sentence (sentence), the part of speech of each word is assigned to each word constituting the sentence by performing a morphological analysis process on the sentence. .

即ち、例えば、文を構成する所定の単語を注目単語として注目すると、その注目単語が、単語の品詞を付与するために必要な情報が辞書データベースに記憶されている既知語である場合、形態素解析エンジンは、形態素解析エンジンが有する辞書データベースに基づいて、注目単語に、その注目単語の品詞を付与する。 That is, for example, when attention is paid to a predetermined word constituting a sentence as an attention word, when the attention word is a known word stored in the dictionary database for information necessary to give the part of speech of the word, morphological analysis is performed. The engine assigns the part of speech of the attention word to the attention word based on the dictionary database of the morphological analysis engine.

一方、文を構成する注目単語が、辞書データベースに記憶されていない未知語である場合、例えば、形態素解析エンジンは、文のサンプル等である学習コーパスから予め求めておいた、n個の品詞が共起する確率であるn-gramの共起確率に基づいて、未知語である注目単語に品詞を付与する。 On the other hand, when the attention word constituting the sentence is an unknown word that is not stored in the dictionary database, for example, the morphological analysis engine has n parts of speech that are obtained in advance from a learning corpus that is a sample of a sentence or the like. Based on the co-occurrence probability of n-gram, which is the probability of co-occurrence, the part of speech is given to the attention word that is an unknown word.

即ち、例えば、文「t1(既知語),t2(未知語),t3(既知語)」を構成する各単語t1,t2、およびt3に品詞を付与する場合、単語t1およびt3については、辞書データベースに記憶されている品詞が付与される。 That is, for example, when giving parts of speech to the words t1, t2, and t3 constituting the sentence `` t1 (known word), t2 (unknown word), t3 (known word) '', the words t1 and t3 Part of speech stored in the database is given.

その後、単語t1の品詞、単語t2の品詞の候補、単語t3の品詞が、この並び順で共起する3-gramの共起確率のうちの最大の3-gramの共起確率が求められ、3-gramの共起確率が最大となるときの単語t2の品詞の候補が、単語t2の品詞として、単語t2に付与される。 Then, the maximum 3-gram co-occurrence probability of the 3-gram co-occurrence probabilities in which the part of speech of the word t1, the part of speech candidate of the word t2, and the part of speech of the word t3 co-occur in this order is obtained. The part of speech candidate of the word t2 when the 3-gram co-occurrence probability is maximized is given to the word t2 as the part of speech of the word t2.

なお、非特許文献１には、非特許文献２や３に開示されているHMM-based model,Maximum entropy model,Conditional Markov model,Conditional random fields,Cyclic dependency networks等の様々なモデルを用いることにより、文を形態素に分割し、各形態素に品詞および語幹を付与する形態素解析エンジンが開示されている。 In Non-Patent Document 1, by using various models such as HMM-based model, Maximum entropy model, Conditional Markov model, Conditional random fields, Cyclic dependency networks disclosed in Non-Patent Documents 2 and 3, A morpheme analysis engine that divides a sentence into morphemes and gives parts of speech and stems to each morpheme is disclosed.

浅原正幸「形態素解析器を構成するための自然言語処理技術」国際シンポジウム比較語彙研究X,2006Masayuki Asahara "Natural Language Processing Technology for Constructing Morphological Analyzers" International Symposium Comparative Vocabulary Research X, 2006 工藤拓、山本薫、松下裕治「Conditional Random Fieldsを用いた日本語形態素解析」情報処理学会自然言語処理研究会 SIGNL-161,2004Taku Kudo, Atsushi Yamamoto, Yuji Matsushita “Japanese Morphological Analysis Using Conditional Random Fields” Information Processing Society of Japan Natural Language Processing Study Group SIGNL-161,2004 Sheila M. Reynolds and Jeff A. Bilmes, “Part-of-Speech Tagging using Virtual Evidence and Negative Training”, Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing(HLT/EMNLP), pages 459-466,Vancouver,October 2005.Sheila M. Reynolds and Jeff A. Bilmes, “Part-of-Speech Tagging using Virtual Evidence and Negative Training”, Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (HLT / EMNLP), pages 459-466 , Vancouver, October 2005.

ところで、例えば、文「t1(既知語),t2(未知語),t3(未知語),t4(既知語)」を構成する各単語t1,t2,t3、およびt4に品詞を付与する場合、既知語である単語t1およびt4については、辞書データベースに記憶されている品詞が付与される。 By the way, for example, when giving part of speech to each word t1, t2, t3, and t4 constituting the sentence `` t1 (known word), t2 (unknown word), t3 (unknown word), t4 (known word) '', For the words t1 and t4 which are known words, the part of speech stored in the dictionary database is given.

その後、単語t1の品詞、単語t2の品詞の候補、単語t3の品詞の候補が、この並び順で共起する3-gramの共起確率のうちの最大の3-gramの共起確率が求められ、3-gramの共起確率が最大となるときの単語t2の品詞の候補が、単語t2の品詞として単語t2に付与されるとともに、3-gramの共起確率が最大となるときの単語t3の品詞の候補が、単語t3の品詞として単語t3に付与される。 After that, the maximum 3-gram co-occurrence probability is calculated from the 3-gram co-occurrence probabilities in which the part of speech of the word t1, the part of speech of the word t2, and the part of speech of the word t3 co-occur in this order. And the candidate for the part of speech of the word t2 when the 3-gram co-occurrence probability is maximized is given to the word t2 as the part of speech of the word t2, and the word when the 3-gram co-occurrence probability is maximized A candidate for part of speech of t3 is given to word t3 as part of speech of word t3.

しかしながら、3-gramの共起確率を用いる場合、既知語である単語t4の品詞が考慮されずに、未知語である単語t2およびt3に品詞が付与されてしまうため、単語t2およびt3に正確な品詞を付与することができないことがある。 However, when the 3-gram co-occurrence probability is used, the part of speech of the word t4, which is a known word, is not considered, and the part of speech is given to the words t2, t3, which are unknown words. May not be able to give a good part of speech.

即ち、文において未知語がn-1個以上続く場合、n-gramの共起確率を用いて未知語に品詞を付与するときには、未知語に、その未知語の品詞を付与する精度が低くなることがある。 That is, when there are more than n-1 unknown words in a sentence, when adding parts of speech to an unknown word using n-gram co-occurrence probabilities, the accuracy of assigning the part of speech of the unknown word to the unknown word is low. Sometimes.

ここで、文「t1(既知語),t2(未知語),t3(未知語),t4(既知語)」を構成する各単語に品詞を付与する場合において、未知語に品詞を付与する精度を維持するには、単語t4の品詞を考慮して、単語t2およびt3に、単語t2およびt3の品詞を付与すること、つまり、単語t1の品詞、単語t2の品詞の候補、単語t3の品詞の候補、単語t4の品詞が、この並び順で共起する4-gramの共起確率のうちの最大の4-gramの共起確率を求め、4-gramの共起確率が最大となるときの単語t2の品詞の候補を、単語t2の品詞として単語t2に付与し、4-gramの共起確率が最大となるときの単語t3の品詞の候補を、単語t3の品詞として単語t3に付与することが必要である。 Here, when giving parts of speech to the words that make up the sentence `` t1 (known word), t2 (unknown word), t3 (unknown word), t4 (known word) '', the accuracy of giving the part of speech to the unknown word To preserve the part of speech of the word t4, the part of speech of the word t1, the part of speech of the word t2, the part of speech of the word t3 is given to the words t2 and t3. Candidate, the part of speech of the word t4 is the maximum 4-gram co-occurrence probability among the 4-gram co-occurrence probabilities that co-occur in this order, and the 4-gram co-occurrence probability is maximized The part-of-speech candidate for the word t2 is assigned to the word t2 as the part-of-speech for the word t2, and the part-of-speech candidate for the word t3 when the 4-gram co-occurrence probability is maximized is assigned to the word t3 as the part-of-speech for the word t3 It is necessary to.

従って、n-gramの共起確率を用いて、未知語に、その未知語の品詞を付与する方法では、未知語が複数個続く場合を考慮して、2-gramの共起確率、3-gramの共起確率…，N-gramの共起確率などの複数のn-gramの共起確率を記憶しておく必要があり、複数のn-gramの共起確率を記憶するメモリ容量が大となる。 Therefore, in the method of assigning the part of speech of an unknown word to an unknown word using the co-occurrence probability of n-gram, considering the case where a plurality of unknown words continue, Co-occurrence probabilities of gram…, co-occurrence probabilities of multiple n-grams such as co-occurrence probabilities of N-grams need to be stored, and the memory capacity to store the co-occurrence probabilities of multiple n-grams is large It becomes.

本発明は、このような状況に鑑みてなされたものであり、単語に品詞を付与する精度を維持しつつ、メモリ容量を抑えることができるようにするものである。 The present invention has been made in view of such a situation, and it is possible to suppress the memory capacity while maintaining the accuracy of giving a part of speech to a word.

本発明の一側面の情報処理装置、またはプログラムは、単語列から、単語ラティスを生成する情報処理装置、単語列から、単語ラティスを生成する情報処理装置として、コンピュータを機能させるプログラムであり、予め求められた、２つの品詞が共起する確率である共起確率が記憶された記憶手段に記憶されている前記共起確率に基づいて、前記単語列を構成する単語のうちの注目している注目単語の前または後ろに隣接する単語である隣接単語の品詞の候補である隣接品詞候補と共起する可能性がある、前記注目単語の品詞の候補である注目品詞候補を決定する品詞候補決定手段と、前記隣接品詞候補と前記注目品詞候補との共起確率に基づいて、品詞が前記注目品詞候補の前記注目単語が出現する確率である単語出現確率を決定する単語出現確率決定手段と、前記単語列の隣接する単語どうしについての前記共起確率と、前記単語列を構成する各単語の単語出現確率とに基づいて、前記単語ラティスを生成する単語ラティス生成手段とを備える情報処理装置、または情報処理装置として、コンピュータに機能させるプログラムである。 An information processing apparatus or program according to one aspect of the present invention is a program that causes a computer to function as an information processing apparatus that generates a word lattice from a word string and an information processing apparatus that generates a word lattice from a word string. Based on the co-occurrence probability stored in the storage means in which the co-occurrence probability, which is the probability of co-occurring two parts of speech, is obtained, attention is paid to the words constituting the word string. Part-of-speech candidate determination that determines a part-of-speech candidate that is a candidate for part-of-speech of the target word that may co-occur with a part-of-speech candidate that is a candidate for part-of-speech of the target word that is adjacent to the front or back of the target word A word appearance probability which is a probability that the attention word of the attention part-of-speech candidate appears, based on a means and a co-occurrence probability of the adjacent part-of-speech candidate and the attention part-of-speech candidate. Appearance probability determining means, word lattice generation means for generating the word lattice based on the co-occurrence probability of adjacent words in the word string and the word appearance probability of each word constituting the word string; As an information processing apparatus including the information processing apparatus, a program causing a computer to function.

前記単語出現確率決定手段では、前記隣接品詞候補と前記注目品詞候補との共起確率を、前記注目単語の前記単語出現確率として決定することができる。 The word appearance probability determining means can determine the co-occurrence probability of the adjacent part-of-speech candidate and the attention part-of-speech candidate as the word appearance probability of the attention word.

前記隣接品詞候補が複数存在する場合において、前記単語出現確率決定手段では、複数の前記隣接品詞候補それぞれと前記注目品詞候補との共起確率のうちの最大値を、前記注目単語の前記単語出現確率として決定することができる。 In the case where there are a plurality of adjacent part-of-speech candidates, the word appearance probability determining means sets the maximum value of the co-occurrence probabilities of each of the plurality of adjacent part-of-speech candidates and the attention part-of-speech candidate as the word appearance of the attention word. It can be determined as a probability.

前記隣接品詞候補が複数存在する場合において、前記単語出現確率決定手段では、複数の前記隣接品詞候補それぞれと前記注目品詞候補との共起確率を合計した合計値を、前記注目単語の前記単語出現確率として決定することができる。 In the case where there are a plurality of adjacent part-of-speech candidates, the word appearance probability determining means calculates a total value of the co-occurrence probabilities of each of the plurality of adjacent part-of-speech candidates and the attention part-of-speech candidate as the word appearance of the attention word. It can be determined as a probability.

前記記憶手段は、さらに、予め求められた、品詞が出現する確率である品詞出現確率を記憶することができ、前記単語出現確率決定手段では、前記隣接品詞候補と前記注目品詞候補との共起確率と、前記注目品詞候補の品詞出現確率との積を、前記注目単語の前記単語出現確率として決定することができる。 The storage means can further store a part-of-speech appearance probability which is a probability of appearance of a part-of-speech obtained in advance, and the word appearance probability determination means co-occurs the adjacent part-of-speech candidate and the attention part-of-speech candidate. The product of the probability and the part of speech appearance probability of the part of speech candidate of interest can be determined as the word appearance probability of the word of interest.

前記隣接品詞候補が複数存在する場合において、前記単語出現確率決定手段では、複数の前記隣接品詞候補それぞれと前記注目品詞候補との共起確率のうちの最大値と、前記注目品詞候補の品詞出現確率との積を、前記注目単語の前記単語出現確率として決定することができる。 In the case where there are a plurality of adjacent part-of-speech candidates, the word appearance probability determination means determines the maximum value of the co-occurrence probabilities of each of the plurality of adjacent part-of-speech candidates and the part-of-speech candidate, and the part-of-speech appearance of the part-of-speech candidate The product with the probability can be determined as the word appearance probability of the word of interest.

前記記憶手段に記憶された前記共起確率または前記品詞出現確率は、文のサンプルである学習コーパスにより予め学習されたものとすることができる。 The co-occurrence probability or the part-of-speech appearance probability stored in the storage means may be learned in advance by a learning corpus that is a sentence sample.

本発明の一側面の情報処理装置は、前記単語ラティス生成手段により生成された前記単語ラティスに基づいて、前記注目単語に、前記注目単語の品詞を付与する単語品詞付与手段と、前記注目単語の品詞が付与された前記注目単語を出力する出力手段とをさらに設けることができる。 An information processing apparatus according to an aspect of the present invention includes: a word part-of-speech providing unit that assigns a part of speech of the attention word to the attention word based on the word lattice generated by the word lattice generation unit; Output means for outputting the attention word given the part of speech can be further provided.

前記記憶手段は、さらに、単語と、その単語の語幹とを対応付けた単語テーブルを記憶することができ、前記記憶手段に記憶された前記単語テーブルに基づいて、前記注目単語に、前記注目単語の語幹を付与する語幹付与手段をさらに設け、前記出力手段では、前記注目単語の品詞および語幹が付与された前記注目単語を出力することができる。 The storage means can further store a word table in which a word and a stem of the word are associated with each other, and the attention word is added to the attention word based on the word table stored in the storage means. A stem adding unit for adding a stem of the word is further provided, and the output unit can output the attention word to which the part of speech and the word stem of the attention word are added.

前記記憶手段は、さらに、複数の単語により構成される複合語と、その複合語の品詞とを対応付けた複合語テーブルを記憶することができ、前記記憶手段に記憶された前記複合語テーブルに基づいて、前記単語列に含まれる複合語に、その複合語の品詞を付与する複合語品詞付与手段をさらに設け、前記出力手段では、さらに、前記単語列に含まれる複合語の品詞が付与された前記複合語を出力することができる。 The storage means can further store a compound word table in which a compound word composed of a plurality of words and a part of speech of the compound word are associated with each other, and the compound word table stored in the storage means can store the compound word table. On the basis of the above, the compound word included in the word string is further provided with compound word part-of-speech giving means for adding the part of speech of the compound word, and the output means is further provided with the part of speech of the compound word included in the word string. The compound word can be output.

前記記憶手段は、予め求められた、品詞が所定の品詞の単語が出現する確率である単語出現確率をさらに記憶することができ、前記単語出現確率決定手段では、前記注目単語の単語出現確率が前記記憶手段に記憶されている場合、前記記憶手段に記憶されている単語出現確率に基づいて、前記注目単語の単語出現確率を決定するとともに、前記注目単語の単語出現確率が前記記憶手段に記憶されていない場合、前記隣接品詞候補と前記注目品詞候補との共起確率に基づいて、前記注目単語の共起確率を決定することができる。 The storage means can further store a word appearance probability, which is a probability of appearance of a word having a predetermined part of speech that is obtained in advance, and the word appearance probability determination means determines the word appearance probability of the attention word. When stored in the storage means, the word appearance probability of the attention word is determined based on the word appearance probability stored in the storage means, and the word appearance probability of the attention word is stored in the storage means. If not, the co-occurrence probability of the attention word can be determined based on the co-occurrence probability of the adjacent part-of-speech candidate and the attention part-of-speech candidate.

前記記憶手段に記憶された前記単語出現確率は、文のサンプルである学習コーパスにより予め学習されたものとすることができる。 The word appearance probability stored in the storage unit may be learned in advance by a learning corpus that is a sample of a sentence.

前記品詞候補決定手段では、前記隣接品詞候補との共起確率が最大値となる品詞を、前記注目品詞候補として決定することができる。 The part-of-speech candidate determination means can determine the part-of-speech with the maximum co-occurrence probability with the adjacent part-of-speech candidate as the target part-of-speech candidate.

本発明の一側面の情報処理方法は、単語列から、単語ラティスを生成する情報処理装置の情報処理方法であり、予め求められた、２つの品詞が共起する確率である共起確率が記憶された記憶手段に記憶されている前記共起確率に基づいて、前記単語列を構成する単語のうちの注目している注目単語の前または後ろに隣接する単語である隣接単語の品詞の候補である隣接品詞候補と共起する可能性がある、前記注目単語の品詞の候補である注目品詞候補を決定し、前記隣接品詞候補と前記注目品詞候補との共起確率に基づいて、品詞が前記注目品詞候補の前記注目単語が出現する確率である単語出現確率を決定し、前記単語列の隣接する単語どうしについての前記共起確率と、前記単語列を構成する各単語の単語出現確率とに基づいて、前記単語ラティスを生成するステップを含む。 An information processing method according to an aspect of the present invention is an information processing method of an information processing apparatus that generates a word lattice from a word string, and stores a co-occurrence probability that is a probability of co-occurring two parts of speech that are obtained in advance. Based on the co-occurrence probabilities stored in the stored storage means, it is a candidate for a part of speech of an adjacent word that is a word adjacent before or after the attention word of interest among the words constituting the word string. Determining a part of speech candidate that is a candidate for the part of speech of the attention word that may co-occur with a certain part of speech candidate, and based on the co-occurrence probability between the part of speech candidate and the part of speech candidate of interest, Determining a word appearance probability, which is a probability that the attention word of the attention part-of-speech candidate appears, and determining the co-occurrence probability between adjacent words in the word string and the word appearance probability of each word constituting the word string Based on the above Including the step of generating a lattice.

本発明の一側面においては、予め求められた、２つの品詞が共起する確率である共起確率が記憶された記憶手段に記憶されている前記共起確率に基づいて、前記単語列を構成する単語のうちの注目している注目単語の前または後ろに隣接する単語である隣接単語の品詞の候補である隣接品詞候補と共起する可能性がある、前記注目単語の品詞の候補である注目品詞候補が決定され、前記隣接品詞候補と前記注目品詞候補との共起確率に基づいて、品詞が前記注目品詞候補の前記注目単語が出現する確率である単語出現確率が決定されるとともに、前記単語列の隣接する単語どうしについての前記共起確率と、前記単語列を構成する各単語の単語出現確率とに基づいて、前記単語ラティスが生成される。 In one aspect of the present invention, the word string is configured based on the co-occurrence probability stored in storage means in which a co-occurrence probability that is a probability of co-occurring two parts of speech is obtained in advance. A part of speech candidate of the attention word that may co-occur with an adjacent part of speech candidate that is a candidate for the part of speech of the adjacent word that is adjacent to the attention word before or after the attention word of interest Participant part-of-speech candidates are determined, and based on the co-occurrence probability of the adjacent part-of-speech candidate and the part-of-speech part-of-interest candidate, a word appearance probability is determined, which is the probability that the part-of-speech part of the part-of-speech part candidate will appear. The word lattice is generated based on the co-occurrence probability between adjacent words in the word string and the word appearance probability of each word constituting the word string.

本発明によれば、単語に品詞を付与する精度を維持しつつ、メモリ容量を抑えることができる。 ADVANTAGE OF THE INVENTION According to this invention, memory capacity can be restrained, maintaining the precision which gives a part of speech to a word.

以下に本発明の実施の形態を説明するが、本発明の構成要件と、明細書又は図面に記載の実施の形態との対応関係を例示すると、次のようになる。この記載は、本発明をサポートする実施の形態が、明細書又は図面に記載されていることを確認するためのものである。従って、明細書又は図面中には記載されているが、本発明の構成要件に対応する実施の形態として、ここには記載されていない実施の形態があったとしても、そのことは、その実施の形態が、その構成要件に対応するものではないことを意味するものではない。逆に、実施の形態が構成要件に対応するものとしてここに記載されていたとしても、そのことは、その実施の形態が、その構成要件以外の構成要件には対応しないものであることを意味するものでもない。 Embodiments of the present invention will be described below. Correspondences between the constituent elements of the present invention and the embodiments described in the specification or the drawings are exemplified as follows. This description is intended to confirm that the embodiments supporting the present invention are described in the specification or the drawings. Therefore, even if there is an embodiment which is described in the specification or the drawings but is not described here as an embodiment corresponding to the constituent elements of the present invention, that is not the case. It does not mean that the form does not correspond to the constituent requirements. Conversely, even if an embodiment is described here as corresponding to a configuration requirement, that means that the embodiment does not correspond to a configuration requirement other than the configuration requirement. It's not something to do.

本発明の一側面の情報処理装置、またはプログラムは、
単語列から、単語ラティスを生成する情報処理装置（例えば、図１の形態素解析エンジン）、または単語列から、単語ラティスを生成する情報処理装置として、コンピュータを機能させるプログラムであり、
予め求められた、２つの品詞が共起する確率である共起確率（例えば、図１の共起確率テーブル６３が保持する共起確率）が記憶された記憶手段（例えば、図１の辞書データベース１２）に記憶されている前記共起確率に基づいて、前記単語列を構成する単語のうちの注目している注目単語の前または後ろに隣接する単語である隣接単語の品詞の候補である隣接品詞候補と共起する可能性がある、前記注目単語の品詞の候補である注目品詞候補を決定する品詞候補決定手段（例えば、図９の品詞候補決定部１０１）と、
前記隣接品詞候補と前記注目品詞候補との共起確率に基づいて、品詞が前記注目品詞候補の前記注目単語が出現する確率である単語出現確率を決定する単語出現確率決定手段（例えば、図９の単語出現確率決定部１０２）と、
前記単語列の隣接する単語どうしについての前記共起確率と、前記単語列を構成する各単語の単語出現確率とに基づいて、前記単語ラティスを生成する単語ラティス生成手段（例えば、図９の単語ラティス生成部１０３）と
を備える情報処理装置、または情報処理装置として、コンピュータを機能させるプログラムである。 An information processing apparatus or program according to one aspect of the present invention includes:
A program that causes a computer to function as an information processing device that generates a word lattice from a word string (for example, the morphological analysis engine of FIG. 1) or an information processing device that generates a word lattice from a word string,
Storage means (for example, the dictionary database of FIG. 1) in which the co-occurrence probability (for example, the co-occurrence probability held in the co-occurrence probability table 63 of FIG. 1), which is the probability of co-occurring two parts of speech, obtained in advance is stored. 12) Adjacent that is a candidate for part of speech of an adjacent word that is adjacent to the attention word in front of or behind the attention word of the words constituting the word string based on the co-occurrence probability stored in 12) Part-of-speech candidate determination means (for example, part-of-speech candidate determination unit 101 in FIG. 9) for determining a target part-of-speech candidate that is a candidate for part-of-speech of the target word that may co-occur with a part-of-speech candidate;
Based on the co-occurrence probability of the adjacent part-of-speech candidate and the target part-of-speech candidate, word appearance probability determining means for determining the word appearance probability that the part-of-speech is the probability that the target word of the target part-of-speech candidate appears (for example, FIG. 9 Word appearance probability determination unit 102),
Word lattice generation means for generating the word lattice based on the co-occurrence probability of adjacent words in the word string and the word appearance probability of each word constituting the word string (for example, the word in FIG. 9) A program that causes a computer to function as an information processing apparatus including the lattice generation unit 103) or an information processing apparatus.

本発明の一側面の情報処理装置では、
前記単語ラティス生成手段により生成された前記単語ラティスに基づいて、前記注目単語に、前記注目単語の品詞を付与する単語品詞付与手段（例えば、図９の単語品詞付与部１０４）と、
前記注目単語の品詞が付与された前記注目単語を出力する出力手段（例えば、図９の複合語品詞付与部３５）と
をさらに備えることができる。 In the information processing apparatus according to one aspect of the present invention,
Based on the word lattice generated by the word lattice generation means, word part-of-speech giving means (for example, the word part-of-speech giving unit 104 in FIG. 9) for giving the part-of-speech of the word of interest to the word of interest;
Output means (for example, the compound word part-of-speech giving unit 35 in FIG. 9) that outputs the attention word to which the part-of-speech of the word of interest is attached can be further provided.

本発明の一側面の情報処理装置では、
前記記憶手段は、さらに、単語と、その単語の語幹とを対応付けた単語テーブル（例えば、図１の単語テーブル６１）を記憶し、
前記記憶手段に記憶された前記単語テーブルに基づいて、前記注目単語に、前記注目単語の語幹を付与する語幹付与手段（例えば、図１の語幹付与部３３）をさらに備え、
前記出力手段は、前記注目単語の品詞および語幹が付与された前記注目単語を出力することができる。 In the information processing apparatus according to one aspect of the present invention,
The storage means further stores a word table (for example, the word table 61 in FIG. 1) in which a word is associated with a stem of the word,
Based on the word table stored in the storage means, further comprising a stem grant means (for example, the stem grant unit 33 in FIG. 1) that gives the attention word a stem of the attention word;
The output means can output the attention word to which the part of speech and the word stem of the attention word are assigned.

本発明の一側面の情報処理装置では、
前記記憶手段は、さらに、複数の単語により構成される複合語と、その複合語の品詞とを対応付けた複合語テーブル（例えば、図１の複合語テーブル６５）を記憶し、
前記記憶手段に記憶された前記複合語テーブルに基づいて、前記単語列に含まれる複合語に、その複合語の品詞を付与する複合語品詞付与手段（例えば、図１の複合語品詞付与部３５）をさらに備え、
前記出力手段は、さらに、前記単語列に含まれる複合語の品詞が付与された前記複合語を出力することができる。 In the information processing apparatus according to one aspect of the present invention,
The storage means further stores a compound word table (for example, the compound word table 65 in FIG. 1) in which a compound word composed of a plurality of words and the part of speech of the compound word are associated with each other.
Based on the compound word table stored in the storage means, compound word part-of-speech giving means (for example, compound word part-of-speech giving unit 35 in FIG. 1) that gives the compound word included in the word string the part of speech of the compound word. )
The output means can further output the compound word to which the part of speech of the compound word included in the word string is given.

本発明の一側面の情報処理方法は、
単語列から、単語ラティスを生成する情報処理装置（例えば、図１の形態素解析エンジン）の情報処理方法であり、
予め求められた、２つの品詞が共起する確率である共起確率（例えば、図１の共起確率テーブル６３が保持する共起確率）が記憶された記憶手段（例えば、図１の辞書データベース１２）に記憶されている前記共起確率に基づいて、前記単語列を構成する単語のうちの注目している注目単語の前または後ろに隣接する単語である隣接単語の品詞の候補である隣接品詞候補と共起する可能性がある、前記注目単語の品詞の候補である注目品詞候補を決定し（例えば、図１８のステップＳ３４）、
前記隣接品詞候補と前記注目品詞候補との共起確率に基づいて、品詞が前記注目品詞候補の前記注目単語が出現する確率である単語出現確率を決定し（例えば、図１８のステップＳ３５）、
前記単語列の隣接する単語どうしについての前記共起確率と、前記単語列を構成する各単語の単語出現確率とに基づいて、前記単語ラティスを生成する（例えば、図１８のステップＳ３６）
ステップを含む。 An information processing method according to one aspect of the present invention includes:
An information processing method of an information processing apparatus (for example, the morphological analysis engine of FIG. 1) that generates a word lattice from a word string,
Storage means (for example, the dictionary database of FIG. 1) in which the co-occurrence probability (for example, the co-occurrence probability held in the co-occurrence probability table 63 of FIG. 1), which is the probability of co-occurring two parts of speech, obtained in advance is stored. 12) Adjacent that is a candidate for part of speech of an adjacent word that is adjacent to the attention word in front of or behind the attention word of the words constituting the word string based on the co-occurrence probability stored in 12) A part of speech candidate that is a candidate for part of speech of the attention word that may co-occur with the part of speech candidate (for example, step S34 in FIG. 18);
Based on the co-occurrence probability of the adjacent part-of-speech candidate and the target part-of-speech candidate, a word appearance probability is determined, which is the probability that the part-of-speech word appears as the target part-of-speech candidate (for example, step S35 in FIG. 18).
The word lattice is generated based on the co-occurrence probability of adjacent words in the word string and the word appearance probability of each word constituting the word string (for example, step S36 in FIG. 18).
Includes steps.

以下、図を参照して、本発明の実施の形態について説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

図１は、本発明を適用した情報処理装置としての形態素解析エンジンの一実施の形態の第１の構成例を示すブロック図である。 FIG. 1 is a block diagram showing a first configuration example of an embodiment of a morphological analysis engine as an information processing apparatus to which the present invention is applied.

図１の形態素解析エンジンは、単語解析部１１、および辞書データベース１２により構成される。 The morphological analysis engine of FIG. 1 includes a word analysis unit 11 and a dictionary database 12.

なお、図１の形態素解析エンジンには、例えば、ユーザが、図示せぬ操作部を操作することにより、複数の単語からなる文等の単語列のテキストデータとしての文書が入力されるようになっており、その文書は、単語解析部１１に供給される。 In the morphological analysis engine of FIG. 1, for example, when a user operates an operation unit (not shown), a document as text data of a word string such as a sentence composed of a plurality of words is input. The document is supplied to the word analysis unit 11.

単語解析部１１は、文区切り部３１、単語区切り部３２、語幹付与部３３、品詞付与部３４、および複合語品詞付与部３５により構成され、そこに供給される文書内の各単語に、その各単語の語幹および品詞を付与する処理を行う。 The word analysis unit 11 includes a sentence delimiter unit 31, a word delimiter unit 32, a stem adding unit 33, a part of speech adding unit 34, and a compound word part of speech adding unit 35. A process of assigning the stem and part of speech of each word is performed.

文区切り部３１には、入力された文書が供給される。 The sentence separator 31 is supplied with the input document.

文区切り部３１は、文書を文単位に区切る所定のルールに基づいて、そこに供給された文書を、文単位に分割する。 The sentence delimiter 31 divides the document supplied thereto into sentence units based on a predetermined rule for dividing the document into sentence units.

即ち、例えば、文区切り部３１は、そこに供給された文書内の”.”や”?”等を隣接する文どうしの区切りであるとみなして、文書を、１文単位に分割する。具体的には、例えば、文書「I wanted to go to United States. So, I bought a ticket.」が文区切り部３１に供給されると、文区切り部３１は、そこに供給された文書を、「I wanted to go to United States.」と「So, I bought a ticket.」との２つの文に分割する。 That is, for example, the sentence delimiter 31 regards “.” And “?” In the document supplied thereto as a delimiter between adjacent sentences, and divides the document into units of sentences. Specifically, for example, when a document “I wanted to go to United States. So, I bought a ticket” is supplied to the sentence delimiter 31, the sentence delimiter 31 converts the document supplied thereto into It is divided into two sentences, “I wanted to go to United States.” And “So, I bought a ticket.”

文区切り部３１は、そこに供給された文書を分割することで得られた「I wanted to go to United States.」や「So, I bought a ticket.」等の文を、適宜、単語区切り部３２に供給する。 The sentence delimiter 31 appropriately converts sentences such as “I wanted to go to United States.” And “So, I bought a ticket” obtained by dividing the document supplied thereto into word delimiters. 32.

単語区切り部３２は、文を単語に分割する所定のルールに基づいて、文区切り部３１から供給された文を、その文を構成する単語に分割する。 The word delimiter 32 divides the sentence supplied from the sentence delimiter 31 into words constituting the sentence based on a predetermined rule for dividing the sentence into words.

即ち、例えば、単語区切り部３２は、文区切り部３１から供給された１文内の”,”や” ”(空白文字)等を隣接する単語どうしの区切りであるとみなして、文を単語に分割する。具体的には、例えば、文「I wanted to go to United States.」が、文区切り部３１から単語区切り部３２に供給されると、単語区切り部３２は、そこに供給された文を、「I」、「wanted」、「to」、「go」、「to」、「United」、「States」、および「.」の８つの単語に分割する。 That is, for example, the word delimiter 32 regards “,”, “” (blank character) or the like in one sentence supplied from the sentence delimiter 31 as a delimiter between adjacent words, and converts the sentence into words. To divide. Specifically, for example, when the sentence “I wanted to go to United States.” Is supplied from the sentence delimiter 31 to the word delimiter 32, the word delimiter 32 converts the sentence supplied thereto into “ It is divided into eight words: “I”, “wanted”, “to”, “go”, “to”, “United”, “States”, and “.”.

単語区切り部３２は、単語に分割された文を、語幹付与部３３に供給する。 The word delimiter 32 supplies the sentence divided into words to the stem adding unit 33.

語幹付与部３３は、辞書データベース１２に記憶されている、単語と、その単語を一意に識別する単語ID(identification)と、その単語の語幹とを対応付けた単語テーブル６１を参照することにより、単語区切り部３２から供給された文を構成する各単語に、各単語の語幹を付与する。 The word stem assigning unit 33 refers to the word table 61 stored in the dictionary database 12 in which a word, a word ID (identification) for uniquely identifying the word, and a word stem of the word are associated with each other. A stem of each word is given to each word constituting the sentence supplied from the word delimiter 32.

また、語幹付与部３３は、単語テーブル６１から、単語区切り部３２から供給された文を構成する各単語の単語IDを読み出し、単語IDに対応する、文を構成する単語と対応付ける。 Further, the word stem assigning unit 33 reads the word ID of each word constituting the sentence supplied from the word delimiter 32 from the word table 61 and associates it with the word constituting the sentence corresponding to the word ID.

語幹付与部３３は、単語に語幹が付与され、かつ、単語IDが対応付けられた文を、品詞付与部３４に供給する。 The stem granting unit 33 supplies the part-of-speech giving unit 34 with a sentence in which a stem is given to a word and a word ID is associated.

品詞付与部３４は、辞書データベース１２に記憶されている、所定の品詞を一意に識別する品詞IDと、品詞が所定の品詞の単語の単語IDと、品詞が所定の品詞の単語が出現する確率である単語出現確率とを対応付けた単語出現確率テーブル６２と、２つの品詞の品詞IDと、それらの２つの品詞が共起する確率である共起確率とを対応付けた共起確率テーブル６３とを参照することにより、語幹付与部３３から供給された文としての単語列から単語ラティスを生成する。 The part-of-speech providing unit 34 stores a part-of-speech ID that uniquely identifies a predetermined part-of-speech stored in the dictionary database 12, a word ID of a word whose part-of-speech is a predetermined part-of-speech, and a probability that a word whose part-of-speech is a predetermined part-of-speech appears The word appearance probability table 62 in which the word appearance probabilities are associated with each other, the part-of-speech IDs of the two parts of speech, and the co-occurrence probability that is the probability that the two parts of speech co-occur. , A word lattice is generated from a word string as a sentence supplied from the stem adding unit 33.

そして、品詞付与部３４は、語幹付与部３３から供給された文としての単語列から生成された単語ラティスに基づいて、語幹付与部３３から供給された文を構成する各単語の品詞の品詞IDを決定し、辞書データベース１２に記憶されている、品詞と、その品詞の品詞IDとを対応付けた品詞テーブル６４を参照することにより、語幹付与部３３から供給された文を構成する各単語に、各単語の品詞を付与する。 Then, the part-of-speech giving unit 34, based on the word lattice generated from the word string as the sentence supplied from the stem giving unit 33, the part of speech ID of the part of speech of each word constituting the sentence supplied from the stem giving unit 33 , And by referring to the part-of-speech table 64 in which the part-of-speech and the part-of-speech ID of the part-of-speech associated with each other are stored in the dictionary database 12. , Give part of speech for each word.

品詞付与部３４は、単語に品詞が付与された文を、複合語品詞付与部３５に供給する。 The part-of-speech giving unit 34 supplies the compound word part-of-speech giving unit 35 with a sentence in which the part of speech is given to the word.

複合語品詞付与部３５は、品詞付与部３４から供給された文に、複数の単語により構成される複合語が含まれる場合、辞書データベース１２に記憶されている、複合語と、その複合語の品詞、およびその複合語を一意に識別する複合語IDが対応付けられた複合語テーブル６５を参照することにより、品詞付与部３４から供給された文に含まれる、複合語とされる複数の単語を、複合語として、その複合語に、複合語の品詞を付与する。 When the sentence supplied from the part-of-speech giving unit 34 includes a compound word composed of a plurality of words, the compound word part-of-speech giving unit 35 stores the compound word stored in the dictionary database 12 and the compound word. By referring to the compound word table 65 in which the part of speech and the compound word ID that uniquely identifies the compound word are associated, a plurality of words that are included in the sentence supplied from the part of speech adding unit 34 and that are compound words Is given as a compound word, and the part of speech of the compound word is given to the compound word.

また、複合語品詞付与部３５は、品詞付与部３４からの文に含まれる複合語を、「複合語/品詞/語幹」という形式で、図示せぬモニタなどに出力して、表示させる。 In addition, the compound word part-of-speech giving unit 35 outputs the compound word included in the sentence from the part-of-speech giving unit 34 in the form of “compound word / part of speech / stem” to a monitor (not shown) or the like for display.

さらに、複合語品詞付与部３５は、品詞付与部３４から供給された文を構成する各単語のうちの、複合語を構成していない単語を、「単語/品詞/語幹」という形式で、図示せぬモニタ等に出力して表示させる。 Further, the compound word part-of-speech giving unit 35 displays the words that do not constitute the compound word among the words constituting the sentence supplied from the part-of-speech giving unit 34 in the form of “word / part of speech / stem”. Output and display on a monitor not shown.

辞書データベース１２は、単語テーブル６１、単語出現確率テーブル６２、共起確率テーブル６３、品詞テーブル６４、および複合語テーブル６５を記憶しており、単語解析部１１が、そこに供給される文書内の各単語に、各単語の語幹および品詞を付与する処理を行うときに参照される。 The dictionary database 12 stores a word table 61, a word appearance probability table 62, a co-occurrence probability table 63, a part-of-speech table 64, and a compound word table 65. The word analysis unit 11 stores the word table 61 in the document supplied thereto. It is referred to when a word stem and part of speech of each word are given to each word.

単語テーブル６１には、単語と、その単語の単語IDと、その単語の語幹とが対応付けられて登録されている。 In the word table 61, a word, a word ID of the word, and a stem of the word are registered in association with each other.

単語出現確率テーブル６２には、所定の品詞の品詞IDと、品詞が所定の品詞の単語の単語IDと、品詞が所定の品詞の単語の単語出現確率とが対応付けられて登録されている。 In the word appearance probability table 62, a part-of-speech ID of a predetermined part of speech, a word ID of a word whose part of speech is a predetermined part of speech, and a word appearance probability of a word whose part of speech is a predetermined part of speech are registered in association with each other.

共起確率テーブル６３には、２つの品詞それぞれを一意に識別する２つの品詞IDと、２つの品詞間の共起確率とが対応付けられて登録されている。 In the co-occurrence probability table 63, two part-of-speech IDs that uniquely identify each of the two parts of speech and the co-occurrence probabilities between the two parts of speech are registered in association with each other.

品詞テーブル６４には、品詞と、その品詞の品詞IDとが対応付けられて登録されている。 In the part of speech table 64, a part of speech and a part of speech ID of the part of speech are registered in association with each other.

複合語テーブル６５には、複合語と、その複合語の品詞と、その複合語の複合語IDとが対応付けられて登録されている。 In the compound word table 65, a compound word, a part of speech of the compound word, and a compound word ID of the compound word are registered in association with each other.

図２は、図１の辞書データベース１２に記憶されている単語テーブル６１を示す図である。 FIG. 2 is a diagram showing the word table 61 stored in the dictionary database 12 of FIG.

図２の単語テーブル６１には、単語（の表記）と、その単語の単語IDと、その単語の語幹とが対応付けられている。 In the word table 61 of FIG. 2, a word (notation), a word ID of the word, and a stem of the word are associated with each other.

語幹付与部３３は、上述したように、図２の単語テーブル６１を参照することにより、単語区切り部３２から供給された文を構成する各単語に語幹を付与し、品詞付与部３４に供給する。 As described above, the stem assigning unit 33 assigns a stem to each word constituting the sentence supplied from the word delimiter 32 by referring to the word table 61 of FIG. .

具体的には、語幹付与部３３では、例えば、文「I wanted to go to United States.」については、その文「I wanted to go to United States.」を構成する各単語「I」、「wanted」、「to」、「go」、「to」、「United」、「States」、「.」に対して、語幹「I」、「want」、「to」、「go」、「to」、「unit」、「state」、「.」が、それぞれ付与される。 Specifically, in the stem giving unit 33, for example, for the sentence “I wanted to go to United States.”, Each word “I”, “wanted” constituting the sentence “I wanted to go to United States.” ”,“ To ”,“ go ”,“ to ”,“ United ”,“ States ”,“. ”, The stems“ I ”,“ want ”,“ to ”,“ go ”,“ to ”,“ “Unit”, “state”, and “.” Are assigned respectively.

ここで、語幹が付与された各単語を、単語、その単語の語幹という順番で、「単語/単語の語幹」という形式により表すこととすれば、語幹付与部３３は、単語に語幹が付与された文『「I/I」、「wanted/want」、「to/to」、「go/go」、「to/to」、「United/unit」、「States/state」、「./.」』を、品詞付与部３４に供給する。 Here, if each word to which a stem is assigned is expressed in the form of “word / word stem” in the order of the word and the stem of the word, the stem grant unit 33 assigns the stem to the word. 'I / I', 'wanted / want', 'to / to', 'go / go', 'to / to', 'United / unit', 'States / state', './.' Is supplied to the part-of-speech giving unit 34.

次に、図３は、辞書データベース１２に記憶されている品詞テーブル６４を示す図である。 Next, FIG. 3 is a diagram showing the part of speech table 64 stored in the dictionary database 12.

図３の品詞テーブル６４には、品詞と、その品詞の品詞IDとが対応付けられている。 The part of speech table 64 in FIG. 3 associates the part of speech with the part of speech ID of the part of speech.

例えば、品詞付与部３４では、文「I wanted to go to United States.」については、その文「I wanted to go to United States.」を構成する、語幹が付与された各単語「I/I」、「wanted/want」、「to/to」、「go/go」、「to/to」、「United/unit」、「States/state」、「./.」に対して、各単語の品詞「NN」、「VBD」、「TO」、「VB」、「TO」、「VBN」、「NNS」、「ST」が、それぞれ付与される。 For example, in the part-of-speech giving unit 34, for the sentence “I wanted to go to United States.”, Each word “I / I” to which the stem is given is included in the sentence “I wanted to go to United States.” , "Wanted / want", "to / to", "go / go", "to / to", "United / unit", "States / state", "./." “NN”, “VBD”, “TO”, “VB”, “TO”, “VBN”, “NNS”, and “ST” are respectively assigned.

ここで、語幹および品詞が付与された各単語を、単語、その単語の品詞、その単語の語幹という順番で、「単語/単語の品詞/単語の語幹」という形式により表すとすれば、品詞付与部３４は、単語に語幹と品詞が付与された文『「I/NN/I」、「wanted/VBD/want」、「to/TO/to」、「go/VB/go」、「to/TO/to」、「United/VBN/unit」、「States/NNS/state」、「./ST/.」』を、複合語品詞付与部３５に供給する。 Here, if each word with a stem and part of speech is given in the form of “word / part of speech / word stem” in the order of the word, the part of speech of the word, and the stem of the word, the part of speech is given. The part 34 is a sentence “I / NN / I”, “wanted / VBD / want”, “to / TO / to”, “go / VB / go”, “to / “TO / to”, “United / VBN / unit”, “States / NNS / state”, and “./ST/.” ”Are supplied to the compound word part-of-speech adding unit 35.

なお、品詞「NN」は単数形の名詞を表し、品詞「VBD」は過去時制の動詞を表す。また、品詞「TO」はtoを表し、品詞「VB」は現在形の動詞を表す。さらに、品詞「VBN」は過去分詞形の動詞を表し、品詞「NNS」は複数形の名詞を表す。また、品詞「ST」は文末の句読点を表す。 The part of speech “NN” represents a singular noun, and the part of speech “VBD” represents a past tense verb. The part of speech “TO” represents to, and the part of speech “VB” represents the present verb. The part of speech “VBN” represents a past participle verb, and the part of speech “NNS” represents a plural noun. The part of speech “ST” represents the punctuation at the end of the sentence.

図４は、図１の辞書データベース１２に記憶されている複合語テーブル６５を示す図である。 FIG. 4 is a diagram showing the compound word table 65 stored in the dictionary database 12 of FIG.

図４の複合語テーブル６５には、複合語と、その複合語の品詞と、その複合語の複合語IDとが対応付けられている。 In the compound word table 65 of FIG. 4, the compound word, the part of speech of the compound word, and the compound word ID of the compound word are associated with each other.

例えば、複合語品詞付与部３５では、文「I wanted to go to United States.」については、その文「I wanted to go to United States.」を構成する、語幹と品詞が付与された各単語「I/NN/I」、「wanted/VBD/want」、「to/TO/to」、「go/VB/go」、「to/TO/to」、「United/VBN/unit」、「States/NNS/state」、「./ST/.」のうちの単語「United/VBN/unit」および「States/NNS/state」を、単語「United」および「States」により構成される複合語「United States」とみなし、複合語「United States」に、その複合語の品詞「Place」を付与する。 For example, in the compound word part-of-speech adding unit 35, for the sentence “I wanted to go to United States.”, Each word “I wanted to go to United States. "I / NN / I", "wanted / VBD / want", "to / TO / to", "go / VB / go", "to / TO / to", "United / VBN / unit", "States / The words "United / VBN / unit" and "States / NNS / state" in "NNS / state" and "./ST/." Are combined with the word "United States" consisting of the words "United" and "States". And the part of speech “Place” of the compound word is given to the compound word “United States”.

その後、複合語品詞付与部３５は、単語「I/NN/I」、「wanted/VBD/want」、「to/TO/to」、「go/VB/go」、「to/TO/to」、「./ST/.」、および複合語「united states/Place/united states」を出力する。 After that, the compound word part-of-speech giving unit 35 reads the words “I / NN / I”, “wanted / VBD / want”, “to / TO / to”, “go / VB / go”, “to / TO / to”. , "./ST/." And the compound word "united states / Place / united states" are output.

なお、品詞「Place」は、場所を表す。 Note that the part of speech “Place” represents a place.

図５は、品詞付与部３４が生成する単語ラティスを示す図である。 FIG. 5 is a diagram illustrating the word lattice generated by the part of speech adding unit 34.

図５は、品詞付与部３４が例えば、文「Time fly like an arrow.」から生成する単語ラティスを示している。 FIG. 5 shows a word lattice generated from the sentence “Time fly like an arrow.” By the part-of-speech giving unit 34, for example.

単語ラティスは、ノードとリンクとから構成される。 The word lattice is composed of nodes and links.

図５に示す丸（○）印は、ノードを表しており、ノードは、文「Time fly like an arrow.」を構成する単語、その単語の品詞の候補である品詞候補、および品詞がその品詞候補の単語の単語出現確率を有する。なお、最も左側に位置するノードは、文の始まりを表す文頭ノードφであり、単語出現確率として、値１を有する。 The circles (◯) shown in FIG. 5 represent nodes, and the nodes are words constituting the sentence “Time fly like an arrow.”, Part-of-speech candidates that are candidates for part-of-speech of the word, and part-of-speech is the part-of-speech. It has the word appearance probability of the candidate word. The leftmost node is a sentence head node φ representing the beginning of a sentence, and has a value 1 as a word appearance probability.

図５に示す、前と後ろに隣接する２つのノードを結ぶ矢印は、リンクを表しており、リンクは、隣接する２つのノードそれぞれが有する２つの品詞候補間の共起確率を有する。 An arrow connecting two nodes adjacent to each other in the front and back shown in FIG. 5 represents a link, and the link has a co-occurrence probability between two part-of-speech candidates that each of the two adjacent nodes has.

品詞付与部３４は、語幹付与部３３から供給された文としての単語列から、図５に示す単語ラティスを生成し、その単語ラティスの最尤パスを求め、その最尤パス上のノードが有する各品詞候補を、最尤パス上のノードが有する各単語の品詞として、各単語に付与する。 The part-of-speech giving unit 34 generates the word lattice shown in FIG. 5 from the word string as the sentence supplied from the stem giving unit 33, obtains the maximum likelihood path of the word lattice, and the node on the maximum likelihood path has Each part-of-speech candidate is given to each word as part-of-speech of each word that the node on the maximum likelihood path has.

なお、最尤パスとは、以下に示す式（１）が最大値をとるときの、単語ラティスのパスをいう。 The maximum likelihood path refers to a path of a word lattice when the following formula (1) takes the maximum value.

…（１）

... (1)

ここで、Πは、tを１からTまで変化させて、P(w_t|p_t)P(p_t|p_t-1)の乗算を行うことを表し、Tは、語幹付与部３３から供給された文を構成する各単語の総数を表す。 Here, 表し represents that t is changed from 1 to T and multiplication of P (w _t | p _t ) P (p _t | p _t−1 ) is performed. Represents the total number of words that make up the supplied sentence.

さらに、w_tは、文の先頭からt番目の単語を表し、p_tは、単語w_tの品詞の候補（品詞候補）を表す。 Furthermore, w _t represents the t-th word from the beginning of the sentence, and p _t represents the part of speech candidate (part of speech candidate) of the word w _t .

また、P(w_t|p_t)は、品詞が品詞候補p_tの単語w_tの単語出現確率を表し、P(p_t|p_t-1)は、単語w_t-1の品詞候補p_t-1と、単語w_tの品詞候補p_tとの共起確率を表している。 P (w _t | p _t ) represents the word appearance probability of the word w _t whose part of speech is the part of speech candidate p _t , and P (p _t | p _t-1 ) is the part of speech candidate p of the word w _t-1 _It represents the co-occurrence probability between _t-1 and the part-of-speech candidate p _t of the word w _t .

なお、共起確率P(p₁|p₀)は、単語w₁の品詞候補p₁が、文の先頭から1番目に出現する品詞とされる確率を表す。 The co-occurrence probability P (p ₁ | p ₀ ) represents the probability that the part-of-speech candidate p ₁ of the word w _{1 is} the part-of-speech that appears first from the beginning of the sentence.

次に、図６乃至図８を参照して、品詞付与部３４が、図５に示す単語ラティスを生成する方法を説明する。 Next, with reference to FIGS. 6 to 8, a method in which the part-of-speech giving unit 34 generates the word lattice shown in FIG. 5 will be described.

図６は、図１の辞書データベース１２に記憶されている単語出現確率テーブル６２を示す図である。 FIG. 6 is a diagram showing a word appearance probability table 62 stored in the dictionary database 12 of FIG.

図６の単語出現確率テーブル６２には、所定の品詞の品詞IDと、品詞が所定の品詞の単語の単語IDと、品詞が所定の品詞の単語の単語出現確率（単語IDの単語が、品詞IDの品詞の単語として出現する確率）とが対応付けられている。 The word appearance probability table 62 in FIG. 6 includes a part of speech ID of a predetermined part of speech, a word ID of a word of a part of speech having a predetermined part of speech, and a word appearance probability of a word having a part of speech of a predetermined part of speech. The probability of appearing as a part-of-speech word of ID) is associated.

図７は、図１の辞書データベース１２に記憶されている共起確率テーブル６３を示す図である。 FIG. 7 is a diagram showing a co-occurrence probability table 63 stored in the dictionary database 12 of FIG.

図７の共起確率テーブル６３には、所定の品詞を表す前品詞IDと、所定の品詞に続く品詞を表す後品詞IDと、前品詞IDが表す所定の品詞と、その所定の品詞に続く後品詞IDが表す品詞との共起確率とが対応付けられている。 In the co-occurrence probability table 63 in FIG. 7, the previous part-of-speech ID representing a predetermined part of speech, the subsequent part-of-speech ID representing the part of speech following the predetermined part of speech, the predetermined part of speech represented by the previous part of speech ID, and the predetermined part of speech The co-occurrence probability with the part of speech represented by the subsequent part of speech ID is associated.

図８は、図６の単語出現確率テーブル６２および図７の共起確率テーブル６３と等価なHMM(Hidden Markov Model)を示している。 FIG. 8 shows an HMM (Hidden Markov Model) equivalent to the word appearance probability table 62 of FIG. 6 and the co-occurrence probability table 63 of FIG.

図８は、図６の単語出現確率テーブル６２および図７の共起確率テーブル６３についてのHMMを示す図である。 FIG. 8 is a diagram showing HMMs for the word appearance probability table 62 of FIG. 6 and the co-occurrence probability table 63 of FIG.

図８に示す丸印は、HMMの状態を表し、状態は、所定の品詞と、品詞が所定の品詞の単語と、品詞が所定の品詞の単語の単語出現確率とを有する。なお、各状態が有する所定の品詞と、品詞が所定の品詞の単語と、品詞が所定の品詞の単語の単語出現確率とは、図６の単語出現確率テーブル６２が保持する、所定の品詞の品詞IDと、品詞が所定の品詞の単語の単語IDと、品詞が所定の品詞の単語の単語出現確率とに、それぞれ対応する。 The circles shown in FIG. 8 represent the state of the HMM, and the state has a predetermined part of speech, a word whose part of speech is a predetermined part of speech, and a word appearance probability of a word whose part of speech is a predetermined part of speech. Note that the predetermined part-of-speech that each state has, the word whose part-of-speech is the predetermined part-of-speech, and the word appearance probability of the word whose part-of-speech is the predetermined part-of-speech are the predetermined part-of-speech stored in the word appearance probability table 62 of FIG. The part-of-speech ID corresponds to the word ID of a word whose part of speech is a predetermined part of speech, and the word appearance probability of a word whose part of speech is a predetermined part of speech.

図８に示す、状態を結ぶ矢印は、HMMの状態遷移を表し、状態遷移は、状態遷移前の状態が有する品詞候補の前品詞IDと、状態遷移後の状態が有する品詞候補の後品詞IDとに対応付けられた図７の共起確率テーブル６３の共起確率を有する。 The arrows connecting the states shown in FIG. 8 represent the state transitions of the HMM, and the state transitions include the previous part-of-speech ID of the part-of-speech candidate that the state before the state transition has and the subsequent part-of-speech ID of the part-of-speech candidate that the state after the state transition has. And the co-occurrence probability of the co-occurrence probability table 63 of FIG.

品詞付与部３４は、図６の単語出現確率テーブル６２と、図７の共起確率テーブル６３とを参照することにより、図５の単語ラティス、つまり、図５の単語ラティスを構成するノードおよびリンクを生成する。 The part-of-speech providing unit 34 refers to the word appearance probability table 62 in FIG. 6 and the co-occurrence probability table 63 in FIG. 7, so that the nodes and links constituting the word lattice in FIG. 5, that is, the word lattice in FIG. Is generated.

即ち、文「Time fly like an arrow.」を構成する単語のうちの、例えば、単語「like」を注目単語とすると、品詞付与部３４は、図６の単語出現確率テーブル６２から、語幹付与部３３から供給された文を構成する注目単語「like」の単語IDに対応付けられた品詞IDおよび単語出現確率を読み出し、その品詞IDが示す品詞と、その単語出現確率とを有するノードを生成する。 That is, of the words constituting the sentence “Time fly like an arrow.”, For example, when the word “like” is the attention word, the part-of-speech giving unit 34 reads the stem giving unit from the word appearance probability table 62 in FIG. The part-of-speech ID and the word appearance probability associated with the word ID of the attention word “like” constituting the sentence supplied from 33 are read, and a node having the part-of-speech indicated by the part-of-speech ID and the word appearance probability is generated. .

図５では、注目単語「like」の単語IDに対応付けられた品詞IDおよび単語出現確率のセットとして、品詞「IN」を示す品詞IDおよび単語出現確率「0.01」、品詞「VB」を示す他の品詞IDおよび他の単語出現確率「0.006」、品詞「NN」を示すさらに他の品詞IDおよびさらに他の単語出現確率「0.002」の３セットが存在し、品詞「IN」と単語出現確率「0.01」とを有するノード（図５中央下側）、品詞「VB」と単語出現確率「0.006」とを有するノード（図５中央中側）、品詞「NN」と単語出現確率「0.002」とを有するノード（図５中央上側）が、単語「like」についてのノードとして生成されている。 In FIG. 5, the part of speech ID indicating the part of speech “IN”, the word appearance probability “0.01”, and the part of speech “VB” are set as a set of the part of speech ID and the word appearance probability associated with the word ID of the attention word “like”. There are three sets of part-of-speech ID and other word appearance probability “0.006”, another part-of-speech ID indicating part-of-speech “NN”, and another word appearance probability “0.002”, part-of-speech “IN” and word appearance probability “ A node having 0.01 ”(lower center in FIG. 5), a node having part of speech“ VB ”and a word appearance probability“ 0.006 ”(middle center in FIG. 5), a part of speech“ NN ”and a word appearance probability“ 0.002 ”. The node having the upper side (upper center in FIG. 5) is generated as a node for the word “like”.

品詞付与部３４は、単語「like」についてのノードを生成する場合と同様にして、文「Time fly like an arrow.」を構成する他の単語についてのノード（および文頭ノードφ）を生成する。 The part-of-speech providing unit 34 generates nodes (and sentence head nodes φ) for other words constituting the sentence “Time fly like an arrow.” In the same manner as when generating a node for the word “like”.

また、品詞付与部３４は、図７の共起確率テーブル６３から、隣接する各ノードが有する２つの品詞候補間の共起確率を読み出し、その共起確率を有するリンク（図５）を生成する。 The part-of-speech assigning unit 34 reads out the co-occurrence probability between two part-of-speech candidates that each adjacent node has from the co-occurrence probability table 63 in FIG. 7, and generates a link (FIG. 5) having the co-occurrence probability. .

図５では、文「Time fly like an arrow.」を構成する、例えば、隣接する単語「time」と「fly」について、単語「time」の品詞候補NNと、単語「fly」の品詞候補VBZとの共起確率0.3、および単語「time」の品詞候補NNと、単語「fly」の品詞候補NNとの共起確率0.4の２つの共起確率が存在し、ノード「time/NN」（単語が「time」で品詞（品詞候補）がNNのノード）と、ノード「fly/VBZ」との間に、共起確率が0.3のリンクが生成されているとともに、ノード「time/NN」と、ノード「fly/NN」との間に、共起確率が0.4のリンクが生成されている。品詞付与部３４は、図５に示す他のリンクも同様にして生成する。 In FIG. 5, for example, for the adjacent words “time” and “fly” constituting the sentence “Time fly like an arrow.”, The part of speech candidate NN of the word “time” and the part of speech candidate VBZ of the word “fly” There are two co-occurrence probabilities of 0.3 and the co-occurrence probability NN between the part-of-speech candidate NN of the word “time” and the part-of-speech candidate NN of the word “fly”, and the node “time / NN” A link with a co-occurrence probability of 0.3 is generated between the node "fly / VBZ" and the node "time / NN" and the node "time / NN" A link with a co-occurrence probability of 0.4 is generated between "fly / NN". The part-of-speech giving unit 34 generates the other links shown in FIG. 5 in the same manner.

その後、品詞付与部３４は、ノードおよびリンクを生成することにより得られた図５の単語ラティスに基づいて、語幹付与部３３から供給された文を構成する各単語の品詞を、各単語に付与する。 After that, the part of speech giving unit 34 gives the part of speech of each word constituting the sentence supplied from the stem giving unit 33 to each word based on the word lattice of FIG. 5 obtained by generating the node and the link. To do.

ところで、例えば、品詞付与部３４は、上述したように、図６の単語出現確率テーブル６２から、語幹付与部３３から供給される文を構成する単語についての品詞IDおよび単語出現確率を読み出すことにより、語幹付与部３３から供給される文を構成する単語についてのノードを生成する。 By the way, for example, the part-of-speech giving unit 34 reads out the part-of-speech ID and the word appearance probability for the words constituting the sentence supplied from the word stem giving unit 33 from the word appearance probability table 62 in FIG. The node about the word which comprises the sentence supplied from the stem provision part 33 is produced | generated.

しかしながら、語幹付与部３３から供給された文を構成する注目単語が、その注目単語についての品詞IDおよび単語出現確率が図６の単語出現確率テーブル６２に保持されていない未知語である場合、図６の単語出現確率テーブル６２からは、注目単語についての品詞IDおよび単語出現確率を得ることができず、従って、注目単語についてのノードを生成することができない。 However, when the attention word constituting the sentence supplied from the word stem assigning unit 33 is an unknown word whose part-of-speech ID and word appearance probability for the attention word are not held in the word appearance probability table 62 of FIG. The part-of-speech ID and the word appearance probability for the word of interest cannot be obtained from the word appearance probability table 62 of 6, and therefore a node for the word of interest cannot be generated.

そこで、図１の品詞付与部３４では、注目単語が、その注目単語についての品詞IDおよび単語出現確率が図６の単語出現確率テーブル６２に記憶されている既知語である場合には、上述のように、図６の単語出現確率テーブル６２から、その注目単語についてのノードを生成するが、注目単語が未知語である場合には、図７の共起確率テーブル６３から、未知語である注目単語についてのノードを生成する。 Therefore, in the part-of-speech providing unit 34 in FIG. 1, when the word of interest is a known word whose part-of-speech ID and word appearance probability for the word of interest are stored in the word appearance probability table 62 in FIG. As described above, a node for the attention word is generated from the word appearance probability table 62 in FIG. 6. When the attention word is an unknown word, the attention that is an unknown word is determined from the co-occurrence probability table 63 in FIG. 7. Generate a node for the word.

図９は、図１の品詞付与部３４の詳細な構成例を示すブロック図である。 FIG. 9 is a block diagram showing a detailed configuration example of the part-of-speech giving unit 34 in FIG.

品詞付与部３４は、品詞候補決定部１０１、単語出現確率決定部１０２、単語ラティス生成部１０３、および単語品詞付与部１０４により構成される。 The part-of-speech providing unit 34 includes a part-of-speech candidate determining unit 101, a word appearance probability determining unit 102, a word lattice generating unit 103, and a word part-of-speech providing unit 104.

なお、品詞付与部３４の品詞候補決定部１０１には、図１の語幹付与部３３から、語幹が付与された各単語により構成される文が供給される。 Note that the part-of-speech candidate determination unit 101 of the part-of-speech giving unit 34 is supplied with a sentence composed of each word to which a stem is given from the stem giving unit 33 in FIG.

品詞候補決定部１０１は、図６の単語出現確率テーブル６２や図７の共起確率テーブル６３を参照することにより、語幹付与部３３から供給された文を構成する各単語を、順次、注目単語として、その注目単語の品詞の候補である注目品詞候補を決定する。 The part-of-speech candidate determination unit 101 refers to the word appearance probability table 62 in FIG. 6 and the co-occurrence probability table 63 in FIG. 7, and sequentially selects each word constituting the sentence supplied from the stem adding unit 33 as the attention word. Then, the part of speech candidate that is a candidate for the part of speech of the attention word is determined.

即ち、注目単語についての品詞IDが図６の単語出現確率テーブル６２に保持されており、従って、注目単語が既知語である場合、品詞候補決定部１０１は、図６の単語出現確率テーブル６２を参照することにより、注目単語についての品詞IDが示す品詞を、注目品詞候補として決定する。 That is, the part-of-speech ID for the attention word is held in the word appearance probability table 62 in FIG. 6. Therefore, when the attention word is a known word, the part-of-speech candidate determination unit 101 stores the word appearance probability table 62 in FIG. By referencing, the part of speech indicated by the part of speech ID for the attention word is determined as the attention part of speech candidate.

一方、注目単語についての品詞IDが図６の単語出現確率テーブル６２に保持されておらず、従って、注目単語が未知語である場合、品詞候補決定部１０１は、図７の共起確率テーブル６３を参照することにより、注目単語の前または後ろに隣接する単語である隣接単語の品詞の候補である隣接品詞候補と共起する可能性がある品詞を、注目品詞候補として決定する。 On the other hand, if the part-of-speech ID for the attention word is not held in the word appearance probability table 62 in FIG. 6, and therefore the attention word is an unknown word, the part-of-speech candidate determination unit 101 performs the co-occurrence probability table 63 in FIG. , The part of speech that may co-occur with the adjacent part of speech candidate that is a candidate for the part of speech of the adjacent word that is the word adjacent before or after the target word is determined as the target part of speech candidate.

品詞候補決定部１０１は、注目品詞候補が決定された注目単語を、品詞が注目品詞候補の注目単語として、単語出現確率決定部１０２に供給する。 The part of speech candidate determination unit 101 supplies the word appearance probability determination unit 102 with the attention word for which the attention part of speech candidate has been determined as the attention word whose part of speech is the attention part of speech candidate.

単語出現確率決定部１０２は、図６の単語出現確率テーブル６２や図７の共起確率テーブル６３を参照することにより、品詞候補決定部１０１から供給された、品詞が注目品詞候補の注目単語の単語出現確率を決定する。 The word appearance probability determining unit 102 refers to the word appearance probability table 62 in FIG. 6 and the co-occurrence probability table 63 in FIG. 7, so that the part of speech supplied from the part of speech candidate determining unit 101 is the attention word of the target part of speech candidate. Determine the word appearance probability.

即ち、注目単語の単語出現確率が図６の単語出現確率テーブル６２に保持されており、従って、注目単語が既知語である場合、単語出現確率決定部１０２は、図６の単語出現確率テーブル６２を参照することにより、注目単語の単語出現確率を決定する。 That is, the word appearance probability of the attention word is held in the word appearance probability table 62 in FIG. 6. Therefore, when the attention word is a known word, the word appearance probability determination unit 102 selects the word appearance probability table 62 in FIG. 6. The word appearance probability of the attention word is determined by referring to.

一方、注目単語の単語出現確率が図６の単語出現確率テーブル６２に保持されておらず、従って、注目単語が未知語である場合、単語出現確率決定部１０２は、図７の共起確率テーブル６３が保持する、隣接品詞候補と注目品詞候補との共起確率を参照することにより、注目単語の単語出現確率を決定する。 On the other hand, when the word appearance probability of the attention word is not held in the word appearance probability table 62 in FIG. 6 and, therefore, the attention word is an unknown word, the word appearance probability determination unit 102 performs the co-occurrence probability table in FIG. The word appearance probability of the attention word is determined by referring to the co-occurrence probability between the adjacent part-of-speech candidate and the attention part-of-speech candidate held by 63.

文を構成する各単語の単語出現確率が決定された後、単語出現確率決定部１０２は、文を構成する各単語の単語出現確率を、単語ラティス生成部１０３に供給する。 After the word appearance probability of each word constituting the sentence is determined, the word appearance probability determination unit 102 supplies the word appearance probability of each word constituting the sentence to the word lattice generation unit 103.

単語ラティス生成部１０３は、図７の共起確率テーブル６３が保持する共起確率（文の隣接する単語どうしについての共起確率）と、単語出現確率決定部１０２から供給された、文を構成する各単語の単語出現確率とに基づいて、単語ラティス（図５）を生成し、単語品詞付与部１０４に供給する。 The word lattice generation unit 103 forms a sentence supplied from the co-occurrence probability (co-occurrence probability between adjacent words of the sentence) held in the co-occurrence probability table 63 of FIG. 7 and the word appearance probability determination unit 102. Based on the word appearance probability of each word to be generated, a word lattice (FIG. 5) is generated and supplied to the word part-of-speech adding unit 104.

単語品詞付与部１０４は、単語ラティス生成部１０３から供給された単語ラティスの最尤パスを求め、その最尤パスに基づいて、各単語の品詞の品詞IDを決定する。 The word part-of-speech providing unit 104 obtains the maximum likelihood path of the word lattice supplied from the word lattice generation unit 103 and determines the part-of-speech ID of the part of speech of each word based on the maximum likelihood path.

さらに、単語品詞付与部１０４は、図３の品詞テーブル６４を参照し、語幹付与部３３から、品詞候補決定部１０１、単語出現確率決定部１０２、および単語ラティス生成部１０３を介して供給される文を構成する各単語に品詞を付与して、図１の複合語品詞付与部３５に供給する。なお、単語品詞付与部１０４には、語幹付与部３３から、品詞候補決定部１０１、単語出現確率決定部１０２、および単語ラティス生成部１０３を介して、語幹が付与された各単語により構成される文が供給される。 Further, the word part-of-speech giving unit 104 is supplied from the word stem giving unit 33 through the part-of-speech candidate determination unit 101, the word appearance probability determination unit 102, and the word lattice generation unit 103 with reference to the part-of-speech table 64 of FIG. Part-of-speech is assigned to each word constituting the sentence and supplied to the compound-word part-of-speech giving unit 35 in FIG. The word part-of-speech giving unit 104 is configured by each word to which a word stem is given from the word stem giving unit 33 via the part-of-speech candidate decision unit 101, the word appearance probability decision unit 102, and the word lattice generation unit 103. A sentence is supplied.

次に、図１０乃至図１７を参照して、注目単語が未知語である場合に、隣接品詞候補と注目品詞候補との共起確率に基づいて、図９の単語出現確率決定部１０２が、品詞が注目品詞候補の注目単語の単語出現確率を決定する単語出現確率決定処理を説明する。 Next, referring to FIG. 10 to FIG. 17, when the attention word is an unknown word, the word appearance probability determination unit 102 in FIG. 9 is based on the co-occurrence probability between the adjacent part of speech candidate and the attention part of speech candidate. A word appearance probability determination process for determining the word appearance probability of a target word whose part of speech is a target part of speech candidate will be described.

図１０は、１つの注目品詞候補に対して、１つの隣接品詞候補としての、例えば注目単語の前に隣接する単語（以下、適宜、前隣接単語という）の品詞の候補（以下、適宜、前隣接品詞候補という）が存在する場合に、前隣接品詞候補と注目品詞候補との共起確率を、品詞が注目品詞候補の注目単語の単語出現確率として決定する単語出現確率決定処理を説明する図である。 FIG. 10 shows a part-of-speech candidate (hereinafter referred to as the previous adjacent word) as an adjacent part-of-speech candidate, for example, a word adjacent to the target word (hereinafter referred to as the previous adjacent word as appropriate). FIG. 10 is a diagram for explaining word appearance probability determination processing for determining the co-occurrence probability between a previous adjacent part-of-speech candidate and a target part-of-speech candidate as a word appearance probability of a target word of the target part-of-speech candidate when the adjacent part-of-speech candidate exists); It is.

なお、単語Lの品詞Posを、以下、適宜、品詞L/Posと表す。 Note that the part of speech Pos of the word L is hereinafter appropriately expressed as part of speech L / Pos.

図１０において、品詞L1/Pos1は、前隣接単語L1の前隣接品詞候補Pos1であり、品詞L2/Pos1は、前隣接品詞候補L1/Pos1と共起する可能性がある、注目単語L2の注目品詞候補である。 In FIG. 10, the part of speech L1 / Pos1 is the previous adjacent part of speech candidate Pos1 of the previous adjacent word L1, and the part of speech L2 / Pos1 is the attention of the attention word L2, which may co-occur with the previous adjacent part of speech candidate L1 / Pos1. It is a part of speech candidate.

また、図１０において、注目品詞候補L2/Pos1の下に示される数字0.02は、品詞が注目品詞候補L2/Pos1の注目単語L2の単語出現確率を示している。 In FIG. 10, the number 0.02 shown below the target part-of-speech candidate L2 / Pos1 indicates the word appearance probability of the target word L2 whose part-of-speech is the target part-of-speech candidate L2 / Pos1.

図１０では、前隣接品詞候補L1/Pos1と共起する可能性がある、単語L2の品詞として、品詞Pos1が存在し、前隣接品詞候補L1/Pos1と注目品詞候補L2/Pos1との共起確率P(Pos1| Pos1)が0.02となっている。 In FIG. 10, there is a part of speech Pos1 as a part of speech of the word L2 that may co-occur with the previous adjacent part of speech candidate L1 / Pos1, and the co-occurrence of the previous adjacent part of speech candidate L1 / Pos1 and the target part of speech candidate L2 / Pos1. The probability P (Pos1 | Pos1) is 0.02.

単語出現確率決定部１０２は、1つの注目品詞候補L2/Pos1に対して、１つの隣接品詞候補としての、例えば前隣接品詞候補L1/Pos1が存在する場合、前隣接品詞候補L1/Pos1と注目品詞候補L2/Pos1との共起確率P(Pos1|Pos1)=0.02を、品詞が注目品詞候補L2/Pos1の注目単語L2の単語出現確率として決定する。 For example, when there is a previous adjacent part-of-speech candidate L1 / Pos1 as one adjacent part-of-speech candidate for one attention-part-of-speech candidate L2 / Pos1, the word appearance probability determination unit 102 pays attention to the previous adjacent part-of-speech candidate L1 / Pos1. The co-occurrence probability P (Pos1 | Pos1) = 0.02 with the part-of-speech candidate L2 / Pos1 is determined as the word appearance probability of the attention word L2 with the part-of-speech candidate L2 / Pos1.

次に、図１１は、1つの注目品詞候補に対して、１つの隣接品詞候補としての、例えば注目単語の後ろに隣接する単語（以下、適宜、後隣接単語という）の品詞の候補（以下、適宜、後隣接品詞候補という）が存在する場合に、後隣接品詞候補と注目品詞候補との共起確率を、品詞が注目品詞候補の注目単語の単語出現確率として決定する単語出現確率決定処理を説明する図である。 Next, FIG. 11 illustrates a candidate part of speech (hereinafter, referred to as a “subsequent adjoining word”, for example) adjacent to the target word as one adjacent part of speech candidate, for example, as a neighboring part of speech candidate. A word appearance probability determination process that determines the co-occurrence probability of the back adjacent part-of-speech candidate and the target part-of-speech candidate as the word appearance probability of the target word of the target part-of-speech candidate when It is a figure explaining.

図１１において、品詞L3/Pos1は、後隣接単語L3の後隣接品詞候補Pos1であり、品詞L2/Pos1は、後隣接品詞候補L3/Pos1と共起する可能性がある、注目単語L2の注目品詞候補である。 In FIG. 11, part-of-speech L3 / Pos1 is post-neighbor part-of-speech candidate Pos1 of post-neighbor word L3, and part-of-speech L2 / Pos1 is the focus of attention word L2 that may co-occur with post-neighbor part-of-speech candidate L3 / Pos1. It is a part of speech candidate.

また、図１１において、注目品詞候補L2/Pos1の下に示される数字0.01は、品詞が注目品詞候補L2/Pos1の注目単語L2の単語出現確率を示している。 In FIG. 11, the number 0.01 shown below the target part-of-speech candidate L2 / Pos1 indicates the word appearance probability of the target word L2 whose part-of-speech is the target part-of-speech candidate L2 / Pos1.

図１１では、後隣接品詞候補L3/Pos1と共起する可能性がある、単語L2の品詞として、品詞Pos1が存在し、後隣接品詞候補L3/Pos1と注目品詞候補L2/Pos1との共起確率P(Pos1| Pos1)が0.01となっている。 In FIG. 11, there is a part of speech Pos1 as a part of speech of the word L2 that may co-occur with the rear neighboring part of speech candidate L3 / Pos1, and the co-occurrence of the rear neighboring part of speech candidate L3 / Pos1 and the target part of speech candidate L2 / Pos1. The probability P (Pos1 | Pos1) is 0.01.

単語出現確率決定部１０２は、1つの注目品詞候補L2/Pos1に対して、１つの隣接品詞候補としての、例えば後隣接品詞候補L3/Pos1が存在する場合、後隣接品詞候補L3/Pos1と注目品詞候補L2/Pos1との共起確率P(Pos1| Pos1)=0.02を、品詞が注目品詞候補L2/Pos1の注目単語L2の単語出現確率として決定する。 The word appearance probability determining unit 102 selects the next adjacent part-of-speech candidate L3 / Pos1 as one adjacent part-of-speech candidate L3 / Pos1 as one adjacent part-of-speech candidate L3 / Pos1, for example. The co-occurrence probability P (Pos1 | Pos1) = 0.02 with the part-of-speech candidate L2 / Pos1 is determined as the word appearance probability of the attention word L2 with the part-of-speech candidate L2 / Pos1.

次に、図１２は、1つの注目品詞候補に対して、２つの隣接品詞候補としての、例えば１つの前隣接品詞候補と１つの後隣接品詞候補とが存在する場合に、前隣接品詞候補と注目品詞候補との共起確率、および後隣接品詞候補と注目品詞候補との共起確率のうちの1の共起確率を、品詞が注目品詞候補の注目単語の単語出現確率として決定する単語出現確率決定処理を説明する図である。 Next, FIG. 12 shows a case where there are two adjacent part-of-speech candidates, for example, one front part-of-speech candidate and one rear part-of-speech candidate, with respect to one target part-of-speech candidate. The word appearance that determines the co-occurrence probability of the attention part-of-speech candidate and the co-occurrence probability of the next adjacent part-of-speech candidate and attention part-of-speech candidate as the word appearance probability of the attention word of the attention part-of-speech candidate It is a figure explaining a probability determination process.

図１２において、品詞L1/Pos1は、前隣接単語L1の前隣接品詞候補Pos1であり、品詞L3/Pos1は、後隣接単語L3の後隣接品詞候補Pos1である。また、品詞L2/Pos1は、前隣接品詞候補L1/Pos1および後隣接品詞候補L3/Pos1と共起する可能性がある、注目単語L2の注目品詞候補である。 In FIG. 12, the part of speech L1 / Pos1 is the previous adjacent part of speech candidate Pos1 of the previous adjacent word L1, and the part of speech L3 / Pos1 is the subsequent adjacent part of speech candidate Pos1 of the rear adjacent word L3. The part-of-speech L2 / Pos1 is a candidate part-of-speech candidate for the target word L2, which may co-occur with the previous neighboring part-of-speech candidate L1 / Pos1 and the rear neighboring part-of-speech candidate L3 / Pos1.

さらに、図１２において、注目品詞候補L2/Pos1の下に示される数字0.02は、品詞が注目品詞候補L2/Pos1の注目単語L2の単語出現確率を示している。 Further, in FIG. 12, the number 0.02 shown below the target part-of-speech candidate L2 / Pos1 indicates the word appearance probability of the target word L2 having the part-of-speech candidate L2 / Pos1.

図１２では、前隣接品詞候補L1/Pos1と共起する可能性がある、単語L2の品詞として、品詞Pos1が存在し、前隣接品詞候補L1/Pos1と注目品詞候補L2/Pos1との共起確率P(Pos1|Pos1)が0.02となっている。 In FIG. 12, there is a part of speech Pos1 as a part of speech of the word L2 that may co-occur with the previous adjacent part of speech candidate L1 / Pos1, and the co-occurrence of the previous adjacent part of speech candidate L1 / Pos1 and the target part of speech candidate L2 / Pos1. The probability P (Pos1 | Pos1) is 0.02.

また、図１２では、後隣接品詞候補L3/Pos1と共起する可能性がある、単語L2の品詞として、品詞Pos1が存在し、後隣接品詞候補L3/Pos1と注目品詞候補L2/Pos1との共起確率P(Pos1|Pos1)が0.01となっている。 In FIG. 12, the part of speech Pos1 exists as the part of speech of the word L2 that may co-occur with the rear adjacent part of speech candidate L3 / Pos1, and the rear adjacent part of speech candidate L3 / Pos1 and the target part of speech candidate L2 / Pos1 The co-occurrence probability P (Pos1 | Pos1) is 0.01.

単語出現確率決定部１０２は、1つの注目品詞候補L2/Pos1に対して、２つの隣接品詞候補としての、例えば１つの前隣接品詞候補L1/Pos1と１つの後隣接品詞候補L3/Pos1が存在する場合、前隣接品詞候補L1/Pos1と注目品詞候補L2/Pos1との共起確率P(Pos1|Pos1)=0.02、および後隣接品詞候補L3/Pos1と注目品詞候補L2/Pos1との共起確率P(Pos1|Pos1)=0.01のうちの、例えば最大値の共起確率P(Pos1|Pos1)=0.02を、品詞が注目品詞候補L2/Pos1の注目単語L2の単語出現確率として決定する。 The word appearance probability determining unit 102 has, for example, one previous adjacent part-of-speech candidate L1 / Pos1 and one subsequent adjacent part-of-speech candidate L3 / Pos1 as one adjacent part-of-speech candidate for one target part-of-speech candidate L2 / Pos1. The co-occurrence probability P (Pos1 | Pos1) = 0.02 between the previous neighboring part-of-speech candidate L1 / Pos1 and the focused part-of-speech candidate L2 / Pos1, and the co-occurrence of the following neighboring part-of-speech candidate L3 / Pos1 and the focused part of speech candidate L2 / Pos1 Of the probability P (Pos1 | Pos1) = 0.01, for example, the maximum co-occurrence probability P (Pos1 | Pos1) = 0.02 is determined as the word appearance probability of the attention word L2 whose part of speech is the attention part of speech candidate L2 / Pos1.

図１３は、１つの注目品詞候補に対して、前隣接品詞候補が複数存在する場合に、複数の前隣接品詞候補それぞれと注目品詞候補との共起確率のうちの最大値の共起確率を、品詞が注目品詞候補の注目単語の単語出現確率として決定する単語出現確率決定処理を説明する図である。 FIG. 13 shows the maximum co-occurrence probability among the co-occurrence probabilities of each of the plurality of previous adjacent part-of-speech candidates and the target part-of-speech candidate when there are a plurality of previous adjacent part-of-speech candidates for one target part-of-speech candidate. FIG. 11 is a diagram for explaining word appearance probability determination processing in which the part of speech is determined as the word appearance probability of the attention word of the attention part of speech candidate.

図１３では、１つの注目品詞候補L2/Pos1と共起する可能性がある複数の前隣接品詞候補として、前隣接品詞候補L1/Pos1乃至L1/Pos3が存在している。 In FIG. 13, previous adjacent part-of-speech candidates L1 / Pos1 to L1 / Pos3 exist as a plurality of previous adjacent part-of-speech candidates that may co-occur with one notable part-of-speech candidate L2 / Pos1.

そして、図１３では、３つの前隣接品詞候補L1/Pos1乃至L1/Pos3それぞれと注目品詞候補L2/Pos1との共起確率P(Pos1|Pos1)，P(Pos1|Pos2)，P(Pos1|Pos3)が、それぞれ、0.02，0.01，0.5となっている。 In FIG. 13, the co-occurrence probabilities P (Pos1 | Pos1), P (Pos1 | Pos2), and P (Pos1 |) of the three previous adjacent part-of-speech candidates L1 / Pos1 to L1 / Pos3 and the target part-of-speech candidate L2 / Pos1. Pos3) is 0.02, 0.01 and 0.5, respectively.

単語出現確率決定部１０２は、１つの注目品詞候補L2/Pos1に対して、複数の前隣接品詞候補L1/Pos1乃至L1/Pos3が存在する場合、３つの前隣接品詞候補L1/Pos1乃至L1/Pos3それぞれと注目品詞候補L2/Pos1との共起確率P(Pos1|Pos1)=0.02，P(Pos1|Pos2)=0.01，P(Pos1|Pos3)=0.5のうちの最大値の共起確率P(Pos1|Pos3)=0.5を、品詞が品詞候補L2/Pos1の注目単語L2の単語出現確率として決定する。 When there are a plurality of previous adjacent part-of-speech candidates L1 / Pos1 to L1 / Pos3 for one target part-of-speech candidate L2 / Pos1, the word appearance probability determination unit 102 determines three previous adjacent part-of-speech candidates L1 / Pos1 to L1 / Co-occurrence probability P (Pos1 | Pos1) = 0.02, P (Pos1 | Pos2) = 0.01, P (Pos1 | Pos3) = 0.5 for each Pos3 and the part of speech candidate L2 / Pos1 (Pos1 | Pos3) = 0.5 is determined as the word appearance probability of the attention word L2 whose part of speech is the part of speech candidate L2 / Pos1.

図１４は、１つの注目品詞候補に対して、後隣接品詞候補が複数存在する場合に、注目品詞候補と複数の後隣接品詞候補それぞれとの共起確率のうちの最大値の共起確率を、品詞が注目品詞候補の注目単語の単語出現確率として決定する単語出現確率決定処理を説明する図である。 FIG. 14 shows the maximum co-occurrence probability among the co-occurrence probabilities of the target part-of-speech candidate and each of the plurality of rear-adjacent part-of-speech candidates for a single target part-of-speech candidate. FIG. 11 is a diagram for explaining word appearance probability determination processing in which the part of speech is determined as the word appearance probability of the attention word of the attention part of speech candidate.

図１４では、１つの注目品詞候補L2/Pos1と共起する可能性がある複数の後隣接品詞候補として、後隣接品詞候補L3/Pos1乃至L3/Pos3が存在している。 In FIG. 14, there are rear adjacent part of speech candidates L3 / Pos1 to L3 / Pos3 as a plurality of rear adjacent part of speech candidates that may co-occur with one notable part of speech candidate L2 / Pos1.

そして、図１４では、注目品詞候補L2/Pos1と３つの後隣接品詞候補L3/Pos1乃至L3/Pos3との共起確率P(Pos1|Pos1)，P(Pos2|Pos1)，P(Pos3|Pos1)が、それぞれ、0.01，0.04，0.4となっている。 In FIG. 14, the co-occurrence probabilities P (Pos1 | Pos1), P (Pos2 | Pos1), P (Pos3 | Pos1) of the target part-of-speech candidate L2 / Pos1 and the three subsequent neighboring part-of-speech candidates L3 / Pos1 to L3 / Pos3 ) Are 0.01, 0.04, and 0.4, respectively.

単語出現確率決定部１０２は、１つの注目品詞候補L2/Pos1に対して、複数の後隣接品詞候補L3/Pos1乃至L3/Pos3が存在する場合、注目品詞候補L2/Pos1と３つの後隣接品詞候補L3/Pos1乃至L3/Pos3それぞれとの共起確率P(Pos1|Pos1)=0.01，P(Pos2|Pos1)=0.04，P(Pos3|Pos1)=0.4のうちの最大値の共起確率P(Pos3|Pos1)=0.4を、品詞が品詞候補L2/Pos1の注目単語L2の単語出現確率として決定する。 The word appearance probability determining unit 102, when there are a plurality of post-adjacent part-of-speech candidates L3 / Pos1 to L3 / Pos3 for one part-of-speech candidate L2 / Pos1, the part-of-speech candidate L2 / Pos1 and three parts Co-occurrence probabilities P (Pos1 | Pos1) = 0.01, P (Pos2 | Pos1) = 0.04, P (Pos3 | Pos1) = 0.4 with the candidates L3 / Pos1 to L3 / Pos3, respectively (Pos3 | Pos1) = 0.4 is determined as the word appearance probability of the attention word L2 whose part of speech is the part of speech candidate L2 / Pos1.

図１５は、１つの注目品詞候補に対して、前隣接品詞候補と後隣接品詞候補が複数存在する場合に、複数の前隣接品詞候補それぞれと注目品詞候補との共起確率、および注目品詞候補と複数の後隣接品詞候補それぞれとの共起確率のうちの最大値の共起確率を、品詞が注目品詞候補の単語出現確率として決定する単語出現確率決定処理を説明する図である。 FIG. 15 shows the co-occurrence probability of each of the plurality of previous adjacent part-of-speech candidates and the target part-of-speech candidate when there are a plurality of front-adjacent part-of-speech candidates and rear-adjacent part-of-speech candidates, and the target part-of-speech candidate FIG. 6 is a diagram for explaining word appearance probability determination processing in which the maximum value co-occurrence probability among the co-occurrence probabilities with each of a plurality of subsequent adjacent part-of-speech candidates is determined as the word appearance probability of the part-of-speech candidate of interest.

図１５では、１つの注目品詞候補L2/Pos1と共起する可能性がある複数の前隣接品詞候補として、前隣接品詞候補L1/Pos1乃至L1/Pos3が存在している。 In FIG. 15, there are previous adjacent part-of-speech candidates L1 / Pos1 to L1 / Pos3 as a plurality of previous adjacent part-of-speech candidates that may co-occur with one notable part-of-speech candidate L2 / Pos1.

そして、図１５では、３つの前隣接品詞候補L1/Pos1乃至L1/Pos3それぞれと注目品詞候補L2/Pos1との共起確率P(Pos1|Pos1)，P(Pos1|Pos2)，P(Pos1|Pos3)が、それぞれ、0.02，0.01，0.5となっている。 In FIG. 15, the co-occurrence probabilities P (Pos1 | Pos1), P (Pos1 | Pos2), P (Pos1 |) of the three previous adjacent part-of-speech candidates L1 / Pos1 to L1 / Pos3 and the target part-of-speech candidate L2 / Pos1. Pos3) is 0.02, 0.01 and 0.5, respectively.

また、図１５では、１つの注目品詞候補L2/Pos1と共起する可能性がある複数の後隣接品詞候補として、後隣接品詞候補L3/Pos1乃至L3/Pos3が存在している。 Further, in FIG. 15, there are rear adjacent part-of-speech candidates L3 / Pos1 to L3 / Pos3 as a plurality of rear adjacent part-of-speech candidates that may co-occur with one notable part-of-speech candidate L2 / Pos1.

そして、図１５では、注目品詞候補L2/Pos1と３つの後隣接品詞候補L3/Pos1乃至L3/Pos3それぞれとの共起確率P(Pos1|Pos1)，P(Pos2|Pos1)，P(Pos3|Pos1)が、それぞれ、0.01，0.04，0.4となっている。 In FIG. 15, the co-occurrence probabilities P (Pos1 | Pos1), P (Pos2 | Pos1), and P (Pos3 |) of the target part-of-speech candidate L2 / Pos1 and the three subsequent neighboring part-of-speech candidates L3 / Pos1 to L3 / Pos3, respectively. Pos1) is 0.01, 0.04, and 0.4, respectively.

単語出現確率決定部１０２は、１つの注目品詞候補L2/Pos1に対して、複数の前隣接品詞候補L1/Pos1乃至L1/Pos3と複数の後隣接品詞候補L3/Pos1乃至L3/Pos3とが存在する場合、３つの前隣接品詞候補L1/Pos1乃至L1/Pos3それぞれと注目品詞候補L2/Pos1との共起確率P(Pos1|Pos1)=0.02，P(Pos1|Pos2)=0.01，P(Pos1|Pos3)=0.5、および注目品詞候補L2/Pos1と３つの後隣接品詞候補L3/Pos1乃至L3/Pos3それぞれとの共起確率P(Pos1|Pos1)=0.01，P(Pos2|Pos1)=0.04，P(Pos3|Pos1)=0.4のうちの最大値の共起確率P(Pos1|Pos3)=0.5を、品詞が品詞候補L2/Pos1の注目単語L2の単語出現確率として決定する。 The word appearance probability determining unit 102 includes a plurality of previous adjacent part-of-speech candidates L1 / Pos1 to L1 / Pos3 and a plurality of subsequent adjacent part-of-speech candidates L3 / Pos1 to L3 / Pos3 for one target part-of-speech candidate L2 / Pos1. In this case, the co-occurrence probabilities P (Pos1 | Pos1) = 0.02, P (Pos1 | Pos2) = 0.01, P (Pos1) between each of the three previous adjacent part-of-speech candidates L1 / Pos1 to L1 / Pos3 and the target part-of-speech candidate L2 / Pos1 | Pos3) = 0.5, and the co-occurrence probabilities P (Pos1 | Pos1) = 0.01 and P (Pos2 | Pos1) = 0.04 for the candidate part-of-speech L2 / Pos1 and the three subsequent neighboring part-of-speech candidates L3 / Pos1 to L3 / Pos3 , P (Pos3 | Pos1) = 0.4, the maximum co-occurrence probability P (Pos1 | Pos3) = 0.5 is determined as the word appearance probability of the attention word L2 having the part of speech candidate L2 / Pos1.

図１６は、１つの前隣接品詞候補と共起する可能性がある注目品詞候補が複数存在する場合に、前隣接品詞候補と注目品詞候補との共起確率を、品詞が注目品詞候補の注目単語の単語出現確率として決定する単語出現確率決定処理を説明する図である。 FIG. 16 shows the co-occurrence probability between the previous adjacent part-of-speech candidate and the target part-of-speech candidate when there are a plurality of target part-of-speech candidates that may co-occur with one previous adjacent part-of-speech candidate. It is a figure explaining the word appearance probability determination process determined as a word appearance probability of a word.

図１６において、品詞L1/Pos1は、前隣接単語L1の前隣接品詞候補Pos1であり、品詞L2/Pos(n)は、前隣接品詞候補L1/Pos1と共起する可能性がある、注目単語L2の注目品詞候補である（n=1,2,…,N）。 In FIG. 16, the part of speech L1 / Pos1 is the previous adjacent part of speech candidate Pos1 of the previous adjacent word L1, and the part of speech L2 / Pos (n) is a word of interest that may co-occur with the previous adjacent part of speech candidate L1 / Pos1. L2's feature part-of-speech candidates (n = 1, 2, ..., N).

また、図１６において、注目品詞候補L2/Pos(n)の下に示される数字は、品詞が注目品詞候補L2/Pos(n)の注目単語L2の単語出現確率を示している。 Also, in FIG. 16, the numbers shown below the target part-of-speech candidate L2 / Pos (n) indicate the word appearance probability of the target word L2 whose part-of-speech is the target part-of-speech candidate L2 / Pos (n).

図１６では、前隣接品詞候補L1/Pos1と共起する可能性がある、単語L2の品詞として、複数であるN個の品詞Pos1,Pos2,…,PosNが存在している。 In FIG. 16, there are a plurality of N part-of-speech Pos1, Pos2,..., PosN as part-of-speech of the word L2 that may co-occur with the previous adjacent part-of-speech candidate L1 / Pos1.

単語出現確率決定部１０２は、１つの前隣接品詞候補L1/Pos1と共起する可能性がある複数の注目品詞候補L2/Pos1,L2/Pos2,…,L2/PosNが存在する場合、図１０で説明した場合と同様にして、前隣接品詞候補L1/Pos１とn番目の注目品詞候補L2/Pos(n)との共起確率P(Pos(n)|Pos1)を、品詞がn番目の品詞候補L2/Pos(n)の注目単語L2の単語出現確率として決定する。 When there are a plurality of part-of-speech candidates L2 / Pos1, L2 / Pos2,..., L2 / PosN that may co-occur with one previous neighboring part-of-speech candidate L1 / Pos1, the word appearance probability determination unit 102 In the same manner as described in the above, the co-occurrence probability P (Pos (n) | Pos1) of the previous adjacent part-of-speech candidate L1 / Pos1 and the n-th attention part-of-speech candidate L2 / Pos (n) is the nth part-of-speech. It is determined as the word appearance probability of the attention word L2 of the part of speech candidate L2 / Pos (n).

図１６では、例えば、前隣接品詞候補L1/Pos１と注目品詞候補L2/Pos1との共起確率P(Pos1|Pos1)が0.02となっており、その共起確率P(Pos1|Pos1)=0.02が、そのまま、品詞が注目品詞候補L2/Pos1の注目単語L2の単語出現確率として決定されている。なお、単語出現確率決定部１０２は、他の注目品詞候補L2/Pos2,L2/Pos3,…,L2/PosNの注目単語L2の単語出現確率も、同様に決定する。 In FIG. 16, for example, the co-occurrence probability P (Pos1 | Pos1) of the previous adjacent part-of-speech candidate L1 / Pos1 and the target part-of-speech candidate L2 / Pos1 is 0.02, and the co-occurrence probability P (Pos1 | Pos1) = 0.02. However, the part of speech is determined as the word appearance probability of the attention word L2 of the attention part of speech candidate L2 / Pos1 as it is. Note that the word appearance probability determining unit 102 similarly determines the word appearance probability of the attention word L2 of other attention part-of-speech candidates L2 / Pos2, L2 / Pos3,..., L2 / PosN.

図１７は、複数の前隣接品詞候補と共起する可能性がある注目品詞候補が複数存在する場合に、複数の前隣接品詞候補それぞれと注目品詞候補との共起確率のうちの最大値を、品詞が注目品詞候補の注目単語の単語出現確率として決定する単語出現確率決定処理を説明する図である。 FIG. 17 shows the maximum value of the co-occurrence probabilities of each of the plurality of previous adjacent part-of-speech candidates and the target part-of-speech candidate when there are a plurality of target part-of-speech candidates that may co-occur with a plurality of previous adjacent part-of-speech candidates. FIG. 11 is a diagram for explaining word appearance probability determination processing in which the part of speech is determined as the word appearance probability of the attention word of the attention part of speech candidate.

図１７では、注目品詞候補L2/Pos(n)と共起する可能性がある複数の前隣接品詞候補として、前隣接品詞候補Pos1乃至Pos3が存在している。 In FIG. 17, there are previous adjacent part-of-speech candidates Pos1 to Pos3 as a plurality of previous adjacent part-of-speech candidates that may co-occur with the target part-of-speech candidate L2 / Pos (n).

単語出現確率決定部１０２は、図１３で説明した場合と同様に、３つの前隣接品詞候補L1/Pos1乃至L1/Pos3それぞれと注目品詞候補L2/Pos(n)との共起確率のうちの最大値の共起確率を、品詞が品詞候補L2/Pos(n)の注目単語L2の単語出現確率として決定する。 Similarly to the case described with reference to FIG. 13, the word appearance probability determination unit 102 includes the co-occurrence probabilities of the three previous adjacent part-of-speech candidates L1 / Pos1 to L1 / Pos3 and the target part-of-speech candidate L2 / Pos (n). The co-occurrence probability of the maximum value is determined as the word appearance probability of the attention word L2 whose part of speech is the part of speech candidate L2 / Pos (n).

次に、図１８のフローチャートを参照して、図１の形態素解析エンジンが、文としての単語列を構成する各単語に、各単語の語幹および品詞を付与する単語解析処理を説明する。 Next, a word analysis process in which the morphological analysis engine in FIG. 1 gives a word stem and part of speech to each word constituting a word string as a sentence will be described with reference to the flowchart in FIG.

ステップＳ３１において、文区切り部３１は、文書を文単位に区切る所定のルールに基づいて、そこに供給された文書を文単位に分割し、分割することで得られた文を、単語区切り部３２に供給して、処理は、ステップＳ３２に進む。 In step S31, the sentence delimiter 31 divides the document supplied thereto into sentence units based on a predetermined rule for dividing the document into sentence units, and the sentence obtained by the division is a word delimiter 32. The process proceeds to step S32.

ステップＳ３２において、単語区切り部３２は、文を単語に分割する所定のルールに基づいて、文区切り部３１から供給された文を、その文を構成する単語に分割し、単語に分割された文を、語幹付与部３３に供給する。 In step S32, the word delimiter 32 divides the sentence supplied from the sentence delimiter 31 into words constituting the sentence based on a predetermined rule for dividing the sentence into words, and the sentence divided into words Is supplied to the stem giving unit 33.

ステップＳ３２の処理の終了後、処理は、ステップＳ３３に進み、語幹付与部３３は、図２の単語テーブル６１を参照することにより、単語区切り部３２から供給された文を構成する各単語に、各単語の語幹を付与する。 After completion of the process of step S32, the process proceeds to step S33, and the stem granting unit 33 refers to the word table 61 of FIG. 2 to each word constituting the sentence supplied from the word delimiter 32. Assign a stem for each word.

また、ステップＳ３３において、語幹付与部３３は、図２の単語テーブル６１から、単語区切り部３２から供給された文を構成する各単語の単語IDを読み出し、単語IDに対応する、文を構成する単語と対応付ける。 Further, in step S33, the stem adding unit 33 reads the word ID of each word constituting the sentence supplied from the word delimiter 32 from the word table 61 of FIG. 2, and constructs a sentence corresponding to the word ID. Associate with a word.

さらに、ステップＳ３３において、語幹付与部３３は、単語に語幹が付与され、かつ、単語IDが対応付けられた文を、品詞付与部３４に供給して、処理は、ステップＳ３４に進む。 Further, in step S33, the stem adding unit 33 supplies the sentence with the stem added to the word and the associated word ID to the part-of-speech adding unit 34, and the process proceeds to step S34.

ステップＳ３４乃至Ｓ３６において、品詞付与部３４は、辞書データベース１２に記憶されている図６の単語出現確率テーブル６２と図７の共起確率テーブル６３とを参照することにより、語幹付与部３３から供給された文としての単語列から単語ラティスを生成する。 In steps S34 to S36, the part-of-speech providing unit 34 supplies the word appearance providing unit 33 by referring to the word appearance probability table 62 in FIG. 6 and the co-occurrence probability table 63 in FIG. A word lattice is generated from the word string as the sentence.

即ち、ステップＳ３４において、品詞付与部３４の品詞候補決定部１０１は、語幹付与部３３から供給された文を構成する各単語の品詞候補を決定する品詞候補決定処理を行う。 That is, in step S <b> 34, the part-of-speech candidate determination unit 101 of the part-of-speech providing unit 34 performs a part-of-speech candidate determination process for determining a part-of-speech candidate for each word constituting the sentence supplied from the word giving unit 33.

つまり、ステップＳ３４において、品詞付与部３４の品詞候補決定部１０１は、図６の単語出現確率テーブル６２や図７の共起確率テーブル６３を参照することにより、語幹付与部３３から供給された文を構成する各単語を、順次、注目単語として、その注目単語の品詞の候補である注目品詞候補を決定し、注目品詞候補が決定された注目単語を、品詞が注目品詞候補の注目単語として、単語出現確率決定部１０２に供給する。 That is, in step S34, the part-of-speech candidate determination unit 101 of the part-of-speech giving unit 34 refers to the word appearance probability table 62 in FIG. 6 or the co-occurrence probability table 63 in FIG. , The target part-of-speech candidate that is a candidate for the part of speech of the target word is determined, the target word for which the target part-of-speech candidate is determined, the part of speech as the target word of the target part-of-speech candidate, This is supplied to the word appearance probability determination unit 102.

その後、処理は、ステップＳ３４からステップＳ３５に進み、単語出現確率決定部１０２は、文を構成する各単語の単語出現確率を決定する単語出現確率決定処理を行う。 Thereafter, the process proceeds from step S34 to step S35, and the word appearance probability determination unit 102 performs word appearance probability determination processing for determining the word appearance probability of each word constituting the sentence.

つまり、ステップＳ３５において、単語出現確率決定部１０２は、図６の単語出現確率テーブル６２や図７の共起確率テーブル６３を参照することにより、品詞候補決定部１０１から供給された、品詞が注目品詞候補の注目単語の単語出現確率を決定する。また、ステップＳ３５において、文を構成する各単語の単語出現確率が決定された後、単語出現確率決定部１０２は、文を構成する各単語の単語出現確率を、単語ラティス生成部１０３に供給して、処理は、ステップＳ３６に進む。 That is, in step S35, the word appearance probability determining unit 102 refers to the word appearance probability table 62 in FIG. 6 and the co-occurrence probability table 63 in FIG. The word appearance probability of the attention word of the part of speech candidate is determined. In step S35, after the word appearance probability of each word constituting the sentence is determined, the word appearance probability determination unit 102 supplies the word appearance probability of each word constituting the sentence to the word lattice generation unit 103. Then, the process proceeds to step S36.

ステップＳ３６において、単語ラティス生成部１０３は、図７の共起確率テーブル６３が保持する共起確率と、単語出現確率決定部１０２から供給された、文を構成する各単語の単語出現確率とに基づいて、単語ラティス（図５）を生成し、単語品詞付与部１０４に供給して、処理は、ステップＳ３７に進む。 In step S36, the word lattice generation unit 103 converts the co-occurrence probability held in the co-occurrence probability table 63 of FIG. 7 and the word appearance probability of each word constituting the sentence supplied from the word appearance probability determination unit 102. Based on this, a word lattice (FIG. 5) is generated and supplied to the word part-of-speech adding unit 104, and the process proceeds to step S37.

ステップＳ３７において、単語品詞付与部１０４は、単語ラティス生成部１０３から供給された単語ラティスの最尤パスを求め、その最尤パスに基づいて、各単語の品詞の品詞IDを決定する。さらに、ステップＳ３７において、単語品詞付与部１０４は、図３の品詞テーブル６４を参照し、語幹付与部３３から、品詞候補決定部１０１、単語出現確率決定部１０２、および単語ラティス生成部１０３を介して供給される文を構成する各単語に品詞を付与して、図１の複合語品詞付与部３５に供給する。 In step S37, the word part-of-speech giving unit 104 obtains the maximum likelihood path of the word lattice supplied from the word lattice generation unit 103, and determines the part-of-speech ID of the part of speech of each word based on the maximum likelihood path. Further, in step S37, the word part-of-speech giving unit 104 refers to the part-of-speech table 64 of FIG. 3 and from the word stem giving unit 33 via the part-of-speech candidate determination unit 101, the word appearance probability determination unit 102, and the word lattice generation unit 103. Part-of-speech is assigned to each word constituting the sentence to be supplied to the compound word part-of-speech giving unit 35 in FIG.

なお、単語品詞付与部１０４には、語幹付与部３３から、品詞候補決定部１０１、単語出現確率決定部１０２、および単語ラティス生成部１０３を介して、語幹が付与された各単語により構成される文が供給される。 The word part-of-speech giving unit 104 is configured by each word to which a word stem is given from the word stem giving unit 33 via the part-of-speech candidate decision unit 101, the word appearance probability decision unit 102, and the word lattice generation unit 103. A sentence is supplied.

その後、処理は、ステップＳ３７からステップＳ３８に進み、複合語品詞付与部３５は、品詞付与部３４（単語品詞付与部１０４）から供給された文に複合語が含まれる場合、辞書データベース１２に記憶されている複合語テーブル６５を参照することにより、品詞付与部３４から供給された文に含まれる、複合語とされる複数の単語を、複合語として、その複合語に、複合語の品詞を付与し、処理は、ステップＳ３９に進む。 Thereafter, the process proceeds from step S37 to step S38, and the compound word part-of-speech giving unit 35 stores the compound word in the dictionary database 12 when the sentence supplied from the part-of-speech giving unit 34 (word part-of-speech giving unit 104) is included. By referring to the compound word table 65, a plurality of words that are compound words included in the sentence supplied from the part-of-speech adding unit 34 are compound words, and the part of speech of the compound word is added to the compound word. The processing proceeds to step S39.

ステップＳ３９において、複合語品詞付与部３５は、品詞付与部３４からの文に含まれる複合語を、「複合語/品詞/語幹」という形式で、図示せぬモニタなどに出力して表示させるとともに、品詞付与部３４から供給された文を構成する各単語のうちの、複合語を構成していない単語を、「単語/品詞/語幹」という形式で、図示せぬモニタ等に出力して表示させて、単語解析処理は終了される。 In step S39, the compound word part-of-speech giving unit 35 outputs and displays the compound word included in the sentence from the part-of-speech giving unit 34 in the form of “compound word / part of speech / word stem” on a monitor (not shown). Of the words constituting the sentence supplied from the part-of-speech giving unit 34, the words that do not constitute the compound word are output and displayed on a monitor (not shown) in the form of “word / part of speech / word stem”. Then, the word analysis process is terminated.

次に、図１９のフローチャートを参照して、図１８のステップＳ３４で行われる品詞候補決定処理を詳細に説明する。 Next, the part of speech candidate determination process performed in step S34 in FIG. 18 will be described in detail with reference to the flowchart in FIG.

ステップＳ６１において、品詞候補決定部１０１は、語幹付与部３３から供給された文を構成する各単語を、順次、注目単語として、その注目単語が未知語であるか否かを判定する。ステップＳ６１において、注目単語が未知語であると判定された場合、処理は、ステップＳ６３に進み、品詞候補決定部１０１は、図７の共起確率テーブル６３を参照することにより、隣接品詞候補と共起する可能性がある品詞を注目品詞候補として決定し、品詞が注目品詞候補の注目単語を、単語出現確率決定部１０２に供給して、処理は、ステップＳ６４に進む。 In step S61, the part-of-speech candidate determination unit 101 sequentially sets each word constituting the sentence supplied from the stem adding unit 33 as the attention word, and determines whether or not the attention word is an unknown word. If it is determined in step S61 that the word of interest is an unknown word, the process proceeds to step S63, and the part-of-speech candidate determination unit 101 refers to the co-occurrence probability table 63 in FIG. The part of speech that may co-occur is determined as a target part of speech candidate, the target word whose part of speech is the target part of speech candidate is supplied to the word appearance probability determination unit 102, and the process proceeds to step S64.

一方、ステップＳ６１において、注目単語が未知語でない、即ち、既知語であると判定された場合、処理は、ステップＳ６２に進み、品詞候補決定部１０１は、図６の単語出現確率テーブル６２を参照することにより、注目単語についての品詞IDが示す品詞を注目品詞候補として決定し、品詞が注目品詞候補の注目単語を、単語出現確率決定部１０２に供給して、処理は、ステップＳ６４に進む。 On the other hand, if it is determined in step S61 that the word of interest is not an unknown word, that is, a known word, the process proceeds to step S62, and the part of speech candidate determination unit 101 refers to the word appearance probability table 62 in FIG. By doing so, the part of speech indicated by the part of speech ID for the attention word is determined as the attention part of speech candidate, the attention word whose part of speech is the attention part of speech candidate is supplied to the word appearance probability determination unit 102, and the process proceeds to step S64.

ステップＳ６４において、品詞候補決定部１０１は、語幹付与部３３から供給された文を構成する各単語すべてを注目単語としたか否かを判定し、注目単語としていないと判定された場合、処理は、ステップＳ６１に戻り、文を構成する各単語のうちの、まだ注目単語としていない単語を、新たな注目単語として、以下、同様の処理を繰り返す。 In step S64, the part-of-speech candidate determination unit 101 determines whether or not all the words constituting the sentence supplied from the stem adding unit 33 are the attention words. If it is determined that the words are not the attention words, the process is performed. Returning to step S61, the same processing is repeated below, using a word that has not yet been set as the attention word among the words constituting the sentence as a new attention word.

一方、ステップＳ６４において、語幹付与部３３から供給された文を構成する各単語すべてを注目単語としたと判定された場合、処理は、図１８のステップＳ３４にリターンして、その後、ステップＳ３５に進む。 On the other hand, when it is determined in step S64 that all the words constituting the sentence supplied from the stem adding unit 33 are the attention words, the process returns to step S34 in FIG. 18, and then proceeds to step S35. move on.

次に、図２０のフローチャートを参照して、図１８のステップＳ３５で行われる単語出現確率決定処理を詳細に説明する。 Next, the word appearance probability determination process performed in step S35 of FIG. 18 will be described in detail with reference to the flowchart of FIG.

ステップＳ９１において、単語出現確率決定部１０２は、品詞候補決定部１０１から供給された、品詞が注目品詞候補の注目単語が未知語であるか否かを判定する。 In step S91, the word appearance probability determination unit 102 determines whether or not the attention word whose part of speech is the attention part of speech candidate supplied from the part of speech candidate determination unit 101 is an unknown word.

ステップＳ９１において、注目単語が未知語であると判定された場合、処理は、ステップＳ９３に進み、単語出現確率決定部１０２は、図７の共起確率テーブル６３が保持する、隣接品詞候補と注目品詞候補との共起確率を参照することにより、品詞候補決定部１０１から供給された注目単語の単語出現確率を決定して、処理は、ステップＳ９４に進む。 If it is determined in step S91 that the attention word is an unknown word, the process proceeds to step S93, and the word appearance probability determination unit 102 stores the adjacent part-of-speech candidates and attention as the co-occurrence probability table 63 in FIG. By referring to the co-occurrence probability with the part of speech candidate, the word appearance probability of the attention word supplied from the part of speech candidate determination unit 101 is determined, and the process proceeds to step S94.

一方、ステップＳ９１において、注目単語が既知語であると判定された場合、処理は、ステップＳ９２に進み、単語出現確率決定部１０２は、図６の単語出現確率テーブル６２を参照することにより、品詞候補決定部１０１から供給された注目単語の単語出現確率を決定して、処理は、ステップＳ９４に進む。 On the other hand, if it is determined in step S91 that the word of interest is a known word, the process proceeds to step S92, and the word appearance probability determination unit 102 refers to the word appearance probability table 62 in FIG. The word appearance probability of the attention word supplied from the candidate determination unit 101 is determined, and the process proceeds to step S94.

ステップＳ９４において、単語出現確率決定部１０２は、文を構成する各単語すべての単語出現確率を決定したか否かを判定し、文を構成する各単語すべての単語出現確率を決定していないと判定された場合、処理は、ステップＳ９１に戻り、以下、同様の処理を繰り返す。 In step S94, the word appearance probability determining unit 102 determines whether or not the word appearance probability of all the words constituting the sentence has been determined, and has not determined the word appearance probability of all the words constituting the sentence. If determined, the process returns to step S91, and the same process is repeated thereafter.

一方、ステップＳ９４において、文を構成する各単語すべての単語出現確率を決定したと判定された場合、単語出現確率決定部１０２は、文を構成する各単語の単語出現確率を、単語ラティス生成部１０３に供給し、処理は、図１８のステップＳ３５にリターンして、その後、ステップＳ３６に進む。 On the other hand, when it is determined in step S94 that the word appearance probabilities of all the words constituting the sentence have been determined, the word appearance probability determining unit 102 determines the word appearance probability of each word constituting the sentence as the word lattice generating unit. 103, the process returns to step S35 of FIG. 18, and then proceeds to step S36.

以上のように、図１８のフローチャートを参照して説明した図１の形態素解析エンジンが行う単語解析処理では、注目単語が既知語である場合、図６の単語出現確率テーブル６２を参照することにより、注目品詞候補、および品詞が注目品詞候補の注目単語の単語出現確率が決定される他、注目単語が未知語である場合、図７の共起確率テーブル６３を参照することにより、注目品詞候補、および品詞が注目品詞候補の注目単語の単語出現確率が決定される。 As described above, in the word analysis process performed by the morphological analysis engine of FIG. 1 described with reference to the flowchart of FIG. 18, when the attention word is a known word, by referring to the word appearance probability table 62 of FIG. , The part-of-speech candidate, and the word appearance probability of the target word of the target part-of-speech candidate are determined, and if the target word is an unknown word, by referring to the co-occurrence probability table 63 of FIG. And the word appearance probability of the attention word whose part of speech is the attention part of speech candidate.

従って、文としての単語列に未知語が含まれる場合でも、単語列の隣接する単語どうしについての共起確率と、単語列を構成する各単語の単語出現確率とに基づいて、単語ラティスが生成されるため、未知語に、その未知語の品詞を付与することができる。 Therefore, even when an unknown word is included in a word string as a sentence, a word lattice is generated based on the co-occurrence probability between adjacent words in the word string and the word appearance probability of each word constituting the word string. Therefore, the part of speech of the unknown word can be given to the unknown word.

また、単語列において未知語が続く場合でも、例えば、未知語である注目単語に隣接する隣接単語が既知語であるときには、隣接単語の隣接品詞候補に基づいて、いわば、芋づる式に、注目単語の注目品詞候補、および品詞が注目品詞候補の注目単語の単語出現確率が決定されるため、未知語が続く単語列から単語ラティスを生成することができ、未知語に品詞を付与することができる。 Further, even when an unknown word continues in a word string, for example, when an adjacent word adjacent to an attention word that is an unknown word is a known word, the attention word is expressed based on the adjacent part-of-speech candidate of the adjacent word. Candidate part-of-speech candidates and the word appearance probability of the target word part-of-speech candidate part-of-speech candidate are determined, so that a word lattice can be generated from a word string followed by an unknown word, and a part-of-speech can be assigned to the unknown word .

即ち、未知語である注目単語に隣接する隣接単語の隣接品詞候補に基づいて、注目単語の注目品詞候補、および品詞が注目品詞候補の注目単語の単語出現確率が決定され、いま決定された未知語の品詞候補に基づいて、次に注目単語とされる未知語の注目品詞候補、および品詞が注目品詞候補の注目単語の単語出現確率が決定されて、以下、同様の処理が繰り返されることにより、すべての未知語に対して、未知語の品詞候補、および未知語の単語出現確率が決定されるため、未知語が続く場合でも、単語ラティスを生成することができ、未知語に品詞を付与することができる。 That is, based on the adjacent part-of-speech candidate of the adjacent word adjacent to the target word that is an unknown word, the target part-of-speech candidate of the target word and the word appearance probability of the target word whose target part-of-speech is the target part-of-speech candidate are determined. Based on the part-of-speech candidate of the word, the word-of-interest candidate of the unknown word to be the next attention word and the word appearance probability of the attention word of the part-of-speech candidate of interest are determined, and the same processing is repeated thereafter. The unknown word part-of-speech candidate and the unknown word appearance probability are determined for all unknown words, so even if the unknown word continues, a word lattice can be generated, and the unknown word is given a part of speech can do.

さらに、単語列において未知語がn-1個以上続く場合でも、単語列の隣接する単語どうしについての共起確率（バイグラム（bigram））により、未知語である注目単語の注目品詞候補、および品詞が注目品詞候補の注目単語の単語出現確率が決定されるため、未知語が続く回数に応じて選択され、未知語に品詞を付与するために用いられる複数のn-gramの共起確率を記憶しておく必要がなく、単語列の隣接する単語どうしについての共起確率を辞書データベース１２に記憶しておけばよいことから、複数のn-gramの共起確率を記憶するときと比較して、メモリ容量を節約することができる。 Furthermore, even if n-1 or more unknown words continue in the word string, the candidate part-of-speech candidate of the word of interest, which is an unknown word, and the part of speech by the co-occurrence probability (bigram) of adjacent words in the word string Since the word appearance probability of the attention word of the attention part-of-speech candidate is determined, it is selected according to the number of times the unknown word continues, and the co-occurrence probabilities of multiple n-grams used to give part-of-speech to the unknown word are stored Since it is only necessary to store the co-occurrence probabilities for adjacent words in the word string in the dictionary database 12, compared with the case of storing the co-occurrence probabilities of a plurality of n-grams. , Can save memory capacity.

また、図１の形態素解析エンジンが行う単語解析処理では、例えば、文「t1（既知語）,t2（未知語）,t3（未知語）,t4（既知語）」を構成する各単語のうちの未知語である単語t2およびt3に品詞を付与する場合、既知語である単語t4（の品詞候補）を考慮した単語ラティス、つまり、単語t4についてのノードを有する単語ラティスを生成することにより、単語t2およびt3に品詞を付与することができるため、単語t4の品詞を考慮しない、単語t1の品詞、単語t2の品詞の候補、単語t3の品詞の候補が、この順番で共起する3-gramの共起確率に基づいて、単語t2およびt3に品詞を付与する場合と比較して、未知語に正確に品詞を付与することができる。 In the word analysis process performed by the morphological analysis engine of FIG. 1, for example, among the words constituting the sentence “t1 (known word), t2 (unknown word), t3 (unknown word), t4 (known word)” To give parts of speech to the unknown words t2 and t3, by generating a word lattice that takes into account the word t4 (part of speech candidate) of the known word, that is, a word lattice having a node for the word t4, Since parts of speech can be given to words t2 and t3, the part of speech of word t1, the part of speech of word t2, the part of speech of word t3 co-occur in this order without considering the part of speech of word t4 Based on the co-occurrence probability of gram, the part of speech can be accurately assigned to the unknown word as compared with the case where the part of speech is assigned to the words t2 and t3.

また、図６の単語出現確率テーブル６２が保持する単語出現確率や、図７の共起確率テーブル６３が保持する共起確率等の確率分布が、実際の言語の確率分布を表わしているならば、実際の言語の確率分布に基づく品詞の付与を行うことができるため、あたかも人間が、文を構成する各単語に品詞を付与したかのような結果を得ることができる。 Further, if the probability distributions such as the word appearance probability held in the word appearance probability table 62 in FIG. 6 and the co-occurrence probability held in the co-occurrence probability table 63 in FIG. 7 represent the probability distribution of the actual language. Since part-of-speech can be assigned based on the probability distribution of the actual language, it is possible to obtain a result as if a person gave part-of-speech to each word constituting the sentence.

なお、図１２乃至図１５を参照して説明した単語出現確率決定処理では、複数の隣接品詞候補それぞれと注目品詞候補との共起確率のうちの、例えば最大値の共起確率を、品詞が注目品詞候補の注目単語の単語出現確率として決定することとしたが、複数の隣接品詞候補それぞれと注目品詞候補との共起確率を合計した合計値を、品詞が注目品詞候補の注目単語の単語出現確率として決定するようにしてもよい。 In the word appearance probability determination process described with reference to FIGS. 12 to 15, for example, the maximum co-occurrence probability among the co-occurrence probabilities of each of the plurality of adjacent part-of-speech candidates and the target part-of-speech candidate is represented by the part of speech. Although it was decided to determine the word appearance probability of the attention word of the target part-of-speech candidate, the total value of the co-occurrence probabilities of each of the plurality of adjacent part-of-speech candidates and the target part-of-speech candidate is used, and the word of the attention word whose part of speech is the target part-of-speech candidate You may make it determine as an appearance probability.

図２１は、１つの注目品詞候補に対して、隣接品詞候補としての前隣接品詞候補が複数存在する場合に、複数の隣接品詞候補それぞれと注目品詞候補との共起確率の合計値を、品詞が注目品詞候補の注目単語の単語出現確率として決定する単語出現確率決定処理を説明する図である。 FIG. 21 shows the total value of the co-occurrence probabilities of each of the plurality of adjacent part-of-speech candidates and the target part-of-speech candidate when there are a plurality of previous part-of-speech candidates as adjacent part-of-speech candidates for one part-of-speech candidate. FIG. 10 is a diagram for explaining word appearance probability determination processing for determining the word appearance probability of the attention word of the attention part-of-speech candidate.

図２１では、図１３の場合と同様、１つの注目品詞候補L2/Pos1と共起する可能性がある複数の前隣接品詞候補として、前隣接品詞候補L1/Pos1乃至L1/Pos3が存在している。 In FIG. 21, as in the case of FIG. 13, there are previous adjacent part-of-speech candidates L1 / Pos1 to L1 / Pos3 as a plurality of previous adjacent part-of-speech candidates that may co-occur with one notable part-of-speech candidate L2 / Pos1. Yes.

そして、図２１では、３つの前隣接品詞候補L1/Pos1乃至L1/Pos3それぞれと注目品詞候補L2/Pos1との共起確率P(Pos1|Pos1)，P(Pos1|Pos2)，P(Pos1|Pos3)が、それぞれ、0.02，0.01，0.5となっている。 In FIG. 21, the co-occurrence probabilities P (Pos1 | Pos1), P (Pos1 | Pos2), and P (Pos1 |) of each of the three previous adjacent part-of-speech candidates L1 / Pos1 to L1 / Pos3 and the target part-of-speech candidate L2 / Pos1. Pos3) is 0.02, 0.01 and 0.5, respectively.

単語出現確率決定部１０２は、１つの注目品詞候補L2/Pos1に対して、複数の前隣接品詞候補L1/Pos1乃至L1/Pos3が存在する場合、３つの前隣接品詞候補L1/Pos1乃至L1/Pos3それぞれと注目品詞候補L2/Pos1との共起確率P(Pos1|Pos1)=0.02，P(Pos1|Pos2)=0.01，P(Pos1|Pos3)=0.5の合計値0.53(=0.02+0.01+0.5)を、品詞が品詞候補L2/Pos1の注目単語L2の単語出現確率として決定する。 When there are a plurality of previous adjacent part-of-speech candidates L1 / Pos1 to L1 / Pos3 for one target part-of-speech candidate L2 / Pos1, the word appearance probability determination unit 102 determines three previous adjacent part-of-speech candidates L1 / Pos1 to L1 / Co-occurrence probabilities P (Pos1 | Pos1) = 0.02, P (Pos1 | Pos2) = 0.01, P (Pos1 | Pos3) = 0.5, 0.53 (= 0.02 + 0.01 + 0.5) is determined as the word appearance probability of the attention word L2 whose part of speech is the part of speech candidate L2 / Pos1.

図２２は、複数の前隣接品詞候補と共起する可能性がある注目品詞候補が複数存在する場合に、複数の前隣接品詞候補それぞれと注目品詞候補との共起確率の合計値を、品詞が注目品詞候補の注目単語の単語出現確率として決定する単語出現確率決定処理を説明する図である。 FIG. 22 illustrates the total value of the co-occurrence probabilities of each of the plurality of previous adjacent part-of-speech candidates and the target part-of-speech candidate when there are a plurality of target part-of-speech candidates that may co-occur with a plurality of previous adjacent part-of-speech candidates. FIG. 10 is a diagram for explaining word appearance probability determination processing for determining the word appearance probability of the attention word of the attention part-of-speech candidate.

図２２では、注目品詞候補L2/Pos(n)と共起する可能性がある複数の前隣接品詞候補として、前隣接品詞候補L1/Pos1乃至L1/Pos3が存在している。 In FIG. 22, there are previous adjacent part-of-speech candidates L1 / Pos1 to L1 / Pos3 as a plurality of previous adjacent part-of-speech candidates that may co-occur with the target part-of-speech candidate L2 / Pos (n).

単語出現確率決定部１０２は、図２１で説明した場合と同様に、３つの前隣接品詞候補L1/Pos1乃至L1/Pos3それぞれと注目品詞候補L2/Pos(n)との共起確率の合計値を、品詞が品詞候補L2/Pos(n)の注目単語L2の単語出現確率として決定する。 Similar to the case described with reference to FIG. 21, the word appearance probability determination unit 102 calculates the total value of the co-occurrence probabilities of the three previous adjacent part-of-speech candidates L1 / Pos1 to L1 / Pos3 and the target part-of-speech candidate L2 / Pos (n). Is determined as the word appearance probability of the attention word L2 whose part of speech is the part of speech candidate L2 / Pos (n).

図２３は、本発明を適用した形態素解析エンジンの一実施の形態の第２の構成例を示すブロック図である。 FIG. 23 is a block diagram showing a second configuration example of an embodiment of a morphological analysis engine to which the present invention is applied.

なお、図中、図１の場合に対応する部分については同一の符号を付してあり、以下、その説明は、適宜省略する。 In the figure, portions corresponding to those in FIG. 1 are denoted by the same reference numerals, and description thereof will be omitted as appropriate.

即ち、図２３の形態素解析エンジンは、単語解析部１１の品詞付与部３４に代えて、品詞付与部４４が設けられており、辞書データベース１２に、新たに品詞出現確率テーブル２０１が記憶されているほかは、図１の場合と同様に構成されている。 That is, the morphological analysis engine of FIG. 23 is provided with a part-of-speech giving unit 44 instead of the part-of-speech giving unit 34 of the word analysis unit 11, and a new part-of-speech appearance probability table 201 is stored in the dictionary database 12. The rest of the configuration is the same as in FIG.

品詞付与部４４には、語幹付与部３３から、単語に語幹が付与され、かつ、単語IDが対応付けられた文が供給される。 The part-of-speech providing unit 44 is supplied with a sentence in which a word stem is assigned to a word and a word ID is associated from the word stem assigning unit 33.

品詞付与部４４は、図６の単語出現確率テーブル６２や、図７の共起確率テーブル６３、品詞を一意に識別する品詞IDと、その品詞が出現する確率である品詞出現確率とを対応付けた品詞出現確率テーブル２０１を参照することにより、語幹付与部３３から供給された文としての単語列から単語ラティス（図５）を生成する。 The part of speech giving unit 44 associates the word appearance probability table 62 in FIG. 6, the co-occurrence probability table 63 in FIG. 7, the part of speech ID that uniquely identifies the part of speech with the part of speech appearance probability that is the probability that the part of speech appears. By referring to the part-of-speech appearance probability table 201, a word lattice (FIG. 5) is generated from a word string as a sentence supplied from the stem adding unit 33.

そして、品詞付与部４４は、語幹付与部３３から供給された文としての単語列から生成された単語ラティスに基づいて、語幹付与部３３から供給された文を構成する各単語の品詞の品詞IDを決定し、図３の品詞テーブル６４を参照することにより、語幹付与部３３から供給された文を構成する各単語に、各単語の品詞を付与する。 The part-of-speech giving unit 44 then, based on the word lattice generated from the word string as the sentence supplied from the stem giving unit 33, the part of speech ID of the part of speech of each word constituting the sentence supplied from the stem giving unit 33 And part of speech of each word is assigned to each word constituting the sentence supplied from the stem adding unit 33.

品詞付与部４４は、単語に品詞が付与された文を、複合語品詞付与部３５に供給する。 The part-of-speech giving unit 44 supplies the compound word part-of-speech giving unit 35 with a sentence in which the part of speech is given to the word.

図２４は、図２３の辞書データベース１２に記憶されている品詞出現確率テーブル２０１を示す図である。 FIG. 24 shows a part-of-speech appearance probability table 201 stored in the dictionary database 12 of FIG.

図１９の品詞出現確率テーブル２０１には、品詞IDと、その品詞IDが表す品詞の品詞出現確率とが対応付けられている。 The part-of-speech appearance probability table 201 in FIG. 19 associates the part-of-speech ID with the part-of-speech appearance probability of the part of speech represented by the part-of-speech ID.

次に、図２５は、図２３の品詞付与部４４の詳細な構成例を示すブロック図である。 Next, FIG. 25 is a block diagram illustrating a detailed configuration example of the part-of-speech providing unit 44 of FIG.

なお、図中、図９の品詞付与部３４に対応する部分については同一の符号を付してあり、以下、その説明は、適宜省略する。 In the figure, portions corresponding to the part-of-speech providing unit 34 in FIG. 9 are denoted by the same reference numerals, and description thereof will be omitted as appropriate.

即ち、図２５の品詞付与部４４は、図９の単語出現確率決定部１０２に代えて、単語出現確率決定部３０２が設けられているほかは、図９の品詞付与部３４と同様に構成されている。 That is, the part-of-speech providing unit 44 in FIG. 25 is configured in the same manner as the part-of-speech providing unit 34 in FIG. 9 except that a word appearance probability determining unit 302 is provided instead of the word appearance probability determining unit 102 in FIG. ing.

なお、単語出現確率決定部３０２には、品詞候補決定部１０１から、品詞が注目品詞候補の注目単語が供給される。 Note that the word appearance probability determination unit 302 is supplied from the part of speech candidate determination unit 101 with the attention word whose part of speech is the attention part of speech candidate.

単語出現確率決定部３０２は、図６の単語出現確率テーブル６２や、図７の共起確率テーブル６３、図２４の品詞出現確率テーブル２０１を参照することにより、品詞候補決定部１０１から供給された、品詞が注目品詞候補の注目単語の単語出現確率を決定する。 The word appearance probability determination unit 302 is supplied from the part of speech candidate determination unit 101 by referring to the word appearance probability table 62 in FIG. 6, the co-occurrence probability table 63 in FIG. 7, and the part of speech appearance probability table 201 in FIG. The word appearance probability of the attention word whose part of speech is the attention part of speech candidate is determined.

即ち、注目単語の単語出現確率が図６の単語出現確率テーブル６２に保持されており、従って、注目単語が既知語である場合、図９の単語出現確率決定部１０２の場合と同様に、単語出現確率決定部３０２は、図６の単語出現確率テーブル６２を参照することにより、注目単語の単語出現確率を決定する。 That is, the word appearance probability of the attention word is held in the word appearance probability table 62 in FIG. 6, and therefore, when the attention word is a known word, as in the case of the word appearance probability determination unit 102 in FIG. The appearance probability determining unit 302 determines the word appearance probability of the attention word by referring to the word appearance probability table 62 of FIG.

一方、注目単語の単語出現確率が図６の単語出現確率テーブル６２に保持されておらず、従って、注目単語が未知語である場合、単語出現確率決定部３０２は、図７の共起確率テーブル６３が保持する、隣接品詞候補と注目品詞候補との共起確率、および図１９の品詞出現確率テーブル２０１が保持する、注目品詞候補の品詞出現確率を参照することにより、注目単語の単語出現確率を決定する。 On the other hand, when the word appearance probability of the attention word is not held in the word appearance probability table 62 of FIG. 6 and, therefore, the attention word is an unknown word, the word appearance probability determination unit 302 displays the co-occurrence probability table of FIG. By referring to the co-occurrence probability between the adjacent part-of-speech candidate and the target part-of-speech candidate held by 63 and the part-of-speech appearance probability of the target part-of-speech candidate held by the part-of-speech appearance probability table 201 of FIG. To decide.

文を構成する各単語の単語出現確率が決定された後、単語出現確率決定部３０２は、文を構成する各単語の単語出現確率を、単語ラティス生成部１０３に供給する。 After the word appearance probability of each word constituting the sentence is determined, the word appearance probability determining unit 302 supplies the word appearance probability of each word constituting the sentence to the word lattice generation unit 103.

次に、図２６および図２７を参照して、注目単語が未知語である場合、隣接品詞候補と注目品詞候補との共起確率、および注目品詞候補の品詞出現確率に基づいて、図２５の単語出現確率決定部３０２が、品詞が注目品詞候補の注目単語の単語出現確率を決定する単語出現確率決定処理を説明する。 Next, referring to FIG. 26 and FIG. 27, when the attention word is an unknown word, based on the co-occurrence probability of the adjacent part-of-speech candidate and the attention part-of-speech candidate, and the part-of-speech appearance probability of the attention part-of-speech candidate, FIG. A word appearance probability determination process in which the word appearance probability determination unit 302 determines the word appearance probability of an attention word whose part of speech is an attention part of speech candidate will be described.

図２６は、１つの前隣接品詞候補と共起する可能性がある複数の注目品詞候補が存在する場合、前隣接品詞候補と注目品詞候補との共起確率と、注目品詞候補の品詞出現確率との積を、品詞が注目品詞候補の注目単語の単語出現確率として決定する単語出現確率決定処理を説明する図である。 FIG. 26 shows the co-occurrence probability between the previous adjacent part-of-speech candidate and the target part-of-speech candidate and the part-of-speech appearance probability of the target part-of-speech candidate when there is a plurality of target part-of-speech candidates that may co-occur with one previous adjacent part-of-speech candidate FIG. 6 is a diagram for explaining a word appearance probability determination process for determining the product of the above as the word appearance probability of a target word whose part of speech is a target part of speech candidate.

図２６において、品詞L1/Pos1は、前隣接単語L1の前隣接品詞候補Pos1であり、品詞L2/Pos(n)は、前隣接品詞候補L1/Pos1と共起する可能性がある、注目単語L2の注目品詞候補である（n=1,2,…,N）。 In FIG. 26, the part of speech L1 / Pos1 is the previous adjacent part of speech candidate Pos1 of the previous adjacent word L1, and the part of speech L2 / Pos (n) is a word of interest that may co-occur with the previous adjacent part of speech candidate L1 / Pos1. L2's feature part-of-speech candidates (n = 1, 2, ..., N).

また、図２６において、注目品詞候補L2/Pos(n)の下に示される数字は、品詞が注目品詞候補L2/Pos(n)の注目単語L2の単語出現確率を示している。 Further, in FIG. 26, the numbers shown below the target part-of-speech candidate L2 / Pos (n) indicate the word appearance probability of the target word L2 having the part-of-speech candidate L2 / Pos (n).

図２６では、品詞L1/Pos1と共起する可能性がある、単語L2の品詞として、複数であるN個の品詞Pos1,Pos2,…,PosNが存在している。 In FIG. 26, there are a plurality of N parts of speech Pos1, Pos2,..., PosN as parts of speech of the word L2 that may co-occur with the parts of speech L1 / Pos1.

ここで、図１６を参照して説明した単語出現確率決定処理では、前隣接品詞候補L1/Pos１とn番目の注目品詞候補L2/Pos(n)との共起確率P(Pos(n)|Pos1)を、品詞がn番目の品詞候補L2/Pos(n)の注目単語L2の単語出現確率として決定されている。 Here, in the word appearance probability determination process described with reference to FIG. 16, the co-occurrence probability P (Pos (n) | of the previous adjacent part-of-speech candidate L1 / Pos1 and the n-th target part-of-speech candidate L2 / Pos (n) Pos1) is determined as the word appearance probability of the attention word L2 of the nth part-of-speech candidate L2 / Pos (n).

図２６を参照して説明する単語出現確率決定処理では、前隣接品詞候補L1/Pos１とn番目の注目品詞候補L2/Pos(n)との共起確率P(Pos(n)|Pos1)を求める点で、図１６を参照して説明した単語出現確率決定処理と共通する。 In the word appearance probability determination process described with reference to FIG. 26, the co-occurrence probability P (Pos (n) | Pos1) of the previous adjacent part-of-speech candidate L1 / Pos1 and the nth part-of-speech candidate L2 / Pos (n) is calculated. This is common to the word appearance probability determination process described with reference to FIG.

但し、図２６を参照して説明する単語出現確率決定処理では、前隣接品詞候補L1/Pos１とn番目の注目品詞候補L2/Pos(n)との共起確率P(Pos(n)|Pos1)と、n番目の注目品詞候補L2/Pos(n)の品詞出現確率との積を、品詞がn番目の注目品詞候補L2/Pos(n)の注目単語L2の単語出現確率として決定する点で相違する。 However, in the word appearance probability determination process described with reference to FIG. 26, the co-occurrence probability P (Pos (n) | Pos1 between the previous adjacent part-of-speech candidate L1 / Pos1 and the n-th part-of-speech candidate L2 / Pos (n). ) And the part-of-speech appearance probability of the n-th attention part-of-speech candidate L2 / Pos (n) is determined as the word appearance probability of the attention word L2 of the part-of-speech candidate L2 / Pos (n) Is different.

図２６では、例えば、前隣接品詞候補L1/Pos１と注目品詞候補L2/Pos1との共起確率P(Pos1|Pos1)が0.02となっている他、注目品詞候補L2/Pos1の品詞出現確率が0.04となっており、共起確率P(Pos1|Pos1)=0.02と品詞出現確率0.04との積0.02*0.04が、品詞が注目品詞候補L2/Pos１の注目単語L2の単語出現確率として決定されている。なお、単語出現確率決定部３０２は、他の注目品詞候補L2/Pos1，L2/Pos2，L2/Pos3，…，L2/PosNの注目単語L2の単語出現確率も、同様に決定する。 In FIG. 26, for example, the co-occurrence probability P (Pos1 | Pos1) between the previous adjacent part-of-speech candidate L1 / Pos1 and the target part-of-speech candidate L2 / Pos1 is 0.02, and the part-of-speech appearance probability of the target part-of-speech candidate L2 / Pos1 is The product 0.02 * 0.04 of the co-occurrence probability P (Pos1 | Pos1) = 0.02 and the part of speech appearance probability 0.04 is determined as the word appearance probability of the attention word L2 of the part of speech candidate L2 / Pos1 Yes. Note that the word appearance probability determination unit 302 similarly determines the word appearance probability of the attention word L2 of the other attention part-of-speech candidates L2 / Pos1, L2 / Pos2, L2 / Pos3,..., L2 / PosN.

図２７は、前隣接品詞候補が複数存在する場合に、複数の前隣接品詞候補それぞれと注目品詞候補との共起確率のうちの最大値と、注目品詞候補の品詞出現確率との積を、品詞が注目品詞候補の注目単語の単語出現確率として決定する単語出現確率決定処理を説明する図である。 FIG. 27 shows the product of the maximum value of the co-occurrence probabilities of each of the plurality of previous adjacent part-of-speech candidates and the target part-of-speech candidate and the part-of-speech appearance probability of the target part-of-speech candidate when there are a plurality of previous adjacent part-of-speech candidates. It is a figure explaining the word appearance probability determination process in which a part of speech is determined as the word appearance probability of the attention word of the attention part of speech candidate.

図２７では、１つの注目品詞候補L2/Pos1と共起する可能性がある複数の前隣接品詞候補として、前隣接品詞候補L1/Pos1乃至L1/Pos3が存在している。 In FIG. 27, there are previous adjacent part-of-speech candidates L1 / Pos1 to L1 / Pos3 as a plurality of previous adjacent part-of-speech candidates that may co-occur with one notable part-of-speech candidate L2 / Pos1.

そして、図２７では、３つの前隣接品詞候補L1/Pos1乃至L1/Pos3それぞれと注目品詞候補L2/Pos1との共起確率P(Pos1|Pos1)，P(Pos1|Pos2)，P(Pos1|Pos3)が、それぞれ、0.02，0.01，0.5となっており、注目品詞候補L2/Pos1の品詞出現確率は0.03となっている。 In FIG. 27, the co-occurrence probabilities P (Pos1 | Pos1), P (Pos1 | Pos2), P (Pos1 |) of each of the three previous adjacent part-of-speech candidates L1 / Pos1 to L1 / Pos3 and the target part-of-speech candidate L2 / Pos1. Pos3) is 0.02, 0.01, 0.5, respectively, and the part-of-speech appearance probability of the target part-of-speech candidate L2 / Pos1 is 0.03.

ここで、図１３を参照して説明した単語出現確率決定処理では、１つの注目品詞候補L2/Pos1に対して、複数の前隣接品詞候補L1/Pos1乃至L1/Pos3が存在する場合、３つの前隣接品詞候補L1/Pos1乃至L1/Pos3それぞれと注目品詞候補L2/Pos1との共起確率P(Pos1|Pos1)=0.02，P(Pos1|Pos2)=0.01，P(Pos1|Pos3)=0.5のうちの最大値の共起確率P(Pos1|Pos3)=0.5が、品詞が品詞候補L2/Pos1の注目単語L2の単語出現確率として決定されている。 Here, in the word appearance probability determination process described with reference to FIG. 13, if there are a plurality of previous adjacent part-of-speech candidates L1 / Pos1 to L1 / Pos3 for one target part-of-speech candidate L2 / Pos1, Co-occurrence probabilities P (Pos1 | Pos1) = 0.02, P (Pos1 | Pos2) = 0.01, P (Pos1 | Pos3) = 0.5 for each of the previous adjacent part-of-speech candidates L1 / Pos1 to L1 / Pos3 and the target part-of-speech candidate L2 / Pos1 The maximum value co-occurrence probability P (Pos1 | Pos3) = 0.5 is determined as the word appearance probability of the attention word L2 whose part of speech is L2 / Pos1.

図２７を参照して説明する単語出現確率決定処理では、１つの注目品詞候補L2/Pos1に対して、複数の前隣接品詞候補L1/Pos1乃至L1/Pos3が存在する場合、３つの前隣接品詞候補L1/Pos1乃至L1/Pos3それぞれと注目品詞候補L2/Pos1との共起確率P(Pos1|Pos1)=0.02，P(Pos1|Pos2)=0.01，P(Pos1|Pos3)=0.5のうちの最大値の共起確率P(Pos1|Pos3)=0.5を求める点で、図１３を参照して説明した単語出現確率決定処理と共通する。 In the word appearance probability determination process described with reference to FIG. 27, when there are a plurality of previous adjacent part-of-speech candidates L1 / Pos1 to L1 / Pos3 for one target part-of-speech candidate L2 / Pos1, three previous adjacent parts-of-speech. Co-occurrence probabilities P (Pos1 | Pos1) = 0.02, P (Pos1 | Pos2) = 0.01, P (Pos1 | Pos3) = 0.5 for each of candidates L1 / Pos1 to L1 / Pos3 and the target part-of-speech candidate L2 / Pos1 This is the same as the word appearance probability determination process described with reference to FIG. 13 in that the maximum co-occurrence probability P (Pos1 | Pos3) = 0.5 is obtained.

但し、図２７を参照して説明する単語出現確率決定処理では、複数の前隣接品詞候補L1/Pos１乃至L1/Pos3それぞれと注目品詞候補L2/Pos1との共起確率のうちの最大値の共起確率P(Pos1|Pos1)=0.5と、注目品詞候補L2/Pos1の品詞出現確率0.03との積0.5*0.03を、注目単語L2の単語出現確率として決定する点で相違する。 However, in the word appearance probability determination process described with reference to FIG. 27, the maximum value of the co-occurrence probabilities of each of the plurality of previous adjacent part-of-speech candidates L1 / Pos1 to L1 / Pos3 and the target part-of-speech candidate L2 / Pos1 is shared. The difference is that the product 0.5 * 0.03 of the occurrence probability P (Pos1 | Pos1) = 0.5 and the part-of-speech appearance probability 0.03 of the target part-of-speech candidate L2 / Pos1 is determined as the word appearance probability of the target word L2.

以上のように、以上のように、図２３乃至図２７を参照して説明した図２３の形態素解析エンジンが行う単語解析処理では、注目単語が既知語である場合、図６の単語出現確率テーブル６２を参照することにより、注目品詞候補、および品詞が注目品詞候補の注目単語の単語出現確率が決定される他、注目単語が未知語である場合、図７の共起確率テーブル６３を参照することにより注目品詞候補が決定され、図７の共起確率テーブル６３および図２４の品詞出現確率テーブル２０１を参照することにより、品詞が注目品詞候補の注目単語の単語出現確率が決定される。 As described above, in the word analysis process performed by the morphological analysis engine of FIG. 23 described with reference to FIGS. 23 to 27 as described above, when the attention word is a known word, the word appearance probability table of FIG. 62 is used to determine the candidate part-of-speech candidate and the word appearance probability of the target word whose target part-of-speech candidate is the target part-of-speech candidate. When the target word is an unknown word, the co-occurrence probability table 63 of FIG. 7 is referred to. Thus, the part of speech candidate of interest is determined, and the word appearance probability of the attention word whose part of speech is the candidate for part of speech candidate is determined by referring to the co-occurrence probability table 63 of FIG. 7 and the part of speech appearance probability table 201 of FIG.

さらに、単語列において未知語がn-1個以上続く場合でも、単語列の隣接する単語どうしについての共起確率（バイグラム（bigram））により、未知語である注目単語の注目品詞候補、および品詞が注目品詞候補の注目単語の単語出現確率が決定されるため、未知語が続く回数に応じて選択され、未知語に品詞を付与するために用いられる複数のn-gramの共起確率を記憶しておく必要がなく、単語列の隣接する単語どうしについての共起確率と品詞の品詞出現確率とを辞書データベース１２に記憶しておけばよいことから、複数のn-gramの共起確率を記憶するときと比較して、メモリ容量を節約することができる。 Furthermore, even if n-1 or more unknown words continue in the word string, the candidate part-of-speech candidate of the word of interest, which is an unknown word, and the part of speech by the co-occurrence probability (bigram) of adjacent words in the word string Since the word appearance probability of the attention word of the attention part-of-speech candidate is determined, it is selected according to the number of times the unknown word continues, and the co-occurrence probabilities of multiple n-grams used to give part-of-speech to the unknown word are stored The co-occurrence probability between adjacent words in the word string and the part-of-speech appearance probability of the part of speech only need to be stored in the dictionary database 12, so that the co-occurrence probabilities of a plurality of n-grams can be obtained. Compared with storing, memory capacity can be saved.

また、図２３の形態素解析エンジンが行う単語解析処理では、例えば、文「t1（既知語）,t2（未知語）,t3（未知語）,t4（既知語）」を構成する各単語のうちの未知語である単語t2およびt3に品詞を付与する場合、既知語である単語t4（の品詞候補）を考慮した単語ラティス、つまり、単語t4についてのノードを有する単語ラティスを生成することにより、単語t2およびt3に品詞を付与することができるため、単語t4の品詞を考慮しない、単語t1の品詞、単語t2の品詞の候補、単語t3の品詞の候補が、この順番で共起する3-gramの共起確率に基づいて、単語t2およびt3に品詞を付与する場合と比較して、未知語に正確に品詞を付与することができる。 In the word analysis process performed by the morphological analysis engine of FIG. 23, for example, among the words constituting the sentence “t1 (known word), t2 (unknown word), t3 (unknown word), t4 (known word)” To give parts of speech to the unknown words t2 and t3, by generating a word lattice that takes into account the word t4 (part of speech candidate) of the known word, that is, a word lattice having a node for the word t4, Since parts of speech can be given to words t2 and t3, the part of speech of word t1, the part of speech of word t2, the part of speech of word t3 co-occur in this order without considering the part of speech of word t4 Based on the co-occurrence probability of gram, the part of speech can be accurately assigned to the unknown word as compared with the case where the part of speech is assigned to the words t2 and t3.

また、図６の単語出現確率テーブル６２が保持する単語出現確率や、図７の共起確率テーブル６３が保持する共起確率、図２４の品詞出現確率テーブル２０１が保持する品詞出現確率等の確率分布が、実際の言語の確率分布を表わしているならば、実際の言語の確率分布に基づく品詞の付与を行うことができるため、あたかも人間が、文書を構成する各単語に品詞を付与したかのような結果を得ることができる。 Further, probabilities such as the word appearance probability held in the word appearance probability table 62 in FIG. 6, the co-occurrence probability held in the co-occurrence probability table 63 in FIG. 7, the part of speech appearance probability held in the part of speech appearance probability table 201 in FIG. If the distribution represents the probability distribution of the actual language, the part of speech can be assigned based on the probability distribution of the actual language, so it is as if a person has given a part of speech to each word constituting the document The following results can be obtained.

ところで、図２６を参照して説明した単語出現確率決定処理では、単語出現確率決定部３０２は、注目品詞候補L2/Pos1と前隣接品詞候補L1/Pos１との共起確率P(Pos1|Pos1)=0.02と、注目品詞候補L2/Pos1の品詞出現確率0.04との積0.02*0.04を、品詞が注目品詞候補L2/Pos1の注目単語L2の単語出現確率として決定することとしたが、例えば、注目品詞候補L2/Pos1の品詞出現確率0.04を、品詞が注目品詞候補L1/Pos１の注目単語L1の単語出現確率として決定するようにしてもよい。 By the way, in the word appearance probability determination process described with reference to FIG. 26, the word appearance probability determination unit 302 has a co-occurrence probability P (Pos1 | Pos1) between the target part-of-speech candidate L2 / Pos1 and the previous adjacent part-of-speech candidate L1 / Pos1. The product 0.02 * 0.04 of the part-of-speech candidate L2 / Pos1 of 0.02 * 0.04 is determined as the word appearance probability of the target word L2 of the target part-of-speech candidate L2 / Pos1, for example, The part of speech appearance probability 0.04 of the part of speech candidate L2 / Pos1 may be determined as the word appearance probability of the attention word L1 of the part of speech candidate L1 / Pos1.

図２８は、注目品詞候補の品詞出現確率を、品詞が注目品詞候補の注目単語の単語出現確率として決定する単語出現確率決定処理を説明する図である。 FIG. 28 is a diagram for explaining word appearance probability determination processing for determining the part-of-speech appearance probability of the target part-of-speech candidate as the word appearance probability of the target word whose part-of-speech is the target part-of-speech candidate.

図２８において、品詞L1/Pos1は、前隣接単語L1の前隣接品詞候補Pos1であり、品詞L2/Pos(n)は、前隣接品詞候補L1/Pos1と共起する可能性がある、注目単語L2の注目品詞候補である（n=1,2,…,N）。 In FIG. 28, the part of speech L1 / Pos1 is the previous adjacent part of speech candidate Pos1 of the previous adjacent word L1, and the part of speech L2 / Pos (n) may co-occur with the previous adjacent part of speech candidate L1 / Pos1. L2's feature part-of-speech candidates (n = 1, 2, ..., N).

また、図２８において、注目品詞候補L2/Pos(n)の下に示される数字は、品詞が注目品詞候補Pos(n)の注目単語L2の単語出現確率を示している。 In FIG. 28, the numbers shown below the target part-of-speech candidate L2 / Pos (n) indicate the word appearance probability of the target word L2 whose part-of-speech is the target part-of-speech candidate Pos (n).

単語出現確率決定部３０２は、注目品詞候補L2/Pos(n)の品詞出現確率を、品詞が注目品詞候補L2/Pos(n)の注目単語の単語出現確率として決定する。 The word appearance probability determination unit 302 determines the part-of-speech appearance probability of the target part-of-speech candidate L2 / Pos (n) as the word appearance probability of the target word whose part-of-speech is the target part-of-speech candidate L2 / Pos (n).

図２８では、例えば、注目品詞候補L2/Pos1の品詞出現確率が0.04となっており、その品詞出現確率0.04が、そのまま、品詞が注目品詞候補L2/Pos1の注目単語L2の単語出現確率として決定されている。なお、単語出現確率決定部３０２は、他の注目品詞候補L2/Pos2,L2/Pos3,…,L2/PosNの注目単語L2の単語出現確率も、同様に決定する。 In FIG. 28, for example, the part-of-speech appearance probability of the target part-of-speech candidate L2 / Pos1 is 0.04, and the part-of-speech appearance probability 0.04 is directly determined as the word appearance probability of the target word L2 of the target part-of-speech candidate L2 / Pos1. Has been. Note that the word appearance probability determination unit 302 similarly determines the word appearance probability of the attention word L2 of other attention part-of-speech candidates L2 / Pos2, L2 / Pos3,..., L2 / PosN.

この場合、図２６を参照して説明した、前隣接品詞候補と注目品詞候補との共起確率と、注目品詞候補の品詞出現確率との積を、品詞が注目品詞候補の注目単語の単語出現確率として決定する場合と比較して、図７の共起確率テーブル６３から前隣接品詞候補と注目品詞候補との共起確率を読み出す処理と、その共起確率と、注目品詞候補の品詞出現確率とを乗算する処理とを省略することができるため、注目単語の単語出現確率をより迅速に決定することができる。 In this case, the product of the co-occurrence probability between the previous adjacent part-of-speech candidate and the target part-of-speech candidate and the part-of-speech appearance probability of the target part-of-speech candidate described with reference to FIG. Compared with the case of determining as a probability, the process of reading the co-occurrence probability between the previous adjacent part-of-speech candidate and the target part-of-speech candidate from the co-occurrence probability table 63 of FIG. 7, the co-occurrence probability, and the part-of-speech appearance probability of the target part-of-speech candidate Therefore, the word appearance probability of the attention word can be determined more quickly.

また、注目品詞候補が複数存在する場合には、複数の注目品詞候補の注目単語の単語出現確率すべてを、同一の単語出現確率として決定するようにしてもよい。 Further, when there are a plurality of target part-of-speech candidates, all the word appearance probabilities of the target word of the plurality of target part-of-speech candidates may be determined as the same word appearance probability.

図２９は、注目品詞候補が複数存在する場合には、複数の注目品詞候補の注目単語の単語出現確率すべてを、同一の単語出現確率として決定する単語出現確率決定処理を説明する図である。 FIG. 29 is a diagram for explaining word appearance probability determination processing for determining all word appearance probabilities of attention words of a plurality of attention part-of-speech candidates as the same word appearance probability when there are a plurality of attention part-of-speech candidates.

図２９において、品詞L1/Pos1は、前隣接単語L1の前隣接品詞候補Pos1であり、品詞L2/Pos(n)は、前隣接品詞候補L1/Pos1と共起する可能性がある、注目単語L2の注目品詞候補である（n=1,2,…,N）。 In FIG. 29, the part of speech L1 / Pos1 is the previous adjacent part of speech candidate Pos1 of the previous adjacent word L1, and the part of speech L2 / Pos (n) is a word of interest that may co-occur with the previous adjacent part of speech candidate L1 / Pos1. L2's feature part-of-speech candidates (n = 1, 2, ..., N).

また、図２９において、注目品詞候補L2/Pos(n)の下に示される数字は、品詞が注目品詞候補L2/Pos(n)の注目単語L2の単語出現確率を示している。 Also, in FIG. 29, the numbers shown below the target part-of-speech candidate L2 / Pos (n) indicate the word appearance probability of the target word L2 whose part-of-speech is the target part-of-speech candidate L2 / Pos (n).

単語出現確率決定部１０２や単語出現確率決定部３０２は、品詞が注目品詞候補L2/Pos(n)の注目単語L2の単語出現確率それぞれを、例えば、注目品詞候補の総数を表すN分の１などの同一の単語出現確率に決定する。 The word appearance probability determining unit 102 and the word appearance probability determining unit 302 represent the word appearance probabilities of the attention word L2 whose part of speech is the attention part of speech candidate L2 / Pos (n), for example, 1 / N representing the total number of attention part of speech candidates. Are determined to be the same word appearance probability.

図２９では、注目品詞候補の総数が1000個であり、従って、品詞が注目品詞候補L2/Pos(n)の注目単語L2の単語出現確率それぞれが、値1/1000=0.001をとる同一の単語出現確率として決定されている。 In FIG. 29, the total number of candidate part-of-speech candidates is 1000. Therefore, the same word in which the word appearance probabilities of the attention word L2 having the part-of-speech candidate L2 / Pos (n) take the value 1/1000 = 0.001. It is determined as the appearance probability.

この場合、図２８を参照して説明した、注目品詞候補の品詞出現確率を、品詞が注目品詞候補の注目単語の単語出現確率として決定する場合と比較して、図２４の品詞出現確率テーブル２０１から、注目品詞候補の品種出現確率を読み出す必要がないため、注目単語の単語出現確率をより迅速に決定することができる。 In this case, the part-of-speech appearance probability table 201 of FIG. 24 is compared with the case where the part-of-speech appearance probability of the target part-of-speech candidate described with reference to FIG. 28 is determined as the word appearance probability of the target word of the target part-of-speech candidate. Therefore, since it is not necessary to read the type appearance probability of the target part-of-speech candidate, the word appearance probability of the target word can be determined more quickly.

なお、図９（図２５）の品詞候補決定部１０１では、注目単語が未知語である場合、図７の共起確率テーブル６３を参照することにより、隣接品詞候補と共起する可能性がある品詞を、注目品詞候補として決定することとしたが、例えば、隣接品詞候補と複数の品詞それぞれとの共起確率が最大値をとるときの、複数の品詞のうちの１の品詞を、注目品詞候補（注目単語の品詞）として決定するようにしてもよい。 Note that in the part of speech candidate determination unit 101 in FIG. 9 (FIG. 25), when the attention word is an unknown word, there is a possibility of co-occurring with the adjacent part of speech candidates by referring to the co-occurrence probability table 63 in FIG. 7. The part of speech is determined as a candidate part-of-speech candidate. For example, when a co-occurrence probability between an adjacent part-of-speech candidate and each of a plurality of part-of-speech takes the maximum value, one part-of-speech of a plurality of parts of speech You may make it determine as a candidate (part of speech of an attention word).

図３０は、品詞候補決定部１０１が、隣接品詞候補と複数の品詞それぞれとの共起確率が最大値をとるときの、複数の品詞のうちの１の品詞を、注目品詞候補として決定する品詞候補決定処理を説明する図である。 FIG. 30 shows a part of speech in which the part of speech candidate determination unit 101 determines one part of speech as a target part of speech candidate when the co-occurrence probability between the adjacent part of speech candidate and each of the plurality of parts of speech has the maximum value. It is a figure explaining candidate determination processing.

図３０において、品詞L1/Pos1は、前隣接単語L1の前隣接品詞候補Pos1であり、品詞L2/Pos10は、前隣接品詞候補L1/Pos1と品詞Pos(n)との共起確率が最大値の共起確率P(Pos10|Pos1)=0.2をとるときの品詞を示している（n=1,2,…,10,…,N）。 In FIG. 30, the part of speech L1 / Pos1 is the previous adjacent part of speech candidate Pos1 of the previous adjacent word L1, and the part of speech L2 / Pos10 has the maximum co-occurrence probability between the previous adjacent part of speech candidate L1 / Pos1 and the part of speech Pos (n). The part of speech when the co-occurrence probability P (Pos10 | Pos1) = 0.2 is shown (n = 1, 2,..., 10,..., N).

図３０では、前隣接品詞候補L1/Pos1と共起する可能性がある、単語L2の品詞として、品詞Pos1,Pos2,…,Pos10,…,PosNが存在し、前隣接品詞候補L1/Pos1と注目品詞候補L2/Pos10との共起確率P(Pos10|Pos1)が、前隣接品詞候補L1/Pos1と複数の注目品詞候補L2/Pos(n)それぞれとの共起確率のうちの最大値の共起確率である0.2となっている。 In FIG. 30, there are part of speech Pos1, Pos2, ..., Pos10, ..., PosN as part of speech of word L2, which may co-occur with previous adjacent part of speech candidate L1 / Pos1, and the previous adjacent part of speech candidate L1 / Pos1 The co-occurrence probability P (Pos10 | Pos1) with the target part-of-speech candidate L2 / Pos10 is the maximum of the co-occurrence probabilities between the previous adjacent part-of-speech candidate L1 / Pos1 and each of the plurality of target part-of-speech candidates L2 / Pos (n). The co-occurrence probability is 0.2.

品詞候補決定部１０１は、前隣接品詞候補L1/Pos1と共起する可能性がある品詞Pos(n)のうちの1の品詞を、注目単語L1の注目品詞候補として決定する。 The part-of-speech candidate determination unit 101 determines one part-of-speech out of part-of-speech Pos (n) that may co-occur with the previous adjacent part-of-speech candidate L1 / Pos1 as an attention part-of-speech candidate for the attention word L1.

図３０では、例えば、隣接品詞候補L1/Pos1と複数の品詞Pos(n)との共起確率が最大値の共起確率P(Pos10|Pos1)=0.2をとるときの、複数の品詞Pos(n)のうちの1の品詞Pos10が、注目品詞候補として決定されている。 In FIG. 30, for example, when the co-occurrence probability of the adjacent part-of-speech candidate L1 / Pos1 and the plurality of part-of-speech Pos (n) takes the maximum co-occurrence probability P (Pos10 | Pos1) = 0.2, a plurality of part-of-speech Pos ( The part of speech Pos10 of n) is determined as the candidate part of speech candidate.

なお、図３０を参照して説明した品詞候補決定処理では、１の品詞が注目品詞候補とされ、従って、その注目品詞候補が必ず注目単語の品詞とされるため、品詞が注目品詞候補の注目単語の単語出現確率としては、例えば、値１をとるようしてもよい。 In the part-of-speech candidate determination process described with reference to FIG. 30, one part-of-speech is set as the target part-of-speech candidate, and therefore, the target part-of-speech candidate is always set as the part of speech of the target word. As the word appearance probability of the word, for example, the value 1 may be taken.

また、品詞候補決定部１０１では、注目単語が未知語である場合、図３の品詞テーブル６４に保持された品詞すべてを、注目品詞候補として決定することができる。 Further, when the word of interest is an unknown word, the part of speech candidate determination unit 101 can determine all the parts of speech held in the part of speech table 64 of FIG.

次に、図３１は、本発明を適用した形態素解析エンジンの一実施の形態の第３の構成例を示すブロック図である。 Next, FIG. 31 is a block diagram showing a third configuration example of an embodiment of a morphological analysis engine to which the present invention is applied.

なお、図中、図２３の場合に対応する部分については同一の符号を付してあり、以下、その説明は、適宜省略する。 In the figure, portions corresponding to those in FIG. 23 are denoted by the same reference numerals, and description thereof will be omitted as appropriate.

即ち、図３１の形態素解析エンジンは、新たに学習部１３が設けられているほかは、図２３の場合と同様に構成される。 That is, the morphological analysis engine of FIG. 31 is configured in the same manner as in FIG. 23 except that a learning unit 13 is newly provided.

なお、図３１の形態素解析エンジンには、例えば、ユーザが、図示せぬ操作部を操作することにより、テキストデータとしての文書や新聞等の文のサンプルである学習コーパスが、その学習コーパスを構成する各単語に、各単語の品詞が付与された形で入力されるようになっており、このとき、学習部１３には、学習コーパスが供給される。 In the morphological analysis engine of FIG. 31, for example, when a user operates an operation unit (not shown), a learning corpus that is a sample of a sentence such as a document or a newspaper as text data constitutes the learning corpus. Each word is input in a form in which the part of speech of each word is given. At this time, a learning corpus is supplied to the learning unit 13.

学習部１３は、例えば、図３１の形態素解析エンジンが行う単語解析処理が行われる前に、入力される学習コーパスに基づいて、図６の単語出現確率テーブル６２や、図７の共起確率テーブル６３、図２４の品詞出現確率テーブル２０１を生成し、辞書データベース１２に供給して記憶させる。 For example, the learning unit 13 performs the word appearance probability table 62 in FIG. 6 or the co-occurrence probability table in FIG. 7 based on the input learning corpus before the word analysis processing performed by the morphological analysis engine in FIG. 31 is performed. 63, the part-of-speech appearance probability table 201 of FIG. 24 is generated and supplied to the dictionary database 12 for storage.

また、学習部１３は、学習コーパスに基づいて、必要に応じて、図６の単語出現確率テーブル６２や、図７の共起確率テーブル６３、図１９の品詞出現確率テーブル２０１を生成するようにしてもよい。 Further, the learning unit 13 generates the word appearance probability table 62 in FIG. 6, the co-occurrence probability table 63 in FIG. 7, and the part-of-speech appearance probability table 201 in FIG. 19 as necessary based on the learning corpus. May be.

また、例えば、図２の単語テーブル６１や図３の品詞テーブル６４、図４の複合語テーブル６５については、英語の辞書などを学習コーパスとして用いることにより、学習部１３により生成される。 Further, for example, the word table 61 in FIG. 2, the part of speech table 64 in FIG. 3, and the compound word table 65 in FIG. 4 are generated by the learning unit 13 by using an English dictionary or the like as a learning corpus.

上述した図１８の単語解析処理、図１９の品詞候補決定処理、および図２０の単語出現確率決定処理は、専用のハードウエアにより実行させることもできるし、ソフトウエアにより実行させることもできる。図１８の単語解析処理、図１９の品詞候補決定処理、および図２０の単語出現確率決定処理をソフトウエアにより実行させる場合には、そのソフトウエアを構成するプログラムが、専用のハードウエアに組み込まれているコンピュータ、または、各種のプログラムをインストールすることで、各種の機能を実行することが可能な、例えば汎用のパーソナルコンピュータなどに、プログラム記録媒体からインストールされる。 The above-described word analysis processing in FIG. 18, part-of-speech candidate determination processing in FIG. 19, and word appearance probability determination processing in FIG. 20 can be executed by dedicated hardware or can be executed by software. When the word analysis process of FIG. 18, the part-of-speech candidate determination process of FIG. 19, and the word appearance probability determination process of FIG. 20 are executed by software, a program constituting the software is incorporated in dedicated hardware. Installed from a program recording medium, for example, a general-purpose personal computer capable of executing various functions by installing various computers or various programs.

図３２は、プログラムを実行することにより上述した図１８の単語解析処理、図１９の品詞候補決定処理、および図２０の単語出現確率決定処理を行うコンピュータの構成例を示すブロック図である。 FIG. 32 is a block diagram illustrating a configuration example of a computer that performs the above-described word analysis processing of FIG. 18, part of speech candidate determination processing of FIG. 19, and word appearance probability determination processing of FIG. 20 by executing a program.

CPU(Central Processing Unit)９０１は、ROM(Read Only Memory)９０２、または記憶部９０８に記憶されているプログラムに従って各種の処理を実行する。RAM(Random Access Memory)９０３には、CPU９０１が実行するプログラムやデータなどが適宜記憶される。これらのCPU９０１、ROM９０２、およびRAM９０３は、バス９０４により相互に接続されている。 A CPU (Central Processing Unit) 901 executes various processes according to a program stored in a ROM (Read Only Memory) 902 or a storage unit 908. A RAM (Random Access Memory) 903 appropriately stores programs executed by the CPU 901 and data. The CPU 901, ROM 902, and RAM 903 are connected to each other by a bus 904.

CPU９０１にはまた、バス９０４を介して入出力インタフェース９０５が接続されている。入出力インタフェース９０５には、キーボード、マウス、マイクロホンなどよりなる入力部９０６、モニタ、スピーカなどよりなる出力部９０７が接続されている。CPU９０１は、入力部９０６から入力される指令に対応して各種の処理を実行する。そして、CPU９０１は、処理の結果を出力部９０７に出力する。 An input / output interface 905 is also connected to the CPU 901 via the bus 904. Connected to the input / output interface 905 are an input unit 906 made up of a keyboard, mouse, microphone, and the like, and an output unit 907 made up of a monitor, a speaker, and the like. The CPU 901 executes various processes in response to a command input from the input unit 906. Then, the CPU 901 outputs the processing result to the output unit 907.

入出力インタフェース９０５に接続されている記憶部９０８は、例えばハードディスクからなり、CPU９０１が実行するプログラムや各種のデータを記憶する。通信部９０９は、インターネットやローカルエリアネットワークなどのネットワークを介して外部の装置と通信する。 The storage unit 908 connected to the input / output interface 905 includes, for example, a hard disk, and stores programs executed by the CPU 901 and various data. A communication unit 909 communicates with an external device via a network such as the Internet or a local area network.

入出力インタフェース９０５に接続されているドライブ９１０は、磁気ディスク、光ディスク、光磁気ディスク、或いは半導体メモリなどのリムーバブルメディア９１１が装着されたとき、それらを駆動し、そこに記録されているプログラムやデータなどを取得する。取得されたプログラムやデータは、必要に応じて記憶部９０８に転送され、記憶される。 A drive 910 connected to the input / output interface 905 drives a removable medium 911 such as a magnetic disk, an optical disk, a magneto-optical disk, or a semiconductor memory, and drives the program or data recorded therein. Get etc. The acquired program and data are transferred to and stored in the storage unit 908 as necessary.

コンピュータにインストールされ、コンピュータによって実行可能な状態とされるプログラムを格納するプログラム記録媒体は、図３２に示すように、磁気ディスク（フレキシブルディスクを含む）、光ディスク（CD-ROM(Compact Disc-Read Only Memory)，DVD(Digital Versatile Disc)を含む）、光磁気ディスク、もしくは半導体メモリなどよりなるパッケージメディアであるリムーバブルメディア９１１、または、プログラムが一時的もしくは永続的に格納されるROM９０２や、記憶部９０８を構成するハードディスクなどにより構成される。プログラム記録媒体へのプログラムの格納は、必要に応じてルータ、モデムなどのインタフェースである通信部９０９を介して、ローカルエリアネットワーク、インターネット、デジタル衛星放送といった、有線または無線の通信媒体を利用して行われる。 As shown in FIG. 32, a program recording medium that stores a program that is installed in a computer and is ready to be executed by the computer includes a magnetic disk (including a flexible disk), an optical disk (CD-ROM (Compact Disc-Read Only). Memory), DVD (Digital Versatile Disc), a removable medium 911 which is a package medium composed of a magneto-optical disk, a semiconductor memory, or the like, or a ROM 902 in which a program is temporarily or permanently stored, or a storage unit 908 It is comprised by the hard disk etc. which comprise The program is stored in the program recording medium using a wired or wireless communication medium such as a local area network, the Internet, or digital satellite broadcasting via a communication unit 909 that is an interface such as a router or a modem as necessary. Done.

なお、本明細書において、プログラム記録媒体に格納されるプログラムを記述するステップは、記載された順序に沿って時系列的に行われる処理はもちろん、必ずしも時系列的に処理されなくとも、並列的あるいは個別に実行される処理をも含むものである。 In the present specification, the step of describing the program stored in the program recording medium is not limited to the processing performed in time series in the described order, but is not necessarily performed in time series. Or the process performed separately is also included.

また、本発明の実施の形態は、上述した実施の形態に限定されるものではなく、本発明の要旨を逸脱しない範囲において種々の変更が可能である。 The embodiments of the present invention are not limited to the above-described embodiments, and various modifications can be made without departing from the scope of the present invention.

本発明を適用した形態素解析エンジンの一実施の形態の第１の構成例を示すブロック図である。It is a block diagram which shows the 1st structural example of one Embodiment of the morphological analysis engine to which this invention is applied. 図１の辞書データベース１２に記憶されている単語テーブル６１を示す図である。It is a figure which shows the word table 61 memorize | stored in the dictionary database 12 of FIG. 図１の辞書データベース１２に記憶されている品詞テーブル６４を示す図である。It is a figure which shows the part of speech table 64 memorize | stored in the dictionary database 12 of FIG. 図１の辞書データベース１２に記憶されている複合語テーブル６５を示す図である。It is a figure which shows the compound word table 65 memorize | stored in the dictionary database 12 of FIG. 図１の品詞付与部３４が生成する単語ラティスを示す図である。It is a figure which shows the word lattice which the part of speech provision part 34 of FIG. 1 produces | generates. 図１の辞書データベース１２に記憶されている単語出現確率テーブル６２を示す図である。It is a figure which shows the word appearance probability table 62 memorize | stored in the dictionary database 12 of FIG. 図１の辞書データベース１２に記憶されている共起確率テーブル６３を示す図である。It is a figure which shows the co-occurrence probability table 63 memorize | stored in the dictionary database 12 of FIG. 図６の単語出現確率テーブル６２および図７の共起確率テーブル６３についてのHMMを示す図である。It is a figure which shows HMM about the word appearance probability table 62 of FIG. 6, and the co-occurrence probability table 63 of FIG. 図１の品詞付与部３４の詳細な構成例を示すブロック図である。It is a block diagram which shows the detailed structural example of the part of speech provision part 34 of FIG. 前隣接品詞候補と注目品詞候補との共起確率を、注目単語の単語出現確率として決定する単語出現確率決定処理を説明する図である。It is a figure explaining the word appearance probability determination process which determines the co-occurrence probability of a previous adjacent part of speech candidate and an attention part of speech candidate as a word appearance probability of an attention word. 注目品詞候補と後隣接品詞候補との共起確率を、注目単語の単語出現確率として決定する単語出現確率決定処理を説明する図である。It is a figure explaining the word appearance probability determination process which determines the co-occurrence probability of an attention part of speech candidate and a back adjacent part of speech candidate as a word appearance probability of an attention word. 前隣接品詞候補と注目品詞候補との共起確率、および注目品詞候補と後隣接品詞候補との共起確率のうちの1の共起確率を、注目単語の単語出現確率として決定する単語出現確率決定処理を説明する図である。Word appearance probability that determines the co-occurrence probability between the previous adjacent part-of-speech candidate and the target part-of-speech candidate, and one of the co-occurrence probabilities between the target part-of-speech candidate and the subsequent adjacent part-of-speech candidate as the word appearance probability of the target word It is a figure explaining a determination process. 複数の前隣接品詞候補それぞれと注目品詞候補との共起確率のうちの最大値の共起確率を、注目単語の単語出現確率として決定する単語出現確率決定処理を説明する図である。It is a figure explaining the word appearance probability determination process which determines the co-occurrence probability of the maximum value among the co-occurrence probabilities of each of a plurality of previous adjacent part-of-speech candidates and the attention part-of-speech candidate as the word appearance probability of the attention word. 注目品詞候補と複数の後隣接品詞候補それぞれとの共起確率のうちの最大値の共起確率を、注目単語の単語出現確率として決定する単語出現確率決定処理を説明する図である。It is a figure explaining the word appearance probability determination process which determines the co-occurrence probability of the maximum value among the co-occurrence probabilities of the attention part-of-speech candidate and each of a plurality of subsequent adjacent part-of-speech candidates as the word appearance probability of the attention word. 複数の前隣接品詞候補および後隣接品詞候補それぞれと注目品詞候補との共起確率のうちの最大値の共起確率を、注目単語の単語出現確率として決定する単語出現確率決定処理を説明する図である。The figure explaining the word appearance probability determination process which determines the co-occurrence probability of the maximum value among the co-occurrence probabilities of each of the plurality of previous adjacent part-of-speech candidates and rear adjacent part-of-speech candidates and the attention part-of-speech candidate as the word appearance probability of the attention word. It is. 前隣接品詞候補と注目品詞候補との共起確率を、注目単語の単語出現確率として決定する単語出現確率決定処理を説明する図である。It is a figure explaining the word appearance probability determination process which determines the co-occurrence probability of a previous adjacent part of speech candidate and an attention part of speech candidate as a word appearance probability of an attention word. 複数の前隣接品詞候補それぞれと注目品詞候補との共起確率のうちの最大値を、注目単語の単語出現確率として決定する単語出現確率決定処理を説明する図である。It is a figure explaining the word appearance probability determination process which determines the maximum value among the co-occurrence probabilities of each of a plurality of previous adjacent part-of-speech candidates and the attention part-of-speech candidate as the word appearance probability of the attention word. 単語解析処理を説明するフローチャートである。It is a flowchart explaining a word analysis process. 図１８のステップＳ３４の品詞候補決定処理を説明するフローチャートである。It is a flowchart explaining the part of speech candidate determination process of step S34 of FIG. 図１８のステップＳ３５の単語出現確率決定処理を説明するフローチャートである。It is a flowchart explaining the word appearance probability determination process of step S35 of FIG. 複数の隣接品詞候補それぞれと注目品詞候補との共起確率の合計値を、注目単語の単語出現確率として決定する単語出現確率決定処理を説明する第１の図である。It is a 1st figure explaining the word appearance probability determination process which determines the total value of the co-occurrence probability of each of several adjacent part-of-speech candidates and an attention part-of-speech candidate as a word appearance probability of an attention word. 複数の前隣接品詞候補それぞれと注目品詞候補との共起確率の合計値を、注目単語の単語出現確率として決定する単語出現確率決定処理を説明する第２の図である。It is a 2nd figure explaining the word appearance probability determination process which determines the total value of the co-occurrence probability of each of several previous adjacent part-of-speech candidates and an attention part-of-speech candidate as a word appearance probability of an attention word. 本発明を適用した形態素解析エンジンの一実施の形態の第２の構成例を示すブロック図である。It is a block diagram which shows the 2nd structural example of one Embodiment of the morphological analysis engine to which this invention is applied. 図２３の辞書データベース１２に記憶されている品詞出現確率テーブル２０１を示す図である。It is a figure which shows the part of speech appearance probability table 201 memorize | stored in the dictionary database 12 of FIG. 図２３の品詞付与部４４の詳細な構成例を示すブロック図である。It is a block diagram which shows the detailed structural example of the part of speech provision part 44 of FIG. 隣接品詞候補と注目品詞候補との共起確率と、注目品詞候補の品詞出現確率との積を、注目単語の単語出現確率として決定する単語出現確率決定処理を説明する図である。It is a figure explaining the word appearance probability determination process which determines the product of the co-occurrence probability of an adjacent part of speech candidate and an attention part of speech candidate and the part of speech appearance probability of an attention part of speech candidate as a word appearance probability of an attention word. 複数の隣接品詞候補それぞれと注目品詞候補との共起確率のうちの最大値と、注目品詞候補の品詞出現確率との積を、注目単語の単語出現確率として決定する単語出現確率決定処理を説明する図である。Explains the word appearance probability determination process that determines the product of the maximum value of the co-occurrence probabilities of each of a plurality of adjacent part-of-speech candidates and the part-of-speech candidate and the part-of-speech appearance probability of the part-of-speech candidate as the word appearance probability of the target word It is a figure to do. 注目品詞候補の品詞出現確率を、注目単語の単語出現確率として決定する単語出現確率決定処理を説明する図である。It is a figure explaining the word appearance probability determination process which determines the part of speech appearance probability of an attention part of speech candidate as a word appearance probability of an attention word. 注目単語の単語出現確率それぞれを、同一の単語出現確率として決定する単語出現確率決定処理を説明する図である。It is a figure explaining the word appearance probability determination process which determines each word appearance probability of an attention word as the same word appearance probability. 隣接品詞候補と複数の品詞それぞれとの共起確率が最大値をとるときの、複数の品詞のうちの１の品詞を、注目品詞候補として決定する単語出現確率決定処理を説明する図である。It is a figure explaining the word appearance probability determination process which determines one part of speech of a some part of speech as an attention part of speech candidate when the co-occurrence probability of an adjacent part of speech candidate and each of a plurality of part of speech takes the maximum value. 本発明を適用した形態素解析エンジンの一実施の形態の第３の構成例を示すブロック図である。It is a block diagram which shows the 3rd structural example of one Embodiment of the morphological analysis engine to which this invention is applied. コンピュータの構成例を示すブロック図である。It is a block diagram which shows the structural example of a computer.

Explanation of symbols

１１単語解析部，１２辞書データベース，３１文区切り部，３２単語区切り部，３３語幹付与部，３４品詞付与部，３５複合語品詞付与部，４４品詞付与部，６１単語テーブル，６２単語出現確率テーブル，６３共起確率テーブル，６４品詞テーブル，６５複合語テーブル，１０１品詞候補決定部，１０２単語出現確率決定部，１０３単語ラティス生成部，１０４単語品詞付与部，２０１品詞出現確率テーブル，３０２単語出現確率決定部 11 word analysis section, 12 dictionary database, 31 sentence delimiter section, 32 word delimiter section, 33 word stem assigning section, 34 part of speech assignment section, 35 compound word part of speech assignment section, 44 part of speech assignment section, 61 word table, 62 word appearance probability table , 63 Co-occurrence probability table, 64 part-of-speech table, 65 compound word table, 101 part-of-speech candidate determination unit, 102 word appearance probability determination unit, 103 word lattice generation unit, 104 word part-of-speech assignment unit, 201 part-of-speech appearance probability table, 302 word appearance Probability determiner

Claims

In an information processing apparatus that generates a word lattice from a word string,
Based on the co-occurrence probabilities stored in the storage means in which the co-occurrence probabilities, which are the probabilities of two parts of speech co-occurring, are stored in advance, pay attention to the words constituting the word string. Part-of-speech candidates for determining a part-of-speech candidate that is a candidate for part-of-speech for the target word that may co-occur with a part-of-speech candidate for a part-of-speech that is a candidate for part-of-speech that is a word adjacent to the front or back of the target word A determination means;
Word appearance probability determining means for determining a word appearance probability that is a probability that the attention word of the attention part of speech candidate appears, based on the co-occurrence probability of the adjacent part of speech candidate and the attention part of speech candidate;
An information processing apparatus comprising: a word lattice generation unit configured to generate the word lattice based on the co-occurrence probability of adjacent words in the word string and the word appearance probability of each word constituting the word string.

The information processing apparatus according to claim 1, wherein the word appearance probability determining unit determines a co-occurrence probability between the adjacent part-of-speech candidate and the attention part-of-speech candidate as the word appearance probability of the attention word.

In the case where there are a plurality of adjacent part of speech candidates,
The information according to claim 1, wherein the word appearance probability determining unit determines a maximum value among co-occurrence probabilities of each of the plurality of adjacent part-of-speech candidates and the attention part-of-speech candidate as the word appearance probability of the attention word. Processing equipment.

In the case where there are a plurality of adjacent part of speech candidates,
2. The information according to claim 1, wherein the word appearance probability determining unit determines a total value obtained by summing up the co-occurrence probabilities of each of the plurality of adjacent part-of-speech candidates and the attention part-of-speech candidate as the word appearance probability of the attention word. Processing equipment.

The storage means further stores a part-of-speech appearance probability which is a probability of appearance of a part-of-speech obtained in advance,
The word appearance probability determining means determines a product of a co-occurrence probability of the adjacent part-of-speech candidate and the target part-of-speech candidate and a part-of-speech appearance probability of the target part-of-speech candidate as the word appearance probability of the target word. The information processing apparatus according to 1.

In the case where there are a plurality of adjacent part of speech candidates,
The word appearance probability determining means calculates the product of the maximum value of the co-occurrence probabilities of each of the plurality of adjacent part-of-speech candidates and the target part-of-speech candidate and the part-of-speech appearance probability of the target part-of-speech candidate. The information processing apparatus according to claim 5, wherein the information processing apparatus determines the word appearance probability.

The information processing apparatus according to claim 5, wherein the co-occurrence probability or the part-of-speech appearance probability stored in the storage unit is learned in advance by a learning corpus that is a sentence sample.

Based on the word lattice generated by the word lattice generating means, word part-of-speech giving means for giving the part-of-speech of the attention word to the attention word;
The information processing apparatus according to claim 1, further comprising: an output unit that outputs the attention word to which the part of speech of the attention word is assigned.

The storage means further stores a word table in which a word is associated with a stem of the word,
Based on the word table stored in the storage means, further comprising a stem grant means for giving a stem of the attention word to the attention word;
The information processing apparatus according to claim 8, wherein the output unit outputs the attention word to which a part of speech and a stem of the attention word are assigned.

The storage means further stores a compound word table in which a compound word composed of a plurality of words and a part of speech of the compound word are associated with each other,
Based on the compound word table stored in the storage means, further comprising compound word part-of-speech giving means for giving a part of speech of the compound word to the compound word included in the word string,
The information processing apparatus according to claim 8, wherein the output unit further outputs the compound word to which the part of speech of the compound word included in the word string is assigned.

The storage means further stores a word appearance probability obtained in advance, which is a probability that a word having a predetermined part of speech appears.
The word appearance probability determining means is
When the word appearance probability of the attention word is stored in the storage means, the word appearance probability of the attention word is determined based on the word appearance probability stored in the storage means,
The co-occurrence probability of the attention word is determined based on the co-occurrence probability of the adjacent part-of-speech candidate and the attention part-of-speech candidate when the word appearance probability of the attention word is not stored in the storage unit. The information processing apparatus described.

The information processing apparatus according to claim 11, wherein the word appearance probability stored in the storage unit is learned in advance by a learning corpus that is a sample of a sentence.

The information processing apparatus according to claim 1, wherein the part-of-speech candidate determination unit determines a part-of-speech having the maximum co-occurrence probability with the adjacent part-of-speech candidate as the target part-of-speech candidate.

In an information processing method of an information processing apparatus that generates a word lattice from a word string,
Based on the co-occurrence probabilities stored in the storage means in which the co-occurrence probabilities, which are the probabilities of two parts of speech co-occurring, are stored in advance, pay attention to the words constituting the word string. Determining a part of speech candidate that is a candidate for part of speech of the attention word that may co-occur with a part of speech candidate that is a candidate for part of speech of the adjacent word that is a word adjacent to the front or back of the attention word;
Based on the co-occurrence probability of the adjacent part-of-speech candidate and the attention part-of-speech candidate, determine a word appearance probability that is the probability that the part-of-speech word appears in the attention part-of-speech candidate;
An information processing method comprising: generating the word lattice based on the co-occurrence probability of adjacent words in the word string and the word appearance probability of each word constituting the word string.

In a program that causes a computer to function as an information processing device that generates a word lattice from a word string,
Based on the co-occurrence probabilities stored in the storage means in which the co-occurrence probabilities, which are the probabilities of two parts of speech co-occurring, are stored in advance, pay attention to the words constituting the word string. Part-of-speech candidates for determining a part-of-speech candidate that is a candidate for part-of-speech for the target word that may co-occur with a part-of-speech candidate for a part-of-speech that is a candidate for part-of-speech that is a word adjacent to the front or back of the target word A determination means;
Word appearance probability determining means for determining a word appearance probability that is a probability that the attention word of the attention part of speech candidate appears, based on the co-occurrence probability of the adjacent part of speech candidate and the attention part of speech candidate;
A computer functioning as word lattice generation means for generating the word lattice based on the co-occurrence probability of adjacent words in the word string and the word appearance probability of each word constituting the word string Program to make.