JP2000075894A

JP2000075894A - Speech recognition method and apparatus, speech dialogue system, recording medium

Info

Publication number: JP2000075894A
Application number: JP10246624A
Authority: JP
Inventors: Atsuji Nagahara; 敦示永原; Toshihiro Isobe; 俊洋磯部
Original assignee: NTT Data Corp
Current assignee: NTT Data Group Corp
Priority date: 1998-09-01
Filing date: 1998-09-01
Publication date: 2000-03-14

Abstract

(57)【要約】【課題】認識対象の音響的特徴が酷似し、単語出現頻
度もほぼ同一である場合であっても高精度な認識結果が
得られる音声認識装置を提供する。【解決手段】第１感性情報処理部１２では、韻律的感
性モデル１２１から入力音声データの韻律特徴に対応す
る感情価Ｑk が選び出す。音声認識部１３では、音響モ
デル１３１及び言語モデル１３２を利用して、入力音声
データの音声認識を行い、候補の単語のスコアＳwnを導
き、Ｓwnが最も高い単語候補Ｗn としてＮ個を選び出
す。第２感性情報処理部１４では、意味的感性モデル１
４１を用いてＮ個の単語候補Ｗn それぞれの感情価Ｒwn
を求める。感性状態統合部１５では、Ｎ個の単語候補Ｗ
n について、感情価Ｑk 、Ｒwnを用いて単語候補スコア
Ｓwnを重み付けすることで認識スコアＴwnを算出する。
認識結果出力部１６は認識スコアＴwnが最も高い単語候
補Ｗn を認識結果として出力する。 (57) [Problem] To provide a speech recognition device capable of obtaining a highly accurate recognition result even when acoustic features of recognition targets are very similar and word appearance frequencies are almost the same. SOLUTION: A first emotion information processing unit 12 selects an emotion value Qk corresponding to a prosody feature of input speech data from a prosodic sensitivity model 121. The speech recognition unit 13 performs speech recognition of the input speech data using the acoustic model 131 and the language model 132, derives a score Swn of the candidate word, and selects N as the word candidate Wn having the highest Swn. In the second sentiment information processing unit 14, the semantic sentiment model 1
41, the emotion value Rwn of each of the N word candidates Wn
Ask for. In the sentiment state integrating unit 15, N word candidates W
With respect to n, the recognition score Twn is calculated by weighting the word candidate score Swn using the emotion values Qk and Rwn.
The recognition result output unit 16 outputs a word candidate Wn having the highest recognition score Twn as a recognition result.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、感性を伴う発話者
の音声を正しく認識する音声認識方法、音声認識装置、
及び音声認識装置を応用した音声対話システムに関す
る。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a voice recognition method and a voice recognition apparatus for correctly recognizing a voice of a speaker with sensibility.
And a voice interaction system to which a voice recognition device is applied.

【０００２】[0002]

【従来の技術】従来の音声認識装置は、音声認識単位
毎、つまり音素や単語等に保有された音響モデルと、単
語列等の単語出現頻度に基づいて作成された言語モデル
とを利用して音声認識を行っている。音響モデルは、予
め大量の音声データから作成され、音声の音響的特徴を
保有しており、言語モデルは、予め大量のテキストデー
タから作成され、認識する単語の出現頻度及び連鎖確率
を保持している。2. Description of the Related Art A conventional speech recognition apparatus utilizes an acoustic model held for each speech recognition unit, that is, a phoneme or a word, and a language model created based on a word appearance frequency such as a word string. Speech recognition is performed. The acoustic model is created in advance from a large amount of speech data and has acoustic characteristics of speech, and the language model is created in advance from a large amount of text data and holds the appearance frequency and chain probability of a word to be recognized. I have.

【０００３】このような音声認識装置による音声認識
は、以下のようにして行われる。まず入力された音声を
音素毎に分割してそれぞれ特徴ベクトルに置換する。そ
して、各特徴ベクトルをそれぞれ音響モデルと照合し、
最も照合スコアの高い単語を選出することで、入力音声
を音素／単語列化する。このとき、言語モデルによっ
て、次に出現しそうな単語列を推定する。これにより、
ただ単に音素列の検索を行うだけではなく、次に出現し
そうなものを検索することによって、認識精度を高めて
いる。これらの一連の認識処理を経ることによって、入
力音声に対する文字列の照合スコアを算出し、この照合
スコアが最大となる文字列を認識結果としている。[0003] Speech recognition by such a speech recognition device is performed as follows. First, the input speech is divided for each phoneme and replaced with a feature vector. Then, each feature vector is checked against the acoustic model,
By selecting the word having the highest matching score, the input speech is converted into a phoneme / word string. At this time, the next likely word string is estimated by the language model. This allows
In addition to simply searching for phoneme strings, the recognition accuracy is increased by searching for the next occurrence. Through a series of these recognition processes, a collation score of the character string with respect to the input voice is calculated, and the character string having the maximum collation score is determined as a recognition result.

【０００４】[0004]

【発明が解決しようとする課題】従来の音声認識装置で
は、「成績が上がる」や「成績が下がる」等のように、
認識対象の音響的特徴が酷似し、なおかつ、単語出現頻
度もほぼ同一である場合においては、間違った認識結果
を導く可能性がある。特に、雑音等により音声の劣化が
著しい場合、この現象は顕著に見られる。また、音声認
識を対話処理に利用する場合に、対話の流れに矛盾した
認識結果を導いてしまう場合があった。In a conventional speech recognition apparatus, such as "upper grade" or "lower grade",
In the case where the acoustic features of the recognition targets are very similar and the word appearance frequencies are almost the same, there is a possibility that an incorrect recognition result may be obtained. In particular, when the sound is significantly deteriorated due to noise or the like, this phenomenon is remarkably observed. In addition, when speech recognition is used for dialogue processing, there are cases where recognition results inconsistent with the flow of dialogue are obtained.

【０００５】そこで本発明の課題は、認識対象の音響的
特徴が酷似し、単語出現頻度もほぼ同一である場合であ
っても高精度な認識結果が得られ、音声認識を対話処理
に利用する場合に、対話の流れに矛盾しない認識結果を
得ることのできる、改良された音声認識方法を提供する
ことにある。本発明の他の課題は、上記音声認識方法の
実施に適した音声認識装置及びこの方法をコンピュータ
装置に実行させるための記録媒体を提供することにあ
る。本発明の他の課題は、上記音声認識方法を応用した
音声対話システムを提供することにある。Accordingly, an object of the present invention is to obtain a highly accurate recognition result even when acoustic features of recognition targets are very similar and word appearance frequencies are almost the same, and use speech recognition for interactive processing. In some cases, it is an object of the present invention to provide an improved speech recognition method capable of obtaining a recognition result consistent with the flow of a dialog. Another object of the present invention is to provide a speech recognition apparatus suitable for implementing the above speech recognition method and a recording medium for causing a computer device to execute the method. Another object of the present invention is to provide a voice interaction system to which the above voice recognition method is applied.

【０００６】[0006]

【課題を解決するための手段】上記課題を解決する本発
明の音声認識方法は、基本的に、発話音声データから韻
律情報を抽出して発話者の感性状態を推定し、音声認識
処理によって候補として挙げられる単語候補の中から推
定された感性状態に矛盾した認識結果を抑圧することに
より、高精度な音声認識を実現する。The speech recognition method of the present invention for solving the above problems basically extracts prosody information from uttered speech data, estimates the kansei state of the speaker, and performs candidate recognition by speech recognition processing. By suppressing the recognition result inconsistent with the sentiment state estimated from the word candidates listed as "", highly accurate speech recognition is realized.

【０００７】ここでいう感性状態とは、快、不快や喜怒
哀楽のような人間の気持ちを表現する状態をいう。音響
的特徴や言語的特徴（出現頻度等）の酷似への対処とし
て、発話音声全体の韻律的特徴に基づく感性状態と認識
候補単語の言語的意味に基づく感性状態とを求め、それ
ぞれの感性状態が矛盾するものの優先度を低減させる。
これにより、発話内容に矛盾した認識候補を軽減させ、
発話内容により一致した音声認識が期待できる。[0007] The emotional state referred to here is a state expressing human feelings such as pleasant, unpleasant, and emotional and emotional joy. In order to deal with very similar acoustic and linguistic features (frequency of appearance, etc.), a kansei state based on the prosodic features of the entire utterance voice and a kansei state based on the linguistic meaning of the recognition candidate word are obtained. Reduce the priority of conflicting.
This reduces recognition candidates that are inconsistent with the utterance content,
Speech recognition that matches the utterance content can be expected.

【０００８】本発明の音声認識方法は、コンピュータ装
置上で下記の処理を実行することにより具現化される。（１）発話者の音声を入力してデジタルの音声データに
変換する音声入力処理、（２）前記音声入力処理により得られた音声データから
韻律情報を抽出してこの韻律情報に対応する韻律的感情
価を導出する第１感性情報処理、（３）前記音声入力処理により得られた音声データの認
識を行い、認識出現率の高い単語候補を選び出す音声認
識処理、（４）前記音声認識処理により得られた単語候補のそれ
ぞれの意味的感情価を導出する第２感性情報処理、（５）前記音声認識処理で得られた単語候補のそれぞれ
について、前記第１感性情報処理により得られた韻律的
感情価と前記音声認識処理により得られた単語候補スコ
アと前記第２感性情報処理で得られた意味的感情価とを
用いて重み付けすることで発話者の感性状態を加味した
認識スコアを導出する感性状態統合処理、（６）前記感性状態統合処理により得られた認識スコア
が最も高い単語候補を選出し、認識結果として出力する
認識結果出力処理。[0008] The speech recognition method of the present invention is embodied by executing the following processing on a computer device. (1) voice input processing for inputting the voice of the speaker and converting it into digital voice data; (2) prosody information extracted from the voice data obtained by the voice input processing and corresponding to the prosody information First emotion information processing for deriving emotion valence; (3) speech recognition processing for recognizing speech data obtained by the speech input processing and selecting a word candidate having a high recognition appearance rate; and (4) speech recognition processing. Second sentiment information processing for deriving the semantic emotional valence of each of the obtained word candidates; (5) for each of the word candidates obtained by the speech recognition processing, the prosody obtained by the first sentiment information processing; Weighting is performed using the emotional valence, the word candidate score obtained by the speech recognition processing, and the semantic emotional valence obtained by the second kansei information processing. (6) A recognition result output process for selecting a word candidate having the highest recognition score obtained by the above-described emotional state integration process and outputting it as a recognition result.

【０００９】上記他の課題を解決する本発明の記録媒体
は、上記処理をコンピュータ装置に実行させるためのプ
ログラムが記録されたコンピュータ読取可能な記録媒体
である。A recording medium according to the present invention for solving the above-mentioned other problems is a computer-readable recording medium in which a program for causing a computer device to execute the above processing is recorded.

【００１０】上記他の課題を解決する本発明の音声認識
装置は、発話者の音声を入力してデジタル音声データに
変換する音声入力手段と、前記音声入力手段で得られた
入力音声データから韻律情報を抽出してこの韻律情報に
対応する韻律的感情価を求める第１感性情報処理手段
と、前記音声入力手段で得られた入力音声データの認識
を行い、認識出現率の高い単語候補を選び出す音声認識
手段と、前記音声認識手段で得られた単語候補のそれぞ
れの意味的感情価を求める第２感性情報処理手段と、前
記音声認識手段で得られた単語候補のそれぞれについ
て、前記第１感性情報処理手段で得られた韻律的感情価
と前記音声認識手段で得られた単語候補スコアと前記第
２感性情報処理手段で得られた意味的感情価とを用いて
重み付けすることで、発話者の感性状態を加味した認識
スコアを算出する感性状態統合手段と、前記感性状態統
合手段で得られた認識スコアが最も高い単語候補を選出
し、認識結果として出力する認識結果出力手段とを具備
したことを特徴とする。According to another aspect of the present invention, there is provided a voice recognition apparatus for inputting voice of a speaker and converting the voice into digital voice data, and a prosody based on the input voice data obtained by the voice input means. A first sentiment information processing means for extracting information and obtaining a prosodic emotional valence corresponding to the prosody information; and recognizing input voice data obtained by the voice input means to select a word candidate having a high recognition appearance rate. Voice recognition means, second sensibility information processing means for obtaining the semantic emotional valence of each of the word candidates obtained by the voice recognition means, and the first sensibility for each of the word candidates obtained by the voice recognition means By weighting using the prosodic emotion value obtained by the information processing means, the word candidate score obtained by the voice recognition means, and the semantic emotion value obtained by the second sensitivity information processing means, Emotion state integration means for calculating a recognition score taking into account the speaker's emotion state, and a recognition result output means for selecting a word candidate having the highest recognition score obtained by the emotion state integration means and outputting it as a recognition result. It is characterized by having.

【００１１】前記感性状態統合手段は、例えば、前記韻
律的感情価と意味的感情価との差に基づいて前記単語候
補スコアを重み付けすることで、韻律的な感性状態に矛
盾する意味的な感性状態に該当する単語候補を抑圧する
ように構成する。The sentiment state integrating means weights the word candidate score based on, for example, the difference between the prosodic sentiment value and the semantic sentiment value, so that the semantic sentiment contradictory to the prosodic sentiment state. The word candidate corresponding to the state is configured to be suppressed.

【００１２】上記音声認識装置は、より具体的には、前
記音声入力手段で得られた入力音声データから韻律特徴
ベクトルＸIを抽出し、現在の特徴ベクトルＸiと最も近
いベクトルＸjに対応する韻律的感情価Ｑkを選び出す第
１感性情報処理手段と、前記音声入力手段で得られた入
力音声データの認識を行い、候補として挙げられる単語
のスコアＳwnを導き出し、そのスコアＳwnが最も高い単
語Ｎ個を単語候補Ｗnとして選び出す音声認識手段と、
前記音声認識手段で得られたＮ個の単語候補Ｗn のそれ
ぞれの意味的感情価Ｒwnを求める第２感性情報処理手段
と、前記音声認識手段で得られたＮ個の単語候補Ｗn の
それぞれについて、前記第１感性情報処理手段で得られ
た韻律的感情価Ｑkと、前記音声認識手段で得られた単
語候補スコアＳwnと、前記第２感性情報処理手段で得ら
れた意味的感情価Ｒwnとを用いて、認識スコアＴwnを、Ｔwn＝Ｓwn×α（Ｑk −Ｒwn）＾（−β）（但し、α、βは正の定数）により算出する感性状態統合手段と、前記感性状態統合
手段で得られた認識スコアＴwnが最も高い単語候補Ｗn
を選出し、認識結果として出力する認識結果出力手段と
を具備して構成する。More specifically, the speech recognition apparatus extracts a prosodic feature vector XI from the input speech data obtained by the speech input means, and generates a prosodic feature vector corresponding to a vector Xj closest to the current feature vector Xi. The first sentiment information processing means for selecting the emotion value Qk and the input voice data obtained by the voice input means are recognized to derive a score Swn of a word as a candidate, and the N words having the highest score Swn are determined. Voice recognition means for selecting as word candidates Wn;
For the second sentiment information processing means for obtaining the semantic emotion value Rwn of each of the N word candidates Wn obtained by the voice recognition means, and for each of the N word candidates Wn obtained by the voice recognition means, The prosodic emotional value Qk obtained by the first emotional information processing means, the word candidate score Swn obtained by the voice recognition means, and the semantic emotional value Rwn obtained by the second emotional information processing means The recognition score Twn is obtained by the sentiment state integration means for calculating Twn = Swn × α (Qk−Rwn) ＾ (− β), where α and β are positive constants. Word candidate Wn with the highest recognized recognition score Twn
And a recognition result output means for outputting the selected result as a recognition result.

【００１３】また、前記第１感性情報処理手段は、大量
の音声データを用いて予め求めた韻律特徴ベクトルＸk
と感情価Ｑkとの対応関係を示すテーブルとして韻律的
感性モデルを備えており、この韻律的感性モデルを参照
して入力音声データに対応する韻律的感情価ＱKを導出
するように構成し、前記第２の感性情報処理手段は、大
量のテキストデータを用いて予め求められた単語Ｗjと
感情価Ｒwjとの対応関係を示すテーブルとして意味的感
性モデルを備えており、この意味的感性モデルを参照し
て単語Ｗnの意味的感情価Ｒwnを求めるように構成す
る。The first sentiment information processing means includes a prosodic feature vector Xk obtained in advance using a large amount of voice data.
A prosodic sensibility model is provided as a table showing the correspondence relationship between the prosodic sensation value Qk and the prosodic sensation value QK corresponding to the input voice data with reference to the prosody sensibility model. The second sentiment information processing means includes a semantic sentiment model as a table indicating the correspondence between the word Wj and the emotion value Rwj obtained in advance using a large amount of text data, and refers to this semantic sentiment model. To obtain the semantic emotional value Rwn of the word Wn.

【００１４】本発明の音声認識手法を応用した音声対話
システムは、音声認識機能において、「音情報（音響的
特徴及び言語的特徴に基づく情報）」と「感性情報（韻
律的特徴に基づく情報）」の２種類の情報を利用して認
識結果を絞りこむことで、認識誤りを軽減させ、発話者
の感性状態に則したユーザフレンドリーな音声認識を可
能にする。In the speech dialogue system to which the speech recognition method of the present invention is applied, in the speech recognition function, "sound information (information based on acoustic features and linguistic features)" and "kansei information (information based on prosodic features)" By using the two types of information to narrow down the recognition result, recognition errors are reduced, and user-friendly voice recognition that conforms to the speaker's emotional state is enabled.

【００１５】[0015]

【発明の実施の形態】以下、図面を参照して本発明の実
施の形態を詳細に説明する。図１は本発明の音声認識装
置の一実施形態を示すものである。本実施形態では、
「感性状態」という概念を用いる。「感性状態」は、
快、不快や喜怒哀楽のような人間の気持ちを表現する状
態であり、ｋ個のカテゴリの感性度合いで表現される。
それぞれのカテゴリの感性度合は、連続数値で表すこと
にする。例えば、快、不快というカテゴリに対しては、
不快を“１”、快を“５”とした１軸上の数値として表
現する。これらの各カテゴリの連続数値を統合したベク
トルのことを総称して感情価（単語についての感情価、
特徴ベクトルについての感情価がある）と呼ぶことにす
る。Embodiments of the present invention will be described below in detail with reference to the drawings. FIG. 1 shows an embodiment of the speech recognition apparatus of the present invention. In this embodiment,
The concept of "kansei state" is used. "Sensitivity state"
This is a state in which human feelings such as pleasure, discomfort, and emotions are expressed in terms of the sensitivity of k categories.
The sensitivity degree of each category is represented by a continuous numerical value. For example, for the category of pleasant and unpleasant,
Discomfort is expressed as a numerical value on one axis, with “1” as pleasant and “5” as pleasant. The vector that integrates the continuous numerical values of each of these categories is collectively referred to as emotion valence (emotion valence for a word,
(There is an emotional value about the feature vector).

【００１６】図１において、音声入力部１１は、マイク
ロフォン等を通じて発話者のアナログ音声信号を取り込
み、これをデジタルの入力音声データに変換して出力す
る。ここで得られた入力音声データは、第１感性情報処
理部１２及び音声認識部１３に送られる。In FIG. 1, a voice input section 11 takes in an analog voice signal of a speaker through a microphone or the like, converts the analog voice signal into digital input voice data, and outputs the digital voice data. The input voice data obtained here is sent to the first sensitivity information processing unit 12 and the voice recognition unit 13.

【００１７】第１感性情報処理部１２は、図２に示すよ
うに、発話音声中に含まれる韻律情報（例えば、ピッチ
情報等）を表す特徴ベクトル（例えば、第１フォルマン
ト、第２フォルマント等）と感情価（特徴ベクトルにつ
いての感情価）との対応関係を示す韻律的感性モデル１
２１を備えている。この韻律的感性モデル１２１は、大
量の音声データを用いた心理実験から予め求められた韻
律特徴ベクトルと感情価との対応関係を示すテーブルで
ある。そして、入力音声データに公知のＦＦＴやｌｐｃ
解析等の処理を施して韻律特徴ベクトルを抽出し、現在
の特徴ベクトルと最も近いベクトルを上記韻律的感性モ
デル１２１から導き出し、当該モデル１２１を参照して
対応する感情価を選び出す。ここで選出された感情価
は、感性状態統合部１５に送られる。As shown in FIG. 2, the first sentiment information processing section 12 is a feature vector (eg, a first formant, a second formant, etc.) representing prosody information (eg, pitch information) included in the uttered voice. Prosody model 1 showing the correspondence between the emotional value and the emotional value (emotional value of the feature vector)
21. The prosodic sensibility model 121 is a table showing the correspondence between prosodic feature vectors and emotion valences obtained in advance from a psychological experiment using a large amount of voice data. Then, a known FFT or lpc is input to the input audio data.
A prosody feature vector is extracted by performing processing such as analysis, a vector closest to the current feature vector is derived from the prosodic sensitivity model 121, and a corresponding emotion value is selected with reference to the model 121. The emotion value selected here is sent to the emotional state integrating unit 15.

【００１８】音声認識部１３は、音声認識単位毎（音素
や単語等）に保有された音響モデル１３１及び単語列等
の単語出現頻度に基づいて作成された言語モデル１３２
を備えている。そして、これらのモデルを利用して入力
音声データについて音声認識を行い、候補として挙げら
れる単語のスコアを導き出す。音声認識は、従来からの
音声認識技術を利用することができる。このとき、認識
スコアが高い１または複数の単語を単語候補として選び
出す。ここで得られた単語候補は第２感性情報処理部１
４及び感性状態統合部１５に、単語候補のスコアは感性
状態統合部１５にそれぞれ送られる。The speech recognition unit 13 includes an acoustic model 131 held for each speech recognition unit (phonemes, words, etc.) and a language model 132 created based on the frequency of occurrence of words such as word strings.
It has. Then, speech recognition is performed on the input speech data using these models to derive a score of a word that is a candidate. For voice recognition, a conventional voice recognition technology can be used. At this time, one or more words having a high recognition score are selected as word candidates. The obtained word candidate is the second sentiment information processing unit 1
4 and the scores of the word candidates are sent to the affective state integrating unit 15.

【００１９】第２感性情報処理部１４は、図３に示すよ
うに、単語と感情価（単語についての感情価）との対応
関係を示す意味的感性モデル１４１を備えている。この
意味的感性モデル１４１は、大量のテキストデータを用
いた心理実験から予め求められた単語と感情価との対応
関係を示すテーブルである。そして、この意味的感性モ
デル１４１を用いて音声認識部１３から渡された単語候
補のそれぞれの感情価を求める。ここで得られた感情価
は、感性状態統合部１５に送られる。As shown in FIG. 3, the second sentiment information processing unit 14 includes a semantic sentiment model 141 that indicates the correspondence between words and emotion values (emotion values of words). The semantic sensibility model 141 is a table showing the correspondence between words and emotional values obtained in advance from psychological experiments using a large amount of text data. Then, using the semantic sensitivity model 141, the emotion value of each of the word candidates passed from the speech recognition unit 13 is obtained. The emotion value obtained here is sent to the emotional state integrating unit 15.

【００２０】感性状態統合部１５は、音声認識部１３か
らの単語候補のそれぞれについて、第１感性情報処理部
１２からの感情価と、音声認識部１３からの単語候補ス
コアと、第２感性情報処理部１４からの感情価とを用い
て認識スコアを算出する。この認識スコアは認識結果出
力部１６に送られる。認識結果出力部１６は、認識スコ
アが最も高い単語候補を選び出し、これを認識結果とし
て出力する。The sentiment state integration unit 15 provides, for each of the word candidates from the speech recognition unit 13, the emotion value from the first sentiment information processing unit 12, the word candidate score from the speech recognition unit 13, and the second sentiment information. The recognition score is calculated using the emotion value from the processing unit 14. This recognition score is sent to the recognition result output unit 16. The recognition result output unit 16 selects a word candidate having the highest recognition score and outputs this as a recognition result.

【００２１】なお、上記機能を有する本実施形態の音声
認識装置は、例えば磁気ディスク、光ないし光磁気ディ
スク、半導体メモリなどに記録されたコンピュータプロ
グラム、あるいは通信媒体を通じて伝送されたコンピュ
ータプログラムによって動作が制御されるコンピュータ
装置によって実施が可能である。また、信号処理プロセ
ッサを用いても実施が可能である。The speech recognition apparatus according to the present embodiment having the above-described functions operates according to a computer program recorded on a magnetic disk, an optical or magneto-optical disk, a semiconductor memory, or the like, or a computer program transmitted through a communication medium. It can be implemented by a controlled computer device. Further, the present invention can be implemented by using a signal processor.

【００２２】次に、上記音声認識装置を用いた音声認識
方法について説明する。この方法は、具体的には、以下
の手順で行われる。まず、音声入力部１１から発話者の
入力音声データを取り込む。この入力音声データは、発
話者の感性状態を反映したものである。Next, a speech recognition method using the above speech recognition apparatus will be described. This method is specifically performed in the following procedure. First, the input voice data of the speaker is fetched from the voice input unit 11. This input voice data reflects the emotional state of the speaker.

【００２３】第１感性情報処理部１２は、入力音声デー
タから韻律特徴ベクトルＸIを抽出し、現在の特徴ベク
トルＸi と最も近いベクトルＸj を韻律的感性モデル１
２１から導き出して、対応する感情価Ｑk を選び出す。
音声認識部１３では、入力音声データの音声認識を行っ
て単語候補の認識スコアＳwnを導き、この認識スコアＳ
wnが高いＮ個の単語候補Ｗn を選び出す。そして、これ
らの単語候補Ｗn を、第２感性情報処理部１４に入力す
る。第２感性情報処理部１４は、Ｎ個の単語候補Ｗn の
それぞれの感情価Ｒwnを求め、これらを感性状態統合部
１５に送る。The first sentiment information processing unit 12 extracts a prosodic feature vector XI from the input speech data, and extracts a vector Xj closest to the current feature vector Xi to the prosodic sentiment model 1.
21 and selects the corresponding emotional value Qk.
The voice recognition unit 13 performs voice recognition of the input voice data to derive a recognition score Swn of the word candidate.
Select N word candidates Wn with high wn. Then, these word candidates Wn are input to the second emotion information processing unit 14. The second sentiment information processing unit 14 obtains the emotion valence Rwn of each of the N word candidates Wn, and sends these to the sentiment state integrating unit 15.

【００２４】感性状態統合部１５は、音声認識部１３か
ら渡されたＮ個の単語候補Ｗn のそれぞれについて、第
１感性情報処理部１２で得られた感情価Ｑk と、音声認
識部１３で得られた単語候補スコアＳwnと、第２感性情
報処理部１４で得られた感情価Ｒwnとを用いて、以下の
計算式により認識スコアＴwnを算出する。このとき、
α、βは正の定数とする。Ｔwn＝Ｓwn×α（Ｑk −Ｒwn）＾（−β）The sentiment state integrating unit 15 determines the emotion value Qk obtained by the first sentiment information processing unit 12 and the sentiment value obtained by the speech recognition unit 13 for each of the N word candidates Wn passed from the speech recognition unit 13. Using the obtained word candidate score Swn and the emotion value Rwn obtained by the second sentiment information processing unit 14, a recognition score Twn is calculated by the following formula. At this time,
α and β are positive constants. Twn = Swn × α (Qk−Rwn) ＾ (− β)

【００２５】Ｎ個の単語候補Ｗn のそれぞれの認識スコ
アＴwnは認識結果出力部１６に送られる。認識結果出力
部１６では、以下の式からスコアＴwnが最も高い単語候
補Ｗn を選び出し、認識結果として出力する。Ｗn ＝｛Ｗn ｜ argwn max（Ｔwn）｝ここで、 γ＝α（Ｑk −Ｒwn）＾（−β）とおけば、認識結果は以下のように表現できる。Ｗn ＝｛Ｗn ｜ argwn max（Ｓwn×γ）｝The recognition score Twn of each of the N word candidates Wn is sent to the recognition result output unit 16. The recognition result output unit 16 selects a word candidate Wn having the highest score Twn from the following equation and outputs it as a recognition result. Wn = {Wn | argwn max (Twn)} where γ = α (Qk−Rwn) ＾ (− β), the recognition result can be expressed as follows. Wn = {Wn | argwn max (Swn × γ)}

【００２６】このように、本実施形態では、入力音声デ
ータから発話者の感性状態に相当する感情価を推定し、
第２感性情報処理部１４にて音声認識部１３の処理によ
って得られた認識結果候補のそれぞれの単語の持つ感情
価を求め、感性状態統合部１５にて、単語候補の中から
推定された感性状態に矛盾した認識結果を抑圧するよう
にしたので、感性状態を反映した入力音声データの認識
を高精度に行うことができるようになる。As described above, in the present embodiment, the emotional value corresponding to the emotional state of the speaker is estimated from the input voice data,
The second kansei information processing unit 14 obtains the emotional valence of each word of the recognition result candidates obtained by the processing of the speech recognition unit 13, and the kansei state integrating unit 15 estimates the kansei estimated from the word candidates. Since the recognition result inconsistent with the state is suppressed, the recognition of the input voice data reflecting the emotional state can be performed with high accuracy.

【００２７】また、音響的特徴や言語的特徴（出現頻度
等）の酷似への対処として、発話音声全体の韻律的特徴
に基づく感性状態と認識候補単語の言語的意味に基づく
感性状態とを求め、それぞれの感性状態が矛盾するもの
の優先度を低減させている。したがって、発話内容に矛
盾した認識候補を抑圧することができ、発話内容により
一致した音声認識が期待できるようになる。具体的に
は、従来の音声認識で利用していた認識スコアＳwnに感
性状態に基づくスコアγを下式のように掛け合わせるこ
とによって、発声内容に矛盾した認識結果を抑圧するこ
とができる。Ｗn ＝｛Ｗn ｜ argwn max（Ｓwn×γ）｝In order to cope with very similar acoustic and linguistic features (frequency of appearance, etc.), a kansei state based on the prosodic features of the entire uttered speech and a kansei state based on the linguistic meaning of the recognition candidate word are obtained. However, the priority is reduced although the respective emotional states are inconsistent. Therefore, recognition candidates inconsistent with the utterance content can be suppressed, and speech recognition more consistent with the utterance content can be expected. Specifically, by multiplying the recognition score Swn used in the conventional speech recognition by the score γ based on the sensibility state as in the following expression, the recognition result inconsistent with the utterance content can be suppressed. Wn = {Wn | argwn max (Swn × γ)}

【００２８】図４は、上記音声認識方法を、コンピュー
タ装置にコンピュータプログラムを読み込ませて実行さ
せることにより実現する場合の処理手順例を示した図で
ある。この場合、上記各種モデル１２１、１３１、１３
２、１４１は、外部記録媒体に格納しておく。FIG. 4 is a diagram showing an example of a processing procedure when the above-described speech recognition method is realized by reading and executing a computer program in a computer device. In this case, the various models 121, 131, 13
2 and 141 are stored in an external recording medium.

【００２９】図４において、認識対象となる音声が入力
されると（Ｓ１）、韻律的感性モデルを参照し、音声中
の韻律情報に対応する特徴ベクトルＸk と感情価Ｑk を
求める（Ｓ２）。次に、音響モデル、言語モデルを参照
し、認識した単語候補Ｗ1 〜Ｗn を抽出し、それぞれの
認識スコアＳw1〜Ｓwnを求める（Ｓ３）。さらに、意味
的感性モデルを参照して、単語候補単語候補Ｗ1 〜Ｗn
それぞれの感情価Ｒw1〜Ｒwnを求める（Ｓ４）。次い
で、単語候補Ｗ1 〜Ｗn のそれぞれについて、対応する
意味的感情価Ｒw1〜Ｒwn、単語候補スコアＳw1〜Ｓwn、
韻律的感情価Ｑk から認識スコアＴw1〜Ｔwnを求め（ス
テップＳ５）、認識スコアＴw1〜Ｔwnから最も高い単語
候補を選択し、識別結果として出力する（ステップＳ
６）。In FIG. 4, when a speech to be recognized is inputted (S1), a feature vector Xk and an emotional value Qk corresponding to the prosodic information in the speech are obtained by referring to a prosodic kansei model (S2). Next, the recognized word candidates W1 to Wn are extracted with reference to the acoustic model and the language model, and respective recognition scores Sw1 to Swn are obtained (S3). Further, referring to the semantic sensitivity model, the word candidates W1 to Wn
The respective emotion values Rw1 to Rwn are obtained (S4). Next, for each of the word candidates W1 to Wn, the corresponding semantic emotion values Rw1 to Rwn, the word candidate scores Sw1 to Swn,
Recognition scores Tw1 to Twn are obtained from the prosodic emotional valence Qk (step S5), and the highest word candidate is selected from the recognition scores Tw1 to Twn and output as a discrimination result (step S5).
6).

【００３０】次に、本発明の音声認識装置を応用した音
声対話システムについて説明する。図５はこの音声対話
システムの構成を示すもので、音声認識装置２１は図１
にその構成例を示した本実施形態の音声認識装置であ
る。この音声認識装置２１で順次認識された単語列は、
応答処理装置２２に送られる。この応答処理装置２２
は、入力した単語列からその意味を把握し、音声辞書フ
ァイル２３から該当する応答音声データを抽出し、音声
合成装置２４へ出力する。Next, a speech dialogue system to which the speech recognition device of the present invention is applied will be described. FIG. 5 shows the configuration of this speech dialogue system.
1 shows a speech recognition apparatus according to the present embodiment, the configuration example of which is shown in FIG. The word sequence sequentially recognized by the speech recognition device 21 is
It is sent to the response processing device 22. This response processing device 22
Grasps the meaning from the input word string, extracts corresponding response voice data from the voice dictionary file 23, and outputs it to the voice synthesizer 24.

【００３１】音声合成装置２４は、入力した応答音声デ
ータをアナログ音声信号に変換し、スピーカ２５により
音声出力する。また、応答処理装置２２は、応答音声デ
ータを抽出する際、キャラクタファイル２６から該当す
るアニメーション等のキャラクタ情報を選択し、適宜デ
ィスプレイ２７に表示する。The voice synthesizer 24 converts the input response voice data into an analog voice signal, and outputs the voice through a speaker 25. When extracting the response voice data, the response processing device 22 selects the corresponding character information such as animation from the character file 26 and displays it on the display 27 as appropriate.

【００３２】以下、音声認識の具体例として、「上がっ
た」、「下がった」の２単語を認識する場合について説
明する。まず、ユーザ（発話者）が「今日、テストの成
績が上がったんだよ。」と音声入力したとする。このと
き、音声認識装置２１では、以下のような処理を行う
（機能ブロックの符号については図１参照）。音声入力
部１１で、アナログ音声をデジタルの入力音声データに
変換する。また、第１感性情報処理部１２での韻律特徴
抽出の結果、韻律特徴ベクトルＸk が得られ、それに対
応した感情価Ｑk が選出される。この感情価Ｑk は感性
状態統合部１５に渡される。Hereinafter, as a specific example of speech recognition, a case where two words of “up” and “down” will be described. First, it is assumed that the user (speaker) has input a voice saying, "The test score has improved today." At this time, the speech recognition device 21 performs the following processing (see FIG. 1 for the codes of the functional blocks). The audio input unit 11 converts analog audio into digital input audio data. Further, as a result of the prosody feature extraction in the first sentiment information processing unit 12, a prosody feature vector Xk is obtained, and an emotion value Qk corresponding thereto is selected. This emotional value Qk is passed to the emotional state integrating unit 15.

【００３３】ここで、「上がった」部分の認識に着目す
ると、音声認識部１３では、「上がった（Ｗ1 ）」「下
がった（Ｗ2 ）」が認識結果の単語候補として導出され
る。これらの単語候補Ｗ1 、Ｗ2 は第２感性情報処理部
１４に渡される。また、音声認識処理部１３で得られる
単語候補各々の認識スコアＳw1，Ｓw2は、感性情報統合
部１５に渡される。Here, paying attention to the recognition of the "raised" part, the voice recognition unit 13 derives "raised (W1)" and "decreased (W2)" as word candidates of the recognition result. These word candidates W1 and W2 are passed to the second emotion information processing unit 14. The recognition scores Sw1 and Sw2 of each of the word candidates obtained by the voice recognition processing unit 13 are passed to the kansei information integrating unit 15.

【００３４】第２感性情報処理部１４では、「上がった
（Ｗ1 ）」の感情価Ｒw1、「下がった（Ｗ2 ）」の感情
価Ｒw2が意味的感性モデル１４１との比較によって導き
出される。これらの感情価Ｒw1、Ｒw2は、感性状態統合
部１５に渡される。感性状態統合処理部１５では、第１
感性情報処理部１２、音声認識部１３、第２感性情報処
理部１４の各処理結果から認識スコアＴw1，Ｔw2を計算
し、計算結果を認識結果出力部１６に渡す。認識結果出
力部１６では、認識スコアＴw1，Ｔw2で大きい方を認識
結果とする。ここではＴw1＜Ｔw2となり、認識結果は
「上がった」となる。In the second emotion information processing unit 14, the emotion value Rw1 of "raised (W1)" and the emotion value Rw2 of "decreased (W2)" are derived by comparison with the semantic sensitivity model 141. These emotion valences Rw1 and Rw2 are passed to the emotional state integrating unit 15. In the emotional state integration processing unit 15, the first
The recognition scores Tw1 and Tw2 are calculated from the processing results of the kansei information processing unit 12, the voice recognition unit 13, and the second kansei information processing unit 14, and the calculation results are passed to the recognition result output unit 16. The recognition result output unit 16 sets the larger of the recognition scores Tw1 and Tw2 as the recognition result. Here, Tw1 <Tw2, and the recognition result is “up”.

【００３５】応答処理装置２２は認識結果の「上がっ
た」に着目し、音声辞書ファイル２３から「よかった
ね。」という応答音声データを抽出し、音声合成装置２
４を通じてスピーカ２５より音声出力するようになる。The response processing unit 22 pays attention to the recognition result “up”, extracts response voice data “good” from the voice dictionary file 23, and
4 through the speaker 25.

【００３６】一方、ユーザが「今日、テストの成績が下
がったんだよ。」と音声入力した場合、音声認識装置２
１では上記認識スコアＴw1，Ｔw2がＴw1＞Ｔw2となり、
認識結果は「下がった」となる。このとき、応答処理装
置２２は認識結果の「下がった」に着目し、音声辞書フ
ァイル２３から「残念だったね。」という応答音声デー
タを抽出し、音声合成装置２４を通じてスピーカ２５よ
り音声出力するようになる。On the other hand, when the user voice-inputs "The test result has dropped today."
In 1, the recognition scores Tw1 and Tw2 satisfy Tw1> Tw2,
The recognition result is “reduced”. At this time, the response processing device 22 pays attention to “reduced” of the recognition result, extracts response voice data “sorry” from the voice dictionary file 23, and outputs the voice from the speaker 25 through the voice synthesis device 24. become.

【００３７】ここで、ノイズ等の影響により、「上がっ
た」、「下がった」の部分が不明瞭となり、「今日、テ
ストの成績が×がったんだよ。」（×部分が欠落）の音
声が入力されたとする。この場合、従来の方法では「上
がった」、「下がった」の出現率が同一であるため、正
しい応答を期待することができない。これに対し、本発
明の方法によれば、韻律的な抑揚から感性状態を推定し
てうれしい表現なのか、悲しい表現なのかを把握し、こ
れを感情価として重み付けする。Here, due to the influence of noise or the like, the portions "raised" and "decreased" become unclear, and "the test results have dropped x today" (the x portion is missing). Assume that voice is input. In this case, in the conventional method, since the appearance rates of “up” and “down” are the same, a correct response cannot be expected. On the other hand, according to the method of the present invention, the emotional state is estimated from the prosodic inflection to grasp whether the expression is a happy expression or a sad expression, and weights this as an emotional valence.

【００３８】この方法によれば、うれしい表現として把
握された場合には、認識スコアがＴw1＜Ｔw2となり、認
識結果は「上がった」となって、「よかったね。」とい
う応答音声が出力されるようになる。また、悲しい表現
として把握された場合には、認識スコアがＴw1＞Ｔw2と
なり、認識結果は「下がった」となって、「残念だった
ね。」という応答音声が出力されるようになる。従っ
て、音声入力が不明瞭となって基本的な単語の一部が情
報として欠落しても、韻律的感性状態に合致しない選択
肢が排除されるため、正しい応答を出力する確率は格段
に向上するようになる。According to this method, when the expression is grasped as a happy expression, the recognition score becomes Tw1 <Tw2, the recognition result becomes "up", and a response voice of "good" is output. Become like When the expression is grasped as a sad expression, the recognition score becomes Tw1> Tw2, the recognition result becomes "reduced", and a response voice saying "Sorry." Therefore, even if the voice input becomes unclear and some basic words are lost as information, options that do not match the prosodic kansei state are excluded, and the probability of outputting a correct response is significantly improved. Become like

【００３９】このように、マン・マシン・インターフェ
ースとして音声認識を利用し、音声を認識してコンピュ
ータが何らかのリアクションを実行するものとし、リア
クションとして、表情等をアニメーションで出力する
等、ユーザフレンドリなインターフェース構築を目指す
場合に、認識対象の音響的特徴が酷似し、単語出現頻度
もほぼ同一である場合でも高精度な認識率を期待できる
本発明を適用することは極めて有効であると考えられ
る。As described above, the voice recognition is used as the man-machine interface, and the computer recognizes the voice and executes some kind of reaction. As the reaction, an expression or the like is output as an animation. When aiming at construction, it is considered to be extremely effective to apply the present invention, which can expect a high-accuracy recognition rate even when the acoustic features of the recognition targets are very similar and the word appearance frequencies are almost the same.

【００４０】[0040]

【発明の効果】以上の説明から明らかなように、本発明
によれば、認識対象の音響的特徴が酷似し、なおかつ、
単語出現頻度もほぼ同一である場合であっても高精度な
認識結果が得られる効果がある。また、音声認識を対話
処理に利用する場合に、対話の流れに矛盾しない認識結
果が得られる効果がある。As is apparent from the above description, according to the present invention, the acoustic characteristics of the recognition target are very similar, and
There is an effect that a highly accurate recognition result can be obtained even when the word appearance frequencies are almost the same. Also, when speech recognition is used for dialog processing, there is an effect that a recognition result consistent with the flow of the dialog can be obtained.

[Brief description of the drawings]

【図１】本発明の音声認識装置の実施の形態の構成を示
す機能ブロック図。FIG. 1 is a functional block diagram showing a configuration of an embodiment of a speech recognition device of the present invention.

【図２】本実施形態に用いられる韻律的感性モデルの具
体例を示す図。FIG. 2 is a diagram showing a specific example of a prosodic kansei model used in the embodiment.

【図３】本実施形態に用いられる意味的感性モデルの具
体例を示す図。FIG. 3 is a view showing a specific example of a semantic affective model used in the embodiment.

【図４】本発明に係る音声認識方法の処理の流れを示す
フローチャート。FIG. 4 is a flowchart showing a processing flow of a voice recognition method according to the present invention.

【図５】本発明の音声認識装置を利用した音声対話シス
テムの構成を示すブロック図。FIG. 5 is a block diagram showing a configuration of a voice interaction system using the voice recognition device of the present invention.

[Explanation of symbols]

１１音声入力部１２第１感性情報処理部１２１韻律的感性モデル１３音声認識部１３１音響モデル１３２言語モデル１４第２感性情報処理部１４１意味的感性モデル１５感性状態統合部１６認識結果出力部２１音声認識装置２２応答処理装置２３音声辞書ファイル２４音声合成装置２５スピーカ２６キャラクタファイル２７ディスプレイ DESCRIPTION OF SYMBOLS 11 Voice input part 12 1st kansei information processing part 121 Prosodic kansei model 13 Speech recognition part 131 Acoustic model 132 Language model 14 2nd kansei information processing part 141 Semantic kansei model 15 Kansei state integration part 16 Recognition result output part 21 Speech Recognition device 22 Response processing device 23 Voice dictionary file 24 Voice synthesis device 25 Speaker 26 Character file 27 Display

Claims

[Claims]

1. A process of extracting prosodic information from input voice data to estimate a kansei state of a speaker, and performing recognition inconsistent with the estimated kansei state from word candidates nominated as candidates by voice recognition processing. A speech recognition method including a step of specifying a result and a step of suppressing the specified recognition result.

2. A voice input process for inputting a voice of a speaker and converting it into digital voice data; prosody information extracted from voice data obtained by the voice input process, and a prosodic emotion corresponding to the prosody information. First sensitivity information processing for deriving a value, voice recognition processing for recognizing voice data obtained by the voice input processing, and selecting a word candidate having a high recognition appearance rate, and word candidate obtained by the voice recognition processing. A second sentiment information processing for deriving a semantic emotion value of each of the following, and for each of the word candidates obtained by the speech recognition processing, a prosodic emotion value obtained by the first emotion information processing and the speech recognition processing Weighting using the word candidate score obtained by the above and the semantic emotion value obtained by the second sentiment information processing, to derive a recognition score taking into account the sentiment state of the speaker Speech recognition characterized by executing, in a computer device, a sentiment state integration process, and a recognition result output process of selecting a word candidate having the highest recognition score obtained by the sentiment state integration process and outputting the word candidate as a recognition result. Method.

3. A voice input means for inputting a voice of a speaker and converting the voice data into digital voice data, prosody information is extracted from input voice data obtained by the voice input means, and a prosody corresponding to the prosody information is obtained. A first sentiment information processing means for obtaining an emotional value; a speech recognition means for recognizing input speech data obtained by the speech input means and selecting a word candidate having a high recognition appearance rate; A second sentiment information processing means for obtaining a semantic emotion value of each of the word candidates; and a prosodic emotion value obtained by the first emotion information processing means for each of the word candidates obtained by the voice recognition means. By weighing using the word candidate score obtained by the voice recognition means and the semantic emotion value obtained by the second sentiment information processing means, a recognition score taking into account the sensibility state of the speaker is calculated. That the emotional state integration means, recognition scores obtained by the emotional state integration means elect a highest word candidates, the recognition result output as recognition result output means and the speech recognition apparatus characterized by comprising a.

4. The sensory state integrating means weights the word candidate score based on a difference between the prosodic emotional value and the semantic emotional value, so that the semantic emotional contradiction to the prosodic emotional state is obtained. The speech recognition device according to claim 3, wherein a word candidate corresponding to the state is suppressed.

5. Speech input means for inputting speech of a speaker and converting it into digital speech data; prosody feature vector XI extracted from input speech data obtained by said speech input means; A first sentiment information processing means for selecting a prosodic emotional value Qk corresponding to the closest vector Xj; and recognition of input voice data obtained by the voice input means to derive a score Swn of a word which is a candidate. The word N having the highest score Swn is determined as a word candidate Wn.
Voice recognition means, a second sentiment information processing means for obtaining a semantic emotional value Rwn of each of the N word candidates Wn obtained by the voice recognition means, and N voice information obtained by the voice recognition means. For each of the word candidates Wn, the prosodic emotional value Qk obtained by the first sentiment information processing means, the word candidate score Swn obtained by the voice recognition means, and the word sentence score Swn obtained by the second sentiment information processing means. Sentiment state integrating means for calculating a recognition score Twn using the semantic emotion valence Rwn according to Twn = Swn × α (Qk−Rwn) ＾ (− β) (where α and β are positive constants); A speech recognition apparatus comprising: a recognition result output unit that selects a word candidate Wn having the highest recognition score Twn obtained by the kansei state integrating unit and outputs the word candidate Wn as a recognition result.

6. The first sentiment information processing means includes a prosodic sentiment model as a table showing a correspondence between a prosodic feature vector Xk and an emotional value Qk obtained in advance using a large amount of voice data. The speech recognition apparatus according to claim 5, wherein the speech recognition apparatus is configured to derive a prosodic emotional value QK corresponding to the input speech data with reference to the prosodic sensitivity model.

7. The second sentiment information processing means includes a semantic sentiment model as a table indicating a correspondence between a word Wj and an emotional value Rwj obtained in advance using a large amount of text data. 6. The speech recognition apparatus according to claim 5, wherein the apparatus is configured to obtain a semantic emotion value Rwn of the word Wn with reference to the semantic sensitivity model.

8. A speech recognition apparatus according to claim 3, wherein the speech recognition apparatus outputs response speech data having a meaning corresponding to a recognition result of the speech recognition apparatus; A voice dialogue system comprising: voice output means for converting the voice into an analog voice signal and outputting the voice signal; and performing an automatic response to the input voice with a feeling.

9. A character information holding means for holding character information such as an animation with a facial expression, and a display means for displaying specific character information, wherein the response processing means includes a character corresponding to the response voice data. The information is selected from the character information holding means and is displayed on the display means.
A spoken dialogue system as described.

10. A voice input process for inputting a voice of a speaker and converting it into digital voice data, prosody information is extracted from voice data obtained by the voice input process, and a prosodic emotion value corresponding to the prosody information. A first sentiment information processing for deriving a voice recognition process; a voice recognition process for recognizing voice data obtained by the voice input process and selecting a word candidate having a high recognition appearance rate; and a word candidate obtained by the voice recognition process. A second sentiment information processing for deriving a semantic emotion value, for each of the word candidates obtained by the voice recognition processing, a prosodic emotion value obtained by the first emotion information processing and a prosodic emotion value obtained by the voice recognition processing. Weighting is performed using the word candidate score and the semantic emotional value obtained in the second sentiment information processing, thereby obtaining a recognition score that takes into account the sentiment state of the speaker. State integration processing, a computer readable program recorded with a program for causing a computer to execute a recognition result output processing of selecting a word candidate having the highest recognition score obtained by the emotional state integration processing and outputting the word candidate as a recognition result recoding media.