JPH11119791A

JPH11119791A - System and method for voice feeling recognition

Info

Publication number: JPH11119791A
Application number: JP9286372A
Authority: JP
Inventors: Shinji Wakizaka; 新路脇坂; Kazuo Kondo; 和夫近藤; Yasunari Obuchi; 康成大淵; Tetsuji Toushita; 哲司塔下; Yasuyo Ishikawa; 泰代石川
Original assignee: Hitachi Ltd; Hitachi ULSI Systems Co Ltd
Current assignee: Hitachi Ltd; Hitachi Solutions Technology Ltd
Priority date: 1997-10-20
Filing date: 1997-10-20
Publication date: 1999-04-30

Abstract

(57)【要約】【課題】音声認識システムにおいて話者の感情のレベル
を認識する。【解決手段】音声認識の対象となる単語を集めた辞書部
と、音声分析処理を行う音声分析部と、音声のパターン
を音素単位でもつ音響モデル部と、感情による音韻スペ
クトルの変形を表す発声変形感情モデル部と、音声分析
結果に対して、音響モデル部と発声変形感情モデル部と
辞書部とを連結して音声認識処理を行う音声認識部とを
備え、音声の特徴から、音声認識の対象となる単語を音
声認識結果として出力すると共に、音声がもっている話
者の感情の度合を示す感情レベルを出力する。他の音声
分析部は音声のパワーの特徴から感情レベルを出力す
る。 (57) [Summary] [PROBLEMS] To recognize the emotion level of a speaker in a speech recognition system. Kind Code: A1 A dictionary unit that collects words to be subjected to speech recognition, a speech analysis unit that performs speech analysis processing, an acoustic model unit that has a speech pattern in units of phonemes, and an utterance representing deformation of a phoneme spectrum due to emotion. A modified emotion model unit, and a speech recognition unit that performs a speech recognition process by connecting the acoustic model unit, the uttered modified emotion model unit, and the dictionary unit to the speech analysis result. The target word is output as a speech recognition result, and an emotion level indicating the degree of the emotion of the speaker having the speech is output. The other voice analysis unit outputs the emotion level from the characteristics of the power of the voice.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、音声認識システム
および方法に係わり、カーナビゲーションシステム、車
載用PC、PDA（パーソナル・デジタル・アシスタン
ト）、ハンドヘルドPCに代表される小型情報機器、携帯
型音声翻訳機、ゲーム、家電機器に用いる音声認識シス
テムであって、特に、音声認識の対象となる単語や文章
の認識とともに、感情を表わす単語や文章においては、
感情の度合まで認識する音声感情認識システムおよび方
法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speech recognition system and method, and relates to a small information device represented by a car navigation system, an in-vehicle PC, a PDA (personal digital assistant), and a handheld PC, and a portable speech translator. Machines, games, and voice recognition systems used in home appliances, especially words and sentences that express emotions, as well as words and sentences to be subjected to voice recognition,
The present invention relates to a voice emotion recognition system and method for recognizing a degree of emotion.

【０００２】[0002]

【従来の技術】近年、音声認識技術を用いた小型情報シ
ステムが普久しつつある。カーナビゲーションシステム
をはじめとして、PDAに代表される小型情報機器、携帯
型翻訳機等である。2. Description of the Related Art In recent years, small information systems using voice recognition technology have been around for a long time. These include car navigation systems, small information devices such as PDAs, and portable translators.

【０００３】このような音声認識システムの例として、
特願平５ー３５７７６号公報の「言語自動選択機能付翻
訳装置」には、マイクから入力した操作者の音声を認識
して、翻訳し、翻訳した言語の音声を出力するようにし
た携帯用の翻訳装置に関する技術が開示されている。[0003] As an example of such a speech recognition system,
Japanese Patent Application No. 5-35776 discloses a "translation device with an automatic language selection function". The translation device recognizes the voice of an operator input from a microphone, translates the voice, and outputs the voice of the translated language. A technology relating to a translation device is disclosed.

【０００４】以下、図７を用いてこのような従来技術に
係わる音声翻訳装置の概要について説明する。[0004] An outline of such a conventional speech translation apparatus will be described below with reference to FIG.

【０００５】図７は従来技術に係わる音声翻訳装置の構
成を示すブロック図である。制御部７０１は、マイクロ
プロセッサ等からなり、装置の各部を制御する。音声区
間切出し部７０２は、マイク７０９から入力された音声
をデジタル信号に変換して切り出し、音声認識部７０３
に送る。音声認識部７０３は、キーボード又はスイッチ
等による操作信号７１１を受けた制御部７０１の指示に
より、マイク７０９、音声区間切出し部７０２を経て、
切り出された音声を分析する。そしてその結果を、音声
認識辞書部７０７に格納された標準音声パターンと比較
することにより、音声認識をおこなう。音声合成部７０
５は、音声認識部７０３により認識された音声に対応し
た翻訳語を、翻訳語データ用メモリカード７０６から読
み込み、音声信号に変換してスピーカアンプ７１０、ス
ピーカ７０８を経て出力する。FIG. 7 is a block diagram showing the configuration of a speech translation apparatus according to the prior art. The control unit 701 includes a microprocessor or the like, and controls each unit of the device. The voice section cutout unit 702 converts the voice input from the microphone 709 into a digital signal and cuts out the digital signal, and outputs the digital signal.
Send to The voice recognition unit 703 receives an operation signal 711 from a keyboard, a switch, or the like, and receives a command from the control unit 701, passes through the microphone 709 and the voice section cutout unit 702,
Analyze the extracted audio. Then, by comparing the result with a standard speech pattern stored in the speech recognition dictionary unit 707, speech recognition is performed. Voice synthesis unit 70
Reference numeral 5 reads a translated word corresponding to the voice recognized by the voice recognition unit 703 from the translated word data memory card 706, converts the translated word into a voice signal, and outputs the voice signal via the speaker amplifier 710 and the speaker 708.

【０００６】表示部７０４は、翻訳装置の使用者への指
示や翻訳語の文字による表示等をおこなう。翻訳語デー
タ用メモリカード７０６は、ＲＯＭカード等からなり、
翻訳語を音声合成して出力する場合には、音声データを
格納している。また、この翻訳語データ用メモリカード
７０６から、翻訳語に対応したキャラクターコードを読
み込み、表示部７０４に表示する。そして、この翻訳語
データ用メモリカード７０６を他の言語のものと交換す
ることにより、複数の言語に翻訳することが可能とな
る。音声認識辞書部７０７は、ＲＡＭ等からなり、操作
者の発生に応じた標準音声パターンを格納している。こ
の標準音声パターンは、操作者があらかじめ格納してお
く。[0006] The display unit 704 provides instructions to the user of the translation apparatus, displays translated characters, and the like. The translation data memory card 706 is composed of a ROM card or the like,
When a translated word is synthesized and output, audio data is stored. Further, a character code corresponding to the translated word is read from the translated word data memory card 706 and displayed on the display unit 704. By exchanging the translated word data memory card 706 with one for another language, translation into a plurality of languages becomes possible. The voice recognition dictionary unit 707 includes a RAM or the like, and stores a standard voice pattern according to the occurrence of the operator. This standard voice pattern is stored in advance by the operator.

【０００７】[0007]

【発明が解決しようとする課題】このような音声認識、
音声合成技術の分野は、半導体技術の向上を背景とし
て、システムがより人間的なユーザインタフェースを提
供すべきであるという要望から、その発展が期待されて
いる。上記従来の音声認識技術を用いた小型情報システ
ムにおいても、カーナビゲーションシステムをはじめと
して、ＰＤＡに代表される携帯型情報機器、携帯型翻訳
機、さらに、音声インタフェースを持った情報家電とし
て、今後ますます普及してくることが予想される。SUMMARY OF THE INVENTION Such speech recognition,
In the field of speech synthesis technology, its development is expected from the demand that the system should provide a more human-like user interface with the improvement of semiconductor technology. In the above-mentioned small information systems using the conventional speech recognition technology, as car navigation systems, portable information devices represented by PDAs, portable translators, and information home appliances with voice interfaces, the future will continue. It is expected to become more and more popular.

【０００８】しかしながら、音声認識は、処理すべき情
報量が膨大なものになるため、従来の技術では、認識率
や認識応答時間の性能を低下させないためには、認識す
る語数に制約を設ける必要がある。そのためには、あら
かじめ登録しておいた単語、文に対して、その文字列が
持つ統計的な話者の音声の特徴と、実際に話者が発声し
た音声の特徴とを比較し、確率的に一番近い値を認識結
果としている。However, in speech recognition, the amount of information to be processed is enormous. Therefore, in the prior art, it is necessary to limit the number of words to be recognized in order to prevent the performance of the recognition rate and the recognition response time from lowering. There is. To do this, for a word or sentence registered in advance, the statistical characteristics of the speaker's voice in the character string and the characteristics of the voice actually spoken by the speaker are compared, and the probability is calculated. The value closest to is used as the recognition result.

【０００９】今後、音声認識における技術革新や、それ
を実現するソフトウエア、ハードウエアの性能向上によ
り、認識率や認識応答時間の性能は向上することが考え
られる。そこで、さらに、人間的なユーザインタフェー
スを提供すためには、単に、従来の音声認識技術におい
て、あらかじめ登録した単語、文の文字列を認識するだ
けでなく、話者の感情や意図を認識できれば、たとえ制
限された認識語数においても、使い勝手の向上が期待で
きる。しかしながら、従来の音声認識システムでは、あ
らかじめ登録した単語や文の文字列のみを音声で照合し
て、入力した音声に最も近い文字列を音声認識結果とし
て出力する音声認識システムであり、音声を発声した話
者の感情や意図までは認識できない。[0009] In the future, it is conceivable that the performance of the recognition rate and the recognition response time will be improved due to technical innovation in speech recognition and improvement of the software and hardware for realizing it. Therefore, in order to provide a more human-like user interface, conventional speech recognition technology should not only recognize words and sentence strings registered in advance, but also recognize emotions and intentions of speakers. Even if the number of recognized words is limited, improvement in usability can be expected. However, the conventional speech recognition system is a speech recognition system in which only a character string of a word or sentence registered in advance is collated with a speech and a character string closest to the inputted speech is output as a speech recognition result. I cannot recognize the emotions and intentions of the speaker who did it.

【００１０】本発明は、システムが少しでも人間的なユ
ーザインタフェースを持てるように、上記問題点を解決
するためになされたものである。The present invention has been made to solve the above problems so that the system can have a human interface at all.

【００１１】本発明の目的は、小型情報システムに用い
られる音声認識システムにおいて、入力された音声に対
して、辞書に登録された単語や文の文字列を認識すると
ともに、入力された音声が持っている話者の感情や意図
を認識することができる音声感情認識システム及び方法
を提供することにある。An object of the present invention is to provide a speech recognition system for use in a small information system, which recognizes a character string of a word or a sentence registered in a dictionary with respect to an inputted speech and has the inputted speech. It is an object of the present invention to provide a voice emotion recognition system and method capable of recognizing a speaker's emotion and intention.

【００１２】また、本発明の他の目的は、小型情報シス
テムに用いられる音声認識システムにおいて、入力され
た音声が持っている話者の感情や意図を感情の度合いを
表現する数字や修飾語に変換して、人間とシステムにお
ける良好な音声インタフェースを実現することである。Another object of the present invention is to provide a speech recognition system used in a small-sized information system, in which a speaker's feeling or intention of an input voice is converted into a number or a modifier expressing the degree of the feeling. Conversion to achieve a good voice interface between humans and the system.

【００１３】[0013]

【課題を解決するための手段】上記目的を達成するため
に、本発明の音声感情認識システム及び方法に於いては
音声認識の対象となる単語や文章を集めて辞書として定
義し、音声認識結果として、それらの単語や文章を辞書
部からピックアップして、文字列表示や音声合成を用い
て出力する音声認識システムにおいて、取り込んだ音声
に対して音声分析処理を行う音声分析部と、音声のパタ
ーンを音素単位でもつ音響モデル部と、感情による音韻
スペクトルの変形を表す発声変形感情モデル部と、音声
分析結果に対して音響モデル部と発声変形感情モデル部
と辞書部とを連結して音声認識処理を行う音声認識部と
を備え、音声の特徴から音声認識の対象となる単語や文
章を音声認識結果として出力するとともに、音声がもっ
ている話者の感情の度合を出力するようにしたものであ
る。In order to achieve the above object, in the speech emotion recognition system and method of the present invention, words and sentences to be subjected to speech recognition are collected and defined as a dictionary, and the speech recognition result is obtained. In a speech recognition system that picks up those words and sentences from the dictionary unit and outputs them using character string display and speech synthesis, a speech analysis unit that performs speech analysis processing on the captured speech, and a speech pattern Acoustic model unit with a phoneme unit, a vocal deformation emotion model unit that represents the deformation of the phoneme spectrum due to emotion, and a voice analysis unit that connects the acoustic model unit, the utterance deformation emotion model unit, and the dictionary unit for speech analysis. A speech recognition unit that performs processing, outputs words and sentences to be subjected to speech recognition based on the features of the speech as speech recognition results, and the speaker's emotions of the speech. Is obtained so as to output a degree.

【００１４】より詳しい１実施例に於いては、音声が持
っている話者の感情の度合を示すレベルは、数字０〜N
（Nは整数）であるようにしたものである。In one more detailed embodiment, the level indicating the degree of the emotion of the speaker possessed by the voice is a number 0 to N.
(N is an integer).

【００１５】また、本発明の音声感情認識システム及び
方法に於いては、音声が持っている話者の感情の度合
は、音声認識の対象となる単語や文章を集めた辞書とそ
れらの単語や文章に対して感情のレベルを表現する修飾
語を集めた辞書と有する辞書部を備え、音声認識結果と
して、それらの単語や文章をピックアップするととも
に、感情のレベルを表現する修飾語をピックアップし
て、単語や文章に修飾語を付加して、文字や音声合成を
用いて出力するようにしたものである。Further, in the speech emotion recognition system and method of the present invention, the degree of the speaker's emotion possessed by the speech is determined by a dictionary which collects words and sentences to be subjected to speech recognition, and a dictionary of those words and sentences. Equipped with a dictionary that has a dictionary that collects modifiers that express the level of emotion for the sentence, and picks up those words and sentences as speech recognition results, and also picks up modifiers that express the level of emotion. In addition, a modifier is added to a word or a sentence, and the word or the sentence is output using characters or speech synthesis.

【００１６】さらに詳しい１実施例に於いては、音声認
識の対象となる単語や文章を集めて辞書として定義し、
音声認識結果として、それらの単語や文章をピックアッ
プして、文字列表示や音声合成を用いて出力する音声認
識システムにおいて、取り込んだ音声に対して音声分析
処理を行う音声分析部と、音声のパターンを音素単位で
もつ音響モデル部と、音声分析結果に対して音響モデル
部と辞書部とを連結して音声認識処理を行う音声認識部
とを備え、取り込んだ音声に対して音声分析処理を行う
音声分析部は、感情の度合が現われる音の強弱を示すパ
ワーの特徴から感情の度合を出力することが出来る。In a further detailed embodiment, words and sentences to be subjected to speech recognition are collected and defined as a dictionary.
In a voice recognition system that picks up these words and sentences as voice recognition results and outputs them using character string display and voice synthesis, a voice analysis unit that performs voice analysis processing on the captured voice, and a voice pattern And a voice recognition unit that performs voice recognition processing by connecting the voice model unit and the dictionary unit to voice analysis results, and performs voice analysis processing on the captured voice. The voice analysis unit can output the degree of emotion from the power characteristic indicating the strength of the sound in which the degree of emotion appears.

【００１７】[0017]

【発明の実施の形態】以下、本発明に係る各実施形態を
図１から図６を用いて説明する。DESCRIPTION OF THE PREFERRED EMBODIMENTS Embodiments according to the present invention will be described below with reference to FIGS.

【００１８】図１は本発明に係る音声および感情認識シ
ステムの各機能とその処理の流れを示すブロック図であ
る。FIG. 1 is a block diagram showing the functions of the voice and emotion recognition system according to the present invention and the flow of processing thereof.

【００１９】音声および感情認識をおこなうために、図
１に示されるマイク１０１から音声が取り込まれる。取
り込まれた音声であるアナログ信号は、アナログ信号を
デジタル信号に変換するＡ／Ｄ変換器１０２によって、
任意に決められたサンプリング周期により、アナログデ
ータからデジタルデータに変換される。変換された音声
のデジタルデータは、音声分析部１０３によって、雑音
処理や音声分析や話者適応などの前処理がなされ、音声
感情認識部１０４により音声および感情認識がなされ
る。ここで、音声および感情認識とは、２つの処理を実
行する。In order to perform voice and emotion recognition, voice is taken in from the microphone 101 shown in FIG. An analog signal that is a captured voice is converted by an A / D converter 102 that converts the analog signal into a digital signal.
Analog data is converted into digital data at an arbitrarily determined sampling period. The digital data of the converted voice is subjected to preprocessing such as noise processing, voice analysis, and speaker adaptation by the voice analysis unit 103, and voice and emotion recognition is performed by the voice emotion recognition unit 104. Here, speech and emotion recognition execute two processes.

【００２０】第１の処理は、音声信号を解析して、それ
を短い時間ごとの音素として分析して、そのパターンを
解析し、該当する単語や文章を辞書から選択することで
ある。The first process is to analyze a speech signal, analyze it as phonemes for each short time, analyze its pattern, and select a corresponding word or sentence from a dictionary.

【００２１】第２の処理は、音声信号を解析して、それ
を短い時間（５〜２０ｍｓ）ごとの音素として分析し
て、そのパターンを解析し、話者が発声した音声の感情
の度合いを示すレベルを単語や文章ごとに選択すること
である。The second process is to analyze a voice signal, analyze it as a phoneme for each short time (5 to 20 ms), analyze its pattern, and measure the degree of emotion of the voice uttered by the speaker. The level to be shown is to be selected for each word or sentence.

【００２２】以上の２つの処理から、音声感情認識シス
テムの出力として、音声認識結果および音声の感情レベ
ル１０９を生成する。From the above two processes, a speech recognition result and a speech emotion level 109 are generated as outputs of the speech emotion recognition system.

【００２３】音声感情認識部１０４は、音声分析部１０
３で分析された入力音声の音声分析結果に対して、音響
モデル部１０５、発声変形感情モデル部１０６、単語辞
書部１０７をモデル連結部１０８によって連結された音
素単位で照合して、単語辞書部１０７に登録した単語の
中で、一番近い単語をピックアップする。さらに、ピッ
クアップされた単語の入力音声を持っている感情の度合
いを示すレベルを選択する。なを、図１に示す実施例に
於いては、電源を投入すると、モデル連結部１０８で連
結された音素単位の単語及び感情の度合いを示すレベル
は音声感情認識部１０４に記憶され、音声分析部１０３
からの音声分析結果と直ちに照合出来るようになってい
る。The voice emotion recognition unit 104 includes the voice analysis unit 10
The voice analysis result of the input voice analyzed in step 3 is compared with the acoustic model unit 105, the uttered deformation emotion model unit 106, and the word dictionary unit 107 for each phoneme connected by the model connection unit 108, and the word dictionary unit The word closest to the words registered in 107 is picked up. Further, a level indicating the degree of the emotion having the input voice of the picked-up word is selected. Note that, in the embodiment shown in FIG. 1, when the power is turned on, the level indicating the degree of the words and emotions of the phoneme units connected by the model connection unit 108 is stored in the voice emotion recognition unit 104, and the voice analysis is performed. Part 103
It can be collated immediately with the voice analysis result from.

【００２４】音響モデル部１０５は、音声認識に用いら
れるモデルであり、具体的には、単語辞書部１０７に用
いられている文字と音素との対応であり、音素の特徴が
出現する確率の分布、出現した音素の特徴が次のどの特
徴が現れる状態に遷移するかの確率の分布を記憶したも
のである。音素の特徴が出現する確率の分布について説
明する。例えば、「あつい」という音声の「あ」に対し
て音声スぺクトラムは人によって変わるため、「あ」と
言う音素に対して、横軸に音声スぺクトラムをとり、縦
軸に音素が出現する確率を取ると、音声スぺクトラムに
対して「あ」と認識される確率が変わることを言う。次
に、出現した音素の特徴が次のどの特徴が現れる状態に
遷移するかの確率の分布について説明する。例えば、
「あ」という音素は「あつい」のように次に「つ」がく
る場合もあるし、「あさい」のように次に「さ」がくる
こともあるし、「あまい」のように次に「ま」に遷移す
ることもある。「あ」が次にどの音素に遷移するかの確
率は各音素によって変わる。つまり、ある音素の特徴が
次にどの音素の特徴に変化するかの確率は変わるので、
この確率の分布を言う。The acoustic model unit 105 is a model used for speech recognition. Specifically, the acoustic model unit 105 is a correspondence between a character and a phoneme used in the word dictionary unit 107, and a distribution of a probability that a feature of the phoneme appears. , Which stores the distribution of probabilities that the characteristics of the phoneme that has appeared transition to a state in which the next characteristic appears. The distribution of the probability that a feature of a phoneme appears will be described. For example, the voice spectrum of "A" is different for each person, so the horizontal axis represents the voice spectrum of the phoneme "A" and the phoneme appears on the vertical axis. This means that the probability of recognizing "a" in the speech spectrum changes when the probability of taking the speech spectrum is changed. Next, a description will be given of the distribution of the probability that the feature of the phoneme that has appeared changes to a state in which the next feature appears. For example,
The phoneme "A" may be followed by "Tsu" like "Atsu", "Sai" followed by "S", or "Ama" It may transition to "ma". The probability that "A" will transition to the next phoneme changes with each phoneme. In other words, the probability that a feature of one phoneme changes to the feature of the next phoneme changes,
This distribution of probability is called.

【００２５】音響モデル部１０５は、あらかじめ声を登
録しなくても、誰が話し手でもその声を認識できるいわ
ゆる「不特定話者対応」が、一般的になってきている。
このような音響モデルとしては、例えば、隠れマルコフ
モデル（ＨＭＭ：Hidden Markov Model）を用いること
ができる。The so-called "unspecified speaker correspondence" in which the speaker can recognize the voice without registering the voice in advance is becoming common.
As such an acoustic model, for example, a Hidden Markov Model (HMM) can be used.

【００２６】発声変形感情モデル部１０６は、感情の変
化による音韻スペクトルの変形要素に着目して、感情が
変化したときの単語辞書部１０７に用いられている文字
と音素との対応である。即ち、感情を込めたときに音素
の確率の分布が変わるが、その時の音素の特徴が出現す
る確率の分布、出現した音素の特徴が次のどの特徴が現
れる状態に遷移するかの確率の分布を記憶したものであ
る。この、出現した音素の特徴が次のどの特徴が現れる
状態に遷移するかの確率の分布は、例えば、「あつい」
と言う言葉を感情を込めて「あつい」と言っても変化し
ないが、「あちー」とか「あちい」に変化した場合に変
わる。このような発声変形感情モデル部１０６として
は、例えば、隠れマルコフモデル（ＨＭＭ：Hidden Mar
kov Model）を用いることができる。The utterance transformation emotion model unit 106 focuses on the transformation element of the phoneme spectrum due to the change in emotion, and is the correspondence between the characters and phonemes used in the word dictionary unit 107 when the emotion changes. That is, the distribution of the probability of phonemes changes when emotions are included, but the distribution of the probability that the characteristics of the phoneme appear at that time, and the distribution of the probability that the characteristics of the phoneme that has appeared transitions to the state where any of the following characteristics appear: Is stored. The distribution of the probability that the feature of the appearing phoneme transitions to a state in which the following feature appears is, for example, “hot”
It does not change even if you say "hot" with emotion, but it changes when it changes to "achi" or "chii". As such an utterance deformation emotion model unit 106, for example, a hidden Markov model (HMM: Hidden Mar
kov Model) can be used.

【００２７】単語辞書部１０７は、言葉、単語（名詞、
動詞等）、文章を集めたものである。例えば、カーナビ
ゲーションシステムにおいては、通り名、地名、建造物
名、町名、番地、交差点名、個人住宅（個人名）、電話
番号等や、必要最小限の会話に必要な言葉の集合体であ
る。ただし、音声認識感情システムでは、特に、単語の
中でも、感情を表現する単語、あるいは、感情が現われ
る単語で構成された単語の集合体である。より具体的に
は、話者が発声する「暑い」「寒い」「熱い」「冷た
い」「はやく」「おそく」「大きい」「小さい」「赤
い」「白い」「高く」「低く」「走れ」「進め」「戻
れ」「回れ」「飛べ」等の言葉である。また、名詞等の
感情表現でない単語も含まれる。この単語辞書部１０７
は、システムの能力に応じて一つの辞書あたり、例えば
１０〜５０００語の単語で構成する。The word dictionary unit 107 stores words, words (nouns,
Verbs, etc.) and sentences. For example, in a car navigation system, it is a collection of words necessary for a minimum necessary conversation, such as street names, place names, building names, town names, street addresses, intersection names, private houses (personal names), telephone numbers, and the like. . However, in the speech recognition emotion system, in particular, among words, it is a word that expresses an emotion, or a collection of words formed of words in which an emotion appears. More specifically, the speaker speaks “hot”, “cold”, “hot”, “cold”, “fast”, “slow”, “large”, “small”, “red”, “white”, “high”, “low”, “run” These words are "advance,""return,""turn," and "fly." Also, words that are not emotional expressions such as nouns are included. This word dictionary 107
Consists of words of, for example, 10 to 5000 words per one dictionary according to the capability of the system.

【００２８】以上から、音声感情認識とは、音声を解析
して、それを短い時間ごとの音素として分析して、その
パターンを解析し、該当する単語や文章を辞書から選択
するとともに、話者が発声した音声の感情の度合いを示
すレベルを単語や文章ごとに選択することである。From the above, speech emotion recognition means analyzing speech, analyzing it as phonemes at short time intervals, analyzing its pattern, selecting a corresponding word or sentence from a dictionary, and selecting a speaker. Is to select, for each word or sentence, a level indicating the degree of emotion of the voice uttered.

【００２９】なお、図１に示す各処理ブロックは、複数
のＬＳＩやメモリで構成されたシステムであっても、半
導体素子上に構成された一つないし複数のシステムオン
チップであってもよい。また、各処理は、専用ＬＳＩや
専用ＩＣで処理するハードウエアであっても、ＤＳＰや
ＲＩＳＣマイコン等のソフトウエアで実現したミドルウ
エアであってもよい。Each processing block shown in FIG. 1 may be a system constituted by a plurality of LSIs or memories, or one or a plurality of system-on-chips constituted on semiconductor elements. Further, each process may be hardware processed by a dedicated LSI or a dedicated IC, or middleware realized by software such as a DSP or a RISC microcomputer.

【００３０】図２は、隠れマルコフモデル（HMM:Hidden
Markov Model）による日本語音素のモデル化の例であ
る。FIG. 2 shows a hidden Markov model (HMM: Hidden
This is an example of modeling a Japanese phoneme by Markov Model).

【００３１】２０１、２０２、２０３は音素分布の状態
を表わしている。話者が発声した音声は、「あつい」で
あり、発音記号の１例で表わす「ａｔｓｕｉ」である。
説明を簡単にするために、図２（ａ）において、「ａ」
が２０１の状態に対応し、「ｔｓｕ」が２０２の状態に
対応し、「ｉ」が２０３の状態に対応している。実際の
音声認識では、状態をさらに細分化して表わしている。
音声は、非定常信号であり、あるときは「ａ」のスペク
トル、あるときは「ｔｓｕ」のスペクトルという具合
に、スペクトルの性質が時々刻々と変化することによっ
て言語情報を伝える。この非定常な信号は、性質の異な
る定常信号の音素片の連続とみることができる。この性
質の異なる定常信号の音素片の一つ一つが、２０１〜２
０３に示したＨＭＭ状態遷移ネットワークの状態に対応
している。この状態、すなわち、非定常信号源からの出
力として音声のスペクトルが観測される。観測値は、短
時間フレーム毎の音声信号のＬＰＣ分析結果であって
も、ベクトル量子化された符号であってもよい。よっ
て、ＨＭＭとは、状態２０１と状態２０２の間の状態遷
移確率２０７と、状態２０１から出力される音声のスペ
クトルが出力される確率２０１である。確率２０１と
は、「ａ」の音素分布の内どの確率値が出力されるかと
言う事を示す。即ち、２０１〜２０３は、音素の分布で
あり、各状態から出力される音声のスペクトルが出力さ
れる確率を示したものであ。２０４〜２０９は、各状態
が次にどの状態に遷移するかの確率を示したものであ
る。この内、２０４〜２０６は、例えば、ある音を長く
発音したとすると、これを音声分析した場合、ある時間
間隔の中では、また同じ音に戻ることを示している。Reference numerals 201, 202, and 203 represent the states of the phoneme distribution. The voice uttered by the speaker is "hot" and is "atsui" represented by an example of phonetic symbols.
For the sake of simplicity, "a" in FIG.
Corresponds to the state of 201, “tsu” corresponds to the state of 202, and “i” corresponds to the state of 203. In actual speech recognition, the state is further subdivided and represented.
Speech is a non-stationary signal, and conveys linguistic information by changing the nature of the spectrum every moment, such as the spectrum of “a” at one time and the spectrum of “tsu” at some times. This non-stationary signal can be regarded as a continuation of phonemic segments of a stationary signal having different properties. Each of the phonemic segments of the stationary signal having a different property is denoted by 201 to 2
This corresponds to the state of the HMM state transition network shown in FIG. In this state, that is, the spectrum of the voice is observed as the output from the non-stationary signal source. The observation value may be an LPC analysis result of the audio signal for each short-time frame or a vector-quantized code. Therefore, the HMM is the state transition probability 207 between the state 201 and the state 202 and the probability 201 that the spectrum of the sound output from the state 201 is output. The probability 201 indicates which probability value of the phoneme distribution of “a” is output. That is, reference numerals 201 to 203 denote phoneme distributions, each of which indicates a probability that a spectrum of a sound output from each state is output. Reference numerals 204 to 209 denote the probabilities to which state each state transitions next. Of these, 204 to 206 indicate that, for example, when a certain sound is generated for a long time, when the sound is analyzed, the sound returns to the same sound within a certain time interval.

【００３２】つぎに、音声認識に用いるＨＭＭの一例を
説明する。Next, an example of an HMM used for speech recognition will be described.

【００３３】図２（ａ）に示す曲線２１１は、状態２０
１から出力される音声のスペクトルが出力される確率を
連続分布で表現したものである。ここで、音声のスペク
トルは、音声の特徴パラメータをｉ次元としたときのｎ
番目の特徴パラメータとする。つまり、音声の特徴を表
わす表現方法としては何種類もあるが、仮にこの表現方
法がｉ個あったとすると、そのｎ番目の表現方法の特徴
パラメータを意味する。横軸は、状態２０１から出力さ
れる音声のスペクトルであり、縦軸は、その確率値であ
る。この分布は、平均μ＿ａ、分散σ＿ａをもつ連続分
布である。同様に、図２（ｂ）に示す曲線２１２は状態
２０２から出力される音声のスペクトルが出力される確
率を連続分布で表現したものである。横軸は、状態２０
２から出力される音声のスペクトルであり、縦軸は、そ
の確率値である。この分布は、平均μ＿ｔｓｕ、分散σ
＿ｔｓｕをもつ連続分布である。A curve 211 shown in FIG.
The probability that the spectrum of the voice output from No. 1 is output is represented by a continuous distribution. Here, the spectrum of the voice is n when the feature parameter of the voice is i-dimensional.
The feature parameter. In other words, there are many types of expression methods representing the features of the voice, but if there are i expression methods, it means the feature parameter of the n-th expression method. The horizontal axis is the spectrum of the sound output from the state 201, and the vertical axis is the probability value. This distribution is a continuous distribution having mean μ_a and variance σ_a. Similarly, a curve 212 shown in FIG. 2B represents the probability that the spectrum of the sound output from the state 202 is output by a continuous distribution. The horizontal axis is state 20
2 is the spectrum of the voice output from the audio signal 2, and the vertical axis is the probability value. This distribution has a mean μ_tsu, variance σ
_Tsu with a continuous distribution.

【００３４】図２（ｃ）に示す曲線２１３は、状態２０
３から出力される音声のスペクトルが出力される確率を
連続分布で表現したものである。横軸は、状態２０３か
ら出力される音声のスペクトルであり、縦軸は、その確
率値である。この分布は、平均μ＿ｉ、分散σ＿ｉをも
つ連続分布である。The curve 213 shown in FIG.
3 represents the probability that the spectrum of the voice output from No. 3 is output by a continuous distribution. The horizontal axis is the spectrum of the sound output from the state 203, and the vertical axis is the probability value. This distribution is a continuous distribution having mean μ_i and variance σ_i.

【００３５】ここで、認識対象単語として登録された
「ａｔｓｕｉ」の単語辞書部１０７に話者が「あつい」
と音声を入力する。「あ」の音声に対して、音声分析が
行われ、音声の特徴が出力される。例えば、音声の特徴
パラメータをｉ次元としたときのｎ番目の特徴パラメー
タを使用するものとすると、「あ」の特徴ｆ＿ｎ１が出
力される。このとき、単語辞書「ａ」において、特徴ｆ
＿ｎ１の出現する確率が連続分布曲線２１１から計算さ
れ、確率値ｐ＿ｎ１が出力される。同様にして、「つ」
「い」の音声に対して、音声分析が行われ、音声の特徴
が出力される。それぞれ、単語辞書「ｔｓｕ」におい
て、特徴ｆ＿ｎ２の出現する確率が連続分布曲線２１２
から計算され、確率値ｐ＿ｎ２が出力される。また、単
語辞書「ｉ」において、特徴ｆ＿ｎ３の出現する確率が
連続分布曲線２１３から計算され、確率値ｐ＿ｎ３が出
力される。さらに、音素分布状態２０１から音素分布状
態２０２関の状態遷移確率においても同様の処理がおこ
なわれ、状態遷移先を状態２０７に決定している。最終
的に、登録された単語辞書「ａｔｓｕｉ」に対して、音
声入力された「あつい」の出現する確率値はＰ＿ａｔｓ
ｕｉ＝ｐ＿ｎ１＋ｐ＿ｎ２＋ｐ＿ｎ３となる。この一連
の処理を、登録された単語辞書全てにおいて計算し、確
率値の一番高かったものが、認識結果となる。以上が音
声認識の一連の処理である。Here, the speaker "Atsui" is entered in the word dictionary 107 of "atsui" registered as a recognition target word.
And voice input. A voice analysis is performed on the voice of “a”, and the characteristics of the voice are output. For example, assuming that the n-th feature parameter when the feature parameter of the voice is i-dimensional is used, the feature f_n1 of “a” is output. At this time, in the word dictionary “a”, the feature f
The probability of occurrence of _n1 is calculated from the continuous distribution curve 211, and a probability value p_n1 is output. Similarly, "one"
Voice analysis is performed on the voice of “i”, and the characteristics of the voice are output. In the word dictionary “tsu”, the probability that the feature f_n2 appears is represented by the continuous distribution curve 212
And a probability value p_n2 is output. Further, in the word dictionary “i”, the probability of occurrence of the feature f_n3 is calculated from the continuous distribution curve 213, and the probability value p_n3 is output. Further, the same processing is performed on the state transition probability from the phoneme distribution state 201 to the phoneme distribution state 202, and the state transition destination is determined to be the state 207. Finally, with respect to the registered word dictionary “atsui”, the probability value of the appearance of “hot” input by voice is P_ats
ui = p_n1 + p_n2 + p_n3. This series of processing is calculated for all the registered word dictionaries, and the one with the highest probability value is the recognition result. The above is a series of processing of speech recognition.

【００３６】さらに、図２（ｅ）から図２（ｇ）を用い
て、音声感情認識における発声変形感情モデルを用いた
HMMの一例を説明する。Further, referring to FIG. 2 (e) to FIG. 2 (g), an utterance deformation emotion model in speech emotion recognition is used.
An example of the HMM will be described.

【００３７】発声変形感情モデル部１０６は、感情の変
化による音韻スペクトルの変形要素に着目して、感情が
変化したときの単語辞書部１０７に格納されている単語
の文字と音素との対応であり、音素の特徴が出現する確
率の分布、出現した音素の特徴が次のどの特徴が現れる
状態に遷移するかの確率の分布を記憶したものである。The utterance transformation emotion model unit 106 focuses on the transformation element of the phoneme spectrum due to the change of the emotion, and is the correspondence between the character of the word stored in the word dictionary unit 107 and the phoneme when the emotion changes. , The distribution of the probability that the feature of the phoneme appears, and the distribution of the probability that the feature of the phoneme that has appeared transitions to a state in which the following feature appears.

【００３８】曲線２１１は前に説明したように、状態２
０１から出力される音声のスペクトルが出力される確率
を連続分布で表現したものである。ここで、音声のスペ
クトルは、音声の特徴パラメータをｉ次元としたときの
ｎ番目の特徴パラメータとする。横軸は、状態２０１か
ら出力される音声のスペクトルであり、縦軸は、その確
率値である。この分布は、平均μ＿ａ、分散σ＿ａをも
つ連続分布である。このとき、音声の特徴パラメータを
ｉ次元としたときのｎ番目の特徴パラメータにおいて、
感情の変化による音韻スペクトルの変形が顕著に現われ
たとする。そこで、話者が通常の感情で発声したときの
音声スペクトルの連続分布曲線を２１１とし、話者が感
情をこめて発声したとき、すなわち、感情の変化により
変形した時の音声スペクトルの連続分布曲線を２１４と
する。よって、従来の音声認識に用いられてきた音響モ
デルのHMMに加えて、感情の変化により音韻スペクトル
の変形が現われる特徴パラメータだけで構成した確率分
布を音声感情認識モデルのHMMとして用意する。曲線２
１４は、音声感情認識モデルにおいて、状態２０１から
出力される感情の変化による音声のスペクトルが出力さ
れる確率を連続分布で表現したものである。横軸は、状
態２０１から出力される音声のスペクトルであり、縦軸
は、その確率値である。この分布は、平均μ＿ａ＿ｓ、
分散σ＿ａ＿ｓをもつ連続分布である。ここで、認識対
象単語として登録された「ａｔｓｕｉ」の単語辞書にお
いて、実際に、話者が「あつい」と感情をこめて音声を
入力する。「あ」の音声に対して、音声分析が行われ、
音声の特徴が出力される。例えば、音声の特徴パラメー
タをｉ次元としたときに、ｎ番目の音声の特徴パラメー
タを採用したとすると、「あ」の特徴ｆ＿ｎ１＿ｅが出
力される。このとき、単語辞書「ａ」において、特徴ｆ
＿ｎ１＿ｅの出現する確率が連続分布曲線２１４から計
算され、確率値ｐ＿ｎ１＿ｅが出力される。ここで、こ
の連続分布曲線２１４に関して、確率値ｐ＿ｎ１＿ｅ
は、話者が通常の発声をしたときの特徴ｆ＿ｎ１での確
率値ｐ＿ｎ１より高い値をとる。The curve 211 corresponds to the state 2 as described above.
This is a representation of the probability that the spectrum of the voice output from No. 01 is output as a continuous distribution. Here, the speech spectrum is the n-th feature parameter when the feature parameter of the speech is i-dimensional. The horizontal axis is the spectrum of the sound output from the state 201, and the vertical axis is the probability value. This distribution is a continuous distribution having mean μ_a and variance σ_a. At this time, in the n-th feature parameter when the feature parameter of the voice is i-dimensional,
Suppose that the deformation of the phonological spectrum due to the change of the emotion has appeared remarkably. Therefore, the continuous distribution curve of the speech spectrum when the speaker utters with normal emotion is set to 211, and the continuous distribution curve of the speech spectrum when the speaker utters with emotion, that is, when the speaker is deformed by a change in emotion. Is set to 214. Therefore, in addition to the HMM of the acoustic model used for the conventional speech recognition, a probability distribution composed only of the feature parameters in which the deformation of the phoneme spectrum appears due to the change of the emotion is prepared as the HMM of the speech emotion recognition model. Curve 2
Reference numeral 14 denotes a continuous distribution of the probability that a spectrum of a voice due to a change in the emotion output from the state 201 is output in the voice emotion recognition model. The horizontal axis is the spectrum of the sound output from the state 201, and the vertical axis is the probability value. This distribution has a mean μ_a_s,
It is a continuous distribution with variance σ_a_s. Here, in the word dictionary of “atsui” registered as a word to be recognized, the speaker actually inputs a voice with a feeling of “hot”. Voice analysis is performed on the voice of "A",
The features of the audio are output. For example, if the feature parameter of the speech is i-dimensional and the feature parameter of the n-th speech is adopted, the feature “a” of f_n1_e is output. At this time, in the word dictionary “a”, the feature f
The probability of occurrence of _n1_e is calculated from the continuous distribution curve 214, and a probability value p_n1_e is output. Here, regarding the continuous distribution curve 214, the probability value p_n1_e
Takes a value higher than the probability value p_n1 at the feature f_n1 when the speaker speaks normally.

【００３９】また、曲線２２２は、状態２０２から出力
される音声のスペクトルが出力される確率を連続分布で
表現したものである。横軸は、状態２０２から出力され
る音声のスペクトルであり、縦軸は、その確率値であ
る。この分布は、平均μ＿ｔｓｕ＿ｓ、分散σ＿ｔｓｕ
＿ｓをもつ連続分布である。曲線２２３は、状態２０３
から出力される音声のスペクトルが出力される確率を連
続分布で表現したものである。横軸は、状態２０３から
出力される音声のスペクトルであり、縦軸は、その確率
値である。この分布は、平均μ＿ｉ＿ｓ、分散σ＿ｉ＿
ｓをもつ連続分布である。A curve 222 represents the probability that the spectrum of the sound output from the state 202 is output by a continuous distribution. The horizontal axis is the spectrum of the sound output from the state 202, and the vertical axis is the probability value. This distribution has a mean μ_tsu_s, a variance σ_tsu
_S. Curve 223 represents state 203
Is a continuous distribution of the probability that the spectrum of the voice output from is output. The horizontal axis is the spectrum of the sound output from the state 203, and the vertical axis is the probability value. This distribution has a mean μ_i_s, a variance σ_i_
It is a continuous distribution with s.

【００４０】曲線２１４の場合と同様に、「つ」「い」
の音声に対して、音声分析が行われ、音声の特徴が出力
される。それぞれ、単語辞書「ｔｓｕ」において、特徴
ｆ＿ｎ２＿ｅの出現する確率が連続分布曲線２２２から
計算され、確率値ｐ＿ｎ２＿ｅが出力される。また、単
語辞書「ｉ」において、特徴ｆ＿ｎ３＿ｅの出現する確
率が連続分布曲線２２３から計算され、確率値ｐ＿ｎ３
＿ｅが出力される。さらに、状態と状態の間の状態遷移
確率においても同様の処理がおこなわれ、状態遷移先を
決定している。最終的に、登録された単語辞書「ａｔｓ
ｕｉ」に対して、感情をこめて音声入力された「あつ
い」の出現する確率値はＰ＿ａｔｓｕｉ＝ｐ＿ｎ１＿ｅ
＋ｐ＿ｎ２＿ｅ＋ｐ＿ｎ３＿ｅとなる。この一連の処理
を、登録された単語辞書全てにおいて計算し、計算され
た確率値の範囲によって感情のレベルを出力する。以上
が音声の感情レベルを認識する一連の処理である。As in the case of the curve 214, "T" and "I"
The voice analysis is performed on the voice and the characteristics of the voice are output. In each of the word dictionaries “tsu”, the probability of occurrence of the feature f_n2_e is calculated from the continuous distribution curve 222, and the probability value p_n2_e is output. Further, in the word dictionary “i”, the probability of occurrence of the feature f_n3_e is calculated from the continuous distribution curve 223, and the probability value p_n3
_E is output. Further, similar processing is performed for the state transition probability between states, and the state transition destination is determined. Finally, the registered word dictionary “ats
With respect to “ui”, the probability value of the appearance of “Autum” that is voice-inputted with emotion is P_attui = p_n1_e
+ P_n2_e + p_n3_e. This series of processing is calculated in all the registered word dictionaries, and the emotion level is output according to the calculated range of the probability value. The above is a series of processing for recognizing the emotion level of the voice.

【００４１】図３は、本発明に係る他の音声および感情
認識システムの各機能とその処理の流れを示すブロック
図である。FIG. 3 is a block diagram showing functions of another voice and emotion recognition system according to the present invention and a flow of processing thereof.

【００４２】音声および感情認識をおこなうために、図
３に於いては、マイク３０１から音声が取り込まれる。
取り込まれた音声であるアナログ信号は、アナログ信号
をデジタル信号に変換するＡ／Ｄ変換器３０２によっ
て、任意に決められたサンプリング周期により、アナロ
グデータからデジタルデータに変換される。変換された
音声のデジタルデータは、音声分析部３０３によって、
雑音処理や音声分析や話者適応などの前処理がなされる
と共に、音声分析部３０３に含まれている音声パワー分
析部３０３ａで音声パワーを分析して感情のレベルが出
力される。音声分析部３０３の出力は音声感情認識部３
０４で処理され、音声および感情認識がなされる。ここ
で、音声感情認識部３０４で行われる音声感情認識と
は、２つの処理を実行する。In FIG. 3, a voice is taken in from a microphone 301 in order to perform voice and emotion recognition.
An analog signal, which is a captured voice, is converted from analog data to digital data by an A / D converter 302 that converts the analog signal into a digital signal at an arbitrarily determined sampling period. The digital data of the converted voice is output by the voice analysis unit 303.
Preprocessing such as noise processing, speech analysis, and speaker adaptation is performed, and speech power is analyzed by a speech power analysis unit 303a included in the speech analysis unit 303, and the emotion level is output. The output of the voice analysis unit 303 is the voice emotion recognition unit 3
At 04, speech and emotion recognition is performed. Here, the voice emotion recognition performed by the voice emotion recognition unit 304 executes two processes.

【００４３】第１の処理は、音声信号を解析して、それ
を短い時間ごとの音素として分析して、そのパターンを
解析し、該当する単語や文章を辞書から選択することで
ある。The first process is to analyze a voice signal, analyze it as a phoneme for each short time, analyze its pattern, and select a corresponding word or sentence from a dictionary.

【００４４】第２の処理は、音声信号を解析して、それ
を短い時間（５〜２０ｍｓ）ごとの音素として分析し
て、そのパターンを解析し、話者が発声した音声の感情
の度合いを示すレベルを単語や文章ごとに選択すること
である。The second process is to analyze a voice signal, analyze it as a phoneme for each short time (5 to 20 ms), analyze its pattern, and measure the degree of emotion of the voice uttered by the speaker. The level to be shown is to be selected for each word or sentence.

【００４５】以上の２つの処理から、音声感情認識シス
テムの出力として、音声認識結果および音声の感情レベ
ル３０９を生成する。From the above two processes, a speech recognition result and a speech emotion level 309 are generated as outputs of the speech emotion recognition system.

【００４６】音声感情認識部３０４は、音声分析部３０
３で分析された入力音声の音声分析結果に対して、音響
モデル３０５、単語辞書３０７をモデル連結部３０８に
よって連結された音素単位で照合して、単語辞書部３０
７に登録した単語辞書３０７の中で、一番近い単語をピ
ックアップする。さらに、ピックアップされた単語の入
力音声が持っている感情の度合いを示すレベルを選択す
る。The voice emotion recognition unit 304 includes the voice analysis unit 30
The acoustic model 305 and the word dictionary 307 are collated by the phoneme unit connected by the model connecting unit 308 with the voice analysis result of the input voice analyzed in
7, the closest word is picked up from the word dictionary 307 registered in. Further, a level indicating the degree of emotion of the input voice of the picked-up word is selected.

【００４７】音響モデル部３０５は、音声認識に用いら
れるモデルであり、具体的には、単語辞書部３０７に格
納されている文字と音素との対応であり、音素の特徴が
出現する確率の分布、出現した音素の特徴が次のどの特
徴が現れる状態に遷移するかの確率の分布を記憶したも
のである。音響モデル部３０５は、あらかじめ声を登録
しなくても、誰が話し手でもその声を認識できるいわゆ
る「不特定話者対応」が、一般的になってきている。こ
のような音響モデルとしては、例えば、隠れマルコフモ
デル（ＨＭＭ：Hidden Markov Model）を用いることが
できる。The acoustic model unit 305 is a model used for speech recognition. Specifically, the acoustic model unit 305 is a correspondence between a character stored in the word dictionary unit 307 and a phoneme, and a distribution of the probability that a feature of the phoneme appears. , Which stores the distribution of probabilities that the characteristics of the phoneme that has appeared transition to a state in which the next characteristic appears. The so-called “unspecified speaker correspondence” in which the speaker can recognize the voice of the acoustic model unit 305 without registering the voice in advance is becoming common. As such an acoustic model, for example, a Hidden Markov Model (HMM) can be used.

【００４８】単語辞書部３０７は、言葉、単語（名詞、
動詞等）、文章を集めたものである。例えば、カーナビ
ゲーションシステムにおいては、通り名、地名、建造物
名、町名、番地、交差点名、個人住宅（個人名）、電話
番号等や、必要最小限の会話に必要な言葉の集合体であ
る。ただし、音声認識感情システムでは、特に、単語の
中でも、感情を表現する単語、あるいは、感情が現われ
る単語で構成された単語の集合体である。より具体的に
は、話者が発声する「暑い」「寒い」「熱い」「冷た
い」「はやく」「おそく」「大きい」「小さい」「赤
い」「白い」「高く」「低く」「走れ」「進め」「戻
れ」「回れ」「飛べ」等の言葉である。また、名詞等の
感情表現でない単語も含まれる。この単語辞書部３０７
に格納される単語数は、システムの能力に応じて決めら
れるが、一つの辞書あたり、例えば、１０〜５０００語
である。The word dictionary unit 307 stores words, words (nouns,
Verbs, etc.) and sentences. For example, in a car navigation system, it is a collection of words necessary for a minimum necessary conversation, such as street names, place names, building names, town names, street addresses, intersection names, private houses (personal names), telephone numbers, and the like. . However, in the speech recognition emotion system, in particular, among words, it is a word that expresses an emotion, or a collection of words formed of words in which an emotion appears. More specifically, the speaker speaks “hot”, “cold”, “hot”, “cold”, “fast”, “slow”, “large”, “small”, “red”, “white”, “high”, “low”, “run” These words are "advance,""return,""turn," and "fly." Also, words that are not emotional expressions such as nouns are included. This word dictionary unit 307
The number of words stored in each dictionary is determined according to the capability of the system, but is, for example, 10 to 5000 words per dictionary.

【００４９】以上から、音声感情認識システム又は音声
感情認識方法とは、音声信号を解析して、それを短い時
間ごとの音素として分析して、そのパターンを解析し、
該当する単語や文章を辞書から選択するとともに、話者
が発声した音声の感情の度合いを示すレベルを単語や文
章ごとに選択することである。As described above, the voice emotion recognition system or the voice emotion recognition method is to analyze a voice signal, analyze it as a phoneme for each short time, analyze its pattern,
In addition to selecting a corresponding word or sentence from the dictionary, a level indicating the degree of emotion of the voice uttered by the speaker is selected for each word or sentence.

【００５０】なお、図３に示す各処理ブロックは、複数
のＬＳＩやメモリで構成されたシステムであっても、半
導体素子上に構成された一つないし複数のシステムオン
チップであってもよい。また、各処理は、専用ＬＳＩや
専用ＩＣで処理するハードウエアであっても、ＤＳＰや
ＲＩＳＣマイコン等のソフトウエアで実現したミドルウ
エアであってもよい。Each processing block shown in FIG. 3 may be a system constituted by a plurality of LSIs and memories, or one or a plurality of system-on-chips constituted on semiconductor elements. Further, each process may be hardware processed by a dedicated LSI or a dedicated IC, or middleware realized by software such as a DSP or a RISC microcomputer.

【００５１】図４（ａ）は、図３で説明した音声感情認
識システムにおいて、話者が発声した音声「あつい」の
音声入力波形を示すもので、横軸は時間を、縦軸は音声
レベルを示す。また、図４（ｂ）は「あつい」の音声の
パワーを示したものであり、横軸に時間を、縦軸に音声
のパワーを示している。FIG. 4 (a) shows a voice input waveform of a voice "hot" uttered by the speaker in the voice emotion recognition system described in FIG. 3, in which the horizontal axis represents time and the vertical axis represents voice level. Is shown. FIG. 4B shows the power of the "hot" sound, in which the horizontal axis represents time and the vertical axis represents the sound power.

【００５２】音声入力波形４０１は、話者が平常の音声
で「あつい」と発声したときの音声波形である。音声信
号は、時々刻々と変化する非定常な信号である。この音
声信号を２０ｍｓの短時間で切り出して見ると、定常信
号と同様なスペクトル音声分析ができる。切り出された
音声信号のサンプル値から、例えば、音声分析で広く用
いられているＬＰＣ分析において、自己相関関数を計算
すると、音声の特徴パラメータの一つとして、音声のパ
ワーが求められる。The voice input waveform 401 is a voice waveform when the speaker utters “Aut” with normal voice. The audio signal is an unsteady signal that changes every moment. If this audio signal is cut out and viewed in a short time of 20 ms, the same spectral audio analysis as that of a stationary signal can be performed. When the autocorrelation function is calculated from the sampled values of the extracted audio signal in, for example, LPC analysis widely used in audio analysis, the power of the audio is obtained as one of the characteristic parameters of the audio.

【００５３】音声パワーを示す曲線４０２は、音声波形
４０１の音声信号から計算されたパワーである。時間ｔ
に対するパワーの変化を表わしている。ここで、このパ
ワー情報に対して、しきい値を任意に設定し、入力され
た音声毎にこのしきい値を超えたかどうかを観測する。
この観測は、音声分析部３０３で行う。さらに、複数の
しきい値を設け、入力された音声毎にそれぞれのしきい
値を超えたかどうかを観測する。例えば、音声パワー曲
線４０２の音声の場合は、しきい値ＴＨ１超えている
が、しきい値ＴＨ２は超えていない。すなわち、連続的
にパワーが、しきい値ＴＨ１とＴＨ２の間にある場合に
は、感情のレベルを１と見なし、音声分析部３０３は感
情レベル１を出力する。A curve 402 showing the audio power is the power calculated from the audio signal of the audio waveform 401. Time t
Represents the change in power with respect to. Here, a threshold value is arbitrarily set for this power information, and it is observed whether or not the threshold value is exceeded for each input voice.
This observation is performed by the voice analysis unit 303. Further, a plurality of thresholds are provided, and it is observed whether or not each of the inputted voices exceeds the threshold. For example, in the case of the sound of the sound power curve 402, the threshold value TH1 is exceeded, but the threshold value TH2 is not exceeded. That is, when the power is continuously between the threshold values TH1 and TH2, the emotion level is regarded as 1, and the voice analysis unit 303 outputs the emotion level 1.

【００５４】つぎに、話者が、感情を込めた強い口調の
音声で「あつい」と発声したときの音声波形及び音声パ
ワーをそれぞれ図４（ｃ）及び図４（ｄ）に示す。図４
（ｃ）は横軸に時間を、縦軸に音声レベルを示し、図４
（ｄ）は横軸に時間を、縦軸に音声パワーを示す。図４
（ｃ）に於いて、４０３は音声波形を示す。、図４
（ｄ）に於いて、４０４は音声波形４０３の音声信号か
ら計算された音声のパワーであり、時間ｔに対するパワ
ーの変化を表わしている。例えば、音声パワー４０４の
場合は、しきい値ＴＨ１超えて、さらに、しきい値ＴＨ
２を超えている。すなわち、連続的にパワーが、しきい
値ＴＨ２を超えている場合には、感情のレベルを２と見
なし、音声分析部３０３は感情レベル２を出力する。こ
の例の場合は、感情レベルを２段階に設定したが、しき
い値を増やすことによって、感情レベルをN（Nは整数）
段階に設定できる。Next, FIGS. 4 (c) and 4 (d) show the speech waveform and the speech power, respectively, when the speaker utters "hot" with a strong tone with emotion. FIG.
FIG. 4C shows time on the horizontal axis and audio level on the vertical axis.
In (d), the horizontal axis indicates time, and the vertical axis indicates audio power. FIG.
In (c), reference numeral 403 denotes an audio waveform. , FIG.
In (d), reference numeral 404 denotes the power of the sound calculated from the sound signal of the sound waveform 403, which represents a change in the power with respect to time t. For example, in the case of the audio power 404, the threshold value TH1 is exceeded and the threshold value TH is further increased.
Over two. That is, when the power continuously exceeds the threshold value TH2, the emotion level is regarded as 2, and the voice analysis unit 303 outputs the emotion level 2. In this example, the emotion level is set to two levels, but by increasing the threshold value, the emotion level is set to N (N is an integer).
Can be set in stages.

【００５５】また、音声分析部３０３からは、感情レベ
ルとともに、音声認識の為の音声の特徴パラメータが時
々刻々と音声感情認識部３０４にに入力され、最終的
に、音声感情認識部３０４からは音声認識結果「ａｔｕ
ｓｉ」（＝あつい）と感情レベルNを示すデータ３０９
が出力される。Further, from the speech analysis unit 303, the feature parameters of the speech for speech recognition together with the emotion level are input to the speech emotion recognition unit 304 from moment to moment, and finally, from the speech emotion recognition unit 304. The speech recognition result "atu
data 309 indicating “si” (= hot) and emotion level N
Is output.

【００５６】次に、図５を用いて本発明に係る音声認識
システムのハードウエア構成について説明する。Next, the hardware configuration of the speech recognition system according to the present invention will be described with reference to FIG.

【００５７】音声を取り込むためのマイク５０１は、カ
ーナビゲーションシステム、携帯型情報端末、ＰＤＡ、
ハンドヘルドＰＣ、ゲーム、携帯型翻訳機、並びに、エ
アコン等の家庭電化製品等では、周囲の雑音を取り込ま
ないために指向性をもたせた指向性マイクである。５０
４は、マイク５０１により取り込まれたアナログ音声デ
ータを、デジタル音声データに変換するＡ／Ｄ変換器で
ある。The microphone 501 for taking in voice is a car navigation system, a portable information terminal, a PDA,
Handheld PCs, games, portable translators, home appliances such as air conditioners, and the like are directional microphones having directivity so as not to capture ambient noise. 50
Reference numeral 4 denotes an A / D converter that converts analog audio data captured by the microphone 501 into digital audio data.

【００５８】音声入力用ボタン５０２は、音声を入力し
ている区間を指定するためのボタンである。ボタンが押
されている間、あるいは、ボタンが押された時点から音
声が入力されたことをシステムに知らせる。５０５は、
音声入力用ボタン５０２と、システムを接続するための
インタフェースである。The voice input button 502 is a button for designating a section in which voice is being input. While the button is being pressed, or from the time the button is pressed, the system is notified that a voice has been input. 505 is
This is an interface for connecting the voice input button 502 and the system.

【００５９】キー入力用デバイス５０９は、例えば、携
帯型情報端末であれば、ペン入力用のデジタイザであ
り、ハンドヘルドＰＣであれば、キーボードである。ま
た、ファミコンなどのゲーム機であれば、キャラクタ等
を操作するキーパッドや、ジョイスティックである。５
１０は、キー入力用デバイス５０９と、システムを接続
するためのインタフェースである。The key input device 509 is, for example, a pen input digitizer in the case of a portable information terminal, and a keyboard in the case of a handheld PC. In the case of a game console such as a NES, a keypad for operating a character or the like or a joystick is used. 5
Reference numeral 10 denotes an interface for connecting the key input device 509 to the system.

【００６０】ＣＰＵ５０３は、カーナビゲーションシス
テム、携帯型情報端末、ＰＤＡ、ハンドヘルドＰＣ、ゲ
ーム、携帯型翻訳機、並びに、家庭電化製品等のメイン
システムの制御と、音声感情認識システムにおける音声
認識および感情認識処理を行う。図３に示す本発明の音
声感情認識システムの音声分析部３０３、音声感情認識
部３０４及びモデル連結部３０８はこのＣＰＵ５０３に
設けられる。このＣＰＵ５０３には、ＲＩＳＣマイコン
やＤＳＰが用いられるのが、最近の潮流である。The CPU 503 controls a main system such as a car navigation system, a portable information terminal, a PDA, a handheld PC, a game, a portable translator, and a home appliance, and performs voice recognition and emotion recognition in a voice emotion recognition system. Perform processing. The voice analysis unit 303, the voice emotion recognition unit 304, and the model connection unit 308 of the voice emotion recognition system of the present invention shown in FIG. A recent trend is to use a RISC microcomputer or DSP for the CPU 503.

【００６１】ＲＯＭ５０６は、音声認識用単語辞書、音
響モデル、発声変形感情モデル、プログラムを格納して
おく記憶装置である。また、複数の辞書や、音響モデ
ル、発声変形感情モデルを格納しておくために、メモリ
カードを用いてもよい。The ROM 506 is a storage device for storing a word dictionary for speech recognition, an acoustic model, an utterance deformation emotion model, and a program. Further, a memory card may be used to store a plurality of dictionaries, acoustic models, and utterance deformation emotion models.

【００６２】ＲＡＭ５０７は、ＲＯＭ５０６から転送さ
れた一部の辞書や、音響モデル、プログラムが格納さ
れ、また、音声感情認識処理に必要な必要最小限のワー
クメモリであり、ＲＯＭ５０６に比べて、通常アクセス
時間の短い半導体素子が用いられる。また、ここにはＣ
ＰＵ５０３から音声認識結果及び感情のレベルを示すデ
ータ３０９が入力される。The RAM 507 stores a part of the dictionary, acoustic model, and program transferred from the ROM 506, and is a minimum necessary work memory required for the voice emotion recognition processing. A semiconductor element having a short time is used. Also, here is C
Data 309 indicating a speech recognition result and an emotion level is input from PU 503.

【００６３】バス５０８は、システムにおけるデータバ
ス、アドレスバス、制御信号バスとして用いられる。The bus 508 is used as a data bus, an address bus, and a control signal bus in the system.

【００６４】音声感情認識結果を出力表示するためのデ
ィスプレイ５１２は、ＴＦＴ液晶ディスプレイ等のＬＣ
Ｄで構成し、音声認識結果および音声の感情レベルを表
示する。５１１は、ディスプレイ５１２と、システムを
接続するためのインタフェースである。A display 512 for outputting and displaying the voice emotion recognition result is an LC such as a TFT liquid crystal display.
D to display the speech recognition result and the emotion level of the speech. An interface 511 connects the display 512 to the system.

【００６５】音声感情認識結果を音で出力するためのス
ピーカ５１４は、音声認識結果および音声の感情レベル
を音声合成して出力する。５１３は、音声認識結果およ
び、音声の感情レベルをテキストから音声合成データに
変換処理した後、デジタル音声合成データからアナログ
音声信号に変換するＤ／Ａ変換器である。A speaker 514 for outputting the voice emotion recognition result as sound synthesizes and outputs the voice recognition result and the emotion level of the voice. A D / A converter 513 converts the speech recognition result and the emotion level of the speech from text to speech synthesis data, and then converts the digital speech synthesis data to an analog speech signal.

【００６６】以下、本発明に係る実施形態の一例を、図
６及び図８を用いて説明する。Hereinafter, an example of an embodiment according to the present invention will be described with reference to FIGS.

【００６７】本実施形態では、本発明の音声感情認識シ
ステムをカーエレクトロニクス製品に適用した場合につ
いて説明する。In the present embodiment, a case where the voice emotion recognition system of the present invention is applied to a car electronics product will be described.

【００６８】図６（ａ）は本発明による音声感情認識シ
ステムをカーエレクトロニクス製品のエアコン操作に利
用した場合のブロック図であり、図６（ｂ）はこの音声
感情認識システムのおける音声入力例とその認識結果を
示す模式図である。FIG. 6A is a block diagram when the voice emotion recognition system according to the present invention is used for operating an air conditioner of a car electronics product. FIG. 6B shows an example of voice input in the voice emotion recognition system. It is a schematic diagram showing the recognition result.

【００６９】図６（ａ）に於いて、６０１は音声入力用
マイク、６０２は音声感情認識システム、６０３は音声
感情認識結果や、話者との双方向のやり取りを行うため
に、会話形式の文字情報を出力するためのディスプレ
イ、６０４は音声感情認識結果や、話者との双方向のや
り取りを行うために、会話形式の文字情報を音声合成し
て出力するためのスピーカである。In FIG. 6A, reference numeral 601 denotes a voice input microphone, 602 denotes a voice emotion recognition system, and 603 denotes a voice emotion recognition result and a conversational format for bidirectional communication with a speaker. A display 604 for outputting character information is a speaker for synthesizing and outputting conversational character information in order to perform a voice emotion recognition result and bidirectional communication with a speaker.

【００７０】次に、図６（ｂ）を用いて、話者が発声し
た音声入力例と、音声感情認識システムが出力した認識
結果例を説明する。Next, with reference to FIG. 6B, an example of a voice input uttered by a speaker and an example of a recognition result output by a voice emotion recognition system will be described.

【００７１】６０５は話者が音声感情システム６０２に
対して、普通の発声で、「暑い」と発声した場合の音声
感情システム６０２の認識結果である「暑い＝感情レベ
ル３」を示す。次に、６０６は話者が音声感情システム
に対して、強い調子で、「暑い」と発声した時の音声感
情システム６０２の認識結果であり、「暑い＝感情レベ
ル５」を示す。Reference numeral 605 denotes “hot = emotional level 3” which is a recognition result of the voice emotion system 602 when the speaker utters “hot” with a normal voice to the voice emotion system 602. Next, reference numeral 606 denotes a recognition result of the voice emotion system 602 when the speaker utters "hot" in a strong tone with respect to the voice emotion system, and indicates "hot = emotion level 5".

【００７２】また、６０７は話者が音声感情システムに
対して、普通の発声で、「暑い」と発声したときの音声
感情システム６０２の認識結果である「少し暑いです
か」を示す。次に、６０８は話者が音声感情システムに
対して、強い調子で、「暑い」と発声した時の音声感情
システム６０２の認識結果である「かなり暑いですか」
を示す。Reference numeral 607 indicates “is it a little hot” as a recognition result of the voice emotion system 602 when the speaker utters “hot” with a normal voice to the voice emotion system. Next, 608 is a recognition result of the voice emotion system 602 when the speaker utters “hot” in a strong tone with respect to the voice emotion system, “Is it quite hot?”
Is shown.

【００７３】さらに、認識結果６０８に対して「はい」
と声感情システム６０２に対して発声すると、音声感情
システム６０２その認識結果として、「車内を２５℃に
設定します」を出力する。実際に、車内が２５℃に設定
される。Further, "Yes" is given to the recognition result 608.
Then, the voice emotion system 602 outputs "set the interior of the vehicle to 25 ° C." as a recognition result. Actually, the inside of the vehicle is set to 25 ° C.

【００７４】また、他の本実施形態では、本発明の音声
感情認識システムをファミリーコンピュータ（登録商
標）等のゲーム製品に適用した場合について説明する。In another embodiment, a case will be described in which the voice emotion recognition system of the present invention is applied to a game product such as Family Computer (registered trademark).

【００７５】図８（ａ）から図８（ｄ）は、ファミリコ
ンピュータ等のゲーム機のキャラクタの操作において、
音声感情認識システムを用いた一例であり、音声による
インターフェースの音声入力例及び認識結果による動作
例を示す模式図である。FIGS. 8 (a) to 8 (d) show the operation of a character of a game machine such as a family computer.
It is an example using a voice emotion recognition system, and is a schematic diagram showing a voice input example of an interface by voice and an operation example based on a recognition result.

【００７６】図８に於いて、８０１、８０２、８０９及
び８１０は、ゲーム機本体のディスプレイやゲーム機が
接続されたＴＶ等の画面である。In FIG. 8, reference numerals 801, 802, 809, and 810 denote screens of a game machine main body display and a TV or the like to which the game machine is connected.

【００７７】図８（ａ）に於いて、ゲームに登場するキ
ャラクター８０５は、例えば、画面８０１に向かって左
から右へと進んでいる。この操作を音声感情認識を使っ
て行う。そこで、操作者（話者）は、発声例８０３に示
すように「進め」と普通の音声で発声する。音声感情認
識システムは、進め（ｓｕｓｕｍｅ）を認識し、さら
に、感情レベルを認識する。例えば、このゲーム機にお
ける音声感情認識システムでは、感情レベルを５段階に
設定したとすると、感情レベル＝３と認識する。そこ
で、ゲーム機本体側のシステムでは、キャラクター８０
５をキャラクター８０６の位置へ移動する。In FIG. 8A, the character 805 appearing in the game is, for example, moving from left to right toward the screen 801. This operation is performed using voice emotion recognition. Therefore, the operator (speaker) utters “forward” as a normal voice as shown in utterance example 803. The voice emotion recognition system recognizes the progress and further recognizes the emotion level. For example, in the voice emotion recognition system in this game machine, if the emotion level is set to five levels, the emotion level is recognized as three. Therefore, in the system of the game machine main body, the character 80
5 is moved to the position of the character 806.

【００７８】図８（ｂ）に於いて、ゲームに登場するキ
ャラクター８０７は、例えば、画面８０２に向かって左
から右へと進んでいる。この操作を音声感情認識を使っ
て行う。そこで、操作者（話者）は、音声例８０４に示
すように、「進め」と強い調子で発声する。音声感情認
識システムは、進め（ｓｕｓｕｍｅ）を認識し、さら
に、感情レベルを認識する。例えば、このゲーム機にお
ける音声感情認識システムでは、感情レベルを５段階に
設定したとすると、感情レベル＝５と認識する。そこ
で、ゲーム機本体側のシステムでは、キャラクター８０
７はキャラクター８０８の位置へ大きく移動する。ここ
で、キャラクターの移動量は、認識された音声「進め」
の感情レベルに比例する。In FIG. 8B, the character 807 appearing in the game moves from left to right toward the screen 802, for example. This operation is performed using voice emotion recognition. Then, the operator (speaker) utters “advance” in a strong tone as shown in a voice example 804. The voice emotion recognition system recognizes the progress and further recognizes the emotion level. For example, in the voice emotion recognition system of this game machine, if the emotion level is set to five levels, the emotion level is recognized as 5. Therefore, in the system of the game machine main body, the character 80
7 largely moves to the position of the character 808. Here, the movement amount of the character is the recognized voice
Is proportional to the emotion level.

【００７９】図８（ｃ）に於いては、ゲームに登場する
キャラクター８１３は、例えば、画面８０９に向かって
左から右へと進んでいる。このとき、前方に障害物８１
６が現われたとする。そこで、この障害物８１６を飛び
超えなくてはならない。この操作を音声感情認識を使っ
て行う。そこで、操作者（話者）は、音声例８１１に示
すように「ジャンプ」と普通の音声で発声する。音声感
情認識システムは、ジャンプ（ｊｙａｎｐｕ）を認識
し、さらに、感情レベルを認識する。例えば、このゲー
ム機における音声感情認識システムでは、感情レベルを
５段階に設定したとすると、感情レベル＝３と認識す
る。そこで、ゲーム機本体側のシステムでは、キャラク
ター８１３をキャラクター８１４の位置へ移動し、さら
に、キャラクター８１５の位置へ移動する。In FIG. 8C, the character 813 appearing in the game advances from left to right toward the screen 809, for example. At this time, the obstacle 81
Suppose 6 appears. Therefore, it is necessary to jump over the obstacle 816. This operation is performed using voice emotion recognition. Then, the operator (speaker) utters “jump” with a normal voice as shown in voice example 811. The voice emotion recognition system recognizes a jump (jyanpu) and further recognizes an emotion level. For example, in the voice emotion recognition system in this game machine, if the emotion level is set to five levels, the emotion level is recognized as three. Therefore, in the system on the game machine main body side, the character 813 is moved to the position of the character 814, and further moved to the position of the character 815.

【００８０】図８（ｄ）にに於いては、ゲームに登場す
るキャラクター８１７は、例えば、画面８１０に向かっ
て左から右へと進んでいる。このとき、前方に障害物８
２０が現われたとする。この障害物８２０は、画面８０
９のときの障害物８１６よりも大きい。そこで、この障
害物８２０を高く飛び超えなくてはならない。この操作
を音声感情認識を使って行う。そこで、操作者（話者）
は、音声例８１２に示すように「ジャンプ」と強い調子
で発声する。音声感情認識システムは、ジャンプ（ｊｙ
ａｎｐｕ）を認識し、さらに、感情レベルを認識する。
例えば、このゲーム機における音声感情認識システムで
は、感情レベルを５段階に設定したとすると、感情レベ
ル＝５と認識する。そこで、ゲーム機本体側のシステム
では、キャラクター８１７をキャラクター８１８の位置
へ大きく移動し、さらに、キャラクター８１９の位置へ
移動する。キャラクターの移動量は、認識された音声
「ジャンプ」の感情レベルに比例する。In FIG. 8D, a character 817 appearing in the game is moving from left to right toward the screen 810, for example. At this time, obstacle 8
Suppose 20 appears. The obstacle 820 is displayed on the screen 80
9 is larger than the obstacle 816. Therefore, it is necessary to jump over the obstacle 820 high. This operation is performed using voice emotion recognition. Therefore, the operator (speaker)
Utters “jump” in a strong tone as shown in a voice example 812. The voice emotion recognition system uses the jump (jy
anpu), and also the emotion level.
For example, in the voice emotion recognition system of this game machine, if the emotion level is set to five levels, the emotion level is recognized as 5. Therefore, in the system of the game machine main body side, the character 817 is largely moved to the position of the character 818 and further moved to the position of the character 819. The movement amount of the character is proportional to the emotion level of the recognized voice “jump”.

【００８１】[0081]

【発明の効果】本発明によれば、カーナビゲーションシ
ステム、小型情報システム、ゲームに用いられる音声認
識システムにおいて、登録した辞書の単語の文字列を音
声で認識するとともに、音声認識された単語において、
話者の音声が持つ感情のレベルを認識することができる
音声感情認識システムを提供することができる。According to the present invention, in a car navigation system, a small information system, and a voice recognition system used in a game, a character string of a word in a registered dictionary is recognized by voice,
It is possible to provide a voice emotion recognition system capable of recognizing a level of emotion of a speaker's voice.

【００８２】また、本発明によれば、音声認識を用いた
カーナビゲーションシステム、小型情報システム、ゲー
ムにおいて、音声の感情レベルを認識できることから、
限られた単語数においても、音声認識によるインタフェ
ースのバリエーションを増やすことができ、良好な音声
認識インタフェースを実現することができる。According to the present invention, the emotion level of voice can be recognized in a car navigation system, a small information system, and a game using voice recognition.
Even with a limited number of words, it is possible to increase the variations of the interface by voice recognition, and to realize a good voice recognition interface.

[Brief description of the drawings]

【図１】本発明に係る音声感情認識システムの一実施例
を示す示すブロック図である。FIG. 1 is a block diagram showing one embodiment of a voice emotion recognition system according to the present invention.

【図２】図１に示す音声感情認識システムの音響モデル
および発声変形感情モデルを説明するための模式図であ
る。FIG. 2 is a schematic diagram for explaining an acoustic model and an utterance deformation emotion model of the speech emotion recognition system shown in FIG. 1;

【図３】本発明に係る音声感情認識システムの他の実施
例を示すブロック図である。FIG. 3 is a block diagram showing another embodiment of the voice emotion recognition system according to the present invention.

【図４】図３に示す音声感情認識システムの音声波形お
よび音声パワーと感情レベルの関係を説明するための模
式図である。FIG. 4 is a schematic diagram for explaining a relationship between a voice waveform and voice power and a feeling level of the voice emotion recognition system shown in FIG. 3;

【図５】本発明のハードウエア構成を示すブロック図で
ある。FIG. 5 is a block diagram showing a hardware configuration of the present invention.

【図６】本発明の音声認識感情システムを適用したカー
ナビゲーションシステムにおける音声によるインタフェ
ースの音声入力例および認識結果例を示す模式図であ
る。FIG. 6 is a schematic diagram showing a voice input example of a voice interface and a recognition result example in a car navigation system to which the voice recognition emotion system of the present invention is applied.

【図７】従来の携帯型翻訳機のブロック図である。FIG. 7 is a block diagram of a conventional portable translator.

【図８】本発明の音声認識感情システムを適用したゲー
ム機における音声によるインタフェースの音声入力例お
よび認識結果による動作例を示した模式図である。FIG. 8 is a schematic diagram showing a voice input example of a voice interface and an operation example based on a recognition result in a game machine to which the voice recognition emotion system of the present invention is applied.

[Explanation of symbols]

１０１、３０１、５０１、６０１・・・マイク、１０２、
３０２、５０４・・・Ａ／Ｄ変換器、１０３、３０３・・・音
声分析部、１０４、３０４・・・音声感情認識部、１０
８、３０８・・・モデル連結部、１０５、３０５・・・音響モ
デル部、１０６・・・発声変形感情モデル部、１０７、３
０７・・・単語辞書部、２０１…ＨＭＭ音響モデル連結に
おける「あ」の状態、２０２…ＨＭＭ音響モデル連結に
おける「つ」の状態、２０３…ＨＭＭ音響モデル連結に
おける「い」の状態、２０４…状態「あ」から状態
「あ」へ遷移する確率、２０５…状態「つ」から状態
「つ」へ遷移する確率、２０６…状態「い」から状態
「い」へ遷移する確率、２０７…状態「あ」から状態
「つ」へ遷移する確率、２０８…状態「つ」から状態
「い」へ遷移する確率、２０９…状態「い」から他の状
態へ遷移する確率、２１１…状態「あ」の出力確率の連
続分布、２１２…状態「つ」の出力確率の連続分布、２
１３…状態「い」の出力確率の連続分布、２１４…ＨＭ
Ｍ発声変形感情モデルにおける状態「あ」の出力確率の
連続分布、２２１…ＨＭＭ発声変形感情モデルにおける
状態「あ」の出力確率の連続分布、２２２…ＨＭＭ発声
変形感情モデルにおける状態「つ」の出力確率の連続分
布、２２３…ＨＭＭ発声変形感情モデルにおける状態
「い」の出力確率の連続分布、５０２・・・ボタン、５０
３・・・ＣＰＵ、５０６・・・ＲＯＭ、５０７・・・ＲＡＭ、５
０９・・・キー、６０２・・・音声感情認識システム。101, 301, 501, 601 ... microphone, 102,
302, 504: A / D converter, 103, 303: Voice analysis unit, 104, 304: Voice emotion recognition unit, 10
8, 308: Model connection unit, 105, 305: Acoustic model unit, 106: Speech deformation emotion model unit, 107, 3
07: Word dictionary unit, 201: “A” state in HMM acoustic model connection, 202: “T” state in HMM acoustic model connection, 203: “I” state in HMM acoustic model connection, 204: State Probability of transition from "A" to state "A", 205 ... Probability of transition from state "T" to state "T", 206 ... Probability of transition from state "I" to state "I", 207 ... State "A" From the state "i" to the state "i", 208 ... the probability of a transition from the state "i" to the state "i", 209 ... the probability of a transition from the state "i" to another state, 211 ... the output of the state "a" Continuous distribution of probabilities, 212 ... Continuous distribution of output probabilities of state "T", 2
13 ... Continuous distribution of output probability of state "i", 214 ... HM
Continuous distribution of output probabilities of state "A" in the M utterance deformation emotion model, 221 ... Continuous distribution of output probabilities of state "A" in the HMM utterance deformation emotion model, 222 ... Output of state "T" in the HMM utterance deformation emotion model Continuous distribution of probabilities, 223 ... continuous distribution of output probabilities of state "i" in the HMM utterance deformation emotion model, 502 ... button, 50
3 CPU, 506 ROM, 507 RAM, 5
09 ... key, 602 ... voice emotion recognition system.

フロントページの続き (72)発明者近藤和夫東京都小平市上水本町五丁目20番１号株式会社日立製作所半導体事業部内 (72)発明者大淵康成東京都国分寺市東恋ケ窪一丁目280番地株式会社日立製作所中央研究所内 (72)発明者塔下哲司東京都小平市上水本町五丁目20番１号株式会社日立製作所半導体事業部内 (72)発明者石川泰代東京都小平市上水本町五丁目22番１号株式会社日立マイコンシステム内Continuing on the front page (72) Inventor Kazuo Kondo 5-2-1, Josuihonmachi, Kodaira-shi, Tokyo Inside the Semiconductor Division, Hitachi, Ltd. Hitachi Central Research Laboratory (72) Inventor Tetsushi Toshita 5-2-1, Josuihoncho, Kodaira-shi, Tokyo Inside the Semiconductor Division, Hitachi, Ltd. (72) Yasushi Ishikawa 5--22, Josuihoncho, Kodaira-shi, Tokyo No. 1 in Hitachi microcomputer system

Claims

[Claims]

1. A method for collecting words and sentences to be subjected to speech recognition, defining the dictionary as a dictionary, picking up the words and sentences from a dictionary unit as a speech recognition result, and outputting the words and sentences using character string display and speech synthesis. A speech recognition unit that performs a speech analysis process on the captured speech, an acoustic model unit that has a speech pattern in phoneme units,
An utterance deformation emotion model unit that represents a deformation of a phoneme spectrum due to emotion, and a speech recognition unit that performs speech recognition processing by connecting the acoustic model unit, the utterance deformation emotion model unit, and the dictionary unit to the speech analysis result. A voice emotion recognition system comprising: outputting words and sentences to be subjected to voice recognition as voice recognition results based on voice characteristics; and outputting a level indicating a degree of emotion of a speaker having voice. .

2. A speech emotion recognition system according to claim 1, wherein the level indicating the degree of the emotion of the speaker possessed by the speech is a numeral 0 to N (N is an integer). Recognition system.

3. The voice emotion recognition system according to claim 1, wherein the dictionary unit has a dictionary in which words and sentences to be subjected to speech recognition are collected, and a voice is provided for those words and sentences. A dictionary that collects modifiers that express the level of emotions, and picks up those words and sentences as speech recognition results, and also picks up modifiers that express the level of emotions, and modifies words and sentences into words. A voice emotion recognition system characterized by adding characters and outputting using characters or voice synthesis.

4. A speech recognition system which collects words and sentences to be subjected to speech recognition, defines them as a dictionary, picks up the words and sentences as a speech recognition result, and outputs them using character string display or speech synthesis. In the system, a voice analysis unit that performs a voice analysis process on the captured voice, an acoustic model unit that has a voice pattern in phoneme units,
A voice recognition unit that performs a voice recognition process by connecting the acoustic model unit and the dictionary unit to the voice analysis result, and performs a voice analysis process on the captured voice; A voice emotion recognition system that outputs a level indicating a degree of emotion from a characteristic of power indicating the strength of a sound that appears.

5. A dictionary that collects words and sentences to be subjected to speech recognition, a speech analysis unit that performs speech analysis processing on the acquired speech, an acoustic model unit that has speech patterns in phoneme units, A speech transformation emotion model unit representing a transformation of a phoneme spectrum by a, and a speech recognition unit that performs speech recognition processing by connecting the acoustic model unit, the speech transformation emotion model unit, and the dictionary unit to the speech analysis result, It outputs words and sentences to be subjected to speech recognition as speech recognition results based on the features of the speech, and outputs the degree of the emotion of the speaker having the speech using data from the utterance deformation emotion model unit. Voice emotion recognition method.

6. Speech analysis capable of collecting words and sentences to be subjected to speech recognition and performing a speech analysis process on the fetched speech and analyzing the power of the sound and outputting an emotion level. Unit, a sound model unit having a voice pattern in phoneme units, and a voice recognition unit for performing voice recognition processing by connecting the voice model unit and the dictionary to the voice analysis result. A voice emotion recognition method comprising: recognizing a degree of emotion from a characteristic of power indicating a strength of a sound in which the degree of emotion appears, and outputting the output emotion level.

7. The voice emotion recognition method according to claim 5, wherein the level indicating the degree of the emotion of the speaker possessed by the voice is a numeral 0 to N (N is an integer). Voice emotion recognition method.

8. The voice emotion recognition method according to claim 5, wherein the dictionary unit includes a dictionary in which words and sentences to be subjected to voice recognition are collected, and a speech and / or a voice for each of the words and sentences. And a dictionary that collects modifiers that express the level of emotion that the user has, and as a result of speech recognition, picks up those words and sentences, and also picks up modifiers that express the level of emotion, A speech emotion recognition system characterized by adding a modifier to a sentence and outputting it using characters or speech synthesis.