JP2001005487A

JP2001005487A - Voice recognition device

Info

Publication number: JP2001005487A
Application number: JP11172546A
Authority: JP
Inventors: Shusuke Yamasaki; 秀典山▲さき▼
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 1999-06-18
Filing date: 1999-06-18
Publication date: 2001-01-12

Abstract

(57)【要約】【課題】音声認識装置において、話者の判別を行い、
その判別結果に基づき、音声合成による認識応答等の音
質を決定する音声合成パラメータを自動設定することで
合成音声の品質を高める。【解決手段】話者の発話を話者判別部３に入力する
と、予め話者判別部３内に格納した話者判別情報３ａ〜
３ｂにより話者の性別、年齢を判定し、この判定結果に
応じて、話者に発する合成音声の発話速度、声の高さな
どの音質を決定するパラメータを自動変更する音声切り
替え部４Ａを備える。 (57) [Summary] [PROBLEMS] To identify a speaker in a speech recognition device,
Based on the result of the determination, the quality of synthesized speech is improved by automatically setting speech synthesis parameters for determining sound quality such as a recognition response by speech synthesis. SOLUTION: When a speaker's utterance is input to a speaker discriminating unit 3, speaker discriminating information 3a to 3d stored in the speaker discriminating unit 3 in advance.
3b, a voice switching unit 4A is provided for determining the sex and age of the speaker, and automatically changing parameters for determining the sound quality such as the speech speed and pitch of the synthesized voice uttered to the speaker according to the determination result. .

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】この発明は、音声認識結果や
認識結果に基づく発話案内を合成音声にて発話者に出力
する際に、発話者の性別、年齢に応じて音声合成パラメ
ータを変更し、各発話者にとって最適な合成音声を発す
る音声認識装置に関するものである。BACKGROUND OF THE INVENTION The present invention relates to a voice recognition result and a voice guidance based on the recognition result, which are output to a speaker as synthesized voice, by changing voice synthesis parameters according to the gender and age of the speaker. The present invention relates to a speech recognition device that emits a synthesized speech that is optimal for each speaker.

【０００２】[0002]

【従来の技術】例えば、自動車等の移動体に搭載される
音声認識装置は、車内に設置したカーステレオ装置や、
ナビゲーション装置、電話端末等の車内機器の操作時に
運転者の身体的、心理的負担を軽減し、安全に資するこ
とを目的とした車内機器の音声操作手段として使用され
ている。2. Description of the Related Art For example, a voice recognition device mounted on a moving body such as an automobile is a car stereo device installed in a vehicle,
2. Description of the Related Art It is used as a voice operation means of an in-vehicle device for the purpose of reducing the physical and psychological burden on a driver and operating safely when operating an in-vehicle device such as a navigation device and a telephone terminal.

【０００３】音声認識装置が音声認識結果を使用者に向
けて通知したり、音声認識に向けた使用者の発話内容に
対して使用者に次の発話を案内する方法として次の２つ
の方法がある。例えば車内機器の表示パネル、或いは表
示装置に文字等の表示により視覚的に通知する方法、音
声合成による合成音声で聴覚的に通知する方法である。[0003] The following two methods are available for the voice recognition device to notify the user of the voice recognition result or to guide the user to the next utterance in response to the utterance content of the user for voice recognition. is there. For example, a method of visually notifying by displaying characters or the like on a display panel or a display device of an in-vehicle device, or a method of audibly notifying by a synthesized voice by voice synthesis.

【０００４】上記の使用者に向けた通知は、自動車など
の運行環境で行うため、先に述べた運転における安全性
向上に寄与させる面から、例えば、使用者が認識結果の
表示確認のために視線を大きく運転席前方から逸らせる
必要がない、合成音声によるものが一般的である。The above-mentioned notification to the user is performed in an operating environment such as an automobile. Therefore, from the viewpoint of contributing to the improvement of safety in driving described above, for example, the user is required to confirm the display of the recognition result. In general, it is not necessary to largely deviate the line of sight from the front of the driver's seat.

【０００５】次に、従来の音声認識装置の構成を添付図
を参照して説明する。図１１は従来の音声認識装置の構
成を示すブロック図である。図１１において、９は、例
えば、ナビゲーション装置１０の音声入出力制御に用い
られる従来の音声認識装置の本体である。この音声認識
装置９は、音声入力された発話内容を認識する音声認識
部２と、音声認識結果による発話内容に対して応答する
音声情報または次の発話内容を使用者に案内する合成音
声情報を備えた音声切り替え部４、合成音声情報に基づ
き音声合成を行い合成音声信号を出力する音声合成部５
より構成される。Next, the configuration of a conventional speech recognition apparatus will be described with reference to the accompanying drawings. FIG. 11 is a block diagram showing a configuration of a conventional voice recognition device. In FIG. 11, reference numeral 9 denotes a main body of a conventional voice recognition device used for voice input / output control of the navigation device 10, for example. The speech recognition device 9 includes a speech recognition unit 2 for recognizing speech content input by speech and speech information responding to the speech content based on the speech recognition result or synthesized speech information for guiding the user to the next speech content. A voice switching unit 4 provided, a voice synthesis unit 5 that performs voice synthesis based on the synthesized voice information and outputs a synthesized voice signal
It is composed of

【０００６】音声切り替え部４は、音声合成部５に対し
て音声合成の内容である音声合成用テキストデータＳｉ
ｇ７を出力するテキストデータ格納メモリ４ｂ、音声合
成部５に対して音声合成パラメータＳｉｇ８を出力する
合成パラメータ格納メモリ４ｃ、テキストデータ制御信
号Ｓｉｇ５により音声認識結果応答や発話音声案内など
音声合成音の内容に関する音声合成用テキストデータＳ
ｉｇ７の出力を、テキストデータ格納メモリ４ｂに指示
する音声合成制御部４ａを含む。[0006] The voice switching unit 4 provides the voice synthesis unit 5 with text data for voice synthesis Si, which is the content of voice synthesis.
a text data storage memory 4b for outputting g7, a synthesis parameter storage memory 4c for outputting a speech synthesis parameter Sig8 to the speech synthesis unit 5, a content of speech synthesis sound such as a speech recognition result response and an uttered speech guidance by a text data control signal Sig5. Text data S for speech synthesis
The voice synthesis control unit 4a instructs the text data storage memory 4b to output ig7.

【０００７】音声認識装置９の周辺機器としては、使用
者からの発話をピックアップして発話音声信号Ｓｉｇ１
を音声認識部２に入力するマイクロホン１、使用者から
の発話を待ち受けることを指示する発話開始信号Ｓｉｇ
１１を音声認識部２に出力する発話スイッチ８、音声合
成部５より出力された合成音声信号Ｓｉｇ９を増幅して
合成音声信号Ｓｉｇ１０をスピーカ７に出力するアンプ
６、音声認識部２から出力された音声認識結果信号Ｓｉ
ｇ２に従って地図検索を行って表示等を行うナビゲーシ
ョン装置１０がある。[0007] As a peripheral device of the voice recognition device 9, a voice from a user is picked up and a voice voice signal Sig 1 is picked up.
Input to the voice recognition unit 2, a speech start signal Sig instructing to wait for a speech from the user
Speech switch 8 for outputting 11 to speech recognition unit 2, amplifier 6 for amplifying synthesized speech signal Sig9 output from speech synthesis unit 5 and outputting synthesized speech signal Sig10 to speaker 7, output from speech recognition unit 2. Voice recognition result signal Si
There is a navigation device 10 that performs a map search according to g2 and performs display and the like.

【０００８】尚、音声合成パラメータ格納メモリ４ｃよ
り音声合成部５に対して出力する音声合成パラメータデ
ータＳｉｇ８とは、合成音声の例えば発声速度、声の高
さ、抑揚の強さなどを調整するパラメータである。音声
合成部５は音声合成パラメータデータＳｉｇ８や音声合
成用テキストデータＳｉｇ７を受けて音声合成を行い、
合成音声信号Ｓｉｇ９をアンプ６に出力し、アンプ６で
増幅された合成音声信号Ｓｉｇ１０はスピーカ７より合
成音声として使用者に発声される。The voice synthesis parameter data Sig8 output from the voice synthesis parameter storage memory 4c to the voice synthesis unit 5 is a parameter for adjusting, for example, the utterance speed, pitch of the voice, and the intensity of the intonation of the synthesized voice. It is. The voice synthesis unit 5 receives voice synthesis parameter data Sig8 and text data for voice synthesis Sig7 and performs voice synthesis.
The synthesized voice signal Sig9 is output to the amplifier 6, and the synthesized voice signal Sig10 amplified by the amplifier 6 is uttered to the user as synthesized voice from the speaker 7.

【０００９】次に各ブロックの機能を信号の流れに沿っ
て説明する。マイクロホン１でピックアップした発話音
声信号Ｓｉｇ１を音声認識部２に入力すると、音声認識
部２は音声認識結果として、音声認識結果信号Ｓｉｇ２
をナビゲーション装置１０と音声切り替え部４の音声合
成制御部４ａに入力する。Next, the function of each block will be described along the flow of signals. When the uttered voice signal Sig1 picked up by the microphone 1 is input to the voice recognition unit 2, the voice recognition unit 2 outputs a voice recognition result signal Sig2 as a voice recognition result.
Is input to the navigation device 10 and the voice synthesis control unit 4a of the voice switching unit 4.

【００１０】音声合成制御部４ａはテキストデータ制御
信号Ｓｉｇ５により、音声認識結果応答や、発話音声案
内など合成音声の内容に関する音声合成用テキストデー
タＳｉｇ７の出力をテキストデータ格納メモリ４ｂに指
示する。テキストデータ格納メモリ４ｂは音声合成部５
に向け、音声合成の内容である音声合成用テキストデー
タＳｉｇ７を出力する。The text-to-speech control unit 4a instructs the text data storage memory 4b to output a text-to-speech text data Sig7 relating to the content of the synthesized voice, such as a voice recognition result response and an uttered voice guidance, by the text data control signal Sig5. The text data storage memory 4b is a voice synthesis unit 5
, The voice synthesis text data Sig7, which is the content of the voice synthesis, is output.

【００１１】音声合成部５で実行される音声合成アルゴ
リズムは、音声合成用テキストデータＳｉｇ７から合成
音声信号を生成するのが一般的に行われており、この音
声合成の方式を、テキスト音声合成（Ｔｅｘｔｔｏ
ｓｐｅｅｃｈ）と一般的に称する。The speech synthesis algorithm executed by the speech synthesis unit 5 generally generates a synthesized speech signal from the text data for speech synthesis Sig7. Text to
Speech).

【００１２】一方、音声認識部２からナビゲーション装
置１０に入力した音声認識結果信号Ｓｉｇ２はナビゲー
ション装置１１内の図示しない制御手段において制御に
供され、音声認識に基づく動作を実行する。使用者が音
声認識部２に接続した発話スイッチ８を操作すると、発
話開始信号Ｓｉｇ１１が音声認識部２に出力される。こ
の発話開始信号Ｓｉｇ１１は音声認識部２に発話音声入
力の開始を通知し、発話を待ち受けることを指示する。On the other hand, a speech recognition result signal Sig2 input from the speech recognition unit 2 to the navigation device 10 is subjected to control by control means (not shown) in the navigation device 11 to execute an operation based on speech recognition. When the user operates the speech switch 8 connected to the speech recognition unit 2, the speech start signal Sig 11 is output to the speech recognition unit 2. The utterance start signal Sig11 notifies the voice recognition unit 2 of the start of the utterance voice input, and instructs to wait for the utterance.

【００１３】図１１に示す音声認識装置９をナビゲーシ
ョン装置１０の制御に適用した場合の構成を図１２に示
す。本構成では、ナビゲーション装置１０の地図画面検
索操作の一環である住所検索のために行った音声認識操
作の結果、出力された画像信号Ｓｉｇ１２に基づき、表
示器１２の住所名表示画面１２ｂに［東京都千代田区丸
の内１丁目］の住所名が表示され、その周辺の地図が住
所周辺地図画面１２ａに表示された例を示す。FIG. 12 shows a configuration in which the voice recognition device 9 shown in FIG. 11 is applied to control of the navigation device 10. In this configuration, based on the output image signal Sig12 as a result of the voice recognition operation performed for the address search as a part of the map screen search operation of the navigation device 10, the address name display screen 12b of the display 12 displays [Tokyo An example is shown in which the address name of Marunouchi 1-chome, Chiyoda-ku, Tokyo] is displayed, and a map of the surrounding area is displayed on the address surrounding map screen 12a.

【００１４】また、図１２では、図全体の記載上、ナビ
ゲーション装置１０から音声合成音信号Ｓｉｇ９が出力
されているが、実際には図１１に示す通り音声認識装置
９の音声合成部５より出力される。In FIG. 12, the synthesized speech signal Sig9 is output from the navigation device 10 for the sake of the description of the whole figure. However, as shown in FIG. 11, the synthesized speech signal Sig9 is actually output from the speech synthesis unit 5 of the speech recognition device 9. Is done.

【００１５】次に、従来の音声認識装置の動作を図１３
に示すフローチャートを参照して説明する。ステップＳ
１における装置の起動開始後、ステップＳ２において、
使用者による発話スイッチ８の操作の有無を監視し、操
作が行われたときはステップＳ３に進み、音声認識部２
を活性化し音声認識が可能な状態にする。Next, the operation of the conventional speech recognition apparatus will be described with reference to FIG.
This will be described with reference to the flowchart shown in FIG. Step S
In step S2, after the start-up of the device in 1 is started,
It monitors whether or not the user operates the utterance switch 8, and when the operation is performed, the process proceeds to step S3, and the voice recognition unit 2 is operated.
Is activated to enable speech recognition.

【００１６】ステップＳ４にて、音声認識部２は使用者
の発話開始を、発話音声信号Ｓｉｇ１のレベルの増大傾
向や音声分析などにより監視し、発話による音声入力が
あると判定した場合は、スッテプＳ５で音声認識部２に
おいて音声認識処理を実行する。In step S4, the voice recognition section 2 monitors the start of the user's utterance based on the tendency of the level of the uttered voice signal Sig1 to increase, voice analysis, and the like. In S5, the voice recognition unit 2 executes voice recognition processing.

【００１７】続く、スッテプＳ１３において、音声合成
制御部４ａは音声認識結果信号Ｓｉｇ２に基づき、テキ
ストデータの組み合わせにより音声合成語を音声合成部
５で作成させるために、音声合成語に対応するテキスト
データ制御信号Ｓｉｇ５をテキストデータ格納メモリ４
ｂに出力する。In step S13, the speech synthesis control unit 4a generates a speech synthesis word by combining the text data with the speech synthesis unit 5 based on the speech recognition result signal Sig2. The control signal Sig5 is stored in the text data storage memory 4
b.

【００１８】最後のステップＳ１４では、テキストデー
タ制御信号Ｓｉｇ５に応じてテキストデータ格納メモリ
４ｂから読み出された音声合成用テキストデータＳｉｇ
７、音声合成パラメータ格納メモリ４ｃより入力されて
いる音声合成パラメータＳｉｇ８に基づき、テキスト規
則音声合成により合成音声信号Ｓｉｇ９が生成される。
その後ステップＳ２に戻り、発話スイッチ８の操作の監
視を行う。In the last step S14, the text data Sig for speech synthesis read from the text data storage memory 4b in response to the text data control signal Sig5.
7. Based on the speech synthesis parameter Sig8 input from the speech synthesis parameter storage memory 4c, a synthesized speech signal Sig9 is generated by text-based speech synthesis.
Thereafter, the process returns to step S2 to monitor the operation of the utterance switch 8.

【００１９】以下、従来の音声認識装置の結果応答の流
れを図１４に示す。本図１４では、使用者による発話ス
イッチ８の操作と、使用者の発話と、音声認識装置２の
認識結果による合成音声による応答と、地図表示とを、
ナビゲーション装置１０の音声操作による地図検索の一
例として説明する。FIG. 14 shows a flow of a result response of the conventional speech recognition apparatus. In FIG. 14, the operation of the utterance switch 8 by the user, the utterance of the user, the response by the synthesized voice based on the recognition result of the voice recognition device 2, and the map display are shown.
An example of a map search by voice operation of the navigation device 10 will be described.

【００２０】図１４の左側は使用者の発話スイッチ８の
操作及び発話の時系列的な推移を示し、図１４の右側は
スイッチ操作や発話に対する音声認識装置２の応答であ
る合成音声出力と、ナビゲーション装置１０の表示内容
の変更などの相互の時系列的な推移を示す。The left side of FIG. 14 shows the user's operation of the utterance switch 8 and the chronological transition of the utterance, and the right side of FIG. 14 shows a synthesized voice output which is a response of the voice recognition device 2 to the switch operation and the utterance. 7 shows mutual chronological changes such as a change in display content of the navigation device 10.

【００２１】まず、使用者が時点Ｈ１で発話スイッチ８
を操作すると、音声認識装置９は時点Ｋ１において発話
の始端を検出し、音声認識が受付可能な状態に遷移す
る。その結果、音声認識装置９は合成音声Ｋ１ａで［音
声コマンドをお話しください。］という発話案内をし、
使用者に音声認識が可能な操作語の発話を促す。First, the user switches the utterance switch 8 at time H1.
Is operated, the voice recognition device 9 detects the beginning of the utterance at the time point K1, and transits to a state where voice recognition can be accepted. As a result, the voice recognition device 9 uses the synthesized voice K1a [Please speak voice command. ]]
Prompt the user to utter an operation word that can be recognized by voice.

【００２２】使用者は合成音声Ｋ１ａ受け、発話音声Ｈ
２において「住所検索」という語を発声する。この「住
所検索」という語は音声操作コマンド語であり、このコ
マンド語は予め音声認識装置９の音声認識部２に内蔵し
たメモリ（図示しない）に記憶された音声認識辞書に、
音声認識におけるパターンマッチング演算用の参照デー
タとして格納されている。The user receives the synthesized voice K1a and the uttered voice H
In step 2, utter the word "address search". The word “address search” is a voice operation command word. The command word is stored in a voice recognition dictionary stored in a memory (not shown) incorporated in the voice recognition unit 2 of the voice recognition device 9 in advance.
It is stored as reference data for a pattern matching operation in voice recognition.

【００２３】音声認識装置９は、発話音声Ｈ２における
「住所検索」という語を発話音声信号Ｓｉｇ１として分
析した後に、パターンマッチング演算により音声認識動
作を実行し、時点Ｋ２において発話内容が「住所検索」
であるという認識結果を得る。そこで、音声認識装置９
は合成音声Ｋ２ａで［住所名をお知らせください。］と
いう発話案内をし、具体的な住所名の発話を使用者に促
す。After analyzing the word "address search" in the uttered voice H2 as the uttered voice signal Sig1, the voice recognizing device 9 executes a voice recognition operation by a pattern matching operation. At time K2, the uttered content is "address search".
Is obtained. Therefore, the voice recognition device 9
Is the synthesized voice K2a [Please let me know your address. ] To prompt the user to speak a specific address name.

【００２４】発話案内を受け使用者は、発話音声Ｈ３の
様に「東京都千代田区丸の内１丁目」と住所名を発話す
ると、音声認識装置９は時点Ｋ３においてその発話音声
Ｈ３を音声認識して、音声認識結果信号Ｓｉｇ２に基づ
く住所名［東京都千代田区丸の内１丁目付近を表示しま
す。］を認識結果応答を合成音声Ｋａ３にて出力する。
最後に、時点Ｋ４で図１２に示すように、ナビゲーショ
ン装置１０は表示器１２に向け画像情報Ｓｉｇ１２を出
力する。表示器１２は［東京都千代田区丸の内１丁目］
周辺の地図画面１２ａと住所名の文字１２ｂとを表示す
る。Upon receiving the utterance guidance, the user utters "1-chome, Marunouchi, Chiyoda-ku, Tokyo" as the utterance voice H3, and the voice recognition device 9 recognizes the utterance voice H3 at time K3. , Address name based on the speech recognition result signal Sig2 [Displays around 1-chome Marunouchi, Chiyoda-ku, Tokyo. ] Is output as a synthesized speech Ka3.
Finally, at time K4, the navigation device 10 outputs the image information Sig12 to the display 12 as shown in FIG. The display 12 is [1-chome Marunouchi, Chiyoda-ku, Tokyo]
A surrounding map screen 12a and address name characters 12b are displayed.

【００２５】このように、音声合成機能を具備する音声
認識装置によれば、使用者は音声認識と音声合成による
音声認識装置との対話形式により操作できるナビゲーシ
ョン装置作を実現することができる。As described above, according to the speech recognition device having the speech synthesis function, it is possible to realize a navigation device that allows the user to operate the speech recognition device by speech recognition and speech synthesis in an interactive manner.

【００２６】[0026]

【発明が解決しようとする課題】以上の様に、合成音声
を発話案内や音声認識結果応答に使用する音声合成機能
を具備した音声認識装置においては、特に合成音声の聴
取が使用者において確実に行われることが一連の操作の
流れを円滑に成立させる上で不可欠である。As described above, in a speech recognition apparatus having a speech synthesis function that uses synthesized speech for utterance guidance and speech recognition result response, it is ensured that the user can hear the synthesized speech in particular. It is indispensable to perform the operation in order to make a series of operations flow smoothly.

【００２７】従来の音声認識装置は、音声合成パラメー
タの条件設定を固定したままか、或いは音声合成音を聞
き取りにくい使用者あるいは音声合成パラメータの調整
を望む使用者のために、音声合成パラメータの条件設定
を使用者が選べるようにしていた。しかしながら、その
様な場合に下記の問題点があった。The conventional speech recognition apparatus uses the speech synthesis parameter condition for a fixed speech synthesis parameter setting or for a user who is difficult to hear the speech synthesis sound or wants to adjust the speech synthesis parameter. The settings were to be selectable by the user. However, such a case has the following problems.

【００２８】（１）音声合成パラメータの条件設定を使
用者が選択できる構成になっていない場合、使用者はい
つも同じ調子の音声を聞かされるので、使用者の年齢や
性別などが多様であることを考えた場合に、個々の聴覚
能力の実態や心理的に合わない合成音声を聞かされるこ
とになり、使用者は快適な操作感を持てない、合成音声
が聞き取りにくい、音声の発声速度が操作のテンポに合
わず快適な操作感が得られない、更には、合成音声の抑
揚や声の高さ等に馴染めない等の弊害が発生し勝ちであ
るという問題点があった。(1) If the user does not select the condition setting of the speech synthesis parameter, the user always hears the same tone, so that the age and gender of the user are various. Given this, users will hear synthesized speech that does not match the actual state of their auditory abilities and psychologically, and the user will not have a comfortable feeling of operation, it will be difficult to hear the synthesized voice, and the voice utterance speed will be controlled. However, there is a problem that a comfortable operation feeling cannot be obtained because the tempo does not match the tempo, and furthermore, adverse effects such as inflection of the synthesized voice and incompatibility with the pitch of the voice occur.

【００２９】（２）音声合成パラメータの条件設定を使
用者が選択できる構成になっている場合でも、音声認識
装置の操作に不慣れな使用者にとっては音声合成パラメ
ータの設定に手間がかかるという問題があった。車両の
多様な走行状態においては、車内外の騒音環境、走行速
度によっては使用者に対する音響的な影響または心理的
な影響を及ぼす条件が変化するが、音声合成パラメータ
を使用者がその都度設定することは困難であった。(2) Even in the case where the user can select the condition setting of the speech synthesis parameters, there is a problem that the user who is unfamiliar with the operation of the speech recognition apparatus takes time to set the speech synthesis parameters. there were. In various running states of a vehicle, acoustic or psychological conditions affecting the user change depending on the noise environment inside and outside the vehicle and the running speed, but the user sets the speech synthesis parameters each time. It was difficult.

【００３０】この発明は上記のような問題点を解消する
ためになされたもので、装置の使用者の性別、年齢、あ
るいは使用者を取り巻く騒音環境に応じて合成音声の内
容、そして発話速度や声の高さ等を決定するパラメータ
を自動的に変更できる音声認識装置を得ることを目的と
する。The present invention has been made in order to solve the above-mentioned problems, and it has been made according to the gender and age of the user of the apparatus or the noise environment surrounding the user, the content of the synthesized voice, the speech speed and the like. It is an object of the present invention to provide a speech recognition device that can automatically change parameters for determining the pitch of a voice and the like.

【００３１】[0031]

【課題を解決するための手段】請求項１の発明に係る音
声認識装置は、音声認識対象となる話者の音声情報を採
取する音声採取手段と、音声情報により特徴付けられる
話者の肉体的属性を複数種類にパターン化して記憶する
パターン記憶手段と、前記採取した音声情報が前記記憶
した何れの属性パターンに含まれるか判定するパターン
判定手段と、この判定された属性パターンに含まれる肉
体的属性に応じて発声音の音質を決める音質決定手段
と、この決定された音質を反映させて当該話者に対して
発する音声を合成する音声合成手段とを備えたものであ
る。According to a first aspect of the present invention, there is provided a speech recognition apparatus for collecting speech information of a speaker to be subjected to speech recognition, and a speaker physical feature characterized by the speech information. Pattern storing means for patterning and storing attributes in a plurality of types; pattern determining means for determining which of the stored attribute patterns the collected voice information includes; physical information included in the determined attribute patterns The apparatus includes sound quality determining means for determining the sound quality of the uttered sound in accordance with the attribute, and voice synthesizing means for synthesizing a voice uttered to the speaker by reflecting the determined sound quality.

【００３２】請求項２の発明に係る音声認識装置の音質
決定手段は、話者に心理的影響を与える周囲環境を検出
する環境検出手段を備え、この検出された周囲環境を発
声音の音質に反映させるものである。The sound quality determining means of the voice recognition apparatus according to the second aspect of the present invention includes an environment detecting means for detecting an ambient environment which has a psychological effect on the speaker, and converts the detected ambient environment into a sound quality of a uttered sound. It is to reflect.

【００３３】請求項３の発明に係る音声認識装置の環境
検出手段は、話者を搭載した移動体の走行速度を検出す
る速度検出手段である。The environment detecting means of the voice recognition device according to the third aspect of the present invention is a speed detecting means for detecting a traveling speed of a moving body carrying a speaker.

【００３４】請求項４の発明に係る音声認識装置の環境
検出手段は、話者を搭載した移動体の内外の騒音を検出
する騒音検出手段である。[0034] The environment detecting means of the voice recognition apparatus according to the fourth aspect of the present invention is a noise detecting means for detecting noise inside and outside a moving body on which a speaker is mounted.

【００３５】請求項５の発明に係る音声認識装置の音質
決定手段は、肉体的属性に応じて発声音の音質および発
話内容を決めるものである。According to a fifth aspect of the present invention, the sound quality determining means of the voice recognition apparatus determines the sound quality of the uttered sound and the content of the utterance according to the physical attributes.

【００３６】[0036]

【発明の実施の形態】実施の形態１．以下、この発明の
実施の形態１に係る音声認識装置を添付図に従って説明
する。図１は本実施の形態１に係る音声認識装置とナビ
ゲーション装置とを組み合わせた車載情報システムの構
成を示すブロック図である。内、図中、図１１と同一符
号は同一または相当部分を示す。DESCRIPTION OF THE PREFERRED EMBODIMENTS Embodiment 1 Hereinafter, a speech recognition apparatus according to Embodiment 1 of the present invention will be described with reference to the accompanying drawings. FIG. 1 is a block diagram showing a configuration of an in-vehicle information system in which a voice recognition device and a navigation device according to the first embodiment are combined. In the drawings, the same reference numerals as those in FIG. 11 indicate the same or corresponding parts.

【００３７】図において、９Ａは本実施の形態１に係る
音声認識装置の本体であり、この音声認識装置９Ａは従
来の音声認識装置（図１０を参照）の構成に加え、マイ
クロホン（音声採取手段）１より入力された発話音声信
号Ｓｉｇ１より話者を男性あるいは女性、男性高齢者を
判別する話者判別部３を備えている。In the figure, reference numeral 9A denotes a main body of the voice recognition device according to the first embodiment. This voice recognition device 9A has a microphone (voice collection means) in addition to the configuration of the conventional voice recognition device (see FIG. 10). 1) A speaker discriminating unit 3 for discriminating a male, female, or male elderly person from the uttered voice signal Sig1 input from 1 is provided.

【００３８】話者判別部３の構成として、男性音声、女
性音声、男性高齢者音声などに分類され、其々の標準的
なパターンを格納し、それぞれのパターンを音韻パター
ン信号Ｓｉｇ３ａ〜Ｓｉｇ３ｃとして出力する話者パタ
ーンメモリ（パターン記憶手段）３ａ〜３ｃと、話者パ
ターンメモリ３ａ〜３ｃから出力された音韻パターン信
号Ｓｉｇ３ａ〜Ｓｉｇ３ｃと発話音声信号Ｓｉｇ１とを
比較し、発話音声信号Ｓｉｇ１がどの話者パターンに最
も近いかをパターンマッチングにより判別する話者音声
比較部（パターン判定手段）３ｄとを含んでいる。The speaker discriminating unit 3 is classified into male voice, female voice, male elderly voice, etc., stores respective standard patterns, and outputs each pattern as phoneme pattern signals Sig3a to Sig3c. The speaker pattern memories (pattern storage means) 3a to 3c, the phoneme pattern signals Sig3a to Sig3c output from the speaker pattern memories 3a to 3c, and the speech signal Sig1 are compared. A speaker voice comparing unit (pattern determining means) 3d for determining whether the pattern is closest to the pattern by pattern matching.

【００３９】更に、本実施の形態１に係る音声合成制御
部（音質決定手段）４ａは、発話音声比較部３ｄから出
力された話者判別結果信号Ｓｉｇ４を受けて最適な音声
合成パラメータＳｉｇ８を音声合成部（音声合成手段）
５に向けて出力するように、音声合成パラメータ制御信
号Ｓｉｇ６を音声合成パラメータ格納メモリ４ｃに出力
する。Further, the speech synthesis control section (sound quality determination means) 4a according to the first embodiment receives the speaker discrimination result signal Sig4 output from the uttered speech comparison section 3d, and converts the speech synthesis parameter Sig8 into an optimal speech. Synthesis unit (speech synthesis means)
Then, the voice synthesis parameter control signal Sig6 is output to the voice synthesis parameter storage memory 4c so as to output the voice signal to the voice synthesis parameter storage signal 4g.

【００４０】次に、本実施の形態１の概略的な動作につ
いて説明する。話者判別部３はマイクロホン１から発話
音声信号Ｓｉｇ１を入力すると、発話音声と予め話者判
別部３の内部に格納した話者パターンメモリ３ａ〜３ｃ
のそれぞれの音韻パターン信号Ｓｉｇ３ａ〜Ｓｉｇ３ｃ
とを比較し、使用者の発話を示す発話音声信号Ｓｉｇ１
がどの話者パターンに最も近いかをパターンマッチング
により判別する。この判別の結果、話者判別部３では、
発話者がどの音韻パターンに分類される音声の持ち主で
あるかが推定される。Next, a schematic operation of the first embodiment will be described. When the utterance voice signal Sig1 is input from the microphone 1, the speaker determination unit 3 receives the utterance voice and the speaker pattern memories 3a to 3c stored in the speaker determination unit 3 in advance.
Each of the phoneme pattern signals Sig3a to Sig3c
And an utterance voice signal Sig1 indicating the user's utterance
To which speaker pattern is closest is determined by pattern matching. As a result of this determination, the speaker determination unit 3
It is estimated that the utterer is the owner of the speech classified into which phoneme pattern.

【００４１】推定結果は、発話音声比較部３ｄより話者
判別結果信号Ｓｉｇ４として音声切り替え部４の音声合
成制御部４ａに出力される。話者判別結果信号Ｓｉｇ４
を受けて音声合成制御部４ａは、音声合成パラメータ格
納メモリ４ｃへ音声合成パラメータ制御信号Ｓｉｇ６を
出力する。この結果、音声合成パラメータ格納メモリ４
ｃからは、話者判別結果に対応する判別話者の最適な音
声合成パラメータＳｉｇ８を音声合成部５に向けて出力
する。The estimation result is output from the uttered voice comparison unit 3d to the voice synthesis control unit 4a of the voice switching unit 4 as a speaker discrimination result signal Sig4. Speaker discrimination result signal Sig4
In response, the speech synthesis control unit 4a outputs a speech synthesis parameter control signal Sig6 to the speech synthesis parameter storage memory 4c. As a result, the speech synthesis parameter storage memory 4
From c, the optimum speech synthesis parameter Sig8 of the discrimination speaker corresponding to the speaker discrimination result is output to the speech synthesis unit 5.

【００４２】本実施の形態１の詳細な動作を図２のフロ
ーチャートにて説明する。図２と、図１３に示す従来装
置の動作を説明するフローチャートとの相違点は、ステ
ップＳ５の音声認識処理とステップＳ１３の音声合成語
作成処理との間に話者判別結果に基づく最適な音声合成
パラメータの選択出力処理が挿入されている点である。The detailed operation of the first embodiment will be described with reference to the flowchart of FIG. The difference between FIG. 2 and the flowchart for explaining the operation of the conventional device shown in FIG. 13 is that the optimum speech based on the speaker discrimination result between the speech recognition processing in step S5 and the speech synthesis word creation processing in step S13. The point is that the process of selecting and outputting the synthesis parameters is inserted.

【００４３】音声認識装置９ＡはステップＳ５において
音声認識処理を実行し、その後、ステップＳ７において
話者判別処理を行う。話者判別の結果、ステップＳ８に
おいて話者が成人男性ならばステップＳ９にて成人男性
対応に音声合成パラメータを調整する。ここでの調整と
は図１に示す発話音声比較部３ｄが音声合成制御部４ａ
に話者判別結果信号Ｓｉｇ４を出力し、音声合成制御部
４ａが上記信号に基づいて音声合成パラメータ格納メモ
リ４ｃに音声合成パラメータ制御信号Ｓｉｇ６を出力し
て、音声合成部５に出力する音声合成パラメータデータ
Ｓｉｇ８を指定する動作を言う。The voice recognition device 9A performs voice recognition processing in step S5, and then performs speaker determination processing in step S7. As a result of the speaker discrimination, if the speaker is an adult man in step S8, the speech synthesis parameters are adjusted for an adult man in step S9. The adjustment here means that the uttered voice comparison unit 3d shown in FIG.
The speech synthesis control unit 4a outputs a speech synthesis parameter control signal Sig6 to the speech synthesis parameter storage memory 4c based on the signal, and outputs the speech synthesis parameter to the speech synthesis unit 5. This refers to an operation of designating the data Sig8.

【００４４】しかし、話者判別の結果、話者が成人男性
以外の場合はステップＳ１０へ進み、話者が女性か否か
の判別を行う。このとき、話者が女性であることが判定
されたならば、ステップＳ１１で女性対応のパラメータ
設定を行う。しかし、話者が女性でないと判定されたな
らば、即ち、男性か女性かの判断が困難な場合はステッ
プＳ１２で男性の男性高齢者対応のパラメータ設定を行
う。このステップＳ１２も、上記ステップＳ９と同様な
動作を実行する。However, as a result of the speaker determination, if the speaker is not an adult male, the process proceeds to step S10, and it is determined whether the speaker is a female. At this time, if it is determined that the speaker is a woman, parameter setting corresponding to the woman is performed in step S11. However, if it is determined that the speaker is not a female, that is, if it is difficult to determine whether the speaker is a male or a female, parameters are set for a male elderly person in step S12. This step S12 also performs the same operation as step S9.

【００４５】上記何れかのパラメータ設定後、ステップ
Ｓ１３に進み、音声合成制御部４ａはテキストデータ格
納メモリ４ｂにテキストデータ制御信号Ｓｉｇ５を出力
し、発話する合成音声内容に応じた音声合成用テキスト
データＳｉｇ７を指定することで音声合成語が確定す
る。After setting any of the above parameters, the process proceeds to step S13, where the speech synthesis control unit 4a outputs a text data control signal Sig5 to the text data storage memory 4b, and outputs the text data for speech synthesis according to the synthesized speech content to be spoken. By designating Sig7, the speech synthesis word is determined.

【００４６】その後ステップＳ１４で、音声合成用テキ
ストデータＳｉｇ７と、音声合成パラメータデータＳｉ
ｇ８に基づき、音声合成部５において音声合成が実行さ
れ合成音声信号Ｓｉｇ９が音声合成部５より出力され
る。Thereafter, in step S14, the text data Sig7 for speech synthesis and the speech synthesis parameter data Si
Based on g8, voice synthesis is performed in the voice synthesis unit 5, and a synthesized voice signal Sig9 is output from the voice synthesis unit 5.

【００４７】次にこの発明の実施の形態１において、話
者の判別結果に基づく音声認識結果応答の音声合成パラ
メータの変更の状況を図３を基に説明する。なお、ここ
では使用者の発話スイッチ操作と、発話と、音声認識装
置の応答音声出力との経過は図１４に示す従来技術と同
様であるものとする。Next, in the first embodiment of the present invention, the situation of changing the speech synthesis parameter of the speech recognition result response based on the speaker discrimination result will be described with reference to FIG. Here, it is assumed that the progress of the user's utterance switch operation, utterance, and response voice output of the voice recognition device are the same as those in the related art shown in FIG.

【００４８】図３では図１４の応答音声Ｋ５における
「東京都千代田区丸の内１丁目を表示します」という認
識結果応答音声を例にとり、話者判別結果である使用者
毎に音声合成パラメータを変更した場合を示す。以下で
は６５歳未満の男性を成人男性と称し、６５歳以上の人
を男性高齢者と称す。図３において、（ａ）は成人男性
向け、（ｂ）は女性向け、（ｃ）は男性高齢者向けのそ
れぞれ音声応答を示す。このような分類は、一般的に音
声判別を可能とするものである。In FIG. 3, the speech synthesis parameter is changed for each user, which is the speaker discrimination result, taking as an example the recognition result response speech "1 Marunouchi, Chiyoda-ku, Tokyo" in response speech K5 in FIG. The following shows the case. In the following, men under the age of 65 are referred to as adult men and those aged 65 and over are referred to as male elderly. In FIG. 3, (a) shows a voice response for an adult male, (b) shows a voice response for a female, and (c) shows a voice response for a male elderly person. Such a classification generally allows speech discrimination.

【００４９】各音声応答の発声時間は図３に示すように
６秒、７秒、８秒という具合に異なる。これは、成人男
性が一般的に応答音声の聞き取りに優れ、また応答音声
の発声時間を短縮して、テンポの速い操作応答を快適な
操作環境として好む傾向にあるからである。男性高齢者
ではその逆に、発声を遅くして聞き取りやすくする配慮
が必要となる。女性はほぼその中間である。The utterance time of each voice response differs as shown in FIG. 3 such as 6 seconds, 7 seconds, and 8 seconds. This is because adult males generally tend to be superior in listening to the response voice and shorten the response voice utterance time to prefer a fast-tempo operation response as a comfortable operation environment. Conversely, consideration must be given to slowing down vocalization and making it easier to hear for elderly males. Women are almost in between.

【００５０】また、図３に記載の表は、使用者分類毎に
分けた各音声合成パラメータである。各音声合成パラメ
ータである発声速度、声の高さ、抑揚の強さ、声の大き
さ等を定性的な表現で示す。The table shown in FIG. 3 shows each speech synthesis parameter classified for each user classification. The utterance speed, the pitch of the voice, the intensity of the intonation, the loudness of the voice, and the like, which are the respective speech synthesis parameters, are represented by qualitative expressions.

【００５１】つまり、使用者分類（標準（６５歳未満）
成人男性、女性、男性高齢者（６５歳以上））毎に、音
声合成パラメータを、発声速度であれば、「速い」、
「やや遅い」、「遅い」、音の高さであれば、「やや高
い」、「普通」、「高い」、抑揚の強さであれば、「普
通」、「強い」、「普通」、声の大きさであれば、「普
通」、「普通」、「大きい」と定性的な表現で示す。That is, user classification (standard (under 65 years old)
For each of the adult male, female, male elderly (65 years old or older), the speech synthesis parameter is set to “fast”
"Slightly slow", "Slow", if the pitch is "Slightly high", "Normal", "High", if the inflection is strong, "Normal", "Strong", "Normal", If the voice is loud, it is qualitatively expressed as “normal”, “normal”, or “loud”.

【００５２】これらは、表中のような話者分類におい
て、各分類に属する使用者に対し、聞き取りやすくして
了解度を向上させ、応答音声の発声時間を妥当なものに
して操作のテンポが快適とし、且つ、聞いた印象として
好ましい音声応答を実現するための設定例である。In the speaker classifications as shown in the table, it is easy for users belonging to each classification to hear and improve intelligibility, to make the response sound utterance time appropriate, and to reduce the operation tempo. It is a setting example for realizing a comfortable voice response that is preferable as an impression of hearing.

【００５３】次に、本実施の形態１における話者判別結
果に基づいて合成された音声発声内容を使用者に応じて
変える場合を図４に従って説明する。成人男性向けでは
「とうきょうとちよだくまるのうちいっちょうめ」と住
所名のみを出力し、応答音声の発声時間を６秒と短縮し
て、応答音声の冗長さを緩和している。逆に、男性高齢
者向けでは、内容を短縮せず、しかも発声を８秒と遅く
し、「とうきょうとちよだくまるのうちいっちょうめふ
きんをひょうじします」と丁寧な発声内容で応答させて
聞き取りやすくする。女性向けは上記二者の中間であ
る。Next, a description will be given of a case in which the voice utterance synthesized based on the speaker determination result in the first embodiment is changed according to the user with reference to FIG. For adult males, only the address of "Tokyo and Chiyodakumamaru" is output, and the response voice utterance time is reduced to 6 seconds to reduce the redundancy of the response voice. Conversely, for male elderly people, the content is not shortened, and the utterance is slowed down to 8 seconds and responded with a polite utterance saying, "I will show you one of the Tokyo and Chikudakumaru." Make it easier to hear. For women, it is between the two.

【００５４】このように、使用者に応じて音声発声内容
を変えるには、音声合成制御部４ａは発話音声比較部３
ｄより使用者を特定する話者判別結果信号Ｓｉｇ４を入
力したなら、テキストデータ格納メモリ４ｂから当該使
用者に応じた音声発声内容の音声合成用テキストデータ
を読み出し、音声合成部５にて音声合成パラメータを加
味して音声合成語を作成する。As described above, in order to change the voice utterance content according to the user, the voice synthesis control section 4a
When a speaker discrimination result signal Sig4 for specifying the user is input from d, the text data for voice synthesis of the voice utterance content corresponding to the user is read out from the text data storage memory 4b. A speech synthesis word is created in consideration of parameters.

【００５５】実施の形態２．次に本実施の形態２に係る
音声認識装置を添付図に付いて説明する。図５は本実施
の形態２による音声認識装置９Ｂの構成図である。図１
に示す実施の形態１に係る音声認識装置９Ａとの相違点
としては、本実施の形態２に係る音声認識装置９Ｂは、
車両に搭載された速度検出手段１１から出力された速度
信号Ｓｉｇ１２を音声切り替え部４Ｂの音声合成制御部
４ａに出力することである。尚、ここで速度検出手段１
１は環境検出手段を構成する。Embodiment 2 Next, a speech recognition apparatus according to Embodiment 2 will be described with reference to the accompanying drawings. FIG. 5 is a configuration diagram of the speech recognition device 9B according to the second embodiment. FIG.
Is different from the speech recognition device 9A according to the first embodiment shown in FIG.
This is to output the speed signal Sig12 output from the speed detection means 11 mounted on the vehicle to the voice synthesis control unit 4a of the voice switching unit 4B. Here, the speed detecting means 1
1 constitutes an environment detecting means.

【００５６】この結果、音声合成制御部４ａは車両の速
度に応じ、音声合成パラメータにおける発声時間を変更
するものである。発声時間の変更としては、音声合成パ
ラメータを各走行速度対応させて各使用者毎に実際の時
間「６秒」、「７秒」「８秒」と云うように定量的にパ
ラメータを設定する。As a result, the voice synthesis control section 4a changes the utterance time in the voice synthesis parameters according to the speed of the vehicle. As for the change of the utterance time, the parameters are quantitatively set such that the actual time is "6 seconds", "7 seconds", and "8 seconds" for each user by making the voice synthesis parameters correspond to each traveling speed.

【００５７】発明の本実施の形態２の動作を図６のフロ
ーチャートにて説明する。このフローチャートと図２に
示す実施の形態１の動作を説明するフロ−チャートとの
相違点として、本フローチャートではステップＳ５にお
ける音声認識処理の後に、ステップＳ６ａにて速度信号
Ｓｉｇ１２を取り込んで自車両の移動速度判別処理を行
う。そして、ステップＳ７において話者判別処理を行
う。The operation of the second embodiment of the invention will be described with reference to the flowchart of FIG. The difference between this flowchart and the flowchart for explaining the operation of the first embodiment shown in FIG. 2 is that, in the present flowchart, after the voice recognition processing in step S5, the speed signal Sig12 is fetched in step S6a, and A moving speed determination process is performed. Then, in step S7, speaker identification processing is performed.

【００５８】次に、自車両の走行速度を加味した話者毎
の音声応答の発話内容と発話時間の変更を図７を参照し
て説明する。ここで一例として、自車両の走行速度が例
えば６０ｋｍ／ｈ以下の場合と、６０ｋｍ／ｈ以上の場
合における発声内容と発声時間を、上記の成人男性と女
性と男性高齢者とに分けて図７に示す。Next, the change of the utterance content and the utterance time of the voice response for each speaker in consideration of the traveling speed of the own vehicle will be described with reference to FIG. Here, as an example, the utterance content and utterance time when the traveling speed of the vehicle is, for example, 60 km / h or less and when the traveling speed is 60 km / h or more are divided into the above-mentioned adult male, female, and male elderly persons. Shown in

【００５９】総じて、走向速度が速くなるほど、音声応
答の発声を速くして発声時間を短縮し、スピーディな応
答を行う様にする。これはナビゲーション装置などを音
声認識装置で操作している場合等に、認識結果応答が速
いことが次の操作への移行を早めるので、走行速度に応
じた自動車の移動に伴う状況変化に対し、より迅速に対
応出来るようになるからである。In general, the higher the running speed, the faster the vocalization of the voice response, the shorter the vocalization time, and the quicker the response. This is because when the navigation device or the like is operated by the voice recognition device, the quick response to the recognition result speeds up the transition to the next operation. This is because they will be able to respond more quickly.

【００６０】しかしながら、ここでも使用者の年齢、性
別に応じて音声応答の発声時間を異ならせ快適な操作感
を与える必要がある。例えば、「とうきょうとちよだく
まるのうちいっちょうめふきんのちずをひょうじしま
す」という音声応答の発声時間を、以下のように異なら
せる。走行速度６０Ｋｍ／ｈ以下で６５歳未満の成人男
子向けであれば６秒とする。走行速度６０Ｋｍ／ｈ以上
で６５歳未満の成人男子向けであれば５秒とする。走行
速度６０Ｋｍ／ｈ以下で女性向けであれば７秒とする。
走行速度６０Ｋｍ／ｈ以上で女性向けであれば６秒とす
る。走行速度６０Ｋｍ／ｈ以下で６５歳以上の成人男子
向けであれば８秒とする。走行速度６０Ｋｍ／ｈ以上で
６５歳以上の成人男子向けであれば７秒とする。However, also here, it is necessary to provide a comfortable operation feeling by varying the voice response utterance time according to the age and gender of the user. For example, the utterance times of the voice response “I will show you the chief of Tokyo” are different as follows. For an adult male under the age of 65 at a running speed of 60 km / h or less, the time is 6 seconds. 5 seconds for an adult male under the age of 65 at a running speed of 60 km / h or more. If the vehicle speed is 60 km / h or less and it is for women, the time is 7 seconds.
If the running speed is 60 km / h or more and it is for women, the time is 6 seconds. 8 seconds for an adult male aged 65 or older at a running speed of 60 km / h or less. 7 seconds if the running speed is 60 km / h or more and it is for adult men over 65 years old.

【００６１】実施の形態３．次に本実施の形態３に係る
音声認識装置を添付図に付いて説明する。図８は本実施
の形態３による音声認識装置９Ｃの構成図である。図１
に示す実施の形態１に係る音声認識装置９Ａとの相違点
は、本実施の形態３に係る音声認識装置９Ｃは、マイク
ロホン１で採取された車内及び車外の騒音を音声切り替
え部４Ｃの音声合成制御部４ａに出力することである。
尚、ここではマイクロホン１は環境検出手段の一部を構
成する。Embodiment 3 Next, a speech recognition apparatus according to Embodiment 3 will be described with reference to the accompanying drawings. FIG. 8 is a configuration diagram of a speech recognition device 9C according to the third embodiment. FIG.
The difference from the voice recognition device 9A according to the first embodiment shown in FIG. 7 is that the voice recognition device 9C according to the third embodiment uses the voice switching unit 4C to synthesize the noise inside and outside the vehicle collected by the microphone 1. This is to output to the control unit 4a.
Here, the microphone 1 forms a part of the environment detecting means.

【００６２】この結果、音声合成制御部４ａにおける図
示しない騒音レベル判別手段は入力された音響信号より
騒音レベルを検出し、このレベルに応じて音声合成パラ
メータにおける発声の大きさを変更するものである。発
声の大きさの変更としては、音声合成パラメータを騒音
レベルに対応させて各使用者毎に「８０ｄＢＡ」、「８
５ｄＢＡ」「９０ｄＢＡ」と云ように定量的にパラメー
タを設定する。As a result, the noise level discriminating means (not shown) in the voice synthesis control section 4a detects the noise level from the input acoustic signal, and changes the utterance level in the voice synthesis parameter according to this level. . As for the change of the utterance volume, “80 dBA”, “8”
Parameters are quantitatively set, such as 5 dBA and 90 dBA.

【００６３】発明の本実施の形態３の動作を図９のフロ
ーチャートにて説明する。このフローチャートと図６に
示す実施の形態２の動作を説明フロ−チャートとの相違
点として、本フローチャートではステップＳ５における
音声認識処理の後に、ステップＳ６ｂにてマイクロホン
１より取り込んだ音響信号より車内及び車外の騒音レベ
ルの判別処理を行う。そして、ステップＳ７において話
者判別処理を行う。The operation of the third embodiment of the present invention will be described with reference to the flowchart of FIG. The difference between this flowchart and the flowchart of the operation of the second embodiment shown in FIG. 6 is that, in the present flowchart, after the voice recognition processing in step S5, the in-vehicle and the acoustic signals taken in from the microphone 1 in step S6b are obtained. A process for determining the noise level outside the vehicle is performed. Then, in step S7, speaker identification processing is performed.

【００６４】本実施の形態３における音声認識装置９Ｃ
による話者分類毎の音声応答の変更の一例としては、車
内の騒音レベルが例えば７５ｄＢＡ以下の場合と、７５
ｄＢＡ以上の場合の合成音声レベル（声の大きさ）を含
む音声合成パラメータを成人男性と女性と男性高齢者と
に分けて図１０に示す。The speech recognition device 9C according to the third embodiment
Examples of the change of the voice response for each speaker classification include a case where the noise level in the vehicle is, for example, 75 dBA or less,
FIG. 10 shows speech synthesis parameters including a synthesized speech level (loudness) in the case of dBA or more for adult male, female, and male elderly.

【００６５】例えば、騒音レベル７５ｄＢＡ以下のとき
には、標準（６５歳未満）成人男性および女性に対して
発する合成音声レベルは８０ｄＢＡ、男性高齢者（６５
歳以上）に対して発する合成音声レベルは８５ｄＢＡで
ある。For example, when the noise level is 75 dBA or less, the synthetic voice level emitted to the standard (under 65 years) adult male and female is 80 dBA, and the male elderly person (65
The synthesized voice level issued to the user is 85 dBA.

【００６６】また、騒音レベル７５ｄＢＡ以上のときに
は、標準（６５歳未満）成人男性および女性に対して発
する合成音声レベルは８５ｄＢＡ、男性高齢者（６５歳
以上）に対して発する合成音声レベルは９０ｄＢＡであ
る。以上のことから、総じて、騒音レベルが高くなるほ
ど、合成音声レベルを上げることは当然であるが、特に
男性高齢者においては騒音レベルが高くなると聞き取り
づらくなるので、合成音声レベルは上げる方がよい。When the noise level is 75 dBA or higher, the synthesized voice level emitted to the standard (under 65 years) adult male and female is 85 dBA, and the synthesized voice level emitted to the elderly male (65 years old and over) is 90 dBA. is there. From the above, in general, it is natural that the higher the noise level is, the higher the synthetic voice level is. However, especially for elderly men, the higher the noise level becomes, the more difficult it is to hear. Therefore, the higher the synthetic voice level is, the better.

【００６７】また、例えば、自動車の走行騒音であるロ
ードノイズやエンジン音など、周波数の低域成分の割合
が大きい騒音が存在するときには、声の高さは高い方が
聞き取りやすい傾向にある。In addition, for example, when there is a noise having a large ratio of low frequency components such as a road noise or an engine sound, which is a running noise of an automobile, a higher voice tends to be easier to hear.

【００６８】実施の形態４．上記実施の形態１〜３の説
明では、例えば図１に示すように話者判別部３は音声認
識部２と別に設けられていた。Embodiment 4 In the description of the first to third embodiments, for example, the speaker discriminating unit 3 is provided separately from the voice recognizing unit 2 as shown in FIG.

【００６９】だが、音声認識部に、話者パターン毎の音
声認識用の音韻データのテンプレート（音声認識の実行
において参照する標準音韻データ）をあらかじめ音声認
識部内の図示しないメモリに格納し、話者判別を行った
結果により上記テンプレートを置き換えて音声認識に使
用することで認識率の向上に供する所謂マルチテンプレ
ート方式の音声認識方法がある。このようなマルチテン
プレート方式では、話者判別部３の発話音声比較部３ｄ
は音声認識部２に含まれる構成となる。However, the speech recognition unit stores in advance a template of phoneme data for speech recognition for each speaker pattern (standard phoneme data to be referred to in performing speech recognition) in a memory (not shown) in the speech recognition unit. There is a so-called multi-template speech recognition method for improving the recognition rate by replacing the template based on the result of the determination and using the template for speech recognition. In such a multi-template method, the uttered voice comparison unit 3d of the speaker determination unit 3
Is included in the voice recognition unit 2.

【００７０】また、各実施の形態の説明では、使用者分
類は６５歳以下の男性、６５歳以上の男性、女性の３つ
に分類したが、話者判別部の性能に応じて、更に細かく
分類することが可能である。Further, in the description of each embodiment, the user classification is classified into three groups: males aged 65 or younger, males aged 65 and over, and females. It is possible to classify.

【００７１】[0071]

【発明の効果】請求項１の発明によれば、音声認識対象
となる話者の音声情報を採取する音声採取手段と、音声
情報により特徴付けられる話者の肉体的属性を複数種類
にパターン化して記憶するパターン記憶手段と、前記採
取した音声情報が前記記憶した何れの属性パターンに含
まれるか判定するパターン判定手段と、この判定された
属性パターンに含まれる肉体的属性に応じて発声音の音
質を決める音質決定手段と、この決定された音質を反映
させて当該話者に対して発する音声を合成する音声合成
手段とを備えたので、話者が聞き取りやすく、快適な操
作感をもたらす発話案内音声や、音声認識結果応答を音
声合成により得ることができるという効果はある。According to the first aspect of the present invention, a voice collecting means for collecting voice information of a speaker to be voice-recognized, and a plurality of physical attributes of the speaker characterized by the voice information are patterned. Pattern storing means for storing the extracted voice information in any of the stored attribute patterns, and pattern determining means for determining which attribute pattern is stored, and a vocal sound according to a physical attribute included in the determined attribute pattern. Since the sound quality determining means for determining the sound quality and the voice synthesizing means for synthesizing the voice uttered to the speaker by reflecting the determined sound quality are provided, the utterance that is easy for the speaker to hear and provides a comfortable operation feeling. There is an effect that the guidance voice and the voice recognition result response can be obtained by voice synthesis.

【００７２】請求項２の発明によれば、音質決定手段
は、話者に心理的影響を与える周囲環境を検出する環境
検出手段を備え、この検出された周囲環境を発声音の音
質に反映させことで、周囲環境の悪化に係わらず話者が
聞き取りやすく、快適な操作感をもたらす発話案内音声
や、音声認識結果応答を音声合成より得ることができる
という効果がある。According to the second aspect of the present invention, the sound quality determining means includes the environment detecting means for detecting the surrounding environment which has a psychological effect on the speaker, and reflects the detected surrounding environment on the sound quality of the uttered sound. Thus, there is an effect that a utterance guidance voice that makes it easy for the speaker to hear and provides a comfortable operation feeling regardless of the deterioration of the surrounding environment and a voice recognition result response can be obtained by voice synthesis.

【００７３】請求項３の発明によれば、環境検出手段
は、話者を搭載した移動体の走行速度を検出する速度検
出手段にしたことで、操作タイミングを逸することなく
必要な音声を聴取できるという効果がある。According to the third aspect of the present invention, the environment detecting means is a speed detecting means for detecting a traveling speed of a moving body on which a speaker is mounted, so that a necessary voice can be heard without losing operation timing. There is an effect that can be.

【００７４】請求項４の発明によれば、環境検出手段
は、話者を搭載した移動体の内外の騒音を検出する騒音
検出手段としたことで、周囲の騒音に係わらず高い了解
度で音声を聴取できるという効果がある。According to the fourth aspect of the present invention, the environment detecting means is a noise detecting means for detecting noise inside and outside the moving body on which the speaker is mounted, so that the voice can be recognized with high intelligibility regardless of the surrounding noise. There is an effect that can be heard.

【００７５】請求項５の発明によれば、音質決定手段
は、肉体的属性に応じて発声音の音質および発話内容を
決めることで、違和感のない音声を聴取できるという効
果がある。According to the fifth aspect of the present invention, the sound quality determining means determines the sound quality and the utterance content of the uttered sound in accordance with the physical attributes, so that there is an effect that it is possible to listen to a sound without a sense of incongruity.

[Brief description of the drawings]

【図１】発明の音声認識装置の実施の形態１の構成、
及び周辺機器を示すブロック図である。FIG. 1 shows a configuration of a speech recognition apparatus according to a first embodiment of the present invention;
FIG. 2 is a block diagram showing peripheral devices.

【図２】発明の音声認識装置の実施の形態１の動作を
示すフローチャートである。FIG. 2 is a flowchart showing an operation of the voice recognition device according to the first embodiment of the present invention;

【図３】発明の音声認識装置の実施の形態１における
使用者の分類に向けた音声認識結果応答の発声条件の変
更を説明する図である。FIG. 3 is a diagram illustrating a change in utterance conditions of a speech recognition result response for classification of a user according to the first embodiment of the speech recognition device of the present invention.

【図４】発明の音声認識装置の実施の形態１における
使用者の分類に向けた音声認識結果応答の発声内容の変
更を説明する図である。FIG. 4 is a diagram illustrating a change in utterance content of a speech recognition result response for classification of a user according to the first embodiment of the speech recognition device of the present invention;

【図５】発明の音声認識装置の実施の形態２の構成、
及び周辺機器を示すブロック図である。FIG. 5 shows the configuration of a second embodiment of the speech recognition apparatus of the present invention;
FIG. 2 is a block diagram showing peripheral devices.

【図６】発明の音声認識装置の実施の形態２の動作を
示すフローチャートである。FIG. 6 is a flowchart showing an operation of the voice recognition device according to the second embodiment of the present invention;

【図７】発明の音声認識装置の実施の形態２における
使用者の分類に向けた音声認識結果応答の発声速度の変
更を説明する図である。FIG. 7 is a diagram illustrating a change in the utterance speed of a speech recognition result response for classification of a user according to the second embodiment of the speech recognition device of the present invention;

【図８】発明の音声認識装置の実施の形態３の構成、
及び周辺機器を示すブロック図である。FIG. 8 shows a configuration of a voice recognition device according to a third embodiment of the present invention;
FIG. 2 is a block diagram showing peripheral devices.

【図９】発明の音声認識装置の実施の形態３の動作を
示すフローチャートである。FIG. 9 is a flowchart showing the operation of the speech recognition apparatus according to the third embodiment of the present invention.

【図１０】発明の音声認識装置の実施の形態３におけ
る騒音レベル毎の使用者と音声合成パラメータの最適化
の例を説明する図である。FIG. 10 is a diagram illustrating an example of optimization of a user and a speech synthesis parameter for each noise level in Embodiment 3 of the speech recognition device of the present invention.

【図１１】従来の音声認識装置の構成と周辺機器を示
すブロック図である。FIG. 11 is a block diagram illustrating a configuration of a conventional voice recognition device and peripheral devices.

【図１２】音声認識装置を接続したナビゲーション装
置の構成例を示すブロック図である。FIG. 12 is a block diagram illustrating a configuration example of a navigation device to which a voice recognition device is connected.

【図１３】従来の音声認識装置の動作を示すフローチ
ャートである。FIG. 13 is a flowchart showing the operation of the conventional voice recognition device.

【図１４】従来の音声認識装置のスイッチ操作と発話
と応答音声の経過を説明する図である。FIG. 14 is a diagram illustrating the progress of switch operation, utterance, and response voice of a conventional voice recognition device.

[Explanation of symbols]

１マイクロホン、２音声認識部、３話者判別部、
４Ａ〜４Ｃ音声切り替え部、５音声合成部、６ア
ンプ、７スピーカ、８発話スイッチ、９Ａ〜９Ｃ
音声認識装置、１０ナビゲーション装置、１１速度
検出手段。1 microphone, 2 voice recognition unit, 3 speaker discrimination unit,
4A to 4C voice switching unit, 5 voice synthesis unit, 6 amplifier, 7 speaker, 8 utterance switch, 9A to 9C
Voice recognition device, 10 navigation device, 11 speed detection means.

───────────────────────────────────────────────────── フロントページの続き (51)Int.Cl.⁷ 識別記号ＦＩテーマコート゛(参考）Ｇ１０Ｌ 3/00 ５７１Ｈ ──────────────────────────────────────────────────続き Continued on the front page (51) Int.Cl. ⁷ Identification symbol FI Theme coat ゛ (Reference) G10L 3/00 571H

Claims

[Claims]

1. A voice collecting means for collecting voice information of a speaker to be voice-recognized, a pattern storing means for patterning and storing a plurality of types of physical attributes of a speaker characterized by the voice information, Pattern determining means for determining which attribute pattern the collected voice information is included in, and sound quality determining means for determining the sound quality of the uttered sound according to the physical attribute included in the determined attribute pattern; A voice synthesizing means for synthesizing voice uttered to the speaker by reflecting the determined sound quality.

2. The apparatus according to claim 1, wherein said sound quality determining means includes environment detecting means for detecting an ambient environment that has a psychological effect on the speaker, and reflects the detected ambient environment on the sound quality of the uttered sound. Item 2. The speech recognition device according to item 1.

3. The speech recognition apparatus according to claim 2, wherein said environment detecting means is a speed detecting means for detecting a traveling speed of a moving body carrying a speaker.

4. The speech recognition apparatus according to claim 2, wherein said environment detecting means is a noise detecting means for detecting noise inside and outside a moving body on which a speaker is mounted.

5. The speech recognition apparatus according to claim 1, wherein said sound quality determination means determines the sound quality and utterance content of the uttered sound according to physical attributes.