JP3681111B2

JP3681111B2 - Speech synthesis apparatus, speech synthesis method, and speech synthesis program

Info

Publication number: JP3681111B2
Application number: JP2001107849A
Authority: JP
Inventors: 俊夫赤羽
Original assignee: Sharp Corp
Current assignee: Sharp Corp
Priority date: 2001-04-05
Filing date: 2001-04-05
Publication date: 2005-08-10
Anticipated expiration: 2021-04-05
Also published as: JP2002304186A

Abstract

PROBLEM TO BE SOLVED: To provide a voice synthesizer, in which individual and dialectal accents are generated from a language dictionary, based on the standard language. SOLUTION: The synthesizer has a text processing means 2, which outputs phoneme string information and rhythm information from the text data, a language dictionary 1 which is referred to by the means 2, a rhythm conversion means 4 which converts the rhythm information, based on a conversion rule and a ruled voice synthesizing means 3 which synthesizes voice waveforms from the phoneme string information and converted rhythm information. The means 4 has an accent conversion rule storage section 401, an accent conversion section 402, a phoneme string conversion rule storage section 403 and a phoneme string conversion section 404. The section 402 converts accent information into accent type information, that is different from input information, in accordance with the conversion rule stored in the section 401. The section 404 converts phoneme string information into a phoneme string that differs from the input information, in accordance with the conversion rule stored in the section 403.

Description

【０００１】
【発明の属する技術分野】
本発明は、文字列または記号列からなるテキストデータを入力して音声データに変換する音声合成装置、音声合成方法および音声合成プログラムに関する。
【０００２】
【従来の技術】
音声合成技術においては、単に情報を正確に伝えるだけではなく、ユーザーの好みに応じてカスタマイズを行ったり、状況に応じて音声を切り替えたり、またはゲーム等の娯楽において複数の登場人物を会話させたりするために、複数の個性のある音声を合成することが求められる。
【０００３】
音声を決定する主な要因としては、声質（太い、細い、ハスキーである等）、全体的（平均的）な声の高さ、発声速度、ポーズ（文節間の間（長さ）の取り方）、アクセント（単語、文節等、短い単位内の音の高低）、フレーズピッチ成分（文全体の音の高低（フレーズ成分の強度）、一般には最初が高く、徐々に下がる）等が挙げられる。このうち、声質以外の部分を一般に韻律と呼ぶ。韻律の中でも特に、アクセントは声の個性に対して、声質と並んで影響が大きい。
【０００４】
従来の音声を合成する方法（特開平１−３２１４９６４２１６号公報、特開平９−２４４６７７号公報）では、図７に示すように、テキスト処理手段２が参照する言語辞書１に文章を解析するための形態素情報記憶部１０１、標準的なアクセントを与えるための標準語アクセント記憶部１０２のデータの他に、方言アクセント記憶部１０３を設けて、そこからアクセントデータを選択することにより、方言の音声を合成していた。
【０００５】
【発明が解決しようとする課題】
しかしながら、上述した従来の音声合成方法は、実際にアクセント辞書を作成する困難さや、開発コストのために実現されていない。その理由を以下に説明する。
【０００６】
従来の日本語音声合成装置において、単語毎のアクセントを決定する言語辞書は、一般に、出版されている標準語のアクセント辞書（例えばＮＨＫ「日本語発音アクセント辞典」）を基に作成しているため、標準語アクセントにより音声を合成する。方言等の標準語以外のアクセントで正確に発声させるためには、数万単語に及ぶアクセント辞書を作成する必要がある。
【０００７】
しかし、方言と行っても実際には細かな地域や話者により各々異なり、方言を一つに特定して整理するのは非常に困難である。このため、正確な方言アクセント辞書を作成するためには、大阪弁のような一つの地方に限ったとしても、膨大な期間と費用がかかり、実現は困難である。また、辞書を作成できた場合でも、多数の単語に方言アクセントを付加するとメモリの増加につながり、さらに方言の種類を増やすこともメモリの増加につながるため、実現は困難である。さらに、コストをかけて作成しても、効果を評価しにくいことも実現されない理由の一つと考えられる。
【０００８】
例えば、「大阪・東京アクセント音声辞典：杉籐美代子、１９９５年」において、同じ大阪市出身の話者でも各単語に対するアクセントが異なることは、正しい大阪弁というのは存在せず、大阪弁らしさのような傾向があるに過ぎないことを示す。このことは、逆に、ある一部地域の方言アクセントを正確に再現しても、聞く人によって評価が異なるため、それを正しいと評価するユーザーは非常に少ないということになり得る。
【０００９】
さらに、一部の単語のみについて、方言アクセントを調査して辞書に付加しても、その他の単語を標準語アクセントで発声した場合には、発話の途中で方言の様式が急激に変化するためにユーザーが混乱し、文章の意味が理解できない等の障害が生じるおそれがある。
【００１０】
具体的なアクセント型の形状については後述するが、一般に、方言のアクセントは従来の標準語アクセントにおける０型からｎ型のアクセント型のみを用いたのでは再現することができず、標準語には無い新たなアクセント型を必要とする。一例として先に挙げた「大阪・東京アクセント辞典」によれば、大阪弁アクセントは、標準語の０型からｎ型とは異なる、高起の０型からｎ型および低起の０型からｎ型というアクセント型を用いなければ表現することができない。他方、これらのアクセント型は関西弁に共通するものであるため、これらのアクセント型を用いれば、アクセント型が特定の地域の方言と異なっていても、ある程度の関西弁らしさが表現できることになる。
【００１１】
本発明は、このような従来技術の課題を解決するべくなされたものであり、方言アクセント辞書が不完全であったり、方言アクセント辞書が無く、標準語を基にした言語辞書からでも、方言らしい個性のあるアクセントを生成することができる音声合成装置、音声合成方法および音声合成プログラムを提供することを目的とする。
【００１２】
【課題を解決するための手段】
本発明の音声合成装置は、単語毎の形態素情報および単語毎の標準語のアクセント情報が記憶された言語辞書と、入力されたテキストデータを前記言語辞書を参照して解析して、文節毎に、音素列情報と標準語における文節のアクセント核の位置を示すアクセント型を含む韻律情報とを出力するテキスト処理手段と、該テキスト処理手段から入力された韻律情報を変換規則に基づいて異なる韻律情報に変換する韻律変換手段と、該音素列情報および変換された韻律情報から音声波形を合成する規則音声合成手段とを有し、前記韻律変換手段は、アクセント変換規則記憶部と、アクセント変換部とを有し、該アクセント変換規則記憶部には、標準語における前記アクセント型から、それぞれの方言に特徴的な高起アクセント型または低起アクセント型への変換を行う変換規則が記憶されており、前記アクセント変換部では、入力された方言選択パラメータに基づいて該アクセント変換規則記憶部に記憶された前記変換規則を選択し、該変換規則に基づいて、入力されるテキストデータにおける全ての文節に対して、標準語における前記アクセント型から、前記方言選択パラメータに対応した方言に特徴的な高起アクセント型または低起アクセント型への変換を行うことを特徴とし、そのことにより上記目的が達成される。
【００１４】
前記韻律変換手段は、さらに、音素列変換規則記憶部と、音素列変換部とを有し、該音素列変換部では、入力された規則選択パラメータに基づいて該音素列変換規則記憶部に記憶された変換規則を選択し、選択された変換規則に従って、前記テキスト処理手段から入力される音素列情報を、入力された音素列とは異なる音素列に変換することを特徴としてもよい。
【００１６】
前記韻律変換手段は、前記テキスト処理手段の出力を、前記アクセント変換部と前記音素列変換部とをバイパスさせて出力するバイパススイッチを有し、前記言語辞書には、一部の単語に、方言のアクセント型が予め記憶されており、入力されるテキストデータに含まれる単語に対して前記方言のアクセント型が記憶されている場合には、前記テキスト処理手段が、前記言語辞書に保持されている方言のアクセント型を選択して出力するとともに、前記韻律変換手段は、該テキスト処理手段の出力を、前記バイパススイッチによって、前記アクセント変換部と前記音素列変換部とをバイパスさせることを特徴としてもよい。
【００１７】
本発明の音声合成方法は、前記音声合成装置を用いて音声合成を行う方法であって、前記該韻律変換手段の前記アクセント変換部により、入力された方言選択パラメータに基づいて該アクセント変換規則記憶部に記憶された前記変換規則を選択し、該変換規則に基づいて、入力されるテキストデータにおける全ての文節に対して、標準語における前記アクセント型から、前記方言選択パラメータに対応した方言に特徴的な高起アクセント型または低起アクセント型への変換を行うことを特徴とする。
【００１８】
前記韻律変換手段は、前記テキスト処理手段の出力を、前記アクセント変換部と前記音素列変換部とをバイパスさせて出力するバイパススイッチを有し、前記言語辞書には、一部の単語に、方言のアクセント型が予め記憶されており、入力されるテキストデータに含まれる単語に対して前記方言のアクセント型が記憶されている場合には、前記テキスト処理手段が、前記言語辞書に保持されている方言のアクセント型を選択して出力するとともに、前記韻律変換手段は、該テキスト処理手段の出力を、前記バイパススイッチによって、前記アクセント変換部と前記音素列変換部とをバイパスさせることを特徴としてもよい。
【００１９】
本発明の音声合成プログラムは、前記音声合成装置を用いて音声合成を行う方法を実行させるための音声合成プログラムであって、前記該韻律変換手段の前記アクセント変換部により、入力された方言選択パラメータに基づいて該アクセント変換規則記憶部に記憶された前記変換規則を選択し、該変換規則に基づいて、入力されるテキストデータにおける全ての文節に対して、標準語における前記アクセント型から、前記方言選択パラメータに対応した方言に特徴的な高起アクセント型または低起アクセント型への変換を行うことを特徴とする。
【００２０】
前記韻律変換手段は、前記テキスト処理手段の出力を、前記アクセント変換部と前記音素列変換部とをバイパスさせて出力するバイパススイッチを有し、前記言語辞書には、一部の単語に、方言のアクセント型が予め記憶されており、入力されるテキストデータに含まれる単語に対して前記方言のアクセント型が記憶されている場合には、前記テキスト処理手段が、前記言語辞書に保持されている方言のアクセント型を選択して出力するとともに、前記韻律変換手段は、該テキスト処理手段の出力を、前記バイパススイッチによって、前記アクセント変換部と前記音素列変換部とをバイパスさせることを特徴としてもよい。
【００２１】
以下に、本発明の作用について説明する。
【００２２】
本発明にあっては、標準語に基づいて解析されたアクセント型（または音素列）を変換規則に基づいて変換することができるため、方言アクセント辞書を作成する必要がなく、従来開発コストがかかりすぎて実現が困難であった、異なる方言アクセントによる音声合成が可能となる。さらに、メモリの増加量もほとんど無いため、複数の方言など、多様な音声合成を１つの装置により行うことが可能となる。
【００２３】
また、アクセント変換部において、変換規則に従って、標準語の０型からｎ型までの整数型アクセントから、関西弁に特徴的な高起アクセント型の０型からｎ型または低起アクセント型の０型からｎ型に変換を行うことにより、テキスト処理手段が標準語の言語辞書に基づく韻律情報を出力しても、関西弁風のアクセントで音声を合成することが可能となる。
【００２４】
さらに、標準語を基本とする言語辞書の一部の単語において、方言の韻律（アクセント）が登録されている場合には、それらの単語に関しては辞書に登録された正確な韻律を用いて音声合成を行い、調査されていない単語については変換規則を用いてその地方の方言風のアクセント（または音素列）に変換する。これにより、より自然な方言に近づくことが可能となり、方言アクセントと標準語アクセントが混ざることによって音声の聴取者が混乱するのを避けることができる。
【００２５】
音声合成を行うアルゴリズムは複雑であり、汎用マイクロコンピューターまたはＤＳＰを用いてソフトウェアにより実現するのが一般的である。本発明の音声合成プログラムによれば、パーソナルコンピューター等のマイクロコンピューターを用いたシステムやＤＳＰを用いたシステムにおいて、方言風の音声合成機能を追加することが可能である。
【００２６】
【発明の実施の形態】
以下に、本発明の実施の形態について、図面を参照しながら説明する。
【００２７】
（実施形態１）
図１は、本発明の一実施形態である音声合成装置の構成を説明するための図である。この音声合成装置は、言語辞書１と、テキスト処理手段２と、記録音声合成手段３と、韻律変換手段４を備えている。
【００２８】
言語辞書１は、単語毎の形態素情報および単語毎のアクセント情報を記憶しており、一般に読み出し専用メモリにより実現される。
【００２９】
テキスト処理手段２は、入力されたテキストデータから言語辞書１を用いて単語を検出して、動詞や名詞等の形態素から文の構造を決定し、文の構造に基づいてポーズの長さと、文章全体の抑揚を制御するフレーズ成分の強さを決定する。また、文節に含まれる自立語のアクセント情報から各文節のアクセント型を決定する。そして、文節毎に音素記号列、アクセント型、フレーズ成分の強さ、ポーズの長さおよび声の大きさの情報等を出力する。一般に、声の大きさは文章内では一定とされる場合が多いが、微視的に見た場合には、音声素片（基本的にはこれを繋ぎ合わせて音声を合成する）を収録したときの声の大きさの変化がそのまま出力される。韻律制御の際には、この声の大きさを文章毎に大きくしたり小さくしたりすることができる。また、音素毎の時間長を統計的に調べて、その長さに対してどの程度速くするか遅くするかということにより発声速度を制御することができる。さらに、文の構造から求められるフレーズ成分と単語各々によって決定されるアクセント成分との加算によって声の高さを制御することができる。それを全体的に高くしたり低くしたりすることにより声の高さを調整し、フレーズ成分やアクセント成分の強さを調整することにより、抑揚の強さを調整することができる。調整をどの程度行うかについては、ユーザーが決定できるようにされることが多い。このテキスト処理手段２は、一般に、ＣＰＵやＤＳＰを用いてソフトウェアにより実現される。
【００３０】
規則音声合成手段３は、入力された情報を基に音声波形を合成する。具体的には、アクセント生成部３０２は、アクセント形状記憶部３０１を参照して、入力された各文節の音素記号列の各音素に対して音の高さを設定する。次に、フレーズ生成部３０４は、フレーズ形状記憶部３０３を参照して、複数の文節にわたる音素記号列の各音素に対して、抑揚を表す音の高さの形状を、フレーズ成分の強さで調整して加算する。フレーズ形状については後述する。次に、波形生成部３０６では、音素毎に波形辞書３０５を参照して、アクセント生成部３０２およびフレーズ生成部３０４で生成された音の高さに応じた音声波形を合成する。この規則音声合成手段３は、一般に、ＣＰＵやＤＳＰを用いてソフトウェアにより実現される。
【００３１】
波形生成の方法としては、各種方式を利用することができるが、ここでは零位相化した単位波形を合成したい声の周波数の逆数で表される間隔（ピッチ）で重畳してゆく波形重畳方式を用いることにより、比較的高品位の音声を合成することが可能である。
【００３２】
図２に、フレーズ形状の一例を示す。フレーズ形状は、人間が文章を発声するときの文全体の声の高さの形状をモデル化したものであり、各文節の先頭で高くなり、文末へ向けて緩やかに下降する形状を有する。また、一般に藤崎モデルと称される「へ」の字型のカーブを用いる場合もある。図２では、対数周波数上で直線的に下降するフレーズ形状を示している。この例では、文節の先頭における声の高さをフレーズ成分の値で与え、文節の最後で定常の高さになるように下降速度を調整する。従って、文節が長くなると、下降速度が遅くなる。
【００３３】
以上は、通常の音声合成装置と同様の機能である。本実施形態では、さらに、韻律変換手段４を設けて、合成される音声の韻律を変更する。韻律とは、一般に、アクセント、フレーズ成分の強度および平均的な声の高さや、発声速度、ポーズ長等を指す。例えば、アクセント型を変換することにより、方言風の音声合成を行うことが可能である。また、アクセントの強さ、フレーズの強さ、声の大きさを文末に向けて徐々に大きくすることにより、演説風に誇張した音声を合成する、といった操作を行うことも可能である。さらに、韻律だけではなく、音素を変換することにより、方言や年齢に特有の言葉の変換も可能である。この韻律変換手段４は、一般にＣＰＵやＤＳＰを用いてソフトウェアにより実現される。
【００３４】
以下に、韻律変換手段４について、図１を参照しながらさらに詳細な説明を行う。韻律変換手段４の内部には、アクセント変換規則記憶部４０１およびアクセント変換部４０２を有する。アクセント変換部４０２は、方言選択パラメータに基づいてアクセント変換規則記憶部４０１に記憶された変換規則からアクセント変換規則を選択する。方言選択パラメータは、入力テキスト中に制御コマンドとして特殊記号を用いて埋め込まれるか、またはユーザーが何らかの方法（例えば方言選択パラメータ入力部を設ける等）により入力することができる。通常は、ソフトウェアにより設定画面を用いて男声や女声、声の高さや大きさ、速さ等を設定できるようになっており、その同じ画面上で方言を設定することができる。または別に設定装置を設けることもできる。そして、入力された音素列とアクセント型から、選択したアクセント変換規則に従って、異なるアクセント型へと変換する。具体的なアクセント型変換については、後述する。
【００３５】
さらに、韻律変換手段４の内部に、音素列変換規則記憶部４０３および音素列変換部４０４を設けてもよい。この場合、アクセントと同様に、規則選択パラメータに基づいて、音素列変換規則記憶部４０３に記憶された変換規則から音素列変換規則を選択する。規則選択パラメータについても、方言選択パラメータと同様に、入力テキスト中に制御コマンドとして特殊記号を用いて埋め込まれるか、またはユーザーが何らかの方法（例えば方言選択パラメータ入力部を設ける等）により入力することができる。そして、入力された音素列とアクセント型から、選択した音素列変換規則に従って、異なる音素列へと変換する。
【００３６】
まず、音素列変換規則の具体例について説明する。例えば、子供っぽい音声に変換するための規則の一例としては、
（１）「し」を「ち」に変換
（２）「す」、「つ」を「ちゅ」に変換
（３）「ら」を「だ」に変換
（４）「ぞ」を「ど」に変換
等が挙げられる。
【００３７】
この他にも、一部の方言における「ひ」と「し」が入れ替わる規則や、別の方言における「し」が「す」と発音される規則等を記憶しておき、目的に応じて規則選択パラメータを用いて選択することにより、各種の変換が可能となる。これらの変換規則の選び方は、文節毎に変るものではないので、音声合成の前に予めどの変換規則を適用するかを規則選択パラメータに設定しておくことにより実現することができる。
【００３８】
次に、具体的な日本語の方言アクセント変換への変換方法について、説明する。図３は、本実施形態で用いる具体的なアクセント型を示す図である。図３（Ａ）は、従来の音声合成でも用いられる標準語アクセントの形状を、アクセント型とモーラ数毎に表示したものである。ここで、モーラとは、音声の拍を表す単位であり、日本語の場合カナ文字に相当する。一般に、アクセントはモーラ毎の高低２段階で表される。０型は平板型とも称され、１モーラを除いて低音「Ｌ」で始まり、２モーラ目以後、最後まで高音「Ｈ」が続く。なお、（）で示されるのは、単語に付属語が続く場合における付属語の音高を示している。例えば、「わたし」は標準語では３モーラで０型であるので「ＬＨＨ」、付属語「は」が続いて「わたしは」となった場合には「ＬＨＨＨ」となる。１型は１モーラ目が高く、続くモーラは低音が並ぶ。２以上の型はアクセント型の数字がアクセント核の位置を表し、第２モーラからアクセント核までが高く、アクセント核の次のモーラから後ろが低音となる。例えば、「はなし」は３モーラで３型であるため「ＬＬＨ」となり、単独では０型と同じであるが、付属語「は」が続いて「はなしは」となった場合には「ＬＨＨＬ」となる。
【００３９】
図３（Ｂ）および図３（Ｃ）は各々関西弁に特徴的な高起アクセント型と低起アクセント型の一例を示す図である。図３（Ｂ）の高起アクセント型では、語頭のモーラが必ず高音「Ｈ」で始まる。また、アクセント型が１で１モーラの場合には、モーラ内で高音から低音に下がり、継続長が長くなる。これを図３（Ｂ）では「Ｄ」と表している。図３（Ｃ）の低起アクセント型では、０型の場合、最後のモーラのみが高音となる。１型は１モーラを除いて標準型や高起アクセント型と同じである。また、アクセント型が１型で１モーラの場合には、モーラ内で低音から高音に上がり、継続時間が長くなる。これを図３（Ｃ）では「Ｕ｝と表している。２型以上では低音で始まってアクセント核の位置のみが高音となり、以後は低音になる。
【００４０】
本実施形態では、規則音声合成手段３内のアクセント形状記憶部３０１でこれらのアクセント形状を記憶し、韻律変換手段４内のアクセント変換部４０２でこれらの中からアクセント型を選択する。従来では、アクセント型の指定には０型からｎ型までの整数値で指定が可能であったが、本実施形態ではこれとは別の整数値を用いて標準型、高起型または低起型というアクセント種別を指定する。従って、文節毎の韻律変換手段４からは、音素記号列、アクセント型、フレーズ成分の強さ、ポーズの長さ、声の大きさおよびアクセント種別の情報を出力する。
【００４１】
関西弁以外のアクセント型を実現する際に、上述したような型で表現できない場合には、さらにアクセント種別を拡張して、アクセント型を追加することも考えられる。また、方言選択パラメータに基づいてアクセント変換規則を選択することにより、１つの装置で複数の方言アクセント型を切り替えて音声合成することが可能になる。方言選択パラメータは設定画面を用いて設定することができ、例えば整数で０が標準語、１が関西弁、２が鹿児島弁等というように設定することができる。
【００４２】
次に、関西弁風の音声を合成する際のアクセント変換規則の具体例を示す。例えば、
（１）２モーラの文節は０型を１型に、１型を０型に変換する
（２）３モーラ以上の１型文節で、第１母音と第２母音が異なる場合には２型に変換する
（３）２型アクセントは低起型にする
（４）上記（３）以外で（アクセント型≧モーラ数）であれば高起型に変換する
（５）上記（４）以外で２モーラであれば低起型に変換する
（６）上記（５）以外でモーラ数が奇数であれば低起型、偶数であれば高起型に変換する
といった規則の組を関西弁風アクセント変換規則として、アクセント変換規則記憶部４０１に記憶する。この変換規則（３）において、２型アクセントは低起型でも標準型でも同じ形状であるので、どちらを選んでもよい。
【００４３】
例文を用いて、この規則を用いた変換例を示す。例えば、
「テレビゲームやパソコンでゲームをして遊ぶ」
という例文をテキスト処理手段２へ入力した場合を考える。テキスト処理手段によって、「テレビゲームや・パソコンで・ゲームを・して・あそぶ」
という文節に区切られたとする。ここでは、文節区切りを中点（・）で表すことにする。このとき、各文節の標準語アクセント型は、
「４・０・１・０・０」
となる。このアクセントを音の高低（Ｈ：高音、Ｌ：低音）で表すと、
「ＬＨＨＨＬＬＬ・ＬＨＨＨＨ・ＨＬＬＬ・ＬＨ・ＬＨＨ」
のようになる。上記変換規則により各文節のアクセント型は、
「低起４・低起０・高起１・高起１・低起３」
のように変換される。このアクセントを音の高低で表すと、
「ＬＬＬＨＬＬＬ・ＬＬＬＬＨ・ＨＬＬＬ・ＨＬ・ＬＬＨ」
のようになる。これにより、規則を用いて概ね関西弁によくあるアクセントに変換できることが分かる。
【００４４】
本実施形態では、変換規則により変換を行うことにより、方言のアクセント辞書を必要としないため、少ないコストで読み上げ音声のバリエーションを広げることができる。
【００４５】
（実施形態２）
図４は、実施形態２の音声合成装置の構成を説明するための図である。この音声合成装置は、図１の構成に加えて、韻律変換手段４の内部にバイパススイッチ４０５を有している。そして、テキスト処理手段２により非変換を指示された場合には、アクセント変換部４０２によるアクセント変換および音素列変換部４０４による音素列変換を行わずに出力する。この機能が効果を発揮するのは、言語辞書１が図７に示したように方言アクセント記憶部１０３を有し、一部の重要語および頻出単語についてのみ正しいアクセントを記憶している場合である。
【００４６】
この構成では、方言選択パラメータとして、テキスト処理手段２とアクセント変換部４０２に同じパラメータを入力し、アクセントを統一するのが望ましい。
【００４７】
例えば、テキスト処理手段２において、方言アクセントで出力しようとする単語に対して、方言選択パラメータによって指定された方言アクセントが言語辞書１内に存在する場合には韻律変換指示コードとして０を出力し、指定された方言アクセントが言語辞書１内に存在しない場合には韻律変換指示コードとして１を出力する。
【００４８】
韻律変換手段４内のバイパススイッチ４０５は、韻律変化指示コードが１のとき、アクセント変換部４０２および音素列変換部４０４の方に情報を出力し、韻律変換指示コードが０のときには、変換せずにそのまま規則音声合成手段３へ情報を出力する。
【００４９】
本実施形態では、文節毎のテキスト処理手段２からの出力および文節毎の韻律変換手段４からの出力は、音素記号列、アクセント型、フレーズ成分の強さ、ポーズの長さ、声の大きさおよびアクセントの種別の情報となる。さらに、韻律変換手段４からは韻律変換指示コードの情報も出力される。なお、アクセントの種別はテキスト処理手段２および韻律変換手段４から出力され、韻律変換指示コードにより変換が指示されていない場合にはテキスト処理手段からのアクセント種別が優先される。また、韻律変換指示コードにより変換が指示されている場合にはテキスト処理手段からのアクセント種別は無視され、韻律変換手段によりアクセント種別が決定される。
【００５０】
バイパススイッチ４０５の操作は、テキスト処理手段２から指示される韻律変換指示コードに従って韻律変換手段４が行ってもよく、アクセント変換部４０２自身が行ってもよい。また、標準アクセントで音声合成する場合にも、バイパススイッチ４０５を用いて、アクセント変換および音素列変換を行わない。
【００５１】
（実施形態３）
図５は、本発明の音声合成装置をコンピューターを用いて実現するための構成例を示す図である。入力装置５は、シリアル通信、ネットワーク通信またはキーボード等により音声合成対象となるテキストデータを入力する。
【００５２】
記憶媒体８は、実施形態１に示した言語辞書１、テキスト処理手段２、規則音声合成手段３および韻律変換手段４の各機能からなる音声合成プログラムを記録したＣＤＲＯＭやフロッピーディスク、フラッシュメモリ等の記録媒体である。記憶装置９は、記憶媒体８のプログラムを読み込んだハードディスクやフラッシュメモリ等の記憶装置である。ＲＡＭ１０は、処理に必要な一次記憶に用いられる。
【００５３】
出力装置７は、合成された音声を出力するためのＤ／Ａ変換器、アンプやスピーカ等で構成される。
【００５４】
処理装置６は、記憶媒体９または記憶装置１０に読み込まれた音声合成プログラムに従って、音声合成処理を行うマイクロコンピューターやＤＳＰ等の処理装置である。
【００５５】
図６は、関西弁風アクセント変換規則をアルゴリズムで実現した場合のフローチャートを示す。このアルゴリズム自体は、上記実施形態１と同様の処理動作を行う。ステップ１１０１〜ステップ１１０７までがアクセント核の位置を変換する処理であり、ステップ１１０８〜ステップ１１１４までが高起型または低起型を選択する処理である。ステップ１１０１〜ステップ１１０５までが上記変換規則（１）に相当し、ステップ１１０６およびステップ１１０７が上記変換規則（２）に相当する。ステップ１１０８およびステップ１１０９が上記変換規則（３）に相当し、ステップ１１１０およびステップ１１１１が上記変換規則（４）に相当する。ステップ１１１２〜ステップ１１１４が上記変換規則（５）および（６）に相当する。
【００５６】
上記実施形態２についても、図５と同様の構成により実現することが可能である。
【００５７】
【発明の効果】
以上詳述したように、本発明によれば、標準語アクセントに基づいて解析されて付与されたアクセント型を、変換規則に基づいて変換するため、特別なアクセント辞書を作成する必要がなく、従来開発コストがかかりすぎて実現が困難であった方言アクセントでの音声合成が可能となる。さらに、メモリの増加量もほとんど無いため、複数の方言や個人的な特徴のあるアクセント等、多様な音声合成を１つの装置で行うことができる。
【００５８】
また、標準語アクセントから関西弁に特徴的な高起アクセント型や低起アクセント型に変換する変換規則を用いることにより、関西弁風の音声を合成することができる。
【００５９】
さらに、一部の単語において方言のアクセントが言語辞書に登録されている場合でも、方言アクセントと標準語アクセントが混ざって音声の聴取者が混乱するのを防ぐことができる。
【図面の簡単な説明】
【図１】実施形態１の音声合成装置の構成を示す図である。
【図２】フレーズ形状の一例を示す図である。
【図３】（Ａ）〜（Ｃ）は実施形態１におけるアクセント型の一例を示す図である。
【図４】実施形態２の音声合成装置の構成を示す図である。
【図５】実施形態３の音声合成装置をコンピューターを用いて実現する構成を示す図である。
【図６】アクセント変換規則をアルゴリズムで実現するためのフローチャートである。
【図７】従来の音声合成装置の構成を示す図である。
【符号の説明】
１言語辞書
２テキスト処理手段
３規則音声合成手段
４韻律変換手段
５入力装置
６処理装置
７出力装置
８記憶媒体
９記憶装置
１０ＲＡＭ
１０１形態素情報記憶部
１０２標準語アクセント記憶部
１０３方言アクセント記憶部
３０１アクセント形状記憶部
３０２アクセント生成部
３０３フレーズ形状記憶部
３０４フレーズ生成部
３０５波形辞書
３０６波形生成部
４０１アクセント変換規則記憶部
４０２アクセント変換部
４０３音素列変換規則記憶部
４０４音素列変換部
４０５バイパススイッチ[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a speech synthesizer, a speech synthesis method, and a speech synthesis program for inputting text data consisting of a character string or a symbol string and converting it into speech data.
[0002]
[Prior art]
In speech synthesis technology, not only accurately convey information, but also customize according to user's preference, switch voice according to the situation, or let multiple characters talk in entertainment such as games Therefore, it is required to synthesize a plurality of individual voices.
[0003]
The main factors that determine speech are: voice quality (thick, thin, husky, etc.), overall (average) pitch, vocalization speed, pause (between phrases (length)) ), Accents (words, phrases, etc., pitches of sounds within short units), phrase pitch components (pitches of the entire sentence (phrase component strength), generally higher at first, gradually lowering), and the like. Of these, parts other than voice quality are generally called prosody. Among prosody, accent has a great influence on voice personality along with voice quality.
[0004]
In the conventional method of synthesizing speech (JP-A-1-3214496416 and JP-A-9-244777), as shown in FIG. 7, the text processing means 2 is used to analyze a sentence in a language dictionary 1 referred to. In addition to the data of the morpheme information storage unit 101 and the standard word accent storage unit 102 for providing standard accents, a dialect accent storage unit 103 is provided and accent data is selected from the dialect accent storage unit 103 to synthesize dialect speech. Was.
[0005]
[Problems to be solved by the invention]
However, the above-described conventional speech synthesis method has not been realized due to difficulty in actually creating an accent dictionary and development cost. The reason will be described below.
[0006]
In a conventional Japanese speech synthesizer, a language dictionary that determines an accent for each word is generally created based on a published standard word accent dictionary (for example, NHK “Japanese Pronunciation Accent Dictionary”). Synthesize speech with standard word accents. In order to utter correctly with accents other than standard words such as dialects, it is necessary to create an accent dictionary with tens of thousands of words.
[0007]
However, even with dialects, the actual dialects differ depending on the region and speaker, and it is very difficult to identify and organize dialects into one. For this reason, to create an accurate dialect accent dictionary, even if it is limited to one region such as Osaka dialect, it takes enormous time and cost, and is difficult to realize. Even if a dictionary can be created, adding dialect accents to a large number of words leads to an increase in memory, and further increasing the number of dialects also leads to an increase in memory, which is difficult to realize. Furthermore, it is considered that it is one of the reasons why it is difficult to evaluate the effect even if the cost is created.
[0008]
For example, in the “Osaka / Tokyo Accent Speech Dictionary: Miyoko Sugitan, 1995”, even if the speaker is from the same city of Osaka, the accent on each word is different. It shows that there is only such a tendency. On the contrary, even if the dialect accent of a certain region is accurately reproduced, since the evaluation differs depending on the listener, very few users may rate it as correct.
[0009]
Furthermore, even if only some of the words are examined for dialect accents and added to the dictionary, if other words are spoken with standard word accents, the dialect style changes rapidly during the utterance. There is a risk that the user will be confused and that the meaning of the sentence will not be understood.
[0010]
Specific accent-shaped shapes will be described later, but generally speaking, dialect accents cannot be reproduced by using only the 0-type to n-type accent types in the conventional standard word accent. Need no new accent type. According to the “Osaka / Tokyo Accent Dictionary” mentioned above as an example, the Osaka dial accent is different from the standard words 0 type to n type, from high type 0 type to n type and low type 0 type to n type. It cannot be expressed without using the accent type. On the other hand, since these accent types are common to Kansai dialects, if these accent types are used, a certain degree of Kansai dialect can be expressed even if the accent type is different from the dialect of a specific region.
[0011]
The present invention has been made to solve such problems of the prior art, and the dialect accent dictionary is incomplete, or there is no dialect accent dictionary, and it seems to be a dialect even from a language dictionary based on standard words. It is an object of the present invention to provide a speech synthesizer, a speech synthesis method, and a speech synthesis program that can generate unique accents.
[0012]
[Means for Solving the Problems]
The speech synthesizer of the present invention A language dictionary storing morpheme information for each word and accent information of standard words for each word; Input text data Refer to the language dictionary Analyze For each phrase, Phoneme sequence information and Contains an accent type that indicates the location of the accent core of a phrase in a standard word Prosodic information When A text processing means for outputting the prosody, a prosody conversion means for converting the prosodic information input from the text processing means into different prosody information based on a conversion rule, and synthesizing a speech waveform from the phoneme string information and the converted prosodic information A regular speech synthesis means And the prosody conversion means includes an accent conversion rule storage unit and an accent conversion unit. The accent conversion rule storage unit includes a high-frequency characteristic characteristic of each dialect from the accent type in the standard word. A conversion rule for conversion to an accent type or a low-rise accent type is stored, and the accent conversion unit selects the conversion rule stored in the accent conversion rule storage unit based on an inputted dialect selection parameter Then, based on the conversion rule, for all the clauses in the input text data, from the accent type in the standard word, the high-rise accent type or low-rise accent characteristic of the dialect corresponding to the dialect selection parameter Convert to type And the above-mentioned object is achieved.
[0014]
The prosody conversion means further includes a phoneme sequence conversion rule storage unit and a phoneme sequence conversion unit, and the phoneme sequence conversion unit stores the phoneme sequence conversion rule storage unit based on the input rule selection parameter. Selected conversion rule, and according to the selected conversion rule, Input from the text processing means Convert phoneme sequence information to phoneme sequence different from the input phoneme sequence It may be characterized by .
[0016]
The prosody conversion means includes the The output of the text processing means Bypass the accent conversion unit and the phoneme string conversion unit To output Has a bypass switch, The language dictionary includes Some words, dialects Accent type is stored in advance , The accent type of the dialect is stored for words included in the input text data in case of, The text processing means is Of dialects maintained in the language dictionary The prosody conversion unit may select and output an accent type, and the output of the text processing unit may bypass the accent conversion unit and the phoneme sequence conversion unit by the bypass switch. .
[0017]
The speech synthesis method of the present invention The speech synthesizer A speech synthesis method using the prosody conversion means. By the accent conversion unit, Based on dialect selection parameters entered The conversion rule stored in the accent conversion rule storage unit is selected, and from the accent type in the standard word to the dialect selection parameter for all the clauses in the input text data based on the conversion rule It is characterized in that it is converted to a high-rise accent type or low-rise accent type characteristic of the corresponding dialect .
[0018]
The prosody conversion means includes a bypass switch that outputs the text processing means by bypassing the accent conversion unit and the phoneme string conversion unit, and the language dictionary includes: Some words, dialects Accent type is stored in advance , The accent type of the dialect is stored for words included in the input text data in case of, The text processing means is Of dialects maintained in the language dictionary The prosody conversion unit may select and output an accent type, and the output of the text processing unit may bypass the accent conversion unit and the phoneme sequence conversion unit by the bypass switch. .
[0019]
The speech synthesis program of the present invention A method for executing a speech synthesis method using the speech synthesizer A speech synthesis program comprising the prosody conversion means. By the accent conversion unit, Based on dialect selection parameters entered The conversion rule stored in the accent conversion rule storage unit is selected, and from the accent type in the standard word to the dialect selection parameter for all the clauses in the input text data based on the conversion rule Perform conversion to the high or low accent type that is characteristic of the corresponding dialect That Characterize .
[0020]
The prosody conversion means includes a bypass switch that outputs the text processing means by bypassing the accent conversion unit and the phoneme string conversion unit, and the language dictionary includes: Some words, dialects Accent type is stored in advance , The accent type of the dialect is stored for words included in the input text data in case of, The text processing means is Of dialects maintained in the language dictionary The prosody conversion unit may select and output an accent type, and the output of the text processing unit may bypass the accent conversion unit and the phoneme sequence conversion unit by the bypass switch. .
[0021]
The operation of the present invention will be described below.
[0022]
In the present invention, an accent type (or phoneme string) analyzed based on a standard word can be converted based on a conversion rule, so that it is not necessary to create a dialect accent dictionary, and thus a conventional development cost is required. Speech synthesis with different dialect accents, which is too difficult to realize, is possible. Furthermore, since there is almost no increase in memory, it is possible to perform various speech synthesis such as a plurality of dialects by one apparatus.
[0023]
Further, in the accent conversion unit, according to the conversion rules, from the integer type accents of standard words 0 type to n type, the high accent type 0 type characteristic of Kansai dialect to the n type or the low accent type 0 type By converting from to n-type, even if the text processing means outputs prosodic information based on the language dictionary of standard words, it is possible to synthesize speech with Kansai dialect accents.
[0024]
Furthermore, when dialect prosody (accent) is registered for some words in a language dictionary based on standard words, speech synthesis is performed using the exact prosody registered in the dictionary for those words. The words that have not been investigated are converted into local dialect-like accents (or phoneme strings) using conversion rules. Thereby, it becomes possible to approach a more natural dialect, and it is possible to avoid the confusion of the voice listener due to the mixture of the dialect accent and the standard word accent.
[0025]
The algorithm for synthesizing speech is complicated and is generally realized by software using a general-purpose microcomputer or DSP. According to the speech synthesis program of the present invention, it is possible to add a dialect-like speech synthesis function in a system using a microcomputer such as a personal computer or a system using a DSP.
[0026]
DETAILED DESCRIPTION OF THE INVENTION
Embodiments of the present invention will be described below with reference to the drawings.
[0027]
(Embodiment 1)
FIG. 1 is a diagram for explaining the configuration of a speech synthesizer according to an embodiment of the present invention. This speech synthesizer comprises a language dictionary 1, text processing means 2, recorded speech synthesis means 3, and prosody conversion means 4.
[0028]
The language dictionary 1 stores morpheme information for each word and accent information for each word, and is generally realized by a read-only memory.
[0029]
The text processing means 2 detects words using the language dictionary 1 from the input text data, determines the sentence structure from morphemes such as verbs and nouns, determines the length of the pose based on the sentence structure, Determines the strength of the phrase component that controls the overall inflection. Also, the accent type of each phrase is determined from the accent information of the independent words included in the phrase. Then, a phoneme symbol string, accent type, phrase component strength, pause length, and voice volume information are output for each phrase. Generally speaking, the loudness of a voice is often fixed within a sentence, but when viewed microscopically, a speech segment (basically, this is combined to synthesize speech) is recorded. The change in the loudness of the voice is output as it is. In prosodic control, the volume of this voice can be increased or decreased for each sentence. In addition, the utterance speed can be controlled by statistically examining the time length of each phoneme and determining how fast or slow the time length is. Furthermore, the pitch of the voice can be controlled by adding the phrase component obtained from the sentence structure and the accent component determined by each word. It Overall It is possible to adjust the strength of the inflection by adjusting the pitch of the voice by raising or lowering the pitch and by adjusting the strength of the phrase component or the accent component. It is often the case that the user can decide how much adjustment is made. This text processing means 2 is generally realized by software using a CPU or DSP.
[0030]
The regular speech synthesizer 3 synthesizes a speech waveform based on the input information. Specifically, the accent generation unit 302 refers to the accent shape storage unit 301 and sets the pitch of each phoneme in the input phoneme symbol string of each phrase. Next, the phrase generation unit 304 refers to the phrase shape storage unit 303, and for each phoneme of the phoneme symbol string spanning a plurality of clauses, converts the pitch shape representing the inflection with the strength of the phrase component. Adjust and add. The phrase shape will be described later. Next, the waveform generation unit 306 refers to the waveform dictionary 305 for each phoneme and synthesizes a speech waveform corresponding to the pitch of the sound generated by the accent generation unit 302 and the phrase generation unit 304. This regular speech synthesizer 3 is generally realized by software using a CPU or DSP.
[0031]
Various methods can be used as a method of waveform generation. Here, a waveform superposition method is used in which zero-phase unit waveforms are superimposed at intervals (pitch) expressed by the reciprocal of the frequency of the voice to be synthesized. By using it, it is possible to synthesize relatively high-quality speech.
[0032]
FIG. 2 shows an example of the phrase shape. The phrase shape is a model of the shape of the voice pitch of the whole sentence when a human utters a sentence, and has a shape that increases at the beginning of each phrase and gently descends toward the end of the sentence. In addition, there is a case where a “he” -shaped curve generally referred to as a Fujisaki model is used. FIG. 2 shows a phrase shape that descends linearly on the logarithmic frequency. In this example, the pitch of the voice at the beginning of the phrase is given as the value of the phrase component, and the descending speed is adjusted so that the steady height is reached at the end of the phrase. Therefore, as the phrase becomes longer, the descending speed becomes slower.
[0033]
The above is the same function as a normal speech synthesizer. In this embodiment, the prosody conversion means 4 is further provided to change the prosody of the synthesized speech. The prosody generally refers to accents, phrase component strength, average voice pitch, utterance speed, pause length, and the like. For example, dialect-like speech synthesis can be performed by converting the accent type. It is also possible to perform operations such as synthesizing exaggerated speech in a speech style by gradually increasing the accent strength, phrase strength, and voice volume toward the end of the sentence. Furthermore, by converting not only prosody but also phonemes, it is possible to convert dialect and age specific words. This prosody conversion means 4 is generally realized by software using a CPU or DSP.
[0034]
In the following, the prosody conversion means 4 will be described in more detail with reference to FIG. The prosody conversion means 4 includes an accent conversion rule storage unit 401 and an accent conversion unit 402. The accent conversion unit 402 selects an accent conversion rule from the conversion rules stored in the accent conversion rule storage unit 401 based on the dialect selection parameter. The dialect selection parameter can be embedded in the input text using a special symbol as a control command, or can be input by the user by some method (for example, providing a dialect selection parameter input unit). Normally, software can be used to set a male voice, female voice, voice pitch, loudness, speed, etc. using a setting screen, and a dialect can be set on the same screen. Alternatively, a setting device can be provided separately. Then, the input phoneme string and accent type are converted into different accent types according to the selected accent conversion rule. Specific accent type conversion will be described later.
[0035]
Furthermore, a phoneme string conversion rule storage unit 403 and a phoneme string conversion unit 404 may be provided inside the prosody conversion means 4. In this case, similar to the accent, a phoneme string conversion rule is selected from the conversion rules stored in the phoneme string conversion rule storage unit 403 based on the rule selection parameter. Similarly to the dialect selection parameter, the rule selection parameter can be embedded in the input text using a special symbol as a control command, or can be input by a user (for example, providing a dialect selection parameter input unit). it can. Then, the input phoneme string and accent type are converted into different phoneme strings according to the selected phoneme string conversion rule.
[0036]
First, a specific example of the phoneme string conversion rule will be described. For example, as an example of rules for converting to childish voice,
(1) Convert “shi” to “chi”
(2) Convert “su” and “tsu” to “chu”
(3) Convert “ra” to “da”
(4) Convert “zo” to “do”
Etc.
[0037]
In addition to this, the rules that replace “hi” and “shi” in some dialects, the rules that “shi” in other dialects are pronounced “su”, etc. are memorized. Choice Various conversions can be made by selecting using parameters. How to choose these transformation rules, so do not change for each clause can be realized by setting the rule selection parameters to apply in advance which conversion rules prior to speech synthesis.
[0038]
Next, a specific conversion method to Japanese dialect accent conversion will be described. FIG. 3 is a diagram showing a specific accent type used in the present embodiment. FIG. 3A shows the standard word accent shape used in conventional speech synthesis for each accent type and number of mora. Here, mora is a unit that represents the beat of a voice, and corresponds to kana characters in Japanese. In general, an accent is expressed in two levels, high and low, for each mora. Type 0 is also referred to as a flat plate type, and starts with a low tone “L” except for 1 mora, followed by a high tone “H” after the second mora. In addition, what is shown by () has shown the pitch of the adjunct word when an adjunct follows a word. For example, “I” is 3 mora and 0 type in the standard language, so “LHH”. If the accompanying word “ha” is followed by “I am”, it is “LHHH”. Type 1 has a high first mora, and the following mora is lined with bass. In the two or more types, the accent type number indicates the position of the accent nucleus, the second mora to the accent nucleus is high, and the back from the mora next to the accent nucleus is low. For example, “Haneashi” becomes “LLH” because it is 3 types with 3 mora, and it is the same as 0 type by itself. However, when the attached word “ha” is followed by “Haneashiha”, “LHHL” It becomes.
[0039]
FIG. 3B and FIG. 3C are diagrams showing examples of a high rise accent type and a low rise accent type that are characteristic of the Kansai dialect, respectively. In the high-pitched accent type shown in FIG. 3B, the beginning mora always starts with a high tone “H”. Further, when the accent type is 1 and 1 mora, the treble is lowered from the high sound within the mora, and the duration is increased. This is represented as “D” in FIG. In the low rise accent type of FIG. 3C, in the case of the 0 type, only the last mora has a high tone. Type 1 is the same as the standard type and high-rise accent type except for 1 mora. Further, when the accent type is 1 type and 1 mora, the sound rises from low to high within the mora, and the duration is long. This is represented by “U} in FIG. 3C. In the case of type 2 or higher, it starts with a low tone, and only the position of the accent nucleus becomes a high tone, and thereafter it becomes a low tone.
[0040]
In the present embodiment, these accent shapes are stored in the accent shape storage unit 301 in the regular speech synthesizer 3, and the accent type is selected from these in the accent conversion unit 402 in the prosody conversion unit 4. Conventionally, an accent type can be specified by an integer value from 0 type to n type, but in this embodiment, an integer value different from this is used to specify a standard type, a high rise type, or a low rise type. Specifies the accent type of type. Therefore, the phoneme symbol string, accent type, phrase component strength, pose length, voice volume, and accent type information are output from the prosody conversion means 4 for each phrase.
[0041]
When an accent type other than Kansai dialect is realized, if it cannot be expressed in the above-described type, it is possible to further expand the accent type and add an accent type. Further, by selecting an accent conversion rule based on a dialect selection parameter, it is possible to synthesize voice by switching a plurality of dialect accent types with a single device. Dialect selection parameters can be set using a setting screen. For example, an integer, 0 is a standard language, 1 is a Kansai dialect, 2 is a Kagoshima dialect, and the like.
[0042]
Next, a specific example of an accent conversion rule for synthesizing Kansai dialect speech will be shown. For example,
(1) 2-mora clauses convert 0 type to 1 type and 1 type to 0 type
(2) If the first vowel differs from the second vowel in a type 1 phrase of 3 mora or more, convert it to type 2
(3) Make the type 2 accent low
(4) Other than (3) above, if (accent type ≧ number of mora)
(5) Other than the above (4), if it is 2 mora, convert it to low-rise type
(6) Other than the above (5), if the number of mora is an odd number, it is converted to a low rise type, and if it is an even number, it is converted to a high rise type.
Such a set of rules is stored in the accent conversion rule storage unit 401 as a Kansai dialect accent conversion rule. In this conversion rule (3), the type 2 accent has the same shape regardless of whether it is a low-rise type or a standard type.
[0043]
An example of conversion using this rule is shown using an example sentence. For example,
"Play with video games and computers"
Assume that the example sentence is input to the text processing means 2. By text processing means, "Play video games, play on PCs, play games, etc."
Suppose that it is delimited by the phrase Here, the phrase break is represented by a midpoint (•). At this time, the standard word accent type of each phrase is
"4 ・ 0 ・ 1 ・ 0 ・ 0"
It becomes. If this accent is expressed by the pitch of the sound (H: treble, L: bass),
"LHHHLLL / LHHHH / HLLL / LH / LHH"
become that way. According to the above conversion rules, the accent type of each phrase is
"Low rise 4, low rise 0, high rise 1, high rise 1, low rise 3"
Is converted as follows. If this accent is expressed in terms of pitch,
"LLLLHLLL / LLLLH / HLLL / HL / LLH"
become that way. As a result, it can be seen that the rules can be used to convert the accents to those often found in Kansai dialect.
[0044]
In the present embodiment, by performing conversion according to the conversion rule, a dialect accent dictionary is not required, so that variations of the reading speech can be expanded at low cost.
[0045]
(Embodiment 2)
FIG. 4 is a diagram for explaining the configuration of the speech synthesizer according to the second embodiment. This speech synthesizer has a bypass switch 405 in the prosody conversion means 4 in addition to the configuration of FIG. When the text processing unit 2 instructs non-conversion, the text is output without performing accent conversion by the accent conversion unit 402 and phoneme sequence conversion by the phoneme sequence conversion unit 404. This function is effective when the language dictionary 1 has a dialect accent storage unit 103 as shown in FIG. 7 and stores correct accents only for some important words and frequent words. .
[0046]
In this configuration, it is desirable to input the same parameters to the text processing means 2 and the accent conversion unit 402 as dialect selection parameters to unify the accents.
[0047]
For example, when the dialect accent specified by the dialect selection parameter is present in the language dictionary 1 for the word to be output with the dialect accent in the text processing means 2, 0 is output as the prosodic conversion instruction code, When the designated dialect accent does not exist in the language dictionary 1, 1 is output as the prosodic conversion instruction code.
[0048]
The bypass switch 405 in the prosody conversion means 4 outputs information to the accent conversion unit 402 and the phoneme sequence conversion unit 404 when the prosody change instruction code is 1, and does not convert when the prosody conversion instruction code is 0. The information is output to the regular speech synthesis means 3 as it is.
[0049]
In the present embodiment, the output from the text processing means 2 for each phrase and the output from the prosody conversion means 4 for each phrase are the phoneme symbol string, accent type, phrase component strength, pause length, and voice volume. And accent type information. Further, the prosody conversion means 4 also outputs information on the prosody conversion instruction code. Note that the accent type is output from the text processing unit 2 and the prosody conversion unit 4, and when the conversion is not instructed by the prosody conversion instruction code, the accent type from the text processing unit has priority. When conversion is instructed by the prosodic conversion instruction code, the accent type from the text processing unit is ignored, and the accent type is determined by the prosody conversion unit.
[0050]
The operation of the bypass switch 405 may be performed by the prosody conversion unit 4 in accordance with the prosody conversion instruction code instructed from the text processing unit 2, or may be performed by the accent conversion unit 402 itself. Further, even when speech synthesis is performed using standard accents, accent conversion and phoneme string conversion are not performed using the bypass switch 405.
[0051]
(Embodiment 3)
FIG. 5 is a diagram showing a configuration example for realizing the speech synthesizer of the present invention using a computer. The input device 5 inputs text data to be synthesized by serial communication, network communication, or a keyboard.
[0052]
The storage medium 8 is a CDROM, floppy disk, flash memory, or the like in which a speech synthesis program comprising the functions of the language dictionary 1, the text processing unit 2, the regular speech synthesis unit 3 and the prosody conversion unit 4 shown in the first embodiment is recorded. It is a recording medium. The storage device 9 is a storage device such as a hard disk or a flash memory that has read the program of the storage medium 8. The RAM 10 is used for primary storage necessary for processing.
[0053]
The output device 7 includes a D / A converter, an amplifier, a speaker, and the like for outputting synthesized voice.
[0054]
The processing device 6 is a processing device such as a microcomputer or a DSP that performs speech synthesis processing in accordance with a speech synthesis program read into the storage medium 9 or the storage device 10.
[0055]
FIG. 6 shows a flowchart when the Kansai dialect accent conversion rule is realized by an algorithm. This algorithm itself performs the same processing operation as in the first embodiment. Steps 1101 to 1107 are processes for converting the position of the accent nucleus, and steps 1108 to 1114 are processes for selecting a high-rise type or a low-rise type. Steps 1101 to 1105 correspond to the conversion rule (1), and steps 1106 and 1107 correspond to the conversion rule (2). Steps 1108 and 1109 correspond to the conversion rule (3), and steps 1110 and 1111 correspond to the conversion rule (4). Steps 1112 to 1114 correspond to the conversion rules (5) and (6).
[0056]
The second embodiment can also be realized by the same configuration as that in FIG.
[0057]
【The invention's effect】
As described above in detail, according to the present invention, since the accent type analyzed and given based on the standard word accent is converted based on the conversion rule, it is not necessary to create a special accent dictionary. Speech synthesis with dialect accents that were difficult to implement due to excessive development costs is possible. Furthermore, since there is almost no increase in memory, a variety of speech synthesis such as a plurality of dialects and accents with personal characteristics can be performed with one apparatus.
[0058]
Also, Kansai dialect speech can be synthesized by using a conversion rule for converting from a standard word accent to a high-rise accent type or low-rise accent type characteristic of Kansai dialect.
[0059]
Furthermore, even when dialect accents are registered in the language dictionary in some words, it is possible to prevent the voice listener from being confused by mixing dialect accents and standard word accents.
[Brief description of the drawings]
FIG. 1 is a diagram illustrating a configuration of a speech synthesizer according to a first embodiment.
FIG. 2 is a diagram illustrating an example of a phrase shape.
FIGS. 3A to 3C are diagrams illustrating an example of an accent type according to the first embodiment.
FIG. 4 is a diagram illustrating a configuration of a speech synthesizer according to a second embodiment.
FIG. 5 is a diagram illustrating a configuration for realizing the speech synthesizer according to the third embodiment by using a computer.
FIG. 6 is a flowchart for realizing an accent conversion rule by an algorithm.
FIG. 7 is a diagram illustrating a configuration of a conventional speech synthesizer.
[Explanation of symbols]
1 language dictionary
2 Text processing means
3. Regular speech synthesis means
4 Prosody conversion means
5 input devices
6 processing equipment
7 Output device
8 storage media
9 Storage device
10 RAM
101 Morphological information storage unit
102 Standard word accent storage
103 Dialect accent storage
301 Accent shape storage
302 Accent generator
303 Phrase shape storage unit
304 Phrase generator
305 Waveform Dictionary
306 Waveform generator
401 Accent conversion rule storage unit
402 Accent converter
403 phoneme string conversion rule storage unit
404 Phoneme sequence converter
405 Bypass switch

Claims

A language dictionary storing morpheme information for each word and accent information of standard words for each word;
The input text data is analyzed by reference to the language dictionary, for each clause, and text processing means for outputting a prosodic information including accent type indicating the position of the accent nucleus clauses in the phoneme string information and Mandarin ,
Prosodic conversion means for converting prosodic information input from the text processing means into different prosodic information based on conversion rules;
Regular speech synthesis means for synthesizing a speech waveform from the phoneme sequence information and the converted prosodic information ,
The prosody conversion means includes an accent conversion rule storage unit and an accent conversion unit,
The accent conversion rule storage unit stores conversion rules for converting from the accent type in the standard language to the high-rise accent type or low-rise accent type characteristic of each dialect,
In the accent conversion unit, the conversion rule stored in the accent conversion rule storage unit is selected based on the input dialect selection parameter, and all the clauses in the input text data are selected based on the conversion rule. Then, the speech synthesizer performs conversion from the accent type in the standard language into a high-rise accent type or a low-rise accent type characteristic of the dialect corresponding to the dialect selection parameter .

The prosody conversion means further includes a phoneme sequence conversion rule storage unit and a phoneme sequence conversion unit,
The phoneme string conversion unit selects a conversion rule stored in the phoneme string conversion rule storage unit based on the input rule selection parameter, and the phoneme string input from the text processing unit according to the selected conversion rule The speech synthesizer according to claim 1 , wherein the information is converted into a phoneme string different from the input phoneme string.

The prosody conversion means includes a bypass switch that outputs the output of the text processing means by bypassing the accent conversion unit and the phoneme string conversion unit,
In the language dictionary , dialect accent types are stored in advance for some words,
When the dialect accent type is stored for the word included in the input text data, the text processing means selects and outputs the dialect accent type held in the language dictionary. The speech synthesizer according to claim 2 , wherein the prosody conversion unit causes the accent conversion unit and the phoneme sequence conversion unit to bypass the output of the text processing unit by the bypass switch .

A method for performing speech synthesis using the speech synthesizer according to claim 1 ,
The accent conversion unit of the prosody conversion unit selects the conversion rule stored in the accent conversion rule storage unit based on the dialect selection parameter input, and the text data input based on the conversion rule A speech synthesis method comprising: converting all accents in a standard word from the accent type in a standard word into a high-rise accent type or a low-rise accent type characteristic of a dialect corresponding to the dialect selection parameter .

The prosody conversion means includes a bypass switch that outputs the output of the text processing means by bypassing the accent conversion unit and the phoneme string conversion unit,
In the language dictionary , dialect accent types are stored in advance for some words,
When the dialect accent type is stored for the word included in the input text data, the text processing means selects and outputs the dialect accent type held in the language dictionary. The speech synthesis method according to claim 4 , wherein the prosody conversion means bypasses the accent conversion unit and the phoneme string conversion unit by using the bypass switch .

A speech synthesis program for executing a method of performing speech synthesis using the speech synthesizer according to claim 1 ,
The accent conversion unit of the prosody conversion unit selects the conversion rule stored in the accent conversion rule storage unit based on the dialect selection parameter input, and the text data input based on the conversion rule A speech synthesis program for performing conversion from the accent type in a standard word to a high-rise accent type or a low-rise accent type characteristic of a dialect corresponding to the dialect selection parameter .

The prosody conversion means includes a bypass switch that outputs the output of the text processing means by bypassing the accent conversion unit and the phoneme string conversion unit,
In the language dictionary , dialect accent types are stored in advance for some words,
When the dialect accent type is stored for the word included in the input text data, the text processing means selects and outputs the dialect accent type held in the language dictionary. The speech synthesis program according to claim 6 , wherein the prosody conversion unit causes the accent conversion unit and the phoneme sequence conversion unit to bypass the output of the text processing unit by the bypass switch .