JP2004126011A

JP2004126011A - Speech synthesis method and apparatus and speech synthesis program

Info

Publication number: JP2004126011A
Application number: JP2002287366A
Authority: JP
Inventors: Katsumi Tsuchiya; 土谷　勝美; Takehiko Kagoshima; 籠嶋　岳彦
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2002-09-30
Filing date: 2002-09-30
Publication date: 2004-04-22
Anticipated expiration: 2022-09-30
Also published as: JP3756864B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a voice synthesizing method which generates natural voiced sounds. <P>SOLUTION: Elementary speech units corresponding to inputted rhythm information are selected from an elementary speech unit dictionary which is stored in an elementary speech unit storage part and consists of a plurality of elementary speech units (S1 to S2), and rhythm control of selected elementary speech units is performed (S3), and phases of voiced sounds included in elementary speech units are randomized (S4), and elementary speech units subjected to rhythm control and phase randomization are connected to generate a synthesized voice signal (S5). <P>COPYRIGHT: (C)2004,JPO

Description

【０００１】
【発明の属する技術分野】
本発明は、例えばテキスト合成に用いられる音声合成方法と装置及びプログラムに係り、特に有声音の音声合成技術に関する。
【０００２】
【従来の技術】
音声合成方式の一つである規則合成方式は、入力された韻律情報から自動的に合成音声信号を生成する技術である。規則合成方式では、音韻記号列、音韻継続時間長、ピッチパターン及びパワーなどの韻律情報に従って、音声素片記憶部に記憶されている音節、音素及び１ピッチ区間などの小さな単位（これを音声素片という）の特徴パラメータを選択し、これらをピッチや継続時間長の制御を行った後に接続することにより、合成音声信号を生成する。規則合成方式による音声合成技術は、例えば任意の文章（テキスト）から人工的に音声を作り出すテキスト合成に使用される。
【０００３】
このような規則合成方式による音声合成技術では、合成音声信号の品質を向上させるために、代表音声素片の特徴パラメータには、音声データから切り出した音声波形や、それを分析して得られる音源信号と合成フィルタの特性を表す係数との組が用いられることが多い。
【０００４】
例えば、音声データから切り出した音声波形を音声素片として用いる前者の音声合成技術では、まずテキスト合成に供されるテキストの文解析で得られた音韻記号列、音韻継続時間長、ピッチパターン及びパワーなどの韻律情報に従って音声素片辞書から音声素片が選択される。選択された音声素片に対して、入力された韻律情報に従ってピッチ及び音韻継続時間長の制御が行われた後、それらの音声素片が接続されることにより合成音声信号が生成される。
【０００５】
ここで、韻律情報に基づくピッチや音韻継続時間長の制御は、音声素片に格納されている単位波形の繰り返しや間引きなどによって実現される。一般には、有声音ではピッチ波形が単位波形として用いられ、無声音では予め決定した長さに分割された波形が単位波形として用いられる。
【０００６】
【発明が解決しようとする課題】
従来の音声合成技術では、同一の単位波形が繰り返されると合成音声の自然性が低下するという問題が生じる。これは、同一の単位波形の繰り返しが原因で本来周期性が弱くなるべき帯域の周期性が強調され、その結果、合成音声のブザー感が増加し、合成音声の自然性が損なわれるからである。同一の単位波形の繰り返しが行われない場合でも、音声素片の作成上の問題などで隣り合う単位波形の位相がそろっている場合（周期性が強調されている場合）には、同様の理由で合成音声の自然性が低下することがある。
【０００７】
特開平５−１０８０９５号公報（特許文献１）では、無声音の合成音声生成時に音声素片波形に対して位相ランダム化の処理を行い、位相ランダム化された音声素片波形をずらしながら加算もしくは重畳することにより、無声音の自然性を向上させる技術が開示されている。しかしながら、有声音に対する有効な解決手段は知られていない。
【０００８】
従って、本発明は自然な有声音を生成する音声合成方法と装置及び音声合成プログラムを提供することを目的とする。
【０００９】
【特許文献１】
特開平５−１０８０９５号公報
【００１０】
【課題を解決するための手段】
上記の課題を解決するため、本発明では複数の音声素片からなる音声素片辞書から選択された音声素片を接続することによって合成音声信号を生成する際に、合成音声信号の有声音の位相をランダム化させる処理を行う。合成音声信号の有声音の位相をランダム化させることによって、必要以上に周期性が強くなっている部分の周期性を弱めることでき、合成音声信号がより自然なものとなる。
【００１１】
周期性の強弱はすべての帯域で同じではなく、強い周期性が望ましい帯域もあれば弱い周期性が望ましい帯域もある。弱い周期性が望ましい帯域のみ位相のランダム化を行えば、合成音声信号はより自然なものになる。一般には、高域に強い周期性が存在すると合成音声信号にブザー感が生じることが多く、位相のランダム化を行う帯域を高域に特定することで簡単に効果が得られる。また、周期性の強弱はすべての素片で同じではなく素片毎に異なっているため、音声素片毎に位相のランダム化を制御することで更なる改善が得られる。
【００１２】
位相のランダム化は、韻律制御で繰り返しあるいは間引きが行われた単位波形に対して行われることが望ましいが、合成音声信号の生成の際の計算量を削減するため、予め位相のランダム化を行った音声素片を音声素片辞書として記憶部に格納しておくことも可能であり、音声素片の作成上の問題などで音声素片の隣り合う単位波形の周期性が必要以上に強くなっている場合に効果がある。
【００１３】
本発明に係る音声合成装置は、複数の音声素片からなる音声素片辞書を記憶した音声素片記憶部と、音声素片辞書から入力された韻律情報に対応する音声素片を選択する素片選択部と、選択された音声素片に含まれる有声音の位相をランダム化する手段と、韻律情報に従って音声素片の韻律を制御する韻律制御部と、韻律が制御された音声素片を接続して合成音声信号を生成する素片接続部とを具備する。
【００１４】
さらに、本発明によると、予め記憶した複数の音声素片からなる音声素片辞書から韻律情報に対応する音声素片を選択する処理と、選択された音声素片に含まれる有声音の位相をランダム化する処理と、韻律情報に従って音声素片の韻律を制御する処理と、韻律が制御された音声素片を接続して合成音声信号を生成する処理とをコンピュータに行わせるための音声合成プログラムが提供される。
【００１５】
【発明の実施の形態】
（第１の実施形態）
図１に、本発明の第１の実施形態に係る音声合成システムの構成を示す。この音声合成システムは素片選択部１１、音声素片辞書を構成する複数の音声素片が格納されている音声素片記憶部１２、韻律制御部１０、音声素片の接続を行って合成音声信号を生成する素片接続部１４、有声音と無声音の切り替えを行う切替部１７及び有声音の位相ランダム化処理を行う位相ランダム化処理部１３から構成されている。
【００１６】
本実施形態の音声合成システムの動作をテキスト合成の場合を例にとって、図２に示すフローチャートを用いて説明する。まず、図示しない文解析・韻律情報生成部においてテキスト合成に供されるテキストの文解析が行われ、これにより音韻記号列、音韻継続時間長、ピッチパターン及びパワーなどの韻律情報が生成される。こうして生成される韻律情報１００は、韻律制御部１０及び素片選択部１１に入力される（ステップＳ１）。
【００１７】
素片選択部１１では、入力された韻律情報１００に従って音声素片１０４が音声素片記憶部１２から選択される（ステップＳ２）。選択された音声素片１０４は韻律制御部１０に供給され、韻律制御部１０では入力された韻律情報１００に従って素片選択部１１から供給された音声素片１０４に対して、韻律制御、例えばピッチ及び音韻継続時間長の制御が行われる（ステップＳ３）。
【００１８】
韻律制御部１０によるピッチや音韻継続時間長の制御は、有声音部では図３に示されるように音声素片３０１，３０２に格納されているピッチ波形の繰り返しや間引きなどによって実現される。一方、無声音部での韻律制御部１０による音韻継続時間長の制御は、音韻継続時間長の雑音音源を音声素片１０２より決定されるフィルタ係数をもつ合成フィルタに通すことによって実現される。
【００１９】
こうして韻律制御部１０によってピッチや音韻継続時間長が制御された音声素片１０１のうち、有声音部は位相ランダム化処理部１３で位相のランダム化が行われる（ステップＳ４）。
【００２０】
位相のランダム化が行われた後の有声音部の音声素片１０２は、切替部１７を介して素片接続部１４に供給される。一方、無声音部は韻律制御部１０から切替部１７を介して素片接続部１４へ供給される。素片接続部１４では、有声音部の位相がランダム化された音声素片１０２が接続され、合成音声信号１０３が生成される（ステップＳ５）。
【００２１】
次に、本実施形態の特徴的な部分である位相ランダム化処理部１３について、図４及び図５を参照しながら詳細に述べる。
図４は、位相ランダム化処理部１３の具体的な構成例を示している。図４の位相ランダム化処理部は、ＦＴＴ（高速フーリエ変換）処理部２０，２４、雑音信号発生部２３、位相変換部２１及びＩＦＦＴ（逆高速フーリエ変換）処理部２２から構成される。
【００２２】
位相ランダム化処理部には、音声素片の有声音部に相当する単位波形（以下、ピッチ波形という）２００が入力される。ピッチ波形２００は、ＦＦＴ処理部２０によってＦＦＴ処理が行われ、フーリエ係数２０１が生成される。一方、雑音信号発生部２３で生成された雑音信号２０４についても、同様にＦＦＴ処理部２４によってＦＦＴ処理が行われ、フーリエ係数２０５が生成される。
【００２３】
位相変換部２１では、予め設定された帯域（周期性が弱くなるべき帯域）に対して、ピッチ波形２００に対応するフーリエ係数２０１の位相を雑音信号２０４に対応するフーリエ係数２０５の位相で置き換えることによって、新たなフーリエ係数２０２を生成する。位相変換部２１により特定帯域の位相が置き換えられたピッチ波形のフーリエ係数２０２は、ＩＦＦＴ処理部２２によってＩＦＦＴ処理され、特定帯域の位相がランダム化されたピッチ波形２０３が生成される。
【００２４】
こうして得られたピッチ波形２０３は、前述した韻律制御部１０での韻律制御によりピッチ波形の繰り返しが行われていても、周期性が弱くなるべき帯域では位相のランダム化の効果により周期性が弱まるので、合成音声信号１０３のブザー感は削減され、自然性が向上する。
【００２５】
図５を用いて、この効果を説明する。位相ランダム化処理を行わない合成音声のスペクトルの例を図５（ａ）に示し、それに対する自然音声のスペクトルの例を図５（ｂ）に示す。自然音声は、図５（ｂ）に示されるように高域部分の周期性が弱く無声音化している。これに対し、位相ランダム化処理を行わない合成音声信号では、図５（ａ）に示されるように全帯域で周期性が強くなっており、高域部分の周期性が合成音声のブザー感の原因となる。本実施形態では、この高域部分の帯域の位相をランダム化処理することによって、合成音声信号１０３のスペクトルを自然音声のスペクトルに近づけることができる。すなわち、合成音声信号１０３のブザー感が削減され、自然性が向上するという効果が得られる。
【００２６】
図６には、位相ランダム化処理部１３の他の具体的な構成例を示す。帯域別の周期性の強弱は音声素片毎に異なっているため、音声素片毎に位相ランダム化処理を行う帯域を決定すると、合成音声信号１０３のさらなる自然性の向上が得られる。図３に示す位相ランダム化処理部１３は、このように音声素片毎の位相ランダム化処理の制御を行う例であり、図５の構成に位相ランダム化処理を行う帯域を選択する帯域選択部２５が追加された構成となっている。
【００２７】
帯域選択部２５は、音韻情報などの情報に基づいて位相ランダム化処理を行う帯域を選択し、その結果を位相変換部２１に供給する。音声素片に位相ランダム化処理を行う帯域のみの情報を格納し、その情報を直接に位相変換部２１へ供給することも可能であり、その場合は帯域選択部２５を省略することができる。
【００２８】
位相変換部２１では、このようにして設定された特定の帯域についてのみ、ピッチ波形２００に対応するフーリエ係数２０１の位相を雑音信号２０４に対応するフーリエ係数２０５の位相で置き換えて新たなフーリエ係数２０２を生成することにより、位相ランダム化処理を行う。
【００２９】
このように本実施形態では、音声素片毎に位相ランダム化処理を行う帯域を設定することにより、合成音声信号１０３の自然性をより一層向上させることができる。
【００３０】
（第２の実施形態）
図７に、本発明の第２の実施形態に係る音声合成システムの構成を示す。図１と相対応する部分に同一符号を付して、第１の実施形態との相違点を中心に説明する。第１の実施形態では、音声素片記憶部１１に格納されている音声素片が音声波形で表されているのに対し、本実施形態では、音声素片は音源信号と合成フィルタの特性を表す係数との組で表されており、さらに素片接続部１４の後に合成フィルタ処理部１５が追加された構成となっている。
【００３１】
従って、本実施形態では位相ランダム化処理部１３による音声素片の位相ランダム化処理は、合成フィルタ処理部１５で処理される前の音源信号、例えば韻律制御部１０から出力される韻律制御された残差信号の音声素片１１１に対して行われる。このようにして位相ランダム化処理された韻律制御後の残差信号の音声素片１１２が切替部１７を介して素片接続部１４に入力され、ここで生成された位相ランダム化残差信号１１３が合成フィルタ部１５によって処理されることにより、合成音声信号１０３が生成される。
【００３２】
このように音声素片を音源信号と合成フィルタの特性を表す係数との組で表す音声合成システムにおいて、合成フィルタ処理前の音源信号に対して音声素片の位相ランダム化処理を行うようにしても、第１の実施形態と同様の効果が得られることは明らかである。
【００３３】
（第３の実施形態）
図８は、第２の実施形態を変形した本発明の第３の実施形態に係る音声合成システムであり、素片接続部１４の後段に位相ランダム化処理部１３を配置し、音声素片接続後の残差信号１１４に対して位相ランダム化処理を行う構成となっている。本実施形態によっても、第１の実施形態と同様の効果が得られることは明らかである。
【００３４】
（第４の実施形態）
図９に、本発明の第４の実施形態に係る音声合成システムの構成を示す。図１と相対応する部分に同一符号を付して、第１の実施形態との相違点を中心に説明する。第１の実施形態では、韻律制御されたピッチ波形に対して位相ランダム化処理が行われているのに対し、本実施形態では予めピッチ波形の位相をランダム化処理した音声素片１２０を位相ランダム化音声素片記憶部１６に格納した構成となっている。
【００３５】
本実施形態によると、予めオフラインで音声素片の位相ランダム化処理を行っておくことで、ピッチ波形の位相ランダム化処理をオンラインで行う必要がないため、合成音声信号生成時の計算量を削減できるという利点がある。本実施形態では、韻律制御でのピッチ波形の繰り返しが原因で合成音声信号の周期性が必要以上に強調される問題に対しては効果はないが、音声素片作成上の問題などによって、記憶されている音声素片の隣り合うピッチ波形の周期性が必要以上に強くなっている場合には効果があり、基本的に第１の実施形態と同様の効果が得られることは明らかである。
【００３６】
（第５の実施形態）
図１０は、第１の実施形態を変形した本発明の第５の実施形態に係る音声合成システムであり、位相ランダム化処理部１３を素片接続部１４の後段に配置し、合成音声信号１０５の有声音部に対して位相ランダム化処理を行う構成となっている。この場合、有声音と無声音の切り替えを行う切替部１７は、位相ランダム化処理部１３の後に配置される。このように音声素片を接続して得られる合成音声信号１０５に対して有声音の位相ランダム化を行っても、第１の実施形態と同様の効果が得られることはいうまでもない。
【００３７】
以上、本発明の実施形態を幾つか説明したが、本発明は上述した実施形態に限られるものではなく、種々変形して実施が可能である。例えば、上記実施形態では、位相ランダム化処理部はピッチ波形のフーリエ係数の位相を雑音信号のフーリエ係数の位相と入れ替える構成になっているが、ピッチ波形のフーリエ係数の位相に雑音を付加する構成にすることもできる。
【００３８】
また、上記実施形態では位相ランダム化処理は、合成音声信号の生成前に行われているが、隣り合うピッチ波形の位相をランダム化できる構成であればそれに限定されない。
【００３９】
さらに、上記実施形態では無声音部に対しては位相ランダム化処理を行わない構成になっているが、少なくとも有声音部において位相ランダム化が行われるような構成であればよく、無声音部に対しても位相ランダム化処理を行う構成としてもよい。
【００４０】
上述した本発明に基づく音声合成処理は、ハードウェアより実現することも可能であるが、コンピュータを用いてソフトウェア処理により実現することも可能である。従って、本発明によれば上述した音声合成処理をコンピュータに行わせるためのプログラムを提供することもできる。
【００４１】
【発明の効果】
以上説明したように、本発明によれば合成音声信号の有声音の位相をランダム化させる処理を行うことにより、自然性に優れた合成音声を得ることができる。
【図面の簡単な説明】
【図１】本発明に係る音声合成方法の第１の実施形態を説明するための音声合成システムの構成を示すブロック図
【図２】同実施形態における音声符号化処理の手順を説明するためのフローチャート
【図３】音声素片の韻律制御の方法を示す図
【図４】位相ランダム化処理部の具体的構成例を示すのブロック図
【図５】位相ランダム化処理部の他の具体的構成例を示すブロック図
【図６】位相ランダム化処理における周波数スペクトルの変化を表す図
【図７】本発明に係る音声合成方法の第２の実施形態を説明するための素片辞書符号化系の構成を示すブロック図
【図８】本発明に係る音声合成方法の第３の実施形態を説明するための音声合成システムの構成を示すブロック図
【図９】本発明に係る音声合成方法の第４の実施形態を説明するための音声合成システムの構成を示すブロック図
【図１０】本発明に係る音声合成方法の第５の実施形態を説明するための音声合成システムの構成を示すブロック図
【符号の説明】
１０…韻律制御部
１１…素片選択部
１２…音声素片記憶部
１３…位相ランダム化処理部
１４…素片接続部
１５…合成フィルタ部
１６…位相ランダム化音声素片記憶部
１７…切替部
２０…ＦＦＴ処理部
２１…位相変換部
２２…ＩＦＦＴ処理部
２３…雑音信号発生部
２４…ＦＦＴ処理部
２５…帯域選択部
１００…韻律情報
１０１…韻律制御音声素片
１０２…位相ランダム韻律制御音声素片
１０３…位相ランダム合成音声信号
１０４…音声素片
１０５…合成音声信号
１１１…韻律制御残差音声素片
１１２…位相ランダム韻律制御残差音声素片
１１３…位相ランダム残差信号
１１４…残差信号
１２０…位相ランダム音声素片
２００…単位波形
２０１…単位波形フーリエ係数
２０２…位相ランダム単位波形フーリエ係数
２０３…位相ランダム単位波形
２０４…雑音信号
２０５…雑音信号フーリエ係数[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a speech synthesis method and apparatus used for text synthesis, for example, and more particularly to a voice synthesis technique for voiced sounds.
[0002]
[Prior art]
The rule synthesis method, which is one of the voice synthesis methods, is a technique for automatically generating a synthesized voice signal from input prosody information. In the rule synthesizing method, small units such as syllables, phonemes, and one pitch section stored in the speech unit storage unit (which are referred to as phonemes) are stored in accordance with prosody information such as a phoneme symbol string, a phoneme duration, a pitch pattern, and power. A composite voice signal is generated by selecting the characteristic parameters (referred to as a piece) and connecting them after controlling the pitch and the duration. The speech synthesis technology based on the rule synthesis method is used, for example, for text synthesis that artificially generates speech from an arbitrary sentence (text).
[0003]
In the speech synthesis technology based on such a rule synthesis method, in order to improve the quality of a synthesized speech signal, the characteristic parameters of a representative speech unit include a speech waveform cut out from speech data and a sound source obtained by analyzing the speech waveform. A combination of a signal and a coefficient representing the characteristic of the synthesis filter is often used.
[0004]
For example, in the former speech synthesis technology using a speech waveform cut out from speech data as a speech unit, first, a phoneme symbol string, a phoneme duration, a pitch pattern, and a power obtained by a sentence analysis of a text provided for text synthesis. A speech unit is selected from the speech unit dictionary according to the prosody information such as. After controlling the pitch and the phoneme duration for the selected speech units according to the input prosody information, the speech units are connected to generate a synthesized speech signal.
[0005]
Here, the control of the pitch and the phoneme duration based on the prosody information is realized by repetition or thinning of the unit waveform stored in the speech unit. Generally, a pitch waveform is used as a unit waveform for a voiced sound, and a waveform divided into a predetermined length is used as a unit waveform for an unvoiced sound.
[0006]
[Problems to be solved by the invention]
In the conventional speech synthesis technology, there is a problem that the naturalness of the synthesized speech is reduced when the same unit waveform is repeated. This is because the repetition of the same unit waveform emphasizes the periodicity of the band where the periodicity should be weakened, and as a result, the buzzing feeling of the synthesized voice increases and the naturalness of the synthesized voice is impaired. . Even when the same unit waveform is not repeated, if the adjacent unit waveforms have the same phase (when the periodicity is emphasized) due to a problem in creating a speech unit, the same reason is applied. , The naturalness of the synthesized speech may be reduced.
[0007]
In Japanese Unexamined Patent Publication No. Hei 5-108095 (Patent Document 1), a phase randomization process is performed on a speech unit waveform at the time of generating an unvoiced synthesized speech, and the phase-randomized speech unit waveform is added or superimposed while shifting. Thus, a technique for improving the naturalness of unvoiced sound has been disclosed. However, no effective solution for voiced sounds is known.
[0008]
Accordingly, an object of the present invention is to provide a speech synthesis method and apparatus for generating a natural voiced sound and a speech synthesis program.
[0009]
[Patent Document 1]
JP-A-5-108095
[Means for Solving the Problems]
In order to solve the above-described problem, in the present invention, when a synthesized speech signal is generated by connecting a speech unit selected from a speech unit dictionary including a plurality of speech units, a voiced sound of the synthesized speech signal is generated. A process for randomizing the phase is performed. By randomizing the phase of the voiced sound of the synthesized voice signal, the periodicity of the portion where the periodicity is unnecessarily strong can be weakened, and the synthesized voice signal becomes more natural.
[0011]
The strength of the periodicity is not the same in all bands, and there are some bands where strong periodicity is desirable and some bands where weak periodicity is desirable. If the phase is randomized only in a band where weak periodicity is desired, the synthesized speech signal becomes more natural. In general, if a strong periodicity exists in the high frequency range, a synthesized sound signal often has a buzzing feeling, and an effect can be easily obtained by specifying a band in which phase randomization is performed in the high frequency range. Further, since the strength of the periodicity is not the same for all segments but differs for each segment, further improvement can be obtained by controlling the randomization of the phase for each speech unit.
[0012]
The phase randomization is desirably performed on a unit waveform that has been repeated or thinned out by prosody control, but in order to reduce the amount of calculation when generating a synthesized speech signal, the phase randomization is performed in advance. It is also possible to store the speech unit in the storage unit as a speech unit dictionary, and the periodicity of adjacent unit waveforms of the speech unit may become unnecessarily strong due to problems in speech unit creation. It is effective when it is.
[0013]
A speech synthesis device according to the present invention includes a speech unit storage unit that stores a speech unit dictionary including a plurality of speech units, and a unit that selects a speech unit corresponding to prosody information input from the speech unit dictionary. A segment selection unit, a unit for randomizing the phase of a voiced sound included in the selected speech unit, a prosody control unit for controlling the prosody of the speech unit according to the prosody information, and a speech unit whose prosody is controlled. And a segment connection unit for generating a synthesized voice signal by connection.
[0014]
Further, according to the present invention, a process of selecting a speech unit corresponding to prosody information from a speech unit dictionary composed of a plurality of speech units stored in advance, and a phase of a voiced sound included in the selected speech unit. A speech synthesis program for causing a computer to perform a process of randomizing, a process of controlling the prosody of a speech unit according to prosody information, and a process of connecting a speech unit whose prosody is controlled to generate a synthesized speech signal. Is provided.
[0015]
BEST MODE FOR CARRYING OUT THE INVENTION
(1st Embodiment)
FIG. 1 shows a configuration of a speech synthesis system according to a first embodiment of the present invention. In this speech synthesis system, a speech unit selection unit 11, a speech unit storage unit 12 storing a plurality of speech units constituting a speech unit dictionary, a prosody control unit 10, and a speech unit are connected to perform speech synthesis. It comprises a segment connection unit 14 for generating a signal, a switching unit 17 for switching between voiced and unvoiced sounds, and a phase randomization processing unit 13 for performing a phase randomization process on voiced sounds.
[0016]
The operation of the speech synthesis system according to the present embodiment will be described with reference to a flowchart shown in FIG. First, a sentence analysis of a text to be subjected to text synthesis is performed in a sentence analysis / prosody information generation unit (not shown), thereby generating prosody information such as a phoneme symbol string, a phoneme duration, a pitch pattern, and power. The generated prosody information 100 is input to the prosody control unit 10 and the segment selection unit 11 (step S1).
[0017]
In the unit selection unit 11, the speech unit 104 is selected from the speech unit storage unit 12 according to the input prosody information 100 (step S2). The selected speech unit 104 is supplied to the prosody control unit 10, and the prosody control unit 10 applies prosody control to the speech unit 104 supplied from the unit selection unit 11 according to the input prosody information 100, for example, pitch. And the control of the phoneme duration is performed (step S3).
[0018]
The control of the pitch and the phoneme duration by the prosody control unit 10 is realized by repetition or thinning of the pitch waveform stored in the speech units 301 and 302 in the voiced sound portion as shown in FIG. On the other hand, the control of the phoneme duration in the unvoiced sound portion by the prosody control unit 10 is realized by passing a noise source having the phoneme duration in a synthesis filter having a filter coefficient determined by the speech unit 102.
[0019]
In the speech unit 101 whose pitch and phoneme duration are controlled by the prosody control unit 10 in this manner, the voiced sound portion is subjected to phase randomization by the phase randomization processing unit 13 (step S4).
[0020]
The voice segment 102 of the voiced portion after the phase randomization is performed is supplied to the segment connection unit 14 via the switching unit 17. On the other hand, the unvoiced sound part is supplied from the prosody control part 10 to the segment connection part 14 via the switching part 17. In the segment connection unit 14, the speech unit 102 in which the phase of the voiced sound portion is randomized is connected, and a synthesized speech signal 103 is generated (step S5).
[0021]
Next, the phase randomization processing unit 13 which is a characteristic part of the present embodiment will be described in detail with reference to FIGS.
FIG. 4 shows a specific configuration example of the phase randomization processing unit 13. 4 includes FTT (fast Fourier transform) processors 20 and 24, a noise signal generator 23, a phase converter 21 and an IFFT (inverse fast Fourier transform) processor 22.
[0022]
A unit waveform (hereinafter, referred to as a pitch waveform) 200 corresponding to a voiced sound part of a speech unit is input to the phase randomization processing unit. The pitch waveform 200 is subjected to FFT processing by the FFT processing unit 20, and a Fourier coefficient 201 is generated. On the other hand, the FFT processing unit 24 similarly performs an FFT process on the noise signal 204 generated by the noise signal generation unit 23, and generates a Fourier coefficient 205.
[0023]
The phase converter 21 replaces the phase of the Fourier coefficient 201 corresponding to the pitch waveform 200 with the phase of the Fourier coefficient 205 corresponding to the noise signal 204 in a preset band (a band in which the periodicity is to be weakened). Generates a new Fourier coefficient 202. The Fourier coefficient 202 of the pitch waveform in which the phase of the specific band has been replaced by the phase conversion unit 21 is subjected to IFFT processing by the IFFT processing unit 22 to generate a pitch waveform 203 in which the phase of the specific band is randomized.
[0024]
In the pitch waveform 203 thus obtained, even if the pitch waveform is repeated by the prosody control in the above-described prosody control unit 10, the periodicity is weakened by the effect of the randomization of the phase in the band where the periodicity should be weakened. Therefore, the buzzing feeling of the synthesized voice signal 103 is reduced, and the naturalness is improved.
[0025]
This effect will be described with reference to FIG. FIG. 5A shows an example of a spectrum of a synthesized speech not subjected to the phase randomization process, and FIG. 5B shows an example of a spectrum of a natural speech corresponding thereto. As shown in FIG. 5B, the natural voice has a weak periodicity in the high-frequency portion and is unvoiced. On the other hand, in the synthesized speech signal not subjected to the phase randomization processing, the periodicity in the entire band is strong as shown in FIG. Cause. In the present embodiment, the spectrum of the synthesized voice signal 103 can be made closer to the spectrum of the natural voice by randomizing the phase of the band of the high frequency band. That is, the effect of reducing the buzzer feeling of the synthesized voice signal 103 and improving the naturalness is obtained.
[0026]
FIG. 6 shows another specific configuration example of the phase randomization processing unit 13. Since the strength of the periodicity for each band differs for each speech unit, determining a band for which phase randomization processing is performed for each speech unit can further improve the naturalness of the synthesized speech signal 103. The phase randomization processing unit 13 shown in FIG. 3 is an example in which the phase randomization processing is controlled for each speech unit in this manner, and a band selection unit that selects a band for performing the phase randomization processing in the configuration of FIG. 25 is added.
[0027]
The band selection unit 25 selects a band for which phase randomization processing is to be performed based on information such as phoneme information, and supplies the result to the phase conversion unit 21. It is also possible to store the information of only the band for which the phase randomization processing is performed in the speech unit and directly supply the information to the phase conversion unit 21. In that case, the band selection unit 25 can be omitted.
[0028]
The phase conversion unit 21 replaces the phase of the Fourier coefficient 201 corresponding to the pitch waveform 200 with the phase of the Fourier coefficient 205 corresponding to the noise signal 204 only for the specific band set in this way, and adds a new Fourier coefficient 202 Is generated to perform a phase randomization process.
[0029]
As described above, in the present embodiment, the naturalness of the synthesized voice signal 103 can be further improved by setting the band in which the phase randomization process is performed for each voice unit.
[0030]
(Second embodiment)
FIG. 7 shows a configuration of a speech synthesis system according to the second embodiment of the present invention. The parts corresponding to those in FIG. 1 are denoted by the same reference numerals, and the description will focus on differences from the first embodiment. In the first embodiment, the speech unit stored in the speech unit storage unit 11 is represented by a speech waveform, whereas in the present embodiment, the speech unit has the characteristics of the sound source signal and the synthesis filter. It is represented by a set of coefficients to be represented, and has a configuration in which a synthesis filter processing unit 15 is added after the segment connection unit 14.
[0031]
Therefore, in the present embodiment, the phase randomization processing of the speech unit by the phase randomization processing unit 13 is performed based on the sound source signal before being processed by the synthesis filter processing unit 15, for example, the prosody control output from the prosody control unit 10. This is performed on the speech unit 111 of the residual signal. The speech unit 112 of the prosody-controlled residual signal subjected to the phase randomization process in this way is input to the unit connection unit 14 via the switching unit 17 and the phase randomized residual signal 113 generated here Are processed by the synthesis filter unit 15 to generate a synthesized speech signal 103.
[0032]
As described above, in a speech synthesis system in which a speech unit is represented by a combination of a sound source signal and a coefficient representing a characteristic of a synthesis filter, a phase randomization process of the speech unit is performed on the sound source signal before the synthesis filter processing. It is apparent that the same effects as those of the first embodiment can be obtained.
[0033]
(Third embodiment)
FIG. 8 shows a speech synthesis system according to a third embodiment of the present invention, which is a modification of the second embodiment. The configuration is such that a phase randomization process is performed on the subsequent residual signal 114. It is apparent that the present embodiment also provides the same effects as the first embodiment.
[0034]
(Fourth embodiment)
FIG. 9 shows a configuration of a speech synthesis system according to the fourth embodiment of the present invention. The parts corresponding to those in FIG. 1 are denoted by the same reference numerals, and the description will focus on differences from the first embodiment. In the first embodiment, the phase randomization process is performed on the pitch waveform under the prosody control. In the present embodiment, however, the speech unit 120 in which the phase of the pitch waveform is randomized in advance is subjected to phase randomization. It is configured to be stored in the coded speech unit storage unit 16.
[0035]
According to the present embodiment, by performing the phase randomization process of the speech unit offline in advance, it is not necessary to perform the phase randomization process of the pitch waveform online, so that the calculation amount when generating the synthesized speech signal is reduced. There is an advantage that you can. This embodiment has no effect on the problem that the periodicity of the synthesized speech signal is unnecessarily emphasized due to the repetition of the pitch waveform in the prosody control. It is clear that the effect is obtained when the periodicity of the adjacent pitch waveforms of the speech unit is stronger than necessary, and it is apparent that basically the same effect as in the first embodiment can be obtained.
[0036]
(Fifth embodiment)
FIG. 10 shows a speech synthesis system according to a fifth embodiment of the present invention, which is a modification of the first embodiment, in which a phase randomization processing unit 13 is arranged after a unit connection unit 14 and a synthesized speech signal 105 The phase randomization process is performed on the voiced sound portion of. In this case, the switching unit 17 that switches between voiced sound and unvoiced sound is disposed after the phase randomization processing unit 13. It is needless to say that the same effect as that of the first embodiment can be obtained even when the phase of the voiced sound is randomized for the synthesized voice signal 105 obtained by connecting the voice segments in this way.
[0037]
Although some embodiments of the present invention have been described above, the present invention is not limited to the above-described embodiments, and can be implemented with various modifications. For example, in the above embodiment, the phase randomization processing unit is configured to replace the phase of the Fourier coefficient of the pitch waveform with the phase of the Fourier coefficient of the noise signal. You can also
[0038]
Further, in the above embodiment, the phase randomization processing is performed before the generation of the synthesized voice signal, but the present invention is not limited to this, as long as the configuration is such that the phases of adjacent pitch waveforms can be randomized.
[0039]
Further, in the above embodiment, the configuration is such that the phase randomization process is not performed on the unvoiced sound portion. However, it is sufficient that the configuration is such that the phase randomization is performed on at least the voiced sound portion. May be configured to perform a phase randomization process.
[0040]
The above-described speech synthesis processing based on the present invention can be realized by hardware, but can also be realized by software processing using a computer. Therefore, according to the present invention, it is also possible to provide a program for causing a computer to perform the above-described speech synthesis processing.
[0041]
【The invention's effect】
As described above, according to the present invention, by performing the process of randomizing the phase of the voiced sound of the synthesized voice signal, it is possible to obtain a synthesized voice excellent in naturalness.
[Brief description of the drawings]
FIG. 1 is a block diagram illustrating a configuration of a speech synthesis system for describing a first embodiment of a speech synthesis method according to the present invention; FIG. 2 is a diagram illustrating a procedure of speech encoding processing in the first embodiment; Flowchart FIG. 3 is a diagram showing a method of controlling the prosody of a speech unit. FIG. 4 is a block diagram showing a specific configuration example of a phase randomization processing unit. FIG. 5 is another specific configuration of the phase randomization processing unit. FIG. 6 is a block diagram showing an example. FIG. 6 is a diagram showing a change in a frequency spectrum in a phase randomization process. FIG. 7 is a block diagram of a unit dictionary encoding system for explaining a second embodiment of the speech synthesis method according to the present invention. FIG. 8 is a block diagram illustrating a configuration of a speech synthesis system for explaining a third embodiment of the speech synthesis method according to the present invention. FIG. 9 is a fourth diagram illustrating a speech synthesis method according to the present invention. An embodiment will be described. Block diagram showing the arrangement of a speech synthesis system for explaining a fifth embodiment of the speech synthesis method according to the block diagram Figure 10 the present invention showing the arrangement of a speech synthesis system because EXPLANATION OF REFERENCE NUMERALS
Reference Signs List 10 prosody control unit 11 unit selection unit 12 speech unit storage unit 13 phase randomization processing unit 14 unit connection unit 15 synthesis filter unit 16 phase randomized speech unit storage unit 17 switching unit Reference Signs List 20 FFT processing unit 21 Phase conversion unit 22 IFFT processing unit 23 Noise signal generation unit 24 FFT processing unit 25 Band selection unit 100 Prosodic information 101 Prosodic control speech unit 102 Phase random prosodic control speech unit Piece 103: Phase random synthesized speech signal 104 ... Voice unit 105 ... Synthesized voice signal 111 ... Prosody control residual speech unit 112 ... Phase random prosody control residual speech unit 113 ... Phase random residual signal 114 ... Residual signal 120 phase random speech unit 200 unit waveform 201 unit waveform Fourier coefficient 202 phase random unit waveform Fourier coefficient 203 phase random unit waveform 2 4 ... noise signal 205 ... noise signal Fourier coefficients

Claims

Selecting a speech unit corresponding to the specified prosody information from a speech unit dictionary consisting of a plurality of speech units;
Generating a synthesized speech signal by connecting the selected speech units;
Performing a process of randomizing the phase of the voiced sound of the synthesized voice signal.

The step of performing a process of randomizing the phase of the voiced sound of the synthesized speech signal is performed by performing a Fourier transform on a noise signal and a phase of a Fourier coefficient obtained by performing a Fourier transform on a unit waveform corresponding to the voiced sound of the speech unit. 2. The speech synthesis according to claim 1, wherein a process of replacing the phase of the obtained Fourier coefficient or adding noise to a phase of a Fourier coefficient obtained by performing a Fourier transform on a unit waveform corresponding to a voiced sound of the speech unit is performed. Method.

Performing a process of randomizing the phase of the voiced sound of the synthesized speech signal,
3. The speech synthesis method according to claim 1, wherein a process of randomizing a phase of a voiced sound included in the selected speech unit is performed before connecting the selected speech unit.

The voice synthesizing method according to any one of claims 1 to 3, wherein the step of performing the process of randomizing the phase of the voiced sound of the synthesized voice signal performs the process of randomizing the phase of a specific band of the voiced sound. .

5. The speech synthesis method according to claim 4, wherein a process of randomizing a phase of a specific band of the voiced sound is controlled for each speech unit.

2. The speech synthesis method according to claim 1, wherein the step of performing the process of randomizing the phase of the voiced sound of the synthesized speech signal randomizes a phase of a speech unit included in the speech unit dictionary in advance. 3.

A speech unit storage unit that stores a speech unit dictionary composed of a plurality of speech units,
A unit selection unit for selecting a speech unit corresponding to the prosodic information input from the speech unit dictionary,
Means for randomizing the phase of the voiced sound contained in the selected speech unit,
A prosody control unit that controls the prosody of the selected speech unit according to the prosody information,
A speech unit for connecting speech segments whose prosody is controlled to generate a synthesized speech signal.

A process of selecting a speech unit corresponding to prosodic information from a speech unit dictionary composed of a plurality of speech units stored in advance,
Processing to randomize the phase of the voiced sound included in the selected speech unit;
A process of controlling the prosody of the selected speech unit according to the prosody information;
A speech synthesis program for causing a computer to perform a process of connecting synthesized speech units with controlled prosody and generating a synthesized speech signal.