JP2006030609A

JP2006030609A - Voice synthesis data generating device, voice synthesizing device, voice synthesis data generating program, and voice synthesizing program

Info

Publication number: JP2006030609A
Application number: JP2004209634A
Authority: JP
Inventors: Yasuo Yoshioka; 靖雄吉岡
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2004-07-16
Filing date: 2004-07-16
Publication date: 2006-02-02

Abstract

<P>PROBLEM TO BE SOLVED: To easily generate voice synthesis data for generating a natural synthesized voice. <P>SOLUTION: An input voice waveform 1 is analyzed by a segmentation part 3, a pitch detection part 4, and a sound volume detection part 5 respectively, and segment information showing continuance lengths of a phoneme symbol and a phoneme, pitch information showing variation in pitch, and sound volume information showing variation in sound volume are outputted. Further, a timbre data detection part 7 of a timbre information extraction part 6 applies a timbre analysis of the same kind with an object voice synthesis part 12 to the input voice waveform 1 and a timbre difference extraction part 10 outputs the difference value between the analysis result and dictionary data of a corresponding phoneme as timbre data. A voice synthesis data generation part 11 generates voice synthesis data from the segment information, pitch information, sound volume information, and timbre data. The voice synthesis part 12 to which the voice synthesis data are inputted varies frame data read out of a dictionary by the pitch information, sound volume information, and timbre data, synthesizes a voice based upon the varied frame data, and outputs the synthesized voice. <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

本発明は、入力音声を分析して音声合成装置へ与える音声合成データを生成する音声合成データ生成装置並びに音声合成データ生成プログラム、及び、該音声合成データ生成装置又は音声合成データ生成プログラムにより生成された音声合成データに基づいて音声合成を行う音声合成装置並びに音声合成プログラムに関する。 The present invention is generated by a speech synthesis data generation device, a speech synthesis data generation program, and a speech synthesis data generation device or a speech synthesis data generation program that analyze input speech and generate speech synthesis data to be given to the speech synthesizer. The present invention relates to a speech synthesizer and a speech synthesis program for performing speech synthesis based on the synthesized speech data.

図５は、一般的な音声合成システムの構成を示す図である。
図５において、５１は文字列入力部、５２は入力文字列に応じた音声合成データを生成する音声合成データ生成部、５３は音声合成データに基づいて音声を合成する音声合成部、５４は合成音声出力部である。
ここで、音声合成部５３に入力する「音声合成データ」としては、音素記号、音素の長さ（時間長）、韻律情報（ピッチ、音量の変化の情報）、声質の情報（合成音声辞書の選択、あるいは、合成方式に対応する声質パラメータ；例えば、第１フォルマントの周波数など）が一般的である。これらは、任意に入力可能で、適切な音声合成データを作成、入力することにより、自然な音声合成が可能となる。
しかし、前処理として、この音声合成データを作成するには、一般的に、文字列（日本語の場合は、かな漢字交じりの文字列。他の言語だと、例えば、英語ならばアルファベット文字列、韓国語ならばハングル文字列など。あるいは、「ピッチを上げる」などの特別な記号を付加した文字列。）を入力（「入力文字列」）し、それを装置内部で定められた規則（構文解析、記号の解析などを経て、音素の長さや、韻律データなどを生成するための規則）に従って、上記の音声合成データを生成している。 FIG. 5 is a diagram showing a configuration of a general speech synthesis system.
In FIG. 5, 51 is a character string input unit, 52 is a speech synthesis data generation unit that generates speech synthesis data corresponding to the input character string, 53 is a speech synthesis unit that synthesizes speech based on the speech synthesis data, and 54 is a synthesis. An audio output unit.
Here, the “speech synthesis data” input to the speech synthesizer 53 includes phoneme symbols, phoneme length (time length), prosodic information (pitch, volume change information), voice quality information (synthesized speech dictionary). A voice quality parameter corresponding to the selection or synthesis method; for example, the frequency of the first formant) is common. These can be arbitrarily input, and natural speech synthesis can be performed by creating and inputting appropriate speech synthesis data.
However, in order to create this speech synthesis data as pre-processing, generally, a character string (a character string mixed with Kana-Kanji in Japanese. For other languages, for example, an English character string in English, In Korean, you can input Korean character strings, etc. (or character strings with special symbols such as “increase pitch”) (“input string”), and use the rules (syntax Through the analysis and symbol analysis, the speech synthesis data is generated according to the phoneme length and the rules for generating the prosodic data.

また、特許文献１には、楽譜から文字列と高さ、強さ及び長さデータからなる参照データを作成し、該参照データを用いて、実際の歌音声のセグメンテーションを行い、音声合成用データを生成する歌音声合成データの作成装置が記載されている。
特開平３−７９９５号公報 Further, in Patent Document 1, reference data including a character string and height, strength, and length data is created from a musical score, and actual singing voice segmentation is performed using the reference data. Singing voice synthesis data generating device is described.
JP-A-3-7995

しかしながら、上述した従来方法による音声合成装置への入力音声合成データ生成方法では、「規則」に従ってデータを生成するために、その規則からはずれたような発声（主に、韻律の変化や音素の長さ）の合成音声（例えば、同じ「おはよー」でも、いろいろな発声方法があり、そのことが人間らしさ、自然さ、個性を表現している。）を出力させるような「入力文字列」の生成は困難（ユーザーが自分で韻律記号＝ピッチを上げる等を付加するタイプの入力ではある程度は可能）、あるいは不可能（韻律記号の入力できないタイプのもの）であった。もちろん、直接「音声合成データ」を作成するのも、困難であることはいうまでもない。 However, in the method for generating input speech synthesis data to the speech synthesizer according to the conventional method described above, in order to generate data according to the “rules”, utterances that deviate from the rules (mainly prosodic changes and phoneme lengths). ”) (For example, the same“ Ohayo ”has various utterance methods that express humanity, naturalness, and personality). It was difficult to generate (to some extent by the user's input that adds prosodic symbols = pitch increase, etc.) or impossible (those that cannot input prosodic symbols). Of course, it is of course difficult to directly create “speech synthesis data”.

実際の音声の多彩な表現を示す例として、図６及び図７を示す。両図とも、「おはよー」としゃべり方を指定せずに別の話者が発声したときの、（Ａ）音声波形、（Ｂ）ピッチ、（Ｃ）音量を示している。
このような表現の違いや声質の変化を音声合成にて再現することが求められている。 FIG. 6 and FIG. 7 are shown as examples showing various expressions of actual speech. Both figures show (A) speech waveform, (B) pitch, and (C) volume when another speaker utters without specifying how to speak “Ohayo”.
It is required to reproduce such differences in expression and changes in voice quality by speech synthesis.

このような事情を鑑みて、本発明は、自然な合成音声を発声することができる音声合成データを容易に作成することができる音声合成データ生成装置並びに音声合成データ生成プログラム、及び、該音声合成データ生成装置又は音声合成データ生成プログラムで生成された音声合成データに基づいて音声を合成する音声合成装置並びに音声合成プログラムを提供することを目的としている。 In view of such circumstances, the present invention provides a speech synthesis data generation apparatus, a speech synthesis data generation program, and the speech synthesis that can easily create speech synthesis data that can utter natural synthesized speech. It is an object of the present invention to provide a speech synthesizer and a speech synthesis program for synthesizing speech based on speech synthesis data generated by a data generation device or a speech synthesis data generation program.

上記目的を達成するために、本発明の音声合成データ生成装置及び音声合成データ生成プログラムは、実際の音声波形を分析して、音素記号と音素の継続時間長を示すセグメント情報、ピッチ情報及び音量情報を抽出し、また、対象となる音声合成装置の合成音声辞書と同一の分析を行って、合成音声辞書内の同じ音素のデータとの差分値を求め、それを音色データとして出力し、これらセグメント情報、ピッチ情報、音量情報及び音色データにより音声合成データを生成することを主要な特徴としている。
また、本発明の音声合成装置及び音声合成プログラムは、上記音声合成データを入力し、合成音声辞書から読み出した発声する音素に対応するフレームデータを、前記ピッチ情報、前記音量情報及び前記音色データに応じて変更して出力することを特徴としている。 In order to achieve the above object, a speech synthesis data generation apparatus and a speech synthesis data generation program according to the present invention analyze an actual speech waveform, segment information indicating a phoneme symbol and a phoneme duration, pitch information, and volume. Extract the information, perform the same analysis as the synthesized speech dictionary of the target speech synthesizer, find the difference value with the same phoneme data in the synthesized speech dictionary, output it as timbre data, these The main feature is that speech synthesis data is generated from segment information, pitch information, volume information, and timbre data.
Also, the speech synthesizer and speech synthesis program of the present invention inputs the speech synthesis data, and converts the frame data corresponding to the phoneme uttered read from the synthesized speech dictionary into the pitch information, the volume information, and the timbre data. It is characterized by changing the output accordingly.

本発明の音声合成データ生成装置又は音声合成データ生成プログラムによれば、少ないデータ量で、自然な音声合成を行うための、音素の長さ、韻律情報、さらには、声質の自然な変化のための情報を含んだ音声合成データを容易に作成することが可能となる。
また、本発明の音声合成装置又は音声合成プログラムによれば、前記音声合成データ生成装置又は音声合成データ生成プログラムにより生成された音声合成データを用いて自然な音声を合成することができる。 According to the speech synthesis data generation apparatus or the speech synthesis data generation program of the present invention, because of natural changes in phoneme length, prosodic information, and voice quality for natural speech synthesis with a small amount of data. It is possible to easily create speech synthesis data including the above information.
According to the speech synthesizer or speech synthesis program of the present invention, natural speech can be synthesized using the speech synthesis data generated by the speech synthesis data generation device or the speech synthesis data generation program.

図１は、本発明の音声合成データ生成装置及び該音声合成データ生成装置で生成された音声合成データに基づいて音声合成を行う音声合成装置の一実施の形態の構成を示すブロック図である。なお、これらの装置は、コンピュータにおけるプログラム処理によっても実現することができる。
図１において、１は、セグメンテーション用入力として入力される入力音素記号列である。これは、分析の対象となる音声波形２の内容に対応している記号列であり、例えば、日本語ならば、「おはよ」といったひらがなでもよいし、実際に音声合成装置内部で用いられている音素記号体系に対応する音素記号列（例えば、ＳＡＭＰＡ（Speech Assessment Methods Phonetic Alphabet ）を使ったもの。今の場合、”ohajo”など）でもよい。ただし、セグメンテーション部３がそういったセグメンテーションのための補助情報を必要としない、純粋な音声認識装置である場合においては、必ずしもこれを必要としない。 FIG. 1 is a block diagram showing a configuration of an embodiment of a speech synthesis data generation device according to the present invention and a speech synthesis device that performs speech synthesis based on speech synthesis data generated by the speech synthesis data generation device. These devices can also be realized by program processing in a computer.
In FIG. 1, reference numeral 1 denotes an input phoneme symbol string input as a segmentation input. This is a symbol string corresponding to the content of the speech waveform 2 to be analyzed. For example, in Japanese, a hiragana such as “Ohayo” may be used, or it is actually used inside the speech synthesizer. It may be a phoneme symbol string (for example, using SAMPA (Speech Assessment Methods Phonetic Alphabet). In this case, “ohajo”, etc.). However, in the case where the segmentation unit 3 is a pure speech recognition apparatus that does not require auxiliary information for such segmentation, this is not necessarily required.

２は、分析の対象となる入力音声波形である。これは、録音済みの音声波形ファイルでもよいし、マイクから入力したものでもよい。
３は、入力音素記号列１及び入力音声波形２の情報をもとに、入力音声のセグメンテーションを行うセグメンテーション部である。ここから出力されるセグメント情報は、例えば、前記図６、図７に示したように、入力された音声波形において、「お」が何秒から何秒まで、「は」が何秒から何秒まで・・・という具合に、ある音素区間が時間的にどこからどこまでかを示す情報である。なお、音素記号列がＳＡＭＰＡの場合は、「お」が”o”、「は」が”ha”、「よ」が”jo”となる。ここで、セグメンテーション部３が「音声認識部」であるとすると、入力音声波形２の音声認識処理によるセグメンテーションにより、入力音素記号列１の入力なしに、同様の情報を得ることができる。 Reference numeral 2 denotes an input speech waveform to be analyzed. This may be a recorded audio waveform file or input from a microphone.
Reference numeral 3 denotes a segmentation unit that performs segmentation of input speech based on information of the input phoneme symbol string 1 and the input speech waveform 2. For example, as shown in FIG. 6 and FIG. 7, the segment information output from here is from “seconds” to “seconds” and “ha” is from “seconds” to “seconds”. It is information indicating where a certain phoneme section is from where in time. When the phoneme symbol string is SAMPA, “o” is “o”, “ha” is “ha”, and “yo” is “jo”. Here, if the segmentation unit 3 is a “speech recognition unit”, similar information can be obtained without input of the input phoneme symbol string 1 by segmentation by speech recognition processing of the input speech waveform 2.

４は、入力された入力音声波形２から適切な方法にてピッチの検出を行うピッチ検出部である。ピッチ検出の方法は、ゼロクロス区間の測定など、どんな方法であってもよい。
この結果は、前記図６及び図７における（Ｂ）ピッチの図のように、時間軸に沿ったピッチ軌跡が情報として得られる。ここでの時間分解能は、例えば、音声合成時のフレーム周期間隔があれば十分であり、あまりピッチが変わらないようであれば、ピッチの大きく変わる部分についてのみデータを持ち、その間については音声合成器側で直線補間などをしてやれば、より少ない情報量ですむ。 Reference numeral 4 denotes a pitch detection unit that detects a pitch from the input speech waveform 2 by an appropriate method. The pitch detection method may be any method such as measurement of a zero cross section.
As a result, a pitch trajectory along the time axis is obtained as information, as shown in FIG. As for the time resolution here, for example, it is sufficient if there is a frame period interval at the time of speech synthesis. If linear interpolation is performed on the side, less information is required.

５は、前記入力音声波形２から適切な方法にて音量の検出を行う音量検出部である。音量検出の方法としては、例えば、あるフレーム区間、例えば１０msec区間内の平均エネルギーを求めるなどの方法が用いられる。
この結果は、前記図６及び図７における（Ｃ）音量の図のように、時間軸に添った音量軌跡が情報として得られる。ここでの時間分解能は、例えば、音声合成時のフレーム周期間隔があれば十分であり、あまり音量が変わらないようであれば、音量の大きく変わる部分についてのみデータをもち、その間については音声合成器側で直線補間などをしてやれば、より少ない情報量ですむ。 Reference numeral 5 denotes a volume detector that detects the volume from the input voice waveform 2 by an appropriate method. As a sound volume detection method, for example, a method of obtaining an average energy within a certain frame section, for example, a 10 msec section is used.
As a result, as shown in FIG. 6C and FIG. 7C, the volume trajectory along the time axis is obtained as information. The time resolution here is, for example, sufficient if there is a frame period interval at the time of speech synthesis, and if the volume does not change much, it has data only for the part where the volume changes greatly, and a speech synthesizer in the meantime If linear interpolation is performed on the side, less information is required.

図１の点線内は、音色情報抽出部６である。この音色情報抽出部６には、図示するように音色データ検出部７、平均音色データ抽出部９及び音色差分抽出部１０が含まれている。この音色情報抽出部６で抽出された情報を音色パラメータとして、音声合成部１２に与えてやることにより、音色の変化も再現することができる。 A timbre information extraction unit 6 is shown within a dotted line in FIG. The timbre information extraction unit 6 includes a timbre data detection unit 7, an average timbre data extraction unit 9, and a timbre difference extraction unit 10 as shown in the figure. By providing the information extracted by the timbre information extraction unit 6 as a timbre parameter to the speech synthesizer 12, a change in timbre can be reproduced.

音色データ検出部７は、対象となる音声合成部１２で用いられている手法と同じ手法で音色分析を行う。
図２は、音色データ検出部７における音色分析の例を示す図である。
図２の（Ａ）は入力音声波形であり、下の点に示す時間間隔（分析フレーム周期）ごとに入力音声波形を切り出して、音色分析を行う。
図２の（Ｂ）は、ある時間における音色の分析結果の例を示している。
この図に示した分析方法では、１フレームごとに、周波数と強度の組（周波数，強度）が４セット求められている。このセットをフォルマント周波数とその強度に対応付ければ、この情報が「音色」や「音韻」の情報を表わすことが容易に想像できるであろう。この分析を全てのフレームについて行う。これらの情報が、音色の時間的な変化を含んだ情報を含んでいることはいうまでもない。なお、この４セットには順番にインデックス番号を付して以下の処理を行う。 The timbre data detection unit 7 performs timbre analysis by the same method as that used by the target speech synthesis unit 12.
FIG. 2 is a diagram illustrating an example of timbre analysis in the timbre data detection unit 7.
(A) of FIG. 2 is an input speech waveform, and the timbre analysis is performed by cutting out the input speech waveform at each time interval (analysis frame period) indicated at the lower point.
FIG. 2B shows an example of the timbre analysis result at a certain time.
In the analysis method shown in this figure, four sets of frequency and intensity (frequency and intensity) are obtained for each frame. If this set is associated with the formant frequency and its intensity, it can be easily imagined that this information represents the information of “tone” and “phoneme”. This analysis is performed for all frames. Needless to say, these pieces of information include information including temporal changes in timbre. The four sets are indexed in order and the following processing is performed.

８は音声合成器の合成音声辞書である。これは、音声合成の単位ごと（例えば、「お」「は」「よ」それぞれ）に前記音色データ検出部７の分析手法と同じ手法で求められた情報が収められている。 Reference numeral 8 denotes a synthesized speech dictionary of the speech synthesizer. This stores information obtained by the same method as the analysis method of the timbre data detection unit 7 for each unit of speech synthesis (for example, “o”, “ha”, “yo”, respectively).

９は、現在のセグメント情報（前記セグメンテーション部３にて分析されている）に対応している前記合成音声辞書８（例えば、「は」の辞書）内の子音部フレーム内（例えば、”ha”のうちの”h”の部分）のデータの平均、及び、母音部フレーム内（例えば、”ha”のうちの”a”の部分）のデータの平均を求め、それぞれを平均音色データとする平均音色データ抽出部である。例えば、各フレームにおける１番目の（周波数，強度）の組の周波数の平均と強度の平均、２番目の（周波数，強度）の組の周波数の平均と強度の平均、・・・というように、平均音色データを求める。
なお、変形例として、ここでは平均音色データは求めずに、合成音声辞書８内のデータ（合成時に実際に用いられるフレームのデータ）そのままを取り出してもよい。 9 is a consonant part frame (for example, “ha”) in the synthesized speech dictionary 8 (for example, “ha” dictionary) corresponding to the current segment information (analyzed by the segmentation unit 3). The average of the “h” part of the data and the average of the data in the vowel part frame (for example, the “a” part of “ha”), and the average of which is the average timbre data A timbre data extraction unit. For example, the average of frequency and intensity of the first (frequency, intensity) group in each frame, the average of frequency and the intensity of the second (frequency, intensity) group, and so on. Obtain average tone data.
As a modification, the average tone color data may not be obtained here, and the data in the synthesized speech dictionary 8 (frame data actually used at the time of synthesis) may be extracted as they are.

１０は、前記平均音色データ抽出部９による平均音色データと前記音色データ検出部７で時々刻々と求まる音色データとの差分を、各フレームごとに求める音色差分抽出部である。差分は、音色データと音色データが対応する音素の平均音色データ（例えば、セグメント情報結果が”h”の時間においては、”h”の平均音色データ）との差分をとる。なお、上記平均音色データ抽出部９の変形例（辞書のデータそのままを取り出す場合）においては、合成時に用いられるフレームデータとの差分を求める。 Reference numeral 10 denotes a timbre difference extraction unit that obtains a difference between the average timbre data obtained by the average timbre data extraction unit 9 and the timbre data obtained by the timbre data detection unit 7 every moment. The difference is the difference between the timbre data and the average timbre data of the phonemes to which the timbre data corresponds (for example, the average timbre data of “h” when the segment information result is “h”). In the modified example of the average tone color data extraction unit 9 (when the dictionary data is extracted as it is), the difference from the frame data used at the time of synthesis is obtained.

前記音色差分抽出部１０により抽出された差分情報（音色データ）の例を図３に示す。
図３の（Ａ）は、各（周波数，強度）の組のうちの、強度に関する差分をそれぞれのインデックス番号ごとに求めたものを図示したものであり、（Ｂ）は、同様に、周波数に関する差分を求めたものを図示したものである。
この音色データを、音声の合成時に、それぞれの時刻における合成フレームの（周波数，強度）の組に加算すると、入力音声の音色の変化が再現できることになる。新しい（周波数，強度）の組は、入力音声の音色の時間的変化を再現するものとなっている。
ここで、変形例として、時々刻々とこの音色差分データを音色データとして音声合成器に与えるのではなく、求められた音色差分データを、入力波形の時間内で全て平均をとったものを、「音色データ」として音声合成器に一度与えるようにすると、時々刻々の音色の変化は与えることができないが、その代わりに、全体的な音色が、合成辞書そのままではなく、入力した音声の音色に近いものとすることができる。 An example of difference information (tone color data) extracted by the timbre difference extraction unit 10 is shown in FIG.
FIG. 3A illustrates the difference between the strengths in each (frequency, strength) pair for each index number, and FIG. 3B similarly relates to the frequency. The figure which calculated | required the difference is illustrated.
When this timbre data is added to a set of (frequency, intensity) of the synthesized frame at each time at the time of voice synthesis, a change in timbre of the input voice can be reproduced. The new (frequency, intensity) pair reproduces the temporal change of the timbre of the input voice.
Here, as a modification, instead of giving this timbre difference data to the speech synthesizer as timbre data from moment to moment, the obtained timbre difference data is all averaged within the time of the input waveform. If it is given to the voice synthesizer as `` tone data '' once, the timbre change from moment to moment cannot be given, but instead, the overall timbre is close to the timbre of the input voice, not the synthesis dictionary as it is Can be.

１１は、以上までで求められた「セグメント情報」「ピッチ情報」「音量情報」「音色データ」を、対象となる音声合成器に対応するフォーマットに合致するように変換を行う音声合成データ生成部である。
図４は、該音声合成データ生成部１１から出力される音声合成データの一例を示す図である。図４の（Ａ）には発声するためのデータ（音声発声データ：発声開始時間、発声終了時間、発声音素）、（Ｂ）にはピッチ軌跡（ピッチデータ：ピッチ変化の時間とそのピッチ）、（Ｃ）には音量軌跡（音量データ：音量変化の時間とその音量）、（Ｄ）には音色データ（音色の変化時間と、そのインデックス番号毎の変化（差分）強度及び周波数）を示し、これらを対象となる音声合成器に入力可能なフォーマットにて出力する。 11 is a speech synthesis data generation unit that converts the “segment information”, “pitch information”, “volume information”, and “timbre data” obtained so far so as to match the format corresponding to the target speech synthesizer. It is.
FIG. 4 is a diagram illustrating an example of speech synthesis data output from the speech synthesis data generation unit 11. 4A shows data for utterance (voice utterance data: utterance start time, utterance end time, utterance phoneme), and FIG. 4B shows a pitch trajectory (pitch data: pitch change time and its pitch). , (C) shows the volume trajectory (volume data: volume change time and its volume), and (D) shows timbre data (tone change time, change (difference) intensity and frequency for each index number). These are output in a format that can be input to the target speech synthesizer.

以上の各構成要素により、本発明の音声合成データ生成装置が構成されている。なお、セグメンテーション部３、ピッチ検出部４、音量検出部５、音色情報抽出部６（音色データ検出部７、平均音色データ抽出部９及び音色差分抽出部１０）及び音声合成データ生成部１１は、コンピュータによりプログラム処理により実現することができる（音声合成データ生成プログラム）。
この音声合成データ生成装置又は音声合成データ生成プログラムにより生成された音声合成データを音声合成器に供給することにより、少ないデータ量で、高品質の音声を合成することが可能となる。 The speech synthesis data generating device of the present invention is configured by the above constituent elements. The segmentation unit 3, the pitch detection unit 4, the volume detection unit 5, the timbre information extraction unit 6 (the timbre data detection unit 7, the average timbre data extraction unit 9 and the timbre difference extraction unit 10), and the speech synthesis data generation unit 11 It can be realized by computer program processing (speech synthesis data generation program).
By supplying the speech synthesis data generated by the speech synthesis data generation apparatus or the speech synthesis data generation program to the speech synthesizer, it is possible to synthesize high quality speech with a small amount of data.

１２は、入力された音声合成データに従って音声合成を行う音声合成部である。この音声合成部１２にも、前記平均音色データ抽出部９に接続されていた合成音声辞書８と同一の合成音声辞書８が接続されている。すなわち、音声合成データ生成装置と音声合成装置にそれぞれ同一の合成音声辞書８が備えられている。
音声合成部１２は、入力された音声合成データに基づいて、前記合成音声辞書８から読み出したフレームデータを変更し、変更したフレームデータに基づいて音声合成を行い、合成した合成音声を出力する。具体的には、前記音声発声データに従って、発声する音素に対応するフレームデータを合成音声辞書８より読み出し、このときのピッチはピッチデータにより、また音量は音量データに従ってフレームデータを変更する。さらに、音色（差分）データをそのフレームデータに加算する手段を有している。これにより、音色の変化まで再現されることになる。
なお、加算するのは、この実施例で述べた合成手法の場合であり、対象となる音声合成手法に対応して、乗算、減算などを行うものであることはいうまでもない。また、この音声合成部１２は、コンピュータによるプログラム処理によっても実現することができる。
これにより、高品質の合成音声波形１３が得られる。 A speech synthesizer 12 synthesizes speech according to the input speech synthesis data. The synthesized speech dictionary 8 identical to the synthesized speech dictionary 8 connected to the average tone color data extracting unit 9 is also connected to the speech synthesizer 12. That is, the same synthesized speech dictionary 8 is provided in each of the speech synthesis data generation device and the speech synthesis device.
The voice synthesizer 12 changes the frame data read from the synthesized voice dictionary 8 based on the input voice synthesized data, performs voice synthesis based on the changed frame data, and outputs the synthesized voice. Specifically, frame data corresponding to the phoneme to be uttered is read from the synthesized speech dictionary 8 according to the voice utterance data, and the frame data is changed according to the pitch data at this time and the volume according to the volume data. Furthermore, it has means for adding the timbre (difference) data to the frame data. As a result, even the timbre change is reproduced.
The addition is performed in the case of the synthesis method described in this embodiment, and it goes without saying that multiplication, subtraction, etc. are performed in accordance with the target speech synthesis method. The voice synthesizer 12 can also be realized by program processing by a computer.
Thereby, a high-quality synthesized speech waveform 13 is obtained.

以上のように、本発明の音声合成データ生成装置又は音声合成データ生成プログラムによれば、入力音声波形から「音声合成データ」の直接な生成が可能であるが、これを適切な韻律記号を伴った「入力文字列」に変換すれば、従来の入力として文字列の入力のみを受け付ける合成システムにそのまま適用することも可能である。例えば、変換された入力文字列を前記図５における入力文字列５１とする。ただし、これは韻律記号の入力が可能なタイプの音声合成システムに限られる。 As described above, according to the speech synthesis data generation device or the speech synthesis data generation program of the present invention, it is possible to directly generate “speech synthesis data” from an input speech waveform, which is accompanied by an appropriate prosodic symbol. If converted to an “input character string”, it can be directly applied to a synthesis system that accepts only input of a character string as a conventional input. For example, the converted input character string is the input character string 51 in FIG. However, this is limited to a speech synthesis system that can input prosodic symbols.

このような本発明の音声合成データ生成装置又は音声合成データ生成プログラムにより生成された音声合成データは、次のような場合などに適用することができる。
（適用例１）
上述した本発明の音声合成装置を持った携帯電話などの端末などに対して、本発明による音声合成データ生成装置又は音声合成データ生成プログラムを用いてユーザーが好みの音声合成データを作成し、送信する。このデータを受信した別のユーザーがこのデータを再生すると、いわゆる音声合成の画一的な合成音ではなく、感情のこもった自然な合成音声を端末から音声として聞くことが可能となる。ここで、重要なことは、この音声合成データのデータ量は、ＰＣＭなどの生音声データに比べて、非常に少ないデータ量であることである。 Such speech synthesis data generated by the speech synthesis data generation apparatus or the speech synthesis data generation program of the present invention can be applied to the following cases.
(Application example 1)
For a terminal such as a mobile phone having the above-described speech synthesizer of the present invention, the user creates desired speech synthesis data using the speech synthesis data generation device or the speech synthesis data generation program according to the present invention, and transmits it. To do. When another user who receives this data reproduces this data, it is possible to hear a natural synthesized voice with emotions as a voice from the terminal instead of a so-called uniform synthetic voice synthesized voice. Here, what is important is that the amount of data of the speech synthesis data is much smaller than that of raw speech data such as PCM.

（適用例２）
音符データ、音素データ、ピッチベンド、音量変化のためのコントロールチェンジ、などを入力し、歌唱合成を行うシステムにおいて、しゃべりやラップといったデータを作成するのは非常に困難である。その理由は、一般の歌唱にくらべて、その音素の長さや抑揚（ピッチ、音量の変化）が激しく変化するため、自然なデータを作成するのは非常に難しいためである。そこで、このようなシステムにおいて、本発明の音声合成データ生成装置又は音声合成データ生成プログラムを使用すれば、自然な音声入力からデータを抽出できるため、容易にこのような音声合成データを作成することが可能となる。 (Application example 2)
It is very difficult to create data such as chatter and rap in a system that synthesizes singing by inputting note data, phoneme data, pitch bend, control change for volume change, and the like. The reason is that it is very difficult to create natural data because the length and intonation (change in pitch and volume) of the phoneme changes drastically compared to general singing. Therefore, in such a system, if the speech synthesis data generation device or the speech synthesis data generation program of the present invention is used, data can be extracted from natural speech input, so that such speech synthesis data can be easily created. Is possible.

また、音声合成システムに限らず、楽音のＭＩＤＩシンセサイザーシステムなどにおいて、例えば、サックスのような抑揚、音色変化に富んだ楽音の合成時に、本発明の音声合成データ生成装置を用いて、ピッチ、音量、音色の変化を入力の演奏から抽出し、そのシーケンスデータ（プログラムナンバー：音色のもとになる音色番号、音色パラメータを表す例えばエクスクルーシブデータ、ノートナンバー、音量コントロールのコントロールチェンジ値、ピッチベンド値など）をシンセサイザーに与えてやれば、リアルな合成音を得ることができる。なお、この場合、音素セグメント部は必要ではなくなるが、音符の区切りの検出は行ったほうがよい。
さらにまた、音声に限らず、情報量を多く含んだ表情豊かな楽音の合成を行なう場合、その合成システム（例えば、フォルマント音源）への入力データの作成は人手では困難になる。このデータ作成を補助するために、本発明の音声合成データ生成装置を用いれば、このようなデータの作成は容易となる。ただし、そのもとになる基準となる演奏が必要となる。これは、ＣＧでいえば、モーションキャプチャに相当するものということができる。 In addition, not only in a speech synthesis system but also in a musical sound MIDI synthesizer system, for example, when synthesizing a musical sound rich in inflections and timbres such as saxophone, the pitch, volume, etc. , Change the timbre from the input performance, and its sequence data (program number: timbre number underlying the timbre, timbre parameters such as exclusive data, note number, volume control control change value, pitch bend value, etc.) If you give to the synthesizer, you can get a realistic synthesized sound. In this case, the phoneme segment part is not necessary, but it is better to detect the break of the note.
Furthermore, when synthesizing not only speech but also expressive musical sounds that contain a large amount of information, creation of input data to the synthesis system (for example, formant sound source) becomes difficult manually. If the speech synthesis data generation apparatus of the present invention is used to assist the data creation, such data creation is facilitated. However, it is necessary to have a standard performance. This can be said to be equivalent to motion capture in CG.

本発明の音声合成データ生成装置及び該音声合成データ生成装置で生成された音声合成データに基づいて音声合成を行う音声合成装置の一実施の形態の構成を示すブロック図である。1 is a block diagram illustrating a configuration of an embodiment of a speech synthesis data generation device of the present invention and a speech synthesis device that performs speech synthesis based on speech synthesis data generated by the speech synthesis data generation device. FIG. 音色データ検出部７における音色分析の例を示す図である。It is a figure which shows the example of the timbre analysis in the timbre data detection part. 前記音色差分抽出部１０により抽出された差分情報の例を示す図であり、（Ａ）は強度に関する差分、（Ｂ）は周波数に関する差分の例を示す図である。It is a figure which shows the example of the difference information extracted by the said timbre difference extraction part 10, (A) is a figure which shows the difference regarding an intensity | strength, (B) is a figure which shows the example of the difference regarding a frequency. 音声合成データ生成部１１から出力される音声合成データの一例を示す図である。It is a figure which shows an example of the speech synthesis data output from the speech synthesis data production | generation part. 一般的な音声合成システムの構成を示す図である。It is a figure which shows the structure of a general speech synthesis system. 「おはよー」と発音したときの（Ａ）音声波形、（Ｂ）ピッチ、（Ｃ）音量の例を示す図である。It is a figure which shows the example of (A) audio | voice waveform, (B) pitch, and (C) volume when sounding "Ohayoyo". 図６とは異なる話者が「おはよー」と発音したときの（Ａ）音声波形、（Ｂ）ピッチ、（Ｃ）音量の例を示す図である。FIG. 7 is a diagram illustrating an example of (A) speech waveform, (B) pitch, and (C) volume when a speaker different from FIG. 6 pronounces “Ohayo!”.

Explanation of symbols

１：入力音素記号列、２：入力音声波形、３：セグメンテーション部、４：ピッチ検出部、５：音量検出部、６：音色情報抽出部、７：音色データ検出部、８：合成音声辞書、９：平均音色データ抽出部、１０：音色差分抽出部、１１：音声合成データ生成部、１２：音声合成部、１３：合成音声波形 1: input phoneme symbol string, 2: input speech waveform, 3: segmentation unit, 4: pitch detection unit, 5: volume detection unit, 6: tone color information extraction unit, 7: tone color data detection unit, 8: synthesized speech dictionary, 9: average timbre data extraction unit, 10: timbre difference extraction unit, 11: speech synthesis data generation unit, 12: speech synthesis unit, 13: synthesized speech waveform

Claims

A speech synthesis data generation device that analyzes an input speech waveform and generates speech synthesis data to be supplied to the speech synthesizer,
Means for analyzing the input speech waveform and outputting segment information indicating the phoneme symbol and the phoneme duration, pitch information and volume information;
Means for performing the same kind of timbre analysis as in the speech synthesizer on the input speech waveform, and outputting a difference value between the analysis result and dictionary data of the corresponding phoneme as timbre data;
A speech synthesis data generation apparatus comprising: means for generating speech synthesis data from the segment information, the pitch information, the volume information, and the timbre data.

2. The speech synthesis data generation device according to claim 1, wherein the difference value is a difference value between the analysis result and an average value of frame data in dictionary data of the corresponding phoneme.

The speech synthesis data generation device according to claim 1, wherein the difference value is a difference value between the analysis result and data of temporally matching frames in the dictionary data of the corresponding phoneme.

4. The voice synthesis data according to claim 1, wherein the means for outputting the difference value as timbre data outputs the difference value calculated for each frame as the timbre data. Generator.

4. The speech synthesis according to claim 1, wherein the means for outputting the difference value as timbre data outputs the difference value averaged over a plurality of frames as the timbre data. Data generator.

A speech synthesizer that synthesizes speech based on speech synthesis data including a phoneme symbol of speech to be synthesized and segment information indicating the duration of the phoneme, pitch information, volume information, and timbre data,
A speech synthesizer comprising means for changing and outputting frame data corresponding to a phoneme to be uttered read from a synthesized speech dictionary according to the pitch information, the volume information, and the timbre data.

On the computer,
Analyzing the input speech waveform and outputting segment information indicating a phoneme symbol and a phoneme duration, pitch information and volume information;
Performing the same kind of timbre analysis as in the target speech synthesizer on the input speech waveform, and outputting a difference value between the analysis result and dictionary data of the corresponding phoneme as timbre data;
Generating a speech synthesis data from the segment information, the pitch information, the volume information, and the timbre data.

A speech synthesis program for causing a computer to execute a process of synthesizing speech based on speech synthesis data including segment information, pitch information, volume information, and timbre data of speech to be synthesized,
A speech synthesis program comprising the step of changing and outputting frame data corresponding to a phoneme to be uttered read from a synthesis speech dictionary according to the pitch information, the volume information, and the timbre data.