JP2009133890A

JP2009133890A - Speech synthesis apparatus and method

Info

Publication number: JP2009133890A
Application number: JP2007307578A
Authority: JP
Inventors: Ryo Morinaka; 亮森中; Takehiko Kagoshima; 岳彦籠嶋
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2007-11-28
Filing date: 2007-11-28
Publication date: 2009-06-18

Abstract

<P>PROBLEM TO BE SOLVED: To provide a voice synthesizing device capable of generating a natural synthesized voice of high tone quality. <P>SOLUTION: A voice synthesizing part 4 is composed of a voice element storage part 40, a phoneme environment storage part 41, a phoneme sequence-rhythm information input part 42, a voice element selecting-fusing part 43, and a voice element editing-connecting part 44. The voice element selecting-fusing part 43 is composed of a voice element selecting part 430, a voice element fusing part 431 and a phoneme environment computing part 432. An optimum element sequence is obtained in the voice element selecting part 430 using a phoneme environment parameter of the fused voice elements, and the phoneme environment parameter of the fused voice elements is obtained to synthesize voice. <P>COPYRIGHT: (C)2009,JPO&INPIT

Description

本発明は、テキスト音声合成に関し、特に音韻記号列、ピッチ、音韻継続時間長などの情報から音声信号を生成する音声合成装置及びその方法に関する。 The present invention relates to text-to-speech synthesis, and more particularly to a speech synthesis apparatus and method for generating a speech signal from information such as phoneme symbol strings, pitches, and phoneme durations.

任意の文章から人工的に音声信号を作り出すことを「テキスト音声合成」という。テキスト音声合成は、一般的に言語処理部、韻律処理部及び音声合成部の３つの段階から構成されるものである。 Artificially creating speech signals from arbitrary sentences is called “text-to-speech synthesis”. Text-to-speech synthesis is generally composed of three stages: a language processing unit, a prosody processing unit, and a speech synthesis unit.

入力されたテキストは、第１段階として言語処理部において形態素解析や構文解析などが行われる。次に、第２段階として韻律処理部においてアクセントやイントネーションの処理が行われて、音韻系列・韻律情報（基本周波数、音韻継続時間長、パワーなど）が出力される。その後、最終段階として音声信号合成部で音韻系列・韻律情報から音声信号を合成することによりテキスト音声合成を実現している。 The input text is subjected to morphological analysis, syntax analysis, etc. in the language processing section as the first stage. Next, as a second step, the prosody processing unit performs accent and intonation processing, and outputs phoneme series / prosodic information (basic frequency, phoneme duration, power, etc.). After that, as a final step, the text signal synthesis is realized by synthesizing the voice signal from the phoneme sequence / prosodic information in the voice signal synthesis unit.

このような任意の音韻記号列を合成することができる合成器の原理は、母音をＶ、子音をＣで表すと、ＣＶ、ＣＶＣ、ＶＣＶなどの基本となる小さな音声単位の特徴パラメータ（音声素片）を記憶し、ピッチや継続時間長を制御して接続することにより音声を合成するものである。この方式では、記憶されている音声素片が合成音声の品質を大きく左右することになる。 The principle of a synthesizer capable of synthesizing such an arbitrary phoneme symbol string is as follows. When a vowel is represented by V and a consonant is represented by C, a characteristic parameter (speech element) of a basic small speech unit such as CV, CVC, VCV or the like. Is stored, and the voice is synthesized by controlling and connecting the pitch and duration. In this method, the stored speech segment greatly affects the quality of the synthesized speech.

このような音声合成方法の１つとして、入力された音韻系列・韻律情報を目標として、大量の音声素片から音声単位毎に複数の音声素片を選択し、選択された複数の音声素片を融合することによって新たな音声素片を生成し、それらを接続して音声を合成する複数音声素片選択・融合型の音声合成方法がある（例えば、特許文献１参照）。 As one of such speech synthesis methods, a plurality of speech units are selected for each speech unit from a large number of speech units, with the target phoneme sequence / prosodic information as a target. There is a multi-speech unit selection / fusion type speech synthesis method in which new speech units are generated by merging and synthesizing speech by connecting them (for example, see Patent Document 1).

この複数音声素片選択・融合型の音声合成は、まず予め記憶された大量の音声素片の中から、入力された音韻系列・韻律情報に基づき音声素片を選択する。音声素片選択方法としては、音声を合成することで生じる合成音声の歪みの度合いをコスト関数として定義し、コストが小さくなるように音声素片を選択する方法がある。 In this multi-speech unit selection / fusion type speech synthesis, a speech unit is first selected from a large number of pre-stored speech units based on the input phoneme sequence / prosodic information. As a speech unit selection method, there is a method in which the degree of distortion of synthesized speech generated by synthesizing speech is defined as a cost function and a speech unit is selected so as to reduce the cost.

例えば、目標とする音声と各音声素片との韻律・音韻環境などの差異を表す目標歪み、音声素片を接続することで生じる接続歪みをコストとして数値化する。このコストに基づいて音声合成に使用する音声素片を選択する。さらに、例えばピッチ波形を平均化する、選択された複数の音声素片のセントロイドを用いるなどの方法で融合する。これにより、音声素片の編集及び接続における音質の劣化を抑え、安定した合成音声を得ることができる。
特開２００５−１６４７４９公報 For example, the target distortion representing the difference between the target speech and each speech segment, such as the prosody and phonological environment, and the connection distortion generated by connecting speech segments are quantified as costs. Based on this cost, a speech segment to be used for speech synthesis is selected. Further, for example, the pitch waveforms are averaged, or the centroids of a plurality of selected speech segments are used for fusion. As a result, it is possible to suppress deterioration of sound quality in editing and connection of speech units and to obtain a stable synthesized speech.
JP 2005-164749

上記のような複数音声素片選択・融合型の音声合成方法では、融合音声素片候補を選択するときに、合成音声として使用する音声素片との接続歪みを考慮しないで、予め音声素片の中から求められている、すなわち、実際には合成音声の音声素片として用いられない最適素片系列上の音声素片との接続歪みに基づいて融合音声素片候補を求めているため、生成された合成音声の接続部に不連続が生じてしまうという問題点がある。 In the multi-speech unit selection / fusion type speech synthesis method as described above, when selecting a fused speech unit candidate, a speech unit is not considered in advance without considering connection distortion with a speech unit used as synthesized speech. In other words, because the fusion speech unit candidate is obtained based on the connection distortion with the speech unit on the optimal unit sequence that is not actually used as the speech unit of the synthesized speech, There is a problem that discontinuity occurs in the connection portion of the generated synthesized speech.

そこで、本発明は、上記問題点に鑑み、複数音声素片選択・融合型の音声合成において、より自然で高音質な合成音声を生成することが出来る音声合成装置及びその方法を提供することを目的とする。 Therefore, in view of the above problems, the present invention provides a speech synthesizer and method for generating a more natural and high-quality synthesized speech in a multiple speech unit selection / fusion speech synthesis. Objective.

本発明は、音声素片群と、前記音声素片群のそれぞれの音声素片についての音素環境パラメータを格納している記憶部と、合成したい目標音声に対応する音韻系列を合成単位で区切ることにより得られる複数のセグメントのそれぞれに対し、前記音声素片群から複数の第１音声素片を選択する選択部と、前記複数の第１音声素片を融合することによって、一つの第２音声素片を生成する生成部と、前記第２音声素片の音素環境パラメータを算出するパラメータ算出部と、前記各セグメントのそれぞれに対して生成された前記第２音声素片を接続することによって合成音声を生成する合成部と、を有し、前記選択部は、前記各セグメントの中から、前記第１音声素片を選択する一つのセグメントを注目セグメントとして設定するセグメント設定部と、前記音声素片群の中から前記注目セグメントの音素と同じ特徴を持つ複数の音声素片を音声素片候補として抽出する抽出部と、前記注目セグメントの前記各音声素片候補を使用して生成される前記合成音声の歪み量を表す目標コストを、前記各音声素片候補の音素環境パラメータと前記目標音声の韻律情報とからそれぞれ算出する第１コスト算出部と、前記注目セグメントの前または後に隣接するセグメントである隣接セグメントの音声素片と、前記注目セグメントの前記各音声素片候補とを接続したときに生じる歪み量を表すそれぞれの接続コストを算出するものであって、（１）前記隣接セグメントが前記第２音声素片である場合には、前記第２音声素片の音素環境パラメータと、前記各音声素片候補の音素環境パラメータとから前記接続コストを算出し、または、（２）前記隣接セグメントが前記第２音声素片を有していない場合には、前記注目セグメントに対応する前記音声素片候補の音素環境パラメータと前記隣接セグメントに対応する前記音声素片候補の音素環境パラメータとから前記接続コストを算出する第２コスト算出部と、前記注目セグメントにおける前記複数の音声素片候補の中で、前記目標コストと前記接続コストとの合計コストが低い複数の音声素片候補を、前記第１音声素片として選択する音声素片選択部と、を有する音声合成装置である。 The present invention divides a speech unit group, a storage unit storing phoneme environment parameters for each speech unit of the speech unit group, and a phoneme sequence corresponding to a target speech to be synthesized into synthesis units. For each of the plurality of segments obtained by the above, a selection unit that selects a plurality of first speech units from the speech unit group and a plurality of the first speech units are combined to form one second speech. Synthesis by connecting a generation unit that generates a segment, a parameter calculation unit that calculates a phoneme environment parameter of the second speech unit, and the second speech unit generated for each of the segments A segment setting unit configured to set, as the segment of interest, one segment for selecting the first speech segment from the segments. An extraction unit that extracts a plurality of speech units having the same characteristics as the phoneme of the target segment from the speech unit group as speech unit candidates, and uses each speech unit candidate of the target segment. A first cost calculation unit that calculates a target cost representing a distortion amount of the synthesized speech generated from the phoneme environment parameter of each speech unit candidate and the prosodic information of the target speech; Alternatively, each connection cost that represents the amount of distortion that occurs when the speech unit of an adjacent segment that is an adjacent segment and the speech unit candidate of the target segment are connected is calculated. ) When the adjacent segment is the second speech unit, from the phoneme environment parameter of the second speech unit and the phoneme environment parameter of each speech unit candidate Or (2) if the adjacent segment does not have the second speech segment, the phoneme environment parameter of the speech segment candidate corresponding to the segment of interest and the adjacent segment A second cost calculation unit that calculates the connection cost from the phoneme environment parameter of the speech unit candidate corresponding to the target speech unit, and the target cost and the connection cost among the plurality of speech unit candidates in the segment of interest. A speech unit selection unit that selects a plurality of speech unit candidates having a low total cost as the first speech unit.

また、本発明は、音声素片群と、前記音声素片群のそれぞれの音声素片についての音素環境パラメータを格納している記憶部と、合成したい目標音声に対応する音韻系列を合成単位で区切ることにより得られる複数のセグメントのそれぞれの中から一つのセグメントを注目セグメントとして設定するセグメント設定部と、前記音声素片群の中から前記注目セグメントの音素と同じ特徴を持つ複数の音声素片を第３音声素片として複数抽出する抽出部と、前記複数の第３音声素片を融合することによって第４音声素片を生成する生成部と、前記第４音声素片の音素環境パラメータを算出するパラメータ算出部と、前記注目セグメントの前記各第３音声素片及び前記第４音声素片を使用して生成される前記合成音声の歪み量を表す目標コストを、前記各第３音声素片及び前記第４音声素片の音素環境パラメータと前記目標音声の韻律情報とからそれぞれ算出する第３コスト算出部と、前記注目セグメントの前または後に隣接するセグメントである隣接セグメントの前記第３音声素片及び前記第４音声素片と、前記注目セグメントの前記各第３音声素片及び前記第４音声素片とを接続したときに生じる歪み量を表すそれぞれの接続コストを、前記注目セグメントの前記第３音声素片及び前記第４音声素片の音素環境パラメータと、前記隣接セグメントの前記第３音声素片及び前記第４音声素片の音素環境パラメータから算出する第４コスト算出部と、前記注目セグメントにおける前記複数の第３音声素片及び前記第４音声素片の中で、前記目標コストと前記接続コストとの合計コストが低い複数の音声素片を、前記第５音声素片として選択する音声素片選択部と、前記各セグメントのそれぞれに対して生成された前記第５音声素片を接続することによって合成音声を生成する合成部と、を有する音声合成装置である。 The present invention also provides a speech unit group, a storage unit storing phoneme environment parameters for each speech unit of the speech unit group, and a phoneme sequence corresponding to a target speech to be synthesized in synthesis units. A segment setting unit for setting one segment as a target segment from among a plurality of segments obtained by dividing, and a plurality of speech units having the same characteristics as the phonemes of the target segment from the speech unit group Are extracted as a third speech unit, a generation unit that generates a fourth speech unit by fusing the plurality of third speech units, and a phoneme environment parameter of the fourth speech unit. A target cost representing a distortion amount of the synthesized speech generated using the parameter calculation unit to calculate, and the third speech unit and the fourth speech unit of the segment of interest, A third cost calculation unit for calculating each of the third speech unit and the fourth speech unit from the phoneme environment parameters and the prosody information of the target speech, and an adjacent segment that is adjacent to the target segment Each connection cost representing the amount of distortion generated when the third speech unit and the fourth speech unit of the segment and the third speech unit and the fourth speech unit of the segment of interest are connected. Is calculated from the phoneme environment parameters of the third speech unit and the fourth speech unit of the segment of interest and the phoneme environment parameters of the third speech unit and the fourth speech unit of the adjacent segment. The total cost of the target cost and the connection cost is low among the four cost calculation unit and the plurality of third speech units and the fourth speech unit in the segment of interest. A synthesized speech is generated by connecting a speech unit selection unit that selects a plurality of speech units as the fifth speech unit and the fifth speech unit generated for each of the segments. And a synthesis unit.

本発明によれば、融合音声素片との接続歪みを考慮しない場合と比べ音質の劣化の程度が減少し、より自然で高音質な合成音声を生成する音声合成方法を提供することが出来る。 According to the present invention, it is possible to provide a speech synthesizing method that generates a more natural and high-quality synthesized speech by reducing the degree of deterioration of the sound quality as compared with a case where connection distortion with a fusion speech unit is not taken into consideration.

本発明の実施形態におけるテキスト音声合成を行う音声合成装置について図面を参照して説明する。 A speech synthesizer that performs text-to-speech synthesis according to an embodiment of the present invention will be described with reference to the drawings.

（第１の実施形態）
本発明の第１の実施形態における音声合成装置について図１〜図１４に基づいて説明する。 (First embodiment)
A speech synthesizer according to a first embodiment of the present invention will be described with reference to FIGS.

（１）音声合成装置の構成
図１は、本実施形態に係る音声合成装置の構成を示すブロック図である。 (1) Configuration of Speech Synthesizer FIG. 1 is a block diagram showing a configuration of a speech synthesizer according to this embodiment.

図１に示すように、音声合成装置は、テキスト入力部１、言語処理部２、韻律処理部３、音声合成部４、音声波形出力部５から構成される。 As shown in FIG. 1, the speech synthesizer includes a text input unit 1, a language processing unit 2, a prosody processing unit 3, a speech synthesis unit 4, and a speech waveform output unit 5.

各部の機能は、コンピュータに格納されたプログラムによっても実現できる。 The function of each part can also be realized by a program stored in a computer.

言語処理部２は、テキスト入力部１から入力されるテキストの形態素解析・構文解析を行い、その結果を韻律処理部３へ送る。 The language processing unit 2 performs morphological analysis / syntax analysis of the text input from the text input unit 1 and sends the result to the prosody processing unit 3.

韻律処理部３は、言語解析結果からアクセントやイントネーションの処理を行い、音韻系列（音韻記号列）及び韻律情報を生成し、音声合成部４へ送る。 The prosody processing unit 3 performs accent and intonation processing from the language analysis result, generates a phoneme sequence (phoneme symbol string) and prosody information, and sends them to the speech synthesis unit 4.

音声合成部４は、音韻系列及び韻律情報から音声波形を生成する。 The speech synthesizer 4 generates a speech waveform from the phoneme sequence and prosodic information.

音声波形出力部５は、こうして生成された音声波形を出力する。 The voice waveform output unit 5 outputs the voice waveform thus generated.

（２）音声合成部４の構成
図２は、図１の音声合成部４の構成例を示すブロック図である。 (2) Configuration of Speech Synthesizer 4 FIG. 2 is a block diagram showing a configuration example of the speech synthesizer 4 in FIG.

図２に示すように、音声合成部４は音声素片記憶部４０、音素環境記憶部４１、音韻系列・韻律情報入力部４２、音声素片選択・融合部４３、音声素片編集・接続部４４により構成される。 As shown in FIG. 2, the speech synthesis unit 4 includes a speech unit storage unit 40, a phoneme environment storage unit 41, a phoneme sequence / prosodic information input unit 42, a speech unit selection / fusion unit 43, and a speech unit editing / connection unit. 44.

以下、各部４０〜４４の機能について詳しく説明する。 Hereinafter, the function of each part 40-44 is demonstrated in detail.

（３）音声素片記憶部４０
音声素片記憶部４０には大量の音声素片が蓄積されており、合成音声を生成するときに用いる音声の単位（合成単位）の音声素片が記憶されている。 (3) Speech unit storage unit 40
A large amount of speech units are accumulated in the speech unit storage unit 40, and speech units of speech units (synthesis units) used when generating synthesized speech are stored.

合成単位は、音素あるいは音素を分割したものの組み合わせである。例えば、半音素、音素（Ｃ、Ｖ）、ダイフォン（ＣＶ、ＶＣ、ＶＶ）、トライフォン（ＣＶＣ、ＶＣＶ）、音節（ＣＶ、Ｖ）、などであり、これらが混在しているなど可変長であってもよい。なお、Ｖは母音、Ｃは子音を表す。 A synthesis unit is a phoneme or a combination of phonemes. For example, semiphonemes, phonemes (C, V), diphones (CV, VC, VV), triphones (CVC, VCV), syllables (CV, V), etc. There may be. V represents a vowel, and C represents a consonant.

また、音声素片は、合成単位に対応する音声信号の波形もしくはその特徴を表すパラメータ系列などを表すものとする。 In addition, the speech segment represents a waveform of a speech signal corresponding to a synthesis unit or a parameter series representing its characteristics.

音声素片記憶部４０には、例えば音声素片が音素の場合には、図４に示すように、各音素の音声信号の波形が前記音素を識別するための音声素片番号と共に記憶されている。音声素片記憶部４０に記憶されている各音声素片は、別途収集された多数の音声データに対して音素毎にラベリングを行い、音素毎に音声波形を切り出したものを、音声素片として蓄積したものである。 In the speech unit storage unit 40, for example, when the speech unit is a phoneme, as shown in FIG. 4, the waveform of the speech signal of each phoneme is stored together with the speech unit number for identifying the phoneme. Yes. Each speech unit stored in the speech unit storage unit 40 is labeled for each phoneme with respect to a large number of separately collected speech data, and a speech waveform cut out for each phoneme is used as a speech unit. Accumulated.

（４）音素環境記憶部４１
音素環境記憶部４１には、音声素片記憶部４０に記憶されている音声素片の音素環境パラメータが記憶されている。 (4) Phoneme environment storage unit 41
The phoneme environment storage unit 41 stores phoneme environment parameters of speech units stored in the speech unit storage unit 40.

音声素片の音素環境パラメータとは、前記音声素片にとっての音素環境となる要因の組み合わせに対応する情報である。要因としては、例えば、前記音声素片の音素名、先行音素、後続音素、後々続音素、基本周波数、音韻継続時間長、パワー、ストレスの有無、アクセント核からの位置、息継ぎからの時間、発声速度、感情などがある。 The phoneme environment parameter of the speech segment is information corresponding to a combination of factors that become the phoneme environment for the speech segment. Factors include, for example, the phoneme name of the speech unit, the preceding phoneme, the succeeding phoneme, the succeeding phoneme, the fundamental frequency, the phoneme duration, power, the presence or absence of stress, the position from the accent core, the time from breathing, the utterance There are speed, feelings, etc.

音素環境記憶部４１には、例えば音声素片が音素の場合には、図５に示すように、音声素片記憶部４０に記憶されている各音素の音素環境パラメータが、前記音素の音声素片番号に対応付けて記憶されている。ここでは、音素環境パラメータとして、音素記号（音素名）、基本周波数、音韻時間継続長、音声素片両端におけるケプストラム係数が記憶されている。 In the phoneme environment storage unit 41, for example, when the speech unit is a phoneme, as shown in FIG. 5, the phoneme environment parameters of each phoneme stored in the phoneme unit storage unit 40 include the phoneme speech unit. It is stored in association with a single number. Here, phoneme symbols (phoneme names), fundamental frequencies, phoneme duration durations, and cepstrum coefficients at both ends of a speech unit are stored as phoneme environment parameters.

（５）音韻系列・韻律情報入力部４２
音韻系列・韻律情報入力部４２には、韻律処理部３から出力された目標音声の音韻系列及び韻律情報が入力される。音韻系列・韻律情報入力部４２に入力される韻律情報としては、基本周波数、音韻継続時間長、パワーなどがある。 (5) Phoneme sequence / prosodic information input unit 42
The phoneme sequence / prosodic information input unit 42 receives the phoneme sequence and prosodic information of the target speech output from the prosody processing unit 3. The prosodic information input to the phoneme sequence / prosodic information input unit 42 includes a fundamental frequency, a phoneme duration, power, and the like.

以下、音韻系列・韻律情報入力部４２に入力される音韻系列と韻律情報を、それぞれ「入力音韻系列」、「入力韻律情報」と呼ぶ。入力音韻系列は、例えば音韻記号の系列である。 Hereinafter, the phoneme sequence and the prosody information input to the phoneme sequence / prosodic information input unit 42 are referred to as “input phoneme sequence” and “input prosody information”, respectively. The input phoneme sequence is a sequence of phoneme symbols, for example.

音韻系列・韻律情報入力部４２には、例えば音声素片が音素の場合には、音韻の情報として、テキスト音声合成のために入力テキストの形態素解析・構文解析後、さらにアクセントやイントネーション処理を行って得られた韻律情報と音韻系列が入力される。入力韻律情報には、基本周波数及び音韻継続時間長が含まれていることとする。 In the phoneme sequence / prosodic information input unit 42, for example, if the speech segment is a phoneme, the phoneme information is subjected to morpheme analysis / syntax analysis of the input text for text-to-speech synthesis, and further performs accent and intonation processing. The prosodic information and phoneme sequence obtained in this way are input. The input prosody information includes a fundamental frequency and a phoneme duration.

（６）音声素片選択・融合部４３
次に、音声素片選択・融合部４３について説明する。 (6) Speech unit selection / fusion unit 43
Next, the speech element selection / fusion unit 43 will be described.

図３は、図２の音声素片選択・融合部４３の構成例を示すブロック図である。 FIG. 3 is a block diagram illustrating a configuration example of the speech unit selection / fusion unit 43 of FIG.

図３に示すように、音声素片選択・融合部４３は、音声素片選択部４３０、音声素片融合部４３１、音素環境算出部４３２により構成される。 As shown in FIG. 3, the speech unit selection / fusion unit 43 includes a speech unit selection unit 430, a speech unit fusion unit 431, and a phoneme environment calculation unit 432.

（６−１）音声素片選択部４３０
音声素片選択部４３０は、入力音韻系列を合成単位で区切ることにより得られる複数のセグメントのそれぞれに対し、入力韻律情報と、音声素片の音素環境パラメータに含まれる韻律情報ないし、後述する音素環境算出部４３２で得られる融合された音声素片の音素環境パラメータとの歪みの度合いであるを歪み量を推定し、前記歪み量を最小化するように音声素片記憶部４０に記憶されている音声素片の中から音声素片を選択する。 (6-1) Speech unit selection unit 430
The speech unit selection unit 430, for each of a plurality of segments obtained by dividing the input phoneme sequence by synthesis unit, prosody information included in the phoneme environment parameter of the speech unit or the phoneme described later. It is stored in the speech unit storage unit 40 so as to estimate the amount of distortion that is the degree of distortion with the phoneme environment parameter of the fused speech unit obtained by the environment calculation unit 432 and to minimize the amount of distortion. A speech unit is selected from existing speech units.

歪み量としては、後述するコスト関数を用いることができるが、これに限定するものではない。 As the amount of distortion, a cost function described later can be used, but is not limited to this.

（６−２）音声素片融合部４３１
音声素片融合部４３１は、音声素片選択部４３０において選択された複数の音声素片を融合して、新たな音声素片を生成する。 (6-2) Speech unit fusion unit 431
The speech unit fusion unit 431 merges a plurality of speech units selected by the speech unit selection unit 430 to generate a new speech unit.

（６−３）音素環境算出部４３２
音素環境算出部４３２は、音声素片融合部４３１において融合された音声素片の音素環境パラメータを算出する。この操作をセグメント毎に行うことにより、入力音韻系列の音韻記号の系列に対応する新たな音声素片の系列が得られる。 (6-3) Phoneme environment calculation unit 432
The phoneme environment calculation unit 432 calculates the phoneme environment parameters of the speech units fused by the speech unit fusion unit 431. By performing this operation for each segment, a new speech segment sequence corresponding to the phoneme symbol sequence of the input phoneme sequence is obtained.

（７）音声素片編集・接続部４４
音声素片編集・接続部４４において、新たな音声素片の系列は、入力韻律情報に基づいて変形及び接続され、合成音声の音声波形が生成される。 (7) Speech segment editing / connection unit 44
In the speech segment editing / connection unit 44, a new speech segment sequence is transformed and connected based on the input prosodic information, and a speech waveform of a synthesized speech is generated.

こうして生成された音声波形は図２の音声波形出力部５で出力される。 The speech waveform generated in this way is output by the speech waveform output unit 5 of FIG.

（８）音声素片選択・融合部４３の処理の内容
次に、図６に基づいて音声素片選択・融合部４３における処理の流れを説明する。ここでは、合成単位の音声素片は音素であるとする。図６は、音声素片選択・融合部４３における処理の流れを示すフローチャートである。 (8) Processing Contents of Speech Unit Selection / Fusion Unit 43 Next, the flow of processing in the speech unit selection / fusion unit 43 will be described with reference to FIG. Here, it is assumed that the speech unit of the synthesis unit is a phoneme. FIG. 6 is a flowchart showing the flow of processing in the speech unit selection / fusion unit 43.

なお、本実施形態では、合成音声のセグメントの数をＩ個とし、文頭から文末へ向けて（すなわち、時系列にしたがって）、音声素片を融合していくものとする。 In the present embodiment, the number of segments of the synthesized speech is I, and speech units are merged from the beginning of the sentence toward the end of the sentence (that is, in time series).

また、ステップＳ４３００、ステップＳ４３１０、ステップＳ４３２０及びステップＳ４３３０はＩ回繰り返され、Ｉ個のセグメントが１回ずつ注目セグメントとなるように処理を行う。以下、各ステップについて説明する。 Further, Step S4300, Step S4310, Step S4320, and Step S4330 are repeated I times, and processing is performed so that I segments become the target segment once. Hereinafter, each step will be described.

（９）ステップＳ４３００
まず、ステップＳ４３００では、後述するコスト関数に基づいて最適素片系列を求めていく。 (9) Step S4300
First, in step S4300, an optimum segment sequence is obtained based on a cost function described later.

（９−１）コスト関数
コスト関数は次のように定める。 (9-1) Cost function The cost function is determined as follows.

まず、音声素片を変形・接続して合成音声を生成するときに生ずる歪みの要因毎にサブコスト関数Ｃ_ｎ（ｕ_ｉ，ｕ_ｉ−１，ｔ_ｉ）（ｎ：１，・・・，Ｎ，Ｎはサブコスト関数の数）を定める。ここで、ｔ_ｉは、入力音韻系列及び入力韻律情報に対応する目標とする音声（目標音声）をｔ＝（ｔ_１，・・・，ｔ_Ｉ）としたときのｉ番目のセグメントに対応する部分の音声素片の目標とする音素環境パラメータ情報を表し、ｕ_ｉは音声素片記憶部４０に記憶されている音声素片のうち、ｔ_ｉと同じ音韻の音声素片を表す。 First, sub-cost functions C _n (u _i, u _i−1 , t _i ) (n: 1,..., N for each factor of distortion generated when speech units are deformed and connected to generate synthesized speech. , N defines the number of sub-cost functions). Here, t _i corresponds to the i-th segment when the target speech (target speech) corresponding to the input phoneme sequence and the input prosodic information is t = (t ₁ ,..., T _I ). The target phoneme environment parameter information of the partial speech unit is represented, and u _i represents the speech unit having the same phoneme as t _i among the speech units stored in the speech unit storage unit 40.

サブコスト関数は、音声素片記憶部４０に記憶されている音声素片を用いて合成音声を生成したときに生ずる前記合成音声の目標音声に対する歪み量を推定するためのコストを算出するためのものである。 The sub-cost function is used to calculate a cost for estimating the amount of distortion of the synthesized speech with respect to the target speech that occurs when the synthesized speech is generated using the speech units stored in the speech unit storage unit 40. It is.

前記コストを算出するために、ここでは、具体的には、前記音声素片を使用することによって生じる合成音声の目標音声に対する歪み量を推定する目標コストと、前記音声素片を他の音声素片と接続したときに生じる前記合成音声の目標音声に対する歪み量を推定する接続コストという２種類のサブコストを用いる。 In order to calculate the cost, here, specifically, a target cost for estimating a distortion amount of the synthesized speech generated by using the speech unit with respect to the target speech, and the speech unit is used as another speech unit. Two types of sub-costs are used which are connection costs for estimating the amount of distortion of the synthesized speech that occurs when connected to a piece with respect to the target speech.

「目標コスト」としては、音声素片記憶部４０に記憶されている音声素片の基本周波数と目標の基本周波数との違い（差）を表す基本周波数コスト、音声素片の音韻継続時間長と目標の音韻継続時間長との違い（差）を表す音韻継続時間長コストを用いる。 As the “target cost”, the fundamental frequency cost representing the difference (difference) between the fundamental frequency of the speech element stored in the speech element storage unit 40 and the target fundamental frequency, the phoneme duration length of the speech element, A phoneme duration cost that represents a difference (difference) from the target phoneme duration is used.

「接続コスト」としては、接続境界でのスペクトルの違い（差）を表すスペクトル接続コストを用いる。 As the “connection cost”, a spectrum connection cost representing a difference (difference) in spectrum at the connection boundary is used.

具体的には、基本周波数コストは、

Specifically, the fundamental frequency cost is

から算出する。ここで、ｖ_ｉは音声素片記憶部４０に記憶されている音声素片ｕ_ｉの音素環境パラメータを、ｆは音素環境パラメータｖ_ｉから基本周波数を取り出す関数を表す。 Calculate from Here, v _i is the phoneme environment parameters of speech unit u _i stored in the voice unit storage 40, f represents a function to extract the fundamental frequency from the phonetic environment parameter v _i.

また、音韻継続時間長コストは、

Also, the long phoneme duration cost is

から算出する。ここで、ｇは音素環境パラメータｖ_ｉから音韻継続時間長を取り出す関数を表す。 Calculate from Here, g represents the function to extract phoneme duration from the phonetic environment parameter v _i.

また、スペクトル接続コストは、２つの音声素片間のケプストラム距離：

Also, the spectrum connection cost is the cepstrum distance between two speech segments:

から算出する。ここで、ｈは音声素片ｕ_ｉの接続境界のケプストラム係数をベクトルとして取り出す関数を表す。 Calculate from Here, h represents a function for taking out a cepstrum coefficient of a connection boundary of the speech unit u _i as a vector.

これらのサブコスト関数の重み付き和を合成単位コスト関数と定義する：

Define the weighted sum of these subcost functions as the composite unit cost function:

ここで、ｗ_ｎはサブコスト関数の重みを表す。本実施形態では、簡単のため、ｗ_ｎはすべて「１」とする。上記式（４）は、ある合成単位に、ある音声素片を当てはめた場合の前記音声素片の合成単位コストである。 Here, w _n represents the weight of the sub cost function. In the present embodiment, for the sake of simplicity, all w _n is set to "1". The above formula (4) is the synthesis unit cost of the speech unit when a speech unit is applied to a synthesis unit.

入力音韻系列を合成単位で区切ることにより得られる複数のセグメントのそれぞれに対し、上記式（４）から合成単位コストを算出した結果を、全セグメントについて足し合わせたものをコストと呼び、前記コストを算出するためのコスト関数を次式（５）に示すように定義する：

For each of a plurality of segments obtained by dividing the input phoneme sequence by the synthesis unit, the result of calculating the synthesis unit cost from the above equation (4) is the sum of all segments is called the cost. A cost function for calculation is defined as shown in the following equation (5):

（９−２）最適素片系列
図６のステップＳ４３００では、上記式（１）〜（５）に示したコスト関数を使って１セグメント当たり（すなわち、１合成単位当たり）１つの音声素片を用いて、上記式（５）で算出されるコストの値が最小の音声素片の系列を求める。 (9-2) Optimal Segment Sequence In step S4300 in FIG. 6, one speech segment is selected per segment (ie, per synthesis unit) using the cost functions shown in the above equations (1) to (5). By using this, a sequence of speech units having a minimum cost value calculated by the above equation (5) is obtained.

また、後述する音声素片融合部４３１で生成された融合音声素片が存在しないセグメントに対しては音声素片記憶部４０中の音声素片の中から１つを用いて、上記式（５）で算出されるコストの値が最小の音声素片の系列を求める。 Further, for a segment for which there is no fused speech unit generated by the speech unit fusion unit 431 described later, one of the speech units in the speech unit storage unit 40 is used, and the above equation (5) is used. ) To obtain a speech unit sequence having the smallest cost value.

さらに、音声素片融合部４３１で生成された融合音声素片が存在するセグメントに対しては融合音声素片を用いて、上記式（５）で算出されるコストの値が最小の音声素片の系列を求める。 Furthermore, for the segment in which the fusion speech unit generated by the speech unit fusion unit 431 exists, the fusion speech unit is used, and the speech unit having the smallest cost value calculated by the above equation (5) is used. Find the series.

このコストが最小となる音声素片の組合せを「最適素片系列」と呼ぶこととする。すなわち、最適素片系列中の各音声素片は、入力音韻系列を合成単位で区切ることにより得られる複数のセグメントのそれぞれに対応し、最適素片系列中の各音声素片から算出された上記合成単位コストと式（５）より算出されたコストの値は、他のどの音声素片系列よりも小さい値である。なお、最適素片系列の探索には、動的計画法（ＤＰ：ｄｙｎａｍｉｃｐｒｏｇｒａｍｉｎｇ）を用いることでより効率的に行うことができる。 A combination of speech units that minimizes the cost is referred to as an “optimal unit sequence”. That is, each speech unit in the optimum unit sequence corresponds to each of a plurality of segments obtained by dividing the input phoneme sequence by a synthesis unit, and is calculated from each speech unit in the optimum unit sequence. The cost value calculated from the synthesis unit cost and the equation (5) is smaller than any other speech unit sequence. It should be noted that the search for the optimum unit sequence can be performed more efficiently by using dynamic programming (DP).

（９−３）具体例
例えば、図７に示すように、入力音韻系列が「ｔｓ・ｉ・ｉ・ｓ・ａ・・・・」であるとする。この場合、合成単位は、音素「ｔｓ」、「ｉ」、「ｉ」、「ｓ」、「ａ」、・・・のそれぞれに対応し、これら音素のそれぞれが１つのセグメントに対応する。 (9-3) Specific Example For example, as shown in FIG. 7, it is assumed that the input phoneme sequence is “ts · i · i · s · a ·. In this case, the synthesis unit corresponds to each of the phonemes “ts”, “i”, “i”, “s”, “a”,..., And each of these phonemes corresponds to one segment.

入力された音韻系列中の３番目の音素「ｉ」に対応するセグメントを注目セグメントとする。すなわち、入力された音韻系列中の１番目の音素「ｔｓ」と２番目の音素「ｉ」は既にステップＳ４３２０において音声素片融合され、それぞれステップＳ４３３０において融合した音声素片の音素環境パラメータが算出されている。 A segment corresponding to the third phoneme “i” in the input phoneme sequence is set as a target segment. That is, the first phoneme “ts” and the second phoneme “i” in the input phoneme sequence are already speech unit fused in step S4320, and the phoneme environment parameters of the speech units fused in step S4330 are calculated. Has been.

この場合、最適素片系列上の１番目の音素「ｔｓ」と２番目の「ｉ」に対応する音声素片４３０１ａ、４３０１ｂは融合された音声素片となっており、音韻系列中の残りの３番目の音素「ｉ」、４番目の音素「ｓ」と５番目の音素「ａ」、・・・、では音声素片記憶部４０からそれぞれ４３０１ｃ、４３０１ｄ、４３０１ｅ、・・・が最適素片系列４３０１上の音声素片として選ばれている。 In this case, speech units 4301a and 4301b corresponding to the first phoneme “ts” and the second “i” on the optimal unit sequence are fused speech units, and the remaining phonemes in the phoneme sequence are the same. In the third phoneme “i”, the fourth phoneme “s”, and the fifth phoneme “a”,..., 4301c, 4301d, 4301e,. It is selected as a speech segment on the series 4301.

この最適素片系列４３０１上の音声素片４３０１ａ、４３０１ｂ、４３０１ｃ、４３０１ｄ、４３０１ｅ、・・・からなる音声素片系列を用いると、上記合成単位コストと式（５）より算出されたコストが他のどの音声素片系列よりも小さな値となっている。 When a speech unit sequence consisting of speech units 4301a, 4301b, 4301c, 4301d, 4301e,... On this optimal unit sequence 4301 is used, the cost calculated from the synthesis unit cost and the equation (5) is different. The value is smaller than the speech unit sequence of the throat.

（１０）ステップＳ４３１０
次に、ステップＳ４３１０に進み、ステップＳ４３００で求めた最適素片系列を用いて、１セグメント当たり複数の音声素片を選ぶ。ここでは、Ｉ個のセグメントそれぞれに対し、Ｍ個の音声素片を選ぶこととして説明する。詳細を図８のフローチャートに示す。 (10) Step S4310
Next, proceeding to step S4310, a plurality of speech segments are selected per segment using the optimum segment sequence obtained at step S4300. Here, a description will be given assuming that M speech segments are selected for each of I segments. Details are shown in the flowchart of FIG.

ステップＳ４３１１では式（５）で算出されるコストの値に応じて順位付けし、ステップＳ４３１２において上位Ｍ個の音声素片を選択する。 In step S4311, ranking is performed according to the cost value calculated by equation (5), and in step S4312, the top M speech segments are selected.

例えば、図７と同様に図９では、入力音韻系列が「ｔｓ・ｉ・ｉ・ｓ・ａ・・・・」であるとする。図９では、入力された音韻系列中の３番目の音素「ｉ」に対応するセグメントを注目セグメントとし、この注目セグメントについて、複数の音声素片を求める場合を示している。この３番目の音素「ｉ」に対応するセグメント以外のセグメントに対しては、最適素片系列中の音声素片４３１３ａ、４３１３ｂ、４３１３ｄ、４３１３ｅ、・・・を固定する。図７の最適系列中の音声素片と比較すると、音声素片４３１３ａは融合音声素片４３０１ａと、音声素片４３１３ｂは融合音声素片４３０１ｂと、音声素片４３１３ｄは音声素片４３０１ｄと、音声素片４３１３ｅは音声素片４３０１ｅと対応している。 For example, as in FIG. 7, in FIG. 9, it is assumed that the input phoneme sequence is “ts · i · i · s · a ·. FIG. 9 shows a case where a segment corresponding to the third phoneme “i” in the input phoneme sequence is set as a target segment, and a plurality of speech segments are obtained for this target segment. For segments other than the segment corresponding to the third phoneme “i”, the speech units 4313a, 4313b, 4313d, 4313e,. Compared with the speech unit in the optimum sequence of FIG. 7, the speech unit 4313a is the fused speech unit 4301a, the speech unit 4313b is the fused speech unit 4301b, the speech unit 4313d is the speech unit 4301d, The segment 4313e corresponds to the speech segment 4301e.

この状態で、音声素片記憶部４０に記憶されている音声素片のうち、注目セグメントの音素「ｉ」と同じ音素名（音素記号）をもつ音声素片のそれぞれについて、式（５）を用いてコストを算出する。但し、それぞれの音声素片に対してコストを求めるときに、値が変わるのは、注目セグメントの目標コスト、注目セグメントとその一つ前のセグメントとの接続コスト、注目セグメントとその一つ後のセグメントとの接続コストであるので、これらのコストのみを考慮すればよい。 In this state, among the speech elements stored in the speech element storage unit 40, for each speech element having the same phoneme name (phoneme symbol) as the phoneme “i” of the segment of interest, Equation (5) is obtained. To calculate the cost. However, when the cost is calculated for each speech unit, the value changes because of the target cost of the target segment, the connection cost between the target segment and the previous segment, the target segment and the next one Since these are the connection costs with the segments, only these costs need be considered.

すなわち、下記のような手順となる。 That is, the procedure is as follows.

（手順１）音声素片記憶部４０に記憶されている音声素片のうち、注目セグメントの音素「ｉ」と同じ音素名（音素記号）をもつ音声素片のうちの１つを音声素片ｕ_３とする。音声素片ｕ_３の基本周波数ｆ（ｖ_３）と、目標の基本周波数ｆ（ｔ_３）とから、式（１）を用いて、基本周波数コストを算出する。 (Procedure 1) Among the speech elements stored in the speech element storage unit 40, one of the speech elements having the same phoneme name (phoneme symbol) as the phoneme “i” of the segment of interest is selected as the speech element. and u _3. From the fundamental frequency f (v ₃ ) of the speech unit u ₃ and the target fundamental frequency f (t ₃ ), the fundamental frequency cost is calculated using Equation (1).

（手順２）音声素片ｕ_３の音韻継続時間長ｇ（ｖ_３）と、目標の音韻継続時間長ｇ（ｔ_３）とから、式（２）を用いて、音韻継続時間長コストを算出する。 (Procedure 2) The phoneme duration cost is calculated from the phoneme duration g (v ₃ ) of the speech unit u ₃ and the target phoneme duration g (t ₃ ) using Equation (2). To do.

（手順３）音声素片ｕ_３のケプストラム係数ｈ（ｕ_３）と、融合された音声素片４３１３ｂ（ｕ_２）のケプストラム係数ｈ（ｕ_２）とから、式（３）を用いて、第１のスペクトル接続コストを算出する。また、音声素片ｕ_３のケプストラム係数ｈ（ｕ_３）と、融合された音声素片４３１３ｄ（ｕ_４）のケプストラム係数ｈ（ｕ_４）とから、式（３）を用いて、第２のスペクトル接続コストを算出する。 Since the (Step 3) cepstral coefficients of the speech unit _{u 3} h _{(u 3),} and fused speech unit 4313b cepstrum coefficients h _{(u 2)} of the _{(u 2),} using equation (3), the The spectrum connection cost of 1 is calculated. Further, since the cepstrum coefficient of the speech unit _{u 3} h _{(u 3),} and cepstral coefficients of the fused speech unit _{_{4313d (u 4) h (u}} 4), using equation (3), the second Calculate the spectrum connection cost.

（手順４）上記手順１〜手順３で各サブコスト関数を用いて算出された基本周波数コストと音韻継続時間長コストと第１及び第２のスペクトル接続コストの重み付け和を算出して、音声素片ｕ_３のコストを算出する。 (Procedure 4) A speech unit is calculated by calculating a weighted sum of the fundamental frequency cost, the phoneme duration time cost, and the first and second spectrum connection costs calculated by using each sub-cost function in the above-described Procedure 1 to Procedure 3. to calculate the cost of u _3.

（手順５）音声素片記憶部４０に記憶されている音声素片のうち、注目セグメントの音素「ｉ」と同じ音素名（音素記号）をもつ各音声素片について、上記手順１〜手順４に従って、コストを算出したら、その値の最も小さい音声素片ほど高い順位となるように順位付けを行う（図８のステップＳ４３１１）。そして、上位Ｍ個の音声素片を選択する（図８のステップＳ４３１２）。例えば、図９では、音声素片４３１４ａが最も順位が高く、音声素片４３１４ｄが最も順位が低い。 (Procedure 5) For each speech unit having the same phoneme name (phoneme symbol) as the phoneme “i” of the segment of interest among the speech units stored in the speech unit storage unit 40, the above steps 1 to 4 are performed. When the cost is calculated according to the above, ranking is performed so that the speech unit having the smallest value has a higher rank (step S4311 in FIG. 8). Then, the top M speech segments are selected (step S4312 in FIG. 8). For example, in FIG. 9, the speech unit 4314a has the highest rank, and the speech unit 4314d has the lowest rank.

以上の手順１〜手順５をそれぞれのセグメントに対して行う。その結果、それぞれのセグメントについて、Ｍ個ずつの音声素片が得られる。 The above steps 1 to 5 are performed for each segment. As a result, M speech segments are obtained for each segment.

音素環境パラメータとして、音声素片の音韻とその基本周波数及び音韻継続長の情報として説明したが、これらに限定するものではなく、必要に応じて、音韻、基本周波数、音韻継続時間長、先行音素、後続音素、後々続音素、パワー、ストレスの有無、アクセント核からの位置、息継ぎからの時間、発生速度、感情などの情報などを組み合わせて用いることが出来る。 The phoneme environment parameters have been described as information on the phoneme of the speech unit and its fundamental frequency and phoneme duration, but are not limited to these, and the phoneme, the fundamental frequency, the phoneme duration, the preceding phoneme, if necessary. , Subsequent phonemes, subsequent phonemes, power, presence / absence of stress, position from the accent core, time from breathing, generation speed, emotion, etc. can be used in combination.

（１１）ステップＳ４３２０
次に、図６のステップＳ４３２０の処理について説明する。 (11) Step S4320
Next, the process of step S4320 in FIG. 6 will be described.

ステップＳ４３２０では、注目セグメントに対して、ステップＳ４３１０で求めたＭ個の音声素片から、セグメント毎に前記Ｍ個の音声素片を融合し、新たな音声素片（融合された音声素片）を生成する。有声音の波形には周期があるが、無声音の波形には周期がないため、このステップは音声素片が有声音である場合と無声音である場合とで別の処理を行う。 In step S4320, for the segment of interest, the M speech units are fused for each segment from the M speech units obtained in step S4310, and a new speech unit (fused speech unit) is created. Is generated. Although the waveform of voiced sound has a period, the waveform of unvoiced sound does not have a period, so this step performs different processing depending on whether the speech segment is voiced sound or unvoiced sound.

まず、有声音の場合について説明する。有声音の場合には、音声素片からピッチ波形を取り出し、ピッチ波形のレベルで融合し、新たなピッチ波形を作り出す。ピッチ波形とは、その長さが音声の基本周期の数倍程度までで、それ自身は基本周期を持たない比較的短い波形であって、そのスペクトルが音声信号のスペクトル包絡を表すようなものを意味する。 First, the case of voiced sound will be described. In the case of voiced sound, a pitch waveform is extracted from the speech segment and fused at the level of the pitch waveform to create a new pitch waveform. A pitch waveform is a relatively short waveform that has a length up to several times the fundamental period of the speech and does not have a fundamental period, and whose spectrum represents the spectral envelope of the speech signal. means.

その抽出方法としては、単に基本周期同期窓で切り出す方法、ケプストラム分析やＰＳＥ分析によって得られたパワースペクトル包絡を逆離散フーリエ変換する方法、線形予測分析によって得られたフィルタのインパルス応答によってピッチ波形を求める方法、閉ループ学習法によって合成音声のレベルで自然音声に対する歪みが小さくなるようなピッチ波形を求める方法など様々なものがある。 As extraction methods, a pitch waveform is obtained by simply cutting out with a fundamental period synchronization window, a method of performing inverse discrete Fourier transform on a power spectrum envelope obtained by cepstrum analysis or PSE analysis, and an impulse response of a filter obtained by linear prediction analysis. There are various methods such as a method for obtaining, and a method for obtaining a pitch waveform that reduces distortion with respect to natural speech at the level of synthesized speech by a closed loop learning method.

本実施形態では、基本周期同期窓で切り出す方法を用いてピッチ波形を抽出する場合を例にとり、図１０のフローチャートを参照して説明する。ここでは、複数のセグメントのうちの１つのセグメントについて、Ｍ個の音声素片を融合して１つの新たな音声素片を生成する場合の処理手順を説明する。 In the present embodiment, an example in which a pitch waveform is extracted using a method of cutting out with a basic period synchronization window will be described with reference to the flowchart of FIG. Here, a processing procedure in the case where one new speech unit is generated by fusing M speech units with respect to one segment among a plurality of segments will be described.

（１１−１）ステップＳ４３２１
まず、ステップＳ４３２１において、Ｍ個の音声素片のそれぞれの音声波形に、その周期間隔毎にマーク（ピッチマーク）を付ける。 (11-1) Step S4321
First, in step S4321, a mark (pitch mark) is added to each speech waveform of the M speech units for each periodic interval.

図１１（ａ）には、Ｍ個の音声素片のうちの１つの音声素片の音声波形４３２１ａに対し、その周期間隔毎にピッチマーク４３２１ｂが付けられている場合を示している。 FIG. 11 (a) shows a case where pitch marks 4321b are attached to the speech waveform 4321a of one speech unit among the M speech units at every periodic interval.

（１１−２）ステップＳ４３２２
次に、ステップＳ４３２２では、図１１（ｂ）に示すように、ピッチマークを基準として窓掛けを行ってピッチ波形を切り出す。 (11-2) Step S4322
Next, in step S4322, as shown in FIG. 11B, windowing is performed with the pitch mark as a reference to cut out the pitch waveform.

窓にはハニング窓４３２１ｃを用い、その窓長は基本周期の２倍とする。 A Hanning window 4321c is used as the window, and the window length is twice the basic period.

そして、図１１（ｃ）に示すように、窓掛けされた波形４３２１ｄをピッチ波形として切り出す。Ｍ個の音声素片のそれぞれについて、図１１に示すような処理（ステップＳ４３２２の処理）を施す。 Then, as shown in FIG. 11C, the windowed waveform 4321d is cut out as a pitch waveform. Each of the M speech segments is subjected to the process as shown in FIG. 11 (the process of step S4322).

その結果、Ｍ個の音声素片それぞれについて、複数個のピッチ波形からなるピッチ波形の系列が求まる。 As a result, for each of the M speech units, a series of pitch waveforms composed of a plurality of pitch waveforms is obtained.

（１１−３）ステップＳ４３２３
次に、ステップＳ４３２３に進み、前記セグメントのＭ個の音声素片のそれぞれのピッチ波形系列の中で、最もピッチ波形の数が多いものに合わせて、Ｍ個全てのピッチ波形の系列中のピッチ波形の数が同じになるように、（ピッチ波形の数が少ないピッチ波形の系列については）ピッチ波形を複製して、ピッチ波形の数をそろえる。 (11-3) Step S4323
Next, proceeding to step S4323, the pitches in the series of all M pitch waveforms are matched to the one having the largest number of pitch waveforms among the pitch waveform series of the M speech units of the segment. The pitch waveforms are duplicated so that the number of pitch waveforms is the same (for a series of pitch waveforms with a small number of pitch waveforms).

図１２（ａ）、（ｂ）には、前記セグメントＭ個（例えば、ここでは３個）の音声素片ｄ１〜ｄ３のそれぞれから、ステップＳ４３２２で切り出されたピッチ波形の系列ｅ１〜ｅ３を示している。ピッチ波形の系列ｅ１中のピッチ波形の数は７個、ピッチ波形の系列ｅ２中のピッチ波形の数は５個、ピッチ波形の系列ｅ３中のピッチ波形の数は６個であるので、ピッチ波形の系列ｅ１〜ｅ３のうち最もピッチ波形の数が多いものは、系列ｅ１である。 FIGS. 12A and 12B show pitch waveform series e1 to e3 cut out in step S4322 from each of the M segment (for example, three in this case) speech segments d1 to d3. ing. Since the number of pitch waveforms in the pitch waveform series e1 is 7, the number of pitch waveforms in the pitch waveform series e2 is 5, and the number of pitch waveforms in the pitch waveform series e3 is 6, the pitch waveform. Among the series e1 to e3, the series e1 has the largest number of pitch waveforms.

従って、図１２（ｃ）には、この系列ｅ１中のピッチ波形の数（例えば、ここでは、ピッチ波形の数は７個）に合わせて、他の系列ｅ２、ｅ３については、それぞれ、前記系列中のピッチ波形のいずれかをコピーして、ピッチ波形の数を７個にする様子を示している。 Therefore, in FIG. 12C, according to the number of pitch waveforms in the series e1 (for example, the number of pitch waveforms is 7 here), the other series e2 and e3 are respectively the series. A state is shown in which one of the inside pitch waveforms is copied to make the number of pitch waveforms seven.

その結果得られた、ｅ２、ｅ３のそれぞれに対応する新たなピッチ波形の系列がｅ２´、ｅ３´である。 As a result, new pitch waveform series corresponding to e2 and e3 are e2 ′ and e3 ′, respectively.

（１１−４）ステップＳ４３２４
次に、ステップＳ４３２４に進む。このステップでは、ピッチ波形毎に処理を行う。 (11-4) Step S4324
Next, it progresses to step S4324. In this step, processing is performed for each pitch waveform.

ステップＳ４３２４では、前記セグメントのＭ個のそれぞれの音声素片に対応するピッチ波形をその位置毎に平均化し、新たなピッチ波形の系列を生成する。この生成された新たなピッチ波形の系列を融合された音声素片とする。 In step S4324, the pitch waveforms corresponding to the M speech units of the segment are averaged for each position, and a new pitch waveform sequence is generated. The generated new pitch waveform sequence is used as a fused speech unit.

図１２（ｄ）に、１番目から７番目のピッチ波形をそれぞれ３つの音声素片で平均化し、７個の新たなピッチ波形からなる新たなピッチ波形の系列ｆ１を生成している。例えば、系列ｅ１の１番目のピッチ波形と、系列ｅ２´の１番目のピッチ波形と、系列ｅ３´の１番目のピッチ波形のセントロイドを求めて、それを新たな１番目のピッチ波形とする。新たなピッチ波形の系列ｆ１の２番目〜７番目のピッチ波形についても同様である。ピッチ波形の系列ｆ１が、上記「融合された音声素片」である。 In FIG. 12 (d), the first to seventh pitch waveforms are averaged by three speech segments, respectively, to generate a new pitch waveform series f1 composed of seven new pitch waveforms. For example, the centroid of the first pitch waveform of the series e1, the first pitch waveform of the series e2 ′, and the first pitch waveform of the series e3 ′ is obtained and set as a new first pitch waveform. . The same applies to the second to seventh pitch waveforms of the new pitch waveform series f1. The series f1 of pitch waveforms is the “fused speech segment”.

なお、Ｍ個の音声素片のうち最も多いピッチ波形を持つものに合わせたが、作成する合成音声素片のピッチマーク数に合わせてもよい。 In addition, although it matched with what has the most pitch waveform among M speech units, you may match with the number of pitch marks of the synthetic speech unit to produce.

また、ピッチ波形を融合する際、セントロイドを求めることにより融合したが、これに限るものではない。例えば、Ｍ個のピッチ波形の平均を求める、Ｍ個の音声素片を帯域分割して、各帯域で位相を揃えてから平均を求めるなどの方法であってもよい。 Further, when the pitch waveforms are merged, the centroids are merged, but the present invention is not limited to this. For example, a method may be used in which an average of M pitch waveforms is obtained, an M speech element is divided into bands, and an average is obtained after aligning phases in each band.

（１１−５）無声音のセグメントの場合
一方、図６のステップＳ４３２０の処理において、無声音のセグメントの場合には、音声素片選択ステップＳ４３１０で前記セグメントのＭ個の音声素片のうち、前記Ｍ個の音声素片のそれぞれに付けられている順位が１位の音声素片をそのまま使用する。 (11-5) In the case of an unvoiced sound segment On the other hand, in the process of step S4320 in FIG. 6, in the case of an unvoiced sound segment, among the M speech elements of the segment in the speech element selection step S4310, the M The speech unit having the first rank assigned to each speech unit is used as it is.

（１２）ステップＳ４３３０
次に、図６のステップＳ４３３０の処理について説明する。 (12) Step S4330
Next, the process of step S4330 in FIG. 6 will be described.

ステップＳ４３３０では、ステップＳ４３２０で求めた、融合された音声素片の音素環境パラメータを算出する。図６のステップＳ４３３０における処理の流れを、図１３のフローチャートに示す。 In step S4330, the phoneme environment parameter of the fused speech unit obtained in step S4320 is calculated. The flow of processing in step S4330 in FIG. 6 is shown in the flowchart in FIG.

融合された音声素片の音素環境パラメータは図６のステップＳ４３００において、最適素片系列を求めるときに用いられる。 The phoneme environment parameter of the fused speech unit is used when obtaining the optimum unit sequence in step S4300 in FIG.

そのため、ステップＳ４３３１において、融合された音声素片の基本周波数を求める。 Therefore, in step S4331, the fundamental frequency of the fused speech unit is obtained.

ステップＳ４３３２において、融合された音声素片の音韻時間継続長を求める。 In step S4332, the phoneme duration duration of the fused speech segment is obtained.

ステップＳ４３３３において、融合された音声素片の接続境界のケプストラム係数ベクトルを求めることにより融合された音声素片の音素環境パラメータとする。 In step S4333, the phoneme environment parameters of the fused speech units are obtained by obtaining the cepstrum coefficient vector of the connection boundary of the fused speech units.

ここでは、融合された音声素片の基本周波数、音韻時間継続長、接続境界のケプストラムを求めたが、これに限るものではない。コストの計算に必要な音素環境パラメータに応じて変更することもできる。 Here, the fundamental frequency, phoneme duration, and cepstrum of the connection boundary of the united speech unit are obtained, but the present invention is not limited to this. It can also be changed according to phoneme environment parameters required for cost calculation.

以上のようにして、入力音韻系列に対応する複数のセグメントのそれぞれについて、前記セグメントに対し選択されたＭ個の音声素片から、前記Ｍ個の音声素片を融合し、新たな音声素片（融合された音声素片）を生成すると、次に、図６の融合音声素片編集・接続ステップＳ４３４０へ進む。 As described above, for each of a plurality of segments corresponding to the input phoneme sequence, the M speech units are merged from the M speech units selected for the segment, and a new speech unit is obtained. If (the fused speech unit) is generated, the process proceeds to the fused speech unit editing / connection step S4340 in FIG.

（１３）ステップＳ４３４０
ステップＳ４３４０では、音声素片編集・接続部４４はステップＳ４３２０で求めた、セグメント毎に融合された音声素片を、入力韻律情報に従って変形し、接続することで（合成音声の）音声波形を生成する。 (13) Step S4340
In step S4340, the speech unit editing / connecting unit 44 generates a speech waveform (synthesized speech) by transforming and connecting the speech units merged for each segment obtained in step S4320 according to the input prosodic information. To do.

ステップＳ４３２０で求めた融合された音声素片は、実際にはピッチ波形の形になっているので、前記融合された音声素片の基本周波数、音韻継続時間長のそれぞれが、入力韻律情報に示されている目標音声の基本周波数、目標音声の音韻継続時間長になるようにピッチ波形を重畳することで、音声波形を生成することができる。 Since the fused speech unit obtained in step S4320 is actually in the form of a pitch waveform, the fundamental frequency and the phoneme duration length of the fused speech unit are indicated in the input prosodic information. The speech waveform can be generated by superimposing the pitch waveform so as to be the basic frequency of the target speech and the phoneme duration of the target speech.

図１４は、ステップＳ４３４０の処理を説明するための図である。図１４では、音素「ｏ」、「Ｎ」、「ｓ」、「ｅ」、「Ｎ」の各合成単位についてステップＳ４３２０で求めた、融合された音声素片を変形・接続して、「おんせん」という音声波形を生成する場合を示している。図１４に示すように、入力韻律情報に示されている目標の基本周波数、目標の音韻継続時間長に応じて、セグメント（合成単位）毎に、融合された音声素片中の各ピッチ波形の基本周波数を変えたり（音の高さを変えたり）、ピッチ波形の数を増やしたり（時間長を変えたり）する。その後に、セグメント内、セグメント間で、隣り合うピッチ波形を接続して合成音声を生成する。 FIG. 14 is a diagram for explaining the processing in step S4340. In FIG. 14, the fused speech segments obtained in step S4320 for each synthesis unit of phonemes “o”, “N”, “s”, “e”, and “N” are transformed and connected, 'Is generated. As shown in FIG. 14, according to the target fundamental frequency and the target phoneme duration length indicated in the input prosodic information, each pitch waveform in the united speech unit is segmented for each segment (synthesis unit). Change the basic frequency (change the pitch) or increase the number of pitch waveforms (change the time length). After that, synthesized speech is generated by connecting adjacent pitch waveforms within and between segments.

（１４）サブコストの要件
なお、上記目標コストは、合成音声を生成するために入力韻律情報を基に、上記のような融合された音声素片の基本周波数や音韻継続時間長などを変更することにより生ずる前記合成音声の目標音声に対する歪みをできるだけ正確に推定（評価）するものであることが望ましい。そのような目標コストの一例である式（１）、式（２）から算出される目標コストは、前記歪み量を、目標音声の韻律情報と音声素片記憶部４０に記憶されている音声素片の韻律情報の違いに基づき算出されるものである。 (14) Sub-cost requirement The target cost is to change the basic frequency of the fused speech unit as described above or the phoneme duration length based on the input prosodic information in order to generate synthesized speech. It is desirable to estimate (evaluate) as accurately as possible the distortion of the synthesized speech caused by the above with respect to the target speech. The target cost calculated from the equations (1) and (2), which are examples of such a target cost, is obtained by calculating the distortion amount from the speech element stored in the prosodic information of the target speech and the speech unit storage unit 40. It is calculated based on the difference between pieces of prosodic information.

また、接続コストは、合成音声を生成するために上記のような融合された音声素片を接続することにより生ずる前記合成音声の目標音声に対する歪みをできるだけ正確に推定（評価）するものであることが望ましい。そのような接続コストの一例である、式（３）から算出される接続コストは、音声素片記憶部４０に記憶されている音声素片と、もしくは、音声素片融合ステップＳ４３２０において融合された音声素片との接続境界のケプストラム係数の違いに基づき算出されるものである。 The connection cost is to estimate (evaluate) the distortion of the synthesized speech with respect to the target speech as accurately as possible as a result of connecting the fused speech segments as described above to generate the synthesized speech. Is desirable. The connection cost calculated from Expression (3), which is an example of such a connection cost, is fused with the speech unit stored in the speech unit storage unit 40 or in the speech unit fusion step S4320. It is calculated based on the difference in cepstrum coefficients at the connection boundary with the speech element.

（１５）効果
本実施形態に係る音声合成方法と、従来の複数音声素片選択・融合型の音声合成方法との違いを説明しつつ、本実施形態の効果を説明する。 (15) Effects The effects of this embodiment will be described while explaining the difference between the speech synthesis method according to the present embodiment and the conventional multiple speech unit selection / fusion speech synthesis method.

本実施形態に係る図３に示した音声合成装置では、音素環境算出部４３２があり、融合された音声素片の音素環境パラメータを用いて音声素片選択部４３０内での処理ステップＳ４３００において最適素片系列を求め直す点と、ステップＳ４３３０において融合された音声素片の音素環境パラメータを求めるという点が、従来の音声合成装置（例えば、特許文献１参照）と異なる。 In the speech synthesizer shown in FIG. 3 according to this embodiment, there is a phoneme environment calculation unit 432, which is optimal in the processing step S4300 in the speech unit selection unit 430 using the phoneme environment parameters of the fused speech units. It differs from the conventional speech synthesizer (see, for example, Patent Document 1) in that the segment series is obtained again and the phoneme environment parameter of the speech unit fused in step S4330 is obtained.

本実施形態では、音声素片選択のときに、隣接するセグメントが既に音声素片選択され、融合された音声素片が存在している場合に融合された音声素片との接続歪みを考慮して音声素片を選択・融合することにより、接続部の不連続間を解消することにより高音質な音声素片を作り出すことができ、その結果、より自然でより高音質な合成音声を生成することができる。 In this embodiment, when selecting a speech unit, if adjacent segments have already been selected and a fused speech unit exists, the connection distortion with the fused speech unit is considered. By selecting and merging speech segments, it is possible to create speech segments with higher sound quality by eliminating discontinuities between connections, resulting in more natural and higher-quality synthesized speech. be able to.

（第２の実施形態）
次に、第２の実施形態に係る音声素片選択部４３０について図１５に基づいて説明する。 (Second Embodiment)
Next, the speech element selection unit 430 according to the second embodiment will be described with reference to FIG.

図７の最適素片系列４３０１において、融合された音声素片が存在しているセグメントでは、最適素片系列上の音声素片として融合された音声素片を固定して用いていたが、これに限定されるものではない。そのため、この変更例として第２の実施形態を説明する。 In the optimum unit sequence 4301 of FIG. 7, in the segment where the fused speech unit exists, the united speech unit is used as a speech unit on the optimum unit sequence. It is not limited to. Therefore, the second embodiment will be described as an example of this change.

最適素片系列４３０１における上記合成単位コストと式（５）より算出されたコストを次のように算出する。 The synthesis unit cost in the optimum segment sequence 4301 and the cost calculated from the equation (5) are calculated as follows.

図１５に示すように、最適素片系列上の音声素片が、融合された音声素片の有無に関わらず、音声素片記憶部４０から上記合成単位コストと式（５）より算出されたコストが、最小となるように選択されている最適素片系列４３０２におけるコストより劣化した場合を考える。この場合には、各セグメントにおける接続コストの算出には最適素片系列４３０２を用いて算出する。 As shown in FIG. 15, the speech unit on the optimal unit sequence is calculated from the speech unit storage unit 40 by the synthesis unit cost and the equation (5) regardless of the presence or absence of the fused speech unit. Consider a case where the cost is lower than the cost in the optimum segment sequence 4302 selected to be the minimum. In this case, the connection cost in each segment is calculated using the optimum segment sequence 4302.

これにより、第１の実施形態に比べ、より自然でより高音質な合成音声を安定して生成することができるのである。 As a result, compared to the first embodiment, it is possible to stably generate a synthesized speech with a more natural and higher sound quality.

（第３の実施形態）
次に、第３の実施形態に係る音声素片選択・融合部４３について図１６に基づいて説明する。 (Third embodiment)
Next, the speech element selection / fusion unit 43 according to the third embodiment will be described with reference to FIG.

図６では、文頭から文末へ向けて（すなわち、時系列にしたがって）音声素片を融合していくものとしたが、これに限定されるものではない。そのため、この変更例として第３の実施形態を説明する。 In FIG. 6, the speech units are merged from the beginning of the sentence toward the end of the sentence (that is, according to the time series), but the present invention is not limited to this. Therefore, a third embodiment will be described as this modification.

合成音声生成時に、音韻系列中の特定の単語や文末部の合成音声の品質を特に向上したい場合に、上記特定の単語や文末部に対応するセグメントを先に融合することもできる。 When it is desired to particularly improve the quality of the synthesized speech of a specific word or sentence end part in the phoneme sequence at the time of synthetic speech generation, the segment corresponding to the specific word or sentence end part can be fused first.

図１６は、本実施形態に係る音声素片選択・融合部４３の処理を流すフローチャートである。 FIG. 16 is a flowchart showing the processing of the speech unit selection / fusion unit 43 according to this embodiment.

ステップＳ４３５０において、本実施形態の合成音声方法を使用するユーザにより設定された順序Ｏ_ｉ（ｉ：１，・・・，Ｉ，Ｉはセグメントの数）を設定する。Ｏ_ｉにはｉ＝１から順に各セグメントに対応する１〜Ｉまでの番号が１つずつ付与されており、Ｏ_ｉ番目のセグメントに対してステップＳ４３００、ステップＳ４３１０、ステップＳ４３２０、ステップＳ４３３０における処理をした後、Ｏ_ｉ＋１番目のセグメントに対して同様の処理を繰り返していくものである。 In step S4350, the order O _i (i: 1,..., I, I is the number of segments) set by the user who uses the synthesized speech method of the present embodiment is set. The O _i are applied from i = 1 one by one number until 1~I for each segment in order, steps for _{O i} th segment S4300, step S4310, step S4320, the processing in step S4330 Then, the same processing is repeated for the O _{i +} 1th segment.

これにより、第１の実施形態に比べ、特定の単語や、文末部など、上記ユーザが特に合成音声の品質を向上させたいセグメントにおける音声素片選択の自由度が向上し、その結果、上記ユーザが所望するより自然でより高音質な合成音声を生成できる。 Thereby, compared to the first embodiment, the degree of freedom of speech segment selection in a segment in which the user particularly wants to improve the quality of synthesized speech, such as a specific word or the end of a sentence, is improved. As a result, the user Can generate synthesized speech with higher natural quality than desired.

（第４の実施形態）
次に、第４の実施形態に係る音声素片選択・融合部４３について図１７に基づいて説明する。 (Fourth embodiment)
Next, the speech element selection / fusion unit 43 according to the fourth embodiment will be described with reference to FIG.

図６のステップＳ４３１０では、各セグメントに対しＭ個の音声素片を選択していくものとしたが、これに限定されるものではない。そのため、この変更例として第４の実施形態を説明する。 In step S4310 of FIG. 6, M speech units are selected for each segment, but the present invention is not limited to this. Therefore, the fourth embodiment will be described as this modification.

合成音声生成時に用いる音声素片を融合しないで、音声素片記憶部４０に記憶されている音声素片の中からＬ（＜Ｍ）個の音声素片を融合して生成された音声素片を用いることもできる。すなわち、あるセグメントの音韻と同じ音韻を持つ音声素片が、音声素片記憶部４０にＬ_ｉ個あるとすると、１つのセグメントにつき、

A speech unit generated by merging L (<M) speech units from speech units stored in the speech unit storage unit 40 without merging speech units used when generating synthesized speech. Can also be used. In other words, the speech unit with the same phoneme as the phoneme of a segment, when the voice unit storage 40 is L _i number, per segment,

個の音声素片が音声素片選択候補として存在していることになる。 This means that there are speech units as speech unit selection candidates.

図１７は、図２の音声素片選択・融合部４３における処理の流れを示すフローチャートである。 FIG. 17 is a flowchart showing the flow of processing in the speech unit selection / fusion unit 43 in FIG.

ステップＳ４３２０において、ｍ個の音声素片を融合し、ステップＳ４３６０においてｍ個の音声素片を融合することによって得られた音声素片の音素環境パラメータを算出する。これをＭ回繰り返し、１つのセグメントにおける音声素片候補を生成する。さらに、各セグメントにおいてステップＳ４３２０、ステップＳ４３６０を同様に繰り返す。 In step S4320, m speech elements are fused, and in step S4360, phoneme environment parameters of the speech elements obtained by fusing m speech elements are calculated. This is repeated M times to generate speech segment candidates in one segment. Further, step S4320 and step S4360 are similarly repeated in each segment.

次に、ステップＳ４３００において、各セグメントの音声素片候補に対して最適素片系列の探索を行い、各セグメント毎に選択された音声素片が合成音声として用いられる。 Next, in step S4300, an optimal segment sequence is searched for speech segment candidates of each segment, and the speech segment selected for each segment is used as synthesized speech.

これにより、第１の実施形態と比べ、音声素片記憶部４０に記憶されている音声素片の中からより良い品質の音声素片を生成することができ、より自然でより高音質な合成音声を生成できる。 As a result, compared to the first embodiment, it is possible to generate a speech unit of better quality from speech units stored in the speech unit storage unit 40, and a more natural and higher-quality synthesis. Can generate audio.

（変更例）
なお、本発明は上記実施形態そのままに限定されるものではなく、実施段階ではその要旨を逸脱しない範囲で構成要素を変形して具体化できる。 (Example of change)
Note that the present invention is not limited to the above-described embodiment as it is, and can be embodied by modifying the constituent elements without departing from the scope of the invention in the implementation stage.

また、上記実施形態に開示されている複数の構成要素の適宜な組み合わせにより、種々の発明を形成できる。例えば、実施形態に示される全構成要素から幾つかの構成要素を削除してもよい。さらに、異なる実施形態にわたる構成要素を適宜組み合わせてもよい。 In addition, various inventions can be formed by appropriately combining a plurality of components disclosed in the embodiment. For example, some components may be deleted from all the components shown in the embodiment. Furthermore, constituent elements over different embodiments may be appropriately combined.

本発明の第１の実施形態に係る音声合成装置の構成を示すブロック図である。It is a block diagram which shows the structure of the speech synthesizer which concerns on the 1st Embodiment of this invention. 音声合成部の構成例を示すブロック図である。It is a block diagram which shows the structural example of a speech synthesizer. 音声素片選択・融合部の構成例を示すブロック図である。It is a block diagram which shows the structural example of a speech unit selection / fusion | melting part. 音声素片記憶部の音声素片の記憶例を示す図である。It is a figure which shows the example of a memory | storage of the speech unit of a speech unit storage part. 音素環境記憶部の音素環境パラメータの記憶例を示す図である。It is a figure which shows the memory example of the phoneme environment parameter of a phoneme environment memory | storage part. 音声素片選択・融合部の処理動作を説明するためのフローチャートである。It is a flowchart for demonstrating the processing operation | movement of a speech unit selection / fusion | melting part. 最適素片系列の例を示す模式図である。It is a schematic diagram which shows the example of an optimal segment series. 音声素片選択部の処理を説明するためのフローチャートである。It is a flowchart for demonstrating the process of a speech unit selection part. 音声素片選択部の処理を説明するためのフローチャートである。It is a flowchart for demonstrating the process of a speech unit selection part. 音声素片融合部の処理を説明するためのフローチャートである。It is a flowchart for demonstrating the process of a speech unit fusion part. ピッチ波形を切り出す処理を説明するためのフローチャートである。It is a flowchart for demonstrating the process which cuts out a pitch waveform. ピッチ波形の数をそろえる処理を説明するためのフローチャートである。It is a flowchart for demonstrating the process which arranges the number of pitch waveforms. 音素環境算出部の処理を説明するためのフローチャートである。It is a flowchart for demonstrating the process of a phoneme environment calculation part. 音声素片編集・接続部の処理動作を説明するための図である。It is a figure for demonstrating the processing operation | movement of a speech unit edit and a connection part. 第２の実施形態に係る最適素片系列の例を示す模式図である。It is a schematic diagram which shows the example of the optimal segment series which concerns on 2nd Embodiment. 第３の実施形態に係る音声素片選択・融合部の処理動作を説明するためのフローチャートである。It is a flowchart for demonstrating the processing operation of the speech unit selection / fusion part which concerns on 3rd Embodiment. 第４の実施形態に係る音声素片選択・融合部の処理動作を説明するためのフローチャートである。It is a flowchart for demonstrating the processing operation of the speech unit selection / fusion unit according to the fourth embodiment.

Explanation of symbols

１テキスト入力部
２言語処理部
３韻律処理部
４音声合成部
５音声波形出力部
４０音声素片記憶部
４１音素環境パラメータ記憶部
４２音韻系列・韻律情報入力部
４３音声素片選択・融合部
４４音声素片編集・接続部 DESCRIPTION OF SYMBOLS 1 Text input part 2 Language processing part 3 Prosody processing part 4 Speech synthesis part 5 Speech waveform output part 40 Speech segment memory | storage part 41 Phoneme environment parameter memory | storage part 42 Phoneme series / prosodic information input part 43 Speech segment selection / fusion part 44 Speech segment editing / connection

Claims

A storage unit storing phoneme environment parameters for each speech unit of the speech unit group and the speech unit group;
A selection unit for selecting a plurality of first speech units from the speech unit group for each of a plurality of segments obtained by dividing a phoneme sequence corresponding to a target speech to be synthesized by a synthesis unit;
A generating unit that generates one second speech unit by fusing the plurality of first speech units;
A parameter calculation unit for calculating a phoneme environment parameter of the second speech unit;
A synthesis unit that generates synthesized speech by connecting the second speech units generated for each of the segments;
Have
The selection unit includes:
A segment setting unit configured to set one segment for selecting the first speech segment from among the segments as a target segment;
An extraction unit that extracts a plurality of speech units having the same characteristics as the phoneme of the segment of interest from the speech unit group, as speech unit candidates;
A target cost representing a distortion amount of the synthesized speech generated using each speech unit candidate of the segment of interest is calculated from the phoneme environment parameter of each speech unit candidate and the prosodic information of the target speech. A first cost calculator that
Calculating each connection cost representing the amount of distortion that occurs when a speech unit of an adjacent segment, which is a segment adjacent before or after the target segment, and each speech unit candidate of the target segment are connected; (1) When the adjacent segment is the second speech unit, the connection cost is calculated from the phoneme environment parameter of the second speech unit and the phoneme environment parameter of each speech unit candidate. Or (2) if the adjacent segment does not have the second speech segment, the phoneme environment parameter of the speech segment candidate corresponding to the segment of interest and the corresponding to the adjacent segment A second cost calculating unit that calculates the connection cost from phoneme environment parameters of a speech unit candidate;
A speech unit selection unit that selects a plurality of speech unit candidates having a low total cost of the target cost and the connection cost as the first speech unit among the plurality of speech unit candidates in the segment of interest. When,
A speech synthesizer.

The selection unit includes:
A plurality of the segments as one group, a segment cost calculation unit that calculates a segment cost consisting of the total of the target cost and the connection cost for each segment;
A total cost calculation unit for calculating a total cost consisting of a total of segment costs of each segment;
A speech unit is selected for each segment, and (1) the second speech unit is selected in a segment having the second speech unit, or (2) the second speech unit is selected. In a segment that does not have, a segment selection unit that selects one speech unit that minimizes the total cost from each speech unit candidate;
A sequence calculation unit for obtaining an optimal unit sequence that is a sequence of the speech unit by associating the selected second speech unit or the speech unit with each segment;
Have
The first cost calculation unit and the second cost calculation unit respectively set the target segment and the adjacent segment for each segment on the optimal segment sequence.
The speech synthesizer according to claim 1.

The selection unit includes:
The connection cost is calculated using the phoneme environment parameter of the speech unit and the phoneme environment parameter of the speech unit candidate of the adjacent segment on the optimal unit sequence.
The speech synthesizer according to claim 2.

The selection unit includes:
A plurality of the first speech segments are selected according to the order given in advance to each segment.
The speech synthesizer according to claim 2.

The order is given so that the segment of voiced sound is higher,
The speech synthesizer according to claim 4.

The order is given according to the number of speech units included in the speech unit group.
The speech synthesizer according to claim 4.

A storage unit storing phoneme environment parameters for each speech unit of the speech unit group and the speech unit group;
A segment setting unit that sets one segment as a target segment from each of a plurality of segments obtained by dividing a phoneme sequence corresponding to a target speech to be synthesized by a synthesis unit;
An extraction unit that extracts a plurality of speech units having the same characteristics as the phoneme of the segment of interest from the speech unit group as a third speech unit;
A generating unit that generates a fourth speech unit by fusing the plurality of third speech units;
A parameter calculation unit for calculating a phoneme environment parameter of the fourth speech unit;
A target cost representing a distortion amount of the synthesized speech generated using each of the third speech unit and the fourth speech unit of the segment of interest is set as the third speech unit and the fourth speech unit. A third cost calculating unit for calculating each of the phoneme environment parameters and the prosodic information of the target speech;
The third speech unit and the fourth speech unit of adjacent segments that are adjacent segments before or after the segment of interest, and the third speech unit and the fourth speech unit of the segment of interest. Respective connection costs representing distortion amounts generated when connected are the phoneme environment parameters of the third speech unit and the fourth speech unit of the segment of interest, the third speech unit of the adjacent segment, and the A fourth cost calculation unit for calculating from the phoneme environment parameter of the fourth speech unit;
Among the plurality of third speech units and the fourth speech unit in the segment of interest, a plurality of speech units having a low total cost of the target cost and the connection cost are used as the fifth speech unit. A speech segment selector to select;
A synthesis unit that generates synthesized speech by connecting the fifth speech units generated for each of the segments;
A speech synthesizer.

A plurality of the segments as one group, a segment cost calculation unit that calculates a segment cost consisting of the total of the target cost and the connection cost for each segment;
A total cost calculation unit for calculating a total cost consisting of a total of segment costs of each segment;
A segment selection unit that selects a speech unit for each segment, and selects one speech unit that minimizes the total cost from the third speech unit and the fourth speech unit. When,
A sequence calculation unit for obtaining an optimal segment sequence that is a sequence of the speech units by associating the speech units with the segments;
Have
The third cost calculation unit and the fourth cost calculation unit respectively set the target segment and the adjacent segment for each segment on the optimal segment sequence.
The speech synthesizer according to claim 7.

A storage function for storing a phoneme environment group and a phoneme environment parameter for each speech unit of the speech unit group;
A selection function for selecting a plurality of first speech segments from the speech segment group for each of a plurality of segments obtained by dividing a phoneme sequence corresponding to a target speech to be synthesized by a synthesis unit;
A generating function for generating one second speech unit by fusing the plurality of first speech units;
A parameter calculation function for calculating a phoneme environment parameter of the second speech unit;
A synthesis function for generating synthesized speech by connecting the second speech units generated for each of the segments;
Have
The selection function is:
A segment setting function for setting, as a target segment, one segment for selecting the first speech segment from the segments;
An extraction function for extracting a plurality of speech units having the same characteristics as the phoneme of the segment of interest from the speech unit group, as speech unit candidates;
A target cost representing a distortion amount of the synthesized speech generated using each speech unit candidate of the segment of interest is calculated from the phoneme environment parameter of each speech unit candidate and the prosodic information of the target speech. A first cost calculating function to
Calculating each connection cost representing the amount of distortion that occurs when a speech unit of an adjacent segment, which is a segment adjacent before or after the target segment, and each speech unit candidate of the target segment are connected; (1) When the adjacent segment is the second speech unit, the connection cost is calculated from the phoneme environment parameter of the second speech unit and the phoneme environment parameter of each speech unit candidate. Or (2) if the adjacent segment does not have the second speech segment, the phoneme environment parameter of the speech segment candidate corresponding to the segment of interest and the corresponding to the adjacent segment A second cost calculating function for calculating the connection cost from phoneme environment parameters of a speech unit candidate;
Speech unit selection function for selecting, as the first speech unit, a plurality of speech unit candidates having a low total cost of the target cost and the connection cost among the plurality of speech unit candidates in the segment of interest. When,
Is a speech synthesis program that implements a computer.

The selection function is:
A plurality of the segments as one group, a segment cost calculation function for calculating a segment cost consisting of a sum of the target cost and the connection cost for each segment;
A total cost calculation function for calculating a total cost consisting of a total of segment costs of each segment;
A speech unit is selected for each segment, and (1) the second speech unit is selected in a segment having the second speech unit, or (2) the second speech unit is selected. A segment selection function for selecting one speech unit that minimizes the total cost from the speech unit candidates in a segment that does not have,
A sequence calculation function for obtaining an optimal unit sequence that is a sequence of the speech unit by associating the selected second speech unit or the speech unit with each segment;
Have
The first cost calculation function and the second cost calculation function respectively set the target segment and the adjacent segment for each segment on the optimum segment sequence.
The speech synthesis program according to claim 9.

The selection function is:
The connection cost is calculated using the phoneme environment parameter of the speech unit and the phoneme environment parameter of the speech unit candidate of the adjacent segment on the optimal unit sequence.
The speech synthesis program according to claim 10.

The selection function is:
A plurality of the first speech segments are selected according to the order given in advance to each segment.
The speech synthesis program according to claim 10.

The order is given so that the segment of voiced sound is higher,
The speech synthesis program according to claim 12.

The order is given according to the number of speech units included in the speech unit group.
The speech synthesis program according to claim 4.

A storage function for storing a phoneme environment group and a phoneme environment parameter for each speech unit of the speech unit group;
A segment setting function for setting one segment as a target segment from among a plurality of segments obtained by dividing a phoneme sequence corresponding to a target speech to be synthesized by a synthesis unit;
An extraction function for extracting a plurality of speech units having the same characteristics as the phoneme of the segment of interest from the speech unit group as a third speech unit;
A generation function for generating a fourth speech unit by fusing the plurality of third speech units;
A parameter calculation function for calculating a phoneme environment parameter of the fourth speech unit;
A target cost representing a distortion amount of the synthesized speech generated using each of the third speech unit and the fourth speech unit of the segment of interest is set as the third speech unit and the fourth speech unit. A third cost calculating function for calculating from each phoneme environment parameter and the prosody information of the target speech;
The third speech unit and the fourth speech unit of adjacent segments that are adjacent segments before or after the segment of interest, and the third speech unit and the fourth speech unit of the segment of interest. Respective connection costs representing distortion amounts generated when connected are the phoneme environment parameters of the third speech unit and the fourth speech unit of the segment of interest, the third speech unit of the adjacent segment, and the A fourth cost calculating function for calculating from a phoneme environment parameter of the fourth speech unit;
Among the plurality of third speech units and the fourth speech unit in the segment of interest, a plurality of speech units having a low total cost of the target cost and the connection cost are used as the fifth speech unit. A voice segment selection function to select;
A synthesis function for generating synthesized speech by connecting the fifth speech segments generated for each of the segments;
Is a speech synthesis program that implements a computer.

A plurality of the segments as one group, a segment cost calculation function for calculating a segment cost consisting of a sum of the target cost and the connection cost for each segment;
A total cost calculation function for calculating a total cost consisting of a total of segment costs of each segment;
A segment selection function for selecting a speech unit for each segment and selecting one speech unit that minimizes the total cost from the third speech unit and the fourth speech unit. When,
A sequence calculation function for obtaining an optimum segment sequence that is a sequence of the speech unit by associating the speech unit with each segment,
Have
The third cost calculation function and the fourth cost calculation function respectively set the target segment and the adjacent segment for each segment on the optimum segment sequence.
The speech synthesis program according to claim 15.

A storage step of storing a phoneme group and a phoneme environment parameter for each phoneme of the phoneme group;
A selection step of selecting a plurality of first speech segments from the speech segment group for each of a plurality of segments obtained by dividing a phoneme sequence corresponding to a target speech to be synthesized by a synthesis unit;
Generating one second speech unit by fusing the plurality of first speech units;
A parameter calculating step of calculating a phoneme environment parameter of the second speech unit;
Generating a synthesized speech by connecting the second speech units generated for each of the segments; and
Have
The selection step includes
A segment setting step for setting, as a segment of interest, one segment for selecting the first speech segment from the segments;
Extracting a plurality of speech units having the same characteristics as the phoneme of the segment of interest from the speech unit group, as speech unit candidates;
A target cost representing a distortion amount of the synthesized speech generated using each speech unit candidate of the segment of interest is calculated from the phoneme environment parameter of each speech unit candidate and the prosodic information of the target speech. A first cost calculating step,
Calculating respective connection costs representing the amount of distortion generated when connecting speech segments of adjacent segments that are adjacent to the target segment before or after the target segment and the speech unit candidates of the target segment; (1) When the adjacent segment is the second speech unit, the connection cost is calculated from the phoneme environment parameter of the second speech unit and the phoneme environment parameter of each speech unit candidate. Or (2) if the adjacent segment does not have the second speech segment, the phoneme environment parameter of the speech segment candidate corresponding to the segment of interest and the corresponding to the adjacent segment A second cost calculating step of calculating the connection cost from a phoneme environment parameter of a speech segment candidate;
A speech unit selection step of selecting, as the first speech unit, a plurality of speech unit candidates having a low total cost of the target cost and the connection cost among the plurality of speech unit candidates in the segment of interest. When,
A speech synthesis method comprising:

A storage step of storing a phoneme group and a phoneme environment parameter for each phoneme of the phoneme group;
A segment setting step for setting one segment as a target segment from each of a plurality of segments obtained by dividing a phoneme sequence corresponding to a target speech to be synthesized by a synthesis unit;
An extraction step of extracting a plurality of speech segments having the same characteristics as the phoneme of the segment of interest from the speech segment group as a third speech segment;
Generating a fourth speech unit by fusing the plurality of third speech units;
A parameter calculating step of calculating a phoneme environment parameter of the fourth speech unit;
A target cost representing a distortion amount of the synthesized speech generated using each of the third speech unit and the fourth speech unit of the segment of interest is set as the third speech unit and the fourth speech unit. A third cost calculating step for calculating each of the phoneme environment parameters and the prosodic information of the target speech;
The third speech unit and the fourth speech unit of adjacent segments that are adjacent segments before or after the segment of interest, and the third speech unit and the fourth speech unit of the segment of interest. Respective connection costs representing distortion amounts generated when connected are the phoneme environment parameters of the third speech unit and the fourth speech unit of the segment of interest, the third speech unit of the adjacent segment, and the A fourth cost calculating step for calculating from the phoneme environment parameter of the fourth speech unit;
Among the plurality of third speech units and the fourth speech unit in the segment of interest, a plurality of speech units having a low total cost of the target cost and the connection cost are used as the fifth speech unit. A speech segment selection step to select;
A synthesis step of generating synthesized speech by connecting the fifth speech segments generated for each of the segments;
A speech synthesis method comprising: